2009-06-04 15:21:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

Hi,


On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> Hi,
>
> Here's the 9th version of the writeback patches. Changes since v8:
>
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
> issue.
> - Get rid of the explicit wait queues, we can just use wake_up_process()
> since it's just for that one task.
> - Add separate "sync_supers" thread that makes sure that the dirty
> super blocks get written. We cannot safely do this from bdi_forker_task(),
> as that risks deadlocking on ->s_umount. Artem, I implemented this
> by doing the wake ups from a timer so that it would be easier for you
> to just deactivate the timer when there are no super blocks.
>
> For ease of patching, I've put the full diff here:
>
> http://kernel.dk/writeback-v9.patch
>
> and also stored this in a writeback-v9 branch that will not change,
> you can pull that into Linus tree from here:
>
> git://git.kernel.dk/linux-2.6-block.git writeback-v9
>
> block/blk-core.c | 1 +
> drivers/block/aoe/aoeblk.c | 1 +
> drivers/char/mem.c | 1 +
> fs/btrfs/disk-io.c | 24 +-
> fs/buffer.c | 2 +-
> fs/char_dev.c | 1 +
> fs/configfs/inode.c | 1 +
> fs/fs-writeback.c | 804 ++++++++++++++++++++++++++++-------
> fs/fuse/inode.c | 1 +
> fs/hugetlbfs/inode.c | 1 +
> fs/nfs/client.c | 1 +
> fs/ntfs/super.c | 33 +--
> fs/ocfs2/dlm/dlmfs.c | 1 +
> fs/ramfs/inode.c | 1 +
> fs/super.c | 3 -
> fs/sync.c | 2 +-
> fs/sysfs/inode.c | 1 +
> fs/ubifs/super.c | 1 +
> include/linux/backing-dev.h | 73 ++++-
> include/linux/fs.h | 11 +-
> include/linux/writeback.h | 15 +-
> kernel/cgroup.c | 1 +
> mm/Makefile | 2 +-
> mm/backing-dev.c | 518 ++++++++++++++++++++++-
> mm/page-writeback.c | 151 +------
> mm/pdflush.c | 269 ------------
> mm/swap_state.c | 1 +
> mm/vmscan.c | 2 +-
> 28 files changed, 1286 insertions(+), 637 deletions(-)
>


I've just tested it on UP in a single disk.

I've run two parallels dbench tests on two partitions and
tried it with this patch and without.

I used 30 proc each during 600 secs.

You can see the result in attachment.
And also there:

http://kernel.org/pub/linux/kernel/people/frederic/dbench.pdf
http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda1.log
http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda3.log
http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda1.log
http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda3.log


As you can see, bdi writeback is faster than pdflush on hda1 and slower
on hda3. But, well that's not the point.

What I can observe here is the difference on the standard deviation
for the rate between two parallel writers on a same device (but
two different partitions, then superblocks).

With pdflush, the distributed rate is much better balanced than
with bdi writeback in a single device.

I'm not sure why. Is there something in these patches that makes
several bdi flusher threads for a same bdi not well balanced
between them?

Frederic.


Attachments:
(No filename) (3.30 kB)
dbench.pdf (21.37 kB)
bdi-writeback-hda1.log (25.97 kB)
bdi-writeback-hda3.log (22.97 kB)
pdflush-hda1.log (28.05 kB)
pdflush-hda3.log (27.15 kB)
Download all attachments

2009-06-04 19:08:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:

> I've just tested it on UP in a single disk.

I must say, I'm stunned at the amount of testing which people are
performing on this patchset. Normally when someone sends out a
patchset it just sort of lands with a dull thud.

I'm not sure what Jens did right to make all this happen, but thanks!

2009-06-04 19:13:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
>
> > I've just tested it on UP in a single disk.
>
> I must say, I'm stunned at the amount of testing which people are
> performing on this patchset. Normally when someone sends out a
> patchset it just sort of lands with a dull thud.
>
> I'm not sure what Jens did right to make all this happen, but thanks!


I don't know how he did either. I was reading theses patches and *something*
pushed me to my testbox, and then I tested...

Jens, how do you do that?

2009-06-04 19:50:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> >
> > > I've just tested it on UP in a single disk.
> >
> > I must say, I'm stunned at the amount of testing which people are
> > performing on this patchset. Normally when someone sends out a
> > patchset it just sort of lands with a dull thud.
> >
> > I'm not sure what Jens did right to make all this happen, but thanks!
>
>
> I don't know how he did either. I was reading theses patches and *something*
> pushed me to my testbox, and then I tested...
>
> Jens, how do you do that?

Heh, not sure :-)

But indeed, thanks for the testing. It looks quite interesting. I'm
guessing it probably has to do with who ends up doing the balancing and
that the flusher threads block, it may change the picture a bit. So it
may just be that it'll require a few vm tweaks. I'll definitely look
into it and try and reproduce your results.

Did you run it a 2nd time on each drive and check if the results were
(approximately) consistent on the two drives?

--
Jens Axboe

2009-06-04 20:10:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, Jun 04 2009, Jens Axboe wrote:
> On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> > >
> > > > I've just tested it on UP in a single disk.
> > >
> > > I must say, I'm stunned at the amount of testing which people are
> > > performing on this patchset. Normally when someone sends out a
> > > patchset it just sort of lands with a dull thud.
> > >
> > > I'm not sure what Jens did right to make all this happen, but thanks!
> >
> >
> > I don't know how he did either. I was reading theses patches and *something*
> > pushed me to my testbox, and then I tested...
> >
> > Jens, how do you do that?
>
> Heh, not sure :-)
>
> But indeed, thanks for the testing. It looks quite interesting. I'm
> guessing it probably has to do with who ends up doing the balancing and
> that the flusher threads block, it may change the picture a bit. So it
> may just be that it'll require a few vm tweaks. I'll definitely look
> into it and try and reproduce your results.
>
> Did you run it a 2nd time on each drive and check if the results were
> (approximately) consistent on the two drives?

each partition... What IO scheduler did you use on hda?

The main difference with this test case is that before we had two super
blocks, each with lists of dirty inodes. pdflush would attack those. Now
we have both the inodes from the two supers on a single set of lists on
the bdi. So either we have some ordering issue there (which is causing
the unfairness), or something else is.

So perhaps you can try with noop on hda to see if that changes the
picture?

--
Jens Axboe

2009-06-04 21:43:24

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, Jun 04, 2009 at 09:50:13PM +0200, Jens Axboe wrote:
> On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> > >
> > > > I've just tested it on UP in a single disk.
> > >
> > > I must say, I'm stunned at the amount of testing which people are
> > > performing on this patchset. Normally when someone sends out a
> > > patchset it just sort of lands with a dull thud.
> > >
> > > I'm not sure what Jens did right to make all this happen, but thanks!
> >
> >
> > I don't know how he did either. I was reading theses patches and *something*
> > pushed me to my testbox, and then I tested...
> >
> > Jens, how do you do that?
>
> Heh, not sure :-)
>
> But indeed, thanks for the testing. It looks quite interesting. I'm
> guessing it probably has to do with who ends up doing the balancing and
> that the flusher threads block, it may change the picture a bit. So it
> may just be that it'll require a few vm tweaks. I'll definitely look
> into it and try and reproduce your results.
>
> Did you run it a 2nd time on each drive and check if the results were
> (approximately) consistent on the two drives?


Another snapshot, only with bdi-writeback this time.

http://kernel.org/pub/linux/kernel/people/frederic/dbench2.pdf

Looks like the same effect but the difference is more quiet this time.

I guess there is a good bunch of entropy inside, so it's hard to tell :)
I'll test with no op scheduler.

2009-06-04 22:35:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> On Thu, Jun 04 2009, Jens Axboe wrote:
> > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> > > >
> > > > > I've just tested it on UP in a single disk.
> > > >
> > > > I must say, I'm stunned at the amount of testing which people are
> > > > performing on this patchset. Normally when someone sends out a
> > > > patchset it just sort of lands with a dull thud.
> > > >
> > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > >
> > >
> > > I don't know how he did either. I was reading theses patches and *something*
> > > pushed me to my testbox, and then I tested...
> > >
> > > Jens, how do you do that?
> >
> > Heh, not sure :-)
> >
> > But indeed, thanks for the testing. It looks quite interesting. I'm
> > guessing it probably has to do with who ends up doing the balancing and
> > that the flusher threads block, it may change the picture a bit. So it
> > may just be that it'll require a few vm tweaks. I'll definitely look
> > into it and try and reproduce your results.
> >
> > Did you run it a 2nd time on each drive and check if the results were
> > (approximately) consistent on the two drives?
>
> each partition... What IO scheduler did you use on hda?


CFQ.


> The main difference with this test case is that before we had two super
> blocks, each with lists of dirty inodes. pdflush would attack those. Now
> we have both the inodes from the two supers on a single set of lists on
> the bdi. So either we have some ordering issue there (which is causing
> the unfairness), or something else is.


Yeah.
But although these flushers are per-bdi, with a single list (well, three)
of dirty inodes, it looks like the writeback is still performed per
superblock, I mean the bdi work gives the concerned superblock
and the bdi list is iterated in generic_sync_wb_inodes() which
only processes the inodes for the given superblock. So there is
a bit of a per superblock serialization there and....


(Note, the above is just written for myself in the secret hope I could
understand better these patches by writing my brainstorming...)


> So perhaps you can try with noop on hda to see if that changes the
> picture?



The result with noop is even more impressive.

See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf

Also a comparison, noop with pdflush against noop with bdi writeback:

http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf


Frederic.

2009-06-05 01:14:47

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Thu, 2009-06-04 at 17:20 +0200, Frederic Weisbecker wrote:
> Hi,
>
>
> On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> > Hi,
> >
> > Here's the 9th version of the writeback patches. Changes since v8:

> I've just tested it on UP in a single disk.
>
> I've run two parallels dbench tests on two partitions and
> tried it with this patch and without.
I also tested V9 with multiple-dbench workload by starting multiple
dbench tasks and every task has 4 processes to do I/O on one partition (file
system). Mostly I use JBODs which have 7/11/13 disks.

I didn't find result regression between vanilla and V9 kernel on this workload.

>
> I used 30 proc each during 600 secs.
>
> You can see the result in attachment.
> And also there:
>
> http://kernel.org/pub/linux/kernel/people/frederic/dbench.pdf
> http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda1.log
> http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda3.log
> http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda1.log
> http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda3.log
>
>
> As you can see, bdi writeback is faster than pdflush on hda1 and slower
> on hda3. But, well that's not the point.
>
> What I can observe here is the difference on the standard deviation
> for the rate between two parallel writers on a same device (but
> two different partitions, then superblocks).
>
> With pdflush, the distributed rate is much better balanced than
> with bdi writeback in a single device.
>
> I'm not sure why. Is there something in these patches that makes
> several bdi flusher threads for a same bdi not well balanced
> between them?
>
> Frederic.

2009-06-05 19:15:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> > On Thu, Jun 04 2009, Jens Axboe wrote:
> > > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> > > > >
> > > > > > I've just tested it on UP in a single disk.
> > > > >
> > > > > I must say, I'm stunned at the amount of testing which people are
> > > > > performing on this patchset. Normally when someone sends out a
> > > > > patchset it just sort of lands with a dull thud.
> > > > >
> > > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > > >
> > > >
> > > > I don't know how he did either. I was reading theses patches and *something*
> > > > pushed me to my testbox, and then I tested...
> > > >
> > > > Jens, how do you do that?
> > >
> > > Heh, not sure :-)
> > >
> > > But indeed, thanks for the testing. It looks quite interesting. I'm
> > > guessing it probably has to do with who ends up doing the balancing and
> > > that the flusher threads block, it may change the picture a bit. So it
> > > may just be that it'll require a few vm tweaks. I'll definitely look
> > > into it and try and reproduce your results.
> > >
> > > Did you run it a 2nd time on each drive and check if the results were
> > > (approximately) consistent on the two drives?
> >
> > each partition... What IO scheduler did you use on hda?
>
>
> CFQ.
>
>
> > The main difference with this test case is that before we had two super
> > blocks, each with lists of dirty inodes. pdflush would attack those. Now
> > we have both the inodes from the two supers on a single set of lists on
> > the bdi. So either we have some ordering issue there (which is causing
> > the unfairness), or something else is.
>
>
> Yeah.
> But although these flushers are per-bdi, with a single list (well, three)
> of dirty inodes, it looks like the writeback is still performed per
> superblock, I mean the bdi work gives the concerned superblock
> and the bdi list is iterated in generic_sync_wb_inodes() which
> only processes the inodes for the given superblock. So there is
> a bit of a per superblock serialization there and....

But in most cases sb == NULL, which means that the writeback does not
care. It should only pass in a valid sb if someone explicitly wants to
sync that sb.

But the way that the lists are organized now does definitely open some
windows of unfairness for a test like yours. It's on the top of the
investigate list for monday.

> > So perhaps you can try with noop on hda to see if that changes the
> > picture?
>
>
>
> The result with noop is even more impressive.
>
> See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
>
> Also a comparison, noop with pdflush against noop with bdi writeback:
>
> http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf

OK, so things aren't exactly peachy here to begin with. It may not
actually BE an issue, or at least now a new one, but that doesn't mean
that we should not attempt to quantify the impact.

How are you starting these runs? With a test like this, even a small
difference in start time can make a huge difference.

--
Jens Axboe

2009-06-05 19:16:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri, Jun 05 2009, Zhang, Yanmin wrote:
> On Thu, 2009-06-04 at 17:20 +0200, Frederic Weisbecker wrote:
> > Hi,
> >
> >
> > On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> > > Hi,
> > >
> > > Here's the 9th version of the writeback patches. Changes since v8:
>
> > I've just tested it on UP in a single disk.
> >
> > I've run two parallels dbench tests on two partitions and
> > tried it with this patch and without.
> I also tested V9 with multiple-dbench workload by starting multiple
> dbench tasks and every task has 4 processes to do I/O on one partition (file
> system). Mostly I use JBODs which have 7/11/13 disks.
>
> I didn't find result regression between ???vanilla and V9 kernel on
> this workload.

Ah that's good, thanks for that result as well :-)

--
Jens Axboe

2009-06-05 21:14:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > The result with noop is even more impressive.
> >
> > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> >
> > Also a comparison, noop with pdflush against noop with bdi writeback:
> >
> > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
>
> OK, so things aren't exactly peachy here to begin with. It may not
> actually BE an issue, or at least now a new one, but that doesn't mean
> that we should not attempt to quantify the impact.
What looks interesting is also the overall throughput. With pdflush we
get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
So per-bdi seems to be *more* fair but throughput suffers a lot (which
might be inevitable due to incurred seeks).
Frederic, how much does dbench achieve for you just on one partition
(test both consecutively if possible) with as many threads as have those
two dbench instances together? Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-06 00:19:29

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > The result with noop is even more impressive.
> > >
> > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > >
> > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > >
> > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> >
> > OK, so things aren't exactly peachy here to begin with. It may not
> > actually BE an issue, or at least now a new one, but that doesn't mean
> > that we should not attempt to quantify the impact.
> What looks interesting is also the overall throughput. With pdflush we
> get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> So per-bdi seems to be *more* fair but throughput suffers a lot (which
> might be inevitable due to incurred seeks).
> Frederic, how much does dbench achieve for you just on one partition
> (test both consecutively if possible) with as many threads as have those
> two dbench instances together? Thanks.

Is the graph showing us dbench tput or disk tput? I'm assuming it is
disk tput, so bdi may just be writing less?

-chris

2009-06-06 00:23:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri 05-06-09 20:18:15, Chris Mason wrote:
> On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > The result with noop is even more impressive.
> > > >
> > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > >
> > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > >
> > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > >
> > > OK, so things aren't exactly peachy here to begin with. It may not
> > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > that we should not attempt to quantify the impact.
> > What looks interesting is also the overall throughput. With pdflush we
> > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > might be inevitable due to incurred seeks).
> > Frederic, how much does dbench achieve for you just on one partition
> > (test both consecutively if possible) with as many threads as have those
> > two dbench instances together? Thanks.
>
> Is the graph showing us dbench tput or disk tput? I'm assuming it is
> disk tput, so bdi may just be writing less?
Good, question. I was assuming dbench throughput :).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-06 00:36:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri, Jun 05, 2009 at 09:15:28PM +0200, Jens Axboe wrote:
> On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> > > On Thu, Jun 04 2009, Jens Axboe wrote:
> > > > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <[email protected]> wrote:
> > > > > >
> > > > > > > I've just tested it on UP in a single disk.
> > > > > >
> > > > > > I must say, I'm stunned at the amount of testing which people are
> > > > > > performing on this patchset. Normally when someone sends out a
> > > > > > patchset it just sort of lands with a dull thud.
> > > > > >
> > > > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > > > >
> > > > >
> > > > > I don't know how he did either. I was reading theses patches and *something*
> > > > > pushed me to my testbox, and then I tested...
> > > > >
> > > > > Jens, how do you do that?
> > > >
> > > > Heh, not sure :-)
> > > >
> > > > But indeed, thanks for the testing. It looks quite interesting. I'm
> > > > guessing it probably has to do with who ends up doing the balancing and
> > > > that the flusher threads block, it may change the picture a bit. So it
> > > > may just be that it'll require a few vm tweaks. I'll definitely look
> > > > into it and try and reproduce your results.
> > > >
> > > > Did you run it a 2nd time on each drive and check if the results were
> > > > (approximately) consistent on the two drives?
> > >
> > > each partition... What IO scheduler did you use on hda?
> >
> >
> > CFQ.
> >
> >
> > > The main difference with this test case is that before we had two super
> > > blocks, each with lists of dirty inodes. pdflush would attack those. Now
> > > we have both the inodes from the two supers on a single set of lists on
> > > the bdi. So either we have some ordering issue there (which is causing
> > > the unfairness), or something else is.
> >
> >
> > Yeah.
> > But although these flushers are per-bdi, with a single list (well, three)
> > of dirty inodes, it looks like the writeback is still performed per
> > superblock, I mean the bdi work gives the concerned superblock
> > and the bdi list is iterated in generic_sync_wb_inodes() which
> > only processes the inodes for the given superblock. So there is
> > a bit of a per superblock serialization there and....
>
> But in most cases sb == NULL, which means that the writeback does not
> care. It should only pass in a valid sb if someone explicitly wants to
> sync that sb.


Ah ok.


> But the way that the lists are organized now does definitely open some
> windows of unfairness for a test like yours. It's on the top of the
> investigate list for monday.



I stay tuned.



> > > So perhaps you can try with noop on hda to see if that changes the
> > > picture?
> >
> >
> >
> > The result with noop is even more impressive.
> >
> > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> >
> > Also a comparison, noop with pdflush against noop with bdi writeback:
> >
> > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
>
> OK, so things aren't exactly peachy here to begin with. It may not
> actually BE an issue, or at least now a new one, but that doesn't mean
> that we should not attempt to quantify the impact.
>
> How are you starting these runs? With a test like this, even a small
> difference in start time can make a huge difference.


Hmm, in a kind of draft way :)
I pre-write the command on two consoles, each on a concerned
partition, then I type enter for each one.

So there is always one that is started before the other with
some delay. And it looks like the first often win the race.

Frederic.



> --
> Jens Axboe
>

2009-06-06 01:00:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > The result with noop is even more impressive.
> > >
> > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > >
> > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > >
> > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> >
> > OK, so things aren't exactly peachy here to begin with. It may not
> > actually BE an issue, or at least now a new one, but that doesn't mean
> > that we should not attempt to quantify the impact.
> What looks interesting is also the overall throughput. With pdflush we
> get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> So per-bdi seems to be *more* fair but throughput suffers a lot (which
> might be inevitable due to incurred seeks).



Heh indeed, I was confused with the colors here but yes pdflush has
a faster total and a higher unfairness with noop, at least with this test.



> Frederic, how much does dbench achieve for you just on one partition
> (test both consecutively if possible) with as many threads as have those
> two dbench instances together? Thanks.



Good idea, I'll try it out so that there wouldn't have any per superblock
ordering there, or whathever that could be.

Thanks.


> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2009-06-06 01:06:42

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > The result with noop is even more impressive.
> > > > >
> > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > >
> > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > >
> > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > >
> > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > that we should not attempt to quantify the impact.
> > > What looks interesting is also the overall throughput. With pdflush we
> > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > might be inevitable due to incurred seeks).
> > > Frederic, how much does dbench achieve for you just on one partition
> > > (test both consecutively if possible) with as many threads as have those
> > > two dbench instances together? Thanks.
> >
> > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > disk tput, so bdi may just be writing less?
> Good, question. I was assuming dbench throughput :).
>
> Honza


Yeah it's dbench. May be that's not the right tool to measure the writeback
layer, even though dbench results are necessarily influenced by the writeback
behaviour.

May be I should use something else?

Note that if you want I can put some surgicals trace_printk()
in fs/fs-writeback.c


>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2009-06-08 09:23:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > The result with noop is even more impressive.
> > > > > >
> > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > >
> > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > >
> > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > >
> > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > that we should not attempt to quantify the impact.
> > > > What looks interesting is also the overall throughput. With pdflush we
> > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > might be inevitable due to incurred seeks).
> > > > Frederic, how much does dbench achieve for you just on one partition
> > > > (test both consecutively if possible) with as many threads as have those
> > > > two dbench instances together? Thanks.
> > >
> > > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > > disk tput, so bdi may just be writing less?
> > Good, question. I was assuming dbench throughput :).
> >
> > Honza
>
>
> Yeah it's dbench. May be that's not the right tool to measure the writeback
> layer, even though dbench results are necessarily influenced by the writeback
> behaviour.
>
> May be I should use something else?
>
> Note that if you want I can put some surgicals trace_printk()
> in fs/fs-writeback.c

FWIW, I ran a similar test here just now. CFQ was used, two partitions
on an (otherwise) idle drive. I used 30 clients per dbench and 600s
runtime. Results are nearly identical, both throughout the run and
total:

/dev/sdb1
Throughput 165.738 MB/sec 30 clients 30 procs max_latency=459.002 ms

/dev/sdb2
Throughput 165.773 MB/sec 30 clients 30 procs max_latency=607.198 ms

The flusher threads see very little exercise here.

--
Jens Axboe

2009-06-08 12:23:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > The result with noop is even more impressive.
> > > > > > >
> > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > >
> > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > >
> > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > >
> > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > that we should not attempt to quantify the impact.
> > > > > What looks interesting is also the overall throughput. With pdflush we
> > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > might be inevitable due to incurred seeks).
> > > > > Frederic, how much does dbench achieve for you just on one partition
> > > > > (test both consecutively if possible) with as many threads as have those
> > > > > two dbench instances together? Thanks.
> > > >
> > > > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > > > disk tput, so bdi may just be writing less?
> > > Good, question. I was assuming dbench throughput :).
> > >
> > > Honza
> >
> >
> > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > layer, even though dbench results are necessarily influenced by the writeback
> > behaviour.
> >
> > May be I should use something else?
> >
> > Note that if you want I can put some surgicals trace_printk()
> > in fs/fs-writeback.c
>
> FWIW, I ran a similar test here just now. CFQ was used, two partitions
> on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> runtime. Results are nearly identical, both throughout the run and
> total:
>
> /dev/sdb1
> Throughput 165.738 MB/sec 30 clients 30 procs max_latency=459.002 ms
>
> /dev/sdb2
> Throughput 165.773 MB/sec 30 clients 30 procs max_latency=607.198 ms
Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
like quite a lot ;). This usually happens with dbench when the processes
manage to delete / redirty data before writeback thread gets to them (so
some IO happens in memory only and throughput is bound by the CPU / memory
speed). So I think you are on a different part of the performance curve
than Frederic. Probably you have to run with more threads so that dbench
threads get throttled because of total amount of dirty data generated...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-08 12:28:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Mon, Jun 08 2009, Jan Kara wrote:
> On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > The result with noop is even more impressive.
> > > > > > > >
> > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > >
> > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > >
> > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > >
> > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > that we should not attempt to quantify the impact.
> > > > > > What looks interesting is also the overall throughput. With pdflush we
> > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > might be inevitable due to incurred seeks).
> > > > > > Frederic, how much does dbench achieve for you just on one partition
> > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > two dbench instances together? Thanks.
> > > > >
> > > > > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > > > > disk tput, so bdi may just be writing less?
> > > > Good, question. I was assuming dbench throughput :).
> > > >
> > > > Honza
> > >
> > >
> > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > layer, even though dbench results are necessarily influenced by the writeback
> > > behaviour.
> > >
> > > May be I should use something else?
> > >
> > > Note that if you want I can put some surgicals trace_printk()
> > > in fs/fs-writeback.c
> >
> > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > runtime. Results are nearly identical, both throughout the run and
> > total:
> >
> > /dev/sdb1
> > Throughput 165.738 MB/sec 30 clients 30 procs max_latency=459.002 ms
> >
> > /dev/sdb2
> > Throughput 165.773 MB/sec 30 clients 30 procs max_latency=607.198 ms
> Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> like quite a lot ;). This usually happens with dbench when the processes
> manage to delete / redirty data before writeback thread gets to them (so
> some IO happens in memory only and throughput is bound by the CPU / memory
> speed). So I think you are on a different part of the performance curve
> than Frederic. Probably you have to run with more threads so that dbench
> threads get throttled because of total amount of dirty data generated...

Certainly, the actual disk data rate was consistenctly in the
60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
I boot with less than 30 clients will do.

But unless the situation changes radically with memory pressure, it
still shows a fair distribution of IO between the two. Since they have
identical results throughout, it should be safe to assume that the have
equal bandwidth distribution at the disk end. A fast dbench run is one
that doesn't touch the disk at all, once you start touching disk you
lose :-)

--
Jens Axboe

2009-06-08 13:02:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Mon 08-06-09 14:28:34, Jens Axboe wrote:
> On Mon, Jun 08 2009, Jan Kara wrote:
> > On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > > The result with noop is even more impressive.
> > > > > > > > >
> > > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > > >
> > > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > > >
> > > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > > >
> > > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > > that we should not attempt to quantify the impact.
> > > > > > > What looks interesting is also the overall throughput. With pdflush we
> > > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > > might be inevitable due to incurred seeks).
> > > > > > > Frederic, how much does dbench achieve for you just on one partition
> > > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > > two dbench instances together? Thanks.
> > > > > >
> > > > > > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > > > > > disk tput, so bdi may just be writing less?
> > > > > Good, question. I was assuming dbench throughput :).
> > > > >
> > > > > Honza
> > > >
> > > >
> > > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > > layer, even though dbench results are necessarily influenced by the writeback
> > > > behaviour.
> > > >
> > > > May be I should use something else?
> > > >
> > > > Note that if you want I can put some surgicals trace_printk()
> > > > in fs/fs-writeback.c
> > >
> > > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > > runtime. Results are nearly identical, both throughout the run and
> > > total:
> > >
> > > /dev/sdb1
> > > Throughput 165.738 MB/sec 30 clients 30 procs max_latency=459.002 ms
> > >
> > > /dev/sdb2
> > > Throughput 165.773 MB/sec 30 clients 30 procs max_latency=607.198 ms
> > Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> > like quite a lot ;). This usually happens with dbench when the processes
> > manage to delete / redirty data before writeback thread gets to them (so
> > some IO happens in memory only and throughput is bound by the CPU / memory
> > speed). So I think you are on a different part of the performance curve
> > than Frederic. Probably you have to run with more threads so that dbench
> > threads get throttled because of total amount of dirty data generated...
>
> Certainly, the actual disk data rate was consistenctly in the
> 60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
> I boot with less than 30 clients will do.
Yes, that would do as well.

> But unless the situation changes radically with memory pressure, it
> still shows a fair distribution of IO between the two. Since they have
> identical results throughout, it should be safe to assume that the have
> equal bandwidth distribution at the disk end. A fast dbench run is one
Yes, I agree. Your previous test indirectly shows fair distribution
on the disk end (with blktrace you could actually confirm it directly).

> that doesn't touch the disk at all, once you start touching disk you
> lose :-)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-09 18:40:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/11] Per-bdi writeback flusher threads v9

On Mon, Jun 08, 2009 at 02:28:34PM +0200, Jens Axboe wrote:
> On Mon, Jun 08 2009, Jan Kara wrote:
> > On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > > The result with noop is even more impressive.
> > > > > > > > >
> > > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > > >
> > > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > > >
> > > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > > >
> > > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > > that we should not attempt to quantify the impact.
> > > > > > > What looks interesting is also the overall throughput. With pdflush we
> > > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > > might be inevitable due to incurred seeks).
> > > > > > > Frederic, how much does dbench achieve for you just on one partition
> > > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > > two dbench instances together? Thanks.
> > > > > >
> > > > > > Is the graph showing us dbench tput or disk tput? I'm assuming it is
> > > > > > disk tput, so bdi may just be writing less?
> > > > > Good, question. I was assuming dbench throughput :).
> > > > >
> > > > > Honza
> > > >
> > > >
> > > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > > layer, even though dbench results are necessarily influenced by the writeback
> > > > behaviour.
> > > >
> > > > May be I should use something else?
> > > >
> > > > Note that if you want I can put some surgicals trace_printk()
> > > > in fs/fs-writeback.c
> > >
> > > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > > runtime. Results are nearly identical, both throughout the run and
> > > total:
> > >
> > > /dev/sdb1
> > > Throughput 165.738 MB/sec 30 clients 30 procs max_latency=459.002 ms
> > >
> > > /dev/sdb2
> > > Throughput 165.773 MB/sec 30 clients 30 procs max_latency=607.198 ms
> > Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> > like quite a lot ;). This usually happens with dbench when the processes
> > manage to delete / redirty data before writeback thread gets to them (so
> > some IO happens in memory only and throughput is bound by the CPU / memory
> > speed). So I think you are on a different part of the performance curve
> > than Frederic. Probably you have to run with more threads so that dbench
> > threads get throttled because of total amount of dirty data generated...
>
> Certainly, the actual disk data rate was consistenctly in the
> 60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
> I boot with less than 30 clients will do.
>
> But unless the situation changes radically with memory pressure, it
> still shows a fair distribution of IO between the two. Since they have
> identical results throughout, it should be safe to assume that the have
> equal bandwidth distribution at the disk end. A fast dbench run is one
> that doesn't touch the disk at all, once you start touching disk you
> lose :-)



When I ran my tests, I only had 384 MB of memory, 100 threads and
only one CPU. So I was in a constant writeback, which should
be smoother with 6 GB of memory and 30 threads.

May be that's why you had a so well balanced result... Or may
be there is too much entropy in my testbox :)



> --
> Jens Axboe
>