LinuxLists.cc - [RFC][PATCH] mm: stop balance_dirty

2009-06-24 10:38:35

Subject: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

When writing to 2 (or more) devices at the same time, stop
balance_dirty_pages moving dirty pages to writeback when it has reached
the bdi threshold. This prevents balance_dirty_pages overshooting its
limits and moving all dirty pages to writeback.

Signed-off-by: Richard Kennedy <[email protected]>
---
balance_dirty_pages can overreact and move all of the dirty pages to
writeback unnecessarily.

balance_dirty_pages makes its decision to throttle based on the number
of dirty plus writeback pages that are over the calculated limit,so it
will continue to move pages even when there are plenty of pages in
writeback and less than the threshold still dirty.

This allows it to overshoot its limits and move all the dirty pages to
writeback while waiting for the drives to catch up and empty the
writeback list.

A simple fio test easily demonstrates this problem.

fio --name=f1 --directory=/disk1 --size=2G -rw=write
--name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10

The attached graph before.png shows how all pages are moved to writeback
as the second write starts and the throttling kicks in.

after.png is the same test with the patch applied, which clearly shows
that it keeps dirty_background_ratio dirty pages in the buffer.
The values and timings of the graphs are only approximate but are good
enough to show the behaviour.

This is the simplest fix I could find, but I'm not entirely sure that it
alone will be enough for all cases. But it certainly is an improvement
on my desktop machine writing to 2 disks.

Do we need something more for machines with large arrays where
bdi_threshold * number_of_drives is greater than the dirty_ratio ?

regards
Richard

This patch against 2.6.30 git latest, built and tested on x86_64.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7b0dcea..7687879 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -541,8 +541,11 @@ static void balance_dirty_pages(struct address_space *mapping)
* filesystems (i.e. NFS) in which data may have been
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
+ * Only move pages to writeback if this bdi is over its
+ * threshold otherwise wait until the disk writes catch
+ * up.
*/
- if (bdi_nr_reclaimable) {
+ if (bdi_nr_reclaimable > bdi_thresh) {
writeback_inodes(&wbc);
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,

Attachments:

before.png (2.55 kB)
after.png (2.19 kB)
Download all attachments

2009-06-24 22:28:19

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Wed, 24 Jun 2009 11:38:24 +0100
Richard Kennedy <[email protected]> wrote:

> When writing to 2 (or more) devices at the same time, stop
> balance_dirty_pages moving dirty pages to writeback when it has reached
> the bdi threshold. This prevents balance_dirty_pages overshooting its
> limits and moving all dirty pages to writeback.
>
>
> Signed-off-by: Richard Kennedy <[email protected]>
> ---
> balance_dirty_pages can overreact and move all of the dirty pages to
> writeback unnecessarily.
>
> balance_dirty_pages makes its decision to throttle based on the number
> of dirty plus writeback pages that are over the calculated limit,so it
> will continue to move pages even when there are plenty of pages in
> writeback and less than the threshold still dirty.
>
> This allows it to overshoot its limits and move all the dirty pages to
> writeback while waiting for the drives to catch up and empty the
> writeback list.
>
> A simple fio test easily demonstrates this problem.
>
> fio --name=f1 --directory=/disk1 --size=2G -rw=write
> --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10
>
> The attached graph before.png shows how all pages are moved to writeback
> as the second write starts and the throttling kicks in.
>
> after.png is the same test with the patch applied, which clearly shows
> that it keeps dirty_background_ratio dirty pages in the buffer.
> The values and timings of the graphs are only approximate but are good
> enough to show the behaviour.
>
> This is the simplest fix I could find, but I'm not entirely sure that it
> alone will be enough for all cases. But it certainly is an improvement
> on my desktop machine writing to 2 disks.
>
> Do we need something more for machines with large arrays where
> bdi_threshold * number_of_drives is greater than the dirty_ratio ?
>

um. Interesting find. Jens, was any of your performance testing using
multiple devices? If so, it looks like the results just got invalidated :)

>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7b0dcea..7687879 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -541,8 +541,11 @@ static void balance_dirty_pages(struct address_space *mapping)
> * filesystems (i.e. NFS) in which data may have been
> * written to the server's write cache, but has not yet
> * been flushed to permanent storage.
> + * Only move pages to writeback if this bdi is over its
> + * threshold otherwise wait until the disk writes catch
> + * up.
> */
> - if (bdi_nr_reclaimable) {
> + if (bdi_nr_reclaimable > bdi_thresh) {
> writeback_inodes(&wbc);
> pages_written += write_chunk - wbc.nr_to_write;
> get_dirty_limits(&background_thresh, &dirty_thresh,

yup, we need to think about the effect with zillions of disks. Peter,
could you please take a look?

Also... get_dirty_limits() is rather hard to grok. The callers of
get_dirty_limits() treat its three return values as "thresholds", but
they're not named as thresholds within get_dirty_limits() itself, which
is a bit confusing. And the meaning of each of those return values is
pretty obscure from the code - could we document them please?

2009-06-25 05:17:00

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Wed, Jun 24 2009, Andrew Morton wrote:
> On Wed, 24 Jun 2009 11:38:24 +0100
> Richard Kennedy <[email protected]> wrote:
>
> > When writing to 2 (or more) devices at the same time, stop
> > balance_dirty_pages moving dirty pages to writeback when it has reached
> > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > limits and moving all dirty pages to writeback.
> >
> >
> > Signed-off-by: Richard Kennedy <[email protected]>
> > ---
> > balance_dirty_pages can overreact and move all of the dirty pages to
> > writeback unnecessarily.
> >
> > balance_dirty_pages makes its decision to throttle based on the number
> > of dirty plus writeback pages that are over the calculated limit,so it
> > will continue to move pages even when there are plenty of pages in
> > writeback and less than the threshold still dirty.
> >
> > This allows it to overshoot its limits and move all the dirty pages to
> > writeback while waiting for the drives to catch up and empty the
> > writeback list.
> >
> > A simple fio test easily demonstrates this problem.
> >
> > fio --name=f1 --directory=/disk1 --size=2G -rw=write
> > --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10
> >
> > The attached graph before.png shows how all pages are moved to writeback
> > as the second write starts and the throttling kicks in.
> >
> > after.png is the same test with the patch applied, which clearly shows
> > that it keeps dirty_background_ratio dirty pages in the buffer.
> > The values and timings of the graphs are only approximate but are good
> > enough to show the behaviour.
> >
> > This is the simplest fix I could find, but I'm not entirely sure that it
> > alone will be enough for all cases. But it certainly is an improvement
> > on my desktop machine writing to 2 disks.
> >
> > Do we need something more for machines with large arrays where
> > bdi_threshold * number_of_drives is greater than the dirty_ratio ?
> >
>
> um. Interesting find. Jens, was any of your performance testing using
> multiple devices? If so, it looks like the results just got invalidated :)

"invalidated" is a bit too much I think, skewed is more like it. But
most of my testing has been on a single spindle, only the very first
patches used 10 disks as a test base.

> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 7b0dcea..7687879 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -541,8 +541,11 @@ static void balance_dirty_pages(struct address_space *mapping)
> > * filesystems (i.e. NFS) in which data may have been
> > * written to the server's write cache, but has not yet
> > * been flushed to permanent storage.
> > + * Only move pages to writeback if this bdi is over its
> > + * threshold otherwise wait until the disk writes catch
> > + * up.
> > */
> > - if (bdi_nr_reclaimable) {
> > + if (bdi_nr_reclaimable > bdi_thresh) {
> > writeback_inodes(&wbc);
> > pages_written += write_chunk - wbc.nr_to_write;
> > get_dirty_limits(&background_thresh, &dirty_thresh,
>
> yup, we need to think about the effect with zillions of disks. Peter,
> could you please take a look?
>
> Also... get_dirty_limits() is rather hard to grok. The callers of
> get_dirty_limits() treat its three return values as "thresholds", but
> they're not named as thresholds within get_dirty_limits() itself, which
> is a bit confusing. And the meaning of each of those return values is
> pretty obscure from the code - could we document them please?

This is indeed a pretty interesting find!

--
Jens Axboe

2009-06-25 08:00:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 11:38:24 +0100
> Richard Kennedy <[email protected]> wrote:
>
> > When writing to 2 (or more) devices at the same time, stop
> > balance_dirty_pages moving dirty pages to writeback when it has reached
> > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > limits and moving all dirty pages to writeback.
> >
> >
> > Signed-off-by: Richard Kennedy <[email protected]>
> > ---

Acked-by: Peter Zijlstra <[email protected]>

[ moved explanation below ]

> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 7b0dcea..7687879 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -541,8 +541,11 @@ static void balance_dirty_pages(struct address_space *mapping)
> > * filesystems (i.e. NFS) in which data may have been
> > * written to the server's write cache, but has not yet
> > * been flushed to permanent storage.
> > + * Only move pages to writeback if this bdi is over its
> > + * threshold otherwise wait until the disk writes catch
> > + * up.
> > */
> > - if (bdi_nr_reclaimable) {
> > + if (bdi_nr_reclaimable > bdi_thresh) {
> > writeback_inodes(&wbc);
> > pages_written += write_chunk - wbc.nr_to_write;
> > get_dirty_limits(&background_thresh, &dirty_thresh,
>
> yup, we need to think about the effect with zillions of disks. Peter,
> could you please take a look?

Looks to have been in that form forever (immediate git history).

When reading the code I read it like:

if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;

if (nr_reclaimable + nr_writeback <
(background_thresh + dirty_thresh) / 2)
break;

if (bdi_nr_reclaimable) {
writeback_inodes(&wbc);

Which to me reads:

- if there's not enough to do, drop out
- see if background write-out can catch up, drop out
- is there anything to do, yay! work.

/me goes read the changelog, maybe there's a clue in there :-)

> > balance_dirty_pages can overreact and move all of the dirty pages to
> > writeback unnecessarily.
> >
> > balance_dirty_pages makes its decision to throttle based on the number
> > of dirty plus writeback pages that are over the calculated limit,so it
> > will continue to move pages even when there are plenty of pages in
> > writeback and less than the threshold still dirty.
> >
> > This allows it to overshoot its limits and move all the dirty pages to
> > writeback while waiting for the drives to catch up and empty the
> > writeback list.

Ahhh, indeed, how silly of me not to notice that before!

> > This is the simplest fix I could find, but I'm not entirely sure that it
> > alone will be enough for all cases. But it certainly is an improvement
> > on my desktop machine writing to 2 disks.

Seems good to me.

> > Do we need something more for machines with large arrays where
> > bdi_threshold * number_of_drives is greater than the dirty_ratio ?

[ I assumed s/dirty_ratio/dirty_thresh/, since dirty_ratio is a ratio
and bdi_threshold is an actual value, therefore the inequality above
doesn't make sense ]

That cannot actually happen (aside from small numerical glitches).

bdi_threshold = P_i * dirty_thresh, where \Sum P_i = 1

The proportion is relative to the recent writeout speed of the device.

On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> Also... get_dirty_limits() is rather hard to grok. The callers of
> get_dirty_limits() treat its three return values as "thresholds", but
> they're not named as thresholds within get_dirty_limits() itself, which
> is a bit confusing. And the meaning of each of those return values is
> pretty obscure from the code - could we document them please?

Does something like this help?

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7b0dcea..dc2cee1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -426,6 +426,13 @@ unsigned long determine_dirtyable_memory(void)
return x + 1; /* Ensure that we never return 0 */
}

+/*
+ * get_dirty_limits() - compute the various dirty limits
+ *
+ * @pbackground - dirty limit at which we want to start background write-out
+ * @pdirty - total dirty limit, we should not have more dirty than this
+ * @pdbi_dirty - the share of @pdirty available to @bdi
+ */
void
get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
unsigned long *pbdi_dirty, struct backing_dev_info *bdi)

2009-06-25 09:10:39

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, Jun 25 2009, Peter Zijlstra wrote:
> On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > On Wed, 24 Jun 2009 11:38:24 +0100
> > Richard Kennedy <[email protected]> wrote:
> >
> > > When writing to 2 (or more) devices at the same time, stop
> > > balance_dirty_pages moving dirty pages to writeback when it has reached
> > > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > > limits and moving all dirty pages to writeback.
> > >
> > >
> > > Signed-off-by: Richard Kennedy <[email protected]>
> > > ---
>
> Acked-by: Peter Zijlstra <[email protected]>

After doing some integration and update work on the writeback branch, I
threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
mix. The writeback series include this patch as a prep patch. Results
for the mmap write test case:

Kernel Throughput usr sys ctx util
--------------------------------------------------------------
vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
writeback 215MB/sec 22.69% 53.23% 4085 92.32%
writeback 214MB/sec 24.31% 52.90% 4495 92.40%
writeback 208MB/sec 23.14% 52.12% 4067 91.68%

To be perfectly clear:

vanilla 2.6.31-rc1 stock
vanilla+patch 2.6.31-rc1 + bdi_thresh patch
writeback 2.6.31-rc1 + bdi_thresh patch + writeback series

This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
run with the writeback and this patch backed out, to see if it makes a
difference here. I didn't do that initially, since the results were in
the range that I expected.

--
Jens Axboe

2009-06-25 09:26:48

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, Jun 25 2009, Jens Axboe wrote:
> On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > Richard Kennedy <[email protected]> wrote:
> > >
> > > > When writing to 2 (or more) devices at the same time, stop
> > > > balance_dirty_pages moving dirty pages to writeback when it has reached
> > > > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > > > limits and moving all dirty pages to writeback.
> > > >
> > > >
> > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > ---
> >
> > Acked-by: Peter Zijlstra <[email protected]>
>
> After doing some integration and update work on the writeback branch, I
> threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
> mix. The writeback series include this patch as a prep patch. Results
> for the mmap write test case:
>
> Kernel Throughput usr sys ctx util
> --------------------------------------------------------------
> vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> writeback 208MB/sec 23.14% 52.12% 4067 91.68%
>
> To be perfectly clear:
>
> vanilla 2.6.31-rc1 stock
> vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
>
> This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
> run with the writeback and this patch backed out, to see if it makes a
> difference here. I didn't do that initially, since the results were in
> the range that I expected.

Results for writeback without the bdi_thresh patch

Kernel Throughput usr sys ctx util
--------------------------------------------------------------
wb-bdi_thresh 211MB/sec 22.71% 53.30% 4050 91.19%
wb-bdi_thresh 212MB/sec 22.78% 53.55% 4809 91.51%
wb-bdi_thresh 212MB/sec 22.99% 54.23% 4715 93.10%

Not a lot of difference there, without more than three runs it's hard to
say what is significant. Could be a small decrease in throughput, if the
208MB/sec results from above is an outlier (I think it is, ~215MB/sec is
usually the most consistent result).

--
Jens Axboe

2009-06-25 12:32:17

by Al Boldi

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

Jens Axboe wrote:
> On Thu, Jun 25 2009, Jens Axboe wrote:
> > On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > >
> > > > Richard Kennedy <[email protected]> wrote:
> > > > > When writing to 2 (or more) devices at the same time, stop
> > > > > balance_dirty_pages moving dirty pages to writeback when it has
> > > > > reached the bdi threshold. This prevents balance_dirty_pages
> > > > > overshooting its limits and moving all dirty pages to writeback.
> > > > >
> > > > >
> > > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > > ---
> > >
> > > Acked-by: Peter Zijlstra <[email protected]>
> >
> > After doing some integration and update work on the writeback branch, I
> > threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
> > mix. The writeback series include this patch as a prep patch. Results
> > for the mmap write test case:
> >
> > Kernel Throughput usr sys ctx util
> > --------------------------------------------------------------
> > vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> > vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> > vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> > vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> > vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> > vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> > writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> > writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> > writeback 208MB/sec 23.14% 52.12% 4067 91.68%
> >
> > To be perfectly clear:
> >
> > vanilla 2.6.31-rc1 stock
> > vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> > writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
> >
> > This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
> > run with the writeback and this patch backed out, to see if it makes a
> > difference here. I didn't do that initially, since the results were in
> > the range that I expected.
>
> Results for writeback without the bdi_thresh patch
>
> Kernel Throughput usr sys ctx util
> --------------------------------------------------------------
> wb-bdi_thresh 211MB/sec 22.71% 53.30% 4050 91.19%
> wb-bdi_thresh 212MB/sec 22.78% 53.55% 4809 91.51%
> wb-bdi_thresh 212MB/sec 22.99% 54.23% 4715 93.10%
>
> Not a lot of difference there, without more than three runs it's hard to
> say what is significant. Could be a small decrease in throughput, if the
> 208MB/sec results from above is an outlier (I think it is, ~215MB/sec is
> usually the most consistent result).

What's the iowait on these runs?

Thanks!

--
Al

2009-06-25 12:43:47

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, Jun 25 2009, Al Boldi wrote:
> Jens Axboe wrote:
> > On Thu, Jun 25 2009, Jens Axboe wrote:
> > > On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > > > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > > >
> > > > > Richard Kennedy <[email protected]> wrote:
> > > > > > When writing to 2 (or more) devices at the same time, stop
> > > > > > balance_dirty_pages moving dirty pages to writeback when it has
> > > > > > reached the bdi threshold. This prevents balance_dirty_pages
> > > > > > overshooting its limits and moving all dirty pages to writeback.
> > > > > >
> > > > > >
> > > > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > > > ---
> > > >
> > > > Acked-by: Peter Zijlstra <[email protected]>
> > >
> > > After doing some integration and update work on the writeback branch, I
> > > threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
> > > mix. The writeback series include this patch as a prep patch. Results
> > > for the mmap write test case:
> > >
> > > Kernel Throughput usr sys ctx util
> > > --------------------------------------------------------------
> > > vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> > > vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> > > vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> > > vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> > > vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> > > vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> > > writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> > > writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> > > writeback 208MB/sec 23.14% 52.12% 4067 91.68%
> > >
> > > To be perfectly clear:
> > >
> > > vanilla 2.6.31-rc1 stock
> > > vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> > > writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
> > >
> > > This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
> > > run with the writeback and this patch backed out, to see if it makes a
> > > difference here. I didn't do that initially, since the results were in
> > > the range that I expected.
> >
> > Results for writeback without the bdi_thresh patch
> >
> > Kernel Throughput usr sys ctx util
> > --------------------------------------------------------------
> > wb-bdi_thresh 211MB/sec 22.71% 53.30% 4050 91.19%
> > wb-bdi_thresh 212MB/sec 22.78% 53.55% 4809 91.51%
> > wb-bdi_thresh 212MB/sec 22.99% 54.23% 4715 93.10%
> >
> > Not a lot of difference there, without more than three runs it's hard to
> > say what is significant. Could be a small decrease in throughput, if the
> > 208MB/sec results from above is an outlier (I think it is, ~215MB/sec is
> > usually the most consistent result).
>
> What's the iowait on these runs?

Not sure, I didn't check. Why do you ask?

--
Jens Axboe

2009-06-25 13:45:31

by Al Boldi

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

Jens Axboe wrote:
> On Thu, Jun 25 2009, Al Boldi wrote:
> > Jens Axboe wrote:
> > > On Thu, Jun 25 2009, Jens Axboe wrote:
> > > > On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > > > > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > > > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > > > >
> > > > > > Richard Kennedy <[email protected]> wrote:
> > > > > > > When writing to 2 (or more) devices at the same time, stop
> > > > > > > balance_dirty_pages moving dirty pages to writeback when it has
> > > > > > > reached the bdi threshold. This prevents balance_dirty_pages
> > > > > > > overshooting its limits and moving all dirty pages to
> > > > > > > writeback.
> > > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > > > > ---
> > > > >
> > > > > Acked-by: Peter Zijlstra <[email protected]>
> > > >
> > > > After doing some integration and update work on the writeback branch,
> > > > I threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the
> > > > test mix. The writeback series include this patch as a prep patch.
> > > > Results for the mmap write test case:
> > > >
> > > > Kernel Throughput usr sys ctx util
> > > > --------------------------------------------------------------
> > > > vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> > > > vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> > > > vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> > > > vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> > > > vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> > > > vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> > > > writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> > > > writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> > > > writeback 208MB/sec 23.14% 52.12% 4067 91.68%
> > > >
> > > > To be perfectly clear:
> > > >
> > > > vanilla 2.6.31-rc1 stock
> > > > vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> > > > writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
> > > >
> > > > This is just a single spindle w/ext4, nothing fancy. I'll do a
> > > > 3-series run with the writeback and this patch backed out, to see if
> > > > it makes a difference here. I didn't do that initially, since the
> > > > results were in the range that I expected.
> > >
> > > Results for writeback without the bdi_thresh patch
> > >
> > > Kernel Throughput usr sys ctx util
> > > --------------------------------------------------------------
> > > wb-bdi_thresh 211MB/sec 22.71% 53.30% 4050 91.19%
> > > wb-bdi_thresh 212MB/sec 22.78% 53.55% 4809 91.51%
> > > wb-bdi_thresh 212MB/sec 22.99% 54.23% 4715 93.10%
> > >
> > > Not a lot of difference there, without more than three runs it's hard
> > > to say what is significant. Could be a small decrease in throughput, if
> > > the 208MB/sec results from above is an outlier (I think it is,
> > > ~215MB/sec is usually the most consistent result).
> >
> > What's the iowait on these runs?
>
> Not sure, I didn't check. Why do you ask?

iowait gives you an indication of seekactivity.

Thanks!

--
Al

2009-06-25 14:44:56

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, Jun 25 2009, Al Boldi wrote:
> Jens Axboe wrote:
> > On Thu, Jun 25 2009, Al Boldi wrote:
> > > Jens Axboe wrote:
> > > > On Thu, Jun 25 2009, Jens Axboe wrote:
> > > > > On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > > > > > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > > > > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > > > > >
> > > > > > > Richard Kennedy <[email protected]> wrote:
> > > > > > > > When writing to 2 (or more) devices at the same time, stop
> > > > > > > > balance_dirty_pages moving dirty pages to writeback when it has
> > > > > > > > reached the bdi threshold. This prevents balance_dirty_pages
> > > > > > > > overshooting its limits and moving all dirty pages to
> > > > > > > > writeback.
> > > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > > > > > ---
> > > > > >
> > > > > > Acked-by: Peter Zijlstra <[email protected]>
> > > > >
> > > > > After doing some integration and update work on the writeback branch,
> > > > > I threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the
> > > > > test mix. The writeback series include this patch as a prep patch.
> > > > > Results for the mmap write test case:
> > > > >
> > > > > Kernel Throughput usr sys ctx util
> > > > > --------------------------------------------------------------
> > > > > vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> > > > > vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> > > > > vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> > > > > vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> > > > > vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> > > > > vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> > > > > writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> > > > > writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> > > > > writeback 208MB/sec 23.14% 52.12% 4067 91.68%
> > > > >
> > > > > To be perfectly clear:
> > > > >
> > > > > vanilla 2.6.31-rc1 stock
> > > > > vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> > > > > writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
> > > > >
> > > > > This is just a single spindle w/ext4, nothing fancy. I'll do a
> > > > > 3-series run with the writeback and this patch backed out, to see if
> > > > > it makes a difference here. I didn't do that initially, since the
> > > > > results were in the range that I expected.
> > > >
> > > > Results for writeback without the bdi_thresh patch
> > > >
> > > > Kernel Throughput usr sys ctx util
> > > > --------------------------------------------------------------
> > > > wb-bdi_thresh 211MB/sec 22.71% 53.30% 4050 91.19%
> > > > wb-bdi_thresh 212MB/sec 22.78% 53.55% 4809 91.51%
> > > > wb-bdi_thresh 212MB/sec 22.99% 54.23% 4715 93.10%
> > > >
> > > > Not a lot of difference there, without more than three runs it's hard
> > > > to say what is significant. Could be a small decrease in throughput, if
> > > > the 208MB/sec results from above is an outlier (I think it is,
> > > > ~215MB/sec is usually the most consistent result).
> > >
> > > What's the iowait on these runs?
> >
> > Not sure, I didn't check. Why do you ask?
>
> iowait gives you an indication of seekactivity.

The test case is random mmap writes to files that have been laid out
sequentially. So it's all seeks. The target drive is an SSD disk though,
so it doesn't matter a whole lot (it's a good SSD).

--
Jens Axboe

2009-06-25 17:09:33

by Al Boldi

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

Jens Axboe wrote:
> The test case is random mmap writes to files that have been laid out
> sequentially. So it's all seeks. The target drive is an SSD disk though,
> so it doesn't matter a whole lot (it's a good SSD).

Oh, SSD. What numbers do you get for normal disks?

Thanks!

--
Al

2009-06-26 05:02:52

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, Jun 25 2009, Al Boldi wrote:
> Jens Axboe wrote:
> > The test case is random mmap writes to files that have been laid out
> > sequentially. So it's all seeks. The target drive is an SSD disk though,
> > so it doesn't matter a whole lot (it's a good SSD).
>
> Oh, SSD. What numbers do you get for normal disks?

I haven't run this particular test on rotating storage. The type of
drive should not matter a lot, I'm mostly interested in comparing
vanilla and the writeback patches on identical workloads and storage.

--
Jens Axboe

2009-06-26 09:16:04

by Richard Kennedy

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Thu, 2009-06-25 at 11:10 +0200, Jens Axboe wrote:
> On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > Richard Kennedy <[email protected]> wrote:
> > >
> > > > When writing to 2 (or more) devices at the same time, stop
> > > > balance_dirty_pages moving dirty pages to writeback when it has reached
> > > > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > > > limits and moving all dirty pages to writeback.
> > > >
> > > >
> > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > ---
> >
> > Acked-by: Peter Zijlstra <[email protected]>
>
> After doing some integration and update work on the writeback branch, I
> threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
> mix. The writeback series include this patch as a prep patch. Results
> for the mmap write test case:
>
> Kernel Throughput usr sys ctx util
> --------------------------------------------------------------
> vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> writeback 208MB/sec 23.14% 52.12% 4067 91.68%
>
> To be perfectly clear:
>
> vanilla 2.6.31-rc1 stock
> vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
>
> This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
> run with the writeback and this patch backed out, to see if it makes a
> difference here. I didn't do that initially, since the results were in
> the range that I expected.

Intriguing numbers. It would tell us a lot if we could find out why
vanilla + patch is slower than vanilla. I'll run some tests using mmap
and see if I can find anything.
What block size are you using ?

I see that the last test of each group is the slowest. I wonder if this
is showing a slowdown over time or just noise? Any chance you could run
more tests in each group?

regards
Richard

2009-06-26 09:20:29

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Fri, Jun 26 2009, Richard Kennedy wrote:
> On Thu, 2009-06-25 at 11:10 +0200, Jens Axboe wrote:
> > On Thu, Jun 25 2009, Peter Zijlstra wrote:
> > > On Wed, 2009-06-24 at 15:27 -0700, Andrew Morton wrote:
> > > > On Wed, 24 Jun 2009 11:38:24 +0100
> > > > Richard Kennedy <[email protected]> wrote:
> > > >
> > > > > When writing to 2 (or more) devices at the same time, stop
> > > > > balance_dirty_pages moving dirty pages to writeback when it has reached
> > > > > the bdi threshold. This prevents balance_dirty_pages overshooting its
> > > > > limits and moving all dirty pages to writeback.
> > > > >
> > > > >
> > > > > Signed-off-by: Richard Kennedy <[email protected]>
> > > > > ---
> > >
> > > Acked-by: Peter Zijlstra <[email protected]>
> >
> > After doing some integration and update work on the writeback branch, I
> > threw 2.6.31-rc1, 2.6.31-rc1+patch, 2.6.31-rc1+writeback into the test
> > mix. The writeback series include this patch as a prep patch. Results
> > for the mmap write test case:
> >
> > Kernel Throughput usr sys ctx util
> > --------------------------------------------------------------
> > vanilla 184MB/sec 19.51% 50.49% 12995 82.88%
> > vanilla 184MB/sec 19.60% 50.77% 12846 83.47%
> > vanilla 182MB/sec 19.25% 51.18% 14692 82.76%
> > vanilla+patch 169MB/sec 18.08% 43.61% 9507 76.38%
> > vanilla+patch 170MB/sec 18.37% 43.46% 10275 76.62%
> > vanilla+patch 165MB/sec 17.59% 42.06% 10165 74.39%
> > writeback 215MB/sec 22.69% 53.23% 4085 92.32%
> > writeback 214MB/sec 24.31% 52.90% 4495 92.40%
> > writeback 208MB/sec 23.14% 52.12% 4067 91.68%
> >
> > To be perfectly clear:
> >
> > vanilla 2.6.31-rc1 stock
> > vanilla+patch 2.6.31-rc1 + bdi_thresh patch
> > writeback 2.6.31-rc1 + bdi_thresh patch + writeback series
> >
> > This is just a single spindle w/ext4, nothing fancy. I'll do a 3-series
> > run with the writeback and this patch backed out, to see if it makes a
> > difference here. I didn't do that initially, since the results were in
> > the range that I expected.
>
> Intriguing numbers. It would tell us a lot if we could find out why
> vanilla + patch is slower than vanilla. I'll run some tests using mmap
> and see if I can find anything.
> What block size are you using ?

It's using 4kb block size.

> I see that the last test of each group is the slowest. I wonder if this
> is showing a slowdown over time or just noise? Any chance you could run
> more tests in each group?

The runs are actually inverted, so the last entry is the first run. It's
a bit confusing. So the first run is usually the odd one out, after that
they are stable.

--
Jens Axboe

2009-06-26 11:36:54

by Al Boldi

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

Jens Axboe wrote:
> On Thu, Jun 25 2009, Al Boldi wrote:
> > Jens Axboe wrote:
> > > The test case is random mmap writes to files that have been laid out
> > > sequentially. So it's all seeks. The target drive is an SSD disk
> > > though, so it doesn't matter a whole lot (it's a good SSD).
> >
> > Oh, SSD. What numbers do you get for normal disks?
>
> I haven't run this particular test on rotating storage. The type of
> drive should not matter a lot, I'm mostly interested in comparing
> vanilla and the writeback patches on identical workloads and storage.

I think drive type matters a lot. Access strategy on drives with high seek
delays differs from those with no seek delays. So it would probably be of
interest to see this test run on rotating storage, unless the writeback
patches are only meant for SSD?

Thanks!

--
Al

2009-06-26 12:35:37

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH] mm: stop balance_dirty_pages doing too much work

On Fri, Jun 26 2009, Al Boldi wrote:
> Jens Axboe wrote:
> > On Thu, Jun 25 2009, Al Boldi wrote:
> > > Jens Axboe wrote:
> > > > The test case is random mmap writes to files that have been laid out
> > > > sequentially. So it's all seeks. The target drive is an SSD disk
> > > > though, so it doesn't matter a whole lot (it's a good SSD).
> > >
> > > Oh, SSD. What numbers do you get for normal disks?
> >
> > I haven't run this particular test on rotating storage. The type of
> > drive should not matter a lot, I'm mostly interested in comparing
> > vanilla and the writeback patches on identical workloads and storage.
>
> I think drive type matters a lot. Access strategy on drives with high seek
> delays differs from those with no seek delays. So it would probably be of
> interest to see this test run on rotating storage, unless the writeback
> patches are only meant for SSD?

Don't get me wrong, I've tested a lot on rotating drives too. The
changelog for one of the patches even mentions results and vmstat for a
10 disk setup, all rotating disks.

What I'm saying is that this particular test is tailored for SSD's and
writeback. The patchset is not for SSD's in particular, it applies
equally across the board (even for non-block IO, like NFS).

--
Jens Axboe