2009-10-14 11:10:29

by Richard Kennedy

[permalink] [raw]
Subject: bdi_threshold slow to reach steady state

Hi Peter,

I've been running simple tests that uses fio to write 2Gb & reading the
bdi dirty threshold once a second from debugfs.

The graph of bdi dirty threshold is nice and smooth but takes a long
time to reach a steady state, 60 seconds or more. (run on 2.6.32-rc4)

By eye it seems as though a first-order control system is a good model
for its behavior, so it approximates to 1-e^(-t/T). It just seems too
heavily damped ( at least on my machine).

For fun, I changed calc_period_shift to
return ilog2(dirty_total - 1) - 2;

and it now reaches a steady state much quicker, around 4-5 seconds.

Tests that write to 2 disks at the same time show no significant
performance differences but are much more consistent, i.e. the standard
deviation is lower across multiple runs.

I have noticed that the first test run on a freshly booted machine is
always the slowest of any sequence of tests, but this change to
calc_period_shift greatly reduces this effect.

So I wondered how you chose these values? and are there any other tests
that are useful to explore this?

I know that my machine is getting a bit old now, it's AMDX2 & only has
sata 150 drives, so I'm not suggesting that this change is going to be
correct for all machines but maybe we can set a better default? or take
more factors in to account other than just memory size.

BTW why is it ilog2(dirty_total -1) -- what does the -1 do?

regards
Richard


2009-10-14 11:38:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: bdi_threshold slow to reach steady state

On Wed, 2009-10-14 at 12:09 +0100, Richard Kennedy wrote:
> Hi Peter,
>
> I've been running simple tests that uses fio to write 2Gb & reading the
> bdi dirty threshold once a second from debugfs.
>
> The graph of bdi dirty threshold is nice and smooth but takes a long
> time to reach a steady state, 60 seconds or more. (run on 2.6.32-rc4)
>
> By eye it seems as though a first-order control system is a good model
> for its behavior, so it approximates to 1-e^(-t/T). It just seems too
> heavily damped ( at least on my machine).
>
> For fun, I changed calc_period_shift to
> return ilog2(dirty_total - 1) - 2;
>
> and it now reaches a steady state much quicker, around 4-5 seconds.
>
> Tests that write to 2 disks at the same time show no significant
> performance differences but are much more consistent, i.e. the standard
> deviation is lower across multiple runs.
>
> I have noticed that the first test run on a freshly booted machine is
> always the slowest of any sequence of tests, but this change to
> calc_period_shift greatly reduces this effect.
>
> So I wondered how you chose these values? and are there any other tests
> that are useful to explore this?

Right, so we measure time in page writeback completions, and the measure
I used was the round up power of two of the dirty_thresh. We adjust in
the same time it takes to write out a full dirty_thresh amount of data.

The idea was that people would scale their dirty thesh according to
their writeout capacity, etc..

Martin J Bligh complained about this very same issue and I told them to
experiment with that same scale function. But I guess the result of that
got lost in the google filter (stuff goes in, nothing ever comes back
out).

Anyway, the dirty_thresh relation seems sensible still, but the exact
parameters could be poked at. I have no objection to reducing the period
with a factor of 16 like you did, except that we need some more
feedback, preferably from people with more than a few spindles.

(The initial ramp will be roughly twice as slow, since the steady state
of this approximation is half-full).

> I know that my machine is getting a bit old now, it's AMDX2 & only has
> sata 150 drives, so I'm not suggesting that this change is going to be
> correct for all machines but maybe we can set a better default? or take
> more factors in to account other than just memory size.
>
> BTW why is it ilog2(dirty_total -1) -- what does the -1 do?

http://lkml.org/lkml/2007/1/26/143

2009-10-14 13:56:04

by Richard Kennedy

[permalink] [raw]
Subject: Re: bdi_threshold slow to reach steady state

On Wed, 2009-10-14 at 13:37 +0200, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 12:09 +0100, Richard Kennedy wrote:
> > Hi Peter,

>
> Right, so we measure time in page writeback completions, and the measure
> I used was the round up power of two of the dirty_thresh. We adjust in
> the same time it takes to write out a full dirty_thresh amount of data.
>
> The idea was that people would scale their dirty thesh according to
> their writeout capacity, etc..
>
> Martin J Bligh complained about this very same issue and I told them to
> experiment with that same scale function. But I guess the result of that
> got lost in the google filter (stuff goes in, nothing ever comes back
> out).
>
> Anyway, the dirty_thresh relation seems sensible still, but the exact
> parameters could be poked at. I have no objection to reducing the period
> with a factor of 16 like you did, except that we need some more
> feedback, preferably from people with more than a few spindles.

Sure, hopefully big fast machines have large amounts of memory so it
should be a good fit.

Yes, it would be good if someone with a big box tested this ;)
Here's a patch just in case anyone does feel like giving it a spin.

> (The initial ramp will be roughly twice as slow, since the steady state
> of this approximation is half-full).
>
> > I know that my machine is getting a bit old now, it's AMDX2 & only has
> > sata 150 drives, so I'm not suggesting that this change is going to be
> > correct for all machines but maybe we can set a better default? or take
> > more factors in to account other than just memory size.
> >
> > BTW why is it ilog2(dirty_total -1) -- what does the -1 do?
>
> http://lkml.org/lkml/2007/1/26/143
>
thanks for that
regards
Richard

(patch against 2.6.32-rc4)

commit 11735a2336ba08cf21aebf79a706c86aca5e44b2
Author: Richard Kennedy <[email protected]>
Date: Wed Oct 14 14:46:21 2009 +0100

mm: speed up per bdi dirty threshold calculations


Signed-off-by: Richard Kennedy <[email protected]>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a3b1409..018024e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -144,7 +144,7 @@ static int calc_period_shift(void)
else
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
- return 2 + ilog2(dirty_total - 1);
+ return ilog2(dirty_total - 1) - 2;
}

/*

2009-10-14 14:05:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: bdi_threshold slow to reach steady state

On Wed, 2009-10-14 at 14:55 +0100, Richard Kennedy wrote:
>
> commit 11735a2336ba08cf21aebf79a706c86aca5e44b2
> Author: Richard Kennedy <[email protected]>
> Date: Wed Oct 14 14:46:21 2009 +0100
>
> mm: speed up per bdi dirty threshold calculations

I think the subject is confusing, we don't actually compute things
faster in the less cycles sense.

We reduce the dampening for the control system, yielding faster
convergence.

> Signed-off-by: Richard Kennedy <[email protected]>
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a3b1409..018024e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -144,7 +144,7 @@ static int calc_period_shift(void)
> else
> dirty_total = (vm_dirty_ratio *
> determine_dirtyable_memory()) /
> 100;
> - return 2 + ilog2(dirty_total - 1);
> + return ilog2(dirty_total - 1) - 2;
> }

2009-10-15 09:23:22

by Richard Kennedy

[permalink] [raw]
Subject: Re: bdi_threshold slow to reach steady state

On Wed, 2009-10-14 at 16:04 +0200, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 14:55 +0100, Richard Kennedy wrote:
> >
> > commit 11735a2336ba08cf21aebf79a706c86aca5e44b2
> > Author: Richard Kennedy <[email protected]>
> > Date: Wed Oct 14 14:46:21 2009 +0100
> >
> > mm: speed up per bdi dirty threshold calculations
>
> I think the subject is confusing, we don't actually compute things
> faster in the less cycles sense.
>
> We reduce the dampening for the control system, yielding faster
> convergence.
Ah yes, sorry about that. That was a bit of a placeholder.

I'll write a proper change log & re-post.
regards
Richard