LinuxLists.cc - [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

2011-04-13 09:07:26

Subject: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

Reduce the dampening for the control system, yielding faster
convergence. The change is a bit conservative, as smaller values may
lead to noticeable bdi threshold fluctuates in low memory JBOD setup.

CC: Peter Zijlstra <[email protected]>
CC: Richard Kennedy <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c 2011-03-02 14:52:19.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-03-02 15:00:17.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
else
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
- return 2 + ilog2(dirty_total - 1);
+ return ilog2(dirty_total - 1);
}

/*

2011-04-13 22:04:49

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> Reduce the dampening for the control system, yielding faster
> convergence. The change is a bit conservative, as smaller values may
> lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
>
> CC: Peter Zijlstra <[email protected]>
> CC: Richard Kennedy <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
Well, I have nothing against this change as such but what I don't like is
that it just changes magical +2 for similarly magical +0. It's clear that
this will lead to more rapid updates of proportions of bdi's share of
writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
I'd prefer to get some understanding of why do we need to update the
proportion period and why 4-times faster is just the right amount of faster
:) If I remember right you had some numbers for this, didn't you?

Honza
> ---
> mm/page-writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-03-02 14:52:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-03-02 15:00:17.000000000 +0800
> @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> else
> dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> 100;
> - return 2 + ilog2(dirty_total - 1);
> + return ilog2(dirty_total - 1);
> }
>
> /*
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-13 23:31:27

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > Reduce the dampening for the control system, yielding faster
> > convergence. The change is a bit conservative, as smaller values may
> > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> >
> > CC: Peter Zijlstra <[email protected]>
> > CC: Richard Kennedy <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> Well, I have nothing against this change as such but what I don't like is
> that it just changes magical +2 for similarly magical +0. It's clear that

The patch tends to make the rampup time a bit more reasonable for
common desktops. From 100s to 25s (see below).

> this will lead to more rapid updates of proportions of bdi's share of
> writeback and thread's share of dirtying but why +0? Why not +1 or -1? So

Yes, it will especially be a problem on _small memory_ JBOD setups.
Richard actually has requested for a much radical change (decrease by
6) but that looks too much.

My team has a 12-disk JBOD with only 6G memory. The memory is pretty
small as a server, but it's a real setup and serves well as the
reference minimal setup that Linux should be able to run well on.

It will sure create more fluctuations, but still is acceptable in my
tests. For example,

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G/xfs-128dd-1M-16p-5904M-20%25-2.6.38-rc6-dt6+-2011-02-23-19-46/balance_dirty_pages-pages.png

> I'd prefer to get some understanding of why do we need to update the
> proportion period and why 4-times faster is just the right amount of faster
> :) If I remember right you had some numbers for this, didn't you?

Even better, I have a graph :)

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/xfs-1dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-21-55/balance_dirty_pages-pages.png

It shows that doing 1 dd on a 4G box, it took more than 100s to
rampup. The patch will reduce it to 25 seconds for a typical desktop.
The disk has 50MB/s throughput. Given a modern HDD or SSD, it will
converge more fast.

Thanks,
Fengguang

> > ---
> > mm/page-writeback.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- linux-next.orig/mm/page-writeback.c 2011-03-02 14:52:19.000000000 +0800
> > +++ linux-next/mm/page-writeback.c 2011-03-02 15:00:17.000000000 +0800
> > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > else
> > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > 100;
> > - return 2 + ilog2(dirty_total - 1);
> > + return ilog2(dirty_total - 1);
> > }
> >
> > /*
> >
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2011-04-13 23:52:18

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > Reduce the dampening for the control system, yielding faster
> > > convergence. The change is a bit conservative, as smaller values may
> > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > >
> > > CC: Peter Zijlstra <[email protected]>
> > > CC: Richard Kennedy <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > Well, I have nothing against this change as such but what I don't like is
> > that it just changes magical +2 for similarly magical +0. It's clear that
>
> The patch tends to make the rampup time a bit more reasonable for
> common desktops. From 100s to 25s (see below).
>
> > this will lead to more rapid updates of proportions of bdi's share of
> > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
>
> Yes, it will especially be a problem on _small memory_ JBOD setups.
> Richard actually has requested for a much radical change (decrease by
> 6) but that looks too much.
>
> My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> small as a server, but it's a real setup and serves well as the
> reference minimal setup that Linux should be able to run well on.

FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
raid setups that have <= 1GB of RAM (many of them run XFS), so even
your setup could be considered large by a significant fraction of
the storage world. Hence you need to be careful of optimising for
what you think is a "normal" server, because there simply isn't such
a thing....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-14 00:23:09

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > Reduce the dampening for the control system, yielding faster
> > > > convergence. The change is a bit conservative, as smaller values may
> > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > >
> > > > CC: Peter Zijlstra <[email protected]>
> > > > CC: Richard Kennedy <[email protected]>
> > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > Well, I have nothing against this change as such but what I don't like is
> > > that it just changes magical +2 for similarly magical +0. It's clear that
> >
> > The patch tends to make the rampup time a bit more reasonable for
> > common desktops. From 100s to 25s (see below).
> >
> > > this will lead to more rapid updates of proportions of bdi's share of
> > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> >
> > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > Richard actually has requested for a much radical change (decrease by
> > 6) but that looks too much.
> >
> > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > small as a server, but it's a real setup and serves well as the
> > reference minimal setup that Linux should be able to run well on.
>
> FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> raid setups that have <= 1GB of RAM (many of them run XFS), so even
> your setup could be considered large by a significant fraction of
> the storage world. Hence you need to be careful of optimising for
> what you think is a "normal" server, because there simply isn't such
> a thing....

Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
I'll test the setup.

I did test low memory setups -- but only on simple 1-disk cases.

For example, when dirty thresh is lowered to 7MB, the dirty pages are
fluctuating like mad within the controlled scope:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pages.png

But still, it achieves 100% disk utilization

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/iostat-util.png

and good IO throughput:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-bandwidth.png

And even better, less than 120ms writeback latencies:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pause.png

Thanks,
Fengguang

2011-04-14 10:36:30

by Richard Kennedy

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, 2011-04-14 at 08:23 +0800, Wu Fengguang wrote:
> On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > Reduce the dampening for the control system, yielding faster
> > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > >
> > > > > CC: Peter Zijlstra <[email protected]>
> > > > > CC: Richard Kennedy <[email protected]>
> > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > Well, I have nothing against this change as such but what I don't like is
> > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > >
> > > The patch tends to make the rampup time a bit more reasonable for
> > > common desktops. From 100s to 25s (see below).
> > >
> > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > >
> > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > Richard actually has requested for a much radical change (decrease by
> > > 6) but that looks too much.
> > >
> > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > small as a server, but it's a real setup and serves well as the
> > > reference minimal setup that Linux should be able to run well on.
> >
> > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > your setup could be considered large by a significant fraction of
> > the storage world. Hence you need to be careful of optimising for
> > what you think is a "normal" server, because there simply isn't such
> > a thing....
>
> Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> I'll test the setup.
>
> I did test low memory setups -- but only on simple 1-disk cases.
>
> For example, when dirty thresh is lowered to 7MB, the dirty pages are
> fluctuating like mad within the controlled scope:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pages.png
>
> But still, it achieves 100% disk utilization
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/iostat-util.png
>
> and good IO throughput:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-bandwidth.png
>
> And even better, less than 120ms writeback latencies:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pause.png
>
> Thanks,
> Fengguang
>

I'm only testing on a desktop with 2 drives. I use a simple test to
write 2gb to sda then 2gb to sdb while recording the threshold values.
On 2.6.39-rc3, after the 2nd write starts it take approx 90 seconds for
sda's threshold value to drop from its maximum to minimum and sdb's to
rise from min to max. So this seems much too slow for normal desktop
workloads.

I haven't tested with this patch on 2.6.39-rc3 yet, but I'm just about
to set that up.

I know it's difficult to pick one magic number to fit every case, but I
don't see any easy way to make this more adaptive. We could make this
calculation take account of more things, but I don't know what.

Nice graphs :) BTW do you know what's causing that 10 second (1/10 Hz)
fluctuation in write bandwidth? and does this change effect that in any
way?

regards
Richard

2011-04-14 13:49:46

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Apr 14, 2011 at 06:36:22PM +0800, Richard Kennedy wrote:
> On Thu, 2011-04-14 at 08:23 +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > >
> > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > >
> > > > The patch tends to make the rampup time a bit more reasonable for
> > > > common desktops. From 100s to 25s (see below).
> > > >
> > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > >
> > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > Richard actually has requested for a much radical change (decrease by
> > > > 6) but that looks too much.
> > > >
> > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > small as a server, but it's a real setup and serves well as the
> > > > reference minimal setup that Linux should be able to run well on.
> > >
> > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > your setup could be considered large by a significant fraction of
> > > the storage world. Hence you need to be careful of optimising for
> > > what you think is a "normal" server, because there simply isn't such
> > > a thing....
> >
> > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > I'll test the setup.
> >
> > I did test low memory setups -- but only on simple 1-disk cases.
> >
> > For example, when dirty thresh is lowered to 7MB, the dirty pages are
> > fluctuating like mad within the controlled scope:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pages.png
> >
> > But still, it achieves 100% disk utilization
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/iostat-util.png
> >
> > and good IO throughput:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-bandwidth.png
> >
> > And even better, less than 120ms writeback latencies:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pause.png
> >
> > Thanks,
> > Fengguang
> >
>
> I'm only testing on a desktop with 2 drives. I use a simple test to
> write 2gb to sda then 2gb to sdb while recording the threshold values.
> On 2.6.39-rc3, after the 2nd write starts it take approx 90 seconds for
> sda's threshold value to drop from its maximum to minimum and sdb's to
> rise from min to max. So this seems much too slow for normal desktop
> workloads.

Yes.

> I haven't tested with this patch on 2.6.39-rc3 yet, but I'm just about
> to set that up.

It will sure help, but the problem is now the low-memory NAS servers..

Fortunately my patchset could make the dirty pages ramp up much more
fast than the ramp up speed of the per-bdi threshold, and is also less
sensitive to the fluctuations of per-bdi thresholds in JBOD setup.

In fact my main concern in the low-memory NAS setup is how to prevent
disk from going idle from time to time due to bdi dirty pages running
low. The fluctuations of per-bdi thresholds in this case is no longer
relevant for me. I end up adding a rule to throttle the task less when
the bdi is running low of dirty pages. I find that the vanilla kernel
also has this problem.

> I know it's difficult to pick one magic number to fit every case, but I
> don't see any easy way to make this more adaptive. We could make this
> calculation take account of more things, but I don't know what.
>
>
> Nice graphs :) BTW do you know what's causing that 10 second (1/10 Hz)
> fluctuation in write bandwidth? and does this change effect that in any
> way?

In fact each filesystems is fluctuating in its unique way. For example,

ext4, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/ext4-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-49/balance_dirty_pages-bandwidth.png

btrfs, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-15-03/balance_dirty_pages-bandwidth.png

btrfs, 1 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/balance_dirty_pages-bandwidth.png

I'm not sure about the exact root cause, but it's more or less related
to the fluctuations of IO completion events. For example, the
"written" curve is not a strictly straight line:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/global_dirtied_written.png

Thanks,
Fengguang

2011-04-14 14:08:26

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

> > I'm only testing on a desktop with 2 drives. I use a simple test to
> > write 2gb to sda then 2gb to sdb while recording the threshold values.
> > On 2.6.39-rc3, after the 2nd write starts it take approx 90 seconds for
> > sda's threshold value to drop from its maximum to minimum and sdb's to
> > rise from min to max. So this seems much too slow for normal desktop
> > workloads.
>
> Yes.
>
> > I haven't tested with this patch on 2.6.39-rc3 yet, but I'm just about
> > to set that up.
>
> It will sure help, but the problem is now the low-memory NAS servers..
>
> Fortunately my patchset could make the dirty pages ramp up much more
> fast than the ramp up speed of the per-bdi threshold, and is also less
> sensitive to the fluctuations of per-bdi thresholds in JBOD setup.

Look at the attached graph. You cannot notice an obvious "rampup"
stage in the number of dirty pages (red line) at all :)

Thanks,
Fengguang

Attachments:

(No filename) (930.00 B)
balance_dirty_pages-pages.png (79.76 kB)
Download all attachments

2011-04-14 15:57:01

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Apr 14, 2011 at 11:14:24PM +0800, Wu Fengguang wrote:
> On Thu, Apr 14, 2011 at 08:23:02AM +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > >
> > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > >
> > > > The patch tends to make the rampup time a bit more reasonable for
> > > > common desktops. From 100s to 25s (see below).
> > > >
> > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > >
> > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > Richard actually has requested for a much radical change (decrease by
> > > > 6) but that looks too much.
> > > >
> > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > small as a server, but it's a real setup and serves well as the
> > > > reference minimal setup that Linux should be able to run well on.
> > >
> > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > your setup could be considered large by a significant fraction of
> > > the storage world. Hence you need to be careful of optimising for
> > > what you think is a "normal" server, because there simply isn't such
> > > a thing....
> >
> > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > I'll test the setup.
>
> Just did a comparison of the IO-less patches' performance with and
> without this patch. I hardly notice any differences besides some more
> bdi goal fluctuations in the attached graphs. The write throughput is
> a bit large with this patch (80MB/s vs 76MB/s), however the delta is
> within the even larger stddev range (20MB/s).
>
> The basic conclusion is, my IO-less patchset is very insensible to the
> bdi threshold fluctuations. In this kind of low memory case, just take
> care to stop the bdi pages from dropping too low and you get good
> performance. (well, the disks are still not 100% utilized at times...)

> Fluctuations in disk throughput and dirty rate and virtually
> everything are unavoidable due to the low memory situation.

Yeah the fluctuations in the dirty rate are worse than memory bounty
situations, however is still a lot better than what vanilla kernel can
provide.

The attached graphs are collected with this patch. They show <=20ms
pause times and not all that straight but nowhere bumpy progresses.

Thanks,
Fengguang

Attachments:

(No filename) (3.20 kB)
balance_dirty_pages-task-bw.png (38.80 kB)
balance_dirty_pages-pause.png (49.10 kB)
Download all attachments

2011-04-14 18:16:17

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Thu 14-04-11 23:14:25, Wu Fengguang wrote:
> On Thu, Apr 14, 2011 at 08:23:02AM +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > >
> > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > >
> > > > The patch tends to make the rampup time a bit more reasonable for
> > > > common desktops. From 100s to 25s (see below).
> > > >
> > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > >
> > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > Richard actually has requested for a much radical change (decrease by
> > > > 6) but that looks too much.
> > > >
> > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > small as a server, but it's a real setup and serves well as the
> > > > reference minimal setup that Linux should be able to run well on.
> > >
> > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > your setup could be considered large by a significant fraction of
> > > the storage world. Hence you need to be careful of optimising for
> > > what you think is a "normal" server, because there simply isn't such
> > > a thing....
> >
> > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > I'll test the setup.
>
> Just did a comparison of the IO-less patches' performance with and
> without this patch. I hardly notice any differences besides some more
> bdi goal fluctuations in the attached graphs. The write throughput is
> a bit large with this patch (80MB/s vs 76MB/s), however the delta is
> within the even larger stddev range (20MB/s).
Thanks for the test but I cannot find out from the numbers you provided
how much did the per-bdi thresholds fluctuate in this low memory NAS case?
You can gather current bdi threshold from /sys/kernel/debug/bdi/<dev>/stats
so it shouldn't be hard to get the numbers...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-15 03:43:12

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Fri, Apr 15, 2011 at 02:16:09AM +0800, Jan Kara wrote:
> On Thu 14-04-11 23:14:25, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 08:23:02AM +0800, Wu Fengguang wrote:
> > > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > > >
> > > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > > >
> > > > > The patch tends to make the rampup time a bit more reasonable for
> > > > > common desktops. From 100s to 25s (see below).
> > > > >
> > > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > > >
> > > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > > Richard actually has requested for a much radical change (decrease by
> > > > > 6) but that looks too much.
> > > > >
> > > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > > small as a server, but it's a real setup and serves well as the
> > > > > reference minimal setup that Linux should be able to run well on.
> > > >
> > > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > > your setup could be considered large by a significant fraction of
> > > > the storage world. Hence you need to be careful of optimising for
> > > > what you think is a "normal" server, because there simply isn't such
> > > > a thing....
> > >
> > > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > > I'll test the setup.
> >
> > Just did a comparison of the IO-less patches' performance with and
> > without this patch. I hardly notice any differences besides some more
> > bdi goal fluctuations in the attached graphs. The write throughput is
> > a bit large with this patch (80MB/s vs 76MB/s), however the delta is
> > within the even larger stddev range (20MB/s).
> Thanks for the test but I cannot find out from the numbers you provided
> how much did the per-bdi thresholds fluctuate in this low memory NAS case?
> You can gather current bdi threshold from /sys/kernel/debug/bdi/<dev>/stats
> so it shouldn't be hard to get the numbers...

Hi Jan, attached are your results w/o this patch. The "bdi goal" (gray
line) is calculated as (bdi_thresh - bdi_thresh/8) and is fluctuating
all over the place.. and average wkB/s is only 49MB/s..

Thanks,
Fengguang
---

wfg ~/bee% cat xfs-1dd-1M-16p-5907M-3:2-2.6.39-rc3-jan-bdp+-2011-04-15.11:11/iostat-avg
avg-cpu: %user %nice %system %iowait %steal %idle
sum 2.460 0.000 71.080 767.240 0.000 1859.220
avg 0.091 0.000 2.633 28.416 0.000 68.860
stddev 0.064 0.000 0.659 7.903 0.000 7.792

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sum 0.000 58.100 0.000 2926.980 0.000 1331730.590 18278.540 962.290 4850.450 97.470 1315.600
avg 0.000 2.152 0.000 108.407 0.000 49323.355 676.983 35.640 179.646 3.610 48.726
stddev 0.000 5.336 0.000 104.398 0.000 47602.790 400.410 40.696 169.289 2.212 45.870

Attachments:

(No filename) (3.96 kB)
balance_dirty_pages-pages.png (108.63 kB)
balance_dirty_pages-task-bw.png (35.80 kB)
balance_dirty_pages-pause.png (27.71 kB)
iostat (63.65 kB)
Download all attachments

2011-04-15 22:13:21

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Fri 15-04-11 22:37:11, Wu Fengguang wrote:
> On Fri, Apr 15, 2011 at 11:43:00AM +0800, Wu Fengguang wrote:
> > On Fri, Apr 15, 2011 at 02:16:09AM +0800, Jan Kara wrote:
> > > On Thu 14-04-11 23:14:25, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 08:23:02AM +0800, Wu Fengguang wrote:
> > > > > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > > > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > > > > >
> > > > > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > > > > >
> > > > > > > The patch tends to make the rampup time a bit more reasonable for
> > > > > > > common desktops. From 100s to 25s (see below).
> > > > > > >
> > > > > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > > > > >
> > > > > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > > > > Richard actually has requested for a much radical change (decrease by
> > > > > > > 6) but that looks too much.
> > > > > > >
> > > > > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > > > > small as a server, but it's a real setup and serves well as the
> > > > > > > reference minimal setup that Linux should be able to run well on.
> > > > > >
> > > > > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > > > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > > > > your setup could be considered large by a significant fraction of
> > > > > > the storage world. Hence you need to be careful of optimising for
> > > > > > what you think is a "normal" server, because there simply isn't such
> > > > > > a thing....
> > > > >
> > > > > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > > > > I'll test the setup.
> > > >
> > > > Just did a comparison of the IO-less patches' performance with and
> > > > without this patch. I hardly notice any differences besides some more
> > > > bdi goal fluctuations in the attached graphs. The write throughput is
> > > > a bit large with this patch (80MB/s vs 76MB/s), however the delta is
> > > > within the even larger stddev range (20MB/s).
> > > Thanks for the test but I cannot find out from the numbers you provided
> > > how much did the per-bdi thresholds fluctuate in this low memory NAS case?
> > > You can gather current bdi threshold from /sys/kernel/debug/bdi/<dev>/stats
> > > so it shouldn't be hard to get the numbers...
> >
> > Hi Jan, attached are your results w/o this patch. The "bdi goal" (gray
> > line) is calculated as (bdi_thresh - bdi_thresh/8) and is fluctuating
> > all over the place.. and average wkB/s is only 49MB/s..
>
> I got the numbers for vanilla kernel: XFS can do 57MB/s and 63MB/s in
> the two runs. There are large fluctuations in the attached graphs, too.
Hmm, so the graphs from previous email are with longer "proportion
period (without patch we discuss here)" and graphs from this email are
with it?

> To summary it up, for a 1GB mem, 4 disks JBOD setup, running 1 dd per
> disk:
>
> vanilla: 57MB/s, 63MB/s
> Jan: 49MB/s, 103MB/s
> Wu: 76MB/s, 80MB/s
>
> The balance_dirty_pages-task-bw-jan.png and
> balance_dirty_pages-pages-jan.png shows very unfair allocation of
> dirty pages and throughput among the disks...
Fengguang, can we please stay on topic? It's good to know that throughput
fluctuates so much with my patches (although not that surprising seeing the
fluctuations of bdi limits) but for the sake of this patch throughput
numbers with different balance_dirty_pages() implementations do not seem
that interesting. What is interesting (at least to me) is how this
particular patch changes fluctuations of bdi thresholds (fractions) in
vanilla kernel. In the graphs, I can see only bdi goal - that is the
per-bdi threshold we have in balance_dirty_pages() am I right? And it is
there for only a single device, right?

Anyway either with or without the patch, bdi thresholds are jumping rather
wildly if I'm interpreting the graphs right. Hmm, which is not that surprising
given that in ideal case we should have about 0.5s worth of writeback for
each disk in the page cache. So with your patch the period for proportion
estimation is also just about 0.5s worth of page writeback which is
understandably susceptible to fluctuations. Thinking about it, the original
period of 4*"dirty limit" on your machine is about 2.5 GB which is about
50s worth of writeback on that machine so it is in match with your
observation that it takes ~100s for bdi threshold to climb up.

So what is a takeaway from this for me is that scaling the period
with the dirty limit is not the right thing. If you'd have 4-times more
memory, your choice of "dirty limit" as the period would be as bad as
current 4*"dirty limit". What would seem like a better choice of period
to me would be to have the period in an order of a few seconds worth of
writeback. That would allow the bdi limit to scale up reasonably fast when
new bdi starts to be used and still not make it fluctuate that much
(hopefully).

Looking at math in lib/proportions.c, nothing really fundamental requires
that each period has the same length. So it shouldn't be hard to actually
create proportions calculator that would have timer triggered periods -
simply whenever the timer fires, we would declare a new period. The only
things which would be broken by this are (t represents global counter of
events):
a) counting of periods as t/period_len - we would have to maintain global
period counter but that's trivial
b) trick that we don't do t=t/2 for each new period but rather use
period_len/2+(t % (period_len/2)) when calculating fractions - again we
would have to bite the bullet and divide the global counter when we declare
new period but again it's not a big deal in our case.

Peter what do you think about this? Do you (or anyone else) think it makes
sense?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-16 06:05:14

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Sat, Apr 16, 2011 at 06:13:14AM +0800, Jan Kara wrote:
> On Fri 15-04-11 22:37:11, Wu Fengguang wrote:
> > On Fri, Apr 15, 2011 at 11:43:00AM +0800, Wu Fengguang wrote:
> > > On Fri, Apr 15, 2011 at 02:16:09AM +0800, Jan Kara wrote:
> > > > On Thu 14-04-11 23:14:25, Wu Fengguang wrote:
> > > > > On Thu, Apr 14, 2011 at 08:23:02AM +0800, Wu Fengguang wrote:
> > > > > > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > > > > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > > > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > > > > > >
> > > > > > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > > > > > CC: Richard Kennedy <[email protected]>
> > > > > > > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > > > > > Well, I have nothing against this change as such but what I don't like is
> > > > > > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > > > > > >
> > > > > > > > The patch tends to make the rampup time a bit more reasonable for
> > > > > > > > common desktops. From 100s to 25s (see below).
> > > > > > > >
> > > > > > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > > > > > >
> > > > > > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > > > > > Richard actually has requested for a much radical change (decrease by
> > > > > > > > 6) but that looks too much.
> > > > > > > >
> > > > > > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > > > > > small as a server, but it's a real setup and serves well as the
> > > > > > > > reference minimal setup that Linux should be able to run well on.
> > > > > > >
> > > > > > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > > > > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > > > > > your setup could be considered large by a significant fraction of
> > > > > > > the storage world. Hence you need to be careful of optimising for
> > > > > > > what you think is a "normal" server, because there simply isn't such
> > > > > > > a thing....
> > > > > >
> > > > > > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > > > > > I'll test the setup.
> > > > >
> > > > > Just did a comparison of the IO-less patches' performance with and
> > > > > without this patch. I hardly notice any differences besides some more
> > > > > bdi goal fluctuations in the attached graphs. The write throughput is
> > > > > a bit large with this patch (80MB/s vs 76MB/s), however the delta is
> > > > > within the even larger stddev range (20MB/s).
> > > > Thanks for the test but I cannot find out from the numbers you provided
> > > > how much did the per-bdi thresholds fluctuate in this low memory NAS case?
> > > > You can gather current bdi threshold from /sys/kernel/debug/bdi/<dev>/stats
> > > > so it shouldn't be hard to get the numbers...
> > >
> > > Hi Jan, attached are your results w/o this patch. The "bdi goal" (gray
> > > line) is calculated as (bdi_thresh - bdi_thresh/8) and is fluctuating
> > > all over the place.. and average wkB/s is only 49MB/s..
> >
> > I got the numbers for vanilla kernel: XFS can do 57MB/s and 63MB/s in
> > the two runs. There are large fluctuations in the attached graphs, too.
> Hmm, so the graphs from previous email are with longer "proportion
> period (without patch we discuss here)" and graphs from this email are
> with it?

All graphs for vanilla and your IO-less kernels are collected without
this patch.

I only showed in previous email how my IO-less kernel works with and
without this patch, and the conclusion is, it's not sensitive to it
and is working fine in both cases.

> > To summary it up, for a 1GB mem, 4 disks JBOD setup, running 1 dd per
> > disk:
> >
> > vanilla: 57MB/s, 63MB/s
> > Jan: 49MB/s, 103MB/s
> > Wu: 76MB/s, 80MB/s
> >
> > The balance_dirty_pages-task-bw-jan.png and
> > balance_dirty_pages-pages-jan.png shows very unfair allocation of
> > dirty pages and throughput among the disks...
> Fengguang, can we please stay on topic? It's good to know that throughput
> fluctuates so much with my patches (although not that surprising seeing the
> fluctuations of bdi limits) but for the sake of this patch throughput
> numbers with different balance_dirty_pages() implementations do not seem
> that interesting. What is interesting (at least to me) is how this
> particular patch changes fluctuations of bdi thresholds (fractions) in
> vanilla kernel. In the graphs, I can see only bdi goal - that is the
> per-bdi threshold we have in balance_dirty_pages() am I right? And it is
> there for only a single device, right?

bdi_goal = bdi_thresh * 7/8. They are close. So by looking at the bdi
goal curve, you get the idea how bdi_thresh fluctuates over time.

balance_dirty_pages-pages-jan.png looks very like the single device
situation, because the bdi goal is so high! But that's exactly the
problem: the first bdi is consuming most dirty pages quota and run at
full speed, while the other bdi's run mostly idle. You can confirm
the imbalance in balance_dirty_pages-task-bw-jan.png and iostat.

Looks similar to the problem described here:

https://lkml.org/lkml/2010/12/5/6

> Anyway either with or without the patch, bdi thresholds are jumping rather
> wildly if I'm interpreting the graphs right. Hmm, which is not that surprising
> given that in ideal case we should have about 0.5s worth of writeback for
> each disk in the page cache. So with your patch the period for proportion
> estimation is also just about 0.5s worth of page writeback which is
> understandably susceptible to fluctuations. Thinking about it, the original
> period of 4*"dirty limit" on your machine is about 2.5 GB which is about
> 50s worth of writeback on that machine so it is in match with your
> observation that it takes ~100s for bdi threshold to climb up.
>
> So what is a takeaway from this for me is that scaling the period
> with the dirty limit is not the right thing. If you'd have 4-times more
> memory, your choice of "dirty limit" as the period would be as bad as
> current 4*"dirty limit". What would seem like a better choice of period
> to me would be to have the period in an order of a few seconds worth of
> writeback. That would allow the bdi limit to scale up reasonably fast when
> new bdi starts to be used and still not make it fluctuate that much
> (hopefully).

Yes it's good to make it more bandwidth and time wise. I'll be glad
if you can improve the algorithm :)

Thanks,
Fengguang

> Looking at math in lib/proportions.c, nothing really fundamental requires
> that each period has the same length. So it shouldn't be hard to actually
> create proportions calculator that would have timer triggered periods -
> simply whenever the timer fires, we would declare a new period. The only
> things which would be broken by this are (t represents global counter of
> events):
> a) counting of periods as t/period_len - we would have to maintain global
> period counter but that's trivial
> b) trick that we don't do t=t/2 for each new period but rather use
> period_len/2+(t % (period_len/2)) when calculating fractions - again we
> would have to bite the bullet and divide the global counter when we declare
> new period but again it's not a big deal in our case.
>
> Peter what do you think about this? Do you (or anyone else) think it makes
> sense?
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2011-04-16 08:39:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Sat, 2011-04-16 at 00:13 +0200, Jan Kara wrote:
>
> So what is a takeaway from this for me is that scaling the period
> with the dirty limit is not the right thing. If you'd have 4-times more
> memory, your choice of "dirty limit" as the period would be as bad as
> current 4*"dirty limit". What would seem like a better choice of period
> to me would be to have the period in an order of a few seconds worth of
> writeback. That would allow the bdi limit to scale up reasonably fast when
> new bdi starts to be used and still not make it fluctuate that much
> (hopefully).

No best would be to scale the period with the writeout bandwidth, but
lacking that the dirty limit had to do. Since we're counting pages, and
bandwidth is pages/second we'll end up with a time measure, exactly the
thing you wanted.

> Looking at math in lib/proportions.c, nothing really fundamental requires
> that each period has the same length. So it shouldn't be hard to actually
> create proportions calculator that would have timer triggered periods -
> simply whenever the timer fires, we would declare a new period. The only
> things which would be broken by this are (t represents global counter of
> events):
> a) counting of periods as t/period_len - we would have to maintain global
> period counter but that's trivial
> b) trick that we don't do t=t/2 for each new period but rather use
> period_len/2+(t % (period_len/2)) when calculating fractions - again we
> would have to bite the bullet and divide the global counter when we declare
> new period but again it's not a big deal in our case.
>
> Peter what do you think about this? Do you (or anyone else) think it makes
> sense?

But if you don't have a fixed sized period, then how do you catch up on
fractions that haven't been updated for several periods? You cannot go
remember all the individual period lengths.

The whole trick to the proportion stuff is that its all O(1) regardless
of the number of contestants. There isn't a single loop that iterates
over all BDIs or tasks to update their cycle, that wouldn't have scaled.

2011-04-16 14:23:57

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Sat, Apr 16, 2011 at 04:33:29PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-04-16 at 00:13 +0200, Jan Kara wrote:
> >
> > So what is a takeaway from this for me is that scaling the period
> > with the dirty limit is not the right thing. If you'd have 4-times more
> > memory, your choice of "dirty limit" as the period would be as bad as
> > current 4*"dirty limit". What would seem like a better choice of period
> > to me would be to have the period in an order of a few seconds worth of
> > writeback. That would allow the bdi limit to scale up reasonably fast when
> > new bdi starts to be used and still not make it fluctuate that much
> > (hopefully).
>
> No best would be to scale the period with the writeout bandwidth, but
> lacking that the dirty limit had to do. Since we're counting pages, and
> bandwidth is pages/second we'll end up with a time measure, exactly the
> thing you wanted.

I owe you the patch :) Here is a tested one for doing the bandwidth
based scaling. It's based on the attached global writeout bandwidth
estimation.

I tried updating the shift both on rosed and fallen bandwidth, however
that leads to reset of the accumulated proportion values. So here the
shift will only be increased and never decreased.

Thanks,
Fengguang
---
Subject: writeback: scale dirty proportions period with writeout bandwidth
Date: Sat Apr 16 18:38:41 CST 2011

CC: Peter Zijlstra <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/page-writeback.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

--- linux-next.orig/mm/page-writeback.c 2011-04-16 21:02:24.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-04-16 21:04:08.000000000 +0800
@@ -121,20 +121,13 @@ static struct prop_descriptor vm_complet
static struct prop_descriptor vm_dirties;

/*
- * couple the period to the dirty_ratio:
+ * couple the period to global write throughput:
*
- * period/2 ~ roundup_pow_of_two(dirty limit)
+ * period/2 ~ roundup_pow_of_two(write IO throughput)
*/
static int calc_period_shift(void)
{
- unsigned long dirty_total;
-
- if (vm_dirty_bytes)
- dirty_total = vm_dirty_bytes / PAGE_SIZE;
- else
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
- 100;
- return 2 + ilog2(dirty_total - 1);
+ return 2 + ilog2(default_backing_dev_info.avg_write_bandwidth);
}

/*
@@ -143,6 +136,13 @@ static int calc_period_shift(void)
static void update_completion_period(void)
{
int shift = calc_period_shift();
+
+ if (shift > PROP_MAX_SHIFT)
+ shift = PROP_MAX_SHIFT;
+
+ if (shift <= vm_completions.pg[0].shift)
+ return;
+
prop_change_shift(&vm_completions, shift);
prop_change_shift(&vm_dirties, shift);
}
@@ -180,7 +180,6 @@ int dirty_ratio_handler(struct ctl_table

ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
- update_completion_period();
vm_dirty_bytes = 0;
}
return ret;
@@ -196,7 +195,6 @@ int dirty_bytes_handler(struct ctl_table

ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
if (ret == 0 && write && vm_dirty_bytes != old_bytes) {
- update_completion_period();
vm_dirty_ratio = 0;
}
return ret;
@@ -1026,6 +1024,7 @@ void bdi_update_bandwidth(struct backing
global_page_state(NR_WRITTEN));
gbdi->bw_time_stamp = now;
gbdi->written_stamp = global_page_state(NR_WRITTEN);
+ update_completion_period();
}
if (thresh) {
bdi_update_dirty_ratelimit(bdi, thresh, dirty,

Attachments:

(No filename) (3.45 kB)
writeback-global-write-bandwidth.patch (1.35 kB)
Download all attachments

2011-04-17 02:11:22

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Sat, Apr 16, 2011 at 10:21:14PM +0800, Wu Fengguang wrote:
> On Sat, Apr 16, 2011 at 04:33:29PM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-04-16 at 00:13 +0200, Jan Kara wrote:
> > >
> > > So what is a takeaway from this for me is that scaling the period
> > > with the dirty limit is not the right thing. If you'd have 4-times more
> > > memory, your choice of "dirty limit" as the period would be as bad as
> > > current 4*"dirty limit". What would seem like a better choice of period
> > > to me would be to have the period in an order of a few seconds worth of
> > > writeback. That would allow the bdi limit to scale up reasonably fast when
> > > new bdi starts to be used and still not make it fluctuate that much
> > > (hopefully).
> >
> > No best would be to scale the period with the writeout bandwidth, but
> > lacking that the dirty limit had to do. Since we're counting pages, and
> > bandwidth is pages/second we'll end up with a time measure, exactly the
> > thing you wanted.
>
> I owe you the patch :) Here is a tested one for doing the bandwidth
> based scaling. It's based on the attached global writeout bandwidth
> estimation.
>
> I tried updating the shift both on rosed and fallen bandwidth, however
> that leads to reset of the accumulated proportion values. So here the
> shift will only be increased and never decreased.

I cannot reproduce the issue now. It may be due to the bandwidth
estimation went wrong and get tiny values at times in an early patch,
thus "resetting" the proportional values.

I'll carry the below version in future tests. In theory we could do
more coarse tracking with

if (abs(shift - vm_completions.pg[0].shift) <= 1)
return;

But let's do it more diligent now.

Thanks,
Fengguang
---
@@ -143,6 +136,13 @@ static int calc_period_shift(void)
static void update_completion_period(void)
{
int shift = calc_period_shift();
+
+ if (shift > PROP_MAX_SHIFT)
+ shift = PROP_MAX_SHIFT;
+
+ if (shift == vm_completions.pg[0].shift)
+ return;
+
prop_change_shift(&vm_completions, shift);
prop_change_shift(&vm_dirties, shift);
}

2011-04-18 14:59:39

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Sat 16-04-11 10:33:29, Peter Zijlstra wrote:
> On Sat, 2011-04-16 at 00:13 +0200, Jan Kara wrote:
> >
> > So what is a takeaway from this for me is that scaling the period
> > with the dirty limit is not the right thing. If you'd have 4-times more
> > memory, your choice of "dirty limit" as the period would be as bad as
> > current 4*"dirty limit". What would seem like a better choice of period
> > to me would be to have the period in an order of a few seconds worth of
> > writeback. That would allow the bdi limit to scale up reasonably fast when
> > new bdi starts to be used and still not make it fluctuate that much
> > (hopefully).
>
> No best would be to scale the period with the writeout bandwidth, but
> lacking that the dirty limit had to do. Since we're counting pages, and
> bandwidth is pages/second we'll end up with a time measure, exactly the
> thing you wanted.
Yes, I was thinking about this as well. We could measure the throughput
but essentially it's a changing entity (dependent on the type of load and
possibly other things like network load for NFS, or other machines
accessing your NAS). So I'm not sure one constant value will work (esp.
because you have to measure it and you never know at which state you did
the measurement). And when you have changing values, you have to solve the
same problem as with time based periods - that's how I came to them.

> > Looking at math in lib/proportions.c, nothing really fundamental requires
> > that each period has the same length. So it shouldn't be hard to actually
> > create proportions calculator that would have timer triggered periods -
> > simply whenever the timer fires, we would declare a new period. The only
> > things which would be broken by this are (t represents global counter of
> > events):
> > a) counting of periods as t/period_len - we would have to maintain global
> > period counter but that's trivial
> > b) trick that we don't do t=t/2 for each new period but rather use
> > period_len/2+(t % (period_len/2)) when calculating fractions - again we
> > would have to bite the bullet and divide the global counter when we declare
> > new period but again it's not a big deal in our case.
> >
> > Peter what do you think about this? Do you (or anyone else) think it makes
> > sense?
>
> But if you don't have a fixed sized period, then how do you catch up on
> fractions that haven't been updated for several periods? You cannot go
> remember all the individual period lengths.
OK, I wrote the expressions down and the way I want to do it would get
different fractions than your original formula:

Your formula is:
p(j)=\sum_i x_i(j)/(t_i*2^{i+1})
where $i$ sums from 0 to \infty, x_i(j) is the number of events of type
$j$ in period $i$, $t_i$ is the total number of events in period $i$.

I want to compute
l(j)=\sum_i x_i(j)/2^{i+1}
g=\sum_i t_i/2^{i+1}
and
p(j)=l(j)/g

Clearly, all these values can be computed in O(1). Now for t_i = t for every
i, the results of both formulas are the same (which is what made me make my
mistake). But when t_i differ, the results are different. I'd say that the
new formula also provides a meaningful notion of writeback share although
it's hard to quantify how far the computations will be in practice...

> The whole trick to the proportion stuff is that its all O(1) regardless
> of the number of contestants. There isn't a single loop that iterates
> over all BDIs or tasks to update their cycle, that wouldn't have scaled.
Sure, I understand.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-24 12:21:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

Sorry for the delay, life got interesting and then it slipped my mind.

On Mon, 2011-04-18 at 16:59 +0200, Jan Kara wrote:
> Your formula is:
> p(j)=\sum_i x_i(j)/(t_i*2^{i+1})
> where $i$ sums from 0 to \infty, x_i(j) is the number of events of type
> $j$ in period $i$, $t_i$ is the total number of events in period $i$.

Actually:

p_j = \Sum_{i=0} (d/dt_i) * x_j / 2^(i+1)

[ discrete differential ]

Where x_j is the total number of events for the j-th element of the set
and t_i is the i-th last period.

Also, the 1/2^(i+1) factor ensures recent history counts heavier while
still maintaining a normalized distribution.

Furthermore, by measuring time in the same measure as the events we get:

t = \Sum_i x_i

which yields that:

p_j = x_j * {\Sum_i (d/dt_i)} * {\Sum 2^(-i-1)}
= x_j * (1/t) * 1

Thus

\Sum_j p_j = \Sum_j x_j / (\Sum_i x_i) = 1

> I want to compute
> l(j)=\sum_i x_i(j)/2^{i+1}
> g=\sum_i t_i/2^{i+1}
> and
> p(j)=l(j)/g

Which gives me:

p_j = x_j * \Sum_i 1/t_i
= x_j / t

Again, if we then measure t in the same events as x, such that:

t = \Sum_i x_i

we again get:

\Sum_j p_j = \Sum_j x_j / \Sum_i x_i = 1

However, if you start measuring t differently that breaks, and the
result is no longer normalized and thus not suitable as a proportion.

Furthermore, while x_j/t is an average, it does not have decaying
history, resulting in past behaviour always affecting current results.
The decaying history thing will ensure that past behaviour will slowly
be 'forgotten' so that when the media is used differently (seeky to
non-seeky workload transition) the slow writeout speed will be forgotten
and we'll end up at the high writeout speed corresponding to less seeks.
Your average will end up hovering in the middle of the slow and fast
modes.

> Clearly, all these values can be computed in O(1).

True, but you get to keep x and t counts over all history, which could
lead to overflow scenarios (although switching to u64 should mitigate
that problem in our lifetime).

> Now for t_i = t for every
> i, the results of both formulas are the same (which is what made me make my
> mistake).

I'm not actually seeing how the averages will be the same, as explained,
yours seems to never forget history.

> But when t_i differ, the results are different.

>From what I can tell, when you stop measuring t in the same events as x
everything comes down because then the sum of proportions isn't
normalized.

> I'd say that the
> new formula also provides a meaningful notion of writeback share although
> it's hard to quantify how far the computations will be in practice...

s/far/fair/ ?

2011-05-24 12:38:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, 2011-05-24 at 14:24 +0200, Peter Zijlstra wrote:
> Again, if we then measure t in the same events as x, such that:
>
> t = \Sum_i x_i

> However, if you start measuring t differently that breaks, and the
> result is no longer normalized and thus not suitable as a proportion.

Ah, I made a mistake there, your proposal would keep the above relation
true, but the discrete periods t_i wouldn't be uniform.

So disregard the non normalized criticism.

2011-06-09 23:58:11

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

On Tue 24-05-11 14:24:29, Peter Zijlstra wrote:
> Sorry for the delay, life got interesting and then it slipped my mind.
And I missed you reply so sorry for my delay as well :).

> On Mon, 2011-04-18 at 16:59 +0200, Jan Kara wrote:
> > Your formula is:
> > p(j)=\sum_i x_i(j)/(t_i*2^{i+1})
> > where $i$ sums from 0 to \infty, x_i(j) is the number of events of type
> > $j$ in period $i$, $t_i$ is the total number of events in period $i$.
>
> Actually:
>
> p_j = \Sum_{i=0} (d/dt_i) * x_j / 2^(i+1)
>
> [ discrete differential ]
>
> Where x_j is the total number of events for the j-th element of the set
> and t_i is the i-th last period.
>
> Also, the 1/2^(i+1) factor ensures recent history counts heavier while
> still maintaining a normalized distribution.
>
> Furthermore, by measuring time in the same measure as the events we get:
>
> t = \Sum_i x_i
>
> which yields that:
>
> p_j = x_j * {\Sum_i (d/dt_i)} * {\Sum 2^(-i-1)}
> = x_j * (1/t) * 1
>
> Thus
>
> \Sum_j p_j = \Sum_j x_j / (\Sum_i x_i) = 1
Yup, I understand this.

> > I want to compute
> > l(j)=\sum_i x_i(j)/2^{i+1}
> > g=\sum_i t_i/2^{i+1}
> > and
> > p(j)=l(j)/g
>
> Which gives me:
>
> p_j = x_j * \Sum_i 1/t_i
> = x_j / t
It cannot really be simplified like this - 2^{i+1} parts do not cancel
out in p(j). Let's write the formula in an iterative manner so that it
becomes clearer. The first step almost looks like the 2^{i+1} members can
cancel out (note that I use x_1 and t_1 instead of x_0 and t_0 so that I don't
have to renumber when going for the next step):
l'(j) = x_1/2 + l(j)/2
g' = t_1/2 + g/2
thus
p'(j) = l'(j) / g'
= (x_1 + l(j))/2 / ((t_1 + g)/2)
= (x_1 + l(j)) / (t_1+g)

But if you properly expand to the next step you'll get:
l''(j) = x_0/2 + l'(j)/2
= x_0/2 + x_1/4 + l(j)/4
g'' = t_0/2 + g'/2
= t_0/2 + t_1/4 + g/4
thus we only get:
p''(j) = l''(j)/g''
= (x_0/2 + x_1/4 + l(j)/4) / (t_0/2 + t_1/4 + g/4)
= (x_0 + x_1/2 + l(j)/2) / (t_0 + t_1/2 + g/2)

Hmm, I guess I should have written the formulas as

l(j) = \sum_i x_i(j)/2^i
g = \sum_i t_i/2^i

It is equivalent and less confusing for the iterative expression where
we get directly:

l'(j)=x_0+l(j)/2
g'=t_0+g/2

which directly shows what's going on.

> Again, if we then measure t in the same events as x, such that:
>
> t = \Sum_i x_i
>
> we again get:
>
> \Sum_j p_j = \Sum_j x_j / \Sum_i x_i = 1
>
> However, if you start measuring t differently that breaks, and the
> result is no longer normalized and thus not suitable as a proportion.
The normalization works with my formula as you noted in your next email
(I just expand it here for other readers):
\Sum_j p_j = \Sum_j l(j)/g
= 1/g * \Sum_j \Sum_i x_i(j)/2^(i+1)
= 1/g * \Sum_i (1/2^(i+1) * \Sum_j x_i(j))
(*) = 1/g * \Sum_i t_i/2^(i+1)
= 1

(*) Here we use that t_i = \Sum_j x_i(j) because that's the definition of
t_i.

Note that exactly same equality holds when 2^(i+1) is replaced with 2^i in
g and l(j).

> Furthermore, while x_j/t is an average, it does not have decaying
> history, resulting in past behaviour always affecting current results.
> The decaying history thing will ensure that past behaviour will slowly
> be 'forgotten' so that when the media is used differently (seeky to
> non-seeky workload transition) the slow writeout speed will be forgotten
> and we'll end up at the high writeout speed corresponding to less seeks.
> Your average will end up hovering in the middle of the slow and fast
> modes.
So this the most disputable point of my formulas I believe :). You are
right that if, for example, nothing happens during a time slice (i.e. t_0 =
0, x_0(j)=0), the proportions don't change (well, after some time rounding
starts to have effect but let's ignore that for now). Generally, if
previously t_i was big and then became small (system bandwidth lowered;
e.g. t_5=10000, t_4=10, t_3=20,...,), it will take roughly log_2(maximum
t_i/current t_i) time slices for the contribution of terms with t_i big
to become comparable with the contribution of later terms with t_i small.
After this number of time slices, proportions will catch up with the change.

On the other hand when t_i was small for some time and then becomes big,
proportions will effectively reflect current state. So when someone starts
writing to a device on otherwise quiet system, the device immediately gets
fraction close to 1.

I'm not sure how big problem the above behavior is or what would actually
be a desirable one...

> > Clearly, all these values can be computed in O(1).
>
> True, but you get to keep x and t counts over all history, which could
> lead to overflow scenarios (although switching to u64 should mitigate
> that problem in our lifetime).
I think even 32-bit numbers might be fine. The numbers we need to keep are
of an order of total maximum bandwidth of the system. If you plug maxbw
instead of all x_i(j) and t_i, you'll get that l(j)=maxbw (or 2*maxbw if we
use 2^i in the formula) and similarly for g. So the math will work in
32-bits for a bandwidth of an order of TB per slice (which I expect to be
something between 0.1 and 10 s). Reasonable given today's HW although
probably we'll have to go to 64-bits soon, you are right.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR