2007-08-16 08:39:11

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 00/23] per device dirty throttling -v9

Per device dirty throttling patches

These patches aim to improve balance_dirty_pages() and directly address three
issues:
1) inter device starvation
2) stacked device deadlocks
3) inter process starvation

1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.

In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.

3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.

Changes since -v8:
- cleanup of the proportion code
- fix percpu_counter_add(&counter, -(unsigned long))
- fix per task dirty rate code
- fwd port to .23-rc2-mm2

--


2007-08-16 12:50:03

by Martin Knoblauch

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9

>Per device dirty throttling patches
>
>These patches aim to improve balance_dirty_pages() and directly
>address three issues:
>1) inter device starvation
>2) stacked device deadlocks
>3) inter process starvation
>
>1 and 2 are a direct result from removing the global dirty
>limit and using per device dirty limits. By giving each device
>its own dirty limit is will no longer starve another device,
>and the cyclic dependancy on the dirty limit is broken.
>
>In order to efficiently distribute the dirty limit across
>the independant devices a floating proportion is used, this
>will allocate a share of the total limit proportional to the
>device's recent activity.
>
>3 is done by also scaling the dirty limit proportional to the
>current task's recent dirty rate.
>
>Changes since -v8:
>- cleanup of the proportion code
>- fix percpu_counter_add(&counter, -(unsigned long))
>- fix per task dirty rate code
>- fwd port to .23-rc2-mm2

Peter,

any chance to get a rollup against 2.6.22-stable?

The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).

Cheers
Martin


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2007-08-16 12:55:33

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9

On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:

> Peter,
>
> any chance to get a rollup against 2.6.22-stable?
>
> The 2.6.23 series may not be usable for me due to the
> nosharedcache changes for NFS (the new default will massively
> disturb the user-space automounter).

I'll see what I can do, bit busy with other stuff atm, hopefully after
the weekend.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-16 13:22:54

by Martin Knoblauch

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9


--- Peter Zijlstra <[email protected]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
>
> > Peter,
> >
> > any chance to get a rollup against 2.6.22-stable?
> >
> > The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
>
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after the weekend.
>
Hi Peter,

that would be highly appreciated. Thanks a lot in advance.

Martin


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2007-08-16 21:29:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 00/23] per device dirty throttling -v9

Is there any way to make the global limits on which the dirty rate
calculations are based cpuset specific?

A process is part of a cpuset and that cpuset has only a fraction of
memory of the whole system.

And only a fraction of that fraction can be dirtied. We do not currently
enforce such limits which can cause the amount of dirty pages in
cpusets to become excessively high. I have posted several patchsets that
deal with that issue. See http://lkml.org/lkml/2007/1/16/5

It seems that limiting dirty pages in cpusets may be much easier to
realize in the context of this patchset. The tracking of the dirty pages
per node is not necessary if one would calculate the maximum amount of
dirtyable pages in a cpuset and use that as a base, right?




2007-08-17 07:19:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/23] per device dirty throttling -v9

On Thu, 2007-08-16 at 14:29 -0700, Christoph Lameter wrote:
> Is there any way to make the global limits on which the dirty rate
> calculations are based cpuset specific?
>
> A process is part of a cpuset and that cpuset has only a fraction of
> memory of the whole system.
>
> And only a fraction of that fraction can be dirtied. We do not currently
> enforce such limits which can cause the amount of dirty pages in
> cpusets to become excessively high. I have posted several patchsets that
> deal with that issue. See http://lkml.org/lkml/2007/1/16/5
>
> It seems that limiting dirty pages in cpusets may be much easier to
> realize in the context of this patchset. The tracking of the dirty pages
> per node is not necessary if one would calculate the maximum amount of
> dirtyable pages in a cpuset and use that as a base, right?


Currently we do:
dirty = total_dirty * bdi_completions_p * task_dirty_p

As dgc pointed out before, there is the issue of bdi/task correlation,
that is, we do not track task dirty rates per bdi, so now a task that
heavily dirties on one bdi will also get penalised on the others (and
similar issues).

If we were to change it so:
dirty = cpuset_dirty * bdi_completions_p * task_dirty_p

We get additional correlation issues: cpuset/bdi, cpuset/task.
Which could yield surprising results if some bdis are strictly per
cpuset.

The cpuset/task correlation has a strict mapping and could be solved by
keeping the vm_dirties counter per cpuset. However, this would seriously
complicate the code and I'm not sure if it would gain us much.

Anyway, things to ponder. But overall it should be quite doable.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-17 20:37:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 00/23] per device dirty throttling -v9

On Fri, 17 Aug 2007, Peter Zijlstra wrote:

> Currently we do:
> dirty = total_dirty * bdi_completions_p * task_dirty_p
>
> As dgc pointed out before, there is the issue of bdi/task correlation,
> that is, we do not track task dirty rates per bdi, so now a task that
> heavily dirties on one bdi will also get penalised on the others (and
> similar issues).

I think that is tolerable.
>
> If we were to change it so:
> dirty = cpuset_dirty * bdi_completions_p * task_dirty_p
>
> We get additional correlation issues: cpuset/bdi, cpuset/task.
> Which could yield surprising results if some bdis are strictly per
> cpuset.

If we do not do the above then the dirty page calculation for a small
cpuset (F.e. 1 node of a 128 node system) could allow an amount of dirty
pages that will fill up all the node.

> The cpuset/task correlation has a strict mapping and could be solved by
> keeping the vm_dirties counter per cpuset. However, this would seriously
> complicate the code and I'm not sure if it would gain us much.

The patchset that I referred to has code to calculate the dirty count and
ratio per cpuset by looping over the nodes. Currently we are having
trouble with small cpusets not performing writeout correctly. This
sometimes may result in OOM conditions because the whole node is full of
dirty pages. If the cpu boundaries are enforced in a strict way then the
application may fail with an OOM.

We can compensate by recalculating the dirty_ratio based on the smallest
cpuset but then larger cpusets are penalized. Also one cannot set the
dirty_ratio below a certain mininum.

2007-08-23 15:59:40

by Martin Knoblauch

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9


--- Peter Zijlstra <[email protected]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
>
> > Peter,
> >
> > any chance to get a rollup against 2.6.22-stable?
> >
> > The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
>
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after
> the weekend.
>
Hi Peter,

any progress on a version against 2.6.22.5? I have seen the very
positive report from Jeffrey W. Baker and would really love to test
your patch. But as I said, anything newer than 2.6.22.x might not be an
option due to the NFS changes.

Kind regards
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2007-08-23 17:42:23

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9

On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> --- Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> >
> > > Peter,
> > >
> > > any chance to get a rollup against 2.6.22-stable?
> > >
> > > The 2.6.23 series may not be usable for me due to the
> > > nosharedcache changes for NFS (the new default will massively
> > > disturb the user-space automounter).
> >
> > I'll see what I can do, bit busy with other stuff atm, hopefully
> > after
> > the weekend.
> >
> Hi Peter,
>
> any progress on a version against 2.6.22.5? I have seen the very
> positive report from Jeffrey W. Baker and would really love to test
> your patch. But as I said, anything newer than 2.6.22.x might not be an
> option due to the NFS changes.

mindless port, seems to compile and boot on my test box ymmv.

I think .5 should not present anything other than trivial rejects if
anything. But I'm not keeping -stable in my git remotes so I can't say
for sure.


Attachments:
bdi-rollup-v9-v2.6.22.patch (68.24 kB)
signature.asc (189.00 B)
This is a digitally signed message part
Download all attachments

2007-08-24 10:47:21

by Martin Knoblauch

[permalink] [raw]
Subject: RE: [PATCH 00/23] per device dirty throttling -v9


--- Peter Zijlstra <[email protected]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[email protected]> wrote:
> >
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > >
> > > > Peter,
> > > >
> > > > any chance to get a rollup against 2.6.22-stable?
> > > >
> > > > The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > >
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > >
> > Hi Peter,
> >
> > any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
>
> mindless port, seems to compile and boot on my test box ymmv.
>
> I think .5 should not present anything other than trivial rejects if
> anything. But I'm not keeping -stable in my git remotes so I can't
> say
> for sure.

Hi Peter,

thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one
8-line offset in readahead.c.

I will report testing-results separately.

Thanks
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2007-09-03 15:20:36

by Martin Knoblauch

[permalink] [raw]
Subject: RFC: [PATCH] Small patch on top of per device dirty throttling -v9


--- Peter Zijlstra <[email protected]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[email protected]> wrote:
> >
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > >
> > > > Peter,
> > > >
> > > > any chance to get a rollup against 2.6.22-stable?
> > > >
> > > > The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > >
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > >
> > Hi Peter,
> >
> > any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
>
> mindless port, seems to compile and boot on my test box ymmv.
>
Hi Peter,

while doing my tests I observed that setting dirty_ratio below 5% did
not make a difference at all. Just by chance I found that this
apparently is an enforced limit in mm/page-writeback.c.

With below patch I have lowered the limit to 2%. With that, things
look a lot better on my systems. Load during write stays below 1.5 for
one writer. Responsiveness is good.

This may even help without the throttling patch. Not sure that this is
the right thing to do, but it helps :-)

Cheers
Martin

--- linux-2.6.22.5-bdi-v9/mm/page-writeback.c
+++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c
@@ -311,8 +311,11 @@
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

- if (dirty_ratio < 5)
- dirty_ratio = 5;
+/*
+** MKN: Lower enforced limit from 5% to 2%
+*/
+ if (dirty_ratio < 2)
+ dirty_ratio = 2;

background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de