2007-08-14 14:33:42

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

The following patchset implements recursive reclaim. Recursive reclaim
is necessary if we run out of memory in the writeout patch from reclaim.

This is f.e. important for stacked filesystems or anything that does
complicated processing in the writeout path.

Recursive reclaim works because it limits itself to only reclaim pages
that do not require writeout. It will only remove clean pages from the LRU.
The dirty throttling of the VM during regular reclaim insures that the amount
of dirty pages is limited. If recursive reclaim causes too many clean pages
to be removed then regular reclaim will throttle all processes until the
dirty ratio is restored. This means that the amount of memory that can
be reclaimed via recursive reclaim is limited to clean memory. The default
ratio is 10%. This means that recursive reclaim can reclaim 90% of memory
before failing. Reclaiming excessive amounts of clean pages may have a
significant performance impact because this means that executable pages
will be removed. However, it ensures that we will no longer fail in the
writeout path.

A patch is included to test this functionality. The test involved allocating
12 Megabytes from the reclaim paths when __PF_MEMALLOC is set. This is enough
to exhaust the reserves.

--


2007-08-14 14:37:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 2007-08-14 at 07:21 -0700, Christoph Lameter wrote:
> The following patchset implements recursive reclaim. Recursive reclaim
> is necessary if we run out of memory in the writeout patch from reclaim.
>
> This is f.e. important for stacked filesystems or anything that does
> complicated processing in the writeout path.
>
> Recursive reclaim works because it limits itself to only reclaim pages
> that do not require writeout. It will only remove clean pages from the LRU.
> The dirty throttling of the VM during regular reclaim insures that the amount
> of dirty pages is limited.

No it doesn't. All memory can be tied up by anonymous pages - who are
dirty by definition and are not clamped by the dirty limit.

> If recursive reclaim causes too many clean pages
> to be removed then regular reclaim will throttle all processes until the
> dirty ratio is restored. This means that the amount of memory that can
> be reclaimed via recursive reclaim is limited to clean memory. The default
> ratio is 10%. This means that recursive reclaim can reclaim 90% of memory
> before failing. Reclaiming excessive amounts of clean pages may have a
> significant performance impact because this means that executable pages
> will be removed. However, it ensures that we will no longer fail in the
> writeout path.
>
> A patch is included to test this functionality. The test involved allocating
> 12 Megabytes from the reclaim paths when __PF_MEMALLOC is set. This is enough
> to exhaust the reserves.
>

2007-08-14 15:29:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 14 Aug 2007, Peter Zijlstra wrote:

> On Tue, 2007-08-14 at 07:21 -0700, Christoph Lameter wrote:
> > The following patchset implements recursive reclaim. Recursive reclaim
> > is necessary if we run out of memory in the writeout patch from reclaim.
> >
> > This is f.e. important for stacked filesystems or anything that does
> > complicated processing in the writeout path.
> >
> > Recursive reclaim works because it limits itself to only reclaim pages
> > that do not require writeout. It will only remove clean pages from the LRU.
> > The dirty throttling of the VM during regular reclaim insures that the amount
> > of dirty pages is limited.
>
> No it doesn't. All memory can be tied up by anonymous pages - who are
> dirty by definition and are not clamped by the dirty limit.

Ok but that could be addressed by making sure that a certain portion of
memory is reserved for clean file backed pages.

2007-08-14 19:33:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 2007-08-14 at 08:29 -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Peter Zijlstra wrote:
>
> > On Tue, 2007-08-14 at 07:21 -0700, Christoph Lameter wrote:
> > > The following patchset implements recursive reclaim. Recursive reclaim
> > > is necessary if we run out of memory in the writeout patch from reclaim.
> > >
> > > This is f.e. important for stacked filesystems or anything that does
> > > complicated processing in the writeout path.
> > >
> > > Recursive reclaim works because it limits itself to only reclaim pages
> > > that do not require writeout. It will only remove clean pages from the LRU.
> > > The dirty throttling of the VM during regular reclaim insures that the amount
> > > of dirty pages is limited.
> >
> > No it doesn't. All memory can be tied up by anonymous pages - who are
> > dirty by definition and are not clamped by the dirty limit.
>
> Ok but that could be addressed by making sure that a certain portion of
> memory is reserved for clean file backed pages.

Which gets us back to the initial problem of sizing this portion and
ensuring it is big enough to service the need.



2007-08-14 19:41:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 14 Aug 2007, Peter Zijlstra wrote:

> > Ok but that could be addressed by making sure that a certain portion of
> > memory is reserved for clean file backed pages.
>
> Which gets us back to the initial problem of sizing this portion and
> ensuring it is big enough to service the need.

Clean file backed pages dominate memory on most boxes. They can be
calculated by NR_FILE_PAGES - NR_FILE_DIRTY

On my 2G system that is

Cached: 1731480 kB
Dirty: 424 kB

So for most load the patch as is will fix your issues. The problem arises
if you have extreme loads that are making the majority of pages anonymous.

We could change min_free_kbytes to specify the number of free + clean
pages required (if we can do atomic reclaim then we do not need it
anymore). Then we can specify a large portion of memory for
min_free_kbytes. 20%? That would give you 400M on my box which would
certainly suffice.

If the amount of clean file backed pages falls below that limit then do
the usual reclaim. If we write anonymous pages out to swap then they
can also become clean and reclaimable.


2007-08-15 12:23:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, Aug 14, 2007 at 07:21:03AM -0700, Christoph Lameter wrote:
> The following patchset implements recursive reclaim. Recursive reclaim
> is necessary if we run out of memory in the writeout patch from reclaim.
>
> This is f.e. important for stacked filesystems or anything that does
> complicated processing in the writeout path.

Filesystems (most of them) that require compilcated allocations at
writeout time suck. That said, especially with network ones, it
seems like making them preallocate or reserve required memory isn't
progressing very smoothly. I think these patchsets are definitely
worth considering as an alternative.

No substantial comments though. I've been sick all week.



> Recursive reclaim works because it limits itself to only reclaim pages
> that do not require writeout. It will only remove clean pages from the LRU.
> The dirty throttling of the VM during regular reclaim insures that the amount
> of dirty pages is limited. If recursive reclaim causes too many clean pages
> to be removed then regular reclaim will throttle all processes until the
> dirty ratio is restored. This means that the amount of memory that can
> be reclaimed via recursive reclaim is limited to clean memory. The default
> ratio is 10%. This means that recursive reclaim can reclaim 90% of memory
> before failing. Reclaiming excessive amounts of clean pages may have a
> significant performance impact because this means that executable pages
> will be removed. However, it ensures that we will no longer fail in the
> writeout path.
>
> A patch is included to test this functionality. The test involved allocating
> 12 Megabytes from the reclaim paths when __PF_MEMALLOC is set. This is enough
> to exhaust the reserves.
>
> --

2007-08-15 13:12:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 2007-08-15 at 14:22 +0200, Nick Piggin wrote:
> On Tue, Aug 14, 2007 at 07:21:03AM -0700, Christoph Lameter wrote:
> > The following patchset implements recursive reclaim. Recursive reclaim
> > is necessary if we run out of memory in the writeout patch from reclaim.
> >
> > This is f.e. important for stacked filesystems or anything that does
> > complicated processing in the writeout path.
>
> Filesystems (most of them) that require compilcated allocations at
> writeout time suck. That said, especially with network ones, it
> seems like making them preallocate or reserve required memory isn't
> progressing very smoothly.

Mainly because we seem to go in circles :-(

> I think these patchsets are definitely
> worth considering as an alternative.

Honestly, I don't. They very much do not solve the problem, they just
displace it.

Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
does it solve all deadlocks :-(


> No substantial comments though.

Please do ponder the problem and its proposed solutions, because I'm
going crazy here.

The problem with networked swap is:

TX
- we need some memory to initiate writeout
- writeout needs to be throttled in order to make this bounded

(currently sort-of done by throttle_vm_writeout() - but evginey and
daniel phillips are working on a more generic approach)

RX
- we basically need infinite memory to receive the network reply
to complete writeout. Consider the following scenario:

3 machines, A, B, C;

A: * networked swapped
* networked service

B: * client for networked service

C: * server for networked swap

C becomes unreachable/slow for a while
B sends massive amounts of traffic A wards
A consumes all memory with non-critical traffic from B and wedges

- so we need a threshold of some sorts to start tossing non-critical
network packets away. (because the consumer of these packets may be
the one swapping and is therefore frozen)

- we also need to ensure memory doesn't fragment too badly during the
receiving -> tossing phase. Otherwise we might again wedge due to OOM

and then there is an TCP specific deadlock: TCP has a global limit on
the amount of skb memory that can be in socket receive queues. Once we
hit this limit with non-critical data (because the consumers are waiting
on swap) all further packets will be tossed and we'll never receive C's
completion


<> Now my solution was to have a reserve just big enough to fit:
- TX
- RX (large enough to overflow the IP fragment reassembly)

that way, whenever we receive a packet and find we need the reserve
to back this packet we must only use this for critical services.
(this provides the threshold previously mentioned)

we then process the packet until socket demux (where the skb gets
associated with a sk - and can therefore determine whether it is
critical or not) and toss all packets that are non-critical. This frees
up the memory to receive the next packet, and this can continue ad
infinitum - until we finally do get C's completion and get out of the
tight spot.


<> What Christoph is proposing is doing recursive reclaim and not
initiating writeout. This will only work _IFF_ there are clean pages
about. Which in the general case need not be true (memory might be
packed with anonymous pages - consider an MPI cluster doing computation
stuff). So this gets us a workload dependant solution - which IMHO is
bad!

Also his suggestion to crank up min_free_kbytes to 20% of machine memory
is not workable (again imagine this MPI cluster loosing 20% of its
collective memory, very much out of the question).

Nor does that solve the TCP deadlock, you need some additional condition
to break that.

> I've been sick all week.

Do get well.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-15 13:21:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

Peter Zijlstra <[email protected]> writes:
>
> Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> does it solve all deadlocks :-(

A minimum enforced reclaimable non dirty threshold wouldn't be
that ridiculous though. So the memory could be used, just not
for dirty data.

His patchkit essentially turns the GFP_ATOMIC requirements
from free to easily reclaimable. I see that as an general improvement.

I remember sct talked about this many years ago and it's still
a good idea.

-Andi

2007-08-15 13:55:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 2007-08-15 at 16:15 +0200, Andi Kleen wrote:
> Peter Zijlstra <[email protected]> writes:
> >
> > Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> > does it solve all deadlocks :-(
>
> A minimum enforced reclaimable non dirty threshold wouldn't be
> that ridiculous though. So the memory could be used, just not
> for dirty data.

Sure, and note that various patches to such an effect have already been
posted (even one by myself), they introduce a third reclaim list on
which clean pages live. If you add to that a requirement to keep that
list at a certain level, one could replace part (or all) of the reserves
with that.

But that is more an optimisation rather than anything else.

The thing I strongly objected to was the 20%.

Also his approach misses the threshold - the extra condition needed to
break out of the various network deadlocks. There is no point that says
- ok, and now we're in trouble, drop anything non-critical. Without that
you'll always run into a wall.

> His patchkit essentially turns the GFP_ATOMIC requirements
> from free to easily reclaimable. I see that as an general improvement.
>
> I remember sct talked about this many years ago and it's still
> a good idea.

That is his second patch-set, and I do worry about the irq latency that
that will introduce. It very much has the potential to ruin everything
that cares about interactiveness or latency.

Hence my suggestion to look at threaded interrupts, in which case it
would only ruin the latency of the interrupt that does this, but does
not hold off other interrupts/processes. Granted PI would be nice to
ensure the threaded handler does eventually finish.



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-15 14:34:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

> That is his second patch-set, and I do worry about the irq latency that
> that will introduce. It very much has the potential to ruin everything
> that cares about interactiveness or latency.

I proposed a way to avoid increasing interrupt latency
in a simple way.

-Andi

2007-08-15 20:29:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 15 Aug 2007, Peter Zijlstra wrote:

> Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> does it solve all deadlocks :-(

Only if min_free_kbytes is really the mininum number of free pages and not
the mininum number of clean pages as I suggested.

All deadlocks? There are numerous ones that can come about for different
reasons. Which ones are we talking about?

> RX
> - we basically need infinite memory to receive the network reply
> to complete writeout. Consider the following scenario:

There is no infinite memory. At some point you need to bound the amount
of memory that the network allocates.

> - so we need a threshold of some sorts to start tossing non-critical
> network packets away. (because the consumer of these packets may be
> the one swapping and is therefore frozen)

Right.

> <> What Christoph is proposing is doing recursive reclaim and not
> initiating writeout. This will only work _IFF_ there are clean pages
> about. Which in the general case need not be true (memory might be
> packed with anonymous pages - consider an MPI cluster doing computation
> stuff). So this gets us a workload dependant solution - which IMHO is
> bad!

In the general case this is true even for an MPI job because the MPI job
needs to have executable code and libraries in memory. At mininum these
are reclaimable.

> Also his suggestion to crank up min_free_kbytes to 20% of machine memory
> is not workable (again imagine this MPI cluster loosing 20% of its
> collective memory, very much out of the question).

It is workable. If you crank the min_clean_pages (this is essentially
what it is) up to 20% then you basically reserve 20% of your memory for
executable pages and page cache pages. And in an emergency these can be
reclaimed to resolve any OOM issues. Note that my patch only accesses
these reserves when we would otherwise OOM. This is rare.

> Nor does that solve the TCP deadlock, you need some additional condition
> to break that.

But that is an issue that is better handled in the network stack.

2007-08-15 20:32:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 15 Aug 2007, Peter Zijlstra wrote:

> The thing I strongly objected to was the 20%.

Well then set it to 10%. We have min_free_kbytes now and so we are used
to these limits.

> Also his approach misses the threshold - the extra condition needed to
> break out of the various network deadlocks. There is no point that says
> - ok, and now we're in trouble, drop anything non-critical. Without that
> you'll always run into a wall.

Networking?

> That is his second patch-set, and I do worry about the irq latency that
> that will introduce. It very much has the potential to ruin everything
> that cares about interactiveness or latency.

Where is the patchset introducing additional latencies? Most of the time
it only saves and restores flags. We already enable and disable interrupts
in the reclaim path but we assume that interupts are always enabled when
we enter reclaim.

2007-08-16 03:29:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, Aug 15, 2007 at 03:12:06PM +0200, Peter Zijlstra wrote:
> On Wed, 2007-08-15 at 14:22 +0200, Nick Piggin wrote:
> > On Tue, Aug 14, 2007 at 07:21:03AM -0700, Christoph Lameter wrote:
> > > The following patchset implements recursive reclaim. Recursive reclaim
> > > is necessary if we run out of memory in the writeout patch from reclaim.
> > >
> > > This is f.e. important for stacked filesystems or anything that does
> > > complicated processing in the writeout path.
> >
> > Filesystems (most of them) that require compilcated allocations at
> > writeout time suck. That said, especially with network ones, it
> > seems like making them preallocate or reserve required memory isn't
> > progressing very smoothly.
>
> Mainly because we seem to go in circles :-(
>
> > I think these patchsets are definitely
> > worth considering as an alternative.
>
> Honestly, I don't. They very much do not solve the problem, they just
> displace it.

Well perhaps it doesn't work for networked swap, because dirty accounting
doesn't work the same way with anonymous memory... but for _filesystems_,
right?

I mean, it intuitively seems like a good idea to terminate the recursive
allocation problem with an attempt to reclaim clean pages rather than
immediately let them have-at our memory reserve that is used for other
things as well. Any and all writepage() via reclaim is allowed to eat
into all of memory (I hate that writepage() ever has to use any memory,
and have prototyped how to fix that for simple block based filesystems
in fsblock, but others will require it).


> Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> does it solve all deadlocks :-(

Well of course it doesn't, but it is a pragmatic way to reduce some
memory depletion cases. I don't see too much harm in it (although I didn't
see the 20% suggestion?)


> > No substantial comments though.
>
> Please do ponder the problem and its proposed solutions, because I'm
> going crazy here.

Well yeah I think you simply have to reserve a minimum amount of memory in
order to reclaim a page, and I don't see any other way to do it other than
what you describe to be _technically_ deadlock free.

But firstly, you don't _want_ to start dropping packets when you hit a tough
patch in reclaim -- even if you are strictly deadlock free. And secondly,
I think recursive reclaim could reduce the deadlocks in practice which is
not a bad thing as your patches aren't merged.

How are your deadlock patches going anyway? AFAIK they are mostly a network
issue and I haven't been keeping up with them for a while. Do you really need
networked swap and actually encounter the deadlock, or is it just a question of
wanting to fix the bugs? If the former, what for, may I ask?


> <> What Christoph is proposing is doing recursive reclaim and not
> initiating writeout. This will only work _IFF_ there are clean pages
> about. Which in the general case need not be true (memory might be
> packed with anonymous pages - consider an MPI cluster doing computation
> stuff). So this gets us a workload dependant solution - which IMHO is
> bad!

Although you will quite likely have at least a couple of MB worth of
clean program text. The important part of recursive reclaim is that it
doesn't so easily allow reclaim to blow all memory reserves (including
interrupt context). Sure you still have theoretical deadlocks, but if
I understand correctly, they are going to be lessened. I would be
really interested to see if even just these recursive reclaim patches
eliminate the problem in practice.


> > I've been sick all week.
>
> Do get well.

Thanks!

2007-08-16 20:27:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Thu, 16 Aug 2007, Nick Piggin wrote:

> > Honestly, I don't. They very much do not solve the problem, they just
> > displace it.
>
> Well perhaps it doesn't work for networked swap, because dirty accounting
> doesn't work the same way with anonymous memory... but for _filesystems_,
> right?

Regular reclaim also cannot immediately write out pages. Writes are
usually deferred. If you have too many anonymous pages in regular reclaim
then you can have the same issues.

The difference is that recursive reclaim does not trigger writeout at
the moment but we could address that by having a pageout list that then
starts writes from another context. Then both reclaims would be able to
trigger writeout.

2007-08-20 03:51:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Thu, 2007-08-16 at 05:29 +0200, Nick Piggin wrote:
> On Wed, Aug 15, 2007 at 03:12:06PM +0200, Peter Zijlstra wrote:
> > On Wed, 2007-08-15 at 14:22 +0200, Nick Piggin wrote:
> > > On Tue, Aug 14, 2007 at 07:21:03AM -0700, Christoph Lameter wrote:
> > > > The following patchset implements recursive reclaim. Recursive reclaim
> > > > is necessary if we run out of memory in the writeout patch from reclaim.
> > > >
> > > > This is f.e. important for stacked filesystems or anything that does
> > > > complicated processing in the writeout path.
> > >
> > > Filesystems (most of them) that require compilcated allocations at
> > > writeout time suck. That said, especially with network ones, it
> > > seems like making them preallocate or reserve required memory isn't
> > > progressing very smoothly.
> >
> > Mainly because we seem to go in circles :-(
> >
> > > I think these patchsets are definitely
> > > worth considering as an alternative.
> >
> > Honestly, I don't. They very much do not solve the problem, they just
> > displace it.
>
> Well perhaps it doesn't work for networked swap, because dirty accounting
> doesn't work the same way with anonymous memory... but for _filesystems_,
> right?
>
> I mean, it intuitively seems like a good idea to terminate the recursive
> allocation problem with an attempt to reclaim clean pages rather than
> immediately let them have-at our memory reserve that is used for other
> things as well.

I'm concerned about the worst case scenarios, and those don't change.
The proposed changes can be seen as an optimisation of various things,
but they do not change the fundamental issues.

> Any and all writepage() via reclaim is allowed to eat
> into all of memory (I hate that writepage() ever has to use any memory,
> and have prototyped how to fix that for simple block based filesystems
> in fsblock, but others will require it).
>
>
> > Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> > does it solve all deadlocks :-(
>
> Well of course it doesn't, but it is a pragmatic way to reduce some
> memory depletion cases. I don't see too much harm in it (although I didn't
> see the 20% suggestion?)

Sure, and on that note I don't object to them, they might be quite
useful at times. It just doesn't help the worst case scenarios.

> > > No substantial comments though.
> >
> > Please do ponder the problem and its proposed solutions, because I'm
> > going crazy here.
>
> Well yeah I think you simply have to reserve a minimum amount of memory in
> order to reclaim a page, and I don't see any other way to do it other than
> what you describe to be _technically_ deadlock free.

Right, and I guess I have to go at it again, this time ensuring not to
touch the fast-path nor sacrificing anything NUMA for simplicity in the
reclaim path.

(I think its a good thing to be technically deadlock free - and if your
work on the fault path rewrite and buffered write rework shows anything
it is that you seem to agree with this)

> But firstly, you don't _want_ to start dropping packets when you hit a tough
> patch in reclaim -- even if you are strictly deadlock free. And secondly,
> I think recursive reclaim could reduce the deadlocks in practice which is
> not a bad thing as your patches aren't merged.

Non of the people who have actually used these patches seem to object to
the dropping packets thing. Nor do I see that as a real problem,
networks are assumed lossy - also if you really need that traffic for a
RT app that also runs on the machine you need networked swap on (odd
combination but hey, it should be possible) then I can make that work as
well with a little bit more effort.

Also, I'm a very reluctant to accept a known deadlock, esp. since the
changes needed are not _that_ complex.

> How are your deadlock patches going anyway? AFAIK they are mostly a network
> issue and I haven't been keeping up with them for a while.

They really do rely on some VM interaction too, network does not have
enough information to break out of the deadlock on its own.

As for how its going, it seems to work quite reliably in my test setup -
that is, I can shut down the NFS server, swamp the client in network
traffic for hours (yes it will quickly stop userspace) and then restart
the NFS server and the client will reconnect and resume operation.

There are also a few people running various versions of my patches in
production environments. One university is running it on a 500-node
cluster and another on ~500 thin-clients and there is someone using it
in a blade product.

> Do you really need
> networked swap and actually encounter the deadlock, or is it just a question of
> wanting to fix the bugs? If the former, what for, may I ask?

Yes (we - not I personally) want networked swap. There is quite the
demand for it in the marked. It allows clusters and blades to be build
without any storage - which not only saves on the initial cost of a hard
drive [1] but also on maintenance but more importantly on energy cost
and heat production.

[1] a single drive is not that expensive, but when you're talking about
a 1000 node cluster things tend to add up.

> > <> What Christoph is proposing is doing recursive reclaim and not
> > initiating writeout. This will only work _IFF_ there are clean pages
> > about. Which in the general case need not be true (memory might be
> > packed with anonymous pages - consider an MPI cluster doing computation
> > stuff). So this gets us a workload dependant solution - which IMHO is
> > bad!
>
> Although you will quite likely have at least a couple of MB worth of
> clean program text. The important part of recursive reclaim is that it
> doesn't so easily allow reclaim to blow all memory reserves (including
> interrupt context). Sure you still have theoretical deadlocks, but if
> I understand correctly, they are going to be lessened. I would be
> really interested to see if even just these recursive reclaim patches
> eliminate the problem in practice.

were we much bothered by the buffered write deadlock? - why accept a
known deadlock if a solid solution is quite attainable?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-20 19:15:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 20 Aug 2007, Peter Zijlstra wrote:

> > > <> What Christoph is proposing is doing recursive reclaim and not
> > > initiating writeout. This will only work _IFF_ there are clean pages
> > > about. Which in the general case need not be true (memory might be
> > > packed with anonymous pages - consider an MPI cluster doing computation
> > > stuff). So this gets us a workload dependant solution - which IMHO is
> > > bad!
> >
> > Although you will quite likely have at least a couple of MB worth of
> > clean program text. The important part of recursive reclaim is that it
> > doesn't so easily allow reclaim to blow all memory reserves (including
> > interrupt context). Sure you still have theoretical deadlocks, but if
> > I understand correctly, they are going to be lessened. I would be
> > really interested to see if even just these recursive reclaim patches
> > eliminate the problem in practice.
>
> were we much bothered by the buffered write deadlock? - why accept a
> known deadlock if a solid solution is quite attainable?

Buffered write deadlock? How does that exactly occur? Memory allocation in
the writeout path while we hold locks?

There are many worst case scenarios in the current reclaim implementation
that are not addressed and we so far have not addressed these because the
code is very sensitive and it is not clear that the complexity introduced
by these changes is offset by the benefits gained.

2007-08-21 00:28:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, Aug 20, 2007 at 05:51:34AM +0200, Peter Zijlstra wrote:
> On Thu, 2007-08-16 at 05:29 +0200, Nick Piggin wrote:
> > Well perhaps it doesn't work for networked swap, because dirty accounting
> > doesn't work the same way with anonymous memory... but for _filesystems_,
> > right?
> >
> > I mean, it intuitively seems like a good idea to terminate the recursive
> > allocation problem with an attempt to reclaim clean pages rather than
> > immediately let them have-at our memory reserve that is used for other
> > things as well.
>
> I'm concerned about the worst case scenarios, and those don't change.
> The proposed changes can be seen as an optimisation of various things,
> but they do not change the fundamental issues.

No, although it sounded like you didn't see any use in these patches.
Which be true if you're just looking at solving the theoretical deadlocks,
but I just think they might be worth looking at to practically solve some
of them and just give better reclaim behaviour in general (but in saying
that I'm not excluding your patches).


> > Well yeah I think you simply have to reserve a minimum amount of memory in
> > order to reclaim a page, and I don't see any other way to do it other than
> > what you describe to be _technically_ deadlock free.
>
> Right, and I guess I have to go at it again, this time ensuring not to
> touch the fast-path nor sacrificing anything NUMA for simplicity in the
> reclaim path.
>
> (I think its a good thing to be technically deadlock free - and if your
> work on the fault path rewrite and buffered write rework shows anything
> it is that you seem to agree with this)

I do of course. There is one thing to have a real lock deadlock
in some core path, and another to have this memory deadlock in a
known-to-be-dodgy configuration (Linus said last year that he didn't
want to go out of our way to support this, right?)... But if you can
solve it without impacting fastpaths etc. then I don't see any
objection to it.


> > But firstly, you don't _want_ to start dropping packets when you hit a tough
> > patch in reclaim -- even if you are strictly deadlock free. And secondly,
> > I think recursive reclaim could reduce the deadlocks in practice which is
> > not a bad thing as your patches aren't merged.
>
> Non of the people who have actually used these patches seem to object to
> the dropping packets thing. Nor do I see that as a real problem,
> networks are assumed lossy - also if you really need that traffic for a
> RT app that also runs on the machine you need networked swap on (odd
> combination but hey, it should be possible) then I can make that work as
> well with a little bit more effort.

I don't mean for correctness, but for throughput. If you're doing a
lot of network operations right near the memory limit, then it could
be possible that these deadlock paths get triggered relatively often.
With Christoph's patches, I think it would tend to be less.


> > How are your deadlock patches going anyway? AFAIK they are mostly a network
> > issue and I haven't been keeping up with them for a while.
>
> They really do rely on some VM interaction too, network does not have
> enough information to break out of the deadlock on its own.

The thing I don't much like about your patches is the addition of more
of these global reserve type things in the allocators. They kind of
suck (not your code, just the concept of them in general -- ie. including
the PF_MEMALLOC reserve). I'd like to eventually reach a model where
reclaimable memory from a given subsystem is always backed by enough
resources to be able to reclaim it. What stopped you from going that
route with the network subsystem? (too much churn, or something
fundamental?)


> > Although you will quite likely have at least a couple of MB worth of
> > clean program text. The important part of recursive reclaim is that it
> > doesn't so easily allow reclaim to blow all memory reserves (including
> > interrupt context). Sure you still have theoretical deadlocks, but if
> > I understand correctly, they are going to be lessened. I would be
> > really interested to see if even just these recursive reclaim patches
> > eliminate the problem in practice.
>
> were we much bothered by the buffered write deadlock? - why accept a
> known deadlock if a solid solution is quite attainable?

As a general statement, I agree of course ;)

2007-08-21 00:32:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, Aug 20, 2007 at 12:15:01PM -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
>
> > > > <> What Christoph is proposing is doing recursive reclaim and not
> > > > initiating writeout. This will only work _IFF_ there are clean pages
> > > > about. Which in the general case need not be true (memory might be
> > > > packed with anonymous pages - consider an MPI cluster doing computation
> > > > stuff). So this gets us a workload dependant solution - which IMHO is
> > > > bad!
> > >
> > > Although you will quite likely have at least a couple of MB worth of
> > > clean program text. The important part of recursive reclaim is that it
> > > doesn't so easily allow reclaim to blow all memory reserves (including
> > > interrupt context). Sure you still have theoretical deadlocks, but if
> > > I understand correctly, they are going to be lessened. I would be
> > > really interested to see if even just these recursive reclaim patches
> > > eliminate the problem in practice.
> >
> > were we much bothered by the buffered write deadlock? - why accept a
> > known deadlock if a solid solution is quite attainable?
>
> Buffered write deadlock? How does that exactly occur? Memory allocation in
> the writeout path while we hold locks?

Different topic. Peter was talking about the write(2) write deadlock
where we take a page fault while holding a page lock (which leads to
lock inversion, taking the lock twice etc.)

2007-08-21 15:29:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

[ now with CCs ]

On Tue, 2007-08-21 at 02:28 +0200, Nick Piggin wrote:

> I do of course. There is one thing to have a real lock deadlock
> in some core path, and another to have this memory deadlock in a
> known-to-be-dodgy configuration (Linus said last year that he didn't
> want to go out of our way to support this, right?)... But if you can
> solve it without impacting fastpaths etc. then I don't see any
> objection to it.

That has been my intention, getting the problem solved without touching
fast paths and with minimal changes to how things are currently done.

> I don't mean for correctness, but for throughput. If you're doing a
> lot of network operations right near the memory limit, then it could
> be possible that these deadlock paths get triggered relatively often.
> With Christoph's patches, I think it would tend to be less.

Christoph's patches all rely on file backed memory being predominant.
[ and to a certain degree fully ignore anonymous memory loads :-( ]

Whereas quite a few realistic loads strive to minimise these - I'll
again fall back to my MPI cluster example, they would want to use so
much anonymous memory to preform their calculations that everything
except the hot paths of code are present in memory. In these scenarios 1
MB of text would already be a lot.

> > > How are your deadlock patches going anyway? AFAIK they are mostly a network
> > > issue and I haven't been keeping up with them for a while.
> >
> > They really do rely on some VM interaction too, network does not have
> > enough information to break out of the deadlock on its own.
>
> The thing I don't much like about your patches is the addition of more
> of these global reserve type things in the allocators. They kind of
> suck (not your code, just the concept of them in general -- ie. including
> the PF_MEMALLOC reserve). I'd like to eventually reach a model where
> reclaimable memory from a given subsystem is always backed by enough
> resources to be able to reclaim it. What stopped you from going that
> route with the network subsystem? (too much churn, or something
> fundamental?)

I'm wanting to keep the patches as non-intrusive as possible, exactly
because some people consider this a fringe functionality. Doing as you
say does sound like a noble goal, but would require massive overhauls.

Also, I'm not quite sure how this would apply to networking. It
generally doesn't have much reclaimable memory sitting around, and it
heavily relies on kmalloc so an alloc/free cycle accounting system would
quickly involve a lot of the things I'm already doing.

(also one advantage of keeping it all in the buddy allocator is that it
can more easily form larger order pages)


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-23 03:02:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, Aug 21, 2007 at 05:29:27PM +0200, Peter Zijlstra wrote:
> [ now with CCs ]
>
> On Tue, 2007-08-21 at 02:28 +0200, Nick Piggin wrote:
>
> > I do of course. There is one thing to have a real lock deadlock
> > in some core path, and another to have this memory deadlock in a
> > known-to-be-dodgy configuration (Linus said last year that he didn't
> > want to go out of our way to support this, right?)... But if you can
> > solve it without impacting fastpaths etc. then I don't see any
> > objection to it.
>
> That has been my intention, getting the problem solved without touching
> fast paths and with minimal changes to how things are currently done.
>
> > I don't mean for correctness, but for throughput. If you're doing a
> > lot of network operations right near the memory limit, then it could
> > be possible that these deadlock paths get triggered relatively often.
> > With Christoph's patches, I think it would tend to be less.
>
> Christoph's patches all rely on file backed memory being predominant.
> [ and to a certain degree fully ignore anonymous memory loads :-( ]

Yes.


> Whereas quite a few realistic loads strive to minimise these - I'll
> again fall back to my MPI cluster example, they would want to use so
> much anonymous memory to preform their calculations that everything
> except the hot paths of code are present in memory. In these scenarios 1
> MB of text would already be a lot.

OK, I don't know exactly about MPI workloads. But I mean a few basic
things like the C and MPI libraries could already be quite big before
you even consider the application text (OK it won't be all paged in).

Maybe it won't be enough, but I think some form of recurive reclaim
will be better than our current scheme. Even assuming your patches are
in the kernel, don't you think it is a good idea to _not_ have potentially
complex writeout from reclaim just default to using up memory reserves?


> > > > How are your deadlock patches going anyway? AFAIK they are mostly a network
> > > > issue and I haven't been keeping up with them for a while.
> > >
> > > They really do rely on some VM interaction too, network does not have
> > > enough information to break out of the deadlock on its own.
> >
> > The thing I don't much like about your patches is the addition of more
> > of these global reserve type things in the allocators. They kind of
> > suck (not your code, just the concept of them in general -- ie. including
> > the PF_MEMALLOC reserve). I'd like to eventually reach a model where
> > reclaimable memory from a given subsystem is always backed by enough
> > resources to be able to reclaim it. What stopped you from going that
> > route with the network subsystem? (too much churn, or something
> > fundamental?)
>
> I'm wanting to keep the patches as non-intrusive as possible, exactly
> because some people consider this a fringe functionality. Doing as you
> say does sound like a noble goal, but would require massive overhauls.

But the code would end up better, wouldn't it? And it could be done
incrementally?


> Also, I'm not quite sure how this would apply to networking. It
> generally doesn't have much reclaimable memory sitting around, and it
> heavily relies on kmalloc so an alloc/free cycle accounting system would
> quickly involve a lot of the things I'm already doing.

It wouldn't use reclaimable memory as such, but would have some small
amounts of reserve memory for allocating all those things required to
get a response from critical sockets. NBD for example would also then
be sure to reserve enough memory to at least clean one page etc. That's
the way the block layer has gone, which seems to be pretty good and I
think much better than having it in the buddy allocator.

> (also one advantage of keeping it all in the buddy allocator is that it
> can more easily form larger order pages)

I don't know if that is a really good advantage. The amount of memory
involved should just be pretty small. I mean it is an advantage, but
there are other disadvantages (imagine the mess if other subsystems used
their own global reserves in the allocator rather than mempools etc). I
don't see why networking is fundamentally more deserving of its own pools
in the allocator than anybody else.

2007-09-05 09:21:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tuesday 14 August 2007 07:21, Christoph Lameter wrote:
> The following patchset implements recursive reclaim. Recursive
> reclaim is necessary if we run out of memory in the writeout patch
> from reclaim.
>
> This is f.e. important for stacked filesystems or anything that does
> complicated processing in the writeout path.
>
> Recursive reclaim works because it limits itself to only reclaim
> pages that do not require writeout. It will only remove clean pages
> from the LRU. The dirty throttling of the VM during regular reclaim
> insures that the amount of dirty pages is limited. If recursive
> reclaim causes too many clean pages to be removed then regular
> reclaim will throttle all processes until the dirty ratio is
> restored. This means that the amount of memory that can be reclaimed
> via recursive reclaim is limited to clean memory. The default ratio
> is 10%. This means that recursive reclaim can reclaim 90% of memory
> before failing. Reclaiming excessive amounts of clean pages may have
> a significant performance impact because this means that executable
> pages will be removed. However, it ensures that we will no longer
> fail in the writeout path.
>
> A patch is included to test this functionality. The test involved
> allocating 12 Megabytes from the reclaim paths when __PF_MEMALLOC is
> set. This is enough to exhaust the reserves.

Hi Christoph,

Over the last two weeks we have tested your patch set in the context of
ddsnap, which used to be prone to deadlock before we added a series of
anti-deadlock measures, including Peter's anti-deadlock patch set, our
own bio throttling code and judicious use of PF_MEMALLOC mode. This
cocktail of patches finally banished the deadlocks, none of which have
been seen during several months of heavy testing. The question in
which you are interested no doubt, is whether your patch set also
solves the same deadlocks.

The results are mixed. I will briefly describe the test setup now. If
you are interested in specific details for independent verification, we
can provide the full recipe separately. We used the patches here:

http://zumastor.googlecode.com/svn/trunk/ddsnap/patches/2.6.21.1/

driven by the scripted storage application here:

http://zumastor.googlecode.com/svn/trunk/zumastor/

If we remove our anti-deadlock measures, including the ddsnap.vm.fixes
(a roll-up of Peter's patch set) and the request throttling code in
dm-ddsnap.c, and apply your patch set instead, we hit deadlock on the
socket write path after a few hours (traceback tomorrow). So your
patch set by itself is a stability regression.

There is also some good news for you here. The combination of our
throttling code, plus your recursive reclaim patches and some fiddling
with PF_LESS_THROTTLE has so far survived testing without deadlocking.
In other words, as far as we have tested it, your patch set can
substitute for Peter's and produce the same effect, provided that we
throttle the block IO traffic.

Just to recap, we have identified two essential ingredients in the
recipe for writeout deadlock prevention:

1) Throttle block IO traffic to a bounded maximum memory use.

2) Guarantee availability of the required amount of memory.

Now we have learned that (1) is not optional with either the peterz or
the clameter approach, and we are wondering which is the better way to
handle (2).

If we accept for the moment that both approaches to (2) are equally
effective at preventing deadlock (this is debatable) then the next
criterion on the list for deciding the winner would be efficiency. A
slight oversimplification to be sure, since we are also interested in
issues of maintainability, provability and general forward progress.
However, since none of the latter is directly measurable, efficiency is
a good place to start.

It is clear which approach is more efficient: Peter's. This is because
no scanning is required to pop a free page off a free list, so scanning
work is not duplicated. How much more efficient is an open question.
Hopefully we will measure that soon.

Briefly touching on other factors:

* Peter's patch set is much bigger than yours. The active ingredients
need to be separated out from the other peterz bits such as reserve
management APIs so we can make a fairer comparison.

* Your patch set here does not address the question of atomic
allocation, though I see you have been busy with that elsewhere.
Adding code to take care of this means you will start catching up
with Peter in complexity.

* The questions Peter raised about how you will deal with loads
involving heavy anonymous allocations are still open. This looks
like more complexity on the way.

* You depend on maintaining a global dirty page limit while Peter's
approach does not. So we see the peterz approach as progress
towards eliminating one of the great thorns in our side:
congestion_wait deadlocks, which we currently hack around in a
thoroughly disgusting way (PF_LESS_THROTTLE abuse).

* Which approach allows us to run with a higher dirty page threshold?
More dirty page caching is better. We will test the two approaches
head to head on this issue pretty soon.

Regards,

Daniel

2007-09-05 10:42:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 5 Sep 2007, Daniel Phillips wrote:

> If we remove our anti-deadlock measures, including the ddsnap.vm.fixes
> (a roll-up of Peter's patch set) and the request throttling code in
> dm-ddsnap.c, and apply your patch set instead, we hit deadlock on the
> socket write path after a few hours (traceback tomorrow). So your
> patch set by itself is a stability regression.

Na, that cannot be the case since it only activates when an OOM condition
would otherwise result.

> There is also some good news for you here. The combination of our
> throttling code, plus your recursive reclaim patches and some fiddling
> with PF_LESS_THROTTLE has so far survived testing without deadlocking.
> In other words, as far as we have tested it, your patch set can
> substitute for Peter's and produce the same effect, provided that we
> throttle the block IO traffic.

Ah. That is good news.

> It is clear which approach is more efficient: Peter's. This is because
> no scanning is required to pop a free page off a free list, so scanning
> work is not duplicated. How much more efficient is an open question.
> Hopefully we will measure that soon.

Efficiency is not a criterion for a rarely used emergency recovery
measure.

> Briefly touching on other factors:
>
> * Peter's patch set is much bigger than yours. The active ingredients
> need to be separated out from the other peterz bits such as reserve
> management APIs so we can make a fairer comparison.

Peters patch is much more invasive and requires a coupling of various
subsystems that is not good.

> * Your patch set here does not address the question of atomic
> allocation, though I see you have been busy with that elsewhere.
> Adding code to take care of this means you will start catching up
> with Peter in complexity.

Given your tests: It looks like we do not need it.

> * The questions Peter raised about how you will deal with loads
> involving heavy anonymous allocations are still open. This looks
> like more complexity on the way.

Either not necessary or also needed without these patches.

> * You depend on maintaining a global dirty page limit while Peter's
> approach does not. So we see the peterz approach as progress
> towards eliminating one of the great thorns in our side:
> congestion_wait deadlocks, which we currently hack around in a
> thoroughly disgusting way (PF_LESS_THROTTLE abuse).

We have a global dirty page limit already. I fully support Peters work on
dirty throttling.

These results show that Peters invasive approach is not needed. Reclaiming
easy reclaimable pages when necessary is sufficient.

2007-09-05 11:42:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, Sep 05, 2007 at 03:42:35AM -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Daniel Phillips wrote:
>
> > If we remove our anti-deadlock measures, including the ddsnap.vm.fixes
> > (a roll-up of Peter's patch set) and the request throttling code in
> > dm-ddsnap.c, and apply your patch set instead, we hit deadlock on the
> > socket write path after a few hours (traceback tomorrow). So your
> > patch set by itself is a stability regression.
>
> Na, that cannot be the case since it only activates when an OOM condition
> would otherwise result.
>
> > There is also some good news for you here. The combination of our
> > throttling code, plus your recursive reclaim patches and some fiddling
> > with PF_LESS_THROTTLE has so far survived testing without deadlocking.
> > In other words, as far as we have tested it, your patch set can
> > substitute for Peter's and produce the same effect, provided that we
> > throttle the block IO traffic.
>
> Ah. That is good news.
>
> > It is clear which approach is more efficient: Peter's. This is because
> > no scanning is required to pop a free page off a free list, so scanning
> > work is not duplicated. How much more efficient is an open question.
> > Hopefully we will measure that soon.
>
> Efficiency is not a criterion for a rarely used emergency recovery
> measure.
>
> > Briefly touching on other factors:
> >
> > * Peter's patch set is much bigger than yours. The active ingredients
> > need to be separated out from the other peterz bits such as reserve
> > management APIs so we can make a fairer comparison.
>
> Peters patch is much more invasive and requires a coupling of various
> subsystems that is not good.
>
> > * Your patch set here does not address the question of atomic
> > allocation, though I see you have been busy with that elsewhere.
> > Adding code to take care of this means you will start catching up
> > with Peter in complexity.
>
> Given your tests: It looks like we do not need it.
>
> > * The questions Peter raised about how you will deal with loads
> > involving heavy anonymous allocations are still open. This looks
> > like more complexity on the way.
>
> Either not necessary or also needed without these patches.
>
> > * You depend on maintaining a global dirty page limit while Peter's
> > approach does not. So we see the peterz approach as progress
> > towards eliminating one of the great thorns in our side:
> > congestion_wait deadlocks, which we currently hack around in a
> > thoroughly disgusting way (PF_LESS_THROTTLE abuse).
>
> We have a global dirty page limit already. I fully support Peters work on
> dirty throttling.
>
> These results show that Peters invasive approach is not needed. Reclaiming
> easy reclaimable pages when necessary is sufficient.


First of all, I'm not surprised these patches solve the deadlock here.
And that's a good thing, and it means it is likely that we want to merge
it (actually, I quite like the idea in general, regardless of whether it
solves the deadlock or not).

However I really have an aversion to the near enough is good enough way of
thinking. Especially when it comes to fundamental deadlocks in the VM. I
don't know whether Peter's patch is completely clean yet, but fixing the
fundamentally broken code has my full support.

I hate it that there are theoretical bugs still left even if they would
be hit less frequently than hardware failure. And that people are really
happy to put even more of these things in :(

Anyway, as you know I like your patch and if that gives Peter a little
more breathing space then it's a good thing. But I really hope he doesn't
give up on it, and it should be merged one day.

2007-09-05 12:14:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 5 Sep 2007, Nick Piggin wrote:

> However I really have an aversion to the near enough is good enough way of
> thinking. Especially when it comes to fundamental deadlocks in the VM. I
> don't know whether Peter's patch is completely clean yet, but fixing the
> fundamentally broken code has my full support.

Uhh. There are already numerous other issues why the VM is failing that is
independent of Peter's approach.

> I hate it that there are theoretical bugs still left even if they would
> be hit less frequently than hardware failure. And that people are really
> happy to put even more of these things in :(

Theoretical bugs? Depends on one's creativity to come up with them I
guess. So far we do not even get around to address the known issues and
this multi subsystem patch has the potential of creating more.

> Anyway, as you know I like your patch and if that gives Peter a little
> more breathing space then it's a good thing. But I really hope he doesn't
> give up on it, and it should be merged one day.

Using the VM to throttle networking is a pretty bad thing because it
assumes single critical user of memory. There are other consumers of
memory and if you have a load that depends on other things than networking
then you should not kill the other things that want memory.

2007-09-05 12:19:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, Sep 05, 2007 at 05:14:06AM -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Nick Piggin wrote:
>
> > However I really have an aversion to the near enough is good enough way of
> > thinking. Especially when it comes to fundamental deadlocks in the VM. I
> > don't know whether Peter's patch is completely clean yet, but fixing the
> > fundamentally broken code has my full support.
>
> Uhh. There are already numerous other issues why the VM is failing that is
> independent of Peter's approach.

I don't know what your point is? We either ignore it, or try to fix things
one at a time.


> > I hate it that there are theoretical bugs still left even if they would
> > be hit less frequently than hardware failure. And that people are really
> > happy to put even more of these things in :(
>
> Theoretical bugs? Depends on one's creativity to come up with them I
> guess. So far we do not even get around to address the known issues and
> this multi subsystem patch has the potential of creating more.

I can't direct people as to what bugs to work on.


> > Anyway, as you know I like your patch and if that gives Peter a little
> > more breathing space then it's a good thing. But I really hope he doesn't
> > give up on it, and it should be merged one day.
>
> Using the VM to throttle networking is a pretty bad thing because it
> assumes single critical user of memory. There are other consumers of
> memory and if you have a load that depends on other things than networking
> then you should not kill the other things that want memory.

Implementation issues aside, the problem is there and I would like to
see it fixed regardless if some/most/or all users in practice don't
hit it.

2007-09-05 16:16:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wednesday 05 September 2007 03:42, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Daniel Phillips wrote:
> > If we remove our anti-deadlock measures, including the
> > ddsnap.vm.fixes (a roll-up of Peter's patch set) and the request
> > throttling code in dm-ddsnap.c, and apply your patch set instead,
> > we hit deadlock on the socket write path after a few hours
> > (traceback tomorrow). So your patch set by itself is a stability
> > regression.
>
> Na, that cannot be the case since it only activates when an OOM
> condition would otherwise result.

I did not express myself clearly then. Compared to our current
anti-deadlock patch set, you patch set is a regression. Because
without help from some of our other patches, it does deadlock.
Obviously, we cannot have that.

> > It is clear which approach is more efficient: Peter's. This is
> > because no scanning is required to pop a free page off a free list,
> > so scanning work is not duplicated. How much more efficient is an
> > open question. Hopefully we will measure that soon.
>
> Efficiency is not a criterion for a rarely used emergency recovery
> measure.

That depends on how rarely used. Under continuous, heavy load this may
not be rare at all. This must be measured.

> > Briefly touching on other factors:
> >
> > * Peter's patch set is much bigger than yours. The active
> > ingredients need to be separated out from the other peterz bits
> > such as reserve management APIs so we can make a fairer comparison.
>
> Peters patch is much more invasive and requires a coupling of various
> subsystems that is not good.

I agree that Peter's patch set is larger than necessary. I do not agree
that it couples subsystems unnecessarily.

> > * Your patch set here does not address the question of atomic
> > allocation, though I see you have been busy with that
> > elsewhere. Adding code to take care of this means you will start
> > catching up with Peter in complexity.
>
> Given your tests: It looks like we do not need it.

I do not agree with that line of thinking. A single test load only
provides evidence, not proof. Your approach is not obviously correct,
quite the contrary. The tested patch set does not help atomic alloc at
all, which is clearly a problem we can hit, we just did not hit it this
time.

> > * The questions Peter raised about how you will deal with loads
> > involving heavy anonymous allocations are still open. This
> > looks like more complexity on the way.
>
> Either not necessary or also needed without these patches.

Your proof?

> > * You depend on maintaining a global dirty page limit while
> > Peter's approach does not. So we see the peterz approach as
> > progress towards eliminating one of the great thorns in our side:
> > congestion_wait deadlocks, which we currently hack around in a
> > thoroughly disgusting way (PF_LESS_THROTTLE abuse).
>
> We have a global dirty page limit already. I fully support Peters
> work on dirty throttling.

Alas, I communicated exactly the opposite of what I intended. We do not
like the global dirty limit. It makes the vm complex and fragile,
unnecessarily. We favor an approach that places less reliance on the
global dirty limit so that we can remove some of the fragile and hard
to support workarounds we have had to implement because of it.

> These results show that Peters invasive approach is not needed.
> Reclaiming easy reclaimable pages when necessary is sufficient.

These results do not show that at all, I apologize for not making that
sufficiently clear.

Regards,

Daniel

2007-09-08 05:12:34

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On 9/5/07, Daniel Phillips <[email protected]> wrote:
> On Wednesday 05 September 2007 03:42, Christoph Lameter wrote:
> > On Wed, 5 Sep 2007, Daniel Phillips wrote:
> > > If we remove our anti-deadlock measures, including the
> > > ddsnap.vm.fixes (a roll-up of Peter's patch set) and the request
> > > throttling code in dm-ddsnap.c, and apply your patch set instead,
> > > we hit deadlock on the socket write path after a few hours
> > > (traceback tomorrow). So your patch set by itself is a stability
> > > regression.
> >
> > Na, that cannot be the case since it only activates when an OOM
> > condition would otherwise result.
>
> I did not express myself clearly then. Compared to our current
> anti-deadlock patch set, you patch set is a regression. Because
> without help from some of our other patches, it does deadlock.
> Obviously, we cannot have that.

Can you be specific about which changes to existing mainline code were
needed to make recursive reclaim "work" in your tests (albeit less
ideally than peterz's patchset in your view)?

Also, in a previous post you stated:

> Just to recap, we have identified two essential ingredients in the
> recipe for writeout deadlock prevention:
>
> 1) Throttle block IO traffic to a bounded maximum memory use.
>
> 2) Guarantee availability of the required amount of memory.

Which changes allowed you to address 1? I had a look at the various
patches you provided (via svn) and it wasn't clear which subset
fulfilled 1 for you. Does it work for all Block IO and not just
specially tuned drivers like ddsnap et al?

regards,
Mike

2007-09-10 19:25:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 5 Sep 2007, Daniel Phillips wrote:

> > Na, that cannot be the case since it only activates when an OOM
> > condition would otherwise result.
>
> I did not express myself clearly then. Compared to our current
> anti-deadlock patch set, you patch set is a regression. Because
> without help from some of our other patches, it does deadlock.
> Obviously, we cannot have that.

Of course boundless allocations from interrupt / reclaim context will
ultimately crash the system. To fix that you need to stop the networking
layer from performing these.

> > Given your tests: It looks like we do not need it.
>
> I do not agree with that line of thinking. A single test load only
> provides evidence, not proof. Your approach is not obviously correct,
> quite the contrary. The tested patch set does not help atomic alloc at
> all, which is clearly a problem we can hit, we just did not hit it this
> time.

The patch is obviously correct because it provides memory where we used to
fail.

> > We have a global dirty page limit already. I fully support Peters
> > work on dirty throttling.
>
> Alas, I communicated exactly the opposite of what I intended. We do not
> like the global dirty limit. It makes the vm complex and fragile,
> unnecessarily. We favor an approach that places less reliance on the
> global dirty limit so that we can remove some of the fragile and hard
> to support workarounds we have had to implement because of it.

So far our experience has just been the opposite and Peter's other patches
demonstrate the same. Dirty limits make the VM stable and increase I/O
performance.




2007-09-10 19:29:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 5 Sep 2007, Nick Piggin wrote:

> Implementation issues aside, the problem is there and I would like to
> see it fixed regardless if some/most/or all users in practice don't
> hit it.

I am all for fixing the problem but the solution can be much simpler and
more universal. F.e. the amount of tcp data in flight may be controlled
via some limit so that other subsystems can continue to function even if
we are overwhelmed by network traffic. Peter's approach establishes the
limit by failing PF_MEMALLOC allocations. If that occurs then other
subsystems (like the disk, or even fork/exec or memory management
allocation) will no longer operate since their allocations no longer
succeed which will make the system even more fragile and may lead to
subsequent failures.

2007-09-10 19:37:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 2007-09-10 at 12:29 -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Nick Piggin wrote:
>
> > Implementation issues aside, the problem is there and I would like to
> > see it fixed regardless if some/most/or all users in practice don't
> > hit it.
>
> I am all for fixing the problem but the solution can be much simpler and
> more universal. F.e. the amount of tcp data in flight may be controlled
> via some limit so that other subsystems can continue to function even if
> we are overwhelmed by network traffic.

With swap over network you need not only protect other subsystems from
networking, but you also have to guarantee networking will in some form
stay functional, otherwise you'll never receive the writeout completion.

> Peter's approach establishes the
> limit by failing PF_MEMALLOC allocations.

I'm not failing PF_MEMALLOC allocations. I'm more stringent in failing !
PF_MEMALLOC allocations.

> If that occurs then other
> subsystems (like the disk, or even fork/exec or memory management
> allocation) will no longer operate since their allocations no longer
> succeed which will make the system even more fragile and may lead to
> subsequent failures.

Failing allocations should never be a stability problem, we have the
fault-injection framework which allows allocations to fail randomly -
this should never crash the kernel - if it does its a BUG.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-10 19:41:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 10 Sep 2007, Peter Zijlstra wrote:

> > Peter's approach establishes the
> > limit by failing PF_MEMALLOC allocations.
>
> I'm not failing PF_MEMALLOC allocations. I'm more stringent in failing !
> PF_MEMALLOC allocations.

Right you are failing other allocations.

> > If that occurs then other
> > subsystems (like the disk, or even fork/exec or memory management
> > allocation) will no longer operate since their allocations no longer
> > succeed which will make the system even more fragile and may lead to
> > subsequent failures.
>
> Failing allocations should never be a stability problem, we have the
> fault-injection framework which allows allocations to fail randomly -
> this should never crash the kernel - if it does its a BUG.

Allright maybe you can get the kernel to be stable in the face of having
no memory and debug all the fallback paths in the kernel when an OOM
condition occurs.

But system calls will fail? Like fork/exec? etc? There may be daemons
running that are essential for the system to survive and that cannot
easily take an OOM condition? Various reclaim paths also need memory and
if the allocation fails then reclaim cannot continue.

2007-09-10 19:55:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 2007-09-10 at 12:41 -0700, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Peter Zijlstra wrote:
>
> > > Peter's approach establishes the
> > > limit by failing PF_MEMALLOC allocations.
> >
> > I'm not failing PF_MEMALLOC allocations. I'm more stringent in failing !
> > PF_MEMALLOC allocations.
>
> Right you are failing other allocations.
>
> > > If that occurs then other
> > > subsystems (like the disk, or even fork/exec or memory management
> > > allocation) will no longer operate since their allocations no longer
> > > succeed which will make the system even more fragile and may lead to
> > > subsequent failures.
> >
> > Failing allocations should never be a stability problem, we have the
> > fault-injection framework which allows allocations to fail randomly -
> > this should never crash the kernel - if it does its a BUG.
>
> Allright maybe you can get the kernel to be stable in the face of having
> no memory and debug all the fallback paths in the kernel when an OOM
> condition occurs.
>
> But system calls will fail? Like fork/exec? etc? There may be daemons
> running that are essential for the system to survive and that cannot
> easily take an OOM condition? Various reclaim paths also need memory and
> if the allocation fails then reclaim cannot continue.

I'm not making any of these paths significantly more likely to occur
than they already are. Lots and lots of users run swap heavy loads day
in day out - they don't get funny systems (well sometimes they do, and
theoretically we can easily run out of the PF_MEMALLOC reserves -
HOWEVER in practise it seems to work quite reliably).


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-10 19:56:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 2007-09-10 at 12:25 -0700, Christoph Lameter wrote:

> Of course boundless allocations from interrupt / reclaim context will
> ultimately crash the system. To fix that you need to stop the networking
> layer from performing these.

Trouble is, I don't only need a network layer to not endlessly consume
memory, I need it to 'fully' function so that we can receive the
writeout completion.

Let us define a strict meaning for a few phrases:

use memory - an alloc / free cycle where the free is unconditional
consume memory - an alloc / free cycle where the free is conditional
and or might be delayed for some unspecified time.

Currently networking has two states:

1) it receives packets and consumes memory
2) it doesn't receive any packets and doesn't use any memory.

In order to use swap over network you need to operate the network stack
in a bounded memory model (PF_MEMALLOC). So we need a state that:

- receives packets
- does NOT consume memory
- but does use memory - albeit limited.

There are two ways to do this:

- reserve a specified amount of memory per socket
(allegedly IRIX has this)

or

- have a global reserve and selectively serves sockets
(what I've been doing)

These two models can be seen as the same. There is no fundamental
difference between having various small reserves and one larger that is
carved up using strict accounting.

So, if you will, you can view my approach as a reserve per socket, where
most sockets get a reserve of 0 and a few (those serving the VM) !0.

What part are you disagreeing with or unclear on?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-10 20:18:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 10 Sep 2007, Peter Zijlstra wrote:

> > Allright maybe you can get the kernel to be stable in the face of having
> > no memory and debug all the fallback paths in the kernel when an OOM
> > condition occurs.
> >
> > But system calls will fail? Like fork/exec? etc? There may be daemons
> > running that are essential for the system to survive and that cannot
> > easily take an OOM condition? Various reclaim paths also need memory and
> > if the allocation fails then reclaim cannot continue.
>
> I'm not making any of these paths significantly more likely to occur
> than they already are. Lots and lots of users run swap heavy loads day
> in day out - they don't get funny systems (well sometimes they do, and
> theoretically we can easily run out of the PF_MEMALLOC reserves -
> HOWEVER in practise it seems to work quite reliably).
>

The patchset increases these failures significantly since there will be a
longer time period where these allocations can fail.

The swap loads are fine as long as we do not exhaust the reserve pools.

IMHO the right solution is to throttle the networking layer to not do
unbounded allocations. You can likely do this by checking certain VM
counters like SLAB_UNRECLAIMABLE. If need be we can add a new category of
SLAB_TEMPORARY for temporary allocs and track these. If they get too large
then throttle.

2007-09-10 20:22:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 10 Sep 2007, Peter Zijlstra wrote:

> On Mon, 2007-09-10 at 12:25 -0700, Christoph Lameter wrote:
>
> > Of course boundless allocations from interrupt / reclaim context will
> > ultimately crash the system. To fix that you need to stop the networking
> > layer from performing these.
>
> Trouble is, I don't only need a network layer to not endlessly consume
> memory, I need it to 'fully' function so that we can receive the
> writeout completion.

You need to drop packets after having inspected them right? Why wont
dropping packets after a certain amount of memory has been allocated work?
What is so difficult about that?

> or
>
> - have a global reserve and selectively serves sockets
> (what I've been doing)

That is a scalability problem on large systems! Global means global
serialization, cacheline bouncing and possibly livelocks. If we get into
this global shortage then all cpus may end up taking the same locks
cycling thought the same allocation paths.

> So, if you will, you can view my approach as a reserve per socket, where
> most sockets get a reserve of 0 and a few (those serving the VM) !0.

Well it looks like you know how to do it. Why not implement it?



2007-09-10 20:48:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 2007-09-10 at 13:17 -0700, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Peter Zijlstra wrote:
>
> > > Allright maybe you can get the kernel to be stable in the face of having
> > > no memory and debug all the fallback paths in the kernel when an OOM
> > > condition occurs.
> > >
> > > But system calls will fail? Like fork/exec? etc? There may be daemons
> > > running that are essential for the system to survive and that cannot
> > > easily take an OOM condition? Various reclaim paths also need memory and
> > > if the allocation fails then reclaim cannot continue.
> >
> > I'm not making any of these paths significantly more likely to occur
> > than they already are. Lots and lots of users run swap heavy loads day
> > in day out - they don't get funny systems (well sometimes they do, and
> > theoretically we can easily run out of the PF_MEMALLOC reserves -
> > HOWEVER in practise it seems to work quite reliably).
> >
>
> The patchset increases these failures significantly since there will be a
> longer time period where these allocations can fail.
>
> The swap loads are fine as long as we do not exhaust the reserve pools.

And I'm working hard to guarantee the additional logic does not exhaust
said pools by making it strictly bounded.

> IMHO the right solution is to throttle the networking layer to not do
> unbounded allocations.

Am I not doing exactly that?

> You can likely do this by checking certain VM
> counters like SLAB_UNRECLAIMABLE. If need be we can add a new category of
> SLAB_TEMPORARY for temporary allocs and track these. If they get too large
> then throttle.

I'm utterly confused as to why you propose all these heuristics when I
have a perfectly good solution that is exact.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-10 20:48:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 2007-09-10 at 13:22 -0700, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Peter Zijlstra wrote:
>
> > On Mon, 2007-09-10 at 12:25 -0700, Christoph Lameter wrote:
> >
> > > Of course boundless allocations from interrupt / reclaim context will
> > > ultimately crash the system. To fix that you need to stop the networking
> > > layer from performing these.
> >
> > Trouble is, I don't only need a network layer to not endlessly consume
> > memory, I need it to 'fully' function so that we can receive the
> > writeout completion.
>
> You need to drop packets after having inspected them right? Why wont
> dropping packets after a certain amount of memory has been allocated work?
> What is so difficult about that?

That puts the burden of tracking skb allocations and all that on the
fast path.

The 'simplicity' of my current approach is that we only start
bean-counting (and incur the overhead thereof) once we need it.

> > or
> >
> > - have a global reserve and selectively serves sockets
> > (what I've been doing)
>
> That is a scalability problem on large systems! Global means global
> serialization, cacheline bouncing and possibly livelocks. If we get into
> this global shortage then all cpus may end up taking the same locks
> cycling thought the same allocation paths.

Dude, breathe, these boxens of yours will never swap over network simply
because you never configure swap.

And, _no_, it does not necessarily mean global serialisation. By simply
saying there must be N pages available I say nothing about on which node
they should be available, and the way the watermarks work they will be
evenly distributed over the appropriate zones.

> > So, if you will, you can view my approach as a reserve per socket, where
> > most sockets get a reserve of 0 and a few (those serving the VM) !0.
>
> Well it looks like you know how to do it. Why not implement it?

/me confused, I already have!

If you talk about the IRIX model, I'm very hestitant to do that simply
because that would incur the bean-counting overhead on the normal case
and that will greatly upset the network people - nor would that mean
that I don't need this stricter PF_MEMALLOC behaviour.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-11 07:41:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, Sep 10, 2007 at 12:29:32PM -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Nick Piggin wrote:
>
> > Implementation issues aside, the problem is there and I would like to
> > see it fixed regardless if some/most/or all users in practice don't
> > hit it.
>
> I am all for fixing the problem but the solution can be much simpler and
> more universal. F.e. the amount of tcp data in flight may be controlled
> via some limit so that other subsystems can continue to function even if
> we are overwhelmed by network traffic. Peter's approach establishes the
> limit by failing PF_MEMALLOC allocations. If that occurs then other

Can you to propose a solution that is much simpler and more universal?


> subsystems (like the disk, or even fork/exec or memory management
> allocation) will no longer operate since their allocations no longer
> succeed which will make the system even more fragile and may lead to
> subsequent failures.

You're saying we shouldn't fix an out of memory deadlocks because
that might result in ENOMEM errors being returned, rather than the
system locking up?

2007-09-12 10:53:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 2007-09-05 at 05:14 -0700, Christoph Lameter wrote:

> Using the VM to throttle networking is a pretty bad thing because it
> assumes single critical user of memory. There are other consumers of
> memory and if you have a load that depends on other things than networking
> then you should not kill the other things that want memory.

The VM is a _critical_ user of memory. And I dare say it is the _most_
important user.

Every user of memory relies on the VM, and we only get into trouble if
the VM in turn relies on one of these users. Traditionally that has only
been the block layer, and we special cased that using mempools and
PF_MEMALLOC.

Why do you object to me doing a similar thing for networking?

The problem of circular dependancies on and with the VM is rather
limited to kernel IO subsystems, and we only have a limited amount of
them.

You talk about something generic, do you mean an approach that is
generic across all these subsystems?

If so, my approach would be it, I can replace mempools as we have them
with the reserve system I introduce.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-12 22:39:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 21 Aug 2007, Nick Piggin wrote:

> The thing I don't much like about your patches is the addition of more
> of these global reserve type things in the allocators. They kind of
> suck (not your code, just the concept of them in general -- ie. including
> the PF_MEMALLOC reserve). I'd like to eventually reach a model where
> reclaimable memory from a given subsystem is always backed by enough
> resources to be able to reclaim it. What stopped you from going that
> route with the network subsystem? (too much churn, or something
> fundamental?)

That sounds very right aside from the global reserve. A given subsystem
may exist in multiple instances and serve sub partitions of the system.
F.e. there may be a network card on node 5 and a job running on nodes 3-7
and another netwwork card on node 15 with the corresponding nodes 13-17
doing I/O through it.

2007-09-12 22:47:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 12 Sep 2007, Peter Zijlstra wrote:

> > assumes single critical user of memory. There are other consumers of
> > memory and if you have a load that depends on other things than networking
> > then you should not kill the other things that want memory.
>
> The VM is a _critical_ user of memory. And I dare say it is the _most_
> important user.

The users of memory are various subsystems. The VM itself of course also
uses memory to manage memory but the important thing is that the VM
provides services to other subsystems

> Every user of memory relies on the VM, and we only get into trouble if
> the VM in turn relies on one of these users. Traditionally that has only
> been the block layer, and we special cased that using mempools and
> PF_MEMALLOC.
>
> Why do you object to me doing a similar thing for networking?

I have not seen you using mempools for the networking layer. I would not
object to such a solution. It already exists for other subsystems.

> The problem of circular dependancies on and with the VM is rather
> limited to kernel IO subsystems, and we only have a limited amount of
> them.

The kernel has to use the filesystems and other subsystems for I/O. These
subsystems compete for memory in order to make progress. I would not
consider strictly them part of the VM. The kernel reclaim may trigger I/O
in multiple I/O subsystems simultaneously.

> You talk about something generic, do you mean an approach that is
> generic across all these subsystems?

Yes an approach that is fair and does not allow one single subsystem to
hog all of memory.

> If so, my approach would be it, I can replace mempools as we have them
> with the reserve system I introduce.

Replacing the mempools for the block layer sounds pretty good. But how do
these various subsystems that may live in different portions of the system
for various devices avoid global serialization and livelock through your
system? And how is fairness addresses? I may want to run a fileserver on
some nodes and a HPC application that relies on a fiberchannel connection
on other nodes. How do we guarantee that the HPC application is not
impacted if the network services of the fileserver flood the system with
messages and exhaust memory?

2007-09-13 08:19:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Wed, 2007-09-12 at 15:47 -0700, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Peter Zijlstra wrote:
>
> > > assumes single critical user of memory. There are other consumers of
> > > memory and if you have a load that depends on other things than networking
> > > then you should not kill the other things that want memory.
> >
> > The VM is a _critical_ user of memory. And I dare say it is the _most_
> > important user.
>
> The users of memory are various subsystems. The VM itself of course also
> uses memory to manage memory but the important thing is that the VM
> provides services to other subsystems

Exactly, and because it services every other subsystem and userspace,
its the most important one, if it doesn't work, nothing else will.

> > Every user of memory relies on the VM, and we only get into trouble if
> > the VM in turn relies on one of these users. Traditionally that has only
> > been the block layer, and we special cased that using mempools and
> > PF_MEMALLOC.
> >
> > Why do you object to me doing a similar thing for networking?
>
> I have not seen you using mempools for the networking layer. I would not
> object to such a solution. It already exists for other subsystems.

Dude, listen, how often do I have to say this: I cannot use mempools for
the network subsystem because its build on kmalloc! What I've done is
build a replacement for mempools - a reserve system - that does work
similar to mempools but also provides the flexibility of kmalloc.

That is all, no more, no less.

> > The problem of circular dependancies on and with the VM is rather
> > limited to kernel IO subsystems, and we only have a limited amount of
> > them.
>
> The kernel has to use the filesystems and other subsystems for I/O. These
> subsystems compete for memory in order to make progress. I would not
> consider strictly them part of the VM. The kernel reclaim may trigger I/O
> in multiple I/O subsystems simultaneously.

I'm confused by this, I've never claimed part of, or such a thing. All
I'm saying is that because of the circular dependency between the VM and
the IO subsystem used for swap (not file backed paging [*], just swap)
you have to do something special to avoid deadlocks.

[*] the dirty limit along with 'atomic' swap ensures that file backed
paging does not get into this tight spot.

> > You talk about something generic, do you mean an approach that is
> > generic across all these subsystems?
>
> Yes an approach that is fair and does not allow one single subsystem to
> hog all of memory.

I do no such thing! My reserve system works much like mempools, you
reserve a certain amount of pages and use no more.

> > If so, my approach would be it, I can replace mempools as we have them
> > with the reserve system I introduce.
>
> Replacing the mempools for the block layer sounds pretty good. But how do
> these various subsystems that may live in different portions of the system
> for various devices avoid global serialization and livelock through your
> system?

The reserves are spread over all kernel mapped zones, the slab allocator
is still per cpu, the page allocator tries to get pages from the nearest
node.

> And how is fairness addresses? I may want to run a fileserver on
> some nodes and a HPC application that relies on a fiberchannel connection
> on other nodes. How do we guarantee that the HPC application is not
> impacted if the network services of the fileserver flood the system with
> messages and exhaust memory?

The network system reserves A pages, the block layer reserves B pages,
once they start getting pages from the reserves they go bean counting,
once they reach their respective limit they stop.

The serialisation impact of the bean counting depends on how
fine-grained you place them, currently I only have a machine wide
network bean counter because the network subsystem is machine wide -
initially I tried to do something per net-device but that doesn't work
out. If someone more skilled in this area comes along and sees a better
way to place the bean counters they are free to do so.

But do notice that the bean counting is only done once we hit the
reserves, the normal mode of operation is not penalised by the extra
overhead thereof.

Also note that mempools also serialise their access once the backing
allocator fails, so I don't differ from them in that respect either.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-13 18:32:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Thu, 13 Sep 2007, Peter Zijlstra wrote:

>
> > > Every user of memory relies on the VM, and we only get into trouble if
> > > the VM in turn relies on one of these users. Traditionally that has only
> > > been the block layer, and we special cased that using mempools and
> > > PF_MEMALLOC.
> > >
> > > Why do you object to me doing a similar thing for networking?
> >
> > I have not seen you using mempools for the networking layer. I would not
> > object to such a solution. It already exists for other subsystems.
>
> Dude, listen, how often do I have to say this: I cannot use mempools for
> the network subsystem because its build on kmalloc! What I've done is
> build a replacement for mempools - a reserve system - that does work
> similar to mempools but also provides the flexibility of kmalloc.
>
> That is all, no more, no less.

Its different since it becomes a privileged player that can suck all
the available memory out of the page allocator.

> I'm confused by this, I've never claimed part of, or such a thing. All
> I'm saying is that because of the circular dependency between the VM and
> the IO subsystem used for swap (not file backed paging [*], just swap)
> you have to do something special to avoid deadlocks.

How are dirty file backed pages different? They may also be written out
by the VM during reclaim.

> > Replacing the mempools for the block layer sounds pretty good. But how do
> > these various subsystems that may live in different portions of the system
> > for various devices avoid global serialization and livelock through your
> > system?
>
> The reserves are spread over all kernel mapped zones, the slab allocator
> is still per cpu, the page allocator tries to get pages from the nearest
> node.

But it seems that you have unbounded allocations with PF_MEMALLOC now for
the networking case? So networking can exhaust all reserves?

> > And how is fairness addresses? I may want to run a fileserver on
> > some nodes and a HPC application that relies on a fiberchannel connection
> > on other nodes. How do we guarantee that the HPC application is not
> > impacted if the network services of the fileserver flood the system with
> > messages and exhaust memory?
>
> The network system reserves A pages, the block layer reserves B pages,
> once they start getting pages from the reserves they go bean counting,
> once they reach their respective limit they stop.

That sounds good.

2007-09-13 19:25:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)


On Thu, 2007-09-13 at 11:32 -0700, Christoph Lameter wrote:
> On Thu, 13 Sep 2007, Peter Zijlstra wrote:
>
> >
> > > > Every user of memory relies on the VM, and we only get into trouble if
> > > > the VM in turn relies on one of these users. Traditionally that has only
> > > > been the block layer, and we special cased that using mempools and
> > > > PF_MEMALLOC.
> > > >
> > > > Why do you object to me doing a similar thing for networking?
> > >
> > > I have not seen you using mempools for the networking layer. I would not
> > > object to such a solution. It already exists for other subsystems.
> >
> > Dude, listen, how often do I have to say this: I cannot use mempools for
> > the network subsystem because its build on kmalloc! What I've done is
> > build a replacement for mempools - a reserve system - that does work
> > similar to mempools but also provides the flexibility of kmalloc.
> >
> > That is all, no more, no less.
>
> Its different since it becomes a privileged player that can suck all
> the available memory out of the page allocator.

No, each reserve user comes with a bean-counter that will limit the
usage.

> > I'm confused by this, I've never claimed part of, or such a thing. All
> > I'm saying is that because of the circular dependency between the VM and
> > the IO subsystem used for swap (not file backed paging [*], just swap)
> > you have to do something special to avoid deadlocks.
>
> How are dirty file backed pages different? They may also be written out
> by the VM during reclaim.

when you have dirty file backed pages, the rest of the memory can only
consists of clean file pages and or anonymous pages - due to the dirty
limit. If you can guarantee that swap doesn't use memory (well, it does,
but its PF_MEMALLOC memory that cannot be used by others) then you can
always free memory by dropping clean pages or swapping out. And thus
make progress for file based writeback.

This is why the dirty page tracking made mmap over NFS useable.

> > > Replacing the mempools for the block layer sounds pretty good. But how do
> > > these various subsystems that may live in different portions of the system
> > > for various devices avoid global serialization and livelock through your
> > > system?
> >
> > The reserves are spread over all kernel mapped zones, the slab allocator
> > is still per cpu, the page allocator tries to get pages from the nearest
> > node.
>
> But it seems that you have unbounded allocations with PF_MEMALLOC now for
> the networking case? So networking can exhaust all reserves?

No, networking will beancount all PF_MEMALLOC memory it receives, and
stop allocating once it hits it limit. It knows that when it has than
much memory outstanding its guaranteed memory will be freed soon.

> > > And how is fairness addresses? I may want to run a fileserver on
> > > some nodes and a HPC application that relies on a fiberchannel connection
> > > on other nodes. How do we guarantee that the HPC application is not
> > > impacted if the network services of the fileserver flood the system with
> > > messages and exhaust memory?
> >
> > The network system reserves A pages, the block layer reserves B pages,
> > once they start getting pages from the reserves they go bean counting,
> > once they reach their respective limit they stop.
>
> That sounds good.

Ok, so next time I'll post the whole series again - I know some people
found it too much - but that way you can see the bean counter.

2007-09-18 00:28:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Friday 07 September 2007 22:12, Mike Snitzer wrote:
> Can you be specific about which changes to existing mainline code
> were needed to make recursive reclaim "work" in your tests (albeit
> less ideally than peterz's patchset in your view)?

Sorry, I was incommunicado out on the high seas all last week. OK, the
measures that actually prevent our ddsnap driver from deadlocking are:

- Statically prove bounded memory use of all code in the writeout
path.

- Implement any special measures required to be able to make such a
proof.

- All allocations performed by the block driver must have access
to dedicated memory resources.

- Disable the congestion_wait mechanism for our code as much as
possible, at least enough to obtain the maximum memory resources
that can be used on the writeout path.

The specific measure we implement in order to prove a bound is:

- Throttle IO on our block device to a known amount of traffic for
which we are sure that the MEMALLOC reserve will always be
adequate.

Note that the boundedness proof we use is somewhat loose at the moment.
It goes something like "we only need at most X kilobytes of reserve and
there are X megabytes available". Much of Peter's patch set is aimed
at getting more precise about this, but to be sure, handwaving just
like this has been part of core kernel since day one without too many
ill effects.

The way we provide guaranteed access to memory resources is:

- Run critical daemons in PF_MEMALLOC mode, including
any userspace daemons that must execute in the block IO path
(cluster coders take note!)

Right now, all writeout submitted to ddsnap gets handed off to a daemon
running in PF_MEMALLOC mode. This is a needless inefficiency that we
want to remove in future, and handle as many of those submissions as
possible entirely in the context of the submitter. To do this, further
measures are needed:

- Network writes performed by the block driver must have access to
dedicated memory resources.

We have not yet managed to trigger network read memory deadlock, but it
is just a matter of time, additional fancy virtual block devices, and
enough stress. So:

- Network reads need some fancy extra support because dedicated
memory resources must be consumed before knowing whether the
network traffic belongs to a block device or not.

Now, the interesting thing about this whole discussion is, none of the
measures that we are actually using at the moment are implemented in
either Peter's or Christoph's patch set. In other words, at present we
do not require either patch set in order to run under heavy load
without deadlocking. But in order to generalize our solution to a
wider range of virtual block devices and other problematic systems such
as userspace filesystems, we need to incorporate a number of elements
of Peter's patch set.

As far as Christoph's proposal goes, it is not required to prevent
deadlocks. Whether or not it is a good optimization is an open
question.

Of all the patches posted so far related to this work, the only
indispensable one is the bio throttling patch developed by Evgeniy and
I in a parallel thread. The other essential pieces are all implemented
in our block driver for now. Some of those can be generalized and
moved at least partially into core, and some cannot.

I do need to write some sort of primer on this, because there is no
fire-and-forget magic core kernel solution. There are helpful things
we can do in core, but some of it can only be implemented in the
drivers themselves.

Regards,

Daniel

2007-09-18 03:27:36

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On 9/17/07, Daniel Phillips <[email protected]> wrote:
> On Friday 07 September 2007 22:12, Mike Snitzer wrote:
> > Can you be specific about which changes to existing mainline code
> > were needed to make recursive reclaim "work" in your tests (albeit
> > less ideally than peterz's patchset in your view)?
>
> Sorry, I was incommunicado out on the high seas all last week. OK, the
> measures that actually prevent our ddsnap driver from deadlocking are:

Hope you enjoyed yourself. First off, as always thanks for the
extremely insightful reply.

To give you context for where I'm coming from; I'm looking to get NBD
to survive the mke2fs hell I described here:
http://marc.info/?l=linux-mm&m=118981112030719&w=2

> - Statically prove bounded memory use of all code in the writeout
> path.
>
> - Implement any special measures required to be able to make such a
> proof.

Once the memory requirements of a userspace daemon (e.g. nbd-server)
are known; should one mlockall() the memory similar to how is done in
heartbeat daemon's realtime library?

Bigger question for me is what kind of hell am I (or others) in for to
try to cap nbd-server's memory usage? All those glib-gone-wild
changes over the recent past feel problematic but I'll look to work
with Wouter to see if we can get things bounded.

> - All allocations performed by the block driver must have access
> to dedicated memory resources.
>
> - Disable the congestion_wait mechanism for our code as much as
> possible, at least enough to obtain the maximum memory resources
> that can be used on the writeout path.

Would peter's per bdi dirty page accounting patchset provide this? If
not, what steps are you taking to disable this mechanism? I've found
that nbd-server is frequently locked with 'blk_congestion_wait' in its
call trace when I hit the deadlock.

> The specific measure we implement in order to prove a bound is:
>
> - Throttle IO on our block device to a known amount of traffic for
> which we are sure that the MEMALLOC reserve will always be
> adequate.

I've embraced Evgeniy's bio throttle patch on a 2.6.22.6 kernel
http://thread.gmane.org/gmane.linux.network/68021/focus=68552

But are you referring to that (as you did below) or is this more a
reference to peterz's bdi dirty accounting patchset?

> Note that the boundedness proof we use is somewhat loose at the moment.
> It goes something like "we only need at most X kilobytes of reserve and
> there are X megabytes available". Much of Peter's patch set is aimed
> at getting more precise about this, but to be sure, handwaving just
> like this has been part of core kernel since day one without too many
> ill effects.
>
> The way we provide guaranteed access to memory resources is:
>
> - Run critical daemons in PF_MEMALLOC mode, including
> any userspace daemons that must execute in the block IO path
> (cluster coders take note!)

I've been using Avi Kivity's patch from some time ago:
http://lkml.org/lkml/2004/7/26/68

to get nbd-server to to run in PF_MEMALLOC mode (could've just used
the _POSIX_PRIORITY_SCHEDULING hack instead right?)... it didn't help
on its own; I likely didn't have enough of the stars aligned to see my
MD+NBD mke2fs test not deadlock.

> Right now, all writeout submitted to ddsnap gets handed off to a daemon
> running in PF_MEMALLOC mode. This is a needless inefficiency that we
> want to remove in future, and handle as many of those submissions as
> possible entirely in the context of the submitter. To do this, further
> measures are needed:
>
> - Network writes performed by the block driver must have access to
> dedicated memory resources.

I assume peterz's network deadlock avoidance patchset (or some subset
of it) has you covered here?

> We have not yet managed to trigger network read memory deadlock, but it
> is just a matter of time, additional fancy virtual block devices, and
> enough stress. So:
>
> - Network reads need some fancy extra support because dedicated
> memory resources must be consumed before knowing whether the
> network traffic belongs to a block device or not.
>
> Now, the interesting thing about this whole discussion is, none of the
> measures that we are actually using at the moment are implemented in
> either Peter's or Christoph's patch set. In other words, at present we
> do not require either patch set in order to run under heavy load
> without deadlocking. But in order to generalize our solution to a
> wider range of virtual block devices and other problematic systems such
> as userspace filesystems, we need to incorporate a number of elements
> of Peter's patch set.
>
> As far as Christoph's proposal goes, it is not required to prevent
> deadlocks. Whether or not it is a good optimization is an open
> question.

OK, yes I've included Christoph's recursive reclaim patch and didn't
have any luck either. Good to know that patch isn't _really_ going to
help me.

> Of all the patches posted so far related to this work, the only
> indispensable one is the bio throttling patch developed by Evgeniy and
> I in a parallel thread. The other essential pieces are all implemented
> in our block driver for now. Some of those can be generalized and
> moved at least partially into core, and some cannot.

I've been working off-list (with Evgeniy's help!) to give the bio
throttling patch a try. I hacked MD (md.c and raid1.c) to limit NBD
members to only 10 in-flight IOs. Without this throttle I'd see up to
170 IOs on the raid1's nbd0 member; with it the IOs holds farely
constant at ~16. But this didn't help my deadlock test either. Also,
throttling in-flight IOs like this feels inherently sub-optimal. Have
you taken any steps to make the 'bio-limit' dynamic in some way?

Anyway, I'm thinking I need to be stacking more/all of these things
together rather than trying them piece-wise.

I'm going to try adding all the things I've learned into the mix all
at once; including both of peterz's patchsets. Peter, do you have a
git repo or website/ftp site for you r latest per-bdi and network
deadlock patchsets? Pulling them out of LKML archives isn't "fun".

Also, I've noticed that the more recent network deadlock avoidance
patchsets haven't included NBD changes; any reason why these have been
dropped? Should I just look to shoe-horn in previous NBD-oriented
patches from an earlier version of that patchset?

> I do need to write some sort of primer on this, because there is no
> fire-and-forget magic core kernel solution. There are helpful things
> we can do in core, but some of it can only be implemented in the
> drivers themselves.

That would be quite helpful; all that I've learned has largely been
from your various posts (or others' responses to your posts).
Requires a hell of a lot of digging and ultimately I'm still missing
something.

In closing, if you (or others) are aware of a minimalist recipe that
would help me defeat this mke2fs MD+NBD deadlock test (as detailed in
my linux-mm post that I referenced above) I'd be hugely grateful.

thanks,
Mike

2007-09-18 05:38:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

(Reposted for completeness. Previously rejected by vger due to
accidental send as html mail. CC's except for Mike and vger deleted)

On Monday 17 September 2007 20:27, Mike Snitzer wrote:
> To give you context for where I'm coming from; I'm looking to get NBD
> to survive the mke2fs hell I described here:
> http://marc.info/?l=linux-mm&m=118981112030719&w=2

The dread blk_congestion_wait is biting you hard. We're very familiar
with the feeling. Congestion_wait is basically the traffic cop that
implements the dirty page limit. I believe it was conceived as a
method of fixing writeout deadlocks, but in our experience it does not
help, in fact it introduces a new kind of deadlock
(blk_congestion_wait) that is much easier to trigger. One of the
things we do to get ddsnap running reliably is disable congestion_wait
via the PF_LESS_THROTTLE hack that was introduced to stop local NFS
clients from deadlocking. NBD will need a similar treatment.

Actually, I hope to show quite soon that dirty page limiting is not
needed at all in order to prevent writeout deadlock. In which case we
can just get rid of the dirty limits and go back to being able to use
all of non-reserve memory as a write cache, the way things used to be
in the days of yore.

It has been pointed out to me that congestion_wait not only enforces
the dirty limit, it controls the balancing of memory resources between
slow and fast block devices. The Peterz/Phillips approach to deadlock
prevention does not provide any such balancing and so it seems to me
that congestion_wait is ideally situated in the kernel to provide that
missing functionality. As I see it, blk_congestion_wait can easily be
modified to balance the _rate_ at which cache memory is dirtied for
various block devices of different speeeds. This should turn out to
be less finicky than balancing the absolute ratios, after all you can
make a lot of mistakes in rate limiting and still not deadlock so long
as dirty rate doesn't drop to zero and stay there for any block
device. Gotta be easy, hmm?

Please note: this plan is firmly in the category of speculation until
we have actually tried it and have patches to show, but I thought that
now is about the right time to say something about where we think
this storage robustness work is headed.

> > - Statically prove bounded memory use of all code in the writeout
> > path.
> >
> > - Implement any special measures required to be able to make such
> > a proof.
>
> Once the memory requirements of a userspace daemon (e.g. nbd-server)
> are known; should one mlockall() the memory similar to how is done in
> heartbeat daemon's realtime library?

Yes, and also inspect the code to ensure it doesn't violate mlock_all
by execing programs (no shell scripts!), dynamically loading
libraries, etc.

> Bigger question for me is what kind of hell am I (or others) in for
> to try to cap nbd-server's memory usage? All those glib-gone-wild
> changes over the recent past feel problematic but I'll look to work
> with Wouter to see if we can get things bounded.

Avoiding glib is a good start. Look at your library dependencies and
prune them merclilessly. Just don't use any libraries that you can
code up yourself in a few hundred bytes of program text for the
functionalituy you need.

> > - All allocations performed by the block driver must have access
> > to dedicated memory resources.
> >
> > - Disable the congestion_wait mechanism for our code as much as
> > possible, at least enough to obtain the maximum memory
> > resources that can be used on the writeout path.
>
> Would peter's per bdi dirty page accounting patchset provide this?
> If not, what steps are you taking to disable this mechanism? I've
> found that nbd-server is frequently locked with 'blk_congestion_wait'
> in its call trace when I hit the deadlock.

See PF_LESS_THROTTLE. Also notice that this mechanism is somewhat
less than general. In mainline it only has one user, NFS, and it only
can have one user before you have to fiddle that code to create things
like PF_EVEN_LESS_THROTTLE.

As far as I can see, not having any dirty page limit for normal
allocations is the way to go, it avoids this mess nicely. Now we just
need to prove that this works ;-)

> > The specific measure we implement in order to prove a bound is:
> >
> > - Throttle IO on our block device to a known amount of traffic
> > for which we are sure that the MEMALLOC reserve will always be
> > adequate.
>
> I've embraced Evgeniy's bio throttle patch on a 2.6.22.6 kernel
> http://thread.gmane.org/gmane.linux.network/68021/focus=68552
>
> But are you referring to that (as you did below) or is this more a
> reference to peterz's bdi dirty accounting patchset?

No, it's a patch I wrote based on Evgeniy's original, that appeared
quietly later in the thread. At the time we hadn't tested it and now
we have. It works fine, it's short, general, efficient and easy to
understand. So it will get a post of its own pretty soon.

> > Note that the boundedness proof we use is somewhat loose at the
> > moment. It goes something like "we only need at most X kilobytes of
> > reserve and there are X megabytes available". Much of Peter's
> > patch set is aimed at getting more precise about this, but to be
> > sure, handwaving just like this has been part of core kernel since
> > day one without too many ill effects.
> >
> > The way we provide guaranteed access to memory resources is:
> >
> > - Run critical daemons in PF_MEMALLOC mode, including
> > any userspace daemons that must execute in the block IO path
> > (cluster coders take note!)
>
> I've been using Avi Kivity's patch from some time ago:
> http://lkml.org/lkml/2004/7/26/68

Yes. Ddsnap includes a bit of code almost identical to that, which we
wrote independently. Seems wild and crazy at first blush, doesn't it?
But this approach has proved robust in practice, and is to my mind,
obviously correct.

> to get nbd-server to to run in PF_MEMALLOC mode (could've just used
> the _POSIX_PRIORITY_SCHEDULING hack instead right?)... it didn't help
> on its own; I likely didn't have enough of the stars aligned to see
> my MD+NBD mke2fs test not deadlock.

You do need the block IO throttling, and you need to bypass the dirty
page limiting.

Without throttling, your block driver will quickly consume any amount
of reserve memory you have, and you are dead. Without an exemption
from dirty page limiting, the number of pages your user space daemon
can allocate without deadlocking is zero, which makes life very
difficult.

I will post our in-production version of the throttling patch in a day
or two.

> > Right now, all writeout submitted to ddsnap gets handed off to a
> > daemon running in PF_MEMALLOC mode. This is a needless
> > inefficiency that we want to remove in future, and handle as many
> > of those submissions as possible entirely in the context of the
> > submitter. To do this, further measures are needed:
> >
> > - Network writes performed by the block driver must have access
> > to dedicated memory resources.
>
> I assume peterz's network deadlock avoidance patchset (or some subset
> of it) has you covered here?

Yes.

> > Of all the patches posted so far related to this work, the only
> > indispensable one is the bio throttling patch developed by Evgeniy
> > and I in a parallel thread. The other essential pieces are all
> > implemented in our block driver for now. Some of those can be
> > generalized and moved at least partially into core, and some
> > cannot.
>
> I've been working off-list (with Evgeniy's help!) to give the bio
> throttling patch a try. I hacked MD (md.c and raid1.c) to limit NBD
> members to only 10 in-flight IOs. Without this throttle I'd see up
> to 170 IOs on the raid1's nbd0 member; with it the IOs holds farely
> constant at ~16. But this didn't help my deadlock test either.
> Also, throttling in-flight IOs like this feels inherently
> sub-optimal. Have you taken any steps to make the 'bio-limit'
> dynamic in some way?

Yes, at least for device mapper devices. In our production device
mapper throttling patch, which I will post pretty soon, we provide an
aribitrary limit by default, and the device mapper device may change
it in its constructor method. Something similar should work for NBD.

As far as sub-optimal throughput goes, we run with a limit of 1,000
bvecs in flight (about 4 MB) and that does not seem to restrict
throughput measurably.

Though you also need this throttling, it is apparent from the traceback
you linked above that you ran around on blk_congestion_wait. Try
setting your user space daemon into PF_LESS_THOTTLE mode and see what
happens.

> Anyway, I'm thinking I need to be stacking more/all of these things
> together rather than trying them piece-wise.

A vm dagwood sandwich, I hope it tastes good :-)

Well, pretty soon we will join you in the NBD rehabilitation effort
because we require it for the next round of storage work, which
centers around the ddraid distributed block device. This requires an
NBD that functions reliably, even when accessing an exported block
device locally.

> I'm going to try adding all the things I've learned into the mix all
> at once; including both of peterz's patchsets. Peter, do you have a
> git repo or website/ftp site for you r latest per-bdi and network
> deadlock patchsets? Pulling them out of LKML archives isn't "fun".
>
> Also, I've noticed that the more recent network deadlock avoidance
> patchsets haven't included NBD changes; any reason why these have
> been dropped? Should I just look to shoe-horn in previous
> NBD-oriented patches from an earlier version of that patchset?

I thought Peter was swapping over NBD? Anyway, we have not moved into
the NBD problem yet because we are still busy chasing
non-deadlock-related ddsnap bugs. Which require increasingly creative
efforts to trigger by the way, but we haven't quite run out of new
bugs, so we don't get to play with distributed storage just yet.

> > I do need to write some sort of primer on this, because there is no
> > fire-and-forget magic core kernel solution. There are helpful
> > things we can do in core, but some of it can only be implemented in
> > the drivers themselves.
>
> That would be quite helpful; all that I've learned has largely been
> from your various posts (or others' responses to your posts).
> Requires a hell of a lot of digging and ultimately I'm still missing
> something.
>
> In closing, if you (or others) are aware of a minimalist recipe that
> would help me defeat this mke2fs MD+NBD deadlock test (as detailed in
> my linux-mm post that I referenced above) I'd be hugely grateful.

Seeing as we have a virtually identical target configuration in mind,
you can expect quite a lot of help from our direction in the near
future, and in the mean time we can provide encouragement, information
and perhaps a few useful lines of code.

Regards,

Daniel

2007-09-18 08:12:24

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, Sep 17, 2007 at 10:11:25PM -0700, Daniel Phillips wrote:
> On Monday 17 September 2007 20:27, Mike Snitzer wrote:
> > > - Statically prove bounded memory use of all code in the writeout
> > > path.
> > >
> > > - Implement any special measures required to be able to make such
> > > a proof.
> >
> > Once the memory requirements of a userspace daemon (e.g. nbd-server)
> > are known; should one mlockall() the memory similar to how is done in
> > heartbeat daemon's realtime library?
>
> Yes, and also inspect the code to ensure it doesn't violate mlock_all by execing programs (no shell scripts!), dynamically loading libraries, etc.

In nbd-server, there's no dlopen(), and I do not currently plan to add
that. Are there problems with using libraries that are sharedly linked
at compile time that I'm not aware of?

There are plans to add the possibility for shell script callouts, but
those would always be optional. I see no reason why we couldn't make any
mlockall() call an option, too; preferably one that would be
incompatible with the shell script callout stuff.

> > Bigger question for me is what kind of hell am I (or others) in for
> > to try to cap nbd-server's memory usage? All those glib-gone-wild
> > changes over the recent past feel problematic but I'll look to work
> > with Wouter to see if we can get things bounded.
>
> Avoiding glib is a good start. Look at your library dependencies and
> prune them merclilessly. Just don't use any libraries that you can
> code up yourself in a few hundred bytes of program text for the
> functionalituy you need.

I'm currently using glib because I wanted some utility functions that it
provides, and since I already knew glib; to me, it feels stupid to
reimplement the same things over and over again if there are libraries
that provide them.

If using glib is problematic for whatever reason, I'll certainly be
willing to switch to "something else"; I just didn't feel like
reinventing the wheel for no particular reason.

[...]
> > to get nbd-server to to run in PF_MEMALLOC mode (could've just used
> > the _POSIX_PRIORITY_SCHEDULING hack instead right?)... it didn't help
> > on its own; I likely didn't have enough of the stars aligned to see
> > my MD+NBD mke2fs test not deadlock.
>
> You do need the block IO throttling, and you need to bypass the dirty
> page limiting.
>
> Without throttling, your block driver will quickly consume any amount
> of reserve memory you have, and you are dead. Without an exemption
> from dirty page limiting, the number of pages your user space daemon
> can allocate without deadlocking is zero, which makes life very
> difficult.

Would having the server use O_DIRECT help here? I would think that it
would avoid it marking pages as dirty, but I'm not very familiar with
the in-kernel bits.

--
<Lo-lan-do> Home is where you have to wash the dishes.
-- #debian-devel, Freenode, 2004-09-22

2007-09-18 09:31:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 17 Sep 2007 23:27:25 -0400 "Mike Snitzer" <[email protected]>
wrote:

> I'm going to try adding all the things I've learned into the mix all
> at once; including both of peterz's patchsets. Peter, do you have a
> git repo or website/ftp site for you r latest per-bdi and network
> deadlock patchsets? Pulling them out of LKML archives isn't "fun".

BDI should be back in -mm, for the other its in shambles atm, I'll tell
you where to find it when I've put it back together.

I should get myself some time to read on how to push relative git
trees, as I did get myself a kernel.org account.

> Also, I've noticed that the more recent network deadlock avoidance
> patchsets haven't included NBD changes; any reason why these have been
> dropped? Should I just look to shoe-horn in previous NBD-oriented
> patches from an earlier version of that patchset?

NBD has some serious block layer issues, I once talked with Jens about
it and he explained what needed to be done to get NBD back in
shape again, but I could not be bothered to spend time on it.
[ and have since forgotten most of the details :-/ ]

For me NBD is dead and broken beyond repair, it needs a wholesale
rewrite.

2007-09-18 09:58:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Mon, 17 Sep 2007 22:11:25 -0700 Daniel Phillips <[email protected]>
wrote:


> > I've been using Avi Kivity's patch from some time ago:
> > http://lkml.org/lkml/2004/7/26/68
>
> Yes. Ddsnap includes a bit of code almost identical to that, which we wrote independently. Seems wild and crazy at first blush, doesn't it? But this approach has proved robust in practice, and is to my mind, obviously correct.

I'm so not liking this :-(

Can't we just run the user-space part as mlockall and extend netlink
to work with PF_MEMALLOC where needed?

I did something like that for iSCSI.

2007-09-18 16:56:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tuesday 18 September 2007 02:58, Peter Zijlstra wrote:
> On Mon, 17 Sep 2007 22:11:25 -0700 Daniel Phillips wrote:
> > > I've been using Avi Kivity's patch from some time ago:
> > > http://lkml.org/lkml/2004/7/26/68
> >
> > Yes. Ddsnap includes a bit of code almost identical to that, which
> > we wrote independently. Seems wild and crazy at first blush,
> > doesn't it? But this approach has proved robust in practice, and is
> > to my mind, obviously correct.
>
> I'm so not liking this :-(

Why don't you share your specific concerns?

> Can't we just run the user-space part as mlockall and extend netlink
> to work with PF_MEMALLOC where needed?
>
> I did something like that for iSCSI.

Not sure what you mean by extend netlink. We do run the user daemons
under mlockall of course, this is one of the rules I stated earlier for
daemons running in the block IO path. The problem is, if this
userspace daemon allocates even one page, for example in sys_open, it
can deadlock. Running the daemon in PF_MEMALLOC mode fixes this
problem robustly, provided that the necessary audit of memory
allocation patterns and library dependencies has been done.

I suppose you are worried that the userspace code could unexpectedly
allocate a large amount of memory and exhaust the entire PF_MEMALLOC
reserve? Kernel code could do that too. This userspace code just
needs to be checked carefully. Perhaps we could come up with a kernel
debugging option to verify that a task does in fact stay within some
bounded number of page allocs while in PF_MEMALLOC mode.

Regards,

Daniel

2007-09-18 19:18:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Tue, 18 Sep 2007 09:56:06 -0700 Daniel Phillips <[email protected]>
wrote:

> On Tuesday 18 September 2007 02:58, Peter Zijlstra wrote:
> > On Mon, 17 Sep 2007 22:11:25 -0700 Daniel Phillips wrote:
> > > > I've been using Avi Kivity's patch from some time ago:
> > > > http://lkml.org/lkml/2004/7/26/68
> > >
> > > Yes. Ddsnap includes a bit of code almost identical to that, which
> > > we wrote independently. Seems wild and crazy at first blush,
> > > doesn't it? But this approach has proved robust in practice, and is
> > > to my mind, obviously correct.
> >
> > I'm so not liking this :-(
>
> Why don't you share your specific concerns?
>
> > Can't we just run the user-space part as mlockall and extend netlink
> > to work with PF_MEMALLOC where needed?
> >
> > I did something like that for iSCSI.
>
> Not sure what you mean by extend netlink. We do run the user daemons
> under mlockall of course, this is one of the rules I stated earlier for
> daemons running in the block IO path. The problem is, if this
> userspace daemon allocates even one page, for example in sys_open, it
> can deadlock. Running the daemon in PF_MEMALLOC mode fixes this
> problem robustly, provided that the necessary audit of memory
> allocation patterns and library dependencies has been done.
>
> I suppose you are worried that the userspace code could unexpectedly
> allocate a large amount of memory and exhaust the entire PF_MEMALLOC
> reserve? Kernel code could do that too. This userspace code just
> needs to be checked carefully. Perhaps we could come up with a kernel
> debugging option to verify that a task does in fact stay within some
> bounded number of page allocs while in PF_MEMALLOC mode.

As I said on IRC, my main concern is exposing PF_MEMALLOC to user-space
at all.

I'm sure you have good programmers that write perfect user-space
code. But once the thing is out there, there is little to no control.
Of course, once root, trashing your box isn't hard, but lets not make
it easier.

The iSCSI daemon was mlockall but only communicated with the kernel
using netlink, so by sprinkling pixie dust on the netlink code one can
inject user-space policy stuffs in a safe way.

2007-10-26 17:44:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

Hi!

> > > or
> > >
> > > - have a global reserve and selectively serves sockets
> > > (what I've been doing)
> >
> > That is a scalability problem on large systems! Global means global
> > serialization, cacheline bouncing and possibly livelocks. If we get into
> > this global shortage then all cpus may end up taking the same locks
> > cycling thought the same allocation paths.
>
> Dude, breathe, these boxens of yours will never swap over network simply
> because you never configure swap.
>
> And, _no_, it does not necessarily mean global serialisation. By simply
> saying there must be N pages available I say nothing about on which node
> they should be available, and the way the watermarks work they will be
> evenly distributed over the appropriate zones.

Agreed. Scalability of emergency swapping reserved is simply
unimportant. Please, lets get swapping to _work_ first, then we can
make it faster.

No, I do not think we'll ever see a livelock on this.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-10-26 17:56:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Fri, 26 Oct 2007, Pavel Machek wrote:

> > And, _no_, it does not necessarily mean global serialisation. By simply
> > saying there must be N pages available I say nothing about on which node
> > they should be available, and the way the watermarks work they will be
> > evenly distributed over the appropriate zones.
>
> Agreed. Scalability of emergency swapping reserved is simply
> unimportant. Please, lets get swapping to _work_ first, then we can
> make it faster.

Global reserve means that any cpuset that runs out of memory may exhaust
the global reserve and thereby impact the rest of the system. The
emergencies that are currently localized to a subset of the system and
may lead to the failure of a job may now become global and lead to the
failure of all jobs running on it.

But Peter mentioned that he has some way of tracking the amount of memory
used in a certain context (beancounter?) which would address the issue.

2007-10-27 22:59:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.

If it does, it is a bug in the reserve accounting. That said, I still
agree with you that per-node reserve is a desirable goal for numa. I
would just like to be clear that it is not necessary, even for numa,
just nice. By all means somebody should be hacking on a numa feature
for per-node emergency reserves, but as far as fixing the immediate,
serious kernel block IO deadlocks goes, it does not matter.

Pavel, I do not agree that efficiency is unimportant on the
under-pressure path. I do not even like to call that the "emergency"
path, because under heavy load it is normal for a machine to spend a
significant fraction of its time in that state. However, the
efficiency goal there does not need to be quite the same as normal
mode.

To illustrate, I would expect to see something like 95% of normal block
IO performance on a numa machine in the case that "emergency" (aka
memalloc memory) is allocated globally instead of locally, thus paying
a (modest compared to the disk transfer itself) penalty for transfer of
disk data over the numa interconnect. 95% of normal throughput on the
block IO path is not a problem: if the machine spends 5% of its time on
the "emergency" (aka memalloc) path, then overall efficiency will be
95% * 95% = 99.75%.

Moral of this story: let's get the memory recursion fixes done in the
most obviously correct way and not get distracted by illusory
efficiency requirements for numa, that do not have a big bottom line
impact.

I'm glad to see everybody still interested in these problems. Though we
have been a little quiet on this issue over here for a while, it does
not mean that progress has stopped. In fact, we are testing our
solutions more heavily than ever, and getting closer to a solution that
not only works solidly, but that should enable mass deletion of the
whole creaky notion of dirty page limits in favor of nice, tight
per-device control of in flight write traffic as I have described
previously.

Regards,

Daniel

2007-10-27 23:09:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

On Friday 26 October 2007 10:44, Peter wrote:
> > ...the way the watermarks work they will be evenly distributed
> > over the appropriate zones. ..

Hi Peter,

The term is "highwater mark" not "high watermark". A watermark is an
anti-counterfeiting device printed on paper money. "Highwater" is how
high water gets, which I believe is the sense we intend in Linux.
Therefore any occurrence of "watermark" in the kernel source is a
spelling mistake, unless it has something to do with printing paper
money.

While fixing this entrenched terminology abuse in our kernel source may
be difficult, sticking to the correct English on lkml is quite easy :-)

Regards,

Daniel