There is a fundamental deadlock associated with paging; when writing out a page
to free memory requires free memory to complete. The usually solution is to
keep a small amount of memory available at all times so we can overcome this
problem. This however assumes the amount of memory needed for writeout is
(constant and) smaller than the provided reserve.
It is this latter assumption that breaks when doing writeout over network.
Network can take up an unspecified amount of memory while waiting for a reply
to our write request. This re-introduces the deadlock; we might never complete
the writeout, for we might not have enough memory to receive the completion
message.
The proposed solution is simple, only allow traffic servicing the VM to make
use of the reserves.
This however implies you know what packets are for whom, which generally
speaking you don't. Hence we need to receive all packets but discard them as
soon as we encounter a non VM bound packet allocated from the reserves.
Also knowing it is headed towards the VM needs a little help, hence we
introduce the socket flag SOCK_VMIO to mark sockets with.
Of course, since we are paging all this has to happen in kernel-space, since
user-space might just not be there.
Since packet processing might also require memory, this all also implies that
those auxiliary allocations may use the reserves when an emergency packet is
processed. This is accomplished by using PF_MEMALLOC.
How much memory is to be reserved is also an issue, enough memory to saturate
both the route cache and IP fragment reassembly, along with various constants.
This patch-set comes in 6 parts:
1) introduce the memory reserve and make the SLAB allocator play nice with it.
patches 01-10
2) add some needed infrastructure to the network code
patches 11-13
3) implement the idea outlined above
patches 14-20
4) teach the swap machinery to use generic address_spaces
patches 21-24
5) implement swap over NFS using all the new stuff
patches 25-31
6) implement swap over iSCSI
patches 32-40
Patches can also be found here:
http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12/
If I receive no feedback, I will assume the various maintainers do not object
and I will respin the series against -mm and submit for inclusion.
There is interest in this feature from the stateless linux world; that is both
the virtualization world, and the cluster world.
I have been contacted by various groups, some have just expressed their
interest, others have been testing this work in their environments.
Various hardware vendors have also expressed interest, and, of course, my
employer finds it important enough to have me work on it.
Also, while it doesn't present a full-fledged reserve-based allocator API yet,
it does lay most of the groundwork for it. There is a GFP_NOFAIL elimination
project wanting to use this as a foundation. Elimination of GFP_NOFAIL will
greatly improve the basic soundness and stability of the code that currently
uses that construct - most disk based filesystems.
--
On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:
> 1) introduce the memory reserve and make the SLAB allocator play nice with it.
> patches 01-10
>
> 2) add some needed infrastructure to the network code
> patches 11-13
>
> 3) implement the idea outlined above
> patches 14-20
>
> 4) teach the swap machinery to use generic address_spaces
> patches 21-24
>
> 5) implement swap over NFS using all the new stuff
> patches 25-31
>
> 6) implement swap over iSCSI
> patches 32-40
This is kind of a lot of patches all at once .. Have you release any of
these patch sets prior to this release ?
Daniel
On Fri, 2007-05-04 at 08:22 -0700, Daniel Walker wrote:
> On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:
>
> > 1) introduce the memory reserve and make the SLAB allocator play nice with it.
> > patches 01-10
> >
> > 2) add some needed infrastructure to the network code
> > patches 11-13
> >
> > 3) implement the idea outlined above
> > patches 14-20
> >
> > 4) teach the swap machinery to use generic address_spaces
> > patches 21-24
> >
> > 5) implement swap over NFS using all the new stuff
> > patches 25-31
> >
> > 6) implement swap over iSCSI
> > patches 32-40
>
> This is kind of a lot of patches all at once .. Have you release any of
> these patch sets prior to this release ?
Like the -v12 suggests, this is the 12th posting of this patch set.
Some is the same, some has changed.
On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> >
> > This is kind of a lot of patches all at once .. Have you release any of
> > these patch sets prior to this release ?
>
> Like the -v12 suggests, this is the 12th posting of this patch set.
> Some is the same, some has changed.
I can find one prior release with this subject (-v11) , what was the
subject prior to that release? It's not a hard rule, but usually >15
patches is too many (check Documentation/SubmittingPatches under
references).. You might want to consider submitting a URL instead.
I think it's a benefit to release less since a developer (like myself)
might know very little about "Swap over Networked storage", but if you
submit 10 patches that developer might still review it, 40 patches they
likely wouldn't review it.
Daniel
On 5/4/07, Daniel Walker <[email protected]> wrote:
> On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> > >
> > > This is kind of a lot of patches all at once .. Have you release any of
> > > these patch sets prior to this release ?
> >
> > Like the -v12 suggests, this is the 12th posting of this patch set.
> > Some is the same, some has changed.
>
> I can find one prior release with this subject (-v11) , what was the
> subject prior to that release? It's not a hard rule, but usually >15
> patches is too many (check Documentation/SubmittingPatches under
> references).. You might want to consider submitting a URL instead.
Previous subjects were like:
[PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)
A URL doesn't allow for true discussion about a particular patch
unless the reviewer takes the initiative to create a new thread to
discuss the Nth patch it a patchset; whereby taking on the burden of a
structured subject and so on. It would get out of control on a large
patchset that actually got a lot of simultaneous feedback... reviewers
don't have a forum to talk about each individual change without
stepping on each others' toes.
> I think it's a benefit to release less since a developer (like myself)
> might know very little about "Swap over Networked storage", but if you
> submit 10 patches that developer might still review it, 40 patches they
> likely wouldn't review it.
The _suggestions_ in Documentation/SubmittingPatches are nice and all
but the quantity of patches shouldn't _really_ matter.
Documentation/SubmittingPatches actually doesn't cover how to post a
large change because it first states:
"Separate _logical changes_ into a single patch file."
then:
"If you cannot condense your patch set into a smaller set of patches,
then only post say 15 or so at a time and wait for review and integration."
These suggestions conflict in the case of a large patchset: the second
can't be met if you honor the first (more important suggestion IMHO).
Unless you leave something out... and I can't see the value in leaving
out the auxiliary consumers of the core changes.
Reviewing 10 patches that are quite large/overloaded is actually
harder than 40 broken-out/well-documented patches. But maybe others
disagree.
*shrug*
From: Peter Zijlstra <[email protected]>
Date: Fri, 04 May 2007 12:26:51 +0200
> There is a fundamental deadlock associated with paging;
I know you'd really like people like myself to review this work, but a
set of 40 patches is just too much to try and digest at once
especially when I have other things going on. When I have lots of
other things already on my plate, when I see a huge patch set like
this I have to just say "delete" because I don't kid myself since
I know I'll never get to it.
Sorry there's now way I can review this with my current workload.
On Fri, 2007-05-04 at 14:09 -0400, Mike Snitzer wrote:
> On 5/4/07, Daniel Walker <[email protected]> wrote:
> > On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> > > >
> > > > This is kind of a lot of patches all at once .. Have you release any of
> > > > these patch sets prior to this release ?
> > >
> > > Like the -v12 suggests, this is the 12th posting of this patch set.
> > > Some is the same, some has changed.
> >
> > I can find one prior release with this subject (-v11) , what was the
> > subject prior to that release? It's not a hard rule, but usually >15
> > patches is too many (check Documentation/SubmittingPatches under
> > references).. You might want to consider submitting a URL instead.
>
> Previous subjects were like:
> [PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)
>
> A URL doesn't allow for true discussion about a particular patch
> unless the reviewer takes the initiative to create a new thread to
> discuss the Nth patch it a patchset; whereby taking on the burden of a
> structured subject and so on. It would get out of control on a large
> patchset that actually got a lot of simultaneous feedback... reviewers
> don't have a forum to talk about each individual change without
> stepping on each others' toes.
True ..
> > I think it's a benefit to release less since a developer (like myself)
> > might know very little about "Swap over Networked storage", but if you
> > submit 10 patches that developer might still review it, 40 patches they
> > likely wouldn't review it.
>
> The _suggestions_ in Documentation/SubmittingPatches are nice and all
> but the quantity of patches shouldn't _really_ matter.
I guess I take the documentation more seriously than your do. It's
clearly not mandatory, but for my reviewing I appreciate less then 15
sets of "logical changes".
Daniel
On Fri, 2007-05-04 at 12:27 -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Fri, 04 May 2007 12:26:51 +0200
>
> > There is a fundamental deadlock associated with paging;
>
> I know you'd really like people like myself to review this work, but a
> set of 40 patches is just too much to try and digest at once
> especially when I have other things going on.
I realize this, however I expected you to mainly look at the the 10
network related patches, namely: 11/40 - 20/40.
I know they build upon the previous 10 patches, which are mostly VM, and
you seem to have an interest in that as well, so that would be 20
patches to look at. Still a sizable set.
How would you prefer I present these?
The other patches are NFS and iSCSI, I'd not expect you to review those
in depth.
From: "Mike Snitzer" <[email protected]>
Date: Fri, 4 May 2007 14:09:40 -0400
> These suggestions conflict in the case of a large patchset: the second
> can't be met if you honor the first (more important suggestion IMHO).
> Unless you leave something out... and I can't see the value in leaving
> out the auxiliary consumers of the core changes.
They do not conflict.
If you say you're setting up infrastructure for a well defined
purpose, then each and every one of the patches can all stand on their
own just fine. You can even post them one at a time and the review
process would work just fine.
From: Peter Zijlstra <[email protected]>
Date: Fri, 04 May 2007 21:41:49 +0200
> How would you prefer I present these?
How about 8 or 9 at a time? You are building infrastructure
and therefore you could post them 1 at a time for review
since each patch should be able to stand on it's own.
David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Fri, 04 May 2007 21:41:49 +0200
>
>> How would you prefer I present these?
>
> How about 8 or 9 at a time? You are building infrastructure
> and therefore you could post them 1 at a time for review
> since each patch should be able to stand on it's own.
Indeed. Just glancing over the patchset, there are quite a few "easy to
apply" cleanup patches that could be fast-forwarded to upstream, without
requiring deep thought on the swap-over-storage MM changes or net
allocator changes.
Jeff
Daniel Walker wrote:
> On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:
>
>
>> 1) introduce the memory reserve and make the SLAB allocator play nice with it.
>> patches 01-10
>>
>> 2) add some needed infrastructure to the network code
>> patches 11-13
>>
>> 3) implement the idea outlined above
>> patches 14-20
>>
>> 4) teach the swap machinery to use generic address_spaces
>> patches 21-24
>>
>> 5) implement swap over NFS using all the new stuff
>> patches 25-31
>>
>> 6) implement swap over iSCSI
>> patches 32-40
>>
>
> This is kind of a lot of patches all at once .. Have you release any of
> these patch sets prior to this release ?
>
Yes, several times AFAIK.
- Arnaldo
On Fri, May 04, 2007 at 12:27:16PM -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Fri, 04 May 2007 12:26:51 +0200
>
> > There is a fundamental deadlock associated with paging;
>
> I know you'd really like people like myself to review this work, but a
> set of 40 patches is just too much to try and digest at once
> especially when I have other things going on. When I have lots of
> other things already on my plate, when I see a huge patch set like
> this I have to just say "delete" because I don't kid myself since
> I know I'll never get to it.
>
> Sorry there's now way I can review this with my current workload.
There also quite alot of only semi-related thing in there. It would
be much better to only do the network stack and iscsi parts first
and leave nfs out for a while. Especially as the former are definitively
useful while I strongly doubt that for swap over nfs.
On Fri, 04 May 2007 12:26:51 +0200, Peter Zijlstra <[email protected]> wrote:
>>> There is a fundamental deadlock associated with paging;
On Fri, May 04, 2007 at 12:27:16PM -0700, David Miller wrote:
>> I know you'd really like people like myself to review this work, but a
>> set of 40 patches is just too much to try and digest at once
>> especially when I have other things going on. When I have lots of
>> other things already on my plate, when I see a huge patch set like
>> this I have to just say "delete" because I don't kid myself since
>> I know I'll never get to it.
>> Sorry there's now way I can review this with my current workload.
On Sat, May 05, 2007 at 10:43:00AM +0100, Christoph Hellwig wrote:
> There also quite alot of only semi-related thing in there. It would
> be much better to only do the network stack and iscsi parts first
> and leave nfs out for a while. Especially as the former are definitively
> useful while I strongly doubt that for swap over nfs.
This is backward. As much as we hate it, the common case is swap over
nfs, essentially because that is/was how things were commonly set up
for other operating systems. I'm not a Solaris administrator, though,
so various disclaimers apply.
-- wli