2008-02-20 15:23:49

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 00/28] Swap over NFS -v16

Hi,

Another posting of the full swap over NFS series.

Andrew/Linus, could we start thinking of sticking this in -mm?

[ patches against 2.6.25-rc2-mm1, also to be found online at:
http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.25-rc2-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


Part 1, patches 1-11

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it currently doesn't do SLOB.

1 - mm: gfp_to_alloc_flags()
2 - mm: tag reseve pages
3 - mm: sl[au]b: add knowledge of reserve pages
4 - mm: kmem_estimate_pages()
5 - mm: allow PF_MEMALLOC from softirq context
6 - mm: serialize access to min_free_kbytes
7 - mm: emergency pool
8 - mm: system wide ALLOC_NO_WATERMARK
9 - mm: __GFP_MEMALLOC
10 - mm: memory reserve management
11 - selinux: tag avc cache alloc as non-critical


Part 2, patches 12-14

Provide some generic network infrastructure needed later on.

12 - net: wrap sk->sk_backlog_rcv()
13 - net: packet split receive api
14 - net: sk_allocation() - concentrate socket related allocations


Part 3, patches 15-21

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations.

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

15 - netvm: network reserve infrastructure
16 - netvm: INET reserves.
17 - netvm: hook skb allocation to reserves
18 - netvm: filter emergency skbs.
19 - netvm: prevent a TCP specific deadlock
20 - netfilter: NF_QUEUE vs emergency skbs
21 - netvm: skb processing


Part 4, patches 22-23

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

This provides new a_ops to handle swapcache pages and could be used to obsolete
the bmap usage for swapfiles.

22 - mm: add support for non block device backed swap files
23 - mm: methods for teaching filesystems about PG_swapcache pages


Part 5, patches 24-28

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

24 - nfs: remove mempools
25 - nfs: teach the NFS client how to treat PG_swapcache pages
26 - nfs: disable data cache revalidation for swapfiles
27 - nfs: enable swap on NFS
28 - nfs: fix various memory recursions possible with swap over NFS.


Changes since -v15:
- fwd port
- more SGE fragment drivers ported
- made the new swapfile logic unconditional
- various bug fixes and cleanups


2008-02-23 08:21:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[email protected]> wrote:

> Another posting of the full swap over NFS series.

Well I looked. There's rather a lot of it and I wouldn't pretend to
understand it.

What is the NFS and net people's take on all of this?

2008-02-26 06:03:35

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Saturday February 23, [email protected] wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[email protected]> wrote:
>
> > Another posting of the full swap over NFS series.
>
> Well I looked. There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

>
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages. Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users.
MEMALLOC users are those that set PF_MEMALLOC. A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages. One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree. I really don't
understand the tree. Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack. This particularly includes the route cache and
skbs for incoming messages. I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file. They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape. I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

Please avoid irrelevant reformatting in patches. It makes them
harder to read. e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
- void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

This introduces
kestimate
kestimate_single
kmem_estimate_pages

The last obviously returns a number of pages. The contrast seems
to suggest the others don't. But they do...
I don't think the names are very good, but I concede that it is
hard to choose good names here. Maybe:
kmalloc_estimate_variable
kmalloc_estimate_fixed
kmem_alloc_estimate
???

mm-reserve.patch

I'm confused by __mem_reserve_add.

+ reserve = mem_reserve_root.pages;
+ __calc_reserve(res, pages, 0);
+ reserve = mem_reserve_root.pages - reserve;

__calc_reserve will always add 'pages' to mem_reserve_root.pages.
So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

ret = adjust_memalloc_reserve(pages);
if (!ret)
__calc_reserve(res, pages, limit);
return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a "void" function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the "GFP_KERNEL" call sites just changed to
"sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

sk_adjust_memalloc
sk_set_memalloc

sound similar. Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to everyone else?

nfs-swap_ops.patch

What happens if you have two swap files on the same NFS
filesystem?
I assume ->swapfile gets called twice. But it hasn't been written
to nest, so the first swapoff will disable swapping for both
files??

That's all for now :-)

NeilBrown

2008-02-26 10:51:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

Hi Neil,


On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
> On Saturday February 23, [email protected] wrote:

> > What is the NFS and net people's take on all of this?
>
> Well I'm only vaguely an NFS person, barely a net person, sporadically
> an mm person, but I've had a look and it seems to mostly make sense.

Thanks for taking a look, and giving such elaborate feedback. I'll try
and address these issues asap, but first let me reply to a few points
here.

> We introduce a new "emergency" concept for page allocation.
> The size of the emergency pool is set by various reservations by
> different potential users.
> If the number of free pages is below the "emergency" size, then only
> users with a "MEMALLOC" flag get to allocate pages. Further, those
> pages get a "reserve" flag set which propagates into slab/slub so
> kmalloc/kmemalloc only return memory from those pages to MEMALLOC
> users.
> MEMALLOC users are those that set PF_MEMALLOC. A socket can get
> SOCK_MEMALLOC set which will cause certain pieces of code to
> temporarily set PF_MEMALLOC while working on that socket.

Small detail, there is also __GFP_MEMALLOC, this is used for single
allocations to avoid setting and unsetting PF_MEMALLOC - like in the skb
alloc once we have determined we otherwise fail and still have room.

> The upshot is that providing any MEMALLOC user reserves an appropriate
> amount of emergency space, returns the emergency memory promptly, and
> sets PF_MEMALLOC whenever allocating memory, it's memory allocations
> should never fail.
>
> As memory is requested is small units, but allocated as pages, there
> needs to be a conversion from small-units to pages. One of the
> patches does this and appears to err on the side of be over-generous,
> which is the right thing to do.
>
>
> Memory reservations are organised in a tree. I really don't
> understand the tree. Is it just to make /proc/reserve_info look more
> helpful?
> Certainly all the individual reservations need to be recorded, and the
> cumulative reservation needs also to be recorded (currently in the
> root of the tree) but what are all the other levels used for?

Ah, there is a little trick there, I hint at that in the reserve.c
description comment:

+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).

And I actually use that, if we show a little of the tree (which andrew
rightly dislikes for not being machine parseable - will fix):

+ * localhost ~ # cat /proc/reserve_info
+ * total reserve 8156K (0/544817)
+ * total network reserve 8156K (0/544817)
+ * network TX reserve 196K (0/49)
+ * protocol TX pages 196K (0/49)
+ * network RX reserve 7960K (0/544768)
+ * IPv6 route cache 1372K (0/4096)
+ * IPv4 route cache 5468K (0/16384)
+ * SKB data reserve 1120K (0/524288)
+ * IPv6 fragment cache 560K (0/262144)
+ * IPv4 fragment cache 560K (0/262144)

We see that the 'SKB data reserve' is build up of the IPv4 and IPv6
fragment cache reserves.

I use the 'SKB data reserve' to charge memory against and account usage,
but use its children to grow/shrink the actual reserve.

This allows you to see the individual reserves, but still use an
aggregate.

The tree form is the simplest structure that allowed such things,
another nice thing is that you can easily detach whole sub-trees to stop
actually reserving the memory, but continue tracking its potential
needs.

This is done when there are no SOCK_MEMALLOC sockets around. The 'total
network reserve' is detached, reducing the 'total reserve' to 0
(assuming no other reserve trees) but the individual reserves are still
tracking their potential need for when it will be re-attached.

With only a single user this might seen a little too much, but I have
hopes for more users.

> Reservations are used for all the transient memory that might be used
> by the network stack. This particularly includes the route cache and
> skbs for incoming messages. I have no idea if there is anything else
> that needs to be allowed for.

This is something I'd like feedback on from the network guru's. In my
reading there weren't many other allocation sites, but hey, I'm not much
of a net person myself. (I did write some instrumentation to track
allocations, but I'm sure I didn't get full coverage of the stack with
my simple usage).

> Filesystems can advertise (via address_space_operations) that files
> may be used as swap file. They then provide swapout/swapin methods
> which are like writepage/readpage but may behave differently and have
> a different way to get credentials from a 'struct file'.

Yes, the added benefit is that even regular blockdev filesystem swap
files could move to this interface and we'd finally be able to remove
->bmap().

> So in general, the patch set looks to have the right sort of shape. I
> cannot be very authoritative on the details as there are a lot of
> them, and they touch code that I'm not very familiar with.
>
> Some specific comments on patches:
>
>
> reserve-slub.patch
>
> Please avoid irrelevant reformatting in patches. It makes them
> harder to read. e.g.:
>
> -static void setup_object(struct kmem_cache *s, struct page *page,
> - void *object)
> +static void setup_object(struct kmem_cache *s, struct page *page, void *object)

Right, I'll split out the cleanups and send those in separately.

> mm-kmem_estimate_pages.patch
>
> This introduces
> kestimate
> kestimate_single
> kmem_estimate_pages
>
> The last obviously returns a number of pages. The contrast seems
> to suggest the others don't. But they do...
> I don't think the names are very good, but I concede that it is
> hard to choose good names here. Maybe:
> kmalloc_estimate_variable
> kmalloc_estimate_fixed
> kmem_alloc_estimate
> ???

You caught me here (and further on), I'm one of those who needs a little
help when it comes to names :-). I'll try and improve the ones you
pointed out.

> mm-reserve.patch
>
> I'm confused by __mem_reserve_add.
>
> + reserve = mem_reserve_root.pages;
> + __calc_reserve(res, pages, 0);
> + reserve = mem_reserve_root.pages - reserve;
>
> __calc_reserve will always add 'pages' to mem_reserve_root.pages.
> So this is a complex way of doing
> reserve = pages;
> __calc_reserve(res, pages, 0);
>
> And as you can calculate reserve before calling __calc_reserve
> (which seems odd when stated that way), the whole function looks
> like it could become:
>
> ret = adjust_memalloc_reserve(pages);
> if (!ret)
> __calc_reserve(res, pages, limit);
> return ret;
>
> What am I missing?

Probably the horrible twist my brain has. Looking at it makes me doubt
my own sanity. I think you're right - it would also clean up
__calc_reserve() a little.

This is what review for :-)

> Also, mem_reserve_disconnect really should be a "void" function.
> Just put a BUG_ON(ret) and don't return anything.

Agreed, I was being over cautious here. I'll WARN_ON, as Andrew is
scared of BUGs :-)

> Finally, I'll just repeat that the purpose of the tree structure
> eludes me.

I hope to have cleared that up. It came from the desire of my users
(there are quite a few out there who use some form of this code) to see
what the reserve is made up of - to see what tunables to use, and their
effect.

The tree form was the easiest that allowed me to keep individual
reserves and work with aggregates. (Must confess, some nodes are purely
decoration).

> net-sk_allocation.patch
>
> Why are the "GFP_KERNEL" call sites just changed to
> "sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??
>
> I assume there is a good reason, and seeing it in the change log
> would educate me and make the patch more obviously correct.

Good point. I think because of legacy (with this patch-set) reasons. Its
not needed for my use of sk_allocation(), but that gives me no right to
deny other people more creative uses of this construct. I'll rectify
this.

> netvm-reserve.patch
>
> Function names again:
>
> sk_adjust_memalloc
> sk_set_memalloc
>
> sound similar. Purpose is completely different.

The naming thing,... I'll try and come up with better ones.

> mm-page_file_methods.patch
>
> This makes page_offset and others more expensive by adding a
> conditional jump to a function call that is not usually made.
>
> Why do swap pages have a different index to everyone else?

Because the page->index of an anonymous page is related to its (anon)vma
so that it satisfies the constraints for vm_normal_page().

The index in the swap file it totally unrelated and quite random. Hence
the swap-cache uses page->private to store it in.

Moving these functions inline (esp __page_file_index seems doable)
results in a horrible include hell.

> nfs-swap_ops.patch
>
> What happens if you have two swap files on the same NFS
> filesystem?
> I assume ->swapfile gets called twice. But it hasn't been written
> to nest, so the first swapoff will disable swapping for both
> files??

Hmm,.. you are quite right (again). I failed to consider this. Not sure
how to rectify this as xprt->swapper is but a single bit.

I'll think about this.

Thanks!

2008-02-26 12:01:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Tue, 2008-02-26 at 11:50 +0100, Peter Zijlstra wrote:

> > mm-reserve.patch
> >
> > I'm confused by __mem_reserve_add.
> >
> > + reserve = mem_reserve_root.pages;
> > + __calc_reserve(res, pages, 0);
> > + reserve = mem_reserve_root.pages - reserve;
> >
> > __calc_reserve will always add 'pages' to mem_reserve_root.pages.
> > So this is a complex way of doing
> > reserve = pages;
> > __calc_reserve(res, pages, 0);
> >
> > And as you can calculate reserve before calling __calc_reserve
> > (which seems odd when stated that way), the whole function looks
> > like it could become:
> >
> > ret = adjust_memalloc_reserve(pages);
> > if (!ret)
> > __calc_reserve(res, pages, limit);
> > return ret;
> >
> > What am I missing?
>
> Probably the horrible twist my brain has. Looking at it makes me doubt
> my own sanity. I think you're right - it would also clean up
> __calc_reserve() a little.
>
> This is what review for :-)

Ah, you confused me. Well, I confused me - this does deserve a comment
its tricksy.

Its correct. The trick is, the mem_reserve in question (res) need not be
connected to mem_reserve_root.

In that case, mem_reserve_root.pages will not change, but we do
propagate the change as far up as possible, so that
mem_reserve_connect() can just observe the parent and child without
being bothered by the rest of the hierarchy.


2008-02-26 15:30:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

> > mm-page_file_methods.patch
> >
> > This makes page_offset and others more expensive by adding a
> > conditional jump to a function call that is not usually made.
> >
> > Why do swap pages have a different index to everyone else?
>
> Because the page->index of an anonymous page is related to its (anon)vma
> so that it satisfies the constraints for vm_normal_page().
>
> The index in the swap file it totally unrelated and quite random. Hence
> the swap-cache uses page->private to store it in.

Yeah, and putting the condition into page_offset() will confuse code
which uses it for finding the offset in the VMA or in a tmpfs file.

So why not just have a separate page_swap_offset() function, used
exclusively by swap_in/out()?

Miklos

2008-02-26 15:42:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Tue, 2008-02-26 at 16:29 +0100, Miklos Szeredi wrote:
> > > mm-page_file_methods.patch
> > >
> > > This makes page_offset and others more expensive by adding a
> > > conditional jump to a function call that is not usually made.
> > >
> > > Why do swap pages have a different index to everyone else?
> >
> > Because the page->index of an anonymous page is related to its (anon)vma
> > so that it satisfies the constraints for vm_normal_page().
> >
> > The index in the swap file it totally unrelated and quite random. Hence
> > the swap-cache uses page->private to store it in.
>
> Yeah, and putting the condition into page_offset() will confuse code
> which uses it for finding the offset in the VMA

Right, do we do that anywhere?

> or in a tmpfs file.

Good point. I really should go read tmpfs some day, its really a blind
spot for me.

> So why not just have a separate page_swap_offset() function, used
> exclusively by swap_in/out()?

That would require duplicating quite a lot of NFS code from what I can
see.

2008-02-26 15:44:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Tue, 2008-02-26 at 16:29 +0100, Miklos Szeredi wrote:
> > > mm-page_file_methods.patch
> > >
> > > This makes page_offset and others more expensive by adding a
> > > conditional jump to a function call that is not usually made.
> > >
> > > Why do swap pages have a different index to everyone else?
> >
> > Because the page->index of an anonymous page is related to its (anon)vma
> > so that it satisfies the constraints for vm_normal_page().
> >
> > The index in the swap file it totally unrelated and quite random. Hence
> > the swap-cache uses page->private to store it in.
>
> Yeah, and putting the condition into page_offset() will confuse code
> which uses it for finding the offset in the VMA or in a tmpfs file.
>
> So why not just have a separate page_swap_offset() function, used
> exclusively by swap_in/out()?

Ah, we can do the page_file_offset() to match page_file_index() and
page_file_mapping(). And convert NFS to use page_file_offset() where
appropriate, as I already did for these others.

That would sort out the mess, right?

2008-02-26 15:47:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

> > > > mm-page_file_methods.patch
> > > >
> > > > This makes page_offset and others more expensive by adding a
> > > > conditional jump to a function call that is not usually made.
> > > >
> > > > Why do swap pages have a different index to everyone else?
> > >
> > > Because the page->index of an anonymous page is related to its (anon)vma
> > > so that it satisfies the constraints for vm_normal_page().
> > >
> > > The index in the swap file it totally unrelated and quite random. Hence
> > > the swap-cache uses page->private to store it in.
> >
> > Yeah, and putting the condition into page_offset() will confuse code
> > which uses it for finding the offset in the VMA or in a tmpfs file.
> >
> > So why not just have a separate page_swap_offset() function, used
> > exclusively by swap_in/out()?
>
> Ah, we can do the page_file_offset() to match page_file_index() and
> page_file_mapping(). And convert NFS to use page_file_offset() where
> appropriate, as I already did for these others.
>
> That would sort out the mess, right?

Yes, that sounds perfect.

Miklos

2008-02-26 17:57:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Tue, 26 Feb 2008 11:50:42 +0100 Peter Zijlstra <[email protected]> wrote:

> On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
> > On Saturday February 23, [email protected] wrote:
>
> > > What is the NFS and net people's take on all of this?
> >
> > Well I'm only vaguely an NFS person, barely a net person, sporadically
> > an mm person, but I've had a look and it seems to mostly make sense.
>
> Thanks for taking a look, and giving such elaborate feedback. I'll try
> and address these issues asap, but first let me reply to a few points
> here.

Neil's overview of what-all-this-is and how-it-all-works is really good.
I'd suggest that you take it over, flesh it out and attach it firmly to the
patchset. It really helps.

2008-02-27 05:51:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


Hi Peter,

On Tuesday February 26, [email protected] wrote:
> Hi Neil,
>
> Thanks for taking a look, and giving such elaborate feedback. I'll try
> and address these issues asap, but first let me reply to a few points
> here.

Thanks... the tree thing is starting to make sense, and I'm not
confused my __mem_reserve_add any more :-)

I've been having a closer read of some of the code that I skimmed over
before and I have some more questions.

1/ I note there is no way to tell if memory returned by kmalloc is
from the emergency reserve - which contrasts with alloc_page
which does make that information available through page->reserve.
This seems a slightly unfortunate aspect of the interface.

It seems to me that __alloc_skb could be simpler if this
information was available. It currently tries the allocation
normally, then if that fails it retries with __GFP_MEMALLOC set and
if that works it assume it was from the emergency pool ... which it
might not be, though the assumption is safe enough.

It would seem to be nicer if you could just make the one alloc call,
setting GFP_MEMALLOC if that might be appropriate. Then if the
memory came from the emergency reserve, flag it as an emergency skb.

However doing that would have issues with reservations.... the
mem_reserve would need to be passed to kmalloc :-(

2/ It doesn't appear to be possible to wait on a reservation. i.e. if
mem_reserve_*_charge fails, we might sometimes want to wait until
it will succeed. This feature is an integral part of how mempools
work and are used. If you want reservations to (be able to)
replace mempools, then waiting really is needed.

It seems that waiting would work best if it was done quite low down
inside kmalloc. That would require kmalloc to know which
'mem_reserve' you are using which it currently doesn't.

If it did, then it could choose to use emergency pages if the
mem_reserve allowed it more space, otherwise require a regular page.
And if __GFP_WAIT is set then each time around the loop it could
revise whether to use an emergency page or not, based on whether it
can successfully charge the reservation.

Of course, having a mem_reserve available for PF_MEMALLOC
allocations would require changing every kmalloc call in the
protected region, which might not be ideal, but might not be a
major hassle, and would ensure that you find all kmalloc calls that
might be made while in PF_MALLOC state.

3/ Thinking about the tree structure a bit more: Your motivation
seems to be that it allows you to make two separate reservations,
and then charge some memory usage again either-one-of-the-other.
This seems to go against a key attribute of reservations. I would
have thought that an important rule about reservations is that
no-one else can use memory reserved for a particular purpose.
So if IPv6 reserves some memory, and the IPv4 uses it, that doesn't
seem like something we want to encourage...


4/ __netdev_alloc_page is bothering me a bit.
This is used to allocate memory for incoming fragments in a
(potentially multi-fragment) packet. But we only rx_emergency_get
for each page as it arrives rather than all at once at the start.
So you could have a situation where two very large packets are
arriving at the same time and there is enough reserve to hold
either of them but not both. The current code will hand out that
reservation a little (well, one page) at a time to each packet and
will run out before either packet has been fully received. This
seems like a bad thing. Is it?

Is it possible to do the rx_emergency_get just once of the whole
packet I wonder?

NeilBrown

2008-02-27 07:59:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Wed, 2008-02-27 at 16:51 +1100, Neil Brown wrote:
> Hi Peter,
>
> On Tuesday February 26, [email protected] wrote:
> > Hi Neil,
> >
> > Thanks for taking a look, and giving such elaborate feedback. I'll try
> > and address these issues asap, but first let me reply to a few points
> > here.
>
> Thanks... the tree thing is starting to make sense, and I'm not
> confused my __mem_reserve_add any more :-)
>
> I've been having a closer read of some of the code that I skimmed over
> before and I have some more questions.
>
> 1/ I note there is no way to tell if memory returned by kmalloc is
> from the emergency reserve - which contrasts with alloc_page
> which does make that information available through page->reserve.
> This seems a slightly unfortunate aspect of the interface.

Yes, but alas there is no room to store such information in kmalloc().
That is, in a sane way. I think it was Daniel Phillips who suggested
encoding it in the return pointer by flipping the low bit - but that is
just too ugly and breaks all current kmalloc sites to boot.

> It seems to me that __alloc_skb could be simpler if this
> information was available. It currently tries the allocation
> normally, then if that fails it retries with __GFP_MEMALLOC set and
> if that works it assume it was from the emergency pool ... which it
> might not be, though the assumption is safe enough.
>
> It would seem to be nicer if you could just make the one alloc call,
> setting GFP_MEMALLOC if that might be appropriate. Then if the
> memory came from the emergency reserve, flag it as an emergency skb.
>
> However doing that would have issues with reservations.... the
> mem_reserve would need to be passed to kmalloc :-(

Yes, it would require a massive overhaul of quite a few things. I agree,
it would all be nicer, but I think you see why I didn't do it.

> 2/ It doesn't appear to be possible to wait on a reservation. i.e. if
> mem_reserve_*_charge fails, we might sometimes want to wait until
> it will succeed. This feature is an integral part of how mempools
> work and are used. If you want reservations to (be able to)
> replace mempools, then waiting really is needed.
>
> It seems that waiting would work best if it was done quite low down
> inside kmalloc. That would require kmalloc to know which
> 'mem_reserve' you are using which it currently doesn't.
>
> If it did, then it could choose to use emergency pages if the
> mem_reserve allowed it more space, otherwise require a regular page.
> And if __GFP_WAIT is set then each time around the loop it could
> revise whether to use an emergency page or not, based on whether it
> can successfully charge the reservation.

Like mempools, we could add a wrapper with a mem_reserve and waitqueue
inside, strip __GFP_WAIT, try, see if the reservation allows, and wait
if not.

I haven't yet done such a wrapper because it wasn't needed. But it could
be done.

> Of course, having a mem_reserve available for PF_MEMALLOC
> allocations would require changing every kmalloc call in the
> protected region, which might not be ideal, but might not be a
> major hassle, and would ensure that you find all kmalloc calls that
> might be made while in PF_MALLOC state.

I assumed the current PF_MEMALLOC usage was good enough for the current
reserves - not quite true as its potentially unlimited, but it seems to
work in practise.

I did try to find all allocation sites in the paths I enabled
PF_MEMALLOC over.

> 3/ Thinking about the tree structure a bit more: Your motivation
> seems to be that it allows you to make two separate reservations,
> and then charge some memory usage again either-one-of-the-other.
> This seems to go against a key attribute of reservations. I would
> have thought that an important rule about reservations is that
> no-one else can use memory reserved for a particular purpose.
> So if IPv6 reserves some memory, and the IPv4 uses it, that doesn't
> seem like something we want to encourage...

Well, we only have one kind of skb, a network packet doesn't know if it
belongs to IPv4 or IPv6 (or yet a whole different address familiy) when
it comes in. So we grow the skb pool to overflow both defragment caches.

But yeah, its something where you need to know what you're doing - as
with so many other things in the kernel, hence I didn't worry too much.

> 4/ __netdev_alloc_page is bothering me a bit.
> This is used to allocate memory for incoming fragments in a
> (potentially multi-fragment) packet. But we only rx_emergency_get
> for each page as it arrives rather than all at once at the start.
> So you could have a situation where two very large packets are
> arriving at the same time and there is enough reserve to hold
> either of them but not both. The current code will hand out that
> reservation a little (well, one page) at a time to each packet and
> will run out before either packet has been fully received. This
> seems like a bad thing. Is it?
>
> Is it possible to do the rx_emergency_get just once of the whole
> packet I wonder?

I honestly don't know enough about network cards and drivers to answer
this. It was a great feat I managed this much :-)

2008-02-27 08:05:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

Hi Peter,

On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <[email protected]> wrote:
> > 1/ I note there is no way to tell if memory returned by kmalloc is
> > from the emergency reserve - which contrasts with alloc_page
> > which does make that information available through page->reserve.
> > This seems a slightly unfortunate aspect of the interface.
>
> Yes, but alas there is no room to store such information in kmalloc().
> That is, in a sane way. I think it was Daniel Phillips who suggested
> encoding it in the return pointer by flipping the low bit - but that is
> just too ugly and breaks all current kmalloc sites to boot.

Why can't you add a kmem_is_emergency() to SLUB that looks up the
cache/slab/page (whatever is the smallest unit of the emergency pool
here) for the object and use that?

2008-02-27 08:14:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> Hi Peter,
>
> On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <[email protected]> wrote:
> > > 1/ I note there is no way to tell if memory returned by kmalloc is
> > > from the emergency reserve - which contrasts with alloc_page
> > > which does make that information available through page->reserve.
> > > This seems a slightly unfortunate aspect of the interface.
> >
> > Yes, but alas there is no room to store such information in kmalloc().
> > That is, in a sane way. I think it was Daniel Phillips who suggested
> > encoding it in the return pointer by flipping the low bit - but that is
> > just too ugly and breaks all current kmalloc sites to boot.
>
> Why can't you add a kmem_is_emergency() to SLUB that looks up the
> cache/slab/page (whatever is the smallest unit of the emergency pool
> here) for the object and use that?

There is an idea.. :-) It would mean preserving page->reserved, but SLUB
has plenty of page flags to pick from. Or maybe I should move the thing
to a page flag anyway. If we do that SLAB would allow something similar,
just look up the page for whatever address you get and look at PG_emerg
or something.

Having this would clean things up. I'll go work on this.

2008-02-27 08:34:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Wed, 2008-02-27 at 09:14 +0100, Peter Zijlstra wrote:
> On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> > Hi Peter,
> >
> > On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <[email protected]> wrote:
> > > > 1/ I note there is no way to tell if memory returned by kmalloc is
> > > > from the emergency reserve - which contrasts with alloc_page
> > > > which does make that information available through page->reserve.
> > > > This seems a slightly unfortunate aspect of the interface.
> > >
> > > Yes, but alas there is no room to store such information in kmalloc().
> > > That is, in a sane way. I think it was Daniel Phillips who suggested
> > > encoding it in the return pointer by flipping the low bit - but that is
> > > just too ugly and breaks all current kmalloc sites to boot.
> >
> > Why can't you add a kmem_is_emergency() to SLUB that looks up the
> > cache/slab/page (whatever is the smallest unit of the emergency pool
> > here) for the object and use that?
>
> There is an idea.. :-) It would mean preserving page->reserved, but SLUB
> has plenty of page flags to pick from. Or maybe I should move the thing
> to a page flag anyway. If we do that SLAB would allow something similar,
> just look up the page for whatever address you get and look at PG_emerg
> or something.
>
> Having this would clean things up. I'll go work on this.

Humm, and here I sit staring at the screen. Perhaps I should go get my
morning juice, but...

if (mem_reserve_kmalloc_charge(my_res, sizeof(*foo), 0)) {
foo = kmalloc(sizeof(*foo), gfp|__GFP_MEMALLOC)
if (!kmem_is_emergency(foo))
mem_reserve_kmalloc_charge(my_res, -sizeof(*foo), 0)
} else
foo = kmalloc(sizeof(*foo), gfp);

Just doesn't look too pretty..

And needing to always account the allocation seems wrong.. but I'll take
poison and see if that wakes up my mind.


2008-02-27 08:43:45

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Wed, 27 Feb 2008, Peter Zijlstra wrote:
> Humm, and here I sit staring at the screen. Perhaps I should go get my
> morning juice, but...
>
> if (mem_reserve_kmalloc_charge(my_res, sizeof(*foo), 0)) {
> foo = kmalloc(sizeof(*foo), gfp|__GFP_MEMALLOC)
> if (!kmem_is_emergency(foo))
> mem_reserve_kmalloc_charge(my_res, -sizeof(*foo), 0)
> } else
> foo = kmalloc(sizeof(*foo), gfp);
>
> Just doesn't look too pretty..
>
> And needing to always account the allocation seems wrong.. but I'll take
> poison and see if that wakes up my mind.

Hmm, perhaps this is just hand-waving but why don't you have a
kmalloc_reserve() function in SLUB that does the accounting properly?

Pekka

2008-02-29 01:29:27

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


So I've been pondering all this some more trying to find the pattern,
and things are beginning to crystalise (I hope).

One of the approaches I have been taking is to compare it to mempools
(which I think I understand) and work out what the important
differences are.

One difference is that you don't wait for memory to become available
(as I mentioned earlier). Rather you just try to get the memory and
if it isn't available, you drop the packet. This probably makes sense
for incoming packets as you rely on the packet being re-sent, and
hopefully various back-off algorithms will slow things down a bit so
that there is a good change that memory will be available next time...

For out going messages I'm less clear on exactly what is going on.
Maybe I haven't looked at that code properly yet, but I would expect
there would be a place for waiting for memory to become available
somewhere in the out-going path ??

But there is another important difference to mempools which I think is
worth exploring. With mempools, you are certain that the memory will
only be used to make forward progress in writing out dirty data. So
if you find that there isn't enough memory at the moment and you have
to wait, you can be sure someone else is making forward progress and
so waiting isn't such a bad thing.

With your reservations it isn't quite the same. Reserved memory can
be used for related purposes. In particular, any incoming packet can
use some reserved memory. Once the purpose of that packet is
discovered (i.e. that matching socket is found), the memory will be
freed again. But there is a period of time when memory is being used
for an inappropriate purpose. The consequences of this should be
clearly understood.

In particular, the memory that is reserved for the emergency pool
should include some overhead to acknowledge the fact that memory
might be used for short periods of time for unrelated purposes.

I think we can fit this acknowledgement into the current model quite
easily, and it makes the tree structure suddenly make lots of sense
(where as before I was still struggling with it).

A key observation in this design is "Sometimes we need to allocate
emergency memory without knowing exactly what it is going to be used
for". I think we should make that explicit in the implementation as
follows:

We have a tree of reservations (as you already do) where levels in
the tree correspond to more explicit knowledge of how the memory
will be used.
At the top level there is a generic 'page' reservation. Below that
to one side with have a 'SLUB/SLAB' reservation. I'm not sure yet
exactly what that will look like.
Also below the 'page' reservation is a reservation for pages to hold
incoming network fragments.
Below the SLxB reservation is a reservation for skbs, which is
parent to a reservation for IPv4 skbs and another for IPv6 skbs.

Each of these nodes has its own independent reservation - parents are
not simply the sum of the children.
The sum over the whole tree is given to the VM as the size of the
emergency pool to reserve for emergency allocations.

Now, every actual allocation from the emergency pool effectively comes
in at the top of the tree and moves down as its purpose is more fully
understood. Every emergency allocation is *always* charged to one
node in the tree, though which node may change.

e.g.
A network driver asks for a page to store a fragment.
netdev_alloc_page calls alloc_page with __GFP_MEMALLOC set.
If alloc_page needs to dive into the emergency pool, it first
charges the one page against the root for the reservation tree.
If this succeeds, it returns the page with ->reserve set. If the
reservation fails, it ignores the GFP_MEMALLOC and fails.
netdev_alloc_page notices that the page is a ->reserve page, and
knows that it has been changed to the top 'page' reservation, but it
should be changed to the network-page reservation. So it tried to
charge against the network-pages reservation, and reverses the
charge against 'pages'. If the network-pages reservation fails, the
page is freed and netdev_alloc_page fails.
As you can see, the charge moves down the tree as more information
becomes available.

Similarly a charge might move from 'pages' to 'SLxB' to 'net_skb' to
'ipv4_skb'.

At the bottom levels, the reservations says how much memory is
needed for that particular usage to be able to make sensible forward
progress.
At the higher levels, the reservation says how much overhead we need
to allow to ensure that transient invalid uses don't unduly limit
available emergency memory. As pages are likely to be immediately
re-charged lower down the tree, the reservation at the top level
would probably be proportional to the number of CPUs (probably one
page per CPU would be perfect). Lower down, different calculations
might suggest different intermediate reservations.

Of course, these things don't need to be explicitly structured as a
tree. There is no need for 'parent' or 'sibling' pointers. The code
implicitly knows where to move charges from and to.
You still need an explicit structure to allow groups of reservations
that are activated or de-activated as a whole. That can use your
current tree structure, or whatever else turns out to make sense.

This model, I think, captures the important "allocate before charging"
aspect of reservations that you need (particularly for incoming
network packets) and it makes that rule apply throughout the different
stages that an allocated chunk of memory goes through.

With this model, alloc_page could fail more often, as it now also
fails if the top level reservation is exhausted. This may seem
un-necessary, but I think it could be a good thing. It means that at
very busy times (when lots of requests are needing emergency memory)
we drop requests randomly and very early. If we are going to drop a
request eventually, dropping it early means we waste less time on it
which is probably a good thing.


So: Does this model help others with understanding how the
reservations work, or am I just over-engineering?

NeilBrown

2008-02-29 10:22:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Fri, 2008-02-29 at 12:29 +1100, Neil Brown wrote:
> So I've been pondering all this some more trying to find the pattern,
> and things are beginning to crystalise (I hope).
>
> One of the approaches I have been taking is to compare it to mempools
> (which I think I understand) and work out what the important
> differences are.
>
> One difference is that you don't wait for memory to become available
> (as I mentioned earlier). Rather you just try to get the memory and
> if it isn't available, you drop the packet. This probably makes sense
> for incoming packets as you rely on the packet being re-sent, and
> hopefully various back-off algorithms will slow things down a bit so
> that there is a good change that memory will be available next time...
>
> For out going messages I'm less clear on exactly what is going on.
> Maybe I haven't looked at that code properly yet, but I would expect
> there would be a place for waiting for memory to become available
> somewhere in the out-going path ??

The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
at that upper limit and leave it at that.

It should be full of holes, and there is some work on writeout
throttling to fill some of them - but I haven't seen any lockups in this
area for a long long while.

> But there is another important difference to mempools which I think is
> worth exploring. With mempools, you are certain that the memory will
> only be used to make forward progress in writing out dirty data. So
> if you find that there isn't enough memory at the moment and you have
> to wait, you can be sure someone else is making forward progress and
> so waiting isn't such a bad thing.
>
> With your reservations it isn't quite the same. Reserved memory can
> be used for related purposes. In particular, any incoming packet can
> use some reserved memory. Once the purpose of that packet is
> discovered (i.e. that matching socket is found), the memory will be
> freed again. But there is a period of time when memory is being used
> for an inappropriate purpose. The consequences of this should be
> clearly understood.

IIRC the route-cache is in this state. Entries there can be added before
we can decide to keep or toss the packet. So we reserve enough memory to
overflow the route-cache (route-cache reclaim keeps it in bounds).

> In particular, the memory that is reserved for the emergency pool
> should include some overhead to acknowledge the fact that memory
> might be used for short periods of time for unrelated purposes.
>
> I think we can fit this acknowledgement into the current model quite
> easily, and it makes the tree structure suddenly make lots of sense
> (where as before I was still struggling with it).
>
> A key observation in this design is "Sometimes we need to allocate
> emergency memory without knowing exactly what it is going to be used
> for". I think we should make that explicit in the implementation as
> follows:
>
> We have a tree of reservations (as you already do) where levels in
> the tree correspond to more explicit knowledge of how the memory
> will be used.
> At the top level there is a generic 'page' reservation. Below that
> to one side with have a 'SLUB/SLAB' reservation. I'm not sure yet
> exactly what that will look like.
> Also below the 'page' reservation is a reservation for pages to hold
> incoming network fragments.
> Below the SLxB reservation is a reservation for skbs, which is
> parent to a reservation for IPv4 skbs and another for IPv6 skbs.
>
> Each of these nodes has its own independent reservation - parents are
> not simply the sum of the children.
> The sum over the whole tree is given to the VM as the size of the
> emergency pool to reserve for emergency allocations.
>
> Now, every actual allocation from the emergency pool effectively comes
> in at the top of the tree and moves down as its purpose is more fully
> understood. Every emergency allocation is *always* charged to one
> node in the tree, though which node may change.
>
> e.g.
> A network driver asks for a page to store a fragment.
> netdev_alloc_page calls alloc_page with __GFP_MEMALLOC set.
> If alloc_page needs to dive into the emergency pool, it first
> charges the one page against the root for the reservation tree.
> If this succeeds, it returns the page with ->reserve set. If the
> reservation fails, it ignores the GFP_MEMALLOC and fails.
> netdev_alloc_page notices that the page is a ->reserve page, and
> knows that it has been changed to the top 'page' reservation, but it
> should be changed to the network-page reservation. So it tried to
> charge against the network-pages reservation, and reverses the
> charge against 'pages'. If the network-pages reservation fails, the
> page is freed and netdev_alloc_page fails.
> As you can see, the charge moves down the tree as more information
> becomes available.
>
> Similarly a charge might move from 'pages' to 'SLxB' to 'net_skb' to
> 'ipv4_skb'.
>
> At the bottom levels, the reservations says how much memory is
> needed for that particular usage to be able to make sensible forward
> progress.
> At the higher levels, the reservation says how much overhead we need
> to allow to ensure that transient invalid uses don't unduly limit
> available emergency memory. As pages are likely to be immediately
> re-charged lower down the tree, the reservation at the top level
> would probably be proportional to the number of CPUs (probably one
> page per CPU would be perfect). Lower down, different calculations
> might suggest different intermediate reservations.
>
> Of course, these things don't need to be explicitly structured as a
> tree. There is no need for 'parent' or 'sibling' pointers. The code
> implicitly knows where to move charges from and to.
> You still need an explicit structure to allow groups of reservations
> that are activated or de-activated as a whole. That can use your
> current tree structure, or whatever else turns out to make sense.
>
> This model, I think, captures the important "allocate before charging"
> aspect of reservations that you need (particularly for incoming
> network packets) and it makes that rule apply throughout the different
> stages that an allocated chunk of memory goes through.

I'm a bit confused here, the only way to keep the allocations bounded is
by accounting before allocation (well, another other way is to bound the
number of concurrent allocations).

Also, I try not to account when not needed, like with the route-cache.
We already know it has bounded memory usage because it maintains that
itself. So by just supplying enough memory to overflow the thing you're
home save.

While the model of moving the accounting down might work, I think it its
not needed. We don't need to know if its ipv4 or ipv6 or yet another
protocol, as long as we have enough skb room to overflow whatever caches
are in between incomming packets and socket de-multiplex.

> With this model, alloc_page could fail more often, as it now also
> fails if the top level reservation is exhausted. This may seem
> un-necessary, but I think it could be a good thing. It means that at
> very busy times (when lots of requests are needing emergency memory)
> we drop requests randomly and very early. If we are going to drop a
> request eventually, dropping it early means we waste less time on it
> which is probably a good thing.

But, might you not be dropping the few packets we do want, early as
well?

> So: Does this model help others with understanding how the
> reservations work, or am I just over-engineering?

Sounds like a bit of overkill to me.

2008-02-29 11:52:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> Hi Peter,
>
> On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <[email protected]> wrote:
> > > 1/ I note there is no way to tell if memory returned by kmalloc is
> > > from the emergency reserve - which contrasts with alloc_page
> > > which does make that information available through page->reserve.
> > > This seems a slightly unfortunate aspect of the interface.
> >
> > Yes, but alas there is no room to store such information in kmalloc().
> > That is, in a sane way. I think it was Daniel Phillips who suggested
> > encoding it in the return pointer by flipping the low bit - but that is
> > just too ugly and breaks all current kmalloc sites to boot.
>
> Why can't you add a kmem_is_emergency() to SLUB that looks up the
> cache/slab/page (whatever is the smallest unit of the emergency pool
> here) for the object and use that?

I made page->reserve into PG_emergency and made that bit stick for the
lifetime of that page allocation. I then made kmem_is_emergency() look
up the head page backing that allocation's slab and return
PageEmergency().

This gives a consistent kmem_is_emergency() - that is if during the
lifetime of the kmem allocation it returns true once, it must return
true always.

You can then, using this properly, push the accounting into
kmalloc_reserve() and kfree_reserve() (and
kmem_cache_{alloc,free}_reserve).

Which yields very pretty code all round. (can make public if you like to
see..)

However...

This is a stricter model than I had before, and has one ramification I'm
not entirely sure I like.

It means the page remains a reserve page throughout its lifetime, which
means the slab remains a reserve slab throughout its lifetime. Therefore
it may never be used for !reserve allocations. Which in turn generates
complexities for the partial list.

In my previous model I had the reserve accounting external to the
allocation, which relaxed the strict need for consistency here, and I
dropped the reserve status once we were above the page limits again.

I managed to complicate the SLUB patch with this extra constraint, by
checking reserve against PageEmergency() when scanning the partial list,
but gave up on SLAB.

Does this sound like something I should pursuit? I feel it might
complicate the slab allocators too much..

2008-02-29 11:59:13

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

Hi Peter,

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> I made page->reserve into PG_emergency and made that bit stick for the
> lifetime of that page allocation. I then made kmem_is_emergency() look
> up the head page backing that allocation's slab and return
> PageEmergency().

[snip]

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> This is a stricter model than I had before, and has one ramification I'm
> not entirely sure I like.
>
> It means the page remains a reserve page throughout its lifetime, which
> means the slab remains a reserve slab throughout its lifetime. Therefore
> it may never be used for !reserve allocations. Which in turn generates
> complexities for the partial list.

Hmm, so why don't we then clear the PG_emergency flag then and
allocate a new fresh page to the reserves?

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> Does this sound like something I should pursuit? I feel it might
> complicate the slab allocators too much..

I can't answer that question until I see the code ;-). But overall, I
think it's better to put that code in SLUB rather than trying to work
around it elsewhere. The fact is, as soon as you have some sort of
reservation for _objects_, you need help from the SLUB allocator.

Pekka

2008-02-29 12:20:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Fri, 2008-02-29 at 13:58 +0200, Pekka Enberg wrote:
> Hi Peter,
>
> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> > I made page->reserve into PG_emergency and made that bit stick for the
> > lifetime of that page allocation. I then made kmem_is_emergency() look
> > up the head page backing that allocation's slab and return
> > PageEmergency().
>
> [snip]
>
> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> > This is a stricter model than I had before, and has one ramification I'm
> > not entirely sure I like.
> >
> > It means the page remains a reserve page throughout its lifetime, which
> > means the slab remains a reserve slab throughout its lifetime. Therefore
> > it may never be used for !reserve allocations. Which in turn generates
> > complexities for the partial list.
>
> Hmm, so why don't we then clear the PG_emergency flag then

Clearing PG_emergency would mean kmem_is_emergency() would return false
in kfree_reserve() and fail to un-charge the object.

Previously objects would track their account status themselves (when
needed) and freeing PG_emergency wouldn't be a problem.

> and allocate a new fresh page to the reserves?

Not sure I understand this properly. We would only do this once the page
watermarks are high enough, so the reserves are full again.

> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> > Does this sound like something I should pursuit? I feel it might
> > complicate the slab allocators too much..
>
> I can't answer that question until I see the code ;-). But overall, I
> think it's better to put that code in SLUB rather than trying to work
> around it elsewhere. The fact is, as soon as you have some sort of
> reservation for _objects_, you need help from the SLUB allocator.

Well, I agree with that consolidating it makes sense. And like I said,
it gives pretty code. However, it also puts the burden of this feature
on everyone and might affect performance - still its only the slow path,
but still.

2008-02-29 12:29:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Fri, Feb 29, 2008 at 2:18 PM, Peter Zijlstra <[email protected]> wrote:
> Clearing PG_emergency would mean kmem_is_emergency() would return false
> in kfree_reserve() and fail to un-charge the object.
>
> Previously objects would track their account status themselves (when
> needed) and freeing PG_emergency wouldn't be a problem.
>
> > and allocate a new fresh page to the reserves?
>
> Not sure I understand this properly. We would only do this once the page
> watermarks are high enough, so the reserves are full again.

The problem with PG_emergency is that, once the watermarks are high
again, SLUB keeps holding to the emergency page and it cannot be used
for regular kmalloc allocations, right?

So the way to fix this is to batch uncharge the objects and clear
PG_emergency for the said SLUB pages thus freeing them for regular
allocations. And to compensate for the loss in the reserves, we ask
the page allocator to give a new one that SLUB knows nothing about.

If you don't do this, the reserve page can only contain few objects
making them unavailable for regular allocations. So we're might be
forced into "emergency mode" even though there's enough memory
available to satisfy the allocation.

Pekka

2008-03-02 22:18:53

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Friday February 29, [email protected] wrote:
>
> The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
> at that upper limit and leave it at that.
>
> It should be full of holes, and there is some work on writeout
> throttling to fill some of them - but I haven't seen any lockups in this
> area for a long long while.

I think this is very interesting and useful.

You seem to be saying that the write-throttling is enough to avoid any
starvation on the transmit path. i.e. the VM is limiting the amount
of dirty memory so that when we desperately need memory on the
writeout path we can always get it, without lots of careful
accounting.

So why doesn't this work on for the receive side? What - exactly - is
the difference.

I think the difference is that on the receive side we have to hand
out memory before we know how it will be used (i.e. whether it is for
a SK_MEMALLOC socket or not) and so emergency memory could get stuck
in some non-emergency usage.

So suppose we forgot about all the allocation tracking (that doesn't
seem to be needed on the send side so maybe isn't on the receive side)
and just focus on not letting emergency memory get used for the wrong
thing.

So: Have some global flag that says "we are into the emergency pool"
which gets set the first time an allocation has to dip below the low
water mark, and cleared when an allocation succeeds without needing to
dip that low.

Then whenever we have memory that might have been allocated from below
the watermark (i.e. an incoming packet) and we find out that it isn't
required for write-out (i.e. it gets attached to a socket for which
SK_MEMALLOC is not set) we test the global flag and if it is set, we
drop the packet and free the memory.

To clarify:
no new accounting
no new reservations
just "drop non-writeout packets when we are low on memory"

Is there any chance that could work reliably?

I call this the "Linus" approach because I remember reading somewhere
(that google cannot find for me) where Linus said he didn't think a
provably correct implementation was the way to go - just something
that made it very likely that we won't run out of memory at an awkward
time.

I guess my position is that any accounting that we do needs to have a
clear theoretical model underneath it so we can reason about it.
I cannot see the clear model in beneath the current code so I'm trying
to come up with models that seem to capture the important elements of
the code, and to explore them.

NeilBrown

2008-03-02 23:33:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Mon, 2008-03-03 at 09:18 +1100, Neil Brown wrote:
> On Friday February 29, [email protected] wrote:
> >
> > The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
> > at that upper limit and leave it at that.
> >
> > It should be full of holes, and there is some work on writeout
> > throttling to fill some of them - but I haven't seen any lockups in this
> > area for a long long while.
>
> I think this is very interesting and useful.
>
> You seem to be saying that the write-throttling is enough to avoid any
> starvation on the transmit path. i.e. the VM is limiting the amount
> of dirty memory so that when we desperately need memory on the
> writeout path we can always get it, without lots of careful
> accounting.
>
> So why doesn't this work on for the receive side? What - exactly - is
> the difference.

The TX path needs to be able to make progress in that it must be able to
send out at least one full request (page). The thing the TX path must
not do is tie up so much memory sending out pages that we can't receive
any incoming packets.

So, having a throttle on the amount of writes in progress, and
sufficient memory to back those, seem like a solid way here.

NFS has such a limit in its congestion logic. But I'm quite sure I'm
failing to allocate enough memory to back it, as I got confused by the
whole RPC code.

> I think the difference is that on the receive side we have to hand
> out memory before we know how it will be used (i.e. whether it is for
> a SK_MEMALLOC socket or not) and so emergency memory could get stuck
> in some non-emergency usage.
>
> So suppose we forgot about all the allocation tracking (that doesn't
> seem to be needed on the send side so maybe isn't on the receive side)
> and just focus on not letting emergency memory get used for the wrong
> thing.
>
> So: Have some global flag that says "we are into the emergency pool"
> which gets set the first time an allocation has to dip below the low
> water mark, and cleared when an allocation succeeds without needing to
> dip that low.

That is basically what the slub logic I added does. Except that global
flags in the vm make people very nervous, so its a little more complex.

> Then whenever we have memory that might have been allocated from below
> the watermark (i.e. an incoming packet)

Which is what I do in the skb_alloc() path.

> and we find out that it isn't
> required for write-out (i.e. it gets attached to a socket for which
> SK_MEMALLOC is not set) we test the global flag and if it is set, we
> drop the packet and free the memory.

Which is somewhat more complex than you make it sound, but that is
exactly what I do.

> To clarify:
> no new accounting
> no new reservations
> just "drop non-writeout packets when we are low on memory"
>
> Is there any chance that could work reliably?

You need to be able to overflow the ip fragement assembly cache, or we
could get stuck with all memory in fragments.

Same for other memory usage before we hit the socket de-multiplex, like
the route-cache.

I just refined those points here; you need to drop more that
non-writeout packets, you need to drop all packets not meant for
SK_MEMALLOC.

You also need to allow some writeout packets, because if you hit 'oom'
and need to write-out some pages to free up memory,...

I did the reservation because I wanted some guarantee we'd be able to
over-flow the caches mentioned. The alternative is working with the
variable ratio that the current reserve has.

The accounting makes the whole system more robust. I wanted to make the
state stable enough to survive a connection drop, or server reset for a
long while, and it does. During a swapping workload and heavy network
load, I can pull the network cable, or shut down the NFS server and
leave it down for over 30 minutes. When I bring it back up again, stuff
resumes.

> I call this the "Linus" approach because I remember reading somewhere
> (that google cannot find for me) where Linus said he didn't think a
> provably correct implementation was the way to go - just something
> that made it very likely that we won't run out of memory at an awkward
> time.
>
> I guess my position is that any accounting that we do needs to have a
> clear theoretical model underneath it so we can reason about it.
> I cannot see the clear model in beneath the current code so I'm trying
> to come up with models that seem to capture the important elements of
> the code, and to explore them.

>From my POV there is a model, and I've tried to convey it, but clearly
I'm failing horribly. Let me try again:

Create a stable state where you can receive an unlimited amount of
network packets awaiting the one packet you need to move forward.

To do so we need to distinguish needed from unneeded packets; we do this
by means of SK_MEMALLOC. So we need to be able to receive packets up to
that point.

The unlimited amount of packets means unlimited time; which means that
our state must not consume memory, merely use memory. That is, the
amount of memory used must not grow unbounded over time.

So we must guarantee that all memory allocated will be promptly freed
again, and never allocate more than available.

Because this state is not the normal state, we need a trigger to enter
this state (and consequently a trigger to leave this state). We do that
by detecting a low memory situation just like you propose. We enter this
state once normal memory allocations fail and leave this state once they
start succeeding again.

We need the accounting to ensure we never allocate more than is
available, but more importantly because we need to ensure progress for
those packets we already have allocated.

A packet is received, it can be a fragment, it will be placed in the
fragment cache for packet re-assembly.

We need to ensure we can overflow this fragment cache in order that
something will come out at the other end. If under a fragment attack,
the fragment cache limit will prune the oldest fragments, freeing up
memory to receive new ones.

Eventually we'd be able to receive either a whole packet, or enough
fragments to assemble one.

Next comes routing the packet; we need to know where to process the
packet; local or non-local. This potentially involves filling the
route-cache.

If at this point there is no memory available because we forgot to limit
the amount of memory available for skb allocation we again are stuck.

The route-cache, like the fragment assembly, is already accounted and
will prune old (unused) entries once the total memory usage exceeds a
pre-determined amount of memory.

Eventually we'll end up at socket demux, matching packets to sockets
which allows us to either toss the packet or consume it. Dropping
packets is allowed because network is assumed lossy, and we have not yet
acknowledged the receive.

Does this make sense?


Then we have TX, which like I said above needs to operate under certain
limits as well. We need to be able to send out packets when under
pressure in order to relieve said pressure.

We need to ensure doing so will not exhaust our reserves.

Writing out a page typically takes a little memory, you fudge some
packets with protocol info, mtu size etc.. send them out, and wait for
an acknowledge from the other end, and drop the stuff and go on writing
other pages.

So sending out pages does not consume memory if we're able to receive
ACKs. Being able to receive packets what what all the previous was
about.

Now of course there is some RPC concurrency, TCP windows and other
funnies going on, but I assumed - and I don't think that's a wrong
assumption - that sending out pages will consume endless amounts of
memory.

Nor will it keep on sending pages, once there is a certain amount of
packets outstanding (nfs congestion logic), it will wait, at which point
it should have no memory in use at all.

Anyway I did get lost in the RPC code, and I know I didn't fully account
everything, but under some (hopefully realistic) assumptions I think the
model is sound.

Does this make sense?

2008-03-03 23:41:47

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


Hi Peter,

Thanks for trying to spell it out for me. :-)

On Monday March 3, [email protected] wrote:
>
> >From my POV there is a model, and I've tried to convey it, but clearly
> I'm failing $,3r_(Bhorribly. Let me try again:
>
> Create a stable state where you can receive an unlimited amount of
> network packets awaiting the one packet you need to move forward.

Yep.

>
> To do so we need to distinguish needed from unneeded packets; we do this
> by means of SK_MEMALLOC. So we need to be able to receive packets up to
> that point.

Yep.

>
> The unlimited amount of packets means unlimited time; which means that
> our state must not consume memory, merely use memory. That is, the
> amount of memory used must not grow unbounded over time.

Yes. Good point.

>
> So we must guarantee that all memory allocated will be promptly freed
> again, and never allocate more than available.

Definitely.

>
> Because this state is not the normal state, we need a trigger to enter
> this state (and consequently a trigger to leave this state). We do that
> by detecting a low memory situation just like you propose. We enter this
> state once normal memory allocations fail and leave this state once they
> start succeeding again.

Agreed.

>
> We need the accounting to ensure we never allocate more than is
> available, but more importantly because we need to ensure progress for
> those packets we already have allocated.

Maybe...
1/ Memory is used
a/ in caches, such as the fragment cache and the route cache
b/ in transient allocations on their way from one place to
another. e.g. network card to fragment cache, frag cache to
socket.
The caches can (do?) impose a natural limit on the amount of
memory they use. The transient allocations should be satisfied
from the normal low watermark pool. When we are in a low memory
conditions we can expect packet loss so we expect network streams
to slow down, so we expect there to be fewer bits in transit.
Also in low memory conditions the caches would be extra-cautious
not to use too much memory.
So it isn't completely clear (to me) that extra accounting is needed.

2/ If we were to do accounting to "ensure progress for those packets
we already have allocated", then I would expect a reservation
(charge) of max_packet_size when a fragment arrives on the network
card - or at least when a new fragment is determined to not match
any packet already in the fragment cache. But I didn't see that
in your code. I saw incremental charges as each page arrived.
And that implementation does seem to fit the model.

>
> A packet is received, it can be a fragment, it will be placed in the
> fragment cache for packet re-assembly.

Understood.

>
> We need to ensure we can overflow this fragment cache in order that
> something will come out at the other end. If under a fragment attack,
> the fragment cache limit will prune the oldest fragments, freeing up
> memory to receive new ones.

I don't understand why we want to "overflow this fragment cache".
I picture the cache having a target size. When under this size,
fragments might be allowed to live longer. When at or over the target
size, old fragments are pruned earlier. When in a low memory
situation it might be even more keen to prune old fragments, to keep
beneath the target size.
When you say "overflow this fragment cache", I picture deliberately
allowing the cache to get bigger than the target size. I don't
understand why you would want to do that.

>
> Eventually we'd be able to receive either a whole packet, or enough
> fragments to assemble one.

That would be important, yes.

>
> Next comes routing the packet; we need to know where to process the
> packet; local or non-local. This potentially involves filling the
> route-cache.
>
> If at this point there is no memory available because we forgot to limit
> the amount of memory available for skb allocation we again are stuck.

Those skbs we allocated - they are either sitting in the fragment
cache, or have been attached to a SK_MEMALLOC socket, or have been
freed - correct? If so, then there is already a limit to how much
memory they can consume.

>
> The route-cache, like the fragment assembly, is already accounted and
> will prune old (unused) entries once the total memory usage exceeds a
> pre-determined amount of memory.

Good. So as long as the normal emergency reserves covers the size of
the route cache plus the size of the fragment cache plus a little bit
of slack, we should be safe - yes?

>
> Eventually we'll end up at socket demux, matching packets to sockets
> which allows us to either toss the packet or consume it. Dropping
> packets is allowed because network is assumed lossy, and we have not yet
> acknowledged the receive.
>
> Does this make sense?

Lots of it does, yes.

>
>
> Then we have TX, which like I said above needs to operate under certain
> limits as well. We need to be able to send out packets when under
> pressure in order to relieve said pressure.

Catch-22 ?? :-)

>
> We need to ensure doing so will not exhaust our reserves.
>
> Writing out a page typically takes a little memory, you fudge some
> packets with protocol info, mtu size etc.. send them out, and wait for
> an acknowledge from the other end, and drop the stuff and go on writing
> other pages.

Yes, rate-limiting those write-outs should keep that moving.

>
> So sending out pages does not consume memory if we're able to receive
> ACKs. Being able to receive packets what what all the previous was
> about.
>
> Now of course there is some RPC concurrency, TCP windows and other
> funnies going on, but I assumed - and I don't think that's a wrong
> assumption - that sending out pages will consume endless amounts of
^not ??
> memory.

Sounds fair.

>
> Nor will it keep on sending pages, once there is a certain amount of
> packets outstanding (nfs congestion logic), it will wait, at which point
> it should have no memory in use at all.

Providing it frees any headers it attached to each page (or had
allocated them from a private pool), it should have no memory in use.
I'd have to check through the RPC code (I get lost in there too) to
see how much memory is tied up by each outstanding page write.

>
> Anyway I did get lost in the RPC code, and I know I didn't fully account
> everything, but under some (hopefully realistic) assumptions I think the
> model is sound.
>
> Does this make sense?

Yes.

So I can see two possible models here.

The first is the "bounded cache" or "locally bounded" model.
At every step in the path from writepage to clear_page_writeback,
the amount of extra memory used is bounded by some local rules.
NFS and RPC uses congestion logic to limit the number of outstanding
writes. For incoming packets, the fragment cache and route cache
impose their own limits.
We simply need that the VM reserves a total amount of memory to meet
the sum of those local limits.

Your code embodies this model with the tree of reservations. The root
of the tree stores the sum of all the reservations below, and this
number is given to the VM.
The value of the tree is that different components can register their
needs independently, and the whole tree (or subtrees) can be attached
or not depending on global conditions, such as whether there are any
SK_MEMALLOC sockets or not.

However I don't see how the charging that you implemented fits into
this model.
You don't do any significant charging for the route cache. But you do
for skbs. Why? Don't the majority of those skbs live in the fragment
cache? Doesn't it account their size? (Maybe it doesn't.... maybe it
should?).

I also don't see the value of tracking pages to see if they are
'reserve' pages or not. The decision to drop an skb that is not for
an SK_MEMALLOC socket should be based on whether we are currently
short on memory. Not whether we were short on memory when the skb was
allocated.

The second model that could fit is "total accounting".
In this model we reserve memory at each stage including the transient
stages (packet that has arrived but isn't in fragment cache yet).
As memory moves around, we move the charging from one reserve to
another. If the target reserve doesn't have an space, we drop the
message.
On the transmit side, that means putting the page back on a queue for
sending later. On the receive side that means discarding the packet
and waiting for a resend.
This model makes it easy for the various limits to be very different
while under memory pressure that otherwise. It also means they are
imposed differently which isn't so good.

So:
- Why do you impose skb allocation limits beyond what is imposed
by the fragment cache?
- Why do you need to track whether each allocation is a reserve or
not?

Thanks,
NeilBrown

2008-03-04 10:29:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16


On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> Hi Peter,
>
> Thanks for trying to spell it out for me. :-)
>
> On Monday March 3, [email protected] wrote:
> >
> > >From my POV there is a model, and I've tried to convey it, but clearly
> > I'm failing $,3r_(Bhorribly. Let me try again:

Hmm, I wonder what went wrong there ^

> > Create a stable state where you can receive an unlimited amount of
> > network packets awaiting the one packet you need to move forward.
>
> Yep.
>
> >
> > To do so we need to distinguish needed from unneeded packets; we do this
> > by means of SK_MEMALLOC. So we need to be able to receive packets up to
> > that point.
>
> Yep.
>
> >
> > The unlimited amount of packets means unlimited time; which means that
> > our state must not consume memory, merely use memory. That is, the
> > amount of memory used must not grow unbounded over time.
>
> Yes. Good point.
>
> >
> > So we must guarantee that all memory allocated will be promptly freed
> > again, and never allocate more than available.
>
> Definitely.
>
> >
> > Because this state is not the normal state, we need a trigger to enter
> > this state (and consequently a trigger to leave this state). We do that
> > by detecting a low memory situation just like you propose. We enter this
> > state once normal memory allocations fail and leave this state once they
> > start succeeding again.
>
> Agreed.
>
> >
> > We need the accounting to ensure we never allocate more than is
> > available, but more importantly because we need to ensure progress for
> > those packets we already have allocated.
>
> Maybe...
> 1/ Memory is used
> a/ in caches, such as the fragment cache and the route cache
> b/ in transient allocations on their way from one place to
> another. e.g. network card to fragment cache, frag cache to
> socket.
> The caches can (do?) impose a natural limit on the amount of
> memory they use. The transient allocations should be satisfied
> from the normal low watermark pool. When we are in a low memory
> conditions we can expect packet loss so we expect network streams
> to slow down, so we expect there to be fewer bits in transit.
> Also in low memory conditions the caches would be extra-cautious
> not to use too much memory.
> So it isn't completely clear (to me) that extra accounting is needed.
>
> 2/ If we were to do accounting to "ensure progress for those packets
> we already have allocated", then I would expect a reservation
> (charge) of max_packet_size when a fragment arrives on the network
> card - or at least when a new fragment is determined to not match
> any packet already in the fragment cache. But I didn't see that
> in your code. I saw incremental charges as each page arrived.
> And that implementation does seem to fit the model.

Ah, the extra accounting I do is count the number of bytes associated
with skb data. So that we don't exhaust the reserves with incoming
packets. Like you said, packets need a little more memory in their
travels up to the socket demux.

When you look at __alloc_skb(), you'll find we charge the data size to
the reserves, and if you look at __netdev_alloc_page(), you'll see
PAGE_SIZE being changed against the skb reserve.

If we would not do this, and the incoming packet rate would be high
enough, we could exhaust the reserves and leave the packets no memory to
use on their travels to the socket demux.

> > A packet is received, it can be a fragment, it will be placed in the
> > fragment cache for packet re-assembly.
>
> Understood.
>
> >
> > We need to ensure we can overflow this fragment cache in order that
> > something will come out at the other end. If under a fragment attack,
> > the fragment cache limit will prune the oldest fragments, freeing up
> > memory to receive new ones.
>
> I don't understand why we want to "overflow this fragment cache".
> I picture the cache having a target size. When under this size,
> fragments might be allowed to live longer. When at or over the target
> size, old fragments are pruned earlier. When in a low memory
> situation it might be even more keen to prune old fragments, to keep
> beneath the target size.
> When you say "overflow this fragment cache", I picture deliberately
> allowing the cache to get bigger than the target size. I don't
> understand why you would want to do that.

What I mean by overflowing is: by providing more than the cache can
handle we guarantee that forward progress is made, because either the
cache will prune items - giving us memory to continue - or we'll get out
a fully assembled packet which can continue its quest :-)

If we provide less memory than the cache can hold, all memory can be
tied up in the cache, waiting for something to happen - which won't,
because we're out of memory.

> > Eventually we'd be able to receive either a whole packet, or enough
> > fragments to assemble one.
>
> That would be important, yes.
>
> >
> > Next comes routing the packet; we need to know where to process the
> > packet; local or non-local. This potentially involves filling the
> > route-cache.
> >
> > If at this point there is no memory available because we forgot to limit
> > the amount of memory available for skb allocation we again are stuck.
>
> Those skbs we allocated - they are either sitting in the fragment
> cache, or have been attached to a SK_MEMALLOC socket, or have been
> freed - correct? If so, then there is already a limit to how much
> memory they can consume.

Not really, there is no natural limit to the amount of packets that can
be in transit between RX and socket demux. So we need the extra (skb)
accounting to impose that.

> > The route-cache, like the fragment assembly, is already accounted and
> > will prune old (unused) entries once the total memory usage exceeds a
> > pre-determined amount of memory.
>
> Good. So as long as the normal emergency reserves covers the size of
> the route cache plus the size of the fragment cache plus a little bit
> of slack, we should be safe - yes?

Basically (except for the last point), unless I've missed something in
the net-stack.

> > Eventually we'll end up at socket demux, matching packets to sockets
> > which allows us to either toss the packet or consume it. Dropping
> > packets is allowed because network is assumed lossy, and we have not yet
> > acknowledged the receive.
> >
> > Does this make sense?
>
> Lots of it does, yes.

Good, making progress here :-)

> > Then we have TX, which like I said above needs to operate under certain
> > limits as well. We need to be able to send out packets when under
> > pressure in order to relieve said pressure.
>
> Catch-22 ?? :-)

Yeah, swapping is such fun.. Which is why we have these reserves. We
only fake we're out of memory, but secretly we do have some left. But,
sssh don't tell user-space :-)

> > We need to ensure doing so will not exhaust our reserves.
> >
> > Writing out a page typically takes a little memory, you fudge some
> > packets with protocol info, mtu size etc.. send them out, and wait for
> > an acknowledge from the other end, and drop the stuff and go on writing
> > other pages.
>
> Yes, rate-limiting those write-outs should keep that moving.
>
> >
> > So sending out pages does not consume memory if we're able to receive
> > ACKs. Being able to receive packets what what all the previous was
> > about.
> >
> > Now of course there is some RPC concurrency, TCP windows and other
> > funnies going on, but I assumed - and I don't think that's a wrong
> > assumption - that sending out pages will consume endless amounts of
> ^not ??

Uhm yeah, sorry about that.

> > memory.
>
> Sounds fair.
>
> >
> > Nor will it keep on sending pages, once there is a certain amount of
> > packets outstanding (nfs congestion logic), it will wait, at which point
> > it should have no memory in use at all.
>
> Providing it frees any headers it attached to each page (or had
> allocated them from a private pool), it should have no memory in use.
> I'd have to check through the RPC code (I get lost in there too) to
> see how much memory is tied up by each outstanding page write.
>
> >
> > Anyway I did get lost in the RPC code, and I know I didn't fully account
> > everything, but under some (hopefully realistic) assumptions I think the
> > model is sound.
> >
> > Does this make sense?
>
> Yes.
>
> So I can see two possible models here.
>
> The first is the "bounded cache" or "locally bounded" model.
> At every step in the path from writepage to clear_page_writeback,
> the amount of extra memory used is bounded by some local rules.
> NFS and RPC uses congestion logic to limit the number of outstanding
> writes. For incoming packets, the fragment cache and route cache
> impose their own limits.
> We simply need that the VM reserves a total amount of memory to meet
> the sum of those local limits.
>
> Your code embodies this model with the tree of reservations. The root
> of the tree stores the sum of all the reservations below, and this
> number is given to the VM.
> The value of the tree is that different components can register their
> needs independently, and the whole tree (or subtrees) can be attached
> or not depending on global conditions, such as whether there are any
> SK_MEMALLOC sockets or not.
>
> However I don't see how the charging that you implemented fits into
> this model.
> You don't do any significant charging for the route cache. But you do
> for skbs. Why? Don't the majority of those skbs live in the fragment
> cache? Doesn't it account their size? (Maybe it doesn't.... maybe it
> should?).

To impose a limit on the amount of skb data in transit. Like stated
above, there is (afaik) no natural limit on this.

> I also don't see the value of tracking pages to see if they are
> 'reserve' pages or not. The decision to drop an skb that is not for
> an SK_MEMALLOC socket should be based on whether we are currently
> short on memory. Not whether we were short on memory when the skb was
> allocated.

That comes from accounting, once you need to account data you need to
know when to start accounting, and keep state so that you can properly
un-account.

Also, from a practical POV its easier to detect our lack of memory from
an allocation site, that outside of it.

> The second model that could fit is "total accounting".
> In this model we reserve memory at each stage including the transient
> stages (packet that has arrived but isn't in fragment cache yet).
> As memory moves around, we move the charging from one reserve to
> another. If the target reserve doesn't have an space, we drop the
> message.
> On the transmit side, that means putting the page back on a queue for
> sending later. On the receive side that means discarding the packet
> and waiting for a resend.
> This model makes it easy for the various limits to be very different
> while under memory pressure that otherwise. It also means they are
> imposed differently which isn't so good.
>
> So:
> - Why do you impose skb allocation limits beyond what is imposed
> by the fragment cache?

To impose a limit on the amount of skbs in transit. The fragment cache
only imposes a limit on the amount of data held for packet assembly. Not
the total amount of skb data between receive and socket demux.

> - Why do you need to track whether each allocation is a reserve or
> not?

To do accounting.

2008-03-07 03:33:56

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Tuesday March 4, [email protected] wrote:
>
> On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> >
> > Those skbs we allocated - they are either sitting in the fragment
> > cache, or have been attached to a SK_MEMALLOC socket, or have been
> > freed - correct? If so, then there is already a limit to how much
> > memory they can consume.
>
> Not really, there is no natural limit to the amount of packets that can
> be in transit between RX and socket demux. So we need the extra (skb)
> accounting to impose that.

Isn't there? A brief look at the code suggests that (except for
fragment handling) there is a fairly straight path from
network-receive to socket demux. No queues along the way.
That suggests the number of in-transit skbs should be limited by the
number of CPUs. Did I miss something? Or is the number of CPUs
potentially too large to be a suitable limit (seems unlikely).

While looking at the code it also occurred to me that:
1/ tcpdump could be sent incoming packets. Is there a limit
to the number of packets that can be in-kernel waiting for
tcpdump to collect them? Should this limit be added to the base
reserve?
2/ If the host is routing network packets, then incoming packets
might go on an outbound queue. Is this space limited? and
included in the reserve?

Not major points, but I thought I would mention them.

> > I also don't see the value of tracking pages to see if they are
> > 'reserve' pages or not. The decision to drop an skb that is not for
> > an SK_MEMALLOC socket should be based on whether we are currently
> > short on memory. Not whether we were short on memory when the skb was
> > allocated.
>
> That comes from accounting, once you need to account data you need to
> know when to start accounting, and keep state so that you can properly
> un-account.
>

skbs in the main (only?) thing you do accounting on, so focusing on
those:

Suppose that every time you allocate memory for an skb, you check
if the allocation had to dip into emergency reserves, and account the
memory if so - releasing the memory and dropping the packet if we are
over the limit.
And any time you free memory associated with an skb, you check if the
accounts currently say '0', and if not subtract the size of the
allocation from the accounts.

Then you have quite workable accounting that doesn't need to tag every
piece of memory with its 'reserve' status, and only pays the
accounting cost (presumably a spinlock) when running out of memory, or
just recovering.

This more relaxed approach to accounting reserved vs non-reserved
memory has a strong parallel in your slub code (which I now
understand). When sl[au]b first gets a ->reserve page, it sets the
->reserve flag on the memcache and leaves it set until it sometime
later gets a non-"->reserve" page. Any memory freed in the mean time
(whether originally reserved or not) is treated as reserve memory in
that it will only be returned for ALLOC_NO_WATERMARKS allocations.
I think this is a good way of working with reserved memory. It isn't
precise, but it is low-cost and good enough to get you through the
difficult patch.

Your netvm-skbuff-reserve.patch has some code to make sure that all
the allocations in an skb have the same 'reserve' status. I don't
think that is needed and just makes the code messy - plus it requires
the 'overcommit' flag to mem_reserve_kmalloc_charge which is a bit of
a wart on the interface.

I would suggest getting rid of that. Just flag the whole skb if any
part gets a 'reserve' allocation, and use that flag to decide to drop
packets arriving at non-SK_MEMALLOC sockets.



So: I think I now really understand what your code is doing, so I will
try to explain it in terms that even I understand... This text in
explicitly available under GPLv2 in case you want it.

It actually describes something a bit different to what your code
currently does, but I think it is very close to the spirit. Some
differences follow from my observations above. Others the way that
seemed to make sense while describing the problem and solution
differed slightly from what I saw the code doing. Obviously the code
and the description should be aligned one way or another before being
finalised.
The description is a bit long ... sorry about that. But I wanted to
make sure I included motivation and various assumptions. Some of my
understanding may well be wrong, but I present it here anyway. It is
easier for you to correct if it is clearly visible:-)

Problem:
When Linux needs to allocate memory it may find that there is
insufficient free memory so it needs to reclaim space that is in
use but not needed at the moment. There are several options:

1/ Shrink a kernel cache such as the inode or dentry cache. This
is fairly easy but provides limited returns.
2/ Discard 'clean' pages from the page cache. This is easy, and
works well as long as there are clean pages in the page cache.
Similarly clean 'anonymous' pages can be discarded - if there
are any.
3/ Write out some dirty page-cache pages so that they become clean.
The VM limits the number of dirty page-cache pages to e.g. 40%
of available memory so that (among other reasons) a "sync" will
not take excessively long. So there should never be excessive
amounts of dirty pagecache.
Writing out dirty page-cache pages involves work by the
filesystem which may need to allocate memory itself. To avoid
deadlock, filesystems use GFP_NOFS when allocating memory on the
write-out path. When this is used, cleaning dirty page-cache
pages is not an option so if the filesystem finds that memory
is tight, another option must be found.
4/ Write out dirty anonymous pages to the "Swap" partition/file.
This is the most interesting for a couple of reasons.
a/ Unlike dirty page-cache pages, there is no need to write anon
pages out unless we are actually short of memory. Thus they
tend to be left to last.
b/ Anon pages tend to be updated randomly and unpredictably, and
flushing them out of memory can have a very significant
performance impact on the process using them. This contrasts
with page-cache pages which are often written sequentially
and often treated as "write-once, read-many".
So anon pages tend to be left until last to be cleaned, and may
be the only cleanable pages while there are still some dirty
page-cache pages (which are waiting on a GFP_NOFS allocation).

[I don't find the above wholly satisfying. There seems to be too much
hand-waving. If someone can provide better text explaining why
swapout is a special case, that would be great.]

So we need to be able to write to the swap file/partition without
needing to allocate any memory ... or only a small well controlled
amount.

The VM reserves a small amount of memory that can only be allocated
for use as part of the swap-out procedure. It is only available to
processes with the PF_MEMALLOC flag set, which is typically just the
memory cleaner.

Traditionally swap-out is performed directly to block devices (swap
files on block-device filesystems are supported by examining the
mapping from file offset to device offset in advance, and then using
the device offsets to write directly to the device). Block devices
are (required to be) written to pre-allocate any memory that might be
needed during write-out, and to block when the pre-allocated memory is
exhausted and no other memory is available. They can be sure not to
block forever as the pre-allocated memory will be returned as soon as
the data it is being used for has been written out. The primary
mechanism for pre-allocating memory is called "mempools".

This approach does not work for writing anonymous pages
(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.


The main reason that it does not work is that when data from an anon
page is written to the network, we must wait for a reply to confirm
the data is safe. Receiving that reply will consume memory and,
significantly, we need to allocate memory to an incoming packet before
we can tell if it is the reply we are waiting for or not.

The secondary reason is that the network code is not written to use
mempools and in most cases does not need to use them. Changing all
allocations in the networking layer to use mempools would be quite
intrusive, and would waste memory, and probably cause a slow-down in
the common case of not swapping over the network.

These problems are addressed by enhancing the system of memory
reserves used by PF_MEMALLOC and requiring any in-kernel networking
client that is used for swap-out to indicate which sockets are used
for swapout so they can be handled specially in low memory situations.

There are several major parts to this enhancement:

1/ PG_emergency, GFP_MEMALLOC

To handle low memory conditions we need to know when those
conditions exist. Having a global "low on memory" flag seems easy,
but its implementation is problematic. Instead we make it possible
to tell if a recent memory allocation required use of the emergency
memory pool.
For pages returned by alloc_page, the new page flag PG_emergency
can be tested. If this is set, then a low memory condition was
current when the page was allocated, so the memory should be used
carefully.

For memory allocated using slab/slub: If a page that is added to a
kmem_cache is found to have PG_emergency set, then a ->reserve
flag is set for the whole kmem_cache. Further allocations will only
be returned from that page (or any other page in the cache) if they
are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
Non-emergency allocations will block in alloc_page until a
non-reserve page is available. Once a non-reserve page has been
added to the cache, the ->reserve flag on the cache is removed.
When memory is returned by slab/slub, PG_emergency is set on the page
holding the memory to match the ->reserve flag on that cache.

After memory has been returned by kmem_cache_alloc or kmalloc, the
page's PG_emergency flag can be tested. If it is set, then the most
recent allocation from that cache required reserve memory, so this
allocation should be used with care.

It is not safe to test the cache's ->reserve flag immediately after
an allocation as that flag is in per-cpu data, and the process could
have be rescheduled to a different cpu if preemption is enabled.
Thus the use of PG_emergency to carry this information.

This allows us to
a/ request use of the emergency pool when allocating memory
(GFP_MEMALLOC), and
b/ to find out if the emergency pool was used.

2/ SK_MEMALLOC, sk_buff->emergency.

When memory from the reserve is used to store incoming network
packets, the memory must be freed (and the packet dropped) as soon
as we find out that the packet is not for a socket that is used for
swap-out.
To achieve this we have an ->emergency flag for skbs, and an
SK_MEMALLOC flag for sockets.
When memory is allocated for an skb, it is allocated with
GFP_MEMALLOC (if we are currently swapping over the network at
all). If a subsequent test shows that the emergency pool was used,
->emergency is set.
When the skb is finally attached to its destination socket, the
SK_MEMALLOC flag on the socket is tested. If the skb has
->emergency set, but the socket does not have SK_MEMALLOC set, then
the skb is immediately freed and the packet is dropped.
This ensures that reserve memory is never queued on a socket that is
not used for swapout.

Similarly, if an skb is ever queued for deliver to user-space for
example by netfilter, the ->emergency flag is tested and the skb is
released if ->emergency is set.

This ensures that memory from the emergency reserve can be used to
allow swapout to proceed, but will not get caught up in any other
network queue.


3/ pages_emergency

The above would be sufficient if the total memory below the lowest
memory watermark (i.e the size of the emergency reserve) were known
to be enough to hold all transient allocations needed for writeout.
I'm a little blurry on how big the current emergency pool is, but it
isn't big and certainly hasn't been sized to allow network traffic
to consume any.

We could simply make the size of the reserve bigger. However in the
common case that we are not swapping over the network, that would be
a waste of memory.

So a new "watermark" is defined: pages_emergency. This is
effectively added to the current low water marks, so that pages from
this emergency pool can only be allocated if one of PF_MEMALLOC or
GFP_MEMALLOC are set.

pages_emergency can be changed dynamically based on need. When
swapout over the network is required, pages_emergency is increased
to cover the maximum expected load. When network swapout is
disabled, pages_emergency is decreased.

To determine how much to increase it by, we introduce reservation
groups....

3a/ reservation groups

The memory used transiently for swapout can be in a number of
different places. e.g. the network route cache, the network
fragment cache, in transit between network card and socket, or (in
the case of NFS) in sunrpc data structures awaiting a reply.
We need to ensure each of these is limited in the amount of memory
they use, and that the maximum is included in the reserve.

The memory required by the network layer only needs to be reserved
once, even if there are multiple swapout paths using the network
(e.g. NFS and NDB and iSCSI, though using all three for swapout at
the same time would be unusual).

So we create a tree of reservation groups. The network might
register a collection of reservations, but not mark them as being in
use. NFS and sunrpc might similarly register a collection of
reservations, and attach it to the network reservations as it
depends on them.
When swapout over NFS is requested, the NFS/sunrpc reservations are
activated which implicitly activates the network reservations.

The total new reservation is added to pages_emergency.

Provided each memory usage stays beneath the registered limit (at
least when allocating memory from reserves), the system will never
run out of emergency memory, and swapout will not deadlock.

It is worth noting here that it is not critical that each usage
stays beneath the limit 100% of the time. Occasional excess is
acceptable provided that the memory will be freed again within a
short amount of time that does *not* require waiting for any event
that itself might require memory.
This is because, at all stages of transmit and receive, it is
acceptable to discard all transient memory associated with a
particular writeout and try again later. On transmit, the page can
be re-queued for later transmission. On receive, the packet can be
dropped assuming that the peer will resend after a timeout.

Thus allocations that are truly transient and will be freed without
blocking do not strictly need to be reserved for. Doing so might
still be a good idea to ensure forward progress doesn't take too
long.

4/ lo-mem accounting

Most places that might hold on to emergency memory (e.g. route
cache, fragment cache etc) already place a limit on the amount of
memory that they can use. This limit can simply be reserved using
the above mechanism and no more needs to be done.

However some memory usage might not be accounted with sufficient
firmness to allow an appropriate emergency reservation. The
in-flight skbs for incoming packets is (claimed to be) on such
example.

To support this, a low-overhead mechanism for accounting memory
usage against the reserves is provided. This mechanism uses the
same data structure that is used to store the emergency memory
reservations through the addition of a 'usage' field.

When memory allocation for a particular purpose succeeds, the memory
is checked to see if it is 'reserve' memory. If it is, the size of
the allocation is added to the 'usage'. If this exceeds the
reservation, the usage is reduced again and the memory that was
allocated is free.

When memory that was allocated for that purpose is freed, the
'usage' field is checked again. If it is non-zero, then the size of
the freed memory is subtracted from the usage, making sure the usage
never becomes less than zero.

This provides adequate accounting with minimal overheads when not in
a low memory condition. When a low memory condition is encountered
it does add the cost of a spin lock necessary to serialise updates
to 'usage'.



5/ swapfile/swap_out/swap_in

So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
any network socket that it uses, and can know when to account
reserve memory carefully, new address_space_operations are
available.
"swapfile" requests that an address space (i.e a file) be make ready
for swapout. swap_out and swap_in request the actual IO. They
together must ensure that each swap_out request can succeed without
allocating more emergency memory that was reserved by swapfile.


Thanks for reading this far. I hope it made sense :-)

NeilBrown

2008-03-07 11:18:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

Hi Neil,

I'm so glad you are working with me on this and writing this in human
English. It seems to be my eternal short-comming to communicate my ideas
clearly :-/. Thanks for your effort!

On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> On Tuesday March 4, [email protected] wrote:
> >
> > On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> > >
> > > Those skbs we allocated - they are either sitting in the fragment
> > > cache, or have been attached to a SK_MEMALLOC socket, or have been
> > > freed - correct? If so, then there is already a limit to how much
> > > memory they can consume.
> >
> > Not really, there is no natural limit to the amount of packets that can
> > be in transit between RX and socket demux. So we need the extra (skb)
> > accounting to impose that.
>
> Isn't there? A brief look at the code suggests that (except for
> fragment handling) there is a fairly straight path from
> network-receive to socket demux. No queues along the way.
> That suggests the number of in-transit skbs should be limited by the
> number of CPUs. Did I miss something? Or is the number of CPUs
> potentially too large to be a suitable limit (seems unlikely).

That would be so if the whole path from RX to socket demux would have
hard-irqs disabled. However I didn't see that. Moreover I think the
whole purpose of the NetPoll interface is to allow some RX queueing to
cut down on softirq overhead.

> While looking at the code it also occurred to me that:
> 1/ tcpdump could be sent incoming packets. Is there a limit
> to the number of packets that can be in-kernel waiting for
> tcpdump to collect them? Should this limit be added to the base
> reserve?

We could indeed do something like that, building a cache there that
re-cycles the oldest waiting skbs, allowing taps to continue working
under light pressure seems like a reasonable thing.

Good suggestion for future work.

> 2/ If the host is routing network packets, then incoming packets
> might go on an outbound queue. Is this space limited? and
> included in the reserve?

Not sure, somewhere along the routing code I lost it again. Constructive
input from someone versed in that part of the kernel would be most
welcome.

> Not major points, but I thought I would mention them.
>
> > > I also don't see the value of tracking pages to see if they are
> > > 'reserve' pages or not. The decision to drop an skb that is not for
> > > an SK_MEMALLOC socket should be based on whether we are currently
> > > short on memory. Not whether we were short on memory when the skb was
> > > allocated.
> >
> > That comes from accounting, once you need to account data you need to
> > know when to start accounting, and keep state so that you can properly
> > un-account.
> >
>
> skbs in the main (only?) thing you do accounting on, so focusing on
> those:
>
> Suppose that every time you allocate memory for an skb, you check
> if the allocation had to dip into emergency reserves, and account the
> memory if so - releasing the memory and dropping the packet if we are
> over the limit.
> And any time you free memory associated with an skb, you check if the
> accounts currently say '0', and if not subtract the size of the
> allocation from the accounts.
>
> Then you have quite workable accounting that doesn't need to tag every
> piece of memory with its 'reserve' status, and only pays the
> accounting cost (presumably a spinlock) when running out of memory, or
> just recovering.

Quite so, that has been the intent.

> This more relaxed approach to accounting reserved vs non-reserved
> memory has a strong parallel in your slub code (which I now
> understand). When sl[au]b first gets a ->reserve page, it sets the
> ->reserve flag on the memcache and leaves it set until it sometime
> later gets a non-"->reserve" page. Any memory freed in the mean time
> (whether originally reserved or not) is treated as reserve memory in
> that it will only be returned for ALLOC_NO_WATERMARKS allocations.
> I think this is a good way of working with reserved memory. It isn't
> precise, but it is low-cost and good enough to get you through the
> difficult patch.

Agreed.

> Your netvm-skbuff-reserve.patch has some code to make sure that all
> the allocations in an skb have the same 'reserve' status. I don't
> think that is needed and just makes the code messy - plus it requires
> the 'overcommit' flag to mem_reserve_kmalloc_charge which is a bit of
> a wart on the interface.

It is indeed.

> I would suggest getting rid of that. Just flag the whole skb if any
> part gets a 'reserve' allocation, and use that flag to decide to drop
> packets arriving at non-SK_MEMALLOC sockets.

OK, I'll look into doing that. I must have been in pedantic mode when I
wrote that code.

> So: I think I now really understand what your code is doing, so I will
> try to explain it in terms that even I understand... This text in
> explicitly available under GPLv2 in case you want it.

Great work! Thanks!

> It actually describes something a bit different to what your code
> currently does, but I think it is very close to the spirit. Some
> differences follow from my observations above. Others the way that
> seemed to make sense while describing the problem and solution
> differed slightly from what I saw the code doing. Obviously the code
> and the description should be aligned one way or another before being
> finalised.
> The description is a bit long ... sorry about that. But I wanted to
> make sure I included motivation and various assumptions. Some of my
> understanding may well be wrong, but I present it here anyway. It is
> easier for you to correct if it is clearly visible:-)
>
> Problem:
> When Linux needs to allocate memory it may find that there is
> insufficient free memory so it needs to reclaim space that is in
> use but not needed at the moment. There are several options:
>
> 1/ Shrink a kernel cache such as the inode or dentry cache. This
> is fairly easy but provides limited returns.
> 2/ Discard 'clean' pages from the page cache. This is easy, and
> works well as long as there are clean pages in the page cache.
> Similarly clean 'anonymous' pages can be discarded - if there
> are any.
> 3/ Write out some dirty page-cache pages so that they become clean.
> The VM limits the number of dirty page-cache pages to e.g. 40%
> of available memory so that (among other reasons) a "sync" will
> not take excessively long. So there should never be excessive
> amounts of dirty pagecache.
> Writing out dirty page-cache pages involves work by the
> filesystem which may need to allocate memory itself. To avoid
> deadlock, filesystems use GFP_NOFS when allocating memory on the
> write-out path. When this is used, cleaning dirty page-cache
> pages is not an option so if the filesystem finds that memory
> is tight, another option must be found.
> 4/ Write out dirty anonymous pages to the "Swap" partition/file.
> This is the most interesting for a couple of reasons.
> a/ Unlike dirty page-cache pages, there is no need to write anon
> pages out unless we are actually short of memory. Thus they
> tend to be left to last.
> b/ Anon pages tend to be updated randomly and unpredictably, and
> flushing them out of memory can have a very significant
> performance impact on the process using them. This contrasts
> with page-cache pages which are often written sequentially
> and often treated as "write-once, read-many".
> So anon pages tend to be left until last to be cleaned, and may
> be the only cleanable pages while there are still some dirty
> page-cache pages (which are waiting on a GFP_NOFS allocation).
>
> [I don't find the above wholly satisfying. There seems to be too much
> hand-waving. If someone can provide better text explaining why
> swapout is a special case, that would be great.]

Anonymous pages are dirty by definition (except the zero page, but I
think we recently ditched it). So shrinking of the anonymous pool will
require swapping.

It is indeed the last refuge for those with GFP_NOFS. Allong with the
strict limit on the amount of dirty file pages it also ensures writing
those out will never deadlock the machine as there are always clean file
pages and or anonymous pages to launder.

Your observation about the difference in swap vs file disk patterns is
the motivation (one of the, at least) for Rik van Riel's split VM
series, it is currently not of any consequence.

Swap is indeed special in that it requires 'atomic' writeout. This
requirement comes from the cyclic dependency you outlined and we have to
break: we're out of memory, but need memory to write out pages to free
memory. So we must make it appear as if swap writes are indeed atomic.

> So we need to be able to write to the swap file/partition without
> needing to allocate any memory ... or only a small well controlled
> amount.
>
> The VM reserves a small amount of memory that can only be allocated
> for use as part of the swap-out procedure. It is only available to
> processes with the PF_MEMALLOC flag set, which is typically just the
> memory cleaner.
>
> Traditionally swap-out is performed directly to block devices (swap
> files on block-device filesystems are supported by examining the
> mapping from file offset to device offset in advance, and then using
> the device offsets to write directly to the device). Block devices
> are (required to be) written to pre-allocate any memory that might be
> needed during write-out, and to block when the pre-allocated memory is
> exhausted and no other memory is available. They can be sure not to
> block forever as the pre-allocated memory will be returned as soon as
> the data it is being used for has been written out. The primary
> mechanism for pre-allocating memory is called "mempools".
>
> This approach does not work for writing anonymous pages
> (i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
>
>
> The main reason that it does not work is that when data from an anon
> page is written to the network, we must wait for a reply to confirm
> the data is safe. Receiving that reply will consume memory and,
> significantly, we need to allocate memory to an incoming packet before
> we can tell if it is the reply we are waiting for or not.
>
> The secondary reason is that the network code is not written to use
> mempools and in most cases does not need to use them. Changing all
> allocations in the networking layer to use mempools would be quite
> intrusive, and would waste memory, and probably cause a slow-down in
> the common case of not swapping over the network.
>
> These problems are addressed by enhancing the system of memory
> reserves used by PF_MEMALLOC and requiring any in-kernel networking
> client that is used for swap-out to indicate which sockets are used
> for swapout so they can be handled specially in low memory situations.
>
> There are several major parts to this enhancement:
>
> 1/ PG_emergency, GFP_MEMALLOC
>
> To handle low memory conditions we need to know when those
> conditions exist. Having a global "low on memory" flag seems easy,
> but its implementation is problematic. Instead we make it possible
> to tell if a recent memory allocation required use of the emergency
> memory pool.
> For pages returned by alloc_page, the new page flag PG_emergency
> can be tested. If this is set, then a low memory condition was
> current when the page was allocated, so the memory should be used
> carefully.
>
> For memory allocated using slab/slub: If a page that is added to a
> kmem_cache is found to have PG_emergency set, then a ->reserve
> flag is set for the whole kmem_cache. Further allocations will only
> be returned from that page (or any other page in the cache) if they
> are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
> Non-emergency allocations will block in alloc_page until a
> non-reserve page is available. Once a non-reserve page has been
> added to the cache, the ->reserve flag on the cache is removed.
> When memory is returned by slab/slub, PG_emergency is set on the page
> holding the memory to match the ->reserve flag on that cache.
>
> After memory has been returned by kmem_cache_alloc or kmalloc, the
> page's PG_emergency flag can be tested. If it is set, then the most
> recent allocation from that cache required reserve memory, so this
> allocation should be used with care.
>
> It is not safe to test the cache's ->reserve flag immediately after
> an allocation as that flag is in per-cpu data, and the process could
> have be rescheduled to a different cpu if preemption is enabled.
> Thus the use of PG_emergency to carry this information.
>
> This allows us to
> a/ request use of the emergency pool when allocating memory
> (GFP_MEMALLOC), and
> b/ to find out if the emergency pool was used.

Right. I've had a long conversation on PG_emergency with Pekka. And I
think the conclusion was that PG_emergency will create more head-aches
than it solves. I probably have the conversation in my IRC logs and
could email it if you're interested (and Pekka doesn't object).

> 2/ SK_MEMALLOC, sk_buff->emergency.
>
> When memory from the reserve is used to store incoming network
> packets, the memory must be freed (and the packet dropped) as soon
> as we find out that the packet is not for a socket that is used for
> swap-out.
> To achieve this we have an ->emergency flag for skbs, and an
> SK_MEMALLOC flag for sockets.
> When memory is allocated for an skb, it is allocated with
> GFP_MEMALLOC (if we are currently swapping over the network at
> all). If a subsequent test shows that the emergency pool was used,
> ->emergency is set.
> When the skb is finally attached to its destination socket, the
> SK_MEMALLOC flag on the socket is tested. If the skb has
> ->emergency set, but the socket does not have SK_MEMALLOC set, then
> the skb is immediately freed and the packet is dropped.
> This ensures that reserve memory is never queued on a socket that is
> not used for swapout.
>
> Similarly, if an skb is ever queued for deliver to user-space for
> example by netfilter, the ->emergency flag is tested and the skb is
> released if ->emergency is set.
>
> This ensures that memory from the emergency reserve can be used to
> allow swapout to proceed, but will not get caught up in any other
> network queue.
>
>
> 3/ pages_emergency
>
> The above would be sufficient if the total memory below the lowest
> memory watermark (i.e the size of the emergency reserve) were known
> to be enough to hold all transient allocations needed for writeout.
> I'm a little blurry on how big the current emergency pool is, but it
> isn't big and certainly hasn't been sized to allow network traffic
> to consume any.
>
> We could simply make the size of the reserve bigger. However in the
> common case that we are not swapping over the network, that would be
> a waste of memory.
>
> So a new "watermark" is defined: pages_emergency. This is
> effectively added to the current low water marks, so that pages from
> this emergency pool can only be allocated if one of PF_MEMALLOC or
> GFP_MEMALLOC are set.
>
> pages_emergency can be changed dynamically based on need. When
> swapout over the network is required, pages_emergency is increased
> to cover the maximum expected load. When network swapout is
> disabled, pages_emergency is decreased.
>
> To determine how much to increase it by, we introduce reservation
> groups....
>
> 3a/ reservation groups
>
> The memory used transiently for swapout can be in a number of
> different places. e.g. the network route cache, the network
> fragment cache, in transit between network card and socket, or (in
> the case of NFS) in sunrpc data structures awaiting a reply.
> We need to ensure each of these is limited in the amount of memory
> they use, and that the maximum is included in the reserve.
>
> The memory required by the network layer only needs to be reserved
> once, even if there are multiple swapout paths using the network
> (e.g. NFS and NDB and iSCSI, though using all three for swapout at
> the same time would be unusual).
>
> So we create a tree of reservation groups. The network might
> register a collection of reservations, but not mark them as being in
> use. NFS and sunrpc might similarly register a collection of
> reservations, and attach it to the network reservations as it
> depends on them.
> When swapout over NFS is requested, the NFS/sunrpc reservations are
> activated which implicitly activates the network reservations.
>
> The total new reservation is added to pages_emergency.
>
> Provided each memory usage stays beneath the registered limit (at
> least when allocating memory from reserves), the system will never
> run out of emergency memory, and swapout will not deadlock.
>
> It is worth noting here that it is not critical that each usage
> stays beneath the limit 100% of the time. Occasional excess is
> acceptable provided that the memory will be freed again within a
> short amount of time that does *not* require waiting for any event
> that itself might require memory.
> This is because, at all stages of transmit and receive, it is
> acceptable to discard all transient memory associated with a
> particular writeout and try again later. On transmit, the page can
> be re-queued for later transmission. On receive, the packet can be
> dropped assuming that the peer will resend after a timeout.
>
> Thus allocations that are truly transient and will be freed without
> blocking do not strictly need to be reserved for. Doing so might
> still be a good idea to ensure forward progress doesn't take too
> long.
>
> 4/ lo-mem accounting
>
> Most places that might hold on to emergency memory (e.g. route
> cache, fragment cache etc) already place a limit on the amount of
> memory that they can use. This limit can simply be reserved using
> the above mechanism and no more needs to be done.
>
> However some memory usage might not be accounted with sufficient
> firmness to allow an appropriate emergency reservation. The
> in-flight skbs for incoming packets is (claimed to be) on such
> example.

:-)

> To support this, a low-overhead mechanism for accounting memory
> usage against the reserves is provided. This mechanism uses the
> same data structure that is used to store the emergency memory
> reservations through the addition of a 'usage' field.
>
> When memory allocation for a particular purpose succeeds, the memory
> is checked to see if it is 'reserve' memory. If it is, the size of
> the allocation is added to the 'usage'. If this exceeds the
> reservation, the usage is reduced again and the memory that was
> allocated is free.
>
> When memory that was allocated for that purpose is freed, the
> 'usage' field is checked again. If it is non-zero, then the size of
> the freed memory is subtracted from the usage, making sure the usage
> never becomes less than zero.
>
> This provides adequate accounting with minimal overheads when not in
> a low memory condition. When a low memory condition is encountered
> it does add the cost of a spin lock necessary to serialise updates
> to 'usage'.

Agreed, minimizing the overhead for the common !net_swap case has been
my goal.

> 5/ swapfile/swap_out/swap_in
>
> So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
> any network socket that it uses, and can know when to account
> reserve memory carefully, new address_space_operations are
> available.
> "swapfile" requests that an address space (i.e a file) be make ready
> for swapout. swap_out and swap_in request the actual IO. They
> together must ensure that each swap_out request can succeed without
> allocating more emergency memory that was reserved by swapfile.

Miklos kindly provided code to slightly out-date this piece, but not in
a conceptual manner.

I've already heard interest from other people to use these hooks to
provide swap on other non-block filesystems such as jffs2, logfs and the
like.

> Thanks for reading this far. I hope it made sense :-)

It does, and I hope it does so for more people.

2008-03-07 11:55:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Fri, 2008-03-07 at 12:17 +0100, Peter Zijlstra wrote:

> That would be so if the whole path from RX to socket demux would have
> hard-irqs disabled. However I didn't see that. Moreover I think the
> whole purpose of the NetPoll interface is to allow some RX queueing to
> cut down on softirq overhead.

s/NetPoll/NAPI/

More specifically look at net/core/dev.c:netif_rx()
It has a input queue per device.

> > 2/ If the host is routing network packets, then incoming packets
> > might go on an outbound queue. Is this space limited? and
> > included in the reserve?
>
> Not sure, somewhere along the routing code I lost it again. Constructive
> input from someone versed in that part of the kernel would be most
> welcome.

To clarify, I think we just send it on as I saw no reason why that could
fail. However the more fancy stuff like engress or QoS might spoil the
party, that is where I lost track.

2008-03-10 05:16:21

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Friday March 7, [email protected] wrote:
> Hi Neil,
>
> I'm so glad you are working with me on this and writing this in human
> English. It seems to be my eternal short-comming to communicate my ideas
> clearly :-/. Thanks for your effort!

:-)
It always helps to have a second brain with a different perspective.


>
> On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> >
> > [I don't find the above wholly satisfying. There seems to be too much
> > hand-waving. If someone can provide better text explaining why
> > swapout is a special case, that would be great.]
>
> Anonymous pages are dirty by definition (except the zero page, but I
> think we recently ditched it). So shrinking of the anonymous pool will
> require swapping.

Well, there is the swap cache. That's probably what I was thinking of
when I said "clean anonymous pages". I suspect they are the first to
go!

>
> It is indeed the last refuge for those with GFP_NOFS. Allong with the
> strict limit on the amount of dirty file pages it also ensures writing
> those out will never deadlock the machine as there are always clean file
> pages and or anonymous pages to launder.

The difficulty I have is justifying exactly why page-cache writeout
will not deadlock. What if all the memory that is not dirty-pagecache
is anonymous, and if swap isn't enabled?
Maybe the number returned by "determine_dirtyable_memory" in
page-writeback.c excludes anonymous pages? I wonder if the meaning of
NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

...
>
> Right. I've had a long conversation on PG_emergency with Pekka. And I
> think the conclusion was that PG_emergency will create more head-aches
> than it solves. I probably have the conversation in my IRC logs and
> could email it if you're interested (and Pekka doesn't object).

Maybe that depends on the exact semantic of PG_emergency ??
I remember you being concerned that PG_emergency never changes between
allocation and freeing, and that wouldn't work well with slub.
My envisioned semantic has it possibly changing quite often.
What it means is:
The last allocation done from this page was in a low-memory
condition.

You really need some way to tell if the result of kmalloc/kmemalloc
should be treated as reserved.
I think you had code which first tried the allocation without
GFP_MEMALLOC and then if that failed, tried again *with*
GFP_MEMALLOC. If that then succeeded, it is assumed to be an
allocation from reserves. That seemed rather ugly, though I guess you
could wrap it in a function to hide the ugliness:

void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
{
void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
if (result) {
*reserve = 0;
return result;
}
result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
if (result) {
*reserve = 1;
return result;
}
return NULL;
}
???

>
> I've already heard interest from other people to use these hooks to
> provide swap on other non-block filesystems such as jffs2, logfs and the
> like.

I'm interested in the swap_in/swap_out interface for external
write-intent bitmaps for md/raid arrays.
You can have a write-intent bitmap which records which blocks might be
dirty if the host crashes, so that resync is much faster.
It can be stored in a file in a separate filesystem, but that is
currently implemented by using bmap to enumerate the blocks and then
reading/writing directly to the device (like swap). Your interface
would be much nicer for that (not that I think having a
write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

I'll look forward to your next patch set....

One thing I had thought odd while reading the patches, but haven't
found an opportunity to mention before, is the "IS_SWAPFILE" test in
nfs-swapper.patch.
This seems like a layering violation. It would be better if the test
was based on whether ->swapfile had been called on the file. That way
my write-intent-bitmaps would get the same benefit.

NeilBrown

2008-03-10 09:18:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Mon, 2008-03-10 at 16:15 +1100, Neil Brown wrote:

> > On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > >
> > > [I don't find the above wholly satisfying. There seems to be too much
> > > hand-waving. If someone can provide better text explaining why
> > > swapout is a special case, that would be great.]
> >
> > Anonymous pages are dirty by definition (except the zero page, but I
> > think we recently ditched it). So shrinking of the anonymous pool will
> > require swapping.
>
> Well, there is the swap cache. That's probably what I was thinking of
> when I said "clean anonymous pages". I suspect they are the first to
> go!

Ah, right, we could consider those clean anonymous. Alas, they are just
part of the aging lists and do not get special priority.

> > It is indeed the last refuge for those with GFP_NOFS. Allong with the
> > strict limit on the amount of dirty file pages it also ensures writing
> > those out will never deadlock the machine as there are always clean file
> > pages and or anonymous pages to launder.
>
> The difficulty I have is justifying exactly why page-cache writeout
> will not deadlock. What if all the memory that is not dirty-pagecache
> is anonymous, and if swap isn't enabled?

Ah, I never considered the !SWAP case.

> Maybe the number returned by "determine_dirtyable_memory" in
> page-writeback.c excludes anonymous pages? I wonder if the meaning of
> NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

I don't think they are, but it should be obvious once you know the VM,
har har har :-)

NR_FREE_PAGES are the pages in the page allocators free lists.
NR_INACTIVE are the pages on the inactive list
NR_ACTIVE are the pageso on the active list

NR_INACTIVE+NR_ACTIVE are the number of pages on the page reclaim lists.

So, if you consider !SWAP, we could get in a deadlock when all of memory
is anonymous except for a few (<=dirty limit) dirty file pages.

But I guess the !SWAP people know what they're doing, large anon usage
without swap is asking for trouble.

> > Right. I've had a long conversation on PG_emergency with Pekka. And I
> > think the conclusion was that PG_emergency will create more head-aches
> > than it solves. I probably have the conversation in my IRC logs and
> > could email it if you're interested (and Pekka doesn't object).
>
> Maybe that depends on the exact semantic of PG_emergency ??
> I remember you being concerned that PG_emergency never changes between
> allocation and freeing, and that wouldn't work well with slub.
> My envisioned semantic has it possibly changing quite often.
> What it means is:
> The last allocation done from this page was in a low-memory
> condition.

Yes, that works, except that we'd need to iterate all pages and clear
PG_emergency - which would imply tracking all these pages etc..

Hence it would be better not to keep persistent state and do as we do
now; use some non-persistent state on allocation.

> You really need some way to tell if the result of kmalloc/kmemalloc
> should be treated as reserved.
> I think you had code which first tried the allocation without
> GFP_MEMALLOC and then if that failed, tried again *with*
> GFP_MEMALLOC. If that then succeeded, it is assumed to be an
> allocation from reserves. That seemed rather ugly, though I guess you
> could wrap it in a function to hide the ugliness:
>
> void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
> {
> void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
> if (result) {
> *reserve = 0;
> return result;
> }
> result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
> if (result) {
> *reserve = 1;
> return result;
> }
> return NULL;
> }
> ???

Yeah, I this this is the best we can do, just split this part out into
helper functions. I've been thinking of doing this - just haven't gotten
around to implementing it. Hope to do so this week and send out a new
series.

> > I've already heard interest from other people to use these hooks to
> > provide swap on other non-block filesystems such as jffs2, logfs and the
> > like.
>
> I'm interested in the swap_in/swap_out interface for external
> write-intent bitmaps for md/raid arrays.
> You can have a write-intent bitmap which records which blocks might be
> dirty if the host crashes, so that resync is much faster.
> It can be stored in a file in a separate filesystem, but that is
> currently implemented by using bmap to enumerate the blocks and then
> reading/writing directly to the device (like swap). Your interface
> would be much nicer for that (not that I think having a
> write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

Hmm, right. But for that purpose the names swap_* are a tad misleading.
I remember hch mentioning this at some point. What would be a more
suitable naming scheme so we can both use it?

> I'll look forward to your next patch set....
>
> One thing I had thought odd while reading the patches, but haven't
> found an opportunity to mention before, is the "IS_SWAPFILE" test in
> nfs-swapper.patch.
> This seems like a layering violation. It would be better if the test
> was based on whether ->swapfile had been called on the file. That way
> my write-intent-bitmaps would get the same benefit.

I'll look into this, I didn't thing using a inode test inside a
filesystem implementation was too weird..

2008-03-14 05:23:46

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Monday March 10, [email protected] wrote:
> >
> > Maybe that depends on the exact semantic of PG_emergency ??
> > I remember you being concerned that PG_emergency never changes between
> > allocation and freeing, and that wouldn't work well with slub.
> > My envisioned semantic has it possibly changing quite often.
> > What it means is:
> > The last allocation done from this page was in a low-memory
> > condition.
>
> Yes, that works, except that we'd need to iterate all pages and clear
> PG_emergency - which would imply tracking all these pages etc..
>

I don't see why you need to clear PG_emergency at all.
If the semantic is:

> > The last allocation done from this page was in a low-memory
> > condition.

Then you only need to (potentially) modify it's value when you
allocate it, or an element within it.

But if it doesn't fit well in the overall picture, then by all means
get rid of it.

>
> Hmm, right. But for that purpose the names swap_* are a tad misleading.
> I remember hch mentioning this at some point. What would be a more
> suitable naming scheme so we can both use it?

One could argue that "swap" is already a misleading term.
Level 7 Unix used to do swapping. It would write one process image
out to swap space, and read a different one in. Moving whole
processes at a time was called swapping.
When this clever idea of only moving pages at a time was introduced (I
think in 4BSD, but possible in 2BSD and elsewhere) it was called
"demand paging" or just "paging".

So we don't have a swap partition any more. We have a paging
partition.

But everyone calls it 'swap' and we know what it means. I don't think
there would be a big cost in keeping the swap_ names but allowing them
to be used for occasional things other than swap.
And I suspect you would lose a lot if you tried to use a different
name that people didn't immediately identify with...

NeilBrown