LinuxLists.cc - [PATCH 00/28] Swap over NFS -v16

2008-02-20 15:23:49

Subject: [PATCH 00/28] Swap over NFS -v16

Hi,

Another posting of the full swap over NFS series.

Andrew/Linus, could we start thinking of sticking this in -mm?

[ patches against 2.6.25-rc2-mm1, also to be found online at:
http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.25-rc2-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.

Part 1, patches 1-11

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it currently doesn't do SLOB.

1 - mm: gfp_to_alloc_flags()
2 - mm: tag reseve pages
3 - mm: sl[au]b: add knowledge of reserve pages
4 - mm: kmem_estimate_pages()
5 - mm: allow PF_MEMALLOC from softirq context
6 - mm: serialize access to min_free_kbytes
7 - mm: emergency pool
8 - mm: system wide ALLOC_NO_WATERMARK
9 - mm: __GFP_MEMALLOC
10 - mm: memory reserve management
11 - selinux: tag avc cache alloc as non-critical

Part 2, patches 12-14

Provide some generic network infrastructure needed later on.

12 - net: wrap sk->sk_backlog_rcv()
13 - net: packet split receive api
14 - net: sk_allocation() - concentrate socket related allocations

Part 3, patches 15-21

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations.

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

15 - netvm: network reserve infrastructure
16 - netvm: INET reserves.
17 - netvm: hook skb allocation to reserves
18 - netvm: filter emergency skbs.
19 - netvm: prevent a TCP specific deadlock
20 - netfilter: NF_QUEUE vs emergency skbs
21 - netvm: skb processing

Part 4, patches 22-23

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

This provides new a_ops to handle swapcache pages and could be used to obsolete
the bmap usage for swapfiles.

22 - mm: add support for non block device backed swap files
23 - mm: methods for teaching filesystems about PG_swapcache pages

Part 5, patches 24-28

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

24 - nfs: remove mempools
25 - nfs: teach the NFS client how to treat PG_swapcache pages
26 - nfs: disable data cache revalidation for swapfiles
27 - nfs: enable swap on NFS
28 - nfs: fix various memory recursions possible with swap over NFS.

Changes since -v15:
- fwd port
- more SGE fragment drivers ported
- made the new swapfile logic unconditional
- various bug fixes and cleanups

2008-02-23 08:21:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[email protected]> wrote:

> Another posting of the full swap over NFS series.

Well I looked. There's rather a lot of it and I wouldn't pretend to
understand it.

What is the NFS and net people's take on all of this?

2008-02-26 06:03:35

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 00/28] Swap over NFS -v16

On Saturday February 23, [email protected] wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[email protected]> wrote:
>
> > Another posting of the full swap over NFS series.
>
> Well I looked. There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

>
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages. Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users.
MEMALLOC users are those that set PF_MEMALLOC. A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages. One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.

Memory reservations are organised in a tree. I really don't
understand the tree. Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack. This particularly includes the route cache and
skbs for incoming messages. I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file. They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.

So in general, the patch set looks to have the right sort of shape. I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:

reserve-slub.patch

Please avoid irrelevant reformatting in patches. It makes them
harder to read. e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
- void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)

mm-kmem_estimate_pages.patch

This introduces
kestimate
kestimate_single
kmem_estimate_pages

The last obviously returns a number of pages. The contrast seems
to suggest the others don't. But they do...
I don't think the names are very good, but I concede that it is
hard to choose good names here. Maybe:
kmalloc_estimate_variable
kmalloc_estimate_fixed
kmem_alloc_estimate
???

mm-reserve.patch

I'm confused by __mem_reserve_add.

+ reserve = mem_reserve_root.pages;
+ __calc_reserve(res, pages, 0);
+ reserve = mem_reserve_root.pages - reserve;

__calc_reserve will always add 'pages' to mem_reserve_root.pages.
So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

ret = adjust_memalloc_reserve(pages);
if (!ret)
__calc_reserve(res, pages, limit);
return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a "void" function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the "GFP_KERNEL" call sites just changed to
"sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

sk_adjust_memalloc
sk_set_memalloc

sound similar. Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to everyone else?

nfs-swap_ops.patch

What happens if you have two swap files on the same NFS
filesystem?
I assume ->swapfile gets called twice. But it hasn't been written
to nest, so the first swapoff will disable swapping for both
files??

That's all for now :-)

NeilBrown

2008-02-26 10:51:17