2009-10-01 14:03:10

by Suresh Jayaraman

[permalink] [raw]
Subject: [PATCH 00/31] Swap over NFS -v20

Hi,

Here's the latest version of swap over NFS series since -v19 last October by
Peter Zijlstra. Peter does not have time to pursue this further (though he has
not lost interest) and that led me to take over this patchset and try merging
upstream.

The patches are against the current mmotm. It does not support SLQB, yet.
These patches can also be found online here:
http://www.suse.de/~sjayaraman/patches/swap-over-nfs/

The swap over NFS patches are being shipped with openSUSE 11.1 and SLE 11 (with
CONFIG_NFS_SWAP enabled by default) for several months now. There have been
no bugs reported so far due to these patches and it has been found stable.

Changes since -v19:
- rebased patches against current -mm
- adapted changes pertaining to using zone->watermarks array
- dropped cleanup patches/fixes that have already made to upstream
- dropped the patch that remove nfs mempools
- fixed racy nature of sync_page in swap_sync_page (NeilBrown)
- fixed use of uninitialized variable in cache_grow() (Miklos Szeredi)
- fixed a bug in bnx2 driver (Jiri Bohac)
- fixed null-pointer dereferences in swapfile code path when s_bdev is NULL

Thanks,
Suresh Jayaraman

--

Peter Zijlstra (26)
mm: serialize access to min_free_kbytes
mm: expose gfp_to_alloc_flags()
mm: tag reseve pages
mm: sl[au]b: add knowledge of reserve pages
mm: kmem_alloc_estimate()
mm: allow PF_MEMALLOC from softirq context
mm: emergency pool
mm: system wide ALLOC_NO_WATERMARK
mm: __GFP_MEMALLOC
mm: memory reserve management
mm: add support for non block device backed swap files
mm: methods for teaching filesystems about PG_swapcache pages
net: packet split receive api
net: sk_allocation() - concentrate socket related allocations
selinux: tag avc cache alloc as non-critical
netvm: network reserve infrastructure
netvm: INET reserves
netvm: hook skb allocation to reserves
netvm: filter emergency skbs
netvm: prevent a stream specific deadlock
netvm: skb processing
netfilter: NF_QUEUE vs emergency skbs
nfs: teach the NFS client how to treat PG_swapcache pages
nfs: disable data cache revalidation for swapfiles
nfs: enable swap on NFS
nfs: fix various memory recursions possible with swap over NFS

Jeff Mahoney (1)
Fix initialization of ipv4_route_lock

Neil Brown (2)
swap over network documentation
Cope with racy nature of sync_page in swap_sync_page

Miklos Szeredi (1)
Fix use of uninitialized variable in cache_grow()

Suresh Jayaraman (1)
swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL


fs/nfs/file.c | 18
fs/nfs/pagelist.c | 2
fs/nfs/write.c | 99 ++++
include/linux/mm_types.h | 1
include/linux/skbuff.h | 28 +
include/linux/slab.h | 19
include/net/sock.h | 55 ++
mm/page_alloc.c | 120 ++++--
mm/page_io.c | 2
mm/slab.c | 80 +++-
mm/slob.c | 67 +++
mm/slub.c | 89 ++++
mm/swapfile.c | 53 ++
Documentation/filesystems/Locking | 22 +
Documentation/filesystems/vfs.txt | 18
Documentation/network-swap.txt | 270 +++++++++++++
drivers/net/bnx2.c | 9
drivers/net/e1000e/netdev.c | 7
drivers/net/igb/igb_main.c | 9
drivers/net/ixgbe/ixgbe_main.c | 14
drivers/net/sky2.c | 16
fs/nfs/Kconfig | 10
fs/nfs/file.c | 6
fs/nfs/inode.c | 6
fs/nfs/internal.h | 7
fs/nfs/pagelist.c | 6
fs/nfs/read.c | 6
fs/nfs/write.c | 53 +-
include/linux/buffer_head.h | 1
include/linux/fs.h | 9
include/linux/gfp.h | 3
include/linux/mm.h | 25 +
include/linux/mm_types.h | 1
include/linux/mmzone.h | 3
include/linux/nfs_fs.h | 2
include/linux/pagemap.h | 5
include/linux/reserve.h | 198 +++++++++
include/linux/sched.h | 7
include/linux/skbuff.h | 3
include/linux/slab.h | 4
include/linux/slub_def.h | 1
include/linux/sunrpc/xprt.h | 5
include/linux/swap.h | 4
include/net/inet_frag.h | 7
include/net/netns/ipv6.h | 4
include/net/sock.h | 5
kernel/softirq.c | 3
mm/Makefile | 2
mm/internal.h | 15
mm/page_alloc.c | 16
mm/page_io.c | 51 ++
mm/reserve.c | 637 ++++++++++++++++++++++++++++++++
mm/slab.c | 61 ++-
mm/slob.c | 16
mm/slub.c | 43 +-
mm/swap_state.c | 4
mm/swapfile.c | 30 +
mm/vmstat.c | 6
net/Kconfig | 3
net/core/dev.c | 57 ++
net/core/filter.c | 3
net/core/skbuff.c | 137 +++++-
net/core/sock.c | 107 +++++
net/ipv4/inet_fragment.c | 3
net/ipv4/ip_fragment.c | 86 ++++
net/ipv4/route.c | 70 +++
net/ipv4/tcp.c | 3
net/ipv4/tcp_input.c | 12
net/ipv4/tcp_output.c | 12
net/ipv6/reassembly.c | 85 ++++
net/ipv6/route.c | 77 +++
net/ipv6/tcp_ipv6.c | 15
net/netfilter/core.c | 3
net/sctp/ulpevent.c | 2
net/sunrpc/Kconfig | 5
net/sunrpc/sched.c | 9
net/sunrpc/xprtsock.c | 68 +++
security/selinux/avc.c | 2
net/core/sock.c | 18
net/ipv4/route.c | 2

80 files changed, 2797 insertions(+), 245 deletions(-)


2009-10-01 17:42:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

On Thu, Oct 01, 2009 at 07:34:18PM +0530, Suresh Jayaraman wrote:
> Hi,
>
> Here's the latest version of swap over NFS series since -v19 last October by
> Peter Zijlstra. Peter does not have time to pursue this further (though he has
> not lost interest) and that led me to take over this patchset and try merging
> upstream.
>
> The patches are against the current mmotm. It does not support SLQB, yet.
> These patches can also be found online here:
> http://www.suse.de/~sjayaraman/patches/swap-over-nfs/

My advise again that I already gave to Peter long ago. It's almost
impossible to get a patchset that large and touching many subsystems in.

Split it into smaller series that make sense of their own. One of them
would be the whole VM/net work to just make swap over nbd/iscsi safe.

The other really big one is adding a proper method for safe, page-backed
kernelspace I/O on files. That is not something like the grotty
swap-tied address_space operations in this patch, but more something in
the direction of the kernel direct I/O patches from Jenx Axboe he did
for using in the loop driver. But even those aren't complete as they
don't touch the locking issue yet.

Especially the latter is an absolutely essential step to make any
progress here, and an excellent patch series of it's own as there are
multiple users for this, like making swap safe on btrfs files, making
the MD bitmap code actually safe or improving the loop driver.

2009-10-02 05:51:07

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

On Thursday October 1, [email protected] wrote:
>
> The other really big one is adding a proper method for safe, page-backed
> kernelspace I/O on files. That is not something like the grotty
> swap-tied address_space operations in this patch, but more something in
> the direction of the kernel direct I/O patches from Jenx Axboe he did
> for using in the loop driver. But even those aren't complete as they
> don't touch the locking issue yet.

Do you have a problem with the proposed address_space operations apart
from their names including the word "swap"? Would something like:
direct_on, direct_off, direct_read, direct_write
be better.
Semantics being that the read and write:
- bypass the page cache (invalidation is up to caller)
- must not make a blocking non-emergency memory allocation
direct_on does any pre-allocation and pre-reading to ensure those
semantics and be provided.

I have wondered if an extra flag along the lines of "I don't care
about this data after a crash" would be useful.
It would be set for swap, but not set for other users. Thus
e.g. RAID1 could easily avoid resyncing an area that was used only for
swap.

The only thing of Jens' that I could find used bmap - is there
something more recent I should look for?

>
> Especially the latter is an absolutely essential step to make any
> progress here, and an excellent patch series of it's own as there are
> multiple users for this, like making swap safe on btrfs files, making
> the MD bitmap code actually safe or improving the loop driver.

100% agree.

Thanks,
NeilBrown

2009-10-02 08:20:48

by Suresh Jayaraman

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

Christoph Hellwig wrote:
> On Thu, Oct 01, 2009 at 07:34:18PM +0530, Suresh Jayaraman wrote:
>
> The other really big one is adding a proper method for safe, page-backed
> kernelspace I/O on files. That is not something like the grotty
> swap-tied address_space operations in this patch, but more something in

I'm not sure I understood about what problems you see with the proposed
address_space operations. Could you please elaborate a bit more?

> the direction of the kernel direct I/O patches from Jenx Axboe he did
> for using in the loop driver. But even those aren't complete as they
> don't touch the locking issue yet.
>

Thanks,

--
Suresh Jayaraman

2009-10-04 21:42:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

On Thu, 2009-10-01 at 13:42 -0400, Christoph Hellwig wrote:
> One of them
> would be the whole VM/net work to just make swap over nbd/iscsi safe.

Getting those two 'fixed' is going to be tons of interesting work
because they involve interaction with userspace daemons.

NBD has fairly simple userspace, but iSCSI has a rather large userspace
footprint and a rather complicated user/kernel interaction which will be
mighty interesting to get allocation safe.

Ideally the swap-over-$foo bits have no userspace component.

That said, Wouter is the NBD userspace maintainer and has expressed
interest into looking at making that work, but its sure going to be
non-trivial, esp. since exposing PF_MEMALLOC to userspace is a, not over
my dead-bodym like thing.

2009-10-10 12:07:22

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

Hi!

> > One of them
> > would be the whole VM/net work to just make swap over nbd/iscsi safe.
>
> Getting those two 'fixed' is going to be tons of interesting work
> because they involve interaction with userspace daemons.
>
> NBD has fairly simple userspace, but iSCSI has a rather large userspace
> footprint and a rather complicated user/kernel interaction which will be
> mighty interesting to get allocation safe.
>
> Ideally the swap-over-$foo bits have no userspace component.
>
> That said, Wouter is the NBD userspace maintainer and has expressed
> interest into looking at making that work, but its sure going to be
> non-trivial, esp. since exposing PF_MEMALLOC to userspace is a, not over
> my dead-bodym like thing.

Well, as long as nbd-server is on separate machine (with real swap),
safe swapping over network should be ok, without PF_MEMALLOC for
userspace or similar nightmares, right?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-10-10 12:24:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

On Sat, 2009-10-10 at 14:06 +0200, Pavel Machek wrote:
> Hi!
>
> > > One of them
> > > would be the whole VM/net work to just make swap over nbd/iscsi safe.
> >
> > Getting those two 'fixed' is going to be tons of interesting work
> > because they involve interaction with userspace daemons.
> >
> > NBD has fairly simple userspace, but iSCSI has a rather large userspace
> > footprint and a rather complicated user/kernel interaction which will be
> > mighty interesting to get allocation safe.
> >
> > Ideally the swap-over-$foo bits have no userspace component.
> >
> > That said, Wouter is the NBD userspace maintainer and has expressed
> > interest into looking at making that work, but its sure going to be
> > non-trivial, esp. since exposing PF_MEMALLOC to userspace is a, not over
> > my dead-bodym like thing.
>
> Well, as long as nbd-server is on separate machine (with real swap),
> safe swapping over network should be ok, without PF_MEMALLOC for
> userspace or similar nightmares, right?

Nope, as soon as the nbd-client looses its connection you're up shit
creek.

2009-10-10 21:11:09

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 00/31] Swap over NFS -v20

On Sat 2009-10-10 14:23:41, Peter Zijlstra wrote:
> On Sat, 2009-10-10 at 14:06 +0200, Pavel Machek wrote:
> > Hi!
> >
> > > > One of them
> > > > would be the whole VM/net work to just make swap over nbd/iscsi safe.
> > >
> > > Getting those two 'fixed' is going to be tons of interesting work
> > > because they involve interaction with userspace daemons.
> > >
> > > NBD has fairly simple userspace, but iSCSI has a rather large userspace
> > > footprint and a rather complicated user/kernel interaction which will be
> > > mighty interesting to get allocation safe.
> > >
> > > Ideally the swap-over-$foo bits have no userspace component.
> > >
> > > That said, Wouter is the NBD userspace maintainer and has expressed
> > > interest into looking at making that work, but its sure going to be
> > > non-trivial, esp. since exposing PF_MEMALLOC to userspace is a, not over
> > > my dead-bodym like thing.
> >
> > Well, as long as nbd-server is on separate machine (with real swap),
> > safe swapping over network should be ok, without PF_MEMALLOC for
> > userspace or similar nightmares, right?
>
> Nope, as soon as the nbd-client looses its connection you're up shit
> creek.

Oops, right. Putting reconnect logic into the kernel would make sense.

I misunderstood your proposal. I thought you'd want to put
nbd-_server_ into the kernel too. I guess we violently agree that
that's unneccessary.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html