2022-02-08 11:29:38

by NeilBrown

[permalink] [raw]
Subject: [PATCH 00/21 V4] Repair SWAP-over_NFS

This 4th version of the series address review comment, particularly
tidying up "NFS: swap IO handling is slightly different for O_DIRECT IO"
and collect reviewed-by etc.

I've also move 3 NFS patches which depend on the MM patches to the end
in case they helps maintainers land the patches in a consistent order.
Those three patches might go through the NFS free after the next merge
window.

Original intro follows.
Thanks,
NeilBrown

swap-over-NFS currently has a variety of problems.

swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.

NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !

So the first few patches here clean up and improve various parts of the
swap-via-filesystem code. ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.

Current swap-to-filesystem code only ever submits single-page reads and
writes. These patches change that to allow multi-page IO when adjacent
requests are submitted. Writes are also changed to be async rather than
sync. This substantially speeds up write throughput for swap-over-NFS.

---

NeilBrown (21):
MM - in merge winow
MM: create new mm/swap.h header file.
MM: drop swap_set_page_dirty
MM: move responsibility for setting SWP_FS_OPS to ->swap_activate
MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space
MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw
DOC: update documentation for swap_activate and swap_rw
MM: submit multipage reads for SWP_FS_OPS swap-space
MM: submit multipage write for SWP_FS_OPS swap-space
VFS: Add FMODE_CAN_ODIRECT file flag

NFS - in merge window
NFS: remove IS_SWAPFILE hack
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/auth: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
NFSv4: keep state manager thread active if swap is enabled

NFS - after merge window
NFS: rename nfs_direct_IO and use as ->swap_rw
NFS: swap IO handling is slightly different for O_DIRECT IO
NFS: swap-out must always use STABLE writes.


Documentation/filesystems/locking.rst | 18 +-
Documentation/filesystems/vfs.rst | 17 +-
drivers/block/loop.c | 4 +-
fs/cifs/file.c | 7 +-
fs/fcntl.c | 9 +-
fs/nfs/direct.c | 67 +++++---
fs/nfs/file.c | 39 +++--
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 +++
fs/nfs/nfs4state.c | 39 ++++-
fs/nfs/read.c | 4 -
fs/nfs/write.c | 2 +
fs/open.c | 9 +-
fs/overlayfs/file.c | 13 +-
include/linux/fs.h | 4 +
include/linux/nfs_fs.h | 15 +-
include/linux/nfs_xdr.h | 2 +
include/linux/sunrpc/auth.h | 1 +
include/linux/sunrpc/sched.h | 1 -
include/linux/swap.h | 7 +-
include/linux/writeback.h | 7 +
include/trace/events/sunrpc.h | 1 -
mm/madvise.c | 8 +-
mm/memory.c | 2 +-
mm/page_io.c | 237 +++++++++++++++++++-------
mm/swap.h | 30 +++-
mm/swap_state.c | 22 ++-
mm/swapfile.c | 13 +-
mm/vmscan.c | 38 +++--
net/sunrpc/auth.c | 8 +-
net/sunrpc/auth_gss/auth_gss.c | 6 +-
net/sunrpc/auth_unix.c | 10 +-
net/sunrpc/clnt.c | 7 +-
net/sunrpc/sched.c | 29 ++--
net/sunrpc/xprt.c | 19 +--
net/sunrpc/xprtrdma/transport.c | 10 +-
net/sunrpc/xprtsock.c | 8 +
37 files changed, 519 insertions(+), 215 deletions(-)

--
Signature



2022-02-08 11:30:10

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/21] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

This makes the calculation of "may_enter_fs" slightly more complex, so
move it into a separate function. with that done, there is little value
in maintaining the bool variable any more. So replace the
may_enter_fs variable with a may_enter_fs() function. This removes any
risk for the variable becoming out-of-date.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
mm/swap.h | 8 ++++++++
mm/vmscan.c | 29 ++++++++++++++++++++---------
2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 13e72a5023aa..5c676e55f288 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,6 +47,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf);

+static inline unsigned int page_swap_flags(struct page *page)
+{
+ return page_swap_info(page)->flags;
+}
#else /* CONFIG_SWAP */
static inline int swap_readpage(struct page *page, bool do_poll)
{
@@ -126,4 +130,8 @@ static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
{
}

+static inline unsigned int page_swap_flags(struct page *page)
+{
+ return 0;
+}
#endif /* CONFIG_SWAP */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5c734ffc6057..ad5026d06aa8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1506,6 +1506,22 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
return nr_succeeded;
}

+static bool may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_FS)
+ return true;
+ if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+ return false;
+ /*
+ * We can "enter_fs" for swap-cache with only __GFP_IO
+ * providing this isn't SWP_FS_OPS.
+ * ->flags can be updated non-atomicially (scan_swap_map_slots),
+ * but that will never affect SWP_FS_OPS, so the data_race
+ * is safe.
+ */
+ return !data_race(page_swap_flags(page) & SWP_FS_OPS);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1531,7 +1547,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
enum page_references references = PAGEREF_RECLAIM;
- bool dirty, writeback, may_enter_fs;
+ bool dirty, writeback;
unsigned int nr_pages;

cond_resched();
@@ -1555,9 +1571,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

- may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
- (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
-
/*
* The number of dirty pages determines if a node is marked
* reclaim_congested. kswapd will stall and start writing
@@ -1602,7 +1615,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
* not to fs). In this case mark the page for immediate
* reclaim and continue scanning.
*
- * Require may_enter_fs because we would wait on fs, which
+ * Require may_enter_fs() because we would wait on fs, which
* may not have submitted IO yet. And the loop driver might
* enter reclaim, and deadlock if it waits on a page for
* which it is needed to do the write (loop masks off
@@ -1634,7 +1647,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,

/* Case 2 above */
} else if (writeback_throttling_sane(sc) ||
- !PageReclaim(page) || !may_enter_fs) {
+ !PageReclaim(page) || !may_enter_fs(page, sc->gfp_mask)) {
/*
* This is slightly racy - end_page_writeback()
* might have just cleared PageReclaim, then
@@ -1724,8 +1737,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
goto activate_locked_split;
}

- may_enter_fs = true;
-
/* Adding to swap updated mapping */
mapping = page_mapping(page);
}
@@ -1795,7 +1806,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,

if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
- if (!may_enter_fs)
+ if (!may_enter_fs(page, sc->gfp_mask))
goto keep_locked;
if (!sc->may_writepage)
goto keep_locked;



2022-02-08 11:31:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 11/21] NFS: remove IS_SWAPFILE hack

This code is pointless as IS_SWAPFILE is always defined.
So remove it.

Suggested-by: Mark Hemment <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 9e2def045111..4d4750738aeb 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -44,11 +44,6 @@

static const struct vm_operations_struct nfs_file_vm_ops;

-/* Hack for future NFS swap support */
-#ifndef IS_SWAPFILE
-# define IS_SWAPFILE(inode) (0)
-#endif
-
int nfs_check_flags(int flags)
{
if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))



2022-02-08 11:32:59

by NeilBrown

[permalink] [raw]
Subject: [PATCH 18/21] NFSv4: keep state manager thread active if swap is enabled

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 15 ++++++++++++---
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 ++++++++++++++++++++
fs/nfs/nfs4state.c | 39 +++++++++++++++++++++++++++++++++------
include/linux/nfs_xdr.h | 2 ++
net/sunrpc/clnt.c | 2 ++
6 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 4d4750738aeb..81fe996c6272 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -486,8 +486,9 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
unsigned long blocks;
long long isize;
int ret;
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

if (!file->f_mapping->a_ops->swap_rw)
/* Cannot support swap */
@@ -512,14 +513,22 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
}
*span = sis->pages;
sis->flags |= SWP_FS_OPS;
+
+ if (cl->rpc_ops->enable_swap)
+ cl->rpc_ops->enable_swap(inode);
+
return ret;
}

static void nfs_swap_deactivate(struct file *file)
{
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

rpc_clnt_swap_deactivate(clnt);
+ if (cl->rpc_ops->disable_swap)
+ cl->rpc_ops->disable_swap(file_inode(file));
}

const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 84f39b6f1b1e..79df6e83881b 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
NFS4CLNT_LEASE_MOVED,
NFS4CLNT_DELEGATION_EXPIRED,
NFS4CLNT_RUN_MANAGER,
+ NFS4CLNT_MANAGER_AVAILABLE,
NFS4CLNT_RECALL_RUNNING,
NFS4CLNT_RECALL_ANY_LAYOUT_READ,
NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index b18f31b2c9e7..d3549f48b012 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10461,6 +10461,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
return error + error2 + error3;
}

+static void nfs4_enable_swap(struct inode *inode)
+{
+ /* The state manager thread must always be running.
+ * It will notice the client is a swapper, and stay put.
+ */
+ struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+ nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+ /* The state manager thread will now exit once it is
+ * woken.
+ */
+ wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
static const struct inode_operations nfs4_dir_inode_operations = {
.create = nfs_create,
.lookup = nfs_lookup,
@@ -10538,6 +10556,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
.create_server = nfs4_create_server,
.clone_server = nfs_clone_server,
.discover_trunking = nfs4_discover_trunking,
+ .enable_swap = nfs4_enable_swap,
+ .disable_swap = nfs4_disable_swap,
};

static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f5a62c0d999b..5dc52eefaffb 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
{
struct task_struct *task;
char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+ if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+ wake_up_var(&clp->cl_state);
return;
+ }
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
__module_get(THIS_MODULE);
refcount_inc(&clp->cl_count);

@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
printk(KERN_ERR "%s: kthread_run: %ld\n",
__func__, PTR_ERR(task));
nfs4_clear_state_manager_bit(clp);
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
nfs_put_client(clp);
module_put(THIS_MODULE);
}
@@ -2669,11 +2677,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
}

- /* Did we race with an attempt to give us more work? */
- if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
- return;
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
- return;
+ return;
+
} while (refcount_read(&clp->cl_count) > 1 && !signalled());
goto out_drain;

@@ -2693,9 +2698,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
static int nfs4_run_state_manager(void *ptr)
{
struct nfs_client *clp = ptr;
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

allow_signal(SIGKILL);
+again:
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
nfs4_state_manager(clp);
+ if (atomic_read(&cl->cl_swapper)) {
+ wait_var_event_interruptible(&clp->cl_state,
+ test_bit(NFS4CLNT_RUN_MANAGER,
+ &clp->cl_state));
+ if (atomic_read(&cl->cl_swapper) &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+ goto again;
+ /* Either no longer a swapper, or were signalled */
+ }
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+ if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+ !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+ goto again;
+
nfs_put_client(clp);
module_put_and_kthread_exit(0);
return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 728cb0c1f0b6..6861ac8af808 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1798,6 +1798,8 @@ struct nfs_rpc_ops {
struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
struct nfs_fattr *, rpc_authflavor_t);
int (*discover_trunking)(struct nfs_server *, struct nfs_fh *);
+ void (*enable_swap)(struct inode *inode);
+ void (*disable_swap)(struct inode *inode);
};

/*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 842366a2fc57..04ccf6a06ca7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3069,6 +3069,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
int
rpc_clnt_swap_activate(struct rpc_clnt *clnt)
{
+ while (clnt != clnt->cl_parent)
+ clnt = clnt->cl_parent;
if (atomic_inc_return(&clnt->cl_swapper) == 1)
return rpc_clnt_iterate_for_each_xprt(clnt,
rpc_clnt_swap_activate_callback, NULL);



2022-02-08 15:46:09

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/21] MM: submit multipage write for SWP_FS_OPS swap-space

swap_writepage() is given one page at a time, but may be called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
are single page reads.

With this patch we pass a pointer-to-pointer via the wbc.
swap_writepage can store state between calls - much like the pointer
passed explicitly to swap_readpage. After calling swap_writepage() some
number of times, the state will be passed to swap_write_unplug() which
can submit the combined request.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/writeback.h | 7 ++++
mm/page_io.c | 78 ++++++++++++++++++++++++++++++++-------------
mm/swap.h | 4 ++
mm/vmscan.c | 9 ++++-
4 files changed, 74 insertions(+), 24 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..32b35f21cb97 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -80,6 +80,13 @@ struct writeback_control {

unsigned punt_to_cgroup:1; /* cgrp punting, see __REQ_CGROUP_PUNT */

+ /* To enable batching of swap writes to non-block-device backends,
+ * "plug" can be set point to a 'struct swap_iocb *'. When all swap
+ * writes have been submitted, if with swap_iocb is not NULL,
+ * swap_write_unplug() should be called.
+ */
+ struct swap_iocb **swap_plug;
+
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
struct inode *inode; /* inode being written out */
diff --git a/mm/page_io.c b/mm/page_io.c
index fc82e8750e9b..7684a8d81dcd 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -308,8 +308,9 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
struct page *page = sio->bvec[0].bv_page;
+ int p;

- if (ret != PAGE_SIZE) {
+ if (ret != PAGE_SIZE * sio->pages) {
/*
* In the case of swap-over-nfs, this can be a
* temporary failure if the system has limited
@@ -320,43 +321,63 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
* the normal direct-to-bio case as it could
* be temporary.
*/
- set_page_dirty(page);
- ClearPageReclaim(page);
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
ret, page_file_offset(page));
+ for (p = 0; p < sio->pages; p++) {
+ page = sio->bvec[p].bv_page;
+ set_page_dirty(page);
+ ClearPageReclaim(page);
+ }
} else
- count_vm_event(PSWPOUT);
- end_page_writeback(page);
+ count_vm_events(PSWPOUT, sio->pages);
+
+ for (p = 0; p < sio->pages; p++)
+ end_page_writeback(sio->bvec[p].bv_page);
+
mempool_free(sio, sio_pool);
}

static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
{
- struct swap_iocb *sio;
+ struct swap_iocb *sio = NULL;
struct swap_info_struct *sis = page_swap_info(page);
struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct iov_iter from;
- int ret;
+ loff_t pos = page_file_offset(page);

set_page_writeback(page);
unlock_page(page);
- sio = mempool_alloc(sio_pool, GFP_NOIO);
- init_sync_kiocb(&sio->iocb, swap_file);
- sio->iocb.ki_complete = sio_write_complete;
- sio->iocb.ki_pos = page_file_offset(page);
- sio->bvec[0].bv_page = page;
- sio->bvec[0].bv_len = PAGE_SIZE;
- sio->bvec[0].bv_offset = 0;
- iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
- ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
- if (ret != -EIOCBQUEUED)
- sio_write_complete(&sio->iocb, ret);
- return ret;
+ if (wbc->swap_plug)
+ sio = *wbc->swap_plug;
+ if (sio) {
+ if (sio->iocb.ki_filp != swap_file ||
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ swap_write_unplug(sio);
+ sio = NULL;
+ }
+ }
+ if (!sio) {
+ sio = mempool_alloc(sio_pool, GFP_NOIO);
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_complete = sio_write_complete;
+ sio->iocb.ki_pos = pos;
+ sio->pages = 0;
+ }
+ sio->bvec[sio->pages].bv_page = page;
+ sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+ sio->bvec[sio->pages].bv_offset = 0;
+ sio->pages += 1;
+ if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->swap_plug) {
+ swap_write_unplug(sio);
+ sio = NULL;
+ }
+ if (wbc->swap_plug)
+ *wbc->swap_plug = sio;
+
+ return 0;
}

int __swap_writepage(struct page *page, struct writeback_control *wbc,
- bio_end_io_t end_write_func)
+ bio_end_io_t end_write_func)
{
struct bio *bio;
int ret;
@@ -388,6 +409,19 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return 0;
}

+void swap_write_unplug(struct swap_iocb *sio)
+{
+ struct iov_iter from;
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ int ret;
+
+ iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
+ PAGE_SIZE * sio->pages);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_write_complete(&sio->iocb, ret);
+}
+
static void sio_read_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
diff --git a/mm/swap.h b/mm/swap.h
index 7aab4c82e2d0..58f28e94ba56 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -13,6 +13,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
if (unlikely(plug))
__swap_read_unplug(plug);
}
+void swap_write_unplug(struct swap_iocb *sio);
int swap_writepage(struct page *page, struct writeback_control *wbc);
void end_swap_bio_write(struct bio *bio);
int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -68,6 +69,9 @@ static inline int swap_readpage(struct page *page, bool do_poll,
{
return 0;
}
+static inline void swap_write_unplug(struct swap_iocb *sio)
+{
+}

static inline struct address_space *swap_address_space(swp_entry_t entry)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ad5026d06aa8..572627412e7d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1164,7 +1164,8 @@ typedef enum {
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
*/
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+ struct swap_iocb **plug)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -1211,6 +1212,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
.range_start = 0,
.range_end = LLONG_MAX,
.for_reclaim = 1,
+ .swap_plug = plug,
};

SetPageReclaim(page);
@@ -1537,6 +1539,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
+ struct swap_iocb *plug = NULL;

memset(stat, 0, sizeof(*stat));
cond_resched();
@@ -1817,7 +1820,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
- switch (pageout(page, mapping)) {
+ switch (pageout(page, mapping, &plug)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
@@ -1971,6 +1974,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);

+ if (plug)
+ swap_write_unplug(plug);
return nr_reclaimed;
}




2022-02-08 17:00:13

by NeilBrown

[permalink] [raw]
Subject: [PATCH 19/21] NFS: rename nfs_direct_IO and use as ->swap_rw

The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while. We now need a ->swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
->swap_rw function.

Note: it still won't work - that will be fixed in later patches.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 23 ++++++++++-------------
fs/nfs/file.c | 5 +----
include/linux/nfs_fs.h | 2 +-
3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..b929dd5b0c3a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -153,28 +153,25 @@ nfs_direct_count_bytes(struct nfs_direct_req *dreq,
}

/**
- * nfs_direct_IO - NFS address space operation for direct I/O
+ * nfs_swap_rw - NFS address space operation for swap I/O
* @iocb: target I/O control block
* @iter: I/O buffer
*
- * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * Perform IO to the swap-file. This is much like direct IO.
*/
-ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
{
- struct inode *inode = iocb->ki_filp->f_mapping->host;
-
- /* we only support swap file calling nfs_direct_IO */
- if (!IS_SWAPFILE(inode))
- return 0;
+ ssize_t ret;

VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);

if (iov_iter_rw(iter) == READ)
- return nfs_file_direct_read(iocb, iter);
- return nfs_file_direct_write(iocb, iter);
+ ret = nfs_file_direct_read(iocb, iter);
+ else
+ ret = nfs_file_direct_write(iocb, iter);
+ if (ret < 0)
+ return ret;
+ return 0;
}

static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 81fe996c6272..7d42117b210d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -490,10 +490,6 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

- if (!file->f_mapping->a_ops->swap_rw)
- /* Cannot support swap */
- return -EINVAL;
-
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
isize = inode->i_size;
@@ -549,6 +545,7 @@ const struct address_space_operations nfs_file_aops = {
.error_remove_page = generic_error_remove_page,
.swap_activate = nfs_swap_activate,
.swap_deactivate = nfs_swap_deactivate,
+ .swap_rw = nfs_swap_rw,
};

/*
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index ff8b3820409c..58807406aff6 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -506,7 +506,7 @@ static inline const struct cred *nfs_file_cred(struct file *file)
/*
* linux/fs/nfs/direct.c
*/
-extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
+int nfs_swap_rw(struct kiocb *, struct iov_iter *);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
struct iov_iter *iter);
extern ssize_t nfs_file_direct_write(struct kiocb *iocb,



2022-02-09 04:10:01

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/21] DOC: update documentation for swap_activate and swap_rw

This documentation for ->swap_activate() has been out-of-date for a long
time. This patch updates it to match recent changes, and adds
documentation for the associated ->swap_rw()

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
Documentation/filesystems/locking.rst | 18 ++++++++++++------
Documentation/filesystems/vfs.rst | 17 ++++++++++++-----
2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 3f9b1497ebb8..fbb10378d5ee 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -260,8 +260,9 @@ prototypes::
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
- int (*swap_activate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);

locking rules:
All except set_page_dirty and freepage may block
@@ -290,6 +291,7 @@ is_partially_uptodate: yes
error_remove_page: yes
swap_activate: no
swap_deactivate: no
+swap_rw: yes, unlocks
====================== ======================== ========= ===============

->write_begin(), ->write_end() and ->readpage() may be called from
@@ -392,15 +394,19 @@ cleaned, or an error value if not. Note that in order to prevent the page
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.

-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap. It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation. It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added. If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.

->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.

+->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
+
file_lock_operations
====================

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..779d23fc7954 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -751,8 +751,9 @@ cache in your filesystem. The following members are defined:
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
- int (*swap_activate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};

``writepage``
@@ -959,15 +960,21 @@ cache in your filesystem. The following members are defined:
unless you have them locked or reference counts increased.

``swap_activate``
- Called when swapon is used on a file to allocate space if
- necessary and pin the block lookup information in memory. A
- return value of zero indicates success, in which case this file
- can be used to back swapspace.
+
+ Called to prepare the given file for swap. It should perform
+ any validation and preparation necessary to ensure that writes
+ can be performed with minimal memory allocation. It should call
+ add_swap_extent(), or the helper iomap_swapfile_activate(), and
+ return the number of extents added. If IO should be submitted
+ through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+ be submitted directly to the block device ``sis->bdev``.

``swap_deactivate``
Called during swapoff on files where swap_activate was
successful.

+``swap_rw``
+ Called to read or write swap pages when SWP_FS_OPS is set.

The File Object
===============



2022-02-09 06:49:25

by NeilBrown

[permalink] [raw]
Subject: [PATCH 16/21] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/read.c | 4 ----
include/linux/nfs_fs.h | 5 -----
include/linux/sunrpc/sched.h | 1 -
include/trace/events/sunrpc.h | 1 -
net/sunrpc/auth.c | 2 +-
5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index eb00229c1a50..cd797ce3a67c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
const struct nfs_rpc_ops *rpc_ops,
struct rpc_task_setup *task_setup_data, int how)
{
- struct inode *inode = hdr->inode;
- int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
- task_setup_data->flags |= swap_flags;
rpc_ops->read_setup(hdr, msg);
trace_nfs_initiate_read(hdr);
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 02aa49323d1d..ff8b3820409c 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
*/
#define NFS_MAX_TRANSPORTS 16

-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS (RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
/*
* Size of the NFS directory verifier
*/
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
#define RPC_TASK_MOVEABLE 0x0004 /* nfs4.1+ rpc tasks */
#define RPC_TASK_NULLCREDS 0x0010 /* Use AUTH_NULL credential */
#define RPC_CALL_MAJORSEEN 0x0020 /* major timeout seen */
-#define RPC_TASK_ROOTCREDS 0x0040 /* force root creds */
#define RPC_TASK_DYNAMIC 0x0080 /* task was kmalloc'ed */
#define RPC_TASK_NO_ROUND_ROBIN 0x0100 /* send requests on "main" xprt */
#define RPC_TASK_SOFT 0x0200 /* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 29982d60b68a..ac33892da411 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
{ RPC_TASK_MOVEABLE, "MOVEABLE" }, \
{ RPC_TASK_NULLCREDS, "NULLCREDS" }, \
{ RPC_CALL_MAJORSEEN, "MAJORSEEN" }, \
- { RPC_TASK_ROOTCREDS, "ROOTCREDS" }, \
{ RPC_TASK_DYNAMIC, "DYNAMIC" }, \
{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" }, \
{ RPC_TASK_SOFT, "SOFT" }, \
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
/* If machine cred couldn't be bound, try a root cred */
if (new)
;
- else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+ else if (cred == &machine_cred)
new = rpcauth_bind_root_cred(task, lookupflags);
else if (flags & RPC_TASK_NULLCREDS)
new = authnull_ops.lookup_cred(NULL, NULL, 0);



2022-02-09 08:38:12

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/21] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate

If a filesystem wishes to handle all swap IO itself (via ->direct_IO and
->readpage), rather than just providing devices addresses for
submit_bio(), SWP_FS_OPS must be set.
Currently the protocol for setting this it to have ->swap_activate
return zero. In that case SWP_FS_OPS is set, and add_swap_extent()
is called for the entire file.

This is a little clumsy as different return values for ->swap_activate
have quite different meanings, and it makes it hard to search for which
filesystems require SWP_FS_OPS to be set.

So remove the special meaning of a zero return, and require the
filesystem to set SWP_FS_OPS if it so desires, and to always call
add_swap_extent() as required.

Currently only NFS and CIFS return zero for add_swap_extent().

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
fs/cifs/file.c | 3 ++-
fs/nfs/file.c | 13 +++++++++++--
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 10 +++-------
4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index e7af802dcfa6..fe49f1cab018 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4917,7 +4917,8 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
* from reading or writing the file
*/

- return 0;
+ sis->flags |= SWP_FS_OPS;
+ return add_swap_extent(sis, 0, sis->max, 0);
}

static void cifs_swap_deactivate(struct file *file)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 76d76acbc594..d5aa55c7edb0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -488,6 +488,7 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
{
unsigned long blocks;
long long isize;
+ int ret;
struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
struct inode *inode = file->f_mapping->host;

@@ -500,9 +501,17 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
return -EINVAL;
}

+ ret = rpc_clnt_swap_activate(clnt);
+ if (ret)
+ return ret;
+ ret = add_swap_extent(sis, 0, sis->max, 0);
+ if (ret < 0) {
+ rpc_clnt_swap_deactivate(clnt);
+ return ret;
+ }
*span = sis->pages;
-
- return rpc_clnt_swap_activate(clnt);
+ sis->flags |= SWP_FS_OPS;
+ return ret;
}

static void nfs_swap_deactivate(struct file *file)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a43929f7033e..b57cff3c5ac2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -573,6 +573,12 @@ static inline swp_entry_t get_swap_page(struct page *page)
return entry;
}

+static inline int add_swap_extent(struct swap_info_struct *sis,
+ unsigned long start_page,
+ unsigned long nr_pages, sector_t start_block)
+{
+ return -EINVAL;
+}
#endif /* CONFIG_SWAP */

#ifdef CONFIG_THP_SWAP
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 71c7a31dd291..ed6028aea8bf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2347,13 +2347,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)

if (mapping->a_ops->swap_activate) {
ret = mapping->a_ops->swap_activate(sis, swap_file, span);
- if (ret >= 0)
- sis->flags |= SWP_ACTIVATED;
- if (!ret) {
- sis->flags |= SWP_FS_OPS;
- ret = add_swap_extent(sis, 0, sis->max, 0);
- *span = sis->pages;
- }
+ if (ret < 0)
+ return ret;
+ sis->flags |= SWP_ACTIVATED;
return ret;
}




2022-02-10 23:05:57

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 00/21 V4] Repair SWAP-over_NFS

Hi Neil,

On Wed, Feb 9, 2022 at 11:29 AM NeilBrown <[email protected]> wrote:
> This 4th version of the series address review comment, particularly
> tidying up "NFS: swap IO handling is slightly different for O_DIRECT IO"
> and collect reviewed-by etc.
>
> I've also move 3 NFS patches which depend on the MM patches to the end
> in case they helps maintainers land the patches in a consistent order.
> Those three patches might go through the NFS free after the next merge
> window.

Thanks for the update!
Tested-by: Geert Uytterhoeven <[email protected]>
(on Renesas RSK+RZA1 with 32 MiB of SDRAM)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds