2021-11-16 02:47:38

by NeilBrown

[permalink] [raw]
Subject: [PATCH 00/13] Repair SWAP-over-NFS

swap-over-NFS currently has a variety of problems.

Due to a newish test in generic_write_checks(), all writes to swap
currently fail.
With that fixed, there are various sources of deadlocks that can cause
a swapping system to freeze.

swap has never worked over NFSv4 due to the occasional need to start the
state-management thread - which won't happen when under high memory
pressure.

This series addresses all the problems that I could find, and also
changes writes to be asynchronous, and both reads and writes to use
multi-page RPC requests when possible (the last 2 patches).

This last change causes interesting performance changes. The rate of
writes to the swap file (measured in K/sec) increases by a factor of
about 5 (not precisely measured). However interactive response falls
noticeably (response time in multiple seconds, but not minutes). So
while it seems like it should be a good idea, I'm not sure if we want it
until it is better understood.

I'd be very happy if others could test out some swapping scenarios to
see how it performs. I've been using
stress-ng --brk 2 --stack 2 --bigheap 2
which doesn't give me any insight into whether more useful work is
getting done.

Apart from the last two patches, I think this series is ready.

Thanks,
NeilBrown

---

NeilBrown (13):
NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write()
NFS: do not take i_rwsem for swap IO
MM: reclaim mustn't enter FS for swap-over-NFS
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/auth: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
NFSv4: keep state manager thread active if swap is enabled
NFS: swap-out must always use STABLE writes.
MM: use AIO/DIO for reads from SWP_FS_OPS swap-space
MM: use AIO for DIO writes to swap


fs/nfs/direct.c | 12 +-
fs/nfs/file.c | 21 ++-
fs/nfs/io.c | 9 ++
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 +++
fs/nfs/nfs4state.c | 39 ++++-
fs/nfs/read.c | 4 -
fs/nfs/write.c | 2 +
include/linux/nfs_fs.h | 8 +-
include/linux/nfs_xdr.h | 2 +
include/linux/sunrpc/auth.h | 1 +
include/linux/sunrpc/sched.h | 1 -
include/trace/events/sunrpc.h | 1 -
mm/page_io.c | 243 +++++++++++++++++++++++++++-----
mm/vmscan.c | 12 +-
net/sunrpc/auth.c | 8 +-
net/sunrpc/auth_gss/auth_gss.c | 6 +-
net/sunrpc/auth_unix.c | 10 +-
net/sunrpc/clnt.c | 7 +-
net/sunrpc/sched.c | 29 ++--
net/sunrpc/xprt.c | 19 +--
net/sunrpc/xprtrdma/transport.c | 10 +-
net/sunrpc/xprtsock.c | 8 ++
23 files changed, 374 insertions(+), 99 deletions(-)

--
Signature



2021-11-16 02:47:44

by NeilBrown

[permalink] [raw]
Subject: [PATCH 01/13] NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write()

generic_write_checks() is not needed for swap-out writes, and fails if
they are attempted.
nfs_file_direct_write() currently calls generic_write_checks() and is in
turn called from:
nfs_direct_IO - only for swap-out
nfs_file_write - for normal O_DIRECT write

So move the generic_write_checks() call into nfs_file_write(). This
allows NFS swap-out writes to complete.

Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 5 +----
fs/nfs/file.c | 6 +++++-
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 9cff8709c80a..1e80d243ba25 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -905,10 +905,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);

- result = generic_write_checks(iocb, iter);
- if (result <= 0)
- return result;
- count = result;
+ count = iov_iter_count(iter);
nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);

pos = iocb->ki_pos;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 24e7dccce355..45d8180b7be3 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -615,8 +615,12 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
if (result)
return result;

- if (iocb->ki_flags & IOCB_DIRECT)
+ if (iocb->ki_flags & IOCB_DIRECT) {
+ result = generic_write_checks(iocb, from);
+ if (result <= 0)
+ return result;
return nfs_file_direct_write(iocb, from);
+ }

dprintk("NFS: write(%pD2, %zu@%Ld)\n",
file, iov_iter_count(from), (long long) iocb->ki_pos);



2021-11-16 02:47:51

by NeilBrown

[permalink] [raw]
Subject: [PATCH 02/13] NFS: do not take i_rwsem for swap IO

Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.

We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway. There is no other need for
i_rwsem during direct IO.

So don't take the rwsem or set the NFS_INO_ODIRECT flag during IO to the
swap file.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/io.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/nfs/io.c b/fs/nfs/io.c
index b5551ed8f648..83b4dfbb826d 100644
--- a/fs/nfs/io.c
+++ b/fs/nfs/io.c
@@ -118,11 +118,18 @@ static void nfs_block_buffered(struct nfs_inode *nfsi, struct inode *inode)
* NFS_INO_ODIRECT.
* Note that buffered writes and truncates both take a write lock on
* inode->i_rwsem, meaning that those are serialised w.r.t. O_DIRECT.
+ *
+ * When inode IS_SWAPFILE we ignore the flag and don't take the rwsem
+ * as it triggers lockdep warnings and possible deadlocks.
+ * bufferred writes are forbidden anyway, and buffered reads will not
+ * be coherent.
*/
void
nfs_start_io_direct(struct inode *inode)
{
struct nfs_inode *nfsi = NFS_I(inode);
+ if (IS_SWAPFILE(inode))
+ return;
/* Be an optimist! */
down_read(&inode->i_rwsem);
if (test_bit(NFS_INO_ODIRECT, &nfsi->flags) != 0)
@@ -144,5 +151,7 @@ nfs_start_io_direct(struct inode *inode)
void
nfs_end_io_direct(struct inode *inode)
{
+ if (IS_SWAPFILE(inode))
+ return;
up_read(&inode->i_rwsem);
}



2021-11-16 02:47:57

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <[email protected]>
---
mm/vmscan.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..049ff4494081 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1513,8 +1513,14 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

+ /* ->flags can be updated non-atomicially (scan_swap_map_slots),
+ * but that will never affect SWP_FS_OPS, so the data_race
+ * is safe.
+ */
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
- (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+ (PageSwapCache(page) &&
+ !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
+ (sc->gfp_mask & __GFP_IO));

/*
* The number of dirty pages determines if a node is marked
@@ -1682,7 +1688,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
goto activate_locked_split;
}

- may_enter_fs = true;
+ if ((sc->gfp_mask & __GFP_FS) ||
+ !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
+ may_enter_fs = true;

/* Adding to swap updated mapping */
mapping = page_mapping(page);



2021-11-16 02:48:05

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/13] SUNRPC/call_alloc: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/sched.c | 4 +++-
net/sunrpc/xprtrdma/transport.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOFS;

+ if (RPC_IS_ASYNC(task))
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ gfp |= __GFP_MEMALLOC;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
gfp_t flags;

flags = RPCRDMA_DEF_GFP;
+ if (RPC_IS_ASYNC(task))
+ flags = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ flags |= __GFP_MEMALLOC;

if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
flags))



2021-11-16 02:48:43

by NeilBrown

[permalink] [raw]
Subject: [PATCH 06/13] SUNRPC/xprt: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely. This can tie up all
workqueue threads and NFS can deadlock. So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block. However it sets the error
to -EAGAIN suggesting this will trigger a sleep. It does not. As we
can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/xprt.c | 5 ++++-
net/sunrpc/xprtrdma/transport.c | 2 +-
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a02de2bddb28..47d207e416ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1687,12 +1687,15 @@ static bool xprt_throttle_congested(struct rpc_xprt *xprt, struct rpc_task *task
static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt)
{
struct rpc_rqst *req = ERR_PTR(-EAGAIN);
+ gfp_t gfp_mask = GFP_NOFS;

if (xprt->num_reqs >= xprt->max_reqs)
goto out;
++xprt->num_reqs;
spin_unlock(&xprt->reserve_lock);
- req = kzalloc(sizeof(struct rpc_rqst), GFP_NOFS);
+ if (current->flags & PF_WQ_WORKER)
+ gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
+ req = kzalloc(sizeof(struct rpc_rqst), gfp_mask);
spin_lock(&xprt->reserve_lock);
if (req != NULL)
goto out;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a52277115500..32df23796747 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -521,7 +521,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
return;

out_sleep:
- task->tk_status = -EAGAIN;
+ task->tk_status = -ENOMEM;
xprt_add_backlog(xprt, task);
}




2021-11-16 02:49:22

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC

rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC
to be used for some allocations. This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks,
then it blocks all other queued tasks. So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
2/ __rpc_execute() sets PF_MEMALLOC while handling any task
with RPC_TASK_SWAPPER set, or when handling any task that
holds the XPRT_LOCKED lock on an xprt used for swap.
This removes the need for the RPC_IS_SWAPPER() test
in ->buf_alloc handlers.
3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
any task to a swapper xprt. __rpc_execute() will clear it.
3/ PF_MEMALLOC is set for all the connect workers.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/write.c | 2 ++
net/sunrpc/clnt.c | 2 --
net/sunrpc/sched.c | 20 +++++++++++++++++---
net/sunrpc/xprt.c | 3 +++
net/sunrpc/xprtrdma/transport.c | 6 ++++--
net/sunrpc/xprtsock.c | 8 ++++++++
6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9b7619ce17a7..0c7a304c9e74 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1408,6 +1408,8 @@ static void nfs_initiate_write(struct nfs_pgio_header *hdr,
{
int priority = flush_task_priority(how);

+ if (IS_SWAPFILE(hdr->inode))
+ task_setup_data->flags |= RPC_TASK_SWAPPER;
task_setup_data->priority = priority;
rpc_ops->write_setup(hdr, msg, &task_setup_data->rpc_client);
trace_nfs_initiate_write(hdr);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 238b2ef5491f..cb76fbea3ed5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1085,8 +1085,6 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
task->tk_flags |= RPC_TASK_TIMEOUT;
if (clnt->cl_noretranstimeo)
task->tk_flags |= RPC_TASK_NO_RETRANS_TIMEOUT;
- if (atomic_read(&clnt->cl_swapper))
- task->tk_flags |= RPC_TASK_SWAPPER;
/* Add to the client's list of all tasks */
spin_lock(&clnt->cl_lock);
list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 256302bf6557..9020cedb7c95 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -869,6 +869,15 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
ops->rpc_release(calldata);
}

+static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk)
+{
+ if (!xprt)
+ return false;
+ if (!atomic_read(&xprt->swapper))
+ return false;
+ return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk;
+}
+
/*
* This is the RPC `scheduler' (or rather, the finite state machine).
*/
@@ -877,6 +886,7 @@ static void __rpc_execute(struct rpc_task *task)
struct rpc_wait_queue *queue;
int task_is_async = RPC_IS_ASYNC(task);
int status = 0;
+ unsigned long pflags = current->flags;

WARN_ON_ONCE(RPC_IS_QUEUED(task));
if (RPC_IS_QUEUED(task))
@@ -899,6 +909,10 @@ static void __rpc_execute(struct rpc_task *task)
}
if (!do_action)
break;
+ if (RPC_IS_SWAPPER(task) ||
+ xprt_needs_memalloc(task->tk_xprt, task))
+ current->flags |= PF_MEMALLOC;
+
trace_rpc_task_run_action(task, do_action);
do_action(task);

@@ -936,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
rpc_clear_running(task);
spin_unlock(&queue->lock);
if (task_is_async)
- return;
+ goto out;

/* sync task: sleep here */
trace_rpc_task_sync_sleep(task, task->tk_action);
@@ -960,6 +974,8 @@ static void __rpc_execute(struct rpc_task *task)

/* Release all resources associated with the task */
rpc_release_task(task);
+out:
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/*
@@ -1018,8 +1034,6 @@ int rpc_malloc(struct rpc_task *task)

if (RPC_IS_ASYNC(task))
gfp = GFP_NOWAIT | __GFP_NOWARN;
- if (RPC_IS_SWAPPER(task))
- gfp |= __GFP_MEMALLOC;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a0a2583fe941..0614e7463d4b 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1492,6 +1492,9 @@ bool xprt_prepare_transmit(struct rpc_task *task)
return false;

}
+ if (atomic_read(&xprt->swapper))
+ /* This will be clear in __rpc_execute */
+ current->flags |= PF_MEMALLOC;
return true;
}

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 32df23796747..256b06a92391 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -239,8 +239,11 @@ xprt_rdma_connect_worker(struct work_struct *work)
struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
rx_connect_worker.work);
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+ unsigned int pflags = current->flags;
int rc;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
rc = rpcrdma_xprt_connect(r_xprt);
xprt_clear_connecting(xprt);
if (!rc) {
@@ -254,6 +257,7 @@ xprt_rdma_connect_worker(struct work_struct *work)
rpcrdma_xprt_disconnect(r_xprt);
xprt_unlock_connect(xprt, r_xprt);
xprt_wake_pending_tasks(xprt, rc);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**
@@ -576,8 +580,6 @@ xprt_rdma_allocate(struct rpc_task *task)
flags = RPCRDMA_DEF_GFP;
if (RPC_IS_ASYNC(task))
flags = GFP_NOWAIT | __GFP_NOWARN;
- if (RPC_IS_SWAPPER(task))
- flags |= __GFP_MEMALLOC;

if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
flags))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index ae48c9c84ee1..062ff1f37702 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2047,7 +2047,10 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
struct socket *sock;
int status = -EIO;
+ unsigned int pflags = current->flags;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
sock = xs_create_sock(xprt, transport,
xs_addr(xprt)->sa_family, SOCK_DGRAM,
IPPROTO_UDP, false);
@@ -2067,6 +2070,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
xprt_clear_connecting(xprt);
xprt_unlock_connect(xprt, transport);
xprt_wake_pending_tasks(xprt, status);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**
@@ -2226,7 +2230,10 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
struct rpc_xprt *xprt = &transport->xprt;
int status;
+ unsigned int pflags = current->flags;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
if (!sock) {
sock = xs_create_sock(xprt, transport,
xs_addr(xprt)->sa_family, SOCK_STREAM,
@@ -2291,6 +2298,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
xprt_clear_connecting(xprt);
out_unlock:
xprt_unlock_connect(xprt, transport);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**



2021-11-16 02:49:40

by NeilBrown

[permalink] [raw]
Subject: [PATCH 10/13] NFSv4: keep state manager thread active if swap is enabled

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 19 +++++++++++++++----
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 ++++++++++++++++++++
fs/nfs/nfs4state.c | 39 +++++++++++++++++++++++++++++++++------
include/linux/nfs_xdr.h | 2 ++
net/sunrpc/clnt.c | 2 ++
6 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 45d8180b7be3..59c271f42ea5 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -489,8 +489,10 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
{
unsigned long blocks;
long long isize;
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
+ int ret;

spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
@@ -503,14 +505,23 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,

*span = sis->pages;

- return rpc_clnt_swap_activate(clnt);
+ ret = rpc_clnt_swap_activate(clnt);
+
+ if (ret == 0 && cl->rpc_ops->enable_swap)
+ cl->rpc_ops->enable_swap(inode);
+
+ return ret;
}

static void nfs_swap_deactivate(struct file *file)
{
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

rpc_clnt_swap_deactivate(clnt);
+ if (cl->rpc_ops->disable_swap)
+ cl->rpc_ops->disable_swap(file_inode(file));
}

const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
NFS4CLNT_LEASE_MOVED,
NFS4CLNT_DELEGATION_EXPIRED,
NFS4CLNT_RUN_MANAGER,
+ NFS4CLNT_MANAGER_AVAILABLE,
NFS4CLNT_RECALL_RUNNING,
NFS4CLNT_RECALL_ANY_LAYOUT_READ,
NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
return error + error2 + error3;
}

+static void nfs4_enable_swap(struct inode *inode)
+{
+ /* The state manager thread must always be running.
+ * It will notice the client is a swapper, and stay put.
+ */
+ struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+ nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+ /* The state manager thread will now exit once it is
+ * woken.
+ */
+ wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
static const struct inode_operations nfs4_dir_inode_operations = {
.create = nfs_create,
.lookup = nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
.free_client = nfs4_free_client,
.create_server = nfs4_create_server,
.clone_server = nfs_clone_server,
+ .enable_swap = nfs4_enable_swap,
+ .disable_swap = nfs4_disable_swap,
};

static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index ecc4594299d6..2408e9b98f68 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
{
struct task_struct *task;
char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+ if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+ wake_up_var(&clp->cl_state);
return;
+ }
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
__module_get(THIS_MODULE);
refcount_inc(&clp->cl_count);

@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
printk(KERN_ERR "%s: kthread_run: %ld\n",
__func__, PTR_ERR(task));
nfs4_clear_state_manager_bit(clp);
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
nfs_put_client(clp);
module_put(THIS_MODULE);
}
@@ -2661,11 +2669,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
}

- /* Did we race with an attempt to give us more work? */
- if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
- return;
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
- return;
+ return;
+
} while (refcount_read(&clp->cl_count) > 1 && !signalled());
goto out_drain;

@@ -2685,9 +2690,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
static int nfs4_run_state_manager(void *ptr)
{
struct nfs_client *clp = ptr;
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

allow_signal(SIGKILL);
+again:
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
nfs4_state_manager(clp);
+ if (atomic_read(&cl->cl_swapper)) {
+ wait_var_event_interruptible(&clp->cl_state,
+ test_bit(NFS4CLNT_RUN_MANAGER,
+ &clp->cl_state));
+ if (atomic_read(&cl->cl_swapper) &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+ goto again;
+ /* Either no longer a swapper, or were signalled */
+ }
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+ if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+ !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+ goto again;
+
nfs_put_client(clp);
module_put_and_exit(0);
return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
struct nfs_server *(*create_server)(struct fs_context *);
struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
struct nfs_fattr *, rpc_authflavor_t);
+ void (*enable_swap)(struct inode *inode);
+ void (*disable_swap)(struct inode *inode);
};

/*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
int
rpc_clnt_swap_activate(struct rpc_clnt *clnt)
{
+ while (clnt != clnt->cl_parent)
+ clnt = clnt->cl_parent;
if (atomic_inc_return(&clnt->cl_swapper) == 1)
return rpc_clnt_iterate_for_each_xprt(clnt,
rpc_clnt_swap_activate_callback, NULL);



2021-11-16 02:49:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space

When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
separate reads for each page. This is generally less efficient than
larger reads.

We can use the block-plugging infrastructure to delay submitting the
read request until multiple contigious pages have been collected. This
requires using ->direct_IO to submit the read (as ->readpages isn't
suitable for swap).

If the caller schedules before unplugging, we hand the read-in task off
to systemwq to avoid any possible locking issues.

Signed-off-by: NeilBrown <[email protected]>
---
mm/page_io.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 103 insertions(+), 4 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 9725c7e1eeea..30d613881995 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -282,6 +282,14 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
#define bio_associate_blkg_from_page(bio, page) do { } while (0)
#endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */

+struct swap_iocb {
+ struct blk_plug_cb cb; /* Must be first */
+ struct kiocb iocb;
+ struct bio_vec bvec[SWAP_CLUSTER_MAX];
+ struct work_struct work;
+ int pages;
+};
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -353,6 +361,59 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return 0;
}

+static void sio_read_complete(struct kiocb *iocb, long ret)
+{
+ struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+ int p;
+
+ for (p = 0; p < sio->pages; p++) {
+ struct page *page = sio->bvec[p].bv_page;
+
+ if (ret != PAGE_SIZE * sio->pages) {
+ SetPageError(page);
+ ClearPageUptodate(page);
+ pr_alert_ratelimited("Read-error on swap-device\n");
+ } else {
+ SetPageUptodate(page);
+ count_vm_event(PSWPIN);
+ }
+ unlock_page(page);
+ }
+ kfree(sio);
+}
+
+static void sio_read_unplug(struct blk_plug_cb *cb, bool from_schedule);
+
+static void sio_read_unplug_worker(struct work_struct *work)
+{
+ struct swap_iocb *sio = container_of(work, struct swap_iocb, work);
+ sio_read_unplug(&sio->cb, 0);
+}
+
+static void sio_read_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+ struct swap_iocb *sio = container_of(cb, struct swap_iocb, cb);
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ struct iov_iter from;
+ unsigned int nofs_flag;
+ int ret;
+
+ if (from_schedule) {
+ INIT_WORK(&sio->work, sio_read_unplug_worker);
+ schedule_work(&sio->work);
+ return;
+ }
+
+ iov_iter_bvec(&from, READ, sio->bvec,
+ sio->pages, PAGE_SIZE * sio->pages);
+ /* nofs needs as ->direct_IO may take the same mutex it takes for write */
+ nofs_flag = memalloc_nofs_save();
+ ret = mapping->a_ops->direct_IO(&sio->iocb, &from);
+ memalloc_nofs_restore(nofs_flag);
+ if (ret != -EIOCBQUEUED)
+ sio_read_complete(&sio->iocb, ret);
+}
+
int swap_readpage(struct page *page, bool synchronous)
{
struct bio *bio;
@@ -380,10 +441,48 @@ int swap_readpage(struct page *page, bool synchronous)
if (data_race(sis->flags & SWP_FS_OPS)) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
-
- ret = mapping->a_ops->readpage(swap_file, page);
- if (!ret)
- count_vm_event(PSWPIN);
+ struct blk_plug_cb *cb;
+ struct swap_iocb *sio;
+ loff_t pos = page_file_offset(page);
+ struct blk_plug plug;
+ int p;
+
+ /* We are sometimes called without a plug active.
+ * By calling blk_start_plug() here, we ensure blk_check_plugged
+ * only fails if memory allocation fails.
+ */
+ blk_start_plug(&plug);
+ cb = blk_check_plugged(sio_read_unplug, swap_file, sizeof(*sio));
+ sio = container_of(cb, struct swap_iocb, cb);
+ if (cb && sio->pages &&
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ /* Not contiguous - hide this sio from lookup */
+ cb->data = NULL;
+ cb = blk_check_plugged(sio_read_unplug, swap_file,
+ sizeof(*sio));
+ sio = container_of(cb, struct swap_iocb, cb);
+ }
+ if (!cb) {
+ blk_finish_plug(&plug);
+ ret = mapping->a_ops->readpage(swap_file, page);
+ if (!ret)
+ count_vm_event(PSWPIN);
+ goto out;
+ }
+ if (sio->pages == 0) {
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_pos = pos;
+ sio->iocb.ki_complete = sio_read_complete;
+ }
+ p = sio->pages;
+ sio->bvec[p].bv_page = page;
+ sio->bvec[p].bv_len = PAGE_SIZE;
+ sio->bvec[p].bv_offset = 0;
+ p += 1;
+ sio->pages = p;
+ if (p == ARRAY_SIZE(sio->bvec))
+ cb->data = NULL;
+ blk_finish_plug(&plug);
goto out;
}




2021-11-16 02:50:01

by NeilBrown

[permalink] [raw]
Subject: [PATCH 13/13] MM: use AIO for DIO writes to swap

When swap-out goes through the filesystem (as with NFS), we currently
perform synchronous writes with ->direct_IO. This serializes swap
writes and causes kswapd to block waiting for a writes to complete. This
is quite different to swap-out to a block device (always async), and
possibly hurts liveness.

So switch to AIO writes. If the necessary kiocb structure cannot be
allocated, fall back to sync writes using a kiocb on the stack.

Signed-off-by: NeilBrown <[email protected]>
---
mm/page_io.c | 136 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 103 insertions(+), 33 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 30d613881995..59a2d49e53c3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -25,6 +25,7 @@
#include <linux/psi.h>
#include <linux/uio.h>
#include <linux/sched/task.h>
+#include "internal.h"

void end_swap_bio_write(struct bio *bio)
{
@@ -288,8 +289,70 @@ struct swap_iocb {
struct bio_vec bvec[SWAP_CLUSTER_MAX];
struct work_struct work;
int pages;
+ bool on_stack;
};

+static void sio_aio_complete(struct kiocb *iocb, long ret)
+{
+ struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+ int p;
+
+ if (ret != PAGE_SIZE * sio->pages) {
+ /*
+ * In the case of swap-over-nfs, this can be a
+ * temporary failure if the system has limited
+ * memory for allocating transmit buffers.
+ * Mark the page dirty and avoid
+ * rotate_reclaimable_page but rate-limit the
+ * messages but do not flag PageError like
+ * the normal direct-to-bio case as it could
+ * be temporary.
+ */
+ pr_err_ratelimited("Write error on dio swapfile (%llu - %d pages)\n",
+ page_file_offset(sio->bvec[0].bv_page),
+ sio->pages);
+ for (p = 0; p < sio->pages; p++) {
+ set_page_dirty(sio->bvec[p].bv_page);
+ ClearPageReclaim(sio->bvec[p].bv_page);
+ }
+ }
+ for (p = 0; p < sio->pages; p++)
+ end_page_writeback(sio->bvec[p].bv_page);
+ if (!sio->on_stack)
+ kfree(sio);
+}
+
+static void sio_aio_unplug(struct blk_plug_cb *cb, bool from_schedule);
+
+static void sio_write_unplug_worker(struct work_struct *work)
+{
+ struct swap_iocb *sio = container_of(work, struct swap_iocb, work);
+ sio_aio_unplug(&sio->cb, 0);
+}
+
+static void sio_aio_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+ struct swap_iocb *sio = container_of(cb, struct swap_iocb, cb);
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ struct iov_iter from;
+ int ret;
+ unsigned int noreclaim_flag;
+
+ if (from_schedule) {
+ INIT_WORK(&sio->work, sio_write_unplug_worker);
+ queue_work(mm_percpu_wq, &sio->work);
+ return;
+ }
+
+ noreclaim_flag = memalloc_noreclaim_save();
+ iov_iter_bvec(&from, WRITE, sio->bvec,
+ sio->pages, PAGE_SIZE * sio->pages);
+ ret = mapping->a_ops->direct_IO(&sio->iocb, &from);
+ memalloc_noreclaim_restore(noreclaim_flag);
+ if (ret != -EIOCBQUEUED)
+ sio_aio_complete(&sio->iocb, ret);
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -299,44 +362,51 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

VM_BUG_ON_PAGE(!PageSwapCache(page), page);
if (data_race(sis->flags & SWP_FS_OPS)) {
- struct kiocb kiocb;
+ struct swap_iocb *sio, sio_on_stack;
+ struct blk_plug_cb *cb;
struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct bio_vec bv = {
- .bv_page = page,
- .bv_len = PAGE_SIZE,
- .bv_offset = 0
- };
- struct iov_iter from;
-
- iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
- init_sync_kiocb(&kiocb, swap_file);
- kiocb.ki_pos = page_file_offset(page);
+ loff_t pos = page_file_offset(page);
+ int p;

set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->direct_IO(&kiocb, &from);
- if (ret == PAGE_SIZE) {
- count_vm_event(PSWPOUT);
- ret = 0;
- } else {
- /*
- * In the case of swap-over-nfs, this can be a
- * temporary failure if the system has limited
- * memory for allocating transmit buffers.
- * Mark the page dirty and avoid
- * folio_rotate_reclaimable but rate-limit the
- * messages but do not flag PageError like
- * the normal direct-to-bio case as it could
- * be temporary.
- */
- set_page_dirty(page);
- ClearPageReclaim(page);
- pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
- page_file_offset(page));
+ cb = blk_check_plugged(sio_aio_unplug, swap_file, sizeof(*sio));
+ sio = container_of(cb, struct swap_iocb, cb);
+ if (cb && sio->pages &&
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ /* Not contiguous - hide this sio from lookup */
+ cb->data = NULL;
+ cb = blk_check_plugged(sio_aio_unplug, swap_file,
+ sizeof(*sio));
+ sio = container_of(cb, struct swap_iocb, cb);
}
- end_page_writeback(page);
- return ret;
+ if (!cb) {
+ sio = &sio_on_stack;
+ sio->pages = 0;
+ sio->on_stack = true;
+ }
+
+ if (sio->pages == 0) {
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_pos = pos;
+ if (sio != &sio_on_stack)
+ sio->iocb.ki_complete = sio_aio_complete;
+ }
+ p = sio->pages;
+ sio->bvec[p].bv_page = page;
+ sio->bvec[p].bv_len = PAGE_SIZE;
+ sio->bvec[p].bv_offset = 0;
+ p += 1;
+ sio->pages = p;
+ if (!cb)
+ sio_aio_unplug(&sio->cb, 0);
+ else if (p >= ARRAY_SIZE(sio->bvec))
+ /* Don't try to add to this */
+ cb->data = NULL;
+
+ count_vm_event(PSWPOUT);
+
+ return 0;
}

ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);



2021-11-16 02:50:26

by NeilBrown

[permalink] [raw]
Subject: [PATCH 11/13] NFS: swap-out must always use STABLE writes.

The commit handling code is not safe against memory-pressure deadlocks
when writing to swap. In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.

swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does. However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.

For safety, make it explicit in the code that direct writes use for swap
must always use FLUSH_COND_STABLE.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 12 +++++++-----
fs/nfs/file.c | 2 +-
include/linux/nfs_fs.h | 3 ++-
3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1e80d243ba25..8d3b12402725 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -173,7 +173,7 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)

if (iov_iter_rw(iter) == READ)
return nfs_file_direct_read(iocb, iter);
- return nfs_file_direct_write(iocb, iter);
+ return nfs_file_direct_write(iocb, iter, FLUSH_STABLE);
}

static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
@@ -789,7 +789,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
*/
static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
struct iov_iter *iter,
- loff_t pos)
+ loff_t pos, int ioflags)
{
struct nfs_pageio_descriptor desc;
struct inode *inode = dreq->inode;
@@ -797,7 +797,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
size_t requested_bytes = 0;
size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize, PAGE_SIZE);

- nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
+ nfs_pageio_init_write(&desc, inode, ioflags, false,
&nfs_direct_write_completion_ops);
desc.pg_dreq = dreq;
get_dreq(dreq);
@@ -875,6 +875,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
* nfs_file_direct_write - file direct write operation for NFS files
* @iocb: target I/O control block
* @iter: vector of user buffers from which to write data
+ * @ioflags: flags for nfs_pageio_init_write()
*
* We use this function for direct writes instead of calling
* generic_file_aio_write() in order to avoid taking the inode
@@ -891,7 +892,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
* Note that O_APPEND is not supported for NFS direct writes, as there
* is no atomic O_APPEND write facility in the NFS protocol.
*/
-ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+ int ioflags)
{
ssize_t result, requested;
size_t count;
@@ -935,7 +937,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)

nfs_start_io_direct(inode);

- requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+ requested = nfs_direct_write_schedule_iovec(dreq, iter, pos, ioflags);

if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 59c271f42ea5..878a6a510a5e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -630,7 +630,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
result = generic_write_checks(iocb, from);
if (result <= 0)
return result;
- return nfs_file_direct_write(iocb, from);
+ return nfs_file_direct_write(iocb, from, FLUSH_COND_STABLE);
}

dprintk("NFS: write(%pD2, %zu@%Ld)\n",
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 5a605e51f4b1..ca312aea6bec 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -509,7 +509,8 @@ extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
struct iov_iter *iter);
extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
- struct iov_iter *iter);
+ struct iov_iter *iter,
+ int ioflags);

/*
* linux/fs/nfs/dir.c



2021-11-16 02:50:26

by NeilBrown

[permalink] [raw]
Subject: [PATCH 08/13] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/read.c | 4 ----
include/linux/nfs_fs.h | 5 -----
include/linux/sunrpc/sched.h | 1 -
include/trace/events/sunrpc.h | 1 -
net/sunrpc/auth.c | 2 +-
5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index d11af2a9299c..a8f2b884306c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
const struct nfs_rpc_ops *rpc_ops,
struct rpc_task_setup *task_setup_data, int how)
{
- struct inode *inode = hdr->inode;
- int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
- task_setup_data->flags |= swap_flags;
rpc_ops->read_setup(hdr, msg);
trace_nfs_initiate_read(hdr);
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 05f249f20f55..5a605e51f4b1 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
*/
#define NFS_MAX_TRANSPORTS 16

-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS (RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
/*
* Size of the NFS directory verifier
*/
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
#define RPC_TASK_MOVEABLE 0x0004 /* nfs4.1+ rpc tasks */
#define RPC_TASK_NULLCREDS 0x0010 /* Use AUTH_NULL credential */
#define RPC_CALL_MAJORSEEN 0x0020 /* major timeout seen */
-#define RPC_TASK_ROOTCREDS 0x0040 /* force root creds */
#define RPC_TASK_DYNAMIC 0x0080 /* task was kmalloc'ed */
#define RPC_TASK_NO_ROUND_ROBIN 0x0100 /* send requests on "main" xprt */
#define RPC_TASK_SOFT 0x0200 /* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 3a99358c262b..141dc34a450e 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
{ RPC_TASK_MOVEABLE, "MOVEABLE" }, \
{ RPC_TASK_NULLCREDS, "NULLCREDS" }, \
{ RPC_CALL_MAJORSEEN, "MAJORSEEN" }, \
- { RPC_TASK_ROOTCREDS, "ROOTCREDS" }, \
{ RPC_TASK_DYNAMIC, "DYNAMIC" }, \
{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" }, \
{ RPC_TASK_SOFT, "SOFT" }, \
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
/* If machine cred couldn't be bound, try a root cred */
if (new)
;
- else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+ else if (cred == &machine_cred)
new = rpcauth_bind_root_cred(task, lookupflags);
else if (flags & RPC_TASK_NULLCREDS)
new = authnull_ops.lookup_cred(NULL, NULL, 0);



2021-11-16 02:50:29

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/13] SUNRPC: remove scheduling boost for "SWAPPER" tasks.

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks. So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it. If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/sched.c | 7 -------
net/sunrpc/xprt.c | 11 -----------
2 files changed, 18 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d5b6e897f5a5..256302bf6557 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -186,11 +186,6 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue,

/*
* Add new request to wait queue.
- *
- * Swapper tasks always get inserted at the head of the queue.
- * This should avoid many nasty memory deadlocks and hopefully
- * improve overall performance.
- * Everyone else gets appended to the queue to ensure proper FIFO behavior.
*/
static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
struct rpc_task *task,
@@ -199,8 +194,6 @@ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
INIT_LIST_HEAD(&task->u.tk_wait.timer_list);
if (RPC_IS_PRIORITY(queue))
__rpc_add_wait_queue_priority(queue, task, queue_priority);
- else if (RPC_IS_SWAPPER(task))
- list_add(&task->u.tk_wait.list, &queue->tasks[0]);
else
list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
task->tk_waitqueue = queue;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 47d207e416ab..a0a2583fe941 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1354,17 +1354,6 @@ xprt_request_enqueue_transmit(struct rpc_task *task)
INIT_LIST_HEAD(&req->rq_xmit2);
goto out;
}
- } else if (RPC_IS_SWAPPER(task)) {
- list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
- if (pos->rq_cong || pos->rq_bytes_sent)
- continue;
- if (RPC_IS_SWAPPER(pos->rq_task))
- continue;
- /* Note: req is added _before_ pos */
- list_add_tail(&req->rq_xmit, &pos->rq_xmit);
- INIT_LIST_HEAD(&req->rq_xmit2);
- goto out;
- }
} else if (!req->rq_seqno) {
list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
if (pos->rq_task->tk_owner != task->tk_owner)



2021-11-16 02:51:25

by NeilBrown

[permalink] [raw]
Subject: [PATCH 05/13] SUNRPC/auth: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks. So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory. If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again. HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/auth.h | 1 +
net/sunrpc/auth.c | 6 +++++-
net/sunrpc/auth_gss/auth_gss.c | 6 +++++-
net/sunrpc/auth_unix.c | 10 ++++++++--
net/sunrpc/clnt.c | 3 +++
5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {

/* Flags for rpcauth_lookupcred() */
#define RPCAUTH_LOOKUP_NEW 0x01 /* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC 0x02 /* Don't block waiting for memory */

/*
* Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
};
struct rpc_cred *ret;

+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
put_cred(acred.cred);
return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)

if (!acred.principal)
return NULL;
+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
}

@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
};

if (flags & RPC_TASK_ASYNC)
- lookupflags |= RPCAUTH_LOOKUP_NEW;
+ lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
if (task->tk_op_cred)
/* Task must use exactly this rpc_cred */
new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
static struct rpc_cred *
gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+ gfp_t gfp = GFP_NOFS;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ return rpcauth_lookup_credcache(auth, acred, flags, gfp);
}

static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
static struct rpc_cred *
unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+ gfp_t gfp = GFP_NOFS;
+ struct rpc_cred *ret;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ ret = mempool_alloc(unix_pool, gfp);
+ if (!ret)
+ return ERR_PTR(-ENOMEM);
rpcauth_init_cred(ret, acred, auth, &unix_credops);
ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
task->tk_cred_retry--;
trace_rpc_retry_refresh_status(task);
return;
+ case -ENOMEM:
+ rpc_delay(task, HZ >> 4);
+ return;
}
trace_rpc_refresh_status(task);
rpc_call_rpcerror(task, status);



2021-11-16 03:55:53

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/13] Repair SWAP-over-NFS

On Tue, 16 Nov 2021, Matthew Wilcox wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > swap-over-NFS currently has a variety of problems.
> >
> > Due to a newish test in generic_write_checks(), all writes to swap
> > currently fail.
>
> And by "currently", you mean "for over two years" (August 2019).

That's about the time scale for "enterprise" releases...
Actually, the earliest patches that impacted swap-over-NFS was more like
4 years ago. I didn't bother tracking Fixes: tags for everything that
was a fix, as I didn't think it would really help and might encourage
people to backport little bits of the series which I wouldn't recommend.

> Does swap-over-NFS (or any other network filesystem) actually have any
> users, and should we fix it or rip it out?
>
>
We have at least one user (why else would I be working on this?). I
think we have more, though they are presumably still on an earlier
release.

I'd prefer "fix it" over "rip it out".

I don't think any other network filesystem supports swap, but it is
not trivial to grep for.. There must be a 'swap_activate' method, and it
must return 0. There must also be a direct_IO that works.
The only other network filesystem with swap_activate is cifs. It
returns 0, but direct_IO returns -EINVAL.

NeilBrown

2021-11-16 05:44:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/13] Repair SWAP-over-NFS

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> swap-over-NFS currently has a variety of problems.
>
> Due to a newish test in generic_write_checks(), all writes to swap
> currently fail.

And by "currently", you mean "for over two years" (August 2019).
Does swap-over-NFS (or any other network filesystem) actually have any
users, and should we fix it or rip it out?


2021-11-16 07:52:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO

I'd really much prefer the variant we discussed before where
swap I/O uses it's own method instead of overloading the normal
file I/O path all over.


2021-11-16 08:32:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
> separate reads for each page. This is generally less efficient than
> larger reads.
>
> We can use the block-plugging infrastructure to delay submitting the
> read request until multiple contigious pages have been collected. This
> requires using ->direct_IO to submit the read (as ->readpages isn't
> suitable for swap).

Abusing the block code here seems little ugly. Also this won't
compile if CONFIG_BLOCK is not set, will it?

What is the problem with just batching up manually?

> + /* nofs needs as ->direct_IO may take the same mutex it takes for write */

Overly long line.

2021-11-16 08:32:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> + /* ->flags can be updated non-atomicially (scan_swap_map_slots),
> + * but that will never affect SWP_FS_OPS, so the data_race
> + * is safe.
> + */
> may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> + (PageSwapCache(page) &&
> + !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
> + (sc->gfp_mask & __GFP_IO));

You might want to move the comment and SWP_FS_OPS into a little
inline helper. That makes it a lot more readable and also avoids the
overly long line in the second hunk.

2021-11-16 21:35:31

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > + /* ->flags can be updated non-atomicially (scan_swap_map_slots),
> > + * but that will never affect SWP_FS_OPS, so the data_race
> > + * is safe.
> > + */
> > may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > + (PageSwapCache(page) &&
> > + !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
> > + (sc->gfp_mask & __GFP_IO));
>
> You might want to move the comment and SWP_FS_OPS into a little
> inline helper. That makes it a lot more readable and also avoids the
> overly long line in the second hunk.

Yes, that's a good idea. Something like this....

Thanks,
NeilBrown

From a85d09cc3d671c45e32d782454afeeaaaece96c7 Mon Sep 17 00:00:00 2001
From: NeilBrown <[email protected]>
Date: Fri, 29 Oct 2021 13:35:56 +1100
Subject: [PATCH] MM: reclaim mustn't enter FS for swap-over-NFS

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <[email protected]>
---
mm/vmscan.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..e672fcc14bac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1464,6 +1464,21 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
return nr_succeeded;
}

+static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_FS)
+ return true;
+ if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+ return false;
+ /* We can "enter_fs" for swap-cache with only __GFP_IO
+ * providing this isn't SWP_FS_OPS.
+ * ->flags can be updated non-atomicially (scan_swap_map_slots),
+ * but that will never affect SWP_FS_OPS, so the data_race
+ * is safe.
+ */
+ return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1513,8 +1528,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

- may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
- (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+ may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);

/*
* The number of dirty pages determines if a node is marked
@@ -1682,7 +1696,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
goto activate_locked_split;
}

- may_enter_fs = true;
+ may_enter_fs = test_may_enter_fs(page,
+ sc->gfp_mask);

/* Adding to swap updated mapping */
mapping = page_mapping(page);
--
2.33.1


2021-11-16 21:46:22

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
> > separate reads for each page. This is generally less efficient than
> > larger reads.
> >
> > We can use the block-plugging infrastructure to delay submitting the
> > read request until multiple contigious pages have been collected. This
> > requires using ->direct_IO to submit the read (as ->readpages isn't
> > suitable for swap).
>
> Abusing the block code here seems little ugly. Also this won't
> compile if CONFIG_BLOCK is not set, will it?

There is nothing really block-layer-specific about the plugging
interfaces. I think it would be quite reasonable to move them to lib/
But you are correct that currently without CONFIG_BLOCK the code will
compile but not work.

>
> What is the problem with just batching up manually?

That would require a bigger change to common code, which would only
benefit one user. The plugging mechanism works well for batching
requests to a block device. Why not use it for non-block-devices too?

Thanks,
NeilBrown


>
> > + /* nofs needs as ->direct_IO may take the same mutex it takes for write */
>
> Overly long line.
>
>

2021-11-16 21:50:21

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> I'd really much prefer the variant we discussed before where
> swap I/O uses it's own method instead of overloading the normal
> file I/O path all over.
>
>
This would be David Howells' "mm: Use DIO for swap and fix NFS
swapfiles" series? I'd be very happy to work with that once it lands.
I might try it out and see how two work together.

Thanks,
NeilBrown

2021-11-17 05:49:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO

On Wed, Nov 17, 2021 at 08:50:12AM +1100, NeilBrown wrote:
> This would be David Howells' "mm: Use DIO for swap and fix NFS
> swapfiles" series?

Yes.

> I'd be very happy to work with that once it lands.
> I might try it out and see how two work together.

Dave: is it ok if Neil takes over the swap vs NFS work given that he's
working on a customer requirement?

2021-11-17 05:50:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS

On Wed, Nov 17, 2021 at 08:35:23AM +1100, NeilBrown wrote:
> + /* We can "enter_fs" for swap-cache with only __GFP_IO

normal kernel style would be the

/*

on a line of it's own. But except for this nitpick this looks good.

2021-11-18 01:43:33

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.16-rc1]
[also build test ERROR on next-20211117]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master rostedt-trace/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211116-104822
base: fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf
config: mips-randconfig-r031-20211116 (attached as .config)
compiler: mipsel-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/b2f1d12df57f816d09ef57fa73758fec820a23f1
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211116-104822
git checkout b2f1d12df57f816d09ef57fa73758fec820a23f1
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=mips

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from <command-line>:
mm/vmscan.c: In function 'shrink_page_list':
>> mm/vmscan.c:1522:37: error: implicit declaration of function 'page_swap_info'; did you mean 'swp_swap_info'? [-Werror=implicit-function-declaration]
1522 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
| ^~~~~~~~~~~~~~
include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
291 | _Generic((x), \
| ^
mm/vmscan.c:1522:27: note: in expansion of macro 'data_race'
1522 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
| ^~~~~~~~~
>> mm/vmscan.c:1522:57: error: invalid type argument of '->' (have 'int')
1522 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
| ^~
include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
291 | _Generic((x), \
| ^
mm/vmscan.c:1522:27: note: in expansion of macro 'data_race'
1522 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
| ^~~~~~~~~
In file included from arch/mips/include/asm/bug.h:5,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from mm/vmscan.c:15:
>> mm/vmscan.c:1522:57: error: invalid type argument of '->' (have 'int')
1522 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
| ^~
include/linux/compiler.h:218:17: note: in definition of macro 'data_race'
218 | expr; \
| ^~~~
In file included from <command-line>:
mm/vmscan.c:1692:68: error: invalid type argument of '->' (have 'int')
1692 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
| ^~
include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
291 | _Generic((x), \
| ^
mm/vmscan.c:1692:38: note: in expansion of macro 'data_race'
1692 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
| ^~~~~~~~~
In file included from arch/mips/include/asm/bug.h:5,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from mm/vmscan.c:15:
mm/vmscan.c:1692:68: error: invalid type argument of '->' (have 'int')
1692 | !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
| ^~
include/linux/compiler.h:218:17: note: in definition of macro 'data_race'
218 | expr; \
| ^~~~
cc1: some warnings being treated as errors


vim +1522 mm/vmscan.c

1466
1467 /*
1468 * shrink_page_list() returns the number of reclaimed pages
1469 */
1470 static unsigned int shrink_page_list(struct list_head *page_list,
1471 struct pglist_data *pgdat,
1472 struct scan_control *sc,
1473 struct reclaim_stat *stat,
1474 bool ignore_references)
1475 {
1476 LIST_HEAD(ret_pages);
1477 LIST_HEAD(free_pages);
1478 LIST_HEAD(demote_pages);
1479 unsigned int nr_reclaimed = 0;
1480 unsigned int pgactivate = 0;
1481 bool do_demote_pass;
1482
1483 memset(stat, 0, sizeof(*stat));
1484 cond_resched();
1485 do_demote_pass = can_demote(pgdat->node_id, sc);
1486
1487 retry:
1488 while (!list_empty(page_list)) {
1489 struct address_space *mapping;
1490 struct page *page;
1491 enum page_references references = PAGEREF_RECLAIM;
1492 bool dirty, writeback, may_enter_fs;
1493 unsigned int nr_pages;
1494
1495 cond_resched();
1496
1497 page = lru_to_page(page_list);
1498 list_del(&page->lru);
1499
1500 if (!trylock_page(page))
1501 goto keep;
1502
1503 VM_BUG_ON_PAGE(PageActive(page), page);
1504
1505 nr_pages = compound_nr(page);
1506
1507 /* Account the number of base pages even though THP */
1508 sc->nr_scanned += nr_pages;
1509
1510 if (unlikely(!page_evictable(page)))
1511 goto activate_locked;
1512
1513 if (!sc->may_unmap && page_mapped(page))
1514 goto keep_locked;
1515
1516 /* ->flags can be updated non-atomicially (scan_swap_map_slots),
1517 * but that will never affect SWP_FS_OPS, so the data_race
1518 * is safe.
1519 */
1520 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
1521 (PageSwapCache(page) &&
> 1522 !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
1523 (sc->gfp_mask & __GFP_IO));
1524
1525 /*
1526 * The number of dirty pages determines if a node is marked
1527 * reclaim_congested. kswapd will stall and start writing
1528 * pages if the tail of the LRU is all dirty unqueued pages.
1529 */
1530 page_check_dirty_writeback(page, &dirty, &writeback);
1531 if (dirty || writeback)
1532 stat->nr_dirty++;
1533
1534 if (dirty && !writeback)
1535 stat->nr_unqueued_dirty++;
1536
1537 /*
1538 * Treat this page as congested if the underlying BDI is or if
1539 * pages are cycling through the LRU so quickly that the
1540 * pages marked for immediate reclaim are making it to the
1541 * end of the LRU a second time.
1542 */
1543 mapping = page_mapping(page);
1544 if (((dirty || writeback) && mapping &&
1545 inode_write_congested(mapping->host)) ||
1546 (writeback && PageReclaim(page)))
1547 stat->nr_congested++;
1548
1549 /*
1550 * If a page at the tail of the LRU is under writeback, there
1551 * are three cases to consider.
1552 *
1553 * 1) If reclaim is encountering an excessive number of pages
1554 * under writeback and this page is both under writeback and
1555 * PageReclaim then it indicates that pages are being queued
1556 * for IO but are being recycled through the LRU before the
1557 * IO can complete. Waiting on the page itself risks an
1558 * indefinite stall if it is impossible to writeback the
1559 * page due to IO error or disconnected storage so instead
1560 * note that the LRU is being scanned too quickly and the
1561 * caller can stall after page list has been processed.
1562 *
1563 * 2) Global or new memcg reclaim encounters a page that is
1564 * not marked for immediate reclaim, or the caller does not
1565 * have __GFP_FS (or __GFP_IO if it's simply going to swap,
1566 * not to fs). In this case mark the page for immediate
1567 * reclaim and continue scanning.
1568 *
1569 * Require may_enter_fs because we would wait on fs, which
1570 * may not have submitted IO yet. And the loop driver might
1571 * enter reclaim, and deadlock if it waits on a page for
1572 * which it is needed to do the write (loop masks off
1573 * __GFP_IO|__GFP_FS for this reason); but more thought
1574 * would probably show more reasons.
1575 *
1576 * 3) Legacy memcg encounters a page that is already marked
1577 * PageReclaim. memcg does not have any dirty pages
1578 * throttling so we could easily OOM just because too many
1579 * pages are in writeback and there is nothing else to
1580 * reclaim. Wait for the writeback to complete.
1581 *
1582 * In cases 1) and 2) we activate the pages to get them out of
1583 * the way while we continue scanning for clean pages on the
1584 * inactive list and refilling from the active list. The
1585 * observation here is that waiting for disk writes is more
1586 * expensive than potentially causing reloads down the line.
1587 * Since they're marked for immediate reclaim, they won't put
1588 * memory pressure on the cache working set any longer than it
1589 * takes to write them to disk.
1590 */
1591 if (PageWriteback(page)) {
1592 /* Case 1 above */
1593 if (current_is_kswapd() &&
1594 PageReclaim(page) &&
1595 test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
1596 stat->nr_immediate++;
1597 goto activate_locked;
1598
1599 /* Case 2 above */
1600 } else if (writeback_throttling_sane(sc) ||
1601 !PageReclaim(page) || !may_enter_fs) {
1602 /*
1603 * This is slightly racy - end_page_writeback()
1604 * might have just cleared PageReclaim, then
1605 * setting PageReclaim here end up interpreted
1606 * as PageReadahead - but that does not matter
1607 * enough to care. What we do want is for this
1608 * page to have PageReclaim set next time memcg
1609 * reclaim reaches the tests above, so it will
1610 * then wait_on_page_writeback() to avoid OOM;
1611 * and it's also appropriate in global reclaim.
1612 */
1613 SetPageReclaim(page);
1614 stat->nr_writeback++;
1615 goto activate_locked;
1616
1617 /* Case 3 above */
1618 } else {
1619 unlock_page(page);
1620 wait_on_page_writeback(page);
1621 /* then go back and try same page again */
1622 list_add_tail(&page->lru, page_list);
1623 continue;
1624 }
1625 }
1626
1627 if (!ignore_references)
1628 references = page_check_references(page, sc);
1629
1630 switch (references) {
1631 case PAGEREF_ACTIVATE:
1632 goto activate_locked;
1633 case PAGEREF_KEEP:
1634 stat->nr_ref_keep += nr_pages;
1635 goto keep_locked;
1636 case PAGEREF_RECLAIM:
1637 case PAGEREF_RECLAIM_CLEAN:
1638 ; /* try to reclaim the page below */
1639 }
1640
1641 /*
1642 * Before reclaiming the page, try to relocate
1643 * its contents to another node.
1644 */
1645 if (do_demote_pass &&
1646 (thp_migration_supported() || !PageTransHuge(page))) {
1647 list_add(&page->lru, &demote_pages);
1648 unlock_page(page);
1649 continue;
1650 }
1651
1652 /*
1653 * Anonymous process memory has backing store?
1654 * Try to allocate it some swap space here.
1655 * Lazyfree page could be freed directly
1656 */
1657 if (PageAnon(page) && PageSwapBacked(page)) {
1658 if (!PageSwapCache(page)) {
1659 if (!(sc->gfp_mask & __GFP_IO))
1660 goto keep_locked;
1661 if (page_maybe_dma_pinned(page))
1662 goto keep_locked;
1663 if (PageTransHuge(page)) {
1664 /* cannot split THP, skip it */
1665 if (!can_split_huge_page(page, NULL))
1666 goto activate_locked;
1667 /*
1668 * Split pages without a PMD map right
1669 * away. Chances are some or all of the
1670 * tail pages can be freed without IO.
1671 */
1672 if (!compound_mapcount(page) &&
1673 split_huge_page_to_list(page,
1674 page_list))
1675 goto activate_locked;
1676 }
1677 if (!add_to_swap(page)) {
1678 if (!PageTransHuge(page))
1679 goto activate_locked_split;
1680 /* Fallback to swap normal pages */
1681 if (split_huge_page_to_list(page,
1682 page_list))
1683 goto activate_locked;
1684 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
1685 count_vm_event(THP_SWPOUT_FALLBACK);
1686 #endif
1687 if (!add_to_swap(page))
1688 goto activate_locked_split;
1689 }
1690
1691 if ((sc->gfp_mask & __GFP_FS) ||
1692 !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
1693 may_enter_fs = true;
1694
1695 /* Adding to swap updated mapping */
1696 mapping = page_mapping(page);
1697 }
1698 } else if (unlikely(PageTransHuge(page))) {
1699 /* Split file THP */
1700 if (split_huge_page_to_list(page, page_list))
1701 goto keep_locked;
1702 }
1703
1704 /*
1705 * THP may get split above, need minus tail pages and update
1706 * nr_pages to avoid accounting tail pages twice.
1707 *
1708 * The tail pages that are added into swap cache successfully
1709 * reach here.
1710 */
1711 if ((nr_pages > 1) && !PageTransHuge(page)) {
1712 sc->nr_scanned -= (nr_pages - 1);
1713 nr_pages = 1;
1714 }
1715
1716 /*
1717 * The page is mapped into the page tables of one or more
1718 * processes. Try to unmap it here.
1719 */
1720 if (page_mapped(page)) {
1721 enum ttu_flags flags = TTU_BATCH_FLUSH;
1722 bool was_swapbacked = PageSwapBacked(page);
1723
1724 if (unlikely(PageTransHuge(page)))
1725 flags |= TTU_SPLIT_HUGE_PMD;
1726
1727 try_to_unmap(page, flags);
1728 if (page_mapped(page)) {
1729 stat->nr_unmap_fail += nr_pages;
1730 if (!was_swapbacked && PageSwapBacked(page))
1731 stat->nr_lazyfree_fail += nr_pages;
1732 goto activate_locked;
1733 }
1734 }
1735
1736 if (PageDirty(page)) {
1737 /*
1738 * Only kswapd can writeback filesystem pages
1739 * to avoid risk of stack overflow. But avoid
1740 * injecting inefficient single-page IO into
1741 * flusher writeback as much as possible: only
1742 * write pages when we've encountered many
1743 * dirty pages, and when we've already scanned
1744 * the rest of the LRU for clean pages and see
1745 * the same dirty pages again (PageReclaim).
1746 */
1747 if (page_is_file_lru(page) &&
1748 (!current_is_kswapd() || !PageReclaim(page) ||
1749 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
1750 /*
1751 * Immediately reclaim when written back.
1752 * Similar in principal to deactivate_page()
1753 * except we already have the page isolated
1754 * and know it's dirty
1755 */
1756 inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
1757 SetPageReclaim(page);
1758
1759 goto activate_locked;
1760 }
1761
1762 if (references == PAGEREF_RECLAIM_CLEAN)
1763 goto keep_locked;
1764 if (!may_enter_fs)
1765 goto keep_locked;
1766 if (!sc->may_writepage)
1767 goto keep_locked;
1768
1769 /*
1770 * Page is dirty. Flush the TLB if a writable entry
1771 * potentially exists to avoid CPU writes after IO
1772 * starts and then write it out here.
1773 */
1774 try_to_unmap_flush_dirty();
1775 switch (pageout(page, mapping)) {
1776 case PAGE_KEEP:
1777 goto keep_locked;
1778 case PAGE_ACTIVATE:
1779 goto activate_locked;
1780 case PAGE_SUCCESS:
1781 stat->nr_pageout += thp_nr_pages(page);
1782
1783 if (PageWriteback(page))
1784 goto keep;
1785 if (PageDirty(page))
1786 goto keep;
1787
1788 /*
1789 * A synchronous write - probably a ramdisk. Go
1790 * ahead and try to reclaim the page.
1791 */
1792 if (!trylock_page(page))
1793 goto keep;
1794 if (PageDirty(page) || PageWriteback(page))
1795 goto keep_locked;
1796 mapping = page_mapping(page);
1797 fallthrough;
1798 case PAGE_CLEAN:
1799 ; /* try to free the page below */
1800 }
1801 }
1802
1803 /*
1804 * If the page has buffers, try to free the buffer mappings
1805 * associated with this page. If we succeed we try to free
1806 * the page as well.
1807 *
1808 * We do this even if the page is PageDirty().
1809 * try_to_release_page() does not perform I/O, but it is
1810 * possible for a page to have PageDirty set, but it is actually
1811 * clean (all its buffers are clean). This happens if the
1812 * buffers were written out directly, with submit_bh(). ext3
1813 * will do this, as well as the blockdev mapping.
1814 * try_to_release_page() will discover that cleanness and will
1815 * drop the buffers and mark the page clean - it can be freed.
1816 *
1817 * Rarely, pages can have buffers and no ->mapping. These are
1818 * the pages which were not successfully invalidated in
1819 * truncate_cleanup_page(). We try to drop those buffers here
1820 * and if that worked, and the page is no longer mapped into
1821 * process address space (page_count == 1) it can be freed.
1822 * Otherwise, leave the page on the LRU so it is swappable.
1823 */
1824 if (page_has_private(page)) {
1825 if (!try_to_release_page(page, sc->gfp_mask))
1826 goto activate_locked;
1827 if (!mapping && page_count(page) == 1) {
1828 unlock_page(page);
1829 if (put_page_testzero(page))
1830 goto free_it;
1831 else {
1832 /*
1833 * rare race with speculative reference.
1834 * the speculative reference will free
1835 * this page shortly, so we may
1836 * increment nr_reclaimed here (and
1837 * leave it off the LRU).
1838 */
1839 nr_reclaimed++;
1840 continue;
1841 }
1842 }
1843 }
1844
1845 if (PageAnon(page) && !PageSwapBacked(page)) {
1846 /* follow __remove_mapping for reference */
1847 if (!page_ref_freeze(page, 1))
1848 goto keep_locked;
1849 /*
1850 * The page has only one reference left, which is
1851 * from the isolation. After the caller puts the
1852 * page back on lru and drops the reference, the
1853 * page will be freed anyway. It doesn't matter
1854 * which lru it goes. So we don't bother checking
1855 * PageDirty here.
1856 */
1857 count_vm_event(PGLAZYFREED);
1858 count_memcg_page_event(page, PGLAZYFREED);
1859 } else if (!mapping || !__remove_mapping(mapping, page, true,
1860 sc->target_mem_cgroup))
1861 goto keep_locked;
1862
1863 unlock_page(page);
1864 free_it:
1865 /*
1866 * THP may get swapped out in a whole, need account
1867 * all base pages.
1868 */
1869 nr_reclaimed += nr_pages;
1870
1871 /*
1872 * Is there need to periodically free_page_list? It would
1873 * appear not as the counts should be low
1874 */
1875 if (unlikely(PageTransHuge(page)))
1876 destroy_compound_page(page);
1877 else
1878 list_add(&page->lru, &free_pages);
1879 continue;
1880
1881 activate_locked_split:
1882 /*
1883 * The tail pages that are failed to add into swap cache
1884 * reach here. Fixup nr_scanned and nr_pages.
1885 */
1886 if (nr_pages > 1) {
1887 sc->nr_scanned -= (nr_pages - 1);
1888 nr_pages = 1;
1889 }
1890 activate_locked:
1891 /* Not a candidate for swapping, so reclaim swap space. */
1892 if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
1893 PageMlocked(page)))
1894 try_to_free_swap(page);
1895 VM_BUG_ON_PAGE(PageActive(page), page);
1896 if (!PageMlocked(page)) {
1897 int type = page_is_file_lru(page);
1898 SetPageActive(page);
1899 stat->nr_activate[type] += nr_pages;
1900 count_memcg_page_event(page, PGACTIVATE);
1901 }
1902 keep_locked:
1903 unlock_page(page);
1904 keep:
1905 list_add(&page->lru, &ret_pages);
1906 VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
1907 }
1908 /* 'page_list' is always empty here */
1909
1910 /* Migrate pages selected for demotion */
1911 nr_reclaimed += demote_page_list(&demote_pages, pgdat);
1912 /* Pages that could not be demoted are still in @demote_pages */
1913 if (!list_empty(&demote_pages)) {
1914 /* Pages which failed to demoted go back on @page_list for retry: */
1915 list_splice_init(&demote_pages, page_list);
1916 do_demote_pass = false;
1917 goto retry;
1918 }
1919
1920 pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
1921
1922 mem_cgroup_uncharge_list(&free_pages);
1923 try_to_unmap_flush();
1924 free_unref_page_list(&free_pages);
1925
1926 list_splice(&ret_pages, page_list);
1927 count_vm_events(PGACTIVATE, pgactivate);
1928
1929 return nr_reclaimed;
1930 }
1931

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (22.16 kB)
.config.gz (32.54 kB)
Download all attachments