Because nfs_release_page() submits a 'COMMIT' nfs request and waits
for it to complete, and does this during memory reclaim, it is
susceptible to deadlocks if memory allocation happens anywhere in
sending the COMMIT message. If the NFS server is on the same host
(i.e. loop-back NFS), then any memory allocations in the NFS server
can also cause deadlocks.
nfs_release_page() already has some code to avoid deadlocks in some
circumstances, but these are not sufficient for loopback NFS.
This patch set changes the approach to deadlock avoidance. Rather
than detecting cases that could deadlock and avoiding the COMMIT, it
always tries the COMMIT, but only waits a short time (1 second).
This avoid any deadlock possibility at the expense of not waiting
longer than 1 second even if no deadlock is pending.
nfs_release_page() does not *need* to wait longer - all callers that
matter handle a failure gracefully - they move on to other pages.
This set:
- adds some "_timeout()" functions to "wait_on_bit". Only a
wait_on_page version is actually used.
- exports page wake_up support. NFS knows that the COMMIT is complete
when PG_private is clear. So nfs_release_page will use
wait_on_page_bit_killable_timeout to wait for the bit to clear,
and needs access to wake_up_page()
- changes nfs_release_page() to use
wait_on_page_bit_killable_timeout()
- removes the other deadlock avoidance mechanisms from
nfs_release_page, so that PF_FSTRANS is again only used
by XFS.
As such, it needs buy-in from sched people, mm people, and NFS people.
Assuming I get that buy-in, suggests for how these patches can flow
into mainline would be appreciated ... I daren't hope they can all go
in through one tree....
Thanks,
NeilBrown
---
NeilBrown (4):
SCHED: add some "wait..on_bit...timeout()" interfaces.
MM: export page_wakeup functions
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
fs/nfs/file.c | 22 ++++++++++++----------
fs/nfs/write.c | 2 ++
include/linux/pagemap.h | 12 ++++++++++--
include/linux/wait.h | 5 ++++-
kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
mm/filemap.c | 21 +++++++++++++++------
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
9 files changed, 79 insertions(+), 33 deletions(-)
--
Signature
This will allow NFS to wait for PG_private to be cleared and,
particularly, to send a wake-up when it is.
Signed-off-by: NeilBrown <[email protected]>
---
include/linux/pagemap.h | 10 ++++++++--
mm/filemap.c | 8 ++------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 87f9e4230d3a..2dca0cef3506 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -496,8 +496,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
}
/*
- * This is exported only for wait_on_page_locked/wait_on_page_writeback.
- * Never use this directly!
+ * This is exported only for wait_on_page_locked/wait_on_page_writeback,
+ * and for filesystems which need to wait on PG_private.
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
@@ -512,6 +512,12 @@ static inline int wait_on_page_locked_killable(struct page *page)
return 0;
}
+extern wait_queue_head_t *page_waitqueue(struct page *page);
+static inline void wake_up_page(struct page *page, int bit)
+{
+ __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+}
+
/*
* Wait for a page to be unlocked.
*
diff --git a/mm/filemap.c b/mm/filemap.c
index 4a19c084bdb1..c9ba09f2ad3c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -670,17 +670,13 @@ EXPORT_SYMBOL(__page_cache_alloc);
* at a cost of "thundering herd" phenomena during rare hash
* collisions.
*/
-static wait_queue_head_t *page_waitqueue(struct page *page)
+wait_queue_head_t *page_waitqueue(struct page *page)
{
const struct zone *zone = page_zone(page);
return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}
-
-static inline void wake_up_page(struct page *page, int bit)
-{
- __wake_up_bit(page_waitqueue(page), &page->flags, bit);
-}
+EXPORT_SYMBOL(page_waitqueue);
void wait_on_page_bit(struct page *page, int bit_nr)
{
On Thu, 18 Sep 2014 16:42:22 +0200 Peter Zijlstra <[email protected]>
wrote:
> On Tue, Sep 16, 2014 at 03:31:35PM +1000, NeilBrown wrote:
> > In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
> > sched: Allow wait_on_bit_action() functions to support a timeout
> >
> > I suggested that a "wait_on_bit_timeout()" interface would not meet my
> > need. This isn't true - I was just over-engineering.
> >
> > Including a 'private' field in wait_bit_key instead of a focused
> > "timeout" field was just premature generalization. If some other
> > use is ever found, it can be generalized or added later.
> >
> > So this patch renames "private" to "timeout" with a meaning "stop
> > waiting when "jiffies" reaches or passes "timeout",
> > and adds two of the many possible wait..bit..timeout() interfaces:
> >
> > wait_on_page_bit_killable_timeout(), which is the one I want to use,
> > and out_of_line_wait_on_bit_timeout() which is a reasonably general
> > example. Others can be added as needed.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > include/linux/pagemap.h | 2 ++
> > include/linux/wait.h | 5 ++++-
> > kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
> > mm/filemap.c | 13 +++++++++++++
> > 4 files changed, 55 insertions(+), 1 deletion(-)
> >
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
Thanks.
I assume that means it is OK for this patch to go to Linus via the NFS tree,
so we get to keep everything together.
Now I just need an Ack from akpm for the mm bits (please...)
NeilBrown
On Tue, Sep 16, 2014 at 03:31:35PM +1000, NeilBrown wrote:
> In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
> sched: Allow wait_on_bit_action() functions to support a timeout
>
> I suggested that a "wait_on_bit_timeout()" interface would not meet my
> need. This isn't true - I was just over-engineering.
>
> Including a 'private' field in wait_bit_key instead of a focused
> "timeout" field was just premature generalization. If some other
> use is ever found, it can be generalized or added later.
>
> So this patch renames "private" to "timeout" with a meaning "stop
> waiting when "jiffies" reaches or passes "timeout",
> and adds two of the many possible wait..bit..timeout() interfaces:
>
> wait_on_page_bit_killable_timeout(), which is the one I want to use,
> and out_of_line_wait_on_bit_timeout() which is a reasonably general
> example. Others can be added as needed.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> include/linux/pagemap.h | 2 ++
> include/linux/wait.h | 5 ++++-
> kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 13 +++++++++++++
> 4 files changed, 55 insertions(+), 1 deletion(-)
>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Tue, 16 Sep 2014 18:04:55 -0400 Trond Myklebust
<[email protected]> wrote:
> Hi Neil,
>
> On Tue, Sep 16, 2014 at 1:31 AM, NeilBrown <[email protected]> wrote:
> > Now that nfs_release_page() doesn't block indefinitely, other deadlock
> > avoidance mechanisms aren't needed.
> > - it doesn't hurt for kswapd to block occasionally. If it doesn't
> > want to block it would clear __GFP_WAIT. The current_is_kswapd()
> > was only added to avoid deadlocks and we have a new approach for
> > that.
> > - memory allocation in the SUNRPC layer can very rarely try to
> > ->releasepage() a page it is trying to handle. The deadlock
> > is removed as nfs_release_page() doesn't block indefinitely.
> >
> > So we don't need to set PF_FSTRANS for sunrpc network operations any
> > more.
>
> Jeff Layton and I had a little discussion about this earlier today.
> The issue that Jeff raised was that these 1 second waits, although
> they will eventually complete, can nevertheless have a cumulative
> large effect if, say, the reason why we're not making progress is that
> we're being called as part of a socket reconnect attempt in
> xs_tcp_setup_socket().
>
> In that case, any attempts to call nfs_release_page() on pages that
> need to use that socket, will result in a 1 second wait, and no
> progress in satisfying the allocation attempt.
>
> Our conclusion was that we still need the PF_FSTRANS in order to deal
> with that case, where we need to actually circumvent the new wait in
> order to guarantee progress on the task of allocating and connecting
> the new socket.
>
> Comments?
This is the one weak point in the patch that had occurred to me.
What if shrink_page_list() gets a list of pages all in the same NFS file. It
will then spend one second on each of those pages...
It will typically only do 32 pages at a time (I think), but that could still
be rather long.
When I was testing with only one large NFS file, and lots of dirty anon pages
to create the required pressure, I didn't see any evidence of extensive
delays, though it is possible that I didn't look in the right place.
My general feeling is that these deadlocks a very rare and an occasional one
or two second pause is a small price to pay - a price you would be unlikely
to even notice.
However ... something else occurs to me. We could use the bdi congestion
markers to guide the timeout.
When the wait for PG_private times out, or when a connection re-establishment
is required (and maybe other similar times) we could set_bdi_congested().
Then in nfs_release_page() we could completely avoid the wait if
bdi_write_congested().
The congestion setting should encourage vmscan away from the filesystem so it
won't keep calling nfs_release_page() which is a bonus.
Setting bdi_congestion from the RPC layer might be awkward from a layering
perspective, but probably isn't necessary.
Would the following allay your concerns? The change to
nfs_inode_remove_request ensures that any congestion is removed when a
'commit' completes.
We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
why it was a separate patch. It would be nice to find a uniform solution
though.
Thanks,
NeilBrown
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5949ca37cd18..bc674ad250ce 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
* benefit that someone else can worry about the freezer.
*/
if (mapping) {
+ struct nfs_server *nfss = NFS_SERVER(mapping->host);
nfs_commit_inode(mapping->host, 0);
- if ((gfp & __GFP_WAIT))
+ if ((gfp & __GFP_WAIT) &&
+ !bdi_write_congested(&nfss->backing_dev_info))
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
+ if (PagePrivate(page))
+ set_bdi_congested(&nfss->backing_dev_info,
+ BLK_RW_ASYNC);
}
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 700e7a865e6d..3ab122e92c9d 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
struct inode *inode = req->wb_context->dentry->d_inode;
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs_page *head;
+ struct nfs_server *nfss = NFS_SERVER(inode);
if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
head = req->wb_head;
@@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
spin_unlock(&inode->i_lock);
}
+ if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+ clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags))
nfs_release_request(req);
else
Hi Neil,
On Tue, Sep 16, 2014 at 1:31 AM, NeilBrown <[email protected]> wrote:
> Now that nfs_release_page() doesn't block indefinitely, other deadlock
> avoidance mechanisms aren't needed.
> - it doesn't hurt for kswapd to block occasionally. If it doesn't
> want to block it would clear __GFP_WAIT. The current_is_kswapd()
> was only added to avoid deadlocks and we have a new approach for
> that.
> - memory allocation in the SUNRPC layer can very rarely try to
> ->releasepage() a page it is trying to handle. The deadlock
> is removed as nfs_release_page() doesn't block indefinitely.
>
> So we don't need to set PF_FSTRANS for sunrpc network operations any
> more.
Jeff Layton and I had a little discussion about this earlier today.
The issue that Jeff raised was that these 1 second waits, although
they will eventually complete, can nevertheless have a cumulative
large effect if, say, the reason why we're not making progress is that
we're being called as part of a socket reconnect attempt in
xs_tcp_setup_socket().
In that case, any attempts to call nfs_release_page() on pages that
need to use that socket, will result in a 1 second wait, and no
progress in satisfying the allocation attempt.
Our conclusion was that we still need the PF_FSTRANS in order to deal
with that case, where we need to actually circumvent the new wait in
order to guarantee progress on the task of allocating and connecting
the new socket.
Comments?
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
- it doesn't hurt for kswapd to block occasionally. If it doesn't
want to block it would clear __GFP_WAIT. The current_is_kswapd()
was only added to avoid deadlocks and we have a new approach for
that.
- memory allocation in the SUNRPC layer can very rarely try to
->releasepage() a page it is trying to handle. The deadlock
is removed as nfs_release_page() doesn't block indefinitely.
So we don't need to set PF_FSTRANS for sunrpc network operations any
more.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 16 +++++++---------
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
4 files changed, 7 insertions(+), 23 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8d74983417af..5949ca37cd18 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,18 +469,16 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
/* Always try to initiate a 'commit' if relevant, but only
- * wait for it if __GFP_WAIT is set and the calling process is
- * allowed to block. Even then, only wait 1 second. Waiting
- * indefinitely can cause deadlocks when the NFS server is on
- * this machine, and there is no particular need to wait
- * extensively here. A short wait has the benefit that
- * someone else can worry about the freezer.
+ * wait for it if __GFP_WAIT is set. Even then, only wait 1
+ * second. Waiting indefinitely can cause deadlocks when the
+ * NFS server is on this machine, when a new TCP connection is
+ * needed and in other rare cases. There is no particular
+ * need to wait extensively here. A short wait has the
+ * benefit that someone else can worry about the freezer.
*/
if (mapping) {
nfs_commit_inode(mapping->host, 0);
- if ((gfp & __GFP_WAIT) &&
- !current_is_kswapd() &&
- !(current->flags & PF_FSTRANS))
+ if ((gfp & __GFP_WAIT))
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
}
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79fd589..fe3441abdbe5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -821,9 +821,7 @@ void rpc_execute(struct rpc_task *task)
static void rpc_async_schedule(struct work_struct *work)
{
- current->flags |= PF_FSTRANS;
__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
- current->flags &= ~PF_FSTRANS;
}
/**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac4940563..6a4615dd0261 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -205,7 +205,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
struct rpc_xprt *xprt = &r_xprt->xprt;
int rc = 0;
- current->flags |= PF_FSTRANS;
xprt_clear_connected(xprt);
dprintk("RPC: %s: %sconnect\n", __func__,
@@ -216,7 +215,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
dprintk("RPC: %s: exit\n", __func__);
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
}
/*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..4707c0c8568b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1927,8 +1927,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
struct socket *sock;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
status = __sock_create(xprt->xprt_net, AF_LOCAL,
SOCK_STREAM, 0, &sock, 1);
@@ -1968,7 +1966,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
return status;
}
@@ -2071,8 +2068,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
/* Start by resetting any existing state */
xs_reset_transport(transport);
sock = xs_create_sock(xprt, transport,
@@ -2092,7 +2087,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}
/*
@@ -2229,8 +2223,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
if (!sock) {
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
sock = xs_create_sock(xprt, transport,
@@ -2276,7 +2268,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -EINPROGRESS:
case -EALREADY:
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
return;
case -EINVAL:
/* Happens, for instance, if the user specified a link
@@ -2294,7 +2285,6 @@ out_eagain:
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}
/**
On Tue, 23 Sep 2014 12:10:52 +1000 NeilBrown <[email protected]> wrote:
> Now I just need an Ack from akpm for the mm bits (please...)
Ack
On Tue, 16 Sep 2014 15:31:34 +1000
NeilBrown <[email protected]> wrote:
> Because nfs_release_page() submits a 'COMMIT' nfs request and waits
> for it to complete, and does this during memory reclaim, it is
> susceptible to deadlocks if memory allocation happens anywhere in
> sending the COMMIT message. If the NFS server is on the same host
> (i.e. loop-back NFS), then any memory allocations in the NFS server
> can also cause deadlocks.
>
> nfs_release_page() already has some code to avoid deadlocks in some
> circumstances, but these are not sufficient for loopback NFS.
>
> This patch set changes the approach to deadlock avoidance. Rather
> than detecting cases that could deadlock and avoiding the COMMIT, it
> always tries the COMMIT, but only waits a short time (1 second).
> This avoid any deadlock possibility at the expense of not waiting
> longer than 1 second even if no deadlock is pending.
>
> nfs_release_page() does not *need* to wait longer - all callers that
> matter handle a failure gracefully - they move on to other pages.
>
> This set:
> - adds some "_timeout()" functions to "wait_on_bit". Only a
> wait_on_page version is actually used.
> - exports page wake_up support. NFS knows that the COMMIT is complete
> when PG_private is clear. So nfs_release_page will use
> wait_on_page_bit_killable_timeout to wait for the bit to clear,
> and needs access to wake_up_page()
> - changes nfs_release_page() to use
> wait_on_page_bit_killable_timeout()
> - removes the other deadlock avoidance mechanisms from
> nfs_release_page, so that PF_FSTRANS is again only used
> by XFS.
>
> As such, it needs buy-in from sched people, mm people, and NFS people.
> Assuming I get that buy-in, suggests for how these patches can flow
> into mainline would be appreciated ... I daren't hope they can all go
> in through one tree....
>
> Thanks,
> NeilBrown
>
>
> ---
>
> NeilBrown (4):
> SCHED: add some "wait..on_bit...timeout()" interfaces.
> MM: export page_wakeup functions
> NFS: avoid deadlocks with loop-back mounted NFS filesystems.
> NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
>
>
> fs/nfs/file.c | 22 ++++++++++++----------
> fs/nfs/write.c | 2 ++
> include/linux/pagemap.h | 12 ++++++++++--
> include/linux/wait.h | 5 ++++-
> kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 21 +++++++++++++++------
> net/sunrpc/sched.c | 2 --
> net/sunrpc/xprtrdma/transport.c | 2 --
> net/sunrpc/xprtsock.c | 10 ----------
> 9 files changed, 79 insertions(+), 33 deletions(-)
>
On balance, I like the NFS parts of this set -- particular the fact
that we get rid of the PF_FSTRANS abortion, and simplify the code quite
a bit. My only real concern is that you could end up stalling in this
code in situations where you really can't release the page.
For instance, suppose you're trying to reconnect the socket to the
server (a'la xs_tcp_setup_socket). The VM is low on memory and tries to
release a page that needs that socket in order to issue a COMMIT. That
situation is going to end up with the page unable to be released, but
you'll still wait 1s before returning. If the VM tries to release a
bunch of pages like this, then those waits could add up.
Also, we call things like invalidate_complete_page2 from the cache
invalidation code. Will we end up with potential problems now that we
have a stronger possibility that a page might not be freeable when it
calls releasepage? (no idea on this -- I'm just spitballing)
--
Jeff Layton <[email protected]>
On Tue, Sep 16, 2014 at 9:10 PM, NeilBrown <[email protected]> wrote:
>
> However ... something else occurs to me. We could use the bdi congestion
> markers to guide the timeout.
> When the wait for PG_private times out, or when a connection re-establishment
> is required (and maybe other similar times) we could set_bdi_congested().
> Then in nfs_release_page() we could completely avoid the wait if
> bdi_write_congested().
>
> The congestion setting should encourage vmscan away from the filesystem so it
> won't keep calling nfs_release_page() which is a bonus.
>
> Setting bdi_congestion from the RPC layer might be awkward from a layering
> perspective, but probably isn't necessary.
>
> Would the following allay your concerns? The change to
> nfs_inode_remove_request ensures that any congestion is removed when a
> 'commit' completes.
>
> We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
> why it was a separate patch. It would be nice to find a uniform solution
> though.
>
> Thanks,
> NeilBrown
>
>
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5949ca37cd18..bc674ad250ce 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> * benefit that someone else can worry about the freezer.
> */
> if (mapping) {
> + struct nfs_server *nfss = NFS_SERVER(mapping->host);
> nfs_commit_inode(mapping->host, 0);
> - if ((gfp & __GFP_WAIT))
> + if ((gfp & __GFP_WAIT) &&
> + !bdi_write_congested(&nfss->backing_dev_info))
> wait_on_page_bit_killable_timeout(page, PG_private,
> HZ);
> + if (PagePrivate(page))
> + set_bdi_congested(&nfss->backing_dev_info,
> + BLK_RW_ASYNC);
> }
> /* If PagePrivate() is set, then the page is not freeable */
> if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 700e7a865e6d..3ab122e92c9d 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> struct inode *inode = req->wb_context->dentry->d_inode;
> struct nfs_inode *nfsi = NFS_I(inode);
> struct nfs_page *head;
> + struct nfs_server *nfss = NFS_SERVER(inode);
>
> if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> head = req->wb_head;
> @@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> spin_unlock(&inode->i_lock);
> }
>
> + if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> + clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
Hmm.... We already have this equivalent functionality in
nfs_end_page_writeback(), so adding it to nfs_inode_remove_request()
is just causing duplication as far as the stable writeback path is
concerned. How about adding it to nfs_commit_release_pages() instead?
Otherwise, yes, the above does indeed look at if it has merit. Have
you got a good test?
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Tue, 16 Sep 2014 07:47:41 -0400 Jeff Layton <[email protected]>
wrote:
> Also, we call things like invalidate_complete_page2 from the cache
> invalidation code. Will we end up with potential problems now that we
> have a stronger possibility that a page might not be freeable when it
> calls releasepage? (no idea on this -- I'm just spitballing)
>
Answering just this part here:
invalidate_complete_page2() is only called immediately after a call to
do_launder_page().
For nfs, that means nfs_launder_page() was called, which calls nfs_wb_page()
which in turn calls
ret = nfs_commit_inode(inode, FLUSH_SYNC);
so the inode is fully committed when invalidate_complete_page2 is called, so
nfs_release_page will succeed.
So there shouldn't be a problem there.
Thanks,
NeilBrown
On Tue, 16 Sep 2014 21:32:43 -0400 Trond Myklebust
<[email protected]> wrote:
> On Tue, Sep 16, 2014 at 9:10 PM, NeilBrown <[email protected]> wrote:
> >
> > However ... something else occurs to me. We could use the bdi congestion
> > markers to guide the timeout.
> > When the wait for PG_private times out, or when a connection re-establishment
> > is required (and maybe other similar times) we could set_bdi_congested().
> > Then in nfs_release_page() we could completely avoid the wait if
> > bdi_write_congested().
> >
> > The congestion setting should encourage vmscan away from the filesystem so it
> > won't keep calling nfs_release_page() which is a bonus.
> >
> > Setting bdi_congestion from the RPC layer might be awkward from a layering
> > perspective, but probably isn't necessary.
> >
> > Would the following allay your concerns? The change to
> > nfs_inode_remove_request ensures that any congestion is removed when a
> > 'commit' completes.
> >
> > We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
> > why it was a separate patch. It would be nice to find a uniform solution
> > though.
> >
> > Thanks,
> > NeilBrown
> >
> >
> >
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 5949ca37cd18..bc674ad250ce 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> > * benefit that someone else can worry about the freezer.
> > */
> > if (mapping) {
> > + struct nfs_server *nfss = NFS_SERVER(mapping->host);
> > nfs_commit_inode(mapping->host, 0);
> > - if ((gfp & __GFP_WAIT))
> > + if ((gfp & __GFP_WAIT) &&
> > + !bdi_write_congested(&nfss->backing_dev_info))
> > wait_on_page_bit_killable_timeout(page, PG_private,
> > HZ);
> > + if (PagePrivate(page))
> > + set_bdi_congested(&nfss->backing_dev_info,
> > + BLK_RW_ASYNC);
> > }
> > /* If PagePrivate() is set, then the page is not freeable */
> > if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 700e7a865e6d..3ab122e92c9d 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> > struct inode *inode = req->wb_context->dentry->d_inode;
> > struct nfs_inode *nfsi = NFS_I(inode);
> > struct nfs_page *head;
> > + struct nfs_server *nfss = NFS_SERVER(inode);
> >
> > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > head = req->wb_head;
> > @@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> > spin_unlock(&inode->i_lock);
> > }
> >
> > + if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > + clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
>
> Hmm.... We already have this equivalent functionality in
> nfs_end_page_writeback(), so adding it to nfs_inode_remove_request()
> is just causing duplication as far as the stable writeback path is
> concerned. How about adding it to nfs_commit_release_pages() instead?
>
> Otherwise, yes, the above does indeed look at if it has merit. Have
> you got a good test?
>
Altered patch below. I'll post a proper one after some testing.
For testing I create a memory pressure load with:
=============================
#!/bin/bash
umount /mnt/ramdisk
umount /mnt/ramdisk
mount -t tmpfs -o size=4G none /mnt/ramdisk
#swapoff -a
i=0
while [ $i -le 10000 ]; do
i=$(($i+1))
dd if=/dev/zero of=/mnt/ramdisk/testdata.dd bs=1M count=6500
date
done
==============================
Where the '4G' matches memory size, and then write out to an NFS file with
=================================
#!/bin/bash
umount /mnt2 /mnt3
umount /mnt2 /mnt3
mount /dev/sdd /mnt2
exportfs -avu
exportfs -av
mount $* 127.0.0.1:/mnt2 /mnt3
for j in {1..100}; do
i=1
while [ $i -le 10000 ]; do
echo "Step $i"
date +%H:%M:%S
i=$(($i+1))
zcat /boot/vmlinux-3.13.3-1-desktop.gz | uuencode -
date +%H:%M:%S
done | dd of=/mnt3/testdat.file bs=1M
done
====================================
Pick your own way to create a large file of random data .. though
probably /dev/zero would do.
With both those going for a few hours the current kernel will deadlock.
With my patches it doesn't.
I'll see if I can come up with some way to measure maximum delay in
try_to_free_pages() and see how the 'congestion' change affects that.
Thanks,
NeilBrown
----------------------
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5949ca37cd18..bc674ad250ce 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
* benefit that someone else can worry about the freezer.
*/
if (mapping) {
+ struct nfs_server *nfss = NFS_SERVER(mapping->host);
nfs_commit_inode(mapping->host, 0);
- if ((gfp & __GFP_WAIT))
+ if ((gfp & __GFP_WAIT) &&
+ !bdi_write_congested(&nfss->backing_dev_info))
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
+ if (PagePrivate(page))
+ set_bdi_congested(&nfss->backing_dev_info,
+ BLK_RW_ASYNC);
}
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 700e7a865e6d..8d4aae9d977a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1641,6 +1641,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
struct nfs_page *req;
int status = data->task.tk_status;
struct nfs_commit_info cinfo;
+ struct nfs_server *nfss;
while (!list_empty(&data->pages)) {
req = nfs_list_entry(data->pages.next);
@@ -1674,6 +1675,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
next:
nfs_unlock_and_release_request(req);
}
+ nfss = NFS_SERVER(data->inode);
+ if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+ clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
nfs_init_cinfo(&cinfo, data->inode, data->dreq);
if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
nfs_commit_clear_lock(NFS_I(data->inode));
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.
If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.
nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().
This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second. If the commit doesn't complete
in this time, nfs_release_page() will fail. This means it might now
fail in some cases where it wouldn't before. These cases are only
when 'gfp' includes '__GFP_WAIT'.
nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
- page_cache_pipe_buf_steal() in splice.c
- shrink_page_list() in vmscan.c
- invalidate_inode_pages2_range() in truncate.c
The first two handle failure quite safely. The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.
So aborting if the commit takes longer than 1 second is perfectly safe.
1 second may be longer than is really necessary, but it is much
shorter than the current maximum wait, so this is not a regression.
Some waiting is needed to help slow down memory allocation to the
rate that we can complete writeout of pages.
In those rare cases where it is nfsd, or something that nfsd is
waiting for, that is calling nfs_release_page(), this delay will at
most cause a small hic-cough in places where it currently deadlocks.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 24 ++++++++++++++----------
fs/nfs/write.c | 2 ++
2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..8d74983417af 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -468,17 +468,21 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
- /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
- * doing this memory reclaim for a fs-related allocation.
+ /* Always try to initiate a 'commit' if relevant, but only
+ * wait for it if __GFP_WAIT is set and the calling process is
+ * allowed to block. Even then, only wait 1 second. Waiting
+ * indefinitely can cause deadlocks when the NFS server is on
+ * this machine, and there is no particular need to wait
+ * extensively here. A short wait has the benefit that
+ * someone else can worry about the freezer.
*/
- if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
- !(current->flags & PF_FSTRANS)) {
- int how = FLUSH_SYNC;
-
- /* Don't let kswapd deadlock waiting for OOM RPC calls */
- if (current_is_kswapd())
- how = 0;
- nfs_commit_inode(mapping->host, how);
+ if (mapping) {
+ nfs_commit_inode(mapping->host, 0);
+ if ((gfp & __GFP_WAIT) &&
+ !current_is_kswapd() &&
+ !(current->flags & PF_FSTRANS))
+ wait_on_page_bit_killable_timeout(page, PG_private,
+ HZ);
}
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 175d5d073ccf..b5d83c7545d4 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
if (likely(!PageSwapCache(head->wb_page))) {
set_page_private(head->wb_page, 0);
ClearPagePrivate(head->wb_page);
+ smp_mb__after_atomic();
+ wake_up_page(head->wb_page, PG_private);
clear_bit(PG_MAPPED, &head->wb_flags);
}
nfsi->npages--;
In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Signed-off-by: NeilBrown <[email protected]>
---
include/linux/pagemap.h | 2 ++
include/linux/wait.h | 5 ++++-
kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
mm/filemap.c | 13 +++++++++++++
4 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7db7a4e..87f9e4230d3a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -502,6 +502,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout);
static inline int wait_on_page_locked_killable(struct page *page)
{
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5f9b2f..80115bf88671 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -25,7 +25,7 @@ struct wait_bit_key {
void *flags;
int bit_nr;
#define WAIT_ATOMIC_T_BIT_NR -1
- unsigned long private;
+ unsigned long timeout;
};
struct wait_bit_queue {
@@ -154,6 +154,7 @@ int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_ac
void wake_up_bit(void *, int);
void wake_up_atomic_t(atomic_t *);
int out_of_line_wait_on_bit(void *, int, wait_bit_action_f *, unsigned);
+int out_of_line_wait_on_bit_timeout(void *, int, wait_bit_action_f *, unsigned, unsigned long);
int out_of_line_wait_on_bit_lock(void *, int, wait_bit_action_f *, unsigned);
int out_of_line_wait_on_atomic_t(atomic_t *, int (*)(atomic_t *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);
@@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
extern int bit_wait(struct wait_bit_key *);
extern int bit_wait_io(struct wait_bit_key *);
+extern int bit_wait_timeout(struct wait_bit_key *);
+extern int bit_wait_io_timeout(struct wait_bit_key *);
/**
* wait_on_bit - wait for a bit to be cleared
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a4f84e..380678b3cba4 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
}
EXPORT_SYMBOL(out_of_line_wait_on_bit);
+int __sched out_of_line_wait_on_bit_timeout(
+ void *word, int bit, wait_bit_action_f *action,
+ unsigned mode, unsigned long timeout)
+{
+ wait_queue_head_t *wq = bit_waitqueue(word, bit);
+ DEFINE_WAIT_BIT(wait, word, bit);
+
+ wait.key.timeout = jiffies + timeout;
+ return __wait_on_bit(wq, &wait, action, mode);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_bit_timeout);
+
int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
wait_bit_action_f *action, unsigned mode)
@@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
return 0;
}
EXPORT_SYMBOL(bit_wait_io);
+
+__sched int bit_wait_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL(bit_wait_timeout);
+
+__sched int bit_wait_io_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ io_schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL(bit_wait_io_timeout);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcdf948d..4a19c084bdb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -703,6 +703,19 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
bit_wait_io, TASK_KILLABLE);
}
+int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+ wait.key.timeout = jiffies + timeout;
+ if (!test_bit(bit_nr, &page->flags))
+ return 0;
+ return __wait_on_bit(page_waitqueue(page), &wait,
+ bit_wait_io_timeout, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(wait_on_page_bit_killable_timeout);
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
On Wed, 17 Sep 2014 09:41:13 +1000
NeilBrown <[email protected]> wrote:
> On Tue, 16 Sep 2014 07:47:41 -0400 Jeff Layton <[email protected]>
> wrote:
>
>
> > Also, we call things like invalidate_complete_page2 from the cache
> > invalidation code. Will we end up with potential problems now that we
> > have a stronger possibility that a page might not be freeable when it
> > calls releasepage? (no idea on this -- I'm just spitballing)
> >
>
> Answering just this part here:
> invalidate_complete_page2() is only called immediately after a call to
> do_launder_page().
> For nfs, that means nfs_launder_page() was called, which calls nfs_wb_page()
> which in turn calls
> ret = nfs_commit_inode(inode, FLUSH_SYNC);
>
> so the inode is fully committed when invalidate_complete_page2 is called, so
> nfs_release_page will succeed.
>
> So there shouldn't be a problem there.
>
Yep, Trond pointed that out today when we were discussing it. Thanks
for confirming it here though...
--
Jeff Layton <[email protected]>