2014-09-18 06:04:21

by NeilBrown

[permalink] [raw]
Subject: [PATCH 0/4] Remove possible deadlocks in nfs_release_page() - V2

These two patches are updated versions of the last two patches of this
series. They include the use of congestion to avoid excessive
waiting.

(I'm not resenting 1/4 and 2/4, they are unchanged).

Without the congestion check, I've seen wait times in
try_to_free_pages as long as 208 seconds.
With no waiting at all in nfs_release_page() I've seen wait times as long
as 1.4 seconds.
With the 1 second wait, I've seen 2 seconds.
These numbers will vary based on numerous factors, but it does seem
to suggest that 1 second is a good ball-park number.

NeilBrown

---

NeilBrown (2):
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()


fs/nfs/file.c | 28 ++++++++++++++++++----------
fs/nfs/write.c | 7 +++++++
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
5 files changed, 25 insertions(+), 24 deletions(-)

--
Signature



2014-09-18 12:01:11

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

On Thu, 18 Sep 2014 16:03:17 +1000
NeilBrown <[email protected]> wrote:

> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
>
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service. If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
>
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().
>
> This patch modifies nfs_release_page() to wait a limited time for the
> commit to complete - one second. If the commit doesn't complete
> in this time, nfs_release_page() will fail. This means it might now
> fail in some cases where it wouldn't before. These cases are only
> when 'gfp' includes '__GFP_WAIT'.
>
> nfs_release_page() is only called by try_to_release_page(), and that
> can only be called on an NFS page with required 'gfp' flags from
> - page_cache_pipe_buf_steal() in splice.c
> - shrink_page_list() in vmscan.c
> - invalidate_inode_pages2_range() in truncate.c
>
> The first two handle failure quite safely. The last is only called
> after ->launder_page() has been called, and that will have waited
> for the commit to finish already.
>
> So aborting if the commit takes longer than 1 second is perfectly safe.
>
> If nfs_release_page() is called on a sequence of pages which are all
> in the same file which is blocked on COMMIT, each page could
> contribute a 1 second delay which could be come excessive. I have
> seen delays of as much as 208 seconds.
>
> To keep the delay to one second, the bdi is marked as write-congested
> if the commit didn't finished. Once it does finish, the
> write-congested flag will be cleared.
>
> With this, the longest total delay in try_to_free_pages that I have
> seen in under 3 seconds. With no waiting in nfs_release_page at all
> I have seen delays of nearly 1.5 seconds.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfs/file.c | 30 ++++++++++++++++++++----------
> fs/nfs/write.c | 7 +++++++
> 2 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 524dd80d1898..febba950d8a6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>
> dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>
> - /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> - * doing this memory reclaim for a fs-related allocation.
> + /* Always try to initiate a 'commit' if relevant, but only
> + * wait for it if __GFP_WAIT is set and the calling process is
> + * allowed to block. Even then, only wait 1 second and only
> + * if the 'bdi' is not congested.
> + * Waiting indefinitely can cause deadlocks when the NFS
> + * server is on this machine, and there is no particular need
> + * to wait extensively here. A short wait has the benefit
> + * that someone else can worry about the freezer.
> */
> - if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> - !(current->flags & PF_FSTRANS)) {
> - int how = FLUSH_SYNC;
> -
> - /* Don't let kswapd deadlock waiting for OOM RPC calls */
> - if (current_is_kswapd())
> - how = 0;
> - nfs_commit_inode(mapping->host, how);
> + if (mapping) {
> + struct nfs_server *nfss = NFS_SERVER(mapping->host);
> + nfs_commit_inode(mapping->host, 0);
> + if ((gfp & __GFP_WAIT) &&
> + !current_is_kswapd() &&
> + !(current->flags & PF_FSTRANS) &&
> + !bdi_write_congested(&nfss->backing_dev_info))
> + wait_on_page_bit_killable_timeout(page, PG_private,
> + HZ);
> + if (PagePrivate(page))
> + set_bdi_congested(&nfss->backing_dev_info,
> + BLK_RW_ASYNC);

I've never had a great feel for the BDI congestion stuff, but won't
this have some unintended effects?

For instance, suppose the VM decides to try to free this page and
passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
COMMIT, but don't wait for it. The COMMIT is actually going to go
reasonably fast, but we now set the BDI congested because we didn't
wait for it to occur.

That in turn causes writeout for other inodes on this BDI to get
throttled even though there really is no congestion. It just looks that
way due to how releasepage got called.

Am I making mountains out of molehills here?

> }
> /* If PagePrivate() is set, then the page is not freeable */
> if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 175d5d073ccf..3066c7fcb565 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> if (likely(!PageSwapCache(head->wb_page))) {
> set_page_private(head->wb_page, 0);
> ClearPagePrivate(head->wb_page);
> + smp_mb__after_atomic();
> + wake_up_page(head->wb_page, PG_private);
> clear_bit(PG_MAPPED, &head->wb_flags);
> }
> nfsi->npages--;
> @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> struct nfs_page *req;
> int status = data->task.tk_status;
> struct nfs_commit_info cinfo;
> + struct nfs_server *nfss;
>
> while (!list_empty(&data->pages)) {
> req = nfs_list_entry(data->pages.next);
> @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> next:
> nfs_unlock_and_release_request(req);
> }
> + nfss = NFS_SERVER(data->inode);
> + if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> + clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> +
> nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
> nfs_commit_clear_lock(NFS_I(data->inode));
>
>


--
Jeff Layton <[email protected]>

2014-09-18 06:04:23

by NeilBrown

[permalink] [raw]
Subject: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second. If the commit doesn't complete
in this time, nfs_release_page() will fail. This means it might now
fail in some cases where it wouldn't before. These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
- page_cache_pipe_buf_steal() in splice.c
- shrink_page_list() in vmscan.c
- invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely. The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive. I have
seen delays of as much as 208 seconds.

To keep the delay to one second, the bdi is marked as write-congested
if the commit didn't finished. Once it does finish, the
write-congested flag will be cleared.

With this, the longest total delay in try_to_free_pages that I have
seen in under 3 seconds. With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 30 ++++++++++++++++++++----------
fs/nfs/write.c | 7 +++++++
2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..febba950d8a6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)

dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);

- /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
- * doing this memory reclaim for a fs-related allocation.
+ /* Always try to initiate a 'commit' if relevant, but only
+ * wait for it if __GFP_WAIT is set and the calling process is
+ * allowed to block. Even then, only wait 1 second and only
+ * if the 'bdi' is not congested.
+ * Waiting indefinitely can cause deadlocks when the NFS
+ * server is on this machine, and there is no particular need
+ * to wait extensively here. A short wait has the benefit
+ * that someone else can worry about the freezer.
*/
- if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
- !(current->flags & PF_FSTRANS)) {
- int how = FLUSH_SYNC;
-
- /* Don't let kswapd deadlock waiting for OOM RPC calls */
- if (current_is_kswapd())
- how = 0;
- nfs_commit_inode(mapping->host, how);
+ if (mapping) {
+ struct nfs_server *nfss = NFS_SERVER(mapping->host);
+ nfs_commit_inode(mapping->host, 0);
+ if ((gfp & __GFP_WAIT) &&
+ !current_is_kswapd() &&
+ !(current->flags & PF_FSTRANS) &&
+ !bdi_write_congested(&nfss->backing_dev_info))
+ wait_on_page_bit_killable_timeout(page, PG_private,
+ HZ);
+ if (PagePrivate(page))
+ set_bdi_congested(&nfss->backing_dev_info,
+ BLK_RW_ASYNC);
}
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 175d5d073ccf..3066c7fcb565 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
if (likely(!PageSwapCache(head->wb_page))) {
set_page_private(head->wb_page, 0);
ClearPagePrivate(head->wb_page);
+ smp_mb__after_atomic();
+ wake_up_page(head->wb_page, PG_private);
clear_bit(PG_MAPPED, &head->wb_flags);
}
nfsi->npages--;
@@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
struct nfs_page *req;
int status = data->task.tk_status;
struct nfs_commit_info cinfo;
+ struct nfs_server *nfss;

while (!list_empty(&data->pages)) {
req = nfs_list_entry(data->pages.next);
@@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
next:
nfs_unlock_and_release_request(req);
}
+ nfss = NFS_SERVER(data->inode);
+ if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+ clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
nfs_init_cinfo(&cinfo, data->inode, data->dreq);
if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
nfs_commit_clear_lock(NFS_I(data->inode));



2014-09-18 06:04:33

by NeilBrown

[permalink] [raw]
Subject: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()

Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
- it doesn't hurt for kswapd to block occasionally. If it doesn't
want to block it would clear __GFP_WAIT. The current_is_kswapd()
was only added to avoid deadlocks and we have a new approach for
that.
- memory allocation in the SUNRPC layer can very rarely try to
->releasepage() a page it is trying to handle. The deadlock
is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 14 ++++++--------
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
4 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index febba950d8a6..3c032b1f1b75 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,20 +469,18 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);

/* Always try to initiate a 'commit' if relevant, but only
- * wait for it if __GFP_WAIT is set and the calling process is
- * allowed to block. Even then, only wait 1 second and only
- * if the 'bdi' is not congested.
+ * wait for it if __GFP_WAIT is set. Even then, only wait 1
+ * second and only if the 'bdi' is not congested.
* Waiting indefinitely can cause deadlocks when the NFS
- * server is on this machine, and there is no particular need
- * to wait extensively here. A short wait has the benefit
- * that someone else can worry about the freezer.
+ * server is on this machine, when a new TCP connection is
+ * needed and in other rare cases. There is no particular
+ * need to wait extensively here. A short wait has the
+ * benefit that someone else can worry about the freezer.
*/
if (mapping) {
struct nfs_server *nfss = NFS_SERVER(mapping->host);
nfs_commit_inode(mapping->host, 0);
if ((gfp & __GFP_WAIT) &&
- !current_is_kswapd() &&
- !(current->flags & PF_FSTRANS) &&
!bdi_write_congested(&nfss->backing_dev_info))
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79fd589..fe3441abdbe5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -821,9 +821,7 @@ void rpc_execute(struct rpc_task *task)

static void rpc_async_schedule(struct work_struct *work)
{
- current->flags |= PF_FSTRANS;
__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
- current->flags &= ~PF_FSTRANS;
}

/**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac4940563..6a4615dd0261 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -205,7 +205,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
struct rpc_xprt *xprt = &r_xprt->xprt;
int rc = 0;

- current->flags |= PF_FSTRANS;
xprt_clear_connected(xprt);

dprintk("RPC: %s: %sconnect\n", __func__,
@@ -216,7 +215,6 @@ xprt_rdma_connect_worker(struct work_struct *work)

dprintk("RPC: %s: exit\n", __func__);
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
}

/*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..4707c0c8568b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1927,8 +1927,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
struct socket *sock;
int status = -EIO;

- current->flags |= PF_FSTRANS;
-
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
status = __sock_create(xprt->xprt_net, AF_LOCAL,
SOCK_STREAM, 0, &sock, 1);
@@ -1968,7 +1966,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
return status;
}

@@ -2071,8 +2068,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
int status = -EIO;

- current->flags |= PF_FSTRANS;
-
/* Start by resetting any existing state */
xs_reset_transport(transport);
sock = xs_create_sock(xprt, transport,
@@ -2092,7 +2087,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}

/*
@@ -2229,8 +2223,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
int status = -EIO;

- current->flags |= PF_FSTRANS;
-
if (!sock) {
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
sock = xs_create_sock(xprt, transport,
@@ -2276,7 +2268,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -EINPROGRESS:
case -EALREADY:
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
return;
case -EINVAL:
/* Happens, for instance, if the user specified a link
@@ -2294,7 +2285,6 @@ out_eagain:
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}

/**



2014-09-16 12:39:45

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

On 09/16/2014 01:31 AM, NeilBrown wrote:
> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
>
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service. If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
>
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().

Is there anything that can be done on the nfsd side to prevent the deadlocks?

Anna

>
> This patch modifies nfs_release_page() to wait a limited time for the
> commit to complete - one second. If the commit doesn't complete
> in this time, nfs_release_page() will fail. This means it might now
> fail in some cases where it wouldn't before. These cases are only
> when 'gfp' includes '__GFP_WAIT'.
>
> nfs_release_page() is only called by try_to_release_page(), and that
> can only be called on an NFS page with required 'gfp' flags from
> - page_cache_pipe_buf_steal() in splice.c
> - shrink_page_list() in vmscan.c
> - invalidate_inode_pages2_range() in truncate.c
>
> The first two handle failure quite safely. The last is only called
> after ->launder_page() has been called, and that will have waited
> for the commit to finish already.
>
> So aborting if the commit takes longer than 1 second is perfectly safe.
>
> 1 second may be longer than is really necessary, but it is much
> shorter than the current maximum wait, so this is not a regression.
> Some waiting is needed to help slow down memory allocation to the
> rate that we can complete writeout of pages.
>
> In those rare cases where it is nfsd, or something that nfsd is
> waiting for, that is calling nfs_release_page(), this delay will at
> most cause a small hic-cough in places where it currently deadlocks.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfs/file.c | 24 ++++++++++++++----------
> fs/nfs/write.c | 2 ++
> 2 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 524dd80d1898..8d74983417af 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -468,17 +468,21 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>
> dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>
> - /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> - * doing this memory reclaim for a fs-related allocation.
> + /* Always try to initiate a 'commit' if relevant, but only
> + * wait for it if __GFP_WAIT is set and the calling process is
> + * allowed to block. Even then, only wait 1 second. Waiting
> + * indefinitely can cause deadlocks when the NFS server is on
> + * this machine, and there is no particular need to wait
> + * extensively here. A short wait has the benefit that
> + * someone else can worry about the freezer.
> */
> - if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> - !(current->flags & PF_FSTRANS)) {
> - int how = FLUSH_SYNC;
> -
> - /* Don't let kswapd deadlock waiting for OOM RPC calls */
> - if (current_is_kswapd())
> - how = 0;
> - nfs_commit_inode(mapping->host, how);
> + if (mapping) {
> + nfs_commit_inode(mapping->host, 0);
> + if ((gfp & __GFP_WAIT) &&
> + !current_is_kswapd() &&
> + !(current->flags & PF_FSTRANS))
> + wait_on_page_bit_killable_timeout(page, PG_private,
> + HZ);
> }
> /* If PagePrivate() is set, then the page is not freeable */
> if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 175d5d073ccf..b5d83c7545d4 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> if (likely(!PageSwapCache(head->wb_page))) {
> set_page_private(head->wb_page, 0);
> ClearPagePrivate(head->wb_page);
> + smp_mb__after_atomic();
> + wake_up_page(head->wb_page, PG_private);
> clear_bit(PG_MAPPED, &head->wb_flags);
> }
> nfsi->npages--;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2014-09-16 23:38:19

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

On Tue, 16 Sep 2014 08:39:39 -0400 Anna Schumaker <[email protected]>
wrote:

> On 09/16/2014 01:31 AM, NeilBrown wrote:
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> >
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service. If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> >
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
>
> Is there anything that can be done on the nfsd side to prevent the deadlocks?
>

I went down that path first and it didn't work out.
Setting PF_FSTRANS in nfsd (when the request comes from localhost) and then
arranging the __GFP_FS is cleared when that flag is set overcomes a number of
possible deadlock sources, but not all.

There are a number of situations where nfsd is waiting on some other thread
(which doesn't have PF_FSTRANS set) and that thread tries to reclaim memory
and hits nfs_release_page().
It was a long and complex patch set, and nobody liked it.
And the common thread was always that it always blocked in nfs_release_page().
So it seemed to make sense to just remove that blockage.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-09-22 01:37:19

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

On Thu, 18 Sep 2014 08:01:07 -0400 Jeff Layton <[email protected]>
wrote:

> On Thu, 18 Sep 2014 16:03:17 +1000
> NeilBrown <[email protected]> wrote:
>
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> >
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service. If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> >
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
> >
> > This patch modifies nfs_release_page() to wait a limited time for the
> > commit to complete - one second. If the commit doesn't complete
> > in this time, nfs_release_page() will fail. This means it might now
> > fail in some cases where it wouldn't before. These cases are only
> > when 'gfp' includes '__GFP_WAIT'.
> >
> > nfs_release_page() is only called by try_to_release_page(), and that
> > can only be called on an NFS page with required 'gfp' flags from
> > - page_cache_pipe_buf_steal() in splice.c
> > - shrink_page_list() in vmscan.c
> > - invalidate_inode_pages2_range() in truncate.c
> >
> > The first two handle failure quite safely. The last is only called
> > after ->launder_page() has been called, and that will have waited
> > for the commit to finish already.
> >
> > So aborting if the commit takes longer than 1 second is perfectly safe.
> >
> > If nfs_release_page() is called on a sequence of pages which are all
> > in the same file which is blocked on COMMIT, each page could
> > contribute a 1 second delay which could be come excessive. I have
> > seen delays of as much as 208 seconds.
> >
> > To keep the delay to one second, the bdi is marked as write-congested
> > if the commit didn't finished. Once it does finish, the
> > write-congested flag will be cleared.
> >
> > With this, the longest total delay in try_to_free_pages that I have
> > seen in under 3 seconds. With no waiting in nfs_release_page at all
> > I have seen delays of nearly 1.5 seconds.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/nfs/file.c | 30 ++++++++++++++++++++----------
> > fs/nfs/write.c | 7 +++++++
> > 2 files changed, 27 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 524dd80d1898..febba950d8a6 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> >
> > dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
> >
> > - /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> > - * doing this memory reclaim for a fs-related allocation.
> > + /* Always try to initiate a 'commit' if relevant, but only
> > + * wait for it if __GFP_WAIT is set and the calling process is
> > + * allowed to block. Even then, only wait 1 second and only
> > + * if the 'bdi' is not congested.
> > + * Waiting indefinitely can cause deadlocks when the NFS
> > + * server is on this machine, and there is no particular need
> > + * to wait extensively here. A short wait has the benefit
> > + * that someone else can worry about the freezer.
> > */
> > - if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> > - !(current->flags & PF_FSTRANS)) {
> > - int how = FLUSH_SYNC;
> > -
> > - /* Don't let kswapd deadlock waiting for OOM RPC calls */
> > - if (current_is_kswapd())
> > - how = 0;
> > - nfs_commit_inode(mapping->host, how);
> > + if (mapping) {
> > + struct nfs_server *nfss = NFS_SERVER(mapping->host);
> > + nfs_commit_inode(mapping->host, 0);
> > + if ((gfp & __GFP_WAIT) &&
> > + !current_is_kswapd() &&
> > + !(current->flags & PF_FSTRANS) &&
> > + !bdi_write_congested(&nfss->backing_dev_info))
> > + wait_on_page_bit_killable_timeout(page, PG_private,
> > + HZ);
> > + if (PagePrivate(page))
> > + set_bdi_congested(&nfss->backing_dev_info,
> > + BLK_RW_ASYNC);
>
> I've never had a great feel for the BDI congestion stuff, but won't
> this have some unintended effects?
>
> For instance, suppose the VM decides to try to free this page and
> passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
> COMMIT, but don't wait for it. The COMMIT is actually going to go
> reasonably fast, but we now set the BDI congested because we didn't
> wait for it to occur.
>
> That in turn causes writeout for other inodes on this BDI to get
> throttled even though there really is no congestion. It just looks that
> way due to how releasepage got called.
>
> Am I making mountains out of molehills here?

Excellent molehill - thanks :-)

I was being lazy. The 'if (PagePrivate())' should really be inside the
other if statement with the wait_on_page_bit...(). I've moved it there.
Once I get an Ack for the mm bits I'll report it all.

Thanks,
NeilBrown

(Molehills are worse for your suspension than mountains!)


>
> > }
> > /* If PagePrivate() is set, then the page is not freeable */
> > if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 175d5d073ccf..3066c7fcb565 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> > if (likely(!PageSwapCache(head->wb_page))) {
> > set_page_private(head->wb_page, 0);
> > ClearPagePrivate(head->wb_page);
> > + smp_mb__after_atomic();
> > + wake_up_page(head->wb_page, PG_private);
> > clear_bit(PG_MAPPED, &head->wb_flags);
> > }
> > nfsi->npages--;
> > @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> > struct nfs_page *req;
> > int status = data->task.tk_status;
> > struct nfs_commit_info cinfo;
> > + struct nfs_server *nfss;
> >
> > while (!list_empty(&data->pages)) {
> > req = nfs_list_entry(data->pages.next);
> > @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> > next:
> > nfs_unlock_and_release_request(req);
> > }
> > + nfss = NFS_SERVER(data->inode);
> > + if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > + clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > +
> > nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> > if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
> > nfs_commit_clear_lock(NFS_I(data->inode));
> >
> >
>
>


Attachments:
signature.asc (828.00 B)