Now that the new wait_on_bit code is in the kernel (in 3.17-rc1) these
last two patches to support deadlock-free loop-back NFS mounts can be
applied (hopefully to go upstream for 3.18-rc1).
A deadlock can happen if nfsd tries to allocate memory, calls
->releasepage() on a page in an NFS filesystem, and nfs_release_page()
blocks waiting for the nfsd to confirm the COMMIT.
With this patch nfs_release_page() will not wait more than 100ms
for COMMIT to a non-remote nfs fileserver.
Thanks,
NeilBrown
---
NeilBrown (2):
SUNRPC: track when a client connection is routed to the local host.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
fs/nfs/file.c | 2 +
fs/nfs/write.c | 72 ++++++++++++++++++++++++++++++++++++++++---
include/linux/freezer.h | 10 ++++++
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
include/uapi/linux/nfs_fs.h | 3 ++
net/sunrpc/clnt.c | 25 +++++++++++++++
net/sunrpc/xprtsock.c | 9 +++++
8 files changed, 117 insertions(+), 6 deletions(-)
--
Signature
On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
>
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service. If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
>
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().
>
> This patch modifies nfs_release_page() and the functions it calls so
> that if the COMMIT is sent to the local host (i.e. is routed over the
> loop-back interface) then nfs_release_page() will not wait forever for
> that COMMIT to complete. This is achieved using a new flag
> FLUSH_COND_CONNECTED. When this is set the flush is conditional and
> will only wait if the client is connected to a non-local server.
>
> Aborting early should be safe as kswapd uses the same code but never
> waits for the COMMIT.
>
> Always aborting early could upset the balance of memory management so
> when the commit was sent to the local host we still wait but with a
> 100ms timeout. This is enough to significantly slow down any process
> that is allocating lots of memory and often long enough to let the
> commit complete.
>
> In those rare cases where it is nfsd, or something that nfsd is
> waiting for, that is calling nfs_release_page(), 100ms is not so long
> that throughput will be seriously affected.
>
> When fail-over happens a remote (foreign) client will first become
> disconnected and then turn into a local client.
> To prevent deadlocks from happening at this point, we still have a
> timeout when the COMMIT has been sent to a remote client. In this case
> the timeout is longer (1 second).
>
> So when a node that has mounted a remote filesystem loses the
> connection, nfs_release_page() will stop blocking and start failing.
> Memory allocators will then be able to make progress without blocking
> in NFS. Any process which writes to a file will still be blocked in
> balance_dirty_pages().
>
> This patch makes use of the new 'private' field in
> "struct wait_bit_key" to store the start time of a commit, so the
> 'action' function called by __wait_on_bit() knows how much longer
> it is appropriate to wait.
>
This puts way too much RPC connection logic in the NFS layer: we really
should not have to care. Is there any reason why we could not just use
RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
Note that all this has to come with a huge WARNING: all your data may be
lost even if you think you just fsync()ed it. Is there any difference
between doing this and using the 'async' export option on the knfsd
side?
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Wed, 20 Aug 2014 20:45:16 -0400 Trond Myklebust
<[email protected]> wrote:
> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> >
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service. If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> >
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
> >
> > This patch modifies nfs_release_page() and the functions it calls so
> > that if the COMMIT is sent to the local host (i.e. is routed over the
> > loop-back interface) then nfs_release_page() will not wait forever for
> > that COMMIT to complete. This is achieved using a new flag
> > FLUSH_COND_CONNECTED. When this is set the flush is conditional and
> > will only wait if the client is connected to a non-local server.
> >
> > Aborting early should be safe as kswapd uses the same code but never
> > waits for the COMMIT.
> >
> > Always aborting early could upset the balance of memory management so
> > when the commit was sent to the local host we still wait but with a
> > 100ms timeout. This is enough to significantly slow down any process
> > that is allocating lots of memory and often long enough to let the
> > commit complete.
> >
> > In those rare cases where it is nfsd, or something that nfsd is
> > waiting for, that is calling nfs_release_page(), 100ms is not so long
> > that throughput will be seriously affected.
> >
> > When fail-over happens a remote (foreign) client will first become
> > disconnected and then turn into a local client.
> > To prevent deadlocks from happening at this point, we still have a
> > timeout when the COMMIT has been sent to a remote client. In this case
> > the timeout is longer (1 second).
> >
> > So when a node that has mounted a remote filesystem loses the
> > connection, nfs_release_page() will stop blocking and start failing.
> > Memory allocators will then be able to make progress without blocking
> > in NFS. Any process which writes to a file will still be blocked in
> > balance_dirty_pages().
> >
> > This patch makes use of the new 'private' field in
> > "struct wait_bit_key" to store the start time of a commit, so the
> > 'action' function called by __wait_on_bit() knows how much longer
> > it is appropriate to wait.
> >
>
> This puts way too much RPC connection logic in the NFS layer: we really
> should not have to care. Is there any reason why we could not just use
> RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
I tried to keep as much in the RPC layer as I could....
There really is a big difference between talking to an 'nfsd' on the same
machine and talking to one on a different machine. In the first case you
need to be cautious of deadlocks, in the second you don't. That is a
difference that matters to NFS, not to RPC.
I guess we could always have a timeout, even when connected to a remote
server. We would just end up blocking somewhere else when the server was not
responding.
I don't think SOFTCONN is really appropriate. We don't want the COMMIT to
stop being retried. We just don't want the memory reclaim code to block
waiting for the COMMIT.
>
> Note that all this has to come with a huge WARNING: all your data may be
> lost even if you think you just fsync()ed it. Is there any difference
> between doing this and using the 'async' export option on the knfsd
> side?
>
I don't think this patch risks losing data at all - that certainly isn't the
intent.
This patch only affects the behaviour of ->releasepage, and only allows it to
fail instead of block. It is perfectly acceptable for releasepage to fail
and it doesn't result in data loss. It just means that the page is busy.
Thanks,
NeilBrown
On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> If requests are being sent to the local host, then NFS will
> need to take care to avoid deadlocks.
>
> So keep track when accepting a connection or sending a UDP request
> and set a flag in the svc_xprt when the peer connected to is local.
>
> The interface rpc_is_foreign() is provided to check is a given client
> is connected to a foreign server. When it returns zero it is either
> not connected or connected to a local server and in either case
> greater care is needed.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> include/linux/sunrpc/clnt.h | 1 +
> include/linux/sunrpc/xprt.h | 1 +
> net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
> net/sunrpc/xprtsock.c | 9 +++++++++
> 4 files changed, 36 insertions(+)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 70736b98c721..cd79b2a28ceb 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -175,6 +175,7 @@ void rpc_force_rebind(struct rpc_clnt *);
> size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
> const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
> int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
> +int rpc_is_foreign(struct rpc_clnt *);
>
> #endif /* __KERNEL__ */
> #endif /* _LINUX_SUNRPC_CLNT_H */
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index fcbfe8783243..6a9dffcb9d3f 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -357,6 +357,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
> #define XPRT_CONNECTION_ABORT (7)
> #define XPRT_CONNECTION_CLOSE (8)
> #define XPRT_CONGESTED (9)
> +#define XPRT_LOCAL (10)
>
Can we please rename that to XPRT_LOOPBACK or something along those
lines? To me XPRT_LOCAL looks a little too close for comfort to
AF_LOCAL, which is also a supported transport.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Thu, 21 Aug 2014 12:15:44 +1000 NeilBrown <[email protected]> wrote:
> On Wed, 20 Aug 2014 21:42:55 -0400 Trond Myklebust
> <[email protected]> wrote:
>
> > On Wed, Aug 20, 2014 at 9:11 PM, NeilBrown <[email protected]> wrote:
> > > On Wed, 20 Aug 2014 20:45:16 -0400 Trond Myklebust
> > > <[email protected]> wrote:
> > >
> > >> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> > >> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > >> > used to access shared storage in a high-availability cluster.
> > >> >
> > >> > If the node running the NFS server fails, some other node can mount the
> > >> > filesystem and start providing NFS service. If that node already had
> > >> > the filesystem NFS mounted, it will now have it loop-back mounted.
> > >> >
> > >> > nfsd can suffer a deadlock when allocating memory and entering direct
> > >> > reclaim.
> > >> > While direct reclaim does not write to the NFS filesystem it can send
> > >> > and wait for a COMMIT through nfs_release_page().
> > >> >
> > >> > This patch modifies nfs_release_page() and the functions it calls so
> > >> > that if the COMMIT is sent to the local host (i.e. is routed over the
> > >> > loop-back interface) then nfs_release_page() will not wait forever for
> > >> > that COMMIT to complete. This is achieved using a new flag
> > >> > FLUSH_COND_CONNECTED. When this is set the flush is conditional and
> > >> > will only wait if the client is connected to a non-local server.
> > >> >
> > >> > Aborting early should be safe as kswapd uses the same code but never
> > >> > waits for the COMMIT.
> > >> >
> > >> > Always aborting early could upset the balance of memory management so
> > >> > when the commit was sent to the local host we still wait but with a
> > >> > 100ms timeout. This is enough to significantly slow down any process
> > >> > that is allocating lots of memory and often long enough to let the
> > >> > commit complete.
> > >> >
> > >> > In those rare cases where it is nfsd, or something that nfsd is
> > >> > waiting for, that is calling nfs_release_page(), 100ms is not so long
> > >> > that throughput will be seriously affected.
> > >> >
> > >> > When fail-over happens a remote (foreign) client will first become
> > >> > disconnected and then turn into a local client.
> > >> > To prevent deadlocks from happening at this point, we still have a
> > >> > timeout when the COMMIT has been sent to a remote client. In this case
> > >> > the timeout is longer (1 second).
> > >> >
> > >> > So when a node that has mounted a remote filesystem loses the
> > >> > connection, nfs_release_page() will stop blocking and start failing.
> > >> > Memory allocators will then be able to make progress without blocking
> > >> > in NFS. Any process which writes to a file will still be blocked in
> > >> > balance_dirty_pages().
> > >> >
> > >> > This patch makes use of the new 'private' field in
> > >> > "struct wait_bit_key" to store the start time of a commit, so the
> > >> > 'action' function called by __wait_on_bit() knows how much longer
> > >> > it is appropriate to wait.
> > >> >
> > >>
> > >> This puts way too much RPC connection logic in the NFS layer: we really
> > >> should not have to care. Is there any reason why we could not just use
> > >> RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
> > >
> > > I tried to keep as much in the RPC layer as I could....
> > >
> > > There really is a big difference between talking to an 'nfsd' on the same
> > > machine and talking to one on a different machine. In the first case you
> > > need to be cautious of deadlocks, in the second you don't. That is a
> > > difference that matters to NFS, not to RPC.
> > >
> > > I guess we could always have a timeout, even when connected to a remote
> > > server. We would just end up blocking somewhere else when the server was not
> > > responding.
> > >
> > > I don't think SOFTCONN is really appropriate. We don't want the COMMIT to
> > > stop being retried. We just don't want the memory reclaim code to block
> > > waiting for the COMMIT.
> > >
> > >>
> > >> Note that all this has to come with a huge WARNING: all your data may be
> > >> lost even if you think you just fsync()ed it. Is there any difference
> > >> between doing this and using the 'async' export option on the knfsd
> > >> side?
> > >>
> > >
> > > I don't think this patch risks losing data at all - that certainly isn't the
> > > intent.
> > > This patch only affects the behaviour of ->releasepage, and only allows it to
> > > fail instead of block. It is perfectly acceptable for releasepage to fail
> > > and it doesn't result in data loss. It just means that the page is busy.
> > >
> >
> > So why not just change the behaviour of ->releasepage() to always
> > initiate a non-blocking commit and then wait up to 1 second for the
> > PG_private bit to be released?
> >
>
> Would that work?
> PG_private seems to be set when the page is being written, not when the
> COMMIT is happening.
Ahh... it is set when the page is written and not cleared until it is
committed. So yes, we could do it that way if you prefer.
Are you happy with the same timeout whether the the server is local or remote?
The dynamics of the situation are very different, but I guess I don't hit
deadlocks in the local-server case so often that the occasional 1 second
delay would hurt.
I'd need to add a 'wake_up_bit(PG_private, &page->flags)' after
ClearPageBit().
I'll send you a patch early next week.
Thanks,
NeilBrown
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.
If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.
nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().
This patch modifies nfs_release_page() and the functions it calls so
that if the COMMIT is sent to the local host (i.e. is routed over the
loop-back interface) then nfs_release_page() will not wait forever for
that COMMIT to complete. This is achieved using a new flag
FLUSH_COND_CONNECTED. When this is set the flush is conditional and
will only wait if the client is connected to a non-local server.
Aborting early should be safe as kswapd uses the same code but never
waits for the COMMIT.
Always aborting early could upset the balance of memory management so
when the commit was sent to the local host we still wait but with a
100ms timeout. This is enough to significantly slow down any process
that is allocating lots of memory and often long enough to let the
commit complete.
In those rare cases where it is nfsd, or something that nfsd is
waiting for, that is calling nfs_release_page(), 100ms is not so long
that throughput will be seriously affected.
When fail-over happens a remote (foreign) client will first become
disconnected and then turn into a local client.
To prevent deadlocks from happening at this point, we still have a
timeout when the COMMIT has been sent to a remote client. In this case
the timeout is longer (1 second).
So when a node that has mounted a remote filesystem loses the
connection, nfs_release_page() will stop blocking and start failing.
Memory allocators will then be able to make progress without blocking
in NFS. Any process which writes to a file will still be blocked in
balance_dirty_pages().
This patch makes use of the new 'private' field in
"struct wait_bit_key" to store the start time of a commit, so the
'action' function called by __wait_on_bit() knows how much longer
it is appropriate to wait.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 2 +
fs/nfs/write.c | 72 ++++++++++++++++++++++++++++++++++++++++---
include/linux/freezer.h | 10 ++++++
include/uapi/linux/nfs_fs.h | 3 ++
4 files changed, 81 insertions(+), 6 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..9135af74b896 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -473,7 +473,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
*/
if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
!(current->flags & PF_FSTRANS)) {
- int how = FLUSH_SYNC;
+ int how = FLUSH_SYNC | FLUSH_COND_CONNECTED;
/* Don't let kswapd deadlock waiting for OOM RPC calls */
if (current_is_kswapd())
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e3b5cf28bdc5..9694f65e2f39 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
#include <linux/writeback.h>
#include <linux/swap.h>
#include <linux/migrate.h>
+#include <linux/freezer.h>
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
@@ -1457,6 +1458,55 @@ static void nfs_writeback_result(struct rpc_task *task,
#if IS_ENABLED(CONFIG_NFS_V3) || IS_ENABLED(CONFIG_NFS_V4)
+
+static int _wait_bit_connected(struct rpc_clnt *cl,
+ struct wait_bit_key *key)
+{
+ if (fatal_signal_pending(current))
+ return -ERESTARTSYS;
+ if (cl && rpc_is_foreign(cl)) {
+ /* We are connected to non-local server,
+ * but might not be so indefinitely, so
+ * don't wait indefinitely.
+ */
+ freezable_schedule_timeout_unsafe(HZ);
+ } else {
+ unsigned long waited;
+ if (key->private == 0) {
+ key->private = jiffies;
+ if (key->private == 0)
+ key->private -= 1;
+ }
+ waited = jiffies - key->private;
+
+ /* We might be waiting for ourselves, so don't
+ * wait too long.
+ */
+ if (waited >= HZ/10)
+ /* Too long, give up */
+ return -EAGAIN;
+
+ freezable_schedule_timeout_unsafe(HZ/10 - waited);
+ }
+ return 0;
+}
+
+static int nfs_wait_bit_connected(struct wait_bit_key *key)
+{
+ struct nfs_inode *nfsi = container_of(key->flags,
+ struct nfs_inode, flags);
+
+ return _wait_bit_connected(NFS_CLIENT(&nfsi->vfs_inode), key);
+}
+
+static int rpc_wait_bit_connected(struct wait_bit_key *key)
+{
+ struct rpc_task *task = container_of(key->flags,
+ struct rpc_task, tk_runstate);
+
+ return _wait_bit_connected(task->tk_client, key);
+}
+
static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
{
int ret;
@@ -1467,7 +1517,9 @@ static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
return 0;
ret = out_of_line_wait_on_bit_lock(&nfsi->flags,
NFS_INO_COMMIT,
- nfs_wait_bit_killable,
+ (may_wait & FLUSH_COND_CONNECTED)
+ ? nfs_wait_bit_connected
+ : nfs_wait_bit_killable,
TASK_KILLABLE);
return (ret < 0) ? ret : 1;
}
@@ -1518,8 +1570,12 @@ int nfs_initiate_commit(struct rpc_clnt *clnt, struct nfs_commit_data *data,
task = rpc_run_task(&task_setup_data);
if (IS_ERR(task))
return PTR_ERR(task);
- if (how & FLUSH_SYNC)
- rpc_wait_for_completion_task(task);
+ if (how & FLUSH_SYNC) {
+ if (how & FLUSH_COND_CONNECTED)
+ __rpc_wait_for_completion_task(task, rpc_wait_bit_connected);
+ else
+ rpc_wait_for_completion_task(task);
+ }
rpc_put_task(task);
return 0;
}
@@ -1695,7 +1751,7 @@ int nfs_commit_inode(struct inode *inode, int how)
{
LIST_HEAD(head);
struct nfs_commit_info cinfo;
- int may_wait = how & FLUSH_SYNC;
+ int may_wait = how & (FLUSH_SYNC | FLUSH_COND_CONNECTED);
int res;
res = nfs_commit_set_lock(NFS_I(inode), may_wait);
@@ -1711,10 +1767,16 @@ int nfs_commit_inode(struct inode *inode, int how)
return error;
if (!may_wait)
goto out_mark_dirty;
+
error = wait_on_bit_action(&NFS_I(inode)->flags,
NFS_INO_COMMIT,
- nfs_wait_bit_killable,
+ (how & FLUSH_COND_CONNECTED)
+ ? nfs_wait_bit_connected
+ : nfs_wait_bit_killable,
TASK_KILLABLE);
+
+ if ((how & FLUSH_COND_CONNECTED) && error == -EAGAIN)
+ goto out_mark_dirty;
if (error < 0)
return error;
} else
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 7fd81b8c4897..1a32fe0c1255 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -193,6 +193,16 @@ static inline long freezable_schedule_timeout(long timeout)
return __retval;
}
+/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
+static inline long freezable_schedule_timeout_unsafe(long timeout)
+{
+ long __retval;
+ freezer_do_not_count();
+ __retval = schedule_timeout(timeout);
+ freezer_count_unsafe();
+ return __retval;
+}
+
/*
* Like schedule_timeout_interruptible(), but should not block the freezer. Do not
* call this with locks held.
diff --git a/include/uapi/linux/nfs_fs.h b/include/uapi/linux/nfs_fs.h
index 49142287999c..896b8d976896 100644
--- a/include/uapi/linux/nfs_fs.h
+++ b/include/uapi/linux/nfs_fs.h
@@ -35,6 +35,9 @@
#define FLUSH_HIGHPRI 16 /* high priority memory reclaim flush */
#define FLUSH_COND_STABLE 32 /* conditional stable write - only stable
* if everything fits in one RPC */
+#define FLUSH_COND_CONNECTED 64 /* Don't wait for COMMIT unless client
+ * is connected to a remote host.
+ */
/*
On Wed, 20 Aug 2014 20:33:04 -0400 Trond Myklebust
<[email protected]> wrote:
> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> > If requests are being sent to the local host, then NFS will
> > need to take care to avoid deadlocks.
> >
> > So keep track when accepting a connection or sending a UDP request
> > and set a flag in the svc_xprt when the peer connected to is local.
> >
> > The interface rpc_is_foreign() is provided to check is a given client
> > is connected to a foreign server. When it returns zero it is either
> > not connected or connected to a local server and in either case
> > greater care is needed.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > include/linux/sunrpc/clnt.h | 1 +
> > include/linux/sunrpc/xprt.h | 1 +
> > net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
> > net/sunrpc/xprtsock.c | 9 +++++++++
> > 4 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> > index 70736b98c721..cd79b2a28ceb 100644
> > --- a/include/linux/sunrpc/clnt.h
> > +++ b/include/linux/sunrpc/clnt.h
> > @@ -175,6 +175,7 @@ void rpc_force_rebind(struct rpc_clnt *);
> > size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
> > const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
> > int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
> > +int rpc_is_foreign(struct rpc_clnt *);
> >
> > #endif /* __KERNEL__ */
> > #endif /* _LINUX_SUNRPC_CLNT_H */
> > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > index fcbfe8783243..6a9dffcb9d3f 100644
> > --- a/include/linux/sunrpc/xprt.h
> > +++ b/include/linux/sunrpc/xprt.h
> > @@ -357,6 +357,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
> > #define XPRT_CONNECTION_ABORT (7)
> > #define XPRT_CONNECTION_CLOSE (8)
> > #define XPRT_CONGESTED (9)
> > +#define XPRT_LOCAL (10)
> >
>
> Can we please rename that to XPRT_LOOPBACK or something along those
> lines? To me XPRT_LOCAL looks a little too close for comfort to
> AF_LOCAL, which is also a supported transport.
>
I'm fine with that. I'll resend once we are in agreement about the other
bits.
Thanks,
NeilBrown
On Wed, Aug 20, 2014 at 10:15 PM, NeilBrown <[email protected]> wrote:
> On Wed, 20 Aug 2014 21:42:55 -0400 Trond Myklebust
> <[email protected]> wrote:
>
>> On Wed, Aug 20, 2014 at 9:11 PM, NeilBrown <[email protected]> wrote:
>> > On Wed, 20 Aug 2014 20:45:16 -0400 Trond Myklebust
>> > <[email protected]> wrote:
>> >
>> >> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
>> >> > Support for loop-back mounted NFS filesystems is useful when NFS is
>> >> > used to access shared storage in a high-availability cluster.
>> >> >
>> >> > If the node running the NFS server fails, some other node can mount the
>> >> > filesystem and start providing NFS service. If that node already had
>> >> > the filesystem NFS mounted, it will now have it loop-back mounted.
>> >> >
>> >> > nfsd can suffer a deadlock when allocating memory and entering direct
>> >> > reclaim.
>> >> > While direct reclaim does not write to the NFS filesystem it can send
>> >> > and wait for a COMMIT through nfs_release_page().
>> >> >
>> >> > This patch modifies nfs_release_page() and the functions it calls so
>> >> > that if the COMMIT is sent to the local host (i.e. is routed over the
>> >> > loop-back interface) then nfs_release_page() will not wait forever for
>> >> > that COMMIT to complete. This is achieved using a new flag
>> >> > FLUSH_COND_CONNECTED. When this is set the flush is conditional and
>> >> > will only wait if the client is connected to a non-local server.
>> >> >
>> >> > Aborting early should be safe as kswapd uses the same code but never
>> >> > waits for the COMMIT.
>> >> >
>> >> > Always aborting early could upset the balance of memory management so
>> >> > when the commit was sent to the local host we still wait but with a
>> >> > 100ms timeout. This is enough to significantly slow down any process
>> >> > that is allocating lots of memory and often long enough to let the
>> >> > commit complete.
>> >> >
>> >> > In those rare cases where it is nfsd, or something that nfsd is
>> >> > waiting for, that is calling nfs_release_page(), 100ms is not so long
>> >> > that throughput will be seriously affected.
>> >> >
>> >> > When fail-over happens a remote (foreign) client will first become
>> >> > disconnected and then turn into a local client.
>> >> > To prevent deadlocks from happening at this point, we still have a
>> >> > timeout when the COMMIT has been sent to a remote client. In this case
>> >> > the timeout is longer (1 second).
>> >> >
>> >> > So when a node that has mounted a remote filesystem loses the
>> >> > connection, nfs_release_page() will stop blocking and start failing.
>> >> > Memory allocators will then be able to make progress without blocking
>> >> > in NFS. Any process which writes to a file will still be blocked in
>> >> > balance_dirty_pages().
>> >> >
>> >> > This patch makes use of the new 'private' field in
>> >> > "struct wait_bit_key" to store the start time of a commit, so the
>> >> > 'action' function called by __wait_on_bit() knows how much longer
>> >> > it is appropriate to wait.
>> >> >
>> >>
>> >> This puts way too much RPC connection logic in the NFS layer: we really
>> >> should not have to care. Is there any reason why we could not just use
>> >> RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
>> >
>> > I tried to keep as much in the RPC layer as I could....
>> >
>> > There really is a big difference between talking to an 'nfsd' on the same
>> > machine and talking to one on a different machine. In the first case you
>> > need to be cautious of deadlocks, in the second you don't. That is a
>> > difference that matters to NFS, not to RPC.
>> >
>> > I guess we could always have a timeout, even when connected to a remote
>> > server. We would just end up blocking somewhere else when the server was not
>> > responding.
>> >
>> > I don't think SOFTCONN is really appropriate. We don't want the COMMIT to
>> > stop being retried. We just don't want the memory reclaim code to block
>> > waiting for the COMMIT.
>> >
>> >>
>> >> Note that all this has to come with a huge WARNING: all your data may be
>> >> lost even if you think you just fsync()ed it. Is there any difference
>> >> between doing this and using the 'async' export option on the knfsd
>> >> side?
>> >>
>> >
>> > I don't think this patch risks losing data at all - that certainly isn't the
>> > intent.
>> > This patch only affects the behaviour of ->releasepage, and only allows it to
>> > fail instead of block. It is perfectly acceptable for releasepage to fail
>> > and it doesn't result in data loss. It just means that the page is busy.
>> >
>>
>> So why not just change the behaviour of ->releasepage() to always
>> initiate a non-blocking commit and then wait up to 1 second for the
>> PG_private bit to be released?
>>
>
> Would that work?
> PG_private seems to be set when the page is being written, not when the
> COMMIT is happening.
> That is marked with the NFS_INO_COMMIT flag.
The PG_private flag remains set until either the write requests
associated with that page have been committed to stable storage or we
get a fatal error that prevents this from occurring. This is precisely
why we call COMMIT before testing PagePrivate(page) in
nfs_release_page().
> So what my code does is wait a little while for NFS_INO_COMMIT to be released.
>
> I tried to leave the flow largely unchanged for a non-local server and only
> introduced a timeout for a local server. That probably creates the
> complexity that is bothering you(?).
The layering violations are what bother me; the fact that we're
introducing the concept of transport connection state to the NFS
layer. That kind of concept should be hidden deep in the bowels of the
transport sub-layer of the RPC client code so that we do not have to
worry, when debugging an issue with a hard mount, about whether or not
the client was using TCP or UDP, and what the state was of the
connection.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Wed, 20 Aug 2014 21:42:55 -0400 Trond Myklebust
<[email protected]> wrote:
> On Wed, Aug 20, 2014 at 9:11 PM, NeilBrown <[email protected]> wrote:
> > On Wed, 20 Aug 2014 20:45:16 -0400 Trond Myklebust
> > <[email protected]> wrote:
> >
> >> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
> >> > Support for loop-back mounted NFS filesystems is useful when NFS is
> >> > used to access shared storage in a high-availability cluster.
> >> >
> >> > If the node running the NFS server fails, some other node can mount the
> >> > filesystem and start providing NFS service. If that node already had
> >> > the filesystem NFS mounted, it will now have it loop-back mounted.
> >> >
> >> > nfsd can suffer a deadlock when allocating memory and entering direct
> >> > reclaim.
> >> > While direct reclaim does not write to the NFS filesystem it can send
> >> > and wait for a COMMIT through nfs_release_page().
> >> >
> >> > This patch modifies nfs_release_page() and the functions it calls so
> >> > that if the COMMIT is sent to the local host (i.e. is routed over the
> >> > loop-back interface) then nfs_release_page() will not wait forever for
> >> > that COMMIT to complete. This is achieved using a new flag
> >> > FLUSH_COND_CONNECTED. When this is set the flush is conditional and
> >> > will only wait if the client is connected to a non-local server.
> >> >
> >> > Aborting early should be safe as kswapd uses the same code but never
> >> > waits for the COMMIT.
> >> >
> >> > Always aborting early could upset the balance of memory management so
> >> > when the commit was sent to the local host we still wait but with a
> >> > 100ms timeout. This is enough to significantly slow down any process
> >> > that is allocating lots of memory and often long enough to let the
> >> > commit complete.
> >> >
> >> > In those rare cases where it is nfsd, or something that nfsd is
> >> > waiting for, that is calling nfs_release_page(), 100ms is not so long
> >> > that throughput will be seriously affected.
> >> >
> >> > When fail-over happens a remote (foreign) client will first become
> >> > disconnected and then turn into a local client.
> >> > To prevent deadlocks from happening at this point, we still have a
> >> > timeout when the COMMIT has been sent to a remote client. In this case
> >> > the timeout is longer (1 second).
> >> >
> >> > So when a node that has mounted a remote filesystem loses the
> >> > connection, nfs_release_page() will stop blocking and start failing.
> >> > Memory allocators will then be able to make progress without blocking
> >> > in NFS. Any process which writes to a file will still be blocked in
> >> > balance_dirty_pages().
> >> >
> >> > This patch makes use of the new 'private' field in
> >> > "struct wait_bit_key" to store the start time of a commit, so the
> >> > 'action' function called by __wait_on_bit() knows how much longer
> >> > it is appropriate to wait.
> >> >
> >>
> >> This puts way too much RPC connection logic in the NFS layer: we really
> >> should not have to care. Is there any reason why we could not just use
> >> RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
> >
> > I tried to keep as much in the RPC layer as I could....
> >
> > There really is a big difference between talking to an 'nfsd' on the same
> > machine and talking to one on a different machine. In the first case you
> > need to be cautious of deadlocks, in the second you don't. That is a
> > difference that matters to NFS, not to RPC.
> >
> > I guess we could always have a timeout, even when connected to a remote
> > server. We would just end up blocking somewhere else when the server was not
> > responding.
> >
> > I don't think SOFTCONN is really appropriate. We don't want the COMMIT to
> > stop being retried. We just don't want the memory reclaim code to block
> > waiting for the COMMIT.
> >
> >>
> >> Note that all this has to come with a huge WARNING: all your data may be
> >> lost even if you think you just fsync()ed it. Is there any difference
> >> between doing this and using the 'async' export option on the knfsd
> >> side?
> >>
> >
> > I don't think this patch risks losing data at all - that certainly isn't the
> > intent.
> > This patch only affects the behaviour of ->releasepage, and only allows it to
> > fail instead of block. It is perfectly acceptable for releasepage to fail
> > and it doesn't result in data loss. It just means that the page is busy.
> >
>
> So why not just change the behaviour of ->releasepage() to always
> initiate a non-blocking commit and then wait up to 1 second for the
> PG_private bit to be released?
>
Would that work?
PG_private seems to be set when the page is being written, not when the
COMMIT is happening.
That is marked with the NFS_INO_COMMIT flag.
So what my code does is wait a little while for NFS_INO_COMMIT to be released.
I tried to leave the flow largely unchanged for a non-local server and only
introduced a timeout for a local server. That probably creates the
complexity that is bothering you(?).
NeilBrown
If requests are being sent to the local host, then NFS will
need to take care to avoid deadlocks.
So keep track when accepting a connection or sending a UDP request
and set a flag in the svc_xprt when the peer connected to is local.
The interface rpc_is_foreign() is provided to check is a given client
is connected to a foreign server. When it returns zero it is either
not connected or connected to a local server and in either case
greater care is needed.
Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 9 +++++++++
4 files changed, 36 insertions(+)
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 70736b98c721..cd79b2a28ceb 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -175,6 +175,7 @@ void rpc_force_rebind(struct rpc_clnt *);
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
+int rpc_is_foreign(struct rpc_clnt *);
#endif /* __KERNEL__ */
#endif /* _LINUX_SUNRPC_CLNT_H */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe8783243..6a9dffcb9d3f 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -357,6 +357,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
#define XPRT_CONNECTION_ABORT (7)
#define XPRT_CONNECTION_CLOSE (8)
#define XPRT_CONGESTED (9)
+#define XPRT_LOCAL (10)
static inline void xprt_set_connected(struct rpc_xprt *xprt)
{
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddeed9363..1559d03e468e 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1117,6 +1117,31 @@ const char *rpc_peeraddr2str(struct rpc_clnt *clnt,
}
EXPORT_SYMBOL_GPL(rpc_peeraddr2str);
+/**
+ * rpc_is_foreign - report is rpc client was recently connected to
+ * remote host
+ * @clnt: RPC client structure
+ *
+ * If the client is not connected, or connected to the local host
+ * (any IP address), then return 0. Only return non-zero if the
+ * most recent state was a connection to a remote host.
+ * For UDP the client always appears to be connected, and the
+ * remoteness of the host is of the destination of the last transmission.
+ */
+int rpc_is_foreign(struct rpc_clnt *clnt)
+{
+ struct rpc_xprt *xprt;
+ int conn_foreign;
+
+ rcu_read_lock();
+ xprt = rcu_dereference(clnt->cl_xprt);
+ conn_foreign = (xprt && xprt_connected(xprt)
+ && !test_bit(XPRT_LOCAL, &xprt->state));
+ rcu_read_unlock();
+ return conn_foreign;
+}
+EXPORT_SYMBOL_GPL(rpc_is_foreign);
+
static const struct sockaddr_in rpc_inaddr_loopback = {
.sin_family = AF_INET,
.sin_addr.s_addr = htonl(INADDR_ANY),
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..70942643e88c 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -643,6 +643,11 @@ static int xs_udp_send_request(struct rpc_task *task)
xdr->len - req->rq_bytes_sent, status);
if (status >= 0) {
+ if (sock_is_loopback(transport->sock->sk))
+ set_bit(XPRT_LOCAL, &xprt->state);
+ else
+ clear_bit(XPRT_LOCAL, &xprt->state);
+
req->rq_xmit_bytes_sent += status;
if (status >= req->rq_slen)
return 0;
@@ -1550,6 +1555,10 @@ static void xs_tcp_state_change(struct sock *sk)
xprt_wake_pending_tasks(xprt, -EAGAIN);
}
+ if (sock_is_loopback(sk))
+ set_bit(XPRT_LOCAL, &xprt->state);
+ else
+ clear_bit(XPRT_LOCAL, &xprt->state);
spin_unlock(&xprt->transport_lock);
break;
case TCP_FIN_WAIT1:
On Wed, Aug 20, 2014 at 9:11 PM, NeilBrown <[email protected]> wrote:
> On Wed, 20 Aug 2014 20:45:16 -0400 Trond Myklebust
> <[email protected]> wrote:
>
>> On Mon, 2014-08-18 at 16:22 +1000, NeilBrown wrote:
>> > Support for loop-back mounted NFS filesystems is useful when NFS is
>> > used to access shared storage in a high-availability cluster.
>> >
>> > If the node running the NFS server fails, some other node can mount the
>> > filesystem and start providing NFS service. If that node already had
>> > the filesystem NFS mounted, it will now have it loop-back mounted.
>> >
>> > nfsd can suffer a deadlock when allocating memory and entering direct
>> > reclaim.
>> > While direct reclaim does not write to the NFS filesystem it can send
>> > and wait for a COMMIT through nfs_release_page().
>> >
>> > This patch modifies nfs_release_page() and the functions it calls so
>> > that if the COMMIT is sent to the local host (i.e. is routed over the
>> > loop-back interface) then nfs_release_page() will not wait forever for
>> > that COMMIT to complete. This is achieved using a new flag
>> > FLUSH_COND_CONNECTED. When this is set the flush is conditional and
>> > will only wait if the client is connected to a non-local server.
>> >
>> > Aborting early should be safe as kswapd uses the same code but never
>> > waits for the COMMIT.
>> >
>> > Always aborting early could upset the balance of memory management so
>> > when the commit was sent to the local host we still wait but with a
>> > 100ms timeout. This is enough to significantly slow down any process
>> > that is allocating lots of memory and often long enough to let the
>> > commit complete.
>> >
>> > In those rare cases where it is nfsd, or something that nfsd is
>> > waiting for, that is calling nfs_release_page(), 100ms is not so long
>> > that throughput will be seriously affected.
>> >
>> > When fail-over happens a remote (foreign) client will first become
>> > disconnected and then turn into a local client.
>> > To prevent deadlocks from happening at this point, we still have a
>> > timeout when the COMMIT has been sent to a remote client. In this case
>> > the timeout is longer (1 second).
>> >
>> > So when a node that has mounted a remote filesystem loses the
>> > connection, nfs_release_page() will stop blocking and start failing.
>> > Memory allocators will then be able to make progress without blocking
>> > in NFS. Any process which writes to a file will still be blocked in
>> > balance_dirty_pages().
>> >
>> > This patch makes use of the new 'private' field in
>> > "struct wait_bit_key" to store the start time of a commit, so the
>> > 'action' function called by __wait_on_bit() knows how much longer
>> > it is appropriate to wait.
>> >
>>
>> This puts way too much RPC connection logic in the NFS layer: we really
>> should not have to care. Is there any reason why we could not just use
>> RPC_TASK_SOFTCONN to have the commit fail if the connection is broken?
>
> I tried to keep as much in the RPC layer as I could....
>
> There really is a big difference between talking to an 'nfsd' on the same
> machine and talking to one on a different machine. In the first case you
> need to be cautious of deadlocks, in the second you don't. That is a
> difference that matters to NFS, not to RPC.
>
> I guess we could always have a timeout, even when connected to a remote
> server. We would just end up blocking somewhere else when the server was not
> responding.
>
> I don't think SOFTCONN is really appropriate. We don't want the COMMIT to
> stop being retried. We just don't want the memory reclaim code to block
> waiting for the COMMIT.
>
>>
>> Note that all this has to come with a huge WARNING: all your data may be
>> lost even if you think you just fsync()ed it. Is there any difference
>> between doing this and using the 'async' export option on the knfsd
>> side?
>>
>
> I don't think this patch risks losing data at all - that certainly isn't the
> intent.
> This patch only affects the behaviour of ->releasepage, and only allows it to
> fail instead of block. It is perfectly acceptable for releasepage to fail
> and it doesn't result in data loss. It just means that the page is busy.
>
So why not just change the behaviour of ->releasepage() to always
initiate a non-blocking commit and then wait up to 1 second for the
PG_private bit to be released?
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]