2014-04-23 03:48:12

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC 0/5] Support loop-back NFS mounts - take 2

This is a somewhat shorter patchset for loop-back NFS support than
last time, thanks to the excellent feedback and particularly to Dave
Chinner. Thanks.

Avoiding the wait-for-congestion which can trigger a livelock is much
the same, though I've reduced the cases in which the wait is
by-passed.
I did this using current->backing_dev_info which is otherwise serving
no purpose on the current kernel.

Avoiding the deadlocks has been turned on its head.
Instead of nfsd checking if it is a loop-back mount and setting
PF_FSTRANS, which then needs lots of changes too PF_FSTRANS and
__GFP_FS handling, it is now NFS which checks for a loop-back
filesystem.

There is more verbosity in that patch (Fifth of Five) but the essence
is that nfs_release_page will now not wait indefinitely for a COMMIT
request to complete when sent to the local host. It still waits a
little while as some delay can be important. But it won't wait
forever.
The duration of "a little while" is currently 100ms, though I do
wonder if a bigger number would serve just as well.

Unlike the previous series, this set should remove deadlocks that
could happen during the actual fail-over process. This is achieved by
having nfs_release_page monitor the connection and if it changes from
a remote to a local connection, or just disconnects, then it will
timeout. It currently polls every second, though this probably could
be longer too. It only needs to be the same order of magnitude as the
time it takes node failure to be detected and failover to happen, and
I suspect that is closer to 1 minute. So maybe a 10 or 20 second poll
interval would be just as good.

Implementing this timeout requires some horrible code as the
wait_on_bit functions don't support timeouts. If the general approach
is found acceptable I'll explore ways to improve the timeout code.

Comments, criticism, etc very welcome as always,

Thanks,
NeilBrown


---

NeilBrown (5):
MM: avoid throttling reclaim for loop-back nfsd threads.
SUNRPC: track whether a request is coming from a loop-back interface.
nfsd: Only set PF_LESS_THROTTLE when really needed.
SUNRPC: track when a client connection is routed to the local host.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.


fs/nfs/file.c | 2 +
fs/nfs/write.c | 73 +++++++++++++++++++++++++++++++++++----
fs/nfsd/nfssvc.c | 6 ---
fs/nfsd/vfs.c | 12 ++++++
include/linux/freezer.h | 10 +++++
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/svc.h | 1 +
include/linux/sunrpc/svc_xprt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
include/uapi/linux/nfs_fs.h | 3 ++
mm/vmscan.c | 18 +++++++++-
net/sunrpc/clnt.c | 25 +++++++++++++
net/sunrpc/svcsock.c | 10 +++++
net/sunrpc/xprtsock.c | 17 +++++++++
14 files changed, 163 insertions(+), 17 deletions(-)

--
Signature



2014-04-23 03:48:30

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/5] SUNRPC: track whether a request is coming from a loop-back interface.

If an incoming NFS request is coming from the local host, then
nfsd will need to perform some special handling. So detect that
possibility and make the source visible in rq_local.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svcsock.c | 10 ++++++++++
3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 04e763221246..a0dbbd1e00e9 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -254,6 +254,7 @@ struct svc_rqst {
u32 rq_prot; /* IP protocol */
unsigned short
rq_secure : 1; /* secure port */
+ unsigned short rq_local : 1; /* local request */

void * rq_argp; /* decoded arguments */
void * rq_resp; /* xdr'd results */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f09ebf..b99bdfb0fcf9 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -63,6 +63,7 @@ struct svc_xprt {
#define XPT_DETACHED 10 /* detached from tempsocks list */
#define XPT_LISTENER 11 /* listening endpoint */
#define XPT_CACHE_AUTH 12 /* cache auth info */
+#define XPT_LOCAL 13 /* connection from loopback interface */

struct svc_serv *xpt_server; /* service for transport */
atomic_t xpt_reserved; /* space on outq that is rsvd */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index b6e59f0a9475..193115fe968c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
struct socket *newsock;
struct svc_sock *newsvsk;
int err, slen;
+ struct dst_entry *dst;
RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);

dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
@@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
}
svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);

+ clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+ rcu_read_lock();
+ dst = rcu_dereference(newsock->sk->sk_dst_cache);
+ if (dst && dst->dev &&
+ (dst->dev->features & NETIF_F_LOOPBACK))
+ set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+ rcu_read_unlock();
+
if (serv->sv_stats)
serv->sv_stats->nettcpconn++;

@@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)

rqstp->rq_xprt_ctxt = NULL;
rqstp->rq_prot = IPPROTO_TCP;
+ rqstp->rq_local = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);

p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
calldir = p[1];



2014-04-23 03:48:58

by NeilBrown

[permalink] [raw]
Subject: [PATCH 5/5] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() and the functions it calls so
that if the COMMIT is sent to the local host (i.e. is routed over the
loop-back interface) then nfs_release_page() will not wait forever for
that COMMIT to complete.

Aborting early should be safe as kswapd uses the same code but never
waits for the COMMIT.

Always aborting early could upset the balance of memory management so
when the commit was sent to the local host we still wait but with a
100ms timeout. This is enough to significantly slow down any process
that is allocating lots of memory and often long enough to let the
commit complete.

In those rare cases where it is nfsd, or something that nfsd is
waiting for, that is calling nfs_release_page(), 100ms is not so long
that throughput will be affected.

When fail-over happens a remote (foreign) client will first become
disconnected and then turn into a local client.
To prevent deadlocks from happening at this point, we still have a
timeout when the COMMIT has been sent to a remote client. In this case
the timeout is longer (1 second).

So when a node that has mounted a remote filesystem loses the
connection, nfs_release_page() will stop blocking and start failing.
Memory allocators will then be able to make progress without blocking
in NFS. Any process which writes to a file will still be blocked in
balance_dirty_pages().

I really need a version of wait_on_bit and friends which supports
timeouts for this patch. I've hacked something that works by abusing
current->journal_info, but I don't expect that to go upstream.

I cannot simply use schedule_timeout in the wait_bit function as there
could be false wakeups due to hash collisions in bit_waitqueue().
One possibility would to be to have a 'private' field in 'struct
wait_bit_key' that was initialized to 0, and have the address of
that struct passed to the 'action' function.
There are currently 26 of these action functions so it wouldn't be too
hard to change the interface...

It might be good to use something like SetPageReclaim() to ensure that
the page is reclaimed as soon as the COMMIT does complete, but I don't
understand that code well enough yet to supply a patch.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 2 +
fs/nfs/write.c | 73 ++++++++++++++++++++++++++++++++++++++-----
include/linux/freezer.h | 10 ++++++
include/uapi/linux/nfs_fs.h | 3 ++
4 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..64a2999fe290 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -474,7 +474,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
*/
if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
!(current->flags & PF_FSTRANS)) {
- int how = FLUSH_SYNC;
+ int how = FLUSH_SYNC | FLUSH_COND_CONNECTED;

/* Don't let kswapd deadlock waiting for OOM RPC calls */
if (current_is_kswapd())
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9a3b6a4cd6b9..eadd3317fd85 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
#include <linux/writeback.h>
#include <linux/swap.h>
#include <linux/migrate.h>
+#include <linux/freezer.h>

#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
@@ -1440,6 +1441,43 @@ void nfs_writeback_done(struct rpc_task *task, struct nfs_write_data *data)


#if IS_ENABLED(CONFIG_NFS_V3) || IS_ENABLED(CONFIG_NFS_V4)
+
+static int _wait_bit_connected(struct rpc_clnt *cl)
+{
+ if (fatal_signal_pending(current))
+ return -ERESTARTSYS;
+ if (!cl || rpc_is_foreign(cl)) {
+ /* it might not stay foreign forever */
+ freezable_schedule_timeout_unsafe(HZ);
+ } else {
+ unsigned long *commit_start = current->journal_info;
+ unsigned long waited = jiffies - *commit_start;
+
+ /* we could block forever here ... */
+ if (waited >= HZ/10)
+ /* Too long, give up */
+ return -EAGAIN;
+
+ freezable_schedule_timeout_unsafe(HZ/10 - waited);
+ }
+ return 0;
+}
+
+static int nfs_wait_bit_connected(void *word)
+{
+ struct nfs_inode *nfsi = container_of(word, struct nfs_inode, flags);
+
+ return _wait_bit_connected(NFS_CLIENT(&nfsi->vfs_inode));
+}
+
+static int rpc_wait_bit_connected(void *word)
+{
+ struct rpc_task *task = container_of(word, struct rpc_task, tk_runstate);
+
+ return _wait_bit_connected(task->tk_client);
+}
+
+
static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
{
int ret;
@@ -1450,7 +1488,8 @@ static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
return 0;
ret = out_of_line_wait_on_bit_lock(&nfsi->flags,
NFS_INO_COMMIT,
- nfs_wait_bit_killable,
+ (may_wait & FLUSH_COND_CONNECTED) ?
+ nfs_wait_bit_connected : nfs_wait_bit_killable,
TASK_KILLABLE);
return (ret < 0) ? ret : 1;
}
@@ -1501,8 +1540,12 @@ int nfs_initiate_commit(struct rpc_clnt *clnt, struct nfs_commit_data *data,
task = rpc_run_task(&task_setup_data);
if (IS_ERR(task))
return PTR_ERR(task);
- if (how & FLUSH_SYNC)
- rpc_wait_for_completion_task(task);
+ if (how & FLUSH_SYNC) {
+ if (how & FLUSH_COND_CONNECTED)
+ __rpc_wait_for_completion_task(task, rpc_wait_bit_connected);
+ else
+ rpc_wait_for_completion_task(task);
+ }
rpc_put_task(task);
return 0;
}
@@ -1678,8 +1721,15 @@ int nfs_commit_inode(struct inode *inode, int how)
{
LIST_HEAD(head);
struct nfs_commit_info cinfo;
- int may_wait = how & FLUSH_SYNC;
+ int may_wait = how & (FLUSH_SYNC | FLUSH_COND_CONNECTED);
int res;
+ int error;
+ unsigned long commit_start;
+
+ if (how & FLUSH_COND_CONNECTED) {
+ commit_start = jiffies;
+ current->journal_info = &commit_start;
+ }

res = nfs_commit_set_lock(NFS_I(inode), may_wait);
if (res <= 0)
@@ -1687,21 +1737,24 @@ int nfs_commit_inode(struct inode *inode, int how)
nfs_init_cinfo_from_inode(&cinfo, inode);
res = nfs_scan_commit(inode, &head, &cinfo);
if (res) {
- int error;

error = nfs_generic_commit_list(inode, &head, how, &cinfo);
if (error < 0)
- return error;
+ goto error;
if (!may_wait)
goto out_mark_dirty;
error = wait_on_bit(&NFS_I(inode)->flags,
NFS_INO_COMMIT,
- nfs_wait_bit_killable,
+ (how & FLUSH_COND_CONNECTED) ?
+ nfs_wait_bit_connected : nfs_wait_bit_killable,
TASK_KILLABLE);
+ if ((how & FLUSH_COND_CONNECTED) && error == -EAGAIN)
+ goto out_mark_dirty;
if (error < 0)
- return error;
+ goto error;
} else
nfs_commit_clear_lock(NFS_I(inode));
+ current->journal_info = NULL;
return res;
/* Note: If we exit without ensuring that the commit is complete,
* we must mark the inode as dirty. Otherwise, future calls to
@@ -1709,8 +1762,12 @@ int nfs_commit_inode(struct inode *inode, int how)
* that the data is on the disk.
*/
out_mark_dirty:
+ current->journal_info = NULL;
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
return res;
+error:
+ current->journal_info = NULL;
+ return error;
}

static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 7fd81b8c4897..1a32fe0c1255 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -193,6 +193,16 @@ static inline long freezable_schedule_timeout(long timeout)
return __retval;
}

+/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
+static inline long freezable_schedule_timeout_unsafe(long timeout)
+{
+ long __retval;
+ freezer_do_not_count();
+ __retval = schedule_timeout(timeout);
+ freezer_count_unsafe();
+ return __retval;
+}
+
/*
* Like schedule_timeout_interruptible(), but should not block the freezer. Do not
* call this with locks held.
diff --git a/include/uapi/linux/nfs_fs.h b/include/uapi/linux/nfs_fs.h
index 49142287999c..896b8d976896 100644
--- a/include/uapi/linux/nfs_fs.h
+++ b/include/uapi/linux/nfs_fs.h
@@ -35,6 +35,9 @@
#define FLUSH_HIGHPRI 16 /* high priority memory reclaim flush */
#define FLUSH_COND_STABLE 32 /* conditional stable write - only stable
* if everything fits in one RPC */
+#define FLUSH_COND_CONNECTED 64 /* Don't wait for COMMIT unless client
+ * is connected to a remote host.
+ */


/*



2014-04-23 03:48:21

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/5] MM: avoid throttling reclaim for loop-back nfsd threads.

When a loop-back NFS mount is active and the backing device for the
NFS mount becomes congested, that can impose throttling delays on the
nfsd threads.

These delays significantly reduce throughput and so the NFS mount
remains congested.

This results in a live lock and the reduced throughput persists.

This live lock has been found in testing with the 'wait_iff_congested'
call, and could possibly be caused by the 'congestion_wait' call.

This livelock is similar to the deadlock which justified the
introduction of PF_LESS_THROTTLE, and the same flag can be used to
remove this livelock.

To minimise the impact of the change, we still throttle nfsd when the
filesystem it is writing to is congested, but not when some separate
filesystem (e.g. the NFS filesystem) is congested.

Signed-off-by: NeilBrown <[email protected]>
---
mm/vmscan.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..e011a646de95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1424,6 +1424,18 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
list_splice(&pages_to_free, page_list);
}

+/* If a kernel thread (such as nfsd for loop-back mounts) services
+ * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
+ * In that case we should only throttle if the backing device it is
+ * writing to is congested. In other cases it is safe to throttle.
+ */
+static int current_may_throttle(void)
+{
+ return !(current->flags & PF_LESS_THROTTLE) ||
+ current->backing_dev_info == NULL ||
+ bdi_write_congested(current->backing_dev_info);
+}
+
/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
@@ -1552,7 +1564,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* implies that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_unqueued_dirty == nr_taken || nr_immediate)
+ if ((nr_unqueued_dirty == nr_taken || nr_immediate)
+ && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

@@ -1561,7 +1574,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
- if (!sc->hibernation_mode && !current_is_kswapd())
+ if (!sc->hibernation_mode && !current_is_kswapd()
+ && current_may_throttle())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,



2014-04-23 03:48:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
and live-locks while writing to the page cache in a loop-back
NFS mount situation.

It therefore makes sense to *only* set PF_LESS_THROTTLE in this
situation.
We now know when a request came from the local-host so it could be a
loop-back mount. We already know when we are handling write requests,
and when we are doing anything else.

So combine those two to allow nfsd to still be throttled (like any
other process) in every situation except when it is known to be
problematic.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfssvc.c | 6 ------
fs/nfsd/vfs.c | 12 ++++++++++++
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9a4a5f9e7468..1879e43f2868 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -591,12 +591,6 @@ nfsd(void *vrqstp)
nfsdstats.th_cnt++;
mutex_unlock(&nfsd_mutex);

- /*
- * We want less throttling in balance_dirty_pages() so that nfs to
- * localhost doesn't cause nfsd to lock up due to all the client's
- * dirty pages.
- */
- current->flags |= PF_LESS_THROTTLE;
set_freezable();

/*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6d7be3f80356..2acd00445ad0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -913,6 +913,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
int stable = *stablep;
int use_wgather;
loff_t pos = offset;
+ unsigned int pflags = current->flags;
+
+ if (rqstp->rq_local)
+ /*
+ * We want less throttling in balance_dirty_pages()
+ * and shrink_inactive_list() so that nfs to
+ * localhost doesn't cause nfsd to lock up due to all
+ * the client's dirty pages or its congested queue.
+ */
+ current->flags |= PF_LESS_THROTTLE;

dentry = file->f_path.dentry;
inode = dentry->d_inode;
@@ -950,6 +960,8 @@ out_nfserr:
err = 0;
else
err = nfserrno(host_err);
+ if (rqstp->rq_local)
+ tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
return err;
}




2014-04-23 22:47:52

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/5] MM: avoid throttling reclaim for loop-back nfsd threads.

On Wed, 23 Apr 2014 15:03:18 -0700 Andrew Morton <[email protected]>
wrote:

> On Wed, 23 Apr 2014 12:40:58 +1000 NeilBrown <[email protected]> wrote:
>
> > When a loop-back NFS mount is active and the backing device for the
> > NFS mount becomes congested, that can impose throttling delays on the
> > nfsd threads.
> >
> > These delays significantly reduce throughput and so the NFS mount
> > remains congested.
> >
> > This results in a live lock and the reduced throughput persists.
> >
> > This live lock has been found in testing with the 'wait_iff_congested'
> > call, and could possibly be caused by the 'congestion_wait' call.
> >
> > This livelock is similar to the deadlock which justified the
> > introduction of PF_LESS_THROTTLE, and the same flag can be used to
> > remove this livelock.
> >
> > To minimise the impact of the change, we still throttle nfsd when the
> > filesystem it is writing to is congested, but not when some separate
> > filesystem (e.g. the NFS filesystem) is congested.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > mm/vmscan.c | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a9c74b409681..e011a646de95 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1424,6 +1424,18 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
> > list_splice(&pages_to_free, page_list);
> > }
> >
> > +/* If a kernel thread (such as nfsd for loop-back mounts) services
>
> /*
> * If ...
>
> please
>
> > + * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
> > + * In that case we should only throttle if the backing device it is
> > + * writing to is congested. In other cases it is safe to throttle.
> > + */
> > +static int current_may_throttle(void)
> > +{
> > + return !(current->flags & PF_LESS_THROTTLE) ||
> > + current->backing_dev_info == NULL ||
> > + bdi_write_congested(current->backing_dev_info);
> > +}
> > +
> > /*
> > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> > * of reclaimed pages
> > @@ -1552,7 +1564,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > * implies that pages are cycling through the LRU faster than
> > * they are written so also forcibly stall.
> > */
> > - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> > + if ((nr_unqueued_dirty == nr_taken || nr_immediate)
> > + && current_may_throttle())
>
> foo &&
> bar
>
> please. As you did in in current_may_throttle().
>
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
> >
> > @@ -1561,7 +1574,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > * is congested. Allow kswapd to continue until it starts encountering
> > * unqueued dirty pages or cycling through the LRU too quickly.
> > */
> > - if (!sc->hibernation_mode && !current_is_kswapd())
> > + if (!sc->hibernation_mode && !current_is_kswapd()
> > + && current_may_throttle())
>
> ditto
>
> > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> >
> > trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
> >

Thanks. I've made those changes and will resend just that patch.
As it is quite independent of all the others can you take it and I'll funnel
the others through other trees?

Thanks
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-24 01:21:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/5] Support loop-back NFS mounts - take 2

On Wed, Apr 23, 2014 at 12:40:58PM +1000, NeilBrown wrote:
> This is a somewhat shorter patchset for loop-back NFS support than
> last time, thanks to the excellent feedback and particularly to Dave
> Chinner. Thanks.
>
> Avoiding the wait-for-congestion which can trigger a livelock is much
> the same, though I've reduced the cases in which the wait is
> by-passed.
> I did this using current->backing_dev_info which is otherwise serving
> no purpose on the current kernel.
>
> Avoiding the deadlocks has been turned on its head.
> Instead of nfsd checking if it is a loop-back mount and setting
> PF_FSTRANS, which then needs lots of changes too PF_FSTRANS and
> __GFP_FS handling, it is now NFS which checks for a loop-back
> filesystem.
>
> There is more verbosity in that patch (Fifth of Five) but the essence
> is that nfs_release_page will now not wait indefinitely for a COMMIT
> request to complete when sent to the local host. It still waits a
> little while as some delay can be important. But it won't wait
> forever.
> The duration of "a little while" is currently 100ms, though I do
> wonder if a bigger number would serve just as well.
>
> Unlike the previous series, this set should remove deadlocks that
> could happen during the actual fail-over process. This is achieved by
> having nfs_release_page monitor the connection and if it changes from
> a remote to a local connection, or just disconnects, then it will
> timeout. It currently polls every second, though this probably could
> be longer too. It only needs to be the same order of magnitude as the
> time it takes node failure to be detected and failover to happen, and
> I suspect that is closer to 1 minute. So maybe a 10 or 20 second poll
> interval would be just as good.
>
> Implementing this timeout requires some horrible code as the
> wait_on_bit functions don't support timeouts. If the general approach
> is found acceptable I'll explore ways to improve the timeout code.
>
> Comments, criticism, etc very welcome as always,

Looks much less intrusive to me, and doesn't appear to affect any
other filesystem or the recursion patterns of memory reclaim,
so I like it very much more than the previous patchset. Nice work!
:)

The code changes are really outside my area of expertise now, so I
don't really feel qualified to review the changes. However, consider
the overall approach:

Acked-by: Dave Chinner <[email protected]>

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-23 03:48:49

by NeilBrown

[permalink] [raw]
Subject: [PATCH 4/5] SUNRPC: track when a client connection is routed to the local host.

If requests are being sent to the local host, then NFS will
need to take care to avoid deadlocks.

So keep track when accepting a connection or sending a UDP request
and set a flag in the svc_xprt when the peer connected to is local.

The interface rpc_is_foreign() is provided to check is a given client
is connected to a foreign server. When it returns zero it is either
not connected or connected to a local server and in either case
greater care is needed.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 17 +++++++++++++++++
4 files changed, 44 insertions(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 8af2804bab16..5d626cc5ab01 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -173,6 +173,7 @@ void rpc_force_rebind(struct rpc_clnt *);
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
+int rpc_is_foreign(struct rpc_clnt *);

#endif /* __KERNEL__ */
#endif /* _LINUX_SUNRPC_CLNT_H */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 8097b9df6773..318ee37bc358 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -340,6 +340,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
#define XPRT_CONNECTION_ABORT (7)
#define XPRT_CONNECTION_CLOSE (8)
#define XPRT_CONGESTED (9)
+#define XPRT_LOCAL (10)

static inline void xprt_set_connected(struct rpc_xprt *xprt)
{
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0edada973434..454cea69b373 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1109,6 +1109,31 @@ const char *rpc_peeraddr2str(struct rpc_clnt *clnt,
}
EXPORT_SYMBOL_GPL(rpc_peeraddr2str);

+/**
+ * rpc_is_foreign - report is rpc client was recently connected to
+ * remote host
+ * @clnt: RPC client structure
+ *
+ * If the client is not connected, or connected to the local host
+ * (any IP address), then return 0. Only return non-zero if the
+ * most recent state was a connection to a remote host.
+ * For UDP the client always appears to be connected, and the
+ * remoteness of the host is of the destination of the last transmission.
+ */
+int rpc_is_foreign(struct rpc_clnt *clnt)
+{
+ struct rpc_xprt *xprt;
+ int conn_foreign;
+
+ rcu_read_lock();
+ xprt = rcu_dereference(clnt->cl_xprt);
+ conn_foreign = (xprt && xprt_connected(xprt)
+ && !test_bit(XPRT_LOCAL, &xprt->state));
+ rcu_read_unlock();
+ return conn_foreign;
+}
+EXPORT_SYMBOL_GPL(rpc_is_foreign);
+
static const struct sockaddr_in rpc_inaddr_loopback = {
.sin_family = AF_INET,
.sin_addr.s_addr = htonl(INADDR_ANY),
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0addefca8e77..74796cf37d5b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -642,6 +642,15 @@ static int xs_udp_send_request(struct rpc_task *task)
xdr->len - req->rq_bytes_sent, status);

if (status >= 0) {
+ struct dst_entry *dst;
+ rcu_read_lock();
+ dst = rcu_dereference(transport->sock->sk->sk_dst_cache);
+ if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
+ set_bit(XPRT_LOCAL, &xprt->state);
+ else
+ clear_bit(XPRT_LOCAL, &xprt->state);
+ rcu_read_unlock();
+
req->rq_xmit_bytes_sent += status;
if (status >= req->rq_slen)
return 0;
@@ -1527,6 +1536,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
static void xs_tcp_state_change(struct sock *sk)
{
struct rpc_xprt *xprt;
+ struct dst_entry *dst;

read_lock_bh(&sk->sk_callback_lock);
if (!(xprt = xprt_from_sock(sk)))
@@ -1556,6 +1566,13 @@ static void xs_tcp_state_change(struct sock *sk)

xprt_wake_pending_tasks(xprt, -EAGAIN);
}
+ rcu_read_lock();
+ dst = rcu_dereference(sk->sk_dst_cache);
+ if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
+ set_bit(XPRT_LOCAL, &xprt->state);
+ else
+ clear_bit(XPRT_LOCAL, &xprt->state);
+ rcu_read_unlock();
spin_unlock(&xprt->transport_lock);
break;
case TCP_FIN_WAIT1:



2014-04-24 12:46:22

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH 4/5] SUNRPC: track when a client connection is routed to the local host.

On 04/23/2014 07:14 PM, NeilBrown wrote:
> On Wed, 23 Apr 2014 09:44:12 -0400 Anna Schumaker <[email protected]>
> wrote:
>
>> On 04/22/2014 10:40 PM, NeilBrown wrote:
>>> If requests are being sent to the local host, then NFS will
>>> need to take care to avoid deadlocks.
>>>
>>> So keep track when accepting a connection or sending a UDP request
>>> and set a flag in the svc_xprt when the peer connected to is local.
>>>
>>> The interface rpc_is_foreign() is provided to check is a given client
>>> is connected to a foreign server. When it returns zero it is either
>>> not connected or connected to a local server and in either case
>>> greater care is needed.
>>>
>>> Signed-off-by: NeilBrown <[email protected]>
>>> ---
>>> include/linux/sunrpc/clnt.h | 1 +
>>> include/linux/sunrpc/xprt.h | 1 +
>>> net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
>>> net/sunrpc/xprtsock.c | 17 +++++++++++++++++
>>> 4 files changed, 44 insertions(+)
>>>
>>> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
>>> index 8af2804bab16..5d626cc5ab01 100644
>>> --- a/include/linux/sunrpc/clnt.h
>>> +++ b/include/linux/sunrpc/clnt.h
>>> @@ -173,6 +173,7 @@ void rpc_force_rebind(struct rpc_clnt *);
>>> size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
>>> const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
>>> int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
>>> +int rpc_is_foreign(struct rpc_clnt *);
>>>
>>> #endif /* __KERNEL__ */
>>> #endif /* _LINUX_SUNRPC_CLNT_H */
>>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>>> index 8097b9df6773..318ee37bc358 100644
>>> --- a/include/linux/sunrpc/xprt.h
>>> +++ b/include/linux/sunrpc/xprt.h
>>> @@ -340,6 +340,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
>>> #define XPRT_CONNECTION_ABORT (7)
>>> #define XPRT_CONNECTION_CLOSE (8)
>>> #define XPRT_CONGESTED (9)
>>> +#define XPRT_LOCAL (10)
>>>
>>> static inline void xprt_set_connected(struct rpc_xprt *xprt)
>>> {
>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>>> index 0edada973434..454cea69b373 100644
>>> --- a/net/sunrpc/clnt.c
>>> +++ b/net/sunrpc/clnt.c
>>> @@ -1109,6 +1109,31 @@ const char *rpc_peeraddr2str(struct rpc_clnt *clnt,
>>> }
>>> EXPORT_SYMBOL_GPL(rpc_peeraddr2str);
>>>
>>> +/**
>>> + * rpc_is_foreign - report is rpc client was recently connected to
>>> + * remote host
>>> + * @clnt: RPC client structure
>>> + *
>>> + * If the client is not connected, or connected to the local host
>>> + * (any IP address), then return 0. Only return non-zero if the
>>> + * most recent state was a connection to a remote host.
>>> + * For UDP the client always appears to be connected, and the
>>> + * remoteness of the host is of the destination of the last transmission.
>>> + */
>>> +int rpc_is_foreign(struct rpc_clnt *clnt)
>>> +{
>>> + struct rpc_xprt *xprt;
>>> + int conn_foreign;
>>> +
>>> + rcu_read_lock();
>>> + xprt = rcu_dereference(clnt->cl_xprt);
>>> + conn_foreign = (xprt && xprt_connected(xprt)
>>> + && !test_bit(XPRT_LOCAL, &xprt->state));
>>> + rcu_read_unlock();
>>> + return conn_foreign;
>>> +}
>>> +EXPORT_SYMBOL_GPL(rpc_is_foreign);
>>> +
>>> static const struct sockaddr_in rpc_inaddr_loopback = {
>>> .sin_family = AF_INET,
>>> .sin_addr.s_addr = htonl(INADDR_ANY),
>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>> index 0addefca8e77..74796cf37d5b 100644
>>> --- a/net/sunrpc/xprtsock.c
>>> +++ b/net/sunrpc/xprtsock.c
>>> @@ -642,6 +642,15 @@ static int xs_udp_send_request(struct rpc_task *task)
>>> xdr->len - req->rq_bytes_sent, status);
>>>
>>> if (status >= 0) {
>>> + struct dst_entry *dst;
>>> + rcu_read_lock();
>>> + dst = rcu_dereference(transport->sock->sk->sk_dst_cache);
>>> + if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
>>> + set_bit(XPRT_LOCAL, &xprt->state);
>>> + else
>>> + clear_bit(XPRT_LOCAL, &xprt->state);
>>> + rcu_read_unlock();
>>> +
>> You repeat this block of code a bit later. Can you please make it an inline helper function?
>
> Thanks for the suggestion.
> I've put
>
> static inline int sock_is_loopback(struct sock *sk)
> {
> struct dst_entry *dst;
> int loopback = 0;
> rcu_read_lock();
> dst = rcu_dereference(sk->sk_dst_cache);
> if (dst && dst->dev &&
> (dst->dev->features & NETIF_F_LOOPBACK))
> loopback = 1;
> rcu_read_unlock();
> return loopback;
> }
>
>
> in sunrpc.h, and used it for both the server-side and the client side.

Awesome, thanks!

Anna
>
> NeilBrown
>


2014-04-23 13:44:36

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH 4/5] SUNRPC: track when a client connection is routed to the local host.

On 04/22/2014 10:40 PM, NeilBrown wrote:
> If requests are being sent to the local host, then NFS will
> need to take care to avoid deadlocks.
>
> So keep track when accepting a connection or sending a UDP request
> and set a flag in the svc_xprt when the peer connected to is local.
>
> The interface rpc_is_foreign() is provided to check is a given client
> is connected to a foreign server. When it returns zero it is either
> not connected or connected to a local server and in either case
> greater care is needed.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> include/linux/sunrpc/clnt.h | 1 +
> include/linux/sunrpc/xprt.h | 1 +
> net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
> net/sunrpc/xprtsock.c | 17 +++++++++++++++++
> 4 files changed, 44 insertions(+)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 8af2804bab16..5d626cc5ab01 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -173,6 +173,7 @@ void rpc_force_rebind(struct rpc_clnt *);
> size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
> const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
> int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
> +int rpc_is_foreign(struct rpc_clnt *);
>
> #endif /* __KERNEL__ */
> #endif /* _LINUX_SUNRPC_CLNT_H */
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 8097b9df6773..318ee37bc358 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -340,6 +340,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
> #define XPRT_CONNECTION_ABORT (7)
> #define XPRT_CONNECTION_CLOSE (8)
> #define XPRT_CONGESTED (9)
> +#define XPRT_LOCAL (10)
>
> static inline void xprt_set_connected(struct rpc_xprt *xprt)
> {
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 0edada973434..454cea69b373 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -1109,6 +1109,31 @@ const char *rpc_peeraddr2str(struct rpc_clnt *clnt,
> }
> EXPORT_SYMBOL_GPL(rpc_peeraddr2str);
>
> +/**
> + * rpc_is_foreign - report is rpc client was recently connected to
> + * remote host
> + * @clnt: RPC client structure
> + *
> + * If the client is not connected, or connected to the local host
> + * (any IP address), then return 0. Only return non-zero if the
> + * most recent state was a connection to a remote host.
> + * For UDP the client always appears to be connected, and the
> + * remoteness of the host is of the destination of the last transmission.
> + */
> +int rpc_is_foreign(struct rpc_clnt *clnt)
> +{
> + struct rpc_xprt *xprt;
> + int conn_foreign;
> +
> + rcu_read_lock();
> + xprt = rcu_dereference(clnt->cl_xprt);
> + conn_foreign = (xprt && xprt_connected(xprt)
> + && !test_bit(XPRT_LOCAL, &xprt->state));
> + rcu_read_unlock();
> + return conn_foreign;
> +}
> +EXPORT_SYMBOL_GPL(rpc_is_foreign);
> +
> static const struct sockaddr_in rpc_inaddr_loopback = {
> .sin_family = AF_INET,
> .sin_addr.s_addr = htonl(INADDR_ANY),
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 0addefca8e77..74796cf37d5b 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -642,6 +642,15 @@ static int xs_udp_send_request(struct rpc_task *task)
> xdr->len - req->rq_bytes_sent, status);
>
> if (status >= 0) {
> + struct dst_entry *dst;
> + rcu_read_lock();
> + dst = rcu_dereference(transport->sock->sk->sk_dst_cache);
> + if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
> + set_bit(XPRT_LOCAL, &xprt->state);
> + else
> + clear_bit(XPRT_LOCAL, &xprt->state);
> + rcu_read_unlock();
> +
You repeat this block of code a bit later. Can you please make it an inline helper function?

Anna

> req->rq_xmit_bytes_sent += status;
> if (status >= req->rq_slen)
> return 0;
> @@ -1527,6 +1536,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
> static void xs_tcp_state_change(struct sock *sk)
> {
> struct rpc_xprt *xprt;
> + struct dst_entry *dst;
>
> read_lock_bh(&sk->sk_callback_lock);
> if (!(xprt = xprt_from_sock(sk)))
> @@ -1556,6 +1566,13 @@ static void xs_tcp_state_change(struct sock *sk)
>
> xprt_wake_pending_tasks(xprt, -EAGAIN);
> }
> + rcu_read_lock();
> + dst = rcu_dereference(sk->sk_dst_cache);
> + if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
> + set_bit(XPRT_LOCAL, &xprt->state);
> + else
> + clear_bit(XPRT_LOCAL, &xprt->state);
> + rcu_read_unlock();
> spin_unlock(&xprt->transport_lock);
> break;
> case TCP_FIN_WAIT1:
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2014-04-23 22:03:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] MM: avoid throttling reclaim for loop-back nfsd threads.

On Wed, 23 Apr 2014 12:40:58 +1000 NeilBrown <[email protected]> wrote:

> When a loop-back NFS mount is active and the backing device for the
> NFS mount becomes congested, that can impose throttling delays on the
> nfsd threads.
>
> These delays significantly reduce throughput and so the NFS mount
> remains congested.
>
> This results in a live lock and the reduced throughput persists.
>
> This live lock has been found in testing with the 'wait_iff_congested'
> call, and could possibly be caused by the 'congestion_wait' call.
>
> This livelock is similar to the deadlock which justified the
> introduction of PF_LESS_THROTTLE, and the same flag can be used to
> remove this livelock.
>
> To minimise the impact of the change, we still throttle nfsd when the
> filesystem it is writing to is congested, but not when some separate
> filesystem (e.g. the NFS filesystem) is congested.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> mm/vmscan.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9c74b409681..e011a646de95 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,6 +1424,18 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
> list_splice(&pages_to_free, page_list);
> }
>
> +/* If a kernel thread (such as nfsd for loop-back mounts) services

/*
* If ...

please

> + * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
> + * In that case we should only throttle if the backing device it is
> + * writing to is congested. In other cases it is safe to throttle.
> + */
> +static int current_may_throttle(void)
> +{
> + return !(current->flags & PF_LESS_THROTTLE) ||
> + current->backing_dev_info == NULL ||
> + bdi_write_congested(current->backing_dev_info);
> +}
> +
> /*
> * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> * of reclaimed pages
> @@ -1552,7 +1564,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * implies that pages are cycling through the LRU faster than
> * they are written so also forcibly stall.
> */
> - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> + if ((nr_unqueued_dirty == nr_taken || nr_immediate)
> + && current_may_throttle())

foo &&
bar

please. As you did in in current_may_throttle().

> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> @@ -1561,7 +1574,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * is congested. Allow kswapd to continue until it starts encountering
> * unqueued dirty pages or cycling through the LRU too quickly.
> */
> - if (!sc->hibernation_mode && !current_is_kswapd())
> + if (!sc->hibernation_mode && !current_is_kswapd()
> + && current_may_throttle())

ditto

> wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
>
> trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>

2014-04-23 23:14:31

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 4/5] SUNRPC: track when a client connection is routed to the local host.

On Wed, 23 Apr 2014 09:44:12 -0400 Anna Schumaker <[email protected]>
wrote:

> On 04/22/2014 10:40 PM, NeilBrown wrote:
> > If requests are being sent to the local host, then NFS will
> > need to take care to avoid deadlocks.
> >
> > So keep track when accepting a connection or sending a UDP request
> > and set a flag in the svc_xprt when the peer connected to is local.
> >
> > The interface rpc_is_foreign() is provided to check is a given client
> > is connected to a foreign server. When it returns zero it is either
> > not connected or connected to a local server and in either case
> > greater care is needed.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > include/linux/sunrpc/clnt.h | 1 +
> > include/linux/sunrpc/xprt.h | 1 +
> > net/sunrpc/clnt.c | 25 +++++++++++++++++++++++++
> > net/sunrpc/xprtsock.c | 17 +++++++++++++++++
> > 4 files changed, 44 insertions(+)
> >
> > diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> > index 8af2804bab16..5d626cc5ab01 100644
> > --- a/include/linux/sunrpc/clnt.h
> > +++ b/include/linux/sunrpc/clnt.h
> > @@ -173,6 +173,7 @@ void rpc_force_rebind(struct rpc_clnt *);
> > size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
> > const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
> > int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
> > +int rpc_is_foreign(struct rpc_clnt *);
> >
> > #endif /* __KERNEL__ */
> > #endif /* _LINUX_SUNRPC_CLNT_H */
> > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > index 8097b9df6773..318ee37bc358 100644
> > --- a/include/linux/sunrpc/xprt.h
> > +++ b/include/linux/sunrpc/xprt.h
> > @@ -340,6 +340,7 @@ int xs_swapper(struct rpc_xprt *xprt, int enable);
> > #define XPRT_CONNECTION_ABORT (7)
> > #define XPRT_CONNECTION_CLOSE (8)
> > #define XPRT_CONGESTED (9)
> > +#define XPRT_LOCAL (10)
> >
> > static inline void xprt_set_connected(struct rpc_xprt *xprt)
> > {
> > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > index 0edada973434..454cea69b373 100644
> > --- a/net/sunrpc/clnt.c
> > +++ b/net/sunrpc/clnt.c
> > @@ -1109,6 +1109,31 @@ const char *rpc_peeraddr2str(struct rpc_clnt *clnt,
> > }
> > EXPORT_SYMBOL_GPL(rpc_peeraddr2str);
> >
> > +/**
> > + * rpc_is_foreign - report is rpc client was recently connected to
> > + * remote host
> > + * @clnt: RPC client structure
> > + *
> > + * If the client is not connected, or connected to the local host
> > + * (any IP address), then return 0. Only return non-zero if the
> > + * most recent state was a connection to a remote host.
> > + * For UDP the client always appears to be connected, and the
> > + * remoteness of the host is of the destination of the last transmission.
> > + */
> > +int rpc_is_foreign(struct rpc_clnt *clnt)
> > +{
> > + struct rpc_xprt *xprt;
> > + int conn_foreign;
> > +
> > + rcu_read_lock();
> > + xprt = rcu_dereference(clnt->cl_xprt);
> > + conn_foreign = (xprt && xprt_connected(xprt)
> > + && !test_bit(XPRT_LOCAL, &xprt->state));
> > + rcu_read_unlock();
> > + return conn_foreign;
> > +}
> > +EXPORT_SYMBOL_GPL(rpc_is_foreign);
> > +
> > static const struct sockaddr_in rpc_inaddr_loopback = {
> > .sin_family = AF_INET,
> > .sin_addr.s_addr = htonl(INADDR_ANY),
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index 0addefca8e77..74796cf37d5b 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -642,6 +642,15 @@ static int xs_udp_send_request(struct rpc_task *task)
> > xdr->len - req->rq_bytes_sent, status);
> >
> > if (status >= 0) {
> > + struct dst_entry *dst;
> > + rcu_read_lock();
> > + dst = rcu_dereference(transport->sock->sk->sk_dst_cache);
> > + if (dst && dst->dev && (dst->dev->features & NETIF_F_LOOPBACK))
> > + set_bit(XPRT_LOCAL, &xprt->state);
> > + else
> > + clear_bit(XPRT_LOCAL, &xprt->state);
> > + rcu_read_unlock();
> > +
> You repeat this block of code a bit later. Can you please make it an inline helper function?

Thanks for the suggestion.
I've put

static inline int sock_is_loopback(struct sock *sk)
{
struct dst_entry *dst;
int loopback = 0;
rcu_read_lock();
dst = rcu_dereference(sk->sk_dst_cache);
if (dst && dst->dev &&
(dst->dev->features & NETIF_F_LOOPBACK))
loopback = 1;
rcu_read_unlock();
return loopback;
}


in sunrpc.h, and used it for both the server-side and the client side.

NeilBrown


Attachments:
signature.asc (828.00 B)

2014-05-06 20:54:26

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

On Wed, Apr 23, 2014 at 12:40:58PM +1000, NeilBrown wrote:
> PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
> and live-locks while writing to the page cache in a loop-back
> NFS mount situation.
>
> It therefore makes sense to *only* set PF_LESS_THROTTLE in this
> situation.
> We now know when a request came from the local-host so it could be a
> loop-back mount. We already know when we are handling write requests,
> and when we are doing anything else.
>
> So combine those two to allow nfsd to still be throttled (like any
> other process) in every situation except when it is known to be
> problematic.

Looks simple enough, ACK.--b.

>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/nfssvc.c | 6 ------
> fs/nfsd/vfs.c | 12 ++++++++++++
> 2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9a4a5f9e7468..1879e43f2868 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -591,12 +591,6 @@ nfsd(void *vrqstp)
> nfsdstats.th_cnt++;
> mutex_unlock(&nfsd_mutex);
>
> - /*
> - * We want less throttling in balance_dirty_pages() so that nfs to
> - * localhost doesn't cause nfsd to lock up due to all the client's
> - * dirty pages.
> - */
> - current->flags |= PF_LESS_THROTTLE;
> set_freezable();
>
> /*
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 6d7be3f80356..2acd00445ad0 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -913,6 +913,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> int stable = *stablep;
> int use_wgather;
> loff_t pos = offset;
> + unsigned int pflags = current->flags;
> +
> + if (rqstp->rq_local)
> + /*
> + * We want less throttling in balance_dirty_pages()
> + * and shrink_inactive_list() so that nfs to
> + * localhost doesn't cause nfsd to lock up due to all
> + * the client's dirty pages or its congested queue.
> + */
> + current->flags |= PF_LESS_THROTTLE;
>
> dentry = file->f_path.dentry;
> inode = dentry->d_inode;
> @@ -950,6 +960,8 @@ out_nfserr:
> err = 0;
> else
> err = nfserrno(host_err);
> + if (rqstp->rq_local)
> + tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> return err;
> }
>
>
>

2014-05-12 15:32:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

On Mon 12-05-14 11:04:37, NeilBrown wrote:
> On Tue, 06 May 2014 17:05:01 -0400 Rik van Riel <[email protected]> wrote:
>
> > On 04/22/2014 10:40 PM, NeilBrown wrote:
> > > PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
> > > and live-locks while writing to the page cache in a loop-back
> > > NFS mount situation.
> > >
> > > It therefore makes sense to *only* set PF_LESS_THROTTLE in this
> > > situation.
> > > We now know when a request came from the local-host so it could be a
> > > loop-back mount. We already know when we are handling write requests,
> > > and when we are doing anything else.
> > >
> > > So combine those two to allow nfsd to still be throttled (like any
> > > other process) in every situation except when it is known to be
> > > problematic.
> >
> > The FUSE code has something similar, but on the "client"
> > side.
> >
> > See BDI_CAP_STRICTLIMIT in mm/writeback.c
> >
> > Would it make sense to use that flag on loopback-mounted
> > NFS filesystems?
> >
>
> I don't think so.
>
> I don't fully understand BDI_CAP_STRICTLIMIT, but it seems to be very
> fuse-specific and relates to NR_WRITEBACK_TEMP, which only fuse uses. NFS
> doesn't need any 'strict' limits.
> i.e. it looks like fuse-specific code inside core-vm code, which I would
> rather steer clear of.
It doesn't really relate to NR_WRITEBACK_TEMP. We have two dirty limits
in the VM - the global one and a per bdi one (which is a fraction of a
global one computed based on how much device has been writing back in the
past). Normally until we have more than (dirty_limit +
dirty_background_limit) / 2 dirty pages globally, the per bdi limit is
ignored. And BDI_CAP_STRICTLIMIT means that the per-bdi dirty limit is
always observed. Together with max_ratio and min_ratio this is useful for
limiting amount of dirty pages for specific bdis. And FUSE uses it so that
userspace filesystems cannot easily lockup the system by creating lots of
dirty pages which cannot be written back.

So I actually don't think BDI_CAP_STRICTLIMIT is a particularly good fit
for your problem although I agree with Rik that FUSE faces a similar
problem.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-05-12 01:04:50

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

On Tue, 06 May 2014 17:05:01 -0400 Rik van Riel <[email protected]> wrote:

> On 04/22/2014 10:40 PM, NeilBrown wrote:
> > PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
> > and live-locks while writing to the page cache in a loop-back
> > NFS mount situation.
> >
> > It therefore makes sense to *only* set PF_LESS_THROTTLE in this
> > situation.
> > We now know when a request came from the local-host so it could be a
> > loop-back mount. We already know when we are handling write requests,
> > and when we are doing anything else.
> >
> > So combine those two to allow nfsd to still be throttled (like any
> > other process) in every situation except when it is known to be
> > problematic.
>
> The FUSE code has something similar, but on the "client"
> side.
>
> See BDI_CAP_STRICTLIMIT in mm/writeback.c
>
> Would it make sense to use that flag on loopback-mounted
> NFS filesystems?
>

I don't think so.

I don't fully understand BDI_CAP_STRICTLIMIT, but it seems to be very
fuse-specific and relates to NR_WRITEBACK_TEMP, which only fuse uses. NFS
doesn't need any 'strict' limits.
i.e. it looks like fuse-specific code inside core-vm code, which I would
rather steer clear of.

Setting a bdi flag for a loopback-mounted NFS filesystem isn't really
possible because it "is it loopback mounted" state is fluid. IP addresses can
be migrated (for HA cluster failover) and what was originally a remote-NFS
mount can become a loopback NFS mount (and that is exactly the case I need to
deal with).

So we can only really assess "is it loop-back" on a per-request basis.

This patch does that assessment in nfsd to limit the use of PF_LESS_THROTTLE.
Another patch does it in nfs to limit the waiting in nfs_release_page.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-05-12 01:06:07

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

On Tue, 6 May 2014 16:54:18 -0400 "J. Bruce Fields" <[email protected]>
wrote:

> On Wed, Apr 23, 2014 at 12:40:58PM +1000, NeilBrown wrote:
> > PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
> > and live-locks while writing to the page cache in a loop-back
> > NFS mount situation.
> >
> > It therefore makes sense to *only* set PF_LESS_THROTTLE in this
> > situation.
> > We now know when a request came from the local-host so it could be a
> > loop-back mount. We already know when we are handling write requests,
> > and when we are doing anything else.
> >
> > So combine those two to allow nfsd to still be throttled (like any
> > other process) in every situation except when it is known to be
> > problematic.
>
> Looks simple enough, ACK.--b.
>

Thanks.
I'll resend the bits need for just this.

The NFS side need to wait for wait_on_bit improvements which seem to be on a
slow path at the moment.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-05-06 21:05:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] nfsd: Only set PF_LESS_THROTTLE when really needed.

On 04/22/2014 10:40 PM, NeilBrown wrote:
> PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
> and live-locks while writing to the page cache in a loop-back
> NFS mount situation.
>
> It therefore makes sense to *only* set PF_LESS_THROTTLE in this
> situation.
> We now know when a request came from the local-host so it could be a
> loop-back mount. We already know when we are handling write requests,
> and when we are doing anything else.
>
> So combine those two to allow nfsd to still be throttled (like any
> other process) in every situation except when it is known to be
> problematic.

The FUSE code has something similar, but on the "client"
side.

See BDI_CAP_STRICTLIMIT in mm/writeback.c

Would it make sense to use that flag on loopback-mounted
NFS filesystems?

--
All rights reversed