LinuxLists.cc - [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

2016-12-16 16:48:27

Subject: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, we reserve one RPC-over-RDMA credit that may be
used only for an NFS NULL. A periodic RPC ping is done on transports
whenever there are outstanding RPCs.

The purpose of this ping is to drive traffic regularly on each
connection to force the transport layer to disconnect it if it is no
longer viable. Some RDMA operations are fully offloaded to the HCA,
and can be successful even if the remote host has crashed. Thus an
operation that requires that the server is responsive is used for
the ping.

This implementation re-uses existing generic RPC infrastructure to
form each NULL Call. An rpc_clnt context must be available to start
an RPC. Thus a generic keepalive mechanism is introduced so that
both an rpc_clnt and an rpc_xprt is available to perform the ping.

Signed-off-by: Chuck Lever <[email protected]>
---

Before sending this for internal testing, I'd like to hear comments
on this approach. It's a little more churn than I had hoped for.

fs/nfs/nfs4client.c | 1
include/linux/sunrpc/clnt.h | 2 +
include/linux/sunrpc/sched.h | 3 +
include/linux/sunrpc/xprt.h | 1
net/sunrpc/clnt.c | 101 +++++++++++++++++++++++++++++++++++++++
net/sunrpc/sched.c | 19 +++++++
net/sunrpc/xprt.c | 5 ++
net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
net/sunrpc/xprtrdma/transport.c | 13 +++++
9 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 074ac71..c5f5ce8 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct nfs_client *clp,
error = nfs_create_rpc_client(clp, cl_init, RPC_AUTH_UNIX);
if (error < 0)
goto error;
+ rpc_schedule_keepalive(clp->cl_rpcclient);

/* If no clientaddr= option was specified, find a usable cb address */
if (ip_addr == NULL) {
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 85cc819..443a955 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -69,6 +69,7 @@ struct rpc_clnt {
struct dentry *cl_debugfs; /* debugfs directory */
#endif
struct rpc_xprt_iter cl_xpi;
+ struct delayed_work cl_ka_worker;
};

/*
@@ -187,6 +188,7 @@ struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
int rpc_localaddr(struct rpc_clnt *, struct sockaddr *, size_t);
+void rpc_schedule_keepalive(struct rpc_clnt *clnt);

int rpc_clnt_iterate_for_each_xprt(struct rpc_clnt *clnt,
int (*fn)(struct rpc_clnt *, struct rpc_xprt *, void *),
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 7ba040c..fd5d7ca 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -127,6 +127,7 @@ struct rpc_task_setup {
#define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
+#define RPC_TASK_PRIORITY 0x8000 /* skip congestion control */

#define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
#define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
@@ -135,6 +136,7 @@ struct rpc_task_setup {
#define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
#define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
#define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_HAS_PRIORITY(t) ((t)->tk_flags & RPC_TASK_PRIORITY)

#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
@@ -238,6 +240,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
bool (*)(struct rpc_task *, void *),
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
+bool rpc_wait_queue_is_active(struct rpc_wait_queue *queue);
void rpc_delay(struct rpc_task *, unsigned long);
int rpc_malloc(struct rpc_task *);
void rpc_free(struct rpc_task *);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index a5da60b..603cd67 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -222,6 +222,7 @@ struct rpc_xprt {
unsigned long last_used,
idle_timeout,
max_reconnect_timeout;
+ bool keepalive;

/*
* Send stuff
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 62a4827..ff46c79 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -79,6 +79,7 @@
static __be32 *rpc_encode_header(struct rpc_task *task);
static __be32 *rpc_verify_header(struct rpc_task *task);
static int rpc_ping(struct rpc_clnt *clnt);
+static void rpc_clnt_keepalive(struct work_struct *work);

static void rpc_register_client(struct rpc_clnt *clnt)
{
@@ -413,6 +414,7 @@ static struct rpc_clnt * rpc_new_client(const struct rpc_create_args *args,
rpc_clnt_set_transport(clnt, xprt, timeout);
xprt_iter_init(&clnt->cl_xpi, xps);
xprt_switch_put(xps);
+ INIT_DELAYED_WORK(&clnt->cl_ka_worker, rpc_clnt_keepalive);

clnt->cl_rtt = &clnt->cl_rtt_default;
rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval);
@@ -871,6 +873,7 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
rcu_dereference(clnt->cl_xprt)->servername);
if (clnt->cl_parent != clnt)
parent = clnt->cl_parent;
+ cancel_delayed_work_sync(&clnt->cl_ka_worker);
rpc_clnt_debugfs_unregister(clnt);
rpc_clnt_remove_pipedir(clnt);
rpc_unregister_client(clnt);
@@ -2782,6 +2785,104 @@ bool rpc_clnt_xprt_switch_has_addr(struct rpc_clnt *clnt,
}
EXPORT_SYMBOL_GPL(rpc_clnt_xprt_switch_has_addr);

+struct rpc_keepalive_calldata {
+ struct rpc_xprt *xprt;
+};
+
+static void rpc_keepalive_done(struct rpc_task *task, void *calldata)
+{
+ struct rpc_keepalive_calldata *data = calldata;
+
+ dprintk("RPC: %s: keepalive ping on xprt %p, status %d\n",
+ __func__, data->xprt, task->tk_status);
+
+ if (task->tk_status)
+ xprt_force_disconnect(data->xprt);
+}
+
+static void rpc_keepalive_release(void *calldata)
+{
+ struct rpc_keepalive_calldata *data = calldata;
+
+ data->xprt->keepalive = true;
+ xprt_put(data->xprt);
+ kfree(data);
+}
+
+static const struct rpc_call_ops rpc_keepalive_call_ops = {
+ .rpc_call_done = rpc_keepalive_done,
+ .rpc_release = rpc_keepalive_release,
+};
+
+static int rpc_xprt_keepalive(struct rpc_clnt *clnt, struct rpc_xprt *xprt,
+ void *unused)
+{
+ struct rpc_keepalive_calldata *data;
+ struct rpc_cred *cred;
+ struct rpc_task *task;
+
+ if (!xprt->keepalive)
+ goto out;
+ if (!xprt_connected(xprt))
+ goto out;
+
+ /* When there are no pending RPCs, squelch keepalive so that a
+ * truly idle connection can be auto-closed.
+ */
+ if (!rpc_wait_queue_is_active(&xprt->pending))
+ goto out;
+
+ dprintk("RPC: %s: sending keepalive ping on xprt %p\n",
+ __func__, xprt);
+
+ data = kmalloc(sizeof(*data), GFP_NOFS);
+ if (!data)
+ goto out;
+ data->xprt = xprt_get(xprt);
+
+ /* Send only one keepalive ping at a time.
+ */
+ xprt->keepalive = false;
+
+ cred = authnull_ops.lookup_cred(NULL, NULL, 0);
+ task = rpc_call_null_helper(clnt, xprt, cred,
+ RPC_TASK_SOFT |
+ RPC_TASK_ASYNC |
+ RPC_TASK_PRIORITY,
+ &rpc_keepalive_call_ops,
+ data);
+
+ put_rpccred(cred);
+ if (!IS_ERR(task))
+ rpc_put_task(task);
+out:
+ return 0;
+}
+
+static void rpc_clnt_keepalive(struct work_struct *work)
+{
+ struct rpc_clnt *clnt = container_of(work, struct rpc_clnt,
+ cl_ka_worker.work);
+
+ rpc_clnt_iterate_for_each_xprt(clnt, rpc_xprt_keepalive, NULL);
+ rpc_schedule_keepalive(clnt);
+}
+
+/**
+ * rpc_schedule_keepalive - Start keepalive heartbeat
+ * @clnt: rpc_clnt with transports that might need keepalive
+ *
+ * For transport classes that do not have a native keepalive mechanism,
+ * detect dead transports as quickly as possible. An RPC NULL is used
+ * as the ping.
+ */
+void rpc_schedule_keepalive(struct rpc_clnt *clnt)
+{
+ schedule_delayed_work(&clnt->cl_ka_worker,
+ clnt->cl_timeout->to_initval >> 1);
+}
+EXPORT_SYMBOL_GPL(rpc_schedule_keepalive);
+
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
static void rpc_show_header(void)
{
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 5db68b3..bb98a9f 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -635,6 +635,25 @@ void rpc_wake_up_status(struct rpc_wait_queue *queue, int status)
}
EXPORT_SYMBOL_GPL(rpc_wake_up_status);

+/**
+ * rpc_wait_queue_is_active - check if there are queue waiters
+ * @queue: rpc_wait_queue on which the tasks are sleeping
+ *
+ * Grabs queue->lock
+ */
+bool rpc_wait_queue_is_active(struct rpc_wait_queue *queue)
+{
+ struct list_head *head;
+ bool result;
+
+ spin_lock_bh(&queue->lock);
+ head = &queue->tasks[queue->maxpriority];
+ result = !list_empty(head);
+ spin_unlock_bh(&queue->lock);
+
+ return result;
+}
+
static void __rpc_queue_timer_fn(unsigned long ptr)
{
struct rpc_wait_queue *queue = (struct rpc_wait_queue *)ptr;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 685e6d2..941949c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
{
struct rpc_rqst *req = task->tk_rqstp;

+ if (RPC_HAS_PRIORITY(task)) {
+ req->rq_cong = 0;
+ return 1;
+ }
if (req->rq_cong)
return 1;
dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
@@ -1328,6 +1332,7 @@ static void xprt_init(struct rpc_xprt *xprt, struct net *net)
xprt->last_used = jiffies;
xprt->cwnd = RPC_INITCWND;
xprt->bind_index = 0;
+ xprt->keepalive = false;

rpc_init_wait_queue(&xprt->binding, "xprt_binding");
rpc_init_wait_queue(&xprt->pending, "xprt_pending");
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c52e0f2..9631fcf 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1083,7 +1083,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,

spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
- xprt->cwnd = atomic_read(&r_xprt->rx_buf.rb_credits) << RPC_CWNDSHIFT;
+ /* Reserve one credit for keepalive ping */
+ xprt->cwnd = atomic_read(&r_xprt->rx_buf.rb_credits) - 1;
+ xprt->cwnd <<= RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)
xprt_release_rqst_cong(rqst->rq_task);

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 534c178..cb6e67b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -312,6 +312,18 @@
module_put(THIS_MODULE);
}

+static bool
+rpcrdma_need_keepalive(struct rpcrdma_xprt *r_xprt)
+{
+ struct rdma_cm_id *id = r_xprt->rx_ia.ri_id;
+
+ /* RDMA RC on InfiniBand has no native keepalive
+ * mechanism. iWARP runs on a lower layer that
+ * already provides keepalive.
+ */
+ return !rdma_protocol_iwarp(id->device, id->port_num);
+}
+
static const struct rpc_timeout xprt_rdma_default_timeout = {
.to_initval = 60 * HZ,
.to_maxval = 60 * HZ,
@@ -433,6 +445,7 @@
xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
if (xprt->max_payload == 0)
goto out4;
+ xprt->keepalive = rpcrdma_need_keepalive(new_xprt);
xprt->max_payload <<= PAGE_SHIFT;
dprintk("RPC: %s: transport data payload maximum: %zu bytes\n",
__func__, xprt->max_payload);

2017-01-12 17:38:49

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

T24gRnJpLCAyMDE2LTEyLTE2IGF0IDExOjQ4IC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
Q3VycmVudCBORlMgY2xpZW50cyByZWx5IG9uIGNvbm5lY3Rpb24gbG9zcyB0byBkZXRlcm1pbmUg
d2hlbiB0bw0KPiByZXRyYW5zbWl0LiBJbiBwYXJ0aWN1bGFyLCBmb3IgcHJvdG9jb2xzIGxpa2Ug
TkZTdjQsIGNsaWVudHMgbm8NCj4gbG9uZ2VyIHJlbHkgb24gUlBDIHRpbWVvdXRzIHRvIGRyaXZl
IHJldHJhbnNtaXNzaW9uOiBORlN2NCBzZXJ2ZXJzDQo+IGFyZSByZXF1aXJlZCB0byB0ZXJtaW5h
dGUgYSBjb25uZWN0aW9uIHdoZW4gdGhleSBuZWVkwqDCoGEgY2xpZW50IHRvDQo+IHJldHJhbnNt
aXQgcGVuZGluZyBSUENzLg0KPiANCj4gV2hlbiBhIHNlcnZlciBpcyBubyBsb25nZXIgcmVhY2hh
YmxlLCBlaXRoZXIgYmVjYXVzZSBpdCBoYXMgY3Jhc2hlZA0KPiBvciBiZWNhdXNlIHRoZSBuZXR3
b3JrIHBhdGggaGFzIGJyb2tlbiwgdGhlIHNlcnZlciBjYW5ub3QgYWN0aXZlbHkNCj4gdGVybWlu
YXRlIGEgY29ubmVjdGlvbi4gVGh1cyBORlMgY2xpZW50cyBkZXBlbmQgb24gdHJhbnNwb3J0LWxl
dmVsDQo+IGtlZXBhbGl2ZSB0byBkZXRlcm1pbmUgd2hlbiBhIGNvbm5lY3Rpb24gbXVzdCBiZSBy
ZXBsYWNlZCBhbmQNCj4gcGVuZGluZyBSUENzIHJldHJhbnNtaXR0ZWQuDQo+IA0KPiBIb3dldmVy
LCBSRE1BIFJDIGNvbm5lY3Rpb25zIGRvIG5vdCBoYXZlIGEgbmF0aXZlIGtlZXBhbGl2ZQ0KPiBt
ZWNoYW5pc20uIElmIGFuIE5GUy9SRE1BIHNlcnZlciBjcmFzaGVzIGFmdGVyIGEgY2xpZW50IGhh
cyBzZW50DQo+IFJQQ3Mgc3VjY2Vzc2Z1bGx5IChhbiBSQyBBQ0sgaGFzIGJlZW4gcmVjZWl2ZWQg
Zm9yIGFsbCBPVFcgUkRNQQ0KPiByZXF1ZXN0cyksIHRoZXJlIGlzIG5vIHdheSBmb3IgdGhlIGNs
aWVudCB0byBrbm93IHRoZSBjb25uZWN0aW9uIGlzDQo+IG1vcmlidW5kLg0KPiANCj4gSW4gYWRk
aXRpb24sIG5ldyBSRE1BIHJlcXVlc3RzIGFyZSBzdWJqZWN0IHRvIHRoZSBSUEMtb3Zlci1SRE1B
DQo+IGNyZWRpdCBsaW1pdC4gSWYgdGhlIGNsaWVudCBoYXMgY29uc3VtZWQgYWxsIGdyYW50ZWQg
Y3JlZGl0cyB3aXRoDQo+IE5GUyB0cmFmZmljLCBpdCBpcyBub3QgYWxsb3dlZCB0byBzZW5kIGFu
b3RoZXIgUkRNQSByZXF1ZXN0IHVudGlsDQo+IHRoZSBzZXJ2ZXIgcmVwbGllcy4gVGh1cyBpdCBo
YXMgbm8gd2F5IHRvIHNlbmQgYSB0cnVlIGtlZXBhbGl2ZSB3aGVuDQo+IHRoZSB3b3JrbG9hZCBo
YXMgYWxyZWFkeSBjb25zdW1lZCBhbGwgY3JlZGl0cyB3aXRoIHBlbmRpbmcgUlBDcy4NCj4gDQo+
IFRvIGFkZHJlc3MgdGhpcywgd2UgcmVzZXJ2ZSBvbmUgUlBDLW92ZXItUkRNQSBjcmVkaXQgdGhh
dCBtYXkgYmUNCj4gdXNlZCBvbmx5IGZvciBhbiBORlMgTlVMTC4gQSBwZXJpb2RpYyBSUEMgcGlu
ZyBpcyBkb25lIG9uIHRyYW5zcG9ydHMNCj4gd2hlbmV2ZXIgdGhlcmUgYXJlIG91dHN0YW5kaW5n
IFJQQ3MuDQo+IA0KPiBUaGUgcHVycG9zZSBvZiB0aGlzIHBpbmcgaXMgdG8gZHJpdmUgdHJhZmZp
YyByZWd1bGFybHkgb24gZWFjaA0KPiBjb25uZWN0aW9uIHRvIGZvcmNlIHRoZSB0cmFuc3BvcnQg
bGF5ZXIgdG8gZGlzY29ubmVjdCBpdCBpZiBpdCBpcyBubw0KPiBsb25nZXIgdmlhYmxlLiBTb21l
IFJETUEgb3BlcmF0aW9ucyBhcmUgZnVsbHkgb2ZmbG9hZGVkIHRvIHRoZSBIQ0EsDQo+IGFuZCBj
YW4gYmUgc3VjY2Vzc2Z1bCBldmVuIGlmIHRoZSByZW1vdGUgaG9zdCBoYXMgY3Jhc2hlZC4gVGh1
cyBhbg0KPiBvcGVyYXRpb24gdGhhdCByZXF1aXJlcyB0aGF0IHRoZSBzZXJ2ZXIgaXMgcmVzcG9u
c2l2ZSBpcyB1c2VkIGZvcg0KPiB0aGUgcGluZy4NCj4gDQo+IFRoaXMgaW1wbGVtZW50YXRpb24g
cmUtdXNlcyBleGlzdGluZyBnZW5lcmljIFJQQyBpbmZyYXN0cnVjdHVyZSB0bw0KPiBmb3JtIGVh
Y2ggTlVMTCBDYWxsLiBBbiBycGNfY2xudCBjb250ZXh0IG11c3QgYmUgYXZhaWxhYmxlIHRvIHN0
YXJ0DQo+IGFuIFJQQy4gVGh1cyBhIGdlbmVyaWMga2VlcGFsaXZlIG1lY2hhbmlzbSBpcyBpbnRy
b2R1Y2VkIHNvIHRoYXQNCj4gYm90aCBhbiBycGNfY2xudCBhbmQgYW4gcnBjX3hwcnQgaXMgYXZh
aWxhYmxlIHRvIHBlcmZvcm0gdGhlIHBpbmcuDQo+IA0KPiBTaWduZWQtb2ZmLWJ5OiBDaHVjayBM
ZXZlciA8Y2h1Y2subGV2ZXJAb3JhY2xlLmNvbT4NCj4gLS0tDQo+IA0KPiBCZWZvcmUgc2VuZGlu
ZyB0aGlzIGZvciBpbnRlcm5hbCB0ZXN0aW5nLCBJJ2QgbGlrZSB0byBoZWFyIGNvbW1lbnRzDQo+
IG9uIHRoaXMgYXBwcm9hY2guIEl0J3MgYSBsaXR0bGUgbW9yZSBjaHVybiB0aGFuIEkgaGFkIGhv
cGVkIGZvci4NCj4gDQo+IA0KPiDCoGZzL25mcy9uZnM0Y2xpZW50LmPCoMKgwqDCoMKgwqDCoMKg
wqDCoMKgwqDCoHzCoMKgwqDCoDHCoA0KPiDCoGluY2x1ZGUvbGludXgvc3VucnBjL2NsbnQuaMKg
wqDCoMKgwqB8wqDCoMKgwqAyICsNCj4gwqBpbmNsdWRlL2xpbnV4L3N1bnJwYy9zY2hlZC5owqDC
oMKgwqB8wqDCoMKgwqAzICsNCj4gwqBpbmNsdWRlL2xpbnV4L3N1bnJwYy94cHJ0LmjCoMKgwqDC
oMKgfMKgwqDCoMKgMcKgDQo+IMKgbmV0L3N1bnJwYy9jbG50LmPCoMKgwqDCoMKgwqDCoMKgwqDC
oMKgwqDCoMKgwqB8wqDCoDEwMQ0KPiArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr
KysrKysNCj4gwqBuZXQvc3VucnBjL3NjaGVkLmPCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKg
fMKgwqDCoDE5ICsrKysrKysNCj4gwqBuZXQvc3VucnBjL3hwcnQuY8KgwqDCoMKgwqDCoMKgwqDC
oMKgwqDCoMKgwqDCoHzCoMKgwqDCoDUgKysNCj4gwqBuZXQvc3VucnBjL3hwcnRyZG1hL3JwY19y
ZG1hLmPCoMKgfMKgwqDCoMKgNCArLQ0KPiDCoG5ldC9zdW5ycGMveHBydHJkbWEvdHJhbnNwb3J0
LmMgfMKgwqDCoDEzICsrKysrDQo+IMKgOSBmaWxlcyBjaGFuZ2VkLCAxNDggaW5zZXJ0aW9ucygr
KSwgMSBkZWxldGlvbigtKQ0KPiANCj4gZGlmZiAtLWdpdCBhL2ZzL25mcy9uZnM0Y2xpZW50LmMg
Yi9mcy9uZnMvbmZzNGNsaWVudC5jDQo+IGluZGV4IDA3NGFjNzEuLmM1ZjVjZTggMTAwNjQ0DQo+
IC0tLSBhL2ZzL25mcy9uZnM0Y2xpZW50LmMNCj4gKysrIGIvZnMvbmZzL25mczRjbGllbnQuYw0K
PiBAQCAtMzc4LDYgKzM3OCw3IEBAIHN0cnVjdCBuZnNfY2xpZW50ICpuZnM0X2luaXRfY2xpZW50
KHN0cnVjdA0KPiBuZnNfY2xpZW50ICpjbHAsDQo+IMKgCQllcnJvciA9IG5mc19jcmVhdGVfcnBj
X2NsaWVudChjbHAsIGNsX2luaXQsDQo+IFJQQ19BVVRIX1VOSVgpOw0KPiDCoAlpZiAoZXJyb3Ig
PCAwKQ0KPiDCoAkJZ290byBlcnJvcjsNCj4gKwlycGNfc2NoZWR1bGVfa2VlcGFsaXZlKGNscC0+
Y2xfcnBjY2xpZW50KTsNCg0KV2h5IGRvIHdlIHdhbnQgdG8gZW5hYmxlIHRoaXMgZm9yIG5vbi1S
RE1BIHRyYW5zcG9ydHM/IFNob3VsZG4ndCB0aGlzDQpmdW5jdGlvbmFsaXR5IGJlIGhpZGRlbiBp
biB0aGUgUkRNQSBjbGllbnQgY29kZSwgaW4gdGhlIHNhbWUgd2F5IHRoYXQNCnRoZSBUQ1Aga2Vl
cGFsaXZlIGlzIGhpZGRlbiBpbiB0aGUgc29ja2V0IGNvZGUuDQoNCi0tIA0KVHJvbmQgTXlrbGVi
dXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIFByaW1hcnlEYXRhDQp0cm9uZC5teWts
ZWJ1c3RAcHJpbWFyeWRhdGEuY29tDQo=

2017-01-12 17:42:13

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

> On Jan 12, 2017, at 12:38 PM, Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
>> Current NFS clients rely on connection loss to determine when to
>> retransmit. In particular, for protocols like NFSv4, clients no
>> longer rely on RPC timeouts to drive retransmission: NFSv4 servers
>> are required to terminate a connection when they need a client to
>> retransmit pending RPCs.
>>
>> When a server is no longer reachable, either because it has crashed
>> or because the network path has broken, the server cannot actively
>> terminate a connection. Thus NFS clients depend on transport-level
>> keepalive to determine when a connection must be replaced and
>> pending RPCs retransmitted.
>>
>> However, RDMA RC connections do not have a native keepalive
>> mechanism. If an NFS/RDMA server crashes after a client has sent
>> RPCs successfully (an RC ACK has been received for all OTW RDMA
>> requests), there is no way for the client to know the connection is
>> moribund.
>>
>> In addition, new RDMA requests are subject to the RPC-over-RDMA
>> credit limit. If the client has consumed all granted credits with
>> NFS traffic, it is not allowed to send another RDMA request until
>> the server replies. Thus it has no way to send a true keepalive when
>> the workload has already consumed all credits with pending RPCs.
>>
>> To address this, we reserve one RPC-over-RDMA credit that may be
>> used only for an NFS NULL. A periodic RPC ping is done on transports
>> whenever there are outstanding RPCs.
>>
>> The purpose of this ping is to drive traffic regularly on each
>> connection to force the transport layer to disconnect it if it is no
>> longer viable. Some RDMA operations are fully offloaded to the HCA,
>> and can be successful even if the remote host has crashed. Thus an
>> operation that requires that the server is responsive is used for
>> the ping.
>>
>> This implementation re-uses existing generic RPC infrastructure to
>> form each NULL Call. An rpc_clnt context must be available to start
>> an RPC. Thus a generic keepalive mechanism is introduced so that
>> both an rpc_clnt and an rpc_xprt is available to perform the ping.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>>
>> Before sending this for internal testing, I'd like to hear comments
>> on this approach. It's a little more churn than I had hoped for.
>>
>>
>> fs/nfs/nfs4client.c | 1
>> include/linux/sunrpc/clnt.h | 2 +
>> include/linux/sunrpc/sched.h | 3 +
>> include/linux/sunrpc/xprt.h | 1
>> net/sunrpc/clnt.c | 101
>> +++++++++++++++++++++++++++++++++++++++
>> net/sunrpc/sched.c | 19 +++++++
>> net/sunrpc/xprt.c | 5 ++
>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
>> net/sunrpc/xprtrdma/transport.c | 13 +++++
>> 9 files changed, 148 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>> index 074ac71..c5f5ce8 100644
>> --- a/fs/nfs/nfs4client.c
>> +++ b/fs/nfs/nfs4client.c
>> @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct
>> nfs_client *clp,
>> error = nfs_create_rpc_client(clp, cl_init,
>> RPC_AUTH_UNIX);
>> if (error < 0)
>> goto error;
>> + rpc_schedule_keepalive(clp->cl_rpcclient);
>
> Why do we want to enable this for non-RDMA transports? Shouldn't this
> functionality be hidden in the RDMA client code, in the same way that
> the TCP keepalive is hidden in the socket code.

Sending a NULL request by re-using the normal RPC infrastructure
requires a struct rpc_clnt. Thus it has to be driven by an upper
layer context.

I'm open to suggestions.

--
Chuck Lever

2017-01-12 22:15:13

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

> On Jan 12, 2017, at 12:42, Chuck Lever <[email protected]> wrote:
>=20
>=20
>> On Jan 12, 2017, at 12:38 PM, Trond Myklebust <[email protected]> =
wrote:
>>=20
>> On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
>>> Current NFS clients rely on connection loss to determine when to
>>> retransmit. In particular, for protocols like NFSv4, clients no
>>> longer rely on RPC timeouts to drive retransmission: NFSv4 servers
>>> are required to terminate a connection when they need a client to
>>> retransmit pending RPCs.
>>>=20
>>> When a server is no longer reachable, either because it has crashed
>>> or because the network path has broken, the server cannot actively
>>> terminate a connection. Thus NFS clients depend on transport-level
>>> keepalive to determine when a connection must be replaced and
>>> pending RPCs retransmitted.
>>>=20
>>> However, RDMA RC connections do not have a native keepalive
>>> mechanism. If an NFS/RDMA server crashes after a client has sent
>>> RPCs successfully (an RC ACK has been received for all OTW RDMA
>>> requests), there is no way for the client to know the connection is
>>> moribund.
>>>=20
>>> In addition, new RDMA requests are subject to the RPC-over-RDMA
>>> credit limit. If the client has consumed all granted credits with
>>> NFS traffic, it is not allowed to send another RDMA request until
>>> the server replies. Thus it has no way to send a true keepalive when
>>> the workload has already consumed all credits with pending RPCs.
>>>=20
>>> To address this, we reserve one RPC-over-RDMA credit that may be
>>> used only for an NFS NULL. A periodic RPC ping is done on transports
>>> whenever there are outstanding RPCs.
>>>=20
>>> The purpose of this ping is to drive traffic regularly on each
>>> connection to force the transport layer to disconnect it if it is no
>>> longer viable. Some RDMA operations are fully offloaded to the HCA,
>>> and can be successful even if the remote host has crashed. Thus an
>>> operation that requires that the server is responsive is used for
>>> the ping.
>>>=20
>>> This implementation re-uses existing generic RPC infrastructure to
>>> form each NULL Call. An rpc_clnt context must be available to start
>>> an RPC. Thus a generic keepalive mechanism is introduced so that
>>> both an rpc_clnt and an rpc_xprt is available to perform the ping.
>>>=20
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>>=20
>>> Before sending this for internal testing, I'd like to hear comments
>>> on this approach. It's a little more churn than I had hoped for.
>>>=20
>>>=20
>>> fs/nfs/nfs4client.c | 1=20
>>> include/linux/sunrpc/clnt.h | 2 +
>>> include/linux/sunrpc/sched.h | 3 +
>>> include/linux/sunrpc/xprt.h | 1=20
>>> net/sunrpc/clnt.c | 101
>>> +++++++++++++++++++++++++++++++++++++++
>>> net/sunrpc/sched.c | 19 +++++++
>>> net/sunrpc/xprt.c | 5 ++
>>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
>>> net/sunrpc/xprtrdma/transport.c | 13 +++++
>>> 9 files changed, 148 insertions(+), 1 deletion(-)
>>>=20
>>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>>> index 074ac71..c5f5ce8 100644
>>> --- a/fs/nfs/nfs4client.c
>>> +++ b/fs/nfs/nfs4client.c
>>> @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct
>>> nfs_client *clp,
>>> =09=09error =3D nfs_create_rpc_client(clp, cl_init,
>>> RPC_AUTH_UNIX);
>>> =09if (error < 0)
>>> =09=09goto error;
>>> +=09rpc_schedule_keepalive(clp->cl_rpcclient);
>>=20
>> Why do we want to enable this for non-RDMA transports? Shouldn't this
>> functionality be hidden in the RDMA client code, in the same way that
>> the TCP keepalive is hidden in the socket code.
>=20
> Sending a NULL request by re-using the normal RPC infrastructure
> requires a struct rpc_clnt. Thus it has to be driven by an upper
> layer context.
>=20
> I'm open to suggestions.
>=20

Ideally we just want this to operate when there are outstanding RPC calls w=
aiting for a reply, am I correct?

If so, perhaps we might have it triggersd by a timer that is armed in xprt-=
>ops->send_request() and disarmed in xprt->ops->release_xprt()? It might th=
en configure itself by looking in the xprt->recv list to find a hanging rpc=
_task and steal its rpc_client info.

Cheers
Trond

2017-01-13 15:13:34

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

> On Jan 12, 2017, at 5:15 PM, Trond Myklebust <[email protected]> wrote:
>
>>
>> On Jan 12, 2017, at 12:42, Chuck Lever <[email protected]> wrote:
>>
>>
>>> On Jan 12, 2017, at 12:38 PM, Trond Myklebust <[email protected]> wrote:
>>>
>>> On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
>>>> Current NFS clients rely on connection loss to determine when to
>>>> retransmit. In particular, for protocols like NFSv4, clients no
>>>> longer rely on RPC timeouts to drive retransmission: NFSv4 servers
>>>> are required to terminate a connection when they need a client to
>>>> retransmit pending RPCs.
>>>>
>>>> When a server is no longer reachable, either because it has crashed
>>>> or because the network path has broken, the server cannot actively
>>>> terminate a connection. Thus NFS clients depend on transport-level
>>>> keepalive to determine when a connection must be replaced and
>>>> pending RPCs retransmitted.
>>>>
>>>> However, RDMA RC connections do not have a native keepalive
>>>> mechanism. If an NFS/RDMA server crashes after a client has sent
>>>> RPCs successfully (an RC ACK has been received for all OTW RDMA
>>>> requests), there is no way for the client to know the connection is
>>>> moribund.
>>>>
>>>> In addition, new RDMA requests are subject to the RPC-over-RDMA
>>>> credit limit. If the client has consumed all granted credits with
>>>> NFS traffic, it is not allowed to send another RDMA request until
>>>> the server replies. Thus it has no way to send a true keepalive when
>>>> the workload has already consumed all credits with pending RPCs.
>>>>
>>>> To address this, we reserve one RPC-over-RDMA credit that may be
>>>> used only for an NFS NULL. A periodic RPC ping is done on transports
>>>> whenever there are outstanding RPCs.
>>>>
>>>> The purpose of this ping is to drive traffic regularly on each
>>>> connection to force the transport layer to disconnect it if it is no
>>>> longer viable. Some RDMA operations are fully offloaded to the HCA,
>>>> and can be successful even if the remote host has crashed. Thus an
>>>> operation that requires that the server is responsive is used for
>>>> the ping.
>>>>
>>>> This implementation re-uses existing generic RPC infrastructure to
>>>> form each NULL Call. An rpc_clnt context must be available to start
>>>> an RPC. Thus a generic keepalive mechanism is introduced so that
>>>> both an rpc_clnt and an rpc_xprt is available to perform the ping.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>>
>>>> Before sending this for internal testing, I'd like to hear comments
>>>> on this approach. It's a little more churn than I had hoped for.
>>>>
>>>>
>>>> fs/nfs/nfs4client.c | 1
>>>> include/linux/sunrpc/clnt.h | 2 +
>>>> include/linux/sunrpc/sched.h | 3 +
>>>> include/linux/sunrpc/xprt.h | 1
>>>> net/sunrpc/clnt.c | 101
>>>> +++++++++++++++++++++++++++++++++++++++
>>>> net/sunrpc/sched.c | 19 +++++++
>>>> net/sunrpc/xprt.c | 5 ++
>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
>>>> net/sunrpc/xprtrdma/transport.c | 13 +++++
>>>> 9 files changed, 148 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>>>> index 074ac71..c5f5ce8 100644
>>>> --- a/fs/nfs/nfs4client.c
>>>> +++ b/fs/nfs/nfs4client.c
>>>> @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct
>>>> nfs_client *clp,
>>>> error = nfs_create_rpc_client(clp, cl_init,
>>>> RPC_AUTH_UNIX);
>>>> if (error < 0)
>>>> goto error;
>>>> + rpc_schedule_keepalive(clp->cl_rpcclient);
>>>
>>> Why do we want to enable this for non-RDMA transports? Shouldn't this
>>> functionality be hidden in the RDMA client code, in the same way that
>>> the TCP keepalive is hidden in the socket code.
>>
>> Sending a NULL request by re-using the normal RPC infrastructure
>> requires a struct rpc_clnt. Thus it has to be driven by an upper
>> layer context.
>>
>> I'm open to suggestions.
>>
>
> Ideally we just want this to operate when there are outstanding RPC calls waiting for a reply, am I correct?
>
> If so, perhaps we might have it triggered by a timer that is armed in xprt->ops->send_request() and disarmed in xprt->ops->release_xprt()? It might then configure itself by looking in the xprt->recv list to find a hanging rpc_task and steal its rpc_client info.

Perhaps, but I was hoping to find a solution that did not add more
overhead (arming and disarming another timer) to the send_request
path.

__mod_timer can do an irqsave spinlock in some cases, for example.

This impacts all I/O on all transports to handle a case that will
be very rare.

We could mitigate the timer flapping by arming when xprt_transmit
finds the recv list empty before adding, and when xprt_lookup_rqst
empties the recv list.

> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

2017-01-13 17:27:35

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

T24gRnJpLCAyMDE3LTAxLTEzIGF0IDEwOjEzIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBKYW4gMTIsIDIwMTcsIGF0IDU6MTUgUE0sIFRyb25kIE15a2xlYnVzdCA8dHJvbmRteUBw
cmltYXJ5ZGF0YS5jDQo+ID4gb20+IHdyb3RlOg0KPiA+IA0KPiA+ID4gDQo+ID4gPiBPbiBKYW4g
MTIsIDIwMTcsIGF0IDEyOjQyLCBDaHVjayBMZXZlciA8Y2h1Y2subGV2ZXJAb3JhY2xlLmNvbT4N
Cj4gPiA+IHdyb3RlOg0KPiA+ID4gDQo+ID4gPiANCj4gPiA+ID4gT24gSmFuIDEyLCAyMDE3LCBh
dCAxMjozOCBQTSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QHByaW1hcnlkDQo+ID4gPiA+IGF0
YS5jb20+IHdyb3RlOg0KPiA+ID4gPiANCj4gPiA+ID4gT24gRnJpLCAyMDE2LTEyLTE2IGF0IDEx
OjQ4IC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4gPiA+ID4gPiBDdXJyZW50IE5GUyBjbGll
bnRzIHJlbHkgb24gY29ubmVjdGlvbiBsb3NzIHRvIGRldGVybWluZSB3aGVuDQo+ID4gPiA+ID4g
dG8NCj4gPiA+ID4gPiByZXRyYW5zbWl0LiBJbiBwYXJ0aWN1bGFyLCBmb3IgcHJvdG9jb2xzIGxp
a2UgTkZTdjQsIGNsaWVudHMNCj4gPiA+ID4gPiBubw0KPiA+ID4gPiA+IGxvbmdlciByZWx5IG9u
IFJQQyB0aW1lb3V0cyB0byBkcml2ZSByZXRyYW5zbWlzc2lvbjogTkZTdjQNCj4gPiA+ID4gPiBz
ZXJ2ZXJzDQo+ID4gPiA+ID4gYXJlIHJlcXVpcmVkIHRvIHRlcm1pbmF0ZSBhIGNvbm5lY3Rpb24g
d2hlbiB0aGV5IG5lZWTCoMKgYQ0KPiA+ID4gPiA+IGNsaWVudCB0bw0KPiA+ID4gPiA+IHJldHJh
bnNtaXQgcGVuZGluZyBSUENzLg0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+IFdoZW4gYSBzZXJ2ZXIg
aXMgbm8gbG9uZ2VyIHJlYWNoYWJsZSwgZWl0aGVyIGJlY2F1c2UgaXQgaGFzDQo+ID4gPiA+ID4g
Y3Jhc2hlZA0KPiA+ID4gPiA+IG9yIGJlY2F1c2UgdGhlIG5ldHdvcmsgcGF0aCBoYXMgYnJva2Vu
LCB0aGUgc2VydmVyIGNhbm5vdA0KPiA+ID4gPiA+IGFjdGl2ZWx5DQo+ID4gPiA+ID4gdGVybWlu
YXRlIGEgY29ubmVjdGlvbi4gVGh1cyBORlMgY2xpZW50cyBkZXBlbmQgb24gdHJhbnNwb3J0LQ0K
PiA+ID4gPiA+IGxldmVsDQo+ID4gPiA+ID4ga2VlcGFsaXZlIHRvIGRldGVybWluZSB3aGVuIGEg
Y29ubmVjdGlvbiBtdXN0IGJlIHJlcGxhY2VkIGFuZA0KPiA+ID4gPiA+IHBlbmRpbmcgUlBDcyBy
ZXRyYW5zbWl0dGVkLg0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+IEhvd2V2ZXIsIFJETUEgUkMgY29u
bmVjdGlvbnMgZG8gbm90IGhhdmUgYSBuYXRpdmUga2VlcGFsaXZlDQo+ID4gPiA+ID4gbWVjaGFu
aXNtLiBJZiBhbiBORlMvUkRNQSBzZXJ2ZXIgY3Jhc2hlcyBhZnRlciBhIGNsaWVudCBoYXMNCj4g
PiA+ID4gPiBzZW50DQo+ID4gPiA+ID4gUlBDcyBzdWNjZXNzZnVsbHkgKGFuIFJDIEFDSyBoYXMg
YmVlbiByZWNlaXZlZCBmb3IgYWxsIE9UVw0KPiA+ID4gPiA+IFJETUENCj4gPiA+ID4gPiByZXF1
ZXN0cyksIHRoZXJlIGlzIG5vIHdheSBmb3IgdGhlIGNsaWVudCB0byBrbm93IHRoZQ0KPiA+ID4g
PiA+IGNvbm5lY3Rpb24gaXMNCj4gPiA+ID4gPiBtb3JpYnVuZC4NCj4gPiA+ID4gPiANCj4gPiA+
ID4gPiBJbiBhZGRpdGlvbiwgbmV3IFJETUEgcmVxdWVzdHMgYXJlIHN1YmplY3QgdG8gdGhlIFJQ
Qy1vdmVyLQ0KPiA+ID4gPiA+IFJETUENCj4gPiA+ID4gPiBjcmVkaXQgbGltaXQuIElmIHRoZSBj
bGllbnQgaGFzIGNvbnN1bWVkIGFsbCBncmFudGVkIGNyZWRpdHMNCj4gPiA+ID4gPiB3aXRoDQo+
ID4gPiA+ID4gTkZTIHRyYWZmaWMsIGl0IGlzIG5vdCBhbGxvd2VkIHRvIHNlbmQgYW5vdGhlciBS
RE1BIHJlcXVlc3QNCj4gPiA+ID4gPiB1bnRpbA0KPiA+ID4gPiA+IHRoZSBzZXJ2ZXIgcmVwbGll
cy4gVGh1cyBpdCBoYXMgbm8gd2F5IHRvIHNlbmQgYSB0cnVlDQo+ID4gPiA+ID4ga2VlcGFsaXZl
IHdoZW4NCj4gPiA+ID4gPiB0aGUgd29ya2xvYWQgaGFzIGFscmVhZHkgY29uc3VtZWQgYWxsIGNy
ZWRpdHMgd2l0aCBwZW5kaW5nDQo+ID4gPiA+ID4gUlBDcy4NCj4gPiA+ID4gPiANCj4gPiA+ID4g
PiBUbyBhZGRyZXNzIHRoaXMsIHdlIHJlc2VydmUgb25lIFJQQy1vdmVyLVJETUEgY3JlZGl0IHRo
YXQgbWF5DQo+ID4gPiA+ID4gYmUNCj4gPiA+ID4gPiB1c2VkIG9ubHkgZm9yIGFuIE5GUyBOVUxM
LiBBIHBlcmlvZGljIFJQQyBwaW5nIGlzIGRvbmUgb24NCj4gPiA+ID4gPiB0cmFuc3BvcnRzDQo+
ID4gPiA+ID4gd2hlbmV2ZXIgdGhlcmUgYXJlIG91dHN0YW5kaW5nIFJQQ3MuDQo+ID4gPiA+ID4g
DQo+ID4gPiA+ID4gVGhlIHB1cnBvc2Ugb2YgdGhpcyBwaW5nIGlzIHRvIGRyaXZlIHRyYWZmaWMg
cmVndWxhcmx5IG9uDQo+ID4gPiA+ID4gZWFjaA0KPiA+ID4gPiA+IGNvbm5lY3Rpb24gdG8gZm9y
Y2UgdGhlIHRyYW5zcG9ydCBsYXllciB0byBkaXNjb25uZWN0IGl0IGlmDQo+ID4gPiA+ID4gaXQg
aXMgbm8NCj4gPiA+ID4gPiBsb25nZXIgdmlhYmxlLiBTb21lIFJETUEgb3BlcmF0aW9ucyBhcmUg
ZnVsbHkgb2ZmbG9hZGVkIHRvDQo+ID4gPiA+ID4gdGhlIEhDQSwNCj4gPiA+ID4gPiBhbmQgY2Fu
IGJlIHN1Y2Nlc3NmdWwgZXZlbiBpZiB0aGUgcmVtb3RlIGhvc3QgaGFzIGNyYXNoZWQuDQo+ID4g
PiA+ID4gVGh1cyBhbg0KPiA+ID4gPiA+IG9wZXJhdGlvbiB0aGF0IHJlcXVpcmVzIHRoYXQgdGhl
IHNlcnZlciBpcyByZXNwb25zaXZlIGlzIHVzZWQNCj4gPiA+ID4gPiBmb3INCj4gPiA+ID4gPiB0
aGUgcGluZy4NCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBUaGlzIGltcGxlbWVudGF0aW9uIHJlLXVz
ZXMgZXhpc3RpbmcgZ2VuZXJpYyBSUEMNCj4gPiA+ID4gPiBpbmZyYXN0cnVjdHVyZSB0bw0KPiA+
ID4gPiA+IGZvcm0gZWFjaCBOVUxMIENhbGwuIEFuIHJwY19jbG50IGNvbnRleHQgbXVzdCBiZSBh
dmFpbGFibGUgdG8NCj4gPiA+ID4gPiBzdGFydA0KPiA+ID4gPiA+IGFuIFJQQy4gVGh1cyBhIGdl
bmVyaWMga2VlcGFsaXZlIG1lY2hhbmlzbSBpcyBpbnRyb2R1Y2VkIHNvDQo+ID4gPiA+ID4gdGhh
dA0KPiA+ID4gPiA+IGJvdGggYW4gcnBjX2NsbnQgYW5kIGFuIHJwY194cHJ0IGlzIGF2YWlsYWJs
ZSB0byBwZXJmb3JtIHRoZQ0KPiA+ID4gPiA+IHBpbmcuDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4g
U2lnbmVkLW9mZi1ieTogQ2h1Y2sgTGV2ZXIgPGNodWNrLmxldmVyQG9yYWNsZS5jb20+DQo+ID4g
PiA+ID4gLS0tDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gQmVmb3JlIHNlbmRpbmcgdGhpcyBmb3Ig
aW50ZXJuYWwgdGVzdGluZywgSSdkIGxpa2UgdG8gaGVhcg0KPiA+ID4gPiA+IGNvbW1lbnRzDQo+
ID4gPiA+ID4gb24gdGhpcyBhcHByb2FjaC4gSXQncyBhIGxpdHRsZSBtb3JlIGNodXJuIHRoYW4g
SSBoYWQgaG9wZWQNCj4gPiA+ID4gPiBmb3IuDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gDQo+ID4g
PiA+ID4gZnMvbmZzL25mczRjbGllbnQuY8KgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgfMKgwqDC
oMKgMcKgDQo+ID4gPiA+ID4gaW5jbHVkZS9saW51eC9zdW5ycGMvY2xudC5owqDCoMKgwqDCoHzC
oMKgwqDCoDIgKw0KPiA+ID4gPiA+IGluY2x1ZGUvbGludXgvc3VucnBjL3NjaGVkLmjCoMKgwqDC
oHzCoMKgwqDCoDMgKw0KPiA+ID4gPiA+IGluY2x1ZGUvbGludXgvc3VucnBjL3hwcnQuaMKgwqDC
oMKgwqB8wqDCoMKgwqAxwqANCj4gPiA+ID4gPiBuZXQvc3VucnBjL2NsbnQuY8KgwqDCoMKgwqDC
oMKgwqDCoMKgwqDCoMKgwqDCoHzCoMKgMTAxDQo+ID4gPiA+ID4gKysrKysrKysrKysrKysrKysr
KysrKysrKysrKysrKysrKysrKysrDQo+ID4gPiA+ID4gbmV0L3N1bnJwYy9zY2hlZC5jwqDCoMKg
wqDCoMKgwqDCoMKgwqDCoMKgwqDCoHzCoMKgwqAxOSArKysrKysrDQo+ID4gPiA+ID4gbmV0L3N1
bnJwYy94cHJ0LmPCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqB8wqDCoMKgwqA1ICsrDQo+
ID4gPiA+ID4gbmV0L3N1bnJwYy94cHJ0cmRtYS9ycGNfcmRtYS5jwqDCoHzCoMKgwqDCoDQgKy0N
Cj4gPiA+ID4gPiBuZXQvc3VucnBjL3hwcnRyZG1hL3RyYW5zcG9ydC5jIHzCoMKgwqAxMyArKysr
Kw0KPiA+ID4gPiA+IDkgZmlsZXMgY2hhbmdlZCwgMTQ4IGluc2VydGlvbnMoKyksIDEgZGVsZXRp
b24oLSkNCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBkaWZmIC0tZ2l0IGEvZnMvbmZzL25mczRjbGll
bnQuYyBiL2ZzL25mcy9uZnM0Y2xpZW50LmMNCj4gPiA+ID4gPiBpbmRleCAwNzRhYzcxLi5jNWY1
Y2U4IDEwMDY0NA0KPiA+ID4gPiA+IC0tLSBhL2ZzL25mcy9uZnM0Y2xpZW50LmMNCj4gPiA+ID4g
PiArKysgYi9mcy9uZnMvbmZzNGNsaWVudC5jDQo+ID4gPiA+ID4gQEAgLTM3OCw2ICszNzgsNyBA
QCBzdHJ1Y3QgbmZzX2NsaWVudA0KPiA+ID4gPiA+ICpuZnM0X2luaXRfY2xpZW50KHN0cnVjdA0K
PiA+ID4gPiA+IG5mc19jbGllbnQgKmNscCwNCj4gPiA+ID4gPiAJCWVycm9yID0gbmZzX2NyZWF0
ZV9ycGNfY2xpZW50KGNscCwgY2xfaW5pdCwNCj4gPiA+ID4gPiBSUENfQVVUSF9VTklYKTsNCj4g
PiA+ID4gPiAJaWYgKGVycm9yIDwgMCkNCj4gPiA+ID4gPiAJCWdvdG8gZXJyb3I7DQo+ID4gPiA+
ID4gKwlycGNfc2NoZWR1bGVfa2VlcGFsaXZlKGNscC0+Y2xfcnBjY2xpZW50KTsNCj4gPiA+ID4g
DQo+ID4gPiA+IFdoeSBkbyB3ZSB3YW50IHRvIGVuYWJsZSB0aGlzIGZvciBub24tUkRNQSB0cmFu
c3BvcnRzPw0KPiA+ID4gPiBTaG91bGRuJ3QgdGhpcw0KPiA+ID4gPiBmdW5jdGlvbmFsaXR5IGJl
IGhpZGRlbiBpbiB0aGUgUkRNQSBjbGllbnQgY29kZSwgaW4gdGhlIHNhbWUNCj4gPiA+ID4gd2F5
IHRoYXQNCj4gPiA+ID4gdGhlIFRDUCBrZWVwYWxpdmUgaXMgaGlkZGVuIGluIHRoZSBzb2NrZXQg
Y29kZS4NCj4gPiA+IA0KPiA+ID4gU2VuZGluZyBhIE5VTEwgcmVxdWVzdCBieSByZS11c2luZyB0
aGUgbm9ybWFsIFJQQyBpbmZyYXN0cnVjdHVyZQ0KPiA+ID4gcmVxdWlyZXMgYSBzdHJ1Y3QgcnBj
X2NsbnQuIFRodXMgaXQgaGFzIHRvIGJlIGRyaXZlbiBieSBhbiB1cHBlcg0KPiA+ID4gbGF5ZXIg
Y29udGV4dC4NCj4gPiA+IA0KPiA+ID4gSSdtIG9wZW4gdG8gc3VnZ2VzdGlvbnMuDQo+ID4gPiAN
Cj4gPiANCj4gPiBJZGVhbGx5IHdlIGp1c3Qgd2FudCB0aGlzIHRvIG9wZXJhdGUgd2hlbiB0aGVy
ZSBhcmUgb3V0c3RhbmRpbmcgUlBDDQo+ID4gY2FsbHMgd2FpdGluZyBmb3IgYSByZXBseSwgYW0g
SSBjb3JyZWN0Pw0KPiA+IA0KPiA+IElmIHNvLCBwZXJoYXBzIHdlIG1pZ2h0IGhhdmUgaXQgdHJp
Z2dlcmVkIGJ5IGEgdGltZXIgdGhhdCBpcyBhcm1lZA0KPiA+IGluIHhwcnQtPm9wcy0+c2VuZF9y
ZXF1ZXN0KCkgYW5kIGRpc2FybWVkIGluIHhwcnQtPm9wcy0NCj4gPiA+cmVsZWFzZV94cHJ0KCk/
IEl0IG1pZ2h0IHRoZW4gY29uZmlndXJlIGl0c2VsZiBieSBsb29raW5nIGluIHRoZQ0KPiA+IHhw
cnQtPnJlY3YgbGlzdCB0byBmaW5kIGEgaGFuZ2luZyBycGNfdGFzayBhbmQgc3RlYWwgaXRzIHJw
Y19jbGllbnQNCj4gPiBpbmZvLg0KPiANCj4gUGVyaGFwcywgYnV0IEkgd2FzIGhvcGluZyB0byBm
aW5kIGEgc29sdXRpb24gdGhhdCBkaWQgbm90IGFkZCBtb3JlDQo+IG92ZXJoZWFkIChhcm1pbmcg
YW5kIGRpc2FybWluZyBhbm90aGVyIHRpbWVyKSB0byB0aGUgc2VuZF9yZXF1ZXN0DQo+IHBhdGgu
DQo+IA0KPiBfX21vZF90aW1lciBjYW4gZG8gYW4gaXJxc2F2ZSBzcGlubG9jayBpbiBzb21lIGNh
c2VzLCBmb3IgZXhhbXBsZS4NCj4gDQo+IFRoaXMgaW1wYWN0cyBhbGwgSS9PIG9uIGFsbCB0cmFu
c3BvcnRzIHRvIGhhbmRsZSBhIGNhc2UgdGhhdCB3aWxsDQo+IGJlIHZlcnkgcmFyZS4NCj4gDQo+
IFdlIGNvdWxkIG1pdGlnYXRlIHRoZSB0aW1lciBmbGFwcGluZyBieSBhcm1pbmcgd2hlbiB4cHJ0
X3RyYW5zbWl0DQo+IGZpbmRzIHRoZSByZWN2IGxpc3QgZW1wdHkgYmVmb3JlIGFkZGluZywgYW5k
IHdoZW4geHBydF9sb29rdXBfcnFzdA0KPiBlbXB0aWVzIHRoZSByZWN2IGxpc3QuDQo+IA0KDQpB
bHRlcm5hdGl2ZWx5LCBob3cgYWJvdXQganVzdCBwdXR0aW5nIHRoZSB0cmlnZ2VyIGluIHhwcnRf
dGltZXIgKGkuZS4NCmluIHRoZSB4cHJ0LT5vcHMtPnRpbWVyKCkgY2FsbGJhY2spPyBUaGF0IHJl
cXVpcmVzIG5vIG5ldyB0aW1lcnMsIGFuZA0KaXQgc29sdmVzIHRoZSBwcm9ibGVtIG9mIHdoaWNo
IHJwY19jbG50IHRvIHVzZS4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGll
bnQgbWFpbnRhaW5lciwgUHJpbWFyeURhdGENCnRyb25kLm15a2xlYnVzdEBwcmltYXJ5ZGF0YS5j
b20NCg==

2017-01-13 19:08:50

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

> On Jan 13, 2017, at 12:27 PM, Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2017-01-13 at 10:13 -0500, Chuck Lever wrote:
>>> On Jan 12, 2017, at 5:15 PM, Trond Myklebust <[email protected]
>>> om> wrote:
>>>
>>>>
>>>> On Jan 12, 2017, at 12:42, Chuck Lever <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>> On Jan 12, 2017, at 12:38 PM, Trond Myklebust <trondmy@primaryd
>>>>> ata.com> wrote:
>>>>>
>>>>> On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
>>>>>> Current NFS clients rely on connection loss to determine when
>>>>>> to
>>>>>> retransmit. In particular, for protocols like NFSv4, clients
>>>>>> no
>>>>>> longer rely on RPC timeouts to drive retransmission: NFSv4
>>>>>> servers
>>>>>> are required to terminate a connection when they need聽聽a
>>>>>> client to
>>>>>> retransmit pending RPCs.
>>>>>>
>>>>>> When a server is no longer reachable, either because it has
>>>>>> crashed
>>>>>> or because the network path has broken, the server cannot
>>>>>> actively
>>>>>> terminate a connection. Thus NFS clients depend on transport-
>>>>>> level
>>>>>> keepalive to determine when a connection must be replaced and
>>>>>> pending RPCs retransmitted.
>>>>>>
>>>>>> However, RDMA RC connections do not have a native keepalive
>>>>>> mechanism. If an NFS/RDMA server crashes after a client has
>>>>>> sent
>>>>>> RPCs successfully (an RC ACK has been received for all OTW
>>>>>> RDMA
>>>>>> requests), there is no way for the client to know the
>>>>>> connection is
>>>>>> moribund.
>>>>>>
>>>>>> In addition, new RDMA requests are subject to the RPC-over-
>>>>>> RDMA
>>>>>> credit limit. If the client has consumed all granted credits
>>>>>> with
>>>>>> NFS traffic, it is not allowed to send another RDMA request
>>>>>> until
>>>>>> the server replies. Thus it has no way to send a true
>>>>>> keepalive when
>>>>>> the workload has already consumed all credits with pending
>>>>>> RPCs.
>>>>>>
>>>>>> To address this, we reserve one RPC-over-RDMA credit that may
>>>>>> be
>>>>>> used only for an NFS NULL. A periodic RPC ping is done on
>>>>>> transports
>>>>>> whenever there are outstanding RPCs.
>>>>>>
>>>>>> The purpose of this ping is to drive traffic regularly on
>>>>>> each
>>>>>> connection to force the transport layer to disconnect it if
>>>>>> it is no
>>>>>> longer viable. Some RDMA operations are fully offloaded to
>>>>>> the HCA,
>>>>>> and can be successful even if the remote host has crashed.
>>>>>> Thus an
>>>>>> operation that requires that the server is responsive is used
>>>>>> for
>>>>>> the ping.
>>>>>>
>>>>>> This implementation re-uses existing generic RPC
>>>>>> infrastructure to
>>>>>> form each NULL Call. An rpc_clnt context must be available to
>>>>>> start
>>>>>> an RPC. Thus a generic keepalive mechanism is introduced so
>>>>>> that
>>>>>> both an rpc_clnt and an rpc_xprt is available to perform the
>>>>>> ping.
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>> ---
>>>>>>
>>>>>> Before sending this for internal testing, I'd like to hear
>>>>>> comments
>>>>>> on this approach. It's a little more churn than I had hoped
>>>>>> for.
>>>>>>
>>>>>>
>>>>>> fs/nfs/nfs4client.c聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽聽聽1聽
>>>>>> include/linux/sunrpc/clnt.h聽聽聽聽聽|聽聽聽聽2 +
>>>>>> include/linux/sunrpc/sched.h聽聽聽聽|聽聽聽聽3 +
>>>>>> include/linux/sunrpc/xprt.h聽聽聽聽聽|聽聽聽聽1聽
>>>>>> net/sunrpc/clnt.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽101
>>>>>> +++++++++++++++++++++++++++++++++++++++
>>>>>> net/sunrpc/sched.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽聽19 +++++++
>>>>>> net/sunrpc/xprt.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽聽聽5 ++
>>>>>> net/sunrpc/xprtrdma/rpc_rdma.c聽聽|聽聽聽聽4 +-
>>>>>> net/sunrpc/xprtrdma/transport.c |聽聽聽13 +++++
>>>>>> 9 files changed, 148 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>>>>>> index 074ac71..c5f5ce8 100644
>>>>>> --- a/fs/nfs/nfs4client.c
>>>>>> +++ b/fs/nfs/nfs4client.c
>>>>>> @@ -378,6 +378,7 @@ struct nfs_client
>>>>>> *nfs4_init_client(struct
>>>>>> nfs_client *clp,
>>>>>> error = nfs_create_rpc_client(clp, cl_init,
>>>>>> RPC_AUTH_UNIX);
>>>>>> if (error < 0)
>>>>>> goto error;
>>>>>> + rpc_schedule_keepalive(clp->cl_rpcclient);
>>>>>
>>>>> Why do we want to enable this for non-RDMA transports?
>>>>> Shouldn't this
>>>>> functionality be hidden in the RDMA client code, in the same
>>>>> way that
>>>>> the TCP keepalive is hidden in the socket code.
>>>>
>>>> Sending a NULL request by re-using the normal RPC infrastructure
>>>> requires a struct rpc_clnt. Thus it has to be driven by an upper
>>>> layer context.
>>>>
>>>> I'm open to suggestions.
>>>>
>>>
>>> Ideally we just want this to operate when there are outstanding RPC
>>> calls waiting for a reply, am I correct?
>>>
>>> If so, perhaps we might have it triggered by a timer that is armed
>>> in xprt->ops->send_request() and disarmed in xprt->ops-
>>>> release_xprt()? It might then configure itself by looking in the
>>> xprt->recv list to find a hanging rpc_task and steal its rpc_client
>>> info.
>>
>> Perhaps, but I was hoping to find a solution that did not add more
>> overhead (arming and disarming another timer) to the send_request
>> path.
>>
>> __mod_timer can do an irqsave spinlock in some cases, for example.
>>
>> This impacts all I/O on all transports to handle a case that will
>> be very rare.
>>
>> We could mitigate the timer flapping by arming when xprt_transmit
>> finds the recv list empty before adding, and when xprt_lookup_rqst
>> empties the recv list.
>>
>
> Alternatively, how about just putting the trigger in xprt_timer (i.e.
> in the xprt->ops->timer() callback)? That requires no new timers, and
> it solves the problem of which rpc_clnt to use.

I was thinking of wiring something into call_timeout, but xprt_timer
looks like it would perform the same job, and there is already a
per-xprt hook. I'll have a look.

Is it safe to call rpc_run_task while transport_lock is held? If not
I can simply schedule a generic worker thread to construct and send
the NULL.

> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> [email protected]
> ��N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏{睗�"炟^n噐■��侂h櫒璀�&Ⅷ�瓽珴閔��(殠娸"濟��m��飦赇z罐枈帼f＂穐殘坢

--
Chuck Lever

2017-01-13 19:20:33

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

T24gRnJpLCAyMDE3LTAxLTEzIGF0IDE0OjA4IC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCg0K
PiBJIHdhcyB0aGlua2luZyBvZiB3aXJpbmcgc29tZXRoaW5nIGludG8gY2FsbF90aW1lb3V0LCBi
dXQgeHBydF90aW1lcg0KPiBsb29rcyBsaWtlIGl0IHdvdWxkIHBlcmZvcm0gdGhlIHNhbWUgam9i
LCBhbmQgdGhlcmUgaXMgYWxyZWFkeSBhDQo+IHBlci14cHJ0IGhvb2suIEknbGwgaGF2ZSBhIGxv
b2suDQo+IA0KPiBJcyBpdCBzYWZlIHRvIGNhbGwgcnBjX3J1bl90YXNrIHdoaWxlIHRyYW5zcG9y
dF9sb2NrIGlzIGhlbGQ/IElmIG5vdA0KPiBJIGNhbiBzaW1wbHkgc2NoZWR1bGUgYSBnZW5lcmlj
IHdvcmtlciB0aHJlYWQgdG8gY29uc3RydWN0IGFuZCBzZW5kDQo+IHRoZSBOVUxMLg0KDQpZb3Ug
c2hvdWxkbid0IGNhbGwgcnBjX3J1bl90YXNrKCkgZnJvbSBhIHNwaW5sb2NrIGNvbnRleHQsIGJ1
dCBsZXQncw0KY29uc2lkZXIgbW92aW5nIHRoYXQgc3BpbmxvY2sgaW50byB0aGUgeHBydC0+b3Bz
LT50aW1lcigpIGNhbGxiYWNrLiBJDQpkb24ndCByZWFsbHkgc2VlIGhvdyBpdCBpcyAgbmVlZGVk
IGZvciB0aGUgY2FzZXMgd2hlcmUgeW91IGRvbid0IGhhdmUgYQ0KY2FsbGJhY2suDQoNCi0tIA0K
VHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIFByaW1hcnlEYXRh
DQp0cm9uZC5teWtsZWJ1c3RAcHJpbWFyeWRhdGEuY29tDQo=