LinuxLists.cc - [PATCH v2 0/7] NFS/RDMA client-side patches for 4.11

2017-01-31 18:38:14

Subject: [PATCH v2 0/7] NFS/RDMA client-side patches for 4.11

These are bug fixes based on v4.10-rc6 + my for-4.10-rc fixes.

The keepalive patches are still waiting for internal testing
resources.

Available in the "nfs-rdma-for-4.11" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.11

Changes since v1:
- Rebased on v4.10-rc6
- Tested-by and additional clean-up in 1/7
- Patch description clarifications
- Renamed some constants and variables

---

Chuck Lever (7):
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
xprtrdma: Handle stale connection rejection
xprtrdma: Refactor management of mw_list field
sunrpc: Allow xprt->ops->timer method to sleep
sunrpc: Enable calls to rpc_call_null_helper() from other modules
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
sunrpc: Allow keepalive ping on a credit-full transport

fs/nfs/nfs4proc.c | 3 -
fs/nfsd/nfs4callback.c | 2 -
include/linux/sunrpc/clnt.h | 5 ++
include/linux/sunrpc/sched.h | 4 ++
net/sunrpc/clnt.c | 28 ++++++------
net/sunrpc/xprt.c | 6 ++
net/sunrpc/xprtrdma/fmr_ops.c | 5 --
net/sunrpc/xprtrdma/frwr_ops.c | 11 ++---
net/sunrpc/xprtrdma/rpc_rdma.c | 20 +++++---
net/sunrpc/xprtrdma/transport.c | 74 +++++++++++++++++++++++++++++--
net/sunrpc/xprtrdma/verbs.c | 94 ++++++++++++++-------------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 23 ++++++++++
net/sunrpc/xprtsock.c | 2 +
13 files changed, 174 insertions(+), 103 deletions(-)

--
Chuck Lever

2017-01-31 18:38:19

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 1/7] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs

Sriharsha ([email protected]) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.

Reported-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Sriharsha Basavapatna <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 14 ++++++++------
net/sunrpc/xprtrdma/transport.c | 4 ----
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d889883..72b3ca0 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -759,13 +759,13 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
iptr = headerp->rm_body.rm_chunks;
iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
iptr = rpcrdma_encode_write_list(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
iptr = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
hdrlen = (unsigned char *)iptr - (unsigned char *)headerp;

dprintk("RPC: %5u %s: %s/%s: hdrlen %zd rpclen %zd\n",
@@ -776,12 +776,14 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, hdrlen,
&rqst->rq_snd_buf, rtype)) {
iptr = ERR_PTR(-EIO);
- goto out_unmap;
+ goto out_err;
}
return 0;

-out_unmap:
- r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
+out_err:
+ pr_err("rpcrdma: rpcrdma_marshal_req failed, status %ld\n",
+ PTR_ERR(iptr));
+ r_xprt->rx_stats.failed_marshal_count++;
return PTR_ERR(iptr);
}

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6990581..c717f54 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -709,10 +709,6 @@
return 0;

failed_marshal:
- dprintk("RPC: %s: rpcrdma_marshal_req failed, status %i\n",
- __func__, rc);
- if (rc == -EIO)
- r_xprt->rx_stats.failed_marshal_count++;
if (rc != -ENOTCONN)
return rc;
drop_connection:

2017-01-31 18:38:36

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 3/7] xprtrdma: Refactor management of mw_list field

Clean up some duplicate code.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 5 +----
net/sunrpc/xprtrdma/frwr_ops.c | 11 ++++-------
net/sunrpc/xprtrdma/rpc_rdma.c | 6 +++---
net/sunrpc/xprtrdma/verbs.c | 15 +++++----------
net/sunrpc/xprtrdma/xprt_rdma.h | 16 ++++++++++++++++
5 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ebb09e..59e6402 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -310,10 +310,7 @@ enum {
struct rpcrdma_mw *mw;

while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
fmr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 47bed53..f81dd93 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -466,8 +466,8 @@
struct ib_send_wr *first, **prev, *last, *bad_wr;
struct rpcrdma_rep *rep = req->rl_reply;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_mw *mw, *tmp;
struct rpcrdma_frmr *f;
+ struct rpcrdma_mw *mw;
int count, rc;

dprintk("RPC: %s: req %p\n", __func__, req);
@@ -534,10 +534,10 @@
* them to the free MW list.
*/
unmap:
- list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+ while (!list_empty(&req->rl_registered)) {
+ mw = rpcrdma_pop_mw(&req->rl_registered);
dprintk("RPC: %s: DMA unmapping frmr %p\n",
__func__, &mw->frmr);
- list_del_init(&mw->mw_list);
ib_dma_unmap_sg(ia->ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);
@@ -571,10 +571,7 @@
struct rpcrdma_mw *mw;

while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
frwr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 72b3ca0..a044be2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -322,7 +322,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
false, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

*iptr++ = xdr_one; /* item present */

@@ -390,7 +390,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

iptr = xdr_encode_rdma_segment(iptr, mw);

@@ -455,7 +455,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

iptr = xdr_encode_rdma_segment(iptr, mw);

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d1ee33f..81cd31a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -776,9 +776,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)

spin_lock(&buf->rb_recovery_lock);
while (!list_empty(&buf->rb_stale_mrs)) {
- mw = list_first_entry(&buf->rb_stale_mrs,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
+ mw = rpcrdma_pop_mw(&buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);

dprintk("RPC: %s: recovering MR %p\n", __func__, mw);
@@ -796,7 +794,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;

spin_lock(&buf->rb_recovery_lock);
- list_add(&mw->mw_list, &buf->rb_stale_mrs);
+ rpcrdma_push_mw(mw, &buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);

schedule_delayed_work(&buf->rb_recovery_worker, 0);
@@ -1072,11 +1070,8 @@ struct rpcrdma_mw *
struct rpcrdma_mw *mw = NULL;

spin_lock(&buf->rb_mwlock);
- if (!list_empty(&buf->rb_mws)) {
- mw = list_first_entry(&buf->rb_mws,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
- }
+ if (!list_empty(&buf->rb_mws))
+ mw = rpcrdma_pop_mw(&buf->rb_mws);
spin_unlock(&buf->rb_mwlock);

if (!mw)
@@ -1099,7 +1094,7 @@ struct rpcrdma_mw *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;

spin_lock(&buf->rb_mwlock);
- list_add_tail(&mw->mw_list, &buf->rb_mws);
+ rpcrdma_push_mw(mw, &buf->rb_mws);
spin_unlock(&buf->rb_mwlock);
}

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 852dd0a..171a351 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -354,6 +354,22 @@ struct rpcrdma_req {
return rqst->rq_xprtdata;
}

+static inline void
+rpcrdma_push_mw(struct rpcrdma_mw *mw, struct list_head *list)
+{
+ list_add_tail(&mw->mw_list, list);
+}
+
+static inline struct rpcrdma_mw *
+rpcrdma_pop_mw(struct list_head *list)
+{
+ struct rpcrdma_mw *mw;
+
+ mw = list_first_entry(list, struct rpcrdma_mw, mw_list);
+ list_del(&mw->mw_list);
+ return mw;
+}
+
/*
* struct rpcrdma_buffer -- holds list/queue of pre-registered memory for
* inline requests/replies, and client/server credits.

2017-01-31 18:38:32

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 2/7] xprtrdma: Handle stale connection rejection

A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 66 ++++++++++++++-----------------------------
1 file changed, 21 insertions(+), 45 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61d16c3..d1ee33f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -54,6 +54,7 @@
#include <linux/sunrpc/svc_rdma.h>
#include <asm/bitops.h>
#include <linux/module.h> /* try_module_get()/module_put() */
+#include <rdma/ib_cm.h>

#include "xprt_rdma.h"

@@ -279,7 +280,14 @@
connstate = -ENETDOWN;
goto connected;
case RDMA_CM_EVENT_REJECTED:
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ pr_info("rpcrdma: connection to %pIS:%u on %s rejected: %s\n",
+ sap, rpc_get_port(sap), ia->ri_device->name,
+ rdma_reject_msg(id, event->status));
+#endif
connstate = -ECONNREFUSED;
+ if (event->status == IB_CM_REJ_STALE_CONN)
+ connstate = -EAGAIN;
goto connected;
case RDMA_CM_EVENT_DISCONNECTED:
connstate = -ECONNABORTED;
@@ -643,20 +651,21 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
int
rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
+ struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
+ rx_ia);
struct rdma_cm_id *id, *old;
+ struct sockaddr *sap;
+ unsigned int extras;
int rc = 0;
- int retry_count = 0;

if (ep->rep_connected != 0) {
- struct rpcrdma_xprt *xprt;
retry:
dprintk("RPC: %s: reconnecting...\n", __func__);

rpcrdma_ep_disconnect(ep, ia);

- xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- id = rpcrdma_create_id(xprt, ia,
- (struct sockaddr *)&xprt->rx_data.addr);
+ sap = (struct sockaddr *)&r_xprt->rx_data.addr;
+ id = rpcrdma_create_id(r_xprt, ia, sap);
if (IS_ERR(id)) {
rc = -EHOSTUNREACH;
goto out;
@@ -711,51 +720,18 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
}

wait_event_interruptible(ep->rep_connect_wait, ep->rep_connected != 0);
-
- /*
- * Check state. A non-peer reject indicates no listener
- * (ECONNREFUSED), which may be a transient state. All
- * others indicate a transport condition which has already
- * undergone a best-effort.
- */
- if (ep->rep_connected == -ECONNREFUSED &&
- ++retry_count <= RDMA_CONNECT_RETRY_MAX) {
- dprintk("RPC: %s: non-peer_reject, retry\n", __func__);
- goto retry;
- }
if (ep->rep_connected <= 0) {
- /* Sometimes, the only way to reliably connect to remote
- * CMs is to use same nonzero values for ORD and IRD. */
- if (retry_count++ <= RDMA_CONNECT_RETRY_MAX + 1 &&
- (ep->rep_remote_cma.responder_resources == 0 ||
- ep->rep_remote_cma.initiator_depth !=
- ep->rep_remote_cma.responder_resources)) {
- if (ep->rep_remote_cma.responder_resources == 0)
- ep->rep_remote_cma.responder_resources = 1;
- ep->rep_remote_cma.initiator_depth =
- ep->rep_remote_cma.responder_resources;
+ if (ep->rep_connected == -EAGAIN)
goto retry;
- }
rc = ep->rep_connected;
- } else {
- struct rpcrdma_xprt *r_xprt;
- unsigned int extras;
-
- dprintk("RPC: %s: connected\n", __func__);
-
- r_xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
-
- if (extras) {
- rc = rpcrdma_ep_post_extra_recv(r_xprt, extras);
- if (rc) {
- pr_warn("%s: rpcrdma_ep_post_extra_recv: %i\n",
- __func__, rc);
- rc = 0;
- }
- }
+ goto out;
}

+ dprintk("RPC: %s: connected\n", __func__);
+ extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
+ if (extras)
+ rpcrdma_ep_post_extra_recv(r_xprt, extras);
+
out:
if (rc)
ep->rep_connected = rc;

2017-01-31 18:38:44

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 4/7] sunrpc: Allow xprt->ops->timer method to sleep

The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprt.c | 2 --
net/sunrpc/xprtsock.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 9a6be03..b530a28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -897,13 +897,11 @@ static void xprt_timer(struct rpc_task *task)
return;
dprintk("RPC: %5u xprt_timer\n", task->tk_pid);

- spin_lock_bh(&xprt->transport_lock);
if (!req->rq_reply_bytes_recvd) {
if (xprt->ops->timer)
xprt->ops->timer(xprt, task);
} else
task->tk_status = 0;
- spin_unlock_bh(&xprt->transport_lock);
}

/**
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index af392d9..d9bb644 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1734,7 +1734,9 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t
*/
static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task *task)
{
+ spin_lock_bh(&xprt->transport_lock);
xprt_adjust_cwnd(xprt, task, -ETIMEDOUT);
+ spin_unlock_bh(&xprt->transport_lock);
}

static unsigned short xs_get_random_port(void)

2017-01-31 18:38:57

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 5/7] sunrpc: Enable calls to rpc_call_null_helper() from other modules

To help detect unreachable servers, I'd like to emit an RPC ping
from rpcrdma.ko.

authnull_ops is not visible outside the sunrpc.ko module, so fold
the common case into rpc_call_null_helper, and export it, so that
it can be invoked from other kernel modules.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/nfs4proc.c | 3 +--
fs/nfsd/nfs4callback.c | 2 +-
include/linux/sunrpc/clnt.h | 5 +++++
include/linux/sunrpc/sched.h | 2 ++
net/sunrpc/clnt.c | 28 ++++++++++++++--------------
5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 0a0eaec..0091e5a 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7640,8 +7640,7 @@ static int _nfs4_proc_exchange_id(struct nfs_client *clp, struct rpc_cred *cred,
if (xprt) {
calldata->xprt = xprt;
task_setup_data.rpc_xprt = xprt;
- task_setup_data.flags =
- RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC;
+ task_setup_data.flags = RPC_TASK_SOFTPING | RPC_TASK_ASYNC;
calldata->args.verifier = &clp->cl_confirm;
} else {
calldata->args.verifier = &verifier;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index eb78109..e1e2224 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1182,7 +1182,7 @@ static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
}

cb->cb_msg.rpc_cred = clp->cl_cb_cred;
- rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+ rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFTPING,
cb->cb_ops ? &nfsd4_cb_ops : &nfsd4_cb_probe_ops, cb);
}

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 333ad11..f576127 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -173,6 +173,11 @@ int rpc_call_async(struct rpc_clnt *clnt,
void *calldata);
int rpc_call_sync(struct rpc_clnt *clnt,
const struct rpc_message *msg, int flags);
+struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
+ struct rpc_xprt *xprt,
+ struct rpc_cred *cred, int flags,
+ const struct rpc_call_ops *ops,
+ void *data);
struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
int flags);
int rpc_restart_call_prepare(struct rpc_task *);
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 7ba040c..13822e6 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -128,6 +128,8 @@ struct rpc_task_setup {
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */

+#define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
+
#define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
#define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
#define RPC_DO_ROOTOVERRIDE(t) ((t)->tk_flags & RPC_TASK_ROOTCREDS)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 1dc9f3b..642c93d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2520,12 +2520,11 @@ static int rpc_ping(struct rpc_clnt *clnt)
};
int err;
msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN);
+ err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFTPING);
put_rpccred(msg.rpc_cred);
return err;
}

-static
struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
const struct rpc_call_ops *ops, void *data)
@@ -2542,9 +2541,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
.callback_data = data,
.flags = flags,
};
+ struct rpc_task *task;

- return rpc_run_task(&task_setup_data);
+ if (!cred)
+ msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
+ task = rpc_run_task(&task_setup_data);
+ if (!cred)
+ put_rpccred(msg.rpc_cred);
+
+ return task;
}
+EXPORT_SYMBOL_GPL(rpc_call_null_helper);

struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
{
@@ -2591,7 +2598,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
void *dummy)
{
struct rpc_cb_add_xprt_calldata *data;
- struct rpc_cred *cred;
struct rpc_task *task;

data = kmalloc(sizeof(*data), GFP_NOFS);
@@ -2600,11 +2606,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
data->xps = xprt_switch_get(xps);
data->xprt = xprt_get(xprt);

- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
- RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
- &rpc_cb_add_xprt_call_ops, data);
- put_rpccred(cred);
+ task = rpc_call_null_helper(clnt, xprt, NULL,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ &rpc_cb_add_xprt_call_ops, data);
if (IS_ERR(task))
return PTR_ERR(task);
rpc_put_task(task);
@@ -2635,7 +2639,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
struct rpc_xprt *xprt,
void *data)
{
- struct rpc_cred *cred;
struct rpc_task *task;
struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
int status = -EADDRINUSE;
@@ -2647,11 +2650,8 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
goto out_err;

/* Test the connection */
- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
- RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+ task = rpc_call_null_helper(clnt, xprt, NULL, RPC_TASK_SOFTPING,
NULL, NULL);
- put_rpccred(cred);
if (IS_ERR(task)) {
status = PTR_ERR(task);
goto out_err;

2017-01-31 18:39:00

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 6/7] xprtrdma: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, emit an RPC NULL ping when an RPC retransmit
timeout occurs.

The purpose of this ping is to drive traffic on the connection to
force the transport layer to disconnect it if it is no longer
viable. Some RDMA operations are fully offloaded to the HCA, and can
be successful even if the server O/S has crashed. Thus an operation
that requires that the server is responsive is used for the ping.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 69 +++++++++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++++
2 files changed, 76 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c717f54..3a5a805 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -484,6 +484,74 @@
dprintk("RPC: %s: %u\n", __func__, port);
}

+static void rpcrdma_keepalive_done(struct rpc_task *task, void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+ struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+
+ if (task->tk_status) {
+ struct sockaddr *sap =
+ (struct sockaddr *)&r_xprt->rx_ep.rep_remote_addr;
+
+ pr_err("rpcrdma: keepalive to %pIS:%u failed (%d)\n",
+ sap, rpc_get_port(sap), task->tk_status);
+ xprt_disconnect_done(xprt);
+ }
+ clear_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags);
+}
+
+static void rpcrdma_keepalive_release(void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+
+ xprt_put(xprt);
+}
+
+static const struct rpc_call_ops rpcrdma_keepalive_call_ops = {
+ .rpc_call_done = rpcrdma_keepalive_done,
+ .rpc_release = rpcrdma_keepalive_release,
+};
+
+/**
+ * xprt_rdma_timer - invoked when an RPC times out
+ * @xprt: controlling RPC transport
+ * @task: RPC task that timed out
+ *
+ * Some RDMA transports do not have any form of connection
+ * keepalive. In some circumstances, unviable connections
+ * can continue to live for a long time.
+ *
+ * Send a NULL RPC to see if the server still responds. On
+ * a moribund connection, this should trigger either an RPC
+ * or transport layer timeout and kill the connection.
+ */
+static void
+xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(xprt, struct rpcrdma_xprt, rx_xprt);
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ struct rpcrdma_ep *ep = &r_xprt->rx_ep;
+ struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
+#endif
+ struct rpc_task *null_task;
+ void *data;
+
+ /* Ensure only one is sent at a time */
+ if (!test_and_set_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags))
+ return;
+
+ dprintk("RPC: %s: sending keepalive ping to %pIS:%u\n",
+ __func__, sap, rpc_get_port(sap));
+
+ data = xprt_get(xprt);
+ null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ &rpcrdma_keepalive_call_ops, data);
+ if (!IS_ERR(null_task))
+ rpc_put_task(null_task);
+}
+
static void
xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
{
@@ -776,6 +844,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
.alloc_slot = xprt_alloc_slot,
.release_request = xprt_release_rqst_cong, /* ditto */
.set_retrans_timeout = xprt_set_retrans_timeout_def, /* ditto */
+ .timer = xprt_rdma_timer,
.rpcbind = rpcb_getport_async, /* sunrpc/rpcb_clnt.c */
.set_port = xprt_rdma_set_port,
.connect = xprt_rdma_connect,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 171a351..95c1e3d 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -77,11 +77,18 @@ struct rpcrdma_ia {
unsigned int ri_max_send_sges;
bool ri_reminv_expected;
bool ri_implicit_roundup;
+ unsigned long ri_flags;
enum ib_mr_type ri_mrtype;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;
};

+/* ri_flags bits
+ */
+enum {
+ RPCRDMA_IA_RSVD_CREDIT = 0,
+};
+
/*
* RDMA Endpoint -- one per transport instance
*/

2017-01-31 18:39:13

by Chuck Lever III

[permalink] [raw]

Subject: [PATCH v2 7/7] sunrpc: Allow keepalive ping on a credit-full transport

Allow RPC-over-RDMA to send NULL pings even when the transport has
hit its credit limit. One RPC-over-RDMA credit is reserved for
operations like keepalive.

For transports that convey NFSv4, it seems like lease renewal would
also be a candidate for using a priority transport slot. I'd like to
see a mechanism better than RPCRDMA_PRIORITY that can ensure only
one priority operation is in use at a time.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/sched.h | 2 ++
net/sunrpc/xprt.c | 4 ++++
net/sunrpc/xprtrdma/transport.c | 3 ++-
net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
5 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 13822e6..fcea158 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -127,6 +127,7 @@ struct rpc_task_setup {
#define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
+#define RPC_TASK_NO_CONG 0x8000 /* skip congestion control */

#define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)

@@ -137,6 +138,7 @@ struct rpc_task_setup {
#define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
#define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
#define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_SKIP_CONG(t) ((t)->tk_flags & RPC_TASK_NO_CONG)

#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index b530a28..a477ee6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
{
struct rpc_rqst *req = task->tk_rqstp;

+ if (RPC_SKIP_CONG(task)) {
+ req->rq_cong = 0;
+ return 1;
+ }
if (req->rq_cong)
return 1;
dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 3a5a805..073fecd 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void *calldata)

data = xprt_get(xprt);
null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
- RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC |
+ RPC_TASK_NO_CONG,
&rpcrdma_keepalive_call_ops, data);
if (!IS_ERR(null_task))
rpc_put_task(null_task);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 81cd31a..d9b5fa7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -136,19 +136,20 @@
static void
rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
{
- struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
+ __be32 *p = rep->rr_rdmabuf->rg_base;
u32 credits;

if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
return;

- credits = be32_to_cpu(rmsgp->rm_credit);
+ credits = be32_to_cpup(p + 2);
+ if (credits > buffer->rb_max_requests)
+ credits = buffer->rb_max_requests;
+ /* Reserve one credit for keepalive ping */
+ credits--;
if (credits == 0)
credits = 1; /* don't deadlock */
- else if (credits > buffer->rb_max_requests)
- credits = buffer->rb_max_requests;
-
atomic_set(&buffer->rb_credits, credits);
}

@@ -915,6 +916,8 @@ struct rpcrdma_rep *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
int i, rc;

+ if (r_xprt->rx_data.max_requests < 2)
+ return -EINVAL;
buf->rb_max_requests = r_xprt->rx_data.max_requests;
buf->rb_bc_srv_max_requests = 0;
atomic_set(&buf->rb_credits, 1);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 95c1e3d..dd1340f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -77,8 +77,8 @@ struct rpcrdma_ia {
unsigned int ri_max_send_sges;
bool ri_reminv_expected;
bool ri_implicit_roundup;
- unsigned long ri_flags;
enum ib_mr_type ri_mrtype;
+ unsigned long ri_flags;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;
};

2017-02-01 14:50:19

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v2 7/7] sunrpc: Allow keepalive ping on a credit-full transport

Hi Chuck,

On 01/31/2017 01:39 PM, Chuck Lever wrote:
> Allow RPC-over-RDMA to send NULL pings even when the transport has
> hit its credit limit. One RPC-over-RDMA credit is reserved for
> operations like keepalive.
>
> For transports that convey NFSv4, it seems like lease renewal would
> also be a candidate for using a priority transport slot. I'd like to
> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
> one priority operation is in use at a time.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/sched.h | 2 ++
> net/sunrpc/xprt.c | 4 ++++
> net/sunrpc/xprtrdma/transport.c | 3 ++-
> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
> 5 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index 13822e6..fcea158 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -127,6 +127,7 @@ struct rpc_task_setup {
> #define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
> #define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
> +#define RPC_TASK_NO_CONG 0x8000 /* skip congestion control */
>
> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
>
> @@ -137,6 +138,7 @@ struct rpc_task_setup {
> #define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
> #define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
> +#define RPC_SKIP_CONG(t) ((t)->tk_flags & RPC_TASK_NO_CONG)
>
> #define RPC_TASK_RUNNING 0
> #define RPC_TASK_QUEUED 1
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index b530a28..a477ee6 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
> {
> struct rpc_rqst *req = task->tk_rqstp;
>
> + if (RPC_SKIP_CONG(task)) {
> + req->rq_cong = 0;
> + return 1;
> + }
> if (req->rq_cong)
> return 1;
> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 3a5a805..073fecd 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void *calldata)
>
> data = xprt_get(xprt);
> null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
> - RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
> + RPC_TASK_SOFTPING | RPC_TASK_ASYNC |
> + RPC_TASK_NO_CONG,
> &rpcrdma_keepalive_call_ops, data);
> if (!IS_ERR(null_task))
> rpc_put_task(null_task);
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 81cd31a..d9b5fa7 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -136,19 +136,20 @@
> static void
> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
> {
> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
> + __be32 *p = rep->rr_rdmabuf->rg_base;
> u32 credits;
>
> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
> return;
>
> - credits = be32_to_cpu(rmsgp->rm_credit);
> + credits = be32_to_cpup(p + 2);
> + if (credits > buffer->rb_max_requests)
> + credits = buffer->rb_max_requests;
> + /* Reserve one credit for keepalive ping */
> + credits--;
> if (credits == 0)
> credits = 1; /* don't deadlock */
> - else if (credits > buffer->rb_max_requests)
> - credits = buffer->rb_max_requests;
> -
> atomic_set(&buffer->rb_credits, credits);
> }
>
> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> int i, rc;
>
> + if (r_xprt->rx_data.max_requests < 2)
> + return -EINVAL;
> buf->rb_max_requests = r_xprt->rx_data.max_requests;
> buf->rb_bc_srv_max_requests = 0;
> atomic_set(&buf->rb_credits, 1);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 95c1e3d..dd1340f 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -77,8 +77,8 @@ struct rpcrdma_ia {
> unsigned int ri_max_send_sges;
> bool ri_reminv_expected;
> bool ri_implicit_roundup;
> - unsigned long ri_flags;
> enum ib_mr_type ri_mrtype;
> + unsigned long ri_flags;

Any reason for the reordering here?

Thanks,
Anna

> struct ib_qp_attr ri_qp_attr;
> struct ib_qp_init_attr ri_qp_init_attr;
> };
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2017-02-01 15:31:43

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH v2 7/7] sunrpc: Allow keepalive ping on a credit-full transport

> On Feb 1, 2017, at 9:50 AM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> On 01/31/2017 01:39 PM, Chuck Lever wrote:
>> Allow RPC-over-RDMA to send NULL pings even when the transport has
>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>> operations like keepalive.
>>
>> For transports that convey NFSv4, it seems like lease renewal would
>> also be a candidate for using a priority transport slot. I'd like to
>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>> one priority operation is in use at a time.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/sched.h | 2 ++
>> net/sunrpc/xprt.c | 4 ++++
>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
>> 5 files changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>> index 13822e6..fcea158 100644
>> --- a/include/linux/sunrpc/sched.h
>> +++ b/include/linux/sunrpc/sched.h
>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>> #define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
>> #define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
>> +#define RPC_TASK_NO_CONG 0x8000 /* skip congestion control */
>>
>> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
>>
>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>> #define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
>> #define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
>> +#define RPC_SKIP_CONG(t) ((t)->tk_flags & RPC_TASK_NO_CONG)
>>
>> #define RPC_TASK_RUNNING 0
>> #define RPC_TASK_QUEUED 1
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index b530a28..a477ee6 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
>> {
>> struct rpc_rqst *req = task->tk_rqstp;
>>
>> + if (RPC_SKIP_CONG(task)) {
>> + req->rq_cong = 0;
>> + return 1;
>> + }
>> if (req->rq_cong)
>> return 1;
>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index 3a5a805..073fecd 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void *calldata)
>>
>> data = xprt_get(xprt);
>> null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
>> - RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
>> + RPC_TASK_SOFTPING | RPC_TASK_ASYNC |
>> + RPC_TASK_NO_CONG,
>> &rpcrdma_keepalive_call_ops, data);
>> if (!IS_ERR(null_task))
>> rpc_put_task(null_task);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 81cd31a..d9b5fa7 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -136,19 +136,20 @@
>> static void
>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>> {
>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>> u32 credits;
>>
>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>> return;
>>
>> - credits = be32_to_cpu(rmsgp->rm_credit);
>> + credits = be32_to_cpup(p + 2);
>> + if (credits > buffer->rb_max_requests)
>> + credits = buffer->rb_max_requests;
>> + /* Reserve one credit for keepalive ping */
>> + credits--;
>> if (credits == 0)
>> credits = 1; /* don't deadlock */
>> - else if (credits > buffer->rb_max_requests)
>> - credits = buffer->rb_max_requests;
>> -
>> atomic_set(&buffer->rb_credits, credits);
>> }
>>
>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> int i, rc;
>>
>> + if (r_xprt->rx_data.max_requests < 2)
>> + return -EINVAL;
>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>> buf->rb_bc_srv_max_requests = 0;
>> atomic_set(&buf->rb_credits, 1);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 95c1e3d..dd1340f 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -77,8 +77,8 @@ struct rpcrdma_ia {
>> unsigned int ri_max_send_sges;
>> bool ri_reminv_expected;
>> bool ri_implicit_roundup;
>> - unsigned long ri_flags;
>> enum ib_mr_type ri_mrtype;
>> + unsigned long ri_flags;
>
> Any reason for the reordering here?

I thought 7/7 added this field, but 6/7 actually does.
6/7 should have put the ri_flags field in the right place.

I can respin the series.

> Thanks,
> Anna
>
>> struct ib_qp_attr ri_qp_attr;
>> struct ib_qp_init_attr ri_qp_init_attr;
>> };
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever