2017-01-26 18:21:19

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 0/7] NFS/RDMA client-side patches for 4.11

Sorry this is later than usual. Fortunately the patches are small
and few in number.

These are bug fixes based on v4.10-rc5 + my for-4.10-rc fixes. They
are still being tested; I'm posting for review only.

I've included an updated set of patches that implement keepalive
for RPC-over-RDMA with the changes discussed recently in this
thread:

http://marc.info/?l=linux-nfs&m=148433523625936&w=2


Available in the "nfs-rdma-for-4.11" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.11


---

Chuck Lever (7):
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
xprtrdma: Handle stale connection rejection
xprtrdma: Refactor management of mw_list field
sunrpc: Allow xprt->ops->timer method to sleep
sunrpc: Enable calls to rpc_call_null_helper() from other modules
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
sunrpc: Allow keepalive ping on a credit-full transport


include/linux/sunrpc/clnt.h | 5 ++
include/linux/sunrpc/sched.h | 2 +
net/sunrpc/clnt.c | 21 +++++----
net/sunrpc/xprt.c | 6 ++-
net/sunrpc/xprtrdma/fmr_ops.c | 5 --
net/sunrpc/xprtrdma/frwr_ops.c | 11 ++---
net/sunrpc/xprtrdma/rpc_rdma.c | 7 +--
net/sunrpc/xprtrdma/transport.c | 63 ++++++++++++++++++++++++++
net/sunrpc/xprtrdma/verbs.c | 93 ++++++++++++++-------------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 21 +++++++++
net/sunrpc/xprtsock.c | 2 +
11 files changed, 149 insertions(+), 87 deletions(-)

--
Chuck Lever


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 1/7] xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs

Sriharsha ([email protected]) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. It's safe to wait until the RPC is
retransmitted.

Reported-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d889883..0ce7a7b 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -781,7 +781,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return 0;

out_unmap:
- r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
return PTR_ERR(iptr);
}



2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 2/7] xprtrdma: Handle stale connection rejection

A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 65 +++++++++++++------------------------------
1 file changed, 20 insertions(+), 45 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61d16c3..45db2b4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -54,6 +54,7 @@
#include <linux/sunrpc/svc_rdma.h>
#include <asm/bitops.h>
#include <linux/module.h> /* try_module_get()/module_put() */
+#include <rdma/ib_cm.h>

#include "xprt_rdma.h"

@@ -279,7 +280,14 @@
connstate = -ENETDOWN;
goto connected;
case RDMA_CM_EVENT_REJECTED:
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ pr_info("rpcrdma: connection to %pIS:%u on %s rejected: %s\n",
+ sap, rpc_get_port(sap), ia->ri_device->name,
+ rdma_reject_msg(id, event->status));
+#endif
connstate = -ECONNREFUSED;
+ if (event->status == IB_CM_REJ_STALE_CONN)
+ connstate = -EAGAIN;
goto connected;
case RDMA_CM_EVENT_DISCONNECTED:
connstate = -ECONNABORTED;
@@ -643,20 +651,20 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
int
rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
+ struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
+ rx_ia);
struct rdma_cm_id *id, *old;
+ unsigned int extras;
int rc = 0;
- int retry_count = 0;

if (ep->rep_connected != 0) {
- struct rpcrdma_xprt *xprt;
retry:
dprintk("RPC: %s: reconnecting...\n", __func__);

rpcrdma_ep_disconnect(ep, ia);

- xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- id = rpcrdma_create_id(xprt, ia,
- (struct sockaddr *)&xprt->rx_data.addr);
+ id = rpcrdma_create_id(r_xprt, ia,
+ (struct sockaddr *)&r_xprt->rx_data.addr);
if (IS_ERR(id)) {
rc = -EHOSTUNREACH;
goto out;
@@ -711,51 +719,18 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
}

wait_event_interruptible(ep->rep_connect_wait, ep->rep_connected != 0);
-
- /*
- * Check state. A non-peer reject indicates no listener
- * (ECONNREFUSED), which may be a transient state. All
- * others indicate a transport condition which has already
- * undergone a best-effort.
- */
- if (ep->rep_connected == -ECONNREFUSED &&
- ++retry_count <= RDMA_CONNECT_RETRY_MAX) {
- dprintk("RPC: %s: non-peer_reject, retry\n", __func__);
- goto retry;
- }
if (ep->rep_connected <= 0) {
- /* Sometimes, the only way to reliably connect to remote
- * CMs is to use same nonzero values for ORD and IRD. */
- if (retry_count++ <= RDMA_CONNECT_RETRY_MAX + 1 &&
- (ep->rep_remote_cma.responder_resources == 0 ||
- ep->rep_remote_cma.initiator_depth !=
- ep->rep_remote_cma.responder_resources)) {
- if (ep->rep_remote_cma.responder_resources == 0)
- ep->rep_remote_cma.responder_resources = 1;
- ep->rep_remote_cma.initiator_depth =
- ep->rep_remote_cma.responder_resources;
+ if (ep->rep_connected == -EAGAIN)
goto retry;
- }
rc = ep->rep_connected;
- } else {
- struct rpcrdma_xprt *r_xprt;
- unsigned int extras;
-
- dprintk("RPC: %s: connected\n", __func__);
-
- r_xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
-
- if (extras) {
- rc = rpcrdma_ep_post_extra_recv(r_xprt, extras);
- if (rc) {
- pr_warn("%s: rpcrdma_ep_post_extra_recv: %i\n",
- __func__, rc);
- rc = 0;
- }
- }
+ goto out;
}

+ dprintk("RPC: %s: connected\n", __func__);
+ extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
+ if (extras)
+ rpcrdma_ep_post_extra_recv(r_xprt, extras);
+
out:
if (rc)
ep->rep_connected = rc;


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 7/7] sunrpc: Allow keepalive ping on a credit-full transport

Allow RPC-over-RDMA to send NULL pings even when the transport has
hit its credit limit.

One credit is reserved. It may be used only to send a keepalive
ping.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/sched.h | 2 ++
net/sunrpc/xprt.c | 4 ++++
net/sunrpc/xprtrdma/transport.c | 4 +++-
net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 7ba040c..a65278e 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -127,6 +127,7 @@ struct rpc_task_setup {
#define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
+#define RPC_TASK_PRIORITY 0x8000 /* skip congestion control */

#define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
#define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
@@ -135,6 +136,7 @@ struct rpc_task_setup {
#define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
#define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
#define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_HAS_PRIORITY(t) ((t)->tk_flags & RPC_TASK_PRIORITY)

#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index b530a28..e8b7b9f 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
{
struct rpc_rqst *req = task->tk_rqstp;

+ if (RPC_HAS_PRIORITY(task)) {
+ req->rq_cong = 0;
+ return 1;
+ }
if (req->rq_cong)
return 1;
dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index f97c851..e43cc85 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -538,7 +538,9 @@ static void rpcrdma_keepalive_release(void *calldata)

data = xprt_get(xprt);
null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
- RPC_TASK_SOFT | RPC_TASK_ASYNC,
+ RPC_TASK_SOFT |
+ RPC_TASK_ASYNC |
+ RPC_TASK_PRIORITY,
&rpcrdma_keepalive_call_ops, data);
if (!IS_ERR(null_task))
rpc_put_task(null_task);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 7e14b73..3e218dd 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -136,19 +136,20 @@
static void
rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
{
- struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
+ __be32 *p = rep->rr_rdmabuf->rg_base;
u32 credits;

if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
return;

- credits = be32_to_cpu(rmsgp->rm_credit);
+ credits = be32_to_cpup(p + 2);
+ if (credits > buffer->rb_max_requests)
+ credits = buffer->rb_max_requests;
+ /* Reserve one credit for keepalive ping */
+ credits--;
if (credits == 0)
credits = 1; /* don't deadlock */
- else if (credits > buffer->rb_max_requests)
- credits = buffer->rb_max_requests;
-
atomic_set(&buffer->rb_credits, credits);
}

@@ -914,6 +915,8 @@ struct rpcrdma_rep *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
int i, rc;

+ if (r_xprt->rx_data.max_requests < 2)
+ return -EINVAL;
buf->rb_max_requests = r_xprt->rx_data.max_requests;
buf->rb_bc_srv_max_requests = 0;
atomic_set(&buf->rb_credits, 1);


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 4/7] sunrpc: Allow xprt->ops->timer method to sleep

The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprt.c | 2 --
net/sunrpc/xprtsock.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 9a6be03..b530a28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -897,13 +897,11 @@ static void xprt_timer(struct rpc_task *task)
return;
dprintk("RPC: %5u xprt_timer\n", task->tk_pid);

- spin_lock_bh(&xprt->transport_lock);
if (!req->rq_reply_bytes_recvd) {
if (xprt->ops->timer)
xprt->ops->timer(xprt, task);
} else
task->tk_status = 0;
- spin_unlock_bh(&xprt->transport_lock);
}

/**
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index af392d9..d9bb644 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1734,7 +1734,9 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t
*/
static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task *task)
{
+ spin_lock_bh(&xprt->transport_lock);
xprt_adjust_cwnd(xprt, task, -ETIMEDOUT);
+ spin_unlock_bh(&xprt->transport_lock);
}

static unsigned short xs_get_random_port(void)


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 5/7] sunrpc: Enable calls to rpc_call_null_helper() from other modules

I'd like to emit an RPC ping from rpcrdma.ko.

authnull_ops is not visible outside the sunrpc.ko module, so fold
the common case into rpc_call_null_helper, and export it.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/clnt.h | 5 +++++
net/sunrpc/clnt.c | 21 +++++++++++----------
2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 85cc819..a79237d 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -173,6 +173,11 @@ int rpc_call_async(struct rpc_clnt *clnt,
void *calldata);
int rpc_call_sync(struct rpc_clnt *clnt,
const struct rpc_message *msg, int flags);
+struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
+ struct rpc_xprt *xprt,
+ struct rpc_cred *cred, int flags,
+ const struct rpc_call_ops *ops,
+ void *data);
struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
int flags);
int rpc_restart_call_prepare(struct rpc_task *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 1efbe48..decaf97 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2520,7 +2520,6 @@ static int rpc_ping(struct rpc_clnt *clnt)
return err;
}

-static
struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
const struct rpc_call_ops *ops, void *data)
@@ -2537,9 +2536,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
.callback_data = data,
.flags = flags,
};
+ struct rpc_task *task;

- return rpc_run_task(&task_setup_data);
+ if (!cred)
+ msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
+ task = rpc_run_task(&task_setup_data);
+ if (!cred)
+ put_rpccred(msg.rpc_cred);
+
+ return task;
}
+EXPORT_SYMBOL_GPL(rpc_call_null_helper);

struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
{
@@ -2586,7 +2593,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
void *dummy)
{
struct rpc_cb_add_xprt_calldata *data;
- struct rpc_cred *cred;
struct rpc_task *task;

data = kmalloc(sizeof(*data), GFP_NOFS);
@@ -2595,11 +2601,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
data->xps = xprt_switch_get(xps);
data->xprt = xprt_get(xprt);

- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
+ task = rpc_call_null_helper(clnt, xprt, NULL,
RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
&rpc_cb_add_xprt_call_ops, data);
- put_rpccred(cred);
if (IS_ERR(task))
return PTR_ERR(task);
rpc_put_task(task);
@@ -2630,7 +2634,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
struct rpc_xprt *xprt,
void *data)
{
- struct rpc_cred *cred;
struct rpc_task *task;
struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
int status = -EADDRINUSE;
@@ -2642,11 +2645,9 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
goto out_err;

/* Test the connection */
- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
+ task = rpc_call_null_helper(clnt, xprt, NULL,
RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
NULL, NULL);
- put_rpccred(cred);
if (IS_ERR(task)) {
status = PTR_ERR(task);
goto out_err;


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 3/7] xprtrdma: Refactor management of mw_list field

Clean up some duplicate code.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 5 +----
net/sunrpc/xprtrdma/frwr_ops.c | 11 ++++-------
net/sunrpc/xprtrdma/rpc_rdma.c | 6 +++---
net/sunrpc/xprtrdma/verbs.c | 15 +++++----------
net/sunrpc/xprtrdma/xprt_rdma.h | 16 ++++++++++++++++
5 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ebb09e..59e6402 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -310,10 +310,7 @@ enum {
struct rpcrdma_mw *mw;

while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
fmr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 47bed53..f81dd93 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -466,8 +466,8 @@
struct ib_send_wr *first, **prev, *last, *bad_wr;
struct rpcrdma_rep *rep = req->rl_reply;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_mw *mw, *tmp;
struct rpcrdma_frmr *f;
+ struct rpcrdma_mw *mw;
int count, rc;

dprintk("RPC: %s: req %p\n", __func__, req);
@@ -534,10 +534,10 @@
* them to the free MW list.
*/
unmap:
- list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+ while (!list_empty(&req->rl_registered)) {
+ mw = rpcrdma_pop_mw(&req->rl_registered);
dprintk("RPC: %s: DMA unmapping frmr %p\n",
__func__, &mw->frmr);
- list_del_init(&mw->mw_list);
ib_dma_unmap_sg(ia->ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);
@@ -571,10 +571,7 @@
struct rpcrdma_mw *mw;

while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
frwr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 0ce7a7b..4276ee4 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -322,7 +322,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
false, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

*iptr++ = xdr_one; /* item present */

@@ -390,7 +390,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

iptr = xdr_encode_rdma_segment(iptr, mw);

@@ -455,7 +455,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);

iptr = xdr_encode_rdma_segment(iptr, mw);

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 45db2b4..7e14b73 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -775,9 +775,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)

spin_lock(&buf->rb_recovery_lock);
while (!list_empty(&buf->rb_stale_mrs)) {
- mw = list_first_entry(&buf->rb_stale_mrs,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
+ mw = rpcrdma_pop_mw(&buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);

dprintk("RPC: %s: recovering MR %p\n", __func__, mw);
@@ -795,7 +793,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;

spin_lock(&buf->rb_recovery_lock);
- list_add(&mw->mw_list, &buf->rb_stale_mrs);
+ rpcrdma_push_mw(mw, &buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);

schedule_delayed_work(&buf->rb_recovery_worker, 0);
@@ -1071,11 +1069,8 @@ struct rpcrdma_mw *
struct rpcrdma_mw *mw = NULL;

spin_lock(&buf->rb_mwlock);
- if (!list_empty(&buf->rb_mws)) {
- mw = list_first_entry(&buf->rb_mws,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
- }
+ if (!list_empty(&buf->rb_mws))
+ mw = rpcrdma_pop_mw(&buf->rb_mws);
spin_unlock(&buf->rb_mwlock);

if (!mw)
@@ -1098,7 +1093,7 @@ struct rpcrdma_mw *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;

spin_lock(&buf->rb_mwlock);
- list_add_tail(&mw->mw_list, &buf->rb_mws);
+ rpcrdma_push_mw(mw, &buf->rb_mws);
spin_unlock(&buf->rb_mwlock);
}

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 852dd0a..171a351 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -354,6 +354,22 @@ struct rpcrdma_req {
return rqst->rq_xprtdata;
}

+static inline void
+rpcrdma_push_mw(struct rpcrdma_mw *mw, struct list_head *list)
+{
+ list_add_tail(&mw->mw_list, list);
+}
+
+static inline struct rpcrdma_mw *
+rpcrdma_pop_mw(struct list_head *list)
+{
+ struct rpcrdma_mw *mw;
+
+ mw = list_first_entry(list, struct rpcrdma_mw, mw_list);
+ list_del(&mw->mw_list);
+ return mw;
+}
+
/*
* struct rpcrdma_buffer -- holds list/queue of pre-registered memory for
* inline requests/replies, and client/server credits.


2017-01-26 17:56:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 6/7] xprtrdma: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, emit an RPC NULL ping when an RPC retransmit
timeout occurs.

The purpose of this ping is to drive traffic on the connection to
force the transport layer to disconnect it if it is no longer
viable. Some RDMA operations are fully offloaded to the HCA, and can
be successful even if the server O/S has crashed. Thus an operation
that requires that the server is responsive is used for the ping.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 61 +++++++++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++
2 files changed, 66 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6990581..f97c851 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -484,6 +484,66 @@
dprintk("RPC: %s: %u\n", __func__, port);
}

+static void rpcrdma_keepalive_done(struct rpc_task *task, void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+ struct rpcrdma_xprt *r_xprt =
+ container_of(xprt, struct rpcrdma_xprt, rx_xprt);
+
+ /* Convert to dprintk before merging */
+ pr_info("RPC: %s: keepalive ping on xprt %p, status %d\n",
+ __func__, xprt, task->tk_status);
+
+ if (task->tk_status)
+ rdma_disconnect(r_xprt->rx_ia.ri_id);
+
+ clear_bit(RPCRDMA_KEEPALIVE, &r_xprt->rx_ia.ri_flags);
+}
+
+static void rpcrdma_keepalive_release(void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+
+ xprt_put(xprt);
+}
+
+static const struct rpc_call_ops rpcrdma_keepalive_call_ops = {
+ .rpc_call_done = rpcrdma_keepalive_done,
+ .rpc_release = rpcrdma_keepalive_release,
+};
+
+/**
+ * xprt_rdma_timer - invoked when an RPC times out
+ * @xprt: controlling RPC transport
+ * @task: RPC task that timed out
+ *
+ * Send a NULL RPC to see if the server still responds.
+ * If it doesn't, close the connection.
+ */
+static void
+xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(xprt, struct rpcrdma_xprt, rx_xprt);
+ struct rpc_task *null_task;
+ void *data;
+
+ /* Ensure only one ping is sent at a time */
+ if (!test_and_set_bit(RPCRDMA_KEEPALIVE, &r_xprt->rx_ia.ri_flags))
+ return;
+
+ /* Convert to dprintk before merging */
+ pr_info("RPC: %s: sending keepalive ping on xprt %p\n",
+ __func__, xprt);
+
+ data = xprt_get(xprt);
+ null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
+ RPC_TASK_SOFT | RPC_TASK_ASYNC,
+ &rpcrdma_keepalive_call_ops, data);
+ if (!IS_ERR(null_task))
+ rpc_put_task(null_task);
+}
+
static void
xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
{
@@ -780,6 +840,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
.alloc_slot = xprt_alloc_slot,
.release_request = xprt_release_rqst_cong, /* ditto */
.set_retrans_timeout = xprt_set_retrans_timeout_def, /* ditto */
+ .timer = xprt_rdma_timer,
.rpcbind = rpcb_getport_async, /* sunrpc/rpcb_clnt.c */
.set_port = xprt_rdma_set_port,
.connect = xprt_rdma_connect,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 171a351..35cd0ac 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -77,11 +77,16 @@ struct rpcrdma_ia {
unsigned int ri_max_send_sges;
bool ri_reminv_expected;
bool ri_implicit_roundup;
+ unsigned long ri_flags;
enum ib_mr_type ri_mrtype;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;
};

+enum {
+ RPCRDMA_KEEPALIVE = 0,
+};
+
/*
* RDMA Endpoint -- one per transport instance
*/


2017-01-27 22:07:29

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] sunrpc: Enable calls to rpc_call_null_helper() from other modules

Hi Chuck,

On 01/26/2017 12:56 PM, Chuck Lever wrote:
> I'd like to emit an RPC ping from rpcrdma.ko.

The patch itself looks fine, but I was wondering if you could add a few extra words to this sentence to say that it'll be for keepalive pings?

Thanks,
Anna
>
> authnull_ops is not visible outside the sunrpc.ko module, so fold
> the common case into rpc_call_null_helper, and export it.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/clnt.h | 5 +++++
> net/sunrpc/clnt.c | 21 +++++++++++----------
> 2 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 85cc819..a79237d 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -173,6 +173,11 @@ int rpc_call_async(struct rpc_clnt *clnt,
> void *calldata);
> int rpc_call_sync(struct rpc_clnt *clnt,
> const struct rpc_message *msg, int flags);
> +struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
> + struct rpc_xprt *xprt,
> + struct rpc_cred *cred, int flags,
> + const struct rpc_call_ops *ops,
> + void *data);
> struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
> int flags);
> int rpc_restart_call_prepare(struct rpc_task *);
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 1efbe48..decaf97 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -2520,7 +2520,6 @@ static int rpc_ping(struct rpc_clnt *clnt)
> return err;
> }
>
> -static
> struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
> struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
> const struct rpc_call_ops *ops, void *data)
> @@ -2537,9 +2536,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
> .callback_data = data,
> .flags = flags,
> };
> + struct rpc_task *task;
>
> - return rpc_run_task(&task_setup_data);
> + if (!cred)
> + msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
> + task = rpc_run_task(&task_setup_data);
> + if (!cred)
> + put_rpccred(msg.rpc_cred);
> +
> + return task;
> }
> +EXPORT_SYMBOL_GPL(rpc_call_null_helper);
>
> struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
> {
> @@ -2586,7 +2593,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
> void *dummy)
> {
> struct rpc_cb_add_xprt_calldata *data;
> - struct rpc_cred *cred;
> struct rpc_task *task;
>
> data = kmalloc(sizeof(*data), GFP_NOFS);
> @@ -2595,11 +2601,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
> data->xps = xprt_switch_get(xps);
> data->xprt = xprt_get(xprt);
>
> - cred = authnull_ops.lookup_cred(NULL, NULL, 0);
> - task = rpc_call_null_helper(clnt, xprt, cred,
> + task = rpc_call_null_helper(clnt, xprt, NULL,
> RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
> &rpc_cb_add_xprt_call_ops, data);
> - put_rpccred(cred);
> if (IS_ERR(task))
> return PTR_ERR(task);
> rpc_put_task(task);
> @@ -2630,7 +2634,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
> struct rpc_xprt *xprt,
> void *data)
> {
> - struct rpc_cred *cred;
> struct rpc_task *task;
> struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
> int status = -EADDRINUSE;
> @@ -2642,11 +2645,9 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
> goto out_err;
>
> /* Test the connection */
> - cred = authnull_ops.lookup_cred(NULL, NULL, 0);
> - task = rpc_call_null_helper(clnt, xprt, cred,
> + task = rpc_call_null_helper(clnt, xprt, NULL,
> RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
> NULL, NULL);
> - put_rpccred(cred);
> if (IS_ERR(task)) {
> status = PTR_ERR(task);
> goto out_err;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2017-01-27 22:53:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] sunrpc: Enable calls to rpc_call_null_helper() from other modules


> On Jan 27, 2017, at 5:07 PM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> On 01/26/2017 12:56 PM, Chuck Lever wrote:
>> I'd like to emit an RPC ping from rpcrdma.ko.
>
> The patch itself looks fine, but I was wondering if you could add a few extra words to this sentence to say that it'll be for keepalive pings?

Check out:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=commit;h=c07f835228bfbe0388a4d109899377bd48dd7eff

And feel free to suggest alternate text.


> Thanks,
> Anna
>>
>> authnull_ops is not visible outside the sunrpc.ko module, so fold
>> the common case into rpc_call_null_helper, and export it.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/clnt.h | 5 +++++
>> net/sunrpc/clnt.c | 21 +++++++++++----------
>> 2 files changed, 16 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
>> index 85cc819..a79237d 100644
>> --- a/include/linux/sunrpc/clnt.h
>> +++ b/include/linux/sunrpc/clnt.h
>> @@ -173,6 +173,11 @@ int rpc_call_async(struct rpc_clnt *clnt,
>> void *calldata);
>> int rpc_call_sync(struct rpc_clnt *clnt,
>> const struct rpc_message *msg, int flags);
>> +struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
>> + struct rpc_xprt *xprt,
>> + struct rpc_cred *cred, int flags,
>> + const struct rpc_call_ops *ops,
>> + void *data);
>> struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
>> int flags);
>> int rpc_restart_call_prepare(struct rpc_task *);
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index 1efbe48..decaf97 100644
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -2520,7 +2520,6 @@ static int rpc_ping(struct rpc_clnt *clnt)
>> return err;
>> }
>>
>> -static
>> struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
>> struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
>> const struct rpc_call_ops *ops, void *data)
>> @@ -2537,9 +2536,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
>> .callback_data = data,
>> .flags = flags,
>> };
>> + struct rpc_task *task;
>>
>> - return rpc_run_task(&task_setup_data);
>> + if (!cred)
>> + msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
>> + task = rpc_run_task(&task_setup_data);
>> + if (!cred)
>> + put_rpccred(msg.rpc_cred);
>> +
>> + return task;
>> }
>> +EXPORT_SYMBOL_GPL(rpc_call_null_helper);
>>
>> struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
>> {
>> @@ -2586,7 +2593,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
>> void *dummy)
>> {
>> struct rpc_cb_add_xprt_calldata *data;
>> - struct rpc_cred *cred;
>> struct rpc_task *task;
>>
>> data = kmalloc(sizeof(*data), GFP_NOFS);
>> @@ -2595,11 +2601,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
>> data->xps = xprt_switch_get(xps);
>> data->xprt = xprt_get(xprt);
>>
>> - cred = authnull_ops.lookup_cred(NULL, NULL, 0);
>> - task = rpc_call_null_helper(clnt, xprt, cred,
>> + task = rpc_call_null_helper(clnt, xprt, NULL,
>> RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
>> &rpc_cb_add_xprt_call_ops, data);
>> - put_rpccred(cred);
>> if (IS_ERR(task))
>> return PTR_ERR(task);
>> rpc_put_task(task);
>> @@ -2630,7 +2634,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
>> struct rpc_xprt *xprt,
>> void *data)
>> {
>> - struct rpc_cred *cred;
>> struct rpc_task *task;
>> struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
>> int status = -EADDRINUSE;
>> @@ -2642,11 +2645,9 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
>> goto out_err;
>>
>> /* Test the connection */
>> - cred = authnull_ops.lookup_cred(NULL, NULL, 0);
>> - task = rpc_call_null_helper(clnt, xprt, cred,
>> + task = rpc_call_null_helper(clnt, xprt, NULL,
>> RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
>> NULL, NULL);
>> - put_rpccred(cred);
>> if (IS_ERR(task)) {
>> status = PTR_ERR(task);
>> goto out_err;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2017-01-30 20:29:41

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] sunrpc: Allow keepalive ping on a credit-full transport

Hi Chuck,

On 01/26/2017 12:56 PM, Chuck Lever wrote:
> Allow RPC-over-RDMA to send NULL pings even when the transport has
> hit its credit limit.
>
> One credit is reserved. It may be used only to send a keepalive
> ping.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/sched.h | 2 ++
> net/sunrpc/xprt.c | 4 ++++
> net/sunrpc/xprtrdma/transport.c | 4 +++-
> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
> 4 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index 7ba040c..a65278e 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -127,6 +127,7 @@ struct rpc_task_setup {
> #define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
> #define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
> +#define RPC_TASK_PRIORITY 0x8000 /* skip congestion control */

This file also defines various RPC_PRIORITY_<whatever> values that get used for the task->tk_priority field. I wonder if there is a better name for this flag that would help keep people from confusing the two later down the line, but I'm also struggling to come up with anything better.

Thoughts?
Anna

>
> #define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
> #define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
> @@ -135,6 +136,7 @@ struct rpc_task_setup {
> #define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
> #define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
> +#define RPC_HAS_PRIORITY(t) ((t)->tk_flags & RPC_TASK_PRIORITY)
>
> #define RPC_TASK_RUNNING 0
> #define RPC_TASK_QUEUED 1
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index b530a28..e8b7b9f 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
> {
> struct rpc_rqst *req = task->tk_rqstp;
>
> + if (RPC_HAS_PRIORITY(task)) {
> + req->rq_cong = 0;
> + return 1;
> + }
> if (req->rq_cong)
> return 1;
> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index f97c851..e43cc85 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -538,7 +538,9 @@ static void rpcrdma_keepalive_release(void *calldata)
>
> data = xprt_get(xprt);
> null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
> - RPC_TASK_SOFT | RPC_TASK_ASYNC,
> + RPC_TASK_SOFT |
> + RPC_TASK_ASYNC |
> + RPC_TASK_PRIORITY,
> &rpcrdma_keepalive_call_ops, data);
> if (!IS_ERR(null_task))
> rpc_put_task(null_task);
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 7e14b73..3e218dd 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -136,19 +136,20 @@
> static void
> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
> {
> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
> + __be32 *p = rep->rr_rdmabuf->rg_base;
> u32 credits;
>
> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
> return;
>
> - credits = be32_to_cpu(rmsgp->rm_credit);
> + credits = be32_to_cpup(p + 2);
> + if (credits > buffer->rb_max_requests)
> + credits = buffer->rb_max_requests;
> + /* Reserve one credit for keepalive ping */
> + credits--;
> if (credits == 0)
> credits = 1; /* don't deadlock */
> - else if (credits > buffer->rb_max_requests)
> - credits = buffer->rb_max_requests;
> -
> atomic_set(&buffer->rb_credits, credits);
> }
>
> @@ -914,6 +915,8 @@ struct rpcrdma_rep *
> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> int i, rc;
>
> + if (r_xprt->rx_data.max_requests < 2)
> + return -EINVAL;
> buf->rb_max_requests = r_xprt->rx_data.max_requests;
> buf->rb_bc_srv_max_requests = 0;
> atomic_set(&buf->rb_credits, 1);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2017-01-30 20:32:22

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] sunrpc: Allow keepalive ping on a credit-full transport


> On Jan 30, 2017, at 3:29 PM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> On 01/26/2017 12:56 PM, Chuck Lever wrote:
>> Allow RPC-over-RDMA to send NULL pings even when the transport has
>> hit its credit limit.
>>
>> One credit is reserved. It may be used only to send a keepalive
>> ping.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/sched.h | 2 ++
>> net/sunrpc/xprt.c | 4 ++++
>> net/sunrpc/xprtrdma/transport.c | 4 +++-
>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>> 4 files changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>> index 7ba040c..a65278e 100644
>> --- a/include/linux/sunrpc/sched.h
>> +++ b/include/linux/sunrpc/sched.h
>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>> #define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
>> #define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
>> +#define RPC_TASK_PRIORITY 0x8000 /* skip congestion control */
>
> This file also defines various RPC_PRIORITY_<whatever> values that get used for the task->tk_priority field. I wonder if there is a better name for this flag that would help keep people from confusing the two later down the line, but I'm also struggling to come up with anything better.

PRIORITY was a (lead) trial balloon, and also a sign of weak imagination.

The point of the flag is to bypass congestion control. RPC_TASK_SKIPCONG ?
Meh.


> Thoughts?
> Anna
>
>>
>> #define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
>> #define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
>> @@ -135,6 +136,7 @@ struct rpc_task_setup {
>> #define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
>> #define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
>> +#define RPC_HAS_PRIORITY(t) ((t)->tk_flags & RPC_TASK_PRIORITY)
>>
>> #define RPC_TASK_RUNNING 0
>> #define RPC_TASK_QUEUED 1
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index b530a28..e8b7b9f 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
>> {
>> struct rpc_rqst *req = task->tk_rqstp;
>>
>> + if (RPC_HAS_PRIORITY(task)) {
>> + req->rq_cong = 0;
>> + return 1;
>> + }
>> if (req->rq_cong)
>> return 1;
>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index f97c851..e43cc85 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -538,7 +538,9 @@ static void rpcrdma_keepalive_release(void *calldata)
>>
>> data = xprt_get(xprt);
>> null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
>> - RPC_TASK_SOFT | RPC_TASK_ASYNC,
>> + RPC_TASK_SOFT |
>> + RPC_TASK_ASYNC |
>> + RPC_TASK_PRIORITY,
>> &rpcrdma_keepalive_call_ops, data);
>> if (!IS_ERR(null_task))
>> rpc_put_task(null_task);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 7e14b73..3e218dd 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -136,19 +136,20 @@
>> static void
>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>> {
>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>> u32 credits;
>>
>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>> return;
>>
>> - credits = be32_to_cpu(rmsgp->rm_credit);
>> + credits = be32_to_cpup(p + 2);
>> + if (credits > buffer->rb_max_requests)
>> + credits = buffer->rb_max_requests;
>> + /* Reserve one credit for keepalive ping */
>> + credits--;
>> if (credits == 0)
>> credits = 1; /* don't deadlock */
>> - else if (credits > buffer->rb_max_requests)
>> - credits = buffer->rb_max_requests;
>> -
>> atomic_set(&buffer->rb_credits, credits);
>> }
>>
>> @@ -914,6 +915,8 @@ struct rpcrdma_rep *
>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> int i, rc;
>>
>> + if (r_xprt->rx_data.max_requests < 2)
>> + return -EINVAL;
>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>> buf->rb_bc_srv_max_requests = 0;
>> atomic_set(&buf->rb_credits, 1);
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2017-01-30 21:36:13

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] sunrpc: Allow keepalive ping on a credit-full transport



On 01/30/2017 03:32 PM, Chuck Lever wrote:
>
>> On Jan 30, 2017, at 3:29 PM, Anna Schumaker <[email protected]> wrote:
>>
>> Hi Chuck,
>>
>> On 01/26/2017 12:56 PM, Chuck Lever wrote:
>>> Allow RPC-over-RDMA to send NULL pings even when the transport has
>>> hit its credit limit.
>>>
>>> One credit is reserved. It may be used only to send a keepalive
>>> ping.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> include/linux/sunrpc/sched.h | 2 ++
>>> net/sunrpc/xprt.c | 4 ++++
>>> net/sunrpc/xprtrdma/transport.c | 4 +++-
>>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>>> 4 files changed, 17 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>>> index 7ba040c..a65278e 100644
>>> --- a/include/linux/sunrpc/sched.h
>>> +++ b/include/linux/sunrpc/sched.h
>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>> #define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
>>> #define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
>>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
>>> +#define RPC_TASK_PRIORITY 0x8000 /* skip congestion control */
>>
>> This file also defines various RPC_PRIORITY_<whatever> values that get used for the task->tk_priority field. I wonder if there is a better name for this flag that would help keep people from confusing the two later down the line, but I'm also struggling to come up with anything better.
>
> PRIORITY was a (lead) trial balloon, and also a sign of weak imagination.
>
> The point of the flag is to bypass congestion control. RPC_TASK_SKIPCONG ?

Something like that could work, yeah.

Anna

> Meh.
>
>
>> Thoughts?
>> Anna
>>
>>>
>>> #define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
>>> #define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
>>> @@ -135,6 +136,7 @@ struct rpc_task_setup {
>>> #define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
>>> #define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
>>> +#define RPC_HAS_PRIORITY(t) ((t)->tk_flags & RPC_TASK_PRIORITY)
>>>
>>> #define RPC_TASK_RUNNING 0
>>> #define RPC_TASK_QUEUED 1
>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>> index b530a28..e8b7b9f 100644
>>> --- a/net/sunrpc/xprt.c
>>> +++ b/net/sunrpc/xprt.c
>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
>>> {
>>> struct rpc_rqst *req = task->tk_rqstp;
>>>
>>> + if (RPC_HAS_PRIORITY(task)) {
>>> + req->rq_cong = 0;
>>> + return 1;
>>> + }
>>> if (req->rq_cong)
>>> return 1;
>>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>> index f97c851..e43cc85 100644
>>> --- a/net/sunrpc/xprtrdma/transport.c
>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>> @@ -538,7 +538,9 @@ static void rpcrdma_keepalive_release(void *calldata)
>>>
>>> data = xprt_get(xprt);
>>> null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
>>> - RPC_TASK_SOFT | RPC_TASK_ASYNC,
>>> + RPC_TASK_SOFT |
>>> + RPC_TASK_ASYNC |
>>> + RPC_TASK_PRIORITY,
>>> &rpcrdma_keepalive_call_ops, data);
>>> if (!IS_ERR(null_task))
>>> rpc_put_task(null_task);
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 7e14b73..3e218dd 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -136,19 +136,20 @@
>>> static void
>>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>> {
>>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>>> u32 credits;
>>>
>>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>> return;
>>>
>>> - credits = be32_to_cpu(rmsgp->rm_credit);
>>> + credits = be32_to_cpup(p + 2);
>>> + if (credits > buffer->rb_max_requests)
>>> + credits = buffer->rb_max_requests;
>>> + /* Reserve one credit for keepalive ping */
>>> + credits--;
>>> if (credits == 0)
>>> credits = 1; /* don't deadlock */
>>> - else if (credits > buffer->rb_max_requests)
>>> - credits = buffer->rb_max_requests;
>>> -
>>> atomic_set(&buffer->rb_credits, credits);
>>> }
>>>
>>> @@ -914,6 +915,8 @@ struct rpcrdma_rep *
>>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>> int i, rc;
>>>
>>> + if (r_xprt->rx_data.max_requests < 2)
>>> + return -EINVAL;
>>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>> buf->rb_bc_srv_max_requests = 0;
>>> atomic_set(&buf->rb_credits, 1);
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>