Hi Anna-
Starting set of NFS/RDMA client-side changes for the next merge
window. These are clean-ups and simple fixes. Let me know what you
think.
---
Chuck Lever (8):
xprtrdma: Fix latency regression on NUMA NFS/RDMA clients
xprtrdma: Remove arbitrary limit on initiator depth
xprtrdma: Remove xprt-specific connect cookie
xprtrdma: ->send_request returns -EAGAIN when there are no free MRs
xprtrdma: Reduce number of MRs created by rpcrdma_mrs_create
xprtrdma: "Support" call-only RPCs
xprtrdma: Chain Send to FastReg WRs
xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_req
include/linux/sunrpc/clnt.h | 7 +++++
net/sunrpc/sunrpc.h | 6 ----
net/sunrpc/xprtrdma/backchannel.c | 7 -----
net/sunrpc/xprtrdma/fmr_ops.c | 13 ++++++++-
net/sunrpc/xprtrdma/frwr_ops.c | 53 +++++++++++++++++++++++++------------
net/sunrpc/xprtrdma/rpc_rdma.c | 32 +++++++++++++++-------
net/sunrpc/xprtrdma/transport.c | 42 ++++++-----------------------
net/sunrpc/xprtrdma/verbs.c | 31 ++++++++++++++--------
net/sunrpc/xprtrdma/xprt_rdma.h | 4 +--
9 files changed, 108 insertions(+), 87 deletions(-)
--
Chuck Lever
With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.
"git bisect" settled on commit ccede7598588 ("xprtrdma: Spread reply
processing over more CPUs") .
After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.
The fix is implemented by reverting only the part of
commit ccede7598588 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.
Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.
Fixes: ccede7598588 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 2 +-
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
3 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index f0855a9..4bc0f4d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1366,7 +1366,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
- queue_work_on(req->rl_cpu, rpcrdma_receive_wq, &rep->rr_work);
+ queue_work(rpcrdma_receive_wq, &rep->rr_work);
return;
out_badstatus:
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 4b1ecfe..f86021e 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -52,7 +52,6 @@
#include <linux/slab.h>
#include <linux/seq_file.h>
#include <linux/sunrpc/addr.h>
-#include <linux/smp.h>
#include "xprt_rdma.h"
@@ -651,7 +650,6 @@
if (!rpcrdma_get_recvbuf(r_xprt, req, rqst->rq_rcvsize, flags))
goto out_fail;
- req->rl_cpu = smp_processor_id();
req->rl_connect_cookie = 0; /* our reserved value */
rpcrdma_set_xprtdata(rqst, req);
rqst->rq_buffer = req->rl_sendbuf->rg_base;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 69883a9..430a6de 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -334,7 +334,6 @@ enum {
struct rpcrdma_buffer;
struct rpcrdma_req {
struct list_head rl_list;
- int rl_cpu;
unsigned int rl_connect_cookie;
struct rpcrdma_buffer *rl_buffer;
struct rpcrdma_rep *rl_reply;
Clean up: We need to check only that the value does not exceed the
range of the u8 field it's going into.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e6f84a6..7cc3465 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -589,11 +589,8 @@
/* Client offers RDMA Read but does not initiate */
ep->rep_remote_cma.initiator_depth = 0;
- if (ia->ri_device->attrs.max_qp_rd_atom > 32) /* arbitrary but <= 255 */
- ep->rep_remote_cma.responder_resources = 32;
- else
- ep->rep_remote_cma.responder_resources =
- ia->ri_device->attrs.max_qp_rd_atom;
+ ep->rep_remote_cma.responder_resources =
+ min_t(int, U8_MAX, ia->ri_device->attrs.max_qp_rd_atom);
/* Limit transport retries so client can detect server
* GID changes quickly. RPC layer handles re-establishing
Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 6 +-----
net/sunrpc/xprtrdma/verbs.c | 2 ++
net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
3 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index f86021e..47b4604 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -236,8 +236,6 @@
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
spin_lock_bh(&xprt->transport_lock);
- if (++xprt->connect_cookie == 0) /* maintain a reserved value */
- ++xprt->connect_cookie;
if (ep->rep_connected > 0) {
if (!xprt_test_and_set_connected(xprt))
xprt_wake_pending_tasks(xprt, 0);
@@ -650,7 +648,6 @@
if (!rpcrdma_get_recvbuf(r_xprt, req, rqst->rq_rcvsize, flags))
goto out_fail;
- req->rl_connect_cookie = 0; /* our reserved value */
rpcrdma_set_xprtdata(rqst, req);
rqst->rq_buffer = req->rl_sendbuf->rg_base;
rqst->rq_rbuffer = req->rl_recvbuf->rg_base;
@@ -721,9 +718,8 @@
rpcrdma_recv_buffer_get(req);
/* Must suppress retransmit to maintain credits */
- if (req->rl_connect_cookie == xprt->connect_cookie)
+ if (rqst->rq_connect_cookie == xprt->connect_cookie)
goto drop_connection;
- req->rl_connect_cookie = xprt->connect_cookie;
__set_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags);
if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 7cc3465..520e7e4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -255,6 +255,7 @@
/* Return 1 to ensure the core destroys the id. */
return 1;
case RDMA_CM_EVENT_ESTABLISHED:
+ ++xprt->rx_xprt.connect_cookie;
connstate = 1;
rpcrdma_update_connect_private(xprt, &event->param.conn);
goto connected;
@@ -273,6 +274,7 @@
connstate = -EAGAIN;
goto connected;
case RDMA_CM_EVENT_DISCONNECTED:
+ ++xprt->rx_xprt.connect_cookie;
connstate = -ECONNABORTED;
connected:
xprt->rx_buf.rb_credits = 1;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 430a6de..29ea0b4 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -334,7 +334,6 @@ enum {
struct rpcrdma_buffer;
struct rpcrdma_req {
struct list_head rl_list;
- unsigned int rl_connect_cookie;
struct rpcrdma_buffer *rl_buffer;
struct rpcrdma_rep *rl_reply;
struct xdr_stream rl_stream;
RPC-over-RDMA version 1 credit accounting relies on there being a
response message for every RPC Call. This means that RPC procedures
that have no reply will disrupt credit accounting, just in the same
way as a retransmit would (since it is sent because no reply has
arrived). Deal with the "no reply" case the same way.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/clnt.h | 7 +++++++
net/sunrpc/sunrpc.h | 6 ------
net/sunrpc/xprtrdma/transport.c | 6 ++++++
3 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index ed761f7..9b11b6a 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -217,5 +217,12 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *,
bool rpc_clnt_xprt_switch_has_addr(struct rpc_clnt *clnt,
const struct sockaddr *sap);
void rpc_cleanup_clids(void);
+
+static inline int rpc_reply_expected(struct rpc_task *task)
+{
+ return (task->tk_msg.rpc_proc != NULL) &&
+ (task->tk_msg.rpc_proc->p_decode != NULL);
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_SUNRPC_CLNT_H */
diff --git a/net/sunrpc/sunrpc.h b/net/sunrpc/sunrpc.h
index f2b7cb5..09a0315 100644
--- a/net/sunrpc/sunrpc.h
+++ b/net/sunrpc/sunrpc.h
@@ -37,12 +37,6 @@ struct rpc_buffer {
char data[];
};
-static inline int rpc_reply_expected(struct rpc_task *task)
-{
- return (task->tk_msg.rpc_proc != NULL) &&
- (task->tk_msg.rpc_proc->p_decode != NULL);
-}
-
static inline int sock_is_loopback(struct sock *sk)
{
struct dst_entry *dst;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 0819689..7e39faa 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -728,6 +728,12 @@
rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
rqst->rq_bytes_sent = 0;
+
+ /* An RPC with no reply will throw off credit accounting,
+ * so drop the connection to reset the credit grant.
+ */
+ if (!rpc_reply_expected(task))
+ goto drop_connection;
return 0;
failed_marshal:
Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.
With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.
Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.
This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 2 +-
net/sunrpc/xprtrdma/frwr_ops.c | 2 +-
net/sunrpc/xprtrdma/rpc_rdma.c | 30 +++++++++++++++++++++---------
net/sunrpc/xprtrdma/transport.c | 3 ++-
net/sunrpc/xprtrdma/verbs.c | 3 ++-
5 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index d5f95bb..629e539 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -191,7 +191,7 @@ enum {
mr = rpcrdma_mr_get(r_xprt);
if (!mr)
- return ERR_PTR(-ENOBUFS);
+ return ERR_PTR(-EAGAIN);
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 90f688f..e21781c 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -367,7 +367,7 @@
rpcrdma_mr_defer_recovery(mr);
mr = rpcrdma_mr_get(r_xprt);
if (!mr)
- return ERR_PTR(-ENOBUFS);
+ return ERR_PTR(-EAGAIN);
} while (mr->frwr.fr_state != FRWR_IS_INVALID);
frwr = &mr->frwr;
frwr->fr_state = FRWR_IS_VALID;
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 4bc0f4d..e8adad3 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -365,7 +365,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
false, &mr);
if (IS_ERR(seg))
- return PTR_ERR(seg);
+ goto out_maperr;
rpcrdma_mr_push(mr, &req->rl_registered);
if (encode_read_segment(xdr, mr, pos) < 0)
@@ -377,6 +377,11 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
} while (nsegs);
return 0;
+
+out_maperr:
+ if (PTR_ERR(seg) == -EAGAIN)
+ xprt_wait_for_buffer_space(rqst->rq_task, NULL);
+ return PTR_ERR(seg);
}
/* Register and XDR encode the Write list. Supports encoding a list
@@ -423,7 +428,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
true, &mr);
if (IS_ERR(seg))
- return PTR_ERR(seg);
+ goto out_maperr;
rpcrdma_mr_push(mr, &req->rl_registered);
if (encode_rdma_segment(xdr, mr) < 0)
@@ -440,6 +445,11 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
*segcount = cpu_to_be32(nchunks);
return 0;
+
+out_maperr:
+ if (PTR_ERR(seg) == -EAGAIN)
+ xprt_wait_for_buffer_space(rqst->rq_task, NULL);
+ return PTR_ERR(seg);
}
/* Register and XDR encode the Reply chunk. Supports encoding an array
@@ -481,7 +491,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
true, &mr);
if (IS_ERR(seg))
- return PTR_ERR(seg);
+ goto out_maperr;
rpcrdma_mr_push(mr, &req->rl_registered);
if (encode_rdma_segment(xdr, mr) < 0)
@@ -498,6 +508,11 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
*segcount = cpu_to_be32(nchunks);
return 0;
+
+out_maperr:
+ if (PTR_ERR(seg) == -EAGAIN)
+ xprt_wait_for_buffer_space(rqst->rq_task, NULL);
+ return PTR_ERR(seg);
}
/**
@@ -724,8 +739,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
* Returns:
* %0 if the RPC was sent successfully,
* %-ENOTCONN if the connection was lost,
- * %-EAGAIN if not enough pages are available for on-demand reply buffer,
- * %-ENOBUFS if no MRs are available to register chunks,
+ * %-EAGAIN if the caller should call again with the same arguments,
+ * %-ENOBUFS if the caller should call again after a delay,
* %-EMSGSIZE if the transport header is too small,
* %-EIO if a permanent problem occurred while marshaling.
*/
@@ -868,10 +883,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return 0;
out_err:
- if (ret != -ENOBUFS) {
- pr_err("rpcrdma: header marshaling failed (%d)\n", ret);
- r_xprt->rx_stats.failed_marshal_count++;
- }
+ r_xprt->rx_stats.failed_marshal_count++;
return ret;
}
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 47b4604..0819689 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -689,7 +689,8 @@
* Returns:
* %0 if the RPC message has been sent
* %-ENOTCONN if the caller should reconnect and call again
- * %-ENOBUFS if the caller should call again later
+ * %-EAGAIN if the caller should call again
+ * %-ENOBUFS if the caller should call again after a delay
* %-EIO if a permanent error occurred and the request was not
* sent. Do not try to send this message again.
*/
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 520e7e4..d36c18f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1048,8 +1048,9 @@ void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc)
list_splice(&all, &buf->rb_all);
r_xprt->rx_stats.mrs_allocated += count;
spin_unlock(&buf->rb_mrlock);
-
trace_xprtrdma_createmrs(r_xprt, count);
+
+ xprt_write_space(&r_xprt->rx_xprt);
}
static void
With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.
This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 11 ++++++++
net/sunrpc/xprtrdma/frwr_ops.c | 51 +++++++++++++++++++++++++++------------
net/sunrpc/xprtrdma/verbs.c | 3 +-
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
4 files changed, 49 insertions(+), 18 deletions(-)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 629e539..5cc68a8 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -251,6 +251,16 @@ enum {
return ERR_PTR(-EIO);
}
+/* Post Send WR containing the RPC Call message.
+ */
+static int
+fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
+{
+ struct ib_send_wr *bad_wr;
+
+ return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, &bad_wr);
+}
+
/* Invalidate all memory regions that were registered for "req".
*
* Sleeps until it is safe for the host CPU to access the
@@ -305,6 +315,7 @@ enum {
const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+ .ro_send = fmr_op_send,
.ro_unmap_sync = fmr_op_unmap_sync,
.ro_recover_mr = fmr_op_recover_mr,
.ro_open = fmr_op_open,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index e21781c..c5743a0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -357,8 +357,7 @@
struct rpcrdma_mr *mr;
struct ib_mr *ibmr;
struct ib_reg_wr *reg_wr;
- struct ib_send_wr *bad_wr;
- int rc, i, n;
+ int i, n;
u8 key;
mr = NULL;
@@ -407,22 +406,12 @@
ib_update_fast_reg_key(ibmr, ++key);
reg_wr = &frwr->fr_regwr;
- reg_wr->wr.next = NULL;
- reg_wr->wr.opcode = IB_WR_REG_MR;
- frwr->fr_cqe.done = frwr_wc_fastreg;
- reg_wr->wr.wr_cqe = &frwr->fr_cqe;
- reg_wr->wr.num_sge = 0;
- reg_wr->wr.send_flags = 0;
reg_wr->mr = ibmr;
reg_wr->key = ibmr->rkey;
reg_wr->access = writing ?
IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
IB_ACCESS_REMOTE_READ;
- rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
- if (rc)
- goto out_senderr;
-
mr->mr_handle = ibmr->rkey;
mr->mr_length = ibmr->length;
mr->mr_offset = ibmr->iova;
@@ -442,11 +431,40 @@
frwr->fr_mr, n, mr->mr_nents);
rpcrdma_mr_defer_recovery(mr);
return ERR_PTR(-EIO);
+}
-out_senderr:
- pr_err("rpcrdma: FRWR registration ib_post_send returned %i\n", rc);
- rpcrdma_mr_defer_recovery(mr);
- return ERR_PTR(-ENOTCONN);
+/* Post Send WR containing the RPC Call message.
+ *
+ * For FRMR, chain any FastReg WRs to the Send WR. Only a
+ * single ib_post_send call is needed to register memory
+ * and then post the Send WR.
+ */
+static int
+frwr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
+{
+ struct ib_send_wr *post_wr, *bad_wr;
+ struct rpcrdma_mr *mr;
+
+ post_wr = &req->rl_sendctx->sc_wr;
+ list_for_each_entry(mr, &req->rl_registered, mr_list) {
+ struct rpcrdma_frwr *frwr;
+
+ frwr = &mr->frwr;
+
+ frwr->fr_cqe.done = frwr_wc_fastreg;
+ frwr->fr_regwr.wr.next = post_wr;
+ frwr->fr_regwr.wr.wr_cqe = &frwr->fr_cqe;
+ frwr->fr_regwr.wr.num_sge = 0;
+ frwr->fr_regwr.wr.opcode = IB_WR_REG_MR;
+ frwr->fr_regwr.wr.send_flags = 0;
+
+ post_wr = &frwr->fr_regwr.wr;
+ }
+
+ /* If ib_post_send fails, the next ->send_request for
+ * @req will queue these MWs for recovery.
+ */
+ return ib_post_send(ia->ri_id->qp, post_wr, &bad_wr);
}
/* Handle a remotely invalidated mr on the @mrs list
@@ -561,6 +579,7 @@
const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+ .ro_send = frwr_op_send,
.ro_reminv = frwr_op_reminv,
.ro_unmap_sync = frwr_op_unmap_sync,
.ro_recover_mr = frwr_op_recover_mr,
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ab86724..626fd30 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1535,7 +1535,6 @@ struct rpcrdma_regbuf *
struct rpcrdma_req *req)
{
struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
- struct ib_send_wr *send_wr_fail;
int rc;
if (req->rl_reply) {
@@ -1554,7 +1553,7 @@ struct rpcrdma_regbuf *
--ep->rep_send_count;
}
- rc = ib_post_send(ia->ri_id->qp, send_wr, &send_wr_fail);
+ rc = ia->ri_ops->ro_send(ia, req);
trace_xprtrdma_post_send(req, rc);
if (rc)
return -ENOTCONN;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 29ea0b4..3d3b423 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -472,6 +472,8 @@ struct rpcrdma_memreg_ops {
(*ro_map)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *, int, bool,
struct rpcrdma_mr **);
+ int (*ro_send)(struct rpcrdma_ia *ia,
+ struct rpcrdma_req *req);
void (*ro_reminv)(struct rpcrdma_rep *rep,
struct list_head *mrs);
void (*ro_unmap_sync)(struct rpcrdma_xprt *,
Create fewer MRs on average. Many workloads don't need as many as
32 MRs, and the transport can now quickly restock the MR free list.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d36c18f..ab86724 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1023,7 +1023,7 @@ void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc)
LIST_HEAD(free);
LIST_HEAD(all);
- for (count = 0; count < 32; count++) {
+ for (count = 0; count < 3; count++) {
struct rpcrdma_mr *mr;
int rc;
Refactor: Both rpcrdma_create_req call sites have to allocate the
buffer where the transport header is built, so just move that
allocation into rpcrdma_create_req.
This buffer is a fixed size. There's no needed information available
in call_allocate that is not also available when the transport is
created.
The original purpose for allocating these buffers on demand was to
reduce the possibility that an allocation failure during transport
creation will hork the mount operation during low memory scenarios.
Some relief for this rare possibility is coming up in the next few
patches.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/backchannel.c | 7 -------
net/sunrpc/xprtrdma/transport.c | 25 -------------------------
net/sunrpc/xprtrdma/verbs.c | 14 ++++++++++++--
3 files changed, 12 insertions(+), 34 deletions(-)
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index ed1a4a3..47ebac9 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -44,13 +44,6 @@ static int rpcrdma_bc_setup_rqst(struct rpcrdma_xprt *r_xprt,
if (IS_ERR(req))
return PTR_ERR(req);
- rb = rpcrdma_alloc_regbuf(RPCRDMA_HDRBUF_SIZE,
- DMA_TO_DEVICE, GFP_KERNEL);
- if (IS_ERR(rb))
- goto out_fail;
- req->rl_rdmabuf = rb;
- xdr_buf_init(&req->rl_hdrbuf, rb->rg_base, rdmab_length(rb));
-
size = r_xprt->rx_data.inline_rsize;
rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, GFP_KERNEL);
if (IS_ERR(rb))
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 7e39faa..67e4386 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -537,29 +537,6 @@
}
}
-/* Allocate a fixed-size buffer in which to construct and send the
- * RPC-over-RDMA header for this request.
- */
-static bool
-rpcrdma_get_rdmabuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
- gfp_t flags)
-{
- size_t size = RPCRDMA_HDRBUF_SIZE;
- struct rpcrdma_regbuf *rb;
-
- if (req->rl_rdmabuf)
- return true;
-
- rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, flags);
- if (IS_ERR(rb))
- return false;
-
- r_xprt->rx_stats.hardway_register_count += size;
- req->rl_rdmabuf = rb;
- xdr_buf_init(&req->rl_hdrbuf, rb->rg_base, rdmab_length(rb));
- return true;
-}
-
static bool
rpcrdma_get_sendbuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
size_t size, gfp_t flags)
@@ -641,8 +618,6 @@
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
- if (!rpcrdma_get_rdmabuf(r_xprt, req, flags))
- goto out_fail;
if (!rpcrdma_get_sendbuf(r_xprt, req, rqst->rq_callsize, flags))
goto out_fail;
if (!rpcrdma_get_recvbuf(r_xprt, req, rqst->rq_rcvsize, flags))
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 626fd30..6a7a5a2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1068,17 +1068,27 @@ struct rpcrdma_req *
rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buffer = &r_xprt->rx_buf;
+ struct rpcrdma_regbuf *rb;
struct rpcrdma_req *req;
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (req == NULL)
return ERR_PTR(-ENOMEM);
+ rb = rpcrdma_alloc_regbuf(RPCRDMA_HDRBUF_SIZE,
+ DMA_TO_DEVICE, GFP_KERNEL);
+ if (IS_ERR(rb)) {
+ kfree(req);
+ return ERR_PTR(-ENOMEM);
+ }
+ req->rl_rdmabuf = rb;
+ xdr_buf_init(&req->rl_hdrbuf, rb->rg_base, rdmab_length(rb));
+ req->rl_buffer = buffer;
+ INIT_LIST_HEAD(&req->rl_registered);
+
spin_lock(&buffer->rb_reqslock);
list_add(&req->rl_all, &buffer->rb_allreqs);
spin_unlock(&buffer->rb_reqslock);
- req->rl_buffer = &r_xprt->rx_buf;
- INIT_LIST_HEAD(&req->rl_registered);
return req;
}
On 02/28/2018 03:30 PM, Chuck Lever wrote:
> With FRWR, the client transport can perform memory registration and
> post a Send with just a single ib_post_send.
>
> This reduces contention between the send_request path and the Send
> Completion handlers, and reduces the overhead of registering a chunk
> that has multiple segments.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 11 ++++++++
> net/sunrpc/xprtrdma/frwr_ops.c | 51 +++++++++++++++++++++++++++------------
> net/sunrpc/xprtrdma/verbs.c | 3 +-
> net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
> 4 files changed, 49 insertions(+), 18 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 629e539..5cc68a8 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -251,6 +251,16 @@ enum {
> return ERR_PTR(-EIO);
> }
>
> +/* Post Send WR containing the RPC Call message.
> + */
> +static int
> +fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
> +{
> + struct ib_send_wr *bad_wr;
> +
> + return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, &bad_wr);
I wish there was a bad_wr null-check in ib_post_send() (or in the infiniband drivers) so you don't have to declare a variable that's never used again. Coordinating that might be more work than it's worth, though.
Anna
> +}
> +
> /* Invalidate all memory regions that were registered for "req".
> *
> * Sleeps until it is safe for the host CPU to access the
> @@ -305,6 +315,7 @@ enum {
>
> const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
> .ro_map = fmr_op_map,
> + .ro_send = fmr_op_send,
> .ro_unmap_sync = fmr_op_unmap_sync,
> .ro_recover_mr = fmr_op_recover_mr,
> .ro_open = fmr_op_open,
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index e21781c..c5743a0 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -357,8 +357,7 @@
> struct rpcrdma_mr *mr;
> struct ib_mr *ibmr;
> struct ib_reg_wr *reg_wr;
> - struct ib_send_wr *bad_wr;
> - int rc, i, n;
> + int i, n;
> u8 key;
>
> mr = NULL;
> @@ -407,22 +406,12 @@
> ib_update_fast_reg_key(ibmr, ++key);
>
> reg_wr = &frwr->fr_regwr;
> - reg_wr->wr.next = NULL;
> - reg_wr->wr.opcode = IB_WR_REG_MR;
> - frwr->fr_cqe.done = frwr_wc_fastreg;
> - reg_wr->wr.wr_cqe = &frwr->fr_cqe;
> - reg_wr->wr.num_sge = 0;
> - reg_wr->wr.send_flags = 0;
> reg_wr->mr = ibmr;
> reg_wr->key = ibmr->rkey;
> reg_wr->access = writing ?
> IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
> IB_ACCESS_REMOTE_READ;
>
> - rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
> - if (rc)
> - goto out_senderr;
> -
> mr->mr_handle = ibmr->rkey;
> mr->mr_length = ibmr->length;
> mr->mr_offset = ibmr->iova;
> @@ -442,11 +431,40 @@
> frwr->fr_mr, n, mr->mr_nents);
> rpcrdma_mr_defer_recovery(mr);
> return ERR_PTR(-EIO);
> +}
>
> -out_senderr:
> - pr_err("rpcrdma: FRWR registration ib_post_send returned %i\n", rc);
> - rpcrdma_mr_defer_recovery(mr);
> - return ERR_PTR(-ENOTCONN);
> +/* Post Send WR containing the RPC Call message.
> + *
> + * For FRMR, chain any FastReg WRs to the Send WR. Only a
> + * single ib_post_send call is needed to register memory
> + * and then post the Send WR.
> + */
> +static int
> +frwr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
> +{
> + struct ib_send_wr *post_wr, *bad_wr;
> + struct rpcrdma_mr *mr;
> +
> + post_wr = &req->rl_sendctx->sc_wr;
> + list_for_each_entry(mr, &req->rl_registered, mr_list) {
> + struct rpcrdma_frwr *frwr;
> +
> + frwr = &mr->frwr;
> +
> + frwr->fr_cqe.done = frwr_wc_fastreg;
> + frwr->fr_regwr.wr.next = post_wr;
> + frwr->fr_regwr.wr.wr_cqe = &frwr->fr_cqe;
> + frwr->fr_regwr.wr.num_sge = 0;
> + frwr->fr_regwr.wr.opcode = IB_WR_REG_MR;
> + frwr->fr_regwr.wr.send_flags = 0;
> +
> + post_wr = &frwr->fr_regwr.wr;
> + }
> +
> + /* If ib_post_send fails, the next ->send_request for
> + * @req will queue these MWs for recovery.
> + */
> + return ib_post_send(ia->ri_id->qp, post_wr, &bad_wr);
> }
>
> /* Handle a remotely invalidated mr on the @mrs list
> @@ -561,6 +579,7 @@
>
> const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
> .ro_map = frwr_op_map,
> + .ro_send = frwr_op_send,
> .ro_reminv = frwr_op_reminv,
> .ro_unmap_sync = frwr_op_unmap_sync,
> .ro_recover_mr = frwr_op_recover_mr,
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index ab86724..626fd30 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1535,7 +1535,6 @@ struct rpcrdma_regbuf *
> struct rpcrdma_req *req)
> {
> struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
> - struct ib_send_wr *send_wr_fail;
> int rc;
>
> if (req->rl_reply) {
> @@ -1554,7 +1553,7 @@ struct rpcrdma_regbuf *
> --ep->rep_send_count;
> }
>
> - rc = ib_post_send(ia->ri_id->qp, send_wr, &send_wr_fail);
> + rc = ia->ri_ops->ro_send(ia, req);
> trace_xprtrdma_post_send(req, rc);
> if (rc)
> return -ENOTCONN;
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 29ea0b4..3d3b423 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -472,6 +472,8 @@ struct rpcrdma_memreg_ops {
> (*ro_map)(struct rpcrdma_xprt *,
> struct rpcrdma_mr_seg *, int, bool,
> struct rpcrdma_mr **);
> + int (*ro_send)(struct rpcrdma_ia *ia,
> + struct rpcrdma_req *req);
> void (*ro_reminv)(struct rpcrdma_rep *rep,
> struct list_head *mrs);
> void (*ro_unmap_sync)(struct rpcrdma_xprt *,
>
On Wed, Feb 28, 2018 at 04:51:11PM -0500, Anna Schumaker wrote:
>
>
> On 02/28/2018 03:30 PM, Chuck Lever wrote:
> > With FRWR, the client transport can perform memory registration and
> > post a Send with just a single ib_post_send.
> >
> > This reduces contention between the send_request path and the Send
> > Completion handlers, and reduces the overhead of registering a chunk
> > that has multiple segments.
> >
> > Signed-off-by: Chuck Lever <[email protected]>
> > net/sunrpc/xprtrdma/fmr_ops.c | 11 ++++++++
> > net/sunrpc/xprtrdma/frwr_ops.c | 51 +++++++++++++++++++++++++++------------
> > net/sunrpc/xprtrdma/verbs.c | 3 +-
> > net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
> > 4 files changed, 49 insertions(+), 18 deletions(-)
> >
> > diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> > index 629e539..5cc68a8 100644
> > +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> > @@ -251,6 +251,16 @@ enum {
> > return ERR_PTR(-EIO);
> > }
> >
> > +/* Post Send WR containing the RPC Call message.
> > + */
> > +static int
> > +fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
> > +{
> > + struct ib_send_wr *bad_wr;
> > +
> > + return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, &bad_wr);
>
> I wish there was a bad_wr null-check in ib_post_send() (or in the
> infiniband drivers) so you don't have to declare a variable that's
> never used again. Coordinating that might be more work than it's
> worth, though.
It is a good point, I actually don't think we have any user in kernel
of bad_wr ..
Would prefer to just drop the parameter and add a new function call if
really, really needed.
Jason
> On Feb 28, 2018, at 5:59 PM, Jason Gunthorpe <[email protected]> wrote:
>=20
> On Wed, Feb 28, 2018 at 04:51:11PM -0500, Anna Schumaker wrote:
>>=20
>>=20
>> On 02/28/2018 03:30 PM, Chuck Lever wrote:
>>> With FRWR, the client transport can perform memory registration and
>>> post a Send with just a single ib_post_send.
>>>=20
>>> This reduces contention between the send_request path and the Send
>>> Completion handlers, and reduces the overhead of registering a chunk
>>> that has multiple segments.
>>>=20
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> net/sunrpc/xprtrdma/fmr_ops.c | 11 ++++++++
>>> net/sunrpc/xprtrdma/frwr_ops.c | 51 =
+++++++++++++++++++++++++++------------
>>> net/sunrpc/xprtrdma/verbs.c | 3 +-
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
>>> 4 files changed, 49 insertions(+), 18 deletions(-)
>>>=20
>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c =
b/net/sunrpc/xprtrdma/fmr_ops.c
>>> index 629e539..5cc68a8 100644
>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>> @@ -251,6 +251,16 @@ enum {
>>> return ERR_PTR(-EIO);
>>> }
>>>=20
>>> +/* Post Send WR containing the RPC Call message.
>>> + */
>>> +static int
>>> +fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
>>> +{
>>> + struct ib_send_wr *bad_wr;
>>> +
>>> + return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, =
&bad_wr);
>>=20
>> I wish there was a bad_wr null-check in ib_post_send() (or in the
>> infiniband drivers) so you don't have to declare a variable that's
>> never used again. Coordinating that might be more work than it's
>> worth, though.
>=20
> It is a good point, I actually don't think we have any user in kernel
> of bad_wr ..
Yes, frwr_op_unmap_sync() uses the 3rd argument, and I'm about to add
a call site for ib_post_recv which uses that argument as well.
> Would prefer to just drop the parameter and add a new function call if
> really, really needed.
You could do something like
ib_post_send_one(qp, wr);
for the common case; and likewise for ib_post_recv.
--
Chuck Lever