Here are patches to support server-side bi-directional RPC/RDMA
operation (to enable NFSv4.1 on RPC/RDMA transports). These still
need testing, but they are ready for initial review.
Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:
git://git.linux-nfs.org/projects/cel/cel-2.6.git
Or for browsing:
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5
---
Chuck Lever (8):
svcrdma: Do not send XDR roundup bytes for a write chunk
svcrdma: Define maximum number of backchannel requests
svcrdma: Add svc_rdma_get_context() API that is allowed to fail
svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
xprtrdma: Add class for RDMA backwards direction transport
svcrdma: No need to count WRs in svc_rdma_send()
svcrdma: Remove svc_rdma_fastreg_mr::access_flags field
include/linux/sunrpc/svc_rdma.h | 12 +
include/linux/sunrpc/xprt.h | 1
net/sunrpc/xprt.c | 1
net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++++++++
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 72 ++++++++-
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 74 +++++++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 65 ++++++--
net/sunrpc/xprtrdma/transport.c | 243 ++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 6 +
9 files changed, 519 insertions(+), 31 deletions(-)
--
Chuck Lever
Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.
The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.
The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.
The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.
That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 969a1ab..bad5eaa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
arg_ch->rs_handle,
arg_ch->rs_offset,
write_len);
+
+ /* Do not send XDR pad bytes */
+ if (chunk_no && write_len < 4) {
+ chunk_no++;
+ break;
+ }
+
chunk_off = 0;
while (write_len) {
ret = send_write(xprt, rqstp,
Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 5 +++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..478aa30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,6 +178,11 @@ struct svcxprt_rdma {
#define RPCRDMA_SQ_DEPTH_MULT 8
#define RPCRDMA_MAX_REQUESTS 32
#define RPCRDMA_MAX_REQ_SIZE 4096
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+#define RPCRDMA_MAX_BC_REQUESTS 8
+#else
+#define RPCRDMA_MAX_BC_REQUESTS 0
+#endif
#define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..01c7b36 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
(size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
RPCSVC_MAXPAGES);
+ /* XXX: what if HCA can't support enough WRs for bc operation? */
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
- (size_t)svcrdma_max_requests);
+ (size_t)(svcrdma_max_requests +
+ RPCRDMA_MAX_BC_REQUESTS));
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
/*
@@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = &newxprt->sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
+ qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+ qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
To support backward direction calls, I'm going to add an
svc_rdma_get_context() call in the client RDMA transport.
Called from ->buf_alloc(), we can't sleep waiting for memory.
So add an API that can get a server op_ctxt but won't sleep.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 2 ++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 28 +++++++++++++++++++++++-----
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 478aa30..0355067 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -222,6 +222,8 @@ extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
extern int svc_rdma_post_recv(struct svcxprt_rdma *);
extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
+extern struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *,
+ gfp_t);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 01c7b36..58ec362 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -153,17 +153,35 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
}
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
-struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+static void svc_rdma_init_context(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt)
{
- struct svc_rdma_op_ctxt *ctxt;
-
- ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
- GFP_KERNEL | __GFP_NOFAIL);
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
ctxt->frmr = NULL;
atomic_inc(&xprt->sc_ctxt_used);
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
+ gfp_t flags)
+{
+ struct svc_rdma_op_ctxt *ctxt;
+
+ ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
+ if (!ctxt)
+ return NULL;
+ svc_rdma_init_context(xprt, ctxt);
+ return ctxt;
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+{
+ struct svc_rdma_op_ctxt *ctxt;
+
+ ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
+ GFP_KERNEL | __GFP_NOFAIL);
+ svc_rdma_init_context(xprt, ctxt);
return ctxt;
}
To support the NFSv4.1 backchannel on RDMA connections, add a
mechanism for sending a backwards-direction RPC/RDMA call on a
connection established by a client.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 2 +
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 63 +++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 0355067..28d4e46 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -214,6 +214,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
extern int svc_rdma_sendto(struct svc_rqst *);
extern struct rpcrdma_read_chunk *
svc_rdma_get_read_chunk(struct rpcrdma_msg *);
+extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
+ struct svc_rdma_op_ctxt *, struct xdr_buf *);
/* svc_rdma_transport.c */
extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index bad5eaa..4fe11ea 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -648,3 +648,66 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
svc_rdma_put_context(ctxt, 0);
return ret;
}
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+/* Send a backwards direction RPC call.
+ *
+ * Caller holds the connection's mutex and has already marshaled the
+ * RPC/RDMA request. Before sending the request, this API also posts
+ * an extra receive buffer to catch the bc reply for this request.
+ */
+int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
+ struct svc_rdma_op_ctxt *ctxt, struct xdr_buf *sndbuf)
+{
+ struct svc_rdma_req_map *vec;
+ struct ib_send_wr send_wr;
+ int ret;
+
+ vec = svc_rdma_get_req_map();
+ ret = map_xdr(rdma, sndbuf, vec);
+ if (ret)
+ goto out;
+
+ /* Post a recv buffer to handle reply for this request */
+ ret = svc_rdma_post_recv(rdma);
+ if (ret) {
+ pr_err("svcrdma: Failed to post bc receive buffer, err=%d. "
+ "Closing transport %p.\n", ret, rdma);
+ set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+ ret = -ENOTCONN;
+ goto out;
+ }
+
+ ctxt->wr_op = IB_WR_SEND;
+ ctxt->direction = DMA_TO_DEVICE;
+ ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+ ctxt->sge[0].length = sndbuf->len;
+ ctxt->sge[0].addr =
+ ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
+ sndbuf->len, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
+ svc_rdma_unmap_dma(ctxt);
+ ret = -EIO;
+ goto out;
+ }
+ atomic_inc(&rdma->sc_dma_used);
+
+ memset(&send_wr, 0, sizeof send_wr);
+ send_wr.wr_id = (unsigned long)ctxt;
+ send_wr.sg_list = ctxt->sge;
+ send_wr.num_sge = 1;
+ send_wr.opcode = IB_WR_SEND;
+ send_wr.send_flags = IB_SEND_SIGNALED;
+
+ ret = svc_rdma_send(rdma, &send_wr);
+ if (ret) {
+ svc_rdma_unmap_dma(ctxt);
+ ret = -EIO;
+ goto out;
+ }
+out:
+ svc_rdma_put_req_map(vec);
+ pr_info("svcrdma: %s returns %d\n", __func__, ret);
+ return ret;
+}
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 ++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 4 ++
3 files changed, 140 insertions(+)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..fef0623 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,79 @@ repost:
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
}
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+ struct xdr_buf *rcvbuf)
+{
+ struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+ struct kvec *dst, *src = &rcvbuf->head[0];
+ struct rpc_rqst *req;
+ unsigned long cwnd;
+ u32 credits;
+ size_t len;
+ __be32 xid;
+ __be32 *p;
+ int ret;
+
+ p = (__be32 *)src->iov_base;
+ len = src->iov_len;
+ xid = rmsgp->rm_xid;
+
+ pr_info("%s: xid=%08x, length=%zu\n",
+ __func__, be32_to_cpu(xid), len);
+ pr_info("%s: RPC/RDMA: %*ph\n",
+ __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+ pr_info("%s: RPC: %*ph\n",
+ __func__, (int)len, p);
+
+ ret = -EAGAIN;
+ if (src->iov_len < 24)
+ goto out_shortreply;
+
+ spin_lock_bh(&xprt->transport_lock);
+ req = xprt_lookup_rqst(xprt, xid);
+ if (!req)
+ goto out_notfound;
+
+ dst = &req->rq_private_buf.head[0];
+ memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+ if (dst->iov_len < len)
+ goto out_unlock;
+ memcpy(dst->iov_base, p, len);
+
+ credits = be32_to_cpu(rmsgp->rm_credit);
+ if (credits == 0)
+ credits = 1; /* don't deadlock */
+ else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+ credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+ cwnd = xprt->cwnd;
+ xprt->cwnd = credits << RPC_CWNDSHIFT;
+ if (xprt->cwnd > cwnd)
+ xprt_release_rqst_cong(req->rq_task);
+
+ ret = 0;
+ xprt_complete_rqst(req->rq_task, rcvbuf->len);
+ rcvbuf->len = 0;
+
+out_unlock:
+ spin_unlock_bh(&xprt->transport_lock);
+out:
+ return ret;
+
+out_shortreply:
+ pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+ xprt, src->iov_len);
+ goto out;
+
+out_notfound:
+ pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+ xprt, be32_to_cpu(xid));
+
+ goto out_unlock;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..2b762b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
#include <rdma/ib_verbs.h>
#include <rdma/rdma_cm.h>
#include <linux/sunrpc/svc_rdma.h>
+#include "xprt_rdma.h"
#define RPCDBG_FACILITY RPCDBG_SVCXPRT
@@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
}
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+ __be32 *p = (__be32 *)rmsgp;
+
+ if (!xprt->xpt_bc_xprt)
+ return false;
+
+ if (rmsgp->rm_type != rdma_msg)
+ return false;
+ if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+ return false;
+ if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+ return false;
+ if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+ return false;
+
+ /* sanity */
+ if (p[7] != rmsgp->rm_xid)
+ return false;
+ /* call direction */
+ if (p[8] == cpu_to_be32(RPC_CALL))
+ return false;
+
+ return true;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
/*
* Set up the rqstp thread context to point to the RQ buffer. If
* necessary, pull additional data from the client with an RDMA_READ
@@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto close_out;
}
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+ if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
+ ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
+ &rqstp->rq_arg);
+ svc_rdma_put_context(ctxt, 0);
+ if (ret)
+ goto repost;
+ return ret;
+ }
+#endif
+
/* Read read-list data. */
ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
if (ret > 0) {
@@ -668,4 +716,16 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
set_bit(XPT_CLOSE, &xprt->xpt_flags);
defer:
return 0;
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+repost:
+ ret = svc_rdma_post_recv(rdma_xprt);
+ if (ret) {
+ pr_info("svcrdma: could not post a receive buffer, err=%d."
+ "Closing transport %p.\n", ret, rdma_xprt);
+ set_bit(XPT_CLOSE, &rdma_xprt->sc_xprt.xpt_flags);
+ ret = -ENOTCONN;
+ }
+ return ret;
+#endif
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..9aeff2b 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -309,6 +309,8 @@ struct rpcrdma_buffer {
u32 rb_bc_srv_max_requests;
spinlock_t rb_reqslock; /* protect rb_allreqs */
struct list_head rb_allreqs;
+
+ u32 rb_bc_max_requests;
};
#define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
@@ -511,6 +513,8 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *);
* RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
*/
int rpcrdma_marshal_req(struct rpc_rqst *);
+int rpcrdma_handle_bc_reply(struct rpc_xprt *, struct rpcrdma_msg *,
+ struct xdr_buf *);
/* RPC/RDMA module init - xprtrdma/transport.c
*/
To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class for backwards direction
operation.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/xprt.h | 1
net/sunrpc/xprt.c | 1
net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 +-
net/sunrpc/xprtrdma/transport.c | 243 ++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 2
5 files changed, 256 insertions(+), 5 deletions(-)
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..7637ccd 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -85,6 +85,7 @@ struct rpc_rqst {
__u32 * rq_buffer; /* XDR encode buffer */
size_t rq_callsize,
rq_rcvsize;
+ void * rq_privdata; /* xprt-specific per-rqst data */
size_t rq_xmit_bytes_sent; /* total bytes sent */
size_t rq_reply_bytes_recvd; /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
}
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 58ec362..3768a7f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1182,12 +1182,14 @@ static void __svc_rdma_free(struct work_struct *work)
{
struct svcxprt_rdma *rdma =
container_of(work, struct svcxprt_rdma, sc_work);
- dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
+ struct svc_xprt *xprt = &rdma->sc_xprt;
+
+ dprintk("svcrdma: %s(%p)\n", __func__, rdma);
/* We should only be called from kref_put */
- if (atomic_read(&rdma->sc_xprt.xpt_ref.refcount) != 0)
+ if (atomic_read(&xprt->xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",
- atomic_read(&rdma->sc_xprt.xpt_ref.refcount));
+ atomic_read(&xprt->xpt_ref.refcount));
/*
* Destroy queued, but not processed read completions. Note
@@ -1222,6 +1224,12 @@ static void __svc_rdma_free(struct work_struct *work)
pr_err("svcrdma: dma still in use? (%d)\n",
atomic_read(&rdma->sc_dma_used));
+ /* Final put of backchannel client transport */
+ if (xprt->xpt_bc_xprt) {
+ xprt_put(xprt->xpt_bc_xprt);
+ xprt->xpt_bc_xprt = NULL;
+ }
+
/* De-allocate fastreg mr */
rdma_dealloc_frmr_q(rdma);
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..fda7488 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -51,6 +51,7 @@
#include <linux/slab.h>
#include <linux/seq_file.h>
#include <linux/sunrpc/addr.h>
+#include <linux/sunrpc/svc_rdma.h>
#include "xprt_rdma.h"
@@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = {
#define RPCRDMA_MAX_REEST_TO (30U * HZ)
#define RPCRDMA_IDLE_DISC_TO (5U * 60 * HZ)
-static struct rpc_xprt_ops xprt_rdma_procs; /* forward reference */
+static struct rpc_xprt_ops xprt_rdma_procs;
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+static struct rpc_xprt_ops xprt_rdma_bc_procs;
+#endif
static void
xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
@@ -499,7 +503,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;
- flags = GFP_NOIO | __GFP_NOWARN;
+ flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
@@ -684,6 +688,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
{
}
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+ struct rpc_rqst *rqst = task->tk_rqstp;
+ struct svc_rdma_op_ctxt *ctxt;
+ struct svcxprt_rdma *rdma;
+ struct svc_xprt *sxprt;
+ struct page *page;
+
+ if (size > PAGE_SIZE) {
+ WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);
+ return NULL;
+ }
+
+ page = alloc_page(RPCRDMA_DEF_GFP);
+ if (!page)
+ return NULL;
+
+ sxprt = rqst->rq_xprt->bc_xprt;
+ rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
+ ctxt = svc_rdma_get_context_gfp(rdma, RPCRDMA_DEF_GFP);
+ if (!ctxt) {
+ put_page(page);
+ return NULL;
+ }
+
+ rqst->rq_privdata = ctxt;
+ ctxt->pages[0] = page;
+ ctxt->count = 1;
+ return page_address(page);
+}
+
+static void
+xprt_rdma_bc_free(void *buffer)
+{
+ /* No-op: ctxt and page have already been freed. */
+}
+
+static int
+rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst)
+{
+ struct rpc_xprt *xprt = rqst->rq_xprt;
+ struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+ struct rpcrdma_msg *headerp = (struct rpcrdma_msg *)rqst->rq_buffer;
+ struct svc_rdma_op_ctxt *ctxt;
+ int rc;
+
+ /* Space in the send buffer for an RPC/RDMA header is reserved
+ * via xprt->tsh_size */
+ headerp->rm_xid = rqst->rq_xid;
+ headerp->rm_vers = rpcrdma_version;
+ headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_bc_max_requests);
+ headerp->rm_type = rdma_msg;
+ headerp->rm_body.rm_chunks[0] = xdr_zero;
+ headerp->rm_body.rm_chunks[1] = xdr_zero;
+ headerp->rm_body.rm_chunks[2] = xdr_zero;
+
+ pr_info("%s: %*ph\n", __func__, 64, rqst->rq_buffer);
+
+ ctxt = (struct svc_rdma_op_ctxt *)rqst->rq_privdata;
+ rc = svc_rdma_bc_post_send(rdma, ctxt, &rqst->rq_snd_buf);
+ if (rc)
+ goto drop_connection;
+ return rc;
+
+drop_connection:
+ pr_info("Failed to send backwards request\n");
+ svc_rdma_put_context(ctxt, 1);
+ xprt_disconnect_done(xprt);
+ return -ENOTCONN;
+}
+
+/* Take an RPC request and sent it on the passive end of a
+ * transport connection.
+ */
+static int
+xprt_rdma_bc_send_request(struct rpc_task *task)
+{
+ struct rpc_rqst *rqst = task->tk_rqstp;
+ struct svc_xprt *sxprt = rqst->rq_xprt->bc_xprt;
+ struct svcxprt_rdma *rdma;
+ u32 len;
+
+ pr_info("%s: sending request with xid: %08x\n",
+ __func__, be32_to_cpu(rqst->rq_xid));
+
+ if (!mutex_trylock(&sxprt->xpt_mutex)) {
+ rpc_sleep_on(&sxprt->xpt_bc_pending, task, NULL);
+ if (!mutex_trylock(&sxprt->xpt_mutex))
+ return -EAGAIN;
+ rpc_wake_up_queued_task(&sxprt->xpt_bc_pending, task);
+ }
+
+ len = -ENOTCONN;
+ rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
+ if (!test_bit(XPT_DEAD, &sxprt->xpt_flags))
+ len = rpcrdma_bc_send_request(rdma, rqst);
+
+ mutex_unlock(&sxprt->xpt_mutex);
+
+ if (len < 0)
+ return len;
+ return 0;
+}
+
+static void
+xprt_rdma_bc_close(struct rpc_xprt *xprt)
+{
+ pr_info("RPC: %s: xprt %p\n", __func__, xprt);
+}
+
+static void
+xprt_rdma_bc_put(struct rpc_xprt *xprt)
+{
+ pr_info("RPC: %s: xprt %p\n", __func__, xprt);
+
+ xprt_free(xprt);
+ module_put(THIS_MODULE);
+}
+
+/* It shouldn't matter if the number of backchannel session slots
+ * doesn't match the number of RPC/RDMA credits. That just means
+ * one or the other will have extra slots that aren't used.
+ */
+static struct rpc_xprt *
+xprt_setup_rdma_bc(struct xprt_create *args)
+{
+ struct rpc_xprt *xprt;
+ struct rpcrdma_xprt *new_xprt;
+
+ if (args->addrlen > sizeof(xprt->addr)) {
+ dprintk("RPC: %s: address too large\n", __func__);
+ return ERR_PTR(-EBADF);
+ }
+
+ xprt = xprt_alloc(args->net, sizeof(*new_xprt),
+ RPCRDMA_MAX_BC_REQUESTS,
+ RPCRDMA_MAX_BC_REQUESTS);
+ if (xprt == NULL) {
+ dprintk("RPC: %s: couldn't allocate rpc_xprt\n",
+ __func__);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ xprt->timeout = &xprt_rdma_default_timeout;
+ xprt_set_bound(xprt);
+ xprt_set_connected(xprt);
+ xprt->bind_timeout = RPCRDMA_BIND_TO;
+ xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
+ xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO;
+
+ xprt->prot = XPRT_TRANSPORT_BC_RDMA;
+ xprt->tsh_size = RPCRDMA_HDRLEN_MIN / sizeof(__be32);
+ xprt->ops = &xprt_rdma_bc_procs;
+
+ memcpy(&xprt->addr, args->dstaddr, args->addrlen);
+ xprt->addrlen = args->addrlen;
+ xprt_rdma_format_addresses(xprt, (struct sockaddr *)&xprt->addr);
+ xprt->resvport = 0;
+
+ xprt->max_payload = xprt_rdma_max_inline_read;
+
+ new_xprt = rpcx_to_rdmax(xprt);
+ new_xprt->rx_buf.rb_bc_max_requests = xprt->max_reqs;
+
+ xprt_get(xprt);
+ args->bc_xprt->xpt_bc_xprt = xprt;
+ xprt->bc_xprt = args->bc_xprt;
+
+ if (!try_module_get(THIS_MODULE))
+ goto out_fail;
+
+ /* Final put for backchannel xprt is in __svc_rdma_free */
+ xprt_get(xprt);
+ return xprt;
+
+out_fail:
+ xprt_rdma_free_addresses(xprt);
+ args->bc_xprt->xpt_bc_xprt = NULL;
+ xprt_put(xprt);
+ xprt_free(xprt);
+ return ERR_PTR(-EINVAL);
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
/*
* Plumbing for rpc transport switch and kernel module
*/
@@ -722,6 +919,32 @@ static struct xprt_class xprt_rdma = {
.setup = xprt_setup_rdma,
};
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+static struct rpc_xprt_ops xprt_rdma_bc_procs = {
+ .reserve_xprt = xprt_reserve_xprt_cong,
+ .release_xprt = xprt_release_xprt_cong,
+ .alloc_slot = xprt_alloc_slot,
+ .release_request = xprt_release_rqst_cong,
+ .buf_alloc = xprt_rdma_bc_allocate,
+ .buf_free = xprt_rdma_bc_free,
+ .send_request = xprt_rdma_bc_send_request,
+ .set_retrans_timeout = xprt_set_retrans_timeout_def,
+ .close = xprt_rdma_bc_close,
+ .destroy = xprt_rdma_bc_put,
+ .print_stats = xprt_rdma_print_stats
+};
+
+static struct xprt_class xprt_rdma_bc = {
+ .list = LIST_HEAD_INIT(xprt_rdma_bc.list),
+ .name = "rdma backchannel",
+ .owner = THIS_MODULE,
+ .ident = XPRT_TRANSPORT_BC_RDMA,
+ .setup = xprt_setup_rdma_bc,
+};
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
void xprt_rdma_cleanup(void)
{
int rc;
@@ -740,6 +963,13 @@ void xprt_rdma_cleanup(void)
rpcrdma_destroy_wq();
frwr_destroy_recovery_wq();
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+ rc = xprt_unregister_transport(&xprt_rdma_bc);
+ if (rc)
+ dprintk("RPC: %s: xprt_unregister(bc) returned %i\n",
+ __func__, rc);
+#endif
}
int xprt_rdma_init(void)
@@ -763,6 +993,15 @@ int xprt_rdma_init(void)
return rc;
}
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+ rc = xprt_register_transport(&xprt_rdma_bc);
+ if (rc) {
+ xprt_unregister_transport(&xprt_rdma);
+ frwr_destroy_recovery_wq();
+ return rc;
+ }
+#endif
+
dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");
dprintk("Defaults:\n");
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 9aeff2b..485027e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -148,6 +148,8 @@ rdmab_to_msg(struct rpcrdma_regbuf *rb)
return (struct rpcrdma_msg *)rb->rg_base;
}
+#define RPCRDMA_DEF_GFP (GFP_NOIO | __GFP_NOWARN)
+
/*
* struct rpcrdma_rep -- this structure encapsulates state required to recv
* and complete a reply, asychronously. It needs several pieces of
Minor optimization: Instead of counting WRs in a chain, have callers
pass in the number of WRs they've prepared.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 2 +-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 9 ++++++---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 +++---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 17 ++++++-----------
4 files changed, 16 insertions(+), 18 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 28d4e46..243edf4 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -218,7 +218,7 @@ extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
struct svc_rdma_op_ctxt *, struct xdr_buf *);
/* svc_rdma_transport.c */
-extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
+extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *, int);
extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
enum rpcrdma_errcode);
extern int svc_rdma_post_recv(struct svcxprt_rdma *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 2b762b5..9480043 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -190,7 +190,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
read_wr.wr.sg_list = ctxt->sge;
read_wr.wr.num_sge = pages_needed;
- ret = svc_rdma_send(xprt, &read_wr.wr);
+ ret = svc_rdma_send(xprt, &read_wr.wr, 1);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
@@ -227,7 +227,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
int nents = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
- int ret, read, pno, dma_nents, n;
+ int ret, read, pno, num_wrs, dma_nents, n;
u32 pg_off = *page_offset;
u32 pg_no = *page_no;
@@ -299,6 +299,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
ctxt->count = 1;
ctxt->read_hdr = head;
+ num_wrs = 2;
+
/* Prepare REG WR */
reg_wr.wr.opcode = IB_WR_REG_MR;
reg_wr.wr.wr_id = 0;
@@ -329,11 +331,12 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
inv_wr.opcode = IB_WR_LOCAL_INV;
inv_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_FENCE;
inv_wr.ex.invalidate_rkey = frmr->mr->lkey;
+ num_wrs++;
}
ctxt->wr_op = read_wr.wr.opcode;
/* Post the chain */
- ret = svc_rdma_send(xprt, ®_wr.wr);
+ ret = svc_rdma_send(xprt, ®_wr.wr, num_wrs);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 4fe11ea..97f18b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -292,7 +292,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
/* Post It */
atomic_inc(&rdma_stat_write);
- if (svc_rdma_send(xprt, &write_wr.wr))
+ if (svc_rdma_send(xprt, &write_wr.wr, 1))
goto err;
return write_len - bc;
err:
@@ -557,7 +557,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
- ret = svc_rdma_send(rdma, &send_wr);
+ ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret)
goto err;
@@ -699,7 +699,7 @@ int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
- ret = svc_rdma_send(rdma, &send_wr);
+ ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret) {
svc_rdma_unmap_dma(ctxt);
ret = -EIO;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 3768a7f..40c9c84 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1284,20 +1284,15 @@ static int svc_rdma_secure_port(struct svc_rqst *rqstp)
return 1;
}
-int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
+int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr,
+ int wr_count)
{
- struct ib_send_wr *bad_wr, *n_wr;
- int wr_count;
- int i;
- int ret;
+ struct ib_send_wr *bad_wr;
+ int i, ret;
if (test_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags))
return -ENOTCONN;
- wr_count = 1;
- for (n_wr = wr->next; n_wr; n_wr = n_wr->next)
- wr_count++;
-
/* If the SQ is full, wait until an SQ entry is available */
while (1) {
spin_lock_bh(&xprt->sc_lock);
@@ -1326,7 +1321,7 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
if (ret) {
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
atomic_sub(wr_count, &xprt->sc_sq_count);
- for (i = 0; i < wr_count; i ++)
+ for (i = 0; i < wr_count; i++)
svc_xprt_put(&xprt->sc_xprt);
dprintk("svcrdma: failed to post SQ WR rc=%d, "
"sc_sq_count=%d, sc_sq_depth=%d\n",
@@ -1384,7 +1379,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
err_wr.send_flags = IB_SEND_SIGNALED;
/* Post It */
- ret = svc_rdma_send(xprt, &err_wr);
+ ret = svc_rdma_send(xprt, &err_wr, 1);
if (ret) {
dprintk("svcrdma: Error %d posting send for protocol error\n",
ret);
Clean up: The access_flags field is not used outside of
rdma_read_chunk_frmr() and is always set to the same value.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 1 -
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +--
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 243edf4..eee2a0d 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
struct ib_mr *mr;
struct scatterlist *sg;
int sg_nents;
- unsigned long access_flags;
enum dma_data_direction direction;
struct list_head frmr_list;
};
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 9480043..8ab1ab5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);
frmr->direction = DMA_FROM_DEVICE;
- frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
frmr->sg_nents = nents;
for (pno = 0; pno < nents; pno++) {
@@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->lkey;
- reg_wr.access = frmr->access_flags;
+ reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
reg_wr.wr.next = &read_wr.wr;
/* Prepare RDMA_READ */
On 11/23/2015 5:20 PM, Chuck Lever wrote:
> To support the NFSv4.1 backchannel on RDMA connections, add a
> capability for receiving an RPC/RDMA reply on a connection
> established by a client.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++++++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 ++++++++++++++++++++++++
> net/sunrpc/xprtrdma/xprt_rdma.h | 4 ++
> 3 files changed, 140 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..fef0623 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -946,3 +946,79 @@ repost:
> if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
> rpcrdma_recv_buffer_put(rep);
> }
> +
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +
> +int
> +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
> + struct xdr_buf *rcvbuf)
> +{
> + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
> + struct kvec *dst, *src = &rcvbuf->head[0];
> + struct rpc_rqst *req;
> + unsigned long cwnd;
> + u32 credits;
> + size_t len;
> + __be32 xid;
> + __be32 *p;
> + int ret;
> +
> + p = (__be32 *)src->iov_base;
> + len = src->iov_len;
> + xid = rmsgp->rm_xid;
> +
> + pr_info("%s: xid=%08x, length=%zu\n",
> + __func__, be32_to_cpu(xid), len);
> + pr_info("%s: RPC/RDMA: %*ph\n",
> + __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
> + pr_info("%s: RPC: %*ph\n",
> + __func__, (int)len, p);
> +
> + ret = -EAGAIN;
> + if (src->iov_len < 24)
> + goto out_shortreply;
> +
> + spin_lock_bh(&xprt->transport_lock);
> + req = xprt_lookup_rqst(xprt, xid);
> + if (!req)
> + goto out_notfound;
> +
> + dst = &req->rq_private_buf.head[0];
> + memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
> + if (dst->iov_len < len)
> + goto out_unlock;
> + memcpy(dst->iov_base, p, len);
> +
> + credits = be32_to_cpu(rmsgp->rm_credit);
> + if (credits == 0)
> + credits = 1; /* don't deadlock */
> + else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
> + credits = r_xprt->rx_buf.rb_bc_max_requests;
> +
> + cwnd = xprt->cwnd;
> + xprt->cwnd = credits << RPC_CWNDSHIFT;
> + if (xprt->cwnd > cwnd)
> + xprt_release_rqst_cong(req->rq_task);
> +
> + ret = 0;
> + xprt_complete_rqst(req->rq_task, rcvbuf->len);
> + rcvbuf->len = 0;
> +
> +out_unlock:
> + spin_unlock_bh(&xprt->transport_lock);
> +out:
> + return ret;
> +
> +out_shortreply:
> + pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
> + xprt, src->iov_len);
> + goto out;
> +
> +out_notfound:
> + pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
> + xprt, be32_to_cpu(xid));
> +
> + goto out_unlock;
> +}
> +
> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index ff4f01e..2b762b5 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -47,6 +47,7 @@
> #include <rdma/ib_verbs.h>
> #include <rdma/rdma_cm.h>
> #include <linux/sunrpc/svc_rdma.h>
> +#include "xprt_rdma.h"
>
> #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>
> @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
> return ret;
> }
>
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +
> +/* By convention, backchannel calls arrive via rdma_msg type
> + * messages, and never populate the chunk lists. This makes
> + * the RPC/RDMA header small and fixed in size, so it is
> + * straightforward to check the RPC header's direction field.
> + */
> +static bool
> +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
> +{
> + __be32 *p = (__be32 *)rmsgp;
> +
> + if (!xprt->xpt_bc_xprt)
> + return false;
> +
> + if (rmsgp->rm_type != rdma_msg)
> + return false;
> + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
> + return false;
> + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
> + return false;
> + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
> + return false;
The above assertion is only true for the NFS behavior as spec'd
today (no chunk-bearing bulk data on existing backchannel NFS
protocol messages). That at least deserves a comment. Or, why
not simply ignore the chunks? They're not the receiver's problem.
> +
> + /* sanity */
> + if (p[7] != rmsgp->rm_xid)
> + return false;
> + /* call direction */
> + if (p[8] == cpu_to_be32(RPC_CALL))
> + return false;
> +
> + return true;
> +}
> +
> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
> +
> /*
> * Set up the rqstp thread context to point to the RQ buffer. If
> * necessary, pull additional data from the client with an RDMA_READ
> @@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> goto close_out;
> }
>
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> + if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
> + ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
> + &rqstp->rq_arg);
> + svc_rdma_put_context(ctxt, 0);
> + if (ret)
> + goto repost;
> + return ret;
> + }
> +#endif
> +
> /* Read read-list data. */
> ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
> if (ret > 0) {
> @@ -668,4 +716,16 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> set_bit(XPT_CLOSE, &xprt->xpt_flags);
> defer:
> return 0;
> +
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +repost:
> + ret = svc_rdma_post_recv(rdma_xprt);
> + if (ret) {
> + pr_info("svcrdma: could not post a receive buffer, err=%d."
> + "Closing transport %p.\n", ret, rdma_xprt);
> + set_bit(XPT_CLOSE, &rdma_xprt->sc_xprt.xpt_flags);
> + ret = -ENOTCONN;
> + }
> + return ret;
> +#endif
> }
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index ac7f8d4..9aeff2b 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -309,6 +309,8 @@ struct rpcrdma_buffer {
> u32 rb_bc_srv_max_requests;
> spinlock_t rb_reqslock; /* protect rb_allreqs */
> struct list_head rb_allreqs;
> +
> + u32 rb_bc_max_requests;
> };
> #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
>
> @@ -511,6 +513,8 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *);
> * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
> */
> int rpcrdma_marshal_req(struct rpc_rqst *);
> +int rpcrdma_handle_bc_reply(struct rpc_xprt *, struct rpcrdma_msg *,
> + struct xdr_buf *);
>
> /* RPC/RDMA module init - xprtrdma/transport.c
> */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On 11/23/2015 5:20 PM, Chuck Lever wrote:
> Extra resources for handling backchannel requests have to be
> pre-allocated when a transport instance is created. Set a limit.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc_rdma.h | 5 +++++
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index f869807..478aa30 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
> #define RPCRDMA_SQ_DEPTH_MULT 8
> #define RPCRDMA_MAX_REQUESTS 32
> #define RPCRDMA_MAX_REQ_SIZE 4096
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
Why is this a config option? Why wouldn't you always want
this? It's needed for any post-1990 NFS dialect.
> +#define RPCRDMA_MAX_BC_REQUESTS 8
Why a constant 8? The forward channel value is apparently
configurable, just a few lines down.
> +#else
> +#define RPCRDMA_MAX_BC_REQUESTS 0
> +#endif
>
> #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index b348b4a..01c7b36 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> (size_t)RPCSVC_MAXPAGES);
> newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
> RPCSVC_MAXPAGES);
> + /* XXX: what if HCA can't support enough WRs for bc operation? */
> newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
> - (size_t)svcrdma_max_requests);
> + (size_t)(svcrdma_max_requests +
> + RPCRDMA_MAX_BC_REQUESTS));
> newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>
> /*
> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> qp_attr.event_handler = qp_event_handler;
> qp_attr.qp_context = &newxprt->sc_xprt;
> qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
> + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
> qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
> + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
> qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
> qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
> qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On 11/23/2015 5:21 PM, Chuck Lever wrote:
> Clean up: The access_flags field is not used outside of
> rdma_read_chunk_frmr() and is always set to the same value.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc_rdma.h | 1 -
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +--
> 2 files changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index 243edf4..eee2a0d 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
> struct ib_mr *mr;
> struct scatterlist *sg;
> int sg_nents;
> - unsigned long access_flags;
> enum dma_data_direction direction;
> struct list_head frmr_list;
> };
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 9480043..8ab1ab5 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);
>
> frmr->direction = DMA_FROM_DEVICE;
> - frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
> frmr->sg_nents = nents;
>
> for (pno = 0; pno < nents; pno++) {
> @@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> reg_wr.wr.num_sge = 0;
> reg_wr.mr = frmr->mr;
> reg_wr.key = frmr->mr->lkey;
> - reg_wr.access = frmr->access_flags;
> + reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
needed for IB or RoCE. Shouldn't this be updated to peek at those
new attributes to decide, instead of remaining unconditional?
> reg_wr.wr.next = &read_wr.wr;
>
> /* Prepare RDMA_READ */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On Nov 23, 2015, at 7:52 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 5:21 PM, Chuck Lever wrote:
>> Clean up: The access_flags field is not used outside of
>> rdma_read_chunk_frmr() and is always set to the same value.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/svc_rdma.h | 1 -
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +--
>> 2 files changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index 243edf4..eee2a0d 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
>> struct ib_mr *mr;
>> struct scatterlist *sg;
>> int sg_nents;
>> - unsigned long access_flags;
>> enum dma_data_direction direction;
>> struct list_head frmr_list;
>> };
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 9480043..8ab1ab5 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
>> read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);
>>
>> frmr->direction = DMA_FROM_DEVICE;
>> - frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
>> frmr->sg_nents = nents;
>>
>> for (pno = 0; pno < nents; pno++) {
>> @@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
>> reg_wr.wr.num_sge = 0;
>> reg_wr.mr = frmr->mr;
>> reg_wr.key = frmr->mr->lkey;
>> - reg_wr.access = frmr->access_flags;
>> + reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
>
> Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
> needed for IB or RoCE. Shouldn't this be updated to peek at those
> new attributes to decide, instead of remaining unconditional?
That’s coming in another patch from Christoph.
>
>
>> reg_wr.wr.next = &read_wr.wr;
>>
>> /* Prepare RDMA_READ */
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Chuck Lever
On 11/23/2015 5:21 PM, Chuck Lever wrote:
> To support the server-side of an NFSv4.1 backchannel on RDMA
> connections, add a transport class for backwards direction
> operation.
So, what's special here is that it re-uses an existing forward
channel's connection? If not, it would seem unnecessary to
define a new type/transport semantic. Say this?
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/xprt.h | 1
> net/sunrpc/xprt.c | 1
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 +-
> net/sunrpc/xprtrdma/transport.c | 243 ++++++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/xprt_rdma.h | 2
> 5 files changed, 256 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..7637ccd 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -85,6 +85,7 @@ struct rpc_rqst {
> __u32 * rq_buffer; /* XDR encode buffer */
> size_t rq_callsize,
> rq_rcvsize;
> + void * rq_privdata; /* xprt-specific per-rqst data */
> size_t rq_xmit_bytes_sent; /* total bytes sent */
> size_t rq_reply_bytes_recvd; /* total reply bytes */
> /* received */
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..37edea6 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
> if (atomic_dec_and_test(&xprt->count))
> xprt_destroy(xprt);
> }
> +EXPORT_SYMBOL_GPL(xprt_put);
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 58ec362..3768a7f 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -1182,12 +1182,14 @@ static void __svc_rdma_free(struct work_struct *work)
> {
> struct svcxprt_rdma *rdma =
> container_of(work, struct svcxprt_rdma, sc_work);
> - dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
> + struct svc_xprt *xprt = &rdma->sc_xprt;
> +
> + dprintk("svcrdma: %s(%p)\n", __func__, rdma);
>
> /* We should only be called from kref_put */
> - if (atomic_read(&rdma->sc_xprt.xpt_ref.refcount) != 0)
> + if (atomic_read(&xprt->xpt_ref.refcount) != 0)
> pr_err("svcrdma: sc_xprt still in use? (%d)\n",
> - atomic_read(&rdma->sc_xprt.xpt_ref.refcount));
> + atomic_read(&xprt->xpt_ref.refcount));
>
> /*
> * Destroy queued, but not processed read completions. Note
> @@ -1222,6 +1224,12 @@ static void __svc_rdma_free(struct work_struct *work)
> pr_err("svcrdma: dma still in use? (%d)\n",
> atomic_read(&rdma->sc_dma_used));
>
> + /* Final put of backchannel client transport */
> + if (xprt->xpt_bc_xprt) {
> + xprt_put(xprt->xpt_bc_xprt);
> + xprt->xpt_bc_xprt = NULL;
> + }
> +
> /* De-allocate fastreg mr */
> rdma_dealloc_frmr_q(rdma);
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 8c545f7..fda7488 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -51,6 +51,7 @@
> #include <linux/slab.h>
> #include <linux/seq_file.h>
> #include <linux/sunrpc/addr.h>
> +#include <linux/sunrpc/svc_rdma.h>
>
> #include "xprt_rdma.h"
>
> @@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = {
> #define RPCRDMA_MAX_REEST_TO (30U * HZ)
> #define RPCRDMA_IDLE_DISC_TO (5U * 60 * HZ)
>
> -static struct rpc_xprt_ops xprt_rdma_procs; /* forward reference */
> +static struct rpc_xprt_ops xprt_rdma_procs;
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +static struct rpc_xprt_ops xprt_rdma_bc_procs;
> +#endif
>
> static void
> xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
> @@ -499,7 +503,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
> if (req == NULL)
> return NULL;
>
> - flags = GFP_NOIO | __GFP_NOWARN;
> + flags = RPCRDMA_DEF_GFP;
> if (RPC_IS_SWAPPER(task))
> flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
>
> @@ -684,6 +688,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
> {
> }
>
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +
> +/* Server-side transport endpoint wants a whole page for its send
> + * buffer. The client RPC code constructs the RPC header in this
> + * buffer before it invokes ->send_request.
> + */
> +static void *
> +xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
> +{
> + struct rpc_rqst *rqst = task->tk_rqstp;
> + struct svc_rdma_op_ctxt *ctxt;
> + struct svcxprt_rdma *rdma;
> + struct svc_xprt *sxprt;
> + struct page *page;
> +
> + if (size > PAGE_SIZE) {
> + WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
> + size);
> + return NULL;
> + }
> +
> + page = alloc_page(RPCRDMA_DEF_GFP);
> + if (!page)
> + return NULL;
> +
> + sxprt = rqst->rq_xprt->bc_xprt;
> + rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
> + ctxt = svc_rdma_get_context_gfp(rdma, RPCRDMA_DEF_GFP);
> + if (!ctxt) {
> + put_page(page);
> + return NULL;
> + }
> +
> + rqst->rq_privdata = ctxt;
> + ctxt->pages[0] = page;
> + ctxt->count = 1;
> + return page_address(page);
> +}
> +
> +static void
> +xprt_rdma_bc_free(void *buffer)
> +{
> + /* No-op: ctxt and page have already been freed. */
> +}
> +
> +static int
> +rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst)
> +{
> + struct rpc_xprt *xprt = rqst->rq_xprt;
> + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
> + struct rpcrdma_msg *headerp = (struct rpcrdma_msg *)rqst->rq_buffer;
> + struct svc_rdma_op_ctxt *ctxt;
> + int rc;
> +
> + /* Space in the send buffer for an RPC/RDMA header is reserved
> + * via xprt->tsh_size */
> + headerp->rm_xid = rqst->rq_xid;
> + headerp->rm_vers = rpcrdma_version;
> + headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_bc_max_requests);
> + headerp->rm_type = rdma_msg;
> + headerp->rm_body.rm_chunks[0] = xdr_zero;
> + headerp->rm_body.rm_chunks[1] = xdr_zero;
> + headerp->rm_body.rm_chunks[2] = xdr_zero;
> +
> + pr_info("%s: %*ph\n", __func__, 64, rqst->rq_buffer);
> +
> + ctxt = (struct svc_rdma_op_ctxt *)rqst->rq_privdata;
> + rc = svc_rdma_bc_post_send(rdma, ctxt, &rqst->rq_snd_buf);
> + if (rc)
> + goto drop_connection;
> + return rc;
> +
> +drop_connection:
> + pr_info("Failed to send backwards request\n");
This is going to be a very strange message to see out of context.
What's a "backwards request"?
> + svc_rdma_put_context(ctxt, 1);
> + xprt_disconnect_done(xprt);
> + return -ENOTCONN;
> +}
> +
> +/* Take an RPC request and sent it on the passive end of a
> + * transport connection.
> + */
Typo 'send".
> +static int
> +xprt_rdma_bc_send_request(struct rpc_task *task)
> +{
> + struct rpc_rqst *rqst = task->tk_rqstp;
> + struct svc_xprt *sxprt = rqst->rq_xprt->bc_xprt;
> + struct svcxprt_rdma *rdma;
> + u32 len;
> +
> + pr_info("%s: sending request with xid: %08x\n",
> + __func__, be32_to_cpu(rqst->rq_xid));
> +
> + if (!mutex_trylock(&sxprt->xpt_mutex)) {
> + rpc_sleep_on(&sxprt->xpt_bc_pending, task, NULL);
> + if (!mutex_trylock(&sxprt->xpt_mutex))
> + return -EAGAIN;
> + rpc_wake_up_queued_task(&sxprt->xpt_bc_pending, task);
> + }
> +
> + len = -ENOTCONN;
> + rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
> + if (!test_bit(XPT_DEAD, &sxprt->xpt_flags))
> + len = rpcrdma_bc_send_request(rdma, rqst);
> +
> + mutex_unlock(&sxprt->xpt_mutex);
> +
> + if (len < 0)
> + return len;
> + return 0;
> +}
> +
> +static void
> +xprt_rdma_bc_close(struct rpc_xprt *xprt)
> +{
> + pr_info("RPC: %s: xprt %p\n", __func__, xprt);
> +}
> +
> +static void
> +xprt_rdma_bc_put(struct rpc_xprt *xprt)
> +{
> + pr_info("RPC: %s: xprt %p\n", __func__, xprt);
> +
> + xprt_free(xprt);
> + module_put(THIS_MODULE);
> +}
> +
> +/* It shouldn't matter if the number of backchannel session slots
> + * doesn't match the number of RPC/RDMA credits. That just means
> + * one or the other will have extra slots that aren't used.
> + */
> +static struct rpc_xprt *
> +xprt_setup_rdma_bc(struct xprt_create *args)
> +{
> + struct rpc_xprt *xprt;
> + struct rpcrdma_xprt *new_xprt;
> +
> + if (args->addrlen > sizeof(xprt->addr)) {
> + dprintk("RPC: %s: address too large\n", __func__);
> + return ERR_PTR(-EBADF);
> + }
> +
> + xprt = xprt_alloc(args->net, sizeof(*new_xprt),
> + RPCRDMA_MAX_BC_REQUESTS,
> + RPCRDMA_MAX_BC_REQUESTS);
> + if (xprt == NULL) {
> + dprintk("RPC: %s: couldn't allocate rpc_xprt\n",
> + __func__);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + xprt->timeout = &xprt_rdma_default_timeout;
> + xprt_set_bound(xprt);
> + xprt_set_connected(xprt);
> + xprt->bind_timeout = RPCRDMA_BIND_TO;
> + xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
> + xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO;
> +
> + xprt->prot = XPRT_TRANSPORT_BC_RDMA;
> + xprt->tsh_size = RPCRDMA_HDRLEN_MIN / sizeof(__be32);
> + xprt->ops = &xprt_rdma_bc_procs;
> +
> + memcpy(&xprt->addr, args->dstaddr, args->addrlen);
> + xprt->addrlen = args->addrlen;
> + xprt_rdma_format_addresses(xprt, (struct sockaddr *)&xprt->addr);
> + xprt->resvport = 0;
> +
> + xprt->max_payload = xprt_rdma_max_inline_read;
> +
> + new_xprt = rpcx_to_rdmax(xprt);
> + new_xprt->rx_buf.rb_bc_max_requests = xprt->max_reqs;
> +
> + xprt_get(xprt);
> + args->bc_xprt->xpt_bc_xprt = xprt;
> + xprt->bc_xprt = args->bc_xprt;
> +
> + if (!try_module_get(THIS_MODULE))
> + goto out_fail;
> +
> + /* Final put for backchannel xprt is in __svc_rdma_free */
> + xprt_get(xprt);
> + return xprt;
> +
> +out_fail:
> + xprt_rdma_free_addresses(xprt);
> + args->bc_xprt->xpt_bc_xprt = NULL;
> + xprt_put(xprt);
> + xprt_free(xprt);
> + return ERR_PTR(-EINVAL);
> +}
> +
> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
> +
> /*
> * Plumbing for rpc transport switch and kernel module
> */
> @@ -722,6 +919,32 @@ static struct xprt_class xprt_rdma = {
> .setup = xprt_setup_rdma,
> };
>
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +
> +static struct rpc_xprt_ops xprt_rdma_bc_procs = {
> + .reserve_xprt = xprt_reserve_xprt_cong,
> + .release_xprt = xprt_release_xprt_cong,
> + .alloc_slot = xprt_alloc_slot,
> + .release_request = xprt_release_rqst_cong,
> + .buf_alloc = xprt_rdma_bc_allocate,
> + .buf_free = xprt_rdma_bc_free,
> + .send_request = xprt_rdma_bc_send_request,
> + .set_retrans_timeout = xprt_set_retrans_timeout_def,
> + .close = xprt_rdma_bc_close,
> + .destroy = xprt_rdma_bc_put,
> + .print_stats = xprt_rdma_print_stats
> +};
> +
> +static struct xprt_class xprt_rdma_bc = {
> + .list = LIST_HEAD_INIT(xprt_rdma_bc.list),
> + .name = "rdma backchannel",
> + .owner = THIS_MODULE,
> + .ident = XPRT_TRANSPORT_BC_RDMA,
> + .setup = xprt_setup_rdma_bc,
> +};
> +
> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
> +
> void xprt_rdma_cleanup(void)
> {
> int rc;
> @@ -740,6 +963,13 @@ void xprt_rdma_cleanup(void)
>
> rpcrdma_destroy_wq();
> frwr_destroy_recovery_wq();
> +
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> + rc = xprt_unregister_transport(&xprt_rdma_bc);
> + if (rc)
> + dprintk("RPC: %s: xprt_unregister(bc) returned %i\n",
> + __func__, rc);
> +#endif
> }
>
> int xprt_rdma_init(void)
> @@ -763,6 +993,15 @@ int xprt_rdma_init(void)
> return rc;
> }
>
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> + rc = xprt_register_transport(&xprt_rdma_bc);
> + if (rc) {
> + xprt_unregister_transport(&xprt_rdma);
> + frwr_destroy_recovery_wq();
> + return rc;
> + }
> +#endif
> +
> dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");
>
> dprintk("Defaults:\n");
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 9aeff2b..485027e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -148,6 +148,8 @@ rdmab_to_msg(struct rpcrdma_regbuf *rb)
> return (struct rpcrdma_msg *)rb->rg_base;
> }
>
> +#define RPCRDMA_DEF_GFP (GFP_NOIO | __GFP_NOWARN)
> +
> /*
> * struct rpcrdma_rep -- this structure encapsulates state required to recv
> * and complete a reply, asychronously. It needs several pieces of
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On Nov 23, 2015, at 7:39 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>> Extra resources for handling backchannel requests have to be
>> pre-allocated when a transport instance is created. Set a limit.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/svc_rdma.h | 5 +++++
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index f869807..478aa30 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>> #define RPCRDMA_SQ_DEPTH_MULT 8
>> #define RPCRDMA_MAX_REQUESTS 32
>> #define RPCRDMA_MAX_REQ_SIZE 4096
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>
> Why is this a config option? Why wouldn't you always want
> this? It's needed for any post-1990 NFS dialect.
I think some distros want to be able to compile out NFSv4.x
on small systems, and take all the backchannel cruft with it.
>> +#define RPCRDMA_MAX_BC_REQUESTS 8
>
> Why a constant 8? The forward channel value is apparently
> configurable, just a few lines down.
The client side backward direction credit limit, now
in 4.4, is already a constant.
The client side ULP uses a constant for the slot table
size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
the server seems to just echo that number back to the
client.
I’d rather not add an admin knob for this. Why would it
be necessary?
>> +#else
>> +#define RPCRDMA_MAX_BC_REQUESTS 0
>> +#endif
>>
>> #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index b348b4a..01c7b36 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> (size_t)RPCSVC_MAXPAGES);
>> newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>> RPCSVC_MAXPAGES);
>> + /* XXX: what if HCA can't support enough WRs for bc operation? */
>> newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>> - (size_t)svcrdma_max_requests);
>> + (size_t)(svcrdma_max_requests +
>> + RPCRDMA_MAX_BC_REQUESTS));
>> newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>>
>> /*
>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> qp_attr.event_handler = qp_event_handler;
>> qp_attr.qp_context = &newxprt->sc_xprt;
>> qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>> + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
>> qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
>> + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
>> qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
>> qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
>> qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
On 11/23/2015 8:09 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 7:39 PM, Tom Talpey <[email protected]> wrote:
>>
>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>> Extra resources for handling backchannel requests have to be
>>> pre-allocated when a transport instance is created. Set a limit.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> include/linux/sunrpc/svc_rdma.h | 5 +++++
>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>> index f869807..478aa30 100644
>>> --- a/include/linux/sunrpc/svc_rdma.h
>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>> #define RPCRDMA_SQ_DEPTH_MULT 8
>>> #define RPCRDMA_MAX_REQUESTS 32
>>> #define RPCRDMA_MAX_REQ_SIZE 4096
>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>
>> Why is this a config option? Why wouldn't you always want
>> this? It's needed for any post-1990 NFS dialect.
>
> I think some distros want to be able to compile out NFSv4.x
> on small systems, and take all the backchannel cruft with it.
So shouldn't it follow the NFSv4.x config options then?
>
>
>>> +#define RPCRDMA_MAX_BC_REQUESTS 8
>>
>> Why a constant 8? The forward channel value is apparently
>> configurable, just a few lines down.
>
> The client side backward direction credit limit, now
> in 4.4, is already a constant.
>
> The client side ULP uses a constant for the slot table
> size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
> the server seems to just echo that number back to the
> client.
>
> I’d rather not add an admin knob for this. Why would it
> be necessary?
Because no constant is ever correct. Why isn't it "1"? Do
you allow multiple credits? Why not that value?
For instance.
>
>
>>> +#else
>>> +#define RPCRDMA_MAX_BC_REQUESTS 0
>>> +#endif
>>>
>>> #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
>>>
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> index b348b4a..01c7b36 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>> (size_t)RPCSVC_MAXPAGES);
>>> newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>>> RPCSVC_MAXPAGES);
>>> + /* XXX: what if HCA can't support enough WRs for bc operation? */
>>> newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>>> - (size_t)svcrdma_max_requests);
>>> + (size_t)(svcrdma_max_requests +
>>> + RPCRDMA_MAX_BC_REQUESTS));
>>> newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>>>
>>> /*
>>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>> qp_attr.event_handler = qp_event_handler;
>>> qp_attr.qp_context = &newxprt->sc_xprt;
>>> qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>>> + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
>>> qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
>>> + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
>>> qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
>>> qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
>>> qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Mon, Nov 23, 2015 at 8:19 PM, Tom Talpey <[email protected]> wrote:
> On 11/23/2015 8:09 PM, Chuck Lever wrote:
>>
>>
>>> On Nov 23, 2015, at 7:39 PM, Tom Talpey <[email protected]> wrote:
>>>
>>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>>>
>>>> Extra resources for handling backchannel requests have to be
>>>> pre-allocated when a transport instance is created. Set a limit.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> include/linux/sunrpc/svc_rdma.h | 5 +++++
>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
>>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/svc_rdma.h
>>>> b/include/linux/sunrpc/svc_rdma.h
>>>> index f869807..478aa30 100644
>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>>> #define RPCRDMA_SQ_DEPTH_MULT 8
>>>> #define RPCRDMA_MAX_REQUESTS 32
>>>> #define RPCRDMA_MAX_REQ_SIZE 4096
>>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>>
>>>
>>> Why is this a config option? Why wouldn't you always want
>>> this? It's needed for any post-1990 NFS dialect.
>>
>>
>> I think some distros want to be able to compile out NFSv4.x
>> on small systems, and take all the backchannel cruft with it.
>
>
> So shouldn't it follow the NFSv4.x config options then?
It does. Why the question?
> On Nov 23, 2015, at 8:19 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 8:09 PM, Chuck Lever wrote:
>>
>>> On Nov 23, 2015, at 7:39 PM, Tom Talpey <[email protected]> wrote:
>>>
>>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>>> Extra resources for handling backchannel requests have to be
>>>> pre-allocated when a transport instance is created. Set a limit.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> include/linux/sunrpc/svc_rdma.h | 5 +++++
>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
>>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>>> index f869807..478aa30 100644
>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>>> #define RPCRDMA_SQ_DEPTH_MULT 8
>>>> #define RPCRDMA_MAX_REQUESTS 32
>>>> #define RPCRDMA_MAX_REQ_SIZE 4096
>>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>>
>>> Why is this a config option? Why wouldn't you always want
>>> this? It's needed for any post-1990 NFS dialect.
>>
>> I think some distros want to be able to compile out NFSv4.x
>> on small systems, and take all the backchannel cruft with it.
>
> So shouldn't it follow the NFSv4.x config options then?
Setting CONFIG_NFS_V4_1 sets CONFIG_SUNRPC_BACKCHANNEL.
Adding #ifdef CONFIG_NFS_V4_1 in net/sunrpc would be
a layering violation.
I see however that CONFIG_SUNRPC_BACKCHANNEL controls
only the client backchannel capability. Perhaps it is
out of place to use it to enable the server’s backchannel
capability.
>>>> +#define RPCRDMA_MAX_BC_REQUESTS 8
>>>
>>> Why a constant 8? The forward channel value is apparently
>>> configurable, just a few lines down.
>>
>> The client side backward direction credit limit, now
>> in 4.4, is already a constant.
>>
>> The client side ULP uses a constant for the slot table
>> size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
>> the server seems to just echo that number back to the
>> client.
>>
>> I’d rather not add an admin knob for this. Why would it
>> be necessary?
>
> Because no constant is ever correct. Why isn't it "1"? Do
> you allow multiple credits? Why not that value?
>
> For instance.
There’s no justification for the forward channel credit
limit either.
The code in Linux assumes one session slot in the NFSv4.1
backchannel. When we get around to it, this can be made
more flexible.
It’s much easier to add flexibility and admin control later
than it is to take it away when the knob becomes useless or
badly designed. For now, 8 works, and it doesn’t have to be
permanent.
I could add a comment that says
/* Arbitrary: support up to eight backward credits.
*/
>>>> +#else
>>>> +#define RPCRDMA_MAX_BC_REQUESTS 0
>>>> +#endif
>>>>
>>>> #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> index b348b4a..01c7b36 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>> (size_t)RPCSVC_MAXPAGES);
>>>> newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>>>> RPCSVC_MAXPAGES);
>>>> + /* XXX: what if HCA can't support enough WRs for bc operation? */
>>>> newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>>>> - (size_t)svcrdma_max_requests);
>>>> + (size_t)(svcrdma_max_requests +
>>>> + RPCRDMA_MAX_BC_REQUESTS));
>>>> newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>>>>
>>>> /*
>>>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>> qp_attr.event_handler = qp_event_handler;
>>>> qp_attr.qp_context = &newxprt->sc_xprt;
>>>> qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>>>> + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
>>>> qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
>>>> + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
>>>> qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
>>>> qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
>>>> qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Chuck Lever
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Chuck Lever
On 11/23/2015 8:36 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 8:19 PM, Tom Talpey <[email protected]> wrote:
>>
>> On 11/23/2015 8:09 PM, Chuck Lever wrote:
>>>
>>>> On Nov 23, 2015, at 7:39 PM, Tom Talpey <[email protected]> wrote:
>>>>
>>>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>>>> Extra resources for handling backchannel requests have to be
>>>>> pre-allocated when a transport instance is created. Set a limit.
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> include/linux/sunrpc/svc_rdma.h | 5 +++++
>>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++++-
>>>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>>>> index f869807..478aa30 100644
>>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>>>> #define RPCRDMA_SQ_DEPTH_MULT 8
>>>>> #define RPCRDMA_MAX_REQUESTS 32
>>>>> #define RPCRDMA_MAX_REQ_SIZE 4096
>>>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>>>
>>>> Why is this a config option? Why wouldn't you always want
>>>> this? It's needed for any post-1990 NFS dialect.
>>>
>>> I think some distros want to be able to compile out NFSv4.x
>>> on small systems, and take all the backchannel cruft with it.
>>
>> So shouldn't it follow the NFSv4.x config options then?
>
> Setting CONFIG_NFS_V4_1 sets CONFIG_SUNRPC_BACKCHANNEL.
> Adding #ifdef CONFIG_NFS_V4_1 in net/sunrpc would be
> a layering violation.
Ok, I guess. It seems a fairly small and abstruse detail to
surface as a config option. But if it's already there, sure,
use it.
>
> I see however that CONFIG_SUNRPC_BACKCHANNEL controls
> only the client backchannel capability. Perhaps it is
> out of place to use it to enable the server’s backchannel
> capability.
Ok, now it's even smaller and more abstruse. :-) To say
nothing of multiplying. ;-)
>
>
>>>>> +#define RPCRDMA_MAX_BC_REQUESTS 8
>>>>
>>>> Why a constant 8? The forward channel value is apparently
>>>> configurable, just a few lines down.
>>>
>>> The client side backward direction credit limit, now
>>> in 4.4, is already a constant.
>>>
>>> The client side ULP uses a constant for the slot table
>>> size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
>>> the server seems to just echo that number back to the
>>> client.
>>>
>>> I’d rather not add an admin knob for this. Why would it
>>> be necessary?
>>
>> Because no constant is ever correct. Why isn't it "1"? Do
>> you allow multiple credits? Why not that value?
>>
>> For instance.
>
> There’s no justification for the forward channel credit
> limit either.
>
> The code in Linux assumes one session slot in the NFSv4.1
> backchannel. When we get around to it, this can be made
> more flexible.
Ok so again, why choose "8" here?
>
> It’s much easier to add flexibility and admin control later
> than it is to take it away when the knob becomes useless or
> badly designed. For now, 8 works, and it doesn’t have to be
> permanent.
>
> I could add a comment that says
>
> /* Arbitrary: support up to eight backward credits.
> */
>
>
>>>>> +#else
>>>>> +#define RPCRDMA_MAX_BC_REQUESTS 0
>>>>> +#endif
>>>>>
>>>>> #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
>>>>>
>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> index b348b4a..01c7b36 100644
>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>>> (size_t)RPCSVC_MAXPAGES);
>>>>> newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>>>>> RPCSVC_MAXPAGES);
>>>>> + /* XXX: what if HCA can't support enough WRs for bc operation? */
>>>>> newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>>>>> - (size_t)svcrdma_max_requests);
>>>>> + (size_t)(svcrdma_max_requests +
>>>>> + RPCRDMA_MAX_BC_REQUESTS));
>>>>> newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>>>>>
>>>>> /*
>>>>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>>> qp_attr.event_handler = qp_event_handler;
>>>>> qp_attr.qp_context = &newxprt->sc_xprt;
>>>>> qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>>>>> + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
>>>>> qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
>>>>> + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
>>>>> qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
>>>>> qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
>>>>> qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On Nov 23, 2015, at 7:44 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>> To support the NFSv4.1 backchannel on RDMA connections, add a
>> capability for receiving an RPC/RDMA reply on a connection
>> established by a client.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++++++++++++++++++++++++++++++
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 ++++++++++++++++++++++++
>> net/sunrpc/xprtrdma/xprt_rdma.h | 4 ++
>> 3 files changed, 140 insertions(+)
>>
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..fef0623 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -946,3 +946,79 @@ repost:
>> if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
>> rpcrdma_recv_buffer_put(rep);
>> }
>> +
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +
>> +int
>> +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
>> + struct xdr_buf *rcvbuf)
>> +{
>> + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>> + struct kvec *dst, *src = &rcvbuf->head[0];
>> + struct rpc_rqst *req;
>> + unsigned long cwnd;
>> + u32 credits;
>> + size_t len;
>> + __be32 xid;
>> + __be32 *p;
>> + int ret;
>> +
>> + p = (__be32 *)src->iov_base;
>> + len = src->iov_len;
>> + xid = rmsgp->rm_xid;
>> +
>> + pr_info("%s: xid=%08x, length=%zu\n",
>> + __func__, be32_to_cpu(xid), len);
>> + pr_info("%s: RPC/RDMA: %*ph\n",
>> + __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
>> + pr_info("%s: RPC: %*ph\n",
>> + __func__, (int)len, p);
>> +
>> + ret = -EAGAIN;
>> + if (src->iov_len < 24)
>> + goto out_shortreply;
>> +
>> + spin_lock_bh(&xprt->transport_lock);
>> + req = xprt_lookup_rqst(xprt, xid);
>> + if (!req)
>> + goto out_notfound;
>> +
>> + dst = &req->rq_private_buf.head[0];
>> + memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
>> + if (dst->iov_len < len)
>> + goto out_unlock;
>> + memcpy(dst->iov_base, p, len);
>> +
>> + credits = be32_to_cpu(rmsgp->rm_credit);
>> + if (credits == 0)
>> + credits = 1; /* don't deadlock */
>> + else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
>> + credits = r_xprt->rx_buf.rb_bc_max_requests;
>> +
>> + cwnd = xprt->cwnd;
>> + xprt->cwnd = credits << RPC_CWNDSHIFT;
>> + if (xprt->cwnd > cwnd)
>> + xprt_release_rqst_cong(req->rq_task);
>> +
>> + ret = 0;
>> + xprt_complete_rqst(req->rq_task, rcvbuf->len);
>> + rcvbuf->len = 0;
>> +
>> +out_unlock:
>> + spin_unlock_bh(&xprt->transport_lock);
>> +out:
>> + return ret;
>> +
>> +out_shortreply:
>> + pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
>> + xprt, src->iov_len);
>> + goto out;
>> +
>> +out_notfound:
>> + pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
>> + xprt, be32_to_cpu(xid));
>> +
>> + goto out_unlock;
>> +}
>> +
>> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index ff4f01e..2b762b5 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -47,6 +47,7 @@
>> #include <rdma/ib_verbs.h>
>> #include <rdma/rdma_cm.h>
>> #include <linux/sunrpc/svc_rdma.h>
>> +#include "xprt_rdma.h"
>>
>> #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>>
>> @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>> return ret;
>> }
>>
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +
>> +/* By convention, backchannel calls arrive via rdma_msg type
>> + * messages, and never populate the chunk lists. This makes
>> + * the RPC/RDMA header small and fixed in size, so it is
>> + * straightforward to check the RPC header's direction field.
>> + */
>> +static bool
>> +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
>> +{
>> + __be32 *p = (__be32 *)rmsgp;
>> +
>> + if (!xprt->xpt_bc_xprt)
>> + return false;
>> +
>> + if (rmsgp->rm_type != rdma_msg)
>> + return false;
>> + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
>> + return false;
>> + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
>> + return false;
>> + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
>> + return false;
>
> The above assertion is only true for the NFS behavior as spec'd
> today (no chunk-bearing bulk data on existing backchannel NFS
> protocol messages). That at least deserves a comment.
Not sure what you mean:
https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rpcrdma-bidirection/
says a chunk-less message is how a backward RPC reply goes over
RPC/RDMA. NFS is not in the picture here.
> Or, why
> not simply ignore the chunks? They're not the receiver's problem.
This check is done before the chunks are parsed. The point
is the receiver has to verify that the RPC-over-RDMA header
is the small, fix-sized kind _first_ before it can go look
at the CALLDIR field in the RPC header.
>> +
>> + /* sanity */
>> + if (p[7] != rmsgp->rm_xid)
>> + return false;
>> + /* call direction */
>> + if (p[8] == cpu_to_be32(RPC_CALL))
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
>> +
>> /*
>> * Set up the rqstp thread context to point to the RQ buffer. If
>> * necessary, pull additional data from the client with an RDMA_READ
>> @@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>> goto close_out;
>> }
>>
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> + if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
>> + ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
>> + &rqstp->rq_arg);
>> + svc_rdma_put_context(ctxt, 0);
>> + if (ret)
>> + goto repost;
>> + return ret;
>> + }
>> +#endif
>> +
>> /* Read read-list data. */
>> ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
>> if (ret > 0) {
>> @@ -668,4 +716,16 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>> set_bit(XPT_CLOSE, &xprt->xpt_flags);
>> defer:
>> return 0;
>> +
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +repost:
>> + ret = svc_rdma_post_recv(rdma_xprt);
>> + if (ret) {
>> + pr_info("svcrdma: could not post a receive buffer, err=%d."
>> + "Closing transport %p.\n", ret, rdma_xprt);
>> + set_bit(XPT_CLOSE, &rdma_xprt->sc_xprt.xpt_flags);
>> + ret = -ENOTCONN;
>> + }
>> + return ret;
>> +#endif
>> }
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ac7f8d4..9aeff2b 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -309,6 +309,8 @@ struct rpcrdma_buffer {
>> u32 rb_bc_srv_max_requests;
>> spinlock_t rb_reqslock; /* protect rb_allreqs */
>> struct list_head rb_allreqs;
>> +
>> + u32 rb_bc_max_requests;
>> };
>> #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
>>
>> @@ -511,6 +513,8 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *);
>> * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
>> */
>> int rpcrdma_marshal_req(struct rpc_rqst *);
>> +int rpcrdma_handle_bc_reply(struct rpc_xprt *, struct rpcrdma_msg *,
>> + struct xdr_buf *);
>>
>> /* RPC/RDMA module init - xprtrdma/transport.c
>> */
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
I'll have to think about whether I agree with that as a
protocol statement.
Chunks in a reply are there to account for the data that is
handled in the chunk of a request. So it kind of comes down
to whether RDMA is allowed (or used) on the backchannel. I
still think that is fundamentally an upper layer question,
not an RPC one.
On 11/23/2015 8:47 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 7:44 PM, Tom Talpey <[email protected]> wrote:
>>
>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>> To support the NFSv4.1 backchannel on RDMA connections, add a
>>> capability for receiving an RPC/RDMA reply on a connection
>>> established by a client.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++++++++++++++++++++++++++++++
>>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 ++++++++++++++++++++++++
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 4 ++
>>> 3 files changed, 140 insertions(+)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index c10d969..fef0623 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -946,3 +946,79 @@ repost:
>>> if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
>>> rpcrdma_recv_buffer_put(rep);
>>> }
>>> +
>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>> +
>>> +int
>>> +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
>>> + struct xdr_buf *rcvbuf)
>>> +{
>>> + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>>> + struct kvec *dst, *src = &rcvbuf->head[0];
>>> + struct rpc_rqst *req;
>>> + unsigned long cwnd;
>>> + u32 credits;
>>> + size_t len;
>>> + __be32 xid;
>>> + __be32 *p;
>>> + int ret;
>>> +
>>> + p = (__be32 *)src->iov_base;
>>> + len = src->iov_len;
>>> + xid = rmsgp->rm_xid;
>>> +
>>> + pr_info("%s: xid=%08x, length=%zu\n",
>>> + __func__, be32_to_cpu(xid), len);
>>> + pr_info("%s: RPC/RDMA: %*ph\n",
>>> + __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
>>> + pr_info("%s: RPC: %*ph\n",
>>> + __func__, (int)len, p);
>>> +
>>> + ret = -EAGAIN;
>>> + if (src->iov_len < 24)
>>> + goto out_shortreply;
>>> +
>>> + spin_lock_bh(&xprt->transport_lock);
>>> + req = xprt_lookup_rqst(xprt, xid);
>>> + if (!req)
>>> + goto out_notfound;
>>> +
>>> + dst = &req->rq_private_buf.head[0];
>>> + memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
>>> + if (dst->iov_len < len)
>>> + goto out_unlock;
>>> + memcpy(dst->iov_base, p, len);
>>> +
>>> + credits = be32_to_cpu(rmsgp->rm_credit);
>>> + if (credits == 0)
>>> + credits = 1; /* don't deadlock */
>>> + else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
>>> + credits = r_xprt->rx_buf.rb_bc_max_requests;
>>> +
>>> + cwnd = xprt->cwnd;
>>> + xprt->cwnd = credits << RPC_CWNDSHIFT;
>>> + if (xprt->cwnd > cwnd)
>>> + xprt_release_rqst_cong(req->rq_task);
>>> +
>>> + ret = 0;
>>> + xprt_complete_rqst(req->rq_task, rcvbuf->len);
>>> + rcvbuf->len = 0;
>>> +
>>> +out_unlock:
>>> + spin_unlock_bh(&xprt->transport_lock);
>>> +out:
>>> + return ret;
>>> +
>>> +out_shortreply:
>>> + pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
>>> + xprt, src->iov_len);
>>> + goto out;
>>> +
>>> +out_notfound:
>>> + pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
>>> + xprt, be32_to_cpu(xid));
>>> +
>>> + goto out_unlock;
>>> +}
>>> +
>>> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> index ff4f01e..2b762b5 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> @@ -47,6 +47,7 @@
>>> #include <rdma/ib_verbs.h>
>>> #include <rdma/rdma_cm.h>
>>> #include <linux/sunrpc/svc_rdma.h>
>>> +#include "xprt_rdma.h"
>>>
>>> #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>>>
>>> @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>>> return ret;
>>> }
>>>
>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>> +
>>> +/* By convention, backchannel calls arrive via rdma_msg type
>>> + * messages, and never populate the chunk lists. This makes
>>> + * the RPC/RDMA header small and fixed in size, so it is
>>> + * straightforward to check the RPC header's direction field.
>>> + */
>>> +static bool
>>> +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
>>> +{
>>> + __be32 *p = (__be32 *)rmsgp;
>>> +
>>> + if (!xprt->xpt_bc_xprt)
>>> + return false;
>>> +
>>> + if (rmsgp->rm_type != rdma_msg)
>>> + return false;
>>> + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
>>> + return false;
>>> + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
>>> + return false;
>>> + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
>>> + return false;
>>
>> The above assertion is only true for the NFS behavior as spec'd
>> today (no chunk-bearing bulk data on existing backchannel NFS
>> protocol messages). That at least deserves a comment.
>
> Not sure what you mean:
>
> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rpcrdma-bidirection/
>
> says a chunk-less message is how a backward RPC reply goes over
> RPC/RDMA. NFS is not in the picture here.
>
>
>> Or, why
>> not simply ignore the chunks? They're not the receiver's problem.
>
> This check is done before the chunks are parsed. The point
> is the receiver has to verify that the RPC-over-RDMA header
> is the small, fix-sized kind _first_ before it can go look
> at the CALLDIR field in the RPC header.
>
>
>>> +
>>> + /* sanity */
>>> + if (p[7] != rmsgp->rm_xid)
>>> + return false;
>>> + /* call direction */
>>> + if (p[8] == cpu_to_be32(RPC_CALL))
>>> + return false;
>>> +
>>> + return true;
>>> +}
>>> +
>>> +#endif /* CONFIG_SUNRPC_BACKCHANNEL */
>>> +
>>> /*
>>> * Set up the rqstp thread context to point to the RQ buffer. If
>>> * necessary, pull additional data from the client with an RDMA_READ
>>> @@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>>> goto close_out;
>>> }
>>>
>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>> + if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
>>> + ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
>>> + &rqstp->rq_arg);
>>> + svc_rdma_put_context(ctxt, 0);
>>> + if (ret)
>>> + goto repost;
>>> + return ret;
>>> + }
>>> +#endif
>>> +
>>> /* Read read-list data. */
>>> ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
>>> if (ret > 0) {
>>> @@ -668,4 +716,16 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>>> set_bit(XPT_CLOSE, &xprt->xpt_flags);
>>> defer:
>>> return 0;
>>> +
>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>> +repost:
>>> + ret = svc_rdma_post_recv(rdma_xprt);
>>> + if (ret) {
>>> + pr_info("svcrdma: could not post a receive buffer, err=%d."
>>> + "Closing transport %p.\n", ret, rdma_xprt);
>>> + set_bit(XPT_CLOSE, &rdma_xprt->sc_xprt.xpt_flags);
>>> + ret = -ENOTCONN;
>>> + }
>>> + return ret;
>>> +#endif
>>> }
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index ac7f8d4..9aeff2b 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -309,6 +309,8 @@ struct rpcrdma_buffer {
>>> u32 rb_bc_srv_max_requests;
>>> spinlock_t rb_reqslock; /* protect rb_allreqs */
>>> struct list_head rb_allreqs;
>>> +
>>> + u32 rb_bc_max_requests;
>>> };
>>> #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
>>>
>>> @@ -511,6 +513,8 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *);
>>> * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
>>> */
>>> int rpcrdma_marshal_req(struct rpc_rqst *);
>>> +int rpcrdma_handle_bc_reply(struct rpc_xprt *, struct rpcrdma_msg *,
>>> + struct xdr_buf *);
>>>
>>> /* RPC/RDMA module init - xprtrdma/transport.c
>>> */
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Mon, Nov 23, 2015 at 07:53:04PM -0500, Chuck Lever wrote:
> > Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
> > needed for IB or RoCE. Shouldn't this be updated to peek at those
> > new attributes to decide, instead of remaining unconditional?
>
> That???s coming in another patch from Christoph.
Can you drop this patch so that we have less conflicts with that one,
assuming this series goes in through the NFS tree, and the memory
registration changes go in through the RDMA tree?
> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
> + gfp_t flags)
> +{
> + struct svc_rdma_op_ctxt *ctxt;
> +
> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
> + if (!ctxt)
> + return NULL;
> + svc_rdma_init_context(xprt, ctxt);
> + return ctxt;
> +}
> +
> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
> +{
> + struct svc_rdma_op_ctxt *ctxt;
> +
> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
> + GFP_KERNEL | __GFP_NOFAIL);
> + svc_rdma_init_context(xprt, ctxt);
> return ctxt;
Sounds like you should have just added a gfp_t argument to
svc_rdma_get_context. And if we have any way to avoid the __GFP_NOFAIL
I'd really appreciate if we could give that a try.
> On Nov 24, 2015, at 1:39 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Nov 23, 2015 at 07:53:04PM -0500, Chuck Lever wrote:
>>> Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
>>> needed for IB or RoCE. Shouldn't this be updated to peek at those
>>> new attributes to decide, instead of remaining unconditional?
>>
>> That???s coming in another patch from Christoph.
>
> Can you drop this patch so that we have less conflicts with that one,
> assuming this series goes in through the NFS tree, and the memory
> registration changes go in through the RDMA tree?
Why don’t you fold my change into yours?
--
Chuck Lever
> On Nov 24, 2015, at 1:55 AM, Christoph Hellwig <[email protected]> wrote:
>
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
>> + gfp_t flags)
>> +{
>> + struct svc_rdma_op_ctxt *ctxt;
>> +
>> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
>> + if (!ctxt)
>> + return NULL;
>> + svc_rdma_init_context(xprt, ctxt);
>> + return ctxt;
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +{
>> + struct svc_rdma_op_ctxt *ctxt;
>> +
>> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> + GFP_KERNEL | __GFP_NOFAIL);
>> + svc_rdma_init_context(xprt, ctxt);
>> return ctxt;
>
> Sounds like you should have just added a gfp_t argument to
> svc_rdma_get_context.
There is only one (new) call site that needs it. I can simplify
this patch as Sagi suggested before, but it seems silly to
introduce the extra clutter of adding a gfp_t argument
everywhere.
> And if we have any way to avoid the __GFP_NOFAIL
> I'd really appreciate if we could give that a try.
I’m not introducing the flag here.
Changing all the svc_rdma_get_context() call sites to handle
allocation failure (when it is already highly unlikely) is
a lot of needless work, IMO, and not related to supporting
bi-directional RPC.
--
Chuck Lever
On Tue, Nov 24, 2015 at 09:08:21AM -0500, Chuck Lever wrote:
> Why don???t you fold my change into yours?
It's already included. Well, sort of - I have removed used of the
field, but forgot to remove the definition. I will update it.
> On Nov 24, 2015, at 11:03 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Nov 24, 2015 at 09:08:21AM -0500, Chuck Lever wrote:
>> Why don???t you fold my change into yours?
>
> It's already included. Well, sort of - I have removed used of the
> field, but forgot to remove the definition. I will update it.
Excellent, that works for me.
--
Chuck Lever
On Tue, Nov 24, 2015 at 09:24:51AM -0500, Chuck Lever wrote:
> There is only one (new) call site that needs it. I can simplify
> this patch as Sagi suggested before, but it seems silly to
> introduce the extra clutter of adding a gfp_t argument
> everywhere.
We a) generally try to pass the gfp_t around if we expect calling
contexts to change, and b the changes to the 6 callers are probably
still smaller than this patch :)
> > And if we have any way to avoid the __GFP_NOFAIL
> > I'd really appreciate if we could give that a try.
>
> I???m not introducing the flag here.
>
> Changing all the svc_rdma_get_context() call sites to handle
> allocation failure (when it is already highly unlikely) is
> a lot of needless work, IMO, and not related to supporting
> bi-directional RPC.
Ok.
> On Nov 24, 2015, at 3:02 PM, Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Nov 24, 2015 at 09:24:51AM -0500, Chuck Lever wrote:
>> There is only one (new) call site that needs it. I can simplify
>> this patch as Sagi suggested before, but it seems silly to
>> introduce the extra clutter of adding a gfp_t argument
>> everywhere.
>
> We a) generally try to pass the gfp_t around if we expect calling
> contexts to change, and b the changes to the 6 callers are probably
> still smaller than this patch :)
I’ll post a v2 early next week. It will be smaller and simpler.
>>> And if we have any way to avoid the __GFP_NOFAIL
>>> I'd really appreciate if we could give that a try.
>>
>> I???m not introducing the flag here.
>>
>> Changing all the svc_rdma_get_context() call sites to handle
>> allocation failure (when it is already highly unlikely) is
>> a lot of needless work, IMO, and not related to supporting
>> bi-directional RPC.
>
> Ok.
--
Chuck Lever
> On Nov 24, 2015, at 1:55 AM, Christoph Hellwig <[email protected]> wrote:
>
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
>> + gfp_t flags)
>> +{
>> + struct svc_rdma_op_ctxt *ctxt;
>> +
>> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
>> + if (!ctxt)
>> + return NULL;
>> + svc_rdma_init_context(xprt, ctxt);
>> + return ctxt;
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +{
>> + struct svc_rdma_op_ctxt *ctxt;
>> +
>> + ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> + GFP_KERNEL | __GFP_NOFAIL);
>> + svc_rdma_init_context(xprt, ctxt);
>> return ctxt;
>
> Sounds like you should have just added a gfp_t argument to
> svc_rdma_get_context. And if we have any way to avoid the __GFP_NOFAIL
> I'd really appreciate if we could give that a try.
Changed my mind on this.
struct svc_rdma_op_ctxt used to be smaller than a page, so these
allocations were not likely to fail. But since the maximum NFS
READ and WRITE payload for NFS/RDMA has been increased to 1MB,
struct svc_rdma_op_ctxt has grown to more than 6KB, thus it is
no longer an order 0 memory allocation.
Some ideas:
1. Pre-allocate these per connection in svc_rdma_accept().
There will never be more than sc_sq_depth of these. But that
could be a large number to allocate during connection
establishment.
2. Once allocated, cache them. If traffic doesn’t manage to
allocate sc_sq_depth of these over time, allocation can still
fail during a traffic burst in very low memory scenarios.
3. Use a mempool. This reserves a few of these which may never
be used. But allocation can still fail once the reserve is
consumed (same as 2).
4. Break out the sge and pages arrays into separate allocations
so the allocation requests are order 0.
1 seems like the most robust solution, and it would be fast.
svc_rdma_get_context is a very common operation.
--
Chuck Lever