2016-04-28 15:14:30

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 00/10] NFS/RDMA server updates proposed for v4.7

Shirley's server-side IPv6 patch, a number of minor fixes and
clean-ups found during code review, and one experimental patch.

The last patch in the series changes the server's send CQ and
receive CQ from soft IRQ to workqueue. There is a measurable
performance degradation documented in the patch description.
Maybe I've missed something? But for now this patch is for
review and discussion only.

Available in the "nfsd-rdma-for-4.7" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.7

---

Chuck Lever (9):
svcrdma: Do not add XDR padding to xdr_buf page vector
svcrdma: svc_rdma_put_context() is invoked twice in Send error path
svcrdma: Remove superfluous line from rdma_read_chunks()
svcrdma: Post Receives only for forward channel requests
svcrdma: Drain QP before freeing svcrdma_xprt
svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
svcrdma: Generalize svc_rdma_xdr_decode_req()
svcrdma: Simplify the check for backward direction replies
svcrdma: Switch CQs from IB_POLL_SOFTIRQ to IB_POLL_WORKQUEUE

Shirley Ma (1):
svcrdma: Support IPv6 with NFS/RDMA


fs/nfsd/nfs3xdr.c | 2 -
include/linux/sunrpc/svc_rdma.h | 2 -
net/sunrpc/xprtrdma/svc_rdma_marshal.c | 31 +++++++++----
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 71 +++++++++---------------------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 28 +++++-------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 43 ++++++++++++------
6 files changed, 85 insertions(+), 92 deletions(-)

--
Chuck Lever


2016-04-28 15:14:39

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 01/10] svcrdma: Support IPv6 with NFS/RDMA

From: Shirley Ma <[email protected]>

Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.

Changes from v1:
- Check rdma_set_afonly return value (suggested by Leon Romanovsky)

Changes from v2:
- Acked-by: Leon Romanovsky <[email protected]>

Signed-off-by: Shirley Ma <[email protected]>
Acked-by: Leon Romanovsky <[email protected]>
Reviewed-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 9066896..d2680b8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -789,7 +789,7 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
int ret;

dprintk("svcrdma: Creating RDMA socket\n");
- if (sa->sa_family != AF_INET) {
+ if ((sa->sa_family != AF_INET) && (sa->sa_family != AF_INET6)) {
dprintk("svcrdma: Address family %d is not supported.\n", sa->sa_family);
return ERR_PTR(-EAFNOSUPPORT);
}
@@ -805,6 +805,16 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
goto err0;
}

+ /* Allow both IPv4 and IPv6 sockets to bind a single port
+ * at the same time.
+ */
+#if IS_ENABLED(CONFIG_IPV6)
+ ret = rdma_set_afonly(listen_id, 1);
+ if (ret) {
+ dprintk("svcrdma: rdma_set_afonly failed = %d\n", ret);
+ goto err1;
+ }
+#endif
ret = rdma_bind_addr(listen_id, sa);
if (ret) {
dprintk("svcrdma: rdma_bind_addr failed = %d\n", ret);


2016-04-28 15:14:47

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 02/10] svcrdma: Do not add XDR padding to xdr_buf page vector

An xdr_buf has a head, a vector of pages, and a tail. Each
RPC request is presented to the NFS server contained in an
xdr_buf.

The RDMA transport would like to supply the NFS server with only
the NFS WRITE payload bytes in the page vector. In some common
cases, that would allow the NFS server to swap those pages right
into the target file's page cache.

Have the transport's RDMA Read logic put XDR pad bytes in the tail
iovec, and not in the pages that hold the data payload.

The NFSv3 WRITE XDR decoder is finicky about the lengths involved,
so make sure it is looking in the correct places when computing
the total length of the incoming NFS WRITE request.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs3xdr.c | 2 +-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2246454..c5eff5f 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -379,7 +379,7 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p,
*/
hdr = (void*)p - rqstp->rq_arg.head[0].iov_base;
dlen = rqstp->rq_arg.head[0].iov_len + rqstp->rq_arg.page_len
- - hdr;
+ + rqstp->rq_arg.tail[0].iov_len - hdr;
/*
* Round the length of the data which was specified up to
* the next multiple of XDR units and then compare that
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 3b24a64..234be9d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -488,7 +488,7 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
if (page_offset & 3) {
u32 pad = 4 - (page_offset & 3);

- head->arg.page_len += pad;
+ head->arg.tail[0].iov_len += pad;
head->arg.len += pad;
head->arg.buflen += pad;
page_offset += pad;


2016-04-28 15:14:55

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 03/10] svcrdma: svc_rdma_put_context() is invoked twice in Send error path

Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto().
This ensures that svc_rdma_put_context() is invoked only once if
send_reply() fails.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 4f1b1c4..54d53330 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -463,25 +463,21 @@ static int send_reply(struct svcxprt_rdma *rdma,
struct svc_rqst *rqstp,
struct page *page,
struct rpcrdma_msg *rdma_resp,
- struct svc_rdma_op_ctxt *ctxt,
struct svc_rdma_req_map *vec,
int byte_count)
{
+ struct svc_rdma_op_ctxt *ctxt;
struct ib_send_wr send_wr;
u32 xdr_off;
int sge_no;
int sge_bytes;
int page_no;
int pages;
- int ret;
-
- ret = svc_rdma_repost_recv(rdma, GFP_KERNEL);
- if (ret) {
- svc_rdma_put_context(ctxt, 0);
- return -ENOTCONN;
- }
+ int ret = -EIO;

/* Prepare the context */
+ ctxt = svc_rdma_get_context(rdma);
+ ctxt->direction = DMA_TO_DEVICE;
ctxt->pages[0] = page;
ctxt->count = 1;

@@ -565,8 +561,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
err:
svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 1);
- pr_err("svcrdma: failed to send reply, rc=%d\n", ret);
- return -EIO;
+ return ret;
}

void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
@@ -585,7 +580,6 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
int ret;
int inline_bytes;
struct page *res_page;
- struct svc_rdma_op_ctxt *ctxt;
struct svc_rdma_req_map *vec;

dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
@@ -598,8 +592,6 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
rp_ary = svc_rdma_get_reply_array(rdma_argp, wr_ary);

/* Build an req vec for the XDR */
- ctxt = svc_rdma_get_context(rdma);
- ctxt->direction = DMA_TO_DEVICE;
vec = svc_rdma_get_req_map(rdma);
ret = svc_rdma_map_xdr(rdma, &rqstp->rq_res, vec, wr_ary != NULL);
if (ret)
@@ -635,7 +627,12 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
inline_bytes -= ret;
}

- ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
+ /* Post a fresh Receive buffer _before_ sending the reply */
+ ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
+ if (ret)
+ goto err1;
+
+ ret = send_reply(rdma, rqstp, res_page, rdma_resp, vec,
inline_bytes);
if (ret < 0)
goto err1;
@@ -648,7 +645,8 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
put_page(res_page);
err0:
svc_rdma_put_req_map(rdma, vec);
- svc_rdma_put_context(ctxt, 0);
+ pr_err("svcrdma: Could not send reply, err=%d. Closing transport.\n",
+ ret);
set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
return -ENOTCONN;
}


2016-04-28 15:15:03

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 04/10] svcrdma: Remove superfluous line from rdma_read_chunks()

Clean up: svc_rdma_get_read_chunk() already returns a pointer
to the Read list. No need to set "ch" again to the value it
already contains.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 234be9d..12e7899 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -447,10 +447,8 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
head->arg.len = rqstp->rq_arg.len;
head->arg.buflen = rqstp->rq_arg.buflen;

- ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
- position = be32_to_cpu(ch->rc_position);
-
/* RDMA_NOMSG: RDMA READ data should land just after RDMA RECV data */
+ position = be32_to_cpu(ch->rc_position);
if (position == 0) {
head->arg.pages = &head->pages[0];
page_offset = head->byte_len;


2016-04-28 15:15:12

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 05/10] svcrdma: Post Receives only for forward channel requests

Since backward direction support was added, the rq_depth was
increased to accommodate both forward and backward Receives.

But only forward Receives need to be posted after a connection
has been accepted. Receives for backward replies are posted as
needed by svc_rdma_bc_sendto().

This doesn't break anything, but it means some resources are
wasted.

Fixes: 03fe9931536f ('svcrdma: Define maximum number of ...')
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index d2680b8..02a112c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1083,7 +1083,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;

/* Post receive buffers */
- for (i = 0; i < newxprt->sc_rq_depth; i++) {
+ for (i = 0; i < newxprt->sc_max_requests; i++) {
ret = svc_rdma_post_recv(newxprt, GFP_KERNEL);
if (ret) {
dprintk("svcrdma: failure posting receive buffers\n");


2016-04-28 15:15:20

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 06/10] svcrdma: Drain QP before freeing svcrdma_xprt

If the server is forcing the connection disconnect, the QP has not
been moved to the Error state, meaning Receives are still posted.

Ensure Receives (and any other outstanding WRs) are flushed to free
up resources that can be freed during teardown of an svcrdma_xprt.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 02a112c..dd94401 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1180,6 +1180,9 @@ static void __svc_rdma_free(struct work_struct *work)

dprintk("svcrdma: %s(%p)\n", __func__, rdma);

+ if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
+ ib_drain_qp(rdma->sc_qp);
+
/* We should only be called from kref_put */
if (atomic_read(&xprt->xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",


2016-04-28 15:15:28

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 07/10] svcrdma: Eliminate code duplication in svc_rdma_recvfrom()

Clean up.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 26 +++++---------------------
1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 12e7899..1b72f35 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -508,11 +508,10 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
return ret;
}

-static int rdma_read_complete(struct svc_rqst *rqstp,
- struct svc_rdma_op_ctxt *head)
+static void rdma_read_complete(struct svc_rqst *rqstp,
+ struct svc_rdma_op_ctxt *head)
{
int page_no;
- int ret;

/* Copy RPC pages */
for (page_no = 0; page_no < head->count; page_no++) {
@@ -548,23 +547,6 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
rqstp->rq_arg.tail[0] = head->arg.tail[0];
rqstp->rq_arg.len = head->arg.len;
rqstp->rq_arg.buflen = head->arg.buflen;
-
- /* Free the context */
- svc_rdma_put_context(head, 0);
-
- /* XXX: What should this be? */
- rqstp->rq_prot = IPPROTO_MAX;
- svc_xprt_copy_addrs(rqstp, rqstp->rq_xprt);
-
- ret = rqstp->rq_arg.head[0].iov_len
- + rqstp->rq_arg.page_len
- + rqstp->rq_arg.tail[0].iov_len;
- dprintk("svcrdma: deferred read ret=%d, rq_arg.len=%u, "
- "rq_arg.head[0].iov_base=%p, rq_arg.head[0].iov_len=%zu\n",
- ret, rqstp->rq_arg.len, rqstp->rq_arg.head[0].iov_base,
- rqstp->rq_arg.head[0].iov_len);
-
- return ret;
}

/* By convention, backchannel calls arrive via rdma_msg type
@@ -622,7 +604,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
dto_q);
list_del_init(&ctxt->dto_q);
spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock);
- return rdma_read_complete(rqstp, ctxt);
+ rdma_read_complete(rqstp, ctxt);
+ goto complete;
} else if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
ctxt = list_entry(rdma_xprt->sc_rq_dto_q.next,
struct svc_rdma_op_ctxt,
@@ -680,6 +663,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
return 0;
}

+complete:
ret = rqstp->rq_arg.head[0].iov_len
+ rqstp->rq_arg.page_len
+ rqstp->rq_arg.tail[0].iov_len;


2016-04-28 15:15:36

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 08/10] svcrdma: Generalize svc_rdma_xdr_decode_req()

Clean up: Pass in just the piece of the svc_rqst that is needed
here.

While we're in the area, add an informative documenting comment.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 2 +-
net/sunrpc/xprtrdma/svc_rdma_marshal.c | 31 +++++++++++++++++++++----------
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 2 +-
3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 3081339..d6917b8 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -199,7 +199,7 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
struct xdr_buf *rcvbuf);

/* svc_rdma_marshal.c */
-extern int svc_rdma_xdr_decode_req(struct rpcrdma_msg *, struct svc_rqst *);
+extern int svc_rdma_xdr_decode_req(struct xdr_buf *);
extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
struct rpcrdma_msg *,
enum rpcrdma_errcode, __be32 *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 765bca4..b3fd068 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -145,19 +145,31 @@ static __be32 *decode_reply_array(__be32 *va, __be32 *vaend)
return (__be32 *)&ary->wc_array[nchunks];
}

-int svc_rdma_xdr_decode_req(struct rpcrdma_msg *rmsgp, struct svc_rqst *rqstp)
+/**
+ * svc_rdma_xdr_decode_req - Parse incoming RPC-over-RDMA header
+ * @rq_arg: buffer containing xprt header and possibly an RPC message
+ *
+ * On entry, xdr->head[0].iov_base points to first byte in the
+ * RPC-over-RDMA header.
+ *
+ * On successful exit, head[0] points to first byte in the RPC
+ * message, and the length of the RPC-over-RDMA header is returned.
+ */
+int svc_rdma_xdr_decode_req(struct xdr_buf *rq_arg)
{
+ struct rpcrdma_msg *rmsgp;
__be32 *va, *vaend;
unsigned int len;
u32 hdr_len;

/* Verify that there's enough bytes for header + something */
- if (rqstp->rq_arg.len <= RPCRDMA_HDRLEN_ERR) {
+ if (rq_arg->len <= RPCRDMA_HDRLEN_ERR) {
dprintk("svcrdma: header too short = %d\n",
- rqstp->rq_arg.len);
+ rq_arg->len);
return -EINVAL;
}

+ rmsgp = (struct rpcrdma_msg *)rq_arg->head[0].iov_base;
if (rmsgp->rm_vers != rpcrdma_version) {
dprintk("%s: bad version %u\n", __func__,
be32_to_cpu(rmsgp->rm_vers));
@@ -189,10 +201,10 @@ int svc_rdma_xdr_decode_req(struct rpcrdma_msg *rmsgp, struct svc_rqst *rqstp)
be32_to_cpu(rmsgp->rm_body.rm_padded.rm_thresh);

va = &rmsgp->rm_body.rm_padded.rm_pempty[4];
- rqstp->rq_arg.head[0].iov_base = va;
+ rq_arg->head[0].iov_base = va;
len = (u32)((unsigned long)va - (unsigned long)rmsgp);
- rqstp->rq_arg.head[0].iov_len -= len;
- if (len > rqstp->rq_arg.len)
+ rq_arg->head[0].iov_len -= len;
+ if (len > rq_arg->len)
return -EINVAL;
return len;
default:
@@ -205,7 +217,7 @@ int svc_rdma_xdr_decode_req(struct rpcrdma_msg *rmsgp, struct svc_rqst *rqstp)
* chunk list and a reply chunk list.
*/
va = &rmsgp->rm_body.rm_chunks[0];
- vaend = (__be32 *)((unsigned long)rmsgp + rqstp->rq_arg.len);
+ vaend = (__be32 *)((unsigned long)rmsgp + rq_arg->len);
va = decode_read_list(va, vaend);
if (!va) {
dprintk("svcrdma: failed to decode read list\n");
@@ -222,10 +234,9 @@ int svc_rdma_xdr_decode_req(struct rpcrdma_msg *rmsgp, struct svc_rqst *rqstp)
return -EINVAL;
}

- rqstp->rq_arg.head[0].iov_base = va;
+ rq_arg->head[0].iov_base = va;
hdr_len = (unsigned long)va - (unsigned long)rmsgp;
- rqstp->rq_arg.head[0].iov_len -= hdr_len;
-
+ rq_arg->head[0].iov_len -= hdr_len;
return hdr_len;
}

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 1b72f35..c984b0a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -636,7 +636,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)

/* Decode the RDMA header. */
rmsgp = (struct rpcrdma_msg *)rqstp->rq_arg.head[0].iov_base;
- ret = svc_rdma_xdr_decode_req(rmsgp, rqstp);
+ ret = svc_rdma_xdr_decode_req(&rqstp->rq_arg);
if (ret < 0)
goto out_err;
if (ret == 0)


2016-04-28 15:15:44

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 09/10] svcrdma: Simplify the check for backward direction replies

Clean up: rq_arg.head[0] already points to the RPC header. No need
to walk down the RPC-over-RDMA header again.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 31 ++++++++++---------------------
1 file changed, 10 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index c984b0a..41adbe7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -549,35 +549,24 @@ static void rdma_read_complete(struct svc_rqst *rqstp,
rqstp->rq_arg.buflen = head->arg.buflen;
}

-/* By convention, backchannel calls arrive via rdma_msg type
- * messages, and never populate the chunk lists. This makes
- * the RPC/RDMA header small and fixed in size, so it is
- * straightforward to check the RPC header's direction field.
+/* Currently, this implementation does not support chunks in the
+ * backward direction. Thus only RDMA_MSG bc replies are accepted.
*/
static bool
-svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+ struct xdr_buf *rq_arg)
{
- __be32 *p = (__be32 *)rmsgp;
+ __be32 *p = (__be32 *)rq_arg->head[0].iov_base;

+ /* Is there a backchannel transport associated with xprt? */
if (!xprt->xpt_bc_xprt)
return false;
-
+ /* Is there an RPC message in the inline buffer? */
if (rmsgp->rm_type != rdma_msg)
return false;
- if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
- return false;
- if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+ /* Is that message a Call or a Reply? */
+ if (p[1] != cpu_to_be32(RPC_REPLY))
return false;
- if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
- return false;
-
- /* sanity */
- if (p[7] != rmsgp->rm_xid)
- return false;
- /* call direction */
- if (p[8] == cpu_to_be32(RPC_CALL))
- return false;
-
return true;
}

@@ -643,7 +632,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto out_drop;
rqstp->rq_xprt_hlen = ret;

- if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
+ if (svc_rdma_is_backchannel_reply(xprt, rmsgp, &rqstp->rq_arg)) {
ret = svc_rdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
&rqstp->rq_arg);
svc_rdma_put_context(ctxt, 0);


2016-04-28 15:15:53

by Chuck Lever

[permalink] [raw]
Subject: [PATCH 10/10] svcrdma: Switch CQs from IB_POLL_SOFTIRQ to IB_POLL_WORKQUEUE

Spread NFSD completion handling across CPUs, and replace
BH-friendly spin locking with plain spin locks.

iozone -i0 -i1 -s128m -y1k -az -I -N

Microseconds/op Mode. Output is in microseconds per operation.

Before:
KB reclen write rewrite read reread
131072 1 51 51 43 43
131072 2 53 52 42 43
131072 4 53 52 43 43
131072 8 55 54 44 44
131072 16 62 59 49 47
131072 32 72 69 53 53
131072 64 92 87 66 66
131072 128 144 130 94 93
131072 256 225 216 146 145
131072 512 485 474 251 251
131072 1024 573 540 514 512
131072 2048 1007 941 624 618
131072 4096 1672 1699 976 969
131072 8192 3179 3158 1660 1649
131072 16384 5836 5659 3062 3041

After:
KB reclen write rewrite read reread
131072 1 54 54 43 43
131072 2 55 55 43 43
131072 4 56 57 44 45
131072 8 59 58 45 45
131072 16 64 62 47 47
131072 32 76 74 54 54
131072 64 96 91 67 66
131072 128 148 133 97 97
131072 256 229 227 148 147
131072 512 488 445 252 255
131072 1024 582 534 511 540
131072 2048 998 988 614 620
131072 4096 1685 1679 946 965
131072 8192 3113 3048 1650 1644
131072 16384 6010 5745 3046 3053

NFS READ is roughly the same, NFS WRITE is marginally worse.

Before:
GETATTR:
242 ops (0%)
avg bytes sent per op: 127
avg bytes received per op: 112
backlog wait: 0.000000
RTT: 0.041322
total execute time: 0.049587 (milliseconds)

After:
GETATTR:
242 ops (0%)
avg bytes sent per op: 127
avg bytes received per op: 112
backlog wait: 0.000000
RTT: 0.045455
total execute time: 0.053719 (milliseconds)

Small op latency increased by 4usec.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 6 +++---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 +++++++++++++-------------
2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 41adbe7..6159ae6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -586,13 +586,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)

dprintk("svcrdma: rqstp=%p\n", rqstp);

- spin_lock_bh(&rdma_xprt->sc_rq_dto_lock);
+ spin_lock(&rdma_xprt->sc_rq_dto_lock);
if (!list_empty(&rdma_xprt->sc_read_complete_q)) {
ctxt = list_entry(rdma_xprt->sc_read_complete_q.next,
struct svc_rdma_op_ctxt,
dto_q);
list_del_init(&ctxt->dto_q);
- spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock);
+ spin_unlock(&rdma_xprt->sc_rq_dto_lock);
rdma_read_complete(rqstp, ctxt);
goto complete;
} else if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
@@ -605,7 +605,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
clear_bit(XPT_DATA, &xprt->xpt_flags);
ctxt = NULL;
}
- spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock);
+ spin_unlock(&rdma_xprt->sc_rq_dto_lock);
if (!ctxt) {
/* This is the EAGAIN path. The svc_recv routine will
* return -EAGAIN, the nfsd thread will go to call into
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index dd94401..8850a5f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -186,7 +186,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
{
struct svc_rdma_op_ctxt *ctxt = NULL;

- spin_lock_bh(&xprt->sc_ctxt_lock);
+ spin_lock(&xprt->sc_ctxt_lock);
xprt->sc_ctxt_used++;
if (list_empty(&xprt->sc_ctxts))
goto out_empty;
@@ -194,7 +194,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
ctxt = list_first_entry(&xprt->sc_ctxts,
struct svc_rdma_op_ctxt, free);
list_del_init(&ctxt->free);
- spin_unlock_bh(&xprt->sc_ctxt_lock);
+ spin_unlock(&xprt->sc_ctxt_lock);

out:
ctxt->count = 0;
@@ -205,15 +205,15 @@ out_empty:
/* Either pre-allocation missed the mark, or send
* queue accounting is broken.
*/
- spin_unlock_bh(&xprt->sc_ctxt_lock);
+ spin_unlock(&xprt->sc_ctxt_lock);

ctxt = alloc_ctxt(xprt, GFP_NOIO);
if (ctxt)
goto out;

- spin_lock_bh(&xprt->sc_ctxt_lock);
+ spin_lock(&xprt->sc_ctxt_lock);
xprt->sc_ctxt_used--;
- spin_unlock_bh(&xprt->sc_ctxt_lock);
+ spin_unlock(&xprt->sc_ctxt_lock);
WARN_ONCE(1, "svcrdma: empty RDMA ctxt list?\n");
return NULL;
}
@@ -248,10 +248,10 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
for (i = 0; i < ctxt->count; i++)
put_page(ctxt->pages[i]);

- spin_lock_bh(&xprt->sc_ctxt_lock);
+ spin_lock(&xprt->sc_ctxt_lock);
xprt->sc_ctxt_used--;
list_add(&ctxt->free, &xprt->sc_ctxts);
- spin_unlock_bh(&xprt->sc_ctxt_lock);
+ spin_unlock(&xprt->sc_ctxt_lock);
}

static void svc_rdma_destroy_ctxts(struct svcxprt_rdma *xprt)
@@ -897,14 +897,14 @@ struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
{
struct svc_rdma_fastreg_mr *frmr = NULL;

- spin_lock_bh(&rdma->sc_frmr_q_lock);
+ spin_lock(&rdma->sc_frmr_q_lock);
if (!list_empty(&rdma->sc_frmr_q)) {
frmr = list_entry(rdma->sc_frmr_q.next,
struct svc_rdma_fastreg_mr, frmr_list);
list_del_init(&frmr->frmr_list);
frmr->sg_nents = 0;
}
- spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ spin_unlock(&rdma->sc_frmr_q_lock);
if (frmr)
return frmr;

@@ -918,10 +918,10 @@ void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
ib_dma_unmap_sg(rdma->sc_cm_id->device,
frmr->sg, frmr->sg_nents, frmr->direction);
atomic_dec(&rdma->sc_dma_used);
- spin_lock_bh(&rdma->sc_frmr_q_lock);
+ spin_lock(&rdma->sc_frmr_q_lock);
WARN_ON_ONCE(!list_empty(&frmr->frmr_list));
list_add(&frmr->frmr_list, &rdma->sc_frmr_q);
- spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ spin_unlock(&rdma->sc_frmr_q_lock);
}
}

@@ -999,13 +999,13 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
goto errout;
}
newxprt->sc_sq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_sq_depth,
- 0, IB_POLL_SOFTIRQ);
+ 0, IB_POLL_WORKQUEUE);
if (IS_ERR(newxprt->sc_sq_cq)) {
dprintk("svcrdma: error creating SQ CQ for connect request\n");
goto errout;
}
newxprt->sc_rq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_rq_depth,
- 0, IB_POLL_SOFTIRQ);
+ 0, IB_POLL_WORKQUEUE);
if (IS_ERR(newxprt->sc_rq_cq)) {
dprintk("svcrdma: error creating RQ CQ for connect request\n");
goto errout;


2016-04-28 15:59:39

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH 10/10] svcrdma: Switch CQs from IB_POLL_SOFTIRQ to IB_POLL_WORKQUEUE



> -----Original Message-----
> From: [email protected] [mailto:linux-rdma-
> [email protected]] On Behalf Of Chuck Lever
> Sent: Thursday, April 28, 2016 10:16 AM
> To: [email protected]; [email protected]
> Subject: [PATCH 10/10] svcrdma: Switch CQs from IB_POLL_SOFTIRQ to
> IB_POLL_WORKQUEUE
>
> Spread NFSD completion handling across CPUs, and replace
> BH-friendly spin locking with plain spin locks.
>
> iozone -i0 -i1 -s128m -y1k -az -I -N
>
> Microseconds/op Mode. Output is in microseconds per operation.
>
> Before:
> KB reclen write rewrite read reread
> 131072 1 51 51 43 43
> 131072 2 53 52 42 43
> 131072 4 53 52 43 43
> 131072 8 55 54 44 44
> 131072 16 62 59 49 47
> 131072 32 72 69 53 53
> 131072 64 92 87 66 66
> 131072 128 144 130 94 93
> 131072 256 225 216 146 145
> 131072 512 485 474 251 251
> 131072 1024 573 540 514 512
> 131072 2048 1007 941 624 618
> 131072 4096 1672 1699 976 969
> 131072 8192 3179 3158 1660 1649
> 131072 16384 5836 5659 3062 3041
>
> After:
> KB reclen write rewrite read reread
> 131072 1 54 54 43 43
> 131072 2 55 55 43 43
> 131072 4 56 57 44 45
> 131072 8 59 58 45 45
> 131072 16 64 62 47 47
> 131072 32 76 74 54 54
> 131072 64 96 91 67 66
> 131072 128 148 133 97 97
> 131072 256 229 227 148 147
> 131072 512 488 445 252 255
> 131072 1024 582 534 511 540
> 131072 2048 998 988 614 620
> 131072 4096 1685 1679 946 965
> 131072 8192 3113 3048 1650 1644
> 131072 16384 6010 5745 3046 3053
>
> NFS READ is roughly the same, NFS WRITE is marginally worse.
>
> Before:
> GETATTR:
> 242 ops (0%)
> avg bytes sent per op: 127
> avg bytes received per op: 112
> backlog wait: 0.000000
> RTT: 0.041322
> total execute time: 0.049587 (milliseconds)
>
> After:
> GETATTR:
> 242 ops (0%)
> avg bytes sent per op: 127
> avg bytes received per op: 112
> backlog wait: 0.000000
> RTT: 0.045455
> total execute time: 0.053719 (milliseconds)
>
> Small op latency increased by 4usec.
>


Hey Chuck, in what scenario or under what type of load do you expect this change to help performance? I guess it would help as you scale out the number of clients and thus the number of CQs in use? Do you do any measurements along these lines?

Stevo




2016-04-28 16:15:22

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 10/10] svcrdma: Switch CQs from IB_POLL_SOFTIRQ to IB_POLL_WORKQUEUE


> On Apr 28, 2016, at 11:59 AM, Steve Wise <[email protected]> wrote:
>
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:linux-rdma-
>> [email protected]] On Behalf Of Chuck Lever
>> Sent: Thursday, April 28, 2016 10:16 AM
>> To: [email protected]; [email protected]
>> Subject: [PATCH 10/10] svcrdma: Switch CQs from IB_POLL_SOFTIRQ to
>> IB_POLL_WORKQUEUE
>>
>> Spread NFSD completion handling across CPUs, and replace
>> BH-friendly spin locking with plain spin locks.
>>
>> iozone -i0 -i1 -s128m -y1k -az -I -N
>>
>> Microseconds/op Mode. Output is in microseconds per operation.
>>
>> Before:
>> KB reclen write rewrite read reread
>> 131072 1 51 51 43 43
>> 131072 2 53 52 42 43
>> 131072 4 53 52 43 43
>> 131072 8 55 54 44 44
>> 131072 16 62 59 49 47
>> 131072 32 72 69 53 53
>> 131072 64 92 87 66 66
>> 131072 128 144 130 94 93
>> 131072 256 225 216 146 145
>> 131072 512 485 474 251 251
>> 131072 1024 573 540 514 512
>> 131072 2048 1007 941 624 618
>> 131072 4096 1672 1699 976 969
>> 131072 8192 3179 3158 1660 1649
>> 131072 16384 5836 5659 3062 3041
>>
>> After:
>> KB reclen write rewrite read reread
>> 131072 1 54 54 43 43
>> 131072 2 55 55 43 43
>> 131072 4 56 57 44 45
>> 131072 8 59 58 45 45
>> 131072 16 64 62 47 47
>> 131072 32 76 74 54 54
>> 131072 64 96 91 67 66
>> 131072 128 148 133 97 97
>> 131072 256 229 227 148 147
>> 131072 512 488 445 252 255
>> 131072 1024 582 534 511 540
>> 131072 2048 998 988 614 620
>> 131072 4096 1685 1679 946 965
>> 131072 8192 3113 3048 1650 1644
>> 131072 16384 6010 5745 3046 3053
>>
>> NFS READ is roughly the same, NFS WRITE is marginally worse.
>>
>> Before:
>> GETATTR:
>> 242 ops (0%)
>> avg bytes sent per op: 127
>> avg bytes received per op: 112
>> backlog wait: 0.000000
>> RTT: 0.041322
>> total execute time: 0.049587 (milliseconds)
>>
>> After:
>> GETATTR:
>> 242 ops (0%)
>> avg bytes sent per op: 127
>> avg bytes received per op: 112
>> backlog wait: 0.000000
>> RTT: 0.045455
>> total execute time: 0.053719 (milliseconds)
>>
>> Small op latency increased by 4usec.
>>
>
>
> Hey Chuck, in what scenario or under what type of load do you expect this change to help performance? I guess it would help as you scale out the number of clients and thus the number of CQs in use?

Allowing completions to run on any CPU should help if the
softIRQ thread is constrained to one CPU.

Flapping bottom-halfs fewer times for each incoming RPC
_should_ also be beneficial.

We are also interested in posting RDMA Read requests during
Receive completion processing. That would reduce the
latency of any request involving a Read chunk by removing
a heavyweight context switch.

I've also noticed that changing just the Receive CQ to use
workqueue has only negligible impact on performance (as
measured using the above tool).


> Do you do any measurements along these lines?

I don't have the quantity of hardware needed for that kind
of analysis. You might have a few more clients in your
lab...

I think my basic question is whether I've missed something,
if the approach can be improved, am I using the correct
metrics, etc.


--
Chuck Lever