2014-05-06 17:46:22

by Steve Wise

[permalink] [raw]
Subject: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

This patch series refactors the NFSRDMA server marshalling logic to
remove the intermediary map structures. It also fixes an existing bug
where the NFSRDMA server was not minding the device fast register page
list length limitations.

I've also made a git repo available with these patches on top of 3.15-rc4:

git://git.openfabrics.org/~swise/linux svcrdma-refactor

Changes since V1:

- fixed regression for devices that don't support FRMRs (see
rdma_read_chunk_lcl())

- split patch up for closer review. However I request it be squashed
before merging as they is not bisectable, and I think these changes
should all be a single commit anyway.

Please review, and test if you can.

Signed-off-by: Tom Tucker <[email protected]>
Signed-off-by: Steve Wise <[email protected]>

---

Tom Tucker (3):
svcrdma: Sendto changes
svcrdma: Recvfrom changes
svcrdma: Transport and header file changes


include/linux/sunrpc/svc_rdma.h | 3
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 633 ++++++++++++------------------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 230 +----------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 ++-
4 files changed, 318 insertions(+), 610 deletions(-)

--

Steve / Tom


2014-05-06 17:46:38

by Steve Wise

[permalink] [raw]
Subject: [PATCH V2 RFC 3/3] svcrdma: Sendto changes

From: Tom Tucker <[email protected]>

Don't use fast-register mrs for the source of RDMA writes and sends.
Instead, use either a local dma lkey or a dma_mr lkey based on what the
device supports.

Signed-off-by: Tom Tucker <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
---

net/sunrpc/xprtrdma/svc_rdma_sendto.c | 230 +++------------------------------
1 files changed, 22 insertions(+), 208 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 7e024a5..49fd21a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -1,4 +1,5 @@
/*
+ * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
* Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
*
* This software is available to you under a choice of one of two
@@ -49,152 +50,6 @@

#define RPCDBG_FACILITY RPCDBG_SVCXPRT

-/* Encode an XDR as an array of IB SGE
- *
- * Assumptions:
- * - head[0] is physically contiguous.
- * - tail[0] is physically contiguous.
- * - pages[] is not physically or virtually contiguous and consists of
- * PAGE_SIZE elements.
- *
- * Output:
- * SGE[0] reserved for RCPRDMA header
- * SGE[1] data from xdr->head[]
- * SGE[2..sge_count-2] data from xdr->pages[]
- * SGE[sge_count-1] data from xdr->tail.
- *
- * The max SGE we need is the length of the XDR / pagesize + one for
- * head + one for tail + one for RPCRDMA header. Since RPCSVC_MAXPAGES
- * reserves a page for both the request and the reply header, and this
- * array is only concerned with the reply we are assured that we have
- * on extra page for the RPCRMDA header.
- */
-static int fast_reg_xdr(struct svcxprt_rdma *xprt,
- struct xdr_buf *xdr,
- struct svc_rdma_req_map *vec)
-{
- int sge_no;
- u32 sge_bytes;
- u32 page_bytes;
- u32 page_off;
- int page_no = 0;
- u8 *frva;
- struct svc_rdma_fastreg_mr *frmr;
-
- frmr = svc_rdma_get_frmr(xprt);
- if (IS_ERR(frmr))
- return -ENOMEM;
- vec->frmr = frmr;
-
- /* Skip the RPCRDMA header */
- sge_no = 1;
-
- /* Map the head. */
- frva = (void *)((unsigned long)(xdr->head[0].iov_base) & PAGE_MASK);
- vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
- vec->sge[sge_no].iov_len = xdr->head[0].iov_len;
- vec->count = 2;
- sge_no++;
-
- /* Map the XDR head */
- frmr->kva = frva;
- frmr->direction = DMA_TO_DEVICE;
- frmr->access_flags = 0;
- frmr->map_len = PAGE_SIZE;
- frmr->page_list_len = 1;
- page_off = (unsigned long)xdr->head[0].iov_base & ~PAGE_MASK;
- frmr->page_list->page_list[page_no] =
- ib_dma_map_page(xprt->sc_cm_id->device,
- virt_to_page(xdr->head[0].iov_base),
- page_off,
- PAGE_SIZE - page_off,
- DMA_TO_DEVICE);
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- frmr->page_list->page_list[page_no]))
- goto fatal_err;
- atomic_inc(&xprt->sc_dma_used);
-
- /* Map the XDR page list */
- page_off = xdr->page_base;
- page_bytes = xdr->page_len + page_off;
- if (!page_bytes)
- goto encode_tail;
-
- /* Map the pages */
- vec->sge[sge_no].iov_base = frva + frmr->map_len + page_off;
- vec->sge[sge_no].iov_len = page_bytes;
- sge_no++;
- while (page_bytes) {
- struct page *page;
-
- page = xdr->pages[page_no++];
- sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off));
- page_bytes -= sge_bytes;
-
- frmr->page_list->page_list[page_no] =
- ib_dma_map_page(xprt->sc_cm_id->device,
- page, page_off,
- sge_bytes, DMA_TO_DEVICE);
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- frmr->page_list->page_list[page_no]))
- goto fatal_err;
-
- atomic_inc(&xprt->sc_dma_used);
- page_off = 0; /* reset for next time through loop */
- frmr->map_len += PAGE_SIZE;
- frmr->page_list_len++;
- }
- vec->count++;
-
- encode_tail:
- /* Map tail */
- if (0 == xdr->tail[0].iov_len)
- goto done;
-
- vec->count++;
- vec->sge[sge_no].iov_len = xdr->tail[0].iov_len;
-
- if (((unsigned long)xdr->tail[0].iov_base & PAGE_MASK) ==
- ((unsigned long)xdr->head[0].iov_base & PAGE_MASK)) {
- /*
- * If head and tail use the same page, we don't need
- * to map it again.
- */
- vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
- } else {
- void *va;
-
- /* Map another page for the tail */
- page_off = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
- va = (void *)((unsigned long)xdr->tail[0].iov_base & PAGE_MASK);
- vec->sge[sge_no].iov_base = frva + frmr->map_len + page_off;
-
- frmr->page_list->page_list[page_no] =
- ib_dma_map_page(xprt->sc_cm_id->device, virt_to_page(va),
- page_off,
- PAGE_SIZE,
- DMA_TO_DEVICE);
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- frmr->page_list->page_list[page_no]))
- goto fatal_err;
- atomic_inc(&xprt->sc_dma_used);
- frmr->map_len += PAGE_SIZE;
- frmr->page_list_len++;
- }
-
- done:
- if (svc_rdma_fastreg(xprt, frmr))
- goto fatal_err;
-
- return 0;
-
- fatal_err:
- printk("svcrdma: Error fast registering memory for xprt %p\n", xprt);
- vec->frmr = NULL;
- svc_rdma_put_frmr(xprt, frmr);
- return -EIO;
-}
-
static int map_xdr(struct svcxprt_rdma *xprt,
struct xdr_buf *xdr,
struct svc_rdma_req_map *vec)
@@ -208,9 +63,6 @@ static int map_xdr(struct svcxprt_rdma *xprt,
BUG_ON(xdr->len !=
(xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len));

- if (xprt->sc_frmr_pg_list_len)
- return fast_reg_xdr(xprt, xdr, vec);
-
/* Skip the first sge, this is for the RPCRDMA header */
sge_no = 1;

@@ -282,8 +134,6 @@ static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
}

/* Assumptions:
- * - We are using FRMR
- * - or -
* - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
*/
static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
@@ -327,23 +177,16 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
sge_bytes = min_t(size_t,
bc, vec->sge[xdr_sge_no].iov_len-sge_off);
sge[sge_no].length = sge_bytes;
- if (!vec->frmr) {
- sge[sge_no].addr =
- dma_map_xdr(xprt, &rqstp->rq_res, xdr_off,
- sge_bytes, DMA_TO_DEVICE);
- xdr_off += sge_bytes;
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- sge[sge_no].addr))
- goto err;
- atomic_inc(&xprt->sc_dma_used);
- sge[sge_no].lkey = xprt->sc_dma_lkey;
- } else {
- sge[sge_no].addr = (unsigned long)
- vec->sge[xdr_sge_no].iov_base + sge_off;
- sge[sge_no].lkey = vec->frmr->mr->lkey;
- }
+ sge[sge_no].addr =
+ dma_map_xdr(xprt, &rqstp->rq_res, xdr_off,
+ sge_bytes, DMA_TO_DEVICE);
+ xdr_off += sge_bytes;
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ sge[sge_no].addr))
+ goto err;
+ atomic_inc(&xprt->sc_dma_used);
+ sge[sge_no].lkey = xprt->sc_dma_lkey;
ctxt->count++;
- ctxt->frmr = vec->frmr;
sge_off = 0;
sge_no++;
xdr_sge_no++;
@@ -369,7 +212,6 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
return 0;
err:
svc_rdma_unmap_dma(ctxt);
- svc_rdma_put_frmr(xprt, vec->frmr);
svc_rdma_put_context(ctxt, 0);
/* Fatal error, close transport */
return -EIO;
@@ -397,10 +239,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
res_ary = (struct rpcrdma_write_array *)
&rdma_resp->rm_body.rm_chunks[1];

- if (vec->frmr)
- max_write = vec->frmr->map_len;
- else
- max_write = xprt->sc_max_sge * PAGE_SIZE;
+ max_write = xprt->sc_max_sge * PAGE_SIZE;

/* Write chunks start at the pagelist */
for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0;
@@ -472,10 +311,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
res_ary = (struct rpcrdma_write_array *)
&rdma_resp->rm_body.rm_chunks[2];

- if (vec->frmr)
- max_write = vec->frmr->map_len;
- else
- max_write = xprt->sc_max_sge * PAGE_SIZE;
+ max_write = xprt->sc_max_sge * PAGE_SIZE;

/* xdr offset starts at RPC message */
nchunks = ntohl(arg_ary->wc_nchunks);
@@ -545,7 +381,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
int byte_count)
{
struct ib_send_wr send_wr;
- struct ib_send_wr inv_wr;
int sge_no;
int sge_bytes;
int page_no;
@@ -559,7 +394,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
"svcrdma: could not post a receive buffer, err=%d."
"Closing transport %p.\n", ret, rdma);
set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
- svc_rdma_put_frmr(rdma, vec->frmr);
svc_rdma_put_context(ctxt, 0);
return -ENOTCONN;
}
@@ -567,11 +401,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
/* Prepare the context */
ctxt->pages[0] = page;
ctxt->count = 1;
- ctxt->frmr = vec->frmr;
- if (vec->frmr)
- set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
- else
- clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);

/* Prepare the SGE for the RPCRDMA Header */
ctxt->sge[0].lkey = rdma->sc_dma_lkey;
@@ -590,21 +419,15 @@ static int send_reply(struct svcxprt_rdma *rdma,
int xdr_off = 0;
sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
byte_count -= sge_bytes;
- if (!vec->frmr) {
- ctxt->sge[sge_no].addr =
- dma_map_xdr(rdma, &rqstp->rq_res, xdr_off,
- sge_bytes, DMA_TO_DEVICE);
- xdr_off += sge_bytes;
- if (ib_dma_mapping_error(rdma->sc_cm_id->device,
- ctxt->sge[sge_no].addr))
- goto err;
- atomic_inc(&rdma->sc_dma_used);
- ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
- } else {
- ctxt->sge[sge_no].addr = (unsigned long)
- vec->sge[sge_no].iov_base;
- ctxt->sge[sge_no].lkey = vec->frmr->mr->lkey;
- }
+ ctxt->sge[sge_no].addr =
+ dma_map_xdr(rdma, &rqstp->rq_res, xdr_off,
+ sge_bytes, DMA_TO_DEVICE);
+ xdr_off += sge_bytes;
+ if (ib_dma_mapping_error(rdma->sc_cm_id->device,
+ ctxt->sge[sge_no].addr))
+ goto err;
+ atomic_inc(&rdma->sc_dma_used);
+ ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
ctxt->sge[sge_no].length = sge_bytes;
}
BUG_ON(byte_count != 0);
@@ -627,6 +450,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->sge[page_no+1].length = 0;
}
rqstp->rq_next_page = rqstp->rq_respages + 1;
+
BUG_ON(sge_no > rdma->sc_max_sge);
memset(&send_wr, 0, sizeof send_wr);
ctxt->wr_op = IB_WR_SEND;
@@ -635,15 +459,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
send_wr.num_sge = sge_no;
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
- if (vec->frmr) {
- /* Prepare INVALIDATE WR */
- memset(&inv_wr, 0, sizeof inv_wr);
- inv_wr.opcode = IB_WR_LOCAL_INV;
- inv_wr.send_flags = IB_SEND_SIGNALED;
- inv_wr.ex.invalidate_rkey =
- vec->frmr->mr->lkey;
- send_wr.next = &inv_wr;
- }

ret = svc_rdma_send(rdma, &send_wr);
if (ret)
@@ -653,7 +468,6 @@ static int send_reply(struct svcxprt_rdma *rdma,

err:
svc_rdma_unmap_dma(ctxt);
- svc_rdma_put_frmr(rdma, vec->frmr);
svc_rdma_put_context(ctxt, 1);
return -EIO;
}


2014-05-13 20:37:39

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes



> -----Original Message-----
> From: Chuck Lever [mailto:[email protected]]
> Sent: Tuesday, May 13, 2014 1:22 PM
> To: Steve Wise
> Cc: J. Bruce Fields; Linux NFS Mailing List; [email protected]; Tom Tucker
> Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes
>
> Hi Steve-
>
> Some random review comments, see below.
>

<snip>

> > +static int rdma_read_chunks(struct svcxprt_rdma *xprt,
> > + struct rpcrdma_msg *rmsgp,
> > + struct svc_rqst *rqstp,
> > + struct svc_rdma_op_ctxt *head)
> > {
> > - struct ib_send_wr read_wr;
> > - struct ib_send_wr inv_wr;
> > - int err = 0;
> > - int ch_no;
> > - int ch_count;
> > - int byte_count;
> > - int sge_count;
> > - u64 sgl_offset;
> > + int page_no, ch_count, ret;
> > struct rpcrdma_read_chunk *ch;
> > - struct svc_rdma_op_ctxt *ctxt = NULL;
> > - struct svc_rdma_req_map *rpl_map;
> > - struct svc_rdma_req_map *chl_map;
> > + u32 page_offset, byte_count;
> > + u64 rs_offset;
> > + rdma_reader_fn reader;
> >
> > /* If no read list is present, return 0 */
> > ch = svc_rdma_get_read_chunk(rmsgp);
> > @@ -408,122 +368,53 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
> > if (ch_count > RPCSVC_MAXPAGES)
> > return -EINVAL;
>
> I'm not sure what this ^^^ ch_count check is doing, but maybe I haven't had
> enough caffeine this morning. Shouldn't this code be comparing the segment
> count, not the chunk count, with MAXPAGES?

Isn't ch_count really the number of chunks in the RPC? I'm not sure what "segment count"
you are talking about?

>
>
> >
> > - /* Allocate temporary reply and chunk maps */
> > - rpl_map = svc_rdma_get_req_map();
> > - chl_map = svc_rdma_get_req_map();
> > -
> > - if (!xprt->sc_frmr_pg_list_len)
> > - sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
> > - rpl_map, chl_map, ch_count,
> > - byte_count);
> > - else
> > - sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
> > - rpl_map, chl_map, ch_count,
> > - byte_count);
> > - if (sge_count < 0) {
> > - err = -EIO;
> > - goto out;
> > - }
> > -
> > - sgl_offset = 0;
> > - ch_no = 0;
> > + /* The request is completed when the RDMA_READs complete. The
> > + * head context keeps all the pages that comprise the
> > + * request.
> > + */
> > + head->arg.head[0] = rqstp->rq_arg.head[0];
> > + head->arg.tail[0] = rqstp->rq_arg.tail[0];
> > + head->arg.pages = &head->pages[head->count];
> > + head->hdr_count = head->count;
> > + head->arg.page_base = 0;
> > + head->arg.page_len = 0;
> > + head->arg.len = rqstp->rq_arg.len;
> > + head->arg.buflen = rqstp->rq_arg.buflen + PAGE_SIZE;
> >
> > + page_no = 0; page_offset = 0;
> > for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
> > - ch->rc_discrim != 0; ch++, ch_no++) {
> > - u64 rs_offset;
> > -next_sge:
> > - ctxt = svc_rdma_get_context(xprt);
> > - ctxt->direction = DMA_FROM_DEVICE;
> > - ctxt->frmr = hdr_ctxt->frmr;
> > - ctxt->read_hdr = NULL;
> > - clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> > - clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
> > + ch->rc_discrim != 0; ch++) {
>
>
> As I understand it, this loop is iterating over the items in the RPC
> header's read chunk list.
>
> RFC 5667 section 4 says
>
> "The server MUST ignore any Read list for other NFS procedures, as well
> as additional Read list entries beyond the first in the list."
>

I interpret this to say the server should ignore an RDMA Write list, if present, when
processing an NFS WRITE (because the server emits RDMA Reads to the client in this case).
Similarly, it should ignore an RDMA Read list when processing an NFS READ (which generates
RDMA Writes from the server->client to fulfill the NFS READ).

> But rdma_read_chunks() still expects multiple chunks in the incoming list.

An RDMA Read list can have multiple chunks, yes?


> Should this code follow the RFC and process only the first chunk in
> each RPC header, or did I misunderstand the RFC text?
>
> I'm not sure how the RPC layer should enforce the requirement to ignore
> chunks for particular classes of NFS requests. But seems like valid
> NFS/RDMA requests will have only zero or one item in each chunk list.
>
>
> The ramification of this change:
>
> The Linux client sends an extra chunk which is a zero pad to satisfy XDR
> alignment. This results in an extra RDMA READ on the wire for a payload
> that is never more than 3 bytes of zeros.
>

This sounds like a logic bug in the server then. You see these IOs on the wire?

> Section 3.7 of RFC 5666 states:
>
> "If in turn, no data remains to be decoded from the inline portion, then
> the receiver MUST conclude that roundup is present, and therefore it
> advances the XDR decode position to that indicated by the next chunk (if
> any). In this way, roundup is passed without ever actually transferring
> additional XDR bytes."
>
> Later, it states:
>
> "Senders SHOULD therefore avoid encoding individual RDMA Read Chunks for
> roundup whenever possible."
>
> The Solaris NFS/RDMA server does not require this extra chunk, but Linux
> appears to need it. When I enable pad optimization on the Linux client,
> it works fine with the Solaris server. But the first NFS WRITE with an
> odd length against the Linux server causes connection loss and the client
> workload hangs.
>
> At some point I'd like to permanently enable this optimization on the
> Linux client. You can try this out with:
>
> sudo echo 1 > /proc/sys/sunrpc/rdma_pad_optimize
>
> The default contents of this file leave pad optimization disabled, for
> compatibility with the current Linux NFS/RDMA server.
>
> So rdma_read_chunks() needs to handle the implicit XDR length round-up.
> That's probably not hard.
>
> What I'm asking is:
>
> Do you agree the server should be changed to comply with section 4 of 5667?
>

I think it is complying and you are misinterpreting. But I could be misinterpreting it
too :)

But we should fix the pad thing.

> Should you change now in this patch, or do you want me to write a patch
> for it once the refactoring patch has been accepted?
>

I'd like to get this refactoring in and then continue improving it. Unfortunately, I
think you still see some bug running over mthca, eh? I haven't debugged that yet. I plan
to and then resubmit V3 (squashed).


>
>
> >
> > - /* Prepare READ WR */
> > - memset(&read_wr, 0, sizeof read_wr);
> > - read_wr.wr_id = (unsigned long)ctxt;
> > - read_wr.opcode = IB_WR_RDMA_READ;
> > - ctxt->wr_op = read_wr.opcode;
> > - read_wr.send_flags = IB_SEND_SIGNALED;
> > - read_wr.wr.rdma.rkey = ntohl(ch->rc_target.rs_handle);
> > xdr_decode_hyper((__be32 *)&ch->rc_target.rs_offset,
> > &rs_offset);
> > - read_wr.wr.rdma.remote_addr = rs_offset + sgl_offset;
> > - read_wr.sg_list = ctxt->sge;
> > - read_wr.num_sge =
> > - rdma_read_max_sge(xprt, chl_map->ch[ch_no].count);
> > - err = rdma_set_ctxt_sge(xprt, ctxt, hdr_ctxt->frmr,
> > - &rpl_map->sge[chl_map->ch[ch_no].start],
> > - &sgl_offset,
> > - read_wr.num_sge);
> > - if (err) {
> > - svc_rdma_unmap_dma(ctxt);
> > - svc_rdma_put_context(ctxt, 0);
> > - goto out;
> > - }
> > - if (((ch+1)->rc_discrim == 0) &&
> > - (read_wr.num_sge == chl_map->ch[ch_no].count)) {
> > - /*
> > - * Mark the last RDMA_READ with a bit to
> > - * indicate all RPC data has been fetched from
> > - * the client and the RPC needs to be enqueued.
> > - */
> > - set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> > - if (hdr_ctxt->frmr) {
> > - set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
> > - /*
> > - * Invalidate the local MR used to map the data
> > - * sink.
> > - */
> > - if (xprt->sc_dev_caps &
> > - SVCRDMA_DEVCAP_READ_W_INV) {
> > - read_wr.opcode =
> > - IB_WR_RDMA_READ_WITH_INV;
> > - ctxt->wr_op = read_wr.opcode;
> > - read_wr.ex.invalidate_rkey =
> > - ctxt->frmr->mr->lkey;
> > - } else {
> > - /* Prepare INVALIDATE WR */
> > - memset(&inv_wr, 0, sizeof inv_wr);
> > - inv_wr.opcode = IB_WR_LOCAL_INV;
> > - inv_wr.send_flags = IB_SEND_SIGNALED;
> > - inv_wr.ex.invalidate_rkey =
> > - hdr_ctxt->frmr->mr->lkey;
> > - read_wr.next = &inv_wr;
> > - }
> > - }
> > - ctxt->read_hdr = hdr_ctxt;
> > - }
> > - /* Post the read */
> > - err = svc_rdma_send(xprt, &read_wr);
> > - if (err) {
> > - printk(KERN_ERR "svcrdma: Error %d posting RDMA_READ\n",
> > - err);
> > - set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> > - svc_rdma_unmap_dma(ctxt);
> > - svc_rdma_put_context(ctxt, 0);
> > - goto out;
> > + byte_count = ntohl(ch->rc_target.rs_length);
> > +
> > + /* Use FRMR if supported */
> > + if (xprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)
> > + reader = rdma_read_chunk_frmr;
> > + else
> > + reader = rdma_read_chunk_lcl;
>
> The reader function is invariant across iterations of the for() loop.
> In fact, it is invariant for all incoming requests on an xprt, isn't it?
>
> Can svc_rdma_create() plant that function pointer in the xprt?
>

Yea, that sounds good.


Thanks for reviewing!

Steve.


2014-05-19 19:14:16

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic



> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf
> Of Devesh Sharma
> Sent: Monday, May 19, 2014 2:07 PM
> To: Steve Wise; J. Bruce Fields
> Cc: [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic
>
> While testing with ocrdma driver I am finding server side SQ full. Following is the log,
yet to
> identify why it's happening. Once this is reported Client side crashes due to some
reason.
> My kdump is not working properly therefore I am not able to analyze the situation
properly.
>
> May 19 23:47:02 neo01-el64 kernel: svcrdma: RDMA_WRITE rmr=8008b12, to=45a2d790c,
> xdr_off=0, write_len=68, vec->sge=ffff88086cb4a0c8, vec->count=2
> May 19 23:47:02 neo01-el64 kernel: svcrdma: send_reply returns 0
> May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000 waiting for data (to =
> 3600000)
> May 19 23:47:02 neo01-el64 kernel: svc: transport ffff88087dfa2400 served by daemon
> ffff88086409a000
> May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000, pool 0, transport
> ffff88087dfa2400, inuse=18
> May 19 23:47:02 neo01-el64 kernel: svcrdma: rqstp=ffff88086409a000
> May 19 23:47:02 neo01-el64 kernel: svcrdma: processing ctxt=ffff880866754540 on
> xprt=ffff88087dfa2400, rqstp=ffff88086409a000, status=0
> May 19 23:47:02 neo01-el64 kernel: svcrdma: failed to post SQ WR rc=-22, sc_sq_count=0,
> sc_sq_depth=128
> May 19 23:47:02 neo01-el64 kernel: svcrdma: Error -22 posting RDMA_READ

Hey Deevesh,

Looking ocrdma_post_send(),-22 (-EINVAL) is returned when the QP is not in RTS. If the SQ
is full, -ENOMEM is returned. So I think the send error is a downstream error because the
connection got knocked down. You should try and figure out what kicked the QP out of RTS.


Steve.


2014-05-14 18:21:25

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes


On May 14, 2014, at 2:11 PM, Steve Wise <[email protected]> wrote:

>
>>>
>>>> If you print ch_count, it is 2 for NFS WRITEs from a Linux client,
>>>> no matter how large the write payload is. Therefore I think the check
>>>> as it is written is not particularly useful.
>>>
>>> Why are there 2?
>>
>> The first chunk lists the pages the server is to read, and the second
>> chunk has the zero pad for XDR alignment.
>>
>> If pad optimization is enabled on the client, there is just 1 chunk in
>> the RPC's Read list.
>>
>
> So the code as it stands violates the RFC by sending 2 chunks?

The Linux client is non-compliant because it sends two chunks. A compliant server ignores the second chunk, so this is functionally harmless (but a waste of resources).

The Linux server is non-compliant because it _requires_ the XDR pad chunk. A compliant client (eg. Linux with pad optimization enabled) does not interoperate with it.

> And if I change the server
> to only consume 1, then everything unaligned will break?

If you change the server to consume only the first chunk, but do not also make it deal with XDR padding as recommended in chapter 3 of RFC 5666, then WRITEs with a length that is not a multiple of four will fail. SYMLINKs with a large payload will also have this problem.

--
Chuck Lever
[email protected]




2014-05-06 19:27:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

On Tue, May 06, 2014 at 12:46:21PM -0500, Steve Wise wrote:
> This patch series refactors the NFSRDMA server marshalling logic to
> remove the intermediary map structures. It also fixes an existing bug
> where the NFSRDMA server was not minding the device fast register page
> list length limitations.
>
> I've also made a git repo available with these patches on top of 3.15-rc4:
>
> git://git.openfabrics.org/~swise/linux svcrdma-refactor
>
> Changes since V1:
>
> - fixed regression for devices that don't support FRMRs (see
> rdma_read_chunk_lcl())
>
> - split patch up for closer review. However I request it be squashed
> before merging as they is not bisectable, and I think these changes
> should all be a single commit anyway.

If it's not split up in a way that's bisectable, then yes, just don't
bother.

--b.

>
> Please review, and test if you can.
>
> Signed-off-by: Tom Tucker <[email protected]>
> Signed-off-by: Steve Wise <[email protected]>
>
> ---
>
> Tom Tucker (3):
> svcrdma: Sendto changes
> svcrdma: Recvfrom changes
> svcrdma: Transport and header file changes
>
>
> include/linux/sunrpc/svc_rdma.h | 3
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 633 ++++++++++++------------------
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 230 +----------
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 ++-
> 4 files changed, 318 insertions(+), 610 deletions(-)
>
> --
>
> Steve / Tom

2014-05-14 14:40:04

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes


On May 14, 2014, at 10:26 AM, Steve Wise <[email protected]> wrote:

> On 5/13/2014 4:44 PM, Chuck Lever wrote:
>>
>>> I'm not sure what "segment count? you are talking about?
>> AFAICT a chunk is a list of segments. Thus one chunk can reference
>> many pages, one per segment.
>
> Ok yes. rdma_read_chunk_frmr()/rdma_read_chunk_lcl() walk the segments creating rdma read work requests to pull the data for this chunk.
>
>> If you print ch_count, it is 2 for NFS WRITEs from a Linux client,
>> no matter how large the write payload is. Therefore I think the check
>> as it is written is not particularly useful.
>
> Why are there 2?

The first chunk lists the pages the server is to read, and the second
chunk has the zero pad for XDR alignment.

If pad optimization is enabled on the client, there is just 1 chunk in
the RPC?s Read list.

>
>> The returned ch_count here is not used beyond that check, after the
>> refactoring patch is applied. The returned byte_count includes all
>> chunks in the passed-in chunk list, whereas I think it should count
>> only the first chunk in the list.
>
> So I'll investigate this more with the goal of only grabbing the first chunk.

[ . . . snipped . . . ]

>>>>
>>>> What I'm asking is:
>>>>
>>>> Do you agree the server should be changed to comply with section 4 of 5667?
>>>>
>>> I think it is complying and you are misinterpreting. But I could be misinterpreting it
>>> too :)
>>>
>>> But we should fix the pad thing.
>> That may be enough to support a client that does not send the
>> pad bytes in a separate chunk. However, I wonder how such a
>> server would behave with the current client.
>>
>> True, this isn?t a regression, per se. The server is already
>> behaving incorrectly without the refactoring patch.
>>
>> However, Tom?s refactoring rewrites all this logic anyway, and
>> it wouldn?t be much of a stretch to make the server comply with
>> RFC 5667, at this point.
>
> Can we defer this until the refactor patch is in, then I'll work on a new patch for the pad stuff.

Yes, we can defer it.

A bug has been filed to track this issue:

https://bugzilla.linux-nfs.org/show_bug.cgi?id=246

--
Chuck Lever
[email protected]




2014-05-13 21:44:53

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes


On May 13, 2014, at 4:37 PM, Steve Wise <[email protected]> wrote:

>
>
>> -----Original Message-----
>> From: Chuck Lever [mailto:[email protected]]
>> Sent: Tuesday, May 13, 2014 1:22 PM
>> To: Steve Wise
>> Cc: J. Bruce Fields; Linux NFS Mailing List; [email protected]; Tom Tucker
>> Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes
>>
>> Hi Steve-
>>
>> Some random review comments, see below.
>>
>
> <snip>
>
>>> +static int rdma_read_chunks(struct svcxprt_rdma *xprt,
>>> + struct rpcrdma_msg *rmsgp,
>>> + struct svc_rqst *rqstp,
>>> + struct svc_rdma_op_ctxt *head)
>>> {
>>> - struct ib_send_wr read_wr;
>>> - struct ib_send_wr inv_wr;
>>> - int err = 0;
>>> - int ch_no;
>>> - int ch_count;
>>> - int byte_count;
>>> - int sge_count;
>>> - u64 sgl_offset;
>>> + int page_no, ch_count, ret;
>>> struct rpcrdma_read_chunk *ch;
>>> - struct svc_rdma_op_ctxt *ctxt = NULL;
>>> - struct svc_rdma_req_map *rpl_map;
>>> - struct svc_rdma_req_map *chl_map;
>>> + u32 page_offset, byte_count;
>>> + u64 rs_offset;
>>> + rdma_reader_fn reader;
>>>
>>> /* If no read list is present, return 0 */
>>> ch = svc_rdma_get_read_chunk(rmsgp);
>>> @@ -408,122 +368,53 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
>>> if (ch_count > RPCSVC_MAXPAGES)
>>> return -EINVAL;
>>
>> I'm not sure what this ^^^ ch_count check is doing, but maybe I haven't had
>> enough caffeine this morning. Shouldn't this code be comparing the segment
>> count, not the chunk count, with MAXPAGES?
>
> Isn't ch_count really the number of chunks in the RPC?

Sort of. Block comment in front of svc_rdma_rcl_chunk_counts() reads:

74 * Determine number of chunks and total bytes in chunk list. The chunk
75 * list has already been verified to fit within the RPCRDMA header.

So, called from rdma_read_chunks(), ch_count is the number of chunks in
the Read list.

As you point out below, NFS/RDMA passes either a Read or a Write list,
not both. So it amounts to the same thing as the total number of chunks
in the RPC.

> I'm not sure what "segment count? you are talking about?

AFAICT a chunk is a list of segments. Thus one chunk can reference
many pages, one per segment.

If you print ch_count, it is 2 for NFS WRITEs from a Linux client,
no matter how large the write payload is. Therefore I think the check
as it is written is not particularly useful.

The returned ch_count here is not used beyond that check, after the
refactoring patch is applied. The returned byte_count includes all
chunks in the passed-in chunk list, whereas I think it should count
only the first chunk in the list.

>
>>
>>
>>>
>>> - /* Allocate temporary reply and chunk maps */
>>> - rpl_map = svc_rdma_get_req_map();
>>> - chl_map = svc_rdma_get_req_map();
>>> -
>>> - if (!xprt->sc_frmr_pg_list_len)
>>> - sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
>>> - rpl_map, chl_map, ch_count,
>>> - byte_count);
>>> - else
>>> - sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
>>> - rpl_map, chl_map, ch_count,
>>> - byte_count);
>>> - if (sge_count < 0) {
>>> - err = -EIO;
>>> - goto out;
>>> - }
>>> -
>>> - sgl_offset = 0;
>>> - ch_no = 0;
>>> + /* The request is completed when the RDMA_READs complete. The
>>> + * head context keeps all the pages that comprise the
>>> + * request.
>>> + */
>>> + head->arg.head[0] = rqstp->rq_arg.head[0];
>>> + head->arg.tail[0] = rqstp->rq_arg.tail[0];
>>> + head->arg.pages = &head->pages[head->count];
>>> + head->hdr_count = head->count;
>>> + head->arg.page_base = 0;
>>> + head->arg.page_len = 0;
>>> + head->arg.len = rqstp->rq_arg.len;
>>> + head->arg.buflen = rqstp->rq_arg.buflen + PAGE_SIZE;
>>>
>>> + page_no = 0; page_offset = 0;
>>> for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
>>> - ch->rc_discrim != 0; ch++, ch_no++) {
>>> - u64 rs_offset;
>>> -next_sge:
>>> - ctxt = svc_rdma_get_context(xprt);
>>> - ctxt->direction = DMA_FROM_DEVICE;
>>> - ctxt->frmr = hdr_ctxt->frmr;
>>> - ctxt->read_hdr = NULL;
>>> - clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
>>> - clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
>>> + ch->rc_discrim != 0; ch++) {
>>
>>
>> As I understand it, this loop is iterating over the items in the RPC
>> header's read chunk list.
>>
>> RFC 5667 section 4 says
>>
>> "The server MUST ignore any Read list for other NFS procedures, as well
>> as additional Read list entries beyond the first in the list."
>>
>
> I interpret this to say the server should ignore an RDMA Write list, if present, when
> processing an NFS WRITE (because the server emits RDMA Reads to the client in this case).
> Similarly, it should ignore an RDMA Read list when processing an NFS READ (which generates
> RDMA Writes from the server->client to fulfill the NFS READ).

That?s already stated explicitly in a later paragraph. So this text
is getting at something else, IMO. Here?s the full context:

4. NFS Versions 2 and 3 Mapping

A single RDMA Write list entry MAY be posted by the client to receive
either the opaque file data from a READ request or the pathname from
a READLINK request. The server MUST ignore a Write list for any
other NFS procedure, as well as any Write list entries beyond the
first in the list.

Similarly, a single RDMA Read list entry MAY be posted by the client
to supply the opaque file data for a WRITE request or the pathname
for a SYMLINK request. The server MUST ignore any Read list for
other NFS procedures, as well as additional Read list entries beyond
the first in the list.

Because there are no NFS version 2 or 3 requests that transfer bulk
data in both directions, it is not necessary to post requests
containing both Write and Read lists. Any unneeded Read or Write
lists are ignored by the server.

If there is a Read list, it can contain one chunk for NFS WRITE and
SYMLINK operations, otherwise it should contain zero chunks. Anything
else MUST be ignored (ie, the server mustn?t process the additional
chunks, and mustn?t throw an error).

Processing the second chunk from the client during NFS WRITE operations
is incorrect. The additional chunk should be ignored, and the first
chunk should be padded on the server to make things work right in the
upper layer (the server?s NFS WRITE XDR decoder).

>
>> But rdma_read_chunks() still expects multiple chunks in the incoming list.
>
> An RDMA Read list can have multiple chunks, yes?

It can. But RFC 5667 suggests that particularly for NFS requests, only
a Read list of zero or one chunks is valid.

>
>> Should this code follow the RFC and process only the first chunk in
>> each RPC header, or did I misunderstand the RFC text?
>>
>> I'm not sure how the RPC layer should enforce the requirement to ignore
>> chunks for particular classes of NFS requests. But seems like valid
>> NFS/RDMA requests will have only zero or one item in each chunk list.
>>
>>
>> The ramification of this change:
>>
>> The Linux client sends an extra chunk which is a zero pad to satisfy XDR
>> alignment. This results in an extra RDMA READ on the wire for a payload
>> that is never more than 3 bytes of zeros.
>>
>
> This sounds like a logic bug in the server then.

Perhaps it?s simply that this code was designed before RFC 5667 was
published. Certainly the client expects the server might work this way.

> You see these IOs on the wire?

I see extra iterations in the RDMA READ loop on the server, both before
and after the refactoring patch.

I haven?t looked at the wire because ib_dump doesn?t work with the mthca
provider.

>
>> Section 3.7 of RFC 5666 states:
>>
>> "If in turn, no data remains to be decoded from the inline portion, then
>> the receiver MUST conclude that roundup is present, and therefore it
>> advances the XDR decode position to that indicated by the next chunk (if
>> any). In this way, roundup is passed without ever actually transferring
>> additional XDR bytes."
>>
>> Later, it states:
>>
>> "Senders SHOULD therefore avoid encoding individual RDMA Read Chunks for
>> roundup whenever possible."
>>
>> The Solaris NFS/RDMA server does not require this extra chunk, but Linux
>> appears to need it. When I enable pad optimization on the Linux client,
>> it works fine with the Solaris server. But the first NFS WRITE with an
>> odd length against the Linux server causes connection loss and the client
>> workload hangs.
>>
>> At some point I'd like to permanently enable this optimization on the
>> Linux client. You can try this out with:
>>
>> sudo echo 1 > /proc/sys/sunrpc/rdma_pad_optimize
>>
>> The default contents of this file leave pad optimization disabled, for
>> compatibility with the current Linux NFS/RDMA server.
>>
>> So rdma_read_chunks() needs to handle the implicit XDR length round-up.
>> That's probably not hard.
>>
>> What I'm asking is:
>>
>> Do you agree the server should be changed to comply with section 4 of 5667?
>>
>
> I think it is complying and you are misinterpreting. But I could be misinterpreting it
> too :)
>
> But we should fix the pad thing.

That may be enough to support a client that does not send the
pad bytes in a separate chunk. However, I wonder how such a
server would behave with the current client.

True, this isn?t a regression, per se. The server is already
behaving incorrectly without the refactoring patch.

However, Tom?s refactoring rewrites all this logic anyway, and
it wouldn?t be much of a stretch to make the server comply with
RFC 5667, at this point.

--
Chuck Lever
[email protected]




2014-05-20 05:42:26

by Devesh Sharma

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

Hi Steve,

> -----Original Message-----
> From: Steve Wise [mailto:[email protected]]
> Sent: Tuesday, May 20, 2014 12:44 AM
> To: Devesh Sharma; 'J. Bruce Fields'
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic
>
>
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Devesh Sharma
> > Sent: Monday, May 19, 2014 2:07 PM
> > To: Steve Wise; J. Bruce Fields
> > Cc: [email protected]; [email protected];
> > [email protected]
> > Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic
> >
> > While testing with ocrdma driver I am finding server side SQ full.
> > Following is the log,
> yet to
> > identify why it's happening. Once this is reported Client side crashes
> > due to some
> reason.
> > My kdump is not working properly therefore I am not able to analyze
> > the situation
> properly.
> >
> > May 19 23:47:02 neo01-el64 kernel: svcrdma: RDMA_WRITE rmr=8008b12,
> > to=45a2d790c, xdr_off=0, write_len=68, vec->sge=ffff88086cb4a0c8,
> > vec->count=2 May 19 23:47:02 neo01-el64 kernel: svcrdma: send_reply
> > returns 0 May 19 23:47:02 neo01-el64 kernel: svc: server
> > ffff88086409a000 waiting for data (to =
> > 3600000)
> > May 19 23:47:02 neo01-el64 kernel: svc: transport ffff88087dfa2400
> > served by daemon
> > ffff88086409a000
> > May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000, pool
> > 0, transport ffff88087dfa2400, inuse=18 May 19 23:47:02 neo01-el64
> > kernel: svcrdma: rqstp=ffff88086409a000 May 19 23:47:02 neo01-el64
> > kernel: svcrdma: processing ctxt=ffff880866754540 on
> > xprt=ffff88087dfa2400, rqstp=ffff88086409a000, status=0 May 19
> > 23:47:02 neo01-el64 kernel: svcrdma: failed to post SQ WR rc=-22,
> > sc_sq_count=0,
> > sc_sq_depth=128
> > May 19 23:47:02 neo01-el64 kernel: svcrdma: Error -22 posting
> > RDMA_READ
>
> Hey Deevesh,
>
> Looking ocrdma_post_send(),-22 (-EINVAL) is returned when the QP is not in
> RTS. If the SQ is full, -ENOMEM is returned. So I think the send error is a
> downstream error because the connection got knocked down. You should
> try and figure out what kicked the QP out of RTS.

Oh wow! I perfectly missed it, let me go through the logs once again and update you.

>
>
> Steve.


2014-05-14 18:11:37

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes


> >
> >> If you print ch_count, it is 2 for NFS WRITEs from a Linux client,
> >> no matter how large the write payload is. Therefore I think the check
> >> as it is written is not particularly useful.
> >
> > Why are there 2?
>
> The first chunk lists the pages the server is to read, and the second
> chunk has the zero pad for XDR alignment.
>
> If pad optimization is enabled on the client, there is just 1 chunk in
> the RPC's Read list.
>

So the code as it stands violates the RFC by sending 2 chunks? And if I change the server
to only consume 1, then everything unaligned will break?




2014-05-06 21:13:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

On Tue, May 06, 2014 at 04:02:41PM -0500, Steve Wise wrote:
> On 5/6/2014 2:21 PM, J. Bruce Fields wrote:
> >On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote:
> >>From: Tom Tucker <[email protected]>
> >>
> >>Change poll logic to grab up to 6 completions at a time.
> >>
> >>RDMA write and send completions no longer deal with fastreg objects.
> >>
> >>Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
> >>capabilities.
> >>
> >>Signed-off-by: Tom Tucker <[email protected]>
> >>Signed-off-by: Steve Wise <[email protected]>
> >>---
> >>
> >> include/linux/sunrpc/svc_rdma.h | 3 -
> >> net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +++++++++++++++++-------------
> >> 2 files changed, 37 insertions(+), 28 deletions(-)
> >>
> >>diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> >>index 0b8e3e6..5cf99a0 100644
> >>--- a/include/linux/sunrpc/svc_rdma.h
> >>+++ b/include/linux/sunrpc/svc_rdma.h
> >>@@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
> >> struct list_head frmr_list;
> >> };
> >> struct svc_rdma_req_map {
> >>- struct svc_rdma_fastreg_mr *frmr;
> >> unsigned long count;
> >> union {
> >> struct kvec sge[RPCSVC_MAXPAGES];
> >> struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
> >>+ unsigned long lkey[RPCSVC_MAXPAGES];
> >> };
> >> };
> >>-#define RDMACTXT_F_FAST_UNREG 1
> >> #define RDMACTXT_F_LAST_CTXT 2
> >> #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */
> >>diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>index 25688fa..2c5b201 100644
> >>--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>@@ -1,4 +1,5 @@
> >> /*
> >>+ * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
> >> * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
> >> *
> >> * This software is available to you under a choice of one of two
> >>@@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
> >> schedule_timeout_uninterruptible(msecs_to_jiffies(500));
> >> }
> >> map->count = 0;
> >>- map->frmr = NULL;
> >> return map;
> >> }
> >>@@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,
> >> switch (ctxt->wr_op) {
> >> case IB_WR_SEND:
> >>- if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
> >>- svc_rdma_put_frmr(xprt, ctxt->frmr);
> >>+ BUG_ON(ctxt->frmr);
> >> svc_rdma_put_context(ctxt, 1);
> >> break;
> >> case IB_WR_RDMA_WRITE:
> >>+ BUG_ON(ctxt->frmr);
> >> svc_rdma_put_context(ctxt, 0);
> >> break;
> >> case IB_WR_RDMA_READ:
> >> case IB_WR_RDMA_READ_WITH_INV:
> >>+ svc_rdma_put_frmr(xprt, ctxt->frmr);
> >> if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
> >> struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
> >> BUG_ON(!read_hdr);
> >>- if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
> >>- svc_rdma_put_frmr(xprt, ctxt->frmr);
> >> spin_lock_bh(&xprt->sc_rq_dto_lock);
> >> set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
> >> list_add_tail(&read_hdr->dto_q,
> >>@@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
> >> break;
> >> default:
> >>+ BUG_ON(1);
> >> printk(KERN_ERR "svcrdma: unexpected completion type, "
> >> "opcode=%d\n",
> >> ctxt->wr_op);
> >Note the printk's unreachable now. Should some of these BUG_ON()'s be
> >WARN_ON()'s?
>
> I'll remove the printk. And if any of the new BUG_ON()'s can be
> WARN_ON(), then I'll do that. But only if proceeding after a
> WARN_ON() results in a working server.

The other thing to keep in mind is what the consequences of the BUG
might be--e.g. if we BUG while holding an important lock then that lock
never gets dropped and the system can freeze pretty quickly--possibly
before we get any useful information to the system logs. On a quick
check that doesn't look like the case here, though.

--b.

2014-05-06 21:02:50

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

On 5/6/2014 2:21 PM, J. Bruce Fields wrote:
> On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote:
>> From: Tom Tucker <[email protected]>
>>
>> Change poll logic to grab up to 6 completions at a time.
>>
>> RDMA write and send completions no longer deal with fastreg objects.
>>
>> Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
>> capabilities.
>>
>> Signed-off-by: Tom Tucker <[email protected]>
>> Signed-off-by: Steve Wise <[email protected]>
>> ---
>>
>> include/linux/sunrpc/svc_rdma.h | 3 -
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +++++++++++++++++-------------
>> 2 files changed, 37 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index 0b8e3e6..5cf99a0 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
>> struct list_head frmr_list;
>> };
>> struct svc_rdma_req_map {
>> - struct svc_rdma_fastreg_mr *frmr;
>> unsigned long count;
>> union {
>> struct kvec sge[RPCSVC_MAXPAGES];
>> struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
>> + unsigned long lkey[RPCSVC_MAXPAGES];
>> };
>> };
>> -#define RDMACTXT_F_FAST_UNREG 1
>> #define RDMACTXT_F_LAST_CTXT 2
>>
>> #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 25688fa..2c5b201 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -1,4 +1,5 @@
>> /*
>> + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
>> * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
>> *
>> * This software is available to you under a choice of one of two
>> @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
>> schedule_timeout_uninterruptible(msecs_to_jiffies(500));
>> }
>> map->count = 0;
>> - map->frmr = NULL;
>> return map;
>> }
>>
>> @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,
>>
>> switch (ctxt->wr_op) {
>> case IB_WR_SEND:
>> - if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
>> - svc_rdma_put_frmr(xprt, ctxt->frmr);
>> + BUG_ON(ctxt->frmr);
>> svc_rdma_put_context(ctxt, 1);
>> break;
>>
>> case IB_WR_RDMA_WRITE:
>> + BUG_ON(ctxt->frmr);
>> svc_rdma_put_context(ctxt, 0);
>> break;
>>
>> case IB_WR_RDMA_READ:
>> case IB_WR_RDMA_READ_WITH_INV:
>> + svc_rdma_put_frmr(xprt, ctxt->frmr);
>> if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
>> struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
>> BUG_ON(!read_hdr);
>> - if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
>> - svc_rdma_put_frmr(xprt, ctxt->frmr);
>> spin_lock_bh(&xprt->sc_rq_dto_lock);
>> set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
>> list_add_tail(&read_hdr->dto_q,
>> @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
>> break;
>>
>> default:
>> + BUG_ON(1);
>> printk(KERN_ERR "svcrdma: unexpected completion type, "
>> "opcode=%d\n",
>> ctxt->wr_op);
> Note the printk's unreachable now. Should some of these BUG_ON()'s be
> WARN_ON()'s?

I'll remove the printk. And if any of the new BUG_ON()'s can be
WARN_ON(), then I'll do that. But only if proceeding after a WARN_ON()
results in a working server.

>> @@ -378,29 +378,42 @@ static void process_context(struct svcxprt_rdma *xprt,
>> static void sq_cq_reap(struct svcxprt_rdma *xprt)
>> {
>> struct svc_rdma_op_ctxt *ctxt = NULL;
>> - struct ib_wc wc;
>> + struct ib_wc wc_a[6];
>> + struct ib_wc *wc;
>> struct ib_cq *cq = xprt->sc_sq_cq;
>> int ret;
> May want to keep an eye on the stack usage here?

Ok. Perhaps I'll put the array in the cvs_rdma_op_ctxt.


> --b.
>
>>
>> + memset(wc_a, 0, sizeof(wc_a));
>> +
>> if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags))
>> return;
>>
>> ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
>> atomic_inc(&rdma_stat_sq_poll);
>> - while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) {
>> - if (wc.status != IB_WC_SUCCESS)
>> - /* Close the transport */
>> - set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
>> + while ((ret = ib_poll_cq(cq, ARRAY_SIZE(wc_a), wc_a)) > 0) {
>> + int i;
>>
>> - /* Decrement used SQ WR count */
>> - atomic_dec(&xprt->sc_sq_count);
>> - wake_up(&xprt->sc_send_wait);
>> + for (i = 0; i < ret; i++) {
>> + wc = &wc_a[i];
>> + if (wc->status != IB_WC_SUCCESS) {
>> + dprintk("svcrdma: sq wc err status %d\n",
>> + wc->status);
>>
>> - ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
>> - if (ctxt)
>> - process_context(xprt, ctxt);
>> + /* Close the transport */
>> + set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
>> + }
>>
>> - svc_xprt_put(&xprt->sc_xprt);
>> + /* Decrement used SQ WR count */
>> + atomic_dec(&xprt->sc_sq_count);
>> + wake_up(&xprt->sc_send_wait);
>> +
>> + ctxt = (struct svc_rdma_op_ctxt *)
>> + (unsigned long)wc->wr_id;
>> + if (ctxt)
>> + process_context(xprt, ctxt);
>> +
>> + svc_xprt_put(&xprt->sc_xprt);
>> + }
>> }
>>
>> if (ctxt)
>> @@ -993,7 +1006,11 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> need_dma_mr = 0;
>> break;
>> case RDMA_TRANSPORT_IB:
>> - if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
>> + if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
>> + need_dma_mr = 1;
>> + dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>> + } else if (!(devattr.device_cap_flags &
>> + IB_DEVICE_LOCAL_DMA_LKEY)) {
>> need_dma_mr = 1;
>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>> } else
>> @@ -1190,14 +1207,7 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
>> container_of(xprt, struct svcxprt_rdma, sc_xprt);
>>
>> /*
>> - * If there are fewer SQ WR available than required to send a
>> - * simple response, return false.
>> - */
>> - if ((rdma->sc_sq_depth - atomic_read(&rdma->sc_sq_count) < 3))
>> - return 0;
>> -
>> - /*
>> - * ...or there are already waiters on the SQ,
>> + * If there are already waiters on the SQ,
>> * return false.
>> */
>> if (waitqueue_active(&rdma->sc_send_wait))
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2014-05-14 18:23:57

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes


> >>
> >
> > So the code as it stands violates the RFC by sending 2 chunks?
>
> The Linux client is non-compliant because it sends two chunks. A compliant server
ignores the
> second chunk, so this is functionally harmless (but a waste of resources).
>
> The Linux server is non-compliant because it _requires_ the XDR pad chunk. A compliant
client
> (eg. Linux with pad optimization enabled) does not interoperate with it.
>
> > And if I change the server
> > to only consume 1, then everything unaligned will break?
>
> If you change the server to consume only the first chunk, but do not also make it deal
with XDR
> padding as recommended in chapter 3 of RFC 5666, then WRITEs with a length that is not a
> multiple of four will fail. SYMLINKs with a large payload will also have this problem.
>

Thanks Chuck for clarifying this! So I'm going to defer this issue and fix it with a
subsequent patch to only process 1 chunk and handle pad correctly.

Steve.



2014-05-06 19:21:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote:
> From: Tom Tucker <[email protected]>
>
> Change poll logic to grab up to 6 completions at a time.
>
> RDMA write and send completions no longer deal with fastreg objects.
>
> Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
> capabilities.
>
> Signed-off-by: Tom Tucker <[email protected]>
> Signed-off-by: Steve Wise <[email protected]>
> ---
>
> include/linux/sunrpc/svc_rdma.h | 3 -
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +++++++++++++++++-------------
> 2 files changed, 37 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index 0b8e3e6..5cf99a0 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
> struct list_head frmr_list;
> };
> struct svc_rdma_req_map {
> - struct svc_rdma_fastreg_mr *frmr;
> unsigned long count;
> union {
> struct kvec sge[RPCSVC_MAXPAGES];
> struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
> + unsigned long lkey[RPCSVC_MAXPAGES];
> };
> };
> -#define RDMACTXT_F_FAST_UNREG 1
> #define RDMACTXT_F_LAST_CTXT 2
>
> #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 25688fa..2c5b201 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -1,4 +1,5 @@
> /*
> + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
> * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
> schedule_timeout_uninterruptible(msecs_to_jiffies(500));
> }
> map->count = 0;
> - map->frmr = NULL;
> return map;
> }
>
> @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,
>
> switch (ctxt->wr_op) {
> case IB_WR_SEND:
> - if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
> - svc_rdma_put_frmr(xprt, ctxt->frmr);
> + BUG_ON(ctxt->frmr);
> svc_rdma_put_context(ctxt, 1);
> break;
>
> case IB_WR_RDMA_WRITE:
> + BUG_ON(ctxt->frmr);
> svc_rdma_put_context(ctxt, 0);
> break;
>
> case IB_WR_RDMA_READ:
> case IB_WR_RDMA_READ_WITH_INV:
> + svc_rdma_put_frmr(xprt, ctxt->frmr);
> if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
> struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
> BUG_ON(!read_hdr);
> - if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
> - svc_rdma_put_frmr(xprt, ctxt->frmr);
> spin_lock_bh(&xprt->sc_rq_dto_lock);
> set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
> list_add_tail(&read_hdr->dto_q,
> @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
> break;
>
> default:
> + BUG_ON(1);
> printk(KERN_ERR "svcrdma: unexpected completion type, "
> "opcode=%d\n",
> ctxt->wr_op);

Note the printk's unreachable now. Should some of these BUG_ON()'s be
WARN_ON()'s?

> @@ -378,29 +378,42 @@ static void process_context(struct svcxprt_rdma *xprt,
> static void sq_cq_reap(struct svcxprt_rdma *xprt)
> {
> struct svc_rdma_op_ctxt *ctxt = NULL;
> - struct ib_wc wc;
> + struct ib_wc wc_a[6];
> + struct ib_wc *wc;
> struct ib_cq *cq = xprt->sc_sq_cq;
> int ret;

May want to keep an eye on the stack usage here?

--b.

>
> + memset(wc_a, 0, sizeof(wc_a));
> +
> if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags))
> return;
>
> ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
> atomic_inc(&rdma_stat_sq_poll);
> - while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) {
> - if (wc.status != IB_WC_SUCCESS)
> - /* Close the transport */
> - set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> + while ((ret = ib_poll_cq(cq, ARRAY_SIZE(wc_a), wc_a)) > 0) {
> + int i;
>
> - /* Decrement used SQ WR count */
> - atomic_dec(&xprt->sc_sq_count);
> - wake_up(&xprt->sc_send_wait);
> + for (i = 0; i < ret; i++) {
> + wc = &wc_a[i];
> + if (wc->status != IB_WC_SUCCESS) {
> + dprintk("svcrdma: sq wc err status %d\n",
> + wc->status);
>
> - ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
> - if (ctxt)
> - process_context(xprt, ctxt);
> + /* Close the transport */
> + set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> + }
>
> - svc_xprt_put(&xprt->sc_xprt);
> + /* Decrement used SQ WR count */
> + atomic_dec(&xprt->sc_sq_count);
> + wake_up(&xprt->sc_send_wait);
> +
> + ctxt = (struct svc_rdma_op_ctxt *)
> + (unsigned long)wc->wr_id;
> + if (ctxt)
> + process_context(xprt, ctxt);
> +
> + svc_xprt_put(&xprt->sc_xprt);
> + }
> }
>
> if (ctxt)
> @@ -993,7 +1006,11 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> need_dma_mr = 0;
> break;
> case RDMA_TRANSPORT_IB:
> - if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
> + if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
> + need_dma_mr = 1;
> + dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
> + } else if (!(devattr.device_cap_flags &
> + IB_DEVICE_LOCAL_DMA_LKEY)) {
> need_dma_mr = 1;
> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
> } else
> @@ -1190,14 +1207,7 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
> container_of(xprt, struct svcxprt_rdma, sc_xprt);
>
> /*
> - * If there are fewer SQ WR available than required to send a
> - * simple response, return false.
> - */
> - if ((rdma->sc_sq_depth - atomic_read(&rdma->sc_sq_count) < 3))
> - return 0;
> -
> - /*
> - * ...or there are already waiters on the SQ,
> + * If there are already waiters on the SQ,
> * return false.
> */
> if (waitqueue_active(&rdma->sc_send_wait))
>

2014-05-14 14:26:26

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes

On 5/13/2014 4:44 PM, Chuck Lever wrote:
>>>> +static int rdma_read_chunks(struct svcxprt_rdma *xprt,
>>>> + struct rpcrdma_msg *rmsgp,
>>>> + struct svc_rqst *rqstp,
>>>> + struct svc_rdma_op_ctxt *head)
>>>> {
>>>> - struct ib_send_wr read_wr;
>>>> - struct ib_send_wr inv_wr;
>>>> - int err = 0;
>>>> - int ch_no;
>>>> - int ch_count;
>>>> - int byte_count;
>>>> - int sge_count;
>>>> - u64 sgl_offset;
>>>> + int page_no, ch_count, ret;
>>>> struct rpcrdma_read_chunk *ch;
>>>> - struct svc_rdma_op_ctxt *ctxt = NULL;
>>>> - struct svc_rdma_req_map *rpl_map;
>>>> - struct svc_rdma_req_map *chl_map;
>>>> + u32 page_offset, byte_count;
>>>> + u64 rs_offset;
>>>> + rdma_reader_fn reader;
>>>>
>>>> /* If no read list is present, return 0 */
>>>> ch = svc_rdma_get_read_chunk(rmsgp);
>>>> @@ -408,122 +368,53 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
>>>> if (ch_count > RPCSVC_MAXPAGES)
>>>> return -EINVAL;
>>> I'm not sure what this ^^^ ch_count check is doing, but maybe I haven't had
>>> enough caffeine this morning. Shouldn't this code be comparing the segment
>>> count, not the chunk count, with MAXPAGES?
>> Isn't ch_count really the number of chunks in the RPC?
> Sort of. Block comment in front of svc_rdma_rcl_chunk_counts() reads:
>
> 74 * Determine number of chunks and total bytes in chunk list. The chunk
> 75 * list has already been verified to fit within the RPCRDMA header.
>
> So, called from rdma_read_chunks(), ch_count is the number of chunks in
> the Read list.
>
> As you point out below, NFS/RDMA passes either a Read or a Write list,
> not both. So it amounts to the same thing as the total number of chunks
> in the RPC.
>
>> I'm not sure what "segment count? you are talking about?
> AFAICT a chunk is a list of segments. Thus one chunk can reference
> many pages, one per segment.

Ok yes. rdma_read_chunk_frmr()/rdma_read_chunk_lcl() walk the segments
creating rdma read work requests to pull the data for this chunk.

> If you print ch_count, it is 2 for NFS WRITEs from a Linux client,
> no matter how large the write payload is. Therefore I think the check
> as it is written is not particularly useful.

Why are there 2?

> The returned ch_count here is not used beyond that check, after the
> refactoring patch is applied. The returned byte_count includes all
> chunks in the passed-in chunk list, whereas I think it should count
> only the first chunk in the list.

So I'll investigate this more with the goal of only grabbing the first
chunk.

>>>
>>>> - /* Allocate temporary reply and chunk maps */
>>>> - rpl_map = svc_rdma_get_req_map();
>>>> - chl_map = svc_rdma_get_req_map();
>>>> -
>>>> - if (!xprt->sc_frmr_pg_list_len)
>>>> - sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
>>>> - rpl_map, chl_map, ch_count,
>>>> - byte_count);
>>>> - else
>>>> - sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
>>>> - rpl_map, chl_map, ch_count,
>>>> - byte_count);
>>>> - if (sge_count < 0) {
>>>> - err = -EIO;
>>>> - goto out;
>>>> - }
>>>> -
>>>> - sgl_offset = 0;
>>>> - ch_no = 0;
>>>> + /* The request is completed when the RDMA_READs complete. The
>>>> + * head context keeps all the pages that comprise the
>>>> + * request.
>>>> + */
>>>> + head->arg.head[0] = rqstp->rq_arg.head[0];
>>>> + head->arg.tail[0] = rqstp->rq_arg.tail[0];
>>>> + head->arg.pages = &head->pages[head->count];
>>>> + head->hdr_count = head->count;
>>>> + head->arg.page_base = 0;
>>>> + head->arg.page_len = 0;
>>>> + head->arg.len = rqstp->rq_arg.len;
>>>> + head->arg.buflen = rqstp->rq_arg.buflen + PAGE_SIZE;
>>>>
>>>> + page_no = 0; page_offset = 0;
>>>> for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
>>>> - ch->rc_discrim != 0; ch++, ch_no++) {
>>>> - u64 rs_offset;
>>>> -next_sge:
>>>> - ctxt = svc_rdma_get_context(xprt);
>>>> - ctxt->direction = DMA_FROM_DEVICE;
>>>> - ctxt->frmr = hdr_ctxt->frmr;
>>>> - ctxt->read_hdr = NULL;
>>>> - clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
>>>> - clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
>>>> + ch->rc_discrim != 0; ch++) {
>>>
>>> As I understand it, this loop is iterating over the items in the RPC
>>> header's read chunk list.
>>>
>>> RFC 5667 section 4 says
>>>
>>> "The server MUST ignore any Read list for other NFS procedures, as well
>>> as additional Read list entries beyond the first in the list."
>>>
>> I interpret this to say the server should ignore an RDMA Write list, if present, when
>> processing an NFS WRITE (because the server emits RDMA Reads to the client in this case).
>> Similarly, it should ignore an RDMA Read list when processing an NFS READ (which generates
>> RDMA Writes from the server->client to fulfill the NFS READ).
> That?s already stated explicitly in a later paragraph. So this text
> is getting at something else, IMO. Here?s the full context:
>
> 4. NFS Versions 2 and 3 Mapping
>
> A single RDMA Write list entry MAY be posted by the client to receive
> either the opaque file data from a READ request or the pathname from
> a READLINK request. The server MUST ignore a Write list for any
> other NFS procedure, as well as any Write list entries beyond the
> first in the list.
>
> Similarly, a single RDMA Read list entry MAY be posted by the client
> to supply the opaque file data for a WRITE request or the pathname
> for a SYMLINK request. The server MUST ignore any Read list for
> other NFS procedures, as well as additional Read list entries beyond
> the first in the list.
>
> Because there are no NFS version 2 or 3 requests that transfer bulk
> data in both directions, it is not necessary to post requests
> containing both Write and Read lists. Any unneeded Read or Write
> lists are ignored by the server.
>
> If there is a Read list, it can contain one chunk for NFS WRITE and
> SYMLINK operations, otherwise it should contain zero chunks. Anything
> else MUST be ignored (ie, the server mustn?t process the additional
> chunks, and mustn?t throw an error).
>
> Processing the second chunk from the client during NFS WRITE operations
> is incorrect. The additional chunk should be ignored, and the first
> chunk should be padded on the server to make things work right in the
> upper layer (the server?s NFS WRITE XDR decoder).

Ok. Padding is another issue, and I'd like to defer it until we get
this refactor patch merged.

>
>>> But rdma_read_chunks() still expects multiple chunks in the incoming list.
>> An RDMA Read list can have multiple chunks, yes?
> It can. But RFC 5667 suggests that particularly for NFS requests, only
> a Read list of zero or one chunks is valid.
>
>>> Should this code follow the RFC and process only the first chunk in
>>> each RPC header, or did I misunderstand the RFC text?
>>>
>>> I'm not sure how the RPC layer should enforce the requirement to ignore
>>> chunks for particular classes of NFS requests. But seems like valid
>>> NFS/RDMA requests will have only zero or one item in each chunk list.
>>>
>>>
>>> The ramification of this change:
>>>
>>> The Linux client sends an extra chunk which is a zero pad to satisfy XDR
>>> alignment. This results in an extra RDMA READ on the wire for a payload
>>> that is never more than 3 bytes of zeros.
>>>
>> This sounds like a logic bug in the server then.
> Perhaps it?s simply that this code was designed before RFC 5667 was
> published. Certainly the client expects the server might work this way.

I don't think so. Just never implemented correctly.

>> You see these IOs on the wire?
> I see extra iterations in the RDMA READ loop on the server, both before
> and after the refactoring patch.
>
> I haven?t looked at the wire because ib_dump doesn?t work with the mthca
> provider.
>
>>> Section 3.7 of RFC 5666 states:
>>>
>>> "If in turn, no data remains to be decoded from the inline portion, then
>>> the receiver MUST conclude that roundup is present, and therefore it
>>> advances the XDR decode position to that indicated by the next chunk (if
>>> any). In this way, roundup is passed without ever actually transferring
>>> additional XDR bytes."
>>>
>>> Later, it states:
>>>
>>> "Senders SHOULD therefore avoid encoding individual RDMA Read Chunks for
>>> roundup whenever possible."
>>>
>>> The Solaris NFS/RDMA server does not require this extra chunk, but Linux
>>> appears to need it. When I enable pad optimization on the Linux client,
>>> it works fine with the Solaris server. But the first NFS WRITE with an
>>> odd length against the Linux server causes connection loss and the client
>>> workload hangs.
>>>
>>> At some point I'd like to permanently enable this optimization on the
>>> Linux client. You can try this out with:
>>>
>>> sudo echo 1 > /proc/sys/sunrpc/rdma_pad_optimize
>>>
>>> The default contents of this file leave pad optimization disabled, for
>>> compatibility with the current Linux NFS/RDMA server.
>>>
>>> So rdma_read_chunks() needs to handle the implicit XDR length round-up.
>>> That's probably not hard.
>>>
>>> What I'm asking is:
>>>
>>> Do you agree the server should be changed to comply with section 4 of 5667?
>>>
>> I think it is complying and you are misinterpreting. But I could be misinterpreting it
>> too :)
>>
>> But we should fix the pad thing.
> That may be enough to support a client that does not send the
> pad bytes in a separate chunk. However, I wonder how such a
> server would behave with the current client.
>
> True, this isn?t a regression, per se. The server is already
> behaving incorrectly without the refactoring patch.
>
> However, Tom?s refactoring rewrites all this logic anyway, and
> it wouldn?t be much of a stretch to make the server comply with
> RFC 5667, at this point.

Can we defer this until the refactor patch is in, then I'll work on a
new patch for the pad stuff.

Thanks,

Steve.

2014-05-19 19:07:13

by Devesh Sharma

[permalink] [raw]
Subject: RE: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

While testing with ocrdma driver I am finding server side SQ full. Following is the log, yet to identify why it's happening. Once this is reported Client side crashes due to some reason.
My kdump is not working properly therefore I am not able to analyze the situation properly.

May 19 23:47:02 neo01-el64 kernel: svcrdma: RDMA_WRITE rmr=8008b12, to=45a2d790c, xdr_off=0, write_len=68, vec->sge=ffff88086cb4a0c8, vec->count=2
May 19 23:47:02 neo01-el64 kernel: svcrdma: send_reply returns 0
May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000 waiting for data (to = 3600000)
May 19 23:47:02 neo01-el64 kernel: svc: transport ffff88087dfa2400 served by daemon ffff88086409a000
May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000, pool 0, transport ffff88087dfa2400, inuse=18
May 19 23:47:02 neo01-el64 kernel: svcrdma: rqstp=ffff88086409a000
May 19 23:47:02 neo01-el64 kernel: svcrdma: processing ctxt=ffff880866754540 on xprt=ffff88087dfa2400, rqstp=ffff88086409a000, status=0
May 19 23:47:02 neo01-el64 kernel: svcrdma: failed to post SQ WR rc=-22, sc_sq_count=0, sc_sq_depth=128
May 19 23:47:02 neo01-el64 kernel: svcrdma: Error -22 posting RDMA_READ
May 19 23:47:02 neo01-el64 kernel: svc: got len=0
May 19 23:47:02 neo01-el64 kernel: svc: transport ffff88087dfa2400 served by daemon ffff88086e782000
May 19 23:47:02 neo01-el64 kernel: svc: transport ffff88087dfa2400 busy, not enqueued
May 19 23:47:02 neo01-el64 kernel: svc: server ffff88086409a000 waiting for data (to = 3600000)
May 19 23:47:02 neo01-el64 kernel: svc_recv: found XPT_CLOSE
May 19 23:47:02 neo01-el64 kernel: svc: svc_delete_xprt(ffff88087dfa2400)
May 19 23:47:02 neo01-el64 kernel: svc: svc_rdma_detach(ffff88087dfa2400)

> -----Original Message-----
> From: [email protected] [mailto:linux-rdma-
> [email protected]] On Behalf Of Steve Wise
> Sent: Wednesday, May 07, 2014 2:39 AM
> To: J. Bruce Fields
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic
>
> On 5/6/2014 2:27 PM, J. Bruce Fields wrote:
> > On Tue, May 06, 2014 at 12:46:21PM -0500, Steve Wise wrote:
> >> This patch series refactors the NFSRDMA server marshalling logic to
> >> remove the intermediary map structures. It also fixes an existing
> >> bug where the NFSRDMA server was not minding the device fast register
> >> page list length limitations.
> >>
> >> I've also made a git repo available with these patches on top of 3.15-rc4:
> >>
> >> git://git.openfabrics.org/~swise/linux svcrdma-refactor
> >>
> >> Changes since V1:
> >>
> >> - fixed regression for devices that don't support FRMRs (see
> >> rdma_read_chunk_lcl())
> >>
> >> - split patch up for closer review. However I request it be squashed
> >> before merging as they is not bisectable, and I think these changes
> >> should all be a single commit anyway.
> > If it's not split up in a way that's bisectable, then yes, just don't
> > bother.
>
> I didn't see a good way to split it up, have it bisectable, and not have all the
> big stuff in one patch. I think its a little more reviewable in these 3 patches,
> but when I post V3, I'll put it back as an uber patch.
> Hopefully folks can have a look at these 3 patches ignoring the bisect
> issue. Having said that, the rdma read logic really is better
> reviewed by look at the code after applying the patches. That's why I
> published a git branch.
>
> Thanks!
>
> Steve.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html

2014-05-06 17:46:32

by Steve Wise

[permalink] [raw]
Subject: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes

From: Tom Tucker <[email protected]>

Based on device support, RDMA read target sgls are fast-registered,
or composed using the local dma lkey or a dma_mr lkey. A given NFS
Write chunk list will be split into a set of rdma reads based on the
limitations of the device.

Signed-off-by: Tom Tucker <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
---

net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 633 +++++++++++++------------------
1 files changed, 259 insertions(+), 374 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 8d904e4..1c4c285 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -1,4 +1,5 @@
/*
+ * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
* Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
*
* This software is available to you under a choice of one of two
@@ -69,7 +70,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,

/* Set up the XDR head */
rqstp->rq_arg.head[0].iov_base = page_address(page);
- rqstp->rq_arg.head[0].iov_len = min(byte_count, ctxt->sge[0].length);
+ rqstp->rq_arg.head[0].iov_len =
+ min_t(size_t, byte_count, ctxt->sge[0].length);
rqstp->rq_arg.len = byte_count;
rqstp->rq_arg.buflen = byte_count;

@@ -85,7 +87,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
page = ctxt->pages[sge_no];
put_page(rqstp->rq_pages[sge_no]);
rqstp->rq_pages[sge_no] = page;
- bc -= min(bc, ctxt->sge[sge_no].length);
+ bc -= min_t(u32, bc, ctxt->sge[sge_no].length);
rqstp->rq_arg.buflen += ctxt->sge[sge_no].length;
sge_no++;
}
@@ -113,291 +115,249 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
rqstp->rq_arg.tail[0].iov_len = 0;
}

-/* Encode a read-chunk-list as an array of IB SGE
- *
- * Assumptions:
- * - chunk[0]->position points to pages[0] at an offset of 0
- * - pages[] is not physically or virtually contiguous and consists of
- * PAGE_SIZE elements.
- *
- * Output:
- * - sge array pointing into pages[] array.
- * - chunk_sge array specifying sge index and count for each
- * chunk in the read list
- *
- */
-static int map_read_chunks(struct svcxprt_rdma *xprt,
- struct svc_rqst *rqstp,
- struct svc_rdma_op_ctxt *head,
- struct rpcrdma_msg *rmsgp,
- struct svc_rdma_req_map *rpl_map,
- struct svc_rdma_req_map *chl_map,
- int ch_count,
- int byte_count)
+static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
{
- int sge_no;
- int sge_bytes;
- int page_off;
- int page_no;
- int ch_bytes;
- int ch_no;
- struct rpcrdma_read_chunk *ch;
+ if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
+ RDMA_TRANSPORT_IWARP)
+ return 1;
+ else
+ return min_t(int, sge_count, xprt->sc_max_sge);
+}

- sge_no = 0;
- page_no = 0;
- page_off = 0;
- ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
- ch_no = 0;
- ch_bytes = ntohl(ch->rc_target.rs_length);
- head->arg.head[0] = rqstp->rq_arg.head[0];
- head->arg.tail[0] = rqstp->rq_arg.tail[0];
- head->arg.pages = &head->pages[head->count];
- head->hdr_count = head->count; /* save count of hdr pages */
- head->arg.page_base = 0;
- head->arg.page_len = ch_bytes;
- head->arg.len = rqstp->rq_arg.len + ch_bytes;
- head->arg.buflen = rqstp->rq_arg.buflen + ch_bytes;
- head->count++;
- chl_map->ch[0].start = 0;
- while (byte_count) {
- rpl_map->sge[sge_no].iov_base =
- page_address(rqstp->rq_arg.pages[page_no]) + page_off;
- sge_bytes = min_t(int, PAGE_SIZE-page_off, ch_bytes);
- rpl_map->sge[sge_no].iov_len = sge_bytes;
- /*
- * Don't bump head->count here because the same page
- * may be used by multiple SGE.
- */
- head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
- rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
+typedef int (*rdma_reader_fn)(struct svcxprt_rdma *xprt,
+ struct svc_rqst *rqstp,
+ struct svc_rdma_op_ctxt *head,
+ int *page_no,
+ u32 *page_offset,
+ u32 rs_handle,
+ u32 rs_length,
+ u64 rs_offset,
+ int last);
+
+/* Issue an RDMA_READ using the local lkey to map the data sink */
+static int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
+ struct svc_rqst *rqstp,
+ struct svc_rdma_op_ctxt *head,
+ int *page_no,
+ u32 *page_offset,
+ u32 rs_handle,
+ u32 rs_length,
+ u64 rs_offset,
+ int last)
+{
+ struct ib_send_wr read_wr;
+ int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
+ struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
+ int ret, read, pno;
+ u32 pg_off = *page_offset;
+ u32 pg_no = *page_no;
+
+ ctxt->direction = DMA_FROM_DEVICE;
+ ctxt->read_hdr = head;
+ pages_needed =
+ min_t(int, pages_needed, rdma_read_max_sge(xprt, pages_needed));
+ read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
+
+ for (pno = 0; pno < pages_needed; pno++) {
+ int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
+
+ head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
+ head->arg.page_len += len;
+ head->arg.len += len;
+ head->count++;
+ rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no+1];
rqstp->rq_next_page = rqstp->rq_respages + 1;
+ ctxt->sge[pno].addr =
+ ib_dma_map_page(xprt->sc_cm_id->device,
+ head->arg.pages[pg_no], pg_off,
+ PAGE_SIZE - pg_off,
+ DMA_FROM_DEVICE);
+ ret = ib_dma_mapping_error(xprt->sc_cm_id->device,
+ ctxt->sge[pno].addr);
+ if (ret)
+ goto err;
+ atomic_inc(&xprt->sc_dma_used);

- byte_count -= sge_bytes;
- ch_bytes -= sge_bytes;
- sge_no++;
- /*
- * If all bytes for this chunk have been mapped to an
- * SGE, move to the next SGE
- */
- if (ch_bytes == 0) {
- chl_map->ch[ch_no].count =
- sge_no - chl_map->ch[ch_no].start;
- ch_no++;
- ch++;
- chl_map->ch[ch_no].start = sge_no;
- ch_bytes = ntohl(ch->rc_target.rs_length);
- /* If bytes remaining account for next chunk */
- if (byte_count) {
- head->arg.page_len += ch_bytes;
- head->arg.len += ch_bytes;
- head->arg.buflen += ch_bytes;
- }
- }
- /*
- * If this SGE consumed all of the page, move to the
- * next page
- */
- if ((sge_bytes + page_off) == PAGE_SIZE) {
- page_no++;
- page_off = 0;
- /*
- * If there are still bytes left to map, bump
- * the page count
- */
- if (byte_count)
- head->count++;
- } else
- page_off += sge_bytes;
+ /* The lkey here is either a local dma lkey or a dma_mr lkey */
+ ctxt->sge[pno].lkey = xprt->sc_dma_lkey;
+ ctxt->sge[pno].length = len;
+ ctxt->count++;
+
+ pg_no++;
+ pg_off = 0;
+ rs_length -= len;
+ }
+
+ if (last && rs_length == 0)
+ set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
+ else
+ clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
+
+ memset(&read_wr, 0, sizeof(read_wr));
+ read_wr.wr_id = (unsigned long)ctxt;
+ read_wr.opcode = IB_WR_RDMA_READ;
+ ctxt->wr_op = read_wr.opcode;
+ read_wr.send_flags = IB_SEND_SIGNALED;
+ read_wr.wr.rdma.rkey = rs_handle;
+ read_wr.wr.rdma.remote_addr = rs_offset;
+ read_wr.sg_list = ctxt->sge;
+ read_wr.num_sge = pages_needed;
+
+ ret = svc_rdma_send(xprt, &read_wr);
+ if (ret) {
+ pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
+ set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+ goto err;
}
- BUG_ON(byte_count != 0);
- return sge_no;
+ *page_no = pg_no;
+ *page_offset = pg_off;
+ ret = read;
+ atomic_inc(&rdma_stat_read);
+ return ret;
+ err:
+ svc_rdma_unmap_dma(ctxt);
+ svc_rdma_put_context(ctxt, 0);
+ return ret;
}

-/* Map a read-chunk-list to an XDR and fast register the page-list.
- *
- * Assumptions:
- * - chunk[0] position points to pages[0] at an offset of 0
- * - pages[] will be made physically contiguous by creating a one-off memory
- * region using the fastreg verb.
- * - byte_count is # of bytes in read-chunk-list
- * - ch_count is # of chunks in read-chunk-list
- *
- * Output:
- * - sge array pointing into pages[] array.
- * - chunk_sge array specifying sge index and count for each
- * chunk in the read list
- */
-static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
+/* Issue an RDMA_READ using an FRMR to map the data sink */
+static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
struct svc_rqst *rqstp,
struct svc_rdma_op_ctxt *head,
- struct rpcrdma_msg *rmsgp,
- struct svc_rdma_req_map *rpl_map,
- struct svc_rdma_req_map *chl_map,
- int ch_count,
- int byte_count)
+ int *page_no,
+ u32 *page_offset,
+ u32 rs_handle,
+ u32 rs_length,
+ u64 rs_offset,
+ int last)
{
- int page_no;
- int ch_no;
- u32 offset;
- struct rpcrdma_read_chunk *ch;
- struct svc_rdma_fastreg_mr *frmr;
- int ret = 0;
+ struct ib_send_wr read_wr;
+ struct ib_send_wr inv_wr;
+ struct ib_send_wr fastreg_wr;
+ u8 key;
+ int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
+ struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
+ struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
+ int ret, read, pno;
+ u32 pg_off = *page_offset;
+ u32 pg_no = *page_no;

- frmr = svc_rdma_get_frmr(xprt);
if (IS_ERR(frmr))
return -ENOMEM;

- head->frmr = frmr;
- head->arg.head[0] = rqstp->rq_arg.head[0];
- head->arg.tail[0] = rqstp->rq_arg.tail[0];
- head->arg.pages = &head->pages[head->count];
- head->hdr_count = head->count; /* save count of hdr pages */
- head->arg.page_base = 0;
- head->arg.page_len = byte_count;
- head->arg.len = rqstp->rq_arg.len + byte_count;
- head->arg.buflen = rqstp->rq_arg.buflen + byte_count;
+ ctxt->direction = DMA_FROM_DEVICE;
+ ctxt->frmr = frmr;
+ pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
+ read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);

- /* Fast register the page list */
- frmr->kva = page_address(rqstp->rq_arg.pages[0]);
+ frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
frmr->direction = DMA_FROM_DEVICE;
frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
- frmr->map_len = byte_count;
- frmr->page_list_len = PAGE_ALIGN(byte_count) >> PAGE_SHIFT;
- for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
- frmr->page_list->page_list[page_no] =
+ frmr->map_len = pages_needed << PAGE_SHIFT;
+ frmr->page_list_len = pages_needed;
+
+ for (pno = 0; pno < pages_needed; pno++) {
+ int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
+
+ head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
+ head->arg.page_len += len;
+ head->arg.len += len;
+ head->count++;
+ rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no+1];
+ rqstp->rq_next_page = rqstp->rq_respages + 1;
+ frmr->page_list->page_list[pno] =
ib_dma_map_page(xprt->sc_cm_id->device,
- rqstp->rq_arg.pages[page_no], 0,
+ head->arg.pages[pg_no], 0,
PAGE_SIZE, DMA_FROM_DEVICE);
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- frmr->page_list->page_list[page_no]))
- goto fatal_err;
+ ret = ib_dma_mapping_error(xprt->sc_cm_id->device,
+ frmr->page_list->page_list[pno]);
+ if (ret)
+ goto err;
atomic_inc(&xprt->sc_dma_used);
- head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
+ pg_no++;
+ pg_off = 0;
}
- head->count += page_no;

- /* rq_respages points one past arg pages */
- rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
+ if (last && read == rs_length)
+ set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
+ else
+ clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);

- /* Create the reply and chunk maps */
- offset = 0;
- ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
- for (ch_no = 0; ch_no < ch_count; ch_no++) {
- int len = ntohl(ch->rc_target.rs_length);
- rpl_map->sge[ch_no].iov_base = frmr->kva + offset;
- rpl_map->sge[ch_no].iov_len = len;
- chl_map->ch[ch_no].count = 1;
- chl_map->ch[ch_no].start = ch_no;
- offset += len;
- ch++;
+ /* Bump the key */
+ key = (u8)(frmr->mr->lkey & 0x000000FF);
+ ib_update_fast_reg_key(frmr->mr, ++key);
+
+ ctxt->sge[0].addr = (unsigned long)frmr->kva;
+ ctxt->sge[0].lkey = frmr->mr->lkey;
+ ctxt->sge[0].length = read;
+ ctxt->count = 1;
+ ctxt->read_hdr = head;
+
+ /* Prepare FASTREG WR */
+ memset(&fastreg_wr, 0, sizeof(fastreg_wr));
+ fastreg_wr.opcode = IB_WR_FAST_REG_MR;
+ fastreg_wr.send_flags = IB_SEND_SIGNALED;
+ fastreg_wr.wr.fast_reg.iova_start = (unsigned long)frmr->kva;
+ fastreg_wr.wr.fast_reg.page_list = frmr->page_list;
+ fastreg_wr.wr.fast_reg.page_list_len = frmr->page_list_len;
+ fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
+ fastreg_wr.wr.fast_reg.length = frmr->map_len;
+ fastreg_wr.wr.fast_reg.access_flags = frmr->access_flags;
+ fastreg_wr.wr.fast_reg.rkey = frmr->mr->lkey;
+ fastreg_wr.next = &read_wr;
+
+ /* Prepare RDMA_READ */
+ memset(&read_wr, 0, sizeof(read_wr));
+ read_wr.send_flags = IB_SEND_SIGNALED;
+ read_wr.wr.rdma.rkey = rs_handle;
+ read_wr.wr.rdma.remote_addr = rs_offset;
+ read_wr.sg_list = ctxt->sge;
+ read_wr.num_sge = 1;
+ if (xprt->sc_dev_caps & SVCRDMA_DEVCAP_READ_W_INV) {
+ read_wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+ read_wr.wr_id = (unsigned long)ctxt;
+ read_wr.ex.invalidate_rkey = ctxt->frmr->mr->lkey;
+ } else {
+ read_wr.opcode = IB_WR_RDMA_READ;
+ read_wr.next = &inv_wr;
+ /* Prepare invalidate */
+ memset(&inv_wr, 0, sizeof(inv_wr));
+ inv_wr.wr_id = (unsigned long)ctxt;
+ inv_wr.opcode = IB_WR_LOCAL_INV;
+ inv_wr.send_flags = IB_SEND_SIGNALED;
+ inv_wr.ex.invalidate_rkey = frmr->mr->lkey;
}
-
- ret = svc_rdma_fastreg(xprt, frmr);
- if (ret)
- goto fatal_err;
-
- return ch_no;
-
- fatal_err:
- printk("svcrdma: error fast registering xdr for xprt %p", xprt);
- svc_rdma_put_frmr(xprt, frmr);
- return -EIO;
-}
-
-static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
- struct svc_rdma_op_ctxt *ctxt,
- struct svc_rdma_fastreg_mr *frmr,
- struct kvec *vec,
- u64 *sgl_offset,
- int count)
-{
- int i;
- unsigned long off;
-
- ctxt->count = count;
- ctxt->direction = DMA_FROM_DEVICE;
- for (i = 0; i < count; i++) {
- ctxt->sge[i].length = 0; /* in case map fails */
- if (!frmr) {
- BUG_ON(!virt_to_page(vec[i].iov_base));
- off = (unsigned long)vec[i].iov_base & ~PAGE_MASK;
- ctxt->sge[i].addr =
- ib_dma_map_page(xprt->sc_cm_id->device,
- virt_to_page(vec[i].iov_base),
- off,
- vec[i].iov_len,
- DMA_FROM_DEVICE);
- if (ib_dma_mapping_error(xprt->sc_cm_id->device,
- ctxt->sge[i].addr))
- return -EINVAL;
- ctxt->sge[i].lkey = xprt->sc_dma_lkey;
- atomic_inc(&xprt->sc_dma_used);
- } else {
- ctxt->sge[i].addr = (unsigned long)vec[i].iov_base;
- ctxt->sge[i].lkey = frmr->mr->lkey;
- }
- ctxt->sge[i].length = vec[i].iov_len;
- *sgl_offset = *sgl_offset + vec[i].iov_len;
+ ctxt->wr_op = read_wr.opcode;
+
+ /* Post the chain */
+ ret = svc_rdma_send(xprt, &fastreg_wr);
+ if (ret) {
+ pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
+ set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+ goto err;
}
- return 0;
-}
-
-static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
-{
- if ((rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
- RDMA_TRANSPORT_IWARP) &&
- sge_count > 1)
- return 1;
- else
- return min_t(int, sge_count, xprt->sc_max_sge);
+ *page_no = pg_no;
+ *page_offset = pg_off;
+ ret = read;
+ atomic_inc(&rdma_stat_read);
+ return ret;
+ err:
+ svc_rdma_unmap_dma(ctxt);
+ svc_rdma_put_context(ctxt, 0);
+ svc_rdma_put_frmr(xprt, frmr);
+ return ret;
}

-/*
- * Use RDMA_READ to read data from the advertised client buffer into the
- * XDR stream starting at rq_arg.head[0].iov_base.
- * Each chunk in the array
- * contains the following fields:
- * discrim - '1', This isn't used for data placement
- * position - The xdr stream offset (the same for every chunk)
- * handle - RMR for client memory region
- * length - data transfer length
- * offset - 64 bit tagged offset in remote memory region
- *
- * On our side, we need to read into a pagelist. The first page immediately
- * follows the RPC header.
- *
- * This function returns:
- * 0 - No error and no read-list found.
- *
- * 1 - Successful read-list processing. The data is not yet in
- * the pagelist and therefore the RPC request must be deferred. The
- * I/O completion will enqueue the transport again and
- * svc_rdma_recvfrom will complete the request.
- *
- * <0 - Error processing/posting read-list.
- *
- * NOTE: The ctxt must not be touched after the last WR has been posted
- * because the I/O completion processing may occur on another
- * processor and free / modify the context. Ne touche pas!
- */
-static int rdma_read_xdr(struct svcxprt_rdma *xprt,
- struct rpcrdma_msg *rmsgp,
- struct svc_rqst *rqstp,
- struct svc_rdma_op_ctxt *hdr_ctxt)
+static int rdma_read_chunks(struct svcxprt_rdma *xprt,
+ struct rpcrdma_msg *rmsgp,
+ struct svc_rqst *rqstp,
+ struct svc_rdma_op_ctxt *head)
{
- struct ib_send_wr read_wr;
- struct ib_send_wr inv_wr;
- int err = 0;
- int ch_no;
- int ch_count;
- int byte_count;
- int sge_count;
- u64 sgl_offset;
+ int page_no, ch_count, ret;
struct rpcrdma_read_chunk *ch;
- struct svc_rdma_op_ctxt *ctxt = NULL;
- struct svc_rdma_req_map *rpl_map;
- struct svc_rdma_req_map *chl_map;
+ u32 page_offset, byte_count;
+ u64 rs_offset;
+ rdma_reader_fn reader;

/* If no read list is present, return 0 */
ch = svc_rdma_get_read_chunk(rmsgp);
@@ -408,122 +368,53 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
if (ch_count > RPCSVC_MAXPAGES)
return -EINVAL;

- /* Allocate temporary reply and chunk maps */
- rpl_map = svc_rdma_get_req_map();
- chl_map = svc_rdma_get_req_map();
-
- if (!xprt->sc_frmr_pg_list_len)
- sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
- rpl_map, chl_map, ch_count,
- byte_count);
- else
- sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
- rpl_map, chl_map, ch_count,
- byte_count);
- if (sge_count < 0) {
- err = -EIO;
- goto out;
- }
-
- sgl_offset = 0;
- ch_no = 0;
+ /* The request is completed when the RDMA_READs complete. The
+ * head context keeps all the pages that comprise the
+ * request.
+ */
+ head->arg.head[0] = rqstp->rq_arg.head[0];
+ head->arg.tail[0] = rqstp->rq_arg.tail[0];
+ head->arg.pages = &head->pages[head->count];
+ head->hdr_count = head->count;
+ head->arg.page_base = 0;
+ head->arg.page_len = 0;
+ head->arg.len = rqstp->rq_arg.len;
+ head->arg.buflen = rqstp->rq_arg.buflen + PAGE_SIZE;

+ page_no = 0; page_offset = 0;
for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
- ch->rc_discrim != 0; ch++, ch_no++) {
- u64 rs_offset;
-next_sge:
- ctxt = svc_rdma_get_context(xprt);
- ctxt->direction = DMA_FROM_DEVICE;
- ctxt->frmr = hdr_ctxt->frmr;
- ctxt->read_hdr = NULL;
- clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
- clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
+ ch->rc_discrim != 0; ch++) {

- /* Prepare READ WR */
- memset(&read_wr, 0, sizeof read_wr);
- read_wr.wr_id = (unsigned long)ctxt;
- read_wr.opcode = IB_WR_RDMA_READ;
- ctxt->wr_op = read_wr.opcode;
- read_wr.send_flags = IB_SEND_SIGNALED;
- read_wr.wr.rdma.rkey = ntohl(ch->rc_target.rs_handle);
xdr_decode_hyper((__be32 *)&ch->rc_target.rs_offset,
&rs_offset);
- read_wr.wr.rdma.remote_addr = rs_offset + sgl_offset;
- read_wr.sg_list = ctxt->sge;
- read_wr.num_sge =
- rdma_read_max_sge(xprt, chl_map->ch[ch_no].count);
- err = rdma_set_ctxt_sge(xprt, ctxt, hdr_ctxt->frmr,
- &rpl_map->sge[chl_map->ch[ch_no].start],
- &sgl_offset,
- read_wr.num_sge);
- if (err) {
- svc_rdma_unmap_dma(ctxt);
- svc_rdma_put_context(ctxt, 0);
- goto out;
- }
- if (((ch+1)->rc_discrim == 0) &&
- (read_wr.num_sge == chl_map->ch[ch_no].count)) {
- /*
- * Mark the last RDMA_READ with a bit to
- * indicate all RPC data has been fetched from
- * the client and the RPC needs to be enqueued.
- */
- set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
- if (hdr_ctxt->frmr) {
- set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
- /*
- * Invalidate the local MR used to map the data
- * sink.
- */
- if (xprt->sc_dev_caps &
- SVCRDMA_DEVCAP_READ_W_INV) {
- read_wr.opcode =
- IB_WR_RDMA_READ_WITH_INV;
- ctxt->wr_op = read_wr.opcode;
- read_wr.ex.invalidate_rkey =
- ctxt->frmr->mr->lkey;
- } else {
- /* Prepare INVALIDATE WR */
- memset(&inv_wr, 0, sizeof inv_wr);
- inv_wr.opcode = IB_WR_LOCAL_INV;
- inv_wr.send_flags = IB_SEND_SIGNALED;
- inv_wr.ex.invalidate_rkey =
- hdr_ctxt->frmr->mr->lkey;
- read_wr.next = &inv_wr;
- }
- }
- ctxt->read_hdr = hdr_ctxt;
- }
- /* Post the read */
- err = svc_rdma_send(xprt, &read_wr);
- if (err) {
- printk(KERN_ERR "svcrdma: Error %d posting RDMA_READ\n",
- err);
- set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
- svc_rdma_unmap_dma(ctxt);
- svc_rdma_put_context(ctxt, 0);
- goto out;
+ byte_count = ntohl(ch->rc_target.rs_length);
+
+ /* Use FRMR if supported */
+ if (xprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)
+ reader = rdma_read_chunk_frmr;
+ else
+ reader = rdma_read_chunk_lcl;
+ while (byte_count > 0) {
+ ret = reader(xprt, rqstp, head,
+ &page_no, &page_offset,
+ ntohl(ch->rc_target.rs_handle),
+ byte_count, rs_offset,
+ ((ch+1)->rc_discrim == 0) /* last */
+ );
+ if (ret < 0)
+ goto err;
+ byte_count -= ret;
+ rs_offset += ret;
}
- atomic_inc(&rdma_stat_read);
-
- if (read_wr.num_sge < chl_map->ch[ch_no].count) {
- chl_map->ch[ch_no].count -= read_wr.num_sge;
- chl_map->ch[ch_no].start += read_wr.num_sge;
- goto next_sge;
- }
- sgl_offset = 0;
- err = 1;
}
-
- out:
- svc_rdma_put_req_map(rpl_map);
- svc_rdma_put_req_map(chl_map);
-
+ ret = 1;
+ err:
/* Detach arg pages. svc_recv will replenish them */
- for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
- rqstp->rq_pages[ch_no] = NULL;
+ for (page_no = 0;
+ &rqstp->rq_pages[page_no] < rqstp->rq_respages; page_no++)
+ rqstp->rq_pages[page_no] = NULL;

- return err;
+ return ret;
}

static int rdma_read_complete(struct svc_rqst *rqstp,
@@ -595,13 +486,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
struct svc_rdma_op_ctxt,
dto_q);
list_del_init(&ctxt->dto_q);
- }
- if (ctxt) {
spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock);
return rdma_read_complete(rqstp, ctxt);
- }
-
- if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
+ } else if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
ctxt = list_entry(rdma_xprt->sc_rq_dto_q.next,
struct svc_rdma_op_ctxt,
dto_q);
@@ -621,7 +508,6 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
if (test_bit(XPT_CLOSE, &xprt->xpt_flags))
goto close_out;

- BUG_ON(ret);
goto out;
}
dprintk("svcrdma: processing ctxt=%p on xprt=%p, rqstp=%p, status=%d\n",
@@ -644,12 +530,11 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
}

/* Read read-list data. */
- ret = rdma_read_xdr(rdma_xprt, rmsgp, rqstp, ctxt);
+ ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
if (ret > 0) {
/* read-list posted, defer until data received from client. */
goto defer;
- }
- if (ret < 0) {
+ } else if (ret < 0) {
/* Post of read-list failed, free context. */
svc_rdma_put_context(ctxt, 1);
return 0;


2014-05-13 18:22:15

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] svcrdma: Recvfrom changes

Hi Steve-

Some random review comments, see below.


On May 6, 2014, at 1:46 PM, Steve Wise <[email protected]> wrote:

> From: Tom Tucker <[email protected]>
>
> Based on device support, RDMA read target sgls are fast-registered,
> or composed using the local dma lkey or a dma_mr lkey. A given NFS
> Write chunk list will be split into a set of rdma reads based on the
> limitations of the device.
>
> Signed-off-by: Tom Tucker <[email protected]>
> Signed-off-by: Steve Wise <[email protected]>
> ---
>
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 633 +++++++++++++------------------
> 1 files changed, 259 insertions(+), 374 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 8d904e4..1c4c285 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -1,4 +1,5 @@
> /*
> + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
> * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -69,7 +70,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>
> /* Set up the XDR head */
> rqstp->rq_arg.head[0].iov_base = page_address(page);
> - rqstp->rq_arg.head[0].iov_len = min(byte_count, ctxt->sge[0].length);
> + rqstp->rq_arg.head[0].iov_len =
> + min_t(size_t, byte_count, ctxt->sge[0].length);
> rqstp->rq_arg.len = byte_count;
> rqstp->rq_arg.buflen = byte_count;
>
> @@ -85,7 +87,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
> page = ctxt->pages[sge_no];
> put_page(rqstp->rq_pages[sge_no]);
> rqstp->rq_pages[sge_no] = page;
> - bc -= min(bc, ctxt->sge[sge_no].length);
> + bc -= min_t(u32, bc, ctxt->sge[sge_no].length);
> rqstp->rq_arg.buflen += ctxt->sge[sge_no].length;
> sge_no++;
> }
> @@ -113,291 +115,249 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
> rqstp->rq_arg.tail[0].iov_len = 0;
> }
>
> -/* Encode a read-chunk-list as an array of IB SGE
> - *
> - * Assumptions:
> - * - chunk[0]->position points to pages[0] at an offset of 0
> - * - pages[] is not physically or virtually contiguous and consists of
> - * PAGE_SIZE elements.
> - *
> - * Output:
> - * - sge array pointing into pages[] array.
> - * - chunk_sge array specifying sge index and count for each
> - * chunk in the read list
> - *
> - */
> -static int map_read_chunks(struct svcxprt_rdma *xprt,
> - struct svc_rqst *rqstp,
> - struct svc_rdma_op_ctxt *head,
> - struct rpcrdma_msg *rmsgp,
> - struct svc_rdma_req_map *rpl_map,
> - struct svc_rdma_req_map *chl_map,
> - int ch_count,
> - int byte_count)
> +static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
> {
> - int sge_no;
> - int sge_bytes;
> - int page_off;
> - int page_no;
> - int ch_bytes;
> - int ch_no;
> - struct rpcrdma_read_chunk *ch;
> + if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
> + RDMA_TRANSPORT_IWARP)
> + return 1;
> + else
> + return min_t(int, sge_count, xprt->sc_max_sge);
> +}
>
> - sge_no = 0;
> - page_no = 0;
> - page_off = 0;
> - ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
> - ch_no = 0;
> - ch_bytes = ntohl(ch->rc_target.rs_length);
> - head->arg.head[0] = rqstp->rq_arg.head[0];
> - head->arg.tail[0] = rqstp->rq_arg.tail[0];
> - head->arg.pages = &head->pages[head->count];
> - head->hdr_count = head->count; /* save count of hdr pages */
> - head->arg.page_base = 0;
> - head->arg.page_len = ch_bytes;
> - head->arg.len = rqstp->rq_arg.len + ch_bytes;
> - head->arg.buflen = rqstp->rq_arg.buflen + ch_bytes;
> - head->count++;
> - chl_map->ch[0].start = 0;
> - while (byte_count) {
> - rpl_map->sge[sge_no].iov_base =
> - page_address(rqstp->rq_arg.pages[page_no]) + page_off;
> - sge_bytes = min_t(int, PAGE_SIZE-page_off, ch_bytes);
> - rpl_map->sge[sge_no].iov_len = sge_bytes;
> - /*
> - * Don't bump head->count here because the same page
> - * may be used by multiple SGE.
> - */
> - head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
> - rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
> +typedef int (*rdma_reader_fn)(struct svcxprt_rdma *xprt,
> + struct svc_rqst *rqstp,
> + struct svc_rdma_op_ctxt *head,
> + int *page_no,
> + u32 *page_offset,
> + u32 rs_handle,
> + u32 rs_length,
> + u64 rs_offset,
> + int last);
> +
> +/* Issue an RDMA_READ using the local lkey to map the data sink */
> +static int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> + struct svc_rqst *rqstp,
> + struct svc_rdma_op_ctxt *head,
> + int *page_no,
> + u32 *page_offset,
> + u32 rs_handle,
> + u32 rs_length,
> + u64 rs_offset,
> + int last)
> +{
> + struct ib_send_wr read_wr;
> + int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
> + struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
> + int ret, read, pno;
> + u32 pg_off = *page_offset;
> + u32 pg_no = *page_no;
> +
> + ctxt->direction = DMA_FROM_DEVICE;
> + ctxt->read_hdr = head;
> + pages_needed =
> + min_t(int, pages_needed, rdma_read_max_sge(xprt, pages_needed));
> + read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> +
> + for (pno = 0; pno < pages_needed; pno++) {
> + int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> +
> + head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
> + head->arg.page_len += len;
> + head->arg.len += len;
> + head->count++;
> + rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no+1];
> rqstp->rq_next_page = rqstp->rq_respages + 1;
> + ctxt->sge[pno].addr =
> + ib_dma_map_page(xprt->sc_cm_id->device,
> + head->arg.pages[pg_no], pg_off,
> + PAGE_SIZE - pg_off,
> + DMA_FROM_DEVICE);
> + ret = ib_dma_mapping_error(xprt->sc_cm_id->device,
> + ctxt->sge[pno].addr);
> + if (ret)
> + goto err;
> + atomic_inc(&xprt->sc_dma_used);
>
> - byte_count -= sge_bytes;
> - ch_bytes -= sge_bytes;
> - sge_no++;
> - /*
> - * If all bytes for this chunk have been mapped to an
> - * SGE, move to the next SGE
> - */
> - if (ch_bytes == 0) {
> - chl_map->ch[ch_no].count =
> - sge_no - chl_map->ch[ch_no].start;
> - ch_no++;
> - ch++;
> - chl_map->ch[ch_no].start = sge_no;
> - ch_bytes = ntohl(ch->rc_target.rs_length);
> - /* If bytes remaining account for next chunk */
> - if (byte_count) {
> - head->arg.page_len += ch_bytes;
> - head->arg.len += ch_bytes;
> - head->arg.buflen += ch_bytes;
> - }
> - }
> - /*
> - * If this SGE consumed all of the page, move to the
> - * next page
> - */
> - if ((sge_bytes + page_off) == PAGE_SIZE) {
> - page_no++;
> - page_off = 0;
> - /*
> - * If there are still bytes left to map, bump
> - * the page count
> - */
> - if (byte_count)
> - head->count++;
> - } else
> - page_off += sge_bytes;
> + /* The lkey here is either a local dma lkey or a dma_mr lkey */
> + ctxt->sge[pno].lkey = xprt->sc_dma_lkey;
> + ctxt->sge[pno].length = len;
> + ctxt->count++;
> +
> + pg_no++;
> + pg_off = 0;
> + rs_length -= len;
> + }
> +
> + if (last && rs_length == 0)
> + set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> + else
> + clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> +
> + memset(&read_wr, 0, sizeof(read_wr));
> + read_wr.wr_id = (unsigned long)ctxt;
> + read_wr.opcode = IB_WR_RDMA_READ;
> + ctxt->wr_op = read_wr.opcode;
> + read_wr.send_flags = IB_SEND_SIGNALED;
> + read_wr.wr.rdma.rkey = rs_handle;
> + read_wr.wr.rdma.remote_addr = rs_offset;
> + read_wr.sg_list = ctxt->sge;
> + read_wr.num_sge = pages_needed;
> +
> + ret = svc_rdma_send(xprt, &read_wr);
> + if (ret) {
> + pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
> + set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> + goto err;
> }
> - BUG_ON(byte_count != 0);
> - return sge_no;
> + *page_no = pg_no;
> + *page_offset = pg_off;
> + ret = read;
> + atomic_inc(&rdma_stat_read);
> + return ret;
> + err:
> + svc_rdma_unmap_dma(ctxt);
> + svc_rdma_put_context(ctxt, 0);
> + return ret;
> }
>
> -/* Map a read-chunk-list to an XDR and fast register the page-list.
> - *
> - * Assumptions:
> - * - chunk[0] position points to pages[0] at an offset of 0
> - * - pages[] will be made physically contiguous by creating a one-off memory
> - * region using the fastreg verb.
> - * - byte_count is # of bytes in read-chunk-list
> - * - ch_count is # of chunks in read-chunk-list
> - *
> - * Output:
> - * - sge array pointing into pages[] array.
> - * - chunk_sge array specifying sge index and count for each
> - * chunk in the read list
> - */
> -static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
> +/* Issue an RDMA_READ using an FRMR to map the data sink */
> +static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> struct svc_rqst *rqstp,
> struct svc_rdma_op_ctxt *head,
> - struct rpcrdma_msg *rmsgp,
> - struct svc_rdma_req_map *rpl_map,
> - struct svc_rdma_req_map *chl_map,
> - int ch_count,
> - int byte_count)
> + int *page_no,
> + u32 *page_offset,
> + u32 rs_handle,
> + u32 rs_length,
> + u64 rs_offset,
> + int last)
> {
> - int page_no;
> - int ch_no;
> - u32 offset;
> - struct rpcrdma_read_chunk *ch;
> - struct svc_rdma_fastreg_mr *frmr;
> - int ret = 0;
> + struct ib_send_wr read_wr;
> + struct ib_send_wr inv_wr;
> + struct ib_send_wr fastreg_wr;
> + u8 key;
> + int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
> + struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
> + struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
> + int ret, read, pno;
> + u32 pg_off = *page_offset;
> + u32 pg_no = *page_no;
>
> - frmr = svc_rdma_get_frmr(xprt);
> if (IS_ERR(frmr))
> return -ENOMEM;
>
> - head->frmr = frmr;
> - head->arg.head[0] = rqstp->rq_arg.head[0];
> - head->arg.tail[0] = rqstp->rq_arg.tail[0];
> - head->arg.pages = &head->pages[head->count];
> - head->hdr_count = head->count; /* save count of hdr pages */
> - head->arg.page_base = 0;
> - head->arg.page_len = byte_count;
> - head->arg.len = rqstp->rq_arg.len + byte_count;
> - head->arg.buflen = rqstp->rq_arg.buflen + byte_count;
> + ctxt->direction = DMA_FROM_DEVICE;
> + ctxt->frmr = frmr;
> + pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
> + read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
>
> - /* Fast register the page list */
> - frmr->kva = page_address(rqstp->rq_arg.pages[0]);
> + frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
> frmr->direction = DMA_FROM_DEVICE;
> frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
> - frmr->map_len = byte_count;
> - frmr->page_list_len = PAGE_ALIGN(byte_count) >> PAGE_SHIFT;
> - for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
> - frmr->page_list->page_list[page_no] =
> + frmr->map_len = pages_needed << PAGE_SHIFT;
> + frmr->page_list_len = pages_needed;
> +
> + for (pno = 0; pno < pages_needed; pno++) {
> + int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> +
> + head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
> + head->arg.page_len += len;
> + head->arg.len += len;
> + head->count++;
> + rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no+1];
> + rqstp->rq_next_page = rqstp->rq_respages + 1;
> + frmr->page_list->page_list[pno] =
> ib_dma_map_page(xprt->sc_cm_id->device,
> - rqstp->rq_arg.pages[page_no], 0,
> + head->arg.pages[pg_no], 0,
> PAGE_SIZE, DMA_FROM_DEVICE);
> - if (ib_dma_mapping_error(xprt->sc_cm_id->device,
> - frmr->page_list->page_list[page_no]))
> - goto fatal_err;
> + ret = ib_dma_mapping_error(xprt->sc_cm_id->device,
> + frmr->page_list->page_list[pno]);
> + if (ret)
> + goto err;
> atomic_inc(&xprt->sc_dma_used);
> - head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
> + pg_no++;
> + pg_off = 0;
> }
> - head->count += page_no;
>
> - /* rq_respages points one past arg pages */
> - rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> - rqstp->rq_next_page = rqstp->rq_respages + 1;
> + if (last && read == rs_length)
> + set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> + else
> + clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
>
> - /* Create the reply and chunk maps */
> - offset = 0;
> - ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
> - for (ch_no = 0; ch_no < ch_count; ch_no++) {
> - int len = ntohl(ch->rc_target.rs_length);
> - rpl_map->sge[ch_no].iov_base = frmr->kva + offset;
> - rpl_map->sge[ch_no].iov_len = len;
> - chl_map->ch[ch_no].count = 1;
> - chl_map->ch[ch_no].start = ch_no;
> - offset += len;
> - ch++;
> + /* Bump the key */
> + key = (u8)(frmr->mr->lkey & 0x000000FF);
> + ib_update_fast_reg_key(frmr->mr, ++key);
> +
> + ctxt->sge[0].addr = (unsigned long)frmr->kva;
> + ctxt->sge[0].lkey = frmr->mr->lkey;
> + ctxt->sge[0].length = read;
> + ctxt->count = 1;
> + ctxt->read_hdr = head;
> +
> + /* Prepare FASTREG WR */
> + memset(&fastreg_wr, 0, sizeof(fastreg_wr));
> + fastreg_wr.opcode = IB_WR_FAST_REG_MR;
> + fastreg_wr.send_flags = IB_SEND_SIGNALED;
> + fastreg_wr.wr.fast_reg.iova_start = (unsigned long)frmr->kva;
> + fastreg_wr.wr.fast_reg.page_list = frmr->page_list;
> + fastreg_wr.wr.fast_reg.page_list_len = frmr->page_list_len;
> + fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
> + fastreg_wr.wr.fast_reg.length = frmr->map_len;
> + fastreg_wr.wr.fast_reg.access_flags = frmr->access_flags;
> + fastreg_wr.wr.fast_reg.rkey = frmr->mr->lkey;
> + fastreg_wr.next = &read_wr;
> +
> + /* Prepare RDMA_READ */
> + memset(&read_wr, 0, sizeof(read_wr));
> + read_wr.send_flags = IB_SEND_SIGNALED;
> + read_wr.wr.rdma.rkey = rs_handle;
> + read_wr.wr.rdma.remote_addr = rs_offset;
> + read_wr.sg_list = ctxt->sge;
> + read_wr.num_sge = 1;
> + if (xprt->sc_dev_caps & SVCRDMA_DEVCAP_READ_W_INV) {
> + read_wr.opcode = IB_WR_RDMA_READ_WITH_INV;
> + read_wr.wr_id = (unsigned long)ctxt;
> + read_wr.ex.invalidate_rkey = ctxt->frmr->mr->lkey;
> + } else {
> + read_wr.opcode = IB_WR_RDMA_READ;
> + read_wr.next = &inv_wr;
> + /* Prepare invalidate */
> + memset(&inv_wr, 0, sizeof(inv_wr));
> + inv_wr.wr_id = (unsigned long)ctxt;
> + inv_wr.opcode = IB_WR_LOCAL_INV;
> + inv_wr.send_flags = IB_SEND_SIGNALED;
> + inv_wr.ex.invalidate_rkey = frmr->mr->lkey;
> }
> -
> - ret = svc_rdma_fastreg(xprt, frmr);
> - if (ret)
> - goto fatal_err;
> -
> - return ch_no;
> -
> - fatal_err:
> - printk("svcrdma: error fast registering xdr for xprt %p", xprt);
> - svc_rdma_put_frmr(xprt, frmr);
> - return -EIO;
> -}
> -
> -static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
> - struct svc_rdma_op_ctxt *ctxt,
> - struct svc_rdma_fastreg_mr *frmr,
> - struct kvec *vec,
> - u64 *sgl_offset,
> - int count)
> -{
> - int i;
> - unsigned long off;
> -
> - ctxt->count = count;
> - ctxt->direction = DMA_FROM_DEVICE;
> - for (i = 0; i < count; i++) {
> - ctxt->sge[i].length = 0; /* in case map fails */
> - if (!frmr) {
> - BUG_ON(!virt_to_page(vec[i].iov_base));
> - off = (unsigned long)vec[i].iov_base & ~PAGE_MASK;
> - ctxt->sge[i].addr =
> - ib_dma_map_page(xprt->sc_cm_id->device,
> - virt_to_page(vec[i].iov_base),
> - off,
> - vec[i].iov_len,
> - DMA_FROM_DEVICE);
> - if (ib_dma_mapping_error(xprt->sc_cm_id->device,
> - ctxt->sge[i].addr))
> - return -EINVAL;
> - ctxt->sge[i].lkey = xprt->sc_dma_lkey;
> - atomic_inc(&xprt->sc_dma_used);
> - } else {
> - ctxt->sge[i].addr = (unsigned long)vec[i].iov_base;
> - ctxt->sge[i].lkey = frmr->mr->lkey;
> - }
> - ctxt->sge[i].length = vec[i].iov_len;
> - *sgl_offset = *sgl_offset + vec[i].iov_len;
> + ctxt->wr_op = read_wr.opcode;
> +
> + /* Post the chain */
> + ret = svc_rdma_send(xprt, &fastreg_wr);
> + if (ret) {
> + pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
> + set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> + goto err;
> }
> - return 0;
> -}
> -
> -static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
> -{
> - if ((rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
> - RDMA_TRANSPORT_IWARP) &&
> - sge_count > 1)
> - return 1;
> - else
> - return min_t(int, sge_count, xprt->sc_max_sge);
> + *page_no = pg_no;
> + *page_offset = pg_off;
> + ret = read;
> + atomic_inc(&rdma_stat_read);
> + return ret;
> + err:
> + svc_rdma_unmap_dma(ctxt);
> + svc_rdma_put_context(ctxt, 0);
> + svc_rdma_put_frmr(xprt, frmr);
> + return ret;
> }
>
> -/*
> - * Use RDMA_READ to read data from the advertised client buffer into the
> - * XDR stream starting at rq_arg.head[0].iov_base.
> - * Each chunk in the array
> - * contains the following fields:
> - * discrim - '1', This isn't used for data placement
> - * position - The xdr stream offset (the same for every chunk)
> - * handle - RMR for client memory region
> - * length - data transfer length
> - * offset - 64 bit tagged offset in remote memory region
> - *
> - * On our side, we need to read into a pagelist. The first page immediately
> - * follows the RPC header.
> - *
> - * This function returns:
> - * 0 - No error and no read-list found.
> - *
> - * 1 - Successful read-list processing. The data is not yet in
> - * the pagelist and therefore the RPC request must be deferred. The
> - * I/O completion will enqueue the transport again and
> - * svc_rdma_recvfrom will complete the request.
> - *
> - * <0 - Error processing/posting read-list.
> - *
> - * NOTE: The ctxt must not be touched after the last WR has been posted
> - * because the I/O completion processing may occur on another
> - * processor and free / modify the context. Ne touche pas!
> - */
> -static int rdma_read_xdr(struct svcxprt_rdma *xprt,
> - struct rpcrdma_msg *rmsgp,
> - struct svc_rqst *rqstp,
> - struct svc_rdma_op_ctxt *hdr_ctxt)
> +static int rdma_read_chunks(struct svcxprt_rdma *xprt,
> + struct rpcrdma_msg *rmsgp,
> + struct svc_rqst *rqstp,
> + struct svc_rdma_op_ctxt *head)
> {
> - struct ib_send_wr read_wr;
> - struct ib_send_wr inv_wr;
> - int err = 0;
> - int ch_no;
> - int ch_count;
> - int byte_count;
> - int sge_count;
> - u64 sgl_offset;
> + int page_no, ch_count, ret;
> struct rpcrdma_read_chunk *ch;
> - struct svc_rdma_op_ctxt *ctxt = NULL;
> - struct svc_rdma_req_map *rpl_map;
> - struct svc_rdma_req_map *chl_map;
> + u32 page_offset, byte_count;
> + u64 rs_offset;
> + rdma_reader_fn reader;
>
> /* If no read list is present, return 0 */
> ch = svc_rdma_get_read_chunk(rmsgp);
> @@ -408,122 +368,53 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
> if (ch_count > RPCSVC_MAXPAGES)
> return -EINVAL;

I?m not sure what this ^^^ ch_count check is doing, but maybe I haven?t had
enough caffeine this morning. Shouldn?t this code be comparing the segment
count, not the chunk count, with MAXPAGES?


>
> - /* Allocate temporary reply and chunk maps */
> - rpl_map = svc_rdma_get_req_map();
> - chl_map = svc_rdma_get_req_map();
> -
> - if (!xprt->sc_frmr_pg_list_len)
> - sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
> - rpl_map, chl_map, ch_count,
> - byte_count);
> - else
> - sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
> - rpl_map, chl_map, ch_count,
> - byte_count);
> - if (sge_count < 0) {
> - err = -EIO;
> - goto out;
> - }
> -
> - sgl_offset = 0;
> - ch_no = 0;
> + /* The request is completed when the RDMA_READs complete. The
> + * head context keeps all the pages that comprise the
> + * request.
> + */
> + head->arg.head[0] = rqstp->rq_arg.head[0];
> + head->arg.tail[0] = rqstp->rq_arg.tail[0];
> + head->arg.pages = &head->pages[head->count];
> + head->hdr_count = head->count;
> + head->arg.page_base = 0;
> + head->arg.page_len = 0;
> + head->arg.len = rqstp->rq_arg.len;
> + head->arg.buflen = rqstp->rq_arg.buflen + PAGE_SIZE;
>
> + page_no = 0; page_offset = 0;
> for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
> - ch->rc_discrim != 0; ch++, ch_no++) {
> - u64 rs_offset;
> -next_sge:
> - ctxt = svc_rdma_get_context(xprt);
> - ctxt->direction = DMA_FROM_DEVICE;
> - ctxt->frmr = hdr_ctxt->frmr;
> - ctxt->read_hdr = NULL;
> - clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> - clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
> + ch->rc_discrim != 0; ch++) {


As I understand it, this loop is iterating over the items in the RPC
header?s read chunk list.

RFC 5667 section 4 says

"The server MUST ignore any Read list for other NFS procedures, as well
as additional Read list entries beyond the first in the list."

But rdma_read_chunks() still expects multiple chunks in the incoming list.
Should this code follow the RFC and process only the first chunk in
each RPC header, or did I misunderstand the RFC text?

I?m not sure how the RPC layer should enforce the requirement to ignore
chunks for particular classes of NFS requests. But seems like valid
NFS/RDMA requests will have only zero or one item in each chunk list.


The ramification of this change:

The Linux client sends an extra chunk which is a zero pad to satisfy XDR
alignment. This results in an extra RDMA READ on the wire for a payload
that is never more than 3 bytes of zeros.

Section 3.7 of RFC 5666 states:

"If in turn, no data remains to be decoded from the inline portion, then
the receiver MUST conclude that roundup is present, and therefore it
advances the XDR decode position to that indicated by the next chunk (if
any). In this way, roundup is passed without ever actually transferring
additional XDR bytes.?

Later, it states:

"Senders SHOULD therefore avoid encoding individual RDMA Read Chunks for
roundup whenever possible.?

The Solaris NFS/RDMA server does not require this extra chunk, but Linux
appears to need it. When I enable pad optimization on the Linux client,
it works fine with the Solaris server. But the first NFS WRITE with an
odd length against the Linux server causes connection loss and the client
workload hangs.

At some point I?d like to permanently enable this optimization on the
Linux client. You can try this out with:

sudo echo 1 > /proc/sys/sunrpc/rdma_pad_optimize

The default contents of this file leave pad optimization disabled, for
compatibility with the current Linux NFS/RDMA server.

So rdma_read_chunks() needs to handle the implicit XDR length round-up.
That?s probably not hard.

What I?m asking is:

Do you agree the server should be changed to comply with section 4 of 5667?

Should you change now in this patch, or do you want me to write a patch
for it once the refactoring patch has been accepted?



>
> - /* Prepare READ WR */
> - memset(&read_wr, 0, sizeof read_wr);
> - read_wr.wr_id = (unsigned long)ctxt;
> - read_wr.opcode = IB_WR_RDMA_READ;
> - ctxt->wr_op = read_wr.opcode;
> - read_wr.send_flags = IB_SEND_SIGNALED;
> - read_wr.wr.rdma.rkey = ntohl(ch->rc_target.rs_handle);
> xdr_decode_hyper((__be32 *)&ch->rc_target.rs_offset,
> &rs_offset);
> - read_wr.wr.rdma.remote_addr = rs_offset + sgl_offset;
> - read_wr.sg_list = ctxt->sge;
> - read_wr.num_sge =
> - rdma_read_max_sge(xprt, chl_map->ch[ch_no].count);
> - err = rdma_set_ctxt_sge(xprt, ctxt, hdr_ctxt->frmr,
> - &rpl_map->sge[chl_map->ch[ch_no].start],
> - &sgl_offset,
> - read_wr.num_sge);
> - if (err) {
> - svc_rdma_unmap_dma(ctxt);
> - svc_rdma_put_context(ctxt, 0);
> - goto out;
> - }
> - if (((ch+1)->rc_discrim == 0) &&
> - (read_wr.num_sge == chl_map->ch[ch_no].count)) {
> - /*
> - * Mark the last RDMA_READ with a bit to
> - * indicate all RPC data has been fetched from
> - * the client and the RPC needs to be enqueued.
> - */
> - set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
> - if (hdr_ctxt->frmr) {
> - set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
> - /*
> - * Invalidate the local MR used to map the data
> - * sink.
> - */
> - if (xprt->sc_dev_caps &
> - SVCRDMA_DEVCAP_READ_W_INV) {
> - read_wr.opcode =
> - IB_WR_RDMA_READ_WITH_INV;
> - ctxt->wr_op = read_wr.opcode;
> - read_wr.ex.invalidate_rkey =
> - ctxt->frmr->mr->lkey;
> - } else {
> - /* Prepare INVALIDATE WR */
> - memset(&inv_wr, 0, sizeof inv_wr);
> - inv_wr.opcode = IB_WR_LOCAL_INV;
> - inv_wr.send_flags = IB_SEND_SIGNALED;
> - inv_wr.ex.invalidate_rkey =
> - hdr_ctxt->frmr->mr->lkey;
> - read_wr.next = &inv_wr;
> - }
> - }
> - ctxt->read_hdr = hdr_ctxt;
> - }
> - /* Post the read */
> - err = svc_rdma_send(xprt, &read_wr);
> - if (err) {
> - printk(KERN_ERR "svcrdma: Error %d posting RDMA_READ\n",
> - err);
> - set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
> - svc_rdma_unmap_dma(ctxt);
> - svc_rdma_put_context(ctxt, 0);
> - goto out;
> + byte_count = ntohl(ch->rc_target.rs_length);
> +
> + /* Use FRMR if supported */
> + if (xprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)
> + reader = rdma_read_chunk_frmr;
> + else
> + reader = rdma_read_chunk_lcl;

The reader function is invariant across iterations of the for() loop.
In fact, it is invariant for all incoming requests on an xprt, isn?t it?

Can svc_rdma_create() plant that function pointer in the xprt?


> + while (byte_count > 0) {
> + ret = reader(xprt, rqstp, head,
> + &page_no, &page_offset,
> + ntohl(ch->rc_target.rs_handle),
> + byte_count, rs_offset,
> + ((ch+1)->rc_discrim == 0) /* last */
> + );
> + if (ret < 0)
> + goto err;
> + byte_count -= ret;
> + rs_offset += ret;
> }
> - atomic_inc(&rdma_stat_read);
> -
> - if (read_wr.num_sge < chl_map->ch[ch_no].count) {
> - chl_map->ch[ch_no].count -= read_wr.num_sge;
> - chl_map->ch[ch_no].start += read_wr.num_sge;
> - goto next_sge;
> - }
> - sgl_offset = 0;
> - err = 1;
> }
> -
> - out:
> - svc_rdma_put_req_map(rpl_map);
> - svc_rdma_put_req_map(chl_map);
> -
> + ret = 1;
> + err:
> /* Detach arg pages. svc_recv will replenish them */
> - for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
> - rqstp->rq_pages[ch_no] = NULL;
> + for (page_no = 0;
> + &rqstp->rq_pages[page_no] < rqstp->rq_respages; page_no++)
> + rqstp->rq_pages[page_no] = NULL;
>
> - return err;
> + return ret;
> }
>
> static int rdma_read_complete(struct svc_rqst *rqstp,
> @@ -595,13 +486,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> struct svc_rdma_op_ctxt,
> dto_q);
> list_del_init(&ctxt->dto_q);
> - }
> - if (ctxt) {
> spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock);
> return rdma_read_complete(rqstp, ctxt);
> - }
> -
> - if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
> + } else if (!list_empty(&rdma_xprt->sc_rq_dto_q)) {
> ctxt = list_entry(rdma_xprt->sc_rq_dto_q.next,
> struct svc_rdma_op_ctxt,
> dto_q);
> @@ -621,7 +508,6 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> if (test_bit(XPT_CLOSE, &xprt->xpt_flags))
> goto close_out;
>
> - BUG_ON(ret);
> goto out;
> }
> dprintk("svcrdma: processing ctxt=%p on xprt=%p, rqstp=%p, status=%d\n",
> @@ -644,12 +530,11 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> }
>
> /* Read read-list data. */
> - ret = rdma_read_xdr(rdma_xprt, rmsgp, rqstp, ctxt);
> + ret = rdma_read_chunks(rdma_xprt, rmsgp, rqstp, ctxt);
> if (ret > 0) {
> /* read-list posted, defer until data received from client. */
> goto defer;
> - }
> - if (ret < 0) {
> + } else if (ret < 0) {
> /* Post of read-list failed, free context. */
> svc_rdma_put_context(ctxt, 1);
> return 0;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
[email protected]




2014-05-06 17:46:27

by Steve Wise

[permalink] [raw]
Subject: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

From: Tom Tucker <[email protected]>

Change poll logic to grab up to 6 completions at a time.

RDMA write and send completions no longer deal with fastreg objects.

Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
capabilities.

Signed-off-by: Tom Tucker <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
---

include/linux/sunrpc/svc_rdma.h | 3 -
net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +++++++++++++++++-------------
2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 0b8e3e6..5cf99a0 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
struct list_head frmr_list;
};
struct svc_rdma_req_map {
- struct svc_rdma_fastreg_mr *frmr;
unsigned long count;
union {
struct kvec sge[RPCSVC_MAXPAGES];
struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
+ unsigned long lkey[RPCSVC_MAXPAGES];
};
};
-#define RDMACTXT_F_FAST_UNREG 1
#define RDMACTXT_F_LAST_CTXT 2

#define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 25688fa..2c5b201 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1,4 +1,5 @@
/*
+ * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
* Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
*
* This software is available to you under a choice of one of two
@@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
schedule_timeout_uninterruptible(msecs_to_jiffies(500));
}
map->count = 0;
- map->frmr = NULL;
return map;
}

@@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,

switch (ctxt->wr_op) {
case IB_WR_SEND:
- if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
- svc_rdma_put_frmr(xprt, ctxt->frmr);
+ BUG_ON(ctxt->frmr);
svc_rdma_put_context(ctxt, 1);
break;

case IB_WR_RDMA_WRITE:
+ BUG_ON(ctxt->frmr);
svc_rdma_put_context(ctxt, 0);
break;

case IB_WR_RDMA_READ:
case IB_WR_RDMA_READ_WITH_INV:
+ svc_rdma_put_frmr(xprt, ctxt->frmr);
if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
BUG_ON(!read_hdr);
- if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
- svc_rdma_put_frmr(xprt, ctxt->frmr);
spin_lock_bh(&xprt->sc_rq_dto_lock);
set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
list_add_tail(&read_hdr->dto_q,
@@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
break;

default:
+ BUG_ON(1);
printk(KERN_ERR "svcrdma: unexpected completion type, "
"opcode=%d\n",
ctxt->wr_op);
@@ -378,29 +378,42 @@ static void process_context(struct svcxprt_rdma *xprt,
static void sq_cq_reap(struct svcxprt_rdma *xprt)
{
struct svc_rdma_op_ctxt *ctxt = NULL;
- struct ib_wc wc;
+ struct ib_wc wc_a[6];
+ struct ib_wc *wc;
struct ib_cq *cq = xprt->sc_sq_cq;
int ret;

+ memset(wc_a, 0, sizeof(wc_a));
+
if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags))
return;

ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
atomic_inc(&rdma_stat_sq_poll);
- while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) {
- if (wc.status != IB_WC_SUCCESS)
- /* Close the transport */
- set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+ while ((ret = ib_poll_cq(cq, ARRAY_SIZE(wc_a), wc_a)) > 0) {
+ int i;

- /* Decrement used SQ WR count */
- atomic_dec(&xprt->sc_sq_count);
- wake_up(&xprt->sc_send_wait);
+ for (i = 0; i < ret; i++) {
+ wc = &wc_a[i];
+ if (wc->status != IB_WC_SUCCESS) {
+ dprintk("svcrdma: sq wc err status %d\n",
+ wc->status);

- ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
- if (ctxt)
- process_context(xprt, ctxt);
+ /* Close the transport */
+ set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+ }

- svc_xprt_put(&xprt->sc_xprt);
+ /* Decrement used SQ WR count */
+ atomic_dec(&xprt->sc_sq_count);
+ wake_up(&xprt->sc_send_wait);
+
+ ctxt = (struct svc_rdma_op_ctxt *)
+ (unsigned long)wc->wr_id;
+ if (ctxt)
+ process_context(xprt, ctxt);
+
+ svc_xprt_put(&xprt->sc_xprt);
+ }
}

if (ctxt)
@@ -993,7 +1006,11 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
need_dma_mr = 0;
break;
case RDMA_TRANSPORT_IB:
- if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
+ if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
+ need_dma_mr = 1;
+ dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
+ } else if (!(devattr.device_cap_flags &
+ IB_DEVICE_LOCAL_DMA_LKEY)) {
need_dma_mr = 1;
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
} else
@@ -1190,14 +1207,7 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
container_of(xprt, struct svcxprt_rdma, sc_xprt);

/*
- * If there are fewer SQ WR available than required to send a
- * simple response, return false.
- */
- if ((rdma->sc_sq_depth - atomic_read(&rdma->sc_sq_count) < 3))
- return 0;
-
- /*
- * ...or there are already waiters on the SQ,
+ * If there are already waiters on the SQ,
* return false.
*/
if (waitqueue_active(&rdma->sc_send_wait))


2014-05-06 21:09:14

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

On 5/6/2014 2:27 PM, J. Bruce Fields wrote:
> On Tue, May 06, 2014 at 12:46:21PM -0500, Steve Wise wrote:
>> This patch series refactors the NFSRDMA server marshalling logic to
>> remove the intermediary map structures. It also fixes an existing bug
>> where the NFSRDMA server was not minding the device fast register page
>> list length limitations.
>>
>> I've also made a git repo available with these patches on top of 3.15-rc4:
>>
>> git://git.openfabrics.org/~swise/linux svcrdma-refactor
>>
>> Changes since V1:
>>
>> - fixed regression for devices that don't support FRMRs (see
>> rdma_read_chunk_lcl())
>>
>> - split patch up for closer review. However I request it be squashed
>> before merging as they is not bisectable, and I think these changes
>> should all be a single commit anyway.
> If it's not split up in a way that's bisectable, then yes, just don't
> bother.

I didn't see a good way to split it up, have it bisectable, and not have
all the big stuff in one patch. I think its a little more reviewable in
these 3 patches, but when I post V3, I'll put it back as an uber patch.
Hopefully folks can have a look at these 3 patches ignoring the bisect
issue. Having said that, the rdma read logic really is better
reviewed by look at the code after applying the patches. That's why I
published a git branch.

Thanks!

Steve.