For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago.
In preparation for Doug's final topic branch, Anna, I've rebased
these on Christoph's ib_device_attr branch, but there were no merge
conflicts or other changes needed. Could you begin preparing these
for linux-next and other final testing and review?
Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:
git://git.linux-nfs.org/projects/cel/cel-2.6.git
Or for browsing:
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5
Changes since v2:
- Rebased on Christoph's ib_device_attr branch
Changes since v1:
- Rebased on v4.4-rc3
- Receive buffer safety margin patch dropped
- Backchannel pr_err and pr_info converted to dprintk
- Backchannel spin locks converted to work queue-safe locks
- Fixed premature release of backchannel request buffer
- NFSv4.1 callbacks tested with for-4.5 server
---
Chuck Lever (11):
xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
xprtrdma: xprt_rdma_free() must not release backchannel reqs
xprtrdma: Disable RPC/RDMA backchannel debugging messages
xprtrdma: Move struct ib_send_wr off the stack
xprtrdma: Introduce ro_unmap_sync method
xprtrdma: Add ro_unmap_sync method for FRWR
xprtrdma: Add ro_unmap_sync method for FMR
xprtrdma: Add ro_unmap_sync method for all-physical registration
SUNRPC: Introduce xprt_commit_rqst()
xprtrdma: Invalidate in the RPC reply handler
xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
include/linux/sunrpc/xprt.h | 1
net/sunrpc/xprt.c | 14 +++
net/sunrpc/xprtrdma/backchannel.c | 22 ++---
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++
net/sunrpc/xprtrdma/frwr_ops.c | 175 +++++++++++++++++++++++++++++++-----
net/sunrpc/xprtrdma/physical_ops.c | 13 +++
net/sunrpc/xprtrdma/rpc_rdma.c | 14 +++
net/sunrpc/xprtrdma/transport.c | 3 +
net/sunrpc/xprtrdma/verbs.c | 13 +--
net/sunrpc/xprtrdma/xprt_rdma.h | 12 +-
10 files changed, 283 insertions(+), 48 deletions(-)
--
Chuck Lever
Clean up.
rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.
The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.
Problem introduced by glib patch re-ordering on my part.
Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/backchannel.c | 6 +-----
net/sunrpc/xprtrdma/verbs.c | 7 +++----
2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 2dcb44f..11d2cfb 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -84,9 +84,7 @@ out_fail:
static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
unsigned int count)
{
- struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;
- unsigned long flags;
int rc = 0;
while (count--) {
@@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
break;
}
- spin_lock_irqsave(&buffers->rb_lock, flags);
- list_add(&rep->rr_list, &buffers->rb_recv_bufs);
- spin_unlock_irqrestore(&buffers->rb_lock, flags);
+ rpcrdma_recv_buffer_put(rep);
}
return rc;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 650034b..f23f3d6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1329,15 +1329,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, unsigned int count)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
struct rpcrdma_rep *rep;
- unsigned long flags;
int rc;
while (count--) {
- spin_lock_irqsave(&buffers->rb_lock, flags);
+ spin_lock(&buffers->rb_lock);
if (list_empty(&buffers->rb_recv_bufs))
goto out_reqbuf;
rep = rpcrdma_buffer_get_rep_locked(buffers);
- spin_unlock_irqrestore(&buffers->rb_lock, flags);
+ spin_unlock(&buffers->rb_lock);
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
@@ -1347,7 +1346,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, unsigned int count)
return 0;
out_reqbuf:
- spin_unlock_irqrestore(&buffers->rb_lock, flags);
+ spin_unlock(&buffers->rb_lock);
pr_warn("%s: no extra receive buffers\n", __func__);
return -ENOMEM;
Clean up.
Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/backchannel.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 11d2cfb..cd31181 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -15,7 +15,7 @@
# define RPCDBG_FACILITY RPCDBG_TRANS
#endif
-#define RPCRDMA_BACKCHANNEL_DEBUG
+#undef RPCRDMA_BACKCHANNEL_DEBUG
static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst *rqst)
@@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
__func__);
goto out_free;
}
+ dprintk("RPC: %s: new rqst %p\n", __func__, rqst);
rqst->rq_xprt = &r_xprt->rx_xprt;
INIT_LIST_HEAD(&rqst->rq_list);
@@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
rpclen = rqst->rq_svec[0].iov_len;
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
pr_info("RPC: %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
__func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
pr_info("RPC: %s: RPC/RDMA: %*ph\n",
__func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
pr_info("RPC: %s: RPC: %*ph\n",
__func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+#endif
req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
@@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
{
struct rpc_xprt *xprt = rqst->rq_xprt;
+ dprintk("RPC: %s: freeing rqst %p (req %p)\n",
+ __func__, rqst, rpcr_to_rdmar(rqst));
+
smp_mb__before_atomic();
WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state));
clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
@@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst, rq_bc_pa_list);
list_del(&rqst->rq_bc_pa_list);
spin_unlock(&xprt->bc_pa_lock);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
- pr_info("RPC: %s: using rqst %p\n", __func__, rqst);
-#endif
+ dprintk("RPC: %s: using rqst %p\n", __func__, rqst);
/* Prepare rqst */
rqst->rq_reply_bytes_recvd = 0;
@@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
* direction reply.
*/
req = rpcr_to_rdmar(rqst);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
- pr_info("RPC: %s: attaching rep %p to req %p\n",
+ dprintk("RPC: %s: attaching rep %p to req %p\n",
__func__, rep, req);
-#endif
req->rl_reply = rep;
/* Defeat the retransmit detection logic in send_request */
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.
Somehow this hunk got lost.
Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..740bddc 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer)
rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]);
req = rb->rg_owner;
+ if (req->rl_backchannel)
+ return;
+
r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf);
dprintk("RPC: %s: called on 0x%p\n", __func__, req->rl_reply);
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().
(This will be for chaining LOCAL_INV requests).
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 38 ++++++++++++++++++++------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
2 files changed, 22 insertions(+), 18 deletions(-)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ae2a241..660d0b6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -318,7 +318,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
- struct ib_reg_wr reg_wr;
+ struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -335,6 +335,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+ reg_wr = &frmr->fr_regwr;
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -380,19 +381,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x000000FF);
ib_update_fast_reg_key(mr, ++key);
- reg_wr.wr.next = NULL;
- reg_wr.wr.opcode = IB_WR_REG_MR;
- reg_wr.wr.wr_id = (uintptr_t)mw;
- reg_wr.wr.num_sge = 0;
- reg_wr.wr.send_flags = 0;
- reg_wr.mr = mr;
- reg_wr.key = mr->rkey;
- reg_wr.access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->wr.next = NULL;
+ reg_wr->wr.opcode = IB_WR_REG_MR;
+ reg_wr->wr.wr_id = (uintptr_t)mw;
+ reg_wr->wr.num_sge = 0;
+ reg_wr->wr.send_flags = 0;
+ reg_wr->mr = mr;
+ reg_wr->key = mr->rkey;
+ reg_wr->access = writing ?
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ;
DECR_CQCOUNT(&r_xprt->rx_ep);
- rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+ rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
@@ -422,23 +423,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
- struct ib_send_wr invalidate_wr, *bad_wr;
+ struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
dprintk("RPC: %s: FRMR %p\n", __func__, mw);
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+ invalidate_wr = &mw->r.frmr.fr_invwr;
- memset(&invalidate_wr, 0, sizeof(invalidate_wr));
- invalidate_wr.wr_id = (unsigned long)(void *)mw;
- invalidate_wr.opcode = IB_WR_LOCAL_INV;
- invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+ memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+ invalidate_wr->wr_id = (uintptr_t)mw;
+ invalidate_wr->opcode = IB_WR_LOCAL_INV;
+ invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
- rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+ rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..e60d817 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct fr_work;
struct rpcrdma_xprt *fr_xprt;
+ struct ib_reg_wr fr_regwr;
+ struct ib_send_wr fr_invwr;
};
struct rpcrdma_fmr {
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).
To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.
The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e60d817..d9f2f65 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -365,6 +365,8 @@ struct rpcrdma_xprt;
struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *, int, bool);
+ void (*ro_unmap_sync)(struct rpcrdma_xprt *,
+ struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.
Note also, DMA unmapping is now done _after_ invalidation.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 137 ++++++++++++++++++++++++++++++++++++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
2 files changed, 135 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 660d0b6..5b9e41d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -244,12 +244,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
}
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
{
- struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
@@ -260,9 +262,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC: %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
}
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+ struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+ struct rpcrdma_frmr *f = &r->r.frmr;
+
+ if (unlikely(wc->status != IB_WC_SUCCESS))
+ __frwr_sendcompletion_flush(wc, r);
+
+ if (f->fr_waiter)
+ complete(&f->fr_linv_done);
+}
+
static int
frwr_op_init(struct rpcrdma_xprt *r_xprt)
{
@@ -334,6 +350,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+ frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
@@ -413,6 +430,117 @@ out_senderr:
return rc;
}
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ struct rpcrdma_frmr *f = &mw->r.frmr;
+ struct ib_send_wr *invalidate_wr;
+
+ f->fr_waiter = false;
+ f->fr_state = FRMR_IS_INVALID;
+ invalidate_wr = &f->fr_invwr;
+
+ memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+ invalidate_wr->wr_id = (unsigned long)(void *)mw;
+ invalidate_wr->opcode = IB_WR_LOCAL_INV;
+ invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+ return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+ int rc)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ int nsegs = seg->mr_nsegs;
+
+ seg->rl_mw = NULL;
+
+ while (nsegs--)
+ rpcrdma_unmap_one(device, seg++);
+
+ if (!rc)
+ rpcrdma_put_mw(r_xprt, mw);
+ else
+ __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+ struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ struct rpcrdma_mr_seg *seg;
+ unsigned int i, nchunks;
+ struct rpcrdma_frmr *f;
+ int rc;
+
+ dprintk("RPC: %s: req %p\n", __func__, req);
+
+ /* ORDER: Invalidate all of the req's MRs first
+ *
+ * Chain the LOCAL_INV Work Requests and post them with
+ * a single ib_post_send() call.
+ */
+ invalidate_wrs = pos = prev = NULL;
+ seg = NULL;
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ pos = __frwr_prepare_linv_wr(seg);
+
+ if (!invalidate_wrs)
+ invalidate_wrs = pos;
+ else
+ prev->next = pos;
+ prev = pos;
+
+ i += seg->mr_nsegs;
+ }
+ f = &seg->rl_mw->r.frmr;
+
+ /* Strong send queue ordering guarantees that when the
+ * last WR in the chain completes, all WRs in the chain
+ * are complete.
+ */
+ f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+ f->fr_waiter = true;
+ init_completion(&f->fr_linv_done);
+ INIT_CQCOUNT(&r_xprt->rx_ep);
+
+ /* Transport disconnect drains the receive CQ before it
+ * replaces the QP. The RPC reply handler won't call us
+ * unless ri_id->qp is a valid pointer.
+ */
+ rc = ib_post_send(ia->ri_id->qp, invalidate_wrs, &bad_wr);
+ if (rc)
+ pr_warn("%s: ib_post_send failed %i\n", __func__, rc);
+
+ wait_for_completion(&f->fr_linv_done);
+
+ /* ORDER: Now DMA unmap all of the req's MRs, and return
+ * them to the free MW list.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ __frwr_dma_unmap(r_xprt, seg, rc);
+
+ i += seg->mr_nsegs;
+ seg->mr_nsegs = 0;
+ }
+
+ req->rl_nchunks = 0;
+}
+
/* Post a LOCAL_INV Work Request to prevent further remote access
* via RDMA READ or RDMA WRITE.
*/
@@ -472,6 +600,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+ .ro_unmap_sync = frwr_op_unmap_sync,
.ro_unmap = frwr_op_unmap,
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index d9f2f65..089a7db 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct fr_work;
struct rpcrdma_xprt *fr_xprt;
+ bool fr_waiter;
+ struct completion fr_linv_done;;
struct ib_reg_wr fr_regwr;
struct ib_send_wr fr_invwr;
};
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.
1. Gather all the MRs for the RPC request onto a list, and invoke
ib_unmap_fmr() once with that list. This reduces the number of
doorbells when there is more than one MR to invalidate
2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
This is critical after invalidating a Write chunk.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
}
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ int nsegs = seg->mr_nsegs;
+
+ seg->rl_mw = NULL;
+
+ while (nsegs--)
+ rpcrdma_unmap_one(device, seg++);
+
+ rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct rpcrdma_mr_seg *seg;
+ unsigned int i, nchunks;
+ struct rpcrdma_mw *mw;
+ LIST_HEAD(unmap_list);
+ int rc;
+
+ dprintk("RPC: %s: req %p\n", __func__, req);
+
+ /* ORDER: Invalidate all of the req's MRs first
+ *
+ * ib_unmap_fmr() is slow, so use a single call instead
+ * of one call per mapped MR.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+ mw = seg->rl_mw;
+
+ list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+ i += seg->mr_nsegs;
+ }
+ rc = ib_unmap_fmr(&unmap_list);
+ if (rc)
+ pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+ /* ORDER: Now DMA unmap all of the req's MRs, and return
+ * them to the free MW list.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ __fmr_dma_unmap(r_xprt, seg);
+
+ i += seg->mr_nsegs;
+ seg->mr_nsegs = 0;
+ }
+
+ req->rl_nchunks = 0;
+}
+
/* Use the ib_unmap_fmr() verb to prevent further remote
* access via RDMA READ or RDMA WRITE.
*/
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+ .ro_unmap_sync = fmr_op_unmap_sync,
.ro_unmap = fmr_op_unmap,
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/physical_ops.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
return 1;
}
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ unsigned int i;
+
+ for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+ rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
static void
physical_op_destroy(struct rpcrdma_buffer *buf)
{
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+ .ro_unmap_sync = physical_op_unmap_sync,
.ro_unmap = physical_op_unmap,
.ro_open = physical_op_open,
.ro_maxpages = physical_op_maxpages,
I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.
Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/xprt.c | 14 ++++++++++++++
net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
3 files changed, 19 insertions(+)
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
void xprt_write_space(struct rpc_xprt *xprt);
void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void xprt_commit_rqst(struct rpc_task *task);
void xprt_complete_rqst(struct rpc_task *task, int copied);
void xprt_release_rqst_cong(struct rpc_task *task);
void xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
}
/**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+
+ list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
* xprt_complete_rqst - called when reply processing is complete
* @task: RPC request that recently completed
* @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..0bc8c39 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
+ xprt_commit_rqst(rqst->rq_task);
+ spin_unlock_bh(&xprt->transport_lock);
+
dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
" RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -894,6 +897,7 @@ badheader:
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
+ spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.
The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.
This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.
Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.
Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.
Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 0bc8c39..3d00c5d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -891,6 +891,16 @@ badheader:
break;
}
+ /* Invalidate and flush the data payloads before waking the
+ * waiting application. This guarantees the memory region is
+ * properly fenced from the server before the application
+ * accesses the data. It also ensures proper send flow
+ * control: waking the next RPC waits until this RPC has
+ * relinquished all its Send Queue entries.
+ */
+ if (req->rl_nchunks)
+ r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1; /* don't deadlock */
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.
Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 6 ++----
net/sunrpc/xprtrdma/xprt_rdma.h | 6 ------
2 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f23f3d6..1867e3a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -608,10 +608,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
- if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
- ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
- else if (ep->rep_cqinit <= 2)
- ep->rep_cqinit = 0;
+ if (ep->rep_cqinit <= 2)
+ ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 089a7db..ba3bc3f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -87,12 +87,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
};
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS (32)
-
#define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
#define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
Thanks, Chuck!
Everything looks okay to me, so I'll apply these patches and send them to Trond before the holiday.
On 12/14/2015 04:17 PM, Chuck Lever wrote:
> For 4.5, I'd like to address the send queue accounting and
> invalidation/unmap ordering issues Jason brought up a couple of
> months ago.
>
> In preparation for Doug's final topic branch, Anna, I've rebased
> these on Christoph's ib_device_attr branch, but there were no merge
> conflicts or other changes needed. Could you begin preparing these
> for linux-next and other final testing and review?
No merge conflicts is nice, and we might not need to worry about ordering the pull request.
Thanks,
Anna
>
> Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:
>
> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>
> Or for browsing:
>
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5
>
>
> Changes since v2:
> - Rebased on Christoph's ib_device_attr branch
>
>
> Changes since v1:
>
> - Rebased on v4.4-rc3
> - Receive buffer safety margin patch dropped
> - Backchannel pr_err and pr_info converted to dprintk
> - Backchannel spin locks converted to work queue-safe locks
> - Fixed premature release of backchannel request buffer
> - NFSv4.1 callbacks tested with for-4.5 server
>
> ---
>
> Chuck Lever (11):
> xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
> xprtrdma: xprt_rdma_free() must not release backchannel reqs
> xprtrdma: Disable RPC/RDMA backchannel debugging messages
> xprtrdma: Move struct ib_send_wr off the stack
> xprtrdma: Introduce ro_unmap_sync method
> xprtrdma: Add ro_unmap_sync method for FRWR
> xprtrdma: Add ro_unmap_sync method for FMR
> xprtrdma: Add ro_unmap_sync method for all-physical registration
> SUNRPC: Introduce xprt_commit_rqst()
> xprtrdma: Invalidate in the RPC reply handler
> xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
>
>
> include/linux/sunrpc/xprt.h | 1
> net/sunrpc/xprt.c | 14 +++
> net/sunrpc/xprtrdma/backchannel.c | 22 ++---
> net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++
> net/sunrpc/xprtrdma/frwr_ops.c | 175 +++++++++++++++++++++++++++++++-----
> net/sunrpc/xprtrdma/physical_ops.c | 13 +++
> net/sunrpc/xprtrdma/rpc_rdma.c | 14 +++
> net/sunrpc/xprtrdma/transport.c | 3 +
> net/sunrpc/xprtrdma/verbs.c | 13 +--
> net/sunrpc/xprtrdma/xprt_rdma.h | 12 +-
> 10 files changed, 283 insertions(+), 48 deletions(-)
>
> --
> Chuck Lever
>
Hi Chuck,
iozone passed on ocrdma device. Link bounce fails to recover iozone
traffic, however failure is not related to this patch series. I am in
processes of finding out the patch which broke it.
Tested-By: Devesh Sharma <[email protected]>
On Wed, Dec 16, 2015 at 1:07 AM, Anna Schumaker
<[email protected]> wrote:
> Thanks, Chuck!
>
> Everything looks okay to me, so I'll apply these patches and send them to Trond before the holiday.
>
> On 12/14/2015 04:17 PM, Chuck Lever wrote:
>> For 4.5, I'd like to address the send queue accounting and
>> invalidation/unmap ordering issues Jason brought up a couple of
>> months ago.
>>
>> In preparation for Doug's final topic branch, Anna, I've rebased
>> these on Christoph's ib_device_attr branch, but there were no merge
>> conflicts or other changes needed. Could you begin preparing these
>> for linux-next and other final testing and review?
>
> No merge conflicts is nice, and we might not need to worry about ordering the pull request.
>
> Thanks,
> Anna
>
>>
>> Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:
>>
>> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>>
>> Or for browsing:
>>
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5
>>
>>
>> Changes since v2:
>> - Rebased on Christoph's ib_device_attr branch
>>
>>
>> Changes since v1:
>>
>> - Rebased on v4.4-rc3
>> - Receive buffer safety margin patch dropped
>> - Backchannel pr_err and pr_info converted to dprintk
>> - Backchannel spin locks converted to work queue-safe locks
>> - Fixed premature release of backchannel request buffer
>> - NFSv4.1 callbacks tested with for-4.5 server
>>
>> ---
>>
>> Chuck Lever (11):
>> xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
>> xprtrdma: xprt_rdma_free() must not release backchannel reqs
>> xprtrdma: Disable RPC/RDMA backchannel debugging messages
>> xprtrdma: Move struct ib_send_wr off the stack
>> xprtrdma: Introduce ro_unmap_sync method
>> xprtrdma: Add ro_unmap_sync method for FRWR
>> xprtrdma: Add ro_unmap_sync method for FMR
>> xprtrdma: Add ro_unmap_sync method for all-physical registration
>> SUNRPC: Introduce xprt_commit_rqst()
>> xprtrdma: Invalidate in the RPC reply handler
>> xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
>>
>>
>> include/linux/sunrpc/xprt.h | 1
>> net/sunrpc/xprt.c | 14 +++
>> net/sunrpc/xprtrdma/backchannel.c | 22 ++---
>> net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++
>> net/sunrpc/xprtrdma/frwr_ops.c | 175 +++++++++++++++++++++++++++++++-----
>> net/sunrpc/xprtrdma/physical_ops.c | 13 +++
>> net/sunrpc/xprtrdma/rpc_rdma.c | 14 +++
>> net/sunrpc/xprtrdma/transport.c | 3 +
>> net/sunrpc/xprtrdma/verbs.c | 13 +--
>> net/sunrpc/xprtrdma/xprt_rdma.h | 12 +-
>> 10 files changed, 283 insertions(+), 48 deletions(-)
>>
>> --
>> Chuck Lever
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Chuck,
Sorry for the last minute comment.
On 12/14/2015 04:19 PM, Chuck Lever wrote:
> I'm about to add code in the RPC/RDMA reply handler between the
> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
> to execute outside of spinlock critical sections.
>
> Add a hook to remove an rpc_rqst from the pending list once
> the transport knows its going to invoke xprt_complete_rqst().
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/xprt.h | 1 +
> net/sunrpc/xprt.c | 14 ++++++++++++++
> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
> 3 files changed, 19 insertions(+)
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..ab6c3a5 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
> void xprt_write_space(struct rpc_xprt *xprt);
> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> +void xprt_commit_rqst(struct rpc_task *task);
> void xprt_complete_rqst(struct rpc_task *task, int copied);
> void xprt_release_rqst_cong(struct rpc_task *task);
> void xprt_disconnect_done(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..a5be4ab 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
> }
>
> /**
> + * xprt_commit_rqst - remove rqst from pending list early
> + * @task: RPC request to remove
> + *
> + * Caller holds transport lock.
> + */
> +void xprt_commit_rqst(struct rpc_task *task)
> +{
> + struct rpc_rqst *req = task->tk_rqstp;
> +
> + list_del_init(&req->rq_list);
> +}
> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
Can you move this function into the xprtrdma code, since it's not called outside of there?
Thanks,
Anna
> +
> +/**
> * xprt_complete_rqst - called when reply processing is complete
> * @task: RPC request that recently completed
> * @copied: actual number of bytes received from the transport
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..0bc8c39 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> if (req->rl_reply)
> goto out_duplicate;
>
> + xprt_commit_rqst(rqst->rq_task);
> + spin_unlock_bh(&xprt->transport_lock);
> +
> dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
> " RPC request 0x%p xid 0x%08x\n",
> __func__, rep, req, rqst,
> @@ -894,6 +897,7 @@ badheader:
> else if (credits > r_xprt->rx_buf.rb_max_requests)
> credits = r_xprt->rx_buf.rb_max_requests;
>
> + spin_lock_bh(&xprt->transport_lock);
> cwnd = xprt->cwnd;
> xprt->cwnd = credits << RPC_CWNDSHIFT;
> if (xprt->cwnd > cwnd)
>
> +static void
> +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> + int rc)
> +{
> + struct ib_device *device = r_xprt->rx_ia.ri_device;
> + struct rpcrdma_mw *mw = seg->rl_mw;
> + int nsegs = seg->mr_nsegs;
> +
> + seg->rl_mw = NULL;
> +
> + while (nsegs--)
> + rpcrdma_unmap_one(device, seg++);
Chuck, shouldn't this be replaced with ib_dma_unmap_sg?
Sorry for the late comment (Didn't find enough time to properly
review this...)
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 4197191..e60d817 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -206,6 +206,8 @@ struct rpcrdma_frmr {
> enum rpcrdma_frmr_state fr_state;
> struct work_struct fr_work;
> struct rpcrdma_xprt *fr_xprt;
> + struct ib_reg_wr fr_regwr;
> + struct ib_send_wr fr_invwr;
Would it make sense to unionize these as they are guaranteed not to
execute together? Some people don't like this sort of savings.
> On Dec 16, 2015, at 9:00 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 4197191..e60d817 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -206,6 +206,8 @@ struct rpcrdma_frmr {
>> enum rpcrdma_frmr_state fr_state;
>> struct work_struct fr_work;
>> struct rpcrdma_xprt *fr_xprt;
>> + struct ib_reg_wr fr_regwr;
>> + struct ib_send_wr fr_invwr;
>
> Would it make sense to unionize these as they are guaranteed not to
> execute together? Some people don't like this sort of savings.
I dislike unions because they make the code that uses
them less readable. I can define macros to help that,
but sigh! OK.
--
Chuck Lever
> On Dec 16, 2015, at 8:57 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>> +static void
>> +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> + int rc)
>> +{
>> + struct ib_device *device = r_xprt->rx_ia.ri_device;
>> + struct rpcrdma_mw *mw = seg->rl_mw;
>> + int nsegs = seg->mr_nsegs;
>> +
>> + seg->rl_mw = NULL;
>> +
>> + while (nsegs--)
>> + rpcrdma_unmap_one(device, seg++);
>
> Chuck, shouldn't this be replaced with ib_dma_unmap_sg?
Looks like this was left over from before the conversion
to use ib_dma_unmap_sg. I'll have a look.
> Sorry for the late comment (Didn't find enough time to properly
> review this...)
--
Chuck Lever
On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote:
> > Would it make sense to unionize these as they are guaranteed not to
> > execute together? Some people don't like this sort of savings.
>
> I dislike unions because they make the code that uses
> them less readable. I can define macros to help that,
> but sigh! OK.
Shouldn't be an issue with transparent unions these days:
union {
struct ib_reg_wr fr_regwr;
struct ib_send_wr fr_invwr;
};
Hi Anna-
> On Dec 16, 2015, at 8:48 AM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> Sorry for the last minute comment.
>
> On 12/14/2015 04:19 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>>
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/xprt.h | 1 +
>> net/sunrpc/xprt.c | 14 ++++++++++++++
>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
>> 3 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void xprt_write_space(struct rpc_xprt *xprt);
>> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +void xprt_commit_rqst(struct rpc_task *task);
>> void xprt_complete_rqst(struct rpc_task *task, int copied);
>> void xprt_release_rqst_cong(struct rpc_task *task);
>> void xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>>
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> + struct rpc_rqst *req = task->tk_rqstp;
>> +
>> + list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
>
> Can you move this function into the xprtrdma code, since it's not called outside of there?
I think that's a layering violation, and the idea is
to allow other transports to use this API eventually.
But I'll include this change in the next version of
the series.
> Thanks,
> Anna
>
>> +
>> +/**
>> * xprt_complete_rqst - called when reply processing is complete
>> * @task: RPC request that recently completed
>> * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..0bc8c39 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> if (req->rl_reply)
>> goto out_duplicate;
>>
>> + xprt_commit_rqst(rqst->rq_task);
>> + spin_unlock_bh(&xprt->transport_lock);
>> +
>> dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
>> " RPC request 0x%p xid 0x%08x\n",
>> __func__, rep, req, rqst,
>> @@ -894,6 +897,7 @@ badheader:
>> else if (credits > r_xprt->rx_buf.rb_max_requests)
>> credits = r_xprt->rx_buf.rb_max_requests;
>>
>> + spin_lock_bh(&xprt->transport_lock);
>> cwnd = xprt->cwnd;
>> xprt->cwnd = credits << RPC_CWNDSHIFT;
>> if (xprt->cwnd > cwnd)
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote:
>>> Would it make sense to unionize these as they are guaranteed not to
>>> execute together? Some people don't like this sort of savings.
>>
>> I dislike unions because they make the code that uses
>> them less readable. I can define macros to help that,
>> but sigh! OK.
>
> Shouldn't be an issue with transparent unions these days:
>
> union {
> struct ib_reg_wr fr_regwr;
> struct ib_send_wr fr_invwr;
> };
Right, but isn't that a gcc-ism that Al hates? If
everyone is OK with that construction, I will use it.
--
Chuck Lever
On Wed, Dec 16, 2015 at 10:13:31AM -0500, Chuck Lever wrote:
> > Shouldn't be an issue with transparent unions these days:
> >
> > union {
> > struct ib_reg_wr fr_regwr;
> > struct ib_send_wr fr_invwr;
> > };
>
> Right, but isn't that a gcc-ism that Al hates? If
> everyone is OK with that construction, I will use it.
I started out as a GNUism, but now is supported in C11. We use it
a lot all over the kernel.