For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago. Here's a first shot at that.
Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:
git://git.linux-nfs.org/projects/cel/cel-2.6.git
Or for browsing:
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5
---
Chuck Lever (9):
xprtrdma: Add a safety margin for receive buffers
xprtrdma: Move struct ib_send_wr off the stack
xprtrdma: Introduce ro_unmap_sync method
xprtrdma: Add ro_unmap_sync method for FRWR
xprtrdma: Add ro_unmap_sync method for FMR
xprtrdma: Add ro_unmap_sync method for all-physical registration
SUNRPC: Introduce xprt_commit_rqst()
xprtrdma: Invalidate in the RPC reply handler
xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
include/linux/sunrpc/xprt.h | 1
net/sunrpc/xprt.c | 14 +++
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++
net/sunrpc/xprtrdma/frwr_ops.c | 175 +++++++++++++++++++++++++++++++-----
net/sunrpc/xprtrdma/physical_ops.c | 13 +++
net/sunrpc/xprtrdma/rpc_rdma.c | 21 ++++
net/sunrpc/xprtrdma/verbs.c | 12 +-
net/sunrpc/xprtrdma/xprt_rdma.h | 17 ++-
8 files changed, 284 insertions(+), 33 deletions(-)
--
Chuck Lever
Rarely, senders post a Send that is larger than the client's inline
threshold. That can be due to a bug, or the client and server may
not have communicated about their inline limits. RPC-over-RDMA
currently doesn't specify any particular limit on inline size, so
peers have to guess what it is.
It is fatal to the connection if the size of a Send is larger than
the client's receive buffer. The sender is likely to retry with the
same message size, so the workload is stuck at that point.
Follow Postel's robustness principal: Be conservative in what you
do, be liberal in what you accept from others. Increase the size of
client receive buffers by a safety margin, and add a warning when
the inline threshold is exceeded during receive.
Note the Linux server's receive buffers are already page-sized.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 7 +++++++
net/sunrpc/xprtrdma/verbs.c | 6 +++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..a169252 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
int rdmalen, status;
unsigned long cwnd;
u32 credits;
+ RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
@@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
goto out_badstatus;
if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
goto out_shortreply;
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ cdata = &r_xprt->rx_data;
+ if (rep->rr_len > cdata->inline_rsize)
+ pr_warn("RPC: %u byte reply exceeds inline threshold\n",
+ rep->rr_len);
+#endif
headerp = rdmab_to_msg(rep->rr_rdmabuf);
if (headerp->rm_vers != rpcrdma_version)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..e3f12e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
if (rep == NULL)
goto out;
- rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
+ /* The actual size of our receive buffers is increased slightly
+ * to prevent small receive overruns from killing our connection.
+ */
+ rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
+ RPCRDMA_RECV_MARGIN,
GFP_KERNEL);
if (IS_ERR(rep->rr_rdmabuf)) {
rc = PTR_ERR(rep->rr_rdmabuf);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..1b72ab1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
#define RPCRDMA_INLINE_PAD_VALUE(rq)\
rpcx_to_rdmad(rq->rq_xprt).padding
+/* To help prevent spurious connection shutdown, allow senders to
+ * overrun our receive inline threshold by a small bit.
+ */
+#define RPCRDMA_RECV_MARGIN (128)
+
/*
* Statistics for RPCRDMA
*/
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().
(This will be for chaining LOCAL_INV requests).
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 38 ++++++++++++++++++++------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
2 files changed, 22 insertions(+), 18 deletions(-)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 88cf9e7..31a4578 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -319,7 +319,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
- struct ib_reg_wr reg_wr;
+ struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -336,6 +336,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+ reg_wr = &frmr->fr_regwr;
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -381,19 +382,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x000000FF);
ib_update_fast_reg_key(mr, ++key);
- reg_wr.wr.next = NULL;
- reg_wr.wr.opcode = IB_WR_REG_MR;
- reg_wr.wr.wr_id = (uintptr_t)mw;
- reg_wr.wr.num_sge = 0;
- reg_wr.wr.send_flags = 0;
- reg_wr.mr = mr;
- reg_wr.key = mr->rkey;
- reg_wr.access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->wr.next = NULL;
+ reg_wr->wr.opcode = IB_WR_REG_MR;
+ reg_wr->wr.wr_id = (uintptr_t)mw;
+ reg_wr->wr.num_sge = 0;
+ reg_wr->wr.send_flags = 0;
+ reg_wr->mr = mr;
+ reg_wr->key = mr->rkey;
+ reg_wr->access = writing ?
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ;
DECR_CQCOUNT(&r_xprt->rx_ep);
- rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+ rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
@@ -423,23 +424,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
- struct ib_send_wr invalidate_wr, *bad_wr;
+ struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
dprintk("RPC: %s: FRMR %p\n", __func__, mw);
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+ invalidate_wr = &mw->r.frmr.fr_invwr;
- memset(&invalidate_wr, 0, sizeof(invalidate_wr));
- invalidate_wr.wr_id = (unsigned long)(void *)mw;
- invalidate_wr.opcode = IB_WR_LOCAL_INV;
- invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+ memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+ invalidate_wr->wr_id = (uintptr_t)mw;
+ invalidate_wr->opcode = IB_WR_LOCAL_INV;
+ invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
- rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+ rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 1b72ab1..2c34159 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct fr_work;
struct rpcrdma_xprt *fr_xprt;
+ struct ib_reg_wr fr_regwr;
+ struct ib_send_wr fr_invwr;
};
struct rpcrdma_fmr {
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).
To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.
The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2c34159..bd2022e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -371,6 +371,8 @@ struct rpcrdma_xprt;
struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *, int, bool);
+ void (*ro_unmap_sync)(struct rpcrdma_xprt *,
+ struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.
Note also, DMA unmapping is now done _after_ invalidation.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 137 ++++++++++++++++++++++++++++++++++++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
2 files changed, 135 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 31a4578..5d58008 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -245,12 +245,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
}
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
{
- struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
@@ -261,9 +263,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC: %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
}
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+ struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+ struct rpcrdma_frmr *f = &r->r.frmr;
+
+ if (unlikely(wc->status != IB_WC_SUCCESS))
+ __frwr_sendcompletion_flush(wc, r);
+
+ if (f->fr_waiter)
+ complete(&f->fr_linv_done);
+}
+
static int
frwr_op_init(struct rpcrdma_xprt *r_xprt)
{
@@ -335,6 +351,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+ frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
@@ -414,6 +431,117 @@ out_senderr:
return rc;
}
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ struct rpcrdma_frmr *f = &mw->r.frmr;
+ struct ib_send_wr *invalidate_wr;
+
+ f->fr_waiter = false;
+ f->fr_state = FRMR_IS_INVALID;
+ invalidate_wr = &f->fr_invwr;
+
+ memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+ invalidate_wr->wr_id = (unsigned long)(void *)mw;
+ invalidate_wr->opcode = IB_WR_LOCAL_INV;
+ invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+ return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+ int rc)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ int nsegs = seg->mr_nsegs;
+
+ seg->rl_mw = NULL;
+
+ while (nsegs--)
+ rpcrdma_unmap_one(device, seg++);
+
+ if (!rc)
+ rpcrdma_put_mw(r_xprt, mw);
+ else
+ __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+ struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ struct rpcrdma_mr_seg *seg;
+ unsigned int i, nchunks;
+ struct rpcrdma_frmr *f;
+ int rc;
+
+ dprintk("RPC: %s: req %p\n", __func__, req);
+
+ /* ORDER: Invalidate all of the req's MRs first
+ *
+ * Chain the LOCAL_INV Work Requests and post them with
+ * a single ib_post_send() call.
+ */
+ invalidate_wrs = pos = prev = NULL;
+ seg = NULL;
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ pos = __frwr_prepare_linv_wr(seg);
+
+ if (!invalidate_wrs)
+ invalidate_wrs = pos;
+ else
+ prev->next = pos;
+ prev = pos;
+
+ i += seg->mr_nsegs;
+ }
+ f = &seg->rl_mw->r.frmr;
+
+ /* Strong send queue ordering guarantees that when the
+ * last WR in the chain completes, all WRs in the chain
+ * are complete.
+ */
+ f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+ f->fr_waiter = true;
+ init_completion(&f->fr_linv_done);
+ INIT_CQCOUNT(&r_xprt->rx_ep);
+
+ /* Transport disconnect drains the receive CQ before it
+ * replaces the QP. The RPC reply handler won't call us
+ * unless ri_id->qp is a valid pointer.
+ */
+ rc = ib_post_send(ia->ri_id->qp, invalidate_wrs, &bad_wr);
+ if (rc)
+ pr_warn("%s: ib_post_send failed %i\n", __func__, rc);
+
+ wait_for_completion(&f->fr_linv_done);
+
+ /* ORDER: Now DMA unmap all of the req's MRs, and return
+ * them to the free MW list.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ __frwr_dma_unmap(r_xprt, seg, rc);
+
+ i += seg->mr_nsegs;
+ seg->mr_nsegs = 0;
+ }
+
+ req->rl_nchunks = 0;
+}
+
/* Post a LOCAL_INV Work Request to prevent further remote access
* via RDMA READ or RDMA WRITE.
*/
@@ -473,6 +601,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+ .ro_unmap_sync = frwr_op_unmap_sync,
.ro_unmap = frwr_op_unmap,
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bd2022e..45dd0b7 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct fr_work;
struct rpcrdma_xprt *fr_xprt;
+ bool fr_waiter;
+ struct completion fr_linv_done;;
struct ib_reg_wr fr_regwr;
struct ib_send_wr fr_invwr;
};
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.
1. Gather all the MRs for the RPC request onto a list, and invoke
ib_unmap_fmr() once with that list. This reduces the number of
doorbells when there is more than one MR to invalidate
2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
This is critical after invalidating a Write chunk.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
}
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ struct rpcrdma_mw *mw = seg->rl_mw;
+ int nsegs = seg->mr_nsegs;
+
+ seg->rl_mw = NULL;
+
+ while (nsegs--)
+ rpcrdma_unmap_one(device, seg++);
+
+ rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct rpcrdma_mr_seg *seg;
+ unsigned int i, nchunks;
+ struct rpcrdma_mw *mw;
+ LIST_HEAD(unmap_list);
+ int rc;
+
+ dprintk("RPC: %s: req %p\n", __func__, req);
+
+ /* ORDER: Invalidate all of the req's MRs first
+ *
+ * ib_unmap_fmr() is slow, so use a single call instead
+ * of one call per mapped MR.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+ mw = seg->rl_mw;
+
+ list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+ i += seg->mr_nsegs;
+ }
+ rc = ib_unmap_fmr(&unmap_list);
+ if (rc)
+ pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+ /* ORDER: Now DMA unmap all of the req's MRs, and return
+ * them to the free MW list.
+ */
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+
+ __fmr_dma_unmap(r_xprt, seg);
+
+ i += seg->mr_nsegs;
+ seg->mr_nsegs = 0;
+ }
+
+ req->rl_nchunks = 0;
+}
+
/* Use the ib_unmap_fmr() verb to prevent further remote
* access via RDMA READ or RDMA WRITE.
*/
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+ .ro_unmap_sync = fmr_op_unmap_sync,
.ro_unmap = fmr_op_unmap,
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/physical_ops.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
return 1;
}
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ unsigned int i;
+
+ for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+ rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
static void
physical_op_destroy(struct rpcrdma_buffer *buf)
{
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+ .ro_unmap_sync = physical_op_unmap_sync,
.ro_unmap = physical_op_unmap,
.ro_open = physical_op_open,
.ro_maxpages = physical_op_maxpages,
I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.
Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/xprt.c | 14 ++++++++++++++
net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
3 files changed, 19 insertions(+)
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
void xprt_write_space(struct rpc_xprt *xprt);
void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void xprt_commit_rqst(struct rpc_task *task);
void xprt_complete_rqst(struct rpc_task *task, int copied);
void xprt_release_rqst_cong(struct rpc_task *task);
void xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
}
/**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+
+ list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
* xprt_complete_rqst - called when reply processing is complete
* @task: RPC request that recently completed
* @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a169252..d7b9156 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
+ xprt_commit_rqst(rqst->rq_task);
+ spin_unlock_bh(&xprt->transport_lock);
+
dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
" RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -901,6 +904,7 @@ badheader:
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
+ spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.
The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.
This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.
Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.
Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.
Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d7b9156..4ad72ca 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -898,6 +898,16 @@ badheader:
break;
}
+ /* Invalidate and flush the data payloads before waking the
+ * waiting application. This guarantees the memory region is
+ * properly fenced from the server before the application
+ * accesses the data. It also ensures proper send flow
+ * control: waking the next RPC waits until this RPC has
+ * relinquished all its Send Queue entries.
+ */
+ if (req->rl_nchunks)
+ r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1; /* don't deadlock */
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.
Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 6 ++----
net/sunrpc/xprtrdma/xprt_rdma.h | 6 ------
2 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e3f12e2..e3baf05 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -616,10 +616,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
- if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
- ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
- else if (ep->rep_cqinit <= 2)
- ep->rep_cqinit = 0;
+ if (ep->rep_cqinit <= 2)
+ ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 45dd0b7..c166aeb 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -88,12 +88,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
};
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS (32)
-
#define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
#define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
On 11/23/2015 5:13 PM, Chuck Lever wrote:
> Rarely, senders post a Send that is larger than the client's inline
> threshold. That can be due to a bug, or the client and server may
> not have communicated about their inline limits. RPC-over-RDMA
> currently doesn't specify any particular limit on inline size, so
> peers have to guess what it is.
>
> It is fatal to the connection if the size of a Send is larger than
> the client's receive buffer. The sender is likely to retry with the
> same message size, so the workload is stuck at that point.
>
> Follow Postel's robustness principal: Be conservative in what you
> do, be liberal in what you accept from others. Increase the size of
> client receive buffers by a safety margin, and add a warning when
> the inline threshold is exceeded during receive.
Safety is good, but how do know the chosen value is enough?
Isn't it better to fail the badly-composed request and be done
with it? Even if the stupid sender loops, which it will do
anyway.
>
> Note the Linux server's receive buffers are already page-sized.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 7 +++++++
> net/sunrpc/xprtrdma/verbs.c | 6 +++++-
> net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++++
> 3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..a169252 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> int rdmalen, status;
> unsigned long cwnd;
> u32 credits;
> + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>
> dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
>
> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> goto out_badstatus;
> if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
> goto out_shortreply;
> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> + cdata = &r_xprt->rx_data;
> + if (rep->rr_len > cdata->inline_rsize)
> + pr_warn("RPC: %u byte reply exceeds inline threshold\n",
> + rep->rr_len);
> +#endif
>
> headerp = rdmab_to_msg(rep->rr_rdmabuf);
> if (headerp->rm_vers != rpcrdma_version)
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index eadd1655..e3f12e2 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
> if (rep == NULL)
> goto out;
>
> - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
> + /* The actual size of our receive buffers is increased slightly
> + * to prevent small receive overruns from killing our connection.
> + */
> + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
> + RPCRDMA_RECV_MARGIN,
> GFP_KERNEL);
> if (IS_ERR(rep->rr_rdmabuf)) {
> rc = PTR_ERR(rep->rr_rdmabuf);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index ac7f8d4..1b72ab1 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
> #define RPCRDMA_INLINE_PAD_VALUE(rq)\
> rpcx_to_rdmad(rq->rq_xprt).padding
>
> +/* To help prevent spurious connection shutdown, allow senders to
> + * overrun our receive inline threshold by a small bit.
> + */
> +#define RPCRDMA_RECV_MARGIN (128)
> +
> /*
> * Statistics for RPCRDMA
> */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On 11/23/2015 5:14 PM, Chuck Lever wrote:
> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> is a synchronous verb. However, some improvements can be made here.
I thought FMR support was about to be removed in the core.
>
> 1. Gather all the MRs for the RPC request onto a list, and invoke
> ib_unmap_fmr() once with that list. This reduces the number of
> doorbells when there is more than one MR to invalidate
>
> 2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
> This is critical after invalidating a Write chunk.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 64 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index f1e8daf..c14f3a4 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -179,6 +179,69 @@ out_maperr:
> return rc;
> }
>
> +static void
> +__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> +{
> + struct ib_device *device = r_xprt->rx_ia.ri_device;
> + struct rpcrdma_mw *mw = seg->rl_mw;
> + int nsegs = seg->mr_nsegs;
> +
> + seg->rl_mw = NULL;
> +
> + while (nsegs--)
> + rpcrdma_unmap_one(device, seg++);
> +
> + rpcrdma_put_mw(r_xprt, mw);
> +}
> +
> +/* Invalidate all memory regions that were registered for "req".
> + *
> + * Sleeps until it is safe for the host CPU to access the
> + * previously mapped memory regions.
> + */
> +static void
> +fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
> +{
> + struct rpcrdma_mr_seg *seg;
> + unsigned int i, nchunks;
> + struct rpcrdma_mw *mw;
> + LIST_HEAD(unmap_list);
> + int rc;
> +
> + dprintk("RPC: %s: req %p\n", __func__, req);
> +
> + /* ORDER: Invalidate all of the req's MRs first
> + *
> + * ib_unmap_fmr() is slow, so use a single call instead
> + * of one call per mapped MR.
> + */
> + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> + seg = &req->rl_segments[i];
> + mw = seg->rl_mw;
> +
> + list_add(&mw->r.fmr.fmr->list, &unmap_list);
> +
> + i += seg->mr_nsegs;
> + }
> + rc = ib_unmap_fmr(&unmap_list);
> + if (rc)
> + pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
> +
> + /* ORDER: Now DMA unmap all of the req's MRs, and return
> + * them to the free MW list.
> + */
> + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> + seg = &req->rl_segments[i];
> +
> + __fmr_dma_unmap(r_xprt, seg);
> +
> + i += seg->mr_nsegs;
> + seg->mr_nsegs = 0;
> + }
> +
> + req->rl_nchunks = 0;
> +}
> +
> /* Use the ib_unmap_fmr() verb to prevent further remote
> * access via RDMA READ or RDMA WRITE.
> */
> @@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
>
> const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
> .ro_map = fmr_op_map,
> + .ro_unmap_sync = fmr_op_unmap_sync,
> .ro_unmap = fmr_op_unmap,
> .ro_open = fmr_op_open,
> .ro_maxpages = fmr_op_maxpages,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On 11/23/2015 5:14 PM, Chuck Lever wrote:
> There is a window between the time the RPC reply handler wakes the
> waiting RPC task and when xprt_release() invokes ops->buf_free.
> During this time, memory regions containing the data payload may
> still be accessed by a broken or malicious server, but the RPC
> application has already been allowed access to the memory containing
> the RPC request's data payloads.
>
> The server should be fenced from client memory containing RPC data
> payloads _before_ the RPC application is allowed to continue.
No concern, just Hurray for implementing this fundamental
integrity requirement. It's even more important when krb5i
is in play, to avoid truly malicious injection-after-integrity.
>
> This change also more strongly enforces send queue accounting. There
> is a maximum number of RPC calls allowed to be outstanding. When an
> RPC/RDMA transport is set up, just enough send queue resources are
> allocated to handle registration, Send, and invalidation WRs for
> each those RPCs at the same time.
>
> Before, additional RPC calls could be dispatched while invalidation
> WRs were still consuming send WQEs. When invalidation WRs backed
> up, dispatching additional RPCs resulted in a send queue overrun.
>
> Now, the reply handler prevents RPC dispatch until invalidation is
> complete. This prevents RPC call dispatch until there are enough
> send queue resources to proceed.
>
> Still to do: If an RPC exits early (say, ^C), the reply handler has
> no opportunity to perform invalidation. Currently, xprt_rdma_free()
> still frees remaining RDMA resources, which could deadlock.
> Additional changes are needed to handle invalidation properly in this
> case.
>
> Reported-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index d7b9156..4ad72ca 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -898,6 +898,16 @@ badheader:
> break;
> }
>
> + /* Invalidate and flush the data payloads before waking the
> + * waiting application. This guarantees the memory region is
> + * properly fenced from the server before the application
> + * accesses the data. It also ensures proper send flow
> + * control: waking the next RPC waits until this RPC has
> + * relinquished all its Send Queue entries.
> + */
> + if (req->rl_nchunks)
> + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
> +
> credits = be32_to_cpu(headerp->rm_credit);
> if (credits == 0)
> credits = 1; /* don't deadlock */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On Nov 23, 2015, at 7:55 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>> Rarely, senders post a Send that is larger than the client's inline
>> threshold. That can be due to a bug, or the client and server may
>> not have communicated about their inline limits. RPC-over-RDMA
>> currently doesn't specify any particular limit on inline size, so
>> peers have to guess what it is.
>>
>> It is fatal to the connection if the size of a Send is larger than
>> the client's receive buffer. The sender is likely to retry with the
>> same message size, so the workload is stuck at that point.
>>
>> Follow Postel's robustness principal: Be conservative in what you
>> do, be liberal in what you accept from others. Increase the size of
>> client receive buffers by a safety margin, and add a warning when
>> the inline threshold is exceeded during receive.
>
> Safety is good, but how do know the chosen value is enough?
> Isn't it better to fail the badly-composed request and be done
> with it? Even if the stupid sender loops, which it will do
> anyway.
It’s good enough to compensate for the most common sender bug,
which is that the sender did not account for the 28 bytes of
the RPC-over-RDMA header when it built the send buffer. The
additional 100 byte margin is gravy.
The loop occurs because the server gets a completion error.
The client just sees a connection loss. There’s no way for it
to know it should fail the RPC, so it keeps trying.
Perhaps the server could remember that the reply failed, and
when the client retransmits, it can simply return that XID
with an RDMA_ERROR.
>> Note the Linux server's receive buffers are already page-sized.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/rpc_rdma.c | 7 +++++++
>> net/sunrpc/xprtrdma/verbs.c | 6 +++++-
>> net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++++
>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..a169252 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> int rdmalen, status;
>> unsigned long cwnd;
>> u32 credits;
>> + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>
>> dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
>>
>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> goto out_badstatus;
>> if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>> goto out_shortreply;
>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> + cdata = &r_xprt->rx_data;
>> + if (rep->rr_len > cdata->inline_rsize)
>> + pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>> + rep->rr_len);
>> +#endif
>>
>> headerp = rdmab_to_msg(rep->rr_rdmabuf);
>> if (headerp->rm_vers != rpcrdma_version)
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index eadd1655..e3f12e2 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>> if (rep == NULL)
>> goto out;
>>
>> - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>> + /* The actual size of our receive buffers is increased slightly
>> + * to prevent small receive overruns from killing our connection.
>> + */
>> + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>> + RPCRDMA_RECV_MARGIN,
>> GFP_KERNEL);
>> if (IS_ERR(rep->rr_rdmabuf)) {
>> rc = PTR_ERR(rep->rr_rdmabuf);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ac7f8d4..1b72ab1 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>> #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>> rpcx_to_rdmad(rq->rq_xprt).padding
>>
>> +/* To help prevent spurious connection shutdown, allow senders to
>> + * overrun our receive inline threshold by a small bit.
>> + */
>> +#define RPCRDMA_RECV_MARGIN (128)
>> +
>> /*
>> * Statistics for RPCRDMA
>> */
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Chuck Lever
On 11/23/2015 8:16 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <[email protected]> wrote:
>>
>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>> Rarely, senders post a Send that is larger than the client's inline
>>> threshold. That can be due to a bug, or the client and server may
>>> not have communicated about their inline limits. RPC-over-RDMA
>>> currently doesn't specify any particular limit on inline size, so
>>> peers have to guess what it is.
>>>
>>> It is fatal to the connection if the size of a Send is larger than
>>> the client's receive buffer. The sender is likely to retry with the
>>> same message size, so the workload is stuck at that point.
>>>
>>> Follow Postel's robustness principal: Be conservative in what you
>>> do, be liberal in what you accept from others. Increase the size of
>>> client receive buffers by a safety margin, and add a warning when
>>> the inline threshold is exceeded during receive.
>>
>> Safety is good, but how do know the chosen value is enough?
>> Isn't it better to fail the badly-composed request and be done
>> with it? Even if the stupid sender loops, which it will do
>> anyway.
>
> It’s good enough to compensate for the most common sender bug,
> which is that the sender did not account for the 28 bytes of
> the RPC-over-RDMA header when it built the send buffer. The
> additional 100 byte margin is gravy.
I think it's good to have sympathy and resilience to differing
designs on the other end of the wire, but I fail to have it for
stupid bugs. Unless this can take down the receiver, fail it fast.
MHO.
>
> The loop occurs because the server gets a completion error.
> The client just sees a connection loss. There’s no way for it
> to know it should fail the RPC, so it keeps trying.
>
> Perhaps the server could remember that the reply failed, and
> when the client retransmits, it can simply return that XID
> with an RDMA_ERROR.
>
>
>>> Note the Linux server's receive buffers are already page-sized.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/rpc_rdma.c | 7 +++++++
>>> net/sunrpc/xprtrdma/verbs.c | 6 +++++-
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++++
>>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index c10d969..a169252 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>> int rdmalen, status;
>>> unsigned long cwnd;
>>> u32 credits;
>>> + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>
>>> dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
>>>
>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>> goto out_badstatus;
>>> if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>> goto out_shortreply;
>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>> + cdata = &r_xprt->rx_data;
>>> + if (rep->rr_len > cdata->inline_rsize)
>>> + pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>> + rep->rr_len);
>>> +#endif
>>>
>>> headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>> if (headerp->rm_vers != rpcrdma_version)
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index eadd1655..e3f12e2 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>> if (rep == NULL)
>>> goto out;
>>>
>>> - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>> + /* The actual size of our receive buffers is increased slightly
>>> + * to prevent small receive overruns from killing our connection.
>>> + */
>>> + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>> + RPCRDMA_RECV_MARGIN,
>>> GFP_KERNEL);
>>> if (IS_ERR(rep->rr_rdmabuf)) {
>>> rc = PTR_ERR(rep->rr_rdmabuf);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index ac7f8d4..1b72ab1 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>> #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>> rpcx_to_rdmad(rq->rq_xprt).padding
>>>
>>> +/* To help prevent spurious connection shutdown, allow senders to
>>> + * overrun our receive inline threshold by a small bit.
>>> + */
>>> +#define RPCRDMA_RECV_MARGIN (128)
>>> +
>>> /*
>>> * Statistics for RPCRDMA
>>> */
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On Nov 23, 2015, at 8:22 PM, Tom Talpey <[email protected]> wrote:
>
> On 11/23/2015 8:16 PM, Chuck Lever wrote:
>>
>>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <[email protected]> wrote:
>>>
>>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>>> Rarely, senders post a Send that is larger than the client's inline
>>>> threshold. That can be due to a bug, or the client and server may
>>>> not have communicated about their inline limits. RPC-over-RDMA
>>>> currently doesn't specify any particular limit on inline size, so
>>>> peers have to guess what it is.
>>>>
>>>> It is fatal to the connection if the size of a Send is larger than
>>>> the client's receive buffer. The sender is likely to retry with the
>>>> same message size, so the workload is stuck at that point.
>>>>
>>>> Follow Postel's robustness principal: Be conservative in what you
>>>> do, be liberal in what you accept from others. Increase the size of
>>>> client receive buffers by a safety margin, and add a warning when
>>>> the inline threshold is exceeded during receive.
>>>
>>> Safety is good, but how do know the chosen value is enough?
>>> Isn't it better to fail the badly-composed request and be done
>>> with it? Even if the stupid sender loops, which it will do
>>> anyway.
>>
>> It’s good enough to compensate for the most common sender bug,
>> which is that the sender did not account for the 28 bytes of
>> the RPC-over-RDMA header when it built the send buffer. The
>> additional 100 byte margin is gravy.
>
> I think it's good to have sympathy and resilience to differing
> designs on the other end of the wire, but I fail to have it for
> stupid bugs. Unless this can take down the receiver, fail it fast.
>
> MHO.
See above: the client can’t tell why it’s failed.
Again, the Send on the server side fails with LOCAL_LEN_ERR
and the connection is terminated. The client sees only the
connection loss. There’s no distinction between this and a
cable pull or a server crash, where you do want the client
to retransmit.
I agree it’s a dumb server bug. But the idea is to preserve
the connection, since NFS retransmits are a hazard.
Just floating this idea here, this is v1. This one can be
dropped.
>> The loop occurs because the server gets a completion error.
>> The client just sees a connection loss. There’s no way for it
>> to know it should fail the RPC, so it keeps trying.
>>
>> Perhaps the server could remember that the reply failed, and
>> when the client retransmits, it can simply return that XID
>> with an RDMA_ERROR.
>>
>>
>>>> Note the Linux server's receive buffers are already page-sized.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 7 +++++++
>>>> net/sunrpc/xprtrdma/verbs.c | 6 +++++-
>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++++
>>>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index c10d969..a169252 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>> int rdmalen, status;
>>>> unsigned long cwnd;
>>>> u32 credits;
>>>> + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>>
>>>> dprintk("RPC: %s: incoming rep %p\n", __func__, rep);
>>>>
>>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>> goto out_badstatus;
>>>> if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>> goto out_shortreply;
>>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>> + cdata = &r_xprt->rx_data;
>>>> + if (rep->rr_len > cdata->inline_rsize)
>>>> + pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>>> + rep->rr_len);
>>>> +#endif
>>>>
>>>> headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>> if (headerp->rm_vers != rpcrdma_version)
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>>> index eadd1655..e3f12e2 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>> if (rep == NULL)
>>>> goto out;
>>>>
>>>> - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>>> + /* The actual size of our receive buffers is increased slightly
>>>> + * to prevent small receive overruns from killing our connection.
>>>> + */
>>>> + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>>> + RPCRDMA_RECV_MARGIN,
>>>> GFP_KERNEL);
>>>> if (IS_ERR(rep->rr_rdmabuf)) {
>>>> rc = PTR_ERR(rep->rr_rdmabuf);
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index ac7f8d4..1b72ab1 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>>> #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>>> rpcx_to_rdmad(rq->rq_xprt).padding
>>>>
>>>> +/* To help prevent spurious connection shutdown, allow senders to
>>>> + * overrun our receive inline threshold by a small bit.
>>>> + */
>>>> +#define RPCRDMA_RECV_MARGIN (128)
>>>> +
>>>> /*
>>>> * Statistics for RPCRDMA
>>>> */
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Chuck Lever
On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> In the current xprtrdma implementation, some memreg strategies
> implement ro_unmap synchronously (the MR is knocked down before the
> method returns) and some asynchonously (the MR will be knocked down
> and returned to the pool in the background).
>
> To guarantee the MR is truly invalid before the RPC consumer is
> allowed to resume execution, we need an unmap method that is
> always synchronous, invoked from the RPC/RDMA reply handler.
>
> The new method unmaps all MRs for an RPC. The existing ro_unmap
> method unmaps only one MR at a time.
Do we really want to go down that road? It seems like we've decided
in general that while the protocol specs say MR must be unmapped before
proceeding with the data that is painful enough to ignore this
requirement. E.g. iser for example only does the local invalidate
just before reusing the MR.
I'd like to hear arguments for and against each method instead of
adding more magic to drivers to either optimize MR performance and
add clunky workarounds to make it even slower, and instead handled
the semantics we agreed upo in common code.
On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
> On 11/23/2015 5:14 PM, Chuck Lever wrote:
> >FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> >is a synchronous verb. However, some improvements can be made here.
>
> I thought FMR support was about to be removed in the core.
Seems like everyone would love to kill it, but no one dares to do
it just yet. Reasons to keep FMRs:
- mthca doesn't support FRs but haven't been staged out
- RDS only supports FRMs for the IB transports (it does support FRs for
an entirely separate iWarp transports. I'd love to know why we can't
just use that iWarp transport for IB as well).
- mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
in reply to the iSER remote invalidation series).
So at lest for 4.5 we're unlikely to be able to get rid of it alone
due to the RDS issue. We'll then need performance numbers for mlx4,
and figure out how much we care about mthca.
On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
>
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue. We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.
mthca is unfortunately very popular in the used market, it is very
easy to get cards, build a cheap test cluster, etc. :(
Jason
On Mon, Nov 23, 2015 at 10:45:56PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> > In the current xprtrdma implementation, some memreg strategies
> > implement ro_unmap synchronously (the MR is knocked down before the
> > method returns) and some asynchonously (the MR will be knocked down
> > and returned to the pool in the background).
> >
> > To guarantee the MR is truly invalid before the RPC consumer is
> > allowed to resume execution, we need an unmap method that is
> > always synchronous, invoked from the RPC/RDMA reply handler.
> >
> > The new method unmaps all MRs for an RPC. The existing ro_unmap
> > method unmaps only one MR at a time.
>
> Do we really want to go down that road? It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this
That is not my impression, I was thinking we keep finding that ULPs
are not implemented correctly. The various clean up exercises keep
exposing flaws.
The common code is intended to drive RDMA properly.
Async invalidating the rkey is fundamentally a security issue and
should be treated as such. The kernel never trades security for
performance without a user opt in. This is the same logic we've used
for purging the global writable rkey stuff, even though it often had
performance.
> requirement. E.g. iser for example only does the local invalidate
> just before reusing the MR.
Ugh :(
> I'd like to hear arguments for and against each method instead of
> adding more magic to drivers to either optimize MR performance and
> add clunky workarounds to make it even slower, and instead handled
> the semantics we agreed upo in common code.
Common code should make it easy to do this right, an invalidate of the
MR ordered before the dma unmap, which must complete before the buffer
is handed back to the caller. With easy support for send with
invalidate.
If the common code has an opt-in to make some of these steps run
async, and that gives performance, then fine, but the default should be
secure operation.
Jason
On 24/11/2015 08:45, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>> In the current xprtrdma implementation, some memreg strategies
>> implement ro_unmap synchronously (the MR is knocked down before the
>> method returns) and some asynchonously (the MR will be knocked down
>> and returned to the pool in the background).
>>
>> To guarantee the MR is truly invalid before the RPC consumer is
>> allowed to resume execution, we need an unmap method that is
>> always synchronous, invoked from the RPC/RDMA reply handler.
>>
>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>> method unmaps only one MR at a time.
>
> Do we really want to go down that road? It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this
> requirement. E.g. iser for example only does the local invalidate
> just before reusing the MR.
It is painful, too painful. The entire value proposition of RDMA is
low-latency and waiting for the extra HW round-trip for a local
invalidation to complete is unacceptable, moreover it adds a huge loads
of extra interrupts and cache-line pollutions.
As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does) then local invalidate before
re-use is only marginally worse. For iSER, remote invalidate solves this
(patches submitted!) and I'd say we should push for all the
storage standards to include remote invalidate. There is the question
of multi-rkey transactions, which is why I stated in the past that
arbitrary sg registration is important (which will be submitted soon
for ConnectX-4).
Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.
On 11/24/2015 1:52 AM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet. Reasons to keep FMRs:
>
> - mthca doesn't support FRs but haven't been staged out
> - RDS only supports FRMs for the IB transports (it does support FRs for
> an entirely separate iWarp transports. I'd love to know why we can't
> just use that iWarp transport for IB as well).
> - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
> in reply to the iSER remote invalidation series).
Ok, let me add then.
Reasons to kill FMRs:
- FMRs are UNSAFE. They protect on page boundaries, not byte,
therefore a length error on an operation can corrupt data
outside the i/o buffer remotely. To say nothing of maliciousness.
- FMRs are SLOWER. The supposed performance improvement of FMR
comes from the "mempool" that is often placed in front of them.
These pools cache remotely registered regions, in the hope of
reusing them without having to invalidate in between. See the
first bullet for the ramifications of that.
- FMRs are SYNCHRONOUS. The FMR verbs make direct calls into the
verbs which block and/or do not return control to the caller
promptly as do work queue operations. The processor spends
more time servicing the device, typically at raised irq.
- FMRs are PROPRIETARY. Only one vendor supports them, only one
of their NICs has no decent alternative, and that NIC is
functionally obsolete.
If storage stacks continue to support them, I personally feel it
is critical to carefully document the risks to customer data as
part of shipping the code.
Tom.
On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does)
For the record, that is false. Windows quite strictly performs these
steps, and deliver millions of real 4K IOPS over SMB Direct.
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.
Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.
Tom.
> On Nov 24, 2015, at 5:59 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>
> On 24/11/2015 08:45, Christoph Hellwig wrote:
>> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>>> In the current xprtrdma implementation, some memreg strategies
>>> implement ro_unmap synchronously (the MR is knocked down before the
>>> method returns) and some asynchonously (the MR will be knocked down
>>> and returned to the pool in the background).
>>>
>>> To guarantee the MR is truly invalid before the RPC consumer is
>>> allowed to resume execution, we need an unmap method that is
>>> always synchronous, invoked from the RPC/RDMA reply handler.
>>>
>>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>>> method unmaps only one MR at a time.
>>
>> Do we really want to go down that road? It seems like we've decided
>> in general that while the protocol specs say MR must be unmapped before
>> proceeding with the data that is painful enough to ignore this
>> requirement. E.g. iser for example only does the local invalidate
>> just before reusing the MR.
That leaves the MR exposed to the remote indefinitely. If
the MR is registered for remote write, that seems hazardous.
> It is painful, too painful. The entire value proposition of RDMA is
> low-latency and waiting for the extra HW round-trip for a local
> invalidation to complete is unacceptable, moreover it adds a huge loads
> of extra interrupts and cache-line pollutions.
The killer is the extra context switches, I?ve found.
> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does) then local invalidate before
> re-use is only marginally worse. For iSER, remote invalidate solves this (patches submitted!) and I'd say we should push for all the
> storage standards to include remote invalidate.
I agree: the right answer is to use remote invalidation,
and to ensure the order is always:
1. invalidate the MR
2. unmap the MR
3. wake up the consumer
And that is exactly my strategy for NFS/RDMA. I don?t have
a choice: as Tom observed yesterday, krb5i is meaningless
unless the integrity of the data is guaranteed by fencing
the server before the client performs checksumming. I
expect the same is true for T10-PI.
> There is the question
> of multi-rkey transactions, which is why I stated in the past that
> arbitrary sg registration is important (which will be submitted soon
> for ConnectX-4).
>
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.
I?ve noticed only a marginal loss of performance on modern
hardware.
--
Chuck Lever
Hey Tom,
> On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
>> As I see it, if we don't wait for local-invalidate to complete before
>> unmap and IO completion (and no one does)
>
> For the record, that is false. Windows quite strictly performs these
> steps,
I meant Linux ULPs. I'm not very familiar with Windows SMBD.
> and deliver millions of real 4K IOPS over SMB Direct.
That's very encouraging! Is this the client side scaling?
From my experience, getting a storage client/initiator to scale
up to 1.5-3 MIOPs over a single HCA with limited number of cores
is really a struggle for each interrupt and cacheline.
Still the latency of each IO will see a noticable increase.
>> Waiting for local invalidate to complete would be a really big
>> sacrifice for our storage ULPs.
>
> Not waiting would also be a sacrifice, by compromising data integrity.
> It's a tough tradeoff, but if you choose to go that way it will be
> critical to be honest about the consequences to the data.
That's true. I assume that the best compromise would be remote
invalidation but some standards needs enhancements and it doesn't solve
the multi-rkey transactions.
As storage devices are becoming faster we really should try to do our
best to keep up.
Hey Chuck,
>
>> It is painful, too painful. The entire value proposition of RDMA is
>> low-latency and waiting for the extra HW round-trip for a local
>> invalidation to complete is unacceptable, moreover it adds a huge loads
>> of extra interrupts and cache-line pollutions.
>
> The killer is the extra context switches, I?ve found.
That too...
> I?ve noticed only a marginal loss of performance on modern
> hardware.
Would you mind sharing your observations?
> On Nov 24, 2015, at 9:44 AM, Sagi Grimberg <[email protected]> wrote:
>
> Hey Chuck,
>
>>
>>> It is painful, too painful. The entire value proposition of RDMA is
>>> low-latency and waiting for the extra HW round-trip for a local
>>> invalidation to complete is unacceptable, moreover it adds a huge loads
>>> of extra interrupts and cache-line pollutions.
>>
>> The killer is the extra context switches, I?ve found.
>
> That too...
>
>> I?ve noticed only a marginal loss of performance on modern
>> hardware.
>
> Would you mind sharing your observations?
I?m testing with CX-3 Pro on FDR.
NFS READ and WRITE round trip latency, which includes the cost
of registration and now invalidation, is not noticeably longer.
dbench and fio results are marginally slower (in the neighborhood
of 5%).
For NFS, the cost of invalidation is probably not significant
compared to other bottlenecks in our stack (lock contention and
scheduling overhead are likely the largest contributors).
Notice that xprtrdma chains together all the LOCAL_INV WRs for
an RPC, and only signals the final one. Before, every LOCAL_INV
WR was signaled. So this patch actually reduces the send
completion rate.
The main benefit for NFS of waiting for invalidation to complete
is better send queue accounting. Even without the data integrity
issue, we have to ensure the WQEs consumed by invalidation
requests are released before dispatching another RPC. Otherwise
the send queue can be overrun.
--
Chuck Lever
On Tue, Nov 24, 2015 at 09:39:33AM -0500, Chuck Lever wrote:
> >> Do we really want to go down that road? It seems like we've decided
> >> in general that while the protocol specs say MR must be unmapped before
> >> proceeding with the data that is painful enough to ignore this
> >> requirement. E.g. iser for example only does the local invalidate
> >> just before reusing the MR.
>
> That leaves the MR exposed to the remote indefinitely. If
> the MR is registered for remote write, that seems hazardous.
Agree.
If we have a performance knob it should be to post the invalidate,
then immediately dma unmap and start using the buffer, not to defer
the invalidate potentially indefinitely. Then it is just a race to be
exploited not a long lived window into random page cache memory.
Jason
Hi Chuck,
On 11/23/2015 05:14 PM, Chuck Lever wrote:
> I'm about to add code in the RPC/RDMA reply handler between the
> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
> to execute outside of spinlock critical sections.
>
> Add a hook to remove an rpc_rqst from the pending list once
> the transport knows its going to invoke xprt_complete_rqst().
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/xprt.h | 1 +
> net/sunrpc/xprt.c | 14 ++++++++++++++
> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
> 3 files changed, 19 insertions(+)
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..ab6c3a5 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
> void xprt_write_space(struct rpc_xprt *xprt);
> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> +void xprt_commit_rqst(struct rpc_task *task);
> void xprt_complete_rqst(struct rpc_task *task, int copied);
> void xprt_release_rqst_cong(struct rpc_task *task);
> void xprt_disconnect_done(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..a5be4ab 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
> }
>
> /**
> + * xprt_commit_rqst - remove rqst from pending list early
> + * @task: RPC request to remove
Is xprt_commit_rqst() the right name for this function? Removing a request from a list isn't how I would expect a commit to work.
Anna
> + *
> + * Caller holds transport lock.
> + */
> +void xprt_commit_rqst(struct rpc_task *task)
> +{
> + struct rpc_rqst *req = task->tk_rqstp;
> +
> + list_del_init(&req->rq_list);
> +}
> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
> +
> +/**
> * xprt_complete_rqst - called when reply processing is complete
> * @task: RPC request that recently completed
> * @copied: actual number of bytes received from the transport
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index a169252..d7b9156 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> if (req->rl_reply)
> goto out_duplicate;
>
> + xprt_commit_rqst(rqst->rq_task);
> + spin_unlock_bh(&xprt->transport_lock);
> +
> dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
> " RPC request 0x%p xid 0x%08x\n",
> __func__, rep, req, rqst,
> @@ -901,6 +904,7 @@ badheader:
> else if (credits > r_xprt->rx_buf.rb_max_requests)
> credits = r_xprt->rx_buf.rb_max_requests;
>
> + spin_lock_bh(&xprt->transport_lock);
> cwnd = xprt->cwnd;
> xprt->cwnd = credits << RPC_CWNDSHIFT;
> if (xprt->cwnd > cwnd)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> On Nov 24, 2015, at 2:54 PM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> On 11/23/2015 05:14 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>>
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/xprt.h | 1 +
>> net/sunrpc/xprt.c | 14 ++++++++++++++
>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++++
>> 3 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void xprt_write_space(struct rpc_xprt *xprt);
>> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +void xprt_commit_rqst(struct rpc_task *task);
>> void xprt_complete_rqst(struct rpc_task *task, int copied);
>> void xprt_release_rqst_cong(struct rpc_task *task);
>> void xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>>
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
>
> Is xprt_commit_rqst() the right name for this function? Removing a request from a list isn't how I would expect a commit to work.
“commit” means the request is committed: we have a parse-able
reply and will proceed to completion. The name does not reflect
the mechanism, but the policy.
Any suggestions on a different name?
> Anna
>
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> + struct rpc_rqst *req = task->tk_rqstp;
>> +
>> + list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
>> +
>> +/**
>> * xprt_complete_rqst - called when reply processing is complete
>> * @task: RPC request that recently completed
>> * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index a169252..d7b9156 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> if (req->rl_reply)
>> goto out_duplicate;
>>
>> + xprt_commit_rqst(rqst->rq_task);
>> + spin_unlock_bh(&xprt->transport_lock);
>> +
>> dprintk("RPC: %s: reply 0x%p completes request 0x%p\n"
>> " RPC request 0x%p xid 0x%08x\n",
>> __func__, rep, req, rqst,
>> @@ -901,6 +904,7 @@ badheader:
>> else if (credits > r_xprt->rx_buf.rb_max_requests)
>> credits = r_xprt->rx_buf.rb_max_requests;
>>
>> + spin_lock_bh(&xprt->transport_lock);
>> cwnd = xprt->cwnd;
>> xprt->cwnd = credits << RPC_CWNDSHIFT;
>> if (xprt->cwnd > cwnd)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
--
Chuck Lever
Hi Christoph,
On 11/23/2015 10:52 PM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet. Reasons to keep FMRs:
>
> - mthca doesn't support FRs but haven't been staged out
> - RDS only supports FRMs for the IB transports (it does support FRs for
> an entirely separate iWarp transports. I'd love to know why we can't
> just use that iWarp transport for IB as well).
> - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
> in reply to the iSER remote invalidation series).
>
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue. We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.
>
As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5. There are few
issues we found with one of the HCA and hence the progress
slowed down. Looking at where we are, 4.6 merge window seems
to be realistic for me to get RDS FR support.
Now on the iWARP transport itself, it has been bit tough because
of lack of hardware to tests. I have been requesting test(s) in
previous RDS patches and haven't seen any interest so far. If
this continues to be the trend, I might as well get rid of
RDS iWARP support after couple of merge windows.
Regards,
Santosh
[1] http://www.spinics.net/lists/linux-nfs/msg53909.html
On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
> As already indicated to Sagi [1], RDS IB FR support is work in
> progress and I was hoping to get it ready for 4.5. There are few
> issues we found with one of the HCA and hence the progress
> slowed down. Looking at where we are, 4.6 merge window seems
> to be realistic for me to get RDS FR support.
Ok.
> Now on the iWARP transport itself, it has been bit tough because
> of lack of hardware to tests. I have been requesting test(s) in
> previous RDS patches and haven't seen any interest so far. If
> this continues to be the trend, I might as well get rid of
> RDS iWARP support after couple of merge windows.
I'd say drop the current iWarp transport if it's not testable. The
only real difference between IB and iWarp is the needed to create
a MR for the RDMA READ sink, and we're much better of adding that into
the current IB transport if new iWarp users show up.
On 11/25/2015 1:00 AM, Christoph Hellwig wrote:
> On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
>> As already indicated to Sagi [1], RDS IB FR support is work in
>> progress and I was hoping to get it ready for 4.5. There are few
>> issues we found with one of the HCA and hence the progress
>> slowed down. Looking at where we are, 4.6 merge window seems
>> to be realistic for me to get RDS FR support.
>
> Ok.
>
>> Now on the iWARP transport itself, it has been bit tough because
>> of lack of hardware to tests. I have been requesting test(s) in
>> previous RDS patches and haven't seen any interest so far. If
>> this continues to be the trend, I might as well get rid of
>> RDS iWARP support after couple of merge windows.
>
> I'd say drop the current iWarp transport if it's not testable. The
> only real difference between IB and iWarp is the needed to create
> a MR for the RDMA READ sink, and we're much better of adding that into
> the current IB transport if new iWarp users show up.
Agree. I will probably do it along with RDS IB FR support in 4.6
Regards,
Santosh
On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
<[email protected]> wrote:
>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>> progress and I was hoping to get it ready for 4.5.
These are really good news! can you please elaborate a bit on the
design changes this move introduced in RDS?
On 11/25/2015 10:22 AM, Or Gerlitz wrote:
> On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
> <[email protected]> wrote:
>>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>>> progress and I was hoping to get it ready for 4.5.
>
> These are really good news! can you please elaborate a bit on the
> design changes this move introduced in RDS?
>
Yeah. It has been a bit of pain point since the need
was to keep the RDS design same and retrofit the FR
support so that it can co-exist with existing deployed
FMR code.
Leaving the details for the code review but at very high
level,
- Have to split the poll CQ handling so that
send + FR WR completion can be handled together.
FR CQ handler and reg/inv WR prep marks the MR state
like INVALID, VALID & STALE appropriately.
- Allocate 2X space on WR and WC queues during queue setup.
- Manage the MR reg/inv based on the space available
in FR WR ring(actually it is just a counter). This is
bit tricky because RDS does MR operation via sendmsg()
as well as directly through socket APIs so needs
co-ordination.
Am hoping that above remains true when code actually
makes to the list but that is how things stand as
of now.
Regards,
Santosh
On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
> >I'd say drop the current iWarp transport if it's not testable. The
> >only real difference between IB and iWarp is the needed to create
> >a MR for the RDMA READ sink, and we're much better of adding that into
> >the current IB transport if new iWarp users show up.
>
> Agree. I will probably do it along with RDS IB FR support in 4.6
Great news!
Jason
On 25/11/2015 21:28, Jason Gunthorpe wrote:
> On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
>>> I'd say drop the current iWarp transport if it's not testable. The
>>> only real difference between IB and iWarp is the needed to create
>>> a MR for the RDMA READ sink, and we're much better of adding that into
>>> the current IB transport if new iWarp users show up.
>>
>> Agree. I will probably do it along with RDS IB FR support in 4.6
>
> Great news!
Great news indeed!
Looking forward to seeing this.
> On Nov 24, 2015, at 2:12 AM, Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
>>
>> So at lest for 4.5 we're unlikely to be able to get rid of it alone
>> due to the RDS issue. We'll then need performance numbers for mlx4,
>> and figure out how much we care about mthca.
>
> mthca is unfortunately very popular in the used market, it is very
> easy to get cards, build a cheap test cluster, etc. :(
Oracle recently announced Sonoma, which is a SPARC CPU with
an on-chip IB HCA. Oracle plans to publish an open-source
GPL device driver that enables this HCA in Linux for SPARC.
We’d eventually like to contribute it to the upstream
Linux kernel.
At the moment, the device and driver support only FMR. As
you might expect, Oracle needs it to work at least with
in-kernel RDS. Thus we hope to see the life of in-kernel
FMR extended for a bit to accommodate this new device.
--
Chuck Lever
On Tue, Dec 01, 2015 at 10:33:18AM -0500, Chuck Lever wrote:
> Oracle recently announced Sonoma, which is a SPARC CPU with
> an on-chip IB HCA. Oracle plans to publish an open-source
> GPL device driver that enables this HCA in Linux for SPARC.
> We???d eventually like to contribute it to the upstream
> Linux kernel.
>
> At the moment, the device and driver support only FMR. As
> you might expect, Oracle needs it to work at least with
> in-kernel RDS. Thus we hope to see the life of in-kernel
> FMR extended for a bit to accommodate this new device.
Oh, that's pretty sad news. I wonder what the engineers where smoking..