LinuxLists.cc - [PATCH RFC 0/5] xprtrdma Send completion batching

2017-09-05 17:00:14

[permalink] [raw]

Subject: [PATCH RFC 0/5] xprtrdma Send completion batching

Hi Jason, Sagi-

As we discussed a few weeks ago, this patch series implements the
following:

- Send SGEs are now managed via lock-less, wait-free circular queues
- Send SGEs referring to page cache pages are DMA unmapped during
Send completion
- Send completions are batched to reduce interrupts, but still
provide a periodic heartbeat signal for SQ housekeeping
- The circular queue prevents Send Queue overflow

The purpose of this change is to address the issue Sagi reported
where the HCA continues to retry a delayed Send request _after_ RPC
completion, resulting in a DMA error.

Also available as the first five commits in this topic branch:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-for-4.15

---

Chuck Lever (5):
xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges()
xprtrdma: Change return value of rpcrdma_prepare_send_sges()
xprtrdma: Add data structure to manage RDMA Send arguments
xprtrdma: Manage RDMA Send arguments via lock-free circular queue
xprtrdma: Remove atomic send completion counting

net/sunrpc/xprtrdma/backchannel.c | 6 +
net/sunrpc/xprtrdma/frwr_ops.c | 8 -
net/sunrpc/xprtrdma/rpc_rdma.c | 104 +++++++++++-------
net/sunrpc/xprtrdma/transport.c | 6 +
net/sunrpc/xprtrdma/verbs.c | 210 +++++++++++++++++++++++++++++++++++--
net/sunrpc/xprtrdma/xprt_rdma.h | 66 +++++++-----
6 files changed, 308 insertions(+), 92 deletions(-)

--
Chuck Lever

2017-09-05 17:00:23

[permalink] [raw]

Subject: [PATCH RFC 1/5] xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges()

rpcrdma_prepare_hdr_sges() sets num_sge to one, then
rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs
it added, plus one for the header SGE just mapped in
rpcrdma_prepare_hdr_sges().

Let's just add the number we mapped, instead of setting num_sge
again. The behavior should be the same, but the code is slightly
clearer and less likely to break due to unrelated changes.

Similarly, simplify mapped SGE counting by just adding the running
count of SGEs before returning.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 84584ca..7ec4fde 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -607,7 +607,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
sge[sge_no].length = len;
sge[sge_no].lkey = lkey;

- req->rl_mapped_sges++;
ppages++;
remaining -= len;
page_base = 0;
@@ -633,11 +632,11 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
goto out_mapping_err;
sge[sge_no].length = len;
sge[sge_no].lkey = lkey;
- req->rl_mapped_sges++;
}

out:
- req->rl_send_wr.num_sge = sge_no + 1;
+ req->rl_send_wr.num_sge += sge_no;
+ req->rl_mapped_sges += sge_no - 1;
return true;

out_mapping_overflow:

2017-09-05 17:00:48

[permalink] [raw]

Subject: [PATCH RFC 4/5] xprtrdma: Manage RDMA Send arguments via lock-free circular queue

Hook rpcrdma_ep_post and the send completion handler up with the new
send ctxs, and remove Send buffer DMA unmapping from xprt_rdma_free.

Care must be taken if an error occurs before the Send is posted.
In this case, the send_ctx has to be popped back onto the circular
queue, since it will never be completed or flushed. This is done
before send_request's call allows another RPC Call to proceed.

Reported-by: Sagi Grimberg <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 40 +++++++++++++++------------------------
net/sunrpc/xprtrdma/transport.c | 1 -
net/sunrpc/xprtrdma/verbs.c | 28 ++++++++++++++++++++-------
net/sunrpc/xprtrdma/xprt_rdma.h | 6 +-----
4 files changed, 36 insertions(+), 39 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 63461bd..d074da2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -517,20 +517,20 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
u32 len)
{
+ struct rpcrdma_sendctx *sc = req->rl_sendctx;
struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
- struct ib_sge *sge = &req->rl_send_sge[0];
+ struct ib_sge *sge = sc->sc_sges;

- if (unlikely(!rpcrdma_regbuf_is_mapped(rb))) {
+ if (unlikely(!rpcrdma_regbuf_is_mapped(rb)))
if (!__rpcrdma_dma_map_regbuf(ia, rb))
goto out_regbuf;
- sge->addr = rdmab_addr(rb);
- sge->lkey = rdmab_lkey(rb);
- }
+ sge->addr = rdmab_addr(rb);
+ sge->lkey = rdmab_lkey(rb);
sge->length = len;

ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
sge->length, DMA_TO_DEVICE);
- req->rl_send_wr.num_sge++;
+ sc->sc_wr.num_sge++;
return true;

out_regbuf:
@@ -545,13 +545,16 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
{
+ struct rpcrdma_sendctx *sc = req->rl_sendctx;
unsigned int sge_no, page_base, len, remaining;
struct rpcrdma_regbuf *rb = req->rl_sendbuf;
struct ib_device *device = ia->ri_device;
- struct ib_sge *sge = req->rl_send_sge;
+ struct ib_sge *sge = sc->sc_sges;
u32 lkey = ia->ri_pd->local_dma_lkey;
struct page *page, **ppages;

+ sc->sc_device = device;
+
/* The head iovec is straightforward, as it is already
* DMA-mapped. Sync the content that has changed.
*/
@@ -639,8 +642,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}

out:
- req->rl_send_wr.num_sge += sge_no;
- req->rl_mapped_sges += sge_no - 1;
+ sc->sc_wr.num_sge += sge_no;
+ sc->sc_unmap_count += sge_no - 1;
return true;

out_regbuf:
@@ -671,8 +674,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, u32 hdrlen,
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
{
- req->rl_send_wr.num_sge = 0;
- req->rl_mapped_sges = 0;
+ req->rl_sendctx = rpcrdma_sendctx_get_locked(&r_xprt->rx_buf);
+ if (!req->rl_sendctx)
+ return -ENOBUFS;

if (!rpcrdma_prepare_hdr_sge(&r_xprt->rx_ia, req, hdrlen))
return -EIO;
@@ -684,20 +688,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return 0;
}

-void
-rpcrdma_unmap_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
-{
- struct ib_device *device = ia->ri_device;
- struct ib_sge *sge;
- int count;
-
- sge = &req->rl_send_sge[2];
- for (count = req->rl_mapped_sges; count--; sge++)
- ib_dma_unmap_page(device, sge->addr, sge->length,
- DMA_TO_DEVICE);
- req->rl_mapped_sges = 0;
-}
-
/**
* rpcrdma_unmap_send_sges - DMA-unmap Send buffers
* @sc: send_ctx containing SGEs to unmap
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 18cb8b4..4cca4a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -688,7 +688,6 @@
rpcrdma_remove_req(&r_xprt->rx_buf, req);
if (!list_empty(&req->rl_registered))
ia->ri_ops->ro_unmap_safe(r_xprt, req, !RPC_IS_ASYNC(task));
- rpcrdma_unmap_sges(ia, req);
rpcrdma_buffer_put(req);
}

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 81d081d..75899ba 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -128,11 +128,17 @@
static void
rpcrdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
{
+ struct ib_cqe *cqe = wc->wr_cqe;
+ struct rpcrdma_sendctx *sc =
+ container_of(cqe, struct rpcrdma_sendctx, sc_cqe);
+
/* WARNING: Only wr_cqe and status are reliable at this point */
if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR)
pr_err("rpcrdma: Send: %s (%u/0x%x)\n",
ib_wc_status_msg(wc->status),
wc->status, wc->vendor_err);
+
+ rpcrdma_sendctx_put_locked(sc);
}

/* Perform basic sanity checking to avoid using garbage
@@ -1113,13 +1119,8 @@ struct rpcrdma_req *
spin_lock(&buffer->rb_reqslock);
list_add(&req->rl_all, &buffer->rb_allreqs);
spin_unlock(&buffer->rb_reqslock);
- req->rl_cqe.done = rpcrdma_wc_send;
req->rl_buffer = &r_xprt->rx_buf;
INIT_LIST_HEAD(&req->rl_registered);
- req->rl_send_wr.next = NULL;
- req->rl_send_wr.wr_cqe = &req->rl_cqe;
- req->rl_send_wr.sg_list = req->rl_send_sge;
- req->rl_send_wr.opcode = IB_WR_SEND;
return req;
}

@@ -1410,7 +1411,6 @@ struct rpcrdma_req *
struct rpcrdma_buffer *buffers = req->rl_buffer;
struct rpcrdma_rep *rep = req->rl_reply;

- req->rl_send_wr.num_sge = 0;
req->rl_reply = NULL;

spin_lock(&buffers->rb_lock);
@@ -1542,7 +1542,7 @@ struct rpcrdma_regbuf *
struct rpcrdma_ep *ep,
struct rpcrdma_req *req)
{
- struct ib_send_wr *send_wr = &req->rl_send_wr;
+ struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
struct ib_send_wr *send_wr_fail;
int rc;

@@ -1556,10 +1556,22 @@ struct rpcrdma_regbuf *
dprintk("RPC: %s: posting %d s/g entries\n",
__func__, send_wr->num_sge);

- rpcrdma_set_signaled(ep, send_wr);
+ /* Signal Send once every "rep_send_batch" WRs:
+ * - Mitigate interrupts due to Send completions
+ * - Batch up DMA unmapping send buffers
+ * - Allow verbs provider housekeeping on the Send Queue
+ */
+ if (ep->rep_send_count) {
+ send_wr->send_flags &= ~IB_SEND_SIGNALED;
+ --ep->rep_send_count;
+ } else {
+ send_wr->send_flags |= IB_SEND_SIGNALED;
+ ep->rep_send_count = ep->rep_send_batch;
+ }
rc = ib_post_send(ia->ri_id->qp, send_wr, &send_wr_fail);
if (rc)
goto out_postsend_err;
+
return 0;

out_postsend_err:
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 1967e7a..8f8e2dd 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -365,19 +365,16 @@ enum {
struct rpcrdma_req {
struct list_head rl_list;
__be32 rl_xid;
- unsigned int rl_mapped_sges;
unsigned int rl_connect_cookie;
struct rpcrdma_buffer *rl_buffer;
struct rpcrdma_rep *rl_reply;
struct xdr_stream rl_stream;
struct xdr_buf rl_hdrbuf;
- struct ib_send_wr rl_send_wr;
- struct ib_sge rl_send_sge[RPCRDMA_MAX_SEND_SGES];
+ struct rpcrdma_sendctx *rl_sendctx;
struct rpcrdma_regbuf *rl_rdmabuf; /* xprt header */
struct rpcrdma_regbuf *rl_sendbuf; /* rq_snd_buf */
struct rpcrdma_regbuf *rl_recvbuf; /* rq_rcv_buf */

- struct ib_cqe rl_cqe;
struct list_head rl_all;
bool rl_backchannel;

@@ -676,7 +673,6 @@ int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, u32 hdrlen,
struct xdr_buf *xdr,
enum rpcrdma_chunktype rtype);
-void rpcrdma_unmap_sges(struct rpcrdma_ia *, struct rpcrdma_req *);
void rpcrdma_unmap_send_sges(struct rpcrdma_sendctx *sc);
int rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst);
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *);

2017-09-05 17:00:43

[permalink] [raw]

Subject: [PATCH RFC 3/5] xprtrdma: Add data structure to manage RDMA Send arguments

Recently Sagi Grimberg <[email protected]> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.

This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.

Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In this case, the HCA can
try to use memory that has been invalidated or DMA unmapped, and
the connection is lost.

Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.

After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated

In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.

The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.

The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).

The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, more conventionally, the queue is empty).

When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a
provider Send queue overflow. In this case, the ULP is told to wait
and try again.

As a final touch, Jason Gunthorpe <[email protected]>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and SGE unmapping of N Send operations
during that one completion.

Reported-by: Sagi Grimberg <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 25 +++++
net/sunrpc/xprtrdma/transport.c | 5 +
net/sunrpc/xprtrdma/verbs.c | 178 +++++++++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 33 +++++++
4 files changed, 239 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 71f43a2..63461bd 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -699,6 +699,31 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}

/**
+ * rpcrdma_unmap_send_sges - DMA-unmap Send buffers
+ * @sc: send_ctx containing SGEs to unmap
+ *
+ * The first two SGEs are always left mapped. They contain the transport
+ * header and the inline buffer.
+ */
+void
+rpcrdma_unmap_send_sges(struct rpcrdma_sendctx *sc)
+{
+ struct ib_sge *sge;
+
+ dprintk("RPC: %s: unmapping %u sges for sc=%p\n",
+ __func__, sc->sc_unmap_count, sc);
+
+ sge = &sc->sc_sges[2];
+ while (sc->sc_unmap_count) {
+ ib_dma_unmap_page(sc->sc_device, sge->addr, sge->length,
+ DMA_TO_DEVICE);
+ ++sge;
+ --sc->sc_unmap_count;
+ }
+ sc->sc_wr.num_sge = 0;
+}
+
+/**
* rpcrdma_marshal_req - Marshal and send one RPC request
* @r_xprt: controlling transport
* @rqst: RPC request to be marshaled
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index b680591..18cb8b4 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -790,11 +790,12 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
r_xprt->rx_stats.failed_marshal_count,
r_xprt->rx_stats.bad_reply_count,
r_xprt->rx_stats.nomsg_call_count);
- seq_printf(seq, "%lu %lu %lu %lu\n",
+ seq_printf(seq, "%lu %lu %lu %lu %lu\n",
r_xprt->rx_stats.mrs_recovered,
r_xprt->rx_stats.mrs_orphaned,
r_xprt->rx_stats.mrs_allocated,
- r_xprt->rx_stats.local_inv_needed);
+ r_xprt->rx_stats.local_inv_needed,
+ r_xprt->rx_stats.empty_sendctx_q);
}

static int
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c78fb27..81d081d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -52,6 +52,8 @@
#include <linux/prefetch.h>
#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/svc_rdma.h>
+
+#include <asm-generic/barrier.h>
#include <asm/bitops.h>

#include <rdma/ib_cm.h>
@@ -564,6 +566,9 @@
ep->rep_attr.cap.max_recv_sge);

/* set trigger for requesting send completion */
+ ep->rep_send_batch = min_t(unsigned int, RPCRDMA_MAX_SEND_BATCH,
+ cdata->max_requests >> 2);
+ ep->rep_send_count = ep->rep_send_batch;
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
if (ep->rep_cqinit <= 2)
ep->rep_cqinit = 0; /* always signal? */
@@ -846,6 +851,173 @@
ib_drain_qp(ia->ri_id->qp);
}

+/* Fixed-size circular FIFO queue. This implementation is wait-free and
+ * lock-free.
+ *
+ * Consumer is the code path that posts Sends. This path dequeues a
+ * sendctx for use by a Send operation. Multiple consumer threads are
+ * serialized by the RPC transport lock, which allows only one
+ * ->send_request at a time.
+ *
+ * Producer is the code path that handles Send completions. This path
+ * enqueues a sendctx that has been completed. Multiple producer threads
+ * are serialized by the ib_poll_cq() function.
+ */
+
+/* rpcrdma_sendctxs_destroy() assumes caller has already quiesced
+ * queue activity.
+ */
+static void rpcrdma_sendctxs_destroy(struct rpcrdma_buffer *buf)
+{
+ unsigned long i;
+
+ /* Remaining unsignaled Sends could still have mapped SGEs.
+ * Ensure all SGEs are unmapped before destroying the queue.
+ */
+ for (i = 0; i <= buf->rb_sc_last; i++) {
+ rpcrdma_unmap_send_sges(buf->rb_sc_ctxs[i]);
+ kfree(buf->rb_sc_ctxs[i]);
+ }
+
+ kfree(buf->rb_sc_ctxs);
+}
+
+static struct rpcrdma_sendctx *rpcrdma_sendctx_create(unsigned int max_sges)
+{
+ struct rpcrdma_sendctx *sc;
+
+ sc = kzalloc(sizeof(*sc) + max_sges * sizeof(struct ib_sge),
+ GFP_KERNEL);
+ if (!sc)
+ return NULL;
+
+ sc->sc_wr.next = NULL;
+ sc->sc_wr.wr_cqe = &sc->sc_cqe;
+ sc->sc_wr.sg_list = sc->sc_sges;
+ sc->sc_wr.opcode = IB_WR_SEND;
+ sc->sc_cqe.done = rpcrdma_wc_send;
+ return sc;
+}
+
+static int rpcrdma_sendctxs_create(struct rpcrdma_xprt *r_xprt)
+{
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ unsigned long i;
+
+ /* Maximum number of concurrent outstanding Send WRs */
+ i = buf->rb_max_requests + RPCRDMA_MAX_BC_REQUESTS;
+ dprintk("RPC: %s: allocating %lu send_ctxs (batch size %u)\n",
+ __func__, i, r_xprt->rx_ep.rep_send_batch);
+ buf->rb_sc_ctxs = kcalloc(i, sizeof(struct rpcrdma_sendctx *),
+ GFP_KERNEL);
+ if (!buf->rb_sc_ctxs)
+ return -ENOMEM;
+
+ buf->rb_sc_last = i - 1;
+ for (i = 0; i <= buf->rb_sc_last; i++) {
+ struct rpcrdma_sendctx *sc;
+
+ sc = rpcrdma_sendctx_create(r_xprt->rx_ia.ri_max_send_sges);
+ if (!sc)
+ goto out_destroy;
+
+ sc->sc_buffers = buf;
+ buf->rb_sc_ctxs[i] = sc;
+ }
+
+ return 0;
+
+out_destroy:
+ rpcrdma_sendctxs_destroy(buf);
+ return -ENOMEM;
+}
+
+/* The sendctx queue is not guaranteed to have a size that is
+ * a power of two, thus the helpers in circ_buf.h cannot be used.
+ * The other option is to use modulus (%), which can be expensive.
+ */
+static unsigned long rpcrdma_sendctx_next(struct rpcrdma_buffer *buf,
+ unsigned long item)
+{
+ return likely(item < buf->rb_sc_last) ? item + 1 : 0;
+}
+
+/**
+ * rpcrdma_sendctx_get_locked - Acquire a send context
+ * @buf: transport buffers from which to acquire an unused context
+ *
+ * Returns pointer to a free send completion context; or NULL if
+ * the queue is empty.
+ *
+ * Usage: Called before Send WR is posted to acquire an
+ * SGE array.
+ *
+ * The caller serializes calls to this function (per rpcrdma_buffer),
+ * and provides an effective memory barrier that flushes the new value
+ * of rb_sc_head.
+ */
+struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf)
+{
+ struct rpcrdma_xprt *r_xprt;
+ struct rpcrdma_sendctx *sc;
+ unsigned long next_head;
+
+ next_head = rpcrdma_sendctx_next(buf, buf->rb_sc_head);
+
+ if (next_head == READ_ONCE(buf->rb_sc_tail))
+ goto out_emptyq;
+
+ /* ORDER: item must be accessed _before_ head is updated */
+ sc = buf->rb_sc_ctxs[next_head];
+
+ /* Releasing the lock in the caller acts as a memory
+ * barrier that flushes rb_sc_head.
+ */
+ buf->rb_sc_head = next_head;
+
+ return sc;
+
+out_emptyq:
+ /* The queue is "empty" if there have not been enough Send
+ * completions recently. This is a sign the Send Queue is
+ * backing up. Cause the caller to pause and try again.
+ */
+ dprintk("RPC: %s: empty sendctx queue\n", __func__);
+ r_xprt = container_of(buf, struct rpcrdma_xprt, rx_buf);
+ r_xprt->rx_stats.empty_sendctx_q++;
+ return NULL;
+}
+
+/**
+ * rpcrdma_sendctx_put_locked - Release a send context
+ * @sc: send context to release
+ *
+ * Usage: Called when Send completes to return a
+ * completed send context to the queue.
+ *
+ * The caller serializes calls to this function (per rpcrdma_buffer).
+ */
+void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc)
+{
+ struct rpcrdma_buffer *buf = sc->sc_buffers;
+ unsigned long next_tail;
+
+ /* Unmap SGEs of previously completed by unsignaled
+ * Sends by walking up the queue until @sc is found.
+ */
+ next_tail = buf->rb_sc_tail;
+ do {
+ next_tail = rpcrdma_sendctx_next(buf, next_tail);
+
+ /* ORDER: item must be accessed _before_ tail is updated */
+ rpcrdma_unmap_send_sges(buf->rb_sc_ctxs[next_tail]);
+
+ } while (buf->rb_sc_ctxs[next_tail] != sc);
+
+ /* Paired with READ_ONCE */
+ smp_store_release(&buf->rb_sc_tail, next_tail);
+}
+
static void
rpcrdma_mr_recovery_worker(struct work_struct *work)
{
@@ -1041,6 +1213,10 @@ struct rpcrdma_rep *
list_add(&rep->rr_list, &buf->rb_recv_bufs);
}

+ rc = rpcrdma_sendctxs_create(r_xprt);
+ if (rc)
+ goto out;
+
return 0;
out:
rpcrdma_buffer_destroy(buf);
@@ -1117,6 +1293,8 @@ struct rpcrdma_rep *
cancel_delayed_work_sync(&buf->rb_recovery_worker);
cancel_delayed_work_sync(&buf->rb_refresh_worker);

+ rpcrdma_sendctxs_destroy(buf);
+
while (!list_empty(&buf->rb_recv_bufs)) {
struct rpcrdma_rep *rep;

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 93f3a36..1967e7a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -93,6 +93,8 @@ enum {
*/

struct rpcrdma_ep {
+ unsigned int rep_send_count;
+ unsigned int rep_send_batch;
atomic_t rep_cqcount;
int rep_cqinit;
int rep_connected;
@@ -229,6 +231,20 @@ struct rpcrdma_rep {
struct ib_recv_wr rr_recv_wr;
};

+/* struct rpcrdma_sendctx - DMA mapped SGEs to unmap after Send completes
+ */
+struct rpcrdma_buffer;
+struct rpcrdma_sendctx {
+ struct ib_send_wr sc_wr;
+ struct ib_cqe sc_cqe;
+
+ struct rpcrdma_buffer *sc_buffers;
+
+ unsigned int sc_unmap_count;
+ struct ib_device *sc_device;
+ struct ib_sge sc_sges[];
+};
+
/*
* struct rpcrdma_mw - external memory region metadata
*
@@ -337,6 +353,14 @@ enum {
RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,
};

+/* Limit the number of SGEs that can be unmapped during one
+ * Send completion. This caps the amount of work a single
+ * completion can do before returning to the provider.
+ */
+enum {
+ RPCRDMA_MAX_SEND_BATCH = 16,
+};
+
struct rpcrdma_buffer;
struct rpcrdma_req {
struct list_head rl_list;
@@ -400,6 +424,11 @@ struct rpcrdma_buffer {
struct list_head rb_mws;
struct list_head rb_all;

+ unsigned long rb_sc_head;
+ unsigned long rb_sc_tail;
+ unsigned long rb_sc_last;
+ struct rpcrdma_sendctx **rb_sc_ctxs;
+
spinlock_t rb_lock; /* protect buf lists */
int rb_send_count, rb_recv_count;
struct list_head rb_send_bufs;
@@ -455,6 +484,7 @@ struct rpcrdma_stats {
unsigned long mrs_recovered;
unsigned long mrs_orphaned;
unsigned long mrs_allocated;
+ unsigned long empty_sendctx_q;

/* accessed when receiving a reply */
unsigned long long total_rdma_reply;
@@ -556,6 +586,8 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
void rpcrdma_destroy_req(struct rpcrdma_req *);
int rpcrdma_buffer_create(struct rpcrdma_xprt *);
void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
+struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
+void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc);

static inline void
rpcrdma_insert_req(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req)
@@ -645,6 +677,7 @@ int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
struct xdr_buf *xdr,
enum rpcrdma_chunktype rtype);
void rpcrdma_unmap_sges(struct rpcrdma_ia *, struct rpcrdma_req *);
+void rpcrdma_unmap_send_sges(struct rpcrdma_sendctx *sc);
int rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst);
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *);
void rpcrdma_reply_handler(struct work_struct *work);

2017-09-05 17:00:32

[permalink] [raw]

Subject: [PATCH RFC 2/5] xprtrdma: Change return value of rpcrdma_prepare_send_sges()

Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
instead of a bool. Soon callers will want distinct treatments of
different types of failures.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/backchannel.c | 6 +++-
net/sunrpc/xprtrdma/rpc_rdma.c | 52 +++++++++++++++++++++++--------------
net/sunrpc/xprtrdma/xprt_rdma.h | 6 +++-
3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index d31d0ac..90a0438 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -206,6 +206,7 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
__be32 *p;
+ int rc;

rpcrdma_set_xdrlen(&req->rl_hdrbuf, 0);
xdr_init_encode(&req->rl_stream, &req->rl_hdrbuf,
@@ -222,8 +223,9 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
*p++ = xdr_zero;
*p = xdr_zero;

- if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, RPCRDMA_HDRLEN_MIN,
- &rqst->rq_snd_buf, rpcrdma_noch))
+ rc = rpcrdma_prepare_send_sges(r_xprt, req, RPCRDMA_HDRLEN_MIN,
+ &rqst->rq_snd_buf, rpcrdma_noch);
+ if (rc)
return -EIO;
return 0;
}
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 7ec4fde..71f43a2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -522,7 +522,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,

if (unlikely(!rpcrdma_regbuf_is_mapped(rb))) {
if (!__rpcrdma_dma_map_regbuf(ia, rb))
- return false;
+ goto out_regbuf;
sge->addr = rdmab_addr(rb);
sge->lkey = rdmab_lkey(rb);
}
@@ -532,6 +532,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
sge->length, DMA_TO_DEVICE);
req->rl_send_wr.num_sge++;
return true;
+
+out_regbuf:
+ pr_err("rpcrdma: failed to DMA map a Send buffer\n");
+ return false;
}

/* Prepare the Send SGEs. The head and tail iovec, and each entry
@@ -552,7 +556,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
* DMA-mapped. Sync the content that has changed.
*/
if (!rpcrdma_dma_map_regbuf(ia, rb))
- return false;
+ goto out_regbuf;
sge_no = 1;
sge[sge_no].addr = rdmab_addr(rb);
sge[sge_no].length = xdr->head[0].iov_len;
@@ -639,6 +643,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
req->rl_mapped_sges += sge_no - 1;
return true;

+out_regbuf:
+ pr_err("rpcrdma: failed to DMA map a Send buffer\n");
+ return false;
+
out_mapping_overflow:
pr_err("rpcrdma: too many Send SGEs (%u)\n", sge_no);
return false;
@@ -648,26 +656,32 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return false;
}

-bool
-rpcrdma_prepare_send_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
- u32 hdrlen, struct xdr_buf *xdr,
- enum rpcrdma_chunktype rtype)
+/**
+ * rpcrdma_prepare_send_sges - Construct SGEs for a Send WR
+ * @r_xprt: controlling transport
+ * @req: context of RPC Call being marshalled
+ * @hdrlen: size of transport header, in bytes
+ * @xdr: xdr_buf containing RPC Call
+ * @rtype: chunk type being encoded
+ *
+ * Returns 0 on success; otherwise a negative errno is returned.
+ */
+int
+rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
+ struct rpcrdma_req *req, u32 hdrlen,
+ struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
{
req->rl_send_wr.num_sge = 0;
req->rl_mapped_sges = 0;

- if (!rpcrdma_prepare_hdr_sge(ia, req, hdrlen))
- goto out_map;
+ if (!rpcrdma_prepare_hdr_sge(&r_xprt->rx_ia, req, hdrlen))
+ return -EIO;

if (rtype != rpcrdma_areadch)
- if (!rpcrdma_prepare_msg_sges(ia, req, xdr, rtype))
- goto out_map;
-
- return true;
+ if (!rpcrdma_prepare_msg_sges(&r_xprt->rx_ia, req, xdr, rtype))
+ return -EIO;

-out_map:
- pr_err("rpcrdma: failed to DMA map a Send buffer\n");
- return false;
+ return 0;
}

void
@@ -835,12 +849,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
transfertypes[rtype], transfertypes[wtype],
xdr_stream_pos(xdr));

- if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req,
- xdr_stream_pos(xdr),
- &rqst->rq_snd_buf, rtype)) {
- ret = -EIO;
+ ret = rpcrdma_prepare_send_sges(r_xprt, req, xdr_stream_pos(xdr),
+ &rqst->rq_snd_buf, rtype);
+ if (ret)
goto out_err;
- }
return 0;

out_err:
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 45dab24..93f3a36 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -640,8 +640,10 @@ enum rpcrdma_chunktype {
rpcrdma_replych
};

-bool rpcrdma_prepare_send_sges(struct rpcrdma_ia *, struct rpcrdma_req *,
- u32, struct xdr_buf *, enum rpcrdma_chunktype);
+int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
+ struct rpcrdma_req *req, u32 hdrlen,
+ struct xdr_buf *xdr,
+ enum rpcrdma_chunktype rtype);
void rpcrdma_unmap_sges(struct rpcrdma_ia *, struct rpcrdma_req *);
int rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst);
void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *);

2017-09-05 17:00:58

[permalink] [raw]

Subject: [PATCH RFC 5/5] xprtrdma: Remove atomic send completion counting

Verbs providers require that a posted send has to be signaled every
so often. This is because the providers perform send queue house-
keeping during the send completion.

However, send completion batching now guarantees send operations
are signaled often enough that managing a count of posted sends and
signaling every so often is no longer necessary for xprtrdma.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 8 --------
net/sunrpc/xprtrdma/verbs.c | 4 ----
net/sunrpc/xprtrdma/xprt_rdma.h | 21 ---------------------
3 files changed, 33 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 5a936a6..f366a05 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -419,7 +419,6 @@
IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
IB_ACCESS_REMOTE_READ;

- rpcrdma_set_signaled(&r_xprt->rx_ep, &reg_wr->wr);
rc = ib_post_send(ia->ri_id->qp, &reg_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
@@ -507,12 +506,6 @@
f->fr_cqe.done = frwr_wc_localinv_wake;
reinit_completion(&f->fr_linv_done);

- /* Initialize CQ count, since there is always a signaled
- * WR being posted here. The new cqcount depends on how
- * many SQEs are about to be consumed.
- */
- rpcrdma_init_cqcount(&r_xprt->rx_ep, count);
-
/* Transport disconnect drains the receive CQ before it
* replaces the QP. The RPC reply handler won't call us
* unless ri_id->qp is a valid pointer.
@@ -545,7 +538,6 @@
/* Find and reset the MRs in the LOCAL_INV WRs that did not
* get posted.
*/
- rpcrdma_init_cqcount(&r_xprt->rx_ep, -count);
while (bad_wr) {
f = container_of(bad_wr, struct rpcrdma_frmr,
fr_invwr);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 75899ba..fb84418 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -575,10 +575,6 @@
ep->rep_send_batch = min_t(unsigned int, RPCRDMA_MAX_SEND_BATCH,
cdata->max_requests >> 2);
ep->rep_send_count = ep->rep_send_batch;
- ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
- if (ep->rep_cqinit <= 2)
- ep->rep_cqinit = 0; /* always signal? */
- rpcrdma_init_cqcount(ep, 0);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 8f8e2dd..660fa0e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -95,8 +95,6 @@ enum {
struct rpcrdma_ep {
unsigned int rep_send_count;
unsigned int rep_send_batch;
- atomic_t rep_cqcount;
- int rep_cqinit;
int rep_connected;
struct ib_qp_init_attr rep_attr;
wait_queue_head_t rep_connect_wait;
@@ -106,25 +104,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
};

-static inline void
-rpcrdma_init_cqcount(struct rpcrdma_ep *ep, int count)
-{
- atomic_set(&ep->rep_cqcount, ep->rep_cqinit - count);
-}
-
-/* To update send queue accounting, provider must take a
- * send completion every now and then.
- */
-static inline void
-rpcrdma_set_signaled(struct rpcrdma_ep *ep, struct ib_send_wr *send_wr)
-{
- send_wr->send_flags = 0;
- if (unlikely(atomic_sub_return(1, &ep->rep_cqcount) <= 0)) {
- rpcrdma_init_cqcount(ep, 0);
- send_wr->send_flags = IB_SEND_SIGNALED;
- }
-}
-
/* Pre-allocate extra Work Requests for handling backward receives
* and sends. This is a fixed value because the Work Queues are
* allocated when the forward channel is set up.

2017-09-05 20:06:14

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Tue, Sep 05, 2017 at 01:00:10PM -0400, Chuck Lever wrote:

> - Send completions are batched to reduce interrupts, but still
> provide a periodic heartbeat signal for SQ housekeeping

I would scrub this commentary, it is very misleading.

The idea of a periodic completion does not match how verbs works at
all, it was an incomplete root cause analysis from a HCA that uses
different rules for freeing space in the SQ.

Instead, I would say this series creates strong SQ accounting and
properly guarentees the SQ can never overflow by only releasing SQ's
back into the pool when the HCA has confirmed they are completed via a
CQ.

Jason

2017-09-05 21:22:22

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 5, 2017, at 4:06 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Sep 05, 2017 at 01:00:10PM -0400, Chuck Lever wrote:
>
>> - Send completions are batched to reduce interrupts, but still
>> provide a periodic heartbeat signal for SQ housekeeping
>
> I would scrub this commentary, it is very misleading.
>
> The idea of a periodic completion does not match how verbs works at
> all, it was an incomplete root cause analysis from a HCA that uses
> different rules for freeing space in the SQ.

I think it does bear mentioning that, given this diagnosis, it is
still safe to remove the ib_post_send counting mechanism in 5/5,
which has been in xprtrdma for as long as I can recall, and has
been effective (with a few minor adjustments) at preventing SQ
overflow.

I'm not able to test this change with every HCA the Linux kernel
currently supports, unfortunately. The best I can do is offer a
"proof of correctness" and hope that vendors will jump on this
and try it out.

> Instead, I would say this series creates strong SQ accounting and
> properly guarentees the SQ can never overflow by only releasing SQ's
> back into the pool when the HCA has confirmed they are completed via a
> CQ.

I will adjust the cover letter (and patch descriptions as necessary)
next time I post this series. Thanks for your suggestions and review.

--
Chuck Lever

2017-09-05 21:50:35

[permalink] [raw]

Subject: Re: [PATCH RFC 4/5] xprtrdma: Manage RDMA Send arguments via lock-free circular queue

> On Sep 5, 2017, at 1:00 PM, Chuck Lever <[email protected]> wrote:
>
> Hook rpcrdma_ep_post and the send completion handler up with the new
> send ctxs, and remove Send buffer DMA unmapping from xprt_rdma_free.
>
> Care must be taken if an error occurs before the Send is posted.
> In this case, the send_ctx has to be popped back onto the circular
> queue, since it will never be completed or flushed. This is done
> before send_request's call allows another RPC Call to proceed.

This second paragraph is obsolete. Please ignore it.

> Reported-by: Sagi Grimberg <[email protected]>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 40 +++++++++++++++------------------------
> net/sunrpc/xprtrdma/transport.c | 1 -
> net/sunrpc/xprtrdma/verbs.c | 28 ++++++++++++++++++++-------
> net/sunrpc/xprtrdma/xprt_rdma.h | 6 +-----
> 4 files changed, 36 insertions(+), 39 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 63461bd..d074da2 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -517,20 +517,20 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
> rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
> u32 len)
> {
> + struct rpcrdma_sendctx *sc = req->rl_sendctx;
> struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
> - struct ib_sge *sge = &req->rl_send_sge[0];
> + struct ib_sge *sge = sc->sc_sges;
>
> - if (unlikely(!rpcrdma_regbuf_is_mapped(rb))) {
> + if (unlikely(!rpcrdma_regbuf_is_mapped(rb)))
> if (!__rpcrdma_dma_map_regbuf(ia, rb))
> goto out_regbuf;
> - sge->addr = rdmab_addr(rb);
> - sge->lkey = rdmab_lkey(rb);
> - }
> + sge->addr = rdmab_addr(rb);
> + sge->lkey = rdmab_lkey(rb);
> sge->length = len;
>
> ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
> sge->length, DMA_TO_DEVICE);
> - req->rl_send_wr.num_sge++;
> + sc->sc_wr.num_sge++;
> return true;
>
> out_regbuf:
> @@ -545,13 +545,16 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
> rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
> struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
> {
> + struct rpcrdma_sendctx *sc = req->rl_sendctx;
> unsigned int sge_no, page_base, len, remaining;
> struct rpcrdma_regbuf *rb = req->rl_sendbuf;
> struct ib_device *device = ia->ri_device;
> - struct ib_sge *sge = req->rl_send_sge;
> + struct ib_sge *sge = sc->sc_sges;
> u32 lkey = ia->ri_pd->local_dma_lkey;
> struct page *page, **ppages;
>
> + sc->sc_device = device;
> +
> /* The head iovec is straightforward, as it is already
> * DMA-mapped. Sync the content that has changed.
> */
> @@ -639,8 +642,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
> }
>
> out:
> - req->rl_send_wr.num_sge += sge_no;
> - req->rl_mapped_sges += sge_no - 1;
> + sc->sc_wr.num_sge += sge_no;
> + sc->sc_unmap_count += sge_no - 1;
> return true;
>
> out_regbuf:
> @@ -671,8 +674,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
> struct rpcrdma_req *req, u32 hdrlen,
> struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
> {
> - req->rl_send_wr.num_sge = 0;
> - req->rl_mapped_sges = 0;
> + req->rl_sendctx = rpcrdma_sendctx_get_locked(&r_xprt->rx_buf);
> + if (!req->rl_sendctx)
> + return -ENOBUFS;
>
> if (!rpcrdma_prepare_hdr_sge(&r_xprt->rx_ia, req, hdrlen))
> return -EIO;
> @@ -684,20 +688,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
> return 0;
> }
>
> -void
> -rpcrdma_unmap_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
> -{
> - struct ib_device *device = ia->ri_device;
> - struct ib_sge *sge;
> - int count;
> -
> - sge = &req->rl_send_sge[2];
> - for (count = req->rl_mapped_sges; count--; sge++)
> - ib_dma_unmap_page(device, sge->addr, sge->length,
> - DMA_TO_DEVICE);
> - req->rl_mapped_sges = 0;
> -}
> -
> /**
> * rpcrdma_unmap_send_sges - DMA-unmap Send buffers
> * @sc: send_ctx containing SGEs to unmap
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 18cb8b4..4cca4a7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -688,7 +688,6 @@
> rpcrdma_remove_req(&r_xprt->rx_buf, req);
> if (!list_empty(&req->rl_registered))
> ia->ri_ops->ro_unmap_safe(r_xprt, req, !RPC_IS_ASYNC(task));
> - rpcrdma_unmap_sges(ia, req);
> rpcrdma_buffer_put(req);
> }
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 81d081d..75899ba 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -128,11 +128,17 @@
> static void
> rpcrdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
> {
> + struct ib_cqe *cqe = wc->wr_cqe;
> + struct rpcrdma_sendctx *sc =
> + container_of(cqe, struct rpcrdma_sendctx, sc_cqe);
> +
> /* WARNING: Only wr_cqe and status are reliable at this point */
> if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR)
> pr_err("rpcrdma: Send: %s (%u/0x%x)\n",
> ib_wc_status_msg(wc->status),
> wc->status, wc->vendor_err);
> +
> + rpcrdma_sendctx_put_locked(sc);
> }
>
> /* Perform basic sanity checking to avoid using garbage
> @@ -1113,13 +1119,8 @@ struct rpcrdma_req *
> spin_lock(&buffer->rb_reqslock);
> list_add(&req->rl_all, &buffer->rb_allreqs);
> spin_unlock(&buffer->rb_reqslock);
> - req->rl_cqe.done = rpcrdma_wc_send;
> req->rl_buffer = &r_xprt->rx_buf;
> INIT_LIST_HEAD(&req->rl_registered);
> - req->rl_send_wr.next = NULL;
> - req->rl_send_wr.wr_cqe = &req->rl_cqe;
> - req->rl_send_wr.sg_list = req->rl_send_sge;
> - req->rl_send_wr.opcode = IB_WR_SEND;
> return req;
> }
>
> @@ -1410,7 +1411,6 @@ struct rpcrdma_req *
> struct rpcrdma_buffer *buffers = req->rl_buffer;
> struct rpcrdma_rep *rep = req->rl_reply;
>
> - req->rl_send_wr.num_sge = 0;
> req->rl_reply = NULL;
>
> spin_lock(&buffers->rb_lock);
> @@ -1542,7 +1542,7 @@ struct rpcrdma_regbuf *
> struct rpcrdma_ep *ep,
> struct rpcrdma_req *req)
> {
> - struct ib_send_wr *send_wr = &req->rl_send_wr;
> + struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
> struct ib_send_wr *send_wr_fail;
> int rc;
>
> @@ -1556,10 +1556,22 @@ struct rpcrdma_regbuf *
> dprintk("RPC: %s: posting %d s/g entries\n",
> __func__, send_wr->num_sge);
>
> - rpcrdma_set_signaled(ep, send_wr);
> + /* Signal Send once every "rep_send_batch" WRs:
> + * - Mitigate interrupts due to Send completions
> + * - Batch up DMA unmapping send buffers
> + * - Allow verbs provider housekeeping on the Send Queue
> + */
> + if (ep->rep_send_count) {
> + send_wr->send_flags &= ~IB_SEND_SIGNALED;
> + --ep->rep_send_count;
> + } else {
> + send_wr->send_flags |= IB_SEND_SIGNALED;
> + ep->rep_send_count = ep->rep_send_batch;
> + }
> rc = ib_post_send(ia->ri_id->qp, send_wr, &send_wr_fail);
> if (rc)
> goto out_postsend_err;
> +
> return 0;
>
> out_postsend_err:
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 1967e7a..8f8e2dd 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -365,19 +365,16 @@ enum {
> struct rpcrdma_req {
> struct list_head rl_list;
> __be32 rl_xid;
> - unsigned int rl_mapped_sges;
> unsigned int rl_connect_cookie;
> struct rpcrdma_buffer *rl_buffer;
> struct rpcrdma_rep *rl_reply;
> struct xdr_stream rl_stream;
> struct xdr_buf rl_hdrbuf;
> - struct ib_send_wr rl_send_wr;
> - struct ib_sge rl_send_sge[RPCRDMA_MAX_SEND_SGES];
> + struct rpcrdma_sendctx *rl_sendctx;
> struct rpcrdma_regbuf *rl_rdmabuf; /* xprt header */
> struct rpcrdma_regbuf *rl_sendbuf; /* rq_snd_buf */
> struct rpcrdma_regbuf *rl_recvbuf; /* rq_rcv_buf */
>
> - struct ib_cqe rl_cqe;
> struct list_head rl_all;
> bool rl_backchannel;
>
> @@ -676,7 +673,6 @@ int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
> struct rpcrdma_req *req, u32 hdrlen,
> struct xdr_buf *xdr,
> enum rpcrdma_chunktype rtype);
> -void rpcrdma_unmap_sges(struct rpcrdma_ia *, struct rpcrdma_req *);
> void rpcrdma_unmap_send_sges(struct rpcrdma_sendctx *sc);
> int rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst);
> void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

2017-09-05 22:03:32

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Tue, Sep 05, 2017 at 05:22:04PM -0400, Chuck Lever wrote:
>
> > On Sep 5, 2017, at 4:06 PM, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Sep 05, 2017 at 01:00:10PM -0400, Chuck Lever wrote:
> >
> >> - Send completions are batched to reduce interrupts, but still
> >> provide a periodic heartbeat signal for SQ housekeeping
> >
> > I would scrub this commentary, it is very misleading.
> >
> > The idea of a periodic completion does not match how verbs works at
> > all, it was an incomplete root cause analysis from a HCA that uses
> > different rules for freeing space in the SQ.
>
> I think it does bear mentioning that, given this diagnosis, it is
> still safe to remove the ib_post_send counting mechanism in 5/5,
> which has been in xprtrdma for as long as I can recall, and has
> been effective (with a few minor adjustments) at preventing SQ
> overflow.

Sure, but lets just talk about it in the context of ensuing the SQ
does not go full, and not some nebulous idea of heartbeat.

The new approach directly prevents overflow, and the past failings in
NFS all boiled down to stuffing SQEs into a full SQ, as some NICs do
not 'empty' the SQ until the CQ is generated.

> I'm not able to test this change with every HCA the Linux kernel
> currently supports, unfortunately. The best I can do is offer a
> "proof of correctness" and hope that vendors will jump on this
> and try it out.

Assuming it is implemented properly in NFS, any HCA that fails with
this is broken by definition.. A HCA must work correctly if the SQ is
full, and all but the last entry are unsignaled.

Jason

2017-09-06 01:36:43

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On 9/5/2017 5:22 PM, Chuck Lever wrote:
> I think it does bear mentioning that, given this diagnosis, it is
> still safe to remove the ib_post_send counting mechanism in 5/5,
> which has been in xprtrdma for as long as I can recall, and has

2007. :-)

> been effective (with a few minor adjustments) at preventing SQ
> overflow.
>
> I'm not able to test this change with every HCA the Linux kernel
> currently supports, unfortunately. The best I can do is offer a
> "proof of correctness" and hope that vendors will jump on this
> and try it out.

Someday, a proper set of verbs test cases? Alas.

Tom.

2017-09-06 11:54:24

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> Hi Jason, Sagi-

Hey Chuck,

> As we discussed a few weeks ago, this patch series implements the
> following:
>
> - Send SGEs are now managed via lock-less, wait-free circular queues
> - Send SGEs referring to page cache pages are DMA unmapped during
> Send completion
> - Send completions are batched to reduce interrupts, but still
> provide a periodic heartbeat signal for SQ housekeeping
> - The circular queue prevents Send Queue overflow
>
> The purpose of this change is to address the issue Sagi reported
> where the HCA continues to retry a delayed Send request _after_ RPC
> completion, resulting in a DMA error.

Question, what happens in direct-io for example? Can a mapped buffer be
reclaimed/free'd before the send completion arrives?

2017-09-06 14:15:57

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 6, 2017, at 7:54 AM, Sagi Grimberg <[email protected]> wrote:
>
>> Hi Jason, Sagi-
>
> Hey Chuck,
>
>> As we discussed a few weeks ago, this patch series implements the
>> following:
>> - Send SGEs are now managed via lock-less, wait-free circular queues
>> - Send SGEs referring to page cache pages are DMA unmapped during
>> Send completion
>> - Send completions are batched to reduce interrupts, but still
>> provide a periodic heartbeat signal for SQ housekeeping
>> - The circular queue prevents Send Queue overflow
>> The purpose of this change is to address the issue Sagi reported
>> where the HCA continues to retry a delayed Send request _after_ RPC
>> completion, resulting in a DMA error.
>
> Question, what happens in direct-io for example? Can a mapped buffer be
> reclaimed/free'd before the send completion arrives?

Good Q! RPC completion allows memory containing the arguments and
results to be re-used. IIRC our conclusion was that a retransmitted
Send could expose the wrong argument data on the wire in this case.

Buffer re-use implies that the RPC has completed. Either a matching
RPC Reply was received, or the RPC was terminated via a POSIX signal.

If the client has already received an RPC Reply for this transaction,
a previous transmission of the RPC Call has already executed on the
server, and this retransmission will be ignored. It's only purpose is
to generate an appropriate RDMA ACK.

A re-used buffer might be subsequently used for data that is sensitive,
and the retransmission will expose that data on the wire. To protect
against that, RPC can use a GSS flavor that protects confidentiality
of RPC arguments and results. This would also require RPC-over-RDMA
to use only RDMA Read to convey RPC Call messages. Send would be used
only to convey the chunk lists, never data.

Note that the buffers used to construct RPC Calls are always mapped
and Send uses the local DMA key to post them. These can also be
re-used immediately after RPC completion. The exposure risk there is
of RPC headers and non-data arguments.

--
Chuck Lever

2017-09-06 14:17:30

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 5, 2017, at 6:03 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Sep 05, 2017 at 05:22:04PM -0400, Chuck Lever wrote:
>>
>>> On Sep 5, 2017, at 4:06 PM, Jason Gunthorpe <[email protected]> wrote:
>>>
>>> On Tue, Sep 05, 2017 at 01:00:10PM -0400, Chuck Lever wrote:
>>>
>>>> - Send completions are batched to reduce interrupts, but still
>>>> provide a periodic heartbeat signal for SQ housekeeping
>>>
>>> I would scrub this commentary, it is very misleading.
>>>
>>> The idea of a periodic completion does not match how verbs works at
>>> all, it was an incomplete root cause analysis from a HCA that uses
>>> different rules for freeing space in the SQ.
>>
>> I think it does bear mentioning that, given this diagnosis, it is
>> still safe to remove the ib_post_send counting mechanism in 5/5,
>> which has been in xprtrdma for as long as I can recall, and has
>> been effective (with a few minor adjustments) at preventing SQ
>> overflow.
>
> Sure, but lets just talk about it in the context of ensuing the SQ
> does not go full, and not some nebulous idea of heartbeat.
>
> The new approach directly prevents overflow, and the past failings in
> NFS all boiled down to stuffing SQEs into a full SQ, as some NICs do
> not 'empty' the SQ until the CQ is generated.

Understood. There are one or two code comments introduced by this
series that will need to be similarly adjusted.

>> I'm not able to test this change with every HCA the Linux kernel
>> currently supports, unfortunately. The best I can do is offer a
>> "proof of correctness" and hope that vendors will jump on this
>> and try it out.
>
> Assuming it is implemented properly in NFS, any HCA that fails with
> this is broken by definition.. A HCA must work correctly if the SQ is
> full, and all but the last entry are unsignaled.
>
> Jason

--
Chuck Lever

2017-09-06 14:29:12

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

>> Question, what happens in direct-io for example? Can a mapped buffer be
>> reclaimed/free'd before the send completion arrives?
>
> Good Q! RPC completion allows memory containing the arguments and
> results to be re-used. IIRC our conclusion was that a retransmitted
> Send could expose the wrong argument data on the wire in this case.
>
> Buffer re-use implies that the RPC has completed. Either a matching
> RPC Reply was received, or the RPC was terminated via a POSIX signal.
>
> If the client has already received an RPC Reply for this transaction,
> a previous transmission of the RPC Call has already executed on the
> server, and this retransmission will be ignored. It's only purpose is
> to generate an appropriate RDMA ACK.
>
> A re-used buffer might be subsequently used for data that is sensitive,
> and the retransmission will expose that data on the wire.

That was where I was going with this...

> To protect
> against that, RPC can use a GSS flavor that protects confidentiality
> of RPC arguments and results. This would also require RPC-over-RDMA
> to use only RDMA Read to convey RPC Call messages. Send would be used
> only to convey the chunk lists, never data.
>
> Note that the buffers used to construct RPC Calls are always mapped
> and Send uses the local DMA key to post them. These can also be
> re-used immediately after RPC completion. The exposure risk there is
> of RPC headers and non-data arguments.

I see, but how can the user know that that it needs to use RPCSEC_GSS
otherwise nfs/rdma might compromise sensitive data? And is this
a valid constraint? (just asking, you're the expert)

2017-09-06 15:12:07

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 6, 2017, at 10:29 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>>> Question, what happens in direct-io for example? Can a mapped buffer be
>>> reclaimed/free'd before the send completion arrives?
>> Good Q! RPC completion allows memory containing the arguments and
>> results to be re-used. IIRC our conclusion was that a retransmitted
>> Send could expose the wrong argument data on the wire in this case.
>> Buffer re-use implies that the RPC has completed. Either a matching
>> RPC Reply was received, or the RPC was terminated via a POSIX signal.
>> If the client has already received an RPC Reply for this transaction,
>> a previous transmission of the RPC Call has already executed on the
>> server, and this retransmission will be ignored. It's only purpose is
>> to generate an appropriate RDMA ACK.
>> A re-used buffer might be subsequently used for data that is sensitive,
>> and the retransmission will expose that data on the wire.
>
> That was where I was going with this...
>
>> To protect
>> against that, RPC can use a GSS flavor that protects confidentiality
>> of RPC arguments and results. This would also require RPC-over-RDMA
>> to use only RDMA Read to convey RPC Call messages. Send would be used
>> only to convey the chunk lists, never data.
>> Note that the buffers used to construct RPC Calls are always mapped
>> and Send uses the local DMA key to post them. These can also be
>> re-used immediately after RPC completion. The exposure risk there is
>> of RPC headers and non-data arguments.
>
> I see, but how can the user know that that it needs to use RPCSEC_GSS
> otherwise nfs/rdma might compromise sensitive data? And is this
> a valid constraint? (just asking, you're the expert)

sec=krb5p is used in cases where data on the wire must remain
confidential. Otherwise, sensitive or no, data on the wire goes
in the clear.

But an administrator might not expect that other sensitive data
on the client (not involved with NFS) can be placed on the wire
by the vagaries of memory allocation and hardware retransmission,
as exceptionally rare as that might be.

Memory in which Send data resides is donated to the device until
the Send completion fires: the ULP has no way to get it back in
the meantime. ULPs can invalidate memory used for RDMA Read at
any time, but Send memory is registered with the local DMA key
(as anything else is just as expensive as an RDMA data transfer).

The immediate solution is to never use Send to move file data
directly. It will always have to be copied into a buffer or
we use RDMA Read. These buffers contain only data that is
destined for the wire. Does that close the unwanted exposure
completely?

If the HCA can guarantee that all Sends complete quickly (either
successful, flush, or time out after a few seconds) then it could
be fair to make RPC completion also wait for Send completion.
Otherwise, a ^C on a file operation targeting an unreachable
server will hang indefinitely.

--
Chuck Lever

2017-09-06 15:23:21

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

>> I see, but how can the user know that that it needs to use RPCSEC_GSS
>> otherwise nfs/rdma might compromise sensitive data? And is this
>> a valid constraint? (just asking, you're the expert)
>
> sec=krb5p is used in cases where data on the wire must remain
> confidential. Otherwise, sensitive or no, data on the wire goes
> in the clear.
>
> But an administrator might not expect that other sensitive data
> on the client (not involved with NFS) can be placed on the wire
> by the vagaries of memory allocation and hardware retransmission,
> as exceptionally rare as that might be.
>
> Memory in which Send data resides is donated to the device until
> the Send completion fires: the ULP has no way to get it back in
> the meantime. ULPs can invalidate memory used for RDMA Read at
> any time, but Send memory is registered with the local DMA key
> (as anything else is just as expensive as an RDMA data transfer).
>
> The immediate solution is to never use Send to move file data
> directly. It will always have to be copied into a buffer or
> we use RDMA Read. These buffers contain only data that is
> destined for the wire. Does that close the unwanted exposure
> completely?

It would, but is that a smaller sacrifice than signaling
send completions for small writes?

> If the HCA can guarantee that all Sends complete quickly (either
> successful, flush, or time out after a few seconds) then it could
> be fair to make RPC completion also wait for Send completion.
> Otherwise, a ^C on a file operation targeting an unreachable
> server will hang indefinitely.

You could set retry_count=0/1 which will fail with zero or one
send retries (a matter of seconds), but that would make the qp go to
error state which is probably not what we want...

2017-09-06 18:34:06

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 6, 2017, at 11:23 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>>> I see, but how can the user know that that it needs to use RPCSEC_GSS
>>> otherwise nfs/rdma might compromise sensitive data? And is this
>>> a valid constraint? (just asking, you're the expert)
>> sec=krb5p is used in cases where data on the wire must remain
>> confidential. Otherwise, sensitive or no, data on the wire goes
>> in the clear.
>> But an administrator might not expect that other sensitive data
>> on the client (not involved with NFS) can be placed on the wire
>> by the vagaries of memory allocation and hardware retransmission,
>> as exceptionally rare as that might be.
>> Memory in which Send data resides is donated to the device until
>> the Send completion fires: the ULP has no way to get it back in
>> the meantime. ULPs can invalidate memory used for RDMA Read at
>> any time, but Send memory is registered with the local DMA key
>> (as anything else is just as expensive as an RDMA data transfer).

>> The immediate solution is to never use Send to move file data
>> directly. It will always have to be copied into a buffer or
>> we use RDMA Read. These buffers contain only data that is
>> destined for the wire. Does that close the unwanted exposure
>> completely?
>
> It would, but is that a smaller sacrifice than signaling
> send completions for small writes?

Recall that if there's no file data, the transport will
utilize a persistently registered and DMA mapped buffer
that it owns in which to build the RPC Call message and
post the Send.

If there is file data, the same buffer is used, but the
memory containing the file data is DMA mapped and added
to the Send SGE list.

With sendctx, every 16th Send [*] is signaled whether it
is carrying extra SGEs that need to be unmapped, or not.
All other Sends are not signaled. This guarantees correct
Send Queue accounting for all workloads and registration
modes, using a minimum number of completions.

During each Send completion, the handler walks through
SGEs since the last completion, and unmaps them if needed.

If we choose never to do scatter-gather Send with file
data, then this last step is unneeded because then only
persistently registered and mapped buffers would be used
for sending RPC Call messages.

But note that either mechanism results in the same Send
completion rate.

[*] 16 is adjusted down to accommodate smaller Send
Queues as needed.

>> If the HCA can guarantee that all Sends complete quickly (either
>> successful, flush, or time out after a few seconds) then it could
>> be fair to make RPC completion also wait for Send completion.
>> Otherwise, a ^C on a file operation targeting an unreachable
>> server will hang indefinitely.
>
> You could set retry_count=0/1 which will fail with zero or one
> send retries (a matter of seconds), but that would make the qp go to
> error state which is probably not what we want...

I'm told that not letting the hardware try as hard as it
can to transmit a Send is asking for data corruption. Thus
the current setting is 6. That should cause a time out in
less than a minute? It depends on the HCA I guess.

Dropping the connection is desirable to force a full
reconnect (with address resolution) and to kick off
another Send. It is not desirable because it will also
interrupt all other outstanding RPCs on that connection.

As I see it, the options are to apply sendctx (this series),
and then:

A. Remove the post-v4.6 scatter-gather code, or

B. Force RPC completion to wait for Send completion, which
would allow the post-v4.6 scatter-gather code to work
safely. This would need some guarantee that Sends will
always complete in a short period.

For B, the signaling scheme would have to be: signal
non-data-bearing Send every so often, but signal all
data-bearing Sends. RPC completion would have to be able
to tell the difference and wait as needed. I can probably
handle this by adding a couple of atomic bits in struct
rpcrdma_req.

A. seems like the more straightforward approach.

--
Chuck Lever

2017-09-06 19:39:52

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Wed, Sep 06, 2017 at 02:33:50PM -0400, Chuck Lever wrote:

> B. Force RPC completion to wait for Send completion, which
> would allow the post-v4.6 scatter-gather code to work
> safely. This would need some guarantee that Sends will
> always complete in a short period.

Why is waiting for the send completion so fundamentally different from
waiting for the remote RPC reply?

I would say that 99% of time the send completion and RPC reply
completion will occure approximately concurrently.

eg It is quite likely the RPC reply SEND carries an embeded ack
for the requesting SEND..

Jason

2017-09-06 20:02:43

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 6, 2017, at 3:39 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, Sep 06, 2017 at 02:33:50PM -0400, Chuck Lever wrote:
>
>> B. Force RPC completion to wait for Send completion, which
>> would allow the post-v4.6 scatter-gather code to work
>> safely. This would need some guarantee that Sends will
>> always complete in a short period.
>
> Why is waiting for the send completion so fundamentally different from
> waiting for the remote RPC reply?
>
> I would say that 99% of time the send completion and RPC reply
> completion will occure approximately concurrently.
>
> eg It is quite likely the RPC reply SEND carries an embeded ack
> for the requesting SEND..

Depends on implementation. Average RTT on IB is 3-5 usecs.
Average RPC RTT is about an order of magnitude more. Typically
the Send is ACK'd more quickly than the RPC Reply can be sent.

But I get your point: the normal case isn't a problem.

The problematic case arises when the Send is not able to complete
because the NFS server is not reachable. User starts pounding on
^C, RPC can't complete because Send won't complete, control
doesn't return to user.

The Send Queue is blocked in that case anyway. That should result
in the sendctx queue becoming empty, so that RPCs can't even post
another Send, and we avoid the problem with newly issued RPCs.

We really do prefer the mechanics of the transport to be detached
from the ability of the RPC client and above to respond to user
signals, IMO.

--
Chuck Lever

2017-09-06 20:10:03

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Wed, Sep 06, 2017 at 04:02:24PM -0400, Chuck Lever wrote:
>
> > On Sep 6, 2017, at 3:39 PM, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Wed, Sep 06, 2017 at 02:33:50PM -0400, Chuck Lever wrote:
> >
> >> B. Force RPC completion to wait for Send completion, which
> >> would allow the post-v4.6 scatter-gather code to work
> >> safely. This would need some guarantee that Sends will
> >> always complete in a short period.
> >
> > Why is waiting for the send completion so fundamentally different from
> > waiting for the remote RPC reply?
> >
> > I would say that 99% of time the send completion and RPC reply
> > completion will occure approximately concurrently.
> >
> > eg It is quite likely the RPC reply SEND carries an embeded ack
> > for the requesting SEND..
>
> Depends on implementation. Average RTT on IB is 3-5 usecs.
> Average RPC RTT is about an order of magnitude more. Typically
> the Send is ACK'd more quickly than the RPC Reply can be sent.
>
> But I get your point: the normal case isn't a problem.
>
> The problematic case arises when the Send is not able to complete
> because the NFS server is not reachable. User starts pounding on
> ^C, RPC can't complete because Send won't complete, control
> doesn't return to user.

Sure, but why is that so different from the NFS server not generating
a response?

I thought you already implemetned a ctrl-c scheme that killed the QP?
Or was that just a discussion?

That is the only way to async terminate outstanding RPCs and clean
up. Killing the QP will allow the send to be 'completed'.

Having ctrl-c escalate to a QP tear down after a short timeout seems
reasonable. 99% of cases will not need the teardown since the send
will complete..

How does sockets based NFS handle this? Doesn't it zero copy from these
same buffers into SKBs? How does it cancel the SKBs before the NIC
transmits them?

Seems like exactly the same kind of problem to me..

Jason

2017-09-06 21:00:42

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

> On Sep 6, 2017, at 4:09 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, Sep 06, 2017 at 04:02:24PM -0400, Chuck Lever wrote:
>>
>>> On Sep 6, 2017, at 3:39 PM, Jason Gunthorpe <[email protected]> wrote:
>>>
>>> On Wed, Sep 06, 2017 at 02:33:50PM -0400, Chuck Lever wrote:
>>>
>>>> B. Force RPC completion to wait for Send completion, which
>>>> would allow the post-v4.6 scatter-gather code to work
>>>> safely. This would need some guarantee that Sends will
>>>> always complete in a short period.
>>>
>>> Why is waiting for the send completion so fundamentally different from
>>> waiting for the remote RPC reply?
>>>
>>> I would say that 99% of time the send completion and RPC reply
>>> completion will occure approximately concurrently.
>>>
>>> eg It is quite likely the RPC reply SEND carries an embeded ack
>>> for the requesting SEND..
>>
>> Depends on implementation. Average RTT on IB is 3-5 usecs.
>> Average RPC RTT is about an order of magnitude more. Typically
>> the Send is ACK'd more quickly than the RPC Reply can be sent.
>>
>> But I get your point: the normal case isn't a problem.
>>
>> The problematic case arises when the Send is not able to complete
>> because the NFS server is not reachable. User starts pounding on
>> ^C, RPC can't complete because Send won't complete, control
>> doesn't return to user.
>
> Sure, but why is that so different from the NFS server not generating
> a response?
>
> I thought you already implemetned a ctrl-c scheme that killed the QP?
> Or was that just a discussion?

No, we want to _avoid_ killing the QP if we can. A ctrl-C (or a
timer signal, say) on an otherwise healthy connection must not
perturb other outstanding RPCs, if possible.

What I implemented was a scheme to invalidate the memory of a
(POSIX) signaled RPC before it completes, in case the RPC Reply
hadn't yet arrived.

Currently, the only time the QP might be killed is if the server
attempts to RDMA Write an RPC Reply into one of these invalidated
memory regions. That case can't be avoided with the current RPC-
over-RDMA protocol.

> That is the only way to async terminate outstanding RPCs and clean
> up. Killing the QP will allow the send to be 'completed'.

It forces outstanding Sends to flush.

But as you explained it at the time, xprtrdma needs to wait
somehow for the QP to complete it's transition to error state
before allowing RPCs to complete. Probably ib_drain_qp would be
enough.

And again, we want to preserve the connection if it is healthy.

> Having ctrl-c escalate to a QP tear down after a short timeout seems
> reasonable. 99% of cases will not need the teardown since the send
> will complete..

So I think we are partially there already. If an RPC timeout occurs
(which should be after a few minutes) then xprtrdma does disconnect,
which tears down the QP.

If a timer signal fires on an RPC waiting for a server that is
unreachable, the application won't see the signal until the RPC
times out. Maybe that's how it works now?

And, otherwise, a ^C on an app waiting for an unresponsive server
will not have immediate results. But again, I think that's how it
works now.

> How does sockets based NFS handle this? Doesn't it zero copy from these
> same buffers into SKBs? How does it cancel the SKBs before the NIC
> transmits them?
>
> Seems like exactly the same kind of problem to me..

TCP has keep-alive, where the sockets consumer is notified as soon
as the network layer determines the remote is unresponsive. The
connection is closed from underneath the consumer.

For RDMA, which has no keep-alive mechanism, we seem to be going
with waiting for the RPC to time out, then the consumer itself
breaks the connection.

--
Chuck Lever

2017-09-06 21:11:07

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Wed, Sep 06, 2017 at 05:00:29PM -0400, Chuck Lever wrote:

> What I implemented was a scheme to invalidate the memory of a
> (POSIX) signaled RPC before it completes, in case the RPC Reply
> hadn't yet arrived.
>
> Currently, the only time the QP might be killed is if the server
> attempts to RDMA Write an RPC Reply into one of these invalidated
> memory regions. That case can't be avoided with the current RPC-
> over-RDMA protocol.

Okay..

> And again, we want to preserve the connection if it is healthy.

Well, if SENDs are not completing then it is not healthly. It is
analogous to what TCP Keep Alive does.

> > How does sockets based NFS handle this? Doesn't it zero copy from these
> > same buffers into SKBs? How does it cancel the SKBs before the NIC
> > transmits them?
> >
> > Seems like exactly the same kind of problem to me..
>
> TCP has keep-alive, where the sockets consumer is notified as soon
> as the network layer determines the remote is unresponsive. The
> connection is closed from underneath the consumer.

keep-alive is something different, it pings the remote periodicaly and
detects dead connections even when there is no traffic.

RDMA detects dead connections via retries and timeouts, but only if
there is traffic.

My questions was about how the same situation is handled in TCP. If it
does DMA directly from the source buffers by chaining them into SKBs
then it has exactly the same problem, it cannot release the buffers
until SKB is released from the TCP stack after NIC xmit. Like in RDMA,
this only happens once TCP has got the ack back.

Perhaps the answer is that TCP does not zero copy these buffers, or
TCP doesn't care about transmitting random memory, but it seems a
question worth asking.

Jason

2017-09-07 13:25:07

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On 9/6/2017 3:39 PM, Jason Gunthorpe wrote:
> On Wed, Sep 06, 2017 at 02:33:50PM -0400, Chuck Lever wrote:
>
>> B. Force RPC completion to wait for Send completion, which
>> would allow the post-v4.6 scatter-gather code to work
>> safely. This would need some guarantee that Sends will
>> always complete in a short period.
>
> Why is waiting for the send completion so fundamentally different from
> waiting for the remote RPC reply?
>
> I would say that 99% of time the send completion and RPC reply
> completion will occure approximately concurrently.

Absolutely not. The RPC reply requires upper layer processing at
the server, which involves work requests, context switches, file
system and disk i/o operations, and plain old queuing. RPC replies
typically can be expected in milliseconds from traditional HDDs,
and hundreds of microseconds from SSDs. Storage class memory is
reducing these by two orders of magnitude, but it's still not in
the range of a network round trip.

> eg It is quite likely the RPC reply SEND carries an embeded ack
> for the requesting SEND..

That's not what we've seen, even on InfiniBand. On adapters such as
iWARP, I will note, the send completion is often generated when
the adapter buffers the outgoing message, and has almost nothing
to do with the remote transport ack.

Tom.

2017-09-07 15:08:41

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On Thu, Sep 07, 2017 at 09:17:16AM -0400, Tom Talpey wrote:

> >Why is waiting for the send completion so fundamentally different from
> >waiting for the remote RPC reply?
> >
> >I would say that 99% of time the send completion and RPC reply
> >completion will occure approximately concurrently.
>
> Absolutely not. The RPC reply requires upper layer processing at
> the server, which involves work requests, context switches, file

I should have said '99% of the time the SEND will occure approximately
concurrently or sooner'

Jason

2017-09-07 16:16:03

[permalink] [raw]

Subject: Re: [PATCH RFC 0/5] xprtrdma Send completion batching

On 9/7/2017 11:08 AM, Jason Gunthorpe wrote:
> On Thu, Sep 07, 2017 at 09:17:16AM -0400, Tom Talpey wrote:
>
>>> Why is waiting for the send completion so fundamentally different from
>>> waiting for the remote RPC reply?
>>>
>>> I would say that 99% of time the send completion and RPC reply
>>> completion will occure approximately concurrently.
>>
>> Absolutely not. The RPC reply requires upper layer processing at
>> the server, which involves work requests, context switches, file
>
> I should have said '99% of the time the SEND will occure approximately
> concurrently or sooner'

Ok, that I agree with. Unfortunately though, the code has to handle
either sequence, and sends do complete after replies in many cases.

One reason for that is the RNIC and network, but another is more
insidious. When sends and receives go to separate CQs, there's
basically no causality between the two, and they synchronize with
different MSI vectors (and therefore CPU cores), different locks,
and different upcall code paths. There's no way to sort that out
without serializing everything much too heavily.

Tom.