2021-03-31 19:37:28

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 0/8] xprtrdma Receive Queue fixes

I found a number of crashers and other problems in and around the
xprtrdma logic for managing the Receive Queue during connect and
disconnect events.

---

Chuck Lever (8):
xprtrdma: Avoid Receive Queue wrapping
xprtrdma: Do not post Receives after disconnect
xprtrdma: Put flushed Receives on free list instead of destroying them
xprtrdma: Improve locking around rpcrdma_rep destruction
xprtrdma: Improve commentary around rpcrdma_reps_unmap()
xprtrdma: Improve locking around rpcrdma_rep creation
xprtrdma: Fix cwnd update ordering
xprtrdma: Delete rpcrdma_recv_buffer_put()


net/sunrpc/xprtrdma/backchannel.c | 4 +-
net/sunrpc/xprtrdma/rpc_rdma.c | 7 +--
net/sunrpc/xprtrdma/verbs.c | 87 +++++++++++++++++++------------
net/sunrpc/xprtrdma/xprt_rdma.h | 4 +-
4 files changed, 64 insertions(+), 38 deletions(-)

--
Chuck Lever


2021-03-31 19:37:28

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 3/8] xprtrdma: Put flushed Receives on free list instead of destroying them

Defer destruction of an rpcrdma_rep until transport tear-down to
avoid races between Receive completion and rpcrdma_reps_unmap().

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1d88685badbe..92af272f9cc9 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -80,6 +80,8 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_sendctx *sc);
static int rpcrdma_reqs_setup(struct rpcrdma_xprt *r_xprt);
static void rpcrdma_reqs_reset(struct rpcrdma_xprt *r_xprt);
+static void rpcrdma_rep_put(struct rpcrdma_buffer *buf,
+ struct rpcrdma_rep *rep);
static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep);
static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt);
static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
@@ -205,7 +207,7 @@ static void rpcrdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)

out_flushed:
rpcrdma_flush_disconnect(r_xprt, wc);
- rpcrdma_rep_destroy(rep);
+ rpcrdma_rep_put(&r_xprt->rx_buf, rep);
}

static void rpcrdma_update_cm_private(struct rpcrdma_ep *ep,


2021-03-31 19:37:55

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 2/8] xprtrdma: Do not post Receives after disconnect

Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.

This patch is probably not sufficient to fully close that window,
but does significantly reduce the opportunity for a crash to
occur without incurring undue performance overhead.

Cc: [email protected] # v5.7
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ec912cf9c618..1d88685badbe 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1371,8 +1371,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_ep *ep = r_xprt->rx_ep;
+ struct ib_qp_init_attr init_attr;
struct ib_recv_wr *wr, *bad_wr;
struct rpcrdma_rep *rep;
+ struct ib_qp_attr attr;
int needed, count, rc;

rc = 0;
@@ -1385,6 +1387,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
if (!temp)
needed += RPCRDMA_MAX_RECV_BATCH;

+ if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
+ goto out;
+ if (attr.qp_state == IB_QPS_ERR)
+ goto out;
+
/* fast path: all needed reps can be found on the free list */
wr = NULL;
while (needed) {


2021-03-31 19:38:32

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 4/8] xprtrdma: Improve locking around rpcrdma_rep destruction

Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two should both traverse the "all" list.

Instead, ensure that all rpcrdma_reps are always destroyed whether
they are on the rep free list or not. Also ensure that rpcrdma_rep
destruction logic is protected by the rb_lock.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 31 ++++++++++++++++++++++++-------
1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 92af272f9cc9..fe250d554695 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1000,16 +1000,23 @@ struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt,
return NULL;
}

-/* No locking needed here. This function is invoked only by the
- * Receive completion handler, or during transport shutdown.
- */
-static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep)
+static void rpcrdma_rep_destroy_locked(struct rpcrdma_rep *rep)
{
- list_del(&rep->rr_all);
rpcrdma_regbuf_free(rep->rr_rdmabuf);
kfree(rep);
}

+static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep)
+{
+ struct rpcrdma_buffer *buf = &rep->rr_rxprt->rx_buf;
+
+ spin_lock(&buf->rb_lock);
+ list_del(&rep->rr_all);
+ spin_unlock(&buf->rb_lock);
+
+ rpcrdma_rep_destroy_locked(rep);
+}
+
static struct rpcrdma_rep *rpcrdma_rep_get_locked(struct rpcrdma_buffer *buf)
{
struct llist_node *node;
@@ -1042,8 +1049,18 @@ static void rpcrdma_reps_destroy(struct rpcrdma_buffer *buf)
{
struct rpcrdma_rep *rep;

- while ((rep = rpcrdma_rep_get_locked(buf)) != NULL)
- rpcrdma_rep_destroy(rep);
+ spin_lock(&buf->rb_lock);
+ while ((rep = list_first_entry_or_null(&buf->rb_all_reps,
+ struct rpcrdma_rep,
+ rr_all)) != NULL) {
+ list_del(&rep->rr_all);
+ spin_unlock(&buf->rb_lock);
+
+ rpcrdma_rep_destroy_locked(rep);
+
+ spin_lock(&buf->rb_lock);
+ }
+ spin_unlock(&buf->rb_lock);
}

/**


2021-03-31 19:38:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 7/8] xprtrdma: Fix cwnd update ordering

After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() considerably.

Fixes: 2ae50ad68cd7 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 3 ++-
net/sunrpc/xprtrdma/verbs.c | 10 +++++-----
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 292f066d006e..21ddd78a8c35 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1430,9 +1430,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
credits = 1; /* don't deadlock */
else if (credits > r_xprt->rx_ep->re_max_requests)
credits = r_xprt->rx_ep->re_max_requests;
+ rpcrdma_post_recvs(r_xprt, credits + (buf->rb_bc_srv_max_requests << 1),
+ false);
if (buf->rb_credits != credits)
rpcrdma_update_cwnd(r_xprt, credits);
- rpcrdma_post_recvs(r_xprt, false);

req = rpcr_to_rdmar(rqst);
if (unlikely(req->rl_reply))
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2fde805edb64..82489f527ece 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -537,7 +537,7 @@ int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt)
* outstanding Receives.
*/
rpcrdma_ep_get(ep);
- rpcrdma_post_recvs(r_xprt, true);
+ rpcrdma_post_recvs(r_xprt, 1, true);

rc = rdma_connect(ep->re_id, &ep->re_remote_cma);
if (rc)
@@ -1388,10 +1388,11 @@ int rpcrdma_post_sends(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
/**
* rpcrdma_post_recvs - Refill the Receive Queue
* @r_xprt: controlling transport instance
- * @temp: mark Receive buffers to be deleted after use
+ * @needed: current credit grant
+ * @temp: mark Receive buffers to be deleted after one use
*
*/
-void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
+void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_ep *ep = r_xprt->rx_ep;
@@ -1399,12 +1400,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
struct ib_recv_wr *wr, *bad_wr;
struct rpcrdma_rep *rep;
struct ib_qp_attr attr;
- int needed, count, rc;
+ int count, rc;

rc = 0;
count = 0;

- needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
if (likely(ep->re_receive_count > needed))
goto out;
needed -= ep->re_receive_count;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index fe3be985e239..28af11fbe643 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -461,7 +461,7 @@ int rpcrdma_xprt_connect(struct rpcrdma_xprt *r_xprt);
void rpcrdma_xprt_disconnect(struct rpcrdma_xprt *r_xprt);

int rpcrdma_post_sends(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req);
-void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
+void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp);

/*
* Buffer calls - xprtrdma/verbs.c


2021-03-31 19:39:00

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 5/8] xprtrdma: Improve commentary around rpcrdma_reps_unmap()

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index fe250d554695..b68fc81a78fb 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1034,6 +1034,10 @@ static void rpcrdma_rep_put(struct rpcrdma_buffer *buf,
llist_add(&rep->rr_node, &buf->rb_free_reps);
}

+/* Caller must ensure the QP is quiescent (RQ is drained) before
+ * invoking this function, to guarantee rb_all_reps is not
+ * changing.
+ */
static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
@@ -1041,7 +1045,7 @@ static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt)

list_for_each_entry(rep, &buf->rb_all_reps, rr_all) {
rpcrdma_regbuf_dma_unmap(rep->rr_rdmabuf);
- rep->rr_temp = true;
+ rep->rr_temp = true; /* Mark this rep for destruction */
}
}



2021-03-31 19:39:08

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 8/8] xprtrdma: Delete rpcrdma_recv_buffer_put()

Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/backchannel.c | 4 +++-
net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++--
net/sunrpc/xprtrdma/verbs.c | 24 ++++++++----------------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
4 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index a249837d6a55..1151efd09b27 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -155,9 +155,11 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
{
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
+ struct rpcrdma_rep *rep = req->rl_reply;
struct rpc_xprt *xprt = rqst->rq_xprt;
+ struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);

- rpcrdma_recv_buffer_put(req->rl_reply);
+ rpcrdma_rep_put(&r_xprt->rx_buf, rep);
req->rl_reply = NULL;

spin_lock(&xprt->bc_pa_lock);
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 21ddd78a8c35..be4e888e78a3 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1437,7 +1437,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)

req = rpcr_to_rdmar(rqst);
if (unlikely(req->rl_reply))
- rpcrdma_recv_buffer_put(req->rl_reply);
+ rpcrdma_rep_put(buf, req->rl_reply);
req->rl_reply = rep;
rep->rr_rqst = rqst;

@@ -1465,5 +1465,5 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
trace_xprtrdma_reply_short_err(rep);

out:
- rpcrdma_recv_buffer_put(rep);
+ rpcrdma_rep_put(buf, rep);
}
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 82489f527ece..fd4cce149001 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -80,8 +80,6 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_sendctx *sc);
static int rpcrdma_reqs_setup(struct rpcrdma_xprt *r_xprt);
static void rpcrdma_reqs_reset(struct rpcrdma_xprt *r_xprt);
-static void rpcrdma_rep_put(struct rpcrdma_buffer *buf,
- struct rpcrdma_rep *rep);
static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep);
static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt);
static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
@@ -1029,8 +1027,13 @@ static struct rpcrdma_rep *rpcrdma_rep_get_locked(struct rpcrdma_buffer *buf)
return llist_entry(node, struct rpcrdma_rep, rr_node);
}

-static void rpcrdma_rep_put(struct rpcrdma_buffer *buf,
- struct rpcrdma_rep *rep)
+/**
+ * rpcrdma_rep_put - Release rpcrdma_rep back to free list
+ * @buf: buffer pool
+ * @rep: rep to release
+ *
+ */
+void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep)
{
llist_add(&rep->rr_node, &buf->rb_free_reps);
}
@@ -1245,17 +1248,6 @@ void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req)
spin_unlock(&buffers->rb_lock);
}

-/**
- * rpcrdma_recv_buffer_put - Release rpcrdma_rep back to free list
- * @rep: rep to release
- *
- * Used after error conditions.
- */
-void rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
-{
- rpcrdma_rep_put(&rep->rr_rxprt->rx_buf, rep);
-}
-
/* Returns a pointer to a rpcrdma_regbuf object, or NULL.
*
* xprtrdma uses a regbuf for posting an outgoing RDMA SEND, or for
@@ -1449,7 +1441,7 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp)

rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr);
wr = wr->next;
- rpcrdma_recv_buffer_put(rep);
+ rpcrdma_rep_put(buf, rep);
--count;
}
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 28af11fbe643..1b02d6bd86e9 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -480,7 +480,7 @@ void rpcrdma_mrs_refresh(struct rpcrdma_xprt *r_xprt);
struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers,
struct rpcrdma_req *req);
-void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
+void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep);

bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size,
gfp_t flags);


2021-03-31 19:39:19

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 6/8] xprtrdma: Improve locking around rpcrdma_rep creation

Serialize the use of the rb_all_reps list during rep creation and
destruction.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b68fc81a78fb..2fde805edb64 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -956,13 +956,11 @@ static void rpcrdma_reqs_reset(struct rpcrdma_xprt *r_xprt)
rpcrdma_req_reset(req);
}

-/* No locking needed here. This function is called only by the
- * Receive completion handler.
- */
static noinline
struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt,
bool temp)
{
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;

rep = kzalloc(sizeof(*rep), GFP_KERNEL);
@@ -989,7 +987,10 @@ struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt,
rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
rep->rr_recv_wr.num_sge = 1;
rep->rr_temp = temp;
- list_add(&rep->rr_all, &r_xprt->rx_buf.rb_all_reps);
+
+ spin_lock(&buf->rb_lock);
+ list_add(&rep->rr_all, &buf->rb_all_reps);
+ spin_unlock(&buf->rb_lock);
return rep;

out_free_regbuf:


2021-03-31 20:02:12

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 2/8] xprtrdma: Do not post Receives after disconnect

On 3/31/2021 3:36 PM, Chuck Lever wrote:
> Currently the Receive completion handler refreshes the Receive Queue
> whenever a successful Receive completion occurs.
>
> On disconnect, xprtrdma drains the Receive Queue. The first few
> Receive completions after a disconnect are typically successful,
> until the first flushed Receive.
>
> This means the Receive completion handler continues to post more
> Receive WRs after the drain sentinel has been posted. The late-
> posted Receives flush after the drain sentinel has completed,
> leading to a crash later in rpcrdma_xprt_disconnect().
>
> To prevent this crash, xprtrdma has to ensure that the Receive
> handler stops posting Receives before ib_drain_rq() posts its
> drain sentinel.
>
> This patch is probably not sufficient to fully close that window,

"Probably" is not a word I'd like to use in a stable:cc...

> but does significantly reduce the opportunity for a crash to
> occur without incurring undue performance overhead.
>
> Cc: [email protected] # v5.7
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index ec912cf9c618..1d88685badbe 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1371,8 +1371,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> {
> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> struct rpcrdma_ep *ep = r_xprt->rx_ep;
> + struct ib_qp_init_attr init_attr;
> struct ib_recv_wr *wr, *bad_wr;
> struct rpcrdma_rep *rep;
> + struct ib_qp_attr attr;
> int needed, count, rc;
>
> rc = 0;
> @@ -1385,6 +1387,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> if (!temp)
> needed += RPCRDMA_MAX_RECV_BATCH;
>
> + if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
> + goto out;

This call isn't completely cheap.

> + if (attr.qp_state == IB_QPS_ERR)
> + goto out;

But the QP is free to disconnect or go to error right now. This approach
just reduces the timing hole. Is it not possible to mark the WRs as
being part of a batch, and allowing them to flush? You could borrow a
bit in the completion cookie, and check it when the CQE pops out. Maybe.

> +
> /* fast path: all needed reps can be found on the free list */
> wr = NULL;
> while (needed) {
>
>
>

2021-03-31 20:04:53

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 3/8] xprtrdma: Put flushed Receives on free list instead of destroying them

On 3/31/2021 3:36 PM, Chuck Lever wrote:
> Defer destruction of an rpcrdma_rep until transport tear-down to
> avoid races between Receive completion and rpcrdma_reps_unmap().

This seems sketchy, but it's a good approach to destroy in one
place, and off the hot path. You might restate the description
in a positive way, not strictly to avoid a race.

Reviewed-By: Tom Talpey <[email protected]>

> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 1d88685badbe..92af272f9cc9 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -80,6 +80,8 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
> struct rpcrdma_sendctx *sc);
> static int rpcrdma_reqs_setup(struct rpcrdma_xprt *r_xprt);
> static void rpcrdma_reqs_reset(struct rpcrdma_xprt *r_xprt);
> +static void rpcrdma_rep_put(struct rpcrdma_buffer *buf,
> + struct rpcrdma_rep *rep);
> static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep);
> static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt);
> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
> @@ -205,7 +207,7 @@ static void rpcrdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
>
> out_flushed:
> rpcrdma_flush_disconnect(r_xprt, wc);
> - rpcrdma_rep_destroy(rep);
> + rpcrdma_rep_put(&r_xprt->rx_buf, rep);
> }
>
> static void rpcrdma_update_cm_private(struct rpcrdma_ep *ep,
>
>
>

2021-03-31 20:33:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 2/8] xprtrdma: Do not post Receives after disconnect

On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <[email protected]> wrote:
>
> On 3/31/2021 3:36 PM, Chuck Lever wrote:
> > Currently the Receive completion handler refreshes the Receive Queue
> > whenever a successful Receive completion occurs.
> >
> > On disconnect, xprtrdma drains the Receive Queue. The first few
> > Receive completions after a disconnect are typically successful,
> > until the first flushed Receive.
> >
> > This means the Receive completion handler continues to post more
> > Receive WRs after the drain sentinel has been posted. The late-
> > posted Receives flush after the drain sentinel has completed,
> > leading to a crash later in rpcrdma_xprt_disconnect().
> >
> > To prevent this crash, xprtrdma has to ensure that the Receive
> > handler stops posting Receives before ib_drain_rq() posts its
> > drain sentinel.
> >
> > This patch is probably not sufficient to fully close that window,
>
> "Probably" is not a word I'd like to use in a stable:cc...

Well, I could be easily convinced to remove the Cc: stable
for this one, since it's not a full fix. But this is a pretty pervasive
problem with disconnect handling.


> > but does significantly reduce the opportunity for a crash to
> > occur without incurring undue performance overhead.
> >
> > Cc: [email protected] # v5.7
> > Signed-off-by: Chuck Lever <[email protected]>
> > ---
> > net/sunrpc/xprtrdma/verbs.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> > index ec912cf9c618..1d88685badbe 100644
> > --- a/net/sunrpc/xprtrdma/verbs.c
> > +++ b/net/sunrpc/xprtrdma/verbs.c
> > @@ -1371,8 +1371,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> > {
> > struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> > struct rpcrdma_ep *ep = r_xprt->rx_ep;
> > + struct ib_qp_init_attr init_attr;
> > struct ib_recv_wr *wr, *bad_wr;
> > struct rpcrdma_rep *rep;
> > + struct ib_qp_attr attr;
> > int needed, count, rc;
> >
> > rc = 0;
> > @@ -1385,6 +1387,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> > if (!temp)
> > needed += RPCRDMA_MAX_RECV_BATCH;
> >
> > + if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
> > + goto out;
>
> This call isn't completely cheap.

True, but it's done only once every 7 Receive completions.

The other option is to use re_connect_status, and add some memory
barriers to ensure we get the latest value. That doesn't help us get the
race window closed any further, though.

> > + if (attr.qp_state == IB_QPS_ERR)
> > + goto out;
>
> But the QP is free to disconnect or go to error right now. This approach
> just reduces the timing hole.

Agreed 100%. I just couldn't think of a better approach. I'm definitely open
to better ideas.

> Is it not possible to mark the WRs as
> being part of a batch, and allowing them to flush? You could borrow a
> bit in the completion cookie, and check it when the CQE pops out. Maybe.

It's not an issue with batching, it's an issue with posting Receives from the
Receive completion handler. I'd think that any of the ULPs that post Receives
in their completion handler would have the same issue.

The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
are no more WRs in flight so that the hardware resources can be safely
destroyed. If the Receive completion handler continues to post Receive WRs
after the drain sentinel has been posted, leaks and crashes become possible.


> > +
> > /* fast path: all needed reps can be found on the free list */
> > wr = NULL;
> > while (needed) {
> >
> >
> >



--
When the world is being engulfed by a comet, go ahead and excrete
where you want to.

2021-03-31 21:26:47

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 2/8] xprtrdma: Do not post Receives after disconnect

On 3/31/2021 4:31 PM, Chuck Lever wrote:
> On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <[email protected]> wrote:
>>
>> On 3/31/2021 3:36 PM, Chuck Lever wrote:
>>> Currently the Receive completion handler refreshes the Receive Queue
>>> whenever a successful Receive completion occurs.
>>>
>>> On disconnect, xprtrdma drains the Receive Queue. The first few
>>> Receive completions after a disconnect are typically successful,
>>> until the first flushed Receive.
snip
>> Is it not possible to mark the WRs as
>> being part of a batch, and allowing them to flush? You could borrow a
>> bit in the completion cookie, and check it when the CQE pops out. Maybe.
>
> It's not an issue with batching, it's an issue with posting Receives from the
> Receive completion handler. I'd think that any of the ULPs that post Receives
> in their completion handler would have the same issue.
>
> The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
> are no more WRs in flight so that the hardware resources can be safely
> destroyed. If the Receive completion handler continues to post Receive WRs
> after the drain sentinel has been posted, leaks and crashes become possible.
Well, why not do an atomic_set() of a flag just before posting the
sentinel, and check it with atomic_get() before any other RQ post?


Tom.

2021-04-01 17:43:32

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v1 2/8] xprtrdma: Do not post Receives after disconnect

On Wed, Mar 31, 2021 at 5:22 PM Tom Talpey <[email protected]> wrote:
>
> On 3/31/2021 4:31 PM, Chuck Lever wrote:
> > On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <[email protected]> wrote:
> >>
> >> On 3/31/2021 3:36 PM, Chuck Lever wrote:
> >>> Currently the Receive completion handler refreshes the Receive Queue
> >>> whenever a successful Receive completion occurs.
> >>>
> >>> On disconnect, xprtrdma drains the Receive Queue. The first few
> >>> Receive completions after a disconnect are typically successful,
> >>> until the first flushed Receive.
> snip
> >> Is it not possible to mark the WRs as
> >> being part of a batch, and allowing them to flush? You could borrow a
> >> bit in the completion cookie, and check it when the CQE pops out. Maybe.
> >
> > It's not an issue with batching, it's an issue with posting Receives from the
> > Receive completion handler. I'd think that any of the ULPs that post Receives
> > in their completion handler would have the same issue.
> >
> > The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
> > are no more WRs in flight so that the hardware resources can be safely
> > destroyed. If the Receive completion handler continues to post Receive WRs
> > after the drain sentinel has been posted, leaks and crashes become possible.

> Well, why not do an atomic_set() of a flag just before posting the
> sentinel, and check it with atomic_get() before any other RQ post?

After a couple of private exchanges, Tom and I agree that doing an
ib_query_qp() in rpcrdma_post_recvs() has some drawbacks and
does not fully close the race window.

There is nothing that prevents rpcrdma_xprt_disconnect() from starting
the RQ drain while rpcrdma_post_recvs is still running. There needs to
be serialization between ib_drain_rq() and rpcrdma_post_recvs() so
that the drain sentinel is guaranteed to be the final Receive WR posted
on the RQ.

It would be great if the core ib_drain_rq() API itself could handle the
exclusion.


--
"We cannot take our next breath without the exhale."
-- Ellen Scott Grable