2015-05-04 17:56:44

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 00/14] client NFS/RDMA patches for 4.2

I'd like these patches considered for merging upstream. This patch
series includes:

- JIT allocation of rpcrdma_mw structures
- Break-up of rb_lock
- Reduction of how many rpcrdma_mw structs are needed per transport

These are pre-requisites for increasing the RPC slot count and
r/wsize on RPC/RDMA transports. And:

- An RPC/RDMA transport fault injector

This is useful to discover regressions in logic for handling
transport disconnection and recovery.

You can find these in my git repo in the "nfs-rdma-for-4.2" topic
branch. See:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=summary

Thanks in advance for patch review!

---

Chuck Lever (14):
xprtrdma: Transport fault injection
xprtrdma: Warn when there are orphaned IB objects
xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
xprtrdma: Use ib_device pointer safely
xprtrdma: Introduce helpers for allocating MWs
xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
xprtrdma: Introduce an FRMR recovery workqueue
xprtrdma: Acquire MRs in rpcrdma_register_external()
xprtrdma: Remove unused LOCAL_INV recovery logic
xprtrdma: Remove ->ro_reset
xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
xprtrdma: Split rb_lock
xprtrdma: Stack relief in fmr_op_map()
xprtrmda: Reduce per-transport MR allocation


include/linux/sunrpc/xprtrdma.h | 3
net/sunrpc/Kconfig | 12 ++
net/sunrpc/xprtrdma/fmr_ops.c | 120 +++++++++++-------
net/sunrpc/xprtrdma/frwr_ops.c | 224 ++++++++++++++++++++++++---------
net/sunrpc/xprtrdma/physical_ops.c | 14 --
net/sunrpc/xprtrdma/rpc_rdma.c | 7 -
net/sunrpc/xprtrdma/transport.c | 52 +++++++-
net/sunrpc/xprtrdma/verbs.c | 241 +++++++++---------------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 38 ++++--
9 files changed, 387 insertions(+), 324 deletions(-)

--
Chuck Lever


2015-05-04 17:56:54

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 01/14] xprtrdma: Transport fault injection

It has been exceptionally useful to exercise the logic that handles
local immediate errors and RDMA connection loss. To enable
developers to test this regularly and repeatably, add logic to
simulate connection loss every so often.

Fault injection is disabled by default. It is enabled with

$ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault

where "xxx" is a large positive number of transport method calls
before a disconnect. A value of several thousand is usually a good
number that allows reasonable forward progress while still causing a
lot of connection drops.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/Kconfig | 12 ++++++++++++
net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
3 files changed, 47 insertions(+)

diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 9068e72..329f82c 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT

If unsure, say N.

+config SUNRPC_XPRT_RDMA_FAULT_INJECTION
+ bool "RPC over RDMA client fault injection"
+ depends on SUNRPC_XPRT_RDMA_CLIENT
+ default N
+ help
+ This option enables fault injection in the xprtrdma module.
+ Fault injection is disabled by default. It is enabled with:
+
+ $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
+
+ If unsure, say N.
+
config SUNRPC_XPRT_RDMA_SERVER
tristate "RPC over RDMA Server Support"
depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 54f23b1..fdcb2c7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
int xprt_rdma_pad_optimize = 1;
+static unsigned int xprt_rdma_inject_transport_fault;

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)

@@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "rdma_inject_transport_fault",
+ .data = &xprt_rdma_inject_transport_fault,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{ },
};

@@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
xprt_clear_connecting(xprt);
}

+#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
+static void
+xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
+{
+ if (!xprt_rdma_inject_transport_fault)
+ return;
+
+ if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
+ atomic_set(&r_xprt->rx_inject_count,
+ xprt_rdma_inject_transport_fault);
+ pr_info("rpcrdma: injecting transport disconnect\n");
+ (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
+ }
+}
+#else
+static void
+xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
+{
+}
+#endif
+
/*
* xprt_rdma_destroy
*
@@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
xprt_rdma_connect_worker);

+ atomic_set(&new_xprt->rx_inject_count,
+ xprt_rdma_inject_transport_fault);
xprt_rdma_format_addresses(xprt);
xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
if (xprt->max_payload == 0)
@@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
out:
dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
req->rl_connect_cookie = 0; /* our reserved value */
+ xprt_rdma_inject_disconnect(r_xprt);
return req->rl_sendbuf->rg_base;

out_rdmabuf:
@@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
}

rpcrdma_buffer_put(req);
+ xprt_rdma_inject_disconnect(r_xprt);
}

/*
@@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)

rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
rqst->rq_bytes_sent = 0;
+ xprt_rdma_inject_disconnect(r_xprt);
return 0;

failed_marshal:
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 78e0b8b..08aee53 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -377,6 +377,7 @@ struct rpcrdma_xprt {
struct rpcrdma_create_data_internal rx_data;
struct delayed_work rx_connect_worker;
struct rpcrdma_stats rx_stats;
+ atomic_t rx_inject_count;
};

#define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)


2015-05-04 17:57:04

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

Print an error during transport destruction if ib_dealloc_pd()
fails. This is a sign that xprtrdma orphaned one or more RDMA API
objects at some point, which can pin lower layer kernel modules
and cause shutdown to hang.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4870d27..0cc4617 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -710,8 +710,8 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
}
if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
rc = ib_dealloc_pd(ia->ri_pd);
- dprintk("RPC: %s: ib_dealloc_pd returned %i\n",
- __func__, rc);
+ if (rc)
+ pr_warn("rpcrdma: ib_dealloc_pd status %i\n", rc);
}
}



2015-05-04 17:57:14

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 03/14] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt

Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++--
net/sunrpc/xprtrdma/transport.c | 7 ++-----
net/sunrpc/xprtrdma/verbs.c | 8 +++++---
net/sunrpc/xprtrdma/xprt_rdma.h | 3 +--
4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 2c53ea9..98a3b95 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -732,8 +732,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
struct rpcrdma_msg *headerp;
struct rpcrdma_req *req;
struct rpc_rqst *rqst;
- struct rpc_xprt *xprt = rep->rr_xprt;
- struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+ struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
+ struct rpc_xprt *xprt = &r_xprt->rx_xprt;
__be32 *iptr;
int rdmalen, status;
unsigned long cwnd;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index fdcb2c7..ed70551 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -650,12 +650,9 @@ xprt_rdma_send_request(struct rpc_task *task)

if (req->rl_reply == NULL) /* e.g. reconnection */
rpcrdma_recv_buffer_get(req);
-
- if (req->rl_reply) {
+ /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
+ if (req->rl_reply)
req->rl_reply->rr_func = rpcrdma_reply_handler;
- /* this need only be done once, but... */
- req->rl_reply->rr_xprt = xprt;
- }

/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 0cc4617..e1eb7c4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -278,6 +278,7 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct list_head *sched_list)
{
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
+ struct rpcrdma_ia *ia;

/* WARNING: Only wr_id and status are reliable at this point */
if (wc->status != IB_WC_SUCCESS)
@@ -290,8 +291,9 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct list_head *sched_list)
dprintk("RPC: %s: rep %p opcode 'recv', length %u: success\n",
__func__, rep, wc->byte_len);

+ ia = &rep->rr_rxprt->rx_ia;
rep->rr_len = wc->byte_len;
- ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
+ ib_dma_sync_single_for_cpu(ia->ri_id->device,
rdmab_addr(rep->rr_rdmabuf),
rep->rr_len, DMA_FROM_DEVICE);
prefetch(rdmab_to_msg(rep->rr_rdmabuf));
@@ -1053,7 +1055,7 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
goto out_free;
}

- rep->rr_buffer = &r_xprt->rx_buf;
+ rep->rr_rxprt = r_xprt;
return rep;

out_free:
@@ -1423,7 +1425,7 @@ rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
void
rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
{
- struct rpcrdma_buffer *buffers = rep->rr_buffer;
+ struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
unsigned long flags;

rep->rr_func = NULL;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 08aee53..143eb10 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -173,8 +173,7 @@ struct rpcrdma_buffer;

struct rpcrdma_rep {
unsigned int rr_len;
- struct rpcrdma_buffer *rr_buffer;
- struct rpc_xprt *rr_xprt;
+ struct rpcrdma_xprt *rr_rxprt;
void (*rr_func)(struct rpcrdma_rep *);
struct list_head rr_list;
struct rpcrdma_regbuf *rr_rdmabuf;


2015-05-04 17:57:23

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

The connect worker can replace ri_id, but prevents ri_id->device
from changing during the lifetime of a transport instance.

Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
The cached copy can be used safely in code that does not serialize
with the connect worker.

Other code can use it to save an extra address generation (one
pointer dereference instead of two).

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
net/sunrpc/xprtrdma/physical_ops.c | 8 +----
net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
5 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 302d4eb..0a96155 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
int nsegs, bool writing)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct ib_device *device = ia->ri_id->device;
+ struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
@@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
- struct ib_device *device;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);

list_add(&seg1->rl_mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
- read_lock(&ia->ri_qplock);
- device = ia->ri_id->device;
while (seg1->mr_nsegs--)
- rpcrdma_unmap_one(device, seg++);
- read_unlock(&ia->ri_qplock);
+ rpcrdma_unmap_one(ia->ri_device, seg++);
if (rc)
goto out_err;
return nsegs;
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index dff0481..66a85fa 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -137,7 +137,7 @@ static int
frwr_op_init(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;
@@ -178,7 +178,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
int nsegs, bool writing)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct ib_device *device = ia->ri_id->device;
+ struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
@@ -263,7 +263,6 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
- struct ib_device *device;

seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;

@@ -273,10 +272,9 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);

- read_lock(&ia->ri_qplock);
- device = ia->ri_id->device;
while (seg1->mr_nsegs--)
- rpcrdma_unmap_one(device, seg++);
+ rpcrdma_unmap_one(ia->ri_device, seg++);
+ read_lock(&ia->ri_qplock);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
@@ -304,7 +302,7 @@ static void
frwr_op_reset(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
struct rpcrdma_mw *r;
diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index ba518af..da149e8 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -50,8 +50,7 @@ physical_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;

- rpcrdma_map_one(ia->ri_id->device, seg,
- rpcrdma_data_dir(writing));
+ rpcrdma_map_one(ia->ri_device, seg, rpcrdma_data_dir(writing));
seg->mr_rkey = ia->ri_bind_mem->rkey;
seg->mr_base = seg->mr_dma;
seg->mr_nsegs = 1;
@@ -65,10 +64,7 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;

- read_lock(&ia->ri_qplock);
- rpcrdma_unmap_one(ia->ri_id->device, seg);
- read_unlock(&ia->ri_qplock);
-
+ rpcrdma_unmap_one(ia->ri_device, seg);
return 1;
}

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e1eb7c4..ebcb0e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -278,7 +278,6 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct list_head *sched_list)
{
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
- struct rpcrdma_ia *ia;

/* WARNING: Only wr_id and status are reliable at this point */
if (wc->status != IB_WC_SUCCESS)
@@ -291,9 +290,8 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct list_head *sched_list)
dprintk("RPC: %s: rep %p opcode 'recv', length %u: success\n",
__func__, rep, wc->byte_len);

- ia = &rep->rr_rxprt->rx_ia;
rep->rr_len = wc->byte_len;
- ib_dma_sync_single_for_cpu(ia->ri_id->device,
+ ib_dma_sync_single_for_cpu(rep->rr_device,
rdmab_addr(rep->rr_rdmabuf),
rep->rr_len, DMA_FROM_DEVICE);
prefetch(rdmab_to_msg(rep->rr_rdmabuf));
@@ -489,7 +487,7 @@ connected:

pr_info("rpcrdma: connection to %pIS:%u on %s, memreg '%s', %d credits, %d responders%s\n",
sap, rpc_get_port(sap),
- ia->ri_id->device->name,
+ ia->ri_device->name,
ia->ri_ops->ro_displayname,
xprt->rx_buf.rb_max_requests,
ird, ird < 4 && ird < tird / 2 ? " (low!)" : "");
@@ -590,8 +588,9 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
rc = PTR_ERR(ia->ri_id);
goto out1;
}
+ ia->ri_device = ia->ri_id->device;

- ia->ri_pd = ib_alloc_pd(ia->ri_id->device);
+ ia->ri_pd = ib_alloc_pd(ia->ri_device);
if (IS_ERR(ia->ri_pd)) {
rc = PTR_ERR(ia->ri_pd);
dprintk("RPC: %s: ib_alloc_pd() failed %i\n",
@@ -599,7 +598,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
goto out2;
}

- rc = ib_query_device(ia->ri_id->device, devattr);
+ rc = ib_query_device(ia->ri_device, devattr);
if (rc) {
dprintk("RPC: %s: ib_query_device failed %d\n",
__func__, rc);
@@ -608,7 +607,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)

if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
ia->ri_have_dma_lkey = 1;
- ia->ri_dma_lkey = ia->ri_id->device->local_dma_lkey;
+ ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
}

if (memreg == RPCRDMA_FRMR) {
@@ -623,7 +622,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
}
}
if (memreg == RPCRDMA_MTHCAFMR) {
- if (!ia->ri_id->device->alloc_fmr) {
+ if (!ia->ri_device->alloc_fmr) {
dprintk("RPC: %s: MTHCAFMR registration "
"not supported by HCA\n", __func__);
memreg = RPCRDMA_ALLPHYSICAL;
@@ -773,9 +772,9 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);

- sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
- rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_send_wr + 1, 0);
+ sendcq = ib_create_cq(ia->ri_device, rpcrdma_sendcq_upcall,
+ rpcrdma_cq_async_error_upcall, ep,
+ ep->rep_attr.cap.max_send_wr + 1, 0);
if (IS_ERR(sendcq)) {
rc = PTR_ERR(sendcq);
dprintk("RPC: %s: failed to create send CQ: %i\n",
@@ -790,9 +789,9 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
goto out2;
}

- recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
- rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_recv_wr + 1, 0);
+ recvcq = ib_create_cq(ia->ri_device, rpcrdma_recvcq_upcall,
+ rpcrdma_cq_async_error_upcall, ep,
+ ep->rep_attr.cap.max_recv_wr + 1, 0);
if (IS_ERR(recvcq)) {
rc = PTR_ERR(recvcq);
dprintk("RPC: %s: failed to create recv CQ: %i\n",
@@ -913,7 +912,7 @@ retry:
* More stuff I haven't thought of!
* Rrrgh!
*/
- if (ia->ri_id->device != id->device) {
+ if (ia->ri_device != id->device) {
printk("RPC: %s: can't reconnect on "
"different device!\n", __func__);
rdma_destroy_id(id);
@@ -1055,6 +1054,7 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
goto out_free;
}

+ rep->rr_device = ia->ri_device;
rep->rr_rxprt = r_xprt;
return rep;

@@ -1457,9 +1457,9 @@ rpcrdma_register_internal(struct rpcrdma_ia *ia, void *va, int len,
/*
* All memory passed here was kmalloc'ed, therefore phys-contiguous.
*/
- iov->addr = ib_dma_map_single(ia->ri_id->device,
+ iov->addr = ib_dma_map_single(ia->ri_device,
va, len, DMA_BIDIRECTIONAL);
- if (ib_dma_mapping_error(ia->ri_id->device, iov->addr))
+ if (ib_dma_mapping_error(ia->ri_device, iov->addr))
return -ENOMEM;

iov->length = len;
@@ -1503,8 +1503,8 @@ rpcrdma_deregister_internal(struct rpcrdma_ia *ia,
{
int rc;

- ib_dma_unmap_single(ia->ri_id->device,
- iov->addr, iov->length, DMA_BIDIRECTIONAL);
+ ib_dma_unmap_single(ia->ri_device,
+ iov->addr, iov->length, DMA_BIDIRECTIONAL);

if (NULL == mr)
return 0;
@@ -1597,15 +1597,18 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,
send_wr.num_sge = req->rl_niovs;
send_wr.opcode = IB_WR_SEND;
if (send_wr.num_sge == 4) /* no need to sync any pad (constant) */
- ib_dma_sync_single_for_device(ia->ri_id->device,
- req->rl_send_iov[3].addr, req->rl_send_iov[3].length,
- DMA_TO_DEVICE);
- ib_dma_sync_single_for_device(ia->ri_id->device,
- req->rl_send_iov[1].addr, req->rl_send_iov[1].length,
- DMA_TO_DEVICE);
- ib_dma_sync_single_for_device(ia->ri_id->device,
- req->rl_send_iov[0].addr, req->rl_send_iov[0].length,
- DMA_TO_DEVICE);
+ ib_dma_sync_single_for_device(ia->ri_device,
+ req->rl_send_iov[3].addr,
+ req->rl_send_iov[3].length,
+ DMA_TO_DEVICE);
+ ib_dma_sync_single_for_device(ia->ri_device,
+ req->rl_send_iov[1].addr,
+ req->rl_send_iov[1].length,
+ DMA_TO_DEVICE);
+ ib_dma_sync_single_for_device(ia->ri_device,
+ req->rl_send_iov[0].addr,
+ req->rl_send_iov[0].length,
+ DMA_TO_DEVICE);

if (DECR_CQCOUNT(ep) > 0)
send_wr.send_flags = 0;
@@ -1638,7 +1641,7 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
recv_wr.num_sge = 1;

- ib_dma_sync_single_for_cpu(ia->ri_id->device,
+ ib_dma_sync_single_for_cpu(ia->ri_device,
rdmab_addr(rep->rr_rdmabuf),
rdmab_length(rep->rr_rdmabuf),
DMA_BIDIRECTIONAL);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 143eb10..531ad33 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -62,6 +62,7 @@
struct rpcrdma_ia {
const struct rpcrdma_memreg_ops *ri_ops;
rwlock_t ri_qplock;
+ struct ib_device *ri_device;
struct rdma_cm_id *ri_id;
struct ib_pd *ri_pd;
struct ib_mr *ri_bind_mem;
@@ -173,6 +174,7 @@ struct rpcrdma_buffer;

struct rpcrdma_rep {
unsigned int rr_len;
+ struct ib_device *rr_device;
struct rpcrdma_xprt *rr_rxprt;
void (*rr_func)(struct rpcrdma_rep *);
struct list_head rr_list;


2015-05-04 17:57:33

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 05/14] xprtrdma: Introduce helpers for allocating MWs

We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 31 +++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
2 files changed, 33 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ebcb0e2..c21329e 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1179,6 +1179,37 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
}

+struct rpcrdma_mw *
+rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
+{
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ struct rpcrdma_mw *mw = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&buf->rb_lock, flags);
+ if (!list_empty(&buf->rb_mws)) {
+ mw = list_first_entry(&buf->rb_mws,
+ struct rpcrdma_mw, mw_list);
+ list_del_init(&mw->mw_list);
+ }
+ spin_unlock_irqrestore(&buf->rb_lock, flags);
+
+ if (!mw)
+ pr_err("RPC: %s: no MWs available\n", __func__);
+ return mw;
+}
+
+void
+rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
+{
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ unsigned long flags;
+
+ spin_lock_irqsave(&buf->rb_lock, flags);
+ list_add_tail(&mw->mw_list, &buf->rb_mws);
+ spin_unlock_irqrestore(&buf->rb_lock, flags);
+}
+
/* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
* some req segments uninitialized.
*/
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 531ad33..7de424e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -415,6 +415,8 @@ int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_ep *,
int rpcrdma_buffer_create(struct rpcrdma_xprt *);
void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);

+struct rpcrdma_mw *rpcrdma_get_mw(struct rpcrdma_xprt *);
+void rpcrdma_put_mw(struct rpcrdma_xprt *, struct rpcrdma_mw *);
struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
void rpcrdma_buffer_put(struct rpcrdma_req *);
void rpcrdma_recv_buffer_get(struct rpcrdma_req *);


2015-05-04 17:57:43

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 06/14] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Transport reset is now unneeded for FMR. Each FMR is recovered
synchronously when its RPC is retransmitted.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++++++++++++++++++++++++++++++----------
net/sunrpc/xprtrdma/verbs.c | 26 -----------------
2 files changed, 48 insertions(+), 42 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 0a96155..ad0055b 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -11,6 +11,21 @@
* can take tens of usecs to complete.
*/

+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using the
+ * ib_map_phys_fmr verb (fmr_op_map). When the RDMA operation is
+ * finished, the Memory Region is unmapped using the ib_unmap_fmr
+ * verb (fmr_op_unmap).
+ */
+
+/* Transport recovery
+ *
+ * After a transport reconnect, fmr_op_map re-uses the MR already
+ * allocated for the RPC, but generates a fresh rkey then maps the
+ * MR again. This process is synchronous.
+ */
+
#include "xprt_rdma.h"

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -77,6 +92,15 @@ out_fmr_err:
return rc;
}

+static int
+__fmr_unmap(struct rpcrdma_mw *r)
+{
+ LIST_HEAD(l);
+
+ list_add(&r->r.fmr->list, &l);
+ return ib_unmap_fmr(&l);
+}
+
/* Use the ib_map_phys_fmr() verb to register a memory region
* for remote access via RDMA READ or RDMA WRITE.
*/
@@ -88,9 +112,22 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
- struct rpcrdma_mw *mw = seg1->rl_mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
+ struct rpcrdma_mw *mw;
+
+ mw = seg1->rl_mw;
+ seg1->rl_mw = NULL;
+ if (!mw) {
+ mw = rpcrdma_get_mw(r_xprt);
+ if (!mw)
+ return -ENOMEM;
+ } else {
+ /* this is a retransmit; generate a fresh rkey */
+ rc = __fmr_unmap(mw);
+ if (rc)
+ return rc;
+ }

pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
@@ -114,6 +151,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
if (rc)
goto out_maperr;

+ seg1->rl_mw = mw;
seg1->mr_rkey = mw->r.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
@@ -137,18 +175,24 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+ struct rpcrdma_mw *mw = seg1->rl_mw;
int rc, nsegs = seg->mr_nsegs;
- LIST_HEAD(l);

- list_add(&seg1->rl_mw->r.fmr->list, &l);
- rc = ib_unmap_fmr(&l);
+ dprintk("RPC: %s: FMR %p\n", __func__, mw);
+
+ seg1->rl_mw = NULL;
while (seg1->mr_nsegs--)
rpcrdma_unmap_one(ia->ri_device, seg++);
+ rc = __fmr_unmap(mw);
if (rc)
goto out_err;
+ rpcrdma_put_mw(r_xprt, mw);
return nsegs;

out_err:
+ /* The FMR is abandoned, but remains in rb_all. fmr_op_destroy
+ * will attempt to release it when the transport is destroyed.
+ */
dprintk("RPC: %s: ib_unmap_fmr status %i\n", __func__, rc);
return nsegs;
}
@@ -161,18 +205,6 @@ out_err:
static void
fmr_op_reset(struct rpcrdma_xprt *r_xprt)
{
- struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct rpcrdma_mw *r;
- LIST_HEAD(list);
- int rc;
-
- list_for_each_entry(r, &buf->rb_all, mw_all)
- list_add(&r->r.fmr->list, &list);
-
- rc = ib_unmap_fmr(&list);
- if (rc)
- dprintk("RPC: %s: ib_unmap_fmr failed %i\n",
- __func__, rc);
}

static void
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c21329e..8a43c7ef 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1331,28 +1331,6 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
return NULL;
}

-static struct rpcrdma_req *
-rpcrdma_buffer_get_fmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
- struct rpcrdma_mw *r;
- int i;
-
- i = RPCRDMA_MAX_SEGS - 1;
- while (!list_empty(&buf->rb_mws)) {
- r = list_entry(buf->rb_mws.next,
- struct rpcrdma_mw, mw_list);
- list_del(&r->mw_list);
- req->rl_segments[i].rl_mw = r;
- if (unlikely(i-- == 0))
- return req; /* Success */
- }
-
- /* Not enough entries on rb_mws for this req */
- rpcrdma_buffer_put_sendbuf(req, buf);
- rpcrdma_buffer_put_mrs(req, buf);
- return NULL;
-}
-
/*
* Get a set of request/reply buffers.
*
@@ -1394,9 +1372,6 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
case RPCRDMA_FRMR:
req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
break;
- case RPCRDMA_MTHCAFMR:
- req = rpcrdma_buffer_get_fmrs(req, buffers);
- break;
default:
break;
}
@@ -1421,7 +1396,6 @@ rpcrdma_buffer_put(struct rpcrdma_req *req)
rpcrdma_buffer_put_sendbuf(req, buffers);
switch (ia->ri_memreg_strategy) {
case RPCRDMA_FRMR:
- case RPCRDMA_MTHCAFMR:
rpcrdma_buffer_put_mrs(req, buffers);
break;
default:


2015-05-04 17:57:53

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 07/14] xprtrdma: Introduce an FRMR recovery workqueue

After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 71 ++++++++++++++++++++++++++++++++++++++-
net/sunrpc/xprtrdma/transport.c | 11 +++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++
3 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 66a85fa..a06d9a3 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,74 @@
# define RPCDBG_FACILITY RPCDBG_TRANS
#endif

+static struct workqueue_struct *frwr_recovery_wq;
+
+#define FRWR_RECOVERY_WQ_FLAGS (WQ_UNBOUND | WQ_MEM_RECLAIM)
+
+int
+frwr_alloc_recovery_wq(void)
+{
+ frwr_recovery_wq = alloc_workqueue("frwr_recovery",
+ FRWR_RECOVERY_WQ_FLAGS, 0);
+ return !frwr_recovery_wq ? -ENOMEM : 0;
+}
+
+void
+frwr_destroy_recovery_wq(void)
+{
+ struct workqueue_struct *wq;
+
+ if (!frwr_recovery_wq)
+ return;
+
+ wq = frwr_recovery_wq;
+ frwr_recovery_wq = NULL;
+ destroy_workqueue(wq);
+}
+
+/* Deferred reset of a single FRMR. Generate a fresh rkey by
+ * replacing the MR.
+ *
+ * There's no recovery if this fails. The FRMR is abandoned, but
+ * remains in rb_all. It will be cleaned up when the transport is
+ * destroyed.
+ */
+static void
+__frwr_recovery_worker(struct work_struct *work)
+{
+ struct rpcrdma_mw *r = container_of(work, struct rpcrdma_mw,
+ r.frmr.fr_work);
+ struct rpcrdma_xprt *r_xprt = r->r.frmr.fr_xprt;
+ unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+ struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+
+ if (ib_dereg_mr(r->r.frmr.fr_mr))
+ goto out_fail;
+
+ r->r.frmr.fr_mr = ib_alloc_fast_reg_mr(pd, depth);
+ if (IS_ERR(r->r.frmr.fr_mr))
+ goto out_fail;
+
+ dprintk("RPC: %s: recovered FRMR %p\n", __func__, r);
+ r->r.frmr.fr_state = FRMR_IS_INVALID;
+ rpcrdma_put_mw(r_xprt, r);
+ return;
+
+out_fail:
+ pr_warn("RPC: %s: FRMR %p unrecovered\n",
+ __func__, r);
+}
+
+/* A broken MR was discovered in a context that can't sleep.
+ * Defer recovery to the recovery worker.
+ */
+static void
+__frwr_queue_recovery(struct rpcrdma_mw *r)
+{
+ INIT_WORK(&r->r.frmr.fr_work, __frwr_recovery_worker);
+ queue_work(frwr_recovery_wq, &r->r.frmr.fr_work);
+}
+
static int
__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
unsigned int depth)
@@ -128,7 +196,7 @@ frwr_sendcompletion(struct ib_wc *wc)

/* WARNING: Only wr_id and status are reliable at this point */
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
- dprintk("RPC: %s: frmr %p (stale), status %d\n",
+ pr_warn("RPC: %s: frmr %p flushed, status %d\n",
__func__, r, wc->status);
r->r.frmr.fr_state = FRMR_IS_STALE;
}
@@ -165,6 +233,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
r->mw_sendcompletion = frwr_sendcompletion;
+ r->r.frmr.fr_xprt = r_xprt;
}

return 0;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index ed70551..f1fa6a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -757,17 +757,24 @@ static void __exit xprt_rdma_cleanup(void)
if (rc)
dprintk("RPC: %s: xprt_unregister returned %i\n",
__func__, rc);
+
+ frwr_destroy_recovery_wq();
}

static int __init xprt_rdma_init(void)
{
int rc;

- rc = xprt_register_transport(&xprt_rdma);
-
+ rc = frwr_alloc_recovery_wq();
if (rc)
return rc;

+ rc = xprt_register_transport(&xprt_rdma);
+ if (rc) {
+ frwr_destroy_recovery_wq();
+ return rc;
+ }
+
dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");

dprintk("Defaults:\n");
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 7de424e..98227d6 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -204,6 +204,8 @@ struct rpcrdma_frmr {
struct ib_fast_reg_page_list *fr_pgl;
struct ib_mr *fr_mr;
enum rpcrdma_frmr_state fr_state;
+ struct work_struct fr_work;
+ struct rpcrdma_xprt *fr_xprt;
};

struct rpcrdma_mw {
@@ -429,6 +431,9 @@ void rpcrdma_free_regbuf(struct rpcrdma_ia *,

unsigned int rpcrdma_max_segments(struct rpcrdma_xprt *);

+int frwr_alloc_recovery_wq(void);
+void frwr_destroy_recovery_wq(void);
+
/*
* Wrappers for chunk registration, shared by read/write chunk code.
*/


2015-05-04 17:58:02

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()

Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 120 ++++++++++++++++++++++++++++------------
net/sunrpc/xprtrdma/rpc_rdma.c | 3 -
net/sunrpc/xprtrdma/verbs.c | 21 -------
3 files changed, 86 insertions(+), 58 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index a06d9a3..6f93a89 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -11,6 +11,62 @@
* but most complex memory registration mode.
*/

+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
+ * Work Request (frmr_op_map). When the RDMA operation is finished, this
+ * Memory Region is invalidated using a LOCAL_INV Work Request
+ * (frmr_op_unmap).
+ *
+ * Typically these Work Requests are not signaled, and neither are RDMA
+ * SEND Work Requests (with the exception of signaling occasionally to
+ * prevent provider work queue overflows). This greatly reduces HCA
+ * interrupt workload.
+ *
+ * As an optimization, frwr_op_unmap marks MRs INVALID before the
+ * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
+ * rb_mws immediately so that no work (like managing a linked list
+ * under a spinlock) is needed in the completion upcall.
+ *
+ * But this means that frwr_op_map() can occasionally encounter an MR
+ * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
+ * ordering prevents a subsequent FAST_REG WR from executing against
+ * that MR while it is still being invalidated.
+ */
+
+/* Transport recovery
+ *
+ * ->op_map and the transport connect worker cannot run at the same
+ * time, but ->op_unmap can fire while the transport connect worker
+ * is running. Thus MR recovery is handled in ->op_map, to guarantee
+ * that recovered MRs are owned by a sending RPC, and not one where
+ * ->op_unmap could fire at the same time transport reconnect is
+ * being done.
+ *
+ * When the underlying transport disconnects, MRs are left in one of
+ * three states:
+ *
+ * INVALID: The MR was not in use before the QP entered ERROR state.
+ * (Or, the LOCAL_INV WR has not completed or flushed yet).
+ *
+ * STALE: The MR was being registered or unregistered when the QP
+ * entered ERROR state, and the pending WR was flushed.
+ *
+ * VALID: The MR was registered before the QP entered ERROR state.
+ *
+ * When frwr_op_map encounters STALE and VALID MRs, they are recovered
+ * with ib_dereg_mr and then are re-initialized. Beause MR recovery
+ * allocates fresh resources, it is deferred to a workqueue, and the
+ * recovered MRs are placed back on the rb_mws list when recovery is
+ * complete. frwr_op_map allocates another MR for the current RPC while
+ * the broken MR is reset.
+ *
+ * To ensure that frwr_op_map doesn't encounter an MR that is marked
+ * INVALID but that is about to be flushed due to a previous transport
+ * disconnect, the transport connect worker attempts to drain all
+ * pending send queue WRs before the transport is reconnected.
+ */
+
#include "xprt_rdma.h"

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
- struct rpcrdma_mw *mw = seg1->rl_mw;
- struct rpcrdma_frmr *frmr = &mw->r.frmr;
- struct ib_mr *mr = frmr->fr_mr;
+ struct rpcrdma_mw *mw;
+ struct rpcrdma_frmr *frmr;
+ struct ib_mr *mr;
struct ib_send_wr fastreg_wr, *bad_wr;
u8 key;
int len, pageoff;
@@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
u64 pa;
int page_no;

+ mw = seg1->rl_mw;
+ seg1->rl_mw = NULL;
+ do {
+ if (mw)
+ __frwr_queue_recovery(mw);
+ mw = rpcrdma_get_mw(r_xprt);
+ if (!mw)
+ return -ENOMEM;
+ } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
+ frmr = &mw->r.frmr;
+ frmr->fr_state = FRMR_IS_VALID;
+
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
seg1->mr_len += pageoff;
len = -pageoff;
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
+
for (page_no = i = 0; i < nsegs;) {
rpcrdma_map_one(device, seg, direction);
pa = seg->mr_dma;
@@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
dprintk("RPC: %s: Using frmr %p to map %d segments (%d bytes)\n",
__func__, mw, i, len);

- frmr->fr_state = FRMR_IS_VALID;
-
memset(&fastreg_wr, 0, sizeof(fastreg_wr));
fastreg_wr.wr_id = (unsigned long)(void *)mw;
fastreg_wr.opcode = IB_WR_FAST_REG_MR;
@@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
fastreg_wr.wr.fast_reg.access_flags = writing ?
IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
IB_ACCESS_REMOTE_READ;
+ mr = frmr->fr_mr;
key = (u8)(mr->rkey & 0x000000FF);
ib_update_fast_reg_key(mr, ++key);
fastreg_wr.wr.fast_reg.rkey = mr->rkey;
@@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
if (rc)
goto out_senderr;

+ seg1->rl_mw = mw;
seg1->mr_rkey = mr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
@@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,

out_senderr:
dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
- ib_update_fast_reg_key(mr, --key);
- frmr->fr_state = FRMR_IS_INVALID;
while (i--)
rpcrdma_unmap_one(device, --seg);
+ __frwr_queue_recovery(mw);
return rc;
}

@@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
{
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ struct rpcrdma_mw *mw = seg1->rl_mw;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;

- seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+ dprintk("RPC: %s: FRMR %p\n", __func__, mw);
+
+ seg1->rl_mw = NULL;
+ mw->r.frmr.fr_state = FRMR_IS_INVALID;

memset(&invalidate_wr, 0, sizeof(invalidate_wr));
- invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
+ invalidate_wr.wr_id = (unsigned long)(void *)mw;
invalidate_wr.opcode = IB_WR_LOCAL_INV;
- invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
+ invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);

while (seg1->mr_nsegs--)
@@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
+
+ rpcrdma_put_mw(r_xprt, mw);
return nsegs;

out_err:
- /* Force rpcrdma_buffer_get() to retry */
- seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
+ __frwr_queue_recovery(mw);
return nsegs;
}

@@ -370,29 +443,6 @@ out_err:
static void
frwr_op_reset(struct rpcrdma_xprt *r_xprt)
{
- struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct ib_device *device = r_xprt->rx_ia.ri_device;
- unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
- struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
- struct rpcrdma_mw *r;
- int rc;
-
- list_for_each_entry(r, &buf->rb_all, mw_all) {
- if (r->r.frmr.fr_state == FRMR_IS_INVALID)
- continue;
-
- __frwr_release(r);
- rc = __frwr_init(r, pd, device, depth);
- if (rc) {
- dprintk("RPC: %s: mw %p left %s\n",
- __func__, r,
- (r->r.frmr.fr_state == FRMR_IS_STALE ?
- "stale" : "valid"));
- continue;
- }
-
- r->r.frmr.fr_state = FRMR_IS_INVALID;
- }
}

static void
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 98a3b95..35ead0b 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
return (unsigned char *)iptr - (unsigned char *)headerp;

out:
- if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
- return n;
-
for (pos = 0; nchunks--;)
pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
&req->rl_segments[pos]);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 8a43c7ef..5226161 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
struct rpcrdma_req *
rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
{
- struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
- struct list_head stale;
struct rpcrdma_req *req;
unsigned long flags;

spin_lock_irqsave(&buffers->rb_lock, flags);
+
if (buffers->rb_send_index == buffers->rb_max_requests) {
spin_unlock_irqrestore(&buffers->rb_lock, flags);
dprintk("RPC: %s: out of request buffers\n", __func__);
@@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
}
buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;

- INIT_LIST_HEAD(&stale);
- switch (ia->ri_memreg_strategy) {
- case RPCRDMA_FRMR:
- req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
- break;
- default:
- break;
- }
spin_unlock_irqrestore(&buffers->rb_lock, flags);
- if (!list_empty(&stale))
- rpcrdma_retry_flushed_linv(&stale, buffers);
return req;
}

@@ -1389,18 +1378,10 @@ void
rpcrdma_buffer_put(struct rpcrdma_req *req)
{
struct rpcrdma_buffer *buffers = req->rl_buffer;
- struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
unsigned long flags;

spin_lock_irqsave(&buffers->rb_lock, flags);
rpcrdma_buffer_put_sendbuf(req, buffers);
- switch (ia->ri_memreg_strategy) {
- case RPCRDMA_FRMR:
- rpcrdma_buffer_put_mrs(req, buffers);
- break;
- default:
- break;
- }
spin_unlock_irqrestore(&buffers->rb_lock, flags);
}



2015-05-04 17:58:12

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 09/14] xprtrdma: Remove unused LOCAL_INV recovery logic

Clean up: Remove functions no longer used to recover broken FRMRs.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 109 -------------------------------------------
1 file changed, 109 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5226161..5120a8e 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1210,33 +1210,6 @@ rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
spin_unlock_irqrestore(&buf->rb_lock, flags);
}

-/* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
- * some req segments uninitialized.
- */
-static void
-rpcrdma_buffer_put_mr(struct rpcrdma_mw **mw, struct rpcrdma_buffer *buf)
-{
- if (*mw) {
- list_add_tail(&(*mw)->mw_list, &buf->rb_mws);
- *mw = NULL;
- }
-}
-
-/* Cycle mw's back in reverse order, and "spin" them.
- * This delays and scrambles reuse as much as possible.
- */
-static void
-rpcrdma_buffer_put_mrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
- struct rpcrdma_mr_seg *seg = req->rl_segments;
- struct rpcrdma_mr_seg *seg1 = seg;
- int i;
-
- for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
- rpcrdma_buffer_put_mr(&seg->rl_mw, buf);
- rpcrdma_buffer_put_mr(&seg1->rl_mw, buf);
-}
-
static void
rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
{
@@ -1249,88 +1222,6 @@ rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
}
}

-/* rpcrdma_unmap_one() was already done during deregistration.
- * Redo only the ib_post_send().
- */
-static void
-rpcrdma_retry_local_inv(struct rpcrdma_mw *r, struct rpcrdma_ia *ia)
-{
- struct rpcrdma_xprt *r_xprt =
- container_of(ia, struct rpcrdma_xprt, rx_ia);
- struct ib_send_wr invalidate_wr, *bad_wr;
- int rc;
-
- dprintk("RPC: %s: FRMR %p is stale\n", __func__, r);
-
- /* When this FRMR is re-inserted into rb_mws, it is no longer stale */
- r->r.frmr.fr_state = FRMR_IS_INVALID;
-
- memset(&invalidate_wr, 0, sizeof(invalidate_wr));
- invalidate_wr.wr_id = (unsigned long)(void *)r;
- invalidate_wr.opcode = IB_WR_LOCAL_INV;
- invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
- DECR_CQCOUNT(&r_xprt->rx_ep);
-
- dprintk("RPC: %s: frmr %p invalidating rkey %08x\n",
- __func__, r, r->r.frmr.fr_mr->rkey);
-
- read_lock(&ia->ri_qplock);
- rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
- read_unlock(&ia->ri_qplock);
- if (rc) {
- /* Force rpcrdma_buffer_get() to retry */
- r->r.frmr.fr_state = FRMR_IS_STALE;
- dprintk("RPC: %s: ib_post_send failed, %i\n",
- __func__, rc);
- }
-}
-
-static void
-rpcrdma_retry_flushed_linv(struct list_head *stale,
- struct rpcrdma_buffer *buf)
-{
- struct rpcrdma_ia *ia = rdmab_to_ia(buf);
- struct list_head *pos;
- struct rpcrdma_mw *r;
- unsigned long flags;
-
- list_for_each(pos, stale) {
- r = list_entry(pos, struct rpcrdma_mw, mw_list);
- rpcrdma_retry_local_inv(r, ia);
- }
-
- spin_lock_irqsave(&buf->rb_lock, flags);
- list_splice_tail(stale, &buf->rb_mws);
- spin_unlock_irqrestore(&buf->rb_lock, flags);
-}
-
-static struct rpcrdma_req *
-rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
- struct list_head *stale)
-{
- struct rpcrdma_mw *r;
- int i;
-
- i = RPCRDMA_MAX_SEGS - 1;
- while (!list_empty(&buf->rb_mws)) {
- r = list_entry(buf->rb_mws.next,
- struct rpcrdma_mw, mw_list);
- list_del(&r->mw_list);
- if (r->r.frmr.fr_state == FRMR_IS_STALE) {
- list_add(&r->mw_list, stale);
- continue;
- }
- req->rl_segments[i].rl_mw = r;
- if (unlikely(i-- == 0))
- return req; /* Success */
- }
-
- /* Not enough entries on rb_mws for this req */
- rpcrdma_buffer_put_sendbuf(req, buf);
- rpcrdma_buffer_put_mrs(req, buf);
- return NULL;
-}
-
/*
* Get a set of request/reply buffers.
*


2015-05-04 17:58:21

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 10/14] xprtrdma: Remove ->ro_reset

An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've replaced the ->ro_reset()
methods with a recovery workqueue. ->ro_reset() is no longer used
and can be removed.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 11 -----------
net/sunrpc/xprtrdma/frwr_ops.c | 16 ----------------
net/sunrpc/xprtrdma/physical_ops.c | 6 ------
net/sunrpc/xprtrdma/verbs.c | 2 --
net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
5 files changed, 36 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index ad0055b..5dd77da 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -197,16 +197,6 @@ out_err:
return nsegs;
}

-/* After a disconnect, unmap all FMRs.
- *
- * This is invoked only in the transport connect worker in order
- * to serialize with rpcrdma_register_fmr_external().
- */
-static void
-fmr_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-}
-
static void
fmr_op_destroy(struct rpcrdma_buffer *buf)
{
@@ -230,7 +220,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
.ro_init = fmr_op_init,
- .ro_reset = fmr_op_reset,
.ro_destroy = fmr_op_destroy,
.ro_displayname = "fmr",
};
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 6f93a89..3fb609a 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -430,21 +430,6 @@ out_err:
return nsegs;
}

-/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
- * an unusable state. Find FRMRs in this state and dereg / reg
- * each. FRMRs that are VALID and attached to an rpcrdma_req are
- * also torn down.
- *
- * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
- *
- * This is invoked only in the transport connect worker in order
- * to serialize with rpcrdma_register_frmr_external().
- */
-static void
-frwr_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-}
-
static void
frwr_op_destroy(struct rpcrdma_buffer *buf)
{
@@ -464,7 +449,6 @@ const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
.ro_init = frwr_op_init,
- .ro_reset = frwr_op_reset,
.ro_destroy = frwr_op_destroy,
.ro_displayname = "frwr",
};
diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index da149e8..41985d0 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -69,11 +69,6 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
}

static void
-physical_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-}
-
-static void
physical_op_destroy(struct rpcrdma_buffer *buf)
{
}
@@ -84,7 +79,6 @@ const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_open = physical_op_open,
.ro_maxpages = physical_op_maxpages,
.ro_init = physical_op_init,
- .ro_reset = physical_op_reset,
.ro_destroy = physical_op_destroy,
.ro_displayname = "physical",
};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5120a8e..eaf0b9d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -897,8 +897,6 @@ retry:
rpcrdma_flush_cqs(ep);

xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- ia->ri_ops->ro_reset(xprt);
-
id = rpcrdma_create_id(xprt, ia,
(struct sockaddr *)&xprt->rx_data.addr);
if (IS_ERR(id)) {
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 98227d6..6a1e565 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -353,7 +353,6 @@ struct rpcrdma_memreg_ops {
struct rpcrdma_create_data_internal *);
size_t (*ro_maxpages)(struct rpcrdma_xprt *);
int (*ro_init)(struct rpcrdma_xprt *);
- void (*ro_reset)(struct rpcrdma_xprt *);
void (*ro_destroy)(struct rpcrdma_buffer *);
const char *ro_displayname;
};


2015-05-04 17:58:30

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 11/14] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

Clean up: This field is no longer used.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/xprtrdma.h | 3 ++-
net/sunrpc/xprtrdma/verbs.c | 3 ---
net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index c984c85..b176130 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -56,7 +56,8 @@

#define RPCRDMA_INLINE_PAD_THRESH (512)/* payload threshold to pad (bytes) */

-/* memory registration strategies */
+/* Memory registration strategies, by number.
+ * This is part of a kernel / user space API. Do not remove. */
enum rpcrdma_memreg {
RPCRDMA_BOUNCEBUFFERS = 0,
RPCRDMA_REGISTER,
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eaf0b9d..1f51547 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -671,9 +671,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
dprintk("RPC: %s: memory registration strategy is '%s'\n",
__func__, ia->ri_ops->ro_displayname);

- /* Else will do memory reg/dereg for each chunk */
- ia->ri_memreg_strategy = memreg;
-
rwlock_init(&ia->ri_qplock);
return 0;

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 6a1e565..5650c23 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -70,7 +70,6 @@ struct rpcrdma_ia {
int ri_have_dma_lkey;
struct completion ri_done;
int ri_async_rc;
- enum rpcrdma_memreg ri_memreg_strategy;
unsigned int ri_max_frmr_depth;
struct ib_device_attr ri_devattr;
struct ib_qp_attr ri_qp_attr;


2015-05-04 17:58:40

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 12/14] xprtrdma: Split rb_lock

/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and
rb_mws is always in a different cacheline than rb_lock and the
buffer pointers.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 1 +
net/sunrpc/xprtrdma/frwr_ops.c | 1 +
net/sunrpc/xprtrdma/verbs.c | 10 ++++------
net/sunrpc/xprtrdma/xprt_rdma.h | 16 +++++++++-------
4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 5dd77da..52f9ad5 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -65,6 +65,7 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
struct rpcrdma_mw *r;
int i, rc;

+ spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 3fb609a..edc10ba 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -266,6 +266,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;

+ spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1f51547..c5830cd 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1179,15 +1179,14 @@ rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_mw *mw = NULL;
- unsigned long flags;

- spin_lock_irqsave(&buf->rb_lock, flags);
+ spin_lock(&buf->rb_mwlock);
if (!list_empty(&buf->rb_mws)) {
mw = list_first_entry(&buf->rb_mws,
struct rpcrdma_mw, mw_list);
list_del_init(&mw->mw_list);
}
- spin_unlock_irqrestore(&buf->rb_lock, flags);
+ spin_unlock(&buf->rb_mwlock);

if (!mw)
pr_err("RPC: %s: no MWs available\n", __func__);
@@ -1198,11 +1197,10 @@ void
rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- unsigned long flags;

- spin_lock_irqsave(&buf->rb_lock, flags);
+ spin_lock(&buf->rb_mwlock);
list_add_tail(&mw->mw_list, &buf->rb_mws);
- spin_unlock_irqrestore(&buf->rb_lock, flags);
+ spin_unlock(&buf->rb_mwlock);
}

static void
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 5650c23..ae31fc7 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -283,15 +283,17 @@ rpcr_to_rdmar(struct rpc_rqst *rqst)
* One of these is associated with a transport instance
*/
struct rpcrdma_buffer {
- spinlock_t rb_lock; /* protects indexes */
- u32 rb_max_requests;/* client max requests */
- struct list_head rb_mws; /* optional memory windows/fmrs/frmrs */
- struct list_head rb_all;
- int rb_send_index;
+ spinlock_t rb_mwlock; /* protect rb_mws list */
+ struct list_head rb_mws;
+ struct list_head rb_all;
+ char *rb_pool;
+
+ spinlock_t rb_lock; /* protect buf arrays */
+ u32 rb_max_requests;
+ int rb_send_index;
+ int rb_recv_index;
struct rpcrdma_req **rb_send_bufs;
- int rb_recv_index;
struct rpcrdma_rep **rb_recv_bufs;
- char *rb_pool;
};
#define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)



2015-05-04 17:58:49

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 13/14] xprtrdma: Stack relief in fmr_op_map()

fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 32 ++++++++++++++++++++++----------
net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++++++-
2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 52f9ad5..4a53ad5 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -72,13 +72,19 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);

+ rc = -ENOMEM;
while (i--) {
r = kzalloc(sizeof(*r), GFP_KERNEL);
if (!r)
- return -ENOMEM;
+ goto out;
+
+ r->r.fmr.physaddrs = kmalloc(RPCRDMA_MAX_FMR_SGES *
+ sizeof(u64), GFP_KERNEL);
+ if (!r->r.fmr.physaddrs)
+ goto out_free;

- r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
- if (IS_ERR(r->r.fmr))
+ r->r.fmr.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
+ if (IS_ERR(r->r.fmr.fmr))
goto out_fmr_err;

list_add(&r->mw_list, &buf->rb_mws);
@@ -87,9 +93,12 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
return 0;

out_fmr_err:
- rc = PTR_ERR(r->r.fmr);
+ rc = PTR_ERR(r->r.fmr.fmr);
dprintk("RPC: %s: ib_alloc_fmr status %i\n", __func__, rc);
+ kfree(r->r.fmr.physaddrs);
+out_free:
kfree(r);
+out:
return rc;
}

@@ -98,7 +107,7 @@ __fmr_unmap(struct rpcrdma_mw *r)
{
LIST_HEAD(l);

- list_add(&r->r.fmr->list, &l);
+ list_add(&r->r.fmr.fmr->list, &l);
return ib_unmap_fmr(&l);
}

@@ -113,7 +122,6 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
- u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
struct rpcrdma_mw *mw;

@@ -138,7 +146,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
rpcrdma_map_one(device, seg, direction);
- physaddrs[i] = seg->mr_dma;
+ mw->r.fmr.physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
++i;
@@ -148,12 +156,13 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
break;
}

- rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
+ rc = ib_map_phys_fmr(mw->r.fmr.fmr, mw->r.fmr.physaddrs,
+ i, seg1->mr_dma);
if (rc)
goto out_maperr;

seg1->rl_mw = mw;
- seg1->mr_rkey = mw->r.fmr->rkey;
+ seg1->mr_rkey = mw->r.fmr.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
seg1->mr_len = len;
@@ -207,10 +216,13 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
while (!list_empty(&buf->rb_all)) {
r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
list_del(&r->mw_all);
- rc = ib_dealloc_fmr(r->r.fmr);
+ kfree(r->r.fmr.physaddrs);
+
+ rc = ib_dealloc_fmr(r->r.fmr.fmr);
if (rc)
dprintk("RPC: %s: ib_dealloc_fmr failed %i\n",
__func__, rc);
+
kfree(r);
}
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ae31fc7..e176bae 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,9 +207,14 @@ struct rpcrdma_frmr {
struct rpcrdma_xprt *fr_xprt;
};

+struct rpcrdma_fmr {
+ struct ib_fmr *fmr;
+ u64 *physaddrs;
+};
+
struct rpcrdma_mw {
union {
- struct ib_fmr *fmr;
+ struct rpcrdma_fmr fmr;
struct rpcrdma_frmr frmr;
} r;
void (*mw_sendcompletion)(struct ib_wc *);


2015-05-04 17:58:59

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation

Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 6 ++++--
net/sunrpc/xprtrdma/frwr_ops.c | 6 ++++--
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 4a53ad5..f1e8daf 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -69,8 +69,10 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);

- i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
- dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);
+ i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
+ i += 2; /* head + tail */
+ i *= buf->rb_max_requests; /* one set for each RPC slot */
+ dprintk("RPC: %s: initalizing %d FMRs\n", __func__, i);

rc = -ENOMEM;
while (i--) {
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index edc10ba..fc2d0c6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -270,8 +270,10 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);

- i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
- dprintk("RPC: %s: initializing %d FRMRs\n", __func__, i);
+ i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
+ i += 2; /* head + tail */
+ i *= buf->rb_max_requests; /* one set for each RPC slot */
+ dprintk("RPC: %s: initalizing %d FRMRs\n", __func__, i);

while (i--) {
struct rpcrdma_mw *r;


2015-05-05 13:49:57

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection

Hi Chuck,

Neat idea! Are servers able to handle client recovery without getting too confused?

Anna

On 05/04/2015 01:56 PM, Chuck Lever wrote:
> It has been exceptionally useful to exercise the logic that handles
> local immediate errors and RDMA connection loss. To enable
> developers to test this regularly and repeatably, add logic to
> simulate connection loss every so often.
>
> Fault injection is disabled by default. It is enabled with
>
> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault
>
> where "xxx" is a large positive number of transport method calls
> before a disconnect. A value of several thousand is usually a good
> number that allows reasonable forward progress while still causing a
> lot of connection drops.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/Kconfig | 12 ++++++++++++
> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
> 3 files changed, 47 insertions(+)
>
> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> index 9068e72..329f82c 100644
> --- a/net/sunrpc/Kconfig
> +++ b/net/sunrpc/Kconfig
> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>
> If unsure, say N.
>
> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
> + bool "RPC over RDMA client fault injection"
> + depends on SUNRPC_XPRT_RDMA_CLIENT
> + default N
> + help
> + This option enables fault injection in the xprtrdma module.
> + Fault injection is disabled by default. It is enabled with:
> +
> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
> +
> + If unsure, say N.
> +
> config SUNRPC_XPRT_RDMA_SERVER
> tristate "RPC over RDMA Server Support"
> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 54f23b1..fdcb2c7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
> static unsigned int xprt_rdma_inline_write_padding;
> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
> int xprt_rdma_pad_optimize = 1;
> +static unsigned int xprt_rdma_inject_transport_fault;
>
> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>
> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> + {
> + .procname = "rdma_inject_transport_fault",
> + .data = &xprt_rdma_inject_transport_fault,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> { },
> };
>
> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
> xprt_clear_connecting(xprt);
> }
>
> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
> +static void
> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
> +{
> + if (!xprt_rdma_inject_transport_fault)
> + return;
> +
> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
> + atomic_set(&r_xprt->rx_inject_count,
> + xprt_rdma_inject_transport_fault);
> + pr_info("rpcrdma: injecting transport disconnect\n");
> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
> + }
> +}
> +#else
> +static void
> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
> +{
> +}
> +#endif
> +
> /*
> * xprt_rdma_destroy
> *
> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
> xprt_rdma_connect_worker);
>
> + atomic_set(&new_xprt->rx_inject_count,
> + xprt_rdma_inject_transport_fault);
> xprt_rdma_format_addresses(xprt);
> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
> if (xprt->max_payload == 0)
> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
> out:
> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
> req->rl_connect_cookie = 0; /* our reserved value */
> + xprt_rdma_inject_disconnect(r_xprt);
> return req->rl_sendbuf->rg_base;
>
> out_rdmabuf:
> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
> }
>
> rpcrdma_buffer_put(req);
> + xprt_rdma_inject_disconnect(r_xprt);
> }
>
> /*
> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>
> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
> rqst->rq_bytes_sent = 0;
> + xprt_rdma_inject_disconnect(r_xprt);
> return 0;
>
> failed_marshal:
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 78e0b8b..08aee53 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
> struct rpcrdma_create_data_internal rx_data;
> struct delayed_work rx_connect_worker;
> struct rpcrdma_stats rx_stats;
> + atomic_t rx_inject_count;
> };
>
> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2015-05-05 13:52:50

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection


On May 5, 2015, at 9:49 AM, Anna Schumaker <[email protected]> wrote:

> Hi Chuck,
>
> Neat idea! Are servers able to handle client recovery without getting too confused?

So far I have encountered only issues on the client side. I think this
is because the client is the active part of re-establishing transport
connections. In addition, RPC/RDMA clients have a bunch of resources
that need to be reset after a transport disconnect.

I think this idea can be translated into something that can be done
in the generic layer (ie, xprt.c) if people think that would be of
benefit for testing TCP also.


> Anna
>
> On 05/04/2015 01:56 PM, Chuck Lever wrote:
>> It has been exceptionally useful to exercise the logic that handles
>> local immediate errors and RDMA connection loss. To enable
>> developers to test this regularly and repeatably, add logic to
>> simulate connection loss every so often.
>>
>> Fault injection is disabled by default. It is enabled with
>>
>> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault
>>
>> where "xxx" is a large positive number of transport method calls
>> before a disconnect. A value of several thousand is usually a good
>> number that allows reasonable forward progress while still causing a
>> lot of connection drops.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/Kconfig | 12 ++++++++++++
>> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>> 3 files changed, 47 insertions(+)
>>
>> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
>> index 9068e72..329f82c 100644
>> --- a/net/sunrpc/Kconfig
>> +++ b/net/sunrpc/Kconfig
>> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>>
>> If unsure, say N.
>>
>> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
>> + bool "RPC over RDMA client fault injection"
>> + depends on SUNRPC_XPRT_RDMA_CLIENT
>> + default N
>> + help
>> + This option enables fault injection in the xprtrdma module.
>> + Fault injection is disabled by default. It is enabled with:
>> +
>> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
>> +
>> + If unsure, say N.
>> +
>> config SUNRPC_XPRT_RDMA_SERVER
>> tristate "RPC over RDMA Server Support"
>> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index 54f23b1..fdcb2c7 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
>> static unsigned int xprt_rdma_inline_write_padding;
>> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
>> int xprt_rdma_pad_optimize = 1;
>> +static unsigned int xprt_rdma_inject_transport_fault;
>>
>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>
>> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
>> .mode = 0644,
>> .proc_handler = proc_dointvec,
>> },
>> + {
>> + .procname = "rdma_inject_transport_fault",
>> + .data = &xprt_rdma_inject_transport_fault,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec,
>> + },
>> { },
>> };
>>
>> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
>> xprt_clear_connecting(xprt);
>> }
>>
>> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
>> +static void
>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>> +{
>> + if (!xprt_rdma_inject_transport_fault)
>> + return;
>> +
>> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
>> + atomic_set(&r_xprt->rx_inject_count,
>> + xprt_rdma_inject_transport_fault);
>> + pr_info("rpcrdma: injecting transport disconnect\n");
>> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
>> + }
>> +}
>> +#else
>> +static void
>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>> +{
>> +}
>> +#endif
>> +
>> /*
>> * xprt_rdma_destroy
>> *
>> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
>> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
>> xprt_rdma_connect_worker);
>>
>> + atomic_set(&new_xprt->rx_inject_count,
>> + xprt_rdma_inject_transport_fault);
>> xprt_rdma_format_addresses(xprt);
>> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
>> if (xprt->max_payload == 0)
>> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
>> out:
>> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
>> req->rl_connect_cookie = 0; /* our reserved value */
>> + xprt_rdma_inject_disconnect(r_xprt);
>> return req->rl_sendbuf->rg_base;
>>
>> out_rdmabuf:
>> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
>> }
>>
>> rpcrdma_buffer_put(req);
>> + xprt_rdma_inject_disconnect(r_xprt);
>> }
>>
>> /*
>> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>>
>> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
>> rqst->rq_bytes_sent = 0;
>> + xprt_rdma_inject_disconnect(r_xprt);
>> return 0;
>>
>> failed_marshal:
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 78e0b8b..08aee53 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
>> struct rpcrdma_create_data_internal rx_data;
>> struct delayed_work rx_connect_worker;
>> struct rpcrdma_stats rx_stats;
>> + atomic_t rx_inject_count;
>> };
>>
>> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-05 15:16:32

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection

On 05/05/2015 11:15 AM, Chuck Lever wrote:
>
> On May 5, 2015, at 10:44 AM, Anna Schumaker <[email protected]> wrote:
>
>> On 05/05/2015 09:53 AM, Chuck Lever wrote:
>>>
>>> On May 5, 2015, at 9:49 AM, Anna Schumaker <[email protected]> wrote:
>>>
>>>> Hi Chuck,
>>>>
>>>> Neat idea! Are servers able to handle client recovery without getting too confused?
>>>
>>> So far I have encountered only issues on the client side. I think this
>>> is because the client is the active part of re-establishing transport
>>> connections. In addition, RPC/RDMA clients have a bunch of resources
>>> that need to be reset after a transport disconnect.
>>>
>>> I think this idea can be translated into something that can be done
>>> in the generic layer (ie, xprt.c) if people think that would be of
>>> benefit for testing TCP also.
>>
>> It might, and now is the time to discuss it before we're stuck maintaining multiple interfaces to the same thing.
>>
>> Another thought: can you move this under debugfs instead of proc? That's where the other kernel fault injection controls are, and it might give us a little more flexibility if we need to change the interface later.
>
> Something like /sys/kernel/debug/sunrpc/inject_transport_fault ?

That looks good to me! :)

Anna
>
>>
>> Anna
>>>
>>>
>>>> Anna
>>>>
>>>> On 05/04/2015 01:56 PM, Chuck Lever wrote:
>>>>> It has been exceptionally useful to exercise the logic that handles
>>>>> local immediate errors and RDMA connection loss. To enable
>>>>> developers to test this regularly and repeatably, add logic to
>>>>> simulate connection loss every so often.
>>>>>
>>>>> Fault injection is disabled by default. It is enabled with
>>>>>
>>>>> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault
>>>>>
>>>>> where "xxx" is a large positive number of transport method calls
>>>>> before a disconnect. A value of several thousand is usually a good
>>>>> number that allows reasonable forward progress while still causing a
>>>>> lot of connection drops.
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> net/sunrpc/Kconfig | 12 ++++++++++++
>>>>> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
>>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>>>>> 3 files changed, 47 insertions(+)
>>>>>
>>>>> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
>>>>> index 9068e72..329f82c 100644
>>>>> --- a/net/sunrpc/Kconfig
>>>>> +++ b/net/sunrpc/Kconfig
>>>>> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>>>>>
>>>>> If unsure, say N.
>>>>>
>>>>> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>>>> + bool "RPC over RDMA client fault injection"
>>>>> + depends on SUNRPC_XPRT_RDMA_CLIENT
>>>>> + default N
>>>>> + help
>>>>> + This option enables fault injection in the xprtrdma module.
>>>>> + Fault injection is disabled by default. It is enabled with:
>>>>> +
>>>>> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
>>>>> +
>>>>> + If unsure, say N.
>>>>> +
>>>>> config SUNRPC_XPRT_RDMA_SERVER
>>>>> tristate "RPC over RDMA Server Support"
>>>>> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
>>>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>>>> index 54f23b1..fdcb2c7 100644
>>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>>> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
>>>>> static unsigned int xprt_rdma_inline_write_padding;
>>>>> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
>>>>> int xprt_rdma_pad_optimize = 1;
>>>>> +static unsigned int xprt_rdma_inject_transport_fault;
>>>>>
>>>>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>>>
>>>>> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
>>>>> .mode = 0644,
>>>>> .proc_handler = proc_dointvec,
>>>>> },
>>>>> + {
>>>>> + .procname = "rdma_inject_transport_fault",
>>>>> + .data = &xprt_rdma_inject_transport_fault,
>>>>> + .maxlen = sizeof(unsigned int),
>>>>> + .mode = 0644,
>>>>> + .proc_handler = proc_dointvec,
>>>>> + },
>>>>> { },
>>>>> };
>>>>>
>>>>> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
>>>>> xprt_clear_connecting(xprt);
>>>>> }
>>>>>
>>>>> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>>>> +static void
>>>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>>>> +{
>>>>> + if (!xprt_rdma_inject_transport_fault)
>>>>> + return;
>>>>> +
>>>>> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
>>>>> + atomic_set(&r_xprt->rx_inject_count,
>>>>> + xprt_rdma_inject_transport_fault);
>>>>> + pr_info("rpcrdma: injecting transport disconnect\n");
>>>>> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
>>>>> + }
>>>>> +}
>>>>> +#else
>>>>> +static void
>>>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>>>> +{
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> /*
>>>>> * xprt_rdma_destroy
>>>>> *
>>>>> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
>>>>> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
>>>>> xprt_rdma_connect_worker);
>>>>>
>>>>> + atomic_set(&new_xprt->rx_inject_count,
>>>>> + xprt_rdma_inject_transport_fault);
>>>>> xprt_rdma_format_addresses(xprt);
>>>>> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
>>>>> if (xprt->max_payload == 0)
>>>>> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
>>>>> out:
>>>>> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
>>>>> req->rl_connect_cookie = 0; /* our reserved value */
>>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>>> return req->rl_sendbuf->rg_base;
>>>>>
>>>>> out_rdmabuf:
>>>>> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
>>>>> }
>>>>>
>>>>> rpcrdma_buffer_put(req);
>>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>>>>>
>>>>> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
>>>>> rqst->rq_bytes_sent = 0;
>>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>>> return 0;
>>>>>
>>>>> failed_marshal:
>>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> index 78e0b8b..08aee53 100644
>>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
>>>>> struct rpcrdma_create_data_internal rx_data;
>>>>> struct delayed_work rx_connect_worker;
>>>>> struct rpcrdma_stats rx_stats;
>>>>> + atomic_t rx_inject_count;
>>>>> };
>>>>>
>>>>> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Chuck Lever
>>> chuck[dot]lever[at]oracle[dot]com
>>>
>>>
>>>
>>
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>


2015-05-05 15:15:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection


On May 5, 2015, at 10:44 AM, Anna Schumaker <[email protected]> wrote:

> On 05/05/2015 09:53 AM, Chuck Lever wrote:
>>
>> On May 5, 2015, at 9:49 AM, Anna Schumaker <[email protected]> wrote:
>>
>>> Hi Chuck,
>>>
>>> Neat idea! Are servers able to handle client recovery without getting too confused?
>>
>> So far I have encountered only issues on the client side. I think this
>> is because the client is the active part of re-establishing transport
>> connections. In addition, RPC/RDMA clients have a bunch of resources
>> that need to be reset after a transport disconnect.
>>
>> I think this idea can be translated into something that can be done
>> in the generic layer (ie, xprt.c) if people think that would be of
>> benefit for testing TCP also.
>
> It might, and now is the time to discuss it before we're stuck maintaining multiple interfaces to the same thing.
>
> Another thought: can you move this under debugfs instead of proc? That's where the other kernel fault injection controls are, and it might give us a little more flexibility if we need to change the interface later.

Something like /sys/kernel/debug/sunrpc/inject_transport_fault ?

>
> Anna
>>
>>
>>> Anna
>>>
>>> On 05/04/2015 01:56 PM, Chuck Lever wrote:
>>>> It has been exceptionally useful to exercise the logic that handles
>>>> local immediate errors and RDMA connection loss. To enable
>>>> developers to test this regularly and repeatably, add logic to
>>>> simulate connection loss every so often.
>>>>
>>>> Fault injection is disabled by default. It is enabled with
>>>>
>>>> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault
>>>>
>>>> where "xxx" is a large positive number of transport method calls
>>>> before a disconnect. A value of several thousand is usually a good
>>>> number that allows reasonable forward progress while still causing a
>>>> lot of connection drops.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> net/sunrpc/Kconfig | 12 ++++++++++++
>>>> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>>>> 3 files changed, 47 insertions(+)
>>>>
>>>> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
>>>> index 9068e72..329f82c 100644
>>>> --- a/net/sunrpc/Kconfig
>>>> +++ b/net/sunrpc/Kconfig
>>>> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>>>>
>>>> If unsure, say N.
>>>>
>>>> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>>> + bool "RPC over RDMA client fault injection"
>>>> + depends on SUNRPC_XPRT_RDMA_CLIENT
>>>> + default N
>>>> + help
>>>> + This option enables fault injection in the xprtrdma module.
>>>> + Fault injection is disabled by default. It is enabled with:
>>>> +
>>>> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
>>>> +
>>>> + If unsure, say N.
>>>> +
>>>> config SUNRPC_XPRT_RDMA_SERVER
>>>> tristate "RPC over RDMA Server Support"
>>>> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>>> index 54f23b1..fdcb2c7 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
>>>> static unsigned int xprt_rdma_inline_write_padding;
>>>> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
>>>> int xprt_rdma_pad_optimize = 1;
>>>> +static unsigned int xprt_rdma_inject_transport_fault;
>>>>
>>>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>>
>>>> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
>>>> .mode = 0644,
>>>> .proc_handler = proc_dointvec,
>>>> },
>>>> + {
>>>> + .procname = "rdma_inject_transport_fault",
>>>> + .data = &xprt_rdma_inject_transport_fault,
>>>> + .maxlen = sizeof(unsigned int),
>>>> + .mode = 0644,
>>>> + .proc_handler = proc_dointvec,
>>>> + },
>>>> { },
>>>> };
>>>>
>>>> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
>>>> xprt_clear_connecting(xprt);
>>>> }
>>>>
>>>> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>>> +static void
>>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>>> +{
>>>> + if (!xprt_rdma_inject_transport_fault)
>>>> + return;
>>>> +
>>>> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
>>>> + atomic_set(&r_xprt->rx_inject_count,
>>>> + xprt_rdma_inject_transport_fault);
>>>> + pr_info("rpcrdma: injecting transport disconnect\n");
>>>> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
>>>> + }
>>>> +}
>>>> +#else
>>>> +static void
>>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>>> +{
>>>> +}
>>>> +#endif
>>>> +
>>>> /*
>>>> * xprt_rdma_destroy
>>>> *
>>>> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
>>>> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
>>>> xprt_rdma_connect_worker);
>>>>
>>>> + atomic_set(&new_xprt->rx_inject_count,
>>>> + xprt_rdma_inject_transport_fault);
>>>> xprt_rdma_format_addresses(xprt);
>>>> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
>>>> if (xprt->max_payload == 0)
>>>> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
>>>> out:
>>>> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
>>>> req->rl_connect_cookie = 0; /* our reserved value */
>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>> return req->rl_sendbuf->rg_base;
>>>>
>>>> out_rdmabuf:
>>>> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
>>>> }
>>>>
>>>> rpcrdma_buffer_put(req);
>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>> }
>>>>
>>>> /*
>>>> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>>>>
>>>> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
>>>> rqst->rq_bytes_sent = 0;
>>>> + xprt_rdma_inject_disconnect(r_xprt);
>>>> return 0;
>>>>
>>>> failed_marshal:
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index 78e0b8b..08aee53 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
>>>> struct rpcrdma_create_data_internal rx_data;
>>>> struct delayed_work rx_connect_worker;
>>>> struct rpcrdma_stats rx_stats;
>>>> + atomic_t rx_inject_count;
>>>> };
>>>>
>>>> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-05 15:10:01

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection


On 5/4/2015 12:56 PM, Chuck Lever wrote:
> It has been exceptionally useful to exercise the logic that handles
> local immediate errors and RDMA connection loss. To enable
> developers to test this regularly and repeatably, add logic to
> simulate connection loss every so often.
>
> Fault injection is disabled by default. It is enabled with
>
> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault

Should this really be in debugfs?

> where "xxx" is a large positive number of transport method calls
> before a disconnect. A value of several thousand is usually a good
> number that allows reasonable forward progress while still causing a
> lot of connection drops.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/Kconfig | 12 ++++++++++++
> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
> 3 files changed, 47 insertions(+)
>
> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> index 9068e72..329f82c 100644
> --- a/net/sunrpc/Kconfig
> +++ b/net/sunrpc/Kconfig
> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>
> If unsure, say N.
>
> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
> + bool "RPC over RDMA client fault injection"
> + depends on SUNRPC_XPRT_RDMA_CLIENT
> + default N
> + help
> + This option enables fault injection in the xprtrdma module.
> + Fault injection is disabled by default. It is enabled with:
> +
> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
> +
> + If unsure, say N.
> +
> config SUNRPC_XPRT_RDMA_SERVER
> tristate "RPC over RDMA Server Support"
> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 54f23b1..fdcb2c7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
> static unsigned int xprt_rdma_inline_write_padding;
> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
> int xprt_rdma_pad_optimize = 1;
> +static unsigned int xprt_rdma_inject_transport_fault;
>
> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>
> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> + {
> + .procname = "rdma_inject_transport_fault",
> + .data = &xprt_rdma_inject_transport_fault,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> { },
> };
>
> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
> xprt_clear_connecting(xprt);
> }
>
> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
> +static void
> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
> +{
> + if (!xprt_rdma_inject_transport_fault)
> + return;
> +
> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
> + atomic_set(&r_xprt->rx_inject_count,
> + xprt_rdma_inject_transport_fault);
> + pr_info("rpcrdma: injecting transport disconnect\n");
> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
> + }
> +}
> +#else
> +static void
> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
> +{
> +}
> +#endif
> +
> /*
> * xprt_rdma_destroy
> *
> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
> xprt_rdma_connect_worker);
>
> + atomic_set(&new_xprt->rx_inject_count,
> + xprt_rdma_inject_transport_fault);
> xprt_rdma_format_addresses(xprt);
> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
> if (xprt->max_payload == 0)
> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
> out:
> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
> req->rl_connect_cookie = 0; /* our reserved value */
> + xprt_rdma_inject_disconnect(r_xprt);
> return req->rl_sendbuf->rg_base;
>
> out_rdmabuf:
> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
> }
>
> rpcrdma_buffer_put(req);
> + xprt_rdma_inject_disconnect(r_xprt);
> }
>
> /*
> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>
> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
> rqst->rq_bytes_sent = 0;
> + xprt_rdma_inject_disconnect(r_xprt);
> return 0;
>
> failed_marshal:
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 78e0b8b..08aee53 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
> struct rpcrdma_create_data_internal rx_data;
> struct delayed_work rx_connect_worker;
> struct rpcrdma_stats rx_stats;
> + atomic_t rx_inject_count;
> };
>
> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2015-05-05 14:44:21

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] xprtrdma: Transport fault injection

On 05/05/2015 09:53 AM, Chuck Lever wrote:
>
> On May 5, 2015, at 9:49 AM, Anna Schumaker <[email protected]> wrote:
>
>> Hi Chuck,
>>
>> Neat idea! Are servers able to handle client recovery without getting too confused?
>
> So far I have encountered only issues on the client side. I think this
> is because the client is the active part of re-establishing transport
> connections. In addition, RPC/RDMA clients have a bunch of resources
> that need to be reset after a transport disconnect.
>
> I think this idea can be translated into something that can be done
> in the generic layer (ie, xprt.c) if people think that would be of
> benefit for testing TCP also.

It might, and now is the time to discuss it before we're stuck maintaining multiple interfaces to the same thing.

Another thought: can you move this under debugfs instead of proc? That's where the other kernel fault injection controls are, and it might give us a little more flexibility if we need to change the interface later.

Anna
>
>
>> Anna
>>
>> On 05/04/2015 01:56 PM, Chuck Lever wrote:
>>> It has been exceptionally useful to exercise the logic that handles
>>> local immediate errors and RDMA connection loss. To enable
>>> developers to test this regularly and repeatably, add logic to
>>> simulate connection loss every so often.
>>>
>>> Fault injection is disabled by default. It is enabled with
>>>
>>> $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_transport_fault
>>>
>>> where "xxx" is a large positive number of transport method calls
>>> before a disconnect. A value of several thousand is usually a good
>>> number that allows reasonable forward progress while still causing a
>>> lot of connection drops.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/Kconfig | 12 ++++++++++++
>>> net/sunrpc/xprtrdma/transport.c | 34 ++++++++++++++++++++++++++++++++++
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>>> 3 files changed, 47 insertions(+)
>>>
>>> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
>>> index 9068e72..329f82c 100644
>>> --- a/net/sunrpc/Kconfig
>>> +++ b/net/sunrpc/Kconfig
>>> @@ -61,6 +61,18 @@ config SUNRPC_XPRT_RDMA_CLIENT
>>>
>>> If unsure, say N.
>>>
>>> +config SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>> + bool "RPC over RDMA client fault injection"
>>> + depends on SUNRPC_XPRT_RDMA_CLIENT
>>> + default N
>>> + help
>>> + This option enables fault injection in the xprtrdma module.
>>> + Fault injection is disabled by default. It is enabled with:
>>> +
>>> + $ sudo echo xxx > /proc/sys/sunrpc/rdma_inject_fault
>>> +
>>> + If unsure, say N.
>>> +
>>> config SUNRPC_XPRT_RDMA_SERVER
>>> tristate "RPC over RDMA Server Support"
>>> depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>> index 54f23b1..fdcb2c7 100644
>>> --- a/net/sunrpc/xprtrdma/transport.c
>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>> @@ -74,6 +74,7 @@ static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
>>> static unsigned int xprt_rdma_inline_write_padding;
>>> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
>>> int xprt_rdma_pad_optimize = 1;
>>> +static unsigned int xprt_rdma_inject_transport_fault;
>>>
>>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>
>>> @@ -135,6 +136,13 @@ static struct ctl_table xr_tunables_table[] = {
>>> .mode = 0644,
>>> .proc_handler = proc_dointvec,
>>> },
>>> + {
>>> + .procname = "rdma_inject_transport_fault",
>>> + .data = &xprt_rdma_inject_transport_fault,
>>> + .maxlen = sizeof(unsigned int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec,
>>> + },
>>> { },
>>> };
>>>
>>> @@ -246,6 +254,27 @@ xprt_rdma_connect_worker(struct work_struct *work)
>>> xprt_clear_connecting(xprt);
>>> }
>>>
>>> +#if defined CONFIG_SUNRPC_XPRT_RDMA_FAULT_INJECTION
>>> +static void
>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>> +{
>>> + if (!xprt_rdma_inject_transport_fault)
>>> + return;
>>> +
>>> + if (atomic_dec_return(&r_xprt->rx_inject_count) == 0) {
>>> + atomic_set(&r_xprt->rx_inject_count,
>>> + xprt_rdma_inject_transport_fault);
>>> + pr_info("rpcrdma: injecting transport disconnect\n");
>>> + (void)rdma_disconnect(r_xprt->rx_ia.ri_id);
>>> + }
>>> +}
>>> +#else
>>> +static void
>>> +xprt_rdma_inject_disconnect(struct rpcrdma_xprt *r_xprt)
>>> +{
>>> +}
>>> +#endif
>>> +
>>> /*
>>> * xprt_rdma_destroy
>>> *
>>> @@ -405,6 +434,8 @@ xprt_setup_rdma(struct xprt_create *args)
>>> INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
>>> xprt_rdma_connect_worker);
>>>
>>> + atomic_set(&new_xprt->rx_inject_count,
>>> + xprt_rdma_inject_transport_fault);
>>> xprt_rdma_format_addresses(xprt);
>>> xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
>>> if (xprt->max_payload == 0)
>>> @@ -515,6 +546,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
>>> out:
>>> dprintk("RPC: %s: size %zd, request 0x%p\n", __func__, size, req);
>>> req->rl_connect_cookie = 0; /* our reserved value */
>>> + xprt_rdma_inject_disconnect(r_xprt);
>>> return req->rl_sendbuf->rg_base;
>>>
>>> out_rdmabuf:
>>> @@ -589,6 +621,7 @@ xprt_rdma_free(void *buffer)
>>> }
>>>
>>> rpcrdma_buffer_put(req);
>>> + xprt_rdma_inject_disconnect(r_xprt);
>>> }
>>>
>>> /*
>>> @@ -634,6 +667,7 @@ xprt_rdma_send_request(struct rpc_task *task)
>>>
>>> rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
>>> rqst->rq_bytes_sent = 0;
>>> + xprt_rdma_inject_disconnect(r_xprt);
>>> return 0;
>>>
>>> failed_marshal:
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index 78e0b8b..08aee53 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -377,6 +377,7 @@ struct rpcrdma_xprt {
>>> struct rpcrdma_create_data_internal rx_data;
>>> struct delayed_work rx_connect_worker;
>>> struct rpcrdma_stats rx_stats;
>>> + atomic_t rx_inject_count;
>>> };
>>>
>>> #define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>


2015-05-06 11:37:06

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

On Mon, May 4, 2015 at 11:27 PM, Chuck Lever <[email protected]> wrote:
>
> Print an error during transport destruction if ib_dealloc_pd()
> fails. This is a sign that xprtrdma orphaned one or more RDMA API
> objects at some point, which can pin lower layer kernel modules
> and cause shutdown to hang.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4870d27..0cc4617 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -710,8 +710,8 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
> }
> if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
> rc = ib_dealloc_pd(ia->ri_pd);
> - dprintk("RPC: %s: ib_dealloc_pd returned %i\n",
> - __func__, rc);

Should we check for EBUSY explicitly? other then this is an error in
vendor specific ib_dealloc_pd()

> + if (rc)
> + pr_warn("rpcrdma: ib_dealloc_pd status %i\n", rc);
> }
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html




--
-Regards
Devesh

2015-05-06 13:24:22

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

Hi Devesh-

On May 6, 2015, at 7:37 AM, Devesh Sharma <[email protected]> wrote:

> On Mon, May 4, 2015 at 11:27 PM, Chuck Lever <[email protected]> wrote:
>>
>> Print an error during transport destruction if ib_dealloc_pd()
>> fails. This is a sign that xprtrdma orphaned one or more RDMA API
>> objects at some point, which can pin lower layer kernel modules
>> and cause shutdown to hang.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/verbs.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 4870d27..0cc4617 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -710,8 +710,8 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
>> }
>> if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
>> rc = ib_dealloc_pd(ia->ri_pd);
>> - dprintk("RPC: %s: ib_dealloc_pd returned %i\n",
>> - __func__, rc);
>
> Should we check for EBUSY explicitly? other then this is an error in
> vendor specific ib_dealloc_pd()

Any error return means ib_dealloc_pd() has failed, right? Doesn?t that
mean the PD is still allocated, and could cause problems later?


>> + if (rc)
>> + pr_warn("rpcrdma: ib_dealloc_pd status %i\n", rc);
>> }
>> }

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-06 14:05:40

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

On 5/6/2015 4:24 PM, Chuck Lever wrote:
> Hi Devesh-
>
> On May 6, 2015, at 7:37 AM, Devesh Sharma <[email protected]> wrote:
>
>> On Mon, May 4, 2015 at 11:27 PM, Chuck Lever <[email protected]> wrote:
>>>
>>> Print an error during transport destruction if ib_dealloc_pd()
>>> fails. This is a sign that xprtrdma orphaned one or more RDMA API
>>> objects at some point, which can pin lower layer kernel modules
>>> and cause shutdown to hang.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 4870d27..0cc4617 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -710,8 +710,8 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
>>> }
>>> if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
>>> rc = ib_dealloc_pd(ia->ri_pd);
>>> - dprintk("RPC: %s: ib_dealloc_pd returned %i\n",
>>> - __func__, rc);
>>
>> Should we check for EBUSY explicitly? other then this is an error in
>> vendor specific ib_dealloc_pd()
>
> Any error return means ib_dealloc_pd() has failed, right? Doesn?t that
> mean the PD is still allocated, and could cause problems later?

AFAICT, the only non-zero rc that ib_dealloc_pd should return is EBUSY.
So I don't see value in verifying it at all.

So, Looks Good

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-06 14:22:04

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

On Wed, May 6, 2015 at 6:54 PM, Chuck Lever <[email protected]> wrote:
> Hi Devesh-
>
> On May 6, 2015, at 7:37 AM, Devesh Sharma <[email protected]> wrote:
>
>> On Mon, May 4, 2015 at 11:27 PM, Chuck Lever <[email protected]> wrote:
>>>
>>> Print an error during transport destruction if ib_dealloc_pd()
>>> fails. This is a sign that xprtrdma orphaned one or more RDMA API
>>> objects at some point, which can pin lower layer kernel modules
>>> and cause shutdown to hang.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 4870d27..0cc4617 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -710,8 +710,8 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
>>> }
>>> if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
>>> rc = ib_dealloc_pd(ia->ri_pd);
>>> - dprintk("RPC: %s: ib_dealloc_pd returned %i\n",
>>> - __func__, rc);
>>
>> Should we check for EBUSY explicitly? other then this is an error in
>> vendor specific ib_dealloc_pd()
>
> Any error return means ib_dealloc_pd() has failed, right? Doesn’t that
> mean the PD is still allocated, and could cause problems later?

Yes, you are correct, I was thinking ib_dealloc_pd() has a refcount
implemented in the core layer, thus if the PD is used by any resource,
it will always fail with -EBUSY.
.With emulex adapter it is possible to fail dealloc_pd with ENOMEM or
EIO in cases where device f/w is not responding etc. this situation do
not represent PD is actually in use.

>
>
>>> + if (rc)
>>> + pr_warn("rpcrdma: ib_dealloc_pd status %i\n", rc);
>>> }
>>> }
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>



--
-Regards
Devesh

2015-05-06 16:48:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

On Wed, May 06, 2015 at 07:52:03PM +0530, Devesh Sharma wrote:
> >> Should we check for EBUSY explicitly? other then this is an error in
> >> vendor specific ib_dealloc_pd()
> >
> > Any error return means ib_dealloc_pd() has failed, right? Doesn’t that
> > mean the PD is still allocated, and could cause problems later?
>
> Yes, you are correct, I was thinking ib_dealloc_pd() has a refcount
> implemented in the core layer, thus if the PD is used by any resource,
> it will always fail with -EBUSY.

.. and it will not be freed, which indicates a serious bug in the
caller, so the caller should respond to the failure with a BUG_ON or
WARN_ON.

> .With emulex adapter it is possible to fail dealloc_pd with ENOMEM or
> EIO in cases where device f/w is not responding etc. this situation do
> not represent PD is actually in use.

This is a really bad idea. If the pd was freed and from the consumer's
perspective everything is sane then it should return success.

If the driver detects an internal failure, then it should move the
driver to a failed state (whatever that means, but at a minimum it
means the firmware state and driver state must be resync'd), and still
succeed the dealloc.

There is absolutely nothing the caller can do about a driver level
failure here, and it doesn't indicate a caller bug.

Returning ENOMEM for dealloc is what we'd call an insane API. You
can't have failable memory allocations in a dealloc path.

Jason

2015-05-07 07:53:16

by Devesh Sharma

[permalink] [raw]
Subject: RE: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects

> -----Original Message-----
> From: Jason Gunthorpe [mailto:[email protected]]
> Sent: Wednesday, May 06, 2015 10:18 PM
> To: Devesh Sharma
> Cc: Chuck Lever; [email protected]; Linux NFS Mailing List
> Subject: Re: [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB
> objects
>
> On Wed, May 06, 2015 at 07:52:03PM +0530, Devesh Sharma wrote:
> > >> Should we check for EBUSY explicitly? other then this is an error
> > >> in vendor specific ib_dealloc_pd()
> > >
> > > Any error return means ib_dealloc_pd() has failed, right? Doesn’t
> > > that mean the PD is still allocated, and could cause problems later?
> >
> > Yes, you are correct, I was thinking ib_dealloc_pd() has a refcount
> > implemented in the core layer, thus if the PD is used by any resource,
> > it will always fail with -EBUSY.
>
> .. and it will not be freed, which indicates a serious bug in the caller,
> so the
> caller should respond to the failure with a BUG_ON or WARN_ON.

Yes, that’s what this patch is doing.

>
> > .With emulex adapter it is possible to fail dealloc_pd with ENOMEM or
> > EIO in cases where device f/w is not responding etc. this situation do
> > not represent PD is actually in use.
>
> This is a really bad idea. If the pd was freed and from the consumer's
> perspective everything is sane then it should return success.
>
> If the driver detects an internal failure, then it should move the driver
> to a
> failed state (whatever that means, but at a minimum it means the firmware
> state and driver state must be resync'd), and still succeed the dealloc.

Makes sense.

>
> There is absolutely nothing the caller can do about a driver level failure
> here,
> and it doesn't indicate a caller bug.
>
> Returning ENOMEM for dealloc is what we'd call an insane API. You can't
> have
> failable memory allocations in a dealloc path.

I will supply a fix in ocrdma.

Reviewed-by: Devesh Sharma <[email protected]>
>
> Jason

2015-05-07 09:38:09

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt

On 5/4/2015 8:57 PM, Chuck Lever wrote:
> Clean up: Instead of carrying a pointer to the buffer pool and
> the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++--
> net/sunrpc/xprtrdma/transport.c | 7 ++-----
> net/sunrpc/xprtrdma/verbs.c | 8 +++++---
> net/sunrpc/xprtrdma/xprt_rdma.h | 3 +--
> 4 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 2c53ea9..98a3b95 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -732,8 +732,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> struct rpcrdma_msg *headerp;
> struct rpcrdma_req *req;
> struct rpc_rqst *rqst;
> - struct rpc_xprt *xprt = rep->rr_xprt;
> - struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
> + struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
> + struct rpc_xprt *xprt = &r_xprt->rx_xprt;
> __be32 *iptr;
> int rdmalen, status;
> unsigned long cwnd;
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index fdcb2c7..ed70551 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -650,12 +650,9 @@ xprt_rdma_send_request(struct rpc_task *task)
>
> if (req->rl_reply == NULL) /* e.g. reconnection */
> rpcrdma_recv_buffer_get(req);
> -
> - if (req->rl_reply) {
> + /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
> + if (req->rl_reply)
> req->rl_reply->rr_func = rpcrdma_reply_handler;
> - /* this need only be done once, but... */
> - req->rl_reply->rr_xprt = xprt;
> - }

Can't you just fold that into rpcrdma_recv_buffer_get() instead of
checking what it did?

Other than that,

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 10:00:40

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

On 5/4/2015 8:57 PM, Chuck Lever wrote:
> The connect worker can replace ri_id, but prevents ri_id->device
> from changing during the lifetime of a transport instance.
>
> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
> The cached copy can be used safely in code that does not serialize
> with the connect worker.
>
> Other code can use it to save an extra address generation (one
> pointer dereference instead of two).
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
> 5 files changed, 43 insertions(+), 48 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 302d4eb..0a96155 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> int nsegs, bool writing)
> {
> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
> - struct ib_device *device = ia->ri_id->device;
> + struct ib_device *device = ia->ri_device;
> enum dma_data_direction direction = rpcrdma_data_dir(writing);
> struct rpcrdma_mr_seg *seg1 = seg;
> struct rpcrdma_mw *mw = seg1->rl_mw;
> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> {
> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
> struct rpcrdma_mr_seg *seg1 = seg;
> - struct ib_device *device;
> int rc, nsegs = seg->mr_nsegs;
> LIST_HEAD(l);
>
> list_add(&seg1->rl_mw->r.fmr->list, &l);
> rc = ib_unmap_fmr(&l);
> - read_lock(&ia->ri_qplock);
> - device = ia->ri_id->device;
> while (seg1->mr_nsegs--)
> - rpcrdma_unmap_one(device, seg++);
> - read_unlock(&ia->ri_qplock);
> + rpcrdma_unmap_one(ia->ri_device, seg++);

Umm, I'm wandering if this is guaranteed to be the same device as
ri_id->device?

Imagine you are working on a bond device where each slave belongs to
a different adapter. When the active port toggles, you will see a
ADDR_CHANGED event (that the current code does not handle...), what
you'd want to do is just reconnect and rdma_cm will resolve the new
address for you (via the backup slave). I suspect that in case this
flow is concurrent with the reconnects you may end up with a stale
device handle.

2015-05-07 10:15:40

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 06/14] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

On 5/4/2015 8:57 PM, Chuck Lever wrote:
> Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
> pool lock is expensive, and unnecessary because FMR mode can
> transfer up to a 1MB payload using just a single ib_fmr.
>
> Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
> return them to rb_mws immediately during deregistration.
>
> Transport reset is now unneeded for FMR. Each FMR is recovered
> synchronously when its RPC is retransmitted.

Does this worth a separate patch? I don't see why the two
changes are together. Is there a dependency I'm missing?

2015-05-07 10:16:05

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] xprtrdma: Introduce helpers for allocating MWs

On 5/4/2015 8:57 PM, Chuck Lever wrote:
> We eventually want to handle allocating MWs one at a time, as
> needed, instead of grabbing 64 and throwing them at each RPC in the
> pipeline.
>
> Add a helper for grabbing an MW off rb_mws, and a helper for
> returning an MW to rb_mws. These will be used in a subsequent patch.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 31 +++++++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
> 2 files changed, 33 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index ebcb0e2..c21329e 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1179,6 +1179,37 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
> kfree(buf->rb_pool);
> }
>
> +struct rpcrdma_mw *
> +rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
> +{
> + struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> + struct rpcrdma_mw *mw = NULL;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&buf->rb_lock, flags);
> + if (!list_empty(&buf->rb_mws)) {
> + mw = list_first_entry(&buf->rb_mws,
> + struct rpcrdma_mw, mw_list);
> + list_del_init(&mw->mw_list);
> + }
> + spin_unlock_irqrestore(&buf->rb_lock, flags);
> +
> + if (!mw)
> + pr_err("RPC: %s: no MWs available\n", __func__);
> + return mw;
> +}
> +
> +void
> +rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
> +{
> + struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&buf->rb_lock, flags);
> + list_add_tail(&mw->mw_list, &buf->rb_mws);
> + spin_unlock_irqrestore(&buf->rb_lock, flags);
> +}
> +
> /* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
> * some req segments uninitialized.
> */
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 531ad33..7de424e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -415,6 +415,8 @@ int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_ep *,
> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>
> +struct rpcrdma_mw *rpcrdma_get_mw(struct rpcrdma_xprt *);
> +void rpcrdma_put_mw(struct rpcrdma_xprt *, struct rpcrdma_mw *);
> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
> void rpcrdma_buffer_put(struct rpcrdma_req *);
> void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 10:30:59

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()

On 5/4/2015 8:57 PM, Chuck Lever wrote:
> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
> pool lock is expensive, and unnecessary because most modern adapters
> can transfer 100s of KBs of payload using just a single MR.
>
> Instead, acquire MRs one-at-a-time as chunks are registered, and
> return them to rb_mws immediately during deregistration.
>
> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
> registration fails") is reverted: There is now a valid case where
> registration can fail (with -ENOMEM) but the QP is still in RTS.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/frwr_ops.c | 120 ++++++++++++++++++++++++++++------------
> net/sunrpc/xprtrdma/rpc_rdma.c | 3 -
> net/sunrpc/xprtrdma/verbs.c | 21 -------
> 3 files changed, 86 insertions(+), 58 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index a06d9a3..6f93a89 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -11,6 +11,62 @@
> * but most complex memory registration mode.
> */
>
> +/* Normal operation
> + *
> + * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
> + * Work Request (frmr_op_map). When the RDMA operation is finished, this
> + * Memory Region is invalidated using a LOCAL_INV Work Request
> + * (frmr_op_unmap).
> + *
> + * Typically these Work Requests are not signaled, and neither are RDMA
> + * SEND Work Requests (with the exception of signaling occasionally to
> + * prevent provider work queue overflows). This greatly reduces HCA
> + * interrupt workload.
> + *
> + * As an optimization, frwr_op_unmap marks MRs INVALID before the
> + * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
> + * rb_mws immediately so that no work (like managing a linked list
> + * under a spinlock) is needed in the completion upcall.
> + *
> + * But this means that frwr_op_map() can occasionally encounter an MR
> + * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
> + * ordering prevents a subsequent FAST_REG WR from executing against
> + * that MR while it is still being invalidated.
> + */
> +
> +/* Transport recovery
> + *
> + * ->op_map and the transport connect worker cannot run at the same
> + * time, but ->op_unmap can fire while the transport connect worker
> + * is running. Thus MR recovery is handled in ->op_map, to guarantee
> + * that recovered MRs are owned by a sending RPC, and not one where
> + * ->op_unmap could fire at the same time transport reconnect is
> + * being done.
> + *
> + * When the underlying transport disconnects, MRs are left in one of
> + * three states:
> + *
> + * INVALID: The MR was not in use before the QP entered ERROR state.
> + * (Or, the LOCAL_INV WR has not completed or flushed yet).
> + *
> + * STALE: The MR was being registered or unregistered when the QP
> + * entered ERROR state, and the pending WR was flushed.
> + *
> + * VALID: The MR was registered before the QP entered ERROR state.
> + *
> + * When frwr_op_map encounters STALE and VALID MRs, they are recovered
> + * with ib_dereg_mr and then are re-initialized. Beause MR recovery
> + * allocates fresh resources, it is deferred to a workqueue, and the
> + * recovered MRs are placed back on the rb_mws list when recovery is
> + * complete. frwr_op_map allocates another MR for the current RPC while
> + * the broken MR is reset.
> + *
> + * To ensure that frwr_op_map doesn't encounter an MR that is marked
> + * INVALID but that is about to be flushed due to a previous transport
> + * disconnect, the transport connect worker attempts to drain all
> + * pending send queue WRs before the transport is reconnected.
> + */
> +
> #include "xprt_rdma.h"
>
> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> @@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> struct ib_device *device = ia->ri_device;
> enum dma_data_direction direction = rpcrdma_data_dir(writing);
> struct rpcrdma_mr_seg *seg1 = seg;
> - struct rpcrdma_mw *mw = seg1->rl_mw;
> - struct rpcrdma_frmr *frmr = &mw->r.frmr;
> - struct ib_mr *mr = frmr->fr_mr;
> + struct rpcrdma_mw *mw;
> + struct rpcrdma_frmr *frmr;
> + struct ib_mr *mr;
> struct ib_send_wr fastreg_wr, *bad_wr;
> u8 key;
> int len, pageoff;
> @@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> u64 pa;
> int page_no;
>
> + mw = seg1->rl_mw;
> + seg1->rl_mw = NULL;
> + do {
> + if (mw)
> + __frwr_queue_recovery(mw);
> + mw = rpcrdma_get_mw(r_xprt);
> + if (!mw)
> + return -ENOMEM;
> + } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
> + frmr = &mw->r.frmr;
> + frmr->fr_state = FRMR_IS_VALID;
> +
> pageoff = offset_in_page(seg1->mr_offset);
> seg1->mr_offset -= pageoff; /* start of page */
> seg1->mr_len += pageoff;
> len = -pageoff;
> if (nsegs > ia->ri_max_frmr_depth)
> nsegs = ia->ri_max_frmr_depth;
> +
> for (page_no = i = 0; i < nsegs;) {
> rpcrdma_map_one(device, seg, direction);
> pa = seg->mr_dma;
> @@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> dprintk("RPC: %s: Using frmr %p to map %d segments (%d bytes)\n",
> __func__, mw, i, len);
>
> - frmr->fr_state = FRMR_IS_VALID;
> -
> memset(&fastreg_wr, 0, sizeof(fastreg_wr));
> fastreg_wr.wr_id = (unsigned long)(void *)mw;
> fastreg_wr.opcode = IB_WR_FAST_REG_MR;
> @@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> fastreg_wr.wr.fast_reg.access_flags = writing ?
> IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
> IB_ACCESS_REMOTE_READ;
> + mr = frmr->fr_mr;
> key = (u8)(mr->rkey & 0x000000FF);
> ib_update_fast_reg_key(mr, ++key);
> fastreg_wr.wr.fast_reg.rkey = mr->rkey;
> @@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> if (rc)
> goto out_senderr;
>
> + seg1->rl_mw = mw;
> seg1->mr_rkey = mr->rkey;
> seg1->mr_base = seg1->mr_dma + pageoff;
> seg1->mr_nsegs = i;
> @@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>
> out_senderr:
> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
> - ib_update_fast_reg_key(mr, --key);
> - frmr->fr_state = FRMR_IS_INVALID;
> while (i--)
> rpcrdma_unmap_one(device, --seg);
> + __frwr_queue_recovery(mw);
> return rc;
> }
>
> @@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> {
> struct rpcrdma_mr_seg *seg1 = seg;
> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
> + struct rpcrdma_mw *mw = seg1->rl_mw;
> struct ib_send_wr invalidate_wr, *bad_wr;
> int rc, nsegs = seg->mr_nsegs;
>
> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
> + dprintk("RPC: %s: FRMR %p\n", __func__, mw);
> +
> + seg1->rl_mw = NULL;
> + mw->r.frmr.fr_state = FRMR_IS_INVALID;
>
> memset(&invalidate_wr, 0, sizeof(invalidate_wr));
> - invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
> + invalidate_wr.wr_id = (unsigned long)(void *)mw;
> invalidate_wr.opcode = IB_WR_LOCAL_INV;
> - invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
> + invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
> DECR_CQCOUNT(&r_xprt->rx_ep);
>
> while (seg1->mr_nsegs--)
> @@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> read_unlock(&ia->ri_qplock);
> if (rc)
> goto out_err;
> +
> + rpcrdma_put_mw(r_xprt, mw);
> return nsegs;
>
> out_err:
> - /* Force rpcrdma_buffer_get() to retry */
> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
> + __frwr_queue_recovery(mw);
> return nsegs;
> }
>
> @@ -370,29 +443,6 @@ out_err:
> static void
> frwr_op_reset(struct rpcrdma_xprt *r_xprt)
> {
> - struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> - struct ib_device *device = r_xprt->rx_ia.ri_device;
> - unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
> - struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
> - struct rpcrdma_mw *r;
> - int rc;
> -
> - list_for_each_entry(r, &buf->rb_all, mw_all) {
> - if (r->r.frmr.fr_state == FRMR_IS_INVALID)
> - continue;
> -
> - __frwr_release(r);
> - rc = __frwr_init(r, pd, device, depth);
> - if (rc) {
> - dprintk("RPC: %s: mw %p left %s\n",
> - __func__, r,
> - (r->r.frmr.fr_state == FRMR_IS_STALE ?
> - "stale" : "valid"));
> - continue;
> - }
> -
> - r->r.frmr.fr_state = FRMR_IS_INVALID;
> - }
> }
>
> static void
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 98a3b95..35ead0b 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
> return (unsigned char *)iptr - (unsigned char *)headerp;
>
> out:
> - if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
> - return n;
> -
> for (pos = 0; nchunks--;)
> pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
> &req->rl_segments[pos]);
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 8a43c7ef..5226161 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
> struct rpcrdma_req *
> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
> {
> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
> - struct list_head stale;
> struct rpcrdma_req *req;
> unsigned long flags;
>
> spin_lock_irqsave(&buffers->rb_lock, flags);
> +
> if (buffers->rb_send_index == buffers->rb_max_requests) {
> spin_unlock_irqrestore(&buffers->rb_lock, flags);
> dprintk("RPC: %s: out of request buffers\n", __func__);
> @@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
> }
> buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
>
> - INIT_LIST_HEAD(&stale);
> - switch (ia->ri_memreg_strategy) {
> - case RPCRDMA_FRMR:
> - req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
> - break;
> - default:
> - break;
> - }
> spin_unlock_irqrestore(&buffers->rb_lock, flags);
> - if (!list_empty(&stale))
> - rpcrdma_retry_flushed_linv(&stale, buffers);
> return req;
> }
>
> @@ -1389,18 +1378,10 @@ void
> rpcrdma_buffer_put(struct rpcrdma_req *req)
> {
> struct rpcrdma_buffer *buffers = req->rl_buffer;
> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
> unsigned long flags;
>
> spin_lock_irqsave(&buffers->rb_lock, flags);
> rpcrdma_buffer_put_sendbuf(req, buffers);
> - switch (ia->ri_memreg_strategy) {
> - case RPCRDMA_FRMR:
> - rpcrdma_buffer_put_mrs(req, buffers);
> - break;
> - default:
> - break;
> - }
> spin_unlock_irqrestore(&buffers->rb_lock, flags);
> }
>
>

Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
about to destroy the endpoint (and the buffers and the MRs...)?

2015-05-07 10:35:29

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] xprtrdma: Remove unused LOCAL_INV recovery logic

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> Clean up: Remove functions no longer used to recover broken FRMRs.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 109 -------------------------------------------
> 1 file changed, 109 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 5226161..5120a8e 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1210,33 +1210,6 @@ rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
> spin_unlock_irqrestore(&buf->rb_lock, flags);
> }
>
> -/* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
> - * some req segments uninitialized.
> - */
> -static void
> -rpcrdma_buffer_put_mr(struct rpcrdma_mw **mw, struct rpcrdma_buffer *buf)
> -{
> - if (*mw) {
> - list_add_tail(&(*mw)->mw_list, &buf->rb_mws);
> - *mw = NULL;
> - }
> -}
> -
> -/* Cycle mw's back in reverse order, and "spin" them.
> - * This delays and scrambles reuse as much as possible.
> - */
> -static void
> -rpcrdma_buffer_put_mrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
> -{
> - struct rpcrdma_mr_seg *seg = req->rl_segments;
> - struct rpcrdma_mr_seg *seg1 = seg;
> - int i;
> -
> - for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
> - rpcrdma_buffer_put_mr(&seg->rl_mw, buf);
> - rpcrdma_buffer_put_mr(&seg1->rl_mw, buf);
> -}
> -
> static void
> rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
> {
> @@ -1249,88 +1222,6 @@ rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
> }
> }
>
> -/* rpcrdma_unmap_one() was already done during deregistration.
> - * Redo only the ib_post_send().
> - */
> -static void
> -rpcrdma_retry_local_inv(struct rpcrdma_mw *r, struct rpcrdma_ia *ia)
> -{
> - struct rpcrdma_xprt *r_xprt =
> - container_of(ia, struct rpcrdma_xprt, rx_ia);
> - struct ib_send_wr invalidate_wr, *bad_wr;
> - int rc;
> -
> - dprintk("RPC: %s: FRMR %p is stale\n", __func__, r);
> -
> - /* When this FRMR is re-inserted into rb_mws, it is no longer stale */
> - r->r.frmr.fr_state = FRMR_IS_INVALID;
> -
> - memset(&invalidate_wr, 0, sizeof(invalidate_wr));
> - invalidate_wr.wr_id = (unsigned long)(void *)r;
> - invalidate_wr.opcode = IB_WR_LOCAL_INV;
> - invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
> - DECR_CQCOUNT(&r_xprt->rx_ep);
> -
> - dprintk("RPC: %s: frmr %p invalidating rkey %08x\n",
> - __func__, r, r->r.frmr.fr_mr->rkey);
> -
> - read_lock(&ia->ri_qplock);
> - rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
> - read_unlock(&ia->ri_qplock);
> - if (rc) {
> - /* Force rpcrdma_buffer_get() to retry */
> - r->r.frmr.fr_state = FRMR_IS_STALE;
> - dprintk("RPC: %s: ib_post_send failed, %i\n",
> - __func__, rc);
> - }
> -}
> -
> -static void
> -rpcrdma_retry_flushed_linv(struct list_head *stale,
> - struct rpcrdma_buffer *buf)
> -{
> - struct rpcrdma_ia *ia = rdmab_to_ia(buf);
> - struct list_head *pos;
> - struct rpcrdma_mw *r;
> - unsigned long flags;
> -
> - list_for_each(pos, stale) {
> - r = list_entry(pos, struct rpcrdma_mw, mw_list);
> - rpcrdma_retry_local_inv(r, ia);
> - }
> -
> - spin_lock_irqsave(&buf->rb_lock, flags);
> - list_splice_tail(stale, &buf->rb_mws);
> - spin_unlock_irqrestore(&buf->rb_lock, flags);
> -}
> -
> -static struct rpcrdma_req *
> -rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
> - struct list_head *stale)
> -{
> - struct rpcrdma_mw *r;
> - int i;
> -
> - i = RPCRDMA_MAX_SEGS - 1;
> - while (!list_empty(&buf->rb_mws)) {
> - r = list_entry(buf->rb_mws.next,
> - struct rpcrdma_mw, mw_list);
> - list_del(&r->mw_list);
> - if (r->r.frmr.fr_state == FRMR_IS_STALE) {
> - list_add(&r->mw_list, stale);
> - continue;
> - }
> - req->rl_segments[i].rl_mw = r;
> - if (unlikely(i-- == 0))
> - return req; /* Success */
> - }
> -
> - /* Not enough entries on rb_mws for this req */
> - rpcrdma_buffer_put_sendbuf(req, buf);
> - rpcrdma_buffer_put_mrs(req, buf);
> - return NULL;
> -}
> -
> /*
> * Get a set of request/reply buffers.
> *
>

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>


2015-05-07 10:36:22

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 10/14] xprtrdma: Remove ->ro_reset

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> An RPC can exit at any time. When it does so, xprt_rdma_free() is
> called, and it calls ->op_unmap().
>
> If ->ro_reset() is running due to a transport disconnect, the two
> methods can race while processing the same rpcrdma_mw. The results
> are unpredictable.
>
> Because of this, in previous patches I've replaced the ->ro_reset()
> methods with a recovery workqueue. ->ro_reset() is no longer used
> and can be removed.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 11 -----------
> net/sunrpc/xprtrdma/frwr_ops.c | 16 ----------------
> net/sunrpc/xprtrdma/physical_ops.c | 6 ------
> net/sunrpc/xprtrdma/verbs.c | 2 --
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
> 5 files changed, 36 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index ad0055b..5dd77da 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -197,16 +197,6 @@ out_err:
> return nsegs;
> }
>
> -/* After a disconnect, unmap all FMRs.
> - *
> - * This is invoked only in the transport connect worker in order
> - * to serialize with rpcrdma_register_fmr_external().
> - */
> -static void
> -fmr_op_reset(struct rpcrdma_xprt *r_xprt)
> -{
> -}
> -
> static void
> fmr_op_destroy(struct rpcrdma_buffer *buf)
> {
> @@ -230,7 +220,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
> .ro_open = fmr_op_open,
> .ro_maxpages = fmr_op_maxpages,
> .ro_init = fmr_op_init,
> - .ro_reset = fmr_op_reset,
> .ro_destroy = fmr_op_destroy,
> .ro_displayname = "fmr",
> };
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 6f93a89..3fb609a 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -430,21 +430,6 @@ out_err:
> return nsegs;
> }
>
> -/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
> - * an unusable state. Find FRMRs in this state and dereg / reg
> - * each. FRMRs that are VALID and attached to an rpcrdma_req are
> - * also torn down.
> - *
> - * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
> - *
> - * This is invoked only in the transport connect worker in order
> - * to serialize with rpcrdma_register_frmr_external().
> - */
> -static void
> -frwr_op_reset(struct rpcrdma_xprt *r_xprt)
> -{
> -}
> -
> static void
> frwr_op_destroy(struct rpcrdma_buffer *buf)
> {
> @@ -464,7 +449,6 @@ const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
> .ro_open = frwr_op_open,
> .ro_maxpages = frwr_op_maxpages,
> .ro_init = frwr_op_init,
> - .ro_reset = frwr_op_reset,
> .ro_destroy = frwr_op_destroy,
> .ro_displayname = "frwr",
> };
> diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
> index da149e8..41985d0 100644
> --- a/net/sunrpc/xprtrdma/physical_ops.c
> +++ b/net/sunrpc/xprtrdma/physical_ops.c
> @@ -69,11 +69,6 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> }
>
> static void
> -physical_op_reset(struct rpcrdma_xprt *r_xprt)
> -{
> -}
> -
> -static void
> physical_op_destroy(struct rpcrdma_buffer *buf)
> {
> }
> @@ -84,7 +79,6 @@ const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
> .ro_open = physical_op_open,
> .ro_maxpages = physical_op_maxpages,
> .ro_init = physical_op_init,
> - .ro_reset = physical_op_reset,
> .ro_destroy = physical_op_destroy,
> .ro_displayname = "physical",
> };
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 5120a8e..eaf0b9d 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -897,8 +897,6 @@ retry:
> rpcrdma_flush_cqs(ep);
>
> xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
> - ia->ri_ops->ro_reset(xprt);
> -
> id = rpcrdma_create_id(xprt, ia,
> (struct sockaddr *)&xprt->rx_data.addr);
> if (IS_ERR(id)) {
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 98227d6..6a1e565 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -353,7 +353,6 @@ struct rpcrdma_memreg_ops {
> struct rpcrdma_create_data_internal *);
> size_t (*ro_maxpages)(struct rpcrdma_xprt *);
> int (*ro_init)(struct rpcrdma_xprt *);
> - void (*ro_reset)(struct rpcrdma_xprt *);
> void (*ro_destroy)(struct rpcrdma_buffer *);
> const char *ro_displayname;
> };
>

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 10:36:48

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> Clean up: This field is no longer used.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/xprtrdma.h | 3 ++-
> net/sunrpc/xprtrdma/verbs.c | 3 ---
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
> 3 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
> index c984c85..b176130 100644
> --- a/include/linux/sunrpc/xprtrdma.h
> +++ b/include/linux/sunrpc/xprtrdma.h
> @@ -56,7 +56,8 @@
>
> #define RPCRDMA_INLINE_PAD_THRESH (512)/* payload threshold to pad (bytes) */
>
> -/* memory registration strategies */
> +/* Memory registration strategies, by number.
> + * This is part of a kernel / user space API. Do not remove. */
> enum rpcrdma_memreg {
> RPCRDMA_BOUNCEBUFFERS = 0,
> RPCRDMA_REGISTER,
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index eaf0b9d..1f51547 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -671,9 +671,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
> dprintk("RPC: %s: memory registration strategy is '%s'\n",
> __func__, ia->ri_ops->ro_displayname);
>
> - /* Else will do memory reg/dereg for each chunk */
> - ia->ri_memreg_strategy = memreg;
> -
> rwlock_init(&ia->ri_qplock);
> return 0;
>
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 6a1e565..5650c23 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -70,7 +70,6 @@ struct rpcrdma_ia {
> int ri_have_dma_lkey;
> struct completion ri_done;
> int ri_async_rc;
> - enum rpcrdma_memreg ri_memreg_strategy;
> unsigned int ri_max_frmr_depth;
> struct ib_device_attr ri_devattr;
> struct ib_qp_attr ri_qp_attr;
>

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 10:37:30

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] xprtrdma: Split rb_lock

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> /proc/lock_stat showed contention between rpcrdma_buffer_get/put
> and the MR allocation functions during I/O intensive workloads.
>
> Now that MRs are no longer allocated in rpcrdma_buffer_get(),
> there's no reason the rb_mws list has to be managed using the
> same lock as the send/receive buffers. Split that lock. The
> new lock does not need to disable interrupts because buffer
> get/put is never called in an interrupt context.
>
> struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and
> rb_mws is always in a different cacheline than rb_lock and the
> buffer pointers.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 1 +
> net/sunrpc/xprtrdma/frwr_ops.c | 1 +
> net/sunrpc/xprtrdma/verbs.c | 10 ++++------
> net/sunrpc/xprtrdma/xprt_rdma.h | 16 +++++++++-------
> 4 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 5dd77da..52f9ad5 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -65,6 +65,7 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
> struct rpcrdma_mw *r;
> int i, rc;
>
> + spin_lock_init(&buf->rb_mwlock);
> INIT_LIST_HEAD(&buf->rb_mws);
> INIT_LIST_HEAD(&buf->rb_all);
>
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 3fb609a..edc10ba 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -266,6 +266,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
> struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
> int i;
>
> + spin_lock_init(&buf->rb_mwlock);
> INIT_LIST_HEAD(&buf->rb_mws);
> INIT_LIST_HEAD(&buf->rb_all);
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 1f51547..c5830cd 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1179,15 +1179,14 @@ rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
> {
> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> struct rpcrdma_mw *mw = NULL;
> - unsigned long flags;
>
> - spin_lock_irqsave(&buf->rb_lock, flags);
> + spin_lock(&buf->rb_mwlock);
> if (!list_empty(&buf->rb_mws)) {
> mw = list_first_entry(&buf->rb_mws,
> struct rpcrdma_mw, mw_list);
> list_del_init(&mw->mw_list);
> }
> - spin_unlock_irqrestore(&buf->rb_lock, flags);
> + spin_unlock(&buf->rb_mwlock);
>
> if (!mw)
> pr_err("RPC: %s: no MWs available\n", __func__);
> @@ -1198,11 +1197,10 @@ void
> rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
> {
> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> - unsigned long flags;
>
> - spin_lock_irqsave(&buf->rb_lock, flags);
> + spin_lock(&buf->rb_mwlock);
> list_add_tail(&mw->mw_list, &buf->rb_mws);
> - spin_unlock_irqrestore(&buf->rb_lock, flags);
> + spin_unlock(&buf->rb_mwlock);
> }
>
> static void
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 5650c23..ae31fc7 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -283,15 +283,17 @@ rpcr_to_rdmar(struct rpc_rqst *rqst)
> * One of these is associated with a transport instance
> */
> struct rpcrdma_buffer {
> - spinlock_t rb_lock; /* protects indexes */
> - u32 rb_max_requests;/* client max requests */
> - struct list_head rb_mws; /* optional memory windows/fmrs/frmrs */
> - struct list_head rb_all;
> - int rb_send_index;
> + spinlock_t rb_mwlock; /* protect rb_mws list */
> + struct list_head rb_mws;
> + struct list_head rb_all;
> + char *rb_pool;
> +
> + spinlock_t rb_lock; /* protect buf arrays */
> + u32 rb_max_requests;
> + int rb_send_index;
> + int rb_recv_index;
> struct rpcrdma_req **rb_send_bufs;
> - int rb_recv_index;
> struct rpcrdma_rep **rb_recv_bufs;
> - char *rb_pool;
> };
> #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
>

Looks good,

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 10:37:48

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 07/14] xprtrdma: Introduce an FRMR recovery workqueue

Looks good

Reviewed-By: Devesh Sharma <[email protected]>

On Mon, May 4, 2015 at 11:27 PM, Chuck Lever <[email protected]> wrote:
> After a transport disconnect, FRMRs can be left in an undetermined
> state. In particular, the MR's rkey is no good.
>
> Currently, FRMRs are fixed up by the transport connect worker, but
> that can race with ->ro_unmap if an RPC happens to exit while the
> transport connect worker is running.
>
> A better way of dealing with broken FRMRs is to detect them before
> they are re-used by ->ro_map. Such FRMRs are either already invalid
> or are owned by the sending RPC, and thus no race with ->ro_unmap
> is possible.
>
> Introduce a mechanism for handing broken FRMRs to a workqueue to be
> reset in a context that is appropriate for allocating resources
> (ie. an ib_alloc_fast_reg_mr() API call).
>
> This mechanism is not yet used, but will be in subsequent patches.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/frwr_ops.c | 71
> ++++++++++++++++++++++++++++++++++++++-
> net/sunrpc/xprtrdma/transport.c | 11 +++++-
> net/sunrpc/xprtrdma/xprt_rdma.h | 5 +++
> 3 files changed, 84 insertions(+), 3 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 66a85fa..a06d9a3 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -17,6 +17,74 @@
> # define RPCDBG_FACILITY RPCDBG_TRANS
> #endif
>
> +static struct workqueue_struct *frwr_recovery_wq;
> +
> +#define FRWR_RECOVERY_WQ_FLAGS (WQ_UNBOUND | WQ_MEM_RECLAIM)
> +
> +int
> +frwr_alloc_recovery_wq(void)
> +{
> + frwr_recovery_wq = alloc_workqueue("frwr_recovery",
> + FRWR_RECOVERY_WQ_FLAGS, 0);
> + return !frwr_recovery_wq ? -ENOMEM : 0;
> +}
> +
> +void
> +frwr_destroy_recovery_wq(void)
> +{
> + struct workqueue_struct *wq;
> +
> + if (!frwr_recovery_wq)
> + return;
> +
> + wq = frwr_recovery_wq;
> + frwr_recovery_wq = NULL;
> + destroy_workqueue(wq);
> +}
> +
> +/* Deferred reset of a single FRMR. Generate a fresh rkey by
> + * replacing the MR.
> + *
> + * There's no recovery if this fails. The FRMR is abandoned, but
> + * remains in rb_all. It will be cleaned up when the transport is
> + * destroyed.
> + */
> +static void
> +__frwr_recovery_worker(struct work_struct *work)
> +{
> + struct rpcrdma_mw *r = container_of(work, struct rpcrdma_mw,
> + r.frmr.fr_work);
> + struct rpcrdma_xprt *r_xprt = r->r.frmr.fr_xprt;
> + unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
> + struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
> +
> + if (ib_dereg_mr(r->r.frmr.fr_mr))
> + goto out_fail;
> +
> + r->r.frmr.fr_mr = ib_alloc_fast_reg_mr(pd, depth);
> + if (IS_ERR(r->r.frmr.fr_mr))
> + goto out_fail;
> +
> + dprintk("RPC: %s: recovered FRMR %p\n", __func__, r);
> + r->r.frmr.fr_state = FRMR_IS_INVALID;
> + rpcrdma_put_mw(r_xprt, r);
> + return;
> +
> +out_fail:
> + pr_warn("RPC: %s: FRMR %p unrecovered\n",
> + __func__, r);
> +}
> +
> +/* A broken MR was discovered in a context that can't sleep.
> + * Defer recovery to the recovery worker.
> + */
> +static void
> +__frwr_queue_recovery(struct rpcrdma_mw *r)
> +{
> + INIT_WORK(&r->r.frmr.fr_work, __frwr_recovery_worker);
> + queue_work(frwr_recovery_wq, &r->r.frmr.fr_work);
> +}
> +
> static int
> __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device
> *device,
> unsigned int depth)
> @@ -128,7 +196,7 @@ frwr_sendcompletion(struct ib_wc *wc)
>
> /* WARNING: Only wr_id and status are reliable at this point */
> r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
> - dprintk("RPC: %s: frmr %p (stale), status %d\n",
> + pr_warn("RPC: %s: frmr %p flushed, status %d\n",
> __func__, r, wc->status);
> r->r.frmr.fr_state = FRMR_IS_STALE;
> }
> @@ -165,6 +233,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
> list_add(&r->mw_list, &buf->rb_mws);
> list_add(&r->mw_all, &buf->rb_all);
> r->mw_sendcompletion = frwr_sendcompletion;
> + r->r.frmr.fr_xprt = r_xprt;
> }
>
> return 0;
> diff --git a/net/sunrpc/xprtrdma/transport.c
> b/net/sunrpc/xprtrdma/transport.c
> index ed70551..f1fa6a7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -757,17 +757,24 @@ static void __exit xprt_rdma_cleanup(void)
> if (rc)
> dprintk("RPC: %s: xprt_unregister returned %i\n",
> __func__, rc);
> +
> + frwr_destroy_recovery_wq();
> }
>
> static int __init xprt_rdma_init(void)
> {
> int rc;
>
> - rc = xprt_register_transport(&xprt_rdma);
> -
> + rc = frwr_alloc_recovery_wq();
> if (rc)
> return rc;
>
> + rc = xprt_register_transport(&xprt_rdma);
> + if (rc) {
> + frwr_destroy_recovery_wq();
> + return rc;
> + }
> +
> dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");
>
> dprintk("Defaults:\n");
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
> b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 7de424e..98227d6 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -204,6 +204,8 @@ struct rpcrdma_frmr {
> struct ib_fast_reg_page_list *fr_pgl;
> struct ib_mr *fr_mr;
> enum rpcrdma_frmr_state fr_state;
> + struct work_struct fr_work;
> + struct rpcrdma_xprt *fr_xprt;
> };
>
> struct rpcrdma_mw {
> @@ -429,6 +431,9 @@ void rpcrdma_free_regbuf(struct rpcrdma_ia *,
>
> unsigned int rpcrdma_max_segments(struct rpcrdma_xprt *);
>
> +int frwr_alloc_recovery_wq(void);
> +void frwr_destroy_recovery_wq(void);
> +
> /*
> * Wrappers for chunk registration, shared by read/write chunk code.
> */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-07 10:50:10

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 13/14] xprtrdma: Stack relief in fmr_op_map()

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> fmr_op_map() declares a 64 element array of u64 in automatic
> storage. This is 512 bytes (8 * 64) on the stack.
>
> Instead, when FMR memory registration is in use, pre-allocate a
> physaddr array for each rpcrdma_mw.
>
> This is a pre-requisite for increasing the r/wsize maximum for
> FMR on platforms with 4KB pages.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 32 ++++++++++++++++++++++----------
> net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++++++-
> 2 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 52f9ad5..4a53ad5 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -72,13 +72,19 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
> i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
> dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);
>
> + rc = -ENOMEM;
> while (i--) {
> r = kzalloc(sizeof(*r), GFP_KERNEL);
> if (!r)
> - return -ENOMEM;
> + goto out;
> +
> + r->r.fmr.physaddrs = kmalloc(RPCRDMA_MAX_FMR_SGES *
> + sizeof(u64), GFP_KERNEL);
> + if (!r->r.fmr.physaddrs)
> + goto out_free;
>
> - r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
> - if (IS_ERR(r->r.fmr))
> + r->r.fmr.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
> + if (IS_ERR(r->r.fmr.fmr))
> goto out_fmr_err;
>
> list_add(&r->mw_list, &buf->rb_mws);
> @@ -87,9 +93,12 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
> return 0;
>
> out_fmr_err:
> - rc = PTR_ERR(r->r.fmr);
> + rc = PTR_ERR(r->r.fmr.fmr);
> dprintk("RPC: %s: ib_alloc_fmr status %i\n", __func__, rc);
> + kfree(r->r.fmr.physaddrs);
> +out_free:
> kfree(r);
> +out:
> return rc;
> }
>
> @@ -98,7 +107,7 @@ __fmr_unmap(struct rpcrdma_mw *r)
> {
> LIST_HEAD(l);
>
> - list_add(&r->r.fmr->list, &l);
> + list_add(&r->r.fmr.fmr->list, &l);
> return ib_unmap_fmr(&l);
> }
>
> @@ -113,7 +122,6 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> struct ib_device *device = ia->ri_device;
> enum dma_data_direction direction = rpcrdma_data_dir(writing);
> struct rpcrdma_mr_seg *seg1 = seg;
> - u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
> int len, pageoff, i, rc;
> struct rpcrdma_mw *mw;
>
> @@ -138,7 +146,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> nsegs = RPCRDMA_MAX_FMR_SGES;
> for (i = 0; i < nsegs;) {
> rpcrdma_map_one(device, seg, direction);
> - physaddrs[i] = seg->mr_dma;
> + mw->r.fmr.physaddrs[i] = seg->mr_dma;
> len += seg->mr_len;
> ++seg;
> ++i;
> @@ -148,12 +156,13 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> break;
> }
>
> - rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
> + rc = ib_map_phys_fmr(mw->r.fmr.fmr, mw->r.fmr.physaddrs,
> + i, seg1->mr_dma);
> if (rc)
> goto out_maperr;
>
> seg1->rl_mw = mw;
> - seg1->mr_rkey = mw->r.fmr->rkey;
> + seg1->mr_rkey = mw->r.fmr.fmr->rkey;
> seg1->mr_base = seg1->mr_dma + pageoff;
> seg1->mr_nsegs = i;
> seg1->mr_len = len;
> @@ -207,10 +216,13 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
> while (!list_empty(&buf->rb_all)) {
> r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
> list_del(&r->mw_all);
> - rc = ib_dealloc_fmr(r->r.fmr);
> + kfree(r->r.fmr.physaddrs);
> +
> + rc = ib_dealloc_fmr(r->r.fmr.fmr);
> if (rc)
> dprintk("RPC: %s: ib_dealloc_fmr failed %i\n",
> __func__, rc);
> +
> kfree(r);
> }
> }
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index ae31fc7..e176bae 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -207,9 +207,14 @@ struct rpcrdma_frmr {
> struct rpcrdma_xprt *fr_xprt;
> };
>
> +struct rpcrdma_fmr {
> + struct ib_fmr *fmr;
> + u64 *physaddrs;
> +};
> +
> struct rpcrdma_mw {
> union {
> - struct ib_fmr *fmr;
> + struct rpcrdma_fmr fmr;
> struct rpcrdma_frmr frmr;
> } r;
> void (*mw_sendcompletion)(struct ib_wc *);
>

Looks good

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 11:00:45

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation

On 5/4/2015 8:58 PM, Chuck Lever wrote:
> Reduce resource consumption per-transport to make way for increasing
> the credit limit and maximum r/wsize. Pre-allocate fewer MRs.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/fmr_ops.c | 6 ++++--
> net/sunrpc/xprtrdma/frwr_ops.c | 6 ++++--
> 2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 4a53ad5..f1e8daf 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -69,8 +69,10 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
> INIT_LIST_HEAD(&buf->rb_mws);
> INIT_LIST_HEAD(&buf->rb_all);
>
> - i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
> - dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);
> + i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
> + i += 2; /* head + tail */
> + i *= buf->rb_max_requests; /* one set for each RPC slot */
> + dprintk("RPC: %s: initalizing %d FMRs\n", __func__, i);
>
> rc = -ENOMEM;
> while (i--) {
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index edc10ba..fc2d0c6 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -270,8 +270,10 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
> INIT_LIST_HEAD(&buf->rb_mws);
> INIT_LIST_HEAD(&buf->rb_all);
>
> - i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
> - dprintk("RPC: %s: initializing %d FRMRs\n", __func__, i);
> + i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
> + i += 2; /* head + tail */
> + i *= buf->rb_max_requests; /* one set for each RPC slot */
> + dprintk("RPC: %s: initalizing %d FRMRs\n", __func__, i);
>
> while (i--) {
> struct rpcrdma_mw *r;
>

Looks good.

Reviewed-by: Sagi Grimberg <[email protected]>

2015-05-07 13:25:30

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt


On May 7, 2015, at 5:38 AM, Sagi Grimberg <[email protected]> wrote:

> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>> Clean up: Instead of carrying a pointer to the buffer pool and
>> the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++--
>> net/sunrpc/xprtrdma/transport.c | 7 ++-----
>> net/sunrpc/xprtrdma/verbs.c | 8 +++++---
>> net/sunrpc/xprtrdma/xprt_rdma.h | 3 +--
>> 4 files changed, 10 insertions(+), 12 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 2c53ea9..98a3b95 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -732,8 +732,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> struct rpcrdma_msg *headerp;
>> struct rpcrdma_req *req;
>> struct rpc_rqst *rqst;
>> - struct rpc_xprt *xprt = rep->rr_xprt;
>> - struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>> + struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
>> + struct rpc_xprt *xprt = &r_xprt->rx_xprt;
>> __be32 *iptr;
>> int rdmalen, status;
>> unsigned long cwnd;
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index fdcb2c7..ed70551 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -650,12 +650,9 @@ xprt_rdma_send_request(struct rpc_task *task)
>>
>> if (req->rl_reply == NULL) /* e.g. reconnection */
>> rpcrdma_recv_buffer_get(req);
>> -
>> - if (req->rl_reply) {
>> + /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
>> + if (req->rl_reply)
>> req->rl_reply->rr_func = rpcrdma_reply_handler;
>> - /* this need only be done once, but... */
>> - req->rl_reply->rr_xprt = xprt;
>> - }
>
> Can't you just fold that into rpcrdma_recv_buffer_get() instead of
> checking what it did?

rr_func is going away in an upcoming merge window.


> Other than that,
>
> Looks good,
>
> Reviewed-by: Sagi Grimberg <[email protected]>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-07 13:38:58

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely


On May 7, 2015, at 6:00 AM, Sagi Grimberg <[email protected]> wrote:

> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>> The connect worker can replace ri_id, but prevents ri_id->device
>> from changing during the lifetime of a transport instance.
>>
>> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
>> The cached copy can be used safely in code that does not serialize
>> with the connect worker.
>>
>> Other code can use it to save an extra address generation (one
>> pointer dereference instead of two).
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
>> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
>> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
>> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
>> 5 files changed, 43 insertions(+), 48 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 302d4eb..0a96155 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> int nsegs, bool writing)
>> {
>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> - struct ib_device *device = ia->ri_id->device;
>> + struct ib_device *device = ia->ri_device;
>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>> struct rpcrdma_mr_seg *seg1 = seg;
>> struct rpcrdma_mw *mw = seg1->rl_mw;
>> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>> {
>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> struct rpcrdma_mr_seg *seg1 = seg;
>> - struct ib_device *device;
>> int rc, nsegs = seg->mr_nsegs;
>> LIST_HEAD(l);
>>
>> list_add(&seg1->rl_mw->r.fmr->list, &l);
>> rc = ib_unmap_fmr(&l);
>> - read_lock(&ia->ri_qplock);
>> - device = ia->ri_id->device;
>> while (seg1->mr_nsegs--)
>> - rpcrdma_unmap_one(device, seg++);
>> - read_unlock(&ia->ri_qplock);
>> + rpcrdma_unmap_one(ia->ri_device, seg++);
>
> Umm, I'm wandering if this is guaranteed to be the same device as
> ri_id->device?
>
> Imagine you are working on a bond device where each slave belongs to
> a different adapter. When the active port toggles, you will see a
> ADDR_CHANGED event (that the current code does not handle...), what
> you'd want to do is just reconnect and rdma_cm will resolve the new
> address for you (via the backup slave). I suspect that in case this
> flow is concurrent with the reconnects you may end up with a stale
> device handle.

I?m not sure what you mean by ?stale? : freed memory?

I?m looking at this code in rpcrdma_ep_connect() :

916 if (ia->ri_id->device != id->device) {
917 printk("RPC: %s: can't reconnect on "
918 "different device!\n", __func__);
919 rdma_destroy_id(id);
920 rc = -ENETUNREACH;
921 goto out;
922 }

After reconnecting, if the ri_id has changed, the connect fails. Today,
xprtrdma does not support the device changing out from under it.

Note also that our receive completion upcall uses ri_id->device for
DMA map syncing. Would that also be a problem during a bond failover?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-07 13:56:50

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

On 5/7/2015 4:39 PM, Chuck Lever wrote:
>
> On May 7, 2015, at 6:00 AM, Sagi Grimberg <[email protected]> wrote:
>
>> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>> The connect worker can replace ri_id, but prevents ri_id->device
>>> from changing during the lifetime of a transport instance.
>>>
>>> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
>>> The cached copy can be used safely in code that does not serialize
>>> with the connect worker.
>>>
>>> Other code can use it to save an extra address generation (one
>>> pointer dereference instead of two).
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
>>> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
>>> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
>>> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
>>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
>>> 5 files changed, 43 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>>> index 302d4eb..0a96155 100644
>>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>> int nsegs, bool writing)
>>> {
>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>> - struct ib_device *device = ia->ri_id->device;
>>> + struct ib_device *device = ia->ri_device;
>>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>> struct rpcrdma_mr_seg *seg1 = seg;
>>> struct rpcrdma_mw *mw = seg1->rl_mw;
>>> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>>> {
>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>> struct rpcrdma_mr_seg *seg1 = seg;
>>> - struct ib_device *device;
>>> int rc, nsegs = seg->mr_nsegs;
>>> LIST_HEAD(l);
>>>
>>> list_add(&seg1->rl_mw->r.fmr->list, &l);
>>> rc = ib_unmap_fmr(&l);
>>> - read_lock(&ia->ri_qplock);
>>> - device = ia->ri_id->device;
>>> while (seg1->mr_nsegs--)
>>> - rpcrdma_unmap_one(device, seg++);
>>> - read_unlock(&ia->ri_qplock);
>>> + rpcrdma_unmap_one(ia->ri_device, seg++);
>>
>> Umm, I'm wandering if this is guaranteed to be the same device as
>> ri_id->device?
>>
>> Imagine you are working on a bond device where each slave belongs to
>> a different adapter. When the active port toggles, you will see a
>> ADDR_CHANGED event (that the current code does not handle...), what
>> you'd want to do is just reconnect and rdma_cm will resolve the new
>> address for you (via the backup slave). I suspect that in case this
>> flow is concurrent with the reconnects you may end up with a stale
>> device handle.
>
> I?m not sure what you mean by ?stale? : freed memory?
>
> I?m looking at this code in rpcrdma_ep_connect() :
>
> 916 if (ia->ri_id->device != id->device) {
> 917 printk("RPC: %s: can't reconnect on "
> 918 "different device!\n", __func__);
> 919 rdma_destroy_id(id);
> 920 rc = -ENETUNREACH;
> 921 goto out;
> 922 }
>
> After reconnecting, if the ri_id has changed, the connect fails. Today,
> xprtrdma does not support the device changing out from under it.
>
> Note also that our receive completion upcall uses ri_id->device for
> DMA map syncing. Would that also be a problem during a bond failover?
>

I'm not talking about ri_id->device, this will be consistent. I'm
wandering about ia->ri_device, which might not have been updated yet.

Just asking, assuming your transport device can change between
consecutive reconnects (the new cm_id will contain another device), is
it safe to rely on ri_device being updated?

2015-05-07 14:12:09

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely


On May 7, 2015, at 9:56 AM, Sagi Grimberg <[email protected]> wrote:

> On 5/7/2015 4:39 PM, Chuck Lever wrote:
>>
>> On May 7, 2015, at 6:00 AM, Sagi Grimberg <[email protected]> wrote:
>>
>>> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>>> The connect worker can replace ri_id, but prevents ri_id->device
>>>> from changing during the lifetime of a transport instance.
>>>>
>>>> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
>>>> The cached copy can be used safely in code that does not serialize
>>>> with the connect worker.
>>>>
>>>> Other code can use it to save an extra address generation (one
>>>> pointer dereference instead of two).
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
>>>> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
>>>> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
>>>> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
>>>> 5 files changed, 43 insertions(+), 48 deletions(-)
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>>>> index 302d4eb..0a96155 100644
>>>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>>> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>>> int nsegs, bool writing)
>>>> {
>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>> - struct ib_device *device = ia->ri_id->device;
>>>> + struct ib_device *device = ia->ri_device;
>>>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>> struct rpcrdma_mw *mw = seg1->rl_mw;
>>>> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>>>> {
>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>> - struct ib_device *device;
>>>> int rc, nsegs = seg->mr_nsegs;
>>>> LIST_HEAD(l);
>>>>
>>>> list_add(&seg1->rl_mw->r.fmr->list, &l);
>>>> rc = ib_unmap_fmr(&l);
>>>> - read_lock(&ia->ri_qplock);
>>>> - device = ia->ri_id->device;
>>>> while (seg1->mr_nsegs--)
>>>> - rpcrdma_unmap_one(device, seg++);
>>>> - read_unlock(&ia->ri_qplock);
>>>> + rpcrdma_unmap_one(ia->ri_device, seg++);
>>>
>>> Umm, I'm wandering if this is guaranteed to be the same device as
>>> ri_id->device?
>>>
>>> Imagine you are working on a bond device where each slave belongs to
>>> a different adapter. When the active port toggles, you will see a
>>> ADDR_CHANGED event (that the current code does not handle...), what
>>> you'd want to do is just reconnect and rdma_cm will resolve the new
>>> address for you (via the backup slave). I suspect that in case this
>>> flow is concurrent with the reconnects you may end up with a stale
>>> device handle.
>>
>> I?m not sure what you mean by ?stale? : freed memory?
>>
>> I?m looking at this code in rpcrdma_ep_connect() :
>>
>> 916 if (ia->ri_id->device != id->device) {
>> 917 printk("RPC: %s: can't reconnect on "
>> 918 "different device!\n", __func__);
>> 919 rdma_destroy_id(id);
>> 920 rc = -ENETUNREACH;
>> 921 goto out;
>> 922 }
>>
>> After reconnecting, if the ri_id has changed, the connect fails. Today,
>> xprtrdma does not support the device changing out from under it.
>>
>> Note also that our receive completion upcall uses ri_id->device for
>> DMA map syncing. Would that also be a problem during a bond failover?
>>
>
> I'm not talking about ri_id->device, this will be consistent. I'm
> wandering about ia->ri_device, which might not have been updated yet.

ia->ri_device is never updated. The only place it is set is in
rpcrdma_ia_open().

> Just asking, assuming your transport device can change between consecutive reconnects (the new cm_id will contain another device), is
> it safe to rely on ri_device being updated?

My reading of the above logic is that ia->ri_id->device is guaranteed to
be the same address during the lifetime of the transport instance. If it
changes during a reconnect, rpcrdma_ep_connect() will fail the connect.

In the case of a bonded device, why are the physical slave devices exposed
to consumers? It might be saner to construct a virtual ib_device in this
case that consumers can depend on.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-07 15:11:42

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

On 5/7/2015 5:12 PM, Chuck Lever wrote:
>
> On May 7, 2015, at 9:56 AM, Sagi Grimberg <[email protected]> wrote:
>
>> On 5/7/2015 4:39 PM, Chuck Lever wrote:
>>>
>>> On May 7, 2015, at 6:00 AM, Sagi Grimberg <[email protected]> wrote:
>>>
>>>> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>>>> The connect worker can replace ri_id, but prevents ri_id->device
>>>>> from changing during the lifetime of a transport instance.
>>>>>
>>>>> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
>>>>> The cached copy can be used safely in code that does not serialize
>>>>> with the connect worker.
>>>>>
>>>>> Other code can use it to save an extra address generation (one
>>>>> pointer dereference instead of two).
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
>>>>> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
>>>>> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
>>>>> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
>>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
>>>>> 5 files changed, 43 insertions(+), 48 deletions(-)
>>>>>
>>>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> index 302d4eb..0a96155 100644
>>>>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>>>> int nsegs, bool writing)
>>>>> {
>>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>>> - struct ib_device *device = ia->ri_id->device;
>>>>> + struct ib_device *device = ia->ri_device;
>>>>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>>> struct rpcrdma_mw *mw = seg1->rl_mw;
>>>>> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>>>>> {
>>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>>> - struct ib_device *device;
>>>>> int rc, nsegs = seg->mr_nsegs;
>>>>> LIST_HEAD(l);
>>>>>
>>>>> list_add(&seg1->rl_mw->r.fmr->list, &l);
>>>>> rc = ib_unmap_fmr(&l);
>>>>> - read_lock(&ia->ri_qplock);
>>>>> - device = ia->ri_id->device;
>>>>> while (seg1->mr_nsegs--)
>>>>> - rpcrdma_unmap_one(device, seg++);
>>>>> - read_unlock(&ia->ri_qplock);
>>>>> + rpcrdma_unmap_one(ia->ri_device, seg++);
>>>>
>>>> Umm, I'm wandering if this is guaranteed to be the same device as
>>>> ri_id->device?
>>>>
>>>> Imagine you are working on a bond device where each slave belongs to
>>>> a different adapter. When the active port toggles, you will see a
>>>> ADDR_CHANGED event (that the current code does not handle...), what
>>>> you'd want to do is just reconnect and rdma_cm will resolve the new
>>>> address for you (via the backup slave). I suspect that in case this
>>>> flow is concurrent with the reconnects you may end up with a stale
>>>> device handle.
>>>
>>> I?m not sure what you mean by ?stale? : freed memory?
>>>
>>> I?m looking at this code in rpcrdma_ep_connect() :
>>>
>>> 916 if (ia->ri_id->device != id->device) {
>>> 917 printk("RPC: %s: can't reconnect on "
>>> 918 "different device!\n", __func__);
>>> 919 rdma_destroy_id(id);
>>> 920 rc = -ENETUNREACH;
>>> 921 goto out;
>>> 922 }
>>>
>>> After reconnecting, if the ri_id has changed, the connect fails. Today,
>>> xprtrdma does not support the device changing out from under it.
>>>
>>> Note also that our receive completion upcall uses ri_id->device for
>>> DMA map syncing. Would that also be a problem during a bond failover?
>>>
>>
>> I'm not talking about ri_id->device, this will be consistent. I'm
>> wandering about ia->ri_device, which might not have been updated yet.
>
> ia->ri_device is never updated. The only place it is set is in
> rpcrdma_ia_open().

So you assume that each ri_id that you will recreate contains the
same device handle?

I think that for ADDR_CHANGE event when the slave belongs to another
device you will hit a mismatch. CC'ing Sean for more info...

>
>> Just asking, assuming your transport device can change between consecutive reconnects (the new cm_id will contain another device), is
>> it safe to rely on ri_device being updated?
>
> My reading of the above logic is that ia->ri_id->device is guaranteed to
> be the same address during the lifetime of the transport instance. If it
> changes during a reconnect, rpcrdma_ep_connect() will fail the connect.

It is the same address - the bond0 IP...

>
> In the case of a bonded device, why are the physical slave devices exposed
> to consumers?

You mean ib_device handle? you need it to create PD/CQ/QP/MRs...
How else can you allocate the device resources without the device
handle?

rdma_cm simply gives you the device handle by the IP route. From there
you own the resources you create.

> It might be saner to construct a virtual ib_device in this
> case that consumers can depend on.

I'm not sure how does a virtual ib_device can work - that goes
to the verbs themselves... Seems like a layering mis-match to me...

2015-05-08 15:24:09

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()

On Thu, May 7, 2015 at 4:01 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>
>> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
>> pool lock is expensive, and unnecessary because most modern adapters
>> can transfer 100s of KBs of payload using just a single MR.
>>
>> Instead, acquire MRs one-at-a-time as chunks are registered, and
>> return them to rb_mws immediately during deregistration.
>>
>> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
>> registration fails") is reverted: There is now a valid case where
>> registration can fail (with -ENOMEM) but the QP is still in RTS.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/frwr_ops.c | 120
>> ++++++++++++++++++++++++++++------------
>> net/sunrpc/xprtrdma/rpc_rdma.c | 3 -
>> net/sunrpc/xprtrdma/verbs.c | 21 -------
>> 3 files changed, 86 insertions(+), 58 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>> b/net/sunrpc/xprtrdma/frwr_ops.c
>> index a06d9a3..6f93a89 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -11,6 +11,62 @@
>> * but most complex memory registration mode.
>> */
>>
>> +/* Normal operation
>> + *
>> + * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
>> + * Work Request (frmr_op_map). When the RDMA operation is finished, this
>> + * Memory Region is invalidated using a LOCAL_INV Work Request
>> + * (frmr_op_unmap).
>> + *
>> + * Typically these Work Requests are not signaled, and neither are RDMA
>> + * SEND Work Requests (with the exception of signaling occasionally to
>> + * prevent provider work queue overflows). This greatly reduces HCA
>> + * interrupt workload.
>> + *
>> + * As an optimization, frwr_op_unmap marks MRs INVALID before the
>> + * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
>> + * rb_mws immediately so that no work (like managing a linked list
>> + * under a spinlock) is needed in the completion upcall.
>> + *
>> + * But this means that frwr_op_map() can occasionally encounter an MR
>> + * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
>> + * ordering prevents a subsequent FAST_REG WR from executing against
>> + * that MR while it is still being invalidated.
>> + */
>> +
>> +/* Transport recovery
>> + *
>> + * ->op_map and the transport connect worker cannot run at the same
>> + * time, but ->op_unmap can fire while the transport connect worker
>> + * is running. Thus MR recovery is handled in ->op_map, to guarantee
>> + * that recovered MRs are owned by a sending RPC, and not one where
>> + * ->op_unmap could fire at the same time transport reconnect is
>> + * being done.
>> + *
>> + * When the underlying transport disconnects, MRs are left in one of
>> + * three states:
>> + *
>> + * INVALID: The MR was not in use before the QP entered ERROR state.
>> + * (Or, the LOCAL_INV WR has not completed or flushed yet).
>> + *
>> + * STALE: The MR was being registered or unregistered when the QP
>> + * entered ERROR state, and the pending WR was flushed.
>> + *
>> + * VALID: The MR was registered before the QP entered ERROR state.
>> + *
>> + * When frwr_op_map encounters STALE and VALID MRs, they are recovered
>> + * with ib_dereg_mr and then are re-initialized. Beause MR recovery
>> + * allocates fresh resources, it is deferred to a workqueue, and the
>> + * recovered MRs are placed back on the rb_mws list when recovery is
>> + * complete. frwr_op_map allocates another MR for the current RPC while
>> + * the broken MR is reset.
>> + *
>> + * To ensure that frwr_op_map doesn't encounter an MR that is marked
>> + * INVALID but that is about to be flushed due to a previous transport
>> + * disconnect, the transport connect worker attempts to drain all
>> + * pending send queue WRs before the transport is reconnected.
>> + */
>> +
>> #include "xprt_rdma.h"
>>
>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> @@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> struct ib_device *device = ia->ri_device;
>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>> struct rpcrdma_mr_seg *seg1 = seg;
>> - struct rpcrdma_mw *mw = seg1->rl_mw;
>> - struct rpcrdma_frmr *frmr = &mw->r.frmr;
>> - struct ib_mr *mr = frmr->fr_mr;
>> + struct rpcrdma_mw *mw;
>> + struct rpcrdma_frmr *frmr;
>> + struct ib_mr *mr;
>> struct ib_send_wr fastreg_wr, *bad_wr;
>> u8 key;
>> int len, pageoff;
>> @@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> u64 pa;
>> int page_no;
>>
>> + mw = seg1->rl_mw;
>> + seg1->rl_mw = NULL;
>> + do {
>> + if (mw)
>> + __frwr_queue_recovery(mw);
>> + mw = rpcrdma_get_mw(r_xprt);
>> + if (!mw)
>> + return -ENOMEM;
>> + } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
>> + frmr = &mw->r.frmr;
>> + frmr->fr_state = FRMR_IS_VALID;
>> +
>> pageoff = offset_in_page(seg1->mr_offset);
>> seg1->mr_offset -= pageoff; /* start of page */
>> seg1->mr_len += pageoff;
>> len = -pageoff;
>> if (nsegs > ia->ri_max_frmr_depth)
>> nsegs = ia->ri_max_frmr_depth;
>> +
>> for (page_no = i = 0; i < nsegs;) {
>> rpcrdma_map_one(device, seg, direction);
>> pa = seg->mr_dma;
>> @@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> dprintk("RPC: %s: Using frmr %p to map %d segments (%d
>> bytes)\n",
>> __func__, mw, i, len);
>>
>> - frmr->fr_state = FRMR_IS_VALID;
>> -
>> memset(&fastreg_wr, 0, sizeof(fastreg_wr));
>> fastreg_wr.wr_id = (unsigned long)(void *)mw;
>> fastreg_wr.opcode = IB_WR_FAST_REG_MR;
>> @@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> fastreg_wr.wr.fast_reg.access_flags = writing ?
>> IB_ACCESS_REMOTE_WRITE |
>> IB_ACCESS_LOCAL_WRITE :
>> IB_ACCESS_REMOTE_READ;
>> + mr = frmr->fr_mr;
>> key = (u8)(mr->rkey & 0x000000FF);
>> ib_update_fast_reg_key(mr, ++key);
>> fastreg_wr.wr.fast_reg.rkey = mr->rkey;
>> @@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> if (rc)
>> goto out_senderr;
>>
>> + seg1->rl_mw = mw;
>> seg1->mr_rkey = mr->rkey;
>> seg1->mr_base = seg1->mr_dma + pageoff;
>> seg1->mr_nsegs = i;
>> @@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>
>> out_senderr:
>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>> - ib_update_fast_reg_key(mr, --key);
>> - frmr->fr_state = FRMR_IS_INVALID;
>> while (i--)
>> rpcrdma_unmap_one(device, --seg);
>> + __frwr_queue_recovery(mw);
>> return rc;
>> }
>>
>> @@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>> {
>> struct rpcrdma_mr_seg *seg1 = seg;
>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> + struct rpcrdma_mw *mw = seg1->rl_mw;
>> struct ib_send_wr invalidate_wr, *bad_wr;
>> int rc, nsegs = seg->mr_nsegs;
>>
>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> + dprintk("RPC: %s: FRMR %p\n", __func__, mw);
>> +
>> + seg1->rl_mw = NULL;
>> + mw->r.frmr.fr_state = FRMR_IS_INVALID;
>>
>> memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> - invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>> + invalidate_wr.wr_id = (unsigned long)(void *)mw;
>> invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> - invalidate_wr.ex.invalidate_rkey =
>> seg1->rl_mw->r.frmr.fr_mr->rkey;
>> + invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
>> DECR_CQCOUNT(&r_xprt->rx_ep);
>>
>> while (seg1->mr_nsegs--)
>> @@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>> read_unlock(&ia->ri_qplock);
>> if (rc)
>> goto out_err;
>> +
>> + rpcrdma_put_mw(r_xprt, mw);
>> return nsegs;
>>
>> out_err:
>> - /* Force rpcrdma_buffer_get() to retry */
>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>> + __frwr_queue_recovery(mw);
>> return nsegs;
>> }
>>
>> @@ -370,29 +443,6 @@ out_err:
>> static void
>> frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>> {
>> - struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> - struct ib_device *device = r_xprt->rx_ia.ri_device;
>> - unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
>> - struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
>> - struct rpcrdma_mw *r;
>> - int rc;
>> -
>> - list_for_each_entry(r, &buf->rb_all, mw_all) {
>> - if (r->r.frmr.fr_state == FRMR_IS_INVALID)
>> - continue;
>> -
>> - __frwr_release(r);
>> - rc = __frwr_init(r, pd, device, depth);
>> - if (rc) {
>> - dprintk("RPC: %s: mw %p left %s\n",
>> - __func__, r,
>> - (r->r.frmr.fr_state == FRMR_IS_STALE ?
>> - "stale" : "valid"));
>> - continue;
>> - }
>> -
>> - r->r.frmr.fr_state = FRMR_IS_INVALID;
>> - }
>> }
>>
>> static void
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 98a3b95..35ead0b 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct
>> xdr_buf *target,
>> return (unsigned char *)iptr - (unsigned char *)headerp;
>>
>> out:
>> - if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
>> - return n;
>> -
>> for (pos = 0; nchunks--;)
>> pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
>>
>> &req->rl_segments[pos]);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 8a43c7ef..5226161 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req,
>> struct rpcrdma_buffer *buf,
>> struct rpcrdma_req *
>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>> {
>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>> - struct list_head stale;
>> struct rpcrdma_req *req;
>> unsigned long flags;
>>
>> spin_lock_irqsave(&buffers->rb_lock, flags);
>> +
>> if (buffers->rb_send_index == buffers->rb_max_requests) {
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> dprintk("RPC: %s: out of request buffers\n",
>> __func__);
>> @@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>> }
>> buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
>>
>> - INIT_LIST_HEAD(&stale);
>> - switch (ia->ri_memreg_strategy) {
>> - case RPCRDMA_FRMR:
>> - req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
>> - break;
>> - default:
>> - break;
>> - }
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> - if (!list_empty(&stale))
>> - rpcrdma_retry_flushed_linv(&stale, buffers);
>> return req;
>> }
>>
>> @@ -1389,18 +1378,10 @@ void
>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>> {
>> struct rpcrdma_buffer *buffers = req->rl_buffer;
>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>> unsigned long flags;
>>
>> spin_lock_irqsave(&buffers->rb_lock, flags);
>> rpcrdma_buffer_put_sendbuf(req, buffers);
>> - switch (ia->ri_memreg_strategy) {
>> - case RPCRDMA_FRMR:
>> - rpcrdma_buffer_put_mrs(req, buffers);
>> - break;
>> - default:
>> - break;
>> - }
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> }
>>
>>
>
> Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
> about to destroy the endpoint (and the buffers and the MRs...)?

I agree with Sagi here, in xprt_rdma_destroy() before calling
rpcrdma_destroy_buffer(), flush_workqueue and cancelling any pending
work seems required.
With the optimization, is it possible that before completion of
REG-FRMR sever starts traffic on a yet-to-be-reg-complete rkey, this
will cause access-error async event on client side and server will see
flush errors.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-08 15:31:21

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] xprtrdma: Remove unused LOCAL_INV recovery logic

Reviewed-bt: Devesh Sharma <[email protected]>

On Thu, May 7, 2015 at 4:05 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:58 PM, Chuck Lever wrote:
>>
>> Clean up: Remove functions no longer used to recover broken FRMRs.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/verbs.c | 109
>> -------------------------------------------
>> 1 file changed, 109 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 5226161..5120a8e 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -1210,33 +1210,6 @@ rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mw *mw)
>> spin_unlock_irqrestore(&buf->rb_lock, flags);
>> }
>>
>> -/* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
>> - * some req segments uninitialized.
>> - */
>> -static void
>> -rpcrdma_buffer_put_mr(struct rpcrdma_mw **mw, struct rpcrdma_buffer *buf)
>> -{
>> - if (*mw) {
>> - list_add_tail(&(*mw)->mw_list, &buf->rb_mws);
>> - *mw = NULL;
>> - }
>> -}
>> -
>> -/* Cycle mw's back in reverse order, and "spin" them.
>> - * This delays and scrambles reuse as much as possible.
>> - */
>> -static void
>> -rpcrdma_buffer_put_mrs(struct rpcrdma_req *req, struct rpcrdma_buffer
>> *buf)
>> -{
>> - struct rpcrdma_mr_seg *seg = req->rl_segments;
>> - struct rpcrdma_mr_seg *seg1 = seg;
>> - int i;
>> -
>> - for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
>> - rpcrdma_buffer_put_mr(&seg->rl_mw, buf);
>> - rpcrdma_buffer_put_mr(&seg1->rl_mw, buf);
>> -}
>> -
>> static void
>> rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct
>> rpcrdma_buffer *buf)
>> {
>> @@ -1249,88 +1222,6 @@ rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req,
>> struct rpcrdma_buffer *buf)
>> }
>> }
>>
>> -/* rpcrdma_unmap_one() was already done during deregistration.
>> - * Redo only the ib_post_send().
>> - */
>> -static void
>> -rpcrdma_retry_local_inv(struct rpcrdma_mw *r, struct rpcrdma_ia *ia)
>> -{
>> - struct rpcrdma_xprt *r_xprt =
>> - container_of(ia, struct rpcrdma_xprt,
>> rx_ia);
>> - struct ib_send_wr invalidate_wr, *bad_wr;
>> - int rc;
>> -
>> - dprintk("RPC: %s: FRMR %p is stale\n", __func__, r);
>> -
>> - /* When this FRMR is re-inserted into rb_mws, it is no longer
>> stale */
>> - r->r.frmr.fr_state = FRMR_IS_INVALID;
>> -
>> - memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> - invalidate_wr.wr_id = (unsigned long)(void *)r;
>> - invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> - invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
>> - DECR_CQCOUNT(&r_xprt->rx_ep);
>> -
>> - dprintk("RPC: %s: frmr %p invalidating rkey %08x\n",
>> - __func__, r, r->r.frmr.fr_mr->rkey);
>> -
>> - read_lock(&ia->ri_qplock);
>> - rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
>> - read_unlock(&ia->ri_qplock);
>> - if (rc) {
>> - /* Force rpcrdma_buffer_get() to retry */
>> - r->r.frmr.fr_state = FRMR_IS_STALE;
>> - dprintk("RPC: %s: ib_post_send failed, %i\n",
>> - __func__, rc);
>> - }
>> -}
>> -
>> -static void
>> -rpcrdma_retry_flushed_linv(struct list_head *stale,
>> - struct rpcrdma_buffer *buf)
>> -{
>> - struct rpcrdma_ia *ia = rdmab_to_ia(buf);
>> - struct list_head *pos;
>> - struct rpcrdma_mw *r;
>> - unsigned long flags;
>> -
>> - list_for_each(pos, stale) {
>> - r = list_entry(pos, struct rpcrdma_mw, mw_list);
>> - rpcrdma_retry_local_inv(r, ia);
>> - }
>> -
>> - spin_lock_irqsave(&buf->rb_lock, flags);
>> - list_splice_tail(stale, &buf->rb_mws);
>> - spin_unlock_irqrestore(&buf->rb_lock, flags);
>> -}
>> -
>> -static struct rpcrdma_req *
>> -rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer
>> *buf,
>> - struct list_head *stale)
>> -{
>> - struct rpcrdma_mw *r;
>> - int i;
>> -
>> - i = RPCRDMA_MAX_SEGS - 1;
>> - while (!list_empty(&buf->rb_mws)) {
>> - r = list_entry(buf->rb_mws.next,
>> - struct rpcrdma_mw, mw_list);
>> - list_del(&r->mw_list);
>> - if (r->r.frmr.fr_state == FRMR_IS_STALE) {
>> - list_add(&r->mw_list, stale);
>> - continue;
>> - }
>> - req->rl_segments[i].rl_mw = r;
>> - if (unlikely(i-- == 0))
>> - return req; /* Success */
>> - }
>> -
>> - /* Not enough entries on rb_mws for this req */
>> - rpcrdma_buffer_put_sendbuf(req, buf);
>> - rpcrdma_buffer_put_mrs(req, buf);
>> - return NULL;
>> -}
>> -
>> /*
>> * Get a set of request/reply buffers.
>> *
>>
>
> Looks good,
>
> Reviewed-by: Sagi Grimberg <[email protected]>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-08 15:33:25

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 10/14] xprtrdma: Remove ->ro_reset

Reviewed-by: Devesh Sharma <[email protected]>

On Thu, May 7, 2015 at 4:06 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:58 PM, Chuck Lever wrote:
>>
>> An RPC can exit at any time. When it does so, xprt_rdma_free() is
>> called, and it calls ->op_unmap().
>>
>> If ->ro_reset() is running due to a transport disconnect, the two
>> methods can race while processing the same rpcrdma_mw. The results
>> are unpredictable.
>>
>> Because of this, in previous patches I've replaced the ->ro_reset()
>> methods with a recovery workqueue. ->ro_reset() is no longer used
>> and can be removed.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c | 11 -----------
>> net/sunrpc/xprtrdma/frwr_ops.c | 16 ----------------
>> net/sunrpc/xprtrdma/physical_ops.c | 6 ------
>> net/sunrpc/xprtrdma/verbs.c | 2 --
>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
>> 5 files changed, 36 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index ad0055b..5dd77da 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -197,16 +197,6 @@ out_err:
>> return nsegs;
>> }
>>
>> -/* After a disconnect, unmap all FMRs.
>> - *
>> - * This is invoked only in the transport connect worker in order
>> - * to serialize with rpcrdma_register_fmr_external().
>> - */
>> -static void
>> -fmr_op_reset(struct rpcrdma_xprt *r_xprt)
>> -{
>> -}
>> -
>> static void
>> fmr_op_destroy(struct rpcrdma_buffer *buf)
>> {
>> @@ -230,7 +220,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops
>> = {
>> .ro_open = fmr_op_open,
>> .ro_maxpages = fmr_op_maxpages,
>> .ro_init = fmr_op_init,
>> - .ro_reset = fmr_op_reset,
>> .ro_destroy = fmr_op_destroy,
>> .ro_displayname = "fmr",
>> };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>> b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 6f93a89..3fb609a 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -430,21 +430,6 @@ out_err:
>> return nsegs;
>> }
>>
>> -/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
>> - * an unusable state. Find FRMRs in this state and dereg / reg
>> - * each. FRMRs that are VALID and attached to an rpcrdma_req are
>> - * also torn down.
>> - *
>> - * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
>> - *
>> - * This is invoked only in the transport connect worker in order
>> - * to serialize with rpcrdma_register_frmr_external().
>> - */
>> -static void
>> -frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>> -{
>> -}
>> -
>> static void
>> frwr_op_destroy(struct rpcrdma_buffer *buf)
>> {
>> @@ -464,7 +449,6 @@ const struct rpcrdma_memreg_ops
>> rpcrdma_frwr_memreg_ops = {
>> .ro_open = frwr_op_open,
>> .ro_maxpages = frwr_op_maxpages,
>> .ro_init = frwr_op_init,
>> - .ro_reset = frwr_op_reset,
>> .ro_destroy = frwr_op_destroy,
>> .ro_displayname = "frwr",
>> };
>> diff --git a/net/sunrpc/xprtrdma/physical_ops.c
>> b/net/sunrpc/xprtrdma/physical_ops.c
>> index da149e8..41985d0 100644
>> --- a/net/sunrpc/xprtrdma/physical_ops.c
>> +++ b/net/sunrpc/xprtrdma/physical_ops.c
>> @@ -69,11 +69,6 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>> }
>>
>> static void
>> -physical_op_reset(struct rpcrdma_xprt *r_xprt)
>> -{
>> -}
>> -
>> -static void
>> physical_op_destroy(struct rpcrdma_buffer *buf)
>> {
>> }
>> @@ -84,7 +79,6 @@ const struct rpcrdma_memreg_ops
>> rpcrdma_physical_memreg_ops = {
>> .ro_open = physical_op_open,
>> .ro_maxpages = physical_op_maxpages,
>> .ro_init = physical_op_init,
>> - .ro_reset = physical_op_reset,
>> .ro_destroy = physical_op_destroy,
>> .ro_displayname = "physical",
>> };
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 5120a8e..eaf0b9d 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -897,8 +897,6 @@ retry:
>> rpcrdma_flush_cqs(ep);
>>
>> xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>> - ia->ri_ops->ro_reset(xprt);
>> -
>> id = rpcrdma_create_id(xprt, ia,
>> (struct sockaddr *)&xprt->rx_data.addr);
>> if (IS_ERR(id)) {
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 98227d6..6a1e565 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -353,7 +353,6 @@ struct rpcrdma_memreg_ops {
>> struct rpcrdma_create_data_internal *);
>> size_t (*ro_maxpages)(struct rpcrdma_xprt *);
>> int (*ro_init)(struct rpcrdma_xprt *);
>> - void (*ro_reset)(struct rpcrdma_xprt *);
>> void (*ro_destroy)(struct rpcrdma_buffer *);
>> const char *ro_displayname;
>> };
>>
>
> Looks good,
>
> Reviewed-by: Sagi Grimberg <[email protected]>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-08 15:34:47

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

Reviewed-bt: Devesh Sharma <[email protected]>

On Thu, May 7, 2015 at 4:06 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:58 PM, Chuck Lever wrote:
>>
>> Clean up: This field is no longer used.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/xprtrdma.h | 3 ++-
>> net/sunrpc/xprtrdma/verbs.c | 3 ---
>> net/sunrpc/xprtrdma/xprt_rdma.h | 1 -
>> 3 files changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/xprtrdma.h
>> b/include/linux/sunrpc/xprtrdma.h
>> index c984c85..b176130 100644
>> --- a/include/linux/sunrpc/xprtrdma.h
>> +++ b/include/linux/sunrpc/xprtrdma.h
>> @@ -56,7 +56,8 @@
>>
>> #define RPCRDMA_INLINE_PAD_THRESH (512)/* payload threshold to pad
>> (bytes) */
>>
>> -/* memory registration strategies */
>> +/* Memory registration strategies, by number.
>> + * This is part of a kernel / user space API. Do not remove. */
>> enum rpcrdma_memreg {
>> RPCRDMA_BOUNCEBUFFERS = 0,
>> RPCRDMA_REGISTER,
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index eaf0b9d..1f51547 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -671,9 +671,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct
>> sockaddr *addr, int memreg)
>> dprintk("RPC: %s: memory registration strategy is '%s'\n",
>> __func__, ia->ri_ops->ro_displayname);
>>
>> - /* Else will do memory reg/dereg for each chunk */
>> - ia->ri_memreg_strategy = memreg;
>> -
>> rwlock_init(&ia->ri_qplock);
>> return 0;
>>
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 6a1e565..5650c23 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -70,7 +70,6 @@ struct rpcrdma_ia {
>> int ri_have_dma_lkey;
>> struct completion ri_done;
>> int ri_async_rc;
>> - enum rpcrdma_memreg ri_memreg_strategy;
>> unsigned int ri_max_frmr_depth;
>> struct ib_device_attr ri_devattr;
>> struct ib_qp_attr ri_qp_attr;
>>
>
> Looks good,
>
> Reviewed-by: Sagi Grimberg <[email protected]>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-08 15:36:59

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 13/14] xprtrdma: Stack relief in fmr_op_map()

Reviewed-by: Devesh Sharma <[email protected]>

On Thu, May 7, 2015 at 4:20 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:58 PM, Chuck Lever wrote:
>>
>> fmr_op_map() declares a 64 element array of u64 in automatic
>> storage. This is 512 bytes (8 * 64) on the stack.
>>
>> Instead, when FMR memory registration is in use, pre-allocate a
>> physaddr array for each rpcrdma_mw.
>>
>> This is a pre-requisite for increasing the r/wsize maximum for
>> FMR on platforms with 4KB pages.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c | 32 ++++++++++++++++++++++----------
>> net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++++++-
>> 2 files changed, 28 insertions(+), 11 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 52f9ad5..4a53ad5 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -72,13 +72,19 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
>> i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
>> dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);
>>
>> + rc = -ENOMEM;
>> while (i--) {
>> r = kzalloc(sizeof(*r), GFP_KERNEL);
>> if (!r)
>> - return -ENOMEM;
>> + goto out;
>> +
>> + r->r.fmr.physaddrs = kmalloc(RPCRDMA_MAX_FMR_SGES *
>> + sizeof(u64), GFP_KERNEL);
>> + if (!r->r.fmr.physaddrs)
>> + goto out_free;
>>
>> - r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
>> - if (IS_ERR(r->r.fmr))
>> + r->r.fmr.fmr = ib_alloc_fmr(pd, mr_access_flags,
>> &fmr_attr);
>> + if (IS_ERR(r->r.fmr.fmr))
>> goto out_fmr_err;
>>
>> list_add(&r->mw_list, &buf->rb_mws);
>> @@ -87,9 +93,12 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
>> return 0;
>>
>> out_fmr_err:
>> - rc = PTR_ERR(r->r.fmr);
>> + rc = PTR_ERR(r->r.fmr.fmr);
>> dprintk("RPC: %s: ib_alloc_fmr status %i\n", __func__, rc);
>> + kfree(r->r.fmr.physaddrs);
>> +out_free:
>> kfree(r);
>> +out:
>> return rc;
>> }
>>
>> @@ -98,7 +107,7 @@ __fmr_unmap(struct rpcrdma_mw *r)
>> {
>> LIST_HEAD(l);
>>
>> - list_add(&r->r.fmr->list, &l);
>> + list_add(&r->r.fmr.fmr->list, &l);
>> return ib_unmap_fmr(&l);
>> }
>>
>> @@ -113,7 +122,6 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> struct ib_device *device = ia->ri_device;
>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>> struct rpcrdma_mr_seg *seg1 = seg;
>> - u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
>> int len, pageoff, i, rc;
>> struct rpcrdma_mw *mw;
>>
>> @@ -138,7 +146,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> nsegs = RPCRDMA_MAX_FMR_SGES;
>> for (i = 0; i < nsegs;) {
>> rpcrdma_map_one(device, seg, direction);
>> - physaddrs[i] = seg->mr_dma;
>> + mw->r.fmr.physaddrs[i] = seg->mr_dma;
>> len += seg->mr_len;
>> ++seg;
>> ++i;
>> @@ -148,12 +156,13 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> break;
>> }
>>
>> - rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
>> + rc = ib_map_phys_fmr(mw->r.fmr.fmr, mw->r.fmr.physaddrs,
>> + i, seg1->mr_dma);
>> if (rc)
>> goto out_maperr;
>>
>> seg1->rl_mw = mw;
>> - seg1->mr_rkey = mw->r.fmr->rkey;
>> + seg1->mr_rkey = mw->r.fmr.fmr->rkey;
>> seg1->mr_base = seg1->mr_dma + pageoff;
>> seg1->mr_nsegs = i;
>> seg1->mr_len = len;
>> @@ -207,10 +216,13 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
>> while (!list_empty(&buf->rb_all)) {
>> r = list_entry(buf->rb_all.next, struct rpcrdma_mw,
>> mw_all);
>> list_del(&r->mw_all);
>> - rc = ib_dealloc_fmr(r->r.fmr);
>> + kfree(r->r.fmr.physaddrs);
>> +
>> + rc = ib_dealloc_fmr(r->r.fmr.fmr);
>> if (rc)
>> dprintk("RPC: %s: ib_dealloc_fmr failed
>> %i\n",
>> __func__, rc);
>> +
>> kfree(r);
>> }
>> }
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ae31fc7..e176bae 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -207,9 +207,14 @@ struct rpcrdma_frmr {
>> struct rpcrdma_xprt *fr_xprt;
>> };
>>
>> +struct rpcrdma_fmr {
>> + struct ib_fmr *fmr;
>> + u64 *physaddrs;
>> +};
>> +
>> struct rpcrdma_mw {
>> union {
>> - struct ib_fmr *fmr;
>> + struct rpcrdma_fmr fmr;
>> struct rpcrdma_frmr frmr;
>> } r;
>> void (*mw_sendcompletion)(struct ib_wc *);
>>
>
> Looks good
>
> Reviewed-by: Sagi Grimberg <[email protected]>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-08 15:39:36

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()


On May 8, 2015, at 11:24 AM, Devesh Sharma <[email protected]> wrote:

> On Thu, May 7, 2015 at 4:01 PM, Sagi Grimberg <[email protected]> wrote:
>> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>>
>>> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
>>> pool lock is expensive, and unnecessary because most modern adapters
>>> can transfer 100s of KBs of payload using just a single MR.
>>>
>>> Instead, acquire MRs one-at-a-time as chunks are registered, and
>>> return them to rb_mws immediately during deregistration.
>>>
>>> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
>>> registration fails") is reverted: There is now a valid case where
>>> registration can fail (with -ENOMEM) but the QP is still in RTS.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/frwr_ops.c | 120
>>> ++++++++++++++++++++++++++++------------
>>> net/sunrpc/xprtrdma/rpc_rdma.c | 3 -
>>> net/sunrpc/xprtrdma/verbs.c | 21 -------
>>> 3 files changed, 86 insertions(+), 58 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>>> b/net/sunrpc/xprtrdma/frwr_ops.c
>>> index a06d9a3..6f93a89 100644
>>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>>> @@ -11,6 +11,62 @@
>>> * but most complex memory registration mode.
>>> */
>>>
>>> +/* Normal operation
>>> + *
>>> + * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
>>> + * Work Request (frmr_op_map). When the RDMA operation is finished, this
>>> + * Memory Region is invalidated using a LOCAL_INV Work Request
>>> + * (frmr_op_unmap).
>>> + *
>>> + * Typically these Work Requests are not signaled, and neither are RDMA
>>> + * SEND Work Requests (with the exception of signaling occasionally to
>>> + * prevent provider work queue overflows). This greatly reduces HCA
>>> + * interrupt workload.
>>> + *
>>> + * As an optimization, frwr_op_unmap marks MRs INVALID before the
>>> + * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
>>> + * rb_mws immediately so that no work (like managing a linked list
>>> + * under a spinlock) is needed in the completion upcall.
>>> + *
>>> + * But this means that frwr_op_map() can occasionally encounter an MR
>>> + * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
>>> + * ordering prevents a subsequent FAST_REG WR from executing against
>>> + * that MR while it is still being invalidated.
>>> + */
>>> +
>>> +/* Transport recovery
>>> + *
>>> + * ->op_map and the transport connect worker cannot run at the same
>>> + * time, but ->op_unmap can fire while the transport connect worker
>>> + * is running. Thus MR recovery is handled in ->op_map, to guarantee
>>> + * that recovered MRs are owned by a sending RPC, and not one where
>>> + * ->op_unmap could fire at the same time transport reconnect is
>>> + * being done.
>>> + *
>>> + * When the underlying transport disconnects, MRs are left in one of
>>> + * three states:
>>> + *
>>> + * INVALID: The MR was not in use before the QP entered ERROR state.
>>> + * (Or, the LOCAL_INV WR has not completed or flushed yet).
>>> + *
>>> + * STALE: The MR was being registered or unregistered when the QP
>>> + * entered ERROR state, and the pending WR was flushed.
>>> + *
>>> + * VALID: The MR was registered before the QP entered ERROR state.
>>> + *
>>> + * When frwr_op_map encounters STALE and VALID MRs, they are recovered
>>> + * with ib_dereg_mr and then are re-initialized. Beause MR recovery
>>> + * allocates fresh resources, it is deferred to a workqueue, and the
>>> + * recovered MRs are placed back on the rb_mws list when recovery is
>>> + * complete. frwr_op_map allocates another MR for the current RPC while
>>> + * the broken MR is reset.
>>> + *
>>> + * To ensure that frwr_op_map doesn't encounter an MR that is marked
>>> + * INVALID but that is about to be flushed due to a previous transport
>>> + * disconnect, the transport connect worker attempts to drain all
>>> + * pending send queue WRs before the transport is reconnected.
>>> + */
>>> +
>>> #include "xprt_rdma.h"
>>>
>>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>> @@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>> struct ib_device *device = ia->ri_device;
>>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>> struct rpcrdma_mr_seg *seg1 = seg;
>>> - struct rpcrdma_mw *mw = seg1->rl_mw;
>>> - struct rpcrdma_frmr *frmr = &mw->r.frmr;
>>> - struct ib_mr *mr = frmr->fr_mr;
>>> + struct rpcrdma_mw *mw;
>>> + struct rpcrdma_frmr *frmr;
>>> + struct ib_mr *mr;
>>> struct ib_send_wr fastreg_wr, *bad_wr;
>>> u8 key;
>>> int len, pageoff;
>>> @@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>> u64 pa;
>>> int page_no;
>>>
>>> + mw = seg1->rl_mw;
>>> + seg1->rl_mw = NULL;
>>> + do {
>>> + if (mw)
>>> + __frwr_queue_recovery(mw);
>>> + mw = rpcrdma_get_mw(r_xprt);
>>> + if (!mw)
>>> + return -ENOMEM;
>>> + } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
>>> + frmr = &mw->r.frmr;
>>> + frmr->fr_state = FRMR_IS_VALID;
>>> +
>>> pageoff = offset_in_page(seg1->mr_offset);
>>> seg1->mr_offset -= pageoff; /* start of page */
>>> seg1->mr_len += pageoff;
>>> len = -pageoff;
>>> if (nsegs > ia->ri_max_frmr_depth)
>>> nsegs = ia->ri_max_frmr_depth;
>>> +
>>> for (page_no = i = 0; i < nsegs;) {
>>> rpcrdma_map_one(device, seg, direction);
>>> pa = seg->mr_dma;
>>> @@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>> dprintk("RPC: %s: Using frmr %p to map %d segments (%d
>>> bytes)\n",
>>> __func__, mw, i, len);
>>>
>>> - frmr->fr_state = FRMR_IS_VALID;
>>> -
>>> memset(&fastreg_wr, 0, sizeof(fastreg_wr));
>>> fastreg_wr.wr_id = (unsigned long)(void *)mw;
>>> fastreg_wr.opcode = IB_WR_FAST_REG_MR;
>>> @@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>> fastreg_wr.wr.fast_reg.access_flags = writing ?
>>> IB_ACCESS_REMOTE_WRITE |
>>> IB_ACCESS_LOCAL_WRITE :
>>> IB_ACCESS_REMOTE_READ;
>>> + mr = frmr->fr_mr;
>>> key = (u8)(mr->rkey & 0x000000FF);
>>> ib_update_fast_reg_key(mr, ++key);
>>> fastreg_wr.wr.fast_reg.rkey = mr->rkey;
>>> @@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>> if (rc)
>>> goto out_senderr;
>>>
>>> + seg1->rl_mw = mw;
>>> seg1->mr_rkey = mr->rkey;
>>> seg1->mr_base = seg1->mr_dma + pageoff;
>>> seg1->mr_nsegs = i;
>>> @@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg,
>>>
>>> out_senderr:
>>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>>> - ib_update_fast_reg_key(mr, --key);
>>> - frmr->fr_state = FRMR_IS_INVALID;
>>> while (i--)
>>> rpcrdma_unmap_one(device, --seg);
>>> + __frwr_queue_recovery(mw);
>>> return rc;
>>> }
>>>
>>> @@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg)
>>> {
>>> struct rpcrdma_mr_seg *seg1 = seg;
>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>> + struct rpcrdma_mw *mw = seg1->rl_mw;
>>> struct ib_send_wr invalidate_wr, *bad_wr;
>>> int rc, nsegs = seg->mr_nsegs;
>>>
>>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>>> + dprintk("RPC: %s: FRMR %p\n", __func__, mw);
>>> +
>>> + seg1->rl_mw = NULL;
>>> + mw->r.frmr.fr_state = FRMR_IS_INVALID;
>>>
>>> memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>>> - invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>>> + invalidate_wr.wr_id = (unsigned long)(void *)mw;
>>> invalidate_wr.opcode = IB_WR_LOCAL_INV;
>>> - invalidate_wr.ex.invalidate_rkey =
>>> seg1->rl_mw->r.frmr.fr_mr->rkey;
>>> + invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
>>> DECR_CQCOUNT(&r_xprt->rx_ep);
>>>
>>> while (seg1->mr_nsegs--)
>>> @@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>>> rpcrdma_mr_seg *seg)
>>> read_unlock(&ia->ri_qplock);
>>> if (rc)
>>> goto out_err;
>>> +
>>> + rpcrdma_put_mw(r_xprt, mw);
>>> return nsegs;
>>>
>>> out_err:
>>> - /* Force rpcrdma_buffer_get() to retry */
>>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
>>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>>> + __frwr_queue_recovery(mw);
>>> return nsegs;
>>> }
>>>
>>> @@ -370,29 +443,6 @@ out_err:
>>> static void
>>> frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>>> {
>>> - struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>> - struct ib_device *device = r_xprt->rx_ia.ri_device;
>>> - unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
>>> - struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
>>> - struct rpcrdma_mw *r;
>>> - int rc;
>>> -
>>> - list_for_each_entry(r, &buf->rb_all, mw_all) {
>>> - if (r->r.frmr.fr_state == FRMR_IS_INVALID)
>>> - continue;
>>> -
>>> - __frwr_release(r);
>>> - rc = __frwr_init(r, pd, device, depth);
>>> - if (rc) {
>>> - dprintk("RPC: %s: mw %p left %s\n",
>>> - __func__, r,
>>> - (r->r.frmr.fr_state == FRMR_IS_STALE ?
>>> - "stale" : "valid"));
>>> - continue;
>>> - }
>>> -
>>> - r->r.frmr.fr_state = FRMR_IS_INVALID;
>>> - }
>>> }
>>>
>>> static void
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index 98a3b95..35ead0b 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct
>>> xdr_buf *target,
>>> return (unsigned char *)iptr - (unsigned char *)headerp;
>>>
>>> out:
>>> - if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
>>> - return n;
>>> -
>>> for (pos = 0; nchunks--;)
>>> pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
>>>
>>> &req->rl_segments[pos]);
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 8a43c7ef..5226161 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req,
>>> struct rpcrdma_buffer *buf,
>>> struct rpcrdma_req *
>>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>> {
>>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>>> - struct list_head stale;
>>> struct rpcrdma_req *req;
>>> unsigned long flags;
>>>
>>> spin_lock_irqsave(&buffers->rb_lock, flags);
>>> +
>>> if (buffers->rb_send_index == buffers->rb_max_requests) {
>>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>>> dprintk("RPC: %s: out of request buffers\n",
>>> __func__);
>>> @@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>> }
>>> buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
>>>
>>> - INIT_LIST_HEAD(&stale);
>>> - switch (ia->ri_memreg_strategy) {
>>> - case RPCRDMA_FRMR:
>>> - req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
>>> - break;
>>> - default:
>>> - break;
>>> - }
>>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>>> - if (!list_empty(&stale))
>>> - rpcrdma_retry_flushed_linv(&stale, buffers);
>>> return req;
>>> }
>>>
>>> @@ -1389,18 +1378,10 @@ void
>>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>>> {
>>> struct rpcrdma_buffer *buffers = req->rl_buffer;
>>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>>> unsigned long flags;
>>>
>>> spin_lock_irqsave(&buffers->rb_lock, flags);
>>> rpcrdma_buffer_put_sendbuf(req, buffers);
>>> - switch (ia->ri_memreg_strategy) {
>>> - case RPCRDMA_FRMR:
>>> - rpcrdma_buffer_put_mrs(req, buffers);
>>> - break;
>>> - default:
>>> - break;
>>> - }
>>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>>> }
>>>
>>>
>>
>> Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
>> about to destroy the endpoint (and the buffers and the MRs...)?
>
> I agree with Sagi here, in xprt_rdma_destroy() before calling
> rpcrdma_destroy_buffer(), flush_workqueue and cancelling any pending
> work seems required.

The buffer list is destroyed only when all work has completed on the
transport (no RPCs are outstanding, and the upper layer is shutting
down). It?s pretty unlikely that there will be ongoing recovery work
at this point.

That said, would it be enough to add a defensive call to flush_workqueue()
at the top of frwr_op_destroy() ?

> With the optimization, is it possible that before completion of
> REG-FRMR sever starts traffic on a yet-to-be-reg-complete rkey, this
> will cause access-error async event on client side and server will see
> flush errors.

The server starts driving RDMA transfers only after it receives a
retransmitted RPC (with a fresh rkey). The client can?t retransmit
an RPC until it has fresh ?invalid? MRs to use.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-08 15:53:48

by Devesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation

Reviewed-by: Devesh Sharma <[email protected]>

On Thu, May 7, 2015 at 4:30 PM, Sagi Grimberg <[email protected]> wrote:
> On 5/4/2015 8:58 PM, Chuck Lever wrote:
>>
>> Reduce resource consumption per-transport to make way for increasing
>> the credit limit and maximum r/wsize. Pre-allocate fewer MRs.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c | 6 ++++--
>> net/sunrpc/xprtrdma/frwr_ops.c | 6 ++++--
>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 4a53ad5..f1e8daf 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -69,8 +69,10 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
>> INIT_LIST_HEAD(&buf->rb_mws);
>> INIT_LIST_HEAD(&buf->rb_all);
>>
>> - i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
>> - dprintk("RPC: %s: initializing %d FMRs\n", __func__, i);
>> + i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
>> + i += 2; /* head + tail */
>> + i *= buf->rb_max_requests; /* one set for each RPC slot */
>> + dprintk("RPC: %s: initalizing %d FMRs\n", __func__, i);
>>
>> rc = -ENOMEM;
>> while (i--) {
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>> b/net/sunrpc/xprtrdma/frwr_ops.c
>> index edc10ba..fc2d0c6 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -270,8 +270,10 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
>> INIT_LIST_HEAD(&buf->rb_mws);
>> INIT_LIST_HEAD(&buf->rb_all);
>>
>> - i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
>> - dprintk("RPC: %s: initializing %d FRMRs\n", __func__, i);
>> + i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
>> + i += 2; /* head + tail */
>> + i *= buf->rb_max_requests; /* one set for each RPC slot */
>> + dprintk("RPC: %s: initalizing %d FRMRs\n", __func__, i);
>>
>> while (i--) {
>> struct rpcrdma_mw *r;
>>
>
> Looks good.
>
> Reviewed-by: Sagi Grimberg <[email protected]>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
-Regards
Devesh

2015-05-10 10:17:43

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()

On 5/8/2015 6:40 PM, Chuck Lever wrote:

>>>
>>> Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
>>> about to destroy the endpoint (and the buffers and the MRs...)?
>>
>> I agree with Sagi here, in xprt_rdma_destroy() before calling
>> rpcrdma_destroy_buffer(), flush_workqueue and cancelling any pending
>> work seems required.
>
> The buffer list is destroyed only when all work has completed on the
> transport (no RPCs are outstanding, and the upper layer is shutting
> down). It?s pretty unlikely that there will be ongoing recovery work
> at this point.

It may be that there aren't any outstanding RPCs, but it is possible
that those that finished queued a frwr recovery work if the QP flushed
inflight frwr's.

>
> That said, would it be enough to add a defensive call to flush_workqueue()
> at the top of frwr_op_destroy() ?

If at this point you can guarantee that no one will queue another frwr
work (i.e. all flushe errors were consumed), then yes, I think
flush_workqueue() would do the job.

Sagi.

2015-05-11 15:22:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely


On May 7, 2015, at 11:11 AM, Sagi Grimberg <[email protected]> wrote:

> On 5/7/2015 5:12 PM, Chuck Lever wrote:
>>
>> On May 7, 2015, at 9:56 AM, Sagi Grimberg <[email protected]> wrote:
>>
>>> On 5/7/2015 4:39 PM, Chuck Lever wrote:
>>>>
>>>> On May 7, 2015, at 6:00 AM, Sagi Grimberg <[email protected]> wrote:
>>>>
>>>>> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>>>>> The connect worker can replace ri_id, but prevents ri_id->device
>>>>>> from changing during the lifetime of a transport instance.
>>>>>>
>>>>>> Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
>>>>>> The cached copy can be used safely in code that does not serialize
>>>>>> with the connect worker.
>>>>>>
>>>>>> Other code can use it to save an extra address generation (one
>>>>>> pointer dereference instead of two).
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>> ---
>>>>>> net/sunrpc/xprtrdma/fmr_ops.c | 8 +----
>>>>>> net/sunrpc/xprtrdma/frwr_ops.c | 12 +++----
>>>>>> net/sunrpc/xprtrdma/physical_ops.c | 8 +----
>>>>>> net/sunrpc/xprtrdma/verbs.c | 61 +++++++++++++++++++-----------------
>>>>>> net/sunrpc/xprtrdma/xprt_rdma.h | 2 +
>>>>>> 5 files changed, 43 insertions(+), 48 deletions(-)
>>>>>>
>>>>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>>> index 302d4eb..0a96155 100644
>>>>>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>>>>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>>> @@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>>>>> int nsegs, bool writing)
>>>>>> {
>>>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>>>> - struct ib_device *device = ia->ri_id->device;
>>>>>> + struct ib_device *device = ia->ri_device;
>>>>>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>>>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>>>> struct rpcrdma_mw *mw = seg1->rl_mw;
>>>>>> @@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>>>>>> {
>>>>>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>>>> struct rpcrdma_mr_seg *seg1 = seg;
>>>>>> - struct ib_device *device;
>>>>>> int rc, nsegs = seg->mr_nsegs;
>>>>>> LIST_HEAD(l);
>>>>>>
>>>>>> list_add(&seg1->rl_mw->r.fmr->list, &l);
>>>>>> rc = ib_unmap_fmr(&l);
>>>>>> - read_lock(&ia->ri_qplock);
>>>>>> - device = ia->ri_id->device;
>>>>>> while (seg1->mr_nsegs--)
>>>>>> - rpcrdma_unmap_one(device, seg++);
>>>>>> - read_unlock(&ia->ri_qplock);
>>>>>> + rpcrdma_unmap_one(ia->ri_device, seg++);
>>>>>
>>>>> Umm, I'm wandering if this is guaranteed to be the same device as
>>>>> ri_id->device?
>>>>>
>>>>> Imagine you are working on a bond device where each slave belongs to
>>>>> a different adapter. When the active port toggles, you will see a
>>>>> ADDR_CHANGED event (that the current code does not handle...), what
>>>>> you'd want to do is just reconnect and rdma_cm will resolve the new
>>>>> address for you (via the backup slave). I suspect that in case this
>>>>> flow is concurrent with the reconnects you may end up with a stale
>>>>> device handle.
>>>>
>>>> I?m not sure what you mean by ?stale? : freed memory?
>>>>
>>>> I?m looking at this code in rpcrdma_ep_connect() :
>>>>
>>>> 916 if (ia->ri_id->device != id->device) {
>>>> 917 printk("RPC: %s: can't reconnect on "
>>>> 918 "different device!\n", __func__);
>>>> 919 rdma_destroy_id(id);
>>>> 920 rc = -ENETUNREACH;
>>>> 921 goto out;
>>>> 922 }
>>>>
>>>> After reconnecting, if the ri_id has changed, the connect fails. Today,
>>>> xprtrdma does not support the device changing out from under it.
>>>>
>>>> Note also that our receive completion upcall uses ri_id->device for
>>>> DMA map syncing. Would that also be a problem during a bond failover?
>>>>
>>>
>>> I'm not talking about ri_id->device, this will be consistent. I'm
>>> wandering about ia->ri_device, which might not have been updated yet.
>>
>> ia->ri_device is never updated. The only place it is set is in
>> rpcrdma_ia_open().
>
> So you assume that each ri_id that you will recreate contains the
> same device handle?

This issue is still unresolved.

xprtrdma does not assume the device pointers are the same, it does
actually check for it. The connect worker is careful to create a new
ri_id first, then check to see that the new ri_id device pointer is
the same as the device pointer in the old (defunct) ri_id.

If the device pointer changes, the old ri_id is kept and the connect
operation fails (retried later).

That is how we guarantee that ia->ri_id->device never changes, and
thus the cached pointer in ia->ri_device never needs to be updated.

> I think that for ADDR_CHANGE event when the slave belongs to another
> device you will hit a mismatch. CC'ing Sean for more info?

Yes, I?d like some confirmation that xprtrdma is not abusing the verbs
API. :-)

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-11 18:26:14

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

> > ia->ri_device is never updated. The only place it is set is in
> > rpcrdma_ia_open().
>
> So you assume that each ri_id that you will recreate contains the
> same device handle?
>
> I think that for ADDR_CHANGE event when the slave belongs to another
> device you will hit a mismatch. CC'ing Sean for more info...

I'm not familiar with the xprtrdma code. From the perspective of the rdma_cm, if a listen is associated with a specific IP address, then it will also be associated with a specific device. If an address change occurs, and the address moves to another device, then the app is essentially left with an unusable listen. Received connection requests will not find a matching listen and be dropped.

If the address moves ports on the same device, then I think this works out fine in the case where the app ignores the ADDR_CHANGE event.

> > It might be saner to construct a virtual ib_device in this
> > case that consumers can depend on.
>
> I'm not sure how does a virtual ib_device can work - that goes
> to the verbs themselves... Seems like a layering mis-match to me...

I interpreted Chuck as asking for some lower-level way of handling this, so that all ULPs don't need to re-implement this. I agree that this would be hard (impossible?) without ULPs operating at a slightly higher level of abstraction. Maybe the rdma_cm could be extended for this purpose, for example, by exposing its own devices, which either map directly to a verbs device, or handle a virtual mapping.

- Sean

2015-05-11 18:57:25

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely


On May 11, 2015, at 2:26 PM, Hefty, Sean <[email protected]> wrote:

>>> ia->ri_device is never updated. The only place it is set is in
>>> rpcrdma_ia_open().
>>
>> So you assume that each ri_id that you will recreate contains the
>> same device handle?
>>
>> I think that for ADDR_CHANGE event when the slave belongs to another
>> device you will hit a mismatch. CC'ing Sean for more info...
>
> I'm not familiar with the xprtrdma code. From the perspective of the rdma_cm, if a listen is associated with a specific IP address, then it will also be associated with a specific device. If an address change occurs, and the address moves to another device, then the app is essentially left with an unusable listen. Received connection requests will not find a matching listen and be dropped.

Thanks Sean.

xprtrdma is the client-side (initiator), so it drives transport connects. The
server-side (target) does the listens. My proposed change is only on the client.

> If the address moves ports on the same device, then I think this works out fine in the case where the app ignores the ADDR_CHANGE event.

xprtrdma currently doesn?t explicitly handle ADDR_CHANGE events (and neither
does the server-side, looks like).

In the deployment scenarios I?m aware of, a single HCA with two ports is
used to form the bonded pair. I?m sure that?s not the only way this can
be done, though.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2015-05-12 10:02:00

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely

On 5/11/2015 9:26 PM, Hefty, Sean wrote:
>>> ia->ri_device is never updated. The only place it is set is in
>>> rpcrdma_ia_open().
>>
>> So you assume that each ri_id that you will recreate contains the
>> same device handle?
>>
>> I think that for ADDR_CHANGE event when the slave belongs to another
>> device you will hit a mismatch. CC'ing Sean for more info...
>
> I'm not familiar with the xprtrdma code. From the perspective of the rdma_cm, if a listen is associated with a specific IP address, then it will also be associated with a specific device. If an address change occurs, and the address moves to another device, then the app is essentially left with an unusable listen. Received connection requests will not find a matching listen and be dropped.
>
> If the address moves ports on the same device, then I think this works out fine in the case where the app ignores the ADDR_CHANGE event.
>
>>> It might be saner to construct a virtual ib_device in this
>>> case that consumers can depend on.
>>
>> I'm not sure how does a virtual ib_device can work - that goes
>> to the verbs themselves... Seems like a layering mis-match to me...
>
> I interpreted Chuck as asking for some lower-level way of handling this, so that all ULPs don't need to re-implement this.

It's not really a re-implementation, address resolution attaches the
the ib device. The assumption that for the same address the same ib
device will be attached is wrong for bonding over more than one device.

It's just something that rdma_cm user needs to be aware of if/when
trying to cache the device handle.

> I agree that this would be hard (impossible?) without ULPs operating at a slightly higher level of abstraction.
> Maybe the rdma_cm could be extended for this purpose, for example, by exposing its own devices, which either map directly to a verbs device, or handle a virtual mapping.

I guess it can be done, but it will require port all the ULPs to
use some sort of abstraction layer. Something that I'm not very
enthusiastic about...

Sagi.