2017-01-13 17:42:51

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 0/5] Fix "support large inline thresholds"

I've received a number of reports that v4.9 commit 655fec6987be
("xprtrdma: Use gathered Send for large inline messages") causes
NFS/RDMA mounts to fail for devices that have a small max_sge.

This series addresses that problem.

A much smaller fix was provided initially. It worked for devices
with as small as five send SGEs. However, additional research has
shown that there is at least one in-tree device that supports only
three send SGEs.

The current series should enable NFS/RDMA again on those devices.


Available in the "nfs-rdma-for-4.10-rc" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

And for online review:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-for-4.10-rc

---

Chuck Lever (5):
xprtrdma: Fix Read chunk padding
xprtrdma: Per-connection pad optimization
xprtrdma: Disable pad optimization by default
xprtrdma: Reduce required number of send SGEs
xprtrdma: Shrink send SGEs array


net/sunrpc/xprtrdma/rpc_rdma.c | 59 ++++++++++++++++++++++++++-------------
net/sunrpc/xprtrdma/transport.c | 2 +
net/sunrpc/xprtrdma/verbs.c | 15 ++++++----
net/sunrpc/xprtrdma/xprt_rdma.h | 13 ++++++---
4 files changed, 58 insertions(+), 31 deletions(-)

--
Chuck Lever


2017-01-13 17:42:59

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 1/5] xprtrdma: Fix Read chunk padding

When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR padding to a Read chunk.

Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether padding is needed,
which appears later in rpcrdma_convert_iovs.

However, if this is indeed a Read chunk (and not a Position-Zero
Read chunk), the tail iovec _always_ contains the chunk's padding,
and never anything else.

So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c52e0f2..a524d3c 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -226,8 +226,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (len && n == RPCRDMA_MAX_SEGS)
goto out_overflow;

- /* When encoding the read list, the tail is always sent inline */
- if (type == rpcrdma_readch)
+ /* When encoding a Read chunk, the tail iovec contains an
+ * XDR pad and may be omitted.
+ */
+ if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
return n;

/* When encoding the Write list, some servers need to see an extra
@@ -238,10 +240,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return n;

if (xdrbuf->tail[0].iov_len) {
- /* the rpcrdma protocol allows us to omit any trailing
- * xdr pad bytes, saving the server an RDMA operation. */
- if (xdrbuf->tail[0].iov_len < 4 && xprt_rdma_pad_optimize)
- return n;
n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
if (n == RPCRDMA_MAX_SEGS)
goto out_overflow;


2017-01-13 17:43:08

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 2/5] xprtrdma: Per-connection pad optimization

Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.

The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts may need pads, while others don't.

So instead, copy the setting into each connection's rpcrdma_ia at
mount time, and use the copy, which can't change during the life of
the connection.

This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.

It's more sensible to use the pad optimization setting instead.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 28 ++++++++++++++--------------
net/sunrpc/xprtrdma/verbs.c | 1 +
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a524d3c..4909758 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -186,9 +186,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
*/

static int
-rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
- enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg,
- bool reminv_expected)
+rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
+ unsigned int pos, enum rpcrdma_chunktype type,
+ struct rpcrdma_mr_seg *seg)
{
int len, n, p, page_base;
struct page **ppages;
@@ -229,14 +229,15 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
/* When encoding a Read chunk, the tail iovec contains an
* XDR pad and may be omitted.
*/
- if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
+ if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_padding)
return n;

- /* When encoding the Write list, some servers need to see an extra
- * segment for odd-length Write chunks. The upper layer provides
- * space in the tail iovec for this purpose.
+ /* When encoding a Write chunk, some servers need to see an
+ * extra segment for non-XDR-aligned Write chunks. The upper
+ * layer provides space in the tail iovec that may be used
+ * for this purpose.
*/
- if (type == rpcrdma_writech && reminv_expected)
+ if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_padding)
return n;

if (xdrbuf->tail[0].iov_len) {
@@ -291,7 +292,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (rtype == rpcrdma_areadch)
pos = 0;
seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_snd_buf, pos, rtype, seg, false);
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
+ rtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

@@ -353,10 +355,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}

seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf,
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
rqst->rq_rcv_buf.head[0].iov_len,
- wtype, seg,
- r_xprt->rx_ia.ri_reminv_expected);
+ wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

@@ -421,8 +422,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}

seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf, 0, wtype, seg,
- r_xprt->rx_ia.ri_reminv_expected);
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 11d0774..890cb3a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -208,6 +208,7 @@

/* Default settings for RPC-over-RDMA Version One */
r_xprt->rx_ia.ri_reminv_expected = false;
+ r_xprt->rx_ia.ri_implicit_padding = xprt_rdma_pad_optimize;
rsize = RPCRDMA_V1_DEF_INLINE_SIZE;
wsize = RPCRDMA_V1_DEF_INLINE_SIZE;

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e35efd4..f495df0c 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -75,6 +75,7 @@ struct rpcrdma_ia {
unsigned int ri_max_inline_write;
unsigned int ri_max_inline_read;
bool ri_reminv_expected;
+ bool ri_implicit_padding;
enum ib_mr_type ri_mrtype;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;


2017-01-13 17:43:16

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 3/5] xprtrdma: Disable pad optimization by default

Commit d5440e27d3e5 ('xprtrdma: Enable pad optimization') made the
Linux client omit XDR padding in normal Read and Write chunks so
that the client doesn't have to register and invalidate 3-byte
memory regions that contain no real data.

Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3e5
was merged (SYMLINK is the main culprit).

So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.

Until now the Solaris server bugs did not impact common operations,
and were thus largely unnoticed. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtrdma/verbs.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 534c178..6990581 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -67,7 +67,7 @@
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
- int xprt_rdma_pad_optimize = 1;
+ int xprt_rdma_pad_optimize = 0;

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 890cb3a..12e8242 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -216,6 +216,7 @@
pmsg->cp_magic == rpcrdma_cmp_magic &&
pmsg->cp_version == RPCRDMA_CMP_VERSION) {
r_xprt->rx_ia.ri_reminv_expected = true;
+ r_xprt->rx_ia.ri_implicit_padding = true;
rsize = rpcrdma_decode_buffer_size(pmsg->cp_send_size);
wsize = rpcrdma_decode_buffer_size(pmsg->cp_recv_size);
}


2017-01-13 17:43:27

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 4/5] xprtrdma: Reduce required number of send SGEs

The MAX_SEND_SGES check introduced in commit 655fec6987be
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.

Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is better than failing all
together to mount the server.

This fix supports devices that have as few as three send SGEs
available.

Reported-By: Selvin Xavier <[email protected]>
Reported-By: Devesh Sharma <[email protected]>
Reported-by: Honggang Li <[email protected]>
Reported-by: Ram Amrani <[email protected]>
Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 23 ++++++++++++++++++++++-
net/sunrpc/xprtrdma/verbs.c | 13 +++++++------
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 4909758..ab699f9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -126,13 +126,34 @@ void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
* plus the RPC call fit under the transport's inline limit. If the
* combined call message size exceeds that limit, the client must use
* the read chunk list for this operation.
+ *
+ * A read chunk is also required if sending the RPC call inline would
+ * exceed this device's max_sge limit.
*/
static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst *rqst)
{
+ struct xdr_buf *xdr = &rqst->rq_snd_buf;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ unsigned int count, remaining, offset;
+
+ if (xdr->len > ia->ri_max_inline_write)
+ return false;
+
+ if (xdr->page_len) {
+ remaining = xdr->page_len;
+ offset = xdr->page_base & ~PAGE_MASK;
+ count = 0;
+ while (remaining) {
+ remaining -= min_t(unsigned int,
+ PAGE_SIZE - offset, remaining);
+ offset = 0;
+ if (++count > ia->ri_max_sgeno)
+ return false;
+ }
+ }

- return rqst->rq_snd_buf.len <= ia->ri_max_inline_write;
+ return true;
}

/* The client can't know how large the actual reply will be. Thus it
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 12e8242..5dcdd0b 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -488,18 +488,19 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
*/
int
rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
- struct rpcrdma_create_data_internal *cdata)
+ struct rpcrdma_create_data_internal *cdata)
{
struct rpcrdma_connect_private *pmsg = &ep->rep_cm_private;
+ unsigned int max_qp_wr, max_sge;
struct ib_cq *sendcq, *recvcq;
- unsigned int max_qp_wr;
int rc;

- if (ia->ri_device->attrs.max_sge < RPCRDMA_MAX_SEND_SGES) {
- dprintk("RPC: %s: insufficient sge's available\n",
- __func__);
+ max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
+ if (max_sge < 3) {
+ pr_warn("rpcrdma: HCA provides only %d send SGEs\n", max_sge);
return -ENOMEM;
}
+ ia->ri_max_sgeno = max_sge - 3;

if (ia->ri_device->attrs.max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
dprintk("RPC: %s: insufficient wqe's available\n",
@@ -524,7 +525,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
ep->rep_attr.cap.max_recv_wr += 1; /* drain cqe */
- ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_SEND_SGES;
+ ep->rep_attr.cap.max_send_sge = max_sge;
ep->rep_attr.cap.max_recv_sge = 1;
ep->rep_attr.cap.max_inline_data = 0;
ep->rep_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f495df0c..c134d0b 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -74,6 +74,7 @@ struct rpcrdma_ia {
unsigned int ri_max_frmr_depth;
unsigned int ri_max_inline_write;
unsigned int ri_max_inline_read;
+ unsigned int ri_max_sgeno;
bool ri_reminv_expected;
bool ri_implicit_padding;
enum ib_mr_type ri_mrtype;


2017-01-13 17:43:33

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v1 5/5] xprtrdma: Shrink send SGEs array

We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/xprt_rdma.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c134d0b..8c32717 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -305,15 +305,18 @@ struct rpcrdma_mr_seg { /* chunk descriptors */
char *mr_offset; /* kva if no page, else offset */
};

-/* Reserve enough Send SGEs to send a maximum size inline request:
+/* The Send SGE array is provisioned to send a maximum size
+ * inline request:
* - RPC-over-RDMA header
* - xdr_buf head iovec
- * - RPCRDMA_MAX_INLINE bytes, possibly unaligned, in pages
+ * - RPCRDMA_MAX_INLINE bytes, in pages
* - xdr_buf tail iovec
+ *
+ * The actual number of array elements consumed by each RPC
+ * depends on the device's max_sge limit.
*/
enum {
- RPCRDMA_MAX_SEND_PAGES = PAGE_SIZE + RPCRDMA_MAX_INLINE - 1,
- RPCRDMA_MAX_PAGE_SGES = (RPCRDMA_MAX_SEND_PAGES >> PAGE_SHIFT) + 1,
+ RPCRDMA_MAX_PAGE_SGES = RPCRDMA_MAX_INLINE >> PAGE_SHIFT,
RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,
};



2017-01-13 18:01:40

by Parav Pandit

[permalink] [raw]
Subject: RE: [PATCH v1 4/5] xprtrdma: Reduce required number of send SGEs

SGkgQ2h1Y2ssDQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogbGludXgt
cmRtYS1vd25lckB2Z2VyLmtlcm5lbC5vcmcgW21haWx0bzpsaW51eC1yZG1hLQ0KPiBvd25lckB2
Z2VyLmtlcm5lbC5vcmddIE9uIEJlaGFsZiBPZiBDaHVjayBMZXZlcg0KPiBTZW50OiBGcmlkYXks
IEphbnVhcnkgMTMsIDIwMTcgMTE6NDMgQU0NCj4gVG86IGxpbnV4LXJkbWFAdmdlci5rZXJuZWwu
b3JnOyBsaW51eC1uZnNAdmdlci5rZXJuZWwub3JnDQo+IFN1YmplY3Q6IFtQQVRDSCB2MSA0LzVd
IHhwcnRyZG1hOiBSZWR1Y2UgcmVxdWlyZWQgbnVtYmVyIG9mIHNlbmQgU0dFcw0KPiBkaWZmIC0t
Z2l0IGEvbmV0L3N1bnJwYy94cHJ0cmRtYS9ycGNfcmRtYS5jDQo+IGIvbmV0L3N1bnJwYy94cHJ0
cmRtYS9ycGNfcmRtYS5jIGluZGV4IDQ5MDk3NTguLmFiNjk5ZjkgMTAwNjQ0DQoNCj4gIC8qIFRo
ZSBjbGllbnQgY2FuJ3Qga25vdyBob3cgbGFyZ2UgdGhlIGFjdHVhbCByZXBseSB3aWxsIGJlLiBU
aHVzIGl0IGRpZmYgLS1naXQNCj4gYS9uZXQvc3VucnBjL3hwcnRyZG1hL3ZlcmJzLmMgYi9uZXQv
c3VucnBjL3hwcnRyZG1hL3ZlcmJzLmMgaW5kZXgNCj4gMTJlODI0Mi4uNWRjZGQwYiAxMDA2NDQN
Cj4gLS0tIGEvbmV0L3N1bnJwYy94cHJ0cmRtYS92ZXJicy5jDQo+ICsrKyBiL25ldC9zdW5ycGMv
eHBydHJkbWEvdmVyYnMuYw0KPiBAQCAtNDg4LDE4ICs0ODgsMTkgQEAgc3RhdGljIHZvaWQgcnBj
cmRtYV9kZXN0cm95X2lkKHN0cnVjdCByZG1hX2NtX2lkDQo+ICppZCkNCj4gICAqLw0KPiAgaW50
DQo+ICBycGNyZG1hX2VwX2NyZWF0ZShzdHJ1Y3QgcnBjcmRtYV9lcCAqZXAsIHN0cnVjdCBycGNy
ZG1hX2lhICppYSwNCj4gLQkJCQlzdHJ1Y3QgcnBjcmRtYV9jcmVhdGVfZGF0YV9pbnRlcm5hbCAq
Y2RhdGEpDQo+ICsJCSAgc3RydWN0IHJwY3JkbWFfY3JlYXRlX2RhdGFfaW50ZXJuYWwgKmNkYXRh
KQ0KPiAgew0KPiAgCXN0cnVjdCBycGNyZG1hX2Nvbm5lY3RfcHJpdmF0ZSAqcG1zZyA9ICZlcC0+
cmVwX2NtX3ByaXZhdGU7DQo+ICsJdW5zaWduZWQgaW50IG1heF9xcF93ciwgbWF4X3NnZTsNCj4g
IAlzdHJ1Y3QgaWJfY3EgKnNlbmRjcSwgKnJlY3ZjcTsNCj4gLQl1bnNpZ25lZCBpbnQgbWF4X3Fw
X3dyOw0KPiAgCWludCByYzsNCj4gDQo+IC0JaWYgKGlhLT5yaV9kZXZpY2UtPmF0dHJzLm1heF9z
Z2UgPCBSUENSRE1BX01BWF9TRU5EX1NHRVMpIHsNCj4gLQkJZHByaW50aygiUlBDOiAgICAgICAl
czogaW5zdWZmaWNpZW50IHNnZSdzIGF2YWlsYWJsZVxuIiwNCj4gLQkJCV9fZnVuY19fKTsNCj4g
KwltYXhfc2dlID0gbWluKGlhLT5yaV9kZXZpY2UtPmF0dHJzLm1heF9zZ2UsDQo+IFJQQ1JETUFf
TUFYX1NFTkRfU0dFUyk7DQo+ICsJaWYgKG1heF9zZ2UgPCAzKSB7DQo+ICsJCXByX3dhcm4oInJw
Y3JkbWE6IEhDQSBwcm92aWRlcyBvbmx5ICVkIHNlbmQgU0dFc1xuIiwNCj4gbWF4X3NnZSk7DQo+
ICAJCXJldHVybiAtRU5PTUVNOw0KPiAgCX0NCj4gKwlpYS0+cmlfbWF4X3NnZW5vID0gbWF4X3Nn
ZSAtIDM7DQo+IA0KDQpJIGRpZG4ndCBub3RpY2VkIHRoaXMgbmV3IHJpX21heF9zZ2VubyB2YXJp
YWJsZSBiZWluZyB1c2VkIGluIHRoaXMgcGF0Y2ggc2V0LiBEaWQgSSBtaXNzPw0KWW91IGFsc28g
bWlnaHQgd2FudCB0byByZW5hbWUgaXQgdG8gcmlfbWF4X3NnZV9udW0uDQpSZWdhcmRsZXNzIGZv
ciBzb21lIGRldmljZSB3aXRoIDMgU0dFcywgcmlfbWF4X3NnZW5vIHdpbGwgYmVjb21lIHplcm8u
IElzIHRoYXQgZmluZT8NCllvdSB3YW50ZWQgdG8gY2hlY2sgZm9yIHZhbHVlIG9mIDU/DQpJdCB3
b3VsZCBiZSBnb29kIHRvIGhhdmUgZGVmaW5lIGZvciB0aGlzIG1pbmltdW0gcmVxdWlyZWQgMyBT
R0VzIGhlYWRlciBmaWxlIHN1Y2ggYXMgUlBDUkRNQV9NSU5fUkVRX1JFQ1ZfU0dFLg0KDQoNCj4g
IAlpZiAoaWEtPnJpX2RldmljZS0+YXR0cnMubWF4X3FwX3dyIDw9IFJQQ1JETUFfQkFDS1dBUkRf
V1JTKSB7DQo+ICAJCWRwcmludGsoIlJQQzogICAgICAgJXM6IGluc3VmZmljaWVudCB3cWUncyBh
dmFpbGFibGVcbiIsDQo+IEBAIC01MjQsNyArNTI1LDcgQEAgc3RhdGljIHZvaWQgcnBjcmRtYV9k
ZXN0cm95X2lkKHN0cnVjdCByZG1hX2NtX2lkDQo+ICppZCkNCj4gIAllcC0+cmVwX2F0dHIuY2Fw
Lm1heF9yZWN2X3dyID0gY2RhdGEtPm1heF9yZXF1ZXN0czsNCj4gIAllcC0+cmVwX2F0dHIuY2Fw
Lm1heF9yZWN2X3dyICs9IFJQQ1JETUFfQkFDS1dBUkRfV1JTOw0KPiAgCWVwLT5yZXBfYXR0ci5j
YXAubWF4X3JlY3Zfd3IgKz0gMTsJLyogZHJhaW4gY3FlICovDQo+IC0JZXAtPnJlcF9hdHRyLmNh
cC5tYXhfc2VuZF9zZ2UgPSBSUENSRE1BX01BWF9TRU5EX1NHRVM7DQo+ICsJZXAtPnJlcF9hdHRy
LmNhcC5tYXhfc2VuZF9zZ2UgPSBtYXhfc2dlOw0KPiAgCWVwLT5yZXBfYXR0ci5jYXAubWF4X3Jl
Y3Zfc2dlID0gMTsNCj4gIAllcC0+cmVwX2F0dHIuY2FwLm1heF9pbmxpbmVfZGF0YSA9IDA7DQo+
ICAJZXAtPnJlcF9hdHRyLnNxX3NpZ190eXBlID0gSUJfU0lHTkFMX1JFUV9XUjsgZGlmZiAtLWdp
dA0KPiBhL25ldC9zdW5ycGMveHBydHJkbWEveHBydF9yZG1hLmggYi9uZXQvc3VucnBjL3hwcnRy
ZG1hL3hwcnRfcmRtYS5oDQo+IGluZGV4IGY0OTVkZjBjLi5jMTM0ZDBiIDEwMDY0NA0KPiAtLS0g
YS9uZXQvc3VucnBjL3hwcnRyZG1hL3hwcnRfcmRtYS5oDQo+ICsrKyBiL25ldC9zdW5ycGMveHBy
dHJkbWEveHBydF9yZG1hLmgNCj4gQEAgLTc0LDYgKzc0LDcgQEAgc3RydWN0IHJwY3JkbWFfaWEg
ew0KPiAgCXVuc2lnbmVkIGludAkJcmlfbWF4X2ZybXJfZGVwdGg7DQo+ICAJdW5zaWduZWQgaW50
CQlyaV9tYXhfaW5saW5lX3dyaXRlOw0KPiAgCXVuc2lnbmVkIGludAkJcmlfbWF4X2lubGluZV9y
ZWFkOw0KPiArCXVuc2lnbmVkIGludAkJcmlfbWF4X3NnZW5vOw0KPiAgCWJvb2wJCQlyaV9yZW1p
bnZfZXhwZWN0ZWQ7DQo+ICAJYm9vbAkJCXJpX2ltcGxpY2l0X3BhZGRpbmc7DQo+ICAJZW51bSBp
Yl9tcl90eXBlCQlyaV9tcnR5cGU7DQo+IA0KPiAtLQ0KPiBUbyB1bnN1YnNjcmliZSBmcm9tIHRo
aXMgbGlzdDogc2VuZCB0aGUgbGluZSAidW5zdWJzY3JpYmUgbGludXgtcmRtYSIgaW4gdGhlDQo+
IGJvZHkgb2YgYSBtZXNzYWdlIHRvIG1ham9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcgTW9yZSBtYWpv
cmRvbW8gaW5mbyBhdA0KPiBodHRwOi8vdmdlci5rZXJuZWwub3JnL21ham9yZG9tby1pbmZvLmh0
bWwNCg==

2017-01-13 18:31:03

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v1 4/5] xprtrdma: Reduce required number of send SGEs


> On Jan 13, 2017, at 1:01 PM, Parav Pandit <[email protected]> wrote:
>
> Hi Chuck,
>
>> -----Original Message-----
>> From: [email protected] [mailto:linux-rdma-
>> [email protected]] On Behalf Of Chuck Lever
>> Sent: Friday, January 13, 2017 11:43 AM
>> To: [email protected]; [email protected]
>> Subject: [PATCH v1 4/5] xprtrdma: Reduce required number of send SGEs
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>> b/net/sunrpc/xprtrdma/rpc_rdma.c index 4909758..ab699f9 100644
>
>> /* The client can't know how large the actual reply will be. Thus it diff --git
>> a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index
>> 12e8242..5dcdd0b 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -488,18 +488,19 @@ static void rpcrdma_destroy_id(struct rdma_cm_id
>> *id)
>> */
>> int
>> rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>> - struct rpcrdma_create_data_internal *cdata)
>> + struct rpcrdma_create_data_internal *cdata)
>> {
>> struct rpcrdma_connect_private *pmsg = &ep->rep_cm_private;
>> + unsigned int max_qp_wr, max_sge;
>> struct ib_cq *sendcq, *recvcq;
>> - unsigned int max_qp_wr;
>> int rc;
>>
>> - if (ia->ri_device->attrs.max_sge < RPCRDMA_MAX_SEND_SGES) {
>> - dprintk("RPC: %s: insufficient sge's available\n",
>> - __func__);
>> + max_sge = min(ia->ri_device->attrs.max_sge,
>> RPCRDMA_MAX_SEND_SGES);
>> + if (max_sge < 3) {
>> + pr_warn("rpcrdma: HCA provides only %d send SGEs\n",
>> max_sge);
>> return -ENOMEM;
>> }
>> + ia->ri_max_sgeno = max_sge - 3;
>>
>
> I didn't noticed this new ri_max_sgeno variable being used in this patch set. Did I miss?

Yes, you snipped out the other hunk where it is used.


> You also might want to rename it to ri_max_sge_num.
> Regardless for some device with 3 SGEs, ri_max_sgeno will become zero. Is that fine?

Yes, that's OK, and tested. Zero means that NFS WRITE and SYMLINK
payloads will always be sent in a Read chunk.


> You wanted to check for value of 5?
> It would be good to have define for this minimum required 3 SGEs header file such as RPCRDMA_MIN_REQ_RECV_SGE.

OK.


>
>
>> if (ia->ri_device->attrs.max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
>> dprintk("RPC: %s: insufficient wqe's available\n",
>> @@ -524,7 +525,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id
>> *id)
>> ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
>> ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
>> ep->rep_attr.cap.max_recv_wr += 1; /* drain cqe */
>> - ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_SEND_SGES;
>> + ep->rep_attr.cap.max_send_sge = max_sge;
>> ep->rep_attr.cap.max_recv_sge = 1;
>> ep->rep_attr.cap.max_inline_data = 0;
>> ep->rep_attr.sq_sig_type = IB_SIGNAL_REQ_WR; diff --git
>> a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index f495df0c..c134d0b 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -74,6 +74,7 @@ struct rpcrdma_ia {
>> unsigned int ri_max_frmr_depth;
>> unsigned int ri_max_inline_write;
>> unsigned int ri_max_inline_read;
>> + unsigned int ri_max_sgeno;
>> bool ri_reminv_expected;
>> bool ri_implicit_padding;
>> enum ib_mr_type ri_mrtype;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
>> body of a message to [email protected] More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> N‹§ēæėrļ›yúčšØbēXŽķĮ§vØ^–)Þš{.nĮ+‰·ĨŠ{ą­ŲšŠ{ayšʇڙë,j­ĒfĢĒ·hš‹āzđŪwĨĒļ Ē·Ķj:+v‰ĻŠwčjØmķŸĸūŦ‘ęįzZ+ƒųšŽŠÝĒj"ú!

--
Chuck Lever




2017-01-13 19:14:49

by Parav Pandit

[permalink] [raw]
Subject: RE: [PATCH v1 4/5] xprtrdma: Reduce required number of send SGEs

PiA+PiArCWlhLT5yaV9tYXhfc2dlbm8gPSBtYXhfc2dlIC0gMzsNCj4gPj4NCj4gPg0KPiA+IEkg
ZGlkbid0IG5vdGljZWQgdGhpcyBuZXcgcmlfbWF4X3NnZW5vIHZhcmlhYmxlIGJlaW5nIHVzZWQg
aW4gdGhpcyBwYXRjaCBzZXQuDQo+IERpZCBJIG1pc3M/DQo+IA0KPiBZZXMsIHlvdSBzbmlwcGVk
IG91dCB0aGUgb3RoZXIgaHVuayB3aGVyZSBpdCBpcyB1c2VkLg0KDQpSaWdodC4gSSBzZWUgaXQg
bm93Lg0KPiANCj4gPiBZb3UgYWxzbyBtaWdodCB3YW50IHRvIHJlbmFtZSBpdCB0byByaV9tYXhf
c2dlX251bS4NCj4gPiBSZWdhcmRsZXNzIGZvciBzb21lIGRldmljZSB3aXRoIDMgU0dFcywgcmlf
bWF4X3NnZW5vIHdpbGwgYmVjb21lIHplcm8uIElzDQo+IHRoYXQgZmluZT8NCj4gDQo+IFllcywg
dGhhdCdzIE9LLCBhbmQgdGVzdGVkLiBaZXJvIG1lYW5zIHRoYXQgTkZTIFdSSVRFIGFuZCBTWU1M
SU5LDQo+IHBheWxvYWRzIHdpbGwgYWx3YXlzIGJlIHNlbnQgaW4gYSBSZWFkIGNodW5rLg0KPiAN
Ck9rLiBUaGFua3MuDQoNCj4gDQo+ID4gSXQgd291bGQgYmUgZ29vZCB0byBoYXZlIGRlZmluZSBm
b3IgdGhpcyBtaW5pbXVtIHJlcXVpcmVkIDMgU0dFcyBoZWFkZXINCj4gZmlsZSBzdWNoIGFzIFJQ
Q1JETUFfTUlOX1JFUV9SRUNWX1NHRS4NCj4gDQo+IE9LLg0K

2017-01-20 17:30:31

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v1 0/5] Fix "support large inline thresholds"

>
> I've received a number of reports that v4.9 commit 655fec6987be
> ("xprtrdma: Use gathered Send for large inline messages") causes
> NFS/RDMA mounts to fail for devices that have a small max_sge.
>
> This series addresses that problem.
>
> A much smaller fix was provided initially. It worked for devices
> with as small as five send SGEs. However, additional research has
> shown that there is at least one in-tree device that supports only
> three send SGEs.
>
> The current series should enable NFS/RDMA again on those devices.
>
>
> Available in the "nfs-rdma-for-4.10-rc" topic branch of this git repo:
>
> git://git.linux-nfs.org/projects/cel/cel-2.6.git

Hey Chuck,

Tests ok on cxgb4.

Tested-by: Steve Wise <[email protected]>


2017-01-20 18:17:20

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] Fix "support large inline thresholds"


> On Jan 20, 2017, at 9:30 AM, Steve Wise <[email protected]> wrote:
>
>>
>> I've received a number of reports that v4.9 commit 655fec6987be
>> ("xprtrdma: Use gathered Send for large inline messages") causes
>> NFS/RDMA mounts to fail for devices that have a small max_sge.
>>
>> This series addresses that problem.
>>
>> A much smaller fix was provided initially. It worked for devices
>> with as small as five send SGEs. However, additional research has
>> shown that there is at least one in-tree device that supports only
>> three send SGEs.
>>
>> The current series should enable NFS/RDMA again on those devices.
>>
>>
>> Available in the "nfs-rdma-for-4.10-rc" topic branch of this git repo:
>>
>> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>
> Hey Chuck,
>
> Tests ok on cxgb4.
>
> Tested-by: Steve Wise <[email protected]>

Thanks. I've also received confirmation that Ram's
device works now, and ocrdma's issue is cleared too.

I'll post a final version of this series on Monday.


--
Chuck Lever