LinuxLists.cc - [PATCH v2 00/10] NFS/RDMA patches for 3.19

2014-11-09 01:14:07

Subject: [PATCH v2 00/10] NFS/RDMA patches for 3.19

Hi-

This is a set of fixes and clean-ups for xprtrdma. Also available
in the nfs-rdma-for-3.19 topic branch at

git://linux-nfs.org/projects/cel/cel-2.6.git

I've run cthon04, iozone, dbench, multi-process kernel build and
xfstests with NFSv3 and NFSv4.0 against a Linux and a Solaris
NFS server via RDMA, with all three memory registration modes.

Changes since v1:

- Merged up to v3.18-rc3
- Dropped patches that enable NFSv4.1 support for now
- CQ completion vector spinning dropped for now
- Addressed Anna's review comments
- Several bug fixes

---

Chuck Lever (10):
xprtrdma: Return an errno from rpcrdma_register_external()
xprtrdma: Cap req_cqinit
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: Refactor tasklet scheduling
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: Enable pad optimization
xprtrdma: Display async errors
SUNRPC: serialize iostats updates
NFS: SETCLIENTID XDR buffer sizes are incorrect
NFS: Clean up nfs4_init_callback()

fs/nfs/nfs4client.c | 31 +++++------
fs/nfs/nfs4xdr.c | 10 ++-
include/linux/sunrpc/metrics.h | 3 +
net/sunrpc/stats.c | 21 +++++--
net/sunrpc/xprtrdma/transport.c | 4 +
net/sunrpc/xprtrdma/verbs.c | 114 ++++++++++++++++++++++++++++++++++-----
net/sunrpc/xprtrdma/xprt_rdma.h | 6 ++
7 files changed, 146 insertions(+), 43 deletions(-)

--
Chuck Lever

2014-11-26 16:25:35

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] NFS/RDMA patches for 3.19

Hi Chuck,

I've applied these to two different branches - one for the RDMA changes and another for the generic client side changes. Both have been sent to Trond to be merged with 3.19.

Anna

On 11/08/2014 08:14 PM, Chuck Lever wrote:
> Hi-
>
> This is a set of fixes and clean-ups for xprtrdma. Also available
> in the nfs-rdma-for-3.19 topic branch at
>
> git://linux-nfs.org/projects/cel/cel-2.6.git
>
> I've run cthon04, iozone, dbench, multi-process kernel build and
> xfstests with NFSv3 and NFSv4.0 against a Linux and a Solaris
> NFS server via RDMA, with all three memory registration modes.
>
> Changes since v1:
>
> - Merged up to v3.18-rc3
> - Dropped patches that enable NFSv4.1 support for now
> - CQ completion vector spinning dropped for now
> - Addressed Anna's review comments
> - Several bug fixes
>
> ---
>
> Chuck Lever (10):
> xprtrdma: Return an errno from rpcrdma_register_external()
> xprtrdma: Cap req_cqinit
> xprtrdma: unmap all FMRs during transport disconnect
> xprtrdma: Refactor tasklet scheduling
> xprtrdma: Re-write rpcrdma_flush_cqs()
> xprtrdma: Enable pad optimization
> xprtrdma: Display async errors
> SUNRPC: serialize iostats updates
> NFS: SETCLIENTID XDR buffer sizes are incorrect
> NFS: Clean up nfs4_init_callback()
>
>
> fs/nfs/nfs4client.c | 31 +++++------
> fs/nfs/nfs4xdr.c | 10 ++-
> include/linux/sunrpc/metrics.h | 3 +
> net/sunrpc/stats.c | 21 +++++--
> net/sunrpc/xprtrdma/transport.c | 4 +
> net/sunrpc/xprtrdma/verbs.c | 114 ++++++++++++++++++++++++++++++++++-----
> net/sunrpc/xprtrdma/xprt_rdma.h | 6 ++
> 7 files changed, 146 insertions(+), 43 deletions(-)
>
> --
> Chuck Lever
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2014-11-26 16:36:09

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] NFS/RDMA patches for 3.19

On Nov 26, 2014, at 11:25 AM, Anna Schumaker <[email protected]> wrote:

> Hi Chuck,
>
> I've applied these to two different branches - one for the RDMA changes and another for the generic client side changes. Both have been sent to Trond to be merged with 3.19.

Thanks Anna!

>
> Anna
>
> On 11/08/2014 08:14 PM, Chuck Lever wrote:
>> Hi-
>>
>> This is a set of fixes and clean-ups for xprtrdma. Also available
>> in the nfs-rdma-for-3.19 topic branch at
>>
>> git://linux-nfs.org/projects/cel/cel-2.6.git
>>
>> I've run cthon04, iozone, dbench, multi-process kernel build and
>> xfstests with NFSv3 and NFSv4.0 against a Linux and a Solaris
>> NFS server via RDMA, with all three memory registration modes.
>>
>> Changes since v1:
>>
>> - Merged up to v3.18-rc3
>> - Dropped patches that enable NFSv4.1 support for now
>> - CQ completion vector spinning dropped for now
>> - Addressed Anna's review comments
>> - Several bug fixes
>>
>> ---
>>
>> Chuck Lever (10):
>> xprtrdma: Return an errno from rpcrdma_register_external()
>> xprtrdma: Cap req_cqinit
>> xprtrdma: unmap all FMRs during transport disconnect
>> xprtrdma: Refactor tasklet scheduling
>> xprtrdma: Re-write rpcrdma_flush_cqs()
>> xprtrdma: Enable pad optimization
>> xprtrdma: Display async errors
>> SUNRPC: serialize iostats updates
>> NFS: SETCLIENTID XDR buffer sizes are incorrect
>> NFS: Clean up nfs4_init_callback()
>>
>>
>> fs/nfs/nfs4client.c | 31 +++++------
>> fs/nfs/nfs4xdr.c | 10 ++-
>> include/linux/sunrpc/metrics.h | 3 +
>> net/sunrpc/stats.c | 21 +++++--
>> net/sunrpc/xprtrdma/transport.c | 4 +
>> net/sunrpc/xprtrdma/verbs.c | 114 ++++++++++++++++++++++++++++++++++-----
>> net/sunrpc/xprtrdma/xprt_rdma.h | 6 ++
>> 7 files changed, 146 insertions(+), 43 deletions(-)
>>
>> --
>> Chuck Lever
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2014-11-09 01:14:30

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 03/10] xprtrdma: unmap all FMRs during transport disconnect

When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtrdma/verbs.c | 42 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6a4615d..cfe9a81 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -599,7 +599,7 @@ xprt_rdma_send_request(struct rpc_task *task)

if (req->rl_niovs == 0)
rc = rpcrdma_marshal_req(rqst);
- else if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
+ else if (r_xprt->rx_ia.ri_memreg_strategy != RPCRDMA_ALLPHYSICAL)
rc = rpcrdma_marshal_chunks(rqst, 0);
if (rc < 0)
goto failed_marshal;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index af45cf3..3c88276 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -62,6 +62,7 @@
#endif

static void rpcrdma_reset_frmrs(struct rpcrdma_ia *);
+static void rpcrdma_reset_fmrs(struct rpcrdma_ia *);

/*
* internal functions
@@ -868,8 +869,19 @@ retry:
rpcrdma_ep_disconnect(ep, ia);
rpcrdma_flush_cqs(ep);

- if (ia->ri_memreg_strategy == RPCRDMA_FRMR)
+ switch (ia->ri_memreg_strategy) {
+ case RPCRDMA_FRMR:
rpcrdma_reset_frmrs(ia);
+ break;
+ case RPCRDMA_MTHCAFMR:
+ rpcrdma_reset_fmrs(ia);
+ break;
+ case RPCRDMA_ALLPHYSICAL:
+ break;
+ default:
+ rc = -EIO;
+ goto out;
+ }

xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
id = rpcrdma_create_id(xprt, ia,
@@ -1289,6 +1301,34 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
}

+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+rpcrdma_reset_fmrs(struct rpcrdma_ia *ia)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(ia, struct rpcrdma_xprt, rx_ia);
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ struct list_head *pos;
+ struct rpcrdma_mw *r;
+ LIST_HEAD(l);
+ int rc;
+
+ list_for_each(pos, &buf->rb_all) {
+ r = list_entry(pos, struct rpcrdma_mw, mw_all);
+
+ INIT_LIST_HEAD(&l);
+ list_add(&r->r.fmr->list, &l);
+ rc = ib_unmap_fmr(&l);
+ if (rc)
+ dprintk("RPC: %s: ib_unmap_fmr failed %i\n",
+ __func__, rc);
+ }
+}
+
/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
* an unusable state. Find FRMRs in this state and dereg / reg
* each. FRMRs that are VALID and attached to an rpcrdma_req are

2014-11-09 01:15:03

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 07/10] xprtrdma: Display async errors

An async error upcall is a hard error, and should be reported in
the system log.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e6ac964..5783c1a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -106,6 +106,32 @@ rpcrdma_run_tasklet(unsigned long data)

static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

+static const char * const async_event[] = {
+ "CQ error",
+ "QP fatal error",
+ "QP request error",
+ "QP access error",
+ "communication established",
+ "send queue drained",
+ "path migration successful",
+ "path mig error",
+ "device fatal error",
+ "port active",
+ "port error",
+ "LID change",
+ "P_key change",
+ "SM change",
+ "SRQ error",
+ "SRQ limit reached",
+ "last WQE reached",
+ "client reregister",
+ "GID change",
+};
+
+#define ASYNC_MSG(status) \
+ ((status) < ARRAY_SIZE(async_event) ? \
+ async_event[(status)] : "unknown async error")
+
static void
rpcrdma_schedule_tasklet(struct list_head *sched_list)
{
@@ -122,8 +148,9 @@ rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: QP error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);
@@ -136,8 +163,9 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: CQ error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);

2014-11-09 10:13:31

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH v2 02/10] xprtrdma: Cap req_cqinit

On 11/9/2014 3:14 AM, Chuck Lever wrote:
> Recent work made FRMR registration and invalidation completions
> unsignaled. This greatly reduces the adapter interrupt rate.
>
> Every so often, however, a posted send Work Request is allowed to
> signal. Otherwise, the provider's Work Queue will wrap and the
> workload will hang.
>
> The number of Work Requests that are allowed to remain unsignaled is
> determined by the value of req_cqinit. Currently, this is set to the
> size of the send Work Queue divided by two, minus 1.
>
> For FRMR, the send Work Queue is the maximum number of concurrent
> RPCs (currently 32) times the maximum number of Work Requests an
> RPC might use (currently 7, though some adapters may need more).
>
> For mlx4, this is 224 entries. This leaves completion signaling
> disabled for 111 send Work Requests.
>
> Some providers hold back dispatching Work Requests until a CQE is
> generated. If completions are disabled, then no CQEs are generated
> for quite some time, and that can stall the Work Queue.
>
> I've seen this occur running xfstests generic/113 over NFSv4, where
> eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
> because the Work Queue has overflowed. The connection is dropped
> and re-established.

Hey Chuck,

As you know, I've seen this issue too...
Looking into this is definitely on my todo list.

Does this happen if you run a simple dd (single request-response inflight)?

Sagi.

2014-11-09 01:15:28

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 10/10] NFS: Clean up nfs4_init_callback()

nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/nfs4client.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index ffdb28d..5f4b818 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -241,28 +241,25 @@ void nfs4_free_client(struct nfs_client *clp)
*/
static int nfs4_init_callback(struct nfs_client *clp)
{
+ struct rpc_xprt *xprt;
int error;

- if (clp->rpc_ops->version == 4) {
- struct rpc_xprt *xprt;
+ xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);

- xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
-
- if (nfs4_has_session(clp)) {
- error = xprt_setup_backchannel(xprt,
- NFS41_BC_MIN_CALLBACKS);
- if (error < 0)
- return error;
- }
-
- error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
- if (error < 0) {
- dprintk("%s: failed to start callback. Error = %d\n",
- __func__, error);
+ if (nfs4_has_session(clp)) {
+ error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+ if (error < 0)
return error;
- }
- __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
}
+
+ error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
+ if (error < 0) {
+ dprintk("%s: failed to start callback. Error = %d\n",
+ __func__, error);
+ return error;
+ }
+ __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
+
return 0;
}

2014-11-11 20:30:09

by Or Gerlitz

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] xprtrdma: Display async errors

On Tue, Nov 11, 2014 at 8:49 PM, Sagi Grimberg <[email protected]> wrote:
> On 11/11/2014 6:52 PM, Chuck Lever wrote:
>>
>>
>>> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg <[email protected]>
>>> wrote:
>>>
>>>> On 11/9/2014 3:15 AM, Chuck Lever wrote:
>>>> An async error upcall is a hard error, and should be reported in
>>>> the system log.
>>>>
>>>
>>> Could be useful to others... Any chance you put this in ib_core for all
>>> of us?
>>
>>
>> Eventually. We certainly wouldn't want copies of this array of strings
>> to appear many times in the kernel. That would be a waste of space.
>>
>> I have a similar patch that adds an array for CQ status codes, and
>> xprtrdma has a string array already for connection status. Are those
>> also interesting?
>>
>
> Yep, also RDMA_CM events. Would certainly help people avoid source
> code navigation to understand what is going on...

Oh yes, Chuck, good if you can pick this up, AFAIRemeber most of the
strings are already in the RDS code (net/rds) - please re-factor them
from there into some IB core helpers, thanks alot!!

2014-11-10 14:36:06

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] xprtrdma: Enable pad optimization

Hey Chuck,

On 11/08/2014 08:14 PM, Chuck Lever wrote:
> The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
> pad optimization was enabled. That bug was fixed by commit
> e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
> it").

Do we need to worry about backwards compatibility with servers that don't have this patch?

Anna

>
> We can now enable pad optimization on the client, which helps
> performance and is supported now by both Linux and Solaris servers.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/transport.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index cfe9a81..8ed2576 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
> static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
> static unsigned int xprt_rdma_inline_write_padding;
> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
> - int xprt_rdma_pad_optimize = 0;
> + int xprt_rdma_pad_optimize = 1;
>
> #ifdef RPC_DEBUG
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-11-09 01:14:39

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 04/10] xprtrdma: Refactor tasklet scheduling

Restore the separate function that schedules the reply handling
tasklet. I need to call it from two different paths.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3c88276..478b2fd 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -107,6 +107,17 @@ rpcrdma_run_tasklet(unsigned long data)
static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

static void
+rpcrdma_schedule_tasklet(struct list_head *sched_list)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
+ list_splice_tail(sched_list, &rpcrdma_tasklets_g);
+ spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
+ tasklet_schedule(&rpcrdma_tasklet_g);
+}
+
+static void
rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;
@@ -244,7 +255,6 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
struct list_head sched_list;
struct ib_wc *wcs;
int budget, count, rc;
- unsigned long flags;

INIT_LIST_HEAD(&sched_list);
budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
@@ -262,10 +272,7 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
rc = 0;

out_schedule:
- spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
- list_splice_tail(&sched_list, &rpcrdma_tasklets_g);
- spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
- tasklet_schedule(&rpcrdma_tasklet_g);
+ rpcrdma_schedule_tasklet(&sched_list);
return rc;
}

2014-11-10 17:18:33

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] NFS: SETCLIENTID XDR buffer sizes are incorrect

On Mon, Nov 10, 2014 at 10:22 AM, Anna Schumaker
<[email protected]> wrote:
> Hi Chuck,
>
> It looks like this patch and the next one are general client changes, so they might have to be submitted to Trond directly rather than going through my tree. Trond, what do you think?

Ditto for the iostat locking change (PATCH 08/10). None of those are
RDMA related.

Should I just cherrypick them from this series?

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-11-11 18:49:32

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] xprtrdma: Display async errors

On 11/11/2014 6:52 PM, Chuck Lever wrote:
>
>> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg <[email protected]> wrote:
>>
>>> On 11/9/2014 3:15 AM, Chuck Lever wrote:
>>> An async error upcall is a hard error, and should be reported in
>>> the system log.
>>>
>>
>> Could be useful to others... Any chance you put this in ib_core for all
>> of us?
>
> Eventually. We certainly wouldn't want copies of this array of strings
> to appear many times in the kernel. That would be a waste of space.
>
> I have a similar patch that adds an array for CQ status codes, and
> xprtrdma has a string array already for connection status. Are those
> also interesting?
>

Yep, also RDMA_CM events. Would certainly help people avoid source
code navigation to understand what is going on...

2014-11-09 01:14:14

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 01/10] xprtrdma: Return an errno from rpcrdma_register_external()

The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61c4129..6ea2942 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1918,10 +1918,10 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
break;

default:
- return -1;
+ return -EIO;
}
if (rc)
- return -1;
+ return rc;

return nsegs;
}

2014-11-11 16:52:48

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] xprtrdma: Display async errors

> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg <[email protected]> wrote:
>
>> On 11/9/2014 3:15 AM, Chuck Lever wrote:
>> An async error upcall is a hard error, and should be reported in
>> the system log.
>>
>
> Could be useful to others... Any chance you put this in ib_core for all
> of us?

Eventually. We certainly wouldn't want copies of this array of strings
to appear many times in the kernel. That would be a waste of space.

I have a similar patch that adds an array for CQ status codes, and
xprtrdma has a string array already for connection status. Are those
also interesting?

2014-11-10 15:22:58

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] NFS: SETCLIENTID XDR buffer sizes are incorrect

Hi Chuck,

It looks like this patch and the next one are general client changes, so they might have to be submitted to Trond directly rather than going through my tree. Trond, what do you think?

Anna

On 11/08/2014 08:15 PM, Chuck Lever wrote:
> Use the correct calculation of the maximum size of a clientaddr4
> when encoding and decoding SETCLIENTID operations. clientaddr4 is
> defined in section 2.2.10 of RFC3530bis-31.
>
> The usage in encode_setclientid_maxsz is missing the 4-byte length
> in both strings, but is otherwise correct. decode_setclientid_maxsz
> simply asks for a page of receive buffer space, which is
> unnecessarily large (more than 4KB).
>
> Note that a SETCLIENTID reply is either clientid+verifier, or
> clientaddr4, depending on the returned NFS status. It doesn't
> hurt to allocate enough space for both.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfs/nfs4xdr.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 206c08a..f8afa67 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -141,13 +141,15 @@ static int nfs4_stat_to_errno(int);
> XDR_QUADLEN(NFS4_VERIFIER_SIZE) + \
> XDR_QUADLEN(NFS4_SETCLIENTID_NAMELEN) + \
> 1 /* sc_prog */ + \
> - XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
> - XDR_QUADLEN(RPCBIND_MAXUADDRLEN) + \
> + 1 + XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
> + 1 + XDR_QUADLEN(RPCBIND_MAXUADDRLEN) + \
> 1) /* sc_cb_ident */
> #define decode_setclientid_maxsz \
> (op_decode_hdr_maxsz + \
> - 2 + \
> - 1024) /* large value for CLID_INUSE */
> + 2 /* clientid */ + \
> + XDR_QUADLEN(NFS4_VERIFIER_SIZE) + \
> + 1 + XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
> + 1 + XDR_QUADLEN(RPCBIND_MAXUADDRLEN))
> #define encode_setclientid_confirm_maxsz \
> (op_encode_hdr_maxsz + \
> 3 + (NFS4_VERIFIER_SIZE >> 2))
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-11-09 01:14:47

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 05/10] xprtrdma: Re-write rpcrdma_flush_cqs()

Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.

1. rpcrdma_flush_cqs() can run concurrently with provider upcalls.
Both flush_cqs() and the upcalls were invoking ib_poll_cq() in
different threads using the same wc buffers (ep->rep_recv_wcs
and ep->rep_send_wcs), added by commit 1c00dd077654 ("xprtrmda:
Reduce calls to ib_poll_cq() in completion handlers").

During transport disconnect processing, this sometimes resulted
in the same reply getting added to the rpcrdma_tasklets_g list
more than once, which corrupted the list.

2. The upcall functions drain only a limited number of CQEs,
thanks to the poll budget added by commit 8301a2c047cc
("xprtrdma: Limit work done by completion handler").

Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 478b2fd..e6ac964 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -317,8 +317,15 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
static void
rpcrdma_flush_cqs(struct rpcrdma_ep *ep)
{
- rpcrdma_recvcq_upcall(ep->rep_attr.recv_cq, ep);
- rpcrdma_sendcq_upcall(ep->rep_attr.send_cq, ep);
+ struct ib_wc wc;
+ LIST_HEAD(sched_list);
+
+ while (ib_poll_cq(ep->rep_attr.recv_cq, 1, &wc) > 0)
+ rpcrdma_recvcq_process_wc(&wc, &sched_list);
+ if (!list_empty(&sched_list))
+ rpcrdma_schedule_tasklet(&sched_list);
+ while (ib_poll_cq(ep->rep_attr.send_cq, 1, &wc) > 0)
+ rpcrdma_sendcq_process_wc(&wc);
}

#ifdef RPC_DEBUG

2014-11-09 01:14:23

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 02/10] xprtrdma: Cap req_cqinit

Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 4 +++-
net/sunrpc/xprtrdma/xprt_rdma.h | 6 ++++++
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..af45cf3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,7 +733,9 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,

/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
- if (ep->rep_cqinit <= 2)
+ if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
+ ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
+ else if (ep->rep_cqinit <= 2)
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);
ep->rep_ia = ia;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7fc9a..b799041 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -97,6 +97,12 @@ struct rpcrdma_ep {
struct ib_wc rep_recv_wcs[RPCRDMA_POLLSIZE];
};

+/*
+ * Force a signaled SEND Work Request every so often,
+ * in case the provider needs to do some housekeeping.
+ */
+#define RPCRDMA_MAX_UNSIGNALED_SENDS (32)
+
#define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
#define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)

2014-11-09 01:14:55

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 06/10] xprtrdma: Enable pad optimization

The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
pad optimization was enabled. That bug was fixed by commit
e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
it").

We can now enable pad optimization on the client, which helps
performance and is supported now by both Linux and Solaris servers.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index cfe9a81..8ed2576 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
- int xprt_rdma_pad_optimize = 0;
+ int xprt_rdma_pad_optimize = 1;

#ifdef RPC_DEBUG

2014-11-09 01:15:20

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 09/10] NFS: SETCLIENTID XDR buffer sizes are incorrect

Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.

The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).

Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/nfs4xdr.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 206c08a..f8afa67 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -141,13 +141,15 @@ static int nfs4_stat_to_errno(int);
XDR_QUADLEN(NFS4_VERIFIER_SIZE) + \
XDR_QUADLEN(NFS4_SETCLIENTID_NAMELEN) + \
1 /* sc_prog */ + \
- XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
- XDR_QUADLEN(RPCBIND_MAXUADDRLEN) + \
+ 1 + XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
+ 1 + XDR_QUADLEN(RPCBIND_MAXUADDRLEN) + \
1) /* sc_cb_ident */
#define decode_setclientid_maxsz \
(op_decode_hdr_maxsz + \
- 2 + \
- 1024) /* large value for CLID_INUSE */
+ 2 /* clientid */ + \
+ XDR_QUADLEN(NFS4_VERIFIER_SIZE) + \
+ 1 + XDR_QUADLEN(RPCBIND_MAXNETIDLEN) + \
+ 1 + XDR_QUADLEN(RPCBIND_MAXUADDRLEN))
#define encode_setclientid_confirm_maxsz \
(op_encode_hdr_maxsz + \
3 + (NFS4_VERIFIER_SIZE >> 2))

2014-11-09 01:15:11

by Chuck Lever

[permalink] [raw]

Subject: [PATCH v2 08/10] SUNRPC: serialize iostats updates

Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.

Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/metrics.h | 3 +++
net/sunrpc/stats.c | 21 ++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
index 1565bbe..eecb5a7 100644
--- a/include/linux/sunrpc/metrics.h
+++ b/include/linux/sunrpc/metrics.h
@@ -27,10 +27,13 @@

#include <linux/seq_file.h>
#include <linux/ktime.h>
+#include <linux/spinlock.h>

#define RPC_IOSTATS_VERS "1.0"

struct rpc_iostats {
+ spinlock_t om_lock;
+
/*
* These counters give an idea about how many request
* transmissions are required, on average, to complete that
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 5453049..9711a15 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -116,7 +116,15 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
*/
struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
{
- return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
+ struct rpc_iostats *stats;
+ int i;
+
+ stats = kcalloc(clnt->cl_maxproc, sizeof(*stats), GFP_KERNEL);
+ if (stats) {
+ for (i = 0; i < clnt->cl_maxproc; i++)
+ spin_lock_init(&stats[i].om_lock);
+ }
+ return stats;
}
EXPORT_SYMBOL_GPL(rpc_alloc_iostats);

@@ -135,20 +143,21 @@ EXPORT_SYMBOL_GPL(rpc_free_iostats);
* rpc_count_iostats - tally up per-task stats
* @task: completed rpc_task
* @stats: array of stat structures
- *
- * Relies on the caller for serialization.
*/
void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)
{
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_iostats *op_metrics;
- ktime_t delta;
+ ktime_t delta, now;

if (!stats || !req)
return;

+ now = ktime_get();
op_metrics = &stats[task->tk_msg.rpc_proc->p_statidx];

+ spin_lock(&op_metrics->om_lock);
+
op_metrics->om_ops++;
op_metrics->om_ntrans += req->rq_ntrans;
op_metrics->om_timeouts += task->tk_timeouts;
@@ -161,8 +170,10 @@ void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)

op_metrics->om_rtt = ktime_add(op_metrics->om_rtt, req->rq_rtt);

- delta = ktime_sub(ktime_get(), task->tk_start);
+ delta = ktime_sub(now, task->tk_start);
op_metrics->om_execute = ktime_add(op_metrics->om_execute, delta);
+
+ spin_unlock(&op_metrics->om_lock);
}
EXPORT_SYMBOL_GPL(rpc_count_iostats);

2014-11-10 14:54:36

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] xprtrdma: Enable pad optimization

On Nov 10, 2014, at 8:36 AM, Anna Schumaker <[email protected]> wrote:

> Hey Chuck,
>
>
> On 11/08/2014 08:14 PM, Chuck Lever wrote:
>> The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
>> pad optimization was enabled. That bug was fixed by commit
>> e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
>> it").
>
> Do we need to worry about backwards compatibility with servers that don't have this patch?

My impression is that we have a window where the server is assumed not
to work and thus is not enabled in distributions, and that therefore
changes like this are allowed. I could be wrong. Bruce, any guidance
on this?

In any event, if things break, they break immediately, and the fix is
simply to set this feature flag via /proc.

> Anna
>
>>
>> We can now enable pad optimization on the client, which helps
>> performance and is supported now by both Linux and Solaris servers.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/transport.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index cfe9a81..8ed2576 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
>> static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
>> static unsigned int xprt_rdma_inline_write_padding;
>> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
>> - int xprt_rdma_pad_optimize = 0;
>> + int xprt_rdma_pad_optimize = 1;
>>
>> #ifdef RPC_DEBUG
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2014-11-11 14:30:43

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] xprtrdma: Display async errors

On 11/9/2014 3:15 AM, Chuck Lever wrote:
> An async error upcall is a hard error, and should be reported in
> the system log.
>

Could be useful to others... Any chance you put this in ib_core for all
of us?

Sagi.

2014-11-10 15:05:34

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] xprtrdma: Enable pad optimization

On Mon, Nov 10, 2014 at 08:54:27AM -0600, Chuck Lever wrote:
>
> On Nov 10, 2014, at 8:36 AM, Anna Schumaker <[email protected]> wrote:
>
> > Hey Chuck,
> >
> >
> > On 11/08/2014 08:14 PM, Chuck Lever wrote:
> >> The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
> >> pad optimization was enabled. That bug was fixed by commit
> >> e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
> >> it").
> >
> > Do we need to worry about backwards compatibility with servers that don't have this patch?
>
> My impression is that we have a window where the server is assumed not
> to work and thus is not enabled in distributions, and that therefore
> changes like this are allowed. I could be wrong. Bruce, any guidance
> on this?
>
> In any event, if things break, they break immediately, and the fix is
> simply to set this feature flag via /proc.

I don't think there's any hard-and-fast rule here, but my impression is
that nfs/rdma still isn't widely used, even less so with linux knfsd, so
on balance it's probably not worth much to humor older servers.

--b.

>
>
>
>
> > Anna
> >
> >>
> >> We can now enable pad optimization on the client, which helps
> >> performance and is supported now by both Linux and Solaris servers.
> >>
> >> Signed-off-by: Chuck Lever <[email protected]>
> >> ---
> >> net/sunrpc/xprtrdma/transport.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> >> index cfe9a81..8ed2576 100644
> >> --- a/net/sunrpc/xprtrdma/transport.c
> >> +++ b/net/sunrpc/xprtrdma/transport.c
> >> @@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
> >> static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
> >> static unsigned int xprt_rdma_inline_write_padding;
> >> static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
> >> - int xprt_rdma_pad_optimize = 0;
> >> + int xprt_rdma_pad_optimize = 1;
> >>
> >> #ifdef RPC_DEBUG
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>

2014-11-09 21:43:51

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH v2 02/10] xprtrdma: Cap req_cqinit

On Nov 9, 2014, at 5:13 AM, Sagi Grimberg <[email protected]> wrote:

> On 11/9/2014 3:14 AM, Chuck Lever wrote:
>> Recent work made FRMR registration and invalidation completions
>> unsignaled. This greatly reduces the adapter interrupt rate.
>>
>> Every so often, however, a posted send Work Request is allowed to
>> signal. Otherwise, the provider's Work Queue will wrap and the
>> workload will hang.
>>
>> The number of Work Requests that are allowed to remain unsignaled is
>> determined by the value of req_cqinit. Currently, this is set to the
>> size of the send Work Queue divided by two, minus 1.
>>
>> For FRMR, the send Work Queue is the maximum number of concurrent
>> RPCs (currently 32) times the maximum number of Work Requests an
>> RPC might use (currently 7, though some adapters may need more).
>>
>> For mlx4, this is 224 entries. This leaves completion signaling
>> disabled for 111 send Work Requests.
>>
>> Some providers hold back dispatching Work Requests until a CQE is
>> generated. If completions are disabled, then no CQEs are generated
>> for quite some time, and that can stall the Work Queue.
>>
>> I've seen this occur running xfstests generic/113 over NFSv4, where
>> eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
>> because the Work Queue has overflowed. The connection is dropped
>> and re-established.
>
> Hey Chuck,
>
> As you know, I've seen this issue too...
> Looking into this is definitely on my todo list.
>
> Does this happen if you run a simple dd (single request-response inflight)?

Hi Sagi-

I typically run dbench, iozone, and xfstests when preparing patches for
upstream. The generic/113 test I mention in the patch description is the
only test where I saw this issue. I expect single-thread won?t drive
enough Work Queue activity to push the provider into WQ overflow.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com