Hi Anna-
Here is the first set of NFS/RDMA client patches for v4.15. These
close a window where async RPCs that never receive a Reply will
terminate but let the memory invalidation happen later.
I don't think there was much of an exposure here. Hard async RPCs
typically just hang anyway, they can't be interrupted.
Please consider these patches for v4.15.
---
Chuck Lever (3):
xprtrdma: Don't defer fencing an async RPC's chunks
xprtrdma: Use ro_unmap_sync in xprt_rdma_send_request
xprtrdma: Remove ro_unmap_safe
net/sunrpc/xprtrdma/fmr_ops.c | 19 -------------------
net/sunrpc/xprtrdma/frwr_ops.c | 19 -------------------
net/sunrpc/xprtrdma/transport.c | 5 +++--
net/sunrpc/xprtrdma/xprt_rdma.h | 2 --
4 files changed, 3 insertions(+), 42 deletions(-)
--
Chuck Lever
In current kernels, waiting in xprt_release appears to be safe to
do. I had erroneously believed that for ASYNC RPCs, waiting of any
kind in xprt_release->xprt_rdma_free would result in deadlock. I've
done injection testing and consulted with Trond to confirm that
waiting in the RPC release path is safe.
For the very few times where RPC resources haven't yet been released
earlier by the reply handler, it is safe to wait synchronously in
xprt_rdma_free for invalidation rather than defering it to MR
recovery.
Note: When the QP is error state, posting a LocalInvalidate should
flush and mark the MR as bad. There is no way the remote HCA can
access that MR via a QP in error state, so it is effectively already
inaccessible and thus safe for the Upper Layer to access. The next
time the MR is used it should be recognized and cleaned up properly
by frwr_op_map.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c84e2b6..8cf5ccf 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -686,7 +686,7 @@
dprintk("RPC: %s: called on 0x%p\n", __func__, req->rl_reply);
if (!list_empty(&req->rl_registered))
- ia->ri_ops->ro_unmap_safe(r_xprt, req, !RPC_IS_ASYNC(task));
+ ia->ri_ops->ro_unmap_sync(r_xprt, &req->rl_registered);
rpcrdma_unmap_sges(ia, req);
rpcrdma_buffer_put(req);
}
Clean up: There are no remaining callers of this method.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 19 -------------------
net/sunrpc/xprtrdma/frwr_ops.c | 19 -------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 --
3 files changed, 40 deletions(-)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 6c71513..30bf713 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -305,28 +305,9 @@ enum {
}
}
-/* Use a slow, safe mechanism to invalidate all memory regions
- * that were registered for "req".
- */
-static void
-fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
- bool sync)
-{
- struct rpcrdma_mw *mw;
-
- while (!list_empty(&req->rl_registered)) {
- mw = rpcrdma_pop_mw(&req->rl_registered);
- if (sync)
- fmr_op_recover_mr(mw);
- else
- rpcrdma_defer_mr_recovery(mw);
- }
-}
-
const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap_sync = fmr_op_unmap_sync,
- .ro_unmap_safe = fmr_op_unmap_safe,
.ro_recover_mr = fmr_op_recover_mr,
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index df062e0..3053fb0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -558,28 +558,9 @@
goto unmap;
}
-/* Use a slow, safe mechanism to invalidate all memory regions
- * that were registered for "req".
- */
-static void
-frwr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
- bool sync)
-{
- struct rpcrdma_mw *mw;
-
- while (!list_empty(&req->rl_registered)) {
- mw = rpcrdma_pop_mw(&req->rl_registered);
- if (sync)
- frwr_op_recover_mr(mw);
- else
- rpcrdma_defer_mr_recovery(mw);
- }
-}
-
const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap_sync = frwr_op_unmap_sync,
- .ro_unmap_safe = frwr_op_unmap_safe,
.ro_recover_mr = frwr_op_recover_mr,
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e26a97d..74e0174 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -473,8 +473,6 @@ struct rpcrdma_memreg_ops {
struct rpcrdma_mw **);
void (*ro_unmap_sync)(struct rpcrdma_xprt *,
struct list_head *);
- void (*ro_unmap_safe)(struct rpcrdma_xprt *,
- struct rpcrdma_req *, bool);
void (*ro_recover_mr)(struct rpcrdma_mw *);
int (*ro_open)(struct rpcrdma_ia *,
struct rpcrdma_ep *,
The "safe" version of ro_unmap is used here to avoid waiting
unnecessarily. However:
- It is safe to wait. After all, we have to wait anyway when using
FMR to register memory.
- This case is rare: it occurs only after a reconnect.
By switching this call site to ro_unmap_sync, the final use of
ro_unmap_safe is removed.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8cf5ccf..eb46d24 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -728,7 +728,8 @@
/* On retransmit, remove any previously registered chunks */
if (unlikely(!list_empty(&req->rl_registered)))
- r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
+ r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
+ &req->rl_registered);
rc = rpcrdma_marshal_req(r_xprt, rqst);
if (rc < 0)
Hi Chuck,
On 10/09/2017 12:03 PM, Chuck Lever wrote:
> The "safe" version of ro_unmap is used here to avoid waiting
> unnecessarily. However:
>
> - It is safe to wait. After all, we have to wait anyway when using
> FMR to register memory.
>
> - This case is rare: it occurs only after a reconnect.
I'm just double checking that this really is safe? I'm seeing the hung task killer running more often after applying this patch:
[ 245.376879] INFO: task kworker/0:5:193 blocked for more than 120 seconds.
[ 245.379909] Not tainted 4.14.0-rc4-ANNA+ #10934
[ 245.382001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.385926] kworker/0:5 D 0 193 2 0x80000080
[ 245.388148] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 245.389386] Call Trace:
[ 245.390070] __schedule+0x23d/0x850
[ 245.390673] schedule+0x46/0xa0
[ 245.391219] schedule_timeout+0x230/0x480
[ 245.391817] ? ib_send_mad+0x32d/0x430
[ 245.392387] ? __slab_alloc.isra.24+0x3e/0x60
[ 245.392988] wait_for_common+0xc9/0x190
[ 245.393574] ? wait_for_common+0xc9/0x190
[ 245.395290] ? wake_up_q+0x90/0x90
[ 245.396239] wait_for_completion+0x30/0x40
[ 245.397279] frwr_op_unmap_sync+0x139/0x2a0 [rpcrdma]
[ 245.398401] ? call_decode+0x830/0x830 [sunrpc]
[ 245.399435] xprt_rdma_send_request+0xdc/0xf0 [rpcrdma]
[ 245.400558] xprt_transmit+0x7f/0x370 [sunrpc]
[ 245.401500] ? call_decode+0x830/0x830 [sunrpc]
[ 245.402079] call_transmit+0x1c4/0x2a0 [sunrpc]
[ 245.402652] ? call_decode+0x830/0x830 [sunrpc]
[ 245.403194] __rpc_execute+0x92/0x430 [sunrpc]
[ 245.403737] ? rpc_wake_up+0x7e/0x90 [sunrpc]
[ 245.404264] rpc_async_schedule+0x25/0x30 [sunrpc]
[ 245.404796] process_one_work+0x1ed/0x430
[ 245.405276] worker_thread+0x45/0x3f0
[ 245.405730] kthread+0x134/0x150
[ 245.406144] ? process_one_work+0x430/0x430
[ 245.406627] ? kthread_create_on_node+0x80/0x80
[ 245.407114] ret_from_fork+0x25/0x30
I see this when running cthon tests over NFS v3 with kvm / softroce configured on my client and server. I'm willing to accept that the problem might be on my end, but I still want to check with you first :)
Thanks,
Anna
>
> By switching this call site to ro_unmap_sync, the final use of
> ro_unmap_safe is removed.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> net/sunrpc/xprtrdma/transport.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 8cf5ccf..eb46d24 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -728,7 +728,8 @@
>
> /* On retransmit, remove any previously registered chunks */
> if (unlikely(!list_empty(&req->rl_registered)))
> - r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
> + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
> + &req->rl_registered);
>
> rc = rpcrdma_marshal_req(r_xprt, rqst);
> if (rc < 0)
>
> On Oct 13, 2017, at 1:43 PM, Anna Schumaker <[email protected]> wrote:
>
> Hi Chuck,
>
> On 10/09/2017 12:03 PM, Chuck Lever wrote:
>> The "safe" version of ro_unmap is used here to avoid waiting
>> unnecessarily. However:
>>
>> - It is safe to wait. After all, we have to wait anyway when using
>> FMR to register memory.
>>
>> - This case is rare: it occurs only after a reconnect.
>
> I'm just double checking that this really is safe? I'm seeing the hung task killer running more often after applying this patch:
>
> [ 245.376879] INFO: task kworker/0:5:193 blocked for more than 120 seconds.
> [ 245.379909] Not tainted 4.14.0-rc4-ANNA+ #10934
> [ 245.382001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 245.385926] kworker/0:5 D 0 193 2 0x80000080
> [ 245.388148] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [ 245.389386] Call Trace:
> [ 245.390070] __schedule+0x23d/0x850
> [ 245.390673] schedule+0x46/0xa0
> [ 245.391219] schedule_timeout+0x230/0x480
> [ 245.391817] ? ib_send_mad+0x32d/0x430
> [ 245.392387] ? __slab_alloc.isra.24+0x3e/0x60
> [ 245.392988] wait_for_common+0xc9/0x190
> [ 245.393574] ? wait_for_common+0xc9/0x190
> [ 245.395290] ? wake_up_q+0x90/0x90
> [ 245.396239] wait_for_completion+0x30/0x40
> [ 245.397279] frwr_op_unmap_sync+0x139/0x2a0 [rpcrdma]
> [ 245.398401] ? call_decode+0x830/0x830 [sunrpc]
> [ 245.399435] xprt_rdma_send_request+0xdc/0xf0 [rpcrdma]
> [ 245.400558] xprt_transmit+0x7f/0x370 [sunrpc]
> [ 245.401500] ? call_decode+0x830/0x830 [sunrpc]
> [ 245.402079] call_transmit+0x1c4/0x2a0 [sunrpc]
> [ 245.402652] ? call_decode+0x830/0x830 [sunrpc]
> [ 245.403194] __rpc_execute+0x92/0x430 [sunrpc]
> [ 245.403737] ? rpc_wake_up+0x7e/0x90 [sunrpc]
> [ 245.404264] rpc_async_schedule+0x25/0x30 [sunrpc]
> [ 245.404796] process_one_work+0x1ed/0x430
> [ 245.405276] worker_thread+0x45/0x3f0
> [ 245.405730] kthread+0x134/0x150
> [ 245.406144] ? process_one_work+0x430/0x430
> [ 245.406627] ? kthread_create_on_node+0x80/0x80
> [ 245.407114] ret_from_fork+0x25/0x30
>
> I see this when running cthon tests over NFS v3 with kvm / softroce configured on my client and server. I'm willing to accept that the problem might be on my end, but I still want to check with you first :)
My guess is that this is because your RDMA layer is dropping the
connection frequently, so you're hitting this case a lot. If
LocalInvalidation is taking a long time, that's a bug in the
provider, I'd say.
But I'm just speculating. You need to look into this and figure
out what's going on. ftrace with function_graph should give
some impression of where the waiting is going on.
> Thanks,
> Anna
>
>>
>> By switching this call site to ro_unmap_sync, the final use of
>> ro_unmap_safe is removed.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index 8cf5ccf..eb46d24 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -728,7 +728,8 @@
>>
>> /* On retransmit, remove any previously registered chunks */
>> if (unlikely(!list_empty(&req->rl_registered)))
>> - r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
>> + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
>> + &req->rl_registered);
>>
>> rc = rpcrdma_marshal_req(r_xprt, rqst);
>> if (rc < 0)
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
> On Oct 13, 2017, at 1:55 PM, Chuck Lever <[email protected]> wrote:
>
>>
>> On Oct 13, 2017, at 1:43 PM, Anna Schumaker <[email protected]> wrote:
>>
>> Hi Chuck,
>>
>> On 10/09/2017 12:03 PM, Chuck Lever wrote:
>>> The "safe" version of ro_unmap is used here to avoid waiting
>>> unnecessarily. However:
>>>
>>> - It is safe to wait. After all, we have to wait anyway when using
>>> FMR to register memory.
>>>
>>> - This case is rare: it occurs only after a reconnect.
>>
>> I'm just double checking that this really is safe? I'm seeing the hung task killer running more often after applying this patch:
>>
>> [ 245.376879] INFO: task kworker/0:5:193 blocked for more than 120 seconds.
>> [ 245.379909] Not tainted 4.14.0-rc4-ANNA+ #10934
>> [ 245.382001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 245.385926] kworker/0:5 D 0 193 2 0x80000080
>> [ 245.388148] Workqueue: rpciod rpc_async_schedule [sunrpc]
>> [ 245.389386] Call Trace:
>> [ 245.390070] __schedule+0x23d/0x850
>> [ 245.390673] schedule+0x46/0xa0
>> [ 245.391219] schedule_timeout+0x230/0x480
>> [ 245.391817] ? ib_send_mad+0x32d/0x430
>> [ 245.392387] ? __slab_alloc.isra.24+0x3e/0x60
>> [ 245.392988] wait_for_common+0xc9/0x190
>> [ 245.393574] ? wait_for_common+0xc9/0x190
>> [ 245.395290] ? wake_up_q+0x90/0x90
>> [ 245.396239] wait_for_completion+0x30/0x40
>> [ 245.397279] frwr_op_unmap_sync+0x139/0x2a0 [rpcrdma]
>> [ 245.398401] ? call_decode+0x830/0x830 [sunrpc]
>> [ 245.399435] xprt_rdma_send_request+0xdc/0xf0 [rpcrdma]
>> [ 245.400558] xprt_transmit+0x7f/0x370 [sunrpc]
>> [ 245.401500] ? call_decode+0x830/0x830 [sunrpc]
>> [ 245.402079] call_transmit+0x1c4/0x2a0 [sunrpc]
>> [ 245.402652] ? call_decode+0x830/0x830 [sunrpc]
>> [ 245.403194] __rpc_execute+0x92/0x430 [sunrpc]
>> [ 245.403737] ? rpc_wake_up+0x7e/0x90 [sunrpc]
>> [ 245.404264] rpc_async_schedule+0x25/0x30 [sunrpc]
>> [ 245.404796] process_one_work+0x1ed/0x430
>> [ 245.405276] worker_thread+0x45/0x3f0
>> [ 245.405730] kthread+0x134/0x150
>> [ 245.406144] ? process_one_work+0x430/0x430
>> [ 245.406627] ? kthread_create_on_node+0x80/0x80
>> [ 245.407114] ret_from_fork+0x25/0x30
>>
>> I see this when running cthon tests over NFS v3 with kvm / softroce configured on my client and server. I'm willing to accept that the problem might be on my end, but I still want to check with you first :)
>
> My guess is that this is because your RDMA layer is dropping the
> connection frequently, so you're hitting this case a lot. If
> LocalInvalidation is taking a long time, that's a bug in the
> provider, I'd say.
>
> But I'm just speculating. You need to look into this and figure
> out what's going on. ftrace with function_graph should give
> some impression of where the waiting is going on.
Reproduced here. Trying to figure out why the client is not
reconnecting.
>> Thanks,
>> Anna
>>
>>>
>>> By switching this call site to ro_unmap_sync, the final use of
>>> ro_unmap_safe is removed.
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>> index 8cf5ccf..eb46d24 100644
>>> --- a/net/sunrpc/xprtrdma/transport.c
>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>> @@ -728,7 +728,8 @@
>>>
>>> /* On retransmit, remove any previously registered chunks */
>>> if (unlikely(!list_empty(&req->rl_registered)))
>>> - r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
>>> + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
>>> + &req->rl_registered);
>>>
>>> rc = rpcrdma_marshal_req(r_xprt, rqst);
>>> if (rc < 0)
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever