2020-06-20 17:19:48

by Dan Aloni

[permalink] [raw]
Subject: [PATCH] xprtrdma: Wake up re_connect_wait on disconnect

Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
connections don't succeeds, something needs to wake it up. In my case, this has
been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
`rpcrdma_xprt_connect()` slept forever.

This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
rpcrdma_cm_event_handler()').

Signed-off-by: Dan Aloni <[email protected]>
CC: Chuck Lever <[email protected]>
---

Notes:
Hi Chuck,

Maybe I missd something, as it is not clear to me how otherwise (without this
patch), re_connect_wait can be woken up in this situation. Please explain?

net/sunrpc/xprtrdma/verbs.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2ae348377806..8bd76a47a91f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
ep->re_connect_status = -ECONNABORTED;
disconnected:
xprt_force_disconnect(xprt);
+ wake_up_all(&ep->re_connect_wait);
return rpcrdma_ep_destroy(ep);
default:
break;
--
2.25.4


2020-06-20 18:48:31

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] xprtrdma: Wake up re_connect_wait on disconnect

Hi Dan-

> On Jun 20, 2020, at 1:18 PM, Dan Aloni <[email protected]> wrote:
>
> Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
> connections don't succeeds, something needs to wake it up. In my case, this has
> been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
> `rpcrdma_xprt_connect()` slept forever.

Interesting. My development and testing generates plenty of REJECTED connection
requests, but I never saw this particular failure mode.


> This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
> rpcrdma_cm_event_handler()').

The patch looks sensible. I'll pull it into my test harness.


> Signed-off-by: Dan Aloni <[email protected]>
> CC: Chuck Lever <[email protected]>
> ---
>
> Notes:
> Hi Chuck,
>
> Maybe I missd something, as it is not clear to me how otherwise (without this
> patch), re_connect_wait can be woken up in this situation. Please explain?
>
> net/sunrpc/xprtrdma/verbs.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 2ae348377806..8bd76a47a91f 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> ep->re_connect_status = -ECONNABORTED;
> disconnected:
> xprt_force_disconnect(xprt);
> + wake_up_all(&ep->re_connect_wait);
> return rpcrdma_ep_destroy(ep);
> default:
> break;
> --
> 2.25.4
>

--
Chuck Lever



2020-06-21 14:53:10

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] xprtrdma: Wake up re_connect_wait on disconnect

Hi Dan-

> On Jun 20, 2020, at 2:46 PM, Chuck Lever <[email protected]> wrote:
>
> Hi Dan-
>
>> On Jun 20, 2020, at 1:18 PM, Dan Aloni <[email protected]> wrote:
>>
>> Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
>> connections don't succeeds, something needs to wake it up. In my case, this has
>> been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
>> `rpcrdma_xprt_connect()` slept forever.
>
> Interesting. My development and testing generates plenty of REJECTED connection
> requests, but I never saw this particular failure mode.

Correction: My testing _used_ _to_ generate REJECTED events regularly. It does
not seem to any more, even after client crashes. So that explains why I haven't
seen this before.

I haven't reproduced the problem here, but the fix still looks proper to me,
and doesn't appear to introduce any regressions. I do have some issues with your
proposed patch, though.

The first paragraph of the patch description is incorrect. RDMA_CM_EVENT_DISCONNECTED
can occur only once a connection has been established. That guarantees there are no
waiters on re_connect_wait in that case. It's connect errors that need to wake-up
the connect worker.


>> This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
>> rpcrdma_cm_event_handler()').

IMO this paragraph needs to be replaced by:

Fixes: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")


>> Signed-off-by: Dan Aloni <[email protected]>
>> CC: Chuck Lever <[email protected]>
>> ---
>>
>> Notes:
>> Hi Chuck,
>>
>> Maybe I missd something, as it is not clear to me how otherwise (without this
>> patch), re_connect_wait can be woken up in this situation. Please explain?
>>
>> net/sunrpc/xprtrdma/verbs.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 2ae348377806..8bd76a47a91f 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>> ep->re_connect_status = -ECONNABORTED;
>> disconnected:
>> xprt_force_disconnect(xprt);
>> + wake_up_all(&ep->re_connect_wait);
>> return rpcrdma_ep_destroy(ep);
>> default:
>> break;

This hunk does not apply on top of fixes I've already sent to Anna for 5.8-rc1.

So, if you don't object, I'll adjust your patch (this hunk and the description)
before sending it along to Anna.


--
Chuck Lever



2020-06-21 15:11:47

by Dan Aloni

[permalink] [raw]
Subject: Re: [PATCH] xprtrdma: Wake up re_connect_wait on disconnect

On Sun, Jun 21, 2020 at 10:49:53AM -0400, Chuck Lever wrote:
> >> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> >> index 2ae348377806..8bd76a47a91f 100644
> >> --- a/net/sunrpc/xprtrdma/verbs.c
> >> +++ b/net/sunrpc/xprtrdma/verbs.c
> >> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> >> ep->re_connect_status = -ECONNABORTED;
> >> disconnected:
> >> xprt_force_disconnect(xprt);
> >> + wake_up_all(&ep->re_connect_wait);
> >> return rpcrdma_ep_destroy(ep);
> >> default:
> >> break;
>
> This hunk does not apply on top of fixes I've already sent to Anna for 5.8-rc1.
>
> So, if you don't object, I'll adjust your patch (this hunk and the description)
> before sending it along to Anna.

Sure, go ahead. Thanks for working on this!

--
Dan Aloni