2024-05-06 09:39:44

by Dan Aloni

[permalink] [raw]
Subject: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

Under the scenario of IB device bonding, when bringing down one of the
ports, or all ports, we saw xprtrdma entering a non-recoverable state
where it is not even possible to complete the disconnect and shut it
down the mount, requiring a reboot. Following debug, we saw that
transport connect never ended after receiving the
RDMA_CM_EVENT_DEVICE_REMOVAL callback.

The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
connected, and ESTABLISHED may not have happened. So need to work with
each of these states accordingly.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
Cc: Sagi Grimberg <[email protected]>
Signed-off-by: Dan Aloni <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..432557a553e7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
case RDMA_CM_EVENT_DEVICE_REMOVAL:
pr_info("rpcrdma: removing device %s for %pISpc\n",
ep->re_id->device->name, sap);
- fallthrough;
+ switch (xchg(&ep->re_connect_status, -ENODEV)) {
+ case 0: goto wake_connect_worker;
+ case 1: goto disconnected;
+ }
+ return 0;
case RDMA_CM_EVENT_ADDR_CHANGE:
ep->re_connect_status = -ENODEV;
goto disconnected;
--
2.39.3



2024-05-06 15:10:00

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL



On 06/05/2024 12:37, Dan Aloni wrote:
> Under the scenario of IB device bonding, when bringing down one of the
> ports, or all ports, we saw xprtrdma entering a non-recoverable state
> where it is not even possible to complete the disconnect and shut it
> down the mount, requiring a reboot. Following debug, we saw that
> transport connect never ended after receiving the
> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
>
> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> connected, and ESTABLISHED may not have happened. So need to work with
> each of these states accordingly.
>
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Is this actually the offending commit ?

commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an
NFS mount")
is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting
that the
cm_id may not be ESTABLISHED (where we need to wake the connect waiter?

Question though, in DEVICE_REMOVAL the device is going away as soon as the
cm handler callback returns. Shouldn't nfs release all the device
resources (related to this
cm_id)? afaict it was changed in:
e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")

The patch itself looks reasonable (although I do think that the rdma
stack expects the
ulp to have the rdma resources released when the callback returns).

FWIW in nvme we avoided the problem altogether by registering an
ib_client that is
called on .remove() and its a separate context that doesn't have all the
intricacies with
rdma_cm...

2024-05-06 15:56:14

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
>
> On 06/05/2024 12:37, Dan Aloni wrote:
> > Under the scenario of IB device bonding, when bringing down one of the
> > ports, or all ports, we saw xprtrdma entering a non-recoverable state
> > where it is not even possible to complete the disconnect and shut it
> > down the mount, requiring a reboot. Following debug, we saw that
> > transport connect never ended after receiving the
> > RDMA_CM_EVENT_DEVICE_REMOVAL callback.
> >
> > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> > connected, and ESTABLISHED may not have happened. So need to work with
> > each of these states accordingly.
> >
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>
> Is this actually the offending commit ?
>
> commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS
> mount")
> is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that
> the
> cm_id may not be ESTABLISHED (where we need to wake the connect waiter?

I'd be OK with discussing possible culprits in the patch description
but leaving off a Fixes: tag for now.

It would be reasonable to demand that the proposed fix be applied
to each LTS kernel and tested individually to ensure there are no
side-effects or pre-requisites.


> Question though, in DEVICE_REMOVAL the device is going away as soon as the
> cm handler callback returns. Shouldn't nfs release all the device resources
> (related to this
> cm_id)? afaict it was changed in:
> e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")

In the case where a DEVICE_REMOVAL event fires and a connection
hasn't yet been established, my guess is the ep reference count will
go to zero when rpcrdma_ep_put() is called.


> The patch itself looks reasonable (although I do think that the rdma stack
> expects the
> ulp to have the rdma resources released when the callback returns).

Thanks for the review! Should we add:

Reviewed-by: Sagi Grimberg <[email protected]>


> FWIW in nvme we avoided the problem altogether by registering an ib_client
> that is
> called on .remove() and its a separate context that doesn't have all the
> intricacies with
> rdma_cm...

I looked at ib_client, years ago, and thought it would be a lot of
added complexity. With a code sample (NVMe host) maybe I can put
something together.


--
Chuck Lever

2024-05-06 19:10:22

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL



On 06/05/2024 18:55, Chuck Lever wrote:
> On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
>> On 06/05/2024 12:37, Dan Aloni wrote:
>>> Under the scenario of IB device bonding, when bringing down one of the
>>> ports, or all ports, we saw xprtrdma entering a non-recoverable state
>>> where it is not even possible to complete the disconnect and shut it
>>> down the mount, requiring a reboot. Following debug, we saw that
>>> transport connect never ended after receiving the
>>> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
>>>
>>> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
>>> connected, and ESTABLISHED may not have happened. So need to work with
>>> each of these states accordingly.
>>>
>>> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>> Is this actually the offending commit ?
>>
>> commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS
>> mount")
>> is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that
>> the
>> cm_id may not be ESTABLISHED (where we need to wake the connect waiter?
> I'd be OK with discussing possible culprits in the patch description
> but leaving off a Fixes: tag for now.
Agreed.
>
> It would be reasonable to demand that the proposed fix be applied
> to each LTS kernel and tested individually to ensure there are no
> side-effects or pre-requisites.
>
>
>> Question though, in DEVICE_REMOVAL the device is going away as soon as the
>> cm handler callback returns. Shouldn't nfs release all the device resources
>> (related to this
>> cm_id)? afaict it was changed in:
>> e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
> In the case where a DEVICE_REMOVAL event fires and a connection
> hasn't yet been established, my guess is the ep reference count will
> go to zero when rpcrdma_ep_put() is called.

Yes, I was actually referring to the case where the connection was
established.
It looks like rpcrdma_force_disconnect -> xprt_force_disconnect
schedules async
work to tear things down no?

>
>
>> The patch itself looks reasonable (although I do think that the rdma stack
>> expects the
>> ulp to have the rdma resources released when the callback returns).
> Thanks for the review! Should we add:
>
> Reviewed-by: Sagi Grimberg <[email protected]>

Yes.

>
>
>> FWIW in nvme we avoided the problem altogether by registering an ib_client
>> that is
>> called on .remove() and its a separate context that doesn't have all the
>> intricacies with
>> rdma_cm...
> I looked at ib_client, years ago, and thought it would be a lot of
> added complexity. With a code sample (NVMe host) maybe I can put
> something together.
>
>

The plus is that there is no need to handle the DEVICE_REMOVAL cm event,
which is
always nice...

2024-05-06 20:06:23

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote:
> Under the scenario of IB device bonding, when bringing down one of the
> ports, or all ports, we saw xprtrdma entering a non-recoverable state
> where it is not even possible to complete the disconnect and shut it
> down the mount, requiring a reboot. Following debug, we saw that
> transport connect never ended after receiving the
> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
>
> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> connected, and ESTABLISHED may not have happened. So need to work with
> each of these states accordingly.
>
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> Cc: Sagi Grimberg <[email protected]>
> Signed-off-by: Dan Aloni <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..432557a553e7 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> case RDMA_CM_EVENT_DEVICE_REMOVAL:
> pr_info("rpcrdma: removing device %s for %pISpc\n",
> ep->re_id->device->name, sap);
> - fallthrough;
> + switch (xchg(&ep->re_connect_status, -ENODEV)) {
> + case 0: goto wake_connect_worker;
> + case 1: goto disconnected;
> + }
> + return 0;
> case RDMA_CM_EVENT_ADDR_CHANGE:
> ep->re_connect_status = -ENODEV;
> goto disconnected;
> --
> 2.39.3
>

Hi Anna,

Please apply this patch with:

Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chuck Lever <[email protected]>


--
Chuck Lever

2024-05-06 20:09:13

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

On Mon, May 06, 2024 at 10:10:12PM +0300, Sagi Grimberg wrote:
>
> On 06/05/2024 18:55, Chuck Lever wrote:
> > On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
> > > Question though, in DEVICE_REMOVAL the device is going away as soon as the
> > > cm handler callback returns. Shouldn't nfs release all the device resources
> > > (related to this
> > > cm_id)? afaict it was changed in:
> > > e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
> > In the case where a DEVICE_REMOVAL event fires and a connection
> > hasn't yet been established, my guess is the ep reference count will
> > go to zero when rpcrdma_ep_put() is called.
>
> Yes, I was actually referring to the case where the connection was
> established.
> It looks like rpcrdma_force_disconnect -> xprt_force_disconnect schedules
> async
> work to tear things down no?

Yes because we can't rip out the hardware resources while the
RPC client is still using the transport. Converting to use an
ib_client might help by first triggering a disconnect, done
with proper serialization.


> > > FWIW in nvme we avoided the problem altogether by registering an ib_client
> > > that is
> > > called on .remove() and its a separate context that doesn't have all the
> > > intricacies with
> > > rdma_cm...
> > I looked at ib_client, years ago, and thought it would be a lot of
> > added complexity. With a code sample (NVMe host) maybe I can put
> > something together.
>
> The plus is that there is no need to handle the DEVICE_REMOVAL cm event,
> which is always nice...

I'll have a look, thanks for the suggestion.


--
Chuck Lever

2024-05-06 20:25:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

On Mon, 2024-05-06 at 16:06 -0400, Chuck Lever wrote:
> On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote:
> > Under the scenario of IB device bonding, when bringing down one of
> > the
> > ports, or all ports, we saw xprtrdma entering a non-recoverable
> > state
> > where it is not even possible to complete the disconnect and shut
> > it
> > down the mount, requiring a reboot. Following debug, we saw that
> > transport connect never ended after receiving the
> > RDMA_CM_EVENT_DEVICE_REMOVAL callback.
> >
> > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> > connected, and ESTABLISHED may not have happened. So need to work
> > with
> > each of these states accordingly.
> >
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep
> > after it is freed')
> > Cc: Sagi Grimberg <[email protected]>
> > Signed-off-by: Dan Aloni <[email protected]>
> > ---
> >  net/sunrpc/xprtrdma/verbs.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sunrpc/xprtrdma/verbs.c
> > b/net/sunrpc/xprtrdma/verbs.c
> > index 4f8d7efa469f..432557a553e7 100644
> > --- a/net/sunrpc/xprtrdma/verbs.c
> > +++ b/net/sunrpc/xprtrdma/verbs.c
> > @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id
> > *id, struct rdma_cm_event *event)
> >   case RDMA_CM_EVENT_DEVICE_REMOVAL:
> >   pr_info("rpcrdma: removing device %s for
> > %pISpc\n",
> >   ep->re_id->device->name, sap);
> > - fallthrough;
> > + switch (xchg(&ep->re_connect_status, -ENODEV)) {
> > + case 0: goto wake_connect_worker;
> > + case 1: goto disconnected;
> > + }
> > + return 0;
> >   case RDMA_CM_EVENT_ADDR_CHANGE:
> >   ep->re_connect_status = -ENODEV;
> >   goto disconnected;
> > --
> > 2.39.3
> >
>
> Hi Anna,
>
> Please apply this patch with:
>
> Reviewed-by: Sagi Grimberg <[email protected]>
> Reviewed-by: Chuck Lever <[email protected]>
>
>
Anna is back on leave for a few weeks, so I'll take it.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]