2024-05-05 12:49:18

by Dan Aloni

[permalink] [raw]
Subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
underflow.

The specific scenario that caused this to happen is IB device bonding,
when bringing down one of the ports, or all ports. The situation is not
just a print - it also causes a non-recoverable state it is not even
possible to complete the disconnect and shut it down the mount,
requiring a reboot, suggesting that tear-down is also incomplete in this
state.

The trivial fix seems to work as such - if we did not receive a
`RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
`rpcrdma_xprt_drain` kills the EP prematurely in from the context of
`rpcrdma_xprt_disconnect`.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Example crash:

rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
RIP: 0010:refcount_warn_saturate+0xba/0x110
Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
rpcrdma_ep_put+0x42/0x50 [rpcrdma]
rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
process_one_work+0x1e5/0x3c0
? rescuer_thread+0x3a0/0x3a0
worker_thread+0x50/0x3b0
? rescuer_thread+0x3a0/0x3a0
kthread+0xd6/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

Signed-off-by: Dan Aloni <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 5 ++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..19996515da94 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
goto disconnected;
case RDMA_CM_EVENT_ESTABLISHED:
rpcrdma_ep_get(ep);
+ ep->re_connect_ref = true;
ep->re_connect_status = 1;
rpcrdma_update_cm_private(ep, &event->param.conn);
trace_xprtrdma_inline_thresh(ep);
@@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
ep->re_connect_status = -ECONNABORTED;
disconnected:
rpcrdma_force_disconnect(ep);
- return rpcrdma_ep_put(ep);
+ if (ep->re_connect_ref)
+ return rpcrdma_ep_put(ep);
+ return 0;
default:
break;
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index da409450dfc0..1553ef69a844 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -84,6 +84,7 @@ struct rpcrdma_ep {
unsigned int re_max_inline_recv;
int re_async_rc;
int re_connect_status;
+ bool re_connect_ref;
atomic_t re_receiving;
atomic_t re_force_disconnect;
struct ib_qp_init_attr re_attr;
--
2.39.3



2024-05-05 16:24:32

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> underflow.
>
> The specific scenario that caused this to happen is IB device bonding,
> when bringing down one of the ports, or all ports. The situation is not
> just a print - it also causes a non-recoverable state it is not even
> possible to complete the disconnect and shut it down the mount,
> requiring a reboot, suggesting that tear-down is also incomplete in this
> state.
>
> The trivial fix seems to work as such - if we did not receive a
> `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> `rpcrdma_xprt_disconnect`.
>
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Hi Dan, thanks for the bug report!

In the future, note that the Fixes: tag needs to go down by the
Signed-off-by: tag. But I wonder if 2acc5 introduced this issue --
ep reference counting seems to have been added before this commit.

Second, I wonder if, when bonding interfaces, there's an opportunity
for the verbs consumer to take an additional transport reference.
Cc'ing linux-rdma for input on that issue. Can you show a brief
diagram of when the ep reference count changes when bonding?

Also, I note that the WARNING below happened on a RHEL9 kernel:

5.14.0-284.11.1.el9_2.x86_64

Can you confirm that this issue reproduces on v6.8 and newer ?


> Example crash:
>
> rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
> ------------[ cut here ]------------
> refcount_t: underflow; use-after-free.
> WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
> isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
> CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
> Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
> Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> RIP: 0010:refcount_warn_saturate+0xba/0x110
> Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
> RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
> RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
> RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
> R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
> R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
> FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> rpcrdma_ep_put+0x42/0x50 [rpcrdma]
> rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
> xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
> process_one_work+0x1e5/0x3c0
> ? rescuer_thread+0x3a0/0x3a0
> worker_thread+0x50/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xd6/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
> </TASK>
>
> Signed-off-by: Dan Aloni <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 5 ++++-
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..19996515da94 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> goto disconnected;
> case RDMA_CM_EVENT_ESTABLISHED:
> rpcrdma_ep_get(ep);
> + ep->re_connect_ref = true;
> ep->re_connect_status = 1;
> rpcrdma_update_cm_private(ep, &event->param.conn);
> trace_xprtrdma_inline_thresh(ep);
> @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> ep->re_connect_status = -ECONNABORTED;
> disconnected:
> rpcrdma_force_disconnect(ep);
> - return rpcrdma_ep_put(ep);
> + if (ep->re_connect_ref)
> + return rpcrdma_ep_put(ep);
> + return 0;
> default:
> break;
> }
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index da409450dfc0..1553ef69a844 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -84,6 +84,7 @@ struct rpcrdma_ep {
> unsigned int re_max_inline_recv;
> int re_async_rc;
> int re_connect_status;
> + bool re_connect_ref;
> atomic_t re_receiving;
> atomic_t re_force_disconnect;
> struct ib_qp_init_attr re_attr;
> --
> 2.39.3
>

--
Chuck Lever

2024-05-05 18:36:38

by Dan Aloni

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> >
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> >
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> >
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>
> Hi Dan, thanks for the bug report!
>
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?

Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.

> Also, I note that the WARNING below happened on a RHEL9 kernel:
>
> 5.14.0-284.11.1.el9_2.x86_64
>
> Can you confirm that this issue reproduces on v6.8 and newer ?

Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.

As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.

Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.

```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
<TASK>
__schedule+0x21c/0x680
schedule+0x31/0xd0
schedule_timeout+0x148/0x160
? mutex_lock+0xe/0x30
? cma_process_remove+0x218/0x260 [rdma_cm]
? preempt_count_add+0x70/0xa0
__wait_for_common+0x90/0x1e0
? __pfx_schedule_timeout+0x10/0x10
cma_remove_one+0x5c/0xd0 [rdma_cm]
remove_client_context+0x88/0xd0 [ib_core]
disable_device+0x8a/0x160 [ib_core]
? _raw_spin_unlock+0x15/0x30
__ib_unregister_device+0x42/0xb0 [ib_core]
ib_unregister_device+0x22/0x30 [ib_core]
mlx5r_remove+0x36/0x60 [mlx5_ib]
auxiliary_bus_remove+0x18/0x30
device_release_driver_internal+0x193/0x200
bus_remove_device+0xbf/0x130
device_del+0x157/0x3e0
? devl_param_driverinit_value_get+0x29/0xa0
mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
mlx5_do_bond+0x2cb/0x390 [mlx5_core]
mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
process_one_work+0x174/0x340
worker_thread+0x27e/0x390
? __pfx_worker_thread+0x10/0x10
kthread+0xee/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
```

--
Dan Aloni

2024-05-05 18:38:40

by Dan Aloni

[permalink] [raw]
Subject: [PATCH] rpcrdma: decref EP only if ESTABLISHED and handle DEVICE_REMOVAL

Under the scenario of IB device bonding, when bringing down one of the
ports, or all ports, we saw xprtrdma entering a non-recoverable state
where it is not even possible to complete the disconnect and shut it
down the mount, requiring a reboot.

If a DEVICE_REMOVAL happened, it may be irrespective of whether the
CM_ID is connected, and ESTABLISHED may not have happened, so we need
to avoid a decref, plus make sure connect path is woken up.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
Signed-off-by: Dan Aloni <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 9 +++++++--
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..43d7d6604c30 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -244,12 +244,15 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
case RDMA_CM_EVENT_DEVICE_REMOVAL:
pr_info("rpcrdma: removing device %s for %pISpc\n",
ep->re_id->device->name, sap);
- fallthrough;
+ ep->re_connect_status = -ENODEV;
+ wake_up_all(&ep->re_connect_wait);
+ goto disconnected;
case RDMA_CM_EVENT_ADDR_CHANGE:
ep->re_connect_status = -ENODEV;
goto disconnected;
case RDMA_CM_EVENT_ESTABLISHED:
rpcrdma_ep_get(ep);
+ ep->re_connect_ref = true;
ep->re_connect_status = 1;
rpcrdma_update_cm_private(ep, &event->param.conn);
trace_xprtrdma_inline_thresh(ep);
@@ -272,7 +275,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
ep->re_connect_status = -ECONNABORTED;
disconnected:
rpcrdma_force_disconnect(ep);
- return rpcrdma_ep_put(ep);
+ if (ep->re_connect_ref)
+ return rpcrdma_ep_put(ep);
+ return 0;
default:
break;
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index da409450dfc0..1553ef69a844 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -84,6 +84,7 @@ struct rpcrdma_ep {
unsigned int re_max_inline_recv;
int re_async_rc;
int re_connect_status;
+ bool re_connect_ref;
atomic_t re_receiving;
atomic_t re_force_disconnect;
struct ib_qp_init_attr re_attr;
--
2.39.3


2024-05-05 19:00:44

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> On 2024-05-05 10:35:46, Chuck Lever wrote:
> > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > > underflow.
> > >
> > > The specific scenario that caused this to happen is IB device bonding,
> > > when bringing down one of the ports, or all ports. The situation is not
> > > just a print - it also causes a non-recoverable state it is not even
> > > possible to complete the disconnect and shut it down the mount,
> > > requiring a reboot, suggesting that tear-down is also incomplete in this
> > > state.
> > >
> > > The trivial fix seems to work as such - if we did not receive a
> > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > > `rpcrdma_xprt_disconnect`.
> > >
> > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> >
> > Hi Dan, thanks for the bug report!
> >
> [..]
> > Second, I wonder if, when bonding interfaces, there's an opportunity
> > for the verbs consumer to take an additional transport reference.
> > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > diagram of when the ep reference count changes when bonding?
>
> Not sure we need an additional reference here.

We already have two mechanisms in play:

- the ep reference count
- the re_connect_status value

I would prefer not to add a boolean here. Seems like
re_connect_status could do that job if we need it to.

I seem to recall that when a device is removed, the verbs consumer
doesn't get a DISCONNECT first. Or does it? Having a sequence of
CM events that you see in the bonded case would help a lot.


> I understand regarding rpcrdma
> managing its internal EP refcount, that it is having three in total: one for
> init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
> so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
> and I found some more about it, please read on.
>
> > Also, I note that the WARNING below happened on a RHEL9 kernel:
> >
> > 5.14.0-284.11.1.el9_2.x86_64
> >
> > Can you confirm that this issue reproduces on v6.8 and newer ?
>
> Some context: I originally tested with a version that also has implemented a
> timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
> 'culprit' in the negative decref, but there is the other issue and second order
> effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
> the unkillable transport, so we are not reaching teardown at all. With both of
> these changes applied both problems are gone.
>
> As for upstream version testing, with the elrepo build of 6.8.9 which matches
> vanilla, I don't see the negative kref but I do see the other effect of
> `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
> which does not release even after ports are back up online.
>
> Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> on bonding, I'll post in a reply.

It looks like you are trying to fix several issues in one patch. So
I need you to separate these issues and the fixes, and let's focus
on the upstream kernel (v6.9-rc6) because there's nothing I can do
about the RHEL9 kernel, which is clearly a different source base
than the one I work on.

If we need a "wake_up_all(&ep->re_connect_wait);" during
DEVICE_REMOVAL, that should be a separate patch. And you need to
figure out if ADDR_CHANGE needs the same treatment: the v2 you just
sent changes the behavior of ADDR_CHANGE too but does not mention
whether it intended that change.

Without a reproducer or a detailed explanation of how the ep
reference count changes in step with CM events, I can't properly
assess your proposed fix.


> ```
> INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
> Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000
> Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
> Call Trace:
> <TASK>
> __schedule+0x21c/0x680
> schedule+0x31/0xd0
> schedule_timeout+0x148/0x160
> ? mutex_lock+0xe/0x30
> ? cma_process_remove+0x218/0x260 [rdma_cm]
> ? preempt_count_add+0x70/0xa0
> __wait_for_common+0x90/0x1e0
> ? __pfx_schedule_timeout+0x10/0x10
> cma_remove_one+0x5c/0xd0 [rdma_cm]
> remove_client_context+0x88/0xd0 [ib_core]
> disable_device+0x8a/0x160 [ib_core]
> ? _raw_spin_unlock+0x15/0x30
> __ib_unregister_device+0x42/0xb0 [ib_core]
> ib_unregister_device+0x22/0x30 [ib_core]
> mlx5r_remove+0x36/0x60 [mlx5_ib]
> auxiliary_bus_remove+0x18/0x30
> device_release_driver_internal+0x193/0x200
> bus_remove_device+0xbf/0x130
> device_del+0x157/0x3e0
> ? devl_param_driverinit_value_get+0x29/0xa0
> mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
> mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
> mlx5_do_bond+0x2cb/0x390 [mlx5_core]
> mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
> process_one_work+0x174/0x340
> worker_thread+0x27e/0x390
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xee/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2d/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
> INFO: task fio:10584 blocked for more than 122 seconds.
> Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
> ```
>
> --
> Dan Aloni

--
Chuck Lever

2024-05-05 20:10:45

by Dan Aloni

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On 2024-05-05 15:00:10, Chuck Lever wrote:
> On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> > On 2024-05-05 10:35:46, Chuck Lever wrote:
> > [..]
> > > Second, I wonder if, when bonding interfaces, there's an opportunity
> > > for the verbs consumer to take an additional transport reference.
> > > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > > diagram of when the ep reference count changes when bonding?
> >
> > Not sure we need an additional reference here.
>
> We already have two mechanisms in play:
>
> - the ep reference count
> - the re_connect_status value
>
> I would prefer not to add a boolean here. Seems like
> re_connect_status could do that job if we need it to.

That's possible. So before overrides e.g. `ep->re_connect_status =
-ENODEV` we would need to remember whether it was `1` and across the
call to `rpcrdma_force_disconnect()`, to do the putref() after it.
Anyway, the patch will be a big larger I guess.

>[..]
> > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> > on bonding, I'll post in a reply.
>
> It looks like you are trying to fix several issues in one patch. So
> I need you to separate these issues and the fixes, and let's focus
> on the upstream kernel (v6.9-rc6) because there's nothing I can do
> about the RHEL9 kernel, which is clearly a different source base
> than the one I work on.
>
> If we need a "wake_up_all(&ep->re_connect_wait);" during
> DEVICE_REMOVAL, that should be a separate patch. And you need to
> figure out if ADDR_CHANGE needs the same treatment: the v2 you just
> sent changes the behavior of ADDR_CHANGE too but does not mention
> whether it intended that change.

Was it just the `rpcrdma_ep_put` change in this case? Will feel less
comfortable changing it without a repro of the `ADDR_CHANGE` case. Going
to isolate it.

> Without a reproducer or a detailed explanation of how the ep
> reference count changes in step with CM events, I can't properly
> assess your proposed fix.

With 6.8 I only see a single `RDMA_CM_EVENT_DEVICE_REMOVAL` event
arriving, driving a single `rpcrdma_ep_put` call to put the reference
back at 1, and it waits forever. Getting late here, I'll check on 6.9-rc
tomorrow.

Maybe for a repro, you can generate a single interface bond? I haven't
tried, but possibly it would occur on a single one too if the underlying
port is down.

--
Dan Aloni