2024-05-05 12:49:18

by Dan Aloni

[permalink] [raw]
Subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
underflow.

The specific scenario that caused this to happen is IB device bonding,
when bringing down one of the ports, or all ports. The situation is not
just a print - it also causes a non-recoverable state it is not even
possible to complete the disconnect and shut it down the mount,
requiring a reboot, suggesting that tear-down is also incomplete in this
state.

The trivial fix seems to work as such - if we did not receive a
`RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
`rpcrdma_xprt_drain` kills the EP prematurely in from the context of
`rpcrdma_xprt_disconnect`.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Example crash:

rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
RIP: 0010:refcount_warn_saturate+0xba/0x110
Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
rpcrdma_ep_put+0x42/0x50 [rpcrdma]
rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
process_one_work+0x1e5/0x3c0
? rescuer_thread+0x3a0/0x3a0
worker_thread+0x50/0x3b0
? rescuer_thread+0x3a0/0x3a0
kthread+0xd6/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

Signed-off-by: Dan Aloni <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 5 ++++-
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..19996515da94 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
goto disconnected;
case RDMA_CM_EVENT_ESTABLISHED:
rpcrdma_ep_get(ep);
+ ep->re_connect_ref = true;
ep->re_connect_status = 1;
rpcrdma_update_cm_private(ep, &event->param.conn);
trace_xprtrdma_inline_thresh(ep);
@@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
ep->re_connect_status = -ECONNABORTED;
disconnected:
rpcrdma_force_disconnect(ep);
- return rpcrdma_ep_put(ep);
+ if (ep->re_connect_ref)
+ return rpcrdma_ep_put(ep);
+ return 0;
default:
break;
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index da409450dfc0..1553ef69a844 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -84,6 +84,7 @@ struct rpcrdma_ep {
unsigned int re_max_inline_recv;
int re_async_rc;
int re_connect_status;
+ bool re_connect_ref;
atomic_t re_receiving;
atomic_t re_force_disconnect;
struct ib_qp_init_attr re_attr;
--
2.39.3



2024-05-05 16:24:32

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> underflow.
>
> The specific scenario that caused this to happen is IB device bonding,
> when bringing down one of the ports, or all ports. The situation is not
> just a print - it also causes a non-recoverable state it is not even
> possible to complete the disconnect and shut it down the mount,
> requiring a reboot, suggesting that tear-down is also incomplete in this
> state.
>
> The trivial fix seems to work as such - if we did not receive a
> `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> `rpcrdma_xprt_disconnect`.
>
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Hi Dan, thanks for the bug report!

In the future, note that the Fixes: tag needs to go down by the
Signed-off-by: tag. But I wonder if 2acc5 introduced this issue --
ep reference counting seems to have been added before this commit.

Second, I wonder if, when bonding interfaces, there's an opportunity
for the verbs consumer to take an additional transport reference.
Cc'ing linux-rdma for input on that issue. Can you show a brief
diagram of when the ep reference count changes when bonding?

Also, I note that the WARNING below happened on a RHEL9 kernel:

5.14.0-284.11.1.el9_2.x86_64

Can you confirm that this issue reproduces on v6.8 and newer ?


> Example crash:
>
> rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
> ------------[ cut here ]------------
> refcount_t: underflow; use-after-free.
> WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
> isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
> CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
> Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
> Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> RIP: 0010:refcount_warn_saturate+0xba/0x110
> Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
> RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
> RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
> RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
> R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
> R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
> FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> rpcrdma_ep_put+0x42/0x50 [rpcrdma]
> rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
> xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
> process_one_work+0x1e5/0x3c0
> ? rescuer_thread+0x3a0/0x3a0
> worker_thread+0x50/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xd6/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
> </TASK>
>
> Signed-off-by: Dan Aloni <[email protected]>
> ---
> net/sunrpc/xprtrdma/verbs.c | 5 ++++-
> net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..19996515da94 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> goto disconnected;
> case RDMA_CM_EVENT_ESTABLISHED:
> rpcrdma_ep_get(ep);
> + ep->re_connect_ref = true;
> ep->re_connect_status = 1;
> rpcrdma_update_cm_private(ep, &event->param.conn);
> trace_xprtrdma_inline_thresh(ep);
> @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> ep->re_connect_status = -ECONNABORTED;
> disconnected:
> rpcrdma_force_disconnect(ep);
> - return rpcrdma_ep_put(ep);
> + if (ep->re_connect_ref)
> + return rpcrdma_ep_put(ep);
> + return 0;
> default:
> break;
> }
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index da409450dfc0..1553ef69a844 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -84,6 +84,7 @@ struct rpcrdma_ep {
> unsigned int re_max_inline_recv;
> int re_async_rc;
> int re_connect_status;
> + bool re_connect_ref;
> atomic_t re_receiving;
> atomic_t re_force_disconnect;
> struct ib_qp_init_attr re_attr;
> --
> 2.39.3
>

--
Chuck Lever

2024-05-05 18:36:38

by Dan Aloni

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> >
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> >
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> >
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>
> Hi Dan, thanks for the bug report!
>
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?

Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.

> Also, I note that the WARNING below happened on a RHEL9 kernel:
>
> 5.14.0-284.11.1.el9_2.x86_64
>
> Can you confirm that this issue reproduces on v6.8 and newer ?

Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.

As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.

Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.

```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
<TASK>
__schedule+0x21c/0x680
schedule+0x31/0xd0
schedule_timeout+0x148/0x160
? mutex_lock+0xe/0x30
? cma_process_remove+0x218/0x260 [rdma_cm]
? preempt_count_add+0x70/0xa0
__wait_for_common+0x90/0x1e0
? __pfx_schedule_timeout+0x10/0x10
cma_remove_one+0x5c/0xd0 [rdma_cm]
remove_client_context+0x88/0xd0 [ib_core]
disable_device+0x8a/0x160 [ib_core]
? _raw_spin_unlock+0x15/0x30
__ib_unregister_device+0x42/0xb0 [ib_core]
ib_unregister_device+0x22/0x30 [ib_core]
mlx5r_remove+0x36/0x60 [mlx5_ib]
auxiliary_bus_remove+0x18/0x30
device_release_driver_internal+0x193/0x200
bus_remove_device+0xbf/0x130
device_del+0x157/0x3e0
? devl_param_driverinit_value_get+0x29/0xa0
mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
mlx5_do_bond+0x2cb/0x390 [mlx5_core]
mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
process_one_work+0x174/0x340
worker_thread+0x27e/0x390
? __pfx_worker_thread+0x10/0x10
kthread+0xee/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
```

--
Dan Aloni

2024-05-05 18:38:40

by Dan Aloni

[permalink] [raw]
Subject: [PATCH] rpcrdma: decref EP only if ESTABLISHED and handle DEVICE_REMOVAL

Under the scenario of IB device bonding, when bringing down one of the
ports, or all ports, we saw xprtrdma entering a non-recoverable state
where it is not even possible to complete the disconnect and shut it
down the mount, requiring a reboot.

If a DEVICE_REMOVAL happened, it may be irrespective of whether the
CM_ID is connected, and ESTABLISHED may not have happened, so we need
to avoid a decref, plus make sure connect path is woken up.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
Signed-off-by: Dan Aloni <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 9 +++++++--
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..43d7d6604c30 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -244,12 +244,15 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
case RDMA_CM_EVENT_DEVICE_REMOVAL:
pr_info("rpcrdma: removing device %s for %pISpc\n",
ep->re_id->device->name, sap);
- fallthrough;
+ ep->re_connect_status = -ENODEV;
+ wake_up_all(&ep->re_connect_wait);
+ goto disconnected;
case RDMA_CM_EVENT_ADDR_CHANGE:
ep->re_connect_status = -ENODEV;
goto disconnected;
case RDMA_CM_EVENT_ESTABLISHED:
rpcrdma_ep_get(ep);
+ ep->re_connect_ref = true;
ep->re_connect_status = 1;
rpcrdma_update_cm_private(ep, &event->param.conn);
trace_xprtrdma_inline_thresh(ep);
@@ -272,7 +275,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
ep->re_connect_status = -ECONNABORTED;
disconnected:
rpcrdma_force_disconnect(ep);
- return rpcrdma_ep_put(ep);
+ if (ep->re_connect_ref)
+ return rpcrdma_ep_put(ep);
+ return 0;
default:
break;
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index da409450dfc0..1553ef69a844 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -84,6 +84,7 @@ struct rpcrdma_ep {
unsigned int re_max_inline_recv;
int re_async_rc;
int re_connect_status;
+ bool re_connect_ref;
atomic_t re_receiving;
atomic_t re_force_disconnect;
struct ib_qp_init_attr re_attr;
--
2.39.3


2024-05-05 19:00:44

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> On 2024-05-05 10:35:46, Chuck Lever wrote:
> > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > > underflow.
> > >
> > > The specific scenario that caused this to happen is IB device bonding,
> > > when bringing down one of the ports, or all ports. The situation is not
> > > just a print - it also causes a non-recoverable state it is not even
> > > possible to complete the disconnect and shut it down the mount,
> > > requiring a reboot, suggesting that tear-down is also incomplete in this
> > > state.
> > >
> > > The trivial fix seems to work as such - if we did not receive a
> > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > > `rpcrdma_xprt_disconnect`.
> > >
> > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> >
> > Hi Dan, thanks for the bug report!
> >
> [..]
> > Second, I wonder if, when bonding interfaces, there's an opportunity
> > for the verbs consumer to take an additional transport reference.
> > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > diagram of when the ep reference count changes when bonding?
>
> Not sure we need an additional reference here.

We already have two mechanisms in play:

- the ep reference count
- the re_connect_status value

I would prefer not to add a boolean here. Seems like
re_connect_status could do that job if we need it to.

I seem to recall that when a device is removed, the verbs consumer
doesn't get a DISCONNECT first. Or does it? Having a sequence of
CM events that you see in the bonded case would help a lot.


> I understand regarding rpcrdma
> managing its internal EP refcount, that it is having three in total: one for
> init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
> so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
> and I found some more about it, please read on.
>
> > Also, I note that the WARNING below happened on a RHEL9 kernel:
> >
> > 5.14.0-284.11.1.el9_2.x86_64
> >
> > Can you confirm that this issue reproduces on v6.8 and newer ?
>
> Some context: I originally tested with a version that also has implemented a
> timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
> 'culprit' in the negative decref, but there is the other issue and second order
> effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
> the unkillable transport, so we are not reaching teardown at all. With both of
> these changes applied both problems are gone.
>
> As for upstream version testing, with the elrepo build of 6.8.9 which matches
> vanilla, I don't see the negative kref but I do see the other effect of
> `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
> which does not release even after ports are back up online.
>
> Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> on bonding, I'll post in a reply.

It looks like you are trying to fix several issues in one patch. So
I need you to separate these issues and the fixes, and let's focus
on the upstream kernel (v6.9-rc6) because there's nothing I can do
about the RHEL9 kernel, which is clearly a different source base
than the one I work on.

If we need a "wake_up_all(&ep->re_connect_wait);" during
DEVICE_REMOVAL, that should be a separate patch. And you need to
figure out if ADDR_CHANGE needs the same treatment: the v2 you just
sent changes the behavior of ADDR_CHANGE too but does not mention
whether it intended that change.

Without a reproducer or a detailed explanation of how the ep
reference count changes in step with CM events, I can't properly
assess your proposed fix.


> ```
> INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
> Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000
> Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
> Call Trace:
> <TASK>
> __schedule+0x21c/0x680
> schedule+0x31/0xd0
> schedule_timeout+0x148/0x160
> ? mutex_lock+0xe/0x30
> ? cma_process_remove+0x218/0x260 [rdma_cm]
> ? preempt_count_add+0x70/0xa0
> __wait_for_common+0x90/0x1e0
> ? __pfx_schedule_timeout+0x10/0x10
> cma_remove_one+0x5c/0xd0 [rdma_cm]
> remove_client_context+0x88/0xd0 [ib_core]
> disable_device+0x8a/0x160 [ib_core]
> ? _raw_spin_unlock+0x15/0x30
> __ib_unregister_device+0x42/0xb0 [ib_core]
> ib_unregister_device+0x22/0x30 [ib_core]
> mlx5r_remove+0x36/0x60 [mlx5_ib]
> auxiliary_bus_remove+0x18/0x30
> device_release_driver_internal+0x193/0x200
> bus_remove_device+0xbf/0x130
> device_del+0x157/0x3e0
> ? devl_param_driverinit_value_get+0x29/0xa0
> mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
> mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
> mlx5_do_bond+0x2cb/0x390 [mlx5_core]
> mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
> process_one_work+0x174/0x340
> worker_thread+0x27e/0x390
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xee/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2d/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
> INFO: task fio:10584 blocked for more than 122 seconds.
> Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
> ```
>
> --
> Dan Aloni

--
Chuck Lever

2024-05-05 20:10:45

by Dan Aloni

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

On 2024-05-05 15:00:10, Chuck Lever wrote:
> On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> > On 2024-05-05 10:35:46, Chuck Lever wrote:
> > [..]
> > > Second, I wonder if, when bonding interfaces, there's an opportunity
> > > for the verbs consumer to take an additional transport reference.
> > > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > > diagram of when the ep reference count changes when bonding?
> >
> > Not sure we need an additional reference here.
>
> We already have two mechanisms in play:
>
> - the ep reference count
> - the re_connect_status value
>
> I would prefer not to add a boolean here. Seems like
> re_connect_status could do that job if we need it to.

That's possible. So before overrides e.g. `ep->re_connect_status =
-ENODEV` we would need to remember whether it was `1` and across the
call to `rpcrdma_force_disconnect()`, to do the putref() after it.
Anyway, the patch will be a big larger I guess.

>[..]
> > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> > on bonding, I'll post in a reply.
>
> It looks like you are trying to fix several issues in one patch. So
> I need you to separate these issues and the fixes, and let's focus
> on the upstream kernel (v6.9-rc6) because there's nothing I can do
> about the RHEL9 kernel, which is clearly a different source base
> than the one I work on.
>
> If we need a "wake_up_all(&ep->re_connect_wait);" during
> DEVICE_REMOVAL, that should be a separate patch. And you need to
> figure out if ADDR_CHANGE needs the same treatment: the v2 you just
> sent changes the behavior of ADDR_CHANGE too but does not mention
> whether it intended that change.

Was it just the `rpcrdma_ep_put` change in this case? Will feel less
comfortable changing it without a repro of the `ADDR_CHANGE` case. Going
to isolate it.

> Without a reproducer or a detailed explanation of how the ep
> reference count changes in step with CM events, I can't properly
> assess your proposed fix.

With 6.8 I only see a single `RDMA_CM_EVENT_DEVICE_REMOVAL` event
arriving, driving a single `rpcrdma_ep_put` call to put the reference
back at 1, and it waits forever. Getting late here, I'll check on 6.9-rc
tomorrow.

Maybe for a repro, you can generate a single interface bond? I haven't
tried, but possibly it would occur on a single one too if the underlying
port is down.

--
Dan Aloni

2024-05-12 17:52:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

Hi Dan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trondmy-nfs/linux-next]
[also build test WARNING on linus/master v6.9-rc7 next-20240510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Dan-Aloni/rpcrdma-don-t-decref-EP-if-a-ESTABLISHED-did-not-happen/20240505-205016
base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
patch link: https://lore.kernel.org/r/20240505124910.1877325-1-dan.aloni%40vastdata.com
patch subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
config: parisc-randconfig-r081-20240512 (https://download.01.org/0day-ci/archive/20240513/[email protected]/config)
compiler: hppa-linux-gcc (GCC) 13.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

smatch warnings:
net/sunrpc/xprtrdma/verbs.c:254 rpcrdma_cm_event_handler() warn: inconsistent indenting

vim +254 net/sunrpc/xprtrdma/verbs.c

87cfb9a0c85ce4 Chuck Lever 2016-09-15 213
ae38288eb73c52 Chuck Lever 2018-10-01 214 /**
ae38288eb73c52 Chuck Lever 2018-10-01 215 * rpcrdma_cm_event_handler - Handle RDMA CM events
ae38288eb73c52 Chuck Lever 2018-10-01 216 * @id: rdma_cm_id on which an event has occurred
ae38288eb73c52 Chuck Lever 2018-10-01 217 * @event: details of the event
ae38288eb73c52 Chuck Lever 2018-10-01 218 *
ae38288eb73c52 Chuck Lever 2018-10-01 219 * Called with @id's mutex held. Returns 1 if caller should
ae38288eb73c52 Chuck Lever 2018-10-01 220 * destroy @id, otherwise 0.
ae38288eb73c52 Chuck Lever 2018-10-01 221 */
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 222 static int
ae38288eb73c52 Chuck Lever 2018-10-01 223 rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 224 {
745b734c9bb805 Chuck Lever 2020-02-21 225 struct sockaddr *sap = (struct sockaddr *)&id->route.addr.dst_addr;
e28ce90083f032 Chuck Lever 2020-02-21 226 struct rpcrdma_ep *ep = id->context;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 227
ae38288eb73c52 Chuck Lever 2018-10-01 228 might_sleep();
ae38288eb73c52 Chuck Lever 2018-10-01 229
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 230 switch (event->event) {
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 231 case RDMA_CM_EVENT_ADDR_RESOLVED:
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 232 case RDMA_CM_EVENT_ROUTE_RESOLVED:
93aa8e0a9de80e Chuck Lever 2020-02-21 233 ep->re_async_rc = 0;
93aa8e0a9de80e Chuck Lever 2020-02-21 234 complete(&ep->re_done);
316a616e788658 Chuck Lever 2018-10-01 235 return 0;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 236 case RDMA_CM_EVENT_ADDR_ERROR:
93aa8e0a9de80e Chuck Lever 2020-02-21 237 ep->re_async_rc = -EPROTO;
93aa8e0a9de80e Chuck Lever 2020-02-21 238 complete(&ep->re_done);
316a616e788658 Chuck Lever 2018-10-01 239 return 0;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 240 case RDMA_CM_EVENT_ROUTE_ERROR:
93aa8e0a9de80e Chuck Lever 2020-02-21 241 ep->re_async_rc = -ENETUNREACH;
93aa8e0a9de80e Chuck Lever 2020-02-21 242 complete(&ep->re_done);
316a616e788658 Chuck Lever 2018-10-01 243 return 0;
bebd031866caa4 Chuck Lever 2017-04-11 244 case RDMA_CM_EVENT_DEVICE_REMOVAL:
745b734c9bb805 Chuck Lever 2020-02-21 245 pr_info("rpcrdma: removing device %s for %pISpc\n",
745b734c9bb805 Chuck Lever 2020-02-21 246 ep->re_id->device->name, sap);
df561f6688fef7 Gustavo A. R. Silva 2020-08-23 247 fallthrough;
e28ce90083f032 Chuck Lever 2020-02-21 248 case RDMA_CM_EVENT_ADDR_CHANGE:
93aa8e0a9de80e Chuck Lever 2020-02-21 249 ep->re_connect_status = -ENODEV;
e28ce90083f032 Chuck Lever 2020-02-21 250 goto disconnected;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 251 case RDMA_CM_EVENT_ESTABLISHED:
2acc5cae292355 Chuck Lever 2020-06-15 252 rpcrdma_ep_get(ep);
c58f2c9a4198e3 Dan Aloni 2024-05-05 253 ep->re_connect_ref = true;
93aa8e0a9de80e Chuck Lever 2020-02-21 @254 ep->re_connect_status = 1;
745b734c9bb805 Chuck Lever 2020-02-21 255 rpcrdma_update_cm_private(ep, &event->param.conn);
745b734c9bb805 Chuck Lever 2020-02-21 256 trace_xprtrdma_inline_thresh(ep);
93aa8e0a9de80e Chuck Lever 2020-02-21 257 wake_up_all(&ep->re_connect_wait);
31e62d25b5b815 Chuck Lever 2018-10-01 258 break;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 259 case RDMA_CM_EVENT_CONNECT_ERROR:
93aa8e0a9de80e Chuck Lever 2020-02-21 260 ep->re_connect_status = -ENOTCONN;
af667527b0e349 Chuck Lever 2020-06-27 261 goto wake_connect_worker;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 262 case RDMA_CM_EVENT_UNREACHABLE:
93aa8e0a9de80e Chuck Lever 2020-02-21 263 ep->re_connect_status = -ENETUNREACH;
af667527b0e349 Chuck Lever 2020-06-27 264 goto wake_connect_worker;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 265 case RDMA_CM_EVENT_REJECTED:
93aa8e0a9de80e Chuck Lever 2020-02-21 266 ep->re_connect_status = -ECONNREFUSED;
0a90487bf7182c Chuck Lever 2017-02-08 267 if (event->status == IB_CM_REJ_STALE_CONN)
4cf44be6f1e86d Chuck Lever 2020-06-27 268 ep->re_connect_status = -ENOTCONN;
af667527b0e349 Chuck Lever 2020-06-27 269 wake_connect_worker:
af667527b0e349 Chuck Lever 2020-06-27 270 wake_up_all(&ep->re_connect_wait);
af667527b0e349 Chuck Lever 2020-06-27 271 return 0;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 272 case RDMA_CM_EVENT_DISCONNECTED:
93aa8e0a9de80e Chuck Lever 2020-02-21 273 ep->re_connect_status = -ECONNABORTED;
31e62d25b5b815 Chuck Lever 2018-10-01 274 disconnected:
c487eb7d8e4157 Chuck Lever 2020-06-15 275 rpcrdma_force_disconnect(ep);
c58f2c9a4198e3 Dan Aloni 2024-05-05 276 if (ep->re_connect_ref)
2acc5cae292355 Chuck Lever 2020-06-15 277 return rpcrdma_ep_put(ep);
c58f2c9a4198e3 Dan Aloni 2024-05-05 278 return 0;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 279 default:
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 280 break;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 281 }
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 282
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 283 return 0;
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 284 }
c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 285

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki