2020-06-12 13:14:35

by Niklas Schnelle

[permalink] [raw]
Subject: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hello Parav, Hello Saeed,

our CI system for IBM Z Linux found a hang[0] when hot unplugging a ConnectX-4 Lx VF from a z/VM guest
in Linus' current tree and added during the merge window.
Sadly it didn't happen all the time which sent me on the wrong path for two full git bisects.

Anyway, I've now tracked this down to the following commit which when reverted
fixes the issue:

41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")

Looking at the diff I'd say the likely culprit is that before
the commit the order of calls was:

mlx5_unregister_device(dev)
mlx5_drain_health_wq(dev)

But with the commit it becomes

mlx5_drain_health_wq(dev)
mlx5_unregister_device(dev)

So without really knowing anything about these functions I would
guess that with the device still registered the drained
queue does not remain empty as new entries are added.
Does that sound plausible to you?

Best regards,
Niklas Schnelle

[0] dmesg output:
[ 36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal error 1 detected
[ 36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[0] 0xffffffff
[ 36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[1] 0xffffffff
[ 36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[2] 0xffffffff
[ 36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[3] 0xffffffff
[ 36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[4] 0xffffffff
[ 36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): assert_exit_ptr 0xffffffff
[ 36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): assert_callra 0xffffffff
[ 36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): fw_ver 65535.65535.65535
[ 36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): hw_id 0xffffffff
[ 36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): irisc_index 255
[ 36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): synd 0xff: unrecognized error
[ 36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): ext_synd 0xffff
[ 36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): raw fw_ver 0xffffffff
[ 36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[ 36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI function 0x514
[ 47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0): device's health compromised - reached miss count
[ 47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[0] 0xffffffff
[ 47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[1] 0xffffffff
[ 47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[2] 0xffffffff
[ 47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[3] 0xffffffff
[ 47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[4] 0xffffffff
[ 47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): assert_exit_ptr 0xffffffff
[ 47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): assert_callra 0xffffffff
[ 47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): fw_ver 65535.65535.65535
[ 47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): hw_id 0xffffffff
[ 47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): irisc_index 255
[ 47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): synd 0xff: unrecognized error
[ 47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): ext_synd 0xffff
[ 47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): raw fw_ver 0xffffffff
[ 100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
[ 100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121): mlx5_ib: modify QP 0x00072c to RESET failed
[ 141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32): QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command resource
[ 161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource


2020-06-12 22:06:19

by Saeed Mahameed

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> Hello Parav, Hello Saeed,
>
> our CI system for IBM Z Linux found a hang[0] when hot unplugging a
> ConnectX-4 Lx VF from a z/VM guest
> in Linus' current tree and added during the merge window.
> Sadly it didn't happen all the time which sent me on the wrong path
> for two full git bisects.
>
> Anyway, I've now tracked this down to the following commit which when
> reverted
> fixes the issue:
>
> 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
>
> Looking at the diff I'd say the likely culprit is that before
> the commit the order of calls was:
>
> mlx5_unregister_device(dev)
> mlx5_drain_health_wq(dev)
>
> But with the commit it becomes
>
> mlx5_drain_health_wq(dev)
> mlx5_unregister_device(dev)
>
> So without really knowing anything about these functions I would
> guess that with the device still registered the drained
> queue does not remain empty as new entries are added.
> Does that sound plausible to you?
>

I don't think it is related, maybe this is similar to some issues
addressed lately by Shay's patches:

https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/
https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/

net/mlx5: drain health workqueue in case of driver load error
net/mlx5: Fix fatal error handling during device load

> Best regards,
> Niklas Schnelle
>
> [0] dmesg output:
> [ 36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal
> error 1 detected
> [ 36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [ 36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [ 36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [ 36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [ 36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [ 36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [ 36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [ 36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [ 36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [ 36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [ 36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [ 36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [ 36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [ 36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B,
> anc=0, erc=0, rsid=0
> [ 36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI
> function 0x514
> [ 47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0):
> device's health compromised - reached miss count
> [ 47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [ 47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [ 47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [ 47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [ 47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [ 47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [ 47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [ 47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [ 47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [ 47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [ 47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [ 47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [ 47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [ 100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
> [ 100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121):
> mlx5_ib: modify QP 0x00072c to RESET failed
> [ 141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32):
> QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command
> resource
> [ 161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> DESTROY_QP(0x501) timeout. Will cause a leak of a command resource

Shay's patches also came to avoid such command timeouts.


2020-06-15 10:05:23

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hello Saeed,

On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
>> Hello Parav, Hello Saeed,
>>
... snip ...
>>
>> So without really knowing anything about these functions I would
>> guess that with the device still registered the drained
>> queue does not remain empty as new entries are added.
>> Does that sound plausible to you?
>>
>
> I don't think it is related, maybe this is similar to some issues
> addressed lately by Shay's patches:
>
> https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/
> https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/
>
> net/mlx5: drain health workqueue in case of driver load error
> net/mlx5: Fix fatal error handling during device load

I agree with your similarity assessment especially for the first commit.
These do not fix the issue though, with mainline v5.8-rc1 which has
both I'm still getting a hang over 50% of the time with the following
detach sequence on z/VM:

vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle

Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
no longer reverts cleanly I used the following diff to move the mlx5_drain_health_wq(dev)
after the mlx5_unregister_devices(dev).

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..63a196fd8e68 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)

devlink_reload_disable(devlink);
mlx5_crdump_disable(dev);
- mlx5_drain_health_wq(dev);
mlx5_unload_one(dev, true);
+ mlx5_drain_health_wq(dev);
mlx5_pci_close(dev);
mlx5_mdev_uninit(dev);
mlx5_devlink_free(devlink);


Note that this changed order also matches the call order in mlx5_pci_err_detected().
With that change I've now done over two dozen detachments with varying time between
attach and detach to have the driver at different stages of initialization.
With the change all worked without a hitch.

Best regards,
Niklas Schnelle
>
>> Best regards,
>> Niklas Schnelle
>>
>> [0] dmesg output:
... snip ...
>
> Shay's patches also came to avoid such command timeouts.
>
>

2020-07-08 10:44:23

by Parav Pandit

[permalink] [raw]
Subject: RE: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hi Niklas,

> From: Niklas Schnelle <[email protected]>
> Sent: Monday, June 15, 2020 3:32 PM
>
> Hello Saeed,
>
> On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> > On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> >> Hello Parav, Hello Saeed,
> >>
> ... snip ...
> >>
> >> So without really knowing anything about these functions I would
> >> guess that with the device still registered the drained queue does
> >> not remain empty as new entries are added.
> >> Does that sound plausible to you?
> >>
> >
> > I don't think it is related, maybe this is similar to some issues
> > addressed lately by Shay's patches:
> >
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > [email protected]/
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > [email protected]/
> >
> > net/mlx5: drain health workqueue in case of driver load error
> > net/mlx5: Fix fatal error handling during device load
>
> I agree with your similarity assessment especially for the first commit.
> These do not fix the issue though, with mainline v5.8-rc1 which has both I'm
> still getting a hang over 50% of the time with the following detach sequence
> on z/VM:
>
> vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle
>
> Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI
> device removal") no longer reverts cleanly I used the following diff to move
> the mlx5_drain_health_wq(dev) after the mlx5_unregister_devices(dev).
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 8b658908f044..63a196fd8e68 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)
>
> devlink_reload_disable(devlink);
> mlx5_crdump_disable(dev);
> - mlx5_drain_health_wq(dev);
> mlx5_unload_one(dev, true);
> + mlx5_drain_health_wq(dev);
> mlx5_pci_close(dev);
> mlx5_mdev_uninit(dev);
> mlx5_devlink_free(devlink);
>
>
> Note that this changed order also matches the call order in
> mlx5_pci_err_detected().
> With that change I've now done over two dozen detachments with varying
> time between attach and detach to have the driver at different stages of
> initialization.
> With the change all worked without a hitch.
>
> Best regards,
> Niklas Schnelle
> >

Sorry for my late response.
Yes, this looks good and I also found same in my analysis.
With latest code mlx5_pci_close() already does drain_health_wq(), so the additional call in remove_one() is redundant.
It should be just removed.
If you can verify below hunk in your setup, it will be really helpful.
You still need patch 42ea9f1b5c6 in your tree.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..ebec2318dbc4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev)

devlink_reload_disable(devlink);
mlx5_crdump_disable(dev);
- mlx5_drain_health_wq(dev);
mlx5_unload_one(dev, true);
mlx5_pci_close(dev);
mlx5_mdev_uninit(dev);

2020-07-08 11:48:08

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hi Parav, Hi Shay,

On 7/8/20 12:43 PM, Parav Pandit wrote:
> Hi Niklas,
>
... snip ...
>>>
>
> Sorry for my late response.
> Yes, this looks good and I also found same in my analysis.
> With latest code mlx5_pci_close() already does drain_health_wq(), so the additional call in remove_one() is redundant.
> It should be just removed.
> If you can verify below hunk in your setup, it will be really helpful.
> You still need patch 42ea9f1b5c6 in your tree.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 8b658908f044..ebec2318dbc4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev)
>
> devlink_reload_disable(devlink);
> mlx5_crdump_disable(dev);
> - mlx5_drain_health_wq(dev);
> mlx5_unload_one(dev, true);
> mlx5_pci_close(dev);
> mlx5_mdev_uninit(dev);
>

thanks for looking into this. Sadly it looks like this
doesn't quite work, I'm getting the bewlow illegal dereference after
removing the mlx5_drain_health_wq(dev) from remove_one() on top of
v5.8-rc4.
I think this is similar to what happned when I tried recreating the
reordering on -rc3 which is why I contacted Shay as I wasn't
able to get this to work again with the current code.
(Sorry if there are formatting issues with the below, plugged this
out of a x3270).

64.735421¨ Unable to handle kernel pointer dereference in virtual kernel address space
64.735470¨ Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6803
64.735476¨ Fault in home space mode while using kernel ASCE.
64.735484¨ AS:0000000088da0007 R3:0000000000000024
64.735533¨ Oops: 0038 ilc:3 Ý#1¨ PREEMPT SMP
64.735538¨ Modules linked in: rpcrdma sunrpc rdma_ucm rdma_cm iw_cm ib_cm configfs mlx5_ib ib_uverbs ib_core mlx5_core dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng ghash_s390 prng ctr aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha1_s390 vfio_ccw vfi
64.735421¨ Unable to handle kernel pointer dereference in virtual kernel address space
64.735558¨ CPU: 0 PID: 762 Comm: kworker/u128:3 Not tainted 5.8.0-rc4-dirty #2
64.735561¨ Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
64.735618¨ Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work mlx5_core
64.735623¨ Krnl PSW : 0704e00180000000 00000000876936c6 (notifier_call_chain+0x3e/0xe8)01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01.01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00.
64.735702¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
64.735704¨ Krnl GPRS: 00000000a38fc246 0000000080000001 00000000e814db00 0000000000000080
64.735706¨ 0000000000000001 ffffffffffffffff 0000000000000000 0000000000000001
64.735708¨ 0000000000000080 0000000000000001 ffffffffffffffff 6b6b6b6b6b6b6b6b
64.735710¨ 00000000e5ff0100 0000000000000000 000003e000973bf8 000003e000973ba0
64.735716¨ Krnl Code: 00000000876936ba: b9040083 lgr %r8,%r3
64.735716¨ 00000000876936be: b9040074 lgr %r7,%r4
64.735716¨ #00000000876936c2: a7d80000 lhi %r13,0
64.735716¨ >00000000876936c6: e390b0080004 lg %r9,8(%r11)
64.735716¨ 00000000876936cc: e320b0000004 lg %r2,0(%r11)
64.735716¨ 00000000876936d2: c0e5ffffd773 brasl %r14,000000008768e5b8
64.735716¨ 00000000876936d8: ec280032007e cij %r2,0,8,000000008769373c
64.735716¨ 00000000876936de: e310b0000004 lg %r1,0(%r11)
64.735731¨ Call Trace:
64.735740¨ <00000000876936c6>¨ notifier_call_chain+0x3e/0xe8
64.735764¨ <000000008769392c>¨ __atomic_notifier_call_chain+0x9c/0x138
64.735766¨ <00000000876939f2>¨ atomic_notifier_call_chain+0x2a/0x38
64.735782¨ <000003ff801e99c4>¨ mlx5_enter_error_state+0xec/0x100 mlx5_core
64.735797¨ <000003ff801e9a0c>¨ mlx5_fw_fatal_reporter_err_work+0x34/0xb8 mlx5_core¨
64.735802¨ <000000008768784c>¨ process_one_work+0x27c/0x478
64.735805¨ <0000000087687aae>¨ worker_thread+0x66/0x368
64.735807¨ <0000000087690a96>¨ kthread+0x176/0x1a0
64.735811¨ <0000000088202820>¨ ret_from_fork+0x24/0x2c
64.735812¨ INFO: lockdep is turned off.
64.735814¨ Last Breaking-Event-Address:
64.735816¨ <0000000087693926>¨ __atomic_notifier_call_chain+0x96/0x138
64.735820¨ Kernel panic - not syncing: Fatal exception: panic_on_oops
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 8762103E

2020-07-08 15:46:43

by Parav Pandit

[permalink] [raw]
Subject: RE: [REGRESSION] mlx5: Driver remove during hot unplug is broken



> From: Niklas Schnelle <[email protected]>
> Sent: Wednesday, July 8, 2020 5:14 PM
> Hi Parav, Hi Shay,
>
> On 7/8/20 12:43 PM, Parav Pandit wrote:
> > Hi Niklas,
> >
> ... snip ...
> >>>
> >
> > Sorry for my late response.
> > Yes, this looks good and I also found same in my analysis.
> > With latest code mlx5_pci_close() already does drain_health_wq(), so the
> additional call in remove_one() is redundant.
> > It should be just removed.
> > If you can verify below hunk in your setup, it will be really helpful.
> > You still need patch 42ea9f1b5c6 in your tree.
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > index 8b658908f044..ebec2318dbc4 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > @@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev)
> >
> > devlink_reload_disable(devlink);
> > mlx5_crdump_disable(dev);
> > - mlx5_drain_health_wq(dev);
> > mlx5_unload_one(dev, true);
> > mlx5_pci_close(dev);
> > mlx5_mdev_uninit(dev);
> >
>
> thanks for looking into this. Sadly it looks like this doesn't quite work, I'm
> getting the bewlow illegal dereference after removing the
> mlx5_drain_health_wq(dev) from remove_one() on top of v5.8-rc4.
> I think this is similar to what happned when I tried recreating the reordering
> on -rc3 which is why I contacted Shay as I wasn't able to get this to work again
> with the current code.
> (Sorry if there are formatting issues with the below, plugged this out of a
> x3270).
>

It is likely because events_cleanup() freed the memory using kvfree() that health recovery context is trying to access in notifier chain.

While reviewing I see few more errors as below.
(a) mlx5_pci_err_detected() invokes mlx5_drain_health_wq() outside of intf_state_mutex.
(b) mlx5_enter_error_state() in commit b6e0b6bebe0 read and updates dev state outside of intf_state_mutex.
(c) due to drain_health_wq() introduction in mlx5_pci_close() in commit 42ea9f1b5c625 health_wq is flushed twice.
(d) priv->events freed is kvfree() but allocated using kzalloc().

This code certainly needs rework.

In meantime to avoid this regression, I believe below hunk eliminates error introduced in patch 41798df9bfc.
Will you please help test it?

Shay and I did the review of below patch.
If it works I will get it through Saeed's tree and internal reviews.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index ebec2318dbc4..529df5703737 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, struct pci_dev *pdev,

static void mlx5_pci_close(struct mlx5_core_dev *dev)
{
- /* health work might still be active, and it needs pci bar in
- * order to know the NIC state. Therefore, drain the health WQ
- * before removing the pci bars
- */
- mlx5_drain_health_wq(dev);
iounmap(dev->iseg);
pci_clear_master(dev->pdev);
release_bar(dev->pdev);
@@ -1235,6 +1230,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
if (cleanup) {
mlx5_unregister_device(dev);
mlx5_devlink_unregister(priv_to_devlink(dev));
+ mlx5_drain_health_wq(dev);
} else {
mlx5_detach_device(dev);
}
@@ -1366,6 +1362,11 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
return 0;

err_load_one:
+ /* health work might still be active, and it needs pci bar in
+ * order to know the NIC state. Therefore, drain the health WQ
+ * before removing the pci bars
+ */
+ mlx5_drain_health_wq(dev);
mlx5_pci_close(dev);
pci_init_err:
mlx5_mdev_uninit(dev);




> 64.735421¨ Unable to handle kernel pointer dereference in virtual kernel
> address space 64.735470¨ Failing address: 6b6b6b6b6b6b6000 TEID:
> 6b6b6b6b6b6b6803 64.735476¨ Fault in home space mode while using kernel
> ASCE.
> 64.735484¨ AS:0000000088da0007 R3:0000000000000024 64.735533¨ Oops:
> 0038 ilc:3 Ý#1¨ PREEMPT SMP 64.735538¨ Modules linked in: rpcrdma sunrpc
> rdma_ucm rdma_cm iw_cm ib_cm configfs mlx5_ib ib_uverbs ib_core
> mlx5_core dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> s390_trng ghash_s390 prng ctr aes_s390 des_s390 libdes sha3_512_s390
> sha3_256_s390 sha512_s390 sha1_s390 vfio_ccw vfi 64.735421¨ Unable to
> handle kernel pointer dereference in virtual kernel address space 64.735558¨
> CPU: 0 PID: 762 Comm: kworker/u128:3 Not tainted 5.8.0-rc4-dirty #2
> 64.735561¨ Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) 64.735618¨
> Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work
> mlx5_core 64.735623¨ Krnl PSW : 0704e00180000000 00000000876936c6
> (notifier_call_chain+0x3e/0xe8)01: HCPGSP2629I The virtual machine is
> placed in CP mode due to a SIGP stop from CPU 01.01: HCPGSP2629I The
> virtual machine is placed in CP mode due to a SIGP stop from CPU 00.
> 64.735702¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> 64.735704¨ Krnl GPRS: 00000000a38fc246 0000000080000001
> 00000000e814db00 0000000000000080
> 64.735706¨ 0000000000000001 ffffffffffffffff 0000000000000000
> 0000000000000001
> 64.735708¨ 0000000000000080 0000000000000001 ffffffffffffffff
> 6b6b6b6b6b6b6b6b
> 64.735710¨ 00000000e5ff0100 0000000000000000 000003e000973bf8
> 000003e000973ba0
> 64.735716¨ Krnl Code: 00000000876936ba: b9040083 lgr %r8,%r3
> 64.735716¨ 00000000876936be: b9040074 lgr %r7,%r4
> 64.735716¨ #00000000876936c2: a7d80000 lhi %r13,0
> 64.735716¨ >00000000876936c6: e390b0080004 lg %r9,8(%r11)
> 64.735716¨ 00000000876936cc: e320b0000004 lg %r2,0(%r11)
> 64.735716¨ 00000000876936d2: c0e5ffffd773 brasl
> %r14,000000008768e5b8
> 64.735716¨ 00000000876936d8: ec280032007e cij
> %r2,0,8,000000008769373c
> 64.735716¨ 00000000876936de: e310b0000004 lg %r1,0(%r11)
> 64.735731¨ Call Trace:
> 64.735740¨ <00000000876936c6>¨ notifier_call_chain+0x3e/0xe8 64.735764¨
> <000000008769392c>¨ __atomic_notifier_call_chain+0x9c/0x138
> 64.735766¨ <00000000876939f2>¨ atomic_notifier_call_chain+0x2a/0x38
> 64.735782¨ <000003ff801e99c4>¨ mlx5_enter_error_state+0xec/0x100
> mlx5_core 64.735797¨ <000003ff801e9a0c>¨
> mlx5_fw_fatal_reporter_err_work+0x34/0xb8 mlx5_core¨ 64.735802¨
> <000000008768784c>¨ process_one_work+0x27c/0x478 64.735805¨
> <0000000087687aae>¨ worker_thread+0x66/0x368 64.735807¨
> <0000000087690a96>¨ kthread+0x176/0x1a0 64.735811¨
> <0000000088202820>¨ ret_from_fork+0x24/0x2c 64.735812¨ INFO: lockdep is
> turned off.
> 64.735814¨ Last Breaking-Event-Address:
> 64.735816¨ <0000000087693926>¨ __atomic_notifier_call_chain+0x96/0x138
> 64.735820¨ Kernel panic - not syncing: Fatal exception: panic_on_oops
> 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> 8762103E

2020-07-09 10:10:20

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken


On 7/8/20 5:44 PM, Parav Pandit wrote:
... snip ..
>>
>
> It is likely because events_cleanup() freed the memory using kvfree() that health recovery context is trying to access in notifier chain.
>
> While reviewing I see few more errors as below.
> (a) mlx5_pci_err_detected() invokes mlx5_drain_health_wq() outside of intf_state_mutex.
> (b) mlx5_enter_error_state() in commit b6e0b6bebe0 read and updates dev state outside of intf_state_mutex.
> (c) due to drain_health_wq() introduction in mlx5_pci_close() in commit 42ea9f1b5c625 health_wq is flushed twice.
> (d) priv->events freed is kvfree() but allocated using kzalloc().
>
> This code certainly needs rework.
>
> In meantime to avoid this regression, I believe below hunk eliminates error introduced in patch 41798df9bfc.
> Will you please help test it?
>
> Shay and I did the review of below patch.
> If it works I will get it through Saeed's tree and internal reviews.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index ebec2318dbc4..529df5703737 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, struct pci_dev *pdev,
>
> static void mlx5_pci_close(struct mlx5_core_dev *dev)
> {
> - /* health work might still be active, and it needs pci bar in
> - * order to know the NIC state. Therefore, drain the health WQ
> - * before removing the pci bars
> - */
> - mlx5_drain_health_wq(dev);
> iounmap(dev->iseg);
> pci_clear_master(dev->pdev);
> release_bar(dev->pdev);
> @@ -1235,6 +1230,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
> if (cleanup) {
> mlx5_unregister_device(dev);
> mlx5_devlink_unregister(priv_to_devlink(dev));
> + mlx5_drain_health_wq(dev);
I think with the above you can remove the mlx5_drain_health_wq(dev) in remove_one()
as that calls mlx5_unload_one() with cleanup == true. Also I wonder if it is a problem
that the order of mlx5_devlink_unregister(priv_to_devlink(dev)) and mlx5_unregister_device(dev)
is switched compared to the 5.7 code. That said changing both seems to result in a deadlock
though not the "leak of a command resource":

[ 48.082222] mlx5_core 0000:00:00.0: poll_health:702:(pid 0): Fatal error 1 detected
[ 48.082296] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
[ 48.082303] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[1] 0xffffffff
[ 48.082309] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[2] 0xffffffff
[ 48.082316] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[3] 0xffffffff
[ 48.082322] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[4] 0xffffffff
[ 48.082329] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): assert_exit_ptr 0xffffffff
[ 48.082336] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): assert_callra 0xffffffff
[ 48.082345] mlx5_core 0000:00:00.0: print_health_info:388:(pid 0): fw_ver 65535.65535.65535
[ 48.082353] mlx5_core 0000:00:00.0: print_health_info:389:(pid 0): hw_id 0xffffffff
[ 48.082360] mlx5_core 0000:00:00.0: print_health_info:390:(pid 0): irisc_index 255
[ 48.082374] mlx5_core 0000:00:00.0: print_health_info:391:(pid 0): synd 0xff: unrecognized error
[ 48.082390] mlx5_core 0000:00:00.0: print_health_info:393:(pid 0): ext_synd 0xffff
[ 48.082396] mlx5_core 0000:00:00.0: print_health_info:395:(pid 0): raw fw_ver 0xffffffff
[ 48.082824] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[ 48.082852] mlx5_core 0000:00:00.0: PME# disabled
[ 244.103844] INFO: task kworker/u128:1:63 blocked for more than 122 seconds.
[ 244.104076] Not tainted 5.8.0-rc4-dirty #2
[ 244.104081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.104086] kworker/u128:1 D13264 63 2 0x00000000
[ 244.104158] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 244.104164] Call Trace:
[ 244.104174] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8
[ 244.104180] [<0000000031cf1f72>] schedule+0x5a/0x130
[ 244.104186] [<0000000031cf2704>] schedule_preempt_disabled+0x2c/0x48
[ 244.104192] [<0000000031cf49c2>] __mutex_lock+0x372/0x960
[ 244.104198] [<0000000031cf4fe2>] mutex_lock_nested+0x32/0x40
[ 244.104219] [<000003ff801e4978>] mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.104238] [<000003ff801e4a0c>] mlx5_fw_fatal_reporter_err_work+0x34/0xb8 [mlx5_core]
[ 244.104249] [<000000003117f84c>] process_one_work+0x27c/0x478
[ 244.104255] [<000000003117faae>] worker_thread+0x66/0x368
[ 244.104275] [<0000000031188a96>] kthread+0x176/0x1a0
[ 244.104280] [<0000000031cfa820>] ret_from_fork+0x24/0x2c
[ 244.104285] 3 locks held by kworker/u128:1/63:
[ 244.104289] #0: 00000000d0392948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.104298] #1: 000003e00295fe18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.104305] #2: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.104328] INFO: task kmcheck:73 blocked for more than 122 seconds.
[ 244.104333] Not tainted 5.8.0-rc4-dirty #2
[ 244.104337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.104341] kmcheck D10888 73 2 0x00000000
[ 244.104350] Call Trace:
[ 244.104355] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8
[ 244.104360] [<0000000031cf1f72>] schedule+0x5a/0x130
[ 244.104365] [<0000000031cf8182>] schedule_timeout+0xfa/0x138
[ 244.104370] [<0000000031cf37e8>] wait_for_completion+0xc0/0x110
[ 244.104374] [<0000000031180798>] __flush_work+0xd8/0x120
[ 244.104380] [<0000000031180f88>] __cancel_work_timer+0x150/0x208
[ 244.104398] [<000003ff801d9c3a>] mlx5_unload_one+0x52/0x148 [mlx5_core]
[ 244.104416] [<000003ff801d9dc2>] remove_one+0x52/0xd8 [mlx5_core]
[ 244.104422] [<0000000031982500>] pci_device_remove+0x40/0x90
[ 244.104429] [<00000000319f0012>] __device_release_driver+0x17a/0x220
[ 244.104433] [<00000000319f00f8>] device_release_driver+0x40/0x50
[ 244.104439] [<0000000031977dcc>] pci_stop_bus_device+0x94/0xc0
[ 244.104444] [<00000000319780c0>] pci_stop_and_remove_bus_device_locked+0x30/0x48
[ 244.104449] [<000000003114efd6>] __zpci_event_availability+0x12e/0x318
[ 244.104456] [<0000000031c566fc>] chsc_process_event_information.constprop.0+0x214/0x228
[ 244.104461] [<0000000031c5daa8>] crw_collect_info+0x280/0x350
[ 244.104466] [<0000000031188a96>] kthread+0x176/0x1a0
[ 244.104470] [<0000000031cfa820>] ret_from_fork+0x24/0x2c
[ 244.104475] 4 locks held by kmcheck/73:
[ 244.104479] #0: 0000000032205e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[ 244.104486] #1: 00000000321e5668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[ 244.104493] #2: 00000000d7fd1238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[ 244.104553] #3: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]



> } else {
> mlx5_detach_device(dev);
> }
> @@ -1366,6 +1362,11 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
> return 0;
>
> err_load_one:
> + /* health work might still be active, and it needs pci bar in
> + * order to know the NIC state. Therefore, drain the health WQ
> + * before removing the pci bars
> + */
> + mlx5_drain_health_wq(dev);
> mlx5_pci_close(dev);
> pci_init_err:
> mlx5_mdev_uninit(dev);
>
As is the patch above fixes the dereference but results in the same completion error
as current 5.8-rc4

[ 103.257202] mlx5_core 0000:00:00.0: poll_health:702:(pid 930): Fatal error 1 detected
[ 103.257405] mlx5_core 0000:00:00.0: print_health_info:380:(pid 930): assert_var[0] 0xffffffff
...
[ 114.268505] mlx5_core 0000:00:00.0: poll_health:717:(pid 0): device's health compromised - reached miss count
[ 114.268524] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
...
[ 167.308576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
[ 167.308799] infiniband mlx5_0: destroy_qp_common:2367:(pid 73): mlx5_ib: modify QP 0x000724 to RESET failed
[ 198.028576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command resource
[ 228.748700] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[ 259.468867] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource
[ 290.189284] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource
[ 320.909972] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource
[ 351.628945] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource
>
>
>
... snip ...

2020-07-09 18:37:13

by Parav Pandit

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

On 7/9/2020 3:36 PM, Niklas Schnelle wrote:
>
> On 7/8/20 5:44 PM, Parav Pandit wrote:
> ... snip ..
>>>
>>
>> It is likely because events_cleanup() freed the memory using kvfree() that health recovery context is trying to access in notifier chain.
>>
>> While reviewing I see few more errors as below.
>> (a) mlx5_pci_err_detected() invokes mlx5_drain_health_wq() outside of intf_state_mutex.
>> (b) mlx5_enter_error_state() in commit b6e0b6bebe0 read and updates dev state outside of intf_state_mutex.
>> (c) due to drain_health_wq() introduction in mlx5_pci_close() in commit 42ea9f1b5c625 health_wq is flushed twice.
>> (d) priv->events freed is kvfree() but allocated using kzalloc().
>>
>> This code certainly needs rework.
>>
>> In meantime to avoid this regression, I believe below hunk eliminates error introduced in patch 41798df9bfc.
>> Will you please help test it?
>>
>> Shay and I did the review of below patch.
>> If it works I will get it through Saeed's tree and internal reviews.
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> index ebec2318dbc4..529df5703737 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>> @@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, struct pci_dev *pdev,
>>
>> static void mlx5_pci_close(struct mlx5_core_dev *dev)
>> {
>> - /* health work might still be active, and it needs pci bar in
>> - * order to know the NIC state. Therefore, drain the health WQ
>> - * before removing the pci bars
>> - */
>> - mlx5_drain_health_wq(dev);
>> iounmap(dev->iseg);
>> pci_clear_master(dev->pdev);
>> release_bar(dev->pdev);
>> @@ -1235,6 +1230,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
>> if (cleanup) {
>> mlx5_unregister_device(dev);
>> mlx5_devlink_unregister(priv_to_devlink(dev));
>> + mlx5_drain_health_wq(dev);
> I think with the above you can remove the mlx5_drain_health_wq(dev) in remove_one()
> as that calls mlx5_unload_one() with cleanup == true.
Oh yes, I was supposed to remove from there too.
That was the whole point of removing from there. But I missed. :(
My bad that I didn't give share patch.
Please find the inline patch at the end of the email thread.

> Also I wonder if it is a problem
> that the order of mlx5_devlink_unregister(priv_to_devlink(dev)) and mlx5_unregister_device(dev)
> is switched compared to the 5.7 code. That said changing both seems to result in a deadlock
> though not the "leak of a command resource":
>
Current order in 5.8-rc4 of unregister_device and devlink_unregister is
the right one. That is, to unregister the netdevice before unregistering
devlink device.

Please do not change the order of mlx5_unregister_device() and
mlx5_devlink_unregister().


>>
> As is the patch above fixes the dereference but results in the same completion error
> as current 5.8-rc4

Below patch should hopefully fix it. It fixes the bug introduced in
commit 41798df9bfca.
Additionally it fixes one small change of 42ea9f1b5c62.
Please remove all previous changes and apply below changes.
Hopefully this should resolve.


From a260b2e6a6065a57c2fa621271483cd51d0a1abf Mon Sep 17 00:00:00 2001
From: Parav Pandit <[email protected]>
Date: Thu, 9 Jul 2020 20:57:13 +0300
Subject: [PATCH] net/mlx5: Drain health wq after unregistering device

When health worker is disabled before unregistering the devices, and if
the firmware error is triggered while in middle of unregstering the
device, it leads to very long command timeout.

This error occurs because unregistering device involves several firmware
commands, and health recovery worker is disabled. So device state is
never updated as error.

cpu-0 cpu-1
----- -----
remove_one()
mlx5_drain_health_wq() poll_health()
mlx5_unregister_device() mlx5_trigger_health_work()
mlx5_cmd_exec() /* health work skipped */
/* timeout */

Hence, disable the health worker after unregistering the device and
before fully unloading the device, as health worker uses the device
during health recovery. Both such contexts are serialized by
intf_state_mutex.

Change-Id: I054df865726af42ea940300b439889e58d0956d1
Fixes: 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
Fixes: 42ea9f1b5c62 ("net/mlx5: drain health workqueue in case of driver
load error")
Reported-by: Niklas Schnelle <[email protected]>
Signed-off-by: Shay Drory <[email protected]>
Signed-off-by: Parav Pandit <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e32d46c33701..100a5c31ee0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev,
struct pci_dev *pdev,

static void mlx5_pci_close(struct mlx5_core_dev *dev)
{
- /* health work might still be active, and it needs pci bar in
- * order to know the NIC state. Therefore, drain the health WQ
- * before removing the pci bars
- */
- mlx5_drain_health_wq(dev);
iounmap(dev->iseg);
pci_clear_master(dev->pdev);
release_bar(dev->pdev);
@@ -1230,6 +1225,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev,
bool cleanup)
if (cleanup) {
mlx5_unregister_device(dev);
mlx5_devlink_unregister(priv_to_devlink(dev));
+ mlx5_drain_health_wq(dev);
} else {
mlx5_detach_device(dev);
}
@@ -1361,6 +1357,11 @@ static int init_one(struct pci_dev *pdev, const
struct pci_device_id *id)
return 0;

err_load_one:
+ /* health work might still be active, and it needs pci bar in
+ * order to know the NIC state. Therefore, drain the health WQ
+ * before removing the pci bars
+ */
+ mlx5_drain_health_wq(dev);
mlx5_pci_close(dev);
pci_init_err:
mlx5_mdev_uninit(dev);
@@ -1377,7 +1378,6 @@ static void remove_one(struct pci_dev *pdev)

devlink_reload_disable(devlink);
mlx5_crdump_disable(dev);
- mlx5_drain_health_wq(dev);
mlx5_unload_one(dev, true);
mlx5_pci_close(dev);
mlx5_mdev_uninit(dev);
--
2.26.2

2020-07-10 08:38:32

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken



On 7/9/20 8:34 PM, Parav Pandit wrote:
> On 7/9/2020 3:36 PM, Niklas Schnelle wrote:
>>
>> On 7/8/20 5:44 PM, Parav Pandit wrote:
... snip ..

>
>>>
>> As is the patch above fixes the dereference but results in the same completion error
>> as current 5.8-rc4
>
> Below patch should hopefully fix it. It fixes the bug introduced in
> commit 41798df9bfca.
> Additionally it fixes one small change of 42ea9f1b5c62.
> Please remove all previous changes and apply below changes.
> Hopefully this should resolve.
>
>
> From a260b2e6a6065a57c2fa621271483cd51d0a1abf Mon Sep 17 00:00:00 2001
> From: Parav Pandit <[email protected]>
> Date: Thu, 9 Jul 2020 20:57:13 +0300
> Subject: [PATCH] net/mlx5: Drain health wq after unregistering device
>
... snip ....Hmm, we still seem to be missing something, this patch
does not yet restore the v5.7 behavior and still results
in a hang on e.g. echo 1 > /proc/cio_settle after a vmcp detach pcif <fid>.
In particular the PCI device is never removed.
I do agree however that your changes make sense so maybe something else
broke the part that finally gives up on recovering the device.

By the you might want to turn off automatic line wrapping in your mail client
the patch was corrupted.

See the dmesg output below for the actual dmesg and below that
for the output on v5.7

[ 111.243608] mlx5_core 0000:00:00.0: poll_health:702:(pid 0): Fatal error 1 detected
[ 111.243851] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
[ 111.243858] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[1] 0xffffffff
[ 111.243864] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[2] 0xffffffff
[ 111.243871] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[3] 0xffffffff
[ 111.243878] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[4] 0xffffffff
[ 111.243885] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): assert_exit_ptr 0xffffffff
[ 111.243892] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): assert_callra 0xffffffff
[ 111.243902] mlx5_core 0000:00:00.0: print_health_info:388:(pid 0): fw_ver 65535.65535.65535
[ 111.243909] mlx5_core 0000:00:00.0: print_health_info:389:(pid 0): hw_id 0xffffffff
[ 111.243916] mlx5_core 0000:00:00.0: print_health_info:390:(pid 0): irisc_index 255
[ 111.243925] mlx5_core 0000:00:00.0: print_health_info:391:(pid 0): synd 0xff: unrecognized error
[ 111.243942] mlx5_core 0000:00:00.0: print_health_info:393:(pid 0): ext_synd 0xffff
[ 111.243949] mlx5_core 0000:00:00.0: print_health_info:395:(pid 0): raw fw_ver 0xffffffff
[ 111.244263] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[ 111.244291] mlx5_core 0000:00:00.0: PME# disabled
[ 244.092384] INFO: task kmcheck:73 blocked for more than 122 seconds.
[ 244.092626] Not tainted 5.8.0-rc4-dirty #2
[ 244.092631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.092636] kmcheck D10888 73 2 0x00000000
[ 244.092648] Call Trace:
[ 244.092659] [<00000000088a1c46>] __schedule+0x2d6/0x5a8
[ 244.092664] [<00000000088a1f72>] schedule+0x5a/0x130
[ 244.092671] [<00000000088a8182>] schedule_timeout+0xfa/0x138
[ 244.092677] [<00000000088a37e8>] wait_for_completion+0xc0/0x110
[ 244.092685] [<0000000007d30798>] __flush_work+0xd8/0x120
[ 244.092691] [<0000000007d30f88>] __cancel_work_timer+0x150/0x208
[ 244.092750] [<000003ff801d5c4c>] mlx5_unload_one+0x64/0x148 [mlx5_core]
[ 244.092770] [<000003ff801d5dc2>] remove_one+0x52/0xd8 [mlx5_core]
[ 244.092778] [<0000000008532500>] pci_device_remove+0x40/0x90
[ 244.092787] [<00000000085a0012>] __device_release_driver+0x17a/0x220
[ 244.092794] [<00000000085a00f8>] device_release_driver+0x40/0x50
[ 244.092812] [<0000000008527dcc>] pci_stop_bus_device+0x94/0xc0
[ 244.092817] [<00000000085280c0>] pci_stop_and_remove_bus_device_locked+0x30/0x48
[ 244.092822] [<0000000007cfefd6>] __zpci_event_availability+0x12e/0x318
[ 244.092829] [<00000000088066fc>] chsc_process_event_information.constprop.0+0x214/0x228
[ 244.092834] [<000000000880daa8>] crw_collect_info+0x280/0x350
[ 244.092840] [<0000000007d38a96>] kthread+0x176/0x1a0
[ 244.092844] [<00000000088aa820>] ret_from_fork+0x24/0x2c
[ 244.092863] 4 locks held by kmcheck/73:
[ 244.092868] #0: 0000000008db5e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[ 244.092878] #1: 0000000008d95668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[ 244.092887] #2: 00000000de431238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[ 244.092896] #3: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]
[ 244.092922] INFO: task kworker/u128:2:346 blocked for more than 122 seconds.
[ 244.092932] Not tainted 5.8.0-rc4-dirty #2
[ 244.092936] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.092940] kworker/u128:2 D10848 346 2 0x02000000
[ 244.092968] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 244.092974] Call Trace:
[ 244.092979] [<00000000088a1c46>] __schedule+0x2d6/0x5a8
[ 244.092985] [<00000000088a1f72>] schedule+0x5a/0x130
[ 244.092990] [<00000000088a2704>] schedule_preempt_disabled+0x2c/0x48
[ 244.092996] [<00000000088a49c2>] __mutex_lock+0x372/0x960
[ 244.093002] [<00000000088a4fe2>] mutex_lock_nested+0x32/0x40
[ 244.093021] [<000003ff801e0978>] mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.093041] [<000003ff801e0a0c>] mlx5_fw_fatal_reporter_err_work+0x34/0xb8 [mlx5_core]
[ 244.093047] [<0000000007d2f84c>] process_one_work+0x27c/0x478
[ 244.093054] [<0000000007d2faae>] worker_thread+0x66/0x368
[ 244.093061] [<0000000007d38a96>] kthread+0x176/0x1a0
[ 244.093067] [<00000000088aa820>] ret_from_fork+0x24/0x2c
[ 244.093077] 3 locks held by kworker/u128:2/346:
[ 244.093081] #0: 00000000d230d948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.093089] #1: 000003e000783e18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.093097] #2: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.093136]
Showing all locks held in the system:
[ 244.093147] 1 lock held by khungtaskd/27:
[ 244.093153] #0: 0000000008d16990 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x50
[ 244.093164] 4 locks held by kmcheck/73:
[ 244.093170] #0: 0000000008db5e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[ 244.093178] #1: 0000000008d95668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[ 244.093187] #2: 00000000de431238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[ 244.093199] #3: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]
[ 244.093221] 3 locks held by kworker/u128:2/346:
[ 244.093225] #0: 00000000d230d948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.093233] #1: 000003e000783e18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.093241] #2: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]

[ 244.093274] =============================================

On v5.7 on the other hand the recovery flow is aborted and the device is released so
that the hard unplug can complete with the device completely removed:
(Note there are SMC messages in the below example that are not in the one above
but you can ignore them)

[78144.718600] mlx5_core 0001:00:00.0: poll_health:695:(pid 3222): Fatal error 1 detected
[78144.718608] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[0] 0xffffffff
[78144.718611] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[1] 0xffffffff
[78144.718613] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[2] 0xffffffff
[78144.718616] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[3] 0xffffffff
[78144.718619] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[4] 0xffffffff
[78144.718621] mlx5_core 0001:00:00.0: print_health_info:376:(pid 3222): assert_exit_ptr 0xffffffff
[78144.718625] mlx5_core 0001:00:00.0: print_health_info:378:(pid 3222): assert_callra 0xffffffff
[78144.718629] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[78144.718634] mlx5_core 0001:00:00.0: print_health_info:381:(pid 3222): fw_ver 65535.65535.65535
[78144.718638] mlx5_core 0001:00:00.0: print_health_info:382:(pid 3222): hw_id 0xffffffff
[78144.718642] mlx5_core 0001:00:00.0: print_health_info:383:(pid 3222): irisc_index 255
[78144.718652] mlx5_core 0001:00:00.0: print_health_info:384:(pid 3222): synd 0xff: unrecognized error
[78144.718654] mlx5_core 0001:00:00.0: print_health_info:386:(pid 3222): ext_synd 0xffff
[78144.718656] mlx5_core 0001:00:00.0: print_health_info:388:(pid 3222): raw fw_ver 0xffffffff
[78144.718906] smc: removing ib device mlx5_0
[78144.719016] mlx5_core 0001:00:00.0: mlx5_health_try_recover:308:(pid 3174): handling bad device here
[78144.719020] mlx5_core 0001:00:00.0: mlx5_error_sw_reset:225:(pid 3174): start
[78145.754943] mlx5_core 0001:00:00.0: NIC IFC still 7 after 1000ms.
[78145.754947] mlx5_core 0001:00:00.0: mlx5_error_sw_reset:258:(pid 3174): end
[78206.864928] mlx5_core 0001:00:00.0: mlx5_health_try_recover:313:(pid 3174): health recovery flow aborted, PCI reads still not working
[78206.864955] mlx5_core 0001:00:00.0: mlx5_unload_one:1252:(pid 121): mlx5_unload_one: interface is down, NOP
[78206.905176] pci 0001:00:00.0: Removing from iommu group 1
[78206.906828] sclp_cmd: configure PCI I/O adapter failed: cmd=0x001b0001 response=0x09f0
[78206.906873] pci_bus 0001:00: busn_res: [bus 00] is released