LinuxLists.cc - [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

2021-07-15 00:57:53

Subject: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

Hi all,
I'm seeing a strange "no MSI-X interrupt" issue with the Mellanox NIC
driver on a physical Linux host [1], if I only enable part of the CPUs.

The physical host has 104 logical processors (2 sockets, and each socket
has 26 cores with HT enabled). By default, the Mellanox driver works fine
when Linux boots up.

If I only use 1, 2, 32, 64, 96 processors by the Linux kernel parameter
"maxcpus=X" or "nr_cpus=X", everthing still works fine.

However, if the Linux host OS only uses 4, 8 or 16 processors, the
mlx5_core driver fails to load as it can not receive interrupt when
creating EQ (maxcpus=8 or 16), or the driver can load but it reports a
timeout error when I try to bring the NIC up (maxcpus=4). This issue is
a 100% repro.

For example, with "maxcpus=8", I get the below timeout error when trying
to load mlx5_core:

# modprobe mlx5_core
[ 1475.716688] mlx5_core 0000:d8:00.0: firmware version: 16.25.8352
[ 1475.722742] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 1475.991398] mlx5_core 0000:d8:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)

[ 1537.020001] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1537.028969] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1598.460003] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1598.468978] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1659.900010] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1659.908987] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1721.340006] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1721.348989] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout

When this happens, the mlx5_core driver is stuck with the below
call-trace, waiting for some interrupt:

# ps aux |grep modprobe
root 1416 0.0 0.0 11024 1472 ttyS0 D+ 08:08 0:00 modprobe mlx5_core
root 1480 0.0 0.0 6440 736 pts/0 S+ 08:15 0:00 grep --color=auto modprobe

# cat /proc/1416/stack
[<0>] cmd_exec+0x8a7/0x9b0 [mlx5_core]
[<0>] mlx5_cmd_exec+0x24/0x50 [mlx5_core]
[<0>] create_map_eq+0x2a6/0x380 [mlx5_core]
[<0>] mlx5_eq_table_create+0x504/0x710 [mlx5_core]
[<0>] mlx5_load+0x52/0x130 [mlx5_core]
[<0>] mlx5_init_one+0x1cc/0x250 [mlx5_core]
[<0>] probe_one+0x1d3/0x2a0 [mlx5_core]
[<0>] local_pci_probe+0x45/0x80
[<0>] pci_device_probe+0x10f/0x1c0
[<0>] really_probe+0x1c1/0x3b0
[<0>] __driver_probe_device+0x109/0x180
[<0>] driver_probe_device+0x23/0xa0
[<0>] __driver_attach+0xbd/0x160
[<0>] bus_for_each_dev+0x7c/0xc0
[<0>] driver_attach+0x1e/0x20
[<0>] bus_add_driver+0x152/0x1f0
[<0>] driver_register+0x74/0xd0
[<0>] __pci_register_driver+0x68/0x70
[<0>] init+0x6b/0x1000 [mlx5_core]
[<0>] do_one_initcall+0x46/0x1d0
[<0>] do_init_module+0x62/0x250
[<0>] load_module+0x2503/0x2730
[<0>] __do_sys_finit_module+0xbf/0x120
[<0>] __x64_sys_finit_module+0x1a/0x20
[<0>] do_syscall_64+0x38/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

To make the issue even weirder, when the issue happens (e.g. when Linux
only uses 8 processors), if I manually bring CPU #8~#31 online [2] and
then bring them offline [3], the Mellanox driver will work fine!

This is a x86-64 host. Is it possibe that the IOMMU Interrrupt Remapping
is not proprely set up with maxcpus=4, 8 and 16?

The above tests were done with the recent Linux v5.14-rc1 kernel. I also
tried Ubuntu 20.04's kernel "5.4.0-77-generic", and the Mellanox driver
exhibits exactly the same issue.

I have Linux/Windows dual-boot on this physical machine, and Windows
doesn't have the issue when I let it only use 4, 8 and 16 processors.
So this looks like somehow the issue is specific to Linux.

Can someone please shed some light on this strange issue? I'm ready
to provide more logs if needed. Thanks!

PS, the physical machine has 4 NVMe controllers and 4 Broadcom NICs,
which are not affected by maxcpus=4, 8, and 16.

[1] This is the 'lspci' output of the Mellanox NIC:
d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA node 1
Memory at f8000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fbe00000 [disabled] [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
Capabilities: [1c0] Secondary PCI Express
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
00: b3 15 17 10 46 05 10 00 00 00 00 02 08 00 00 00
10: 0c 00 00 f8 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b3 15 80 00
30: 00 00 e0 fb 60 00 00 00 00 00 00 00 ff 01 00 00

[2] for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done
[3] for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done

Thanks,
-- Dexuan

2021-07-15 01:39:30

by Dexuan Cui

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Dexuan Cui
> Sent: Wednesday, July 14, 2021 5:39 PM
> To: [email protected]; [email protected]
> Cc: Haiyang Zhang <[email protected]>; [email protected]
> Subject: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8
>
> Hi all,
> I'm seeing a strange "no MSI-X interrupt" issue with the Mellanox NIC
> driver on a physical Linux host [1], if I only enable part of the CPUs.
>
> The physical host has 104 logical processors (2 sockets, and each socket
> has 26 cores with HT enabled). By default, the Mellanox driver works fine
> when Linux boots up.
>
> If I only use 1, 2, 32, 64, 96 processors by the Linux kernel parameter
> "maxcpus=X" or "nr_cpus=X", everthing still works fine.
>
> However, if the Linux host OS only uses 4, 8 or 16 processors, the
> mlx5_core driver fails to load as it can not receive interrupt when
> creating EQ (maxcpus=8 or 16), or the driver can load but it reports a
> timeout error when I try to bring the NIC up (maxcpus=4). This issue is
> a 100% repro.
>
> For example, with "maxcpus=8", I get the below timeout error when trying
> to load mlx5_core:
>
> # modprobe mlx5_core
> [ 1475.716688] mlx5_core 0000:d8:00.0: firmware version: 16.25.8352
> [ 1475.722742] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe
> bandwidth (8.0 GT/s PCIe x16 link)
> [ 1475.991398] mlx5_core 0000:d8:00.0: E-Switch: Total vports 2, per vport:
> max uc(1024) max mc(16384)
>
> [ 1537.020001] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> Recovered 1 EQEs on cmd_eq
> [ 1537.028969] mlx5_core 0000:d8:00.0:
> wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> recovered after timeout
> [ 1598.460003] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> Recovered 1 EQEs on cmd_eq
> [ 1598.468978] mlx5_core 0000:d8:00.0:
> wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> recovered after timeout
> [ 1659.900010] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> Recovered 1 EQEs on cmd_eq
> [ 1659.908987] mlx5_core 0000:d8:00.0:
> wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> recovered after timeout
> [ 1721.340006] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> Recovered 1 EQEs on cmd_eq
> [ 1721.348989] mlx5_core 0000:d8:00.0:
> wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> recovered after timeout
>
> When this happens, the mlx5_core driver is stuck with the below
> call-trace, waiting for some interrupt:
>
> # ps aux |grep modprobe
> root 1416 0.0 0.0 11024 1472 ttyS0 D+ 08:08 0:00
> modprobe mlx5_core
> root 1480 0.0 0.0 6440 736 pts/0 S+ 08:15 0:00
> grep --color=auto modprobe
>
> # cat /proc/1416/stack
> [<0>] cmd_exec+0x8a7/0x9b0 [mlx5_core]
> [<0>] mlx5_cmd_exec+0x24/0x50 [mlx5_core]
> [<0>] create_map_eq+0x2a6/0x380 [mlx5_core]
> [<0>] mlx5_eq_table_create+0x504/0x710 [mlx5_core]
> [<0>] mlx5_load+0x52/0x130 [mlx5_core]
> [<0>] mlx5_init_one+0x1cc/0x250 [mlx5_core]
> [<0>] probe_one+0x1d3/0x2a0 [mlx5_core]
> [<0>] local_pci_probe+0x45/0x80
> [<0>] pci_device_probe+0x10f/0x1c0
> [<0>] really_probe+0x1c1/0x3b0
> [<0>] __driver_probe_device+0x109/0x180
> [<0>] driver_probe_device+0x23/0xa0
> [<0>] __driver_attach+0xbd/0x160
> [<0>] bus_for_each_dev+0x7c/0xc0
> [<0>] driver_attach+0x1e/0x20
> [<0>] bus_add_driver+0x152/0x1f0
> [<0>] driver_register+0x74/0xd0
> [<0>] __pci_register_driver+0x68/0x70
> [<0>] init+0x6b/0x1000 [mlx5_core]
> [<0>] do_one_initcall+0x46/0x1d0
> [<0>] do_init_module+0x62/0x250
> [<0>] load_module+0x2503/0x2730
> [<0>] __do_sys_finit_module+0xbf/0x120
> [<0>] __x64_sys_finit_module+0x1a/0x20
> [<0>] do_syscall_64+0x38/0xc0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> To make the issue even weirder, when the issue happens (e.g. when Linux
> only uses 8 processors), if I manually bring CPU #8~#31 online [2] and
> then bring them offline [3], the Mellanox driver will work fine!
>
> This is a x86-64 host. Is it possibe that the IOMMU Interrrupt Remapping
> is not proprely set up with maxcpus=4, 8 and 16?
>
> The above tests were done with the recent Linux v5.14-rc1 kernel. I also
> tried Ubuntu 20.04's kernel "5.4.0-77-generic", and the Mellanox driver
> exhibits exactly the same issue.
>
> I have Linux/Windows dual-boot on this physical machine, and Windows
> doesn't have the issue when I let it only use 4, 8 and 16 processors.
> So this looks like somehow the issue is specific to Linux.
>
> Can someone please shed some light on this strange issue? I'm ready
> to provide more logs if needed. Thanks!
>
> PS, the physical machine has 4 NVMe controllers and 4 Broadcom NICs,
> which are not affected by maxcpus=4, 8, and 16.
>
> [1] This is the 'lspci' output of the Mellanox NIC:
> d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
> [ConnectX-5]
> Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
> Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA node 1
> Memory at f8000000 (64-bit, prefetchable) [size=32M]
> Expansion ROM at fbe00000 [disabled] [size=1M]
> Capabilities: [60] Express Endpoint, MSI 00
> Capabilities: [48] Vital Product Data
> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
> Capabilities: [c0] Vendor Specific Information: Len=18 <?>
> Capabilities: [40] Power Management version 3
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
> Capabilities: [1c0] Secondary PCI Express
> Kernel driver in use: mlx5_core
> Kernel modules: mlx5_core
> 00: b3 15 17 10 46 05 10 00 00 00 00 02 08 00 00 00
> 10: 0c 00 00 f8 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 b3 15 80 00
> 30: 00 00 e0 fb 60 00 00 00 00 00 00 00 ff 01 00 00
>
> [2] for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online;
> done
> [3] for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online;
> done
>
> Thanks,
> -- Dexuan

(+ the linux-pci list)

It turns out that adding "intremap=off" can work around the issue!

The root cause is still not clear yet. I don't know why Windows is good here.

Thanks,
Dexuan

2021-07-18 09:13:35

by Leon Romanovsky

[permalink] [raw]

Subject: Re: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

On Thu, Jul 15, 2021 at 01:11:55AM +0000, Dexuan Cui wrote:
> > From: Dexuan Cui
> > Sent: Wednesday, July 14, 2021 5:39 PM
> > To: [email protected]; [email protected]
> > Cc: Haiyang Zhang <[email protected]>; [email protected]
> > Subject: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8
> >
> > Hi all,
> > I'm seeing a strange "no MSI-X interrupt" issue with the Mellanox NIC
> > driver on a physical Linux host [1], if I only enable part of the CPUs.
> >
> > The physical host has 104 logical processors (2 sockets, and each socket
> > has 26 cores with HT enabled). By default, the Mellanox driver works fine
> > when Linux boots up.
> >
> > If I only use 1, 2, 32, 64, 96 processors by the Linux kernel parameter
> > "maxcpus=X" or "nr_cpus=X", everthing still works fine.
> >
> > However, if the Linux host OS only uses 4, 8 or 16 processors, the
> > mlx5_core driver fails to load as it can not receive interrupt when
> > creating EQ (maxcpus=8 or 16), or the driver can load but it reports a
> > timeout error when I try to bring the NIC up (maxcpus=4). This issue is
> > a 100% repro.
> >
> > For example, with "maxcpus=8", I get the below timeout error when trying
> > to load mlx5_core:
> >
> > # modprobe mlx5_core
> > [ 1475.716688] mlx5_core 0000:d8:00.0: firmware version: 16.25.8352
> > [ 1475.722742] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe
> > bandwidth (8.0 GT/s PCIe x16 link)
> > [ 1475.991398] mlx5_core 0000:d8:00.0: E-Switch: Total vports 2, per vport:
> > max uc(1024) max mc(16384)
> >
> > [ 1537.020001] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> > Recovered 1 EQEs on cmd_eq
> > [ 1537.028969] mlx5_core 0000:d8:00.0:
> > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> > recovered after timeout
> > [ 1598.460003] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> > Recovered 1 EQEs on cmd_eq
> > [ 1598.468978] mlx5_core 0000:d8:00.0:
> > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> > recovered after timeout
> > [ 1659.900010] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> > Recovered 1 EQEs on cmd_eq
> > [ 1659.908987] mlx5_core 0000:d8:00.0:
> > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> > recovered after timeout
> > [ 1721.340006] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416):
> > Recovered 1 EQEs on cmd_eq
> > [ 1721.348989] mlx5_core 0000:d8:00.0:
> > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301)
> > recovered after timeout
> >
> > When this happens, the mlx5_core driver is stuck with the below
> > call-trace, waiting for some interrupt:
> >
> > # ps aux |grep modprobe
> > root 1416 0.0 0.0 11024 1472 ttyS0 D+ 08:08 0:00
> > modprobe mlx5_core
> > root 1480 0.0 0.0 6440 736 pts/0 S+ 08:15 0:00
> > grep --color=auto modprobe
> >
> > # cat /proc/1416/stack
> > [<0>] cmd_exec+0x8a7/0x9b0 [mlx5_core]
> > [<0>] mlx5_cmd_exec+0x24/0x50 [mlx5_core]
> > [<0>] create_map_eq+0x2a6/0x380 [mlx5_core]
> > [<0>] mlx5_eq_table_create+0x504/0x710 [mlx5_core]
> > [<0>] mlx5_load+0x52/0x130 [mlx5_core]
> > [<0>] mlx5_init_one+0x1cc/0x250 [mlx5_core]
> > [<0>] probe_one+0x1d3/0x2a0 [mlx5_core]
> > [<0>] local_pci_probe+0x45/0x80
> > [<0>] pci_device_probe+0x10f/0x1c0
> > [<0>] really_probe+0x1c1/0x3b0
> > [<0>] __driver_probe_device+0x109/0x180
> > [<0>] driver_probe_device+0x23/0xa0
> > [<0>] __driver_attach+0xbd/0x160
> > [<0>] bus_for_each_dev+0x7c/0xc0
> > [<0>] driver_attach+0x1e/0x20
> > [<0>] bus_add_driver+0x152/0x1f0
> > [<0>] driver_register+0x74/0xd0
> > [<0>] __pci_register_driver+0x68/0x70
> > [<0>] init+0x6b/0x1000 [mlx5_core]
> > [<0>] do_one_initcall+0x46/0x1d0
> > [<0>] do_init_module+0x62/0x250
> > [<0>] load_module+0x2503/0x2730
> > [<0>] __do_sys_finit_module+0xbf/0x120
> > [<0>] __x64_sys_finit_module+0x1a/0x20
> > [<0>] do_syscall_64+0x38/0xc0
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > To make the issue even weirder, when the issue happens (e.g. when Linux
> > only uses 8 processors), if I manually bring CPU #8~#31 online [2] and
> > then bring them offline [3], the Mellanox driver will work fine!
> >
> > This is a x86-64 host. Is it possibe that the IOMMU Interrrupt Remapping
> > is not proprely set up with maxcpus=4, 8 and 16?
> >
> > The above tests were done with the recent Linux v5.14-rc1 kernel. I also
> > tried Ubuntu 20.04's kernel "5.4.0-77-generic", and the Mellanox driver
> > exhibits exactly the same issue.
> >
> > I have Linux/Windows dual-boot on this physical machine, and Windows
> > doesn't have the issue when I let it only use 4, 8 and 16 processors.
> > So this looks like somehow the issue is specific to Linux.
> >
> > Can someone please shed some light on this strange issue? I'm ready
> > to provide more logs if needed. Thanks!
> >
> > PS, the physical machine has 4 NVMe controllers and 4 Broadcom NICs,
> > which are not affected by maxcpus=4, 8, and 16.
> >
> > [1] This is the 'lspci' output of the Mellanox NIC:
> > d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
> > [ConnectX-5]
> > Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
> > Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA node 1
> > Memory at f8000000 (64-bit, prefetchable) [size=32M]
> > Expansion ROM at fbe00000 [disabled] [size=1M]
> > Capabilities: [60] Express Endpoint, MSI 00
> > Capabilities: [48] Vital Product Data
> > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
> > Capabilities: [c0] Vendor Specific Information: Len=18 <?>
> > Capabilities: [40] Power Management version 3
> > Capabilities: [100] Advanced Error Reporting
> > Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> > Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
> > Capabilities: [1c0] Secondary PCI Express
> > Kernel driver in use: mlx5_core
> > Kernel modules: mlx5_core
> > 00: b3 15 17 10 46 05 10 00 00 00 00 02 08 00 00 00
> > 10: 0c 00 00 f8 00 00 00 00 00 00 00 00 00 00 00 00
> > 20: 00 00 00 00 00 00 00 00 00 00 00 00 b3 15 80 00
> > 30: 00 00 e0 fb 60 00 00 00 00 00 00 00 ff 01 00 00
> >
> > [2] for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online;
> > done
> > [3] for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online;
> > done
> >
> > Thanks,
> > -- Dexuan
>
> (+ the linux-pci list)
>
> It turns out that adding "intremap=off" can work around the issue!
>
> The root cause is still not clear yet. I don't know why Windows is good here.

The card is stuck in the FW, maybe Saeed knows why. I tried your
scenario and it worked for me.

Thanks

>
> Thanks,
> Dexuan
>

2021-07-19 20:48:08

by Saeed Mahameed

[permalink] [raw]

Subject: Re: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

On Sun, 2021-07-18 at 12:12 +0300, Leon Romanovsky wrote:
> On Thu, Jul 15, 2021 at 01:11:55AM +0000, Dexuan Cui wrote:
> > > From: Dexuan Cui
> > > Sent: Wednesday, July 14, 2021 5:39 PM
> > > To: [email protected]; [email protected]
> > > Cc: Haiyang Zhang <[email protected]>;
> > > [email protected]
> > > Subject: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8
> > >
> > > Hi all,
> > > I'm seeing a strange "no MSI-X interrupt" issue with the Mellanox
> > > NIC
> > > driver on a physical Linux host [1], if I only enable part of the
> > > CPUs.
> > >
> > > The physical host has 104 logical processors (2 sockets, and each
> > > socket
> > > has 26 cores with HT enabled). By default, the Mellanox driver
> > > works fine
> > > when Linux boots up.
> > >
> > > If I only use 1, 2, 32, 64, 96 processors by the Linux kernel
> > > parameter
> > > "maxcpus=X" or "nr_cpus=X", everthing still works fine.
> > >
> > > However, if the Linux host OS only uses 4, 8 or 16 processors, the
> > > mlx5_core driver fails to load as it can not receive interrupt when
> > > creating EQ (maxcpus=8 or 16), or the driver can load but it
> > > reports a
> > > timeout error when I try to bring the NIC up (maxcpus=4). This
> > > issue is
> > > a 100% repro.
> > >
> > > For example, with "maxcpus=8", I get the below timeout error when
> > > trying
> > > to load mlx5_core:
> > >
> > > # modprobe mlx5_core
> > > [ 1475.716688] mlx5_core 0000:d8:00.0: firmware version: 16.25.8352
> > > [ 1475.722742] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe
> > > bandwidth (8.0 GT/s PCIe x16 link)
> > > [ 1475.991398] mlx5_core 0000:d8:00.0: E-Switch: Total vports 2,
> > > per vport:
> > > max uc(1024) max mc(16384)
> > >
> > > [ 1537.020001] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid
> > > 1416):
> > > Recovered 1 EQEs on cmd_eq
> > > [ 1537.028969] mlx5_core 0000:d8:00.0:
> > > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> > > CREATE_EQ(0x301)
> > > recovered after timeout
> > > [ 1598.460003] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid
> > > 1416):
> > > Recovered 1 EQEs on cmd_eq
> > > [ 1598.468978] mlx5_core 0000:d8:00.0:
> > > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> > > CREATE_EQ(0x301)
> > > recovered after timeout
> > > [ 1659.900010] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid
> > > 1416):
> > > Recovered 1 EQEs on cmd_eq
> > > [ 1659.908987] mlx5_core 0000:d8:00.0:
> > > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> > > CREATE_EQ(0x301)
> > > recovered after timeout
> > > [ 1721.340006] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid
> > > 1416):
> > > Recovered 1 EQEs on cmd_eq
> > > [ 1721.348989] mlx5_core 0000:d8:00.0:
> > > wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> > > CREATE_EQ(0x301)
> > > recovered after timeout
> > >
> > > When this happens, the mlx5_core driver is stuck with the below
> > > call-trace, waiting for some interrupt:
> > >
> > > # ps aux |grep modprobe
> > > root        1416 0.0 0.0 11024 1472 ttyS0    D+   08:08   0:00
> > > modprobe mlx5_core
> > > root        1480 0.0 0.0   6440   736 pts/0    S+   08:15   0:00
> > > grep --color=auto modprobe
> > >
> > > # cat /proc/1416/stack
> > > [<0>] cmd_exec+0x8a7/0x9b0 [mlx5_core]
> > > [<0>] mlx5_cmd_exec+0x24/0x50 [mlx5_core]
> > > [<0>] create_map_eq+0x2a6/0x380 [mlx5_core]
> > > [<0>] mlx5_eq_table_create+0x504/0x710 [mlx5_core]
> > > [<0>] mlx5_load+0x52/0x130 [mlx5_core]
> > > [<0>] mlx5_init_one+0x1cc/0x250 [mlx5_core]
> > > [<0>] probe_one+0x1d3/0x2a0 [mlx5_core]
> > > [<0>] local_pci_probe+0x45/0x80
> > > [<0>] pci_device_probe+0x10f/0x1c0
> > > [<0>] really_probe+0x1c1/0x3b0
> > > [<0>] __driver_probe_device+0x109/0x180
> > > [<0>] driver_probe_device+0x23/0xa0
> > > [<0>] __driver_attach+0xbd/0x160
> > > [<0>] bus_for_each_dev+0x7c/0xc0
> > > [<0>] driver_attach+0x1e/0x20
> > > [<0>] bus_add_driver+0x152/0x1f0
> > > [<0>] driver_register+0x74/0xd0
> > > [<0>] __pci_register_driver+0x68/0x70
> > > [<0>] init+0x6b/0x1000 [mlx5_core]
> > > [<0>] do_one_initcall+0x46/0x1d0
> > > [<0>] do_init_module+0x62/0x250
> > > [<0>] load_module+0x2503/0x2730
> > > [<0>] __do_sys_finit_module+0xbf/0x120
> > > [<0>] __x64_sys_finit_module+0x1a/0x20
> > > [<0>] do_syscall_64+0x38/0xc0
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > To make the issue even weirder, when the issue happens (e.g. when
> > > Linux
> > > only uses 8 processors), if I manually bring CPU #8~#31 online [2]
> > > and
> > > then bring them offline [3], the Mellanox driver will work fine!
> > >
> > > This is a x86-64 host. Is it possibe that the IOMMU Interrrupt
> > > Remapping
> > > is not proprely set up with maxcpus=4, 8 and 16?
> > >
> > > The above tests were done with the recent Linux v5.14-rc1 kernel. I
> > > also
> > > tried Ubuntu 20.04's kernel "5.4.0-77-generic", and the Mellanox
> > > driver
> > > exhibits exactly the same issue.
> > >
> > > I have Linux/Windows dual-boot on this physical machine, and
> > > Windows
> > > doesn't have the issue when I let it only use 4, 8 and 16
> > > processors.
> > > So this looks like somehow the issue is specific to Linux.
> > >
> > > Can someone please shed some light on this strange issue? I'm ready
> > > to provide more logs if needed. Thanks!
> > >
> > > PS, the physical machine has 4 NVMe controllers and 4 Broadcom
> > > NICs,
> > > which are not affected by maxcpus=4, 8, and 16.
> > >
> > > [1] This is the 'lspci' output of the Mellanox NIC:
> > > d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >         Subsystem: Mellanox Technologies MT27800 Family [ConnectX-
> > > 5]
> > >         Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA
> > > node 1
> > >         Memory at f8000000 (64-bit, prefetchable) [size=32M]
> > >         Expansion ROM at fbe00000 [disabled] [size=1M]
> > >         Capabilities: [60] Express Endpoint, MSI 00
> > >         Capabilities: [48] Vital Product Data
> > >         Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
> > >         Capabilities: [c0] Vendor Specific Information: Len=18 <?>
> > >         Capabilities: [40] Power Management version 3
> > >         Capabilities: [100] Advanced Error Reporting
> > >         Capabilities: [150] Alternative Routing-ID Interpretation
> > > (ARI)
> > >         Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
> > >         Capabilities: [1c0] Secondary PCI Express
> > >         Kernel driver in use: mlx5_core
> > >         Kernel modules: mlx5_core
> > > 00: b3 15 17 10 46 05 10 00 00 00 00 02 08 00 00 00
> > > 10: 0c 00 00 f8 00 00 00 00 00 00 00 00 00 00 00 00
> > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 b3 15 80 00
> > > 30: 00 00 e0 fb 60 00 00 00 00 00 00 00 ff 01 00 00
> > >
> > > [2] for i in `seq 8 31`; do echo 1 >
> > > /sys/devices/system/cpu/cpu$i/online;
> > > done
> > > [3] for i in `seq 8 31`; do echo 0 >
> > > /sys/devices/system/cpu/cpu$i/online;
> > > done
> > >
> > > Thanks,
> > > -- Dexuan
> >
> > (+ the linux-pci list)
> >
> > It turns out that adding "intremap=off" can work around the issue!
> >
> > The root cause is still not clear yet. I don't know why Windows is
> > good here.
>
> The card is stuck in the FW, maybe Saeed knows why. I tried your
> scenario and it worked for me.
>
> Thanks

I don't think the FW is stuck since we see the cmd completion after
timeout, this means that the 1st interrupt from the device got lost.

"wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
CREATE_EQ(0x301) recovered after timeout"

the fact that this happens on 5.14 and 5.4 kernels and the issue is
worked around via bringing the cpus online, or disabling intremap,
means that there is something wrong with the interrupt remapping
mechanism, maybe the interrupt is being delivered on an offline cpu ?
is this a qemu/VM guest or a bare metal host ?

2021-07-19 21:03:59

by Dexuan Cui

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Saeed Mahameed <[email protected]>
> Sent: Monday, July 19, 2021 1:18 PM
> > > ...
> > > It turns out that adding "intremap=off" can work around the issue!
> > >
> > > The root cause is still not clear yet. I don't know why Windows is
> > > good here.
> >
> > The card is stuck in the FW, maybe Saeed knows why. I tried your
> > scenario and it worked for me.
> >
> > Thanks
>
> I don't think the FW is stuck since we see the cmd completion after
> timeout, this means that the 1st interrupt from the device got lost.
>
> "wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> CREATE_EQ(0x301) recovered after timeout"
>
> the fact that this happens on 5.14 and 5.4 kernels and the issue is
> worked around via bringing the cpus online, or disabling intremap,
> means that there is something wrong with the interrupt remapping
> mechanism, maybe the interrupt is being delivered on an offline cpu ?
> is this a qemu/VM guest or a bare metal host ?

Thanks for the replies!

This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
issue is in the IOMMU Interrupt Remapping mechanism rather in the
NIC driver. I just don't understand why bringing the CPUs online and
offline can work around the issue. I'm trying to dump the IOMMU IR
table entries to look for any error.

Thanks,
Dexuan

2021-07-21 21:17:42

by Thomas Gleixner

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

Dexuan,

On Mon, Jul 19 2021 at 20:33, Dexuan Cui wrote:
> This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
> issue is in the IOMMU Interrupt Remapping mechanism rather in the
> NIC driver. I just don't understand why bringing the CPUs online and
> offline can work around the issue. I'm trying to dump the IOMMU IR
> table entries to look for any error.

can you please enable GENERIC_IRQ_DEBUGFS and provide the output of

cat /sys/kernel/debug/irq/irqs/$THENICIRQS

Thanks,

tglx

2021-08-18 21:10:29

by Dexuan Cui

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, July 21, 2021 2:17 PM
> To: Dexuan Cui <[email protected]>; Saeed Mahameed
>
> On Mon, Jul 19 2021 at 20:33, Dexuan Cui wrote:
> > This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
> > issue is in the IOMMU Interrupt Remapping mechanism rather in the
> > NIC driver. I just don't understand why bringing the CPUs online and
> > offline can work around the issue. I'm trying to dump the IOMMU IR
> > table entries to look for any error.
>
> can you please enable GENERIC_IRQ_DEBUGFS and provide the output of
>
> cat /sys/kernel/debug/irq/irqs/$THENICIRQS
>
> Thanks,
>
> tglx

Sorry for the late response! I checked the below sys file, and the output is
exactly the same in the good/bad cases -- in both cases, I use maxcpus=8;
the only difference in the good case is that I online and then offline CPU 8~31:
for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done
for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done

# cat /sys/kernel/debug/irq/irqs/209
handler: handle_edge_irq
device: 0000:d8:00.0
status: 0x00004000
istate: 0x00000000
ddepth: 0
wdepth: 0
dstate: 0x35409200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_SINGLE_TARGET
IRQD_MOVE_PCNTXT
IRQD_AFFINITY_SET
IRQD_AFFINITY_ON_ACTIVATE
IRQD_CAN_RESERVE
IRQD_HANDLE_ENFORCE_IRQCTX
node: 1
affinity: 0-7
effectiv: 5
pending:
domain: INTEL-IR-MSI-3-3
hwirq: 0x6c00000
chip: IR-PCI-MSI
flags: 0x30
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: INTEL-IR-3
hwirq: 0x20000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0xd1
chip: APIC
flags: 0x0
Vector: 42
Target: 5
move_in_progress: 0
is_managed: 0
can_reserve: 1
has_reserved: 0
cleanup_pending: 0

Thanks,
Dexuan

2021-08-19 20:45:54

by Dexuan Cui

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Dexuan Cui
> Sent: Wednesday, August 18, 2021 2:08 PM
>
> > From: Thomas Gleixner <[email protected]>
> > Sent: Wednesday, July 21, 2021 2:17 PM
> > To: Dexuan Cui <[email protected]>; Saeed Mahameed
> >
> > On Mon, Jul 19 2021 at 20:33, Dexuan Cui wrote:
> > > This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
> > > issue is in the IOMMU Interrupt Remapping mechanism rather in the
> > > NIC driver. I just don't understand why bringing the CPUs online and
> > > offline can work around the issue. I'm trying to dump the IOMMU IR
> > > table entries to look for any error.
> >
> > can you please enable GENERIC_IRQ_DEBUGFS and provide the output of
> >
> > cat /sys/kernel/debug/irq/irqs/$THENICIRQS
> >
> > Thanks,
> >
> > tglx
>
> Sorry for the late response! I checked the below sys file, and the output is
> exactly the same in the good/bad cases -- in both cases, I use maxcpus=8;
> the only difference in the good case is that I online and then offline CPU 8~31:
> for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done
> for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
>
> # cat /sys/kernel/debug/irq/irqs/209
> ...

I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost" but
it didn't help. Only "intremap=off" can work round the no interrupt issue.

When the no interrupt issue happens, irq 209's effective_affinity_list is 5.
I modified modify_irte() to print the irte->low, irte->high, and I also printed
the irte_index for irq 209, and they were all normal to me, and they were
exactly the same in the bad case and the good case -- it looks like, with
"intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device
(MSI-X on CPU5 works for other devices like a NVMe controller) , and somehow
"onlining and then offlining CPU 8~31" can "fix" the issue, which is really weird.

Thanks,
Dexuan

2021-08-28 21:28:32

by Thomas Gleixner

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

Dexuan,

On Sat, Aug 28 2021 at 01:53, Thomas Gleixner wrote:
> On Thu, Aug 19 2021 at 20:41, Dexuan Cui wrote:
>>> Sorry for the late response! I checked the below sys file, and the output is
>>> exactly the same in the good/bad cases -- in both cases, I use maxcpus=8;
>>> the only difference in the good case is that I online and then offline CPU 8~31:
>>> for i in `seq 8 31`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done
>>> for i in `seq 8 31`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
>>>
>>> # cat /sys/kernel/debug/irq/irqs/209

Yes, that looks correct.

>>
>> I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost" but
>> it didn't help. Only "intremap=off" can work round the no interrupt issue.
>>
>> When the no interrupt issue happens, irq 209's effective_affinity_list is 5.
>> I modified modify_irte() to print the irte->low, irte->high, and I also printed
>> the irte_index for irq 209, and they were all normal to me, and they were
>> exactly the same in the bad case and the good case -- it looks like, with
>> "intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device
>> (MSI-X on CPU5 works for other devices like a NVMe controller) , and somehow
>> "onlining and then offlining CPU 8~31" can "fix" the issue, which is really weird.

Just for the record: maxcpus=N is a dangerous boot option as it leaves
the non brought up CPUs in a state where they can be hit by MCE
broadcasting without being able to act on it. Which means you're
operating the system out of spec.

According to your debug output the interrupt in question belongs to the
INTEL-IR-3 interrupt domain, which means it hangs of IOMMU3, aka DMAR
unit 3.

To which DMAR/remap unit are the other unaffected devices connected to?

Thanks,

tglx

2021-08-29 20:14:44

by Dexuan Cui

[permalink] [raw]

Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, August 28, 2021 1:44 PM
> >> I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost"
> but
> >> it didn't help. Only "intremap=off" can work round the no interrupt issue.
> >>
> >> When the no interrupt issue happens, irq 209's effective_affinity_list is 5.
> >> I modified modify_irte() to print the irte->low, irte->high, and I also printed
> >> the irte_index for irq 209, and they were all normal to me, and they were
> >> exactly the same in the bad case and the good case -- it looks like, with
> >> "intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device
> >> (MSI-X on CPU5 works for other devices like a NVMe controller) , and
> somehow
> >> "onlining and then offlining CPU 8~31" can "fix" the issue, which is really
> weird.
>
> Just for the record: maxcpus=N is a dangerous boot option as it leaves
> the non brought up CPUs in a state where they can be hit by MCE
> broadcasting without being able to act on it. Which means you're
> operating the system out of spec.

I didn't know about this. Thanks for the reply!

> According to your debug output the interrupt in question belongs to the
> INTEL-IR-3 interrupt domain, which means it hangs of IOMMU3, aka DMAR
> unit 3.
>
> To which DMAR/remap unit are the other unaffected devices connected to?
>
> tglx

With maxcpus=8, on CPU 5, the NIC receives no interrupt, but a NVMe
interrupt ("INTEL-IR-6") on the CPU works, and two "IOAT" interrupts
("INTEL-IR-7") also work.

Except the NIC, the only IRQs connected to the faulty IOMMU3 are
irq33 and irq34:

root@lsg-gen7-a:~# cat /sys/kernel/debug/irq/irqs/33
handler: handle_fasteoi_irq
device: (null)
status: 0x00004100
istate: 0x00000000
ddepth: 1
wdepth: 0
dstate: 0x3503a000
IRQD_LEVEL
IRQD_IRQ_DISABLED
IRQD_IRQ_MASKED
IRQD_SINGLE_TARGET
IRQD_MOVE_PCNTXT
IRQD_AFFINITY_ON_ACTIVATE
IRQD_CAN_RESERVE
IRQD_HANDLE_ENFORCE_IRQCTX
node: 1
affinity: 0-103
effectiv: 0
pending:
domain: IO-APIC-18
hwirq: 0x0
chip: IR-IO-APIC
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: INTEL-IR-3
hwirq: 0x0
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x21
chip: APIC
flags: 0x0
Vector: 0
Target: 0
move_in_progress: 0
is_managed: 0
can_reserve: 1
has_reserved: 1
cleanup_pending: 0

root@lsg-gen7-a:~# cat /sys/kernel/debug/irq/irqs/34
handler: handle_edge_irq
device: 0000:d7:00.0
status: 0x00004000
istate: 0x00000000
ddepth: 0
wdepth: 0
dstate: 0x37408200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_SINGLE_TARGET
IRQD_MOVE_PCNTXT
IRQD_AFFINITY_ON_ACTIVATE
IRQD_CAN_RESERVE
IRQD_DEFAULT_TRIGGER_SET
IRQD_HANDLE_ENFORCE_IRQCTX
node: 1
affinity: 0-7
effectiv: 1
pending:
domain: INTEL-IR-MSI-3-3
hwirq: 0x6b80000
chip: IR-PCI-MSI
flags: 0x30
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: INTEL-IR-3
hwirq: 0x10000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x22
chip: APIC
flags: 0x0
Vector: 34
Target: 1
move_in_progress: 0
is_managed: 0
can_reserve: 1
has_reserved: 0
cleanup_pending: 0

root@lsg-gen7-a:~# lspci |grep d7:00.0
d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 07)

irq 33 doesn't appear in /proc/interupts.
irq 34 in /proc/interupts also receives no interrupts.

So it looks like IOMMU3 is somehow not working at all
with maxcpus=8. "onlining and offlining CPU 8~31" can
somehow "fix" it. :-) I'm not sure if this is a kernel issue
or firmware issue.

Thanks,
Dexuan