2022-10-31 12:49:08

by Bjorn Helgaas

[permalink] [raw]
Subject: [[email protected]: [Bug 216644] New: Host OS hangs when enabling VMD in UEFI setup]

Thanks, Adrian, for the bisection and detailed debugging!

----- Forwarded message from [email protected] -----

https://bugzilla.kernel.org/show_bug.cgi?id=216644

Summary: Host OS hangs when enabling VMD in UEFI setup
Kernel Version: 6.1-rc2
Regression: No

Created attachment 303108
--> https://bugzilla.kernel.org/attachment.cgi?id=303108&action=edit
OS Log (Serial Log)

When enabling VMD in BIOS setup, the host OS cannot boot successfully with the
following error message:

[ 8.986310] vmd 0000:64:05.5: PCI host bridge to bus 10000:00
...
[ 9.674113] vmd 0000:64:05.5: Bound to PCI domain 10000
...
[ 33.592638] DMAR: VT-d detected Invalidation Queue Error: Reason f
[ 33.592640] DMAR: VT-d detected Invalidation Time-out Error: SID ffff
[ 33.599853] DMAR: VT-d detected Invalidation Completion Error: SID ffff
[ 33.607339] DMAR: QI HEAD: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 33.621143] DMAR: QI PRIOR: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 33.627366] DMAR: Invalidation Time-out Error (ITE) cleared


*** Hardware Info ***
Platform: skylake-D purley platform
VMD: 8086:201d
# lspci -s 0000:64:05.5 -nn
0000:64:05.5 RAID bus controller [0104]: Intel Corporation Volume
Management
Device NVMe RAID Controller [8086:201d] (rev 04)


*** Detail Info ***
`git bisect` points the following offending patch (commit: 6aab5622296b):

commit 6aab5622296b990024ee67dd7efa7d143e7558d0
Author: Nirmal Patel <[email protected]>
Date: Tue Nov 16 15:11:36 2021 -0700

PCI: vmd: Clean up domain before enumeration

During VT-d pass-through, the VMD driver occasionally fails to
enumerate underlying NVMe devices when repetitive reboots are
performed in the guest OS. The issue can be resolved by resetting
VMD root ports for proper enumeration and triggering secondary bus
reset which will also propagate reset through downstream bridges.

Link:
https://lore.kernel.org/r/[email protected]
Signed-off-by: Nirmal Patel <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Jon Derrick <[email protected]>


*** Debugging Info ***
1. Reverting 6aab5622296b on top of 6.1-rc2 can fix the issue.

2. Comment out for calling vmd_domain_reset() can also fix the issue. So, it
looks like the function memset_io() causes the issue.

static void vmd_domain_reset(struct vmd_dev *vmd)
{
...
for (bus = 0; bus < max_buses; bus++) {
for (dev = 0; dev < 32; dev++) {
...

memset_io(base + PCI_IO_BASE, 0,
PCI_ROM_ADDRESS1 - PCI_IO_BASE);
}
}
}
}

3. pci_reset_bus() returns -25 because 'slot' or 'bus->self' is NULL.

4. We have 4 disks attached to VMD:
# nvme list
Node Generic SN Model
Namespace Usage Format
FW Rev
--------------------- --------------------- --------------------
---------------------------------------- --------- --------------------------
---------------- --------
/dev/nvme3n1 /dev/ng3n1 222639A46A39
Micron_7450_MTFDKBA960TFR 1 11.48 GB / 960.20 GB
512 B + 0 B E2MU111
/dev/nvme2n1 /dev/ng2n1 222639A46A30
Micron_7450_MTFDKBA960TFR 1 4.18 GB / 960.20 GB
512 B + 0 B E2MU111
/dev/nvme1n1 /dev/ng1n1 BTLJ849201CE1P0I SSDPELKX010T8L
1 1.00 TB / 1.00 TB 512 B + 0 B
VCV1LZ37
/dev/nvme0n1 /dev/ng0n1 BTLJ849201BS1P0I SSDPELKX010T8L
1 1.00 TB / 1.00 TB 512 B + 0 B
VCV1LZ37

Any thoughts? Thanks for the help.

----- End forwarded message -----


2022-11-04 05:01:43

by Adrian Huang12

[permalink] [raw]
Subject: RE: [External] [[email protected]: [Bug 216644] New: Host OS hangs when enabling VMD in UEFI setup]

Hi Nirmal,

Thanks for the info about disabling interrupt remapping in BIOS setup.

The issue does not appear after disabling interrupt remapping in BIOS setup. OS log is attached.


-- Adrian

-----Original Message-----
From: Bjorn Helgaas <[email protected]>
Sent: Monday, October 31, 2022 7:39 PM
To: Nirmal Patel <[email protected]>
Cc: Jon Derrick <[email protected]>; Adrian Huang12 <[email protected]>; [email protected]; [email protected]
Subject: [External] [[email protected]: [Bug 216644] New: Host OS hangs when enabling VMD in UEFI setup]

Thanks, Adrian, for the bisection and detailed debugging!

----- Forwarded message from [email protected] -----

https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216644&amp;data=05%7C01%7Cahuang12%40lenovo.com%7C7366e863853d48afb72808dabb3491dd%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638028131715158472%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WG%2BGca4UTLSM%2FJjbLmUHcE55%2BtoTiykC3%2BjmdFtMJs0%3D&amp;reserved=0

Summary: Host OS hangs when enabling VMD in UEFI setup
Kernel Version: 6.1-rc2
Regression: No

Created attachment 303108
--> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D303108%26action%3Dedit&amp;data=05%7C01%7Cahuang12%40lenovo.com%7C7366e863853d48afb72808dabb3491dd%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638028131715158472%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=PddvOCKZTXtSj4%2FTLR7nWpmGR%2BU10mlebpln22Z%2BCBg%3D&amp;reserved=0
OS Log (Serial Log)

When enabling VMD in BIOS setup, the host OS cannot boot successfully with the following error message:

[ 8.986310] vmd 0000:64:05.5: PCI host bridge to bus 10000:00
...
[ 9.674113] vmd 0000:64:05.5: Bound to PCI domain 10000
...
[ 33.592638] DMAR: VT-d detected Invalidation Queue Error: Reason f
[ 33.592640] DMAR: VT-d detected Invalidation Time-out Error: SID ffff
[ 33.599853] DMAR: VT-d detected Invalidation Completion Error: SID ffff
[ 33.607339] DMAR: QI HEAD: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 33.621143] DMAR: QI PRIOR: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 33.627366] DMAR: Invalidation Time-out Error (ITE) cleared


*** Hardware Info ***
Platform: skylake-D purley platform
VMD: 8086:201d
# lspci -s 0000:64:05.5 -nn
0000:64:05.5 RAID bus controller [0104]: Intel Corporation Volume Management
Device NVMe RAID Controller [8086:201d] (rev 04)


*** Detail Info ***
`git bisect` points the following offending patch (commit: 6aab5622296b):

commit 6aab5622296b990024ee67dd7efa7d143e7558d0
Author: Nirmal Patel <[email protected]>
Date: Tue Nov 16 15:11:36 2021 -0700

PCI: vmd: Clean up domain before enumeration

During VT-d pass-through, the VMD driver occasionally fails to
enumerate underlying NVMe devices when repetitive reboots are
performed in the guest OS. The issue can be resolved by resetting
VMD root ports for proper enumeration and triggering secondary bus
reset which will also propagate reset through downstream bridges.

Link:
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fr%2F20211116221136.85134-1-nirmal.patel%40linux.intel.com&amp;data=05%7C01%7Cahuang12%40lenovo.com%7C7366e863853d48afb72808dabb3491dd%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638028131715158472%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=lEvRqZtizBUK1buL3TD%2Fj17EaG5DLvr48q2wgO1hemk%3D&amp;reserved=0
Signed-off-by: Nirmal Patel <[email protected]>
Signed-off-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Jon Derrick <[email protected]>


*** Debugging Info ***
1. Reverting 6aab5622296b on top of 6.1-rc2 can fix the issue.

2. Comment out for calling vmd_domain_reset() can also fix the issue. So, it looks like the function memset_io() causes the issue.

static void vmd_domain_reset(struct vmd_dev *vmd) {
...
for (bus = 0; bus < max_buses; bus++) {
for (dev = 0; dev < 32; dev++) {
...

memset_io(base + PCI_IO_BASE, 0,
PCI_ROM_ADDRESS1 - PCI_IO_BASE);
}
}
}
}

3. pci_reset_bus() returns -25 because 'slot' or 'bus->self' is NULL.

4. We have 4 disks attached to VMD:
# nvme list
Node Generic SN Model
Namespace Usage Format
FW Rev
--------------------- --------------------- --------------------
---------------------------------------- --------- --------------------------
---------------- --------
/dev/nvme3n1 /dev/ng3n1 222639A46A39
Micron_7450_MTFDKBA960TFR 1 11.48 GB / 960.20 GB
512 B + 0 B E2MU111
/dev/nvme2n1 /dev/ng2n1 222639A46A30
Micron_7450_MTFDKBA960TFR 1 4.18 GB / 960.20 GB
512 B + 0 B E2MU111
/dev/nvme1n1 /dev/ng1n1 BTLJ849201CE1P0I SSDPELKX010T8L
1 1.00 TB / 1.00 TB 512 B + 0 B
VCV1LZ37
/dev/nvme0n1 /dev/ng0n1 BTLJ849201BS1P0I SSDPELKX010T8L
1 1.00 TB / 1.00 TB 512 B + 0 B
VCV1LZ37

Any thoughts? Thanks for the help.

----- End forwarded message -----


Attachments:
sol-disabled-intr-remapping.log (127.63 kB)
sol-disabled-intr-remapping.log