This patchset is used to fix vt-d hard lockup reported when surpprise
unplug ATS capable endpoint device connects to system via PCIe switch
as following topology.
+-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
| +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
| +-00.2 Intel Corporation Ice Lake RAS
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
NVIDIA Corporation Device 2324
| +-01.0-[19]----00.0
Mellanox Technologies MT2910 Family [ConnectX-7]
User brought endpoint device 19:00.0's link down by flapping it's hotplug
capable slot 17:01.0 link control register, as sequence DLLSC response,
pciehp_ist() will unload device driver and power it off, durning device
driver is unloading an iommu devTlb flush request issued to that link
down device, thus a long time completion/timeout waiting in interrupt
context causes continuous hard lockup warnning and system hang.
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Make a quick fix by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
safe_removal unplug doesn't trigger such issue.
and this fix works for all supprise_removal unplug operation.
This patchset was tested by [email protected] on stable-6.7rc4.
change log:
v4:
- move the PCI device state checking after ATS per Baolu's suggestion.
v3:
- fix commit description typo.
v2:
- revise commit[1] description part according to Lukas' suggestion.
- revise commit[2] description to clarify the issue's impact.
v1:
- https://lore.kernel.org/lkml/[email protected]/T/
Thanks,
Ethan
Ethan Zhao (2):
PCI: make pci_dev_is_disconnected() helper public for other drivers
iommu/vt-d: don's issue devTLB flush request when device is
disconnected
drivers/iommu/intel/pasid.c | 21 ++++++++++++++++++++-
drivers/pci/pci.h | 5 -----
include/linux/pci.h | 5 +++++
3 files changed, 25 insertions(+), 6 deletions(-)
--
2.31.1
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.
That would cause following continuous hard lockup warning and system hang
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then power off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
For SAVE_REMOVAL unplug, link is alive when iommu releases device and
issues devTLB invalidate request, wouldn't trigger such issue.
This patch works for all kinds of SURPPRISE_REMOVAL unplug operation.
Tested-by: Haorong Ye <[email protected]>
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/pasid.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..7dbee9931eb6 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
if (!info || !info->ats_enabled)
return;
+ if (pci_dev_is_disconnected(to_pci_dev(dev)))
+ return;
+
sid = info->bus << 8 | info->devfn;
qdep = info->ats_qdep;
pfsid = info->pfsid;
--
2.31.1
On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.
I think the problem is in the "waiting in interrupt context".
Can you change qi_submit_sync() to *sleep* until the queue is done?
Instead of busy-waiting in atomic context?
Is the hardware capable of sending an interrupt once the queue is done?
If it is not capable, would it be viable to poll with exponential backoff
and sleep in-between polling once the polling delay increases beyond, say,
10 usec?
Again, the proposed patch is not a proper solution. It will paper over
the issue most of the time but every once in a while someone will still
get a hard lockup splat and it will then be more difficult to reproduce
and fix if the proposed patch is accepted.
> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
> OE kernel version xxxx
I don't see any reason to hide the kernel version.
This isn't Intel Confidential information.
> [ 4223.822628] Call Trace:
> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
__dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
one and a half years ago, so the stack trace appears to be from
an older kernel version.
Thanks,
Lukas
On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> > For those endpoint devices connect to system via hotplug capable ports,
> > users could request a warm reset to the device by flapping device's link
> > through setting the slot's link control register, as pciehpt_ist() DLLSC
> > interrupt sequence response, pciehp will unload the device driver and
> > then power it off. thus cause an IOMMU devTLB flush request for device to
> > be sent and a long time completion/timeout waiting in interrupt context.
>
> I think the problem is in the "waiting in interrupt context".
I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
capability to reduce the Invalidate Completion Timeout to a sane value?
Could you check whether that's supported?
Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
but that's not even a "must", it's a "should". So devices are free to
take even longer. We have to cut off at *some* point.
Thanks,
Lukas
On 12/21/2023 6:39 PM, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> I think the problem is in the "waiting in interrupt context".
>
> Can you change qi_submit_sync() to *sleep* until the queue is done?
> Instead of busy-waiting in atomic context?
If you read that function carefully, you wouldn't say "sleep" there....
that is 'sync'ed.
>
> Is the hardware capable of sending an interrupt once the queue is done?
> If it is not capable, would it be viable to poll with exponential backoff
> and sleep in-between polling once the polling delay increases beyond, say,
> 10 usec?
I don't know if the polling along sleeping for completion of meanningless
devTLB invalidation request blindly sent to (removed/powered down/link down)
device makes sense or not.
But according to PCIe spec 6.1 10.3.1
"Software ensures no invalidations are issued to a Function when its
ATS capability is disabled. "
>
> Again, the proposed patch is not a proper solution. It will paper over
> the issue most of the time but every once in a while someone will still
> get a hard lockup splat and it will then be more difficult to reproduce
> and fix if the proposed patch is accepted.
Could you point out why is not proper ? Is there any other window
the hard lockup still could happen with the ATS capable devcie
supprise_removal case if we checked the connection state first ?
Please help to elaberate it.
>
>
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
> I don't see any reason to hide the kernel version.
> This isn't Intel Confidential information.
>
Yes, this is the old kernel stack trace, but customer also tried lasted
6.7rc4
(doesn't work) and the patched 6.7rc4 (fixed).
Thanks,
Ethan
>> [ 4223.822628] Call Trace:
>> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
> __dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
> one and a half years ago, so the stack trace appears to be from
> an older kernel version.
>
> Thanks,
>
> Lukas
On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?
It is not about Intel vt-d's capability per my understanding, it is the
third
party PCIe switch's capability, they are not aware of ATS transation at
all,
if its downstream port endpoint device is removed/powered-off/link-down,
it couldn't feedback the upstream iommu a fault/completion/timeout for
ATS transaction breakage reason. While the root port could (verified).
>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
1 min (60 sec)+50%
> but that's not even a "must", it's a "should". So devices are free to
I could happen if blindly wait here, so we should avoid such case.
Thanks,
Ethan
> take even longer. We have to cut off at *some* point.
>
> Thanks,
>
> Lukas
On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?
>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
> but that's not even a "must", it's a "should". So devices are free to
> take even longer. We have to cut off at *some* point.
I really "expected" there is interrrupt signal to iommu hardware when
the PCIe swtich downstream device 'gone', or some internal polling
/heartbeating the endpoint device for ATS breaking,
but so far seems there are only hotplug interrupts to downstream
control.
...
How to define the point "some" msec to timeout while software
break out the waiting loop ? or polling if the target is gone ?
Thanks,
Ethan
>
> Thanks,
>
> Lukas
On Fri, Dec 22, 2023 at 09:56:39AM +0800, Ethan Zhao wrote:
> I don't know if the polling along sleeping for completion of meanningless
> devTLB invalidation request blindly sent to (removed/powered down/link down)
> device makes sense or not.
If you have a way to get to the struct pci_dev * which you're waiting for
in qi_submit_sync() then I guess you could check for its presence and bail
out if it's gone, instead of issuing a cpu_relax().
> > Again, the proposed patch is not a proper solution. It will paper over
> > the issue most of the time but every once in a while someone will still
> > get a hard lockup splat and it will then be more difficult to reproduce
> > and fix if the proposed patch is accepted.
>
> Could you point out why is not proper ? Is there any other window
> the hard lockup still could happen with the ATS capable devcie
> supprise_removal case if we checked the connection state first ?
> Please help to elaberate it.
Even though user space may have initiated orderly removal via sysfs,
the device may be yanked from the slot (surprise removed) while the
orderly removal is happening.
> Yes, this is the old kernel stack trace, but customer also tried lasted
> 6.7rc4
If you could provide a stacktrace for a contemporary kernel,
I think that would be preferred.
> (doesn't work) and the patched 6.7rc4 (fixed).
Why is it fixed in v6.7-rc4? Is the present patch thus unnecessary?
> > Finally, it is common to adhere to terms
> > used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
> > might be preferable to "devTLB flush request".
>
> ATS Invalidate Request ? devTLB flush request has the same meaning,
>
> I thought all iommu/PCIe guys could understand.
I'm just pointing out the preferred way to write commit messages
in the PCI subsystem (as I've perceived it over the years) so that
you can reduce the number of iterations you have to go through
due to maintainer feedback. I'm just trying to be helpful.
> How to define the point "some" msec to timeout while software
> break out the waiting loop ? or polling if the target is gone ?
I'd say adhere to the 1 min + 50% number provided in the spec.
If you know the device is gone before that then you can break out
of the loop in qi_submit_sync() of course.
The question is, does the Intel IOMMU have a timeout at all for
Invalidate Requests? I guess we don't really know that because
in the stack trace you've provided, the watchdog stops the machine
before a timeout occurs. So it's at least 12 sec. Or there's
no timeout at all.
If the Intel IOMMU doesn't enforce a timeout, you should probably amend
qi_submit_sync() to break out of the loop once the 1 min + 50% limit
is exceeded. And you need to amend the function to sleep instead of
polling in interrupt context.
Can you check with hardware engineers whether there's a timeout?
Thanks,
Lukas
On 12/22/2023 4:14 PM, Lukas Wunner wrote:
> On Fri, Dec 22, 2023 at 09:56:39AM +0800, Ethan Zhao wrote:
>> I don't know if the polling along sleeping for completion of meanningless
>> devTLB invalidation request blindly sent to (removed/powered down/link down)
>> device makes sense or not.
> If you have a way to get to the struct pci_dev * which you're waiting for
> in qi_submit_sync() then I guess you could check for its presence and bail
> out if it's gone, instead of issuing a cpu_relax().
One option to bail out the loop.
>
>>> Again, the proposed patch is not a proper solution. It will paper over
>>> the issue most of the time but every once in a while someone will still
>>> get a hard lockup splat and it will then be more difficult to reproduce
>>> and fix if the proposed patch is accepted.
>> Could you point out why is not proper ? Is there any other window
>> the hard lockup still could happen with the ATS capable devcie
>> supprise_removal case if we checked the connection state first ?
>> Please help to elaberate it.
> Even though user space may have initiated orderly removal via sysfs,
> the device may be yanked from the slot (surprise removed) while the
> orderly removal is happening.
Yes, just after the wait descripor is submitted and before waiting in loop.
the rare but worst case.
>
>
>> Yes, this is the old kernel stack trace, but customer also tried lasted
>> 6.7rc4
> If you could provide a stacktrace for a contemporary kernel,
> I think that would be preferred.
Customer tried, but they didn't provide me the lastest trace.
>
>
>> (doesn't work) and the patched 6.7rc4 (fixed).
> Why is it fixed in v6.7-rc4? Is the present patch thus unnecessary?
Not fixed in v6.7rc4, with this patch, they said the unplug works.
>
>>> Finally, it is common to adhere to terms
>>> used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
>>> might be preferable to "devTLB flush request".
>> ATS Invalidate Request ? devTLB flush request has the same meaning,
>>
>> I thought all iommu/PCIe guys could understand.
> I'm just pointing out the preferred way to write commit messages
> in the PCI subsystem (as I've perceived it over the years) so that
> you can reduce the number of iterations you have to go through
> due to maintainer feedback. I'm just trying to be helpful.
>
Understand.
>> How to define the point "some" msec to timeout while software
>> break out the waiting loop ? or polling if the target is gone ?
> I'd say adhere to the 1 min + 50% number provided in the spec.
>
> If you know the device is gone before that then you can break out
> of the loop in qi_submit_sync() of course.
I am trying to find a way to break it out in this qi_submit_sync().
checking the device state in this loop, but seems not good in this
iommu low level code and need some interfaces to be modified.
That would cost me much more hours to make the rare case work,
to be perfect:
1. check the pci device state in the loop
2. modify the invalidation descriptor status in
pciehp_ist()->intel_iommu_release_device() call.
>
> The question is, does the Intel IOMMU have a timeout at all for
> Invalidate Requests? I guess we don't really know that because
> in the stack trace you've provided, the watchdog stops the machine
> before a timeout occurs. So it's at least 12 sec. Or there's
> no timeout at all.
The calltrace wouldn't tell us there is really timeout of 1min+50%
or not, event there is, meanlingless.
> If the Intel IOMMU doesn't enforce a timeout, you should probably amend
> qi_submit_sync() to break out of the loop once the 1 min + 50% limit
> is exceeded. And you need to amend the function to sleep instead of
> polling in interrupt context.
Too many paths to call this function, and revise it to non-sync, to much
things impacted.
>
> Can you check with hardware engineers whether there's a timeout?
Combinated with third party PCIe switch chips ?
Thanks,
Ethan
>
> Thanks,
>
> Lukas
>