This patchset is used to fix vt-d hard lockup reported when surpprise
unplug ATS capable endpoint device connects to system via PCIe switch
as following topology.
+-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
| +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
| +-00.2 Intel Corporation Ice Lake RAS
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
NVIDIA Corporation Device 2324
| +-01.0-[19]----00.0
Mellanox Technologies MT2910 Family [ConnectX-7]
User brought endpoint device 19:00.0's link down by flapping it's hotplug
capable slot 17:01.0 link control register, as sequence DLLSC response,
pciehp_ist() will unload device driver and power it off, durning device
driver is unloading an iommu devTlb flush request issued to that link
down device, thus a long time completion/timeout waiting in interrupt
context causes continuous hard lockup warnning and system hang.
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Make a quick fix by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
safe_removal unplug doesn't trigger such issue.
and this fix works for all supprise_removal unplug operation.
patchset [patch 1&2] was tested by [email protected] on stable-6.7rc4.
patch[3&4] only passed compiling on stable-6.7rc4, not test yet.
change log:
v6:
- add two patches to break out devTLB invalidation if device is gone.
v5:
- add a patch try to fix the rare case (supprise-remove a device in
safe_removal process). not work because supprise_removal handling coun't
re-enter when another safe_removal is in-process.
v4:
- move the PCI device state checking after ATS per Baolu's suggestion.
v3:
- fix commit description typo.
v2:
- revise commit[1] description part according to Lukas' suggestion.
- revise commit[2] description to clarify the issue's impact.
v1:
- https://lore.kernel.org/lkml/[email protected]/T/
Thanks,
Ethan
Ethan Zhao (4):
PCI: make pci_dev_is_disconnected() helper public for other drivers
iommu/vt-d: don's issue devTLB flush request when device is
disconnected
iommu/vt-d: add flush_target_dev member to struct intel_iommu and pass
device info to all needed functions
iommu/vt-d: break out devTLB invalidation if target device is gone
drivers/iommu/intel/dmar.c | 7 +++++++
drivers/iommu/intel/iommu.c | 1 +
drivers/iommu/intel/iommu.h | 2 ++
drivers/iommu/intel/pasid.c | 4 ++++
drivers/iommu/intel/svm.c | 1 +
drivers/pci/pci.h | 5 -----
include/linux/pci.h | 5 +++++
7 files changed, 20 insertions(+), 5 deletions(-)
--
2.31.1
move pci_dev_is_disconnected() from driver/pci/pci.h to public
include/linux/pci.h for other driver's reference.
no function change.
Tested-by: Haorong Ye <[email protected]>
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/pci/pci.h | 5 -----
include/linux/pci.h | 5 +++++
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..75fa2084492f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -366,11 +366,6 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
return 0;
}
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
- return dev->error_state == pci_channel_io_perm_failure;
-}
-
/* pci_dev priv_flags */
#define PCI_DEV_ADDED 0
#define PCI_DPC_RECOVERED 1
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60ca768bc867..869f2ec97a84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2503,6 +2503,11 @@ static inline struct pci_dev *pcie_find_root_port(struct pci_dev *dev)
return NULL;
}
+static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
+{
+ return dev->error_state == pci_channel_io_perm_failure;
+}
+
void pci_request_acs(void);
bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
--
2.31.1
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.
That would cause following continuous hard lockup warning and system hang
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
issues devTLB invalidate request, wouldn't trigger such issue.
This patch works for all links of SURPPRISE_REMOVAL unplug operations.
Tested-by: Haorong Ye <[email protected]>
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/pasid.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..7dbee9931eb6 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
if (!info || !info->ats_enabled)
return;
+ if (pci_dev_is_disconnected(to_pci_dev(dev)))
+ return;
+
sid = info->bus << 8 | info->devfn;
qdep = info->ats_qdep;
pfsid = info->pfsid;
--
2.31.1
As iommu is a pointer member of device_domain_info, so can't play
trick like container_of() to get the info and device instance for
qi_submit_sync() low level function usage, add a flush_target_dev
member to struct inte_iommu and pass dev info to all devTLB flush
functions.
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/iommu.c | 1 +
drivers/iommu/intel/iommu.h | 2 ++
drivers/iommu/intel/pasid.c | 1 +
drivers/iommu/intel/svm.c | 1 +
4 files changed, 5 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47d..c3724f1d86dc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1461,6 +1461,7 @@ static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
sid = info->bus << 8 | info->devfn;
qdep = info->ats_qdep;
+ info->iommu->flush_target_dev = info->dev;
qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
qdep, addr, mask);
quirk_extra_dev_tlb_flush(info, addr, mask, IOMMU_NO_PASID, qdep);
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index ce030c5b5772..6c7e67f0bf28 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -731,6 +731,8 @@ struct intel_iommu {
void *perf_statistic;
struct iommu_pmu *pmu;
+
+ struct device *flush_target_dev; /* the target device devTLB to be flushed. */
};
/* PCI domain-device relationship */
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 7dbee9931eb6..a08bdbec90eb 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -488,6 +488,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
qdep = info->ats_qdep;
pfsid = info->pfsid;
+ info->iommu->flush_target_dev = info->dev;
/*
* When PASID 0 is used, it indicates RID2PASID(DMA request w/o PASID),
* devTLB flush w/o PASID should be used. For non-zero PASID under
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index ac12f76c1212..d42a99801cdf 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -181,6 +181,7 @@ static void __flush_svm_range_dev(struct intel_svm *svm,
qi_flush_piotlb(sdev->iommu, sdev->did, svm->pasid, address, pages, ih);
if (info->ats_enabled) {
+ info->iommu->flush_target_dev = info->dev;
qi_flush_dev_iotlb_pasid(sdev->iommu, sdev->sid, info->pfsid,
svm->pasid, sdev->qdep, address,
order_base_2(pages));
--
2.31.1
to fix the rare case, the in-process safe_removal unpluged device could
be supprise_removed anytime, thus check the target device state if it
is gone, don't wait for the completion/timeout anymore. it might cause
hard lockup or system hang
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/dmar.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 23cb80d62a9a..7a273ee80c49 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1423,6 +1423,13 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
while (qi->desc_status[wait_index] != QI_DONE) {
+ /*
+ * if the devTLB invalidation target device is gone, don't wait
+ * anymore, it might take up to 1min+50%, causes system hang.
+ */
+ if (type == QI_DIOTLB_TYPE && iommu->flush_target_dev)
+ if (!pci_device_is_present(to_pci_dev(iommu->flush_target_dev)))
+ break;
/*
* We will leave the interrupts disabled, to prevent interrupt
* context to queue another cmd while a cmd is already submitted
--
2.31.1
On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
> if (!info || !info->ats_enabled)
> return;
>
> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
> + return;
> +
> sid = info->bus << 8 | info->devfn;
> qdep = info->ats_qdep;
> pfsid = info->pfsid;
Do you even need this or is patch [4/4] sufficient?
Is there a benefit to the hunk above on top of patch [4/4]?
Thanks,
Lukas
On Sun, Dec 24, 2023 at 12:06:57AM -0500, Ethan Zhao wrote:
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1423,6 +1423,13 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
> writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
>
> while (qi->desc_status[wait_index] != QI_DONE) {
> + /*
> + * if the devTLB invalidation target device is gone, don't wait
> + * anymore, it might take up to 1min+50%, causes system hang.
> + */
> + if (type == QI_DIOTLB_TYPE && iommu->flush_target_dev)
> + if (!pci_device_is_present(to_pci_dev(iommu->flush_target_dev)))
> + break;
As a general approach, this is much better now.
Please combine the nested if-clauses into one.
Please amend the code comment with a spec reference, i.e.
"(see Implementation Note in PCIe r6.1 sec 10.3.1)"
so that readers of the code know where the magic number "1min+50%"
is coming from.
Is flush_target_dev guaranteed to always be a pci_dev?
I'll let iommu maintainers comment on whether storing a flush_target_dev
pointer is the right approach. (May store a back pointer from
struct intel_iommu to struct device_domain_info?)
Maybe move the "to_pci_dev(iommu->flush_target_dev)" lookup outside the
loop to avoid doing this over and over again?
I think we still have a problem here if the device is not removed
but simply takes a long time to respond to Invalidate Requests
(as it is permitted to do per the Implementation Note). We'll
busy-wait for the completion and potentially run into the watchdog's
time limit again. So I think you or someone else in your org should
add OKRs to refactor the code so that it sleeps in-between polling
for Invalidate Completions (instead of busy-waiting with interrupts
disabled).
Thanks,
Lukas
On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.
s/don's/don't/ (in subject)
s/pciehpt_ist/pciehp_ist/
IIUC you are referring to a specific PCIe transaction, so unless
there's another spec that defines "devTLB flush request", please use
the actual PCIe transaction name ("ATS Invalidate Request") as Lukas
suggested.
There's no point in using an informal name that we assume "all
iommu/PCIe guys could understand." It's better to use a term that
anybody can find by searching the spec.
> That would cause following continuous hard lockup warning and system hang
>
> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
> OE kernel version xxxx
> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
> BIOS 01.01.02.03.01 05/15/2023
> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
> knlGS:0000000000000000
> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 4223.822628] PKRU: 55555554
> [ 4223.822628] Call Trace:
> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
> [ 4223.822629] iommu_release_device+0x33/0x60
> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
> [ 4223.822630] device_del+0x2e5/0x420
> [ 4223.822630] pci_remove_bus_device+0x70/0x110
> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
> [ 4223.822631] pciehp_ist+0x176/0x180
> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
> [ 4223.822632] irq_thread_fn+0x19/0x50
> [ 4223.822632] irq_thread+0x104/0x190
> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
> [ 4223.822633] kthread+0x114/0x130
> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
> [ 4223.822633] ret_from_fork+0x1f/0x30
> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
> OE kernel version xxxx
> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
> BIOS 01.01.02.03.01 05/15/2023
> [ 4223.822634] Call Trace:
> [ 4223.822634] <NMI>
> [ 4223.822635] dump_stack+0x6d/0x88
> [ 4223.822635] panic+0x101/0x2d0
> [ 4223.822635] ? ret_from_fork+0x11/0x30
> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
> [ 4223.822636] handle_pmi_common+0x1ef/0x290
> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
> [ 4223.822638] nmi_handle+0x4d/0xf0
> [ 4223.822638] default_do_nmi+0x49/0x100
> [ 4223.822638] exc_nmi+0x134/0x180
> [ 4223.822639] end_repeat_nmi+0x16/0x67
> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
> [ 4223.822642] </NMI>
> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
> [ 4223.822643] iommu_release_device+0x33/0x60
> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
> [ 4223.822644] device_del+0x2e5/0x420
> [ 4223.822644] pci_remove_bus_device+0x70/0x110
> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
> [ 4223.822645] pciehp_ist+0x176/0x180
> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
> [ 4223.822645] irq_thread_fn+0x19/0x50
> [ 4223.822646] irq_thread+0x104/0x190
> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
> [ 4223.822646] kthread+0x114/0x130
> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
> [ 4223.822647] ret_from_fork+0x1f/0x30
> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
> range: 0xffffffff80000000-0xffffffffbfffffff)
The timestamps don't help understand the problem, so you could remove
them so they aren't a distraction.
> Fix it by checking the device's error_state in
> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> request to link down device that is set to pci_channel_io_perm_failure and
> then powered off in
A pci_dev_is_disconnected() is racy in this context, so this by itself
doesn't look like a complete "fix".
> pciehp_ist()
> pciehp_handle_presence_or_link_change()
> pciehp_disable_slot()
> remove_board()
> pciehp_unconfigure_device()
There are some interesting steps missing here between
pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
Requests are not Intel-specific, so all IOMMU drivers must have to
deal with the case of an ATS Invalidate Request where we never receive
a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
and ARM have a similar issue?
> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
> issues devTLB invalidate request, wouldn't trigger such issue.
>
> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
s/devcie/device/
Writing "SAVE_REMOVAL" and "SURPPRISE_REMOVAL" in all caps with an
underscore makes them look like identifiers. But neither appears in
the kernel source. Write them as normal English words, e.g., "save
removal" instead (though I suspect you mean "safe removal"?).
s/surpprise/surprise/
It's not completely obvious that a fix that works for the safe removal
case also works for the surprise removal case. Can you briefly
explain why it does?
> Tested-by: Haorong Ye <[email protected]>
> Signed-off-by: Ethan Zhao <[email protected]>
> ---
> drivers/iommu/intel/pasid.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 74e8e4c17e81..7dbee9931eb6 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
> if (!info || !info->ats_enabled)
> return;
>
> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
> + return;
> +
> sid = info->bus << 8 | info->devfn;
> qdep = info->ats_qdep;
> pfsid = info->pfsid;
This goes on to call qi_submit_sync(), which contains a restart: loop.
I don't know the programming model there, but it looks possible that
qi_submit_sync() and qi_check_fault() might not handle the case of an
unreachable device correctly. There should be a way to exit that
restart: loop in cases where the device doesn't respond at all.
Bjorn
On 12/24/2023 6:32 PM, Lukas Wunner wrote:
> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>> if (!info || !info->ats_enabled)
>> return;
>>
>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>> + return;
>> +
>> sid = info->bus << 8 | info->devfn;
>> qdep = info->ats_qdep;
>> pfsid = info->pfsid;
> Do you even need this or is patch [4/4] sufficient?
> Is there a benefit to the hunk above on top of patch [4/4]?
Need this, I don't want to access config space here, check the
flag here is enough, but patch [4/] needs to know if the device
is gone by reading device vendor info.
Thanks,
Ethan
> Thanks,
>
> Lukas
On 12/24/2023 6:47 PM, Lukas Wunner wrote:
> On Sun, Dec 24, 2023 at 12:06:57AM -0500, Ethan Zhao wrote:
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1423,6 +1423,13 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
>> writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
>>
>> while (qi->desc_status[wait_index] != QI_DONE) {
>> + /*
>> + * if the devTLB invalidation target device is gone, don't wait
>> + * anymore, it might take up to 1min+50%, causes system hang.
>> + */
>> + if (type == QI_DIOTLB_TYPE && iommu->flush_target_dev)
>> + if (!pci_device_is_present(to_pci_dev(iommu->flush_target_dev)))
>> + break;
> As a general approach, this is much better now.
>
> Please combine the nested if-clauses into one.
That would be harder to read ?
> Please amend the code comment with a spec reference, i.e.
> "(see Implementation Note in PCIe r6.1 sec 10.3.1)"
> so that readers of the code know where the magic number "1min+50%"
> is coming from.
Yup.
>
> Is flush_target_dev guaranteed to always be a pci_dev?
yes, as Baolu said, only PCI and ATS capable device supports
devTLB invalidation operation, this is checked by its caller path.
>
> I'll let iommu maintainers comment on whether storing a flush_target_dev
> pointer is the right approach. (May store a back pointer from
> struct intel_iommu to struct device_domain_info?)
One of them, wonder which one is better, but device_domain_info
is still per device...seems no good to back it there.
>
> Maybe move the "to_pci_dev(iommu->flush_target_dev)" lookup outside the
> loop to avoid doing this over and over again?
hmm. that is a macro renam of container_of(), exactly, doesn't matter.
right ?
>
> I think we still have a problem here if the device is not removed
> but simply takes a long time to respond to Invalidate Requests
> (as it is permitted to do per the Implementation Note). We'll
> busy-wait for the completion and potentially run into the watchdog's
> time limit again. So I think you or someone else in your org should
> add OKRs to refactor the code so that it sleeps in-between polling
refactor code would be long story, so far still a quick fix for the issue.
and I think developers have other justifiction or conern about the
non-sync version, once again, thanks for your comment.
regards,
Ethan
> for Invalidate Completions (instead of busy-waiting with interrupts
> disabled).
>
> Thanks,
>
> Lukas
On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> s/don's/don't/ (in subject)
> s/pciehpt_ist/pciehp_ist/
>
> IIUC you are referring to a specific PCIe transaction, so unless
> there's another spec that defines "devTLB flush request", please use
> the actual PCIe transaction name ("ATS Invalidate Request") as Lukas
> suggested.
>
> There's no point in using an informal name that we assume "all
> iommu/PCIe guys could understand." It's better to use a term that
> anybody can find by searching the spec.
agree. will revise them together.
Thanks,
Ethan
>
>> That would cause following continuous hard lockup warning and system hang
>>
>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
>> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
>> knlGS:0000000000000000
>> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
>> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 4223.822628] PKRU: 55555554
>> [ 4223.822628] Call Trace:
>> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822629] iommu_release_device+0x33/0x60
>> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822630] device_del+0x2e5/0x420
>> [ 4223.822630] pci_remove_bus_device+0x70/0x110
>> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822631] pciehp_ist+0x176/0x180
>> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822632] irq_thread_fn+0x19/0x50
>> [ 4223.822632] irq_thread+0x104/0x190
>> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822633] kthread+0x114/0x130
>> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822633] ret_from_fork+0x1f/0x30
>> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
>> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822634] Call Trace:
>> [ 4223.822634] <NMI>
>> [ 4223.822635] dump_stack+0x6d/0x88
>> [ 4223.822635] panic+0x101/0x2d0
>> [ 4223.822635] ? ret_from_fork+0x11/0x30
>> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
>> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
>> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
>> [ 4223.822636] handle_pmi_common+0x1ef/0x290
>> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
>> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
>> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
>> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
>> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
>> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
>> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
>> [ 4223.822638] nmi_handle+0x4d/0xf0
>> [ 4223.822638] default_do_nmi+0x49/0x100
>> [ 4223.822638] exc_nmi+0x134/0x180
>> [ 4223.822639] end_repeat_nmi+0x16/0x67
>> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
>> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] </NMI>
>> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822643] iommu_release_device+0x33/0x60
>> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822644] device_del+0x2e5/0x420
>> [ 4223.822644] pci_remove_bus_device+0x70/0x110
>> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822645] pciehp_ist+0x176/0x180
>> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822645] irq_thread_fn+0x19/0x50
>> [ 4223.822646] irq_thread+0x104/0x190
>> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822646] kthread+0x114/0x130
>> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822647] ret_from_fork+0x1f/0x30
>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
>> range: 0xffffffff80000000-0xffffffffbfffffff)
> The timestamps don't help understand the problem, so you could remove
> them so they aren't a distraction.
>
>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> A pci_dev_is_disconnected() is racy in this context, so this by itself
> doesn't look like a complete "fix".
>
>> pciehp_ist()
>> pciehp_handle_presence_or_link_change()
>> pciehp_disable_slot()
>> remove_board()
>> pciehp_unconfigure_device()
> There are some interesting steps missing here between
> pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
>
> devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
> Requests are not Intel-specific, so all IOMMU drivers must have to
> deal with the case of an ATS Invalidate Request where we never receive
> a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
> and ARM have a similar issue?
>
>> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
>> issues devTLB invalidate request, wouldn't trigger such issue.
>>
>> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
> s/devcie/device/
>
> Writing "SAVE_REMOVAL" and "SURPPRISE_REMOVAL" in all caps with an
> underscore makes them look like identifiers. But neither appears in
> the kernel source. Write them as normal English words, e.g., "save
> removal" instead (though I suspect you mean "safe removal"?).
>
> s/surpprise/surprise/
>
> It's not completely obvious that a fix that works for the safe removal
> case also works for the surprise removal case. Can you briefly
> explain why it does?
>
>> Tested-by: Haorong Ye <[email protected]>
>> Signed-off-by: Ethan Zhao <[email protected]>
>> ---
>> drivers/iommu/intel/pasid.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 74e8e4c17e81..7dbee9931eb6 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>> if (!info || !info->ats_enabled)
>> return;
>>
>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>> + return;
>> +
>> sid = info->bus << 8 | info->devfn;
>> qdep = info->ats_qdep;
>> pfsid = info->pfsid;
> This goes on to call qi_submit_sync(), which contains a restart: loop.
> I don't know the programming model there, but it looks possible that
> qi_submit_sync() and qi_check_fault() might not handle the case of an
> unreachable device correctly. There should be a way to exit that
> restart: loop in cases where the device doesn't respond at all.
>
> Bjorn
On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> s/don's/don't/ (in subject)
> s/pciehpt_ist/pciehp_ist/
>
> IIUC you are referring to a specific PCIe transaction, so unless
> there's another spec that defines "devTLB flush request", please use
> the actual PCIe transaction name ("ATS Invalidate Request") as Lukas
> suggested.
>
> There's no point in using an informal name that we assume "all
> iommu/PCIe guys could understand." It's better to use a term that
> anybody can find by searching the spec.
>
>> That would cause following continuous hard lockup warning and system hang
>>
>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
>> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
>> knlGS:0000000000000000
>> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
>> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 4223.822628] PKRU: 55555554
>> [ 4223.822628] Call Trace:
>> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822629] iommu_release_device+0x33/0x60
>> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822630] device_del+0x2e5/0x420
>> [ 4223.822630] pci_remove_bus_device+0x70/0x110
>> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822631] pciehp_ist+0x176/0x180
>> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822632] irq_thread_fn+0x19/0x50
>> [ 4223.822632] irq_thread+0x104/0x190
>> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822633] kthread+0x114/0x130
>> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822633] ret_from_fork+0x1f/0x30
>> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
>> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822634] Call Trace:
>> [ 4223.822634] <NMI>
>> [ 4223.822635] dump_stack+0x6d/0x88
>> [ 4223.822635] panic+0x101/0x2d0
>> [ 4223.822635] ? ret_from_fork+0x11/0x30
>> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
>> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
>> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
>> [ 4223.822636] handle_pmi_common+0x1ef/0x290
>> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
>> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
>> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
>> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
>> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
>> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
>> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
>> [ 4223.822638] nmi_handle+0x4d/0xf0
>> [ 4223.822638] default_do_nmi+0x49/0x100
>> [ 4223.822638] exc_nmi+0x134/0x180
>> [ 4223.822639] end_repeat_nmi+0x16/0x67
>> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
>> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] </NMI>
>> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822643] iommu_release_device+0x33/0x60
>> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822644] device_del+0x2e5/0x420
>> [ 4223.822644] pci_remove_bus_device+0x70/0x110
>> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822645] pciehp_ist+0x176/0x180
>> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822645] irq_thread_fn+0x19/0x50
>> [ 4223.822646] irq_thread+0x104/0x190
>> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822646] kthread+0x114/0x130
>> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822647] ret_from_fork+0x1f/0x30
>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
>> range: 0xffffffff80000000-0xffffffffbfffffff)
> The timestamps don't help understand the problem, so you could remove
> them so they aren't a distraction.
Lukas said he see the qi_submit_sync takes up to 12 seconds to trigger the
watchdog.
>
>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> A pci_dev_is_disconnected() is racy in this context, so this by itself
> doesn't look like a complete "fix".
A quick workaround.
>> pciehp_ist()
>> pciehp_handle_presence_or_link_change()
>> pciehp_disable_slot()
>> remove_board()
>> pciehp_unconfigure_device()
> There are some interesting steps missing here between
> pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
>
> devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
> Requests are not Intel-specific, so all IOMMU drivers must have to
> deal with the case of an ATS Invalidate Request where we never receive
> a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
> and ARM have a similar issue?
So far fix it in Intel vt-d specific path.
>> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
>> issues devTLB invalidate request, wouldn't trigger such issue.
>>
>> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
> s/devcie/device/
got it.
>
> Writing "SAVE_REMOVAL" and "SURPPRISE_REMOVAL" in all caps with an
> underscore makes them look like identifiers. But neither appears in
> the kernel source. Write them as normal English words, e.g., "save
> removal" instead (though I suspect you mean "safe removal"?).
>
> s/surpprise/surprise/
>
> It's not completely obvious that a fix that works for the safe removal
> case also works for the surprise removal case. Can you briefly
> explain why it does?
As I explained to baolu.
For safe_removal, device wouldn't be removed till the whole software
handling process done, so without this fix, it wouldn't trigger the lockup
issue, and in safe_removal path, device state isn't set to
pci_channel_io_perm_failure in pciehp_unconfigure_device() by checking
'presence', patch calling this pci_dev_is_disconnected() will return false
there, wouldn't break the function. so it works.
For suprise_removal, device state is set to pci_channel_io_perm_failure in
pciehp_unconfigure_device(), means device already be in power-off/link-down
/removed state, callpci_dev_is_disconnected() hrere will return true to
break
the function not to send ATS invalidation request anymore, thus avoid the
further long time waiting trigger the hard lockup.
Do I make it clear enough ?
>> Tested-by: Haorong Ye <[email protected]>
>> Signed-off-by: Ethan Zhao <[email protected]>
>> ---
>> drivers/iommu/intel/pasid.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 74e8e4c17e81..7dbee9931eb6 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>> if (!info || !info->ats_enabled)
>> return;
>>
>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>> + return;
>> +
>> sid = info->bus << 8 | info->devfn;
>> qdep = info->ats_qdep;
>> pfsid = info->pfsid;
> This goes on to call qi_submit_sync(), which contains a restart: loop.
> I don't know the programming model there, but it looks possible that
> qi_submit_sync() and qi_check_fault() might not handle the case of an
> unreachable device correctly. There should be a way to exit that
> restart: loop in cases where the device doesn't respond at all.
Yes, fix it in patch[4/4] to break it out when device is gone.
Thanks,
Ethan
>
> Bjorn
On 12/24/2023 6:32 PM, Lukas Wunner wrote:
> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>> if (!info || !info->ats_enabled)
>> return;
>>
>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>> + return;
>> +
>> sid = info->bus << 8 | info->devfn;
>> qdep = info->ats_qdep;
>> pfsid = info->pfsid;
> Do you even need this or is patch [4/4] sufficient?
> Is there a benefit to the hunk above on top of patch [4/4]?
this is enough for purely surprise_removal or safe_removal,
it is better to not send "ATS invalidation request" than sent,
then check device state later.
Thanks,
Ethan
> Thanks,
>
> Lukas
>
On Mon, Dec 25, 2023 at 09:46:26AM +0800, Ethan Zhao wrote:
> On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
> > On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
> > > ...
> > > [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
> > > [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
> > > [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
> > > ...
> > > [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
> > > range: 0xffffffff80000000-0xffffffffbfffffff)
> >
> > The timestamps don't help understand the problem, so you could remove
> > them so they aren't a distraction.
>
> Lukas said he see the qi_submit_sync takes up to 12 seconds to trigger the
> watchdog.
OK, so the timestamps told us how long the watchdog tolerates. I
don't know how useful that is. I suspect that's not a fixed interval
(probably differs by watchdog and possibly user preference).
> > > Fix it by checking the device's error_state in
> > > devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> > > request to link down device that is set to pci_channel_io_perm_failure and
> > > then powered off in
> >
> > A pci_dev_is_disconnected() is racy in this context, so this by itself
> > doesn't look like a complete "fix".
>
> A quick workaround.
Call it a "quick workaround" then, not a "fix". I'm personally not
usually interested in quick workarounds that are not actually fixes,
but the IOMMU folks would be the ones to decide.
Maybe this is more of an optimization? If patch 4/4 is a real fix (in
the sense that if you disable the watchdog, you would get correct
results after a long timeout), maybe you could reorder the patches so
4/4 comes first, and this one becomes an optimization on top of it? I
haven't worked though the whole path, so I don't know exactly how
these patches work.
> > > pciehp_ist()
> > > pciehp_handle_presence_or_link_change()
> > > pciehp_disable_slot()
> > > remove_board()
> > > pciehp_unconfigure_device()
> > There are some interesting steps missing here between
> > pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
> >
> > devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
> > Requests are not Intel-specific, so all IOMMU drivers must have to
> > deal with the case of an ATS Invalidate Request where we never receive
> > a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
> > and ARM have a similar issue?
>
> So far fix it in Intel vt-d specific path.
If the other IOMMU drivers are vulnerable, I guess they would like to
fix this at the same time and in a similar way if possible.
> > > For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
> > > issues devTLB invalidate request, wouldn't trigger such issue.
> > >
> > > This patch works for all links of SURPPRISE_REMOVAL unplug operations.
> > It's not completely obvious that a fix that works for the safe removal
> > case also works for the surprise removal case. Can you briefly
> > explain why it does?
>
> As I explained to baolu.
>
> For safe_removal, device wouldn't be removed till the whole software
> handling process done, so without this fix, it wouldn't trigger the lockup
> issue, and in safe_removal path, device state isn't set to
> pci_channel_io_perm_failure in pciehp_unconfigure_device() by checking
> 'presence', patch calling this pci_dev_is_disconnected() will return false
> there, wouldn't break the function. so it works.
>
> For suprise_removal, device state is set to pci_channel_io_perm_failure in
> pciehp_unconfigure_device(), means device already be in power-off/link-down
> /removed state, callpci_dev_is_disconnected() hrere will return true to
> break
>
> the function not to send ATS invalidation request anymore, thus avoid the
> further long time waiting trigger the hard lockup.
s/safe_removal/safe removal/ (they are not a single word)
s/suprise_removal/surprise removal/ (misspelled, also not a single word)
> Do I make it clear enough ?
Needs to be in the commit log, of course.
> > > Tested-by: Haorong Ye <[email protected]>
> > > Signed-off-by: Ethan Zhao <[email protected]>
> > > ---
> > > drivers/iommu/intel/pasid.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> > > index 74e8e4c17e81..7dbee9931eb6 100644
> > > --- a/drivers/iommu/intel/pasid.c
> > > +++ b/drivers/iommu/intel/pasid.c
> > > @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
> > > if (!info || !info->ats_enabled)
> > > return;
> > > + if (pci_dev_is_disconnected(to_pci_dev(dev)))
> > > + return;
> > > +
> > > sid = info->bus << 8 | info->devfn;
> > > qdep = info->ats_qdep;
> > > pfsid = info->pfsid;
> >
> > This goes on to call qi_submit_sync(), which contains a restart: loop.
> > I don't know the programming model there, but it looks possible that
> > qi_submit_sync() and qi_check_fault() might not handle the case of an
> > unreachable device correctly. There should be a way to exit that
> > restart: loop in cases where the device doesn't respond at all.
>
> Yes, fix it in patch[4/4] to break it out when device is gone.
On 12/25/2023 10:21 AM, Bjorn Helgaas wrote:
> On Mon, Dec 25, 2023 at 09:46:26AM +0800, Ethan Zhao wrote:
>> On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
>>> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>>>> ...
>>>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>>>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
>>>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>>>> ...
>>>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
>>>> range: 0xffffffff80000000-0xffffffffbfffffff)
>>> The timestamps don't help understand the problem, so you could remove
>>> them so they aren't a distraction.
>> Lukas said he see the qi_submit_sync takes up to 12 seconds to trigger the
>> watchdog.
> OK, so the timestamps told us how long the watchdog tolerates. I
> don't know how useful that is. I suspect that's not a fixed interval
> (probably differs by watchdog and possibly user preference).
>
>>>> Fix it by checking the device's error_state in
>>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>>>> request to link down device that is set to pci_channel_io_perm_failure and
>>>> then powered off in
>>> A pci_dev_is_disconnected() is racy in this context, so this by itself
>>> doesn't look like a complete "fix".
>> A quick workaround.
> Call it a "quick workaround" then, not a "fix". I'm personally not
> usually interested in quick workarounds that are not actually fixes,
> but the IOMMU folks would be the ones to decide.
>
> Maybe this is more of an optimization? If patch 4/4 is a real fix (in
> the sense that if you disable the watchdog, you would get correct
> results after a long timeout), maybe you could reorder the patches so
> 4/4 comes first, and this one becomes an optimization on top of it? I
Make sense, will reorder them.
> haven't worked though the whole path, so I don't know exactly how
> these patches work.
>
>>>> pciehp_ist()
>>>> pciehp_handle_presence_or_link_change()
>>>> pciehp_disable_slot()
>>>> remove_board()
>>>> pciehp_unconfigure_device()
>>> There are some interesting steps missing here between
>>> pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
>>>
>>> devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
>>> Requests are not Intel-specific, so all IOMMU drivers must have to
>>> deal with the case of an ATS Invalidate Request where we never receive
>>> a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
>>> and ARM have a similar issue?
>> So far fix it in Intel vt-d specific path.
> If the other IOMMU drivers are vulnerable, I guess they would like to
> fix this at the same time and in a similar way if possible.
>
>>>> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
>>>> issues devTLB invalidate request, wouldn't trigger such issue.
>>>>
>>>> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
>>> It's not completely obvious that a fix that works for the safe removal
>>> case also works for the surprise removal case. Can you briefly
>>> explain why it does?
>> As I explained to baolu.
>>
>> For safe_removal, device wouldn't be removed till the whole software
>> handling process done, so without this fix, it wouldn't trigger the lockup
>> issue, and in safe_removal path, device state isn't set to
>> pci_channel_io_perm_failure in pciehp_unconfigure_device() by checking
>> 'presence', patch calling this pci_dev_is_disconnected() will return false
>> there, wouldn't break the function. so it works.
>>
>> For suprise_removal, device state is set to pci_channel_io_perm_failure in
>> pciehp_unconfigure_device(), means device already be in power-off/link-down
>> /removed state, callpci_dev_is_disconnected() hrere will return true to
>> break
>>
>> the function not to send ATS invalidation request anymore, thus avoid the
>> further long time waiting trigger the hard lockup.
> s/safe_removal/safe removal/ (they are not a single word)
> s/suprise_removal/surprise removal/ (misspelled, also not a single word)
>
>> Do I make it clear enough ?
> Needs to be in the commit log, of course.
Okay, append to the commit log.
Thanks,
Ethan
>>>> Tested-by: Haorong Ye <[email protected]>
>>>> Signed-off-by: Ethan Zhao <[email protected]>
>>>> ---
>>>> drivers/iommu/intel/pasid.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>>>> index 74e8e4c17e81..7dbee9931eb6 100644
>>>> --- a/drivers/iommu/intel/pasid.c
>>>> +++ b/drivers/iommu/intel/pasid.c
>>>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>>>> if (!info || !info->ats_enabled)
>>>> return;
>>>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>>>> + return;
>>>> +
>>>> sid = info->bus << 8 | info->devfn;
>>>> qdep = info->ats_qdep;
>>>> pfsid = info->pfsid;
>>> This goes on to call qi_submit_sync(), which contains a restart: loop.
>>> I don't know the programming model there, but it looks possible that
>>> qi_submit_sync() and qi_check_fault() might not handle the case of an
>>> unreachable device correctly. There should be a way to exit that
>>> restart: loop in cases where the device doesn't respond at all.
>> Yes, fix it in patch[4/4] to break it out when device is gone.
On 12/24/2023 6:47 PM, Lukas Wunner wrote:
> On Sun, Dec 24, 2023 at 12:06:57AM -0500, Ethan Zhao wrote:
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1423,6 +1423,13 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
>> writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
>>
>> while (qi->desc_status[wait_index] != QI_DONE) {
>> + /*
>> + * if the devTLB invalidation target device is gone, don't wait
>> + * anymore, it might take up to 1min+50%, causes system hang.
>> + */
>> + if (type == QI_DIOTLB_TYPE && iommu->flush_target_dev)
>> + if (!pci_device_is_present(to_pci_dev(iommu->flush_target_dev)))
>> + break;
> As a general approach, this is much better now.
>
> Please combine the nested if-clauses into one.
>
> Please amend the code comment with a spec reference, i.e.
> "(see Implementation Note in PCIe r6.1 sec 10.3.1)"
> so that readers of the code know where the magic number "1min+50%"
> is coming from.
>
> Is flush_target_dev guaranteed to always be a pci_dev?
>
> I'll let iommu maintainers comment on whether storing a flush_target_dev
> pointer is the right approach. (May store a back pointer from
> struct intel_iommu to struct device_domain_info?)
>
> Maybe move the "to_pci_dev(iommu->flush_target_dev)" lookup outside the
> loop to avoid doing this over and over again?
>
> I think we still have a problem here if the device is not removed
> but simply takes a long time to respond to Invalidate Requests
> (as it is permitted to do per the Implementation Note). We'll
If the hardware implenmentation didn't extend the PCIe spec, that
is possible and horrible case for current synchromous queue model
for ATS invalidation. but to wipe the concern and quote info not public
here, perhaps not proper for me.
Thanks,
Ethan
> busy-wait for the completion and potentially run into the watchdog's
> time limit again. So I think you or someone else in your org should
> add OKRs to refactor the code so that it sleeps in-between polling
> for Invalidate Completions (instead of busy-waiting with interrupts
> disabled).
>
> Thanks,
>
> Lukas
>
On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> s/don's/don't/ (in subject)
> s/pciehpt_ist/pciehp_ist/
>
> IIUC you are referring to a specific PCIe transaction, so unless
> there's another spec that defines "devTLB flush request", please use
> the actual PCIe transaction name ("ATS Invalidate Request") as Lukas
> suggested.
>
> There's no point in using an informal name that we assume "all
> iommu/PCIe guys could understand." It's better to use a term that
> anybody can find by searching the spec.
>
>> That would cause following continuous hard lockup warning and system hang
>>
>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
>> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
>> knlGS:0000000000000000
>> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
>> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 4223.822628] PKRU: 55555554
>> [ 4223.822628] Call Trace:
>> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822629] iommu_release_device+0x33/0x60
>> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822630] device_del+0x2e5/0x420
>> [ 4223.822630] pci_remove_bus_device+0x70/0x110
>> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822631] pciehp_ist+0x176/0x180
>> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822632] irq_thread_fn+0x19/0x50
>> [ 4223.822632] irq_thread+0x104/0x190
>> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822633] kthread+0x114/0x130
>> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822633] ret_from_fork+0x1f/0x30
>> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
>> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>> OE kernel version xxxx
>> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822634] Call Trace:
>> [ 4223.822634] <NMI>
>> [ 4223.822635] dump_stack+0x6d/0x88
>> [ 4223.822635] panic+0x101/0x2d0
>> [ 4223.822635] ? ret_from_fork+0x11/0x30
>> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
>> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
>> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
>> [ 4223.822636] handle_pmi_common+0x1ef/0x290
>> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
>> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
>> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
>> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
>> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
>> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
>> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
>> [ 4223.822638] nmi_handle+0x4d/0xf0
>> [ 4223.822638] default_do_nmi+0x49/0x100
>> [ 4223.822638] exc_nmi+0x134/0x180
>> [ 4223.822639] end_repeat_nmi+0x16/0x67
>> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
>> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
>> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
>> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
>> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
>> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
>> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642] </NMI>
>> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
>> [ 4223.822643] iommu_release_device+0x33/0x60
>> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
>> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822644] device_del+0x2e5/0x420
>> [ 4223.822644] pci_remove_bus_device+0x70/0x110
>> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
>> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822645] pciehp_ist+0x176/0x180
>> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822645] irq_thread_fn+0x19/0x50
>> [ 4223.822646] irq_thread+0x104/0x190
>> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822646] kthread+0x114/0x130
>> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822647] ret_from_fork+0x1f/0x30
>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
>> range: 0xffffffff80000000-0xffffffffbfffffff)
> The timestamps don't help understand the problem, so you could remove
> them so they aren't a distraction.
>
>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> A pci_dev_is_disconnected() is racy in this context, so this by itself
> doesn't look like a complete "fix".
>
>> pciehp_ist()
>> pciehp_handle_presence_or_link_change()
>> pciehp_disable_slot()
>> remove_board()
>> pciehp_unconfigure_device()
> There are some interesting steps missing here between
> pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
>
> devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
> Requests are not Intel-specific, so all IOMMU drivers must have to
> deal with the case of an ATS Invalidate Request where we never receive
> a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
> and ARM have a similar issue?
>
>> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
>> issues devTLB invalidate request, wouldn't trigger such issue.
>>
>> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
> s/devcie/device/
>
> Writing "SAVE_REMOVAL" and "SURPPRISE_REMOVAL" in all caps with an
> underscore makes them look like identifiers. But neither appears in
> the kernel source. Write them as normal English words, e.g., "save
> removal" instead (though I suspect you mean "safe removal"?).
>
> s/surpprise/surprise/
>
> It's not completely obvious that a fix that works for the safe removal
> case also works for the surprise removal case. Can you briefly
> explain why it does?
>
>> Tested-by: Haorong Ye <[email protected]>
>> Signed-off-by: Ethan Zhao <[email protected]>
>> ---
>> drivers/iommu/intel/pasid.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 74e8e4c17e81..7dbee9931eb6 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
>> if (!info || !info->ats_enabled)
>> return;
>>
>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>> + return;
>> +
>> sid = info->bus << 8 | info->devfn;
>> qdep = info->ats_qdep;
>> pfsid = info->pfsid;
> This goes on to call qi_submit_sync(), which contains a restart: loop.
> I don't know the programming model there, but it looks possible that
> qi_submit_sync() and qi_check_fault() might not handle the case of an
> unreachable device correctly. There should be a way to exit that
> restart: loop in cases where the device doesn't respond at all.
Current sychronous model isn't good to handle such case, so does
the CPU. the vt-d hardware is integrated, if it is just broken, no response
at all, it will block all devices I/O attached to that iommu, then bring
down
the whole system. except individual iommu and its device tree could be
hotplug capable. asynchornouse programming module will work for it.
my undestanding.
Thanks,
Ethan
>
> Bjorn
>
On 12/25/2023 5:12 PM, Ethan Zhao wrote:
>
> On 12/25/2023 6:43 AM, Bjorn Helgaas wrote:
>> On Sun, Dec 24, 2023 at 12:06:55AM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's
>>> link
>>> through setting the slot's link control register, as pciehpt_ist()
>>> DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt
>>> context.
>> s/don's/don't/ (in subject)
>> s/pciehpt_ist/pciehp_ist/
>>
>> IIUC you are referring to a specific PCIe transaction, so unless
>> there's another spec that defines "devTLB flush request", please use
>> the actual PCIe transaction name ("ATS Invalidate Request") as Lukas
>> suggested.
>>
>> There's no point in using an informal name that we assume "all
>> iommu/PCIe guys could understand." It's better to use a term that
>> anybody can find by searching the spec.
>>
>>> That would cause following continuous hard lockup warning and system
>>> hang
>>>
>>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not
>>> present
>>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded
>>> Tainted: G S
>>> OE kernel version xxxx
>>> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
>>> BIOS 01.01.02.03.01 05/15/2023
>>> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
>>> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f
>>> 95 c1 48 8b
>>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34
>>> <40> f6 c6 1
>>> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>>> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>>> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX:
>>> 0000000000000005
>>> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI:
>>> ffff9f38401a8340
>>> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09:
>>> 0000000000000000
>>> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12:
>>> ffff9f384005e200
>>> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15:
>>> 0000000000000004
>>> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
>>> knlGS:0000000000000000
>>> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4:
>>> 0000000000770ee0
>>> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>>> 0000000000000400
>>> [ 4223.822628] PKRU: 55555554
>>> [ 4223.822628] Call Trace:
>>> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
>>> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
>>> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
>>> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
>>> [ 4223.822629] iommu_release_device+0x33/0x60
>>> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
>>> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
>>> [ 4223.822630] device_del+0x2e5/0x420
>>> [ 4223.822630] pci_remove_bus_device+0x70/0x110
>>> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
>>> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
>>> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
>>> [ 4223.822631] pciehp_ist+0x176/0x180
>>> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
>>> [ 4223.822632] irq_thread_fn+0x19/0x50
>>> [ 4223.822632] irq_thread+0x104/0x190
>>> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
>>> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
>>> [ 4223.822633] kthread+0x114/0x130
>>> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
>>> [ 4223.822633] ret_from_fork+0x1f/0x30
>>> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
>>> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded
>>> Tainted: G S
>>> OE kernel version xxxx
>>> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
>>> BIOS 01.01.02.03.01 05/15/2023
>>> [ 4223.822634] Call Trace:
>>> [ 4223.822634] <NMI>
>>> [ 4223.822635] dump_stack+0x6d/0x88
>>> [ 4223.822635] panic+0x101/0x2d0
>>> [ 4223.822635] ? ret_from_fork+0x11/0x30
>>> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
>>> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
>>> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
>>> [ 4223.822636] handle_pmi_common+0x1ef/0x290
>>> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
>>> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
>>> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
>>> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
>>> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
>>> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
>>> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
>>> [ 4223.822638] nmi_handle+0x4d/0xf0
>>> [ 4223.822638] default_do_nmi+0x49/0x100
>>> [ 4223.822638] exc_nmi+0x134/0x180
>>> [ 4223.822639] end_repeat_nmi+0x16/0x67
>>> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
>>> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f
>>> 95 c1 48 8b
>>> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34
>>> <40> f6 c6 10
>>> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>>> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>>> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX:
>>> 0000000000000005
>>> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI:
>>> ffff9f38401a8340
>>> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09:
>>> 0000000000000000
>>> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12:
>>> ffff9f384005e200
>>> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15:
>>> 0000000000000004
>>> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
>>> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
>>> [ 4223.822642] </NMI>
>>> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
>>> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
>>> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
>>> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
>>> [ 4223.822643] iommu_release_device+0x33/0x60
>>> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
>>> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
>>> [ 4223.822644] device_del+0x2e5/0x420
>>> [ 4223.822644] pci_remove_bus_device+0x70/0x110
>>> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
>>> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
>>> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
>>> [ 4223.822645] pciehp_ist+0x176/0x180
>>> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
>>> [ 4223.822645] irq_thread_fn+0x19/0x50
>>> [ 4223.822646] irq_thread+0x104/0x190
>>> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
>>> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
>>> [ 4223.822646] kthread+0x114/0x130
>>> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
>>> [ 4223.822647] ret_from_fork+0x1f/0x30
>>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000
>>> (relocation
>>> range: 0xffffffff80000000-0xffffffffbfffffff)
>> The timestamps don't help understand the problem, so you could remove
>> them so they aren't a distraction.
>>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB
>>> flush
>>> request to link down device that is set to
>>> pci_channel_io_perm_failure and
>>> then powered off in
>> A pci_dev_is_disconnected() is racy in this context, so this by itself
>> doesn't look like a complete "fix".
>>
>>> pciehp_ist()
>>> pciehp_handle_presence_or_link_change()
>>> pciehp_disable_slot()
>>> remove_board()
>>> pciehp_unconfigure_device()
>> There are some interesting steps missing here between
>> pciehp_unconfigure_device() and devtlb_invalidation_with_pasid().
>>
>> devtlb_invalidation_with_pasid() is Intel-specific. ATS Invalidate
>> Requests are not Intel-specific, so all IOMMU drivers must have to
>> deal with the case of an ATS Invalidate Request where we never receive
>> a corresponding ATS Invalidate Completion. Do other IOMMUs like AMD
>> and ARM have a similar issue?
>>
>>> For SAVE_REMOVAL unplug, link is alive when iommu releases devcie and
>>> issues devTLB invalidate request, wouldn't trigger such issue.
>>>
>>> This patch works for all links of SURPPRISE_REMOVAL unplug operations.
>> s/devcie/device/
>>
>> Writing "SAVE_REMOVAL" and "SURPPRISE_REMOVAL" in all caps with an
>> underscore makes them look like identifiers. But neither appears in
>> the kernel source. Write them as normal English words, e.g., "save
>> removal" instead (though I suspect you mean "safe removal"?).
>>
>> s/surpprise/surprise/
>>
>> It's not completely obvious that a fix that works for the safe removal
>> case also works for the surprise removal case. Can you briefly
>> explain why it does?
>>
>>> Tested-by: Haorong Ye <[email protected]>
>>> Signed-off-by: Ethan Zhao <[email protected]>
>>> ---
>>> drivers/iommu/intel/pasid.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>>> index 74e8e4c17e81..7dbee9931eb6 100644
>>> --- a/drivers/iommu/intel/pasid.c
>>> +++ b/drivers/iommu/intel/pasid.c
>>> @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct
>>> intel_iommu *iommu,
>>> if (!info || !info->ats_enabled)
>>> return;
>>> + if (pci_dev_is_disconnected(to_pci_dev(dev)))
>>> + return;
>>> +
>>> sid = info->bus << 8 | info->devfn;
>>> qdep = info->ats_qdep;
>>> pfsid = info->pfsid;
>> This goes on to call qi_submit_sync(), which contains a restart: loop.
>> I don't know the programming model there, but it looks possible that
>> qi_submit_sync() and qi_check_fault() might not handle the case of an
>> unreachable device correctly. There should be a way to exit that
>> restart: loop in cases where the device doesn't respond at all.
>
> Current sychronous model isn't good to handle such case, so does
>
> the CPU. the vt-d hardware is integrated, if it is just broken, no
> response
>
> at all, it will block all devices I/O attached to that iommu, then
> bring down
>
> the whole system. except individual iommu and its device tree could be
>
> hotplug capable. asynchornouse programming module will work for it.
>
> my undestanding.
>
>
Add another patch in v8, try to break the timeout invalidation loop if the
endpoint device just no response, but not gone (present). thanks for your
tip about the no response case.
Thanks,
Ethan
> Thanks,
>
> Ethan
>
>>
>> Bjorn
>>
>