This patch is used to fix vt-d hard lockup reported when ATS capable
endpoint device connects to system via PCIe switch populates in root port
as following topology.
+-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
| +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
| +-00.2 Intel Corporation Ice Lake RAS
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
NVIDIA Corporation Device 2324
| +-01.0-[19]----00.0
Mellanox Technologies MT2910 Family [ConnectX-7]
User brought endpoint device 19:00.0's link down by flap it's hotplug
capable slot 17:01.0 link control register, as sequence DLLSC response,
pciehp_ist() will unload device driver and power it off, durning device
driver is unloading an iommu devTlb flush request isssed to that link
down device, thus a long time completion/timeout waiting in interrupt
context causes continuous lock lockup warnning and system hang.
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
This patchset was tested by [email protected] on stable-6.7rc4.
And is sent for more comment.
Ethan Zhao (2):
PCI: make pci_dev_is_disconnected() helper public for other drivers
iommu/vt-d: don's issue devTLB flush request when device is
disconnected
drivers/iommu/intel/pasid.c | 21 ++++++++++++++++++++-
drivers/pci/pci.h | 5 -----
include/linux/pci.h | 5 +++++
3 files changed, 25 insertions(+), 6 deletions(-)
--
2.31.1
move pci_dev_is_disconnected() from driver/pci/pci.h to public
include/linux/pci.h for other driver's reference.
no function change.
Tested-by: Haorong Ye <[email protected]>
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/pci/pci.h | 5 -----
include/linux/pci.h | 5 +++++
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..75fa2084492f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -366,11 +366,6 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
return 0;
}
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
- return dev->error_state == pci_channel_io_perm_failure;
-}
-
/* pci_dev priv_flags */
#define PCI_DEV_ADDED 0
#define PCI_DPC_RECOVERED 1
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60ca768bc867..869f2ec97a84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2503,6 +2503,11 @@ static inline struct pci_dev *pcie_find_root_port(struct pci_dev *dev)
return NULL;
}
+static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
+{
+ return dev->error_state == pci_channel_io_perm_failure;
+}
+
void pci_request_acs(void);
bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
--
2.31.1
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.
That would cause following continuous hard lockup warning and system hang
as following
[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
pciehp_ist()
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
Tested-by: Haorong Ye <[email protected]>
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/pasid.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..8557b6dee22f 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -476,6 +476,23 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
{
struct device_domain_info *info;
u16 sid, qdep, pfsid;
+ struct pci_dev *pdev;
+
+ pdev = to_pci_dev(dev);
+ if (!pdev)
+ return;
+
+ /*
+ * If endpoint device's link was brough down by user's pci configuration
+ * access to it's hotplug capable slot's link control register, as sequence
+ * response for DLLSC pciehp_ist() will set the device error_state to
+ * pci_channel_io_perm_failure. Checking device's state here to avoid
+ * issuing meaningless devTLB flush request to it, that might cause lockup
+ * warning or deadlock because too long time waiting in interrupt context.
+ */
+
+ if (pci_dev_is_disconnected(pdev))
+ return;
info = dev_iommu_priv_get(dev);
if (!info || !info->ats_enabled)
@@ -495,6 +512,8 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
else
qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);
+
+ return;
}
void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
--
2.31.1
On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
> move pci_dev_is_disconnected() from driver/pci/pci.h to public
> include/linux/pci.h for other driver's reference.
> no function change.
That's merely a prose description of the code. A reader can already
see from the code what it's doing. You need to explain the *reason*
for the change instead. E.g.: "Make pci_dev_is_disconnected() public
so that it can be called from $DRIVER to speed up hot removal
handling which may otherwise take seconds because of $REASONS."
Thanks,
Lukas
On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register,
Well, users could just *unplug* the device, right? Why is it relevant
that thay could fiddle with registers in config space?
> as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.
A completion timeout should be on the order of usecs or msecs, why does it
cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second*
delay between hot removal and watchdog reaction.
> Fix it by checking the device's error_state in
> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> request to link down device that is set to pci_channel_io_perm_failure and
> then powered off in
This doesn't seem to be a proper fix. It will work most of the time
but not always. A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards. In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.
Thanks,
Lukas
On 13/12/2023 10:44 am, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
>
> Well, users could just *unplug* the device, right? Why is it relevant
> that thay could fiddle with registers in config space?
>
>
>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
>
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.
The PCIe spec only requires an endpoint to respond to an ATS invalidate
within a rather hilarious 90 seconds, so it's primarily a question of
how patient the root complex and bridges in between are prepared to be.
>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
>
> This doesn't seem to be a proper fix. It will work most of the time
> but not always. A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards. In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.
Yeah, I think we have a subtle but fundamental issue here in that the
iommu_release_device() callback is hooked to BUS_NOTIFY_REMOVED_DEVICE,
so in general probably shouldn't be assuming it's safe to do anything
with the device itself *after* it's already been removed from its bus -
this step is primarily about cleaning up any of the IOMMU's own state
relating to the given device.
I think if we want to ensure ATCs are invalidated on hot-unplug we need
an additional pre-removal notifier to take care of that, and that step
would then want to distinguish between an orderly removal where cleaning
up is somewhat meaningful, and a surprise removal where it definitely isn't.
Thanks,
Robin.
On 2023/12/13 11:46, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off.
Is it possible for pciehp to disable ATS on the device before unloading
the driver? Or should the device follow some specific steps to warm
reset the device?
What happens if IOMMU issues device TLB invalidation after link down but
before pci_dev_is_disconnected() returns true?
Best regards,
baolu
On 12/13/2023 6:49 PM, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
>> move pci_dev_is_disconnected() from driver/pci/pci.h to public
>> include/linux/pci.h for other driver's reference.
>> no function change.
> That's merely a prose description of the code. A reader can already
> see from the code what it's doing. You need to explain the *reason*
> for the change instead. E.g.: "Make pci_dev_is_disconnected() public
> so that it can be called from $DRIVER to speed up hot removal
> handling which may otherwise take seconds because of $REASONS."
Yup, why I made it public. then how about
"
Make pci_dev_is_disconnected() public so that it can be called from
Intel vt-d driver to check the device's hotplug removal state when
issue devTLB flush request."
Thanks,
Ethan
>
> Thanks,
>
> Lukas
On 12/13/2023 6:44 PM, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
> Well, users could just *unplug* the device, right? Why is it relevant
> that thay could fiddle with registers in config space?
>
Yes, if the device and it's slot are hotplug capable, users could just
'unplug' the device.
But this case reported, users try to do a warm reset with a tool
command like:
mlxfwreset -d <busid> -y reset
Actually, it will access configuration space just as
setpci -s 0000:17:01.0 0x78.L=0x21050010
Well, we couldn't say don't fiddle PCIe config space registers like
that.
>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.
>
In my understanding, the devTLB flush request sent to ATS capable devcie
is non-posted request, if the ATS transaction is broken by endpoint link
-down, power-off event, the timeout will take up to 60 seconds+-30,
see "Invalidate Completion Timeout " part of
chapter 10.3.1 Invalidate Request
In PCIe spec 6.1
"
IMPLEMENTATION NOTE:
INVALIDATE COMPLETION TIMEOUT
Devices should respond to Invalidate Requests within 1 minute (+50%
-0%).Having a bounded time
permits an ATPT to implement Invalidate Completion Timeouts and reuse
the associated ITag values.
ATPT designs are implementation specific. As such, Invalidate Completion
Timeouts and their
associated error handling are outside the scope of this specification
"
>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> This doesn't seem to be a proper fix. It will work most of the time
> but not always. A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards. In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.
If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?
that would issse devTLB invalidation first, power off device later, it
wouldn't trigger the hard lockup, though the
pci_dev_is_disconnected() return false. this fix works such case.
Thanks,
Ethan
>
> Thanks,
>
> Lukas
On 12/13/2023 7:59 PM, Baolu Lu wrote:
> On 2023/12/13 11:46, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off.
>
> Is it possible for pciehp to disable ATS on the device before unloading
> the driver? Or should the device follow some specific steps to warm
> reset the device?
>
In this case, link down first, then pciehp_ist() got DLLSC interrupt to
know
that, I don't think it makes sense to disable device ATS here, but it could
flag the device is ATS disabled, well, "disconnected" is enough to let
vt-d like software knows the device state.
> What happens if IOMMU issues device TLB invalidation after link down but
> before pci_dev_is_disconnected() returns true?
Seems it wouldn't happen with hotplug cases, safe_removal or
supprise_removal.
Thanks,
Ethan
>
> Best regards,
> baolu
On 12/13/2023 7:54 PM, Robin Murphy wrote:
> On 13/12/2023 10:44 am, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's
>>> link
>>> through setting the slot's link control register,
>>
>> Well, users could just *unplug* the device, right? Why is it relevant
>> that thay could fiddle with registers in config space?
>>
>>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt
>>> context.
>>
>> A completion timeout should be on the order of usecs or msecs, why
>> does it
>> cause a hard lockup? The dmesg excerpt you've provided shows a 12
>> *second*
>> delay between hot removal and watchdog reaction.
>
> The PCIe spec only requires an endpoint to respond to an ATS
> invalidate within a rather hilarious 90 seconds, so it's primarily a
> question of how patient the root complex and bridges in between are
> prepared to be.
The issue reported only reproduce with endpoint device connects to
system via PCIe switch (only has read tracking feature), those switchses
seem not be aware of ATS transaction. while root port is aware of ATS
while the ATS transaction is broken. (invalidation sent, but link down,
device removed etc). but I didn't find any public spec about that.
>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB
>>> flush
>>> request to link down device that is set to
>>> pci_channel_io_perm_failure and
>>> then powered off in
>>
>> This doesn't seem to be a proper fix. It will work most of the time
>> but not always. A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards. In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> Yeah, I think we have a subtle but fundamental issue here in that the
> iommu_release_device() callback is hooked to
> BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be
> assuming it's safe to do anything with the device itself *after* it's
> already been removed from its bus - this step is primarily about
> cleaning up any of the IOMMU's own state relating to the given device.
>
> I think if we want to ensure ATCs are invalidated on hot-unplug we
> need an additional pre-removal notifier to take care of that, and that
> step would then want to distinguish between an orderly removal where
> cleaning up is somewhat meaningful, and a surprise removal where it
> definitely isn't.
So, at least, we should check device state before issue devTLB invaliation.
Thanks,
Ethan
>
>
> Thanks,
> Robin.
On 12/14/2023 10:16 AM, Ethan Zhao wrote:
>
> On 12/13/2023 6:44 PM, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's
>>> link
>>> through setting the slot's link control register,
>> Well, users could just *unplug* the device, right? Why is it relevant
>> that thay could fiddle with registers in config space?
>>
> Yes, if the device and it's slot are hotplug capable, users could just
>
> 'unplug' the device.
>
> But this case reported, users try to do a warm reset with a tool
>
> command like:
>
> mlxfwreset -d <busid> -y reset
>
> Actually, it will access configuration space just as
>
> setpci -s 0000:17:01.0 0x78.L=0x21050010
>
> Well, we couldn't say don't fiddle PCIe config space registers like
>
> that.
>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt
>>> context.
>> A completion timeout should be on the order of usecs or msecs, why
>> does it
>> cause a hard lockup? The dmesg excerpt you've provided shows a 12
>> *second*
>> delay between hot removal and watchdog reaction.
>>
> In my understanding, the devTLB flush request sent to ATS capable devcie
>
> is non-posted request, if the ATS transaction is broken by endpoint link
>
> -down, power-off event, the timeout will take up to 60 seconds+-30,
>
> see "Invalidate Completion Timeout " part of
>
> chapter 10.3.1 Invalidate Request
>
> In PCIe spec 6.1
>
> "
>
> IMPLEMENTATION NOTE:
>
> INVALIDATE COMPLETION TIMEOUT
>
> Devices should respond to Invalidate Requests within 1 minute (+50%
> -0%).Having a bounded time
>
> permits an ATPT to implement Invalidate Completion Timeouts and reuse
> the associated ITag values.
>
> ATPT designs are implementation specific. As such, Invalidate
> Completion Timeouts and their
>
> associated error handling are outside the scope of this specification
>
> "
>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB
>>> flush
>>> request to link down device that is set to
>>> pci_channel_io_perm_failure and
>>> then powered off in
>> This doesn't seem to be a proper fix. It will work most of the time
>> but not always. A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards. In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?
>
> that would issse devTLB invalidation first, power off device later, it
>
> wouldn't trigger the hard lockup, though the
>
> pci_dev_is_disconnected() return false. this fix works such case.
Could you help to point out if there are any other window to close ?
Thanks,
Ethan
>
>
> Thanks,
>
> Ethan
>
>
>
>>
>> Thanks,
>>
>> Lukas
On 12/14/2023 10:26 AM, Ethan Zhao wrote:
>
> On 12/13/2023 7:59 PM, Baolu Lu wrote:
>> On 2023/12/13 11:46, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's
>>> link
>>> through setting the slot's link control register, as pciehpt_ist()
>>> DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off.
>>
>> Is it possible for pciehp to disable ATS on the device before unloading
>> the driver? Or should the device follow some specific steps to warm
>> reset the device?
>>
> In this case, link down first, then pciehp_ist() got DLLSC interrupt
> to know
>
> that, I don't think it makes sense to disable device ATS here, but it
> could
>
> flag the device is ATS disabled, well, "disconnected" is enough to let
>
> vt-d like software knows the device state.
>
>
For hot "unplug" cases:
1. safe_removal
Users request unplug the device via sysfs or press the attention button,
Then pciehp will response to unconfig device/unload device driver, power
if off, and devcie is ready to remove. in this case, devTLB invalidate
request is sent before device link to be brought down or device power
to be turned off. so it doesn't trigger the hard lockup.
2. supprise_removal
Users remove the devece directly or bring the device link down/turn off
device power first by setting pci config space, link-down/not-present/
power-off are all handled by pciehp the same way "supprise_removal", in
such case, pciehp_ist() will flag the device as "disconnected" first, then
unconfig the devcie, unload driver, iommu release device(issing devTLB
flush)
delete device. so we checking the device state could work such cases.
But I am still think about if there are other windows.
Thanks,
Ethan
>> What happens if IOMMU issues device TLB invalidation after link down but
>> before pci_dev_is_disconnected() returns true?
>
> Seems it wouldn't happen with hotplug cases, safe_removal or
>
> supprise_removal.
>
>
>
> Thanks,
>
> Ethan
>
>>
>> Best regards,
>> baolu
On 2023/12/15 9:03, Ethan Zhao wrote:
>
> 2. supprise_removal
>
> Users remove the devece directly or bring the device link down/turn off
>
> device power first by setting pci config space, link-down/not-present/
>
> power-off are all handled by pciehp the same way "supprise_removal", in
>
> such case, pciehp_ist() will flag the device as "disconnected" first,
> then
>
> unconfig the devcie, unload driver, iommu release device(issing devTLB
> flush)
>
> delete device. so we checking the device state could work such cases.
If so, then it is fine for the iommu driver. As Robin said, if the
device needs more cleanup, the iommu core should register a right
callback to the driver core and handle it before the device goes away.
Disabling PCI features seems to be a reasonable device cleanup. This
gives us another reason to move ATS enabling/disabling out from the
iommu subsystem. Once this is done, the device driver will enable ATS
during its probe and disable it during its release. There will be no
such workaround in the iommu driver anymore.
Best regards,
baolu
On 12/15/2023 9:34 AM, Baolu Lu wrote:
> On 2023/12/15 9:03, Ethan Zhao wrote:
>>
>> 2. supprise_removal
>>
>> Users remove the devece directly or bring the device link down/turn
>> off
>>
>> device power first by setting pci config space, link-down/not-present/
>>
>> power-off are all handled by pciehp the same way
>> "supprise_removal", in
>>
>> such case, pciehp_ist() will flag the device as "disconnected"
>> first, then
>>
>> unconfig the devcie, unload driver, iommu release device(issing
>> devTLB flush)
>>
>> delete device. so we checking the device state could work such cases.
>
> If so, then it is fine for the iommu driver. As Robin said, if the
> device needs more cleanup, the iommu core should register a right
> callback to the driver core and handle it before the device goes away.
>
> Disabling PCI features seems to be a reasonable device cleanup. This
> gives us another reason to move ATS enabling/disabling out from the
For supprise_removal, device was already removed, powered-off, iommu
device-release got notification or cleanup callback is invoked to disable
ATS to not-present device etc ,
I didn't get the meaning to do so, perhaps I misunderstand ?
Thanks,
Ethan
> iommu subsystem. Once this is done, the device driver will enable ATS
> during its probe and disable it during its release. There will be no
> such workaround in the iommu driver anymore.
>
> Best regards,
> baolu
On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> additional pre-removal notifier to take care of that, and that step would
> then want to distinguish between an orderly removal where cleaning up is
> somewhat meaningful, and a surprise removal where it definitely isn't.
Even if a user starts the process for orderly removal, the device may be
surprise-removed *during* that process. So we cannot assume that the
device is actually accessible if orderly removal has been initiated.
If the form factor supports surprise removal, the device may be gone
at any time.
Thanks,
Lukas
On Thu, Dec 14, 2023 at 08:58:49AM +0800, Ethan Zhao wrote:
> On 12/13/2023 6:49 PM, Lukas Wunner wrote:
> > On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
> > > move pci_dev_is_disconnected() from driver/pci/pci.h to public
> > > include/linux/pci.h for other driver's reference.
> > > no function change.
> >
> > That's merely a prose description of the code. A reader can already
> > see from the code what it's doing. You need to explain the *reason*
> > for the change instead. E.g.: "Make pci_dev_is_disconnected() public
> > so that it can be called from $DRIVER to speed up hot removal
> > handling which may otherwise take seconds because of $REASONS."
>
> Yup, why I made it public. then how about
>
> "Make pci_dev_is_disconnected() public so that it can be called from
> Intel vt-d driver to check the device's hotplug removal state when
> issue devTLB flush request."
Much better.
You may optionally want to point out the location of the file in the
source tree because not everyone may be familiar where to find the
"Intel vt-d driver". Also, not every reader may know where issuing
of devTLB flush requests occurs, so it might make sense to name the
function where that happens. Finally, it is common to adhere to terms
used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
might be preferable to "devTLB flush request".
Thanks,
Lukas
On 2023-12-21 10:42 am, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
>
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process. So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone
> at any time.
Sure, whatever we do there's always going to be some unavoidable
time-of-check-to-time-of-use race window so we can never guarantee that
sending a request to the device will succeed. I was just making the
point that if we *have* already detected a surprise removal, then
cleaning up its leftover driver model state should still generate a
BUS_NOTIFY_REMOVE_DEVICE call, but in that case we can know there's no
point trying to send any requests to the device that's already gone.
Thanks,
Robin.
On Thu, Dec 21, 2023 at 11:01:56AM +0000, Robin Murphy wrote:
> On 2023-12-21 10:42 am, Lukas Wunner wrote:
> > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> > > I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> > > additional pre-removal notifier to take care of that, and that step would
> > > then want to distinguish between an orderly removal where cleaning up is
> > > somewhat meaningful, and a surprise removal where it definitely isn't.
> >
> > Even if a user starts the process for orderly removal, the device may be
> > surprise-removed *during* that process. So we cannot assume that the
> > device is actually accessible if orderly removal has been initiated.
> > If the form factor supports surprise removal, the device may be gone
> > at any time.
>
> Sure, whatever we do there's always going to be some unavoidable
> time-of-check-to-time-of-use race window so we can never guarantee that
> sending a request to the device will succeed. I was just making the point
> that if we *have* already detected a surprise removal, then cleaning up its
> leftover driver model state should still generate a BUS_NOTIFY_REMOVE_DEVICE
> call, but in that case we can know there's no point trying to send any
> requests to the device that's already gone.
Right, using pci_dev_is_disconnected() as a *speedup* when we
definitely know the device is gone, that's perfectly fine.
So in that sense the proposed patch is acceptable *after* this
series has been extended to make sure hard lockups can *never*
occur on unplug.
Thanks,
Lukas
On 12/21/2023 6:51 PM, Lukas Wunner wrote:
> On Thu, Dec 14, 2023 at 08:58:49AM +0800, Ethan Zhao wrote:
>> On 12/13/2023 6:49 PM, Lukas Wunner wrote:
>>> On Tue, Dec 12, 2023 at 10:46:36PM -0500, Ethan Zhao wrote:
>>>> move pci_dev_is_disconnected() from driver/pci/pci.h to public
>>>> include/linux/pci.h for other driver's reference.
>>>> no function change.
>>> That's merely a prose description of the code. A reader can already
>>> see from the code what it's doing. You need to explain the *reason*
>>> for the change instead. E.g.: "Make pci_dev_is_disconnected() public
>>> so that it can be called from $DRIVER to speed up hot removal
>>> handling which may otherwise take seconds because of $REASONS."
>> Yup, why I made it public. then how about
>>
>> "Make pci_dev_is_disconnected() public so that it can be called from
>> Intel vt-d driver to check the device's hotplug removal state when
>> issue devTLB flush request."
> Much better.
>
> You may optionally want to point out the location of the file in the
> source tree because not everyone may be familiar where to find the
> "Intel vt-d driver". Also, not every reader may know where issuing
> of devTLB flush requests occurs, so it might make sense to name the
> function where that happens. Finally, it is common to adhere to terms
> used in the PCIe Base Spec in commit messages, so "ATC Invalidate Request"
> might be preferable to "devTLB flush request".
ATS Invalidate Request ? devTLB flush request has the same meaning,
I thought all iommu/PCIe guys could understand.
Thanks,
Ethan
>
> Thanks,
>
> Lukas
On 12/21/2023 6:42 PM, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process. So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone
There is no hardware lock to prevent user powerring-off/removing
the supprise-removal capable device before issuing ATS invalidation
request but after checking device connection state, the no target
request still possibly be sent.
Thanks,
Ethan
> at any time.
>
> Thanks,
>
> Lukas
>