2023-12-28 17:02:25

by Ethan Zhao

[permalink] [raw]
Subject: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device

This patchset is used to fix vt-d hard lockup reported when surprise
unplug ATS capable endpoint device connects to system via PCIe switch
as following topology.

+-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
| +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
| +-00.2 Intel Corporation Ice Lake RAS
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
NVIDIA Corporation Device 2324
| +-01.0-[19]----00.0
Mellanox Technologies MT2910 Family [ConnectX-7]

User brought endpoint device 19:00.0's link down by flapping it's hotplug
capable slot 17:01.0 link control register, as sequence DLLSC response,
pciehp_ist() will unload device driver and power it off, durning device
driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
'ATS Invalidation' in PCIe spec) request issued to that link down device,
thus a long time completion/timeout waiting in interrupt context causes
continuous hard lockup warnning and system hang.

Other detail, see every patch commit log.

patch [3&4] were tested by [email protected] on stable v6.7-rc4.
patch [1-5] passed compiling on stable v6.7-rc6.


change log:
v10:
- refactor qi_submit_sync() and its callers to get pci_dev instance, as
Kevin pointed out add target_flush_dev to iommu is not right.
v9:
- unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
suggestion.
v8:
- add a patch to break the loop for timeout device-TLB invalidation, as
Bjorn said there is possibility device just no response but not gone.
v7:
- reorder patches and revise commit log per Bjorn's guide.
- other code and commit log revise per Lukas' suggestion.
- rebased to stable v6.7-rc6.
v6:
- add two patches to break out device-TLB invalidation if device is gone.
v5:
- add a patch try to fix the rare case (surprise remove a device in
safe removal process). not work because surprise removal handling can't
re-enter when another safe removal is in process.
v4:
- move the PCI device state checking after ATS per Baolu's suggestion.
v3:
- fix commit description typo.
v2:
- revise commit[1] description part according to Lukas' suggestion.
- revise commit[2] description to clarify the issue's impact.
v1:
- https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
linux.intel.com/T/


Thanks,
Ethan


Ethan Zhao (5):
iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
callers
iommu/vt-d: break out ATS Invalidation if target device is gone
PCI: make pci_dev_is_disconnected() helper public for other drivers
iommu/vt-d: don't issue ATS Invalidation request when device is
disconnected
iommu/vt-d: don't loop for timeout ATS Invalidation request forever

drivers/iommu/intel/dmar.c | 55 ++++++++++++++++++++++-------
drivers/iommu/intel/iommu.c | 26 ++++----------
drivers/iommu/intel/iommu.h | 17 +++++----
drivers/iommu/intel/irq_remapping.c | 2 +-
drivers/iommu/intel/pasid.c | 13 +++----
drivers/iommu/intel/svm.c | 13 ++++---
drivers/pci/pci.h | 5 ---
include/linux/pci.h | 5 +++
8 files changed, 74 insertions(+), 62 deletions(-)

--
2.31.1



2023-12-28 17:02:41

by Ethan Zhao

[permalink] [raw]
Subject: [RFC PATCH v10 1/5] iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor callers

Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/dmar.c | 45 +++++++++++++++++++++--------
drivers/iommu/intel/iommu.c | 26 +++++------------
drivers/iommu/intel/iommu.h | 17 +++++------
drivers/iommu/intel/irq_remapping.c | 2 +-
drivers/iommu/intel/pasid.c | 11 ++-----
drivers/iommu/intel/svm.c | 13 ++++-----
6 files changed, 58 insertions(+), 56 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 23cb80d62a9a..3d661f2b7946 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1344,7 +1344,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
* can be part of the submission but it will not be polled for completion.
*/
int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
- unsigned int count, unsigned long options)
+ unsigned int count, unsigned long options, struct pci_dev *pdev)
{
struct q_inval *qi = iommu->qi;
s64 devtlb_start_ktime = 0;
@@ -1476,7 +1476,7 @@ void qi_global_iec(struct intel_iommu *iommu)
desc.qw3 = 0;

/* should never fail */
- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
@@ -1490,7 +1490,7 @@ void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
desc.qw2 = 0;
desc.qw3 = 0;

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
@@ -1514,14 +1514,25 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
desc.qw2 = 0;
desc.qw3 = 0;

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
- u16 qdep, u64 addr, unsigned mask)
+void qi_flush_dev_iotlb(struct intel_iommu *iommu,
+ struct device_domain_info *info, u64 addr, unsigned mask)
{
+ struct pci_dev *pdev = NULL;
+ u16 sid, qdep, pfsid;
struct qi_desc desc;

+ if (!info || !info->ats_enabled)
+ return;
+
+ sid = info->bus << 8 | info->devfn;
+ qdep = info->ats_qdep;
+ pfsid = info->pfsid;
+
+ if (dev_is_pci(info->dev))
+ pdev = to_pci_dev(info->dev);
/*
* VT-d spec, section 4.3:
*
@@ -1545,7 +1556,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
desc.qw2 = 0;
desc.qw3 = 0;

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, pdev);
}

/* PASID-based IOTLB invalidation */
@@ -1586,16 +1597,26 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
QI_EIOTLB_AM(mask);
}

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

/* PASID-based device IOTLB Invalidate */
-void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
- u32 pasid, u16 qdep, u64 addr, unsigned int size_order)
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu,
+ struct device_domain_info *info, u64 addr, u32 pasid,
+ unsigned int size_order)
{
unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
+ struct pci_dev *pdev = NULL;
+ u16 sid, qdep, pfsid;
+
+ if (!info || !dev_is_pci(info->dev))
+ return;
+ pdev = to_pci_dev(info->dev);

+ sid = info->bus << 8 | info->devfn;
+ qdep = info->ats_qdep;
+ pfsid = info->pfsid;
/*
* VT-d spec, section 4.3:
*
@@ -1639,7 +1660,7 @@ void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
desc.qw1 |= QI_DEV_EIOTLB_SIZE;
}

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, pdev);
}

void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did,
@@ -1649,7 +1670,7 @@ void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did,

desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
QI_PC_GRAN(granu) | QI_PC_TYPE;
- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

/*
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47d..4dfbb493cd85 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1454,16 +1454,11 @@ static void iommu_disable_pci_caps(struct device_domain_info *info)
static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
u64 addr, unsigned int mask)
{
- u16 sid, qdep;
-
if (!info || !info->ats_enabled)
return;

- sid = info->bus << 8 | info->devfn;
- qdep = info->ats_qdep;
- qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
- qdep, addr, mask);
- quirk_extra_dev_tlb_flush(info, addr, mask, IOMMU_NO_PASID, qdep);
+ qi_flush_dev_iotlb(info->iommu, info, addr, mask);
+ quirk_extra_dev_tlb_flush(info, addr, IOMMU_NO_PASID, mask);
}

static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
@@ -1486,11 +1481,7 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
if (!info->ats_enabled)
continue;

- qi_flush_dev_iotlb_pasid(info->iommu,
- PCI_DEVID(info->bus, info->devfn),
- info->pfsid, dev_pasid->pasid,
- info->ats_qdep, addr,
- mask);
+ qi_flush_dev_iotlb_pasid(info->iommu, info, addr, dev_pasid->pasid, mask);
}
spin_unlock_irqrestore(&domain->lock, flags);
}
@@ -5183,9 +5174,8 @@ static void __init check_tylersburg_isoch(void)
*
* As a reminder, #6 will *NEED* this quirk as we enable nested translation.
*/
-void quirk_extra_dev_tlb_flush(struct device_domain_info *info,
- unsigned long address, unsigned long mask,
- u32 pasid, u16 qdep)
+void quirk_extra_dev_tlb_flush(struct device_domain_info *info, u32 pasid,
+ unsigned long address, unsigned long mask)
{
u16 sid;

@@ -5194,11 +5184,9 @@ void quirk_extra_dev_tlb_flush(struct device_domain_info *info,

sid = PCI_DEVID(info->bus, info->devfn);
if (pasid == IOMMU_NO_PASID) {
- qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
- qdep, address, mask);
+ qi_flush_dev_iotlb(info->iommu, info, address, mask);
} else {
- qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid,
- pasid, qdep, address, mask);
+ qi_flush_dev_iotlb_pasid(info->iommu, info, address, pasid, mask);
}
}

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index ce030c5b5772..c19d93fd72e9 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -865,23 +865,22 @@ void qi_flush_context(struct intel_iommu *iommu, u16 did,
u16 sid, u8 fm, u64 type);
void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
unsigned int size_order, u64 type);
-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
- u16 qdep, u64 addr, unsigned mask);
+void qi_flush_dev_iotlb(struct intel_iommu *iommu,
+ struct device_domain_info *info, u64 addr, unsigned mask);

void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
unsigned long npages, bool ih);

-void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
- u32 pasid, u16 qdep, u64 addr,
- unsigned int size_order);
-void quirk_extra_dev_tlb_flush(struct device_domain_info *info,
- unsigned long address, unsigned long pages,
- u32 pasid, u16 qdep);
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu,
+ struct device_domain_info *info, u64 addr, u32 pasid,
+ unsigned int size_order);
+void quirk_extra_dev_tlb_flush(struct device_domain_info *info, u32 pasid,
+ unsigned long address, unsigned long mask);
void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu,
u32 pasid);

int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
- unsigned int count, unsigned long options);
+ unsigned int count, unsigned long options, struct pci_dev *pdev);
/*
* Options used in qi_submit_sync:
* QI_OPT_WAIT_DRAIN - Wait for PRQ drain completion, spec 6.5.2.8.
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 29b9e55dcf26..f834afa3672d 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -153,7 +153,7 @@ static int qi_flush_iec(struct intel_iommu *iommu, int index, int mask)
desc.qw2 = 0;
desc.qw3 = 0;

- return qi_submit_sync(iommu, &desc, 1, 0);
+ return qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

static int modify_irte(struct irq_2_iommu *irq_iommu,
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..715943531091 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -467,7 +467,7 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
desc.qw2 = 0;
desc.qw3 = 0;

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

static void
@@ -475,16 +475,11 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
struct device *dev, u32 pasid)
{
struct device_domain_info *info;
- u16 sid, qdep, pfsid;

info = dev_iommu_priv_get(dev);
if (!info || !info->ats_enabled)
return;

- sid = info->bus << 8 | info->devfn;
- qdep = info->ats_qdep;
- pfsid = info->pfsid;
-
/*
* When PASID 0 is used, it indicates RID2PASID(DMA request w/o PASID),
* devTLB flush w/o PASID should be used. For non-zero PASID under
@@ -492,9 +487,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
* efficient to flush devTLB specific to the PASID.
*/
if (pasid == IOMMU_NO_PASID)
- qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
+ qi_flush_dev_iotlb(iommu, info, 0, 64 - VTD_PAGE_SHIFT);
else
- qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);
+ qi_flush_dev_iotlb_pasid(iommu, info, 0, pasid, 64 - VTD_PAGE_SHIFT);
}

void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index ac12f76c1212..02719b96e6df 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -181,11 +181,10 @@ static void __flush_svm_range_dev(struct intel_svm *svm,

qi_flush_piotlb(sdev->iommu, sdev->did, svm->pasid, address, pages, ih);
if (info->ats_enabled) {
- qi_flush_dev_iotlb_pasid(sdev->iommu, sdev->sid, info->pfsid,
- svm->pasid, sdev->qdep, address,
+ qi_flush_dev_iotlb_pasid(sdev->iommu, info, address, svm->pasid,
+ order_base_2(pages));
+ quirk_extra_dev_tlb_flush(info, svm->pasid, address,
order_base_2(pages));
- quirk_extra_dev_tlb_flush(info, address, order_base_2(pages),
- svm->pasid, sdev->qdep);
}
}

@@ -543,7 +542,7 @@ void intel_drain_pasid_prq(struct device *dev, u32 pasid)
QI_DEV_IOTLB_PFSID(info->pfsid);
qi_retry:
reinit_completion(&iommu->prq_complete);
- qi_submit_sync(iommu, desc, 3, QI_OPT_WAIT_DRAIN);
+ qi_submit_sync(iommu, desc, 3, QI_OPT_WAIT_DRAIN, NULL);
if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) {
wait_for_completion(&iommu->prq_complete);
goto qi_retry;
@@ -646,7 +645,7 @@ static void handle_bad_prq_event(struct intel_iommu *iommu,
desc.qw3 = 0;
}

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}

static irqreturn_t prq_event_thread(int irq, void *d)
@@ -811,7 +810,7 @@ int intel_svm_page_response(struct device *dev,
ktime_to_ns(ktime_get()) - prm->private_data[0]);
}

- qi_submit_sync(iommu, &desc, 1, 0);
+ qi_submit_sync(iommu, &desc, 1, 0, NULL);
}
out:
return ret;
--
2.31.1


2023-12-28 17:06:01

by Ethan Zhao

[permalink] [raw]
Subject: [RFC PATCH v10 2/5] iommu/vt-d: break out ATS Invalidation if target device is gone

For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehp_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU device-TLB invalidation (Intel
VT-d spec, or ATS Invalidation in PCIe spec r6.1) request for device to
be sent and a long time completion/timeout waiting in interrupt context.

That would cause following continuous hard lockup warning and system hang

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629] intel_iommu_release_device+0x1f/0x30
[ 4223.822629] iommu_release_device+0x33/0x60
[ 4223.822629] iommu_bus_notifier+0x7f/0x90
[ 4223.822630] blocking_notifier_call_chain+0x60/0x90
[ 4223.822630] device_del+0x2e5/0x420
[ 4223.822630] pci_remove_bus_device+0x70/0x110
[ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631] pciehp_disable_slot+0x6b/0x100
[ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631] pciehp_ist+0x176/0x180
[ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632] irq_thread_fn+0x19/0x50
[ 4223.822632] irq_thread+0x104/0x190
[ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633] kthread+0x114/0x130
[ 4223.822633] ? __kthread_cancel_work+0x40/0x40
[ 4223.822633] ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
OE kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634] <NMI>
[ 4223.822635] dump_stack+0x6d/0x88
[ 4223.822635] panic+0x101/0x2d0
[ 4223.822635] ? ret_from_fork+0x11/0x30
[ 4223.822635] nmi_panic.cold.14+0xc/0xc
[ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636] __perf_event_overflow+0x4f/0xf0
[ 4223.822636] handle_pmi_common+0x1ef/0x290
[ 4223.822636] ? __set_pte_vaddr+0x28/0x40
[ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637] ? __native_set_fixmap+0x24/0x30
[ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638] perf_event_nmi_handler+0x24/0x40
[ 4223.822638] nmi_handle+0x4d/0xf0
[ 4223.822638] default_do_nmi+0x49/0x100
[ 4223.822638] exc_nmi+0x134/0x180
[ 4223.822639] end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] ? qi_submit_sync+0x2c0/0x490
[ 4223.822642] </NMI>
[ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643] intel_iommu_release_device+0x1f/0x30
[ 4223.822643] iommu_release_device+0x33/0x60
[ 4223.822643] iommu_bus_notifier+0x7f/0x90
[ 4223.822644] blocking_notifier_call_chain+0x60/0x90
[ 4223.822644] device_del+0x2e5/0x420
[ 4223.822644] pci_remove_bus_device+0x70/0x110
[ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644] pciehp_disable_slot+0x6b/0x100
[ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645] pciehp_ist+0x176/0x180
[ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645] irq_thread_fn+0x19/0x50
[ 4223.822646] irq_thread+0x104/0x190
[ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646] kthread+0x114/0x130
[ 4223.822647] ? __kthread_cancel_work+0x40/0x40
[ 4223.822647] ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Furthermore even an in-process safe removal unplugged device could be
surprise removed anytime, thus need to check the ATS Invalidation target
device state to see if it is gone, and don't wait for the completion/
timeout blindly, thus avoid the up to 1min+50% (see Implementation Note
in PCIe spec r6.1 sec 10.3.1) waiting and cause hard lockup or system
hang.

Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/dmar.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 3d661f2b7946..0a8d628a42ee 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1423,6 +1423,14 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);

while (qi->desc_status[wait_index] != QI_DONE) {
+ /*
+ * if the device-TLB invalidation target device is gone, don't
+ * wait anymore, it might take up to 1min+50%, causes system
+ * hang. (see Implementation Note in PCIe spec r6.1 sec 10.3.1)
+ */
+ if ((type == QI_DIOTLB_TYPE || type == QI_DEIOTLB_TYPE) && pdev)
+ if (!pci_device_is_present(pdev))
+ break;
/*
* We will leave the interrupts disabled, to prevent interrupt
* context to queue another cmd while a cmd is already submitted
--
2.31.1


2023-12-29 08:19:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device

> From: Ethan Zhao <[email protected]>
> Sent: Friday, December 29, 2023 1:02 AM
>
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
> Kevin pointed out add target_flush_dev to iommu is not right.

let's not rush for new versions when there are still opens unclosed in
previous one (and considering most related folks are in vacation).

2023-12-29 09:29:06

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 12/29/2023 4:18 PM, Tian, Kevin wrote:
>> From: Ethan Zhao <[email protected]>
>> Sent: Friday, December 29, 2023 1:02 AM
>>
>> change log:
>> v10:
>> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
>> Kevin pointed out add target_flush_dev to iommu is not right.
> let's not rush for new versions when there are still opens unclosed in
> previous one (and considering most related folks are in vacation).

Just have some hours these days to make it in well shape.

Okay, let's think, and wait for new comments.


Thanks,

Ethan


2024-01-09 01:24:25

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 12/29/2023 1:02 AM, Ethan Zhao wrote:
> This patchset is used to fix vt-d hard lockup reported when surprise
> unplug ATS capable endpoint device connects to system via PCIe switch
> as following topology.
>
> +-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
> | +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
> | +-00.2 Intel Corporation Ice Lake RAS
> | +-00.4 Intel Corporation Device 0b23
> | \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
> NVIDIA Corporation Device 2324
> | +-01.0-[19]----00.0
> Mellanox Technologies MT2910 Family [ConnectX-7]
>
> User brought endpoint device 19:00.0's link down by flapping it's hotplug
> capable slot 17:01.0 link control register, as sequence DLLSC response,
> pciehp_ist() will unload device driver and power it off, durning device
> driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
> 'ATS Invalidation' in PCIe spec) request issued to that link down device,
> thus a long time completion/timeout waiting in interrupt context causes
> continuous hard lockup warnning and system hang.
>
> Other detail, see every patch commit log.
>
> patch [3&4] were tested by [email protected] on stable v6.7-rc4.
> patch [1-5] passed compiling on stable v6.7-rc6.
>
>
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
> Kevin pointed out add target_flush_dev to iommu is not right.
> v9:
> - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
> suggestion.
> v8:
> - add a patch to break the loop for timeout device-TLB invalidation, as
> Bjorn said there is possibility device just no response but not gone.
> v7:
> - reorder patches and revise commit log per Bjorn's guide.
> - other code and commit log revise per Lukas' suggestion.
> - rebased to stable v6.7-rc6.
> v6:
> - add two patches to break out device-TLB invalidation if device is gone.
> v5:
> - add a patch try to fix the rare case (surprise remove a device in
> safe removal process). not work because surprise removal handling can't
> re-enter when another safe removal is in process.
> v4:
> - move the PCI device state checking after ATS per Baolu's suggestion.
> v3:
> - fix commit description typo.
> v2:
> - revise commit[1] description part according to Lukas' suggestion.
> - revise commit[2] description to clarify the issue's impact.
> v1:
> - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
> linux.intel.com/T/
>
>
> Thanks,
> Ethan
>
>
> Ethan Zhao (5):
> iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
> callers
> iommu/vt-d: break out ATS Invalidation if target device is gone
> PCI: make pci_dev_is_disconnected() helper public for other drivers
> iommu/vt-d: don't issue ATS Invalidation request when device is
> disconnected
> iommu/vt-d: don't loop for timeout ATS Invalidation request forever
>
> drivers/iommu/intel/dmar.c | 55 ++++++++++++++++++++++-------
> drivers/iommu/intel/iommu.c | 26 ++++----------
> drivers/iommu/intel/iommu.h | 17 +++++----
> drivers/iommu/intel/irq_remapping.c | 2 +-
> drivers/iommu/intel/pasid.c | 13 +++----
> drivers/iommu/intel/svm.c | 13 ++++---
> drivers/pci/pci.h | 5 ---
> include/linux/pci.h | 5 +++
> 8 files changed, 74 insertions(+), 62 deletions(-)

Any new comment for this patchset ?


Thanks,

Ethan


2024-01-10 05:04:46

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v10 1/5] iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor callers

On 12/29/23 1:02 AM, Ethan Zhao wrote:
> Signed-off-by: Ethan Zhao<[email protected]>

Please don't leave the message body empty. You should describe why do
you want to add the change in this patch.

> ---
> drivers/iommu/intel/dmar.c | 45 +++++++++++++++++++++--------
> drivers/iommu/intel/iommu.c | 26 +++++------------
> drivers/iommu/intel/iommu.h | 17 +++++------
> drivers/iommu/intel/irq_remapping.c | 2 +-
> drivers/iommu/intel/pasid.c | 11 ++-----
> drivers/iommu/intel/svm.c | 13 ++++-----
> 6 files changed, 58 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 23cb80d62a9a..3d661f2b7946 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1344,7 +1344,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
> * can be part of the submission but it will not be polled for completion.
> */
> int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
> - unsigned int count, unsigned long options)
> + unsigned int count, unsigned long options, struct pci_dev *pdev)

How about adding a bit in options parameter to tell whether the @pdev is
valid?

Best regards,
baolu

2024-01-10 05:22:43

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v10 2/5] iommu/vt-d: break out ATS Invalidation if target device is gone

On 12/29/23 1:02 AM, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehp_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU device-TLB invalidation (Intel
> VT-d spec, or ATS Invalidation in PCIe spec r6.1) request for device to
> be sent and a long time completion/timeout waiting in interrupt context.
>
> That would cause following continuous hard lockup warning and system hang
>
> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
> OE kernel version xxxx
> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
> BIOS 01.01.02.03.01 05/15/2023
> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
> [ 4223.822626] FS: 0000000000000000(0000) GS:ffffa237ae400000(0000)
> knlGS:0000000000000000
> [ 4223.822627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 4223.822628] PKRU: 55555554
> [ 4223.822628] Call Trace:
> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50
> [ 4223.822629] intel_iommu_release_device+0x1f/0x30
> [ 4223.822629] iommu_release_device+0x33/0x60
> [ 4223.822629] iommu_bus_notifier+0x7f/0x90
> [ 4223.822630] blocking_notifier_call_chain+0x60/0x90
> [ 4223.822630] device_del+0x2e5/0x420
> [ 4223.822630] pci_remove_bus_device+0x70/0x110
> [ 4223.822630] pciehp_unconfigure_device+0x7c/0x130
> [ 4223.822631] pciehp_disable_slot+0x6b/0x100
> [ 4223.822631] pciehp_handle_presence_or_link_change+0xd8/0x320
> [ 4223.822631] pciehp_ist+0x176/0x180
> [ 4223.822631] ? irq_finalize_oneshot.part.50+0x110/0x110
> [ 4223.822632] irq_thread_fn+0x19/0x50
> [ 4223.822632] irq_thread+0x104/0x190
> [ 4223.822632] ? irq_forced_thread_fn+0x90/0x90
> [ 4223.822632] ? irq_thread_check_affinity+0xe0/0xe0
> [ 4223.822633] kthread+0x114/0x130
> [ 4223.822633] ? __kthread_cancel_work+0x40/0x40
> [ 4223.822633] ret_from_fork+0x1f/0x30
> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
> OE kernel version xxxx
> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
> BIOS 01.01.02.03.01 05/15/2023
> [ 4223.822634] Call Trace:
> [ 4223.822634] <NMI>
> [ 4223.822635] dump_stack+0x6d/0x88
> [ 4223.822635] panic+0x101/0x2d0
> [ 4223.822635] ? ret_from_fork+0x11/0x30
> [ 4223.822635] nmi_panic.cold.14+0xc/0xc
> [ 4223.822636] watchdog_overflow_callback.cold.8+0x6d/0x81
> [ 4223.822636] __perf_event_overflow+0x4f/0xf0
> [ 4223.822636] handle_pmi_common+0x1ef/0x290
> [ 4223.822636] ? __set_pte_vaddr+0x28/0x40
> [ 4223.822637] ? flush_tlb_one_kernel+0xa/0x20
> [ 4223.822637] ? __native_set_fixmap+0x24/0x30
> [ 4223.822637] ? ghes_copy_tofrom_phys+0x70/0x100
> [ 4223.822637] ? __ghes_peek_estatus.isra.16+0x49/0xa0
> [ 4223.822637] intel_pmu_handle_irq+0xba/0x2b0
> [ 4223.822638] perf_event_nmi_handler+0x24/0x40
> [ 4223.822638] nmi_handle+0x4d/0xf0
> [ 4223.822638] default_do_nmi+0x49/0x100
> [ 4223.822638] exc_nmi+0x134/0x180
> [ 4223.822639] end_repeat_nmi+0x16/0x67
> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
> 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
> 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
> [ 4223.822641] ? qi_submit_sync+0x2c0/0x490
> [ 4223.822642] ? qi_submit_sync+0x2c0/0x490
> [ 4223.822642] </NMI>
> [ 4223.822642] qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822642] __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822643] dmar_remove_one_dev_info+0x3e/0x50
> [ 4223.822643] intel_iommu_release_device+0x1f/0x30
> [ 4223.822643] iommu_release_device+0x33/0x60
> [ 4223.822643] iommu_bus_notifier+0x7f/0x90
> [ 4223.822644] blocking_notifier_call_chain+0x60/0x90
> [ 4223.822644] device_del+0x2e5/0x420
> [ 4223.822644] pci_remove_bus_device+0x70/0x110
> [ 4223.822644] pciehp_unconfigure_device+0x7c/0x130
> [ 4223.822644] pciehp_disable_slot+0x6b/0x100
> [ 4223.822645] pciehp_handle_presence_or_link_change+0xd8/0x320
> [ 4223.822645] pciehp_ist+0x176/0x180
> [ 4223.822645] ? irq_finalize_oneshot.part.50+0x110/0x110
> [ 4223.822645] irq_thread_fn+0x19/0x50
> [ 4223.822646] irq_thread+0x104/0x190
> [ 4223.822646] ? irq_forced_thread_fn+0x90/0x90
> [ 4223.822646] ? irq_thread_check_affinity+0xe0/0xe0
> [ 4223.822646] kthread+0x114/0x130
> [ 4223.822647] ? __kthread_cancel_work+0x40/0x40
> [ 4223.822647] ret_from_fork+0x1f/0x30
> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
> range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Furthermore even an in-process safe removal unplugged device could be
> surprise removed anytime, thus need to check the ATS Invalidation target
> device state to see if it is gone, and don't wait for the completion/
> timeout blindly, thus avoid the up to 1min+50% (see Implementation Note
> in PCIe spec r6.1 sec 10.3.1) waiting and cause hard lockup or system
> hang.
>
> Signed-off-by: Ethan Zhao <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 3d661f2b7946..0a8d628a42ee 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1423,6 +1423,14 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
> writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
>
> while (qi->desc_status[wait_index] != QI_DONE) {
> + /*
> + * if the device-TLB invalidation target device is gone, don't
> + * wait anymore, it might take up to 1min+50%, causes system
> + * hang. (see Implementation Note in PCIe spec r6.1 sec 10.3.1)
> + */
> + if ((type == QI_DIOTLB_TYPE || type == QI_DEIOTLB_TYPE) && pdev)
> + if (!pci_device_is_present(pdev))
> + break;
> /*
> * We will leave the interrupts disabled, to prevent interrupt
> * context to queue another cmd while a cmd is already submitted

How about handing this in qi_check_fault() when it detects an ITE error?

qi_check_fault() should returns -ETIMEDOUT instead of -EAGAIN, if

- qi_submit_sync() is called for a device TLB invalidation request
(indicated by pdev is valid);
- device is not present.

Best regards,
baolu

2024-01-10 07:52:12

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 1/5] iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor callers


On 1/10/2024 12:59 PM, Baolu Lu wrote:
> On 12/29/23 1:02 AM, Ethan Zhao wrote:
>> Signed-off-by: Ethan Zhao<[email protected]>
>
> Please don't leave the message body empty. You should describe why do
> you want to add the change in this patch.
Seems the description part was lost, will append next version.
>
>> ---
>>   drivers/iommu/intel/dmar.c          | 45 +++++++++++++++++++++--------
>>   drivers/iommu/intel/iommu.c         | 26 +++++------------
>>   drivers/iommu/intel/iommu.h         | 17 +++++------
>>   drivers/iommu/intel/irq_remapping.c |  2 +-
>>   drivers/iommu/intel/pasid.c         | 11 ++-----
>>   drivers/iommu/intel/svm.c           | 13 ++++-----
>>   6 files changed, 58 insertions(+), 56 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>> index 23cb80d62a9a..3d661f2b7946 100644
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1344,7 +1344,7 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index)
>>    * can be part of the submission but it will not be polled for
>> completion.
>>    */
>>   int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
>> -           unsigned int count, unsigned long options)
>> +           unsigned int count, unsigned long options, struct pci_dev
>> *pdev)
>
> How about adding a bit in options parameter to tell whether the @pdev is
> valid?

well, checking the option bit or checking pdev == NULL, use one parameter

to describe another one is common function defination method if one

parameter couldn't self-describe.

This case, we always check pdev(one checking), and if we check option
bit first, then have

to check pdev again (one or two checking).  I prefer checking pdev only.


Thanks,

Ethan

>
> Best regards,
> baolu

2024-01-10 08:31:08

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 2/5] iommu/vt-d: break out ATS Invalidation if target device is gone


On 1/10/2024 1:17 PM, Baolu Lu wrote:
> On 12/29/23 1:02 AM, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehp_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU device-TLB invalidation (Intel
>> VT-d spec, or ATS Invalidation in PCIe spec r6.1) request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
>>
>> That would cause following continuous hard lockup warning and system
>> hang
>>
>> [ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
>> [ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not
>> present
>> [ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded
>> Tainted: G S
>>           OE    kernel version xxxx
>> [ 4223.822623] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f
>> 95 c1 48 8b
>>   57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34
>> <40> f6 c6 1
>> 0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX:
>> 0000000000000005
>> [ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI:
>> ffff9f38401a8340
>> [ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09:
>> 0000000000000000
>> [ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12:
>> ffff9f384005e200
>> [ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15:
>> 0000000000000004
>> [ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
>> knlGS:0000000000000000
>> [ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4:
>> 0000000000770ee0
>> [ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>> 0000000000000400
>> [ 4223.822628] PKRU: 55555554
>> [ 4223.822628] Call Trace:
>> [ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822629]  intel_iommu_release_device+0x1f/0x30
>> [ 4223.822629]  iommu_release_device+0x33/0x60
>> [ 4223.822629]  iommu_bus_notifier+0x7f/0x90
>> [ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822630]  device_del+0x2e5/0x420
>> [ 4223.822630]  pci_remove_bus_device+0x70/0x110
>> [ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822631]  pciehp_disable_slot+0x6b/0x100
>> [ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822631]  pciehp_ist+0x176/0x180
>> [ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822632]  irq_thread_fn+0x19/0x50
>> [ 4223.822632]  irq_thread+0x104/0x190
>> [ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822633]  kthread+0x114/0x130
>> [ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822633]  ret_from_fork+0x1f/0x30
>> [ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
>> [ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded
>> Tainted: G S
>>           OE     kernel version xxxx
>> [ 4223.822634] Hardware name: vendorname xxxx 666-106,
>> BIOS 01.01.02.03.01 05/15/2023
>> [ 4223.822634] Call Trace:
>> [ 4223.822634]  <NMI>
>> [ 4223.822635]  dump_stack+0x6d/0x88
>> [ 4223.822635]  panic+0x101/0x2d0
>> [ 4223.822635]  ? ret_from_fork+0x11/0x30
>> [ 4223.822635]  nmi_panic.cold.14+0xc/0xc
>> [ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
>> [ 4223.822636]  __perf_event_overflow+0x4f/0xf0
>> [ 4223.822636]  handle_pmi_common+0x1ef/0x290
>> [ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
>> [ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
>> [ 4223.822637]  ? __native_set_fixmap+0x24/0x30
>> [ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
>> [ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
>> [ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
>> [ 4223.822638]  perf_event_nmi_handler+0x24/0x40
>> [ 4223.822638]  nmi_handle+0x4d/0xf0
>> [ 4223.822638]  default_do_nmi+0x49/0x100
>> [ 4223.822638]  exc_nmi+0x134/0x180
>> [ 4223.822639]  end_repeat_nmi+0x16/0x67
>> [ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
>> [ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f
>> 95 c1 48 8b
>>   57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34
>> <40> f6 c6 10
>>   74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
>> [ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
>> [ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX:
>> 0000000000000005
>> [ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI:
>> ffff9f38401a8340
>> [ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09:
>> 0000000000000000
>> [ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12:
>> ffff9f384005e200
>> [ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15:
>> 0000000000000004
>> [ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
>> [ 4223.822642]  </NMI>
>> [ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
>> [ 4223.822643]  intel_iommu_release_device+0x1f/0x30
>> [ 4223.822643]  iommu_release_device+0x33/0x60
>> [ 4223.822643]  iommu_bus_notifier+0x7f/0x90
>> [ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
>> [ 4223.822644]  device_del+0x2e5/0x420
>> [ 4223.822644]  pci_remove_bus_device+0x70/0x110
>> [ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
>> [ 4223.822644]  pciehp_disable_slot+0x6b/0x100
>> [ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
>> [ 4223.822645]  pciehp_ist+0x176/0x180
>> [ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
>> [ 4223.822645]  irq_thread_fn+0x19/0x50
>> [ 4223.822646]  irq_thread+0x104/0x190
>> [ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
>> [ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
>> [ 4223.822646]  kthread+0x114/0x130
>> [ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
>> [ 4223.822647]  ret_from_fork+0x1f/0x30
>> [ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000
>> (relocation
>> range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Furthermore even an in-process safe removal unplugged device could be
>> surprise removed anytime, thus need to check the ATS Invalidation target
>> device state to see if it is gone, and don't wait for the completion/
>> timeout blindly, thus avoid the up to 1min+50% (see Implementation Note
>> in PCIe spec r6.1 sec 10.3.1) waiting and cause hard lockup or system
>> hang.
>>
>> Signed-off-by: Ethan Zhao <[email protected]>
>> ---
>>   drivers/iommu/intel/dmar.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>> index 3d661f2b7946..0a8d628a42ee 100644
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1423,6 +1423,14 @@ int qi_submit_sync(struct intel_iommu *iommu,
>> struct qi_desc *desc,
>>       writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
>>         while (qi->desc_status[wait_index] != QI_DONE) {
>> +        /*
>> +         * if the device-TLB invalidation target device is gone, don't
>> +         * wait anymore, it might take up to 1min+50%, causes system
>> +         * hang. (see Implementation Note in PCIe spec r6.1 sec 10.3.1)
>> +         */
>> +        if ((type == QI_DIOTLB_TYPE || type == QI_DEIOTLB_TYPE) &&
>> pdev)
>> +            if (!pci_device_is_present(pdev))
>> +                    break;
>>           /*
>>            * We will leave the interrupts disabled, to prevent interrupt
>>            * context to queue another cmd while a cmd is already
>> submitted
>
> How about handing this in qi_check_fault() when it detects an ITE error?

fold into qi_check_fault() looks reasonable, no response from endpoint

device is a kind of fault. my concern there is no real ITE there (it didn't

wait for enough time), but it predicts there would be a timeout, that

is weird if we describe the fact, -ENOTCONN would be more precise

(device is not conneted)  well -ETIMEDOUT could simplify the caller error

handling, the side effect is we have to add pdev parameter to
qi_check_fault()

too. then no need to check invalidition type of QI_IOTLB_TYPE &

QI_EIOTLB_TYPE in qi_check_fault() ? , seems we could save another

patch then, I am still not be convinced :), on the wall, not incline

to which side.

pros

- qi_submit_sync() could be simpler in error handling.

- qi_check_fault() does the right thing it should do.

- save another patch to break the loop.

cons

- more parameters to qi_check_fault()

- lost one opportunity to break loop while retry, but will bail out in
next try.



Thanks,

Ethan

>
> qi_check_fault() should returns -ETIMEDOUT instead of -EAGAIN, if
>
> - qi_submit_sync() is called for a device TLB invalidation request
>   (indicated by pdev is valid);
> - device is not present.
>
> Best regards,
> baolu

2024-01-15 07:59:04

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 12/29/2023 1:02 AM, Ethan Zhao wrote:
> This patchset is used to fix vt-d hard lockup reported when surprise
> unplug ATS capable endpoint device connects to system via PCIe switch
> as following topology.
>
> +-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d
> | +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe
> | +-00.2 Intel Corporation Ice Lake RAS
> | +-00.4 Intel Corporation Device 0b23
> | \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
> NVIDIA Corporation Device 2324
> | +-01.0-[19]----00.0
> Mellanox Technologies MT2910 Family [ConnectX-7]
>
> User brought endpoint device 19:00.0's link down by flapping it's hotplug
> capable slot 17:01.0 link control register, as sequence DLLSC response,
> pciehp_ist() will unload device driver and power it off, durning device
> driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
> 'ATS Invalidation' in PCIe spec) request issued to that link down device,
> thus a long time completion/timeout waiting in interrupt context causes
> continuous hard lockup warnning and system hang.
>
> Other detail, see every patch commit log.
>
> patch [3&4] were tested by [email protected] on stable v6.7-rc4.
> patch [1-5] passed compiling on stable v6.7-rc6.
>
>
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
> Kevin pointed out add target_flush_dev to iommu is not right.
> v9:
> - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
> suggestion.
> v8:
> - add a patch to break the loop for timeout device-TLB invalidation, as
> Bjorn said there is possibility device just no response but not gone.
> v7:
> - reorder patches and revise commit log per Bjorn's guide.
> - other code and commit log revise per Lukas' suggestion.
> - rebased to stable v6.7-rc6.
> v6:
> - add two patches to break out device-TLB invalidation if device is gone.
> v5:
> - add a patch try to fix the rare case (surprise remove a device in
> safe removal process). not work because surprise removal handling can't
> re-enter when another safe removal is in process.
> v4:
> - move the PCI device state checking after ATS per Baolu's suggestion.
> v3:
> - fix commit description typo.
> v2:
> - revise commit[1] description part according to Lukas' suggestion.
> - revise commit[2] description to clarify the issue's impact.
> v1:
> - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
> linux.intel.com/T/
>
>
> Thanks,
> Ethan
>
>
> Ethan Zhao (5):
> iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
> callers
> iommu/vt-d: break out ATS Invalidation if target device is gone
> PCI: make pci_dev_is_disconnected() helper public for other drivers
> iommu/vt-d: don't issue ATS Invalidation request when device is
> disconnected
> iommu/vt-d: don't loop for timeout ATS Invalidation request forever
>
> drivers/iommu/intel/dmar.c | 55 ++++++++++++++++++++++-------
> drivers/iommu/intel/iommu.c | 26 ++++----------
> drivers/iommu/intel/iommu.h | 17 +++++----
> drivers/iommu/intel/irq_remapping.c | 2 +-
> drivers/iommu/intel/pasid.c | 13 +++----
> drivers/iommu/intel/svm.c | 13 ++++---
> drivers/pci/pci.h | 5 ---
> include/linux/pci.h | 5 +++
> 8 files changed, 74 insertions(+), 62 deletions(-)

How aobut refactor the qi_submit_sync() and qi_check_fault() like

following, combination of patch

[2] iommu/vt-d: break out ATS Invalidation if target device is gone

[5] iommu/vt-d: don't loop for timeout ATS Invalidation request forever

but sending them in seperated patches seems better ? each of them

handling different case.

- fold additional errors/fault/exception handling into qi_check_fault()

- the detetion of target device presence use to handle surprise removal

 or device died /no response.

- the ITE part use to break out the timeout ATS invalidation request,

  use to handle the case response time of device is too long.

- if passed invalid target_pdev, means this is ATS invalidation request.

- no error handling change in qi_submit_sync().


Please comment.


--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,16 +1267,28 @@ static void qi_dump_fault(struct intel_iommu
*iommu, u32 fault)
               (unsigned long long)desc->qw1);
 }

-static int qi_check_fault(struct intel_iommu *iommu, int index, int
wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index, int
wait_index,
+                  pci_dev *target_pdev)
 {
        u32 fault;
        int head, tail;
+       u64 iqe_err, ice_sid;
        struct q_inval *qi = iommu->qi;
        int shift = qi_shift(iommu);

        if (qi->desc_status[wait_index] == QI_ABORT)
                return -EAGAIN;

+       /*
+        * If the ATS invalidation target device is gone this moment
(surprise
+        * removed, died, no response) don't try this request again. this
+        * request will not get valid result anymore. but the request was
+        * already submitted to hardware and we predict to get a ITE in
+        * followed batch of request, if so, it will get handled then.
+        */
+       if (target_pdev && !pci_device_is_present(target_pdev))
+               return -EINVAL;
+
        fault = readl(iommu->reg + DMAR_FSTS_REG);
        if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
                qi_dump_fault(iommu, fault);
@@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
                tail = readl(iommu->reg + DMAR_IQT_REG);
                tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;

+               /*
+                * SID field is valid only when the ITE field is Set in
FSTS_REG
+                * see Intel VT-d spec r4.1, section 11.4.9.9
+                */
+               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
                writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
                pr_info("Invalidation Time-out Error (ITE) cleared\n");

@@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
                        head = (head - 2 + QI_LENGTH) % QI_LENGTH;
                } while (head != tail);

+               /*
+                * If got ITE, we need to check if the sid of ITE is the
same as
+                * current ATS invalidation target device, if yes, don't
try this
+                * request anymore, the target device has a response
time beyound
+                * expected. 0 value of ice_sid means old device, no
ice_sid value.
+                */
+               if (target_pdev && ice_sid && ice_sid ==
+                   pci_dev_id(pci_physfn(target_pdev))
+                               return -ETIMEDOUT;
+
                if (qi->desc_status[wait_index] == QI_ABORT)
                        return -EAGAIN;
        }
@@ -1344,7 +1373,7 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
  * can be part of the submission but it will not be polled for completion.
  */
 int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-                  unsigned int count, unsigned long options)
+                  unsigned int count, unsigned long options, pci_dev
*target_pdev)
 {
        struct q_inval *qi = iommu->qi;
        s64 devtlb_start_ktime = 0;
@@ -1430,7 +1459,7 @@ int qi_submit_sync(struct intel_iommu *iommu,
struct qi_desc *desc,
                 * a deadlock where the interrupt context can wait
indefinitely
                 * for free slots in the queue.
                 */
-               rc = qi_check_fault(iommu, index, wait_index);
+               rc = qi_check_fault(iommu, index, wait_index, target_pdev);
                if (rc)
                        break;


Thanks,

Ethan

>

2024-01-17 04:40:07

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device

On 2024/1/15 15:58, Ethan Zhao wrote:
> -static int qi_check_fault(struct intel_iommu *iommu, int index, int
> wait_index)
> +static int qi_check_fault(struct intel_iommu *iommu, int index, int
> wait_index,
> +                  pci_dev *target_pdev)
>  {
>         u32 fault;
>         int head, tail;
> +       u64 iqe_err, ice_sid;
>         struct q_inval *qi = iommu->qi;
>         int shift = qi_shift(iommu);
>
>         if (qi->desc_status[wait_index] == QI_ABORT)
>                 return -EAGAIN;
>
> +       /*
> +        * If the ATS invalidation target device is gone this moment
> (surprise
> +        * removed, died, no response) don't try this request again. this
> +        * request will not get valid result anymore. but the request was
> +        * already submitted to hardware and we predict to get a ITE in
> +        * followed batch of request, if so, it will get handled then.
> +        */

We can't leave the ITE triggered by this request for the next one, which
has no context about why this happened. Perhaps move below code down to
the segment that handles ITEs.

Another concern is about qi_dump_fault(), which pr_err's the fault
message as long as the register is set. Some faults are predictable,
such as cache invalidation for surprise-removed devices. Unconditionally
reporting errors with pr_err() may lead the user to believe that a more
serious hardware error has occurred. Probably we can refine this part of
the code as well.

Others look sane to me.

> +       if (target_pdev && !pci_device_is_present(target_pdev))
> +               return -EINVAL;
> +
>         fault = readl(iommu->reg + DMAR_FSTS_REG);
>         if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>                 qi_dump_fault(iommu, fault);
> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index, int wait_index)
>                 tail = readl(iommu->reg + DMAR_IQT_REG);
>                 tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>
> +               /*
> +                * SID field is valid only when the ITE field is Set in
> FSTS_REG
> +                * see Intel VT-d spec r4.1, section 11.4.9.9
> +                */
> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
> +
>                 writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>                 pr_info("Invalidation Time-out Error (ITE) cleared\n");
>
> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index, int wait_index)
>                         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>                 } while (head != tail);
>
> +               /*
> +                * If got ITE, we need to check if the sid of ITE is the
> same as
> +                * current ATS invalidation target device, if yes, don't
> try this
> +                * request anymore, the target device has a response
> time beyound
> +                * expected. 0 value of ice_sid means old device, no
> ice_sid value.
> +                */
> +               if (target_pdev && ice_sid && ice_sid ==
> +                   pci_dev_id(pci_physfn(target_pdev))
> +                               return -ETIMEDOUT;
> +
>                 if (qi->desc_status[wait_index] == QI_ABORT)
>                         return -EAGAIN;
>         }

Best regards,
baolu

2024-01-17 09:00:53

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 1/17/2024 1:38 PM, Ethan Zhao wrote:
>
> On 1/17/2024 1:26 PM, Ethan Zhao wrote:
>>
>> On 1/17/2024 11:24 AM, Baolu Lu wrote:
>>> On 2024/1/15 15:58, Ethan Zhao wrote:
>>>> -static int qi_check_fault(struct intel_iommu *iommu, int index,
>>>> int wait_index)
>>>> +static int qi_check_fault(struct intel_iommu *iommu, int index,
>>>> int wait_index,
>>>> +                  pci_dev *target_pdev)
>>>>   {
>>>>          u32 fault;
>>>>          int head, tail;
>>>> +       u64 iqe_err, ice_sid;
>>>>          struct q_inval *qi = iommu->qi;
>>>>          int shift = qi_shift(iommu);
>>>>
>>>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                  return -EAGAIN;
>>>>
>>>> +       /*
>>>> +        * If the ATS invalidation target device is gone this
>>>> moment (surprise
>>>> +        * removed, died, no response) don't try this request
>>>> again. this
>>>> +        * request will not get valid result anymore. but the
>>>> request was
>>>> +        * already submitted to hardware and we predict to get a
>>>> ITE in
>>>> +        * followed batch of request, if so, it will get handled then.
>>>> +        */
>>>
>>> We can't leave the ITE triggered by this request for the next one,
>>> which
>>> has no context about why this happened. Perhaps move below code down to
>>> the segment that handles ITEs.
>>
>> Here, the invalidation request has been issued to hardware but target
>> device
>>
>> gone, we can't loop and wait for the ITE for this request to happen,
>> and we
>>
>> bail out here because we hold lock_irqsave lock , the ITE still could
>> happen
>>
>> with later batch request in the future,  though it is not triggered
>> by that request,
>>
>> but it could still be cleaned/handled. move it to the fault() segment
>> ?,there means
>
> That moment, the ITE was triggered by previous requests, they are not
> in current
>
> context, also shouldn't be retried, they have response time over
> expected.
>
> but triggered ITE fault blocks this patch request, we should retry
> this batch request.
>
> we just clean the fault and retry it.  nothing missed.
>
>
> Thanks,
>
> Ethan
>
>>
>> ITE already happened, no need to check target presence anymore.
>>
>> did I miss something about the context lost ?
>>
>>>
>>> Another concern is about qi_dump_fault(), which pr_err's the fault
>>> message as long as the register is set. Some faults are predictable,
>>> such as cache invalidation for surprise-removed devices.
>>> Unconditionally
>>> reporting errors with pr_err() may lead the user to believe that a more
>>> serious hardware error has occurred. Probably we can refine this
>>> part of
>>> the code as well.
>>
>> Agree, may refine them in seperated series ?
>>
>> loop and always retry IQE, ICE don't make sense per my
>> understanding.  if
>>
>> IQE happened retry it will always reproduce the fault, because
>> request is the same.
>>
>> we could fix them together in other patches.
>>
>>
>> Thanks,
>>
>> Ethan
>>
>>>
>>> Others look sane to me.
>>>
>>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>>> +               return -EINVAL;
>>>> +
>>>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>>>                  qi_dump_fault(iommu, fault);
>>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu
>>>> *iommu, int index, int wait_index)
>>>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>>
>>>> +               /*
>>>> +                * SID field is valid only when the ITE field is
>>>> Set in FSTS_REG
>>>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>>>> +                */
>>>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>>> +
>>>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>>                  pr_info("Invalidation Time-out Error (ITE)
>>>> cleared\n");
>>>>
>>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu
>>>> *iommu, int index, int wait_index)
>>>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>                  } while (head != tail);
>>>>
>>>> +               /*
>>>> +                * If got ITE, we need to check if the sid of ITE
>>>> is the same as
>>>> +                * current ATS invalidation target device, if yes,
>>>> don't try this
>>>> +                * request anymore, the target device has a
>>>> response time beyound
>>>> +                * expected. 0 value of ice_sid means old device,
>>>> no ice_sid value.
>>>> +                */
>>>> +               if (target_pdev && ice_sid && ice_sid ==
>>>> +                   pci_dev_id(pci_physfn(target_pdev))
>>>> +                               return -ETIMEDOUT;
>>>> +
>>>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                          return -EAGAIN;
>>>>          }
>>>
>>> Best regards,
>>> baolu
>>
>
new proposed qi_check_fault() change as following:

- remove the lines of

  if (qi->desc_status[wait_index] == QI_ABORT)

     return -EAGAIN;

  reason see the comment in the code.

- others, the same as last proposed code.


--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,15 +1267,33 @@ static void qi_dump_fault(struct intel_iommu
*iommu, u32 fault)
               (unsigned long long)desc->qw1);
 }

-static int qi_check_fault(struct intel_iommu *iommu, int index, int
wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index, int
wait_index,
+                  pci_dev *target_pdev)
 {
        u32 fault;
        int head, tail;
+       u64 iqe_err, ice_sid;
        struct q_inval *qi = iommu->qi;
        int shift = qi_shift(iommu);
+       /*
+        * If the inv_wait_dsc got QI_ABORT here, means an ITE happens,
retry
+        * this batch of invalidation request without clearing the ITE fault
+        * will be trapped into dead loop. remov this line.
+        * see Intel VT-d spec r4.1 sec 6.5.2.10, 6.5.2.8
+        *
+        * if (qi->desc_status[wait_index] == QI_ABORT)
+        *      return -EAGAIN;
+        */

-       if (qi->desc_status[wait_index] == QI_ABORT)
-               return -EAGAIN;
+       /*
+        * If the ATS invalidation target device is gone this moment
(surprise
+        * removed, died, no response) don't try this request again. this
+        * request will not get valid result anymore. but the request was
+        * already submitted to hardware and we predict to get a ITE in
+        * followed batch of request, if so, it will get handled then.
+        */
+       if (target_pdev && !pci_device_is_present(target_pdev))
+               return -EINVAL;

        fault = readl(iommu->reg + DMAR_FSTS_REG);
        if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
@@ -1315,6 +1333,13 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
                tail = readl(iommu->reg + DMAR_IQT_REG);
                tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;

+               /*
+                * SID field is valid only when the ITE field is Set in
FSTS_REG
+                * see Intel VT-d spec r4.1, section 11.4.9.9
+                */
+               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
                writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
                pr_info("Invalidation Time-out Error (ITE) cleared\n");

@@ -1324,6 +1349,16 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
                        head = (head - 2 + QI_LENGTH) % QI_LENGTH;
                } while (head != tail);

+               /*
+                * If got ITE, we need to check if the sid of ITE is the
same as
+                * current ATS invalidation target device, if yes, don't
try this
+                * request anymore, the target device has a response
time beyound
+                * expected. 0 value of ice_sid means old device, no
ice_sid value.
+                */
+               if (target_pdev && ice_sid && ice_sid ==
+                   pci_dev_id(pci_physfn(target_pdev))
+                       return -ETIMEDOUT;
+
                if (qi->desc_status[wait_index] == QI_ABORT)
                        return -EAGAIN;
        }
@@ -1344,7 +1379,7 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
  * can be part of the submission but it will not be polled for completion.
  */
 int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-                  unsigned int count, unsigned long options)
+                  unsigned int count, unsigned long options, pci_dev
*target_pdev)
 {
        struct q_inval *qi = iommu->qi;
        s64 devtlb_start_ktime = 0;
@@ -1430,7 +1465,7 @@ int qi_submit_sync(struct intel_iommu *iommu,
struct qi_desc *desc,
                 * a deadlock where the interrupt context can wait
indefinitely
                 * for free slots in the queue.
                 */
-               rc = qi_check_fault(iommu, index, wait_index);
+               rc = qi_check_fault(iommu, index, wait_index, target_pdev);
                if (rc)
                        break;

Thanks,

Ethan


2024-01-17 10:26:08

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 1/17/2024 1:26 PM, Ethan Zhao wrote:
>
> On 1/17/2024 11:24 AM, Baolu Lu wrote:
>> On 2024/1/15 15:58, Ethan Zhao wrote:
>>> -static int qi_check_fault(struct intel_iommu *iommu, int index, int
>>> wait_index)
>>> +static int qi_check_fault(struct intel_iommu *iommu, int index, int
>>> wait_index,
>>> +                  pci_dev *target_pdev)
>>>   {
>>>          u32 fault;
>>>          int head, tail;
>>> +       u64 iqe_err, ice_sid;
>>>          struct q_inval *qi = iommu->qi;
>>>          int shift = qi_shift(iommu);
>>>
>>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>>                  return -EAGAIN;
>>>
>>> +       /*
>>> +        * If the ATS invalidation target device is gone this moment
>>> (surprise
>>> +        * removed, died, no response) don't try this request again.
>>> this
>>> +        * request will not get valid result anymore. but the
>>> request was
>>> +        * already submitted to hardware and we predict to get a ITE in
>>> +        * followed batch of request, if so, it will get handled then.
>>> +        */
>>
>> We can't leave the ITE triggered by this request for the next one, which
>> has no context about why this happened. Perhaps move below code down to
>> the segment that handles ITEs.
>
> Here, the invalidation request has been issued to hardware but target
> device
>
> gone, we can't loop and wait for the ITE for this request to happen,
> and we
>
> bail out here because we hold lock_irqsave lock , the ITE still could
> happen
>
> with later batch request in the future,  though it is not triggered by
> that request,
>
> but it could still be cleaned/handled. move it to the fault() segment
> ?,there means

That moment, the ITE was triggered by previous requests, they are not in
current

context, also shouldn't be retried, they have response time over expected.

but triggered ITE fault blocks this patch request, we should retry this
batch request.

we just clean the fault and retry it.  nothing missed.


Thanks,

Ethan

>
> ITE already happened, no need to check target presence anymore.
>
> did I miss something about the context lost ?
>
>>
>> Another concern is about qi_dump_fault(), which pr_err's the fault
>> message as long as the register is set. Some faults are predictable,
>> such as cache invalidation for surprise-removed devices. Unconditionally
>> reporting errors with pr_err() may lead the user to believe that a more
>> serious hardware error has occurred. Probably we can refine this part of
>> the code as well.
>
> Agree, may refine them in seperated series ?
>
> loop and always retry IQE, ICE don't make sense per my understanding.  if
>
> IQE happened retry it will always reproduce the fault, because request
> is the same.
>
> we could fix them together in other patches.
>
>
> Thanks,
>
> Ethan
>
>>
>> Others look sane to me.
>>
>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>> +               return -EINVAL;
>>> +
>>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>>                  qi_dump_fault(iommu, fault);
>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index)
>>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>
>>> +               /*
>>> +                * SID field is valid only when the ITE field is Set
>>> in FSTS_REG
>>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>>> +                */
>>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>> +
>>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>                  pr_info("Invalidation Time-out Error (ITE)
>>> cleared\n");
>>>
>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index)
>>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>                  } while (head != tail);
>>>
>>> +               /*
>>> +                * If got ITE, we need to check if the sid of ITE is
>>> the same as
>>> +                * current ATS invalidation target device, if yes,
>>> don't try this
>>> +                * request anymore, the target device has a response
>>> time beyound
>>> +                * expected. 0 value of ice_sid means old device, no
>>> ice_sid value.
>>> +                */
>>> +               if (target_pdev && ice_sid && ice_sid ==
>>> +                   pci_dev_id(pci_physfn(target_pdev))
>>> +                               return -ETIMEDOUT;
>>> +
>>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>>                          return -EAGAIN;
>>>          }
>>
>> Best regards,
>> baolu
>

2024-01-17 10:26:51

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 1/17/2024 11:24 AM, Baolu Lu wrote:
> On 2024/1/15 15:58, Ethan Zhao wrote:
>> -static int qi_check_fault(struct intel_iommu *iommu, int index, int
>> wait_index)
>> +static int qi_check_fault(struct intel_iommu *iommu, int index, int
>> wait_index,
>> +                  pci_dev *target_pdev)
>>   {
>>          u32 fault;
>>          int head, tail;
>> +       u64 iqe_err, ice_sid;
>>          struct q_inval *qi = iommu->qi;
>>          int shift = qi_shift(iommu);
>>
>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>                  return -EAGAIN;
>>
>> +       /*
>> +        * If the ATS invalidation target device is gone this moment
>> (surprise
>> +        * removed, died, no response) don't try this request again.
>> this
>> +        * request will not get valid result anymore. but the request
>> was
>> +        * already submitted to hardware and we predict to get a ITE in
>> +        * followed batch of request, if so, it will get handled then.
>> +        */
>
> We can't leave the ITE triggered by this request for the next one, which
> has no context about why this happened. Perhaps move below code down to
> the segment that handles ITEs.

Here, the invalidation request has been issued to hardware but target device

gone, we can't loop and wait for the ITE for this request to happen, and we

bail out here because we hold lock_irqsave lock , the ITE still could happen

with later batch request in the future,  though it is not triggered by
that request,

but it could still be cleaned/handled. move it to the fault() segment
?,there means

ITE already happened, no need to check target presence anymore.

did I miss something about the context lost ?

>
> Another concern is about qi_dump_fault(), which pr_err's the fault
> message as long as the register is set. Some faults are predictable,
> such as cache invalidation for surprise-removed devices. Unconditionally
> reporting errors with pr_err() may lead the user to believe that a more
> serious hardware error has occurred. Probably we can refine this part of
> the code as well.

Agree, may refine them in seperated series ?

loop and always retry IQE, ICE don't make sense per my understanding.  if

IQE happened retry it will always reproduce the fault, because request
is the same.

we could fix them together in other patches.


Thanks,

Ethan

>
> Others look sane to me.
>
>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>> +               return -EINVAL;
>> +
>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>                  qi_dump_fault(iommu, fault);
>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index)
>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>
>> +               /*
>> +                * SID field is valid only when the ITE field is Set
>> in FSTS_REG
>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>> +                */
>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>> +
>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>                  pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>
>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index)
>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>                  } while (head != tail);
>>
>> +               /*
>> +                * If got ITE, we need to check if the sid of ITE is
>> the same as
>> +                * current ATS invalidation target device, if yes,
>> don't try this
>> +                * request anymore, the target device has a response
>> time beyound
>> +                * expected. 0 value of ice_sid means old device, no
>> ice_sid value.
>> +                */
>> +               if (target_pdev && ice_sid && ice_sid ==
>> +                   pci_dev_id(pci_physfn(target_pdev))
>> +                               return -ETIMEDOUT;
>> +
>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>                          return -EAGAIN;
>>          }
>
> Best regards,
> baolu

2024-01-18 03:20:12

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device

On 1/17/24 5:00 PM, Ethan Zhao wrote:
> +       /*
> +        * If the ATS invalidation target device is gone this moment
> (surprise
> +        * removed, died, no response) don't try this request again. this
> +        * request will not get valid result anymore. but the request was
> +        * already submitted to hardware and we predict to get a ITE in
> +        * followed batch of request, if so, it will get handled then.
> +        */
> +       if (target_pdev && !pci_device_is_present(target_pdev))
> +               return -EINVAL;

Again, we should not ignore the error triggered by the current request.
Do not leave it to the next one. The WAIT descriptor is a fence. Handle
everything within its boundary.

Best regards,
baolu

2024-01-18 06:06:09

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 1/18/2024 8:46 AM, Baolu Lu wrote:
> On 1/17/24 5:00 PM, Ethan Zhao wrote:
>> +       /*
>> +        * If the ATS invalidation target device is gone this moment
>> (surprise
>> +        * removed, died, no response) don't try this request again.
>> this
>> +        * request will not get valid result anymore. but the request
>> was
>> +        * already submitted to hardware and we predict to get a ITE in
>> +        * followed batch of request, if so, it will get handled then.
>> +        */
>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>> +               return -EINVAL;
>
> Again, we should not ignore the error triggered by the current request.
> Do not leave it to the next one. The WAIT descriptor is a fence. Handle
> everything within its boundary.

 We didn't set fence bit to every ATS invalidation wait descriptor,

only the intel_drain_pasid_prq() queue a drain page requests with FN

sit, but that is not called in hotplug removal path.


Thanks,

Ethan



>
> Best regards,
> baolu

2024-01-18 07:21:44

by Ethan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device


On 1/18/2024 10:26 AM, Ethan Zhao wrote:
>
> On 1/18/2024 8:46 AM, Baolu Lu wrote:
>> On 1/17/24 5:00 PM, Ethan Zhao wrote:
>>> +       /*
>>> +        * If the ATS invalidation target device is gone this moment
>>> (surprise
>>> +        * removed, died, no response) don't try this request again.
>>> this
>>> +        * request will not get valid result anymore. but the
>>> request was
>>> +        * already submitted to hardware and we predict to get a ITE in
>>> +        * followed batch of request, if so, it will get handled then.
>>> +        */
>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>> +               return -EINVAL;
>>
>> Again, we should not ignore the error triggered by the current request.
>> Do not leave it to the next one. The WAIT descriptor is a fence. Handle
>> everything within its boundary.
>
>  We didn't set fence bit to every ATS invalidation wait descriptor,
>
> only the intel_drain_pasid_prq() queue a drain page requests with FN
>
> sit, but that is not called in hotplug removal path.

So, in fact so far, it doesn't act as a fence except the
intel_drain_pasid_prq() ,

and it never handle everthing within its border.


Thanks,

Ethan

>
>
> Thanks,
>
> Ethan
>
>
>
>>
>> Best regards,
>> baolu
>