Because surprise removal could happen anytime, e.g. user could request safe
removal to EP(endpoint device) via sysfs and brings its link down to do
surprise removal cocurrently. such aggressive cases would cause ATS
invalidation request issued to non-existence target device, then deadly
loop to retry that request after ITE fault triggered in interrupt context.
this patch aims to optimize the ITE handling by checking the target device
presence state to avoid retrying the timeout request blindly, thus avoid
hard lockup or system hang.
Signed-off-by: Ethan Zhao <[email protected]>
---
drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 814134e9aa5a..2e214b43725c 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
{
u32 fault;
int head, tail;
+ u64 iqe_err, ite_sid;
struct q_inval *qi = iommu->qi;
int shift = qi_shift(iommu);
@@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
tail = readl(iommu->reg + DMAR_IQT_REG);
tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
+ /*
+ * SID field is valid only when the ITE field is Set in FSTS_REG
+ * see Intel VT-d spec r4.1, section 11.4.9.9
+ */
+ iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+ ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
pr_info("Invalidation Time-out Error (ITE) cleared\n");
@@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
head = (head - 2 + QI_LENGTH) % QI_LENGTH;
} while (head != tail);
+ /*
+ * If got ITE, we need to check if the sid of ITE is the same as
+ * current ATS invalidation target device, if yes, don't try this
+ * request anymore if the target device isn't present.
+ * 0 value of ite_sid means old VT-d device, no ite_sid value.
+ */
+ if (pdev && ite_sid && !pci_device_is_present(pdev) &&
+ ite_sid == pci_dev_id(pci_physfn(pdev)))
+ return -ETIMEDOUT;
+
if (qi->desc_status[wait_index] == QI_ABORT)
return -EAGAIN;
}
--
2.31.1
> From: Ethan Zhao <[email protected]>
> Sent: Monday, January 29, 2024 11:49 AM
>
> Because surprise removal could happen anytime, e.g. user could request safe
> removal to EP(endpoint device) via sysfs and brings its link down to do
> surprise removal cocurrently. such aggressive cases would cause ATS
> invalidation request issued to non-existence target device, then deadly
> loop to retry that request after ITE fault triggered in interrupt context.
> this patch aims to optimize the ITE handling by checking the target device
> presence state to avoid retrying the timeout request blindly, thus avoid
> hard lockup or system hang.
>
> Signed-off-by: Ethan Zhao <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 814134e9aa5a..2e214b43725c 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index, int wait_index,
> {
> u32 fault;
> int head, tail;
> + u64 iqe_err, ite_sid;
> struct q_inval *qi = iommu->qi;
> int shift = qi_shift(iommu);
>
> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index, int wait_index,
> tail = readl(iommu->reg + DMAR_IQT_REG);
> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>
> + /*
> + * SID field is valid only when the ITE field is Set in FSTS_REG
> + * see Intel VT-d spec r4.1, section 11.4.9.9
> + */
> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
> +
> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>
> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
> *iommu, int index, int wait_index,
> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> } while (head != tail);
>
> + /*
> + * If got ITE, we need to check if the sid of ITE is the same as
> + * current ATS invalidation target device, if yes, don't try this
> + * request anymore if the target device isn't present.
> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
> + */
> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
> + ite_sid == pci_dev_id(pci_physfn(pdev)))
> + return -ETIMEDOUT;
> +
since the hardware already reports source id leading to timeout, can't we
just find the pci_dev according to reported ite_sid? this is a slow path (either
due to device in bad state or removed) hence it's not necessary to add more
intelligence to pass the pci_dev in, leading to only a partial fix can be backported.
It's also more future-proof, say if one day the driver allows batching invalidation
requests for multiple devices then no need to pass in a list of devices.
Then it's easier to backport a full fix.
On 2024/1/29 17:06, Tian, Kevin wrote:
>> From: Ethan Zhao <[email protected]>
>> Sent: Monday, January 29, 2024 11:49 AM
>>
>> Because surprise removal could happen anytime, e.g. user could request safe
>> removal to EP(endpoint device) via sysfs and brings its link down to do
>> surprise removal cocurrently. such aggressive cases would cause ATS
>> invalidation request issued to non-existence target device, then deadly
>> loop to retry that request after ITE fault triggered in interrupt context.
>> this patch aims to optimize the ITE handling by checking the target device
>> presence state to avoid retrying the timeout request blindly, thus avoid
>> hard lockup or system hang.
>>
>> Signed-off-by: Ethan Zhao <[email protected]>
>> ---
>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>> index 814134e9aa5a..2e214b43725c 100644
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> {
>> u32 fault;
>> int head, tail;
>> + u64 iqe_err, ite_sid;
>> struct q_inval *qi = iommu->qi;
>> int shift = qi_shift(iommu);
>>
>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> tail = readl(iommu->reg + DMAR_IQT_REG);
>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>
>> + /*
>> + * SID field is valid only when the ITE field is Set in FSTS_REG
>> + * see Intel VT-d spec r4.1, section 11.4.9.9
>> + */
>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
>> +
>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>
>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>> } while (head != tail);
>>
>> + /*
>> + * If got ITE, we need to check if the sid of ITE is the same as
>> + * current ATS invalidation target device, if yes, don't try this
>> + * request anymore if the target device isn't present.
>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
>> + */
>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
>> + return -ETIMEDOUT;
>> +
>
> since the hardware already reports source id leading to timeout, can't we
> just find the pci_dev according to reported ite_sid? this is a slow path (either
> due to device in bad state or removed) hence it's not necessary to add more
> intelligence to pass the pci_dev in, leading to only a partial fix can be backported.
>
> It's also more future-proof, say if one day the driver allows batching invalidation
> requests for multiple devices then no need to pass in a list of devices.
>
> Then it's easier to backport a full fix.
May consider pci_get_domain_bus_and_slot() or
pci_find_bus()/pci_get_slot(). But I doubt if the pci_dev is still tracked
in the bus or a kind of dev list in the device hot removal case. So Ethan
may need to check.
Regards,
Yi Liu
--
Regards,
Yi Liu
On 2024/1/29 11:49, Ethan Zhao wrote:
> Because surprise removal could happen anytime, e.g. user could request safe
> removal to EP(endpoint device) via sysfs and brings its link down to do
> surprise removal cocurrently. such aggressive cases would cause ATS
> invalidation request issued to non-existence target device, then deadly
> loop to retry that request after ITE fault triggered in interrupt context.
> this patch aims to optimize the ITE handling by checking the target device
> presence state to avoid retrying the timeout request blindly, thus avoid
> hard lockup or system hang.
>
should include a Fix tag here?
> Signed-off-by: Ethan Zhao <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 814134e9aa5a..2e214b43725c 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
> {
> u32 fault;
> int head, tail;
> + u64 iqe_err, ite_sid;
> struct q_inval *qi = iommu->qi;
> int shift = qi_shift(iommu);
>
> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
> tail = readl(iommu->reg + DMAR_IQT_REG);
> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>
> + /*
> + * SID field is valid only when the ITE field is Set in FSTS_REG
> + * see Intel VT-d spec r4.1, section 11.4.9.9
> + */
> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
> +
> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>
> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index,
> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> } while (head != tail);
>
> + /*
> + * If got ITE, we need to check if the sid of ITE is the same as
> + * current ATS invalidation target device, if yes, don't try this
> + * request anymore if the target device isn't present.
> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
> + */
> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
> + ite_sid == pci_dev_id(pci_physfn(pdev)))
> + return -ETIMEDOUT;
> +
> if (qi->desc_status[wait_index] == QI_ABORT)
> return -EAGAIN;
> }
--
Regards,
Yi Liu
On 2024/1/29 17:06, Tian, Kevin wrote:
>> From: Ethan Zhao<[email protected]>
>> Sent: Monday, January 29, 2024 11:49 AM
>>
>> Because surprise removal could happen anytime, e.g. user could request safe
>> removal to EP(endpoint device) via sysfs and brings its link down to do
>> surprise removal cocurrently. such aggressive cases would cause ATS
>> invalidation request issued to non-existence target device, then deadly
>> loop to retry that request after ITE fault triggered in interrupt context.
>> this patch aims to optimize the ITE handling by checking the target device
>> presence state to avoid retrying the timeout request blindly, thus avoid
>> hard lockup or system hang.
>>
>> Signed-off-by: Ethan Zhao<[email protected]>
>> ---
>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>> index 814134e9aa5a..2e214b43725c 100644
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> {
>> u32 fault;
>> int head, tail;
>> + u64 iqe_err, ite_sid;
>> struct q_inval *qi = iommu->qi;
>> int shift = qi_shift(iommu);
>>
>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> tail = readl(iommu->reg + DMAR_IQT_REG);
>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>
>> + /*
>> + * SID field is valid only when the ITE field is Set in FSTS_REG
>> + * see Intel VT-d spec r4.1, section 11.4.9.9
>> + */
>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
>> +
>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>
>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
>> *iommu, int index, int wait_index,
>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>> } while (head != tail);
>>
>> + /*
>> + * If got ITE, we need to check if the sid of ITE is the same as
>> + * current ATS invalidation target device, if yes, don't try this
>> + * request anymore if the target device isn't present.
>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
>> + */
>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
>> + return -ETIMEDOUT;
>> +
> since the hardware already reports source id leading to timeout, can't we
> just find the pci_dev according to reported ite_sid? this is a slow path (either
> due to device in bad state or removed) hence it's not necessary to add more
> intelligence to pass the pci_dev in, leading to only a partial fix can be backported.
>
> It's also more future-proof, say if one day the driver allows batching invalidation
> requests for multiple devices then no need to pass in a list of devices.
I have ever thought about this solution and gave up in the end due to
the locking issue.
A batch of qi requests must be handled in the spin lock critical region
to enforce that only one batch of requests is submitted at a time.
Searching pci_dev in this locking region might result in nested locking
issues, and I haven't found a good solution for this yet.
Unless someone can bring up a better solution, perhaps we have to live
in a world where only single device TLB invalidation request in a batch
could be submitted to the queue.
Best regards,
baolu
> From: Baolu Lu <[email protected]>
> Sent: Monday, January 29, 2024 10:49 PM
>
> On 2024/1/29 17:06, Tian, Kevin wrote:
> >> From: Ethan Zhao<[email protected]>
> >> Sent: Monday, January 29, 2024 11:49 AM
> >>
> >> Because surprise removal could happen anytime, e.g. user could request
> safe
> >> removal to EP(endpoint device) via sysfs and brings its link down to do
> >> surprise removal cocurrently. such aggressive cases would cause ATS
> >> invalidation request issued to non-existence target device, then deadly
> >> loop to retry that request after ITE fault triggered in interrupt context.
> >> this patch aims to optimize the ITE handling by checking the target device
> >> presence state to avoid retrying the timeout request blindly, thus avoid
> >> hard lockup or system hang.
> >>
> >> Signed-off-by: Ethan Zhao<[email protected]>
> >> ---
> >> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
> >> 1 file changed, 18 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> >> index 814134e9aa5a..2e214b43725c 100644
> >> --- a/drivers/iommu/intel/dmar.c
> >> +++ b/drivers/iommu/intel/dmar.c
> >> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
> >> *iommu, int index, int wait_index,
> >> {
> >> u32 fault;
> >> int head, tail;
> >> + u64 iqe_err, ite_sid;
> >> struct q_inval *qi = iommu->qi;
> >> int shift = qi_shift(iommu);
> >>
> >> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
> >> *iommu, int index, int wait_index,
> >> tail = readl(iommu->reg + DMAR_IQT_REG);
> >> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
> >>
> >> + /*
> >> + * SID field is valid only when the ITE field is Set in FSTS_REG
> >> + * see Intel VT-d spec r4.1, section 11.4.9.9
> >> + */
> >> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> >> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
> >> +
> >> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
> >> pr_info("Invalidation Time-out Error (ITE) cleared\n");
> >>
> >> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
> >> *iommu, int index, int wait_index,
> >> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> >> } while (head != tail);
> >>
> >> + /*
> >> + * If got ITE, we need to check if the sid of ITE is the same as
> >> + * current ATS invalidation target device, if yes, don't try this
> >> + * request anymore if the target device isn't present.
> >> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
> >> + */
> >> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
> >> + ite_sid == pci_dev_id(pci_physfn(pdev)))
> >> + return -ETIMEDOUT;
> >> +
> > since the hardware already reports source id leading to timeout, can't we
> > just find the pci_dev according to reported ite_sid? this is a slow path
> (either
> > due to device in bad state or removed) hence it's not necessary to add
> more
> > intelligence to pass the pci_dev in, leading to only a partial fix can be
> backported.
> >
> > It's also more future-proof, say if one day the driver allows batching
> invalidation
> > requests for multiple devices then no need to pass in a list of devices.
>
> I have ever thought about this solution and gave up in the end due to
> the locking issue.
>
> A batch of qi requests must be handled in the spin lock critical region
> to enforce that only one batch of requests is submitted at a time.
> Searching pci_dev in this locking region might result in nested locking
> issues, and I haven't found a good solution for this yet.
>
in spin-lock you just get the sid.
searching pci_dev can be done outside of the critical region. anyway
the handling of -EAGAIN is already after spin_unlock.
> From: Ethan Zhao <[email protected]>
> Sent: Tuesday, January 30, 2024 1:13 PM
>
> On 1/29/2024 5:21 PM, Yi Liu wrote:
> > On 2024/1/29 17:06, Tian, Kevin wrote:
> >>> From: Ethan Zhao <[email protected]>
> >>> Sent: Monday, January 29, 2024 11:49 AM
> >>>
> >>> Because surprise removal could happen anytime, e.g. user could
> >>> request safe
> >>> removal to EP(endpoint device) via sysfs and brings its link down to do
> >>> surprise removal cocurrently. such aggressive cases would cause ATS
> >>> invalidation request issued to non-existence target device, then deadly
> >>> loop to retry that request after ITE fault triggered in interrupt
> >>> context.
> >>> this patch aims to optimize the ITE handling by checking the target
> >>> device
> >>> presence state to avoid retrying the timeout request blindly, thus
> >>> avoid
> >>> hard lockup or system hang.
> >>>
> >>> Signed-off-by: Ethan Zhao <[email protected]>
> >>> ---
> >>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
> >>> 1 file changed, 18 insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> >>> index 814134e9aa5a..2e214b43725c 100644
> >>> --- a/drivers/iommu/intel/dmar.c
> >>> +++ b/drivers/iommu/intel/dmar.c
> >>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
> >>> *iommu, int index, int wait_index,
> >>> {
> >>> u32 fault;
> >>> int head, tail;
> >>> + u64 iqe_err, ite_sid;
> >>> struct q_inval *qi = iommu->qi;
> >>> int shift = qi_shift(iommu);
> >>>
> >>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
> >>> *iommu, int index, int wait_index,
> >>> tail = readl(iommu->reg + DMAR_IQT_REG);
> >>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
> >>>
> >>> + /*
> >>> + * SID field is valid only when the ITE field is Set in
> >>> FSTS_REG
> >>> + * see Intel VT-d spec r4.1, section 11.4.9.9
> >>> + */
> >>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> >>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
> >>> +
> >>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
> >>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
> >>>
> >>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
> >>> *iommu, int index, int wait_index,
> >>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
> >>> } while (head != tail);
> >>>
> >>> + /*
> >>> + * If got ITE, we need to check if the sid of ITE is the
> >>> same as
> >>> + * current ATS invalidation target device, if yes, don't
> >>> try this
> >>> + * request anymore if the target device isn't present.
> >>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
> >>> + */
> >>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
> >>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
> >>> + return -ETIMEDOUT;
> >>> +
> >>
> >> since the hardware already reports source id leading to timeout,
> >> can't we
> >> just find the pci_dev according to reported ite_sid? this is a slow
> >> path (either
> >> due to device in bad state or removed) hence it's not necessary to
> >> add more
> >> intelligence to pass the pci_dev in, leading to only a partial fix
> >> can be backported.
> >>
> >> It's also more future-proof, say if one day the driver allows
> >> batching invalidation
> >> requests for multiple devices then no need to pass in a list of devices.
> >>
> >> Then it's easier to backport a full fix.
> >
> > May consider pci_get_domain_bus_and_slot() or
> > pci_find_bus()/pci_get_slot(). But I doubt if the pci_dev is still
> > tracked
> > in the bus or a kind of dev list in the device hot removal case. So Ethan
> > may need to check.
>
> Perhaps it is too late to call pci_find_bus() or
> pci_get_domain_bus_and_slot() to get the
>
> device instance from this notifier registered as
> BUS_NOTIFY_REMOVED_DEVICE
>
> action. if the device is still there in bus list, *must* be a bug of
> device subsystem as
>
> *removed* device.
>
Ethan, looks your reply is not formatted well. Can you fix your mail
client like how you write the commit msg?
Here we need consider two situations.
One is that the device is not bound to a driver or bound to a driver
which doesn't do active work to the device when it's removed. In
that case one may observe the timeout situation only in the removal
path as the stack dump in your patch02 shows.
patch02 can fix that case by checking whether the device is present
to skip sending the invalidation requests. So the logic being discussed
here doesn't matter.
The 2nd situation is more tricky. The device might be bound to
a driver which is doing active work to the device with in-fly
ATS invalidation requests. In this case in-fly requests must be aborted
before the driver can be detached from the removed device. Conceptually
a device is removed from the bus only after its driver is detached.
From this angle you can still find the pci_dev from the bus when handling
timeout error for in-fly invalidation requests. But I'm not a PCI person
so let's wait for the inputs from Bjorn and Lukas.
On 1/29/2024 5:21 PM, Yi Liu wrote:
> On 2024/1/29 17:06, Tian, Kevin wrote:
>>> From: Ethan Zhao <[email protected]>
>>> Sent: Monday, January 29, 2024 11:49 AM
>>>
>>> Because surprise removal could happen anytime, e.g. user could
>>> request safe
>>> removal to EP(endpoint device) via sysfs and brings its link down to do
>>> surprise removal cocurrently. such aggressive cases would cause ATS
>>> invalidation request issued to non-existence target device, then deadly
>>> loop to retry that request after ITE fault triggered in interrupt
>>> context.
>>> this patch aims to optimize the ITE handling by checking the target
>>> device
>>> presence state to avoid retrying the timeout request blindly, thus
>>> avoid
>>> hard lockup or system hang.
>>>
>>> Signed-off-by: Ethan Zhao <[email protected]>
>>> ---
>>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
>>> 1 file changed, 18 insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>>> index 814134e9aa5a..2e214b43725c 100644
>>> --- a/drivers/iommu/intel/dmar.c
>>> +++ b/drivers/iommu/intel/dmar.c
>>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> {
>>> u32 fault;
>>> int head, tail;
>>> + u64 iqe_err, ite_sid;
>>> struct q_inval *qi = iommu->qi;
>>> int shift = qi_shift(iommu);
>>>
>>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> tail = readl(iommu->reg + DMAR_IQT_REG);
>>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>
>>> + /*
>>> + * SID field is valid only when the ITE field is Set in
>>> FSTS_REG
>>> + * see Intel VT-d spec r4.1, section 11.4.9.9
>>> + */
>>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>> +
>>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>>
>>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>> } while (head != tail);
>>>
>>> + /*
>>> + * If got ITE, we need to check if the sid of ITE is the
>>> same as
>>> + * current ATS invalidation target device, if yes, don't
>>> try this
>>> + * request anymore if the target device isn't present.
>>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
>>> + */
>>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
>>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
>>> + return -ETIMEDOUT;
>>> +
>>
>> since the hardware already reports source id leading to timeout,
>> can't we
>> just find the pci_dev according to reported ite_sid? this is a slow
>> path (either
>> due to device in bad state or removed) hence it's not necessary to
>> add more
>> intelligence to pass the pci_dev in, leading to only a partial fix
>> can be backported.
>>
>> It's also more future-proof, say if one day the driver allows
>> batching invalidation
>> requests for multiple devices then no need to pass in a list of devices.
>>
>> Then it's easier to backport a full fix.
>
> May consider pci_get_domain_bus_and_slot() or
> pci_find_bus()/pci_get_slot(). But I doubt if the pci_dev is still
> tracked
> in the bus or a kind of dev list in the device hot removal case. So Ethan
> may need to check.
Perhaps it is too late to call pci_find_bus() or
pci_get_domain_bus_and_slot() to get the
device instance from this notifier registered as BUS_NOTIFY_REMOVED_DEVICE
action. if the device is still there in bus list, *must* be a bug of
device subsystem as
*removed* device.
but if we call iommu_release_device() in iommu_bus_notifier() for
BUS_NOTIFY_DEL_DEVICE
action, there should be opportuniy to get the device instance, but that
change need
more evaluation about side effect.
furthermore, iommu never cross domain number per context table
defination in VT-d
spec, not mean, domain number in a system never will be !0, per my
understanding.
Thanks,
Ethan
>
> Regards,
> Yi Liu
>
On 1/29/2024 10:48 PM, Baolu Lu wrote:
> On 2024/1/29 17:06, Tian, Kevin wrote:
>>> From: Ethan Zhao<[email protected]>
>>> Sent: Monday, January 29, 2024 11:49 AM
>>>
>>> Because surprise removal could happen anytime, e.g. user could
>>> request safe
>>> removal to EP(endpoint device) via sysfs and brings its link down to do
>>> surprise removal cocurrently. such aggressive cases would cause ATS
>>> invalidation request issued to non-existence target device, then deadly
>>> loop to retry that request after ITE fault triggered in interrupt
>>> context.
>>> this patch aims to optimize the ITE handling by checking the target
>>> device
>>> presence state to avoid retrying the timeout request blindly, thus
>>> avoid
>>> hard lockup or system hang.
>>>
>>> Signed-off-by: Ethan Zhao<[email protected]>
>>> ---
>>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
>>> 1 file changed, 18 insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>>> index 814134e9aa5a..2e214b43725c 100644
>>> --- a/drivers/iommu/intel/dmar.c
>>> +++ b/drivers/iommu/intel/dmar.c
>>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> {
>>> u32 fault;
>>> int head, tail;
>>> + u64 iqe_err, ite_sid;
>>> struct q_inval *qi = iommu->qi;
>>> int shift = qi_shift(iommu);
>>>
>>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> tail = readl(iommu->reg + DMAR_IQT_REG);
>>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>
>>> + /*
>>> + * SID field is valid only when the ITE field is Set in
>>> FSTS_REG
>>> + * see Intel VT-d spec r4.1, section 11.4.9.9
>>> + */
>>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>> +
>>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>>
>>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
>>> *iommu, int index, int wait_index,
>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>> } while (head != tail);
>>>
>>> + /*
>>> + * If got ITE, we need to check if the sid of ITE is the
>>> same as
>>> + * current ATS invalidation target device, if yes, don't
>>> try this
>>> + * request anymore if the target device isn't present.
>>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
>>> + */
>>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
>>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
>>> + return -ETIMEDOUT;
>>> +
>> since the hardware already reports source id leading to timeout,
>> can't we
>> just find the pci_dev according to reported ite_sid? this is a slow
>> path (either
>> due to device in bad state or removed) hence it's not necessary to
>> add more
>> intelligence to pass the pci_dev in, leading to only a partial fix
>> can be backported.
>>
>> It's also more future-proof, say if one day the driver allows
>> batching invalidation
>> requests for multiple devices then no need to pass in a list of devices.
>
> I have ever thought about this solution and gave up in the end due to
> the locking issue.
>
> A batch of qi requests must be handled in the spin lock critical region
> to enforce that only one batch of requests is submitted at a time.
> Searching pci_dev in this locking region might result in nested locking
> issues, and I haven't found a good solution for this yet.
>
You said async-interrupt model is a bad idea, how bad is it ? I wonder if
the hardware and VT-d spec definition could support it pefectly or not.
at least, would never get in trouble about balance timeout & wakeup
watchdog.
Yes, the VT-d DMAR driver wasn't inited as async-interrupt model from
begnining...
Thanks,
Ethan
> Unless someone can bring up a better solution, perhaps we have to live
> in a world where only single device TLB invalidation request in a batch
> could be submitted to the queue.
>
> Best regards,
> baolu
On 1/30/2024 2:22 PM, Tian, Kevin wrote:
>> From: Ethan Zhao <[email protected]>
>> Sent: Tuesday, January 30, 2024 1:13 PM
>>
>> On 1/29/2024 5:21 PM, Yi Liu wrote:
>>> On 2024/1/29 17:06, Tian, Kevin wrote:
>>>>> From: Ethan Zhao <[email protected]>
>>>>> Sent: Monday, January 29, 2024 11:49 AM
>>>>>
>>>>> Because surprise removal could happen anytime, e.g. user could
>>>>> request safe
>>>>> removal to EP(endpoint device) via sysfs and brings its link down to do
>>>>> surprise removal cocurrently. such aggressive cases would cause ATS
>>>>> invalidation request issued to non-existence target device, then deadly
>>>>> loop to retry that request after ITE fault triggered in interrupt
>>>>> context.
>>>>> this patch aims to optimize the ITE handling by checking the target
>>>>> device
>>>>> presence state to avoid retrying the timeout request blindly, thus
>>>>> avoid
>>>>> hard lockup or system hang.
>>>>>
>>>>> Signed-off-by: Ethan Zhao <[email protected]>
>>>>> ---
>>>>> drivers/iommu/intel/dmar.c | 18 ++++++++++++++++++
>>>>> 1 file changed, 18 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
>>>>> index 814134e9aa5a..2e214b43725c 100644
>>>>> --- a/drivers/iommu/intel/dmar.c
>>>>> +++ b/drivers/iommu/intel/dmar.c
>>>>> @@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
>>>>> *iommu, int index, int wait_index,
>>>>> {
>>>>> u32 fault;
>>>>> int head, tail;
>>>>> + u64 iqe_err, ite_sid;
>>>>> struct q_inval *qi = iommu->qi;
>>>>> int shift = qi_shift(iommu);
>>>>>
>>>>> @@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
>>>>> *iommu, int index, int wait_index,
>>>>> tail = readl(iommu->reg + DMAR_IQT_REG);
>>>>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>>>
>>>>> + /*
>>>>> + * SID field is valid only when the ITE field is Set in
>>>>> FSTS_REG
>>>>> + * see Intel VT-d spec r4.1, section 11.4.9.9
>>>>> + */
>>>>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>>>> + ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>>>> +
>>>>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>>> pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>>>>
>>>>> @@ -1325,6 +1333,16 @@ static int qi_check_fault(struct intel_iommu
>>>>> *iommu, int index, int wait_index,
>>>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>> } while (head != tail);
>>>>>
>>>>> + /*
>>>>> + * If got ITE, we need to check if the sid of ITE is the
>>>>> same as
>>>>> + * current ATS invalidation target device, if yes, don't
>>>>> try this
>>>>> + * request anymore if the target device isn't present.
>>>>> + * 0 value of ite_sid means old VT-d device, no ite_sid value.
>>>>> + */
>>>>> + if (pdev && ite_sid && !pci_device_is_present(pdev) &&
>>>>> + ite_sid == pci_dev_id(pci_physfn(pdev)))
>>>>> + return -ETIMEDOUT;
>>>>> +
>>>> since the hardware already reports source id leading to timeout,
>>>> can't we
>>>> just find the pci_dev according to reported ite_sid? this is a slow
>>>> path (either
>>>> due to device in bad state or removed) hence it's not necessary to
>>>> add more
>>>> intelligence to pass the pci_dev in, leading to only a partial fix
>>>> can be backported.
>>>>
>>>> It's also more future-proof, say if one day the driver allows
>>>> batching invalidation
>>>> requests for multiple devices then no need to pass in a list of devices.
>>>>
>>>> Then it's easier to backport a full fix.
>>> May consider pci_get_domain_bus_and_slot() or
>>> pci_find_bus()/pci_get_slot(). But I doubt if the pci_dev is still
>>> tracked
>>> in the bus or a kind of dev list in the device hot removal case. So Ethan
>>> may need to check.
>> Perhaps it is too late to call pci_find_bus() or
>> pci_get_domain_bus_and_slot() to get the
>>
>> device instance from this notifier registered as
>> BUS_NOTIFY_REMOVED_DEVICE
>>
>> action. if the device is still there in bus list, *must* be a bug of
>> device subsystem as
>>
>> *removed* device.
>>
> Ethan, looks your reply is not formatted well. Can you fix your mail
> client like how you write the commit msg?
Sorry for the mail format.
> Here we need consider two situations.
>
> One is that the device is not bound to a driver or bound to a driver
> which doesn't do active work to the device when it's removed. In
> that case one may observe the timeout situation only in the removal
> path as the stack dump in your patch02 shows.
When iommu_bus_notifier() got called for hotplug removal cases to
flush devTLB (ATS invalidation), driver was already unloaded.
whatever safe removal or surprise removal. so in theory no active
driver working there.
pciehp_ist()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
pci_stop_and_remove_bus_device()
pci_stop_bus_device()--->here unload driver
pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
>
> patch02 can fix that case by checking whether the device is present
> to skip sending the invalidation requests. So the logic being discussed
> here doesn't matter.
>
> The 2nd situation is more tricky. The device might be bound to
> a driver which is doing active work to the device with in-fly
> ATS invalidation requests. In this case in-fly requests must be aborted
> before the driver can be detached from the removed device. Conceptually
> a device is removed from the bus only after its driver is detached.
Some tricky situations:
1. The ATS invalidation request is issued from driver driver, while it is
in handling, device is removed. this momment, the device instance still
exists in the bus list. yes, if searching it by BDF, could get it.
2. The ATS invalidation request is issued from iommu_bus_notifier()
for surprise removal reason, as shown in above calltrace, device was
already removed from bus list. if searching it by BDF, return NULL.
3. The ATS invlidation request is issued from iommu_bus_notifier()
for safe removal, when is in handling, device is removed or link
is down. also as #2, device was already removed from bus list.
if searching it by BDF. got NULL.
..
so, searching device by BDF, only works for the ATS invalidation
request is from device driver.
Thanks,
Ethan
>
> From this angle you can still find the pci_dev from the bus when handling
> timeout error for in-fly invalidation requests. But I'm not a PCI person
> so let's wait for the inputs from Bjorn and Lukas.
>
>
> From: Ethan Zhao <[email protected]>
> Sent: Tuesday, January 30, 2024 4:16 PM
>
> On 1/30/2024 2:22 PM, Tian, Kevin wrote:
> > Here we need consider two situations.
> >
> > One is that the device is not bound to a driver or bound to a driver
> > which doesn't do active work to the device when it's removed. In
> > that case one may observe the timeout situation only in the removal
> > path as the stack dump in your patch02 shows.
>
> When iommu_bus_notifier() got called for hotplug removal cases to
> flush devTLB (ATS invalidation), driver was already unloaded.
> whatever safe removal or surprise removal. so in theory no active
> driver working there.
>
> pciehp_ist()
> pciehp_disable_slot()
> remove_board()
> pciehp_unconfigure_device()
> pci_stop_and_remove_bus_device()
> pci_stop_bus_device()--->here unload driver
> pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
yes, so patch02 can fix this case.
>
> >
> > patch02 can fix that case by checking whether the device is present
> > to skip sending the invalidation requests. So the logic being discussed
> > here doesn't matter.
> >
> > The 2nd situation is more tricky. The device might be bound to
> > a driver which is doing active work to the device with in-fly
> > ATS invalidation requests. In this case in-fly requests must be aborted
> > before the driver can be detached from the removed device. Conceptually
> > a device is removed from the bus only after its driver is detached.
>
> Some tricky situations:
>
> 1. The ATS invalidation request is issued from driver driver, while it is
> in handling, device is removed. this momment, the device instance still
> exists in the bus list. yes, if searching it by BDF, could get it.
it's searchable between the point where the device is removed and the
point where the driver is unloaded:
CPU0 CPU1
(Driver is active) (pciehp handler)
qi_submit_sync() pciehp_ist()
... ...
loop for completion() { pciehp_unconfigure_device()
... pci_dev_set_disconnected()
if (ITE) { ...
//find pci_dev from sid pci_remove_bus_device()
if (pci_dev_is_connected()) device_del()
break; bus_remove_device()
} device_remove_driver()
.. //wait for driver unload
}
..
return;
BUS_NOTIFY_REMOVED_DEVICE;
list_del(&dev->bus_list);
(I didn’t draw the full calling stack on the right hand side)
>
> 2. The ATS invalidation request is issued from iommu_bus_notifier()
> for surprise removal reason, as shown in above calltrace, device was
> already removed from bus list. if searching it by BDF, return NULL.
>
> 3. The ATS invlidation request is issued from iommu_bus_notifier()
> for safe removal, when is in handling, device is removed or link
> is down. also as #2, device was already removed from bus list.
> if searching it by BDF. got NULL.
> ...
>
> so, searching device by BDF, only works for the ATS invalidation
> request is from device driver.
>
anything related to bus notifier has been fixed by patch02.
the remaining logic is really for fixing the race invalidation from
device driver.
> From: Ethan Zhao <[email protected]>
> Sent: Tuesday, January 30, 2024 5:13 PM
>
> On 1/30/2024 4:43 PM, Tian, Kevin wrote:
> >> From: Ethan Zhao <[email protected]>
> >> Sent: Tuesday, January 30, 2024 4:16 PM
> >>
> >> On 1/30/2024 2:22 PM, Tian, Kevin wrote:
> >>> Here we need consider two situations.
> >>>
> >>> One is that the device is not bound to a driver or bound to a driver
> >>> which doesn't do active work to the device when it's removed. In
> >>> that case one may observe the timeout situation only in the removal
> >>> path as the stack dump in your patch02 shows.
> >> When iommu_bus_notifier() got called for hotplug removal cases to
> >> flush devTLB (ATS invalidation), driver was already unloaded.
> >> whatever safe removal or surprise removal. so in theory no active
> >> driver working there.
> >>
> >> pciehp_ist()
> >> pciehp_disable_slot()
> >> remove_board()
> >> pciehp_unconfigure_device()
> >> pci_stop_and_remove_bus_device()
> >> pci_stop_bus_device()--->here unload driver
> >> pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
> > yes, so patch02 can fix this case.
> >
> >>> patch02 can fix that case by checking whether the device is present
> >>> to skip sending the invalidation requests. So the logic being discussed
> >>> here doesn't matter.
> >>>
> >>> The 2nd situation is more tricky. The device might be bound to
> >>> a driver which is doing active work to the device with in-fly
> >>> ATS invalidation requests. In this case in-fly requests must be aborted
> >>> before the driver can be detached from the removed device.
> Conceptually
> >>> a device is removed from the bus only after its driver is detached.
> >> Some tricky situations:
> >>
> >> 1. The ATS invalidation request is issued from driver driver, while it is
> >> in handling, device is removed. this momment, the device instance still
> >> exists in the bus list. yes, if searching it by BDF, could get it.
> > it's searchable between the point where the device is removed and the
> > point where the driver is unloaded:
> >
> > CPU0 CPU1
> > (Driver is active) (pciehp handler)
> > qi_submit_sync() pciehp_ist()
> > ... ...
> > loop for completion() { pciehp_unconfigure_device()
> > ... pci_dev_set_disconnected()
> > if (ITE) { ...
> > //find pci_dev from sid pci_remove_bus_device()
> > if (pci_dev_is_connected()) device_del()
> > break; bus_remove_device()
> > } device_remove_driver()
>
> If the device was hot plugin or re-scanned, the device has a PCI_DEV_ADDED
> flag,
in this case is pci_dev_is_disconnected() true or false?
how is this patch supposed to work with it?
> if so the driver unloading work isn't defered to the tail of device_del(), it
> is unloaded before pci_remove_bus_device()->device_del(), in pci_stop_dev
>
> pci_stop_bus_device()
> pci_stop_dev()
> {
> if (pci_dev_is_added(dev)) {
> device_release_driver(&dev->dev);
> }
no matter where driver unload is requested, it needs to wait for aborting
in-fly request on CPU0.
>
> So the interval the device is searchable, only applied to those devices
> not hot plugged, or never be scanned.
>
and in the worst case even if pci_dev is not searchable, isn't it already
an indicator that the device is absent then qi_submit_sync() should
just exit upon ITE?
On 1/30/2024 4:43 PM, Tian, Kevin wrote:
>> From: Ethan Zhao <[email protected]>
>> Sent: Tuesday, January 30, 2024 4:16 PM
>>
>> On 1/30/2024 2:22 PM, Tian, Kevin wrote:
>>> Here we need consider two situations.
>>>
>>> One is that the device is not bound to a driver or bound to a driver
>>> which doesn't do active work to the device when it's removed. In
>>> that case one may observe the timeout situation only in the removal
>>> path as the stack dump in your patch02 shows.
>> When iommu_bus_notifier() got called for hotplug removal cases to
>> flush devTLB (ATS invalidation), driver was already unloaded.
>> whatever safe removal or surprise removal. so in theory no active
>> driver working there.
>>
>> pciehp_ist()
>> pciehp_disable_slot()
>> remove_board()
>> pciehp_unconfigure_device()
>> pci_stop_and_remove_bus_device()
>> pci_stop_bus_device()--->here unload driver
>> pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
> yes, so patch02 can fix this case.
>
>>> patch02 can fix that case by checking whether the device is present
>>> to skip sending the invalidation requests. So the logic being discussed
>>> here doesn't matter.
>>>
>>> The 2nd situation is more tricky. The device might be bound to
>>> a driver which is doing active work to the device with in-fly
>>> ATS invalidation requests. In this case in-fly requests must be aborted
>>> before the driver can be detached from the removed device. Conceptually
>>> a device is removed from the bus only after its driver is detached.
>> Some tricky situations:
>>
>> 1. The ATS invalidation request is issued from driver driver, while it is
>> in handling, device is removed. this momment, the device instance still
>> exists in the bus list. yes, if searching it by BDF, could get it.
> it's searchable between the point where the device is removed and the
> point where the driver is unloaded:
>
> CPU0 CPU1
> (Driver is active) (pciehp handler)
> qi_submit_sync() pciehp_ist()
> ... ...
> loop for completion() { pciehp_unconfigure_device()
> ... pci_dev_set_disconnected()
> if (ITE) { ...
> //find pci_dev from sid pci_remove_bus_device()
> if (pci_dev_is_connected()) device_del()
> break; bus_remove_device()
> } device_remove_driver()
If the device was hot plugin or re-scanned, the device has a PCI_DEV_ADDED flag,
if so the driver unloading work isn't defered to the tail of device_del(), it
is unloaded before pci_remove_bus_device()->device_del(), in pci_stop_dev
pci_stop_bus_device()
pci_stop_dev()
{
if (pci_dev_is_added(dev)) {
device_release_driver(&dev->dev);
}
So the interval the device is searchable, only applied to those devices
not hot plugged, or never be scanned.
Thanks,
Ethan
> .. //wait for driver unload
> }
> ..
> return;
>
> BUS_NOTIFY_REMOVED_DEVICE;
> list_del(&dev->bus_list);
>
> (I didn’t draw the full calling stack on the right hand side)
>
>> 2. The ATS invalidation request is issued from iommu_bus_notifier()
>> for surprise removal reason, as shown in above calltrace, device was
>> already removed from bus list. if searching it by BDF, return NULL.
>>
>> 3. The ATS invlidation request is issued from iommu_bus_notifier()
>> for safe removal, when is in handling, device is removed or link
>> is down. also as #2, device was already removed from bus list.
>> if searching it by BDF. got NULL.
>> ...
>>
>> so, searching device by BDF, only works for the ATS invalidation
>> request is from device driver.
>>
> anything related to bus notifier has been fixed by patch02.
>
> the remaining logic is really for fixing the race invalidation from
> device driver.
On Tue, Jan 30, 2024 at 04:15:33PM +0800, Ethan Zhao wrote:
> Some tricky situations:
>
> 1. The ATS invalidation request is issued from driver driver, while it is
> in handling, device is removed. this momment, the device instance still
> exists in the bus list. yes, if searching it by BDF, could get it.
>
> 2. The ATS invalidation request is issued from iommu_bus_notifier()
> for surprise removal reason, as shown in above calltrace, device was
> already removed from bus list. if searching it by BDF, return NULL.
>
> 3. The ATS invlidation request is issued from iommu_bus_notifier()
> for safe removal, when is in handling, device is removed or link
> is down. also as #2, device was already removed from bus list.
> if searching it by BDF. got NULL.
> ...
>
> so, searching device by BDF, only works for the ATS invalidation
> request is from device driver.
In the good path, where the hot removal is expected and this is about
coordinating, the IOMMU driver should do an orderly shutdown of the
ATS mechanism:
1 Write to PCI config space to disable the ATS
2 Make the IOMMU respond to ATS requests with UR and set it to BLOCKED
3 Issue a flush of the ATC
4 Wait for all outstanding ATC flushes to complete
If it is a bad/surprise path where the device is already gone then:
1 should automatically not do anything, possibly timing out
2 must succeed
3 should time out
4 should "complete" in that the ATC flushes are all timed out
IMHO all you need to do is not crash/lockup while processing the ATC
timeouts. If this is a surprise path then the ATC timeout might
already happened before the iommu driver remove notifier event happens.
If the driver needs to translate from the IOMMU device table index
into a struct device it is probably best to do that inside the driver.
eg ARM maintains a rbtree in the iommu dev data. (see
arm_smmu_insert_master)
Jason
On 1/30/2024 5:24 PM, Tian, Kevin wrote:
>> From: Ethan Zhao <[email protected]>
>> Sent: Tuesday, January 30, 2024 5:13 PM
>>
>> On 1/30/2024 4:43 PM, Tian, Kevin wrote:
>>>> From: Ethan Zhao <[email protected]>
>>>> Sent: Tuesday, January 30, 2024 4:16 PM
>>>>
>>>> On 1/30/2024 2:22 PM, Tian, Kevin wrote:
>>>>> Here we need consider two situations.
>>>>>
>>>>> One is that the device is not bound to a driver or bound to a driver
>>>>> which doesn't do active work to the device when it's removed. In
>>>>> that case one may observe the timeout situation only in the removal
>>>>> path as the stack dump in your patch02 shows.
>>>> When iommu_bus_notifier() got called for hotplug removal cases to
>>>> flush devTLB (ATS invalidation), driver was already unloaded.
>>>> whatever safe removal or surprise removal. so in theory no active
>>>> driver working there.
>>>>
>>>> pciehp_ist()
>>>> pciehp_disable_slot()
>>>> remove_board()
>>>> pciehp_unconfigure_device()
>>>> pci_stop_and_remove_bus_device()
>>>> pci_stop_bus_device()--->here unload driver
>>>> pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
>>> yes, so patch02 can fix this case.
>>>
>>>>> patch02 can fix that case by checking whether the device is present
>>>>> to skip sending the invalidation requests. So the logic being discussed
>>>>> here doesn't matter.
>>>>>
>>>>> The 2nd situation is more tricky. The device might be bound to
>>>>> a driver which is doing active work to the device with in-fly
>>>>> ATS invalidation requests. In this case in-fly requests must be aborted
>>>>> before the driver can be detached from the removed device.
>> Conceptually
>>>>> a device is removed from the bus only after its driver is detached.
>>>> Some tricky situations:
>>>>
>>>> 1. The ATS invalidation request is issued from driver driver, while it is
>>>> in handling, device is removed. this momment, the device instance still
>>>> exists in the bus list. yes, if searching it by BDF, could get it.
>>> it's searchable between the point where the device is removed and the
>>> point where the driver is unloaded:
>>>
>>> CPU0 CPU1
>>> (Driver is active) (pciehp handler)
>>> qi_submit_sync() pciehp_ist()
>>> ... ...
>>> loop for completion() { pciehp_unconfigure_device()
>>> ... pci_dev_set_disconnected()
>>> if (ITE) { ...
>>> //find pci_dev from sid pci_remove_bus_device()
>>> if (pci_dev_is_connected()) device_del()
>>> break; bus_remove_device()
>>> } device_remove_driver()
>> If the device was hot plugin or re-scanned, the device has a PCI_DEV_ADDED
>> flag,
> in this case is pci_dev_is_disconnected() true or false?
>
> how is this patch supposed to work with it?
pci_dev_is_disconnected() is true for safe removal, false for surprise
removal, but it not called in this patch, is used in patch[2/5],
explained in its commit log. This patch use the pci_device_is_present()
to check device present or not. if pci_dev_is_disconnected() returns true, then check its presence by pci vendor
configuration reading (a specific protocal in PCIe spec).
>
>> if so the driver unloading work isn't defered to the tail of device_del(), it
>> is unloaded before pci_remove_bus_device()->device_del(), in pci_stop_dev
>>
>> pci_stop_bus_device()
>> pci_stop_dev()
>> {
>> if (pci_dev_is_added(dev)) {
>> device_release_driver(&dev->dev);
>> }
> no matter where driver unload is requested, it needs to wait for aborting
> in-fly request on CPU0.
yes, the progress of driver unloading has complex sync mechanism in
__device_release_driver() to do that.
>
>> So the interval the device is searchable, only applied to those devices
>> not hot plugged, or never be scanned.
>>
> and in the worst case even if pci_dev is not searchable, isn't it already
> an indicator that the device is absent then qi_submit_sync() should
> just exit upon ITE?
Hmmm, pci_dev is not searchable, but that pci_dev instance is just not in
the bus list or device list, not mean is disconnected or not present that
moment. :)
Thanks,
Ethan
On 2024/1/31 0:29, Jason Gunthorpe wrote:
> On Tue, Jan 30, 2024 at 04:15:33PM +0800, Ethan Zhao wrote:
>> Some tricky situations:
>>
>> 1. The ATS invalidation request is issued from driver driver, while it is
>> in handling, device is removed. this momment, the device instance still
>> exists in the bus list. yes, if searching it by BDF, could get it.
>>
>> 2. The ATS invalidation request is issued from iommu_bus_notifier()
>> for surprise removal reason, as shown in above calltrace, device was
>> already removed from bus list. if searching it by BDF, return NULL.
>>
>> 3. The ATS invlidation request is issued from iommu_bus_notifier()
>> for safe removal, when is in handling, device is removed or link
>> is down. also as #2, device was already removed from bus list.
>> if searching it by BDF. got NULL.
>> ...
>>
>> so, searching device by BDF, only works for the ATS invalidation
>> request is from device driver.
> In the good path, where the hot removal is expected and this is about
> coordinating, the IOMMU driver should do an orderly shutdown of the
> ATS mechanism:
>
> 1 Write to PCI config space to disable the ATS
> 2 Make the IOMMU respond to ATS requests with UR and set it to BLOCKED
> 3 Issue a flush of the ATC
> 4 Wait for all outstanding ATC flushes to complete
>
> If it is a bad/surprise path where the device is already gone then:
>
> 1 should automatically not do anything, possibly timing out
> 2 must succeed
> 3 should time out
> 4 should "complete" in that the ATC flushes are all timed out
>
> IMHO all you need to do is not crash/lockup while processing the ATC
> timeouts. If this is a surprise path then the ATC timeout might
> already happened before the iommu driver remove notifier event happens.
>
> If the driver needs to translate from the IOMMU device table index
> into a struct device it is probably best to do that inside the driver.
>
> eg ARM maintains a rbtree in the iommu dev data. (see
> arm_smmu_insert_master)
An rbtree for IOMMU device data for the VT-d driver would be beneficial.
It also benefits other paths of fault handling, such as the I/O page
fault handling path, where it currently still relies on the PCI
subsystem to convert a RID value into a pci_device structure.
Given that such an rbtree would be helpful for multiple individual
drivers that handle PCI devices, it seems valuable to implement it in
the core?
Best regards,
baolu
On Wed, Jan 31, 2024 at 02:21:20PM +0800, Baolu Lu wrote:
> An rbtree for IOMMU device data for the VT-d driver would be beneficial.
> It also benefits other paths of fault handling, such as the I/O page
> fault handling path, where it currently still relies on the PCI
> subsystem to convert a RID value into a pci_device structure.
>
> Given that such an rbtree would be helpful for multiple individual
> drivers that handle PCI devices, it seems valuable to implement it in
> the core?
rbtree is already supposed to be a re-usable library.
There is already good helper support in rbtree to make things easy to
implement. I see arm hasn't used them yet, it should look something
like this:
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f58e77b99d476b..ebf86c6a8787c4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1673,26 +1673,37 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
return 0;
}
+static int arm_smmu_streams_cmp_key(const void *lhs, const struct rb_node *rhs)
+{
+ struct arm_smmu_stream *stream_rhs =
+ rb_entry(rhs, struct arm_smmu_stream, node);
+ const u32 *sid_lhs = lhs;
+
+ if (*sid_lhs < stream_rhs->id)
+ return -1;
+ if (*sid_lhs > stream_rhs->id)
+ return 1;
+ return 0;
+}
+
+static int arm_smmu_streams_cmp_node(struct rb_node *lhs,
+ const struct rb_node *rhs)
+{
+ return arm_smmu_streams_cmp_key(
+ &rb_entry(lhs, struct arm_smmu_stream, node)->id, rhs);
+}
+
static struct arm_smmu_master *
arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
{
struct rb_node *node;
- struct arm_smmu_stream *stream;
lockdep_assert_held(&smmu->streams_mutex);
- node = smmu->streams.rb_node;
- while (node) {
- stream = rb_entry(node, struct arm_smmu_stream, node);
- if (stream->id < sid)
- node = node->rb_right;
- else if (stream->id > sid)
- node = node->rb_left;
- else
- return stream->master;
- }
-
- return NULL;
+ node = rb_find(&sid, &smmu->streams, arm_smmu_streams_cmp_key);
+ if (!node)
+ return NULL;
+ return rb_entry(node, struct arm_smmu_stream, node)->master;
}
/* IRQ and event handlers */
@@ -3324,8 +3335,6 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
{
int i;
int ret = 0;
- struct arm_smmu_stream *new_stream, *cur_stream;
- struct rb_node **new_node, *parent_node = NULL;
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
master->streams = kcalloc(fwspec->num_ids, sizeof(*master->streams),
@@ -3336,6 +3345,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
mutex_lock(&smmu->streams_mutex);
for (i = 0; i < fwspec->num_ids; i++) {
+ struct arm_smmu_stream *new_stream;
u32 sid = fwspec->ids[i];
new_stream = &master->streams[i];
@@ -3347,28 +3357,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
break;
/* Insert into SID tree */
- new_node = &(smmu->streams.rb_node);
- while (*new_node) {
- cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
- node);
- parent_node = *new_node;
- if (cur_stream->id > new_stream->id) {
- new_node = &((*new_node)->rb_left);
- } else if (cur_stream->id < new_stream->id) {
- new_node = &((*new_node)->rb_right);
- } else {
- dev_warn(master->dev,
- "stream %u already in tree\n",
- cur_stream->id);
- ret = -EINVAL;
- break;
- }
- }
- if (ret)
+ if (rb_find_add(&new_stream->node, &smmu->streams,
+ arm_smmu_streams_cmp_node)) {
+ dev_warn(master->dev, "stream %u already in tree\n",
+ sid);
+ ret = -EINVAL;
break;
-
- rb_link_node(&new_stream->node, parent_node, new_node);
- rb_insert_color(&new_stream->node, &smmu->streams);
+ }
}
if (ret) {
On 2/2/24 3:34 AM, Jason Gunthorpe wrote:
> On Wed, Jan 31, 2024 at 02:21:20PM +0800, Baolu Lu wrote:
>> An rbtree for IOMMU device data for the VT-d driver would be beneficial.
>> It also benefits other paths of fault handling, such as the I/O page
>> fault handling path, where it currently still relies on the PCI
>> subsystem to convert a RID value into a pci_device structure.
>>
>> Given that such an rbtree would be helpful for multiple individual
>> drivers that handle PCI devices, it seems valuable to implement it in
>> the core?
> rbtree is already supposed to be a re-usable library.
>
> There is already good helper support in rbtree to make things easy to
> implement. I see arm hasn't used them yet, it should look something
> like this:
I have posted a similar implementation for the vt-d driver here:
https://lore.kernel.org/linux-iommu/[email protected]/
Based on this implementation, only patches 1 and 2 are required. The
last patch could be like below (code compiled but not tested, comments
not changed yet):
diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index f9b63c2875f7..30a659a4d3ed 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1272,6 +1272,7 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
{
u32 fault;
int head, tail;
+ u64 iqe_err, ite_sid;
struct q_inval *qi = iommu->qi;
int shift = qi_shift(iommu);
@@ -1316,6 +1317,13 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
tail = readl(iommu->reg + DMAR_IQT_REG);
tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
+ /*
+ * SID field is valid only when the ITE field is Set in
FSTS_REG
+ * see Intel VT-d spec r4.1, section 11.4.9.9
+ */
+ iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+ ite_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
pr_info("Invalidation Time-out Error (ITE) cleared\n");
@@ -1325,6 +1333,19 @@ static int qi_check_fault(struct intel_iommu
*iommu, int index, int wait_index)
head = (head - 2 + QI_LENGTH) % QI_LENGTH;
} while (head != tail);
+ /*
+ * If got ITE, we need to check if the sid of ITE is the
same as
+ * current ATS invalidation target device, if yes, don't
try this
+ * request anymore if the target device isn't present.
+ * 0 value of ite_sid means old VT-d device, no ite_sid
value.
+ */
+ if (ite_sid) {
+ struct device *dev = device_rbtree_find(iommu,
ite_sid);
+
+ if (!dev || !pci_device_is_present(to_pci_dev(dev)))
+ return -ETIMEDOUT;
+ }
+
if (qi->desc_status[wait_index] == QI_ABORT)
return -EAGAIN;
}
Best regards,
baolu