2013-08-21 07:16:25

by Takao Indoh

[permalink] [raw]
Subject: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

This patch quiesces devices before disabling IOMMU on boot to stop
ongoing DMA. In intel_iommu_init(), check context entries and if there
is entry whose present bit is set then reset corresponding device.

When IOMMU is already enabled on boot, it is disabled and new DMAR table
is created and then re-enabled in intel_iommu_init(). This causes DMAR
faults if there are in-flight DMAs.

This causes problem on kdump. Devices are working in first kernel, and
after switching to second kernel and initializing IOMMU, many DMAR faults
occur and it causes problems like driver error or PCI SERR, at last
kdump fails. This patch fixes this problem.

Signed-off-by: Takao Indoh <[email protected]>


NOTE:
To reset devices this patch uses bus reset interface introduced by
following commits in PCI "next" branch.

64e8674fbe6bc848333a9b7e19f8cc019dde9eab
5c32b35b004f5ef70dcf62bbc42b8bed1e50b471
2e35afaefe64946caaecfacaf7fb568e46185e88
608c388122c72e1bf11ba8113434eb3d0c40c32d
77cb985ad4acbe66a92ead1bb826deffa47dd33f
090a3c5322e900f468b3205b76d0837003ad57b2
a6cbaadea0af9b4aa6eee2882f2aa761ab91a4f8
de0c548c33429cc78fd47a3c190c6d00b0e4e441
1b95ce8fc9c12fdb60047f2f9950f29e76e7c66d
---
drivers/iommu/intel-iommu.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index eec0d3e..efb98eb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
.notifier_call = device_notifier,
};

+/* Reset PCI devices if its entry exists in DMAR table */
+static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
+{
+ u64 addr;
+ struct root_entry *root;
+ struct context_entry *context;
+ int bus, devfn;
+ struct pci_dev *dev;
+
+ addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+ if (!addr)
+ return;
+
+ /*
+ * In the case of kdump, ioremap is needed because root-entry table
+ * exists in first kernel's memory area which is not mapped in second
+ * kernel
+ */
+ root = (struct root_entry *)ioremap(addr, PAGE_SIZE);
+ if (!root)
+ return;
+
+ for (bus = 0; bus < ROOT_ENTRY_NR; bus++) {
+ if (!root_present(&root[bus]))
+ continue;
+
+ context = (struct context_entry *)ioremap(
+ root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
+ if (!context)
+ continue;
+
+ for (devfn = 0; devfn < ROOT_ENTRY_NR; devfn++) {
+ if (!context_present(&context[devfn]))
+ continue;
+
+ dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
+ if (!dev)
+ continue;
+
+ if (!pci_reset_bus(dev->bus)) /* go to next bus */
+ break;
+ else /* Try per-function reset */
+ pci_reset_function(dev);
+
+ }
+ iounmap(context);
+ }
+ iounmap(root);
+}
+
int __init intel_iommu_init(void)
{
int ret = 0;
@@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
continue;

iommu = drhd->iommu;
- if (iommu->gcmd & DMA_GCMD_TE)
+ if (iommu->gcmd & DMA_GCMD_TE) {
+ if (reset_devices)
+ iommu_reset_devices(iommu, drhd->segment);
iommu_disable_translation(iommu);
+ }
}

if (dmar_dev_scope_init() < 0) {
--
1.7.1


2013-09-08 11:48:53

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

Hi,
Patch is great and works on my HP-z420.
There are several small concerns, please see inline comments.

On 08/21/13 at 04:15pm, Takao Indoh wrote:
> This patch quiesces devices before disabling IOMMU on boot to stop
> ongoing DMA. In intel_iommu_init(), check context entries and if there
> is entry whose present bit is set then reset corresponding device.
>
> When IOMMU is already enabled on boot, it is disabled and new DMAR table
> is created and then re-enabled in intel_iommu_init(). This causes DMAR
> faults if there are in-flight DMAs.
>
> This causes problem on kdump. Devices are working in first kernel, and
> after switching to second kernel and initializing IOMMU, many DMAR faults
> occur and it causes problems like driver error or PCI SERR, at last
> kdump fails. This patch fixes this problem.
>
> Signed-off-by: Takao Indoh <[email protected]>
>
>
> NOTE:
> To reset devices this patch uses bus reset interface introduced by
> following commits in PCI "next" branch.
>
> 64e8674fbe6bc848333a9b7e19f8cc019dde9eab
> 5c32b35b004f5ef70dcf62bbc42b8bed1e50b471
> 2e35afaefe64946caaecfacaf7fb568e46185e88
> 608c388122c72e1bf11ba8113434eb3d0c40c32d
> 77cb985ad4acbe66a92ead1bb826deffa47dd33f
> 090a3c5322e900f468b3205b76d0837003ad57b2
> a6cbaadea0af9b4aa6eee2882f2aa761ab91a4f8
> de0c548c33429cc78fd47a3c190c6d00b0e4e441
> 1b95ce8fc9c12fdb60047f2f9950f29e76e7c66d
> ---
> drivers/iommu/intel-iommu.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
> 1 files changed, 54 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..efb98eb 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
> .notifier_call = device_notifier,
> };
>
> +/* Reset PCI devices if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
> +{
> + u64 addr;
> + struct root_entry *root;
> + struct context_entry *context;
> + int bus, devfn;
> + struct pci_dev *dev;
> +
> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> + if (!addr)
> + return;
> +
> + /*
> + * In the case of kdump, ioremap is needed because root-entry table
> + * exists in first kernel's memory area which is not mapped in second
> + * kernel
> + */
> + root = (struct root_entry *)ioremap(addr, PAGE_SIZE);
> + if (!root)
> + return;
> +
> + for (bus = 0; bus < ROOT_ENTRY_NR; bus++) {
> + if (!root_present(&root[bus]))
> + continue;
> +
> + context = (struct context_entry *)ioremap(
> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> + if (!context)
> + continue;
> +
> + for (devfn = 0; devfn < ROOT_ENTRY_NR; devfn++) {
For context_entry table, the context_entry has the same size as
root_entry currently, so it's also correct to use ROOT_ENTRY_NR. But for
future extention and removing confusion, can we define a new MACRO on
calculating the size of context_entry table, e.g CONTEXT_ENTRY_NR?

> + if (!context_present(&context[devfn]))
> + continue;
> +
> + dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
> + if (!dev)
> + continue;
> +
> + if (!pci_reset_bus(dev->bus)) /* go to next bus */

Here, we have got the specific dev, why don't we just call
pci_reset_function? If call pci_reset_bus here, will it repeat resetting
devices on the same bus many times?

> + break;
> + else /* Try per-function reset */
> + pci_reset_function(dev);
> +
> + }
> + iounmap(context);
> + }
> + iounmap(root);
> +}
> +
> int __init intel_iommu_init(void)
> {
> int ret = 0;
> @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
> continue;
>
> iommu = drhd->iommu;
> - if (iommu->gcmd & DMA_GCMD_TE)
> + if (iommu->gcmd & DMA_GCMD_TE) {
> + if (reset_devices)
> + iommu_reset_devices(iommu, drhd->segment);

I remember the double reset issue vivek concerned in the old patch. Here
the iommu reset is done at the very beginning of pci_iommu_init, it's
after the bus subsystem init, it means here only iommu reset, am I
right?

If yes, I think this patch is clear and logic is simple.

BTW, what's the status of Alex's PCI patchset which this patch depends
on?

Baoquan
Thanks

> iommu_disable_translation(iommu);
> + }
> }
>
> if (dmar_dev_scope_init() < 0) {
> --
> 1.7.1
>
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

2013-09-09 04:29:17

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

(2013/09/08 20:47), Baoquan He wrote:
> Hi,
> Patch is great and works on my HP-z420.

Thank you for your test!

> There are several small concerns, please see inline comments.
>
> On 08/21/13 at 04:15pm, Takao Indoh wrote:
>> This patch quiesces devices before disabling IOMMU on boot to stop
>> ongoing DMA. In intel_iommu_init(), check context entries and if there
>> is entry whose present bit is set then reset corresponding device.
>>
>> When IOMMU is already enabled on boot, it is disabled and new DMAR table
>> is created and then re-enabled in intel_iommu_init(). This causes DMAR
>> faults if there are in-flight DMAs.
>>
>> This causes problem on kdump. Devices are working in first kernel, and
>> after switching to second kernel and initializing IOMMU, many DMAR faults
>> occur and it causes problems like driver error or PCI SERR, at last
>> kdump fails. This patch fixes this problem.
>>
>> Signed-off-by: Takao Indoh <[email protected]>
>>
>>
>> NOTE:
>> To reset devices this patch uses bus reset interface introduced by
>> following commits in PCI "next" branch.
>>
>> 64e8674fbe6bc848333a9b7e19f8cc019dde9eab
>> 5c32b35b004f5ef70dcf62bbc42b8bed1e50b471
>> 2e35afaefe64946caaecfacaf7fb568e46185e88
>> 608c388122c72e1bf11ba8113434eb3d0c40c32d
>> 77cb985ad4acbe66a92ead1bb826deffa47dd33f
>> 090a3c5322e900f468b3205b76d0837003ad57b2
>> a6cbaadea0af9b4aa6eee2882f2aa761ab91a4f8
>> de0c548c33429cc78fd47a3c190c6d00b0e4e441
>> 1b95ce8fc9c12fdb60047f2f9950f29e76e7c66d
>> ---
>> drivers/iommu/intel-iommu.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
>> 1 files changed, 54 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index eec0d3e..efb98eb 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
>> .notifier_call = device_notifier,
>> };
>>
>> +/* Reset PCI devices if its entry exists in DMAR table */
>> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
>> +{
>> + u64 addr;
>> + struct root_entry *root;
>> + struct context_entry *context;
>> + int bus, devfn;
>> + struct pci_dev *dev;
>> +
>> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
>> + if (!addr)
>> + return;
>> +
>> + /*
>> + * In the case of kdump, ioremap is needed because root-entry table
>> + * exists in first kernel's memory area which is not mapped in second
>> + * kernel
>> + */
>> + root = (struct root_entry *)ioremap(addr, PAGE_SIZE);
>> + if (!root)
>> + return;
>> +
>> + for (bus = 0; bus < ROOT_ENTRY_NR; bus++) {
>> + if (!root_present(&root[bus]))
>> + continue;
>> +
>> + context = (struct context_entry *)ioremap(
>> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
>> + if (!context)
>> + continue;
>> +
>> + for (devfn = 0; devfn < ROOT_ENTRY_NR; devfn++) {
> For context_entry table, the context_entry has the same size as
> root_entry currently, so it's also correct to use ROOT_ENTRY_NR. But for
> future extention and removing confusion, can we define a new MACRO on
> calculating the size of context_entry table, e.g CONTEXT_ENTRY_NR?

That makes sense, will do in next version.


>
>> + if (!context_present(&context[devfn]))
>> + continue;
>> +
>> + dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
>> + if (!dev)
>> + continue;
>> +
>> + if (!pci_reset_bus(dev->bus)) /* go to next bus */
>
> Here, we have got the specific dev, why don't we just call
> pci_reset_function? If call pci_reset_bus here, will it repeat resetting
> devices on the same bus many times?

pci_reset_bus() can reset all devices on the same bus at the same time.
I think it is better than calling pci_reset_function() many times for
each device/function.

When pci_reset_bus() returns 0 (reset succeeded), we can immediately go
out of devfn loop by "break" and go to next bus loop.

>
>> + break;
>> + else /* Try per-function reset */
>> + pci_reset_function(dev);
>> +
>> + }
>> + iounmap(context);
>> + }
>> + iounmap(root);
>> +}
>> +
>> int __init intel_iommu_init(void)
>> {
>> int ret = 0;
>> @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
>> continue;
>>
>> iommu = drhd->iommu;
>> - if (iommu->gcmd & DMA_GCMD_TE)
>> + if (iommu->gcmd & DMA_GCMD_TE) {
>> + if (reset_devices)
>> + iommu_reset_devices(iommu, drhd->segment);
>
> I remember the double reset issue vivek concerned in the old patch. Here
> the iommu reset is done at the very beginning of pci_iommu_init, it's
> after the bus subsystem init, it means here only iommu reset, am I
> right?

Yes, exactly.

>
> If yes, I think this patch is clear and logic is simple.
>
> BTW, what's the status of Alex's PCI patchset which this patch depends
> on?

It is merged to Bjorn's tree, will be in v3.12.

Thanks,
Takao Indoh


>
> Baoquan
> Thanks
>
>> iommu_disable_translation(iommu);
>> + }
>> }
>>
>> if (dmar_dev_scope_init() < 0) {
>> --
>> 1.7.1
>>
>>
>>
>> _______________________________________________
>> kexec mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/kexec
>
>

2013-09-09 04:46:33

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

(2013/09/09 13:28), Takao Indoh wrote:
>> BTW, what's the status of Alex's PCI patchset which this patch depends
>> on?
>
> It is merged to Bjorn's tree, will be in v3.12.

This was wrong. Alex's patchset is already in Linus tree, v3.11.

Thanks,
Takao Indoh

2013-09-09 09:07:58

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

On Wed, 2013-08-21 at 16:15 +0900, Takao Indoh wrote:
>
> This causes problem on kdump. Devices are working in first kernel, and
> after switching to second kernel and initializing IOMMU, many DMAR faults
> occur and it causes problems like driver error or PCI SERR, at last
> kdump fails. This patch fixes this problem.

I'm not sure I'd call this a fix.

If the driver is so broken that it cannot get the device working again
after a fault, surely the driver needs to be fixed?

If the system is suffering an IRQ storm because device doesn't give up
after the first few faults, then we should switch off the fault
*reporting* for that device so that its faults get ignored (until it
next actually sets up a DMA mapping, or something).

For the IOMMU code to reset individual devices, just because they still
have an active DMA mapping even if they're not *doing* DMA, seems wrong.
You'll even end up resetting devices just because they have an RMRR,
won't you? (Although I wouldn't lose any sleep over that, I suppose. In
fact it might be a *feature*... :)

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (5.61 kB)

2013-09-10 05:43:52

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

(2013/09/09 18:07), David Woodhouse wrote:
> On Wed, 2013-08-21 at 16:15 +0900, Takao Indoh wrote:
>>
>> This causes problem on kdump. Devices are working in first kernel, and
>> after switching to second kernel and initializing IOMMU, many DMAR faults
>> occur and it causes problems like driver error or PCI SERR, at last
>> kdump fails. This patch fixes this problem.
>
> I'm not sure I'd call this a fix.
>
> If the driver is so broken that it cannot get the device working again
> after a fault, surely the driver needs to be fixed?

Yes,this problem may be solved by fixing driver. Actually megaraid sas
driver is recently fixed for this problem. (See commit 6431f5d7)

But I think root cause of this problem is initializing IOMMU while DMA
is still working, and I want to solve the root cause rather than
handling it in each driver, otherwise we have to fix driver each time we
find this kind of problem.

>
> If the system is suffering an IRQ storm because device doesn't give up
> after the first few faults, then we should switch off the fault
> *reporting* for that device so that its faults get ignored (until it
> next actually sets up a DMA mapping, or something).

In such a case, yeah limiting messages is enough.

>
> For the IOMMU code to reset individual devices, just because they still
> have an active DMA mapping even if they're not *doing* DMA, seems wrong.
> You'll even end up resetting devices just because they have an RMRR,
> won't you? (Although I wouldn't lose any sleep over that, I suppose. In
> fact it might be a *feature*... :)

Right, current code is resetting devices which *may* be doing DMA. The
ideal way is finding devices which are actually doing DMA and reset only
them but I don't know how we can do this, though I think current code
is sufficient.

Thanks,
Takao Indoh

2013-09-18 11:29:17

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] intel-iommu: Quiesce devices before disabling IOMMU

On Tue, 2013-09-10 at 14:43 +0900, Takao Indoh wrote:
> (2013/09/09 18:07), David Woodhouse wrote:
> > If the driver is so broken that it cannot get the device working again
> > after a fault, surely the driver needs to be fixed?
>
> Yes,this problem may be solved by fixing driver. Actually megaraid sas
> driver is recently fixed for this problem. (See commit 6431f5d7)
>
> But I think root cause of this problem is initializing IOMMU while DMA
> is still working, and I want to solve the root cause rather than
> handling it in each driver, otherwise we have to fix driver each time we
> find this kind of problem.

But if the driver is broken and cannot actually recover from hardware
issues, the driver needs to be fixed *anyway*. We shouldn't be papering
over the problem.

> > For the IOMMU code to reset individual devices, just because they still
> > have an active DMA mapping even if they're not *doing* DMA, seems wrong.
>
> Right, current code is resetting devices which *may* be doing DMA. The
> ideal way is finding devices which are actually doing DMA and reset only
> them but I don't know how we can do this, though I think current code
> is sufficient.

No, that's not the ideal way either. Their DMA will be blocked, and
they'll stop (or at least we'll stop getting an interrupt and reporting
their DMA faults, if the hardware *is* so broken that it keeps trying
over and over again). The new driver will come up and reset the device,
and all will be well.

Do not paper over driver bugs. You are just *encouraging* brokenness.

We need to fix the 'fault storm' issue, by setting the FPD bit in the
context-entry for offending devices when appropriate, and then clearing
it again when appropriate too. But for the IOMMU code to go out and
trigger a PCI reset of random devices and buses is ABSOLUTELY WRONG.

Do Not Do This.

--
dwmw2


Attachments:
smime.p7s (5.61 kB)