2013-03-01 17:18:24

by Neil Horman

[permalink] [raw]
Subject: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: [email protected]
---
drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
include/linux/pci_ids.h | 2 ++
2 files changed, 22 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f3b8f23..9bfb6c2 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
.msi_setup_irq = intel_msi_setup_irq,
.setup_hpet_msi = intel_setup_hpet_msi,
};
+
+
+static void intel_remapping_check(struct pci_dev *dev)
+{
+ u8 revision;
+
+ pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
+
+ if ((revision == 0x13) && irq_remapping_enabled) {
+ pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an errata making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ }
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
+
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 31717bd..54027a6 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2732,6 +2732,8 @@
#define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
#define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
#define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
+#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
+#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
#define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
#define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
#define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
--
1.7.11.7


2013-03-01 18:20:37

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Mar 1, 2013 at 9:17 AM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: [email protected]
> ---
> drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
> include/linux/pci_ids.h | 2 ++
> 2 files changed, 22 insertions(+)
>
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..9bfb6c2 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
> .msi_setup_irq = intel_msi_setup_irq,
> .setup_hpet_msi = intel_setup_hpet_msi,
> };
> +
> +
> +static void intel_remapping_check(struct pci_dev *dev)
> +{
> + u8 revision;
> +
> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> +
> + if ((revision == 0x13) && irq_remapping_enabled) {
> + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an errata making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");
> + }
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);

only for x86 platform?
If so, you can check that in arch/x86/kernel/early-quirks.c::early_quirks()
and set one flag and later print warning and skip there if someone
need to enable intr-remap.
So users will not need to reboot the system...

Thanks

Yinghai

> +
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 31717bd..54027a6 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2732,6 +2732,8 @@
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
> #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
> +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
> +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-01 19:30:13

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Mar 01, 2013 at 10:20:35AM -0800, Yinghai Lu wrote:
> On Fri, Mar 1, 2013 at 9:17 AM, Neil Horman <[email protected]> wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
> > For the 5520 and 5500 chipsets which contained an errata (specificially errata
> > 53), which noted that these chipsets can't properly do interrupt remapping, and
> > as a result the recommend that interrupt remapping be disabled in bios. While
> > many vendors have a bios update to do exactly that, not all do, and of course
> > not all users update their bios to a level that corrects the problem. As a
> > result, occasionally interrupts can arrive at a cpu even after affinity for that
> > interrupt has be moved, leading to lost or spurrious interrupts (usually
> > characterized by the message:
> > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >
> > There have been several incidents recently of people seeing this error, and
> > investigation has shown that they have system for which their BIOS level is such
> > that this feature was not properly turned off. As such, it would be good to
> > give them a reminder that their systems are vulnurable to this problem.
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > CC: Prarit Bhargava <[email protected]>
> > CC: Don Zickus <[email protected]>
> > CC: Don Dutile <[email protected]>
> > CC: Bjorn Helgaas <[email protected]>
> > CC: Asit Mallick <[email protected]>
> > CC: [email protected]
> > ---
> > drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
> > include/linux/pci_ids.h | 2 ++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> > index f3b8f23..9bfb6c2 100644
> > --- a/drivers/iommu/intel_irq_remapping.c
> > +++ b/drivers/iommu/intel_irq_remapping.c
> > @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
> > .msi_setup_irq = intel_msi_setup_irq,
> > .setup_hpet_msi = intel_setup_hpet_msi,
> > };
> > +
> > +
> > +static void intel_remapping_check(struct pci_dev *dev)
> > +{
> > + u8 revision;
> > +
> > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> > +
> > + if ((revision == 0x13) && irq_remapping_enabled) {
> > + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
> > + "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an errata making that\n"
> > + "feature unstable. Please reboot with nointremap\n"
> > + "added to the kernel command line and contact\n"
> > + "your BIOS vendor for an update");
> > + }
> > +}
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
>
> only for x86 platform?
> If so, you can check that in arch/x86/kernel/early-quirks.c::early_quirks()
> and set one flag and later print warning and skip there if someone
> need to enable intr-remap.
> So users will not need to reboot the system...
>
> Thanks
>
I was under the impression that BIOS might have enabled irq remapping prior to
the OS booting, at which point a reboot was requried anyway.

Neil

2013-03-02 02:28:20

by Jiang Liu

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/02/2013 02:20 AM, Yinghai Lu wrote:
> On Fri, Mar 1, 2013 at 9:17 AM, Neil Horman <[email protected]> wrote:
>> A few years back intel published a spec update:
>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>
>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>> as a result the recommend that interrupt remapping be disabled in bios. While
>> many vendors have a bios update to do exactly that, not all do, and of course
>> not all users update their bios to a level that corrects the problem. As a
>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>> characterized by the message:
>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>
>> There have been several incidents recently of people seeing this error, and
>> investigation has shown that they have system for which their BIOS level is such
>> that this feature was not properly turned off. As such, it would be good to
>> give them a reminder that their systems are vulnurable to this problem.
>>
>> Signed-off-by: Neil Horman <[email protected]>
>> CC: Prarit Bhargava <[email protected]>
>> CC: Don Zickus <[email protected]>
>> CC: Don Dutile <[email protected]>
>> CC: Bjorn Helgaas <[email protected]>
>> CC: Asit Mallick <[email protected]>
>> CC: [email protected]
>> ---
>> drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
>> include/linux/pci_ids.h | 2 ++
>> 2 files changed, 22 insertions(+)
>>
>> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
>> index f3b8f23..9bfb6c2 100644
>> --- a/drivers/iommu/intel_irq_remapping.c
>> +++ b/drivers/iommu/intel_irq_remapping.c
>> @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
>> .msi_setup_irq = intel_msi_setup_irq,
>> .setup_hpet_msi = intel_setup_hpet_msi,
>> };
>> +
>> +
>> +static void intel_remapping_check(struct pci_dev *dev)
>> +{
>> + u8 revision;
>> +
>> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
>> +
>> + if ((revision == 0x13) && irq_remapping_enabled) {
>> + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
>> + "This system BIOS has enabled interrupt remapping\n"
>> + "on a chipset that contains an errata making that\n"
>> + "feature unstable. Please reboot with nointremap\n"
>> + "added to the kernel command line and contact\n"
>> + "your BIOS vendor for an update");
>> + }
>> +}
>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
>
> only for x86 platform?
> If so, you can check that in arch/x86/kernel/early-quirks.c::early_quirks()
> and set one flag and later print warning and skip there if someone
> need to enable intr-remap.
> So users will not need to reboot the system...
We have just struggled with this issue when doing kvm restarting stress tests,
and finally found it's a chipset errata. Thanks for fix it.
And I think 5520/5500 is for x86 only, so could move it to x86 arch subdirectory.

Regards!
Gerry

>
> Thanks
>
> Yinghai
>
>> +
>> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
>> index 31717bd..54027a6 100644
>> --- a/include/linux/pci_ids.h
>> +++ b/include/linux/pci_ids.h
>> @@ -2732,6 +2732,8 @@
>> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
>> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
>> #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
>> +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
>> +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
>> #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
>> #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
>> #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-03-02 15:59:46

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

Hi,

> if ((revision == 0x13) && irq_remapping_enabled) {
> + pr_warn("WARNING WARNING WARNING WARNING WARNING
> WARNING\n"
> + "This system BIOS has enabled interrupt
> remapping\n"
> + "on a chipset that contains an errata making
> that\n"
> + "feature unstable. Please reboot with
> nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");
> + }

Forgive me, but ISTR that there's a special BIOS firmware quirk bug annotating
logger warning message mechanism (have I managed to hit all keywords yet? ;)
in the kernel which might be useful in this case.


OK, found something (but I don't think it was the mechanism
that ISTR - perhaps it got modernized?):


include/linux/printk.h:

/*
* FW_BUG
* Add this to a message where you are sure the firmware is buggy or
* behaves
* really stupid or out of spec. Be aware that the responsible BIOS
* developer
* should be able to fix this issue or at least get a concrete idea of
* the
* problem by reading your message without the need of looking at the
* kernel
* code.
*
* Use it for definite and high priority BIOS bugs.
*
* FW_WARN
* Use it for not that clear (e.g. could the kernel messed up things
* already?)
* and medium priority BIOS bugs.
*
* FW_INFO
* Use this one if you want to tell the user or vendor about something
* suspicious, but generally harmless related to the firmware.
*
* Use it for information or very low priority BIOS bugs.
*/


So perhaps it is appropriate to be used here.


HTH,

Andreas Mohr

2013-03-02 16:21:50

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/01/2013 12:17 PM, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: [email protected]
> ---
> drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
> include/linux/pci_ids.h | 2 ++
> 2 files changed, 22 insertions(+)
>
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..9bfb6c2 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
> .msi_setup_irq = intel_msi_setup_irq,
> .setup_hpet_msi = intel_setup_hpet_msi,
> };
> +
> +
> +static void intel_remapping_check(struct pci_dev *dev)
> +{
> + u8 revision;
> +
> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> +
> + if ((revision == 0x13) && irq_remapping_enabled) {
> + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an errata making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");

Make this one line? Might be too long but I believe the preferred policy is now
to keep the output on one line so that it is easy to find in the kernel source.

Also, IMO, remove the WARNING WARNING stuff.

You also should probably use HW_ERR here too.

P.

> + }
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
> +
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 31717bd..54027a6 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2732,6 +2732,8 @@
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
> #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
> +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
> +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b

2013-03-02 20:14:08

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Sat, Mar 02, 2013 at 11:21:29AM -0500, Prarit Bhargava wrote:
> On 03/01/2013 12:17 PM, Neil Horman wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
> > For the 5520 and 5500 chipsets which contained an errata (specificially errata
> > 53), which noted that these chipsets can't properly do interrupt remapping, and
> > as a result the recommend that interrupt remapping be disabled in bios. While
> > many vendors have a bios update to do exactly that, not all do, and of course
> > not all users update their bios to a level that corrects the problem. As a
> > result, occasionally interrupts can arrive at a cpu even after affinity for that
> > interrupt has be moved, leading to lost or spurrious interrupts (usually
> > characterized by the message:
> > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >
> > There have been several incidents recently of people seeing this error, and
> > investigation has shown that they have system for which their BIOS level is such
> > that this feature was not properly turned off. As such, it would be good to
> > give them a reminder that their systems are vulnurable to this problem.
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > CC: Prarit Bhargava <[email protected]>
> > CC: Don Zickus <[email protected]>
> > CC: Don Dutile <[email protected]>
> > CC: Bjorn Helgaas <[email protected]>
> > CC: Asit Mallick <[email protected]>
> > CC: [email protected]
> > ---
> > drivers/iommu/intel_irq_remapping.c | 20 ++++++++++++++++++++
> > include/linux/pci_ids.h | 2 ++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> > index f3b8f23..9bfb6c2 100644
> > --- a/drivers/iommu/intel_irq_remapping.c
> > +++ b/drivers/iommu/intel_irq_remapping.c
> > @@ -1113,3 +1113,23 @@ struct irq_remap_ops intel_irq_remap_ops = {
> > .msi_setup_irq = intel_msi_setup_irq,
> > .setup_hpet_msi = intel_setup_hpet_msi,
> > };
> > +
> > +
> > +static void intel_remapping_check(struct pci_dev *dev)
> > +{
> > + u8 revision;
> > +
> > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> > +
> > + if ((revision == 0x13) && irq_remapping_enabled) {
> > + pr_warn("WARNING WARNING WARNING WARNING WARNING WARNING\n"
> > + "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an errata making that\n"
> > + "feature unstable. Please reboot with nointremap\n"
> > + "added to the kernel command line and contact\n"
> > + "your BIOS vendor for an update");
>
> Make this one line? Might be too long but I believe the preferred policy is now
> to keep the output on one line so that it is easy to find in the kernel source.
>
> Also, IMO, remove the WARNING WARNING stuff.
>
> You also should probably use HW_ERR here too.
>
> P.
>
I can do that, I'll post version 2 on monday.
Neil

2013-03-04 13:24:46

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/02/2013 10:59 AM, Andreas Mohr wrote:
> Hi,
>
>> if ((revision == 0x13)&& irq_remapping_enabled) {
>> + pr_warn("WARNING WARNING WARNING WARNING WARNING
>> WARNING\n"
>> + "This system BIOS has enabled interrupt
>> remapping\n"
>> + "on a chipset that contains an errata making
>> that\n"
>> + "feature unstable. Please reboot with
>> nointremap\n"
>> + "added to the kernel command line and contact\n"
>> + "your BIOS vendor for an update");
>> + }
>
> Forgive me, but ISTR that there's a special BIOS firmware quirk bug annotating
> logger warning message mechanism (have I managed to hit all keywords yet? ;)
> in the kernel which might be useful in this case.
>
>
> OK, found something (but I don't think it was the mechanism
> that ISTR - perhaps it got modernized?):
>
>
> include/linux/printk.h:
>
> /*
> * FW_BUG
> * Add this to a message where you are sure the firmware is buggy or
> * behaves
> * really stupid or out of spec. Be aware that the responsible BIOS
> * developer
> * should be able to fix this issue or at least get a concrete idea of
> * the
> * problem by reading your message without the need of looking at the
> * kernel
> * code.
> *
> * Use it for definite and high priority BIOS bugs.
> *
> * FW_WARN
> * Use it for not that clear (e.g. could the kernel messed up things
> * already?)
> * and medium priority BIOS bugs.
> *
> * FW_INFO
> * Use this one if you want to tell the user or vendor about something
> * suspicious, but generally harmless related to the firmware.
> *
> * Use it for information or very low priority BIOS bugs.
> */
>

It is not a firmware/BIOS bug. Prarit's comment to annotate it as
a HW_ERR is more accurate. A software patch is being tested now
to see if it can do set-affinity in a manner that avoids this race
and enables IR to stay on for all these systems. It requires
more testing to ensure the logic is valid. This patch was
recommended as a necessary short-term fix, and to highlight to
others this possible state -- which Gerry mentioned he had.

Note: the race condition is a very small window, and could not
be duplicated on upstream kernel on known-failing-hw b/c
the interrupt rate dropped by 80%.

The errata's recommendation (to disable IR in the BIOS) makes
the option of a workaround impossible, since it will hide the
IR hw (no DMAR table for it). The BIOS 'update' may be avoided,
and corrected, if testing yields positive results, and after
posting, others don't see a flaw in the logic that was missed
by other reviewers.

>
> So perhaps it is appropriate to be used here.
>
>
> HTH,
>
> Andreas Mohr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-04 19:04:45

by Neil Horman

[permalink] [raw]
Subject: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: [email protected]

---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.
---
arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
include/linux/pci_ids.h | 2 ++
2 files changed, 20 insertions(+)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 26ee48a..a718ea2 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -5,6 +5,7 @@
#include <linux/irq.h>

#include <asm/hpet.h>
+#include "../../../drivers/iommu/irq_remapping.h"

#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)

@@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
quirk_amd_nb_node);

#endif
+
+static void intel_remapping_check(struct pci_dev *dev)
+{
+ u8 revision;
+
+ pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
+
+ if ((revision == 0x13) && irq_remapping_enabled) {
+ pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an errata making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ }
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 31717bd..54027a6 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2732,6 +2732,8 @@
#define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
#define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
#define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
+#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
+#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
#define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
#define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
#define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
--
1.7.11.7

2013-03-09 20:49:55

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: [email protected]
>
Ping, anyone want to Ack/Nack this?
Neil

2013-03-09 22:21:02

by Myron Stowe

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <[email protected]> wrote:
> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
>> A few years back intel published a spec update:
>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>
>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>> as a result the recommend that interrupt remapping be disabled in bios. While
>> many vendors have a bios update to do exactly that, not all do, and of course
>> not all users update their bios to a level that corrects the problem. As a
>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>> characterized by the message:
>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>
>> There have been several incidents recently of people seeing this error, and
>> investigation has shown that they have system for which their BIOS level is such
>> that this feature was not properly turned off. As such, it would be good to
>> give them a reminder that their systems are vulnurable to this problem.
>>
>> Signed-off-by: Neil Horman <[email protected]>
>> CC: Prarit Bhargava <[email protected]>
>> CC: Don Zickus <[email protected]>
>> CC: Don Dutile <[email protected]>
>> CC: Bjorn Helgaas <[email protected]>
>> CC: Asit Mallick <[email protected]>
>> CC: [email protected]
>>
> Ping, anyone want to Ack/Nack this?

Don's comment earlier seems to imply that this is a short term fix and
that a more long term fix may be coming soon. If that is the case
wouldn't we want to wait for the long term fix and just pull that in?

Myron

> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-10 01:11:58

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/04/2013 08:24 AM, Don Dutile wrote:
> On 03/02/2013 10:59 AM, Andreas Mohr wrote:
>> Hi,
>>
>>> if ((revision == 0x13)&& irq_remapping_enabled) {
>>> + pr_warn("WARNING WARNING WARNING WARNING WARNING
>>> WARNING\n"
>>> + "This system BIOS has enabled interrupt
>>> remapping\n"
>>> + "on a chipset that contains an errata making
>>> that\n"
>>> + "feature unstable. Please reboot with
>>> nointremap\n"
>>> + "added to the kernel command line and contact\n"
>>> + "your BIOS vendor for an update");
>>> + }
>>
>> Forgive me, but ISTR that there's a special BIOS firmware quirk bug annotating
>> logger warning message mechanism (have I managed to hit all keywords yet? ;)
>> in the kernel which might be useful in this case.
>>
>>
>> OK, found something (but I don't think it was the mechanism
>> that ISTR - perhaps it got modernized?):
>>
>>
>> include/linux/printk.h:
>>
>> /*
>> * FW_BUG
>> * Add this to a message where you are sure the firmware is buggy or
>> * behaves
>> * really stupid or out of spec. Be aware that the responsible BIOS
>> * developer
>> * should be able to fix this issue or at least get a concrete idea of
>> * the
>> * problem by reading your message without the need of looking at the
>> * kernel
>> * code.
>> *
>> * Use it for definite and high priority BIOS bugs.
>> *
>> * FW_WARN
>> * Use it for not that clear (e.g. could the kernel messed up things
>> * already?)
>> * and medium priority BIOS bugs.
>> *
>> * FW_INFO
>> * Use this one if you want to tell the user or vendor about something
>> * suspicious, but generally harmless related to the firmware.
>> *
>> * Use it for information or very low priority BIOS bugs.
>> */
>>
>
> It is not a firmware/BIOS bug.

Correct. This is a hardware bug that *may be* resolved through a BIOS update.
But there is no guarantee that a BIOS update is available. Labelling it a FW
bug would be a mistake.

Prarit's comment to annotate it as
> a HW_ERR is more accurate. A software patch is being tested now
> to see if it can do set-affinity in a manner that avoids this race
> and enables IR to stay on for all these systems. It requires
> more testing to ensure the logic is valid. This patch was
> recommended as a necessary short-term fix, and to highlight to
> others this possible state -- which Gerry mentioned he had.

Yup -- as mstowe asked ... should we even consider this patch then, or should we
wait for the possible real fix?

Having said that ... I'm nervous about playing around with the set-affinity path
for this HW problem. We're basically changing good/reliable code for broken-ass
hardware. :/ That doesn't seem a like a good choice to me.

I can understand if we all feel that the code is broken, or it can be made
better -- but to change it because of bad HW just doesn't seem like the right
thing to do.

IMO.

P.

2013-03-11 01:32:17

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/09/2013 05:20 PM, Myron Stowe wrote:
> On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman<[email protected]> wrote:
>> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
>>> A few years back intel published a spec update:
>>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>>
>>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>>> as a result the recommend that interrupt remapping be disabled in bios. While
>>> many vendors have a bios update to do exactly that, not all do, and of course
>>> not all users update their bios to a level that corrects the problem. As a
>>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>>> characterized by the message:
>>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>>
>>> There have been several incidents recently of people seeing this error, and
>>> investigation has shown that they have system for which their BIOS level is such
>>> that this feature was not properly turned off. As such, it would be good to
>>> give them a reminder that their systems are vulnurable to this problem.
>>>
>>> Signed-off-by: Neil Horman<[email protected]>
>>> CC: Prarit Bhargava<[email protected]>
>>> CC: Don Zickus<[email protected]>
>>> CC: Don Dutile<[email protected]>
>>> CC: Bjorn Helgaas<[email protected]>
>>> CC: Asit Mallick<[email protected]>
>>> CC: [email protected]
>>>
>> Ping, anyone want to Ack/Nack this?
>
> Don's comment earlier seems to imply that this is a short term fix and
> that a more long term fix may be coming soon. If that is the case
> wouldn't we want to wait for the long term fix and just pull that in?
>
> Myron
>
At the time of Neil's postings, multiple changes were being considered,
and we didn't know how long it would take to verify any one change.
Thus, Neil's patch was proposed to identify a known problem that was
being seen on multiple systems, and it was proposed so further
system issues wouldn't go mis-diagnosed.

We are still testing a minor change, and test results are positive so far.
After we're sure of its logic as it's results, it can be posted and
this patch can be removed if it is taken before we are certain.

As Prarit stated, we should be sure of changes in this area before
throwing this patch away for an option that could yield other failures.

- Don
>> Neil
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-11 11:25:13

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Sat, Mar 09, 2013 at 03:20:57PM -0700, Myron Stowe wrote:
> On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <[email protected]> wrote:
> > On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
> >> A few years back intel published a spec update:
> >> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >>
> >> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> >> 53), which noted that these chipsets can't properly do interrupt remapping, and
> >> as a result the recommend that interrupt remapping be disabled in bios. While
> >> many vendors have a bios update to do exactly that, not all do, and of course
> >> not all users update their bios to a level that corrects the problem. As a
> >> result, occasionally interrupts can arrive at a cpu even after affinity for that
> >> interrupt has be moved, leading to lost or spurrious interrupts (usually
> >> characterized by the message:
> >> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >>
> >> There have been several incidents recently of people seeing this error, and
> >> investigation has shown that they have system for which their BIOS level is such
> >> that this feature was not properly turned off. As such, it would be good to
> >> give them a reminder that their systems are vulnurable to this problem.
> >>
> >> Signed-off-by: Neil Horman <[email protected]>
> >> CC: Prarit Bhargava <[email protected]>
> >> CC: Don Zickus <[email protected]>
> >> CC: Don Dutile <[email protected]>
> >> CC: Bjorn Helgaas <[email protected]>
> >> CC: Asit Mallick <[email protected]>
> >> CC: [email protected]
> >>
> > Ping, anyone want to Ack/Nack this?
>
> Don's comment earlier seems to imply that this is a short term fix and
> that a more long term fix may be coming soon. If that is the case
> wouldn't we want to wait for the long term fix and just pull that in?
>
> Myron
>
As Don and Prarit have mentioned, an alternate change is being worked on and
tested that may work around this issue, but we're not yet sure that it will, and
we're not sure of the time frame for this fix. Normally I would agree, that it
would be easier just to wait for the long term fix, but as Prarit noted, since
this hardware is in fact broken, I would rather do a both approach. Its fine if
this gets reverted tomorrow with a longer term fix as far as I'm concerned, its
just caused enough problems already that I'd like to see it in place until the
better solution arrives.
Neil

2013-03-11 12:17:52

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 03/11/2013 07:25 AM, Neil Horman wrote:
> On Sat, Mar 09, 2013 at 03:20:57PM -0700, Myron Stowe wrote:
>> On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <[email protected]> wrote:
>>> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
>>>> A few years back intel published a spec update:
>>>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>>>
>>>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>>>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>>>> as a result the recommend that interrupt remapping be disabled in bios. While
>>>> many vendors have a bios update to do exactly that, not all do, and of course
>>>> not all users update their bios to a level that corrects the problem. As a
>>>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>>>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>>>> characterized by the message:
>>>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>>>
>>>> There have been several incidents recently of people seeing this error, and
>>>> investigation has shown that they have system for which their BIOS level is such
>>>> that this feature was not properly turned off. As such, it would be good to
>>>> give them a reminder that their systems are vulnurable to this problem.
>>>>
>>>> Signed-off-by: Neil Horman <[email protected]>
>>>> CC: Prarit Bhargava <[email protected]>
>>>> CC: Don Zickus <[email protected]>
>>>> CC: Don Dutile <[email protected]>
>>>> CC: Bjorn Helgaas <[email protected]>
>>>> CC: Asit Mallick <[email protected]>
>>>> CC: [email protected]
>>>>
>>> Ping, anyone want to Ack/Nack this?
>>
>> Don's comment earlier seems to imply that this is a short term fix and
>> that a more long term fix may be coming soon. If that is the case
>> wouldn't we want to wait for the long term fix and just pull that in?
>>
>> Myron
>>
> As Don and Prarit have mentioned, an alternate change is being worked on and
> tested that may work around this issue, but we're not yet sure that it will, and
> we're not sure of the time frame for this fix. Normally I would agree, that it
> would be easier just to wait for the long term fix, but as Prarit noted, since
> this hardware is in fact broken, I would rather do a both approach. Its fine if
> this gets reverted tomorrow with a longer term fix as far as I'm concerned, its
> just caused enough problems already that I'd like to see it in place until the
> better solution arrives.

I agree with Neil on this. While vendors are supposed to fix their BIOSes,
experience has shown that not all vendors will fix their BIOSes for a problem
like this.

Ack this quirk.

P.

> Neil
>
>

2013-04-03 23:53:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

[+cc David and iommu list, Yinghai, Jiang]

On Mon, Mar 4, 2013 at 12:04 PM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: [email protected]
>
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
> ---
> arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
> include/linux/pci_ids.h | 2 ++
> 2 files changed, 20 insertions(+)
>
> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> index 26ee48a..a718ea2 100644
> --- a/arch/x86/kernel/quirks.c
> +++ b/arch/x86/kernel/quirks.c
> @@ -5,6 +5,7 @@
> #include <linux/irq.h>
>
> #include <asm/hpet.h>
> +#include "../../../drivers/iommu/irq_remapping.h"
>
> #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
>
> @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
> quirk_amd_nb_node);
>
> #endif
> +
> +static void intel_remapping_check(struct pci_dev *dev)
> +{
> + u8 revision;
> +
> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> +
> + if ((revision == 0x13) && irq_remapping_enabled) {
> + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an errata making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");
> + }
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);

This started as an IOMMU change, and I'm not an expert in that area,
so I added David and the IOMMU list. I'd rather have him deal with
this than me.

Is this something we can just *fix* in the kernel, e.g., by turning
off interrupt remapping ourselves, or does it have to be done before
the OS boots?

> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 31717bd..54027a6 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2732,6 +2732,8 @@
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
> #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
> #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
> +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
> +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406

For constants only used in one place, we just use the bare constant
(0x3403) in the quirk rather than editing pci_ids.h (see comment at
the top of that file).

> #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a
> #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-04-04 11:17:34

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Wed, Apr 03, 2013 at 05:53:27PM -0600, Bjorn Helgaas wrote:
> [+cc David and iommu list, Yinghai, Jiang]
>
> On Mon, Mar 4, 2013 at 12:04 PM, Neil Horman <[email protected]> wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
> > For the 5520 and 5500 chipsets which contained an errata (specificially errata
> > 53), which noted that these chipsets can't properly do interrupt remapping, and
> > as a result the recommend that interrupt remapping be disabled in bios. While
> > many vendors have a bios update to do exactly that, not all do, and of course
> > not all users update their bios to a level that corrects the problem. As a
> > result, occasionally interrupts can arrive at a cpu even after affinity for that
> > interrupt has be moved, leading to lost or spurrious interrupts (usually
> > characterized by the message:
> > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >
> > There have been several incidents recently of people seeing this error, and
> > investigation has shown that they have system for which their BIOS level is such
> > that this feature was not properly turned off. As such, it would be good to
> > give them a reminder that their systems are vulnurable to this problem.
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > CC: Prarit Bhargava <[email protected]>
> > CC: Don Zickus <[email protected]>
> > CC: Don Dutile <[email protected]>
> > CC: Bjorn Helgaas <[email protected]>
> > CC: Asit Mallick <[email protected]>
> > CC: [email protected]
> >
> > ---
> >
> > Change notes:
> >
> > v2)
> >
> > * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> > chipset series is x86 only. I decided however to keep the quirk as a regular
> > quirk, not an early_quirk. Early quirks have no way currently to determine if
> > BIOS has properly disabled the feature in the iommu, at least not without
> > significant hacking, and since its quite possible this will be a short lived
> > quirk, should Don Z's workaround code prove successful (and it looks like it may
> > well), I don't think that necessecary.
> >
> > * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> > string, I opted to leave the newlines in place however, as I really couldnt
> > find a way to keep the text on a single line is still legible from a code
> > perspective. I think theres enough language in there that using cscope on just
> > about any substring however will turn it up, and again, this may be a short
> > lived quirk.
> > ---
> > arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
> > include/linux/pci_ids.h | 2 ++
> > 2 files changed, 20 insertions(+)
> >
> > diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> > index 26ee48a..a718ea2 100644
> > --- a/arch/x86/kernel/quirks.c
> > +++ b/arch/x86/kernel/quirks.c
> > @@ -5,6 +5,7 @@
> > #include <linux/irq.h>
> >
> > #include <asm/hpet.h>
> > +#include "../../../drivers/iommu/irq_remapping.h"
> >
> > #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
> >
> > @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
> > quirk_amd_nb_node);
> >
> > #endif
> > +
> > +static void intel_remapping_check(struct pci_dev *dev)
> > +{
> > + u8 revision;
> > +
> > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> > +
> > + if ((revision == 0x13) && irq_remapping_enabled) {
> > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an errata making that\n"
> > + "feature unstable. Please reboot with nointremap\n"
> > + "added to the kernel command line and contact\n"
> > + "your BIOS vendor for an update");
> > + }
> > +}
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check);
>
> This started as an IOMMU change, and I'm not an expert in that area,
> so I added David and the IOMMU list. I'd rather have him deal with
> this than me.
>
This was never an iommu change, other than the fact that interrupt mapping and
the iommu are tightly coupled.

> Is this something we can just *fix* in the kernel, e.g., by turning
> off interrupt remapping ourselves, or does it have to be done before
> the OS boots?
>
Possibly, but it didn't seem safe to me. As it currently stands, pci quirks are
executed after the apic is configured, which means its possible the problem
might trigger before the quirk has a chance to execute (even the early quirk).
There was an exchange about this earlier in the thread IIRC.

> > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> > index 31717bd..54027a6 100644
> > --- a/include/linux/pci_ids.h
> > +++ b/include/linux/pci_ids.h
> > @@ -2732,6 +2732,8 @@
> > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2
> > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3
> > #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340
> > +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403
> > +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406
>
> For constants only used in one place, we just use the bare constant
> (0x3403) in the quirk rather than editing pci_ids.h (see comment at
> the top of that file).
>
Ok, I'll repost with this removed.

Thanks & Regards
Neil

2013-04-04 13:55:20

by Neil Horman

[permalink] [raw]
Subject: [PATCH v3] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.
---
arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 26ee48a..7c99675 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -5,6 +5,7 @@
#include <linux/irq.h>

#include <asm/hpet.h>
+#include "../../../drivers/iommu/irq_remapping.h"

#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)

@@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
quirk_amd_nb_node);

#endif
+
+static void intel_remapping_check(struct pci_dev *dev)
+{
+ u8 revision;
+
+ pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
+
+ if ((revision == 0x13) && irq_remapping_enabled) {
+ pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an errata making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ }
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3406, intel_remapping_check);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3403, intel_remapping_check);
--
1.8.1.4

2013-04-04 14:27:54

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
> );
> > +
> > + if ((revision == 0x13) && irq_remapping_enabled) {
> > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an errata making that\n"
> > + "feature unstable. Please reboot with nointremap\n"
> > + "added to the kernel command line and contact\n"
> > + "your BIOS vendor for an update");

This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.

--
dwmw2


Attachments:
smime.p7s (6.03 kB)

2013-04-04 14:50:44

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote:
> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
> > );
> > > +
> > > + if ((revision == 0x13) && irq_remapping_enabled) {
> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> > > + "on a chipset that contains an errata making that\n"
> > > + "feature unstable. Please reboot with nointremap\n"
> > > + "added to the kernel command line and contact\n"
> > > + "your BIOS vendor for an update");
>
> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
>
Ok, copy that. I'll repost shortly

> --
> dwmw2

2013-04-04 14:57:29

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote:
>> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
>> > );
>> > > +
>> > > + if ((revision == 0x13) && irq_remapping_enabled) {
>> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
>> > > + "on a chipset that contains an errata making that\n"
>> > > + "feature unstable. Please reboot with nointremap\n"
>> > > + "added to the kernel command line and contact\n"
>> > > + "your BIOS vendor for an update");
>>
>> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
>>
> Ok, copy that. I'll repost shortly

When you do, please include URLs for any problem reports or bugzillas you have.

I assume Windows "just works" in this situation?

2013-04-04 15:08:30

by Neil Horman

[permalink] [raw]
Subject: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse
---
arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 26ee48a..eb0785d 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -5,6 +5,7 @@
#include <linux/irq.h>

#include <asm/hpet.h>
+#include "../../../drivers/iommu/irq_remapping.h"

#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)

@@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
quirk_amd_nb_node);

#endif
+
+static void intel_remapping_check(struct pci_dev *dev)
+{
+ u8 revision;
+
+ pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
+
+ WARN_TAINT(((revision == 0x13) && irq_remapping_enabled),
+ TAINT_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3406, intel_remapping_check);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3403, intel_remapping_check);
--
1.8.1.4

2013-04-04 15:39:27

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote:
> On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <[email protected]> wrote:
> > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote:
> >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
> >> > );
> >> > > +
> >> > > + if ((revision == 0x13) && irq_remapping_enabled) {
> >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> >> > > + "on a chipset that contains an errata making that\n"
> >> > > + "feature unstable. Please reboot with nointremap\n"
> >> > > + "added to the kernel command line and contact\n"
> >> > > + "your BIOS vendor for an update");
> >>
> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
> >>
> > Ok, copy that. I'll repost shortly
>
> When you do, please include URLs for any problem reports or bugzillas you have.
>
Well, those are going to be vendor specific, so I'm not sure we can really do
that, at least not in any meaningful way.

> I assume Windows "just works" in this situation?
No more or less than linux does in this case. The Intel provided errata
indicates that the only acceptable workaround is to disable remapping in the
BIOS, so I would presume that if a windows system has a BIOS that doesn't
implement this fix, its just as exposed as we are.
Neil

2013-04-04 16:16:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 8:08 AM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
> ---
> arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> index 26ee48a..eb0785d 100644
> --- a/arch/x86/kernel/quirks.c
> +++ b/arch/x86/kernel/quirks.c
> @@ -5,6 +5,7 @@
> #include <linux/irq.h>
>
> #include <asm/hpet.h>
> +#include "../../../drivers/iommu/irq_remapping.h"
>
> #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
>
> @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
> quirk_amd_nb_node);
>
> #endif
> +
> +static void intel_remapping_check(struct pci_dev *dev)
> +{
> + u8 revision;
> +
> + pci_read_config_byte(dev, PCI_REVISION_ID, &revision);
> +
> + WARN_TAINT(((revision == 0x13) && irq_remapping_enabled),
> + TAINT_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3406, intel_remapping_check);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3403, intel_remapping_check);

No, you only address my one request, move to arch/x86 directory.

You need to move the quirk to early_quirk to append nointremap to
avoid extra rebooting.

Thanks

Yinghai

2013-04-04 17:14:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 9:39 AM, Neil Horman <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote:
>> On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <[email protected]> wrote:
>> > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote:
>> >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
>> >> > );
>> >> > > +
>> >> > > + if ((revision == 0x13) && irq_remapping_enabled) {
>> >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
>> >> > > + "on a chipset that contains an errata making that\n"
>> >> > > + "feature unstable. Please reboot with nointremap\n"
>> >> > > + "added to the kernel command line and contact\n"
>> >> > > + "your BIOS vendor for an update");
>> >>
>> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
>> >>
>> > Ok, copy that. I'll repost shortly
>>
>> When you do, please include URLs for any problem reports or bugzillas you have.
>>
> Well, those are going to be vendor specific, so I'm not sure we can really do
> that, at least not in any meaningful way.

Sorry, I don't understand your point. It's useful to know who
reported it (e.g., for future testers) and what happened and what
bugzillas it solved. Of course it applies only to machines with this
chipset and certain BIOS revisions.

>> I assume Windows "just works" in this situation?
> No more or less than linux does in this case. The Intel provided errata
> indicates that the only acceptable workaround is to disable remapping in the
> BIOS, so I would presume that if a windows system has a BIOS that doesn't
> implement this fix, its just as exposed as we are.

It sounds like the effect of this bug is that on Linux, devices may
not work at all because of lost interrupts. Either Windows must never
enable remapping (so it never sees the bug), or it must be designed in
a way that tolerates the problem. I can't believe these machines
shipped with Windows certification if devices didn't work correctly.

Either way, I don't understand why we can't make the quirk just fix
this. Booting with "nointremap" only sets disable_irq_remap, which is
only used by irq_remapping_supported(). Early quirks are run before
irq_remapping_supported () is ever called, so an early quirk ought to
be just as effective as the command line option. Here's the relevant
call tree I see:

start_kernel
setup_arch
parse_early_param
early_quirks
rest_init
...
<first use of disable_irq_remap>

The x86 setup_arch() does call generic_apic_probe(), but as far as I
can tell, none of the APIC .probe() methods reference
disable_irq_remap, so that doesn't look like a problem.

Bjorn

2013-04-04 17:27:42

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 04/04/2013 12:16 PM, Yinghai Lu wrote:
> On Thu, Apr 4, 2013 at 8:08 AM, Neil Horman<[email protected]> wrote:
>> A few years back intel published a spec update:
>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>
>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>> as a result the recommend that interrupt remapping be disabled in bios. While
>> many vendors have a bios update to do exactly that, not all do, and of course
>> not all users update their bios to a level that corrects the problem. As a
>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>> characterized by the message:
>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>
>> There have been several incidents recently of people seeing this error, and
>> investigation has shown that they have system for which their BIOS level is such
>> that this feature was not properly turned off. As such, it would be good to
>> give them a reminder that their systems are vulnurable to this problem.
>>
>> Signed-off-by: Neil Horman<[email protected]>
>> CC: Prarit Bhargava<[email protected]>
>> CC: Don Zickus<[email protected]>
>> CC: Don Dutile<[email protected]>
>> CC: Bjorn Helgaas<[email protected]>
>> CC: Asit Mallick<[email protected]>
>> CC: David Woodhouse<[email protected]>
>> CC: [email protected]
>> ---
>>
>> Change notes:
>>
>> v2)
>>
>> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
>> chipset series is x86 only. I decided however to keep the quirk as a regular
>> quirk, not an early_quirk. Early quirks have no way currently to determine if
>> BIOS has properly disabled the feature in the iommu, at least not without
>> significant hacking, and since its quite possible this will be a short lived
>> quirk, should Don Z's workaround code prove successful (and it looks like it may
>> well), I don't think that necessecary.
>>
>> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
>> string, I opted to leave the newlines in place however, as I really couldnt
>> find a way to keep the text on a single line is still legible from a code
>> perspective. I think theres enough language in there that using cscope on just
>> about any substring however will turn it up, and again, this may be a short
>> lived quirk.
>>
>> v3)
>>
>> * Removed defines from pci_ids.h, and used direct id values as per request from
>> Bjorn.
>>
>> v4)
>>
>> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
>> Woodhouse
>> ---
>> arch/x86/kernel/quirks.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
>> index 26ee48a..eb0785d 100644
>> --- a/arch/x86/kernel/quirks.c
>> +++ b/arch/x86/kernel/quirks.c
>> @@ -5,6 +5,7 @@
>> #include<linux/irq.h>
>>
>> #include<asm/hpet.h>
>> +#include "../../../drivers/iommu/irq_remapping.h"
>>
>> #if defined(CONFIG_X86_IO_APIC)&& defined(CONFIG_SMP)&& defined(CONFIG_PCI)
>>
>> @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
>> quirk_amd_nb_node);
>>
>> #endif
>> +
>> +static void intel_remapping_check(struct pci_dev *dev)
>> +{
>> + u8 revision;
>> +
>> + pci_read_config_byte(dev, PCI_REVISION_ID,&revision);
>> +
>> + WARN_TAINT(((revision == 0x13)&& irq_remapping_enabled),
>> + TAINT_FIRMWARE_WORKAROUND,
>> + "This system BIOS has enabled interrupt remapping\n"
>> + "on a chipset that contains an erratum making that\n"
>> + "feature unstable. Please reboot with nointremap\n"
>> + "added to the kernel command line and contact\n"
>> + "your BIOS vendor for an update");
>> +}
>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3406, intel_remapping_check);
>> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3403, intel_remapping_check);
>
> No, you only address my one request, move to arch/x86 directory.
>
> You need to move the quirk to early_quirk to append nointremap to
> avoid extra rebooting.
>
> Thanks
>
> Yinghai
The pci-dev's of all the (minimally, root, 5500-chipset) pci-dev's are known/scanned/created by that time?

2013-04-04 17:40:10

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 10:27 AM, Don Dutile <[email protected]> wrote:
>> You need to move the quirk to early_quirk to append nointremap to
>> avoid extra rebooting.
>>
> The pci-dev's of all the (minimally, root, 5500-chipset) pci-dev's are
> known/scanned/created by that time?

in arch/x86/kernel/early-quirk.c

and on top of
https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/commit/?h=for-x86-early-quirk-usb&id=de38757e964cfee20e6da1977572a2191d7f4aa0

You could add one entry in early_qrk[].

Some one already try to use that path to disable x2apic on some thinkpad.

So it should work on nointrremap too.

Thanks

Yinghai

2013-04-04 17:51:47

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 11:14:30AM -0600, Bjorn Helgaas wrote:
> On Thu, Apr 4, 2013 at 9:39 AM, Neil Horman <[email protected]> wrote:
> > On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote:
> >> On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <[email protected]> wrote:
> >> > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote:
> >> >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote:
> >> >> > );
> >> >> > > +
> >> >> > > + if ((revision == 0x13) && irq_remapping_enabled) {
> >> >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n"
> >> >> > > + "on a chipset that contains an errata making that\n"
> >> >> > > + "feature unstable. Please reboot with nointremap\n"
> >> >> > > + "added to the kernel command line and contact\n"
> >> >> > > + "your BIOS vendor for an update");
> >> >>
> >> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
> >> >>
> >> > Ok, copy that. I'll repost shortly
> >>
> >> When you do, please include URLs for any problem reports or bugzillas you have.
> >>
> > Well, those are going to be vendor specific, so I'm not sure we can really do
> > that, at least not in any meaningful way.
>
> Sorry, I don't understand your point. It's useful to know who
> reported it (e.g., for future testers) and what happened and what
> bugzillas it solved. Of course it applies only to machines with this
> chipset and certain BIOS revisions.
>
Oh, you want the bug report that I'm fixing this against? Sure, I can do that.
I thought you wanted me to include a url in the WARN_TAINT, with which user
could report occurances of this bug. Yeah, the bug that this is reported in is:
https://bugzilla.redhat.com/show_bug.cgi?id=887006

Its standing in for about a dozen or so variants of this issue we've seen

> >> I assume Windows "just works" in this situation?
> > No more or less than linux does in this case. The Intel provided errata
> > indicates that the only acceptable workaround is to disable remapping in the
> > BIOS, so I would presume that if a windows system has a BIOS that doesn't
> > implement this fix, its just as exposed as we are.
>
> It sounds like the effect of this bug is that on Linux, devices may
> not work at all because of lost interrupts. Either Windows must never
> enable remapping (so it never sees the bug), or it must be designed in
> a way that tolerates the problem. I can't believe these machines
> shipped with Windows certification if devices didn't work correctly.
>
I don't know what to tell you here. We explicitly asked intel about this, and
there was an effort to recode the interrupt migration code to properly manage
this situation, and intels response was "No, just disable it in BIOS", so we're
telling people to disable it in BIOS. You'd have to ask somebody at Microsoft
what Windows did or did not do about this problem.

> Either way, I don't understand why we can't make the quirk just fix
> this. Booting with "nointremap" only sets disable_irq_remap, which is
> only used by irq_remapping_supported(). Early quirks are run before
> irq_remapping_supported () is ever called, so an early quirk ought to
> be just as effective as the command line option. Here's the relevant
> call tree I see:
>
> start_kernel
> setup_arch
> parse_early_param
> early_quirks
> rest_init
> ...
> <first use of disable_irq_remap>
>
> The x86 setup_arch() does call generic_apic_probe(), but as far as I
> can tell, none of the APIC .probe() methods reference
> disable_irq_remap, so that doesn't look like a problem.
>
Well, I can give it a try, but I'm sure something went wrong last time I did.
Regardless, theres also the security issue to consider here - namely that
disabling irq remapping opens up users of virt to a possible security bug
(potential irq injection). Some users may wish to live with the remapping
error, given that error typically leads to devices that need to be
restarted/reset to start working again, rather than live with the security hole.
I rather like the warning, that gives users a choice, but I'll spin up a version
that just disables it if you would rather.

Neil

2013-04-04 18:41:52

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 11:51 AM, Neil Horman <[email protected]> wrote:

> Oh, you want the bug report that I'm fixing this against? Sure, I can do that.
> I thought you wanted me to include a url in the WARN_TAINT, with which user
> could report occurances of this bug. Yeah, the bug that this is reported in is:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006
>
> Its standing in for about a dozen or so variants of this issue we've seen

Exactly -- I'm just hoping for something in the changelog. BTW, this
particular bugzilla is not public.

> Regardless, theres also the security issue to consider here - namely that
> disabling irq remapping opens up users of virt to a possible security bug
> (potential irq injection). Some users may wish to live with the remapping
> error, given that error typically leads to devices that need to be
> restarted/reset to start working again, rather than live with the security hole.
> I rather like the warning, that gives users a choice, but I'll spin up a version
> that just disables it if you would rather.

I don't believe users will want to make a choice like that or even be
sophisticated enough to do it, at least not based on something in
dmesg. I'm pretty sure I'm not :)

The only supportable thing I can imagine doing would be:

- Disable interrupt remapping if this chipset defect is present, so
devices work reliably (they don't need whatever restart/reset you
referred to above).
- Disable virt functionality when interrupt remapping is disabled to
avoid the security problem (I don't know the details of this.)
- Add a command-line option to enable interrupt remapping (I think
"intremap=on" is currently parsed too early, but maybe this could be
reworked so the option could override the quirk disable).
- Add release notes saying "boot with 'intremap=on' if you want the
virt functionality and can accept unreliable devices."

That way the default behavior is safe and reliable (though perhaps
lacking some functionality), and you have told the user a way to get
safe and unreliable operation if he's willing to accept that. At
least, that's what I think I would want if I were in RH's shoes.

Bjorn

2013-04-04 20:02:36

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 12:41:27PM -0600, Bjorn Helgaas wrote:
> On Thu, Apr 4, 2013 at 11:51 AM, Neil Horman <[email protected]> wrote:
>
> > Oh, you want the bug report that I'm fixing this against? Sure, I can do that.
> > I thought you wanted me to include a url in the WARN_TAINT, with which user
> > could report occurances of this bug. Yeah, the bug that this is reported in is:
> > https://bugzilla.redhat.com/show_bug.cgi?id=887006
> >
> > Its standing in for about a dozen or so variants of this issue we've seen
>
> Exactly -- I'm just hoping for something in the changelog. BTW, this
> particular bugzilla is not public.
>
Ok, that I can do, I'll get the bz fixed up to be public in a bit.


> > Regardless, theres also the security issue to consider here - namely that
> > disabling irq remapping opens up users of virt to a possible security bug
> > (potential irq injection). Some users may wish to live with the remapping
> > error, given that error typically leads to devices that need to be
> > restarted/reset to start working again, rather than live with the security hole.
> > I rather like the warning, that gives users a choice, but I'll spin up a version
> > that just disables it if you would rather.
>
> I don't believe users will want to make a choice like that or even be
> sophisticated enough to do it, at least not based on something in
> dmesg. I'm pretty sure I'm not :)
>
> The only supportable thing I can imagine doing would be:
>
> - Disable interrupt remapping if this chipset defect is present, so
> devices work reliably (they don't need whatever restart/reset you
> referred to above).
> - Disable virt functionality when interrupt remapping is disabled to
> avoid the security problem (I don't know the details of this.)
> - Add a command-line option to enable interrupt remapping (I think
> "intremap=on" is currently parsed too early, but maybe this could be
> reworked so the option could override the quirk disable).
> - Add release notes saying "boot with 'intremap=on' if you want the
> virt functionality and can accept unreliable devices."
>
> That way the default behavior is safe and reliable (though perhaps
> lacking some functionality), and you have told the user a way to get
> safe and unreliable operation if he's willing to accept that. At
> least, that's what I think I would want if I were in RH's shoes.
>
Theres a long argument behind this, and I'm with you. At the least I don't see
a problem with disabling it upstream, at least not a policy-oriented reason.
That said, having looked back at my notes, I do now recall a technical
impediment behind disabling interrupt remapping. If we were to do this in an
early quirk, we would need to determine the status of the interrupt remapping
capability flag in the iommu in question. Looking at the interrupt remapping
code, it appears that the capability flag isn't directly contained in the pci
config space, but rather in a memory mapped address range who's base address is
parsed out of an acpi table. Since we check the irq_remapping_enabled flag in
the current quirk (which is set after the early quirks run), to do this in an
early quirk, we would need to somehow parse that base address register out of
the available acpi tables ourselves, and I'm not at all sure how to do that. Do
you know if theres some available parsing mechanism that early in boot? If so,
I can probably make this happen.

Neil

> Bjorn
>

2013-04-04 20:05:05

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 10:40:07AM -0700, Yinghai Lu wrote:
> On Thu, Apr 4, 2013 at 10:27 AM, Don Dutile <[email protected]> wrote:
> >> You need to move the quirk to early_quirk to append nointremap to
> >> avoid extra rebooting.
> >>
> > The pci-dev's of all the (minimally, root, 5500-chipset) pci-dev's are
> > known/scanned/created by that time?
>
> in arch/x86/kernel/early-quirk.c
>
> and on top of
> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/commit/?h=for-x86-early-quirk-usb&id=de38757e964cfee20e6da1977572a2191d7f4aa0
>
> You could add one entry in early_qrk[].
>
> Some one already try to use that path to disable x2apic on some thinkpad.
>
> So it should work on nointrremap too.
>
See my last email to Bjorn. Doing this in early-quirks in such a way that we
can detect an iommu that has interrupt remapping enabled (so we don't just
unilaterally print this quirk all the time) requires that we be able to parse
acpi tables very early in the boot. If you know of how to do that, I can make
this happen. If not, I suppose another alternative would be to have the early
quirk set a flag that tells us this is a bogus chip, and if we try to enable irq
remapping with that flag set, we should fail, and report the error at that time,
but I'm not sure I like that solution.

Neil

> Thanks
>
> Yinghai
>

2013-04-04 20:33:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 2:04 PM, Neil Horman <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 10:40:07AM -0700, Yinghai Lu wrote:
>> On Thu, Apr 4, 2013 at 10:27 AM, Don Dutile <[email protected]> wrote:
>> >> You need to move the quirk to early_quirk to append nointremap to
>> >> avoid extra rebooting.
>> >>
>> > The pci-dev's of all the (minimally, root, 5500-chipset) pci-dev's are
>> > known/scanned/created by that time?
>>
>> in arch/x86/kernel/early-quirk.c
>>
>> and on top of
>> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/commit/?h=for-x86-early-quirk-usb&id=de38757e964cfee20e6da1977572a2191d7f4aa0
>>
>> You could add one entry in early_qrk[].
>>
>> Some one already try to use that path to disable x2apic on some thinkpad.
>>
>> So it should work on nointrremap too.
>>
> See my last email to Bjorn. Doing this in early-quirks in such a way that we
> can detect an iommu that has interrupt remapping enabled (so we don't just
> unilaterally print this quirk all the time) requires that we be able to parse
> acpi tables very early in the boot. If you know of how to do that, I can make
> this happen. If not, I suppose another alternative would be to have the early
> quirk set a flag that tells us this is a bogus chip, and if we try to enable irq
> remapping with that flag set, we should fail, and report the error at that time,
> but I'm not sure I like that solution.

I like that solution :) It seems very simple -- you don't have to
parse any tables or anything.

Bjorn

2013-04-04 21:11:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 4, 2013 at 1:33 PM, Bjorn Helgaas <[email protected]> wrote:
>> See my last email to Bjorn. Doing this in early-quirks in such a way that we
>> can detect an iommu that has interrupt remapping enabled (so we don't just
>> unilaterally print this quirk all the time) requires that we be able to parse
>> acpi tables very early in the boot. If you know of how to do that, I can make
>> this happen. If not, I suppose another alternative would be to have the early
>> quirk set a flag that tells us this is a bogus chip, and if we try to enable irq
>> remapping with that flag set, we should fail, and report the error at that time,
>> but I'm not sure I like that solution.
>
> I like that solution :) It seems very simple -- you don't have to
> parse any tables or anything.

You are right, we don't need to parse any acpi tables.

just add one quirk in early-quirk.c to set
disable_irq_remap = 1;

Thanks

Yinghai

2013-04-05 00:24:40

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v4] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 04, 2013 at 02:11:54PM -0700, Yinghai Lu wrote:
> On Thu, Apr 4, 2013 at 1:33 PM, Bjorn Helgaas <[email protected]> wrote:
> >> See my last email to Bjorn. Doing this in early-quirks in such a way that we
> >> can detect an iommu that has interrupt remapping enabled (so we don't just
> >> unilaterally print this quirk all the time) requires that we be able to parse
> >> acpi tables very early in the boot. If you know of how to do that, I can make
> >> this happen. If not, I suppose another alternative would be to have the early
> >> quirk set a flag that tells us this is a bogus chip, and if we try to enable irq
> >> remapping with that flag set, we should fail, and report the error at that time,
> >> but I'm not sure I like that solution.
> >
> > I like that solution :) It seems very simple -- you don't have to
> > parse any tables or anything.
>
> You are right, we don't need to parse any acpi tables.
>
> just add one quirk in early-quirk.c to set
> disable_irq_remap = 1;
>
Well, I can't just do that. We need to issue a warning to the user as well, and
to do so conditionally (we don't want to warn users who have prorperly updated
BIOSes), I would need to know if irq remapping is actually on or not, which
would require parsing ACPI tables

But, as noted above, I can just set a flag, and defer the printing of the
warning until later in the boot process, when we know that information already.
Bjorn seems on board with that idea, so I'll spin up a patch for it in the AM.

Thanks!
Neil

> Thanks
>
> Yinghai
>

2013-04-05 19:26:21

by Neil Horman

[permalink] [raw]
Subject: [PATCH v5] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.
---
arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
arch/x86/kernel/quirks.c | 2 ++
drivers/iommu/irq_remapping.c | 12 ++++++++++++
drivers/iommu/irq_remapping.h | 1 +
4 files changed, 40 insertions(+)

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..bfa3139 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ irq_remap_broken = 1;
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 26ee48a..2136844 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -5,6 +5,7 @@
#include <linux/irq.h>

#include <asm/hpet.h>
+#include "../../../drivers/iommu/irq_remapping.h"

#if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)

@@ -567,3 +568,4 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5,
quirk_amd_nb_node);

#endif
+
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..2b56e92 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

@@ -216,6 +217,17 @@ int irq_remapping_supported(void)
if (disable_irq_remap)
return 0;

+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ disable_irq_remap = 1;
+ return 0;
+ }
+
if (!remap_ops || !remap_ops->supported)
return 0;

diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..d7537e4 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
--
1.8.1.4

2013-04-05 19:29:58

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v5] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Apr 05, 2013 at 03:25:54PM -0400, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
>
> v5)
>
> * Moved check to an early quirk, and flagged the broken chip, so we could
> reasonably disable irq remapping during bootup.


Self NAK, sorry, I've got extra water in the quirks file left over from my move.

2013-04-05 19:31:49

by Neil Horman

[permalink] [raw]
Subject: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.

v6)
* Clean up of stupid extra thrash in quirks.c
---
arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
drivers/iommu/irq_remapping.c | 12 ++++++++++++
drivers/iommu/irq_remapping.h | 1 +
3 files changed, 38 insertions(+)

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..bfa3139 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ irq_remap_broken = 1;
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..2b56e92 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

@@ -216,6 +217,17 @@ int irq_remapping_supported(void)
if (disable_irq_remap)
return 0;

+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ disable_irq_remap = 1;
+ return 0;
+ }
+
if (!remap_ops || !remap_ops->supported)
return 0;

diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..d7537e4 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
--
1.8.1.4

2013-04-05 23:37:34

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Apr 5, 2013 at 12:31 PM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
>
> v5)
>
> * Moved check to an early quirk, and flagged the broken chip, so we could
> reasonably disable irq remapping during bootup.
>
> v6)
> * Clean up of stupid extra thrash in quirks.c
> ---
> arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
> drivers/iommu/irq_remapping.c | 12 ++++++++++++
> drivers/iommu/irq_remapping.h | 1 +
> 3 files changed, 38 insertions(+)
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 3755ef4..bfa3139 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> }
> #endif
>
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + irq_remap_broken = 1;
> +
> +}
> +#else
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> +}
> +#endif
> +
> #define QFLAG_APPLY_ONCE 0x1
> #define QFLAG_APPLIED 0x2
> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> {}
> };
>
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index d56f8c1..2b56e92 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -19,6 +19,7 @@
> int irq_remapping_enabled;
>
> int disable_irq_remap;
> +int irq_remap_broken;
> int disable_sourceid_checking;
> int no_x2apic_optout;
>
> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
> if (disable_irq_remap)
> return 0;
>
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");

What do you mean "This system BIOS has enabled interrupt remapping" ?
BIOS have interrupt pre-enabled or BIOS just provide DMAR table ?

Why do you need "Please reboot with nointremap" ?

Thanks

Yinghai

2013-04-06 01:26:16

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

I'm sorry. Forgot to change the wording of the error for the new model that I'm following here. Although the message is mostly right as bios is responsible for setting and clearing the IRQ remapping feature bit in the chips capabilities register.

I'll fix and repost Monday

Neil

Yinghai Lu <[email protected]> wrote:

>On Fri, Apr 5, 2013 at 12:31 PM, Neil Horman <[email protected]> wrote:
>> A few years back intel published a spec update:
>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>
>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>> as a result the recommend that interrupt remapping be disabled in bios. While
>> many vendors have a bios update to do exactly that, not all do, and of course
>> not all users update their bios to a level that corrects the problem. As a
>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>> characterized by the message:
>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>
>> There have been several incidents recently of people seeing this error, and
>> investigation has shown that they have system for which their BIOS level is such
>> that this feature was not properly turned off. As such, it would be good to
>> give them a reminder that their systems are vulnurable to this problem.
>>
>> Signed-off-by: Neil Horman <[email protected]>
>> CC: Prarit Bhargava <[email protected]>
>> CC: Don Zickus <[email protected]>
>> CC: Don Dutile <[email protected]>
>> CC: Bjorn Helgaas <[email protected]>
>> CC: Asit Mallick <[email protected]>
>> CC: David Woodhouse <[email protected]>
>> CC: [email protected]
>> ---
>>
>> Change notes:
>>
>> v2)
>>
>> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
>> chipset series is x86 only. I decided however to keep the quirk as a regular
>> quirk, not an early_quirk. Early quirks have no way currently to determine if
>> BIOS has properly disabled the feature in the iommu, at least not without
>> significant hacking, and since its quite possible this will be a short lived
>> quirk, should Don Z's workaround code prove successful (and it looks like it may
>> well), I don't think that necessecary.
>>
>> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
>> string, I opted to leave the newlines in place however, as I really couldnt
>> find a way to keep the text on a single line is still legible from a code
>> perspective. I think theres enough language in there that using cscope on just
>> about any substring however will turn it up, and again, this may be a short
>> lived quirk.
>>
>> v3)
>>
>> * Removed defines from pci_ids.h, and used direct id values as per request from
>> Bjorn.
>>
>> v4)
>>
>> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
>> Woodhouse
>>
>> v5)
>>
>> * Moved check to an early quirk, and flagged the broken chip, so we could
>> reasonably disable irq remapping during bootup.
>>
>> v6)
>> * Clean up of stupid extra thrash in quirks.c
>> ---
>> arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
>> drivers/iommu/irq_remapping.c | 12 ++++++++++++
>> drivers/iommu/irq_remapping.h | 1 +
>> 3 files changed, 38 insertions(+)
>>
>> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
>> index 3755ef4..bfa3139 100644
>> --- a/arch/x86/kernel/early-quirks.c
>> +++ b/arch/x86/kernel/early-quirks.c
>> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
>> }
>> #endif
>>
>> +#ifdef CONFIG_IRQ_REMAP
>> +static void __init intel_remapping_check(int num, int slot, int func)
>> +{
>> + u8 revision;
>> +
>> + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
>> +
>> + /*
>> + * Revision 0x13 of this chipset supports irq remapping
>> + * but has an erratum that breaks its behavior, flag it as such
>> + */
>> + if (revision == 0x13)
>> + irq_remap_broken = 1;
>> +
>> +}
>> +#else
>> +static void __init intel_remapping_check(int num, int slot, int func)
>> +{
>> +}
>> +#endif
>> +
>> #define QFLAG_APPLY_ONCE 0x1
>> #define QFLAG_APPLIED 0x2
>> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
>> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
>> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
>> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>> {}
>> };
>>
>> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
>> index d56f8c1..2b56e92 100644
>> --- a/drivers/iommu/irq_remapping.c
>> +++ b/drivers/iommu/irq_remapping.c
>> @@ -19,6 +19,7 @@
>> int irq_remapping_enabled;
>>
>> int disable_irq_remap;
>> +int irq_remap_broken;
>> int disable_sourceid_checking;
>> int no_x2apic_optout;
>>
>> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
>> if (disable_irq_remap)
>> return 0;
>>
>> + if (irq_remap_broken) {
>> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
>> + "This system BIOS has enabled interrupt remapping\n"
>> + "on a chipset that contains an erratum making that\n"
>> + "feature unstable. Please reboot with nointremap\n"
>> + "added to the kernel command line and contact\n"
>> + "your BIOS vendor for an update");
>
>What do you mean "This system BIOS has enabled interrupt remapping" ?
>BIOS have interrupt pre-enabled or BIOS just provide DMAR table ?
>
>Why do you need "Please reboot with nointremap" ?
>
>Thanks
>
>Yinghai
>
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-04-06 01:56:14

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Apr 5, 2013 at 1:31 PM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem.

I'd still like to mention the bugzilla URL in the changelog
(https://bugzilla.redhat.com/show_bug.cgi?id=887006) if it can be made
public.

> ...

> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 3755ef4..bfa3139 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> }
> #endif
>
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + irq_remap_broken = 1;
> +
> +}
> +#else
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> +}
> +#endif
> +
> #define QFLAG_APPLY_ONCE 0x1
> #define QFLAG_APPLIED 0x2
> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> {}
> };
>
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index d56f8c1..2b56e92 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -19,6 +19,7 @@
> int irq_remapping_enabled;
>
> int disable_irq_remap;
> +int irq_remap_broken;
> int disable_sourceid_checking;
> int no_x2apic_optout;
>
> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
> if (disable_irq_remap)
> return 0;
>
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,

This looks like a typo (s/TAIN/TAINT/).

> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");

I suspect your updated message won't mention "nointremap", but if it
does, Documentation/kernel-parameters.txt says that option is
deprecated and "intremap=off" should be used instead.

> + disable_irq_remap = 1;

Tell me if I have this correct:

Before this patch, we had interrupt remapping enabled and
virtualization enabled. This is safe, but devices might need resets
to deal with lost or spurious interrupts.

After this patch, these same machines will by default have interrupt
remapping disabled and virtualization enabled. The lost or spurious
interrupt problem should be gone, but we now have the IRQ injection
security bug.

If that's really the change we're making, I'm not comfortable applying
this patch. But I don't know the details of the IRQ injection
problem, so maybe my understanding of the implications is wrong.

> + return 0;
> + }
> +
> if (!remap_ops || !remap_ops->supported)
> return 0;
>
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index ecb6376..d7537e4 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -32,6 +32,7 @@ struct pci_dev;
> struct msi_msg;
>
> extern int disable_irq_remap;
> +extern int irq_remap_broken;
> extern int disable_sourceid_checking;
> extern int no_x2apic_optout;
> extern int irq_remapping_enabled;
> --
> 1.8.1.4
>

2013-04-06 02:32:21

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Fri, Apr 5, 2013 at 6:25 PM, Neil Horman <[email protected]> wrote:
> I'm sorry. Forgot to change the wording of the error for the new model that I'm following here. Although the message is mostly right as bios is responsible for setting and clearing the IRQ remapping feature bit in the chips capabilities register.
>
> I'll fix and repost Monday

>>> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
>>> index d56f8c1..2b56e92 100644
>>> --- a/drivers/iommu/irq_remapping.c
>>> +++ b/drivers/iommu/irq_remapping.c
>>> @@ -19,6 +19,7 @@
>>> int irq_remapping_enabled;
>>>
>>> int disable_irq_remap;
>>> +int irq_remap_broken;
>>> int disable_sourceid_checking;
>>> int no_x2apic_optout;
>>>
>>> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
>>> if (disable_irq_remap)
>>> return 0;
>>>
>>> + if (irq_remap_broken) {
>>> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
>>> + "This system BIOS has enabled interrupt remapping\n"
>>> + "on a chipset that contains an erratum making that\n"
>>> + "feature unstable. Please reboot with nointremap\n"
>>> + "added to the kernel command line and contact\n"
>>> + "your BIOS vendor for an update");

Also please put those warning code in to
drivers/iommu/intel_irq_remapping.c::intel_irq_remapping_supported()

It does not belong to drivers/iommu/irq_remapping.c.

Thanks

Yinghai

2013-04-08 15:29:59

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 04/05/2013 09:55 PM, Bjorn Helgaas wrote:
> On Fri, Apr 5, 2013 at 1:31 PM, Neil Horman<[email protected]> wrote:
>> A few years back intel published a spec update:
>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>
>> For the 5520 and 5500 chipsets which contained an errata (specificially errata
>> 53), which noted that these chipsets can't properly do interrupt remapping, and
>> as a result the recommend that interrupt remapping be disabled in bios. While
>> many vendors have a bios update to do exactly that, not all do, and of course
>> not all users update their bios to a level that corrects the problem. As a
>> result, occasionally interrupts can arrive at a cpu even after affinity for that
>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>> characterized by the message:
>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>
>> There have been several incidents recently of people seeing this error, and
>> investigation has shown that they have system for which their BIOS level is such
>> that this feature was not properly turned off. As such, it would be good to
>> give them a reminder that their systems are vulnurable to this problem.
>
> I'd still like to mention the bugzilla URL in the changelog
> (https://bugzilla.redhat.com/show_bug.cgi?id=887006) if it can be made
> public.
>
>> ...
>
>> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
>> index 3755ef4..bfa3139 100644
>> --- a/arch/x86/kernel/early-quirks.c
>> +++ b/arch/x86/kernel/early-quirks.c
>> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
>> }
>> #endif
>>
>> +#ifdef CONFIG_IRQ_REMAP
>> +static void __init intel_remapping_check(int num, int slot, int func)
>> +{
>> + u8 revision;
>> +
>> + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
>> +
>> + /*
>> + * Revision 0x13 of this chipset supports irq remapping
>> + * but has an erratum that breaks its behavior, flag it as such
>> + */
>> + if (revision == 0x13)
>> + irq_remap_broken = 1;
>> +
>> +}
>> +#else
>> +static void __init intel_remapping_check(int num, int slot, int func)
>> +{
>> +}
>> +#endif
>> +
>> #define QFLAG_APPLY_ONCE 0x1
>> #define QFLAG_APPLIED 0x2
>> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
>> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
>> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
>> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>> {}
>> };
>>
>> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
>> index d56f8c1..2b56e92 100644
>> --- a/drivers/iommu/irq_remapping.c
>> +++ b/drivers/iommu/irq_remapping.c
>> @@ -19,6 +19,7 @@
>> int irq_remapping_enabled;
>>
>> int disable_irq_remap;
>> +int irq_remap_broken;
>> int disable_sourceid_checking;
>> int no_x2apic_optout;
>>
>> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
>> if (disable_irq_remap)
>> return 0;
>>
>> + if (irq_remap_broken) {
>> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
>
> This looks like a typo (s/TAIN/TAINT/).
>
>> + "This system BIOS has enabled interrupt remapping\n"
>> + "on a chipset that contains an erratum making that\n"
>> + "feature unstable. Please reboot with nointremap\n"
>> + "added to the kernel command line and contact\n"
>> + "your BIOS vendor for an update");
>
> I suspect your updated message won't mention "nointremap", but if it
> does, Documentation/kernel-parameters.txt says that option is
> deprecated and "intremap=off" should be used instead.
>
>> + disable_irq_remap = 1;
>
> Tell me if I have this correct:
>
> Before this patch, we had interrupt remapping enabled and
> virtualization enabled. This is safe, but devices might need resets
> to deal with lost or spurious interrupts.
>
Bigger then that -- system reboots are often necessary, and for virtualization,
that means not just the lost of the device, but all guests running on that host.

> After this patch, these same machines will by default have interrupt
> remapping disabled and virtualization enabled. The lost or spurious
> interrupt problem should be gone, but we now have the IRQ injection
> security bug.
>
IRQ injection security bug *if* device-assignment of a PCI(e) device
to a KVM guest is done. To do so, requires kvm to be loaded with
a parameter to allow device-assignment w/o intr-remapping (b/c certain chipsets
didn't have intr-remap support complete until this past summer).
So, a sysadmin would have to consciously enable this security vulnerability,
and is only a vulnerability if (a) the guest is not well known/behaved or
(b) the assigned device goes-bonkers/breaks.
This vulnerability has been known and in existence since the beginning of
device-assignment; intr-remap is the way to isolate it.
The end result on this (rev of this) chip set is the equivalent of running
device-assignment on a (2009 era) Q35 chipset -- a VT-d1 (IOMMU-only,
no-intr-remap) capable chipset.

> If that's really the change we're making, I'm not comfortable applying
> this patch. But I don't know the details of the IRQ injection
> problem, so maybe my understanding of the implications is wrong.
>
>> + return 0;
>> + }
>> +
>> if (!remap_ops || !remap_ops->supported)
>> return 0;
>>
>> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
>> index ecb6376..d7537e4 100644
>> --- a/drivers/iommu/irq_remapping.h
>> +++ b/drivers/iommu/irq_remapping.h
>> @@ -32,6 +32,7 @@ struct pci_dev;
>> struct msi_msg;
>>
>> extern int disable_irq_remap;
>> +extern int irq_remap_broken;
>> extern int disable_sourceid_checking;
>> extern int no_x2apic_optout;
>> extern int irq_remapping_enabled;
>> --
>> 1.8.1.4
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-04-08 17:17:46

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

[+cc Joerg, Konrad]

On Mon, Apr 8, 2013 at 9:29 AM, Don Dutile <[email protected]> wrote:
> On 04/05/2013 09:55 PM, Bjorn Helgaas wrote:
>>
>> On Fri, Apr 5, 2013 at 1:31 PM, Neil Horman<[email protected]> wrote:
>>>
>>> A few years back intel published a spec update:
>>>
>>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>>>
>>> For the 5520 and 5500 chipsets which contained an errata (specificially
>>> errata
>>> 53), which noted that these chipsets can't properly do interrupt
>>> remapping, and
>>> as a result the recommend that interrupt remapping be disabled in bios.
>>> While
>>> many vendors have a bios update to do exactly that, not all do, and of
>>> course
>>> not all users update their bios to a level that corrects the problem. As
>>> a
>>> result, occasionally interrupts can arrive at a cpu even after affinity
>>> for that
>>> interrupt has be moved, leading to lost or spurrious interrupts (usually
>>> characterized by the message:
>>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>>>
>>> There have been several incidents recently of people seeing this error,
>>> and
>>> investigation has shown that they have system for which their BIOS level
>>> is such
>>> that this feature was not properly turned off. As such, it would be good
>>> to
>>> give them a reminder that their systems are vulnurable to this problem.
>>
>>
>> I'd still like to mention the bugzilla URL in the changelog
>> (https://bugzilla.redhat.com/show_bug.cgi?id=887006) if it can be made
>> public.
>>
>>> ...
>>
>>
>>> diff --git a/arch/x86/kernel/early-quirks.c
>>> b/arch/x86/kernel/early-quirks.c
>>> index 3755ef4..bfa3139 100644
>>> --- a/arch/x86/kernel/early-quirks.c
>>> +++ b/arch/x86/kernel/early-quirks.c
>>> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot,
>>> int func)
>>> }
>>> #endif
>>>
>>> +#ifdef CONFIG_IRQ_REMAP
>>> +static void __init intel_remapping_check(int num, int slot, int func)
>>> +{
>>> + u8 revision;
>>> +
>>> + revision = pci_read_config_byte(num, slot, func ,
>>> PCI_REVISION_ID);
>>> +
>>> + /*
>>> + * Revision 0x13 of this chipset supports irq remapping
>>> + * but has an erratum that breaks its behavior, flag it as such
>>> + */
>>> + if (revision == 0x13)
>>> + irq_remap_broken = 1;
>>> +
>>> +}
>>> +#else
>>> +static void __init intel_remapping_check(int num, int slot, int func)
>>> +{
>>> +}
>>> +#endif
>>> +
>>> #define QFLAG_APPLY_ONCE 0x1
>>> #define QFLAG_APPLIED 0x2
>>> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
>>> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
>>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
>>> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
>>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
>>> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
>>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>>> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
>>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
>>> {}
>>> };
>>>
>>> diff --git a/drivers/iommu/irq_remapping.c
>>> b/drivers/iommu/irq_remapping.c
>>> index d56f8c1..2b56e92 100644
>>> --- a/drivers/iommu/irq_remapping.c
>>> +++ b/drivers/iommu/irq_remapping.c
>>> @@ -19,6 +19,7 @@
>>> int irq_remapping_enabled;
>>>
>>> int disable_irq_remap;
>>> +int irq_remap_broken;
>>> int disable_sourceid_checking;
>>> int no_x2apic_optout;
>>>
>>> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
>>> if (disable_irq_remap)
>>> return 0;
>>>
>>> + if (irq_remap_broken) {
>>> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
>>
>>
>> This looks like a typo (s/TAIN/TAINT/).
>>
>>> + "This system BIOS has enabled interrupt
>>> remapping\n"
>>> + "on a chipset that contains an erratum making
>>> that\n"
>>> + "feature unstable. Please reboot with
>>> nointremap\n"
>>> + "added to the kernel command line and
>>> contact\n"
>>> + "your BIOS vendor for an update");
>>
>>
>> I suspect your updated message won't mention "nointremap", but if it
>> does, Documentation/kernel-parameters.txt says that option is
>> deprecated and "intremap=off" should be used instead.
>>
>>> + disable_irq_remap = 1;
>>
>>AMD
>> Tell me if I have this correct:
>>
>> Before this patch, we had interrupt remapping enabled and
>> virtualization enabled. This is safe, but devices might need resets
>> to deal with lost or spurious interrupts.
>>
> Bigger then that -- system reboots are often necessary, and for
> virtualization,
> that means not just the lost of the device, but all guests running on that
> host.
>
>
>> After this patch, these same machines will by default have interrupt
>> remapping disabled and virtualization enabled. The lost or spurious
>> interrupt problem should be gone, but we now have the IRQ injection
>> security bug.
>>
> IRQ injection security bug *if* device-assignment of a PCI(e) device
> to a KVM guest is done. To do so, requires kvm to be loaded with
> a parameter to allow device-assignment w/o intr-remapping (b/c certain
> chipsets
> didn't have intr-remap support complete until this past summer).
> So, a sysadmin would have to consciously enable this security vulnerability,
> and is only a vulnerability if (a) the guest is not well known/behaved or
> (b) the assigned device goes-bonkers/breaks.
> This vulnerability has been known and in existence since the beginning of
> device-assignment; intr-remap is the way to isolate it.
> The end result on this (rev of this) chip set is the equivalent of running
> device-assignment on a (2009 era) Q35 chipset -- a VT-d1 (IOMMU-only,
> no-intr-remap) capable chipset.

Thanks for the details, Don. It makes sense to me to disable
intr-remap on this chipset and handle it like an older machine that's
not capable of intr-remap at all. The IRQ injection issue should be
no worse than on those older machines.

I don't care whether the "if (irq_remap_broken)" test is in
irq_remapping.c or intel_irq_remapping.c. The quirk itself, where we
actually look at config space, is clearly Intel-specific, but there
could easily be similar AMD quirks that could also set
irq_remap_broken. In that case, it would make sense to have the test
in the common code.

Other than the fact that the quirk looks at PCI config space to find
the revision, this really isn't a PCI patch, so I hope somebody else
will take care of this. From MAINTAINERS, it looks like nobody else
wants irq_remapping.c either :) I cc'd Joerg and Konrad, who have
made many of the recent changes.

Bjorn

2013-04-08 17:43:15

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 08, 2013 at 11:17:23AM -0600, Bjorn Helgaas wrote:
> [+cc Joerg, Konrad]
>
> On Mon, Apr 8, 2013 at 9:29 AM, Don Dutile <[email protected]> wrote:
> > On 04/05/2013 09:55 PM, Bjorn Helgaas wrote:
> >>
> >> On Fri, Apr 5, 2013 at 1:31 PM, Neil Horman<[email protected]> wrote:
> >>>
> >>> A few years back intel published a spec update:
> >>>
> >>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >>>
> >>> For the 5520 and 5500 chipsets which contained an errata (specificially
> >>> errata
> >>> 53), which noted that these chipsets can't properly do interrupt
> >>> remapping, and
> >>> as a result the recommend that interrupt remapping be disabled in bios.
> >>> While
> >>> many vendors have a bios update to do exactly that, not all do, and of
> >>> course
> >>> not all users update their bios to a level that corrects the problem. As
> >>> a
> >>> result, occasionally interrupts can arrive at a cpu even after affinity
> >>> for that
> >>> interrupt has be moved, leading to lost or spurrious interrupts (usually
> >>> characterized by the message:
> >>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >>>
> >>> There have been several incidents recently of people seeing this error,
> >>> and
> >>> investigation has shown that they have system for which their BIOS level
> >>> is such
> >>> that this feature was not properly turned off. As such, it would be good
> >>> to
> >>> give them a reminder that their systems are vulnurable to this problem.
> >>
> >>
> >> I'd still like to mention the bugzilla URL in the changelog
> >> (https://bugzilla.redhat.com/show_bug.cgi?id=887006) if it can be made
> >> public.
> >>
> >>> ...
> >>
> >>
> >>> diff --git a/arch/x86/kernel/early-quirks.c
> >>> b/arch/x86/kernel/early-quirks.c
> >>> index 3755ef4..bfa3139 100644
> >>> --- a/arch/x86/kernel/early-quirks.c
> >>> +++ b/arch/x86/kernel/early-quirks.c
> >>> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot,
> >>> int func)
> >>> }
> >>> #endif
> >>>
> >>> +#ifdef CONFIG_IRQ_REMAP
> >>> +static void __init intel_remapping_check(int num, int slot, int func)
> >>> +{
> >>> + u8 revision;
> >>> +
> >>> + revision = pci_read_config_byte(num, slot, func ,
> >>> PCI_REVISION_ID);
> >>> +
> >>> + /*
> >>> + * Revision 0x13 of this chipset supports irq remapping
> >>> + * but has an erratum that breaks its behavior, flag it as such
> >>> + */
> >>> + if (revision == 0x13)
> >>> + irq_remap_broken = 1;
> >>> +
> >>> +}
> >>> +#else
> >>> +static void __init intel_remapping_check(int num, int slot, int func)
> >>> +{
> >>> +}
> >>> +#endif
> >>> +
> >>> #define QFLAG_APPLY_ONCE 0x1
> >>> #define QFLAG_APPLIED 0x2
> >>> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> >>> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
> >>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> >>> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> >>> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> >>> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> >>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> >>> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> >>> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> >>> {}
> >>> };
> >>>
> >>> diff --git a/drivers/iommu/irq_remapping.c
> >>> b/drivers/iommu/irq_remapping.c
> >>> index d56f8c1..2b56e92 100644
> >>> --- a/drivers/iommu/irq_remapping.c
> >>> +++ b/drivers/iommu/irq_remapping.c
> >>> @@ -19,6 +19,7 @@
> >>> int irq_remapping_enabled;
> >>>
> >>> int disable_irq_remap;
> >>> +int irq_remap_broken;
> >>> int disable_sourceid_checking;
> >>> int no_x2apic_optout;
> >>>
> >>> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
> >>> if (disable_irq_remap)
> >>> return 0;
> >>>
> >>> + if (irq_remap_broken) {
> >>> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
> >>
> >>
> >> This looks like a typo (s/TAIN/TAINT/).
> >>
> >>> + "This system BIOS has enabled interrupt
> >>> remapping\n"
> >>> + "on a chipset that contains an erratum making
> >>> that\n"
> >>> + "feature unstable. Please reboot with
> >>> nointremap\n"
> >>> + "added to the kernel command line and
> >>> contact\n"
> >>> + "your BIOS vendor for an update");
> >>
> >>
> >> I suspect your updated message won't mention "nointremap", but if it
> >> does, Documentation/kernel-parameters.txt says that option is
> >> deprecated and "intremap=off" should be used instead.
> >>
> >>> + disable_irq_remap = 1;
> >>
> >>AMD
> >> Tell me if I have this correct:
> >>
> >> Before this patch, we had interrupt remapping enabled and
> >> virtualization enabled. This is safe, but devices might need resets
> >> to deal with lost or spurious interrupts.
> >>
> > Bigger then that -- system reboots are often necessary, and for
> > virtualization,
> > that means not just the lost of the device, but all guests running on that
> > host.
> >
> >
> >> After this patch, these same machines will by default have interrupt
> >> remapping disabled and virtualization enabled. The lost or spurious
> >> interrupt problem should be gone, but we now have the IRQ injection
> >> security bug.
> >>
> > IRQ injection security bug *if* device-assignment of a PCI(e) device
> > to a KVM guest is done. To do so, requires kvm to be loaded with
> > a parameter to allow device-assignment w/o intr-remapping (b/c certain
> > chipsets
> > didn't have intr-remap support complete until this past summer).
> > So, a sysadmin would have to consciously enable this security vulnerability,
> > and is only a vulnerability if (a) the guest is not well known/behaved or
> > (b) the assigned device goes-bonkers/breaks.
> > This vulnerability has been known and in existence since the beginning of
> > device-assignment; intr-remap is the way to isolate it.
> > The end result on this (rev of this) chip set is the equivalent of running
> > device-assignment on a (2009 era) Q35 chipset -- a VT-d1 (IOMMU-only,
> > no-intr-remap) capable chipset.
>
> Thanks for the details, Don. It makes sense to me to disable
> intr-remap on this chipset and handle it like an older machine that's
> not capable of intr-remap at all. The IRQ injection issue should be
> no worse than on those older machines.
>
> I don't care whether the "if (irq_remap_broken)" test is in
> irq_remapping.c or intel_irq_remapping.c. The quirk itself, where we
> actually look at config space, is clearly Intel-specific, but there
> could easily be similar AMD quirks that could also set
> irq_remap_broken. In that case, it would make sense to have the test
> in the common code.
>
I've moved it to intel specific code for the time being, since it currently is
intel specific, its an easy move to put it in a common location if other vendors
have a need for it.

I'm currently waiting on aproval to make the bz public, so that its inclusion in
the changelog is more than just an irritation when following the link results in
a 403 error. As soon as thats square, I'll post this again, CC-ing Jeorg and
Konrad.

Thanks
Neil

2013-04-09 10:08:07

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v6] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 08, 2013 at 01:42:57PM -0400, Neil Horman wrote:
> On Mon, Apr 08, 2013 at 11:17:23AM -0600, Bjorn Helgaas wrote:

> > I don't care whether the "if (irq_remap_broken)" test is in
> > irq_remapping.c or intel_irq_remapping.c. The quirk itself, where we
> > actually look at config space, is clearly Intel-specific, but there
> > could easily be similar AMD quirks that could also set
> > irq_remap_broken. In that case, it would make sense to have the test
> > in the common code.
> >
> I've moved it to intel specific code for the time being, since it currently is
> intel specific, its an easy move to put it in a common location if other vendors
> have a need for it.
>
> I'm currently waiting on aproval to make the bz public, so that its inclusion in
> the changelog is more than just an irritation when following the link results in
> a 403 error. As soon as thats square, I'll post this again, CC-ing Jeorg and
> Konrad.

Thanks Neil. As long as this quirk is intel specific it should be in
the intel-code.


Joerg

2013-04-15 11:19:16

by Neil Horman

[permalink] [raw]
Subject: [PATCH v7] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem. For
details of those that reported the problem, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=887006

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
CC: Joerg Roedel <[email protected]>
CC: Konrad Rzeszutek Wilk <[email protected]>
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.

v6)

* Clean up of stupid extra thrash in quirks.c

v7)

* Move broken check to intel_irq_remapping.c
* Fixed another typo
* Finally made the reference bugzilla public
---
arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
drivers/iommu/irq_remapping.c | 12 ++++++++++++
drivers/iommu/irq_remapping.h | 1 +
4 files changed, 48 insertions(+)

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..bfa3139 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ irq_remap_broken = 1;
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f3b8f23..5b19b2d 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)

if (disable_irq_remap)
return 0;
+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. To maintain system stability\n"
+ "interrupt remapping is being disabled. Please\n"
+ "contact your BIOS vendor for an update\n");
+ disable_irq_remap = 1;
+ return 0;
+ }

if (!dmar_ir_support())
return 0;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..2b56e92 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

@@ -216,6 +217,17 @@ int irq_remapping_supported(void)
if (disable_irq_remap)
return 0;

+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. Please reboot with nointremap\n"
+ "added to the kernel command line and contact\n"
+ "your BIOS vendor for an update");
+ disable_irq_remap = 1;
+ return 0;
+ }
+
if (!remap_ops || !remap_ops->supported)
return 0;

diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..d7537e4 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
--
1.8.1.4

2013-04-15 15:31:01

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v7] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 15, 2013 at 5:18 AM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem. For
> details of those that reported the problem, please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> CC: Joerg Roedel <[email protected]>
> CC: Konrad Rzeszutek Wilk <[email protected]>
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
>
> v5)
>
> * Moved check to an early quirk, and flagged the broken chip, so we could
> reasonably disable irq remapping during bootup.
>
> v6)
>
> * Clean up of stupid extra thrash in quirks.c
>
> v7)
>
> * Move broken check to intel_irq_remapping.c
> * Fixed another typo
> * Finally made the reference bugzilla public
> ---
> arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
> drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
> drivers/iommu/irq_remapping.c | 12 ++++++++++++
> drivers/iommu/irq_remapping.h | 1 +
> 4 files changed, 48 insertions(+)
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 3755ef4..bfa3139 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> }
> #endif
>
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + irq_remap_broken = 1;
> +
> +}
> +#else
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> +}
> +#endif
> +
> #define QFLAG_APPLY_ONCE 0x1
> #define QFLAG_APPLIED 0x2
> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> {}
> };
>
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..5b19b2d 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)
>
> if (disable_irq_remap)
> return 0;
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. To maintain system stability\n"
> + "interrupt remapping is being disabled. Please\n"
> + "contact your BIOS vendor for an update\n");
> + disable_irq_remap = 1;
> + return 0;
> + }
>
> if (!dmar_ir_support())
> return 0;
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index d56f8c1..2b56e92 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -19,6 +19,7 @@
> int irq_remapping_enabled;
>
> int disable_irq_remap;
> +int irq_remap_broken;
> int disable_sourceid_checking;
> int no_x2apic_optout;
>
> @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
> if (disable_irq_remap)
> return 0;
>
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. Please reboot with nointremap\n"
> + "added to the kernel command line and contact\n"
> + "your BIOS vendor for an update");
> + disable_irq_remap = 1;
> + return 0;
> + }
> +

I think you probably intended to drop the change to this file, since
you moved it to intel_irq_remapping.c. If you did intend to keep it,
s/TAIN_/TAINT_/.

> if (!remap_ops || !remap_ops->supported)
> return 0;
>
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index ecb6376..d7537e4 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -32,6 +32,7 @@ struct pci_dev;
> struct msi_msg;
>
> extern int disable_irq_remap;
> +extern int irq_remap_broken;
> extern int disable_sourceid_checking;
> extern int no_x2apic_optout;
> extern int irq_remapping_enabled;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-04-15 16:28:21

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v7] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 15, 2013 at 09:30:36AM -0600, Bjorn Helgaas wrote:
> On Mon, Apr 15, 2013 at 5:18 AM, Neil Horman <[email protected]> wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
> > For the 5520 and 5500 chipsets which contained an errata (specificially errata
> > 53), which noted that these chipsets can't properly do interrupt remapping, and
> > as a result the recommend that interrupt remapping be disabled in bios. While
> > many vendors have a bios update to do exactly that, not all do, and of course
> > not all users update their bios to a level that corrects the problem. As a
> > result, occasionally interrupts can arrive at a cpu even after affinity for that
> > interrupt has be moved, leading to lost or spurrious interrupts (usually
> > characterized by the message:
> > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >
> > There have been several incidents recently of people seeing this error, and
> > investigation has shown that they have system for which their BIOS level is such
> > that this feature was not properly turned off. As such, it would be good to
> > give them a reminder that their systems are vulnurable to this problem. For
> > details of those that reported the problem, please see:
> > https://bugzilla.redhat.com/show_bug.cgi?id=887006
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > CC: Prarit Bhargava <[email protected]>
> > CC: Don Zickus <[email protected]>
> > CC: Don Dutile <[email protected]>
> > CC: Bjorn Helgaas <[email protected]>
> > CC: Asit Mallick <[email protected]>
> > CC: David Woodhouse <[email protected]>
> > CC: [email protected]
> > CC: Joerg Roedel <[email protected]>
> > CC: Konrad Rzeszutek Wilk <[email protected]>
> > ---
> >
> > Change notes:
> >
> > v2)
> >
> > * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> > chipset series is x86 only. I decided however to keep the quirk as a regular
> > quirk, not an early_quirk. Early quirks have no way currently to determine if
> > BIOS has properly disabled the feature in the iommu, at least not without
> > significant hacking, and since its quite possible this will be a short lived
> > quirk, should Don Z's workaround code prove successful (and it looks like it may
> > well), I don't think that necessecary.
> >
> > * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> > string, I opted to leave the newlines in place however, as I really couldnt
> > find a way to keep the text on a single line is still legible from a code
> > perspective. I think theres enough language in there that using cscope on just
> > about any substring however will turn it up, and again, this may be a short
> > lived quirk.
> >
> > v3)
> >
> > * Removed defines from pci_ids.h, and used direct id values as per request from
> > Bjorn.
> >
> > v4)
> >
> > * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> > Woodhouse
> >
> > v5)
> >
> > * Moved check to an early quirk, and flagged the broken chip, so we could
> > reasonably disable irq remapping during bootup.
> >
> > v6)
> >
> > * Clean up of stupid extra thrash in quirks.c
> >
> > v7)
> >
> > * Move broken check to intel_irq_remapping.c
> > * Fixed another typo
> > * Finally made the reference bugzilla public
> > ---
> > arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
> > drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
> > drivers/iommu/irq_remapping.c | 12 ++++++++++++
> > drivers/iommu/irq_remapping.h | 1 +
> > 4 files changed, 48 insertions(+)
> >
> > diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> > index 3755ef4..bfa3139 100644
> > --- a/arch/x86/kernel/early-quirks.c
> > +++ b/arch/x86/kernel/early-quirks.c
> > @@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> > }
> > #endif
> >
> > +#ifdef CONFIG_IRQ_REMAP
> > +static void __init intel_remapping_check(int num, int slot, int func)
> > +{
> > + u8 revision;
> > +
> > + revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
> > +
> > + /*
> > + * Revision 0x13 of this chipset supports irq remapping
> > + * but has an erratum that breaks its behavior, flag it as such
> > + */
> > + if (revision == 0x13)
> > + irq_remap_broken = 1;
> > +
> > +}
> > +#else
> > +static void __init intel_remapping_check(int num, int slot, int func)
> > +{
> > +}
> > +#endif
> > +
> > #define QFLAG_APPLY_ONCE 0x1
> > #define QFLAG_APPLIED 0x2
> > #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> > @@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
> > PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> > { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> > PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> > + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> > + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> > + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> > + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> > {}
> > };
> >
> > diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> > index f3b8f23..5b19b2d 100644
> > --- a/drivers/iommu/intel_irq_remapping.c
> > +++ b/drivers/iommu/intel_irq_remapping.c
> > @@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)
> >
> > if (disable_irq_remap)
> > return 0;
> > + if (irq_remap_broken) {
> > + WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> > + "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an erratum making that\n"
> > + "feature unstable. To maintain system stability\n"
> > + "interrupt remapping is being disabled. Please\n"
> > + "contact your BIOS vendor for an update\n");
> > + disable_irq_remap = 1;
> > + return 0;
> > + }
> >
> > if (!dmar_ir_support())
> > return 0;
> > diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> > index d56f8c1..2b56e92 100644
> > --- a/drivers/iommu/irq_remapping.c
> > +++ b/drivers/iommu/irq_remapping.c
> > @@ -19,6 +19,7 @@
> > int irq_remapping_enabled;
> >
> > int disable_irq_remap;
> > +int irq_remap_broken;
> > int disable_sourceid_checking;
> > int no_x2apic_optout;
> >
> > @@ -216,6 +217,17 @@ int irq_remapping_supported(void)
> > if (disable_irq_remap)
> > return 0;
> >
> > + if (irq_remap_broken) {
> > + WARN_TAINT(1, TAIN_FIRMWARE_WORKAROUND,
> > + "This system BIOS has enabled interrupt remapping\n"
> > + "on a chipset that contains an erratum making that\n"
> > + "feature unstable. Please reboot with nointremap\n"
> > + "added to the kernel command line and contact\n"
> > + "your BIOS vendor for an update");
> > + disable_irq_remap = 1;
> > + return 0;
> > + }
> > +
>
> I think you probably intended to drop the change to this file, since
> you moved it to intel_irq_remapping.c. If you did intend to keep it,
> s/TAIN_/TAINT_/.
Damnit, yes, sorry. Not sure how that snuck back in there. I know I had
removed it. Perhaps I forgot to commit that part of the change.

Neil

2013-04-15 16:28:20

by Neil Horman

[permalink] [raw]
Subject: [PATCH v8] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem. For
details of those that reported the problem, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=887006

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
CC: Joerg Roedel <[email protected]>
CC: Konrad Rzeszutek Wilk <[email protected]>
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.

v6)

* Clean up of stupid extra thrash in quirks.c

v7)

* Move broken check to intel_irq_remapping.c
* Fixed another typo
* Finally made the reference bugzilla public

v8)

* Removed extraneous code from irq_remapping_enabled
---
arch/x86/kernel/early-quirks.c | 25 +++++++++++++++++++++++++
drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
drivers/iommu/irq_remapping.c | 1 +
drivers/iommu/irq_remapping.h | 1 +
4 files changed, 37 insertions(+)

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..bfa3139 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -192,6 +192,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = pci_read_config_byte(num, slot, func , PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ irq_remap_broken = 1;
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +242,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f3b8f23..5b19b2d 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)

if (disable_irq_remap)
return 0;
+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. To maintain system stability\n"
+ "interrupt remapping is being disabled. Please\n"
+ "contact your BIOS vendor for an update\n");
+ disable_irq_remap = 1;
+ return 0;
+ }

if (!dmar_ir_support())
return 0;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..04d975f 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..d7537e4 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
--
1.8.1.4

2013-04-15 22:41:43

by Neil Horman

[permalink] [raw]
Subject: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem. For
details of those that reported the problem, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=887006

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
CC: Joerg Roedel <[email protected]>
CC: Konrad Rzeszutek Wilk <[email protected]>
CC: Arkadiusz Miśkiewicz <[email protected]>
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.

v6)

* Clean up of stupid extra thrash in quirks.c

v7)

* Move broken check to intel_irq_remapping.c
* Fixed another typo
* Finally made the reference bugzilla public

v8)

* Removed extraneous code from irq_remapping_enabled

v9)

* Fix stupid build break from rushing to shuffle simmilar header files about
Thanks to Arkadiusz Miśkiewicz for pointing it out
---
arch/x86/kernel/early-quirks.c | 26 ++++++++++++++++++++++++++
drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
drivers/iommu/irq_remapping.c | 1 +
drivers/iommu/irq_remapping.h | 2 ++
4 files changed, 39 insertions(+)

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..ef4ac6c 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -18,6 +18,7 @@
#include <asm/apic.h>
#include <asm/iommu.h>
#include <asm/gart.h>
+#include "../drivers/iommu/irq_remapping.h"

static void __init fix_hypertransport_config(int num, int slot, int func)
{
@@ -192,6 +193,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ irq_remap_broken = 1;
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +243,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f3b8f23..5b19b2d 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)

if (disable_irq_remap)
return 0;
+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. To maintain system stability\n"
+ "interrupt remapping is being disabled. Please\n"
+ "contact your BIOS vendor for an update\n");
+ disable_irq_remap = 1;
+ return 0;
+ }

if (!dmar_ir_support())
return 0;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..04d975f 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..90c4dae 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
@@ -89,6 +90,7 @@ extern struct irq_remap_ops amd_iommu_irq_ops;

#define irq_remapping_enabled 0
#define disable_irq_remap 1
+#define irq_remap_broken 0

#endif /* CONFIG_IRQ_REMAP */

--
1.8.1.4

2013-04-15 23:02:58

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 15, 2013 at 3:41 PM, Neil Horman <[email protected]> wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem. For
> details of those that reported the problem, please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> CC: Joerg Roedel <[email protected]>
> CC: Konrad Rzeszutek Wilk <[email protected]>
> CC: Arkadiusz Miśkiewicz <[email protected]>
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
>
> v5)
>
> * Moved check to an early quirk, and flagged the broken chip, so we could
> reasonably disable irq remapping during bootup.
>
> v6)
>
> * Clean up of stupid extra thrash in quirks.c
>
> v7)
>
> * Move broken check to intel_irq_remapping.c
> * Fixed another typo
> * Finally made the reference bugzilla public
>
> v8)
>
> * Removed extraneous code from irq_remapping_enabled
>
> v9)
>
> * Fix stupid build break from rushing to shuffle simmilar header files about
> Thanks to Arkadiusz Miśkiewicz for pointing it out
> ---
> arch/x86/kernel/early-quirks.c | 26 ++++++++++++++++++++++++++
> drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
> drivers/iommu/irq_remapping.c | 1 +
> drivers/iommu/irq_remapping.h | 2 ++
> 4 files changed, 39 insertions(+)
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 3755ef4..ef4ac6c 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -18,6 +18,7 @@
> #include <asm/apic.h>
> #include <asm/iommu.h>
> #include <asm/gart.h>
> +#include "../drivers/iommu/irq_remapping.h"

looks ugly.

>
> static void __init fix_hypertransport_config(int num, int slot, int func)
> {
> @@ -192,6 +193,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> }
> #endif
>
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + irq_remap_broken = 1;

change to more specific like:

intel_55xx_rev13_found?

> +
> +}
> +#else
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> +}
> +#endif
> +
> #define QFLAG_APPLY_ONCE 0x1
> #define QFLAG_APPLIED 0x2
> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -221,6 +243,10 @@ static struct chipset early_qrk[] __initdata = {
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> {}
> };
>
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..5b19b2d 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)
>
> if (disable_irq_remap)
> return 0;
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. To maintain system stability\n"
> + "interrupt remapping is being disabled. Please\n"
> + "contact your BIOS vendor for an update\n");
> + disable_irq_remap = 1;
> + return 0;
> + }
>
> if (!dmar_ir_support())
> return 0;
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index d56f8c1..04d975f 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -19,6 +19,7 @@
> int irq_remapping_enabled;
>
> int disable_irq_remap;
> +int irq_remap_broken;
> int disable_sourceid_checking;
> int no_x2apic_optout;
>
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index ecb6376..90c4dae 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -32,6 +32,7 @@ struct pci_dev;
> struct msi_msg;
>
> extern int disable_irq_remap;
> +extern int irq_remap_broken;
> extern int disable_sourceid_checking;
> extern int no_x2apic_optout;
> extern int irq_remapping_enabled;
> @@ -89,6 +90,7 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
>
> #define irq_remapping_enabled 0
> #define disable_irq_remap 1
> +#define irq_remap_broken 0

this one is needed

>
> #endif /* CONFIG_IRQ_REMAP */
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-04-16 00:44:13

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 15, 2013 at 04:02:56PM -0700, Yinghai Lu wrote:
> On Mon, Apr 15, 2013 at 3:41 PM, Neil Horman <[email protected]> wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
><snip>
> > diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> > index 3755ef4..ef4ac6c 100644
> > --- a/arch/x86/kernel/early-quirks.c
> > +++ b/arch/x86/kernel/early-quirks.c
> > @@ -18,6 +18,7 @@
> > #include <asm/apic.h>
> > #include <asm/iommu.h>
> > #include <asm/gart.h>
> > +#include "../drivers/iommu/irq_remapping.h"
>
> looks ugly.
>
Yes, but I think its acceptible given that it makes sense to me to define the
irq_remap_broken flag in the common driver code. We can certainly move the
header around, but I'd much rather do that in a separate patch.

> >
> > static void __init fix_hypertransport_config(int num, int slot, int func)
> > {
> > @@ -192,6 +193,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> > }
> > #endif
> >
> > +#ifdef CONFIG_IRQ_REMAP
> > +static void __init intel_remapping_check(int num, int slot, int func)
> > +{
> > + u8 revision;
> > +
> > + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> > +
> > + /*
> > + * Revision 0x13 of this chipset supports irq remapping
> > + * but has an erratum that breaks its behavior, flag it as such
> > + */
> > + if (revision == 0x13)
> > + irq_remap_broken = 1;
>
> change to more specific like:
>
> intel_55xx_rev13_found?
>
No. This was discussed previously, and the consensus was that we
can use a generic name, should other chips have simmilarly broken functionality.

><snip>
> >
> > int disable_irq_remap;
> > +int irq_remap_broken;
> > int disable_sourceid_checking;
> > int no_x2apic_optout;
> >
> > diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> > index ecb6376..90c4dae 100644
> > --- a/drivers/iommu/irq_remapping.h
> > +++ b/drivers/iommu/irq_remapping.h
> > @@ -32,6 +32,7 @@ struct pci_dev;
> > struct msi_msg;
> >
> > extern int disable_irq_remap;
> > +extern int irq_remap_broken;
> > extern int disable_sourceid_checking;
> > extern int no_x2apic_optout;
> > extern int irq_remapping_enabled;
> > @@ -89,6 +90,7 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
> >
> > #define irq_remapping_enabled 0
> > #define disable_irq_remap 1
> > +#define irq_remap_broken 0
>
> this one is needed
>
Um, yes? I think you mean to say its not needed, since all the users of this
check are only in code thats compiled conditionally with CONFIG_IRQ_REMAP.
You're correct, but I like to have it there for completness, should that change
in the future.

Neil

2013-04-16 06:20:08

by Arkadiusz Miskiewicz

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tuesday 16 of April 2013, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chi
> pset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially
> errata 53), which noted that these chipsets can't properly do interrupt
> remapping, and as a result the recommend that interrupt remapping be
> disabled in bios. While many vendors have a bios update to do exactly
> that, not all do, and of course not all users update their bios to a level
> that corrects the problem. As a result, occasionally interrupts can
> arrive at a cpu even after affinity for that interrupt has be moved,
> leading to lost or spurrious interrupts (usually characterized by the
> message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is
> such that this feature was not properly turned off. As such, it would be
> good to give them a reminder that their systems are vulnurable to this
> problem. For details of those that reported the problem, please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006

Tested-by: Arkadiusz Miśkiewicz <[email protected]>

(on top of 3.7.10 kernel)

Stack trace looks useless for me in this case but according changelog this was
already discussed.

[ 0.137512] Freeing SMP alternatives: 20k freed
[ 0.143539] ACPI: Core revision 20120913
[ 0.156067] dmar: Host address width 40
[ 0.160440] dmar: DRHD base: 0x000000fe710000 flags: 0x1
[ 0.166467] dmar: IOMMU 0: reg_base_addr fe710000 ver 1:0 cap
c90780106f0462 ecap f020f7
[ 0.175618] dmar: RMRR base: 0x0000008f62f000 end: 0x0000008f631fff
[ 0.182705] dmar: RMRR base: 0x0000008f61a000 end: 0x0000008f61afff
[ 0.189792] dmar: RMRR base: 0x0000008f617000 end: 0x0000008f617fff
[ 0.196871] dmar: RMRR base: 0x0000008f614000 end: 0x0000008f614fff
[ 0.203960] dmar: RMRR base: 0x0000008f611000 end: 0x0000008f611fff
[ 0.211047] dmar: RMRR base: 0x0000008f60e000 end: 0x0000008f60efff
[ 0.218135] dmar: RMRR base: 0x0000008f60b000 end: 0x0000008f60bfff
[ 0.225222] dmar: RMRR base: 0x0000008f608000 end: 0x0000008f608fff
[ 0.232309] dmar: RMRR base: 0x0000008f605000 end: 0x0000008f605fff
[ 0.239388] dmar: ATSR flags: 0x0
[ 0.243273] ------------[ cut here ]------------
[ 0.248515] WARNING: at
/home/users/arekm/rpm/BUILD/kernel-3.7.10/linux-3.7/drivers/iommu/intel_irq_remapping.c:518
intel_irq_remapping_supported+0x37/0x7
a()
[ 0.264358] Hardware name: S5500WB
[ 0.268238] This system BIOS has enabled interrupt remapping
on a chipset that contains an erratum making that
feature unstable. To maintain system stability
interrupt remapping is being disabled. Please
contact your BIOS vendor for an update
[ 0.298811] Modules linked in:
[ 0.302373] Pid: 1, comm: swapper/0 xid: #0 Not tainted 3.7.10-6 #1
[ 0.309453] Call Trace:
[ 0.312270] [<ffffffff8105182a>] warn_slowpath_common+0x7a/0xb0
[ 0.319061] [<ffffffff810518ba>] warn_slowpath_fmt_taint+0x3a/0x40
[ 0.326143] [<ffffffff818e8371>] intel_irq_remapping_supported+0x37/0x7a
[ 0.333810] [<ffffffff813b7226>] irq_remapping_supported+0x26/0x30
[ 0.340893] [<ffffffff818bd1be>] enable_IR+0x9/0x3e
[ 0.346521] [<ffffffff818bd558>] enable_IR_x2apic+0xa0/0x1e3
[ 0.353024] [<ffffffff814cfdc4>] ? set_cpu_sibling_map+0x415/0x435
[ 0.360108] [<ffffffff818bf47a>] default_setup_apic_routing+0x12/0x6b
[ 0.367483] [<ffffffff818bb30b>] native_smp_prepare_cpus+0x2e7/0x336
[ 0.374761] [<ffffffff818abcd5>] kernel_init_freeable+0x89/0x1c4
[ 0.381652] [<ffffffff814b9380>] ? rest_init+0x70/0x70
[ 0.387570] [<ffffffff814b9389>] kernel_init+0x9/0x100
[ 0.393489] [<ffffffff814e8d7c>] ret_from_fork+0x7c/0xb0
[ 0.399602] [<ffffffff814b9380>] ? rest_init+0x70/0x70
[ 0.405524] ---[ end trace bf40f410b44b3726 ]---
[ 0.410830] Switched APIC routing to physical flat.
[ 0.416872] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.456602] smpboot: CPU0: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
(fam: 06, model: 1a, stepping: 04)
[ 0.573650] Performance Events: PEBS fmt1+, 16-deep LBR, Nehalem events,
Intel PMU driver.
[ 0.583265] perf_event_intel: CPU erratum AAJ80 worked around


--
Arkadiusz Miśkiewicz, arekm / maven.pl

2013-04-16 10:25:01

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Mon, Apr 15, 2013 at 06:41:17PM -0400, Neil Horman wrote:
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + irq_remap_broken = 1;
> +
> +}
> +#else

Any reason why you don't check this in the Intel IOMMU init code? You
would safe the ifdefs and you don't have to include
irq-remapping-internal header files somewhere else in the tree.


Joerg

2013-04-16 13:08:24

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tue, Apr 16, 2013 at 12:24:54PM +0200, Joerg Roedel wrote:
> On Mon, Apr 15, 2013 at 06:41:17PM -0400, Neil Horman wrote:
> > +#ifdef CONFIG_IRQ_REMAP
> > +static void __init intel_remapping_check(int num, int slot, int func)
> > +{
> > + u8 revision;
> > +
> > + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> > +
> > + /*
> > + * Revision 0x13 of this chipset supports irq remapping
> > + * but has an erratum that breaks its behavior, flag it as such
> > + */
> > + if (revision == 0x13)
> > + irq_remap_broken = 1;
> > +
> > +}
> > +#else
>
> Any reason why you don't check this in the Intel IOMMU init code? You
> would safe the ifdefs and you don't have to include
> irq-remapping-internal header files somewhere else in the tree.
>
>
> Joerg
>
Mostly because we've spent so much time early in this thread talking about where
the quirk should go, that after this last revision, it didn't even occur to me
that, using this new approach, we don't even need a quirk anymore. That makes
way more sense to me though, I'll revise the patch again :(.

Neil

2013-04-16 13:36:07

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tue, Apr 16, 2013 at 12:24:54PM +0200, Joerg Roedel wrote:
> On Mon, Apr 15, 2013 at 06:41:17PM -0400, Neil Horman wrote:
> > +#ifdef CONFIG_IRQ_REMAP
> > +static void __init intel_remapping_check(int num, int slot, int func)
> > +{
> > + u8 revision;
> > +
> > + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> > +
> > + /*
> > + * Revision 0x13 of this chipset supports irq remapping
> > + * but has an erratum that breaks its behavior, flag it as such
> > + */
> > + if (revision == 0x13)
> > + irq_remap_broken = 1;
> > +
> > +}
> > +#else
>
> Any reason why you don't check this in the Intel IOMMU init code? You
> would safe the ifdefs and you don't have to include
> irq-remapping-internal header files somewhere else in the tree.
>
>
> Joerg
>
>
>
Actually, hold on that last note, the intel iommu init code doesn't seem to
create any direct relationship between the set of iommu's and the pci_dev's that
implement them. In the intel_irq_remapping_supported path I can loop over each
dmar_dhrd_unit, and interrogate each of the devices on its **devices list to see
if the device/vendor and revision ids match, but looking at the dhrd parsing
code, I'm not sure the iommu pci_dev is always going to be on that list. That
seems like its going to be pretty ugly in and of itself. Do you have a
suggested way to identify the pci_dev of the device we need in that path without
having to simply iterate over every device in that scope?

Neil

2013-04-16 16:37:14

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tue, Apr 16, 2013 at 09:35:56AM -0400, Neil Horman wrote:
> Actually, hold on that last note, the intel iommu init code doesn't seem to
> create any direct relationship between the set of iommu's and the pci_dev's that
> implement them. In the intel_irq_remapping_supported path I can loop over each
> dmar_dhrd_unit, and interrogate each of the devices on its **devices list to see
> if the device/vendor and revision ids match, but looking at the dhrd parsing
> code, I'm not sure the iommu pci_dev is always going to be on that list. That
> seems like its going to be pretty ugly in and of itself. Do you have a
> suggested way to identify the pci_dev of the device we need in that path without
> having to simply iterate over every device in that scope?

Hmkay, looks like this is a non-trivial problem. Here is what I suggest:
Keep the early-quirk as in your current patch. But add a function to
drivers/iommu/irq_remapping.c to disable irq-remapping and export that
function via the header-file arch/x86/include/asm/irq_remapping.h. Use
that function in the quirk instead of setting the disable-flag directly.
This way you don't have to include any private header file from iommu
code.


Joerg

2013-04-16 17:26:15

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v9] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tue, Apr 16, 2013 at 06:37:05PM +0200, Joerg Roedel wrote:
> On Tue, Apr 16, 2013 at 09:35:56AM -0400, Neil Horman wrote:
> > Actually, hold on that last note, the intel iommu init code doesn't seem to
> > create any direct relationship between the set of iommu's and the pci_dev's that
> > implement them. In the intel_irq_remapping_supported path I can loop over each
> > dmar_dhrd_unit, and interrogate each of the devices on its **devices list to see
> > if the device/vendor and revision ids match, but looking at the dhrd parsing
> > code, I'm not sure the iommu pci_dev is always going to be on that list. That
> > seems like its going to be pretty ugly in and of itself. Do you have a
> > suggested way to identify the pci_dev of the device we need in that path without
> > having to simply iterate over every device in that scope?
>
> Hmkay, looks like this is a non-trivial problem. Here is what I suggest:
> Keep the early-quirk as in your current patch. But add a function to
> drivers/iommu/irq_remapping.c to disable irq-remapping and export that
> function via the header-file arch/x86/include/asm/irq_remapping.h. Use
> that function in the quirk instead of setting the disable-flag directly.
> This way you don't have to include any private header file from iommu
> code.
>
Ok, that seems reasonable. I'll have a new patch in a day or so.

Thanks!
Neil

2013-04-16 20:38:51

by Neil Horman

[permalink] [raw]
Subject: [PATCH v10] irq: add quirk for broken interrupt remapping on 55XX chipsets

A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem. For
details of those that reported the problem, please see:
https://bugzilla.redhat.com/show_bug.cgi?id=887006

Signed-off-by: Neil Horman <[email protected]>
CC: Prarit Bhargava <[email protected]>
CC: Don Zickus <[email protected]>
CC: Don Dutile <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: Asit Mallick <[email protected]>
CC: David Woodhouse <[email protected]>
CC: [email protected]
CC: Joerg Roedel <[email protected]>
CC: Konrad Rzeszutek Wilk <[email protected]>
CC: Arkadiusz Miśkiewicz <[email protected]>
---

Change notes:

v2)

* Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
chipset series is x86 only. I decided however to keep the quirk as a regular
quirk, not an early_quirk. Early quirks have no way currently to determine if
BIOS has properly disabled the feature in the iommu, at least not without
significant hacking, and since its quite possible this will be a short lived
quirk, should Don Z's workaround code prove successful (and it looks like it may
well), I don't think that necessecary.

* Removed the WARNING banner from the quirk, and added the HW_ERR token to the
string, I opted to leave the newlines in place however, as I really couldnt
find a way to keep the text on a single line is still legible from a code
perspective. I think theres enough language in there that using cscope on just
about any substring however will turn it up, and again, this may be a short
lived quirk.

v3)

* Removed defines from pci_ids.h, and used direct id values as per request from
Bjorn.

v4)

* Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
Woodhouse

v5)

* Moved check to an early quirk, and flagged the broken chip, so we could
reasonably disable irq remapping during bootup.

v6)

* Clean up of stupid extra thrash in quirks.c

v7)

* Move broken check to intel_irq_remapping.c
* Fixed another typo
* Finally made the reference bugzilla public

v8)

* Removed extraneous code from irq_remapping_enabled

v9)

* Fix stupid build break from rushing to shuffle simmilar header files about
Thanks to Arkadiusz Miśkiewicz for pointing it out

v10)

* Rewrite to hide the irq_remap_broken variable so we don't need to pull in a
private header file
---
arch/x86/include/asm/irq_remapping.h | 1 +
arch/x86/kernel/early-quirks.c | 26 ++++++++++++++++++++++++++
drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
drivers/iommu/irq_remapping.c | 6 ++++++
drivers/iommu/irq_remapping.h | 2 ++
5 files changed, 45 insertions(+)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 95fd352..d740cb4 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -28,6 +28,7 @@

extern void setup_irq_remapping_ops(void);
extern int irq_remapping_supported(void);
+extern void set_irq_remapping_broken(void);
extern int irq_remapping_prepare(void);
extern int irq_remapping_enable(void);
extern void irq_remapping_disable(void);
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 3755ef4..589092d 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -18,6 +18,7 @@
#include <asm/apic.h>
#include <asm/iommu.h>
#include <asm/gart.h>
+#include <asm/irq_remapping.h>

static void __init fix_hypertransport_config(int num, int slot, int func)
{
@@ -192,6 +193,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

+#ifdef CONFIG_IRQ_REMAP
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+ u8 revision;
+
+ revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
+
+ /*
+ * Revision 0x13 of this chipset supports irq remapping
+ * but has an erratum that breaks its behavior, flag it as such
+ */
+ if (revision == 0x13)
+ set_irq_remapping_broken();
+
+}
+#else
+static void __init intel_remapping_check(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -221,6 +243,10 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+ { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
+ { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
+ PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
{}
};

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f3b8f23..5b19b2d 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)

if (disable_irq_remap)
return 0;
+ if (irq_remap_broken) {
+ WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
+ "This system BIOS has enabled interrupt remapping\n"
+ "on a chipset that contains an erratum making that\n"
+ "feature unstable. To maintain system stability\n"
+ "interrupt remapping is being disabled. Please\n"
+ "contact your BIOS vendor for an update\n");
+ disable_irq_remap = 1;
+ return 0;
+ }

if (!dmar_ir_support())
return 0;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index d56f8c1..3c11043 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -19,6 +19,7 @@
int irq_remapping_enabled;

int disable_irq_remap;
+int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;

@@ -211,6 +212,11 @@ void __init setup_irq_remapping_ops(void)
#endif
}

+void set_irq_remapping_broken(void)
+{
+ irq_remap_broken = 1;
+}
+
int irq_remapping_supported(void)
{
if (disable_irq_remap)
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index ecb6376..90c4dae 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -32,6 +32,7 @@ struct pci_dev;
struct msi_msg;

extern int disable_irq_remap;
+extern int irq_remap_broken;
extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
@@ -89,6 +90,7 @@ extern struct irq_remap_ops amd_iommu_irq_ops;

#define irq_remapping_enabled 0
#define disable_irq_remap 1
+#define irq_remap_broken 0

#endif /* CONFIG_IRQ_REMAP */

--
1.8.1.4

2013-04-16 22:08:32

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v10] irq: add quirk for broken interrupt remapping on 55XX chipsets

On 04/16/2013 04:38 PM, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem. For
> details of those that reported the problem, please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006
>
> Signed-off-by: Neil Horman<[email protected]>
> CC: Prarit Bhargava<[email protected]>
> CC: Don Zickus<[email protected]>
> CC: Don Dutile<[email protected]>
> CC: Bjorn Helgaas<[email protected]>
> CC: Asit Mallick<[email protected]>
> CC: David Woodhouse<[email protected]>
> CC: [email protected]
> CC: Joerg Roedel<[email protected]>
> CC: Konrad Rzeszutek Wilk<[email protected]>
> CC: Arkadiusz Miśkiewicz<[email protected]>
> ---
>
> Change notes:
>
> v2)
>
> * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX
> chipset series is x86 only. I decided however to keep the quirk as a regular
> quirk, not an early_quirk. Early quirks have no way currently to determine if
> BIOS has properly disabled the feature in the iommu, at least not without
> significant hacking, and since its quite possible this will be a short lived
> quirk, should Don Z's workaround code prove successful (and it looks like it may
> well), I don't think that necessecary.
>
> * Removed the WARNING banner from the quirk, and added the HW_ERR token to the
> string, I opted to leave the newlines in place however, as I really couldnt
> find a way to keep the text on a single line is still legible from a code
> perspective. I think theres enough language in there that using cscope on just
> about any substring however will turn it up, and again, this may be a short
> lived quirk.
>
> v3)
>
> * Removed defines from pci_ids.h, and used direct id values as per request from
> Bjorn.
>
> v4)
>
> * Converted pr_warn to WARN_TAINT(TAINT_FIRMWARE_WORKAROUND) as per David
> Woodhouse
>
> v5)
>
> * Moved check to an early quirk, and flagged the broken chip, so we could
> reasonably disable irq remapping during bootup.
>
> v6)
>
> * Clean up of stupid extra thrash in quirks.c
>
> v7)
>
> * Move broken check to intel_irq_remapping.c
> * Fixed another typo
> * Finally made the reference bugzilla public
>
> v8)
>
> * Removed extraneous code from irq_remapping_enabled
>
> v9)
>
> * Fix stupid build break from rushing to shuffle simmilar header files about
> Thanks to Arkadiusz Miśkiewicz for pointing it out
>
> v10)
>
> * Rewrite to hide the irq_remap_broken variable so we don't need to pull in a
> private header file
/me waiting for v11...
/me runs! ;-)

Neil,
Looks good, and kudos on your tenacity through all the revs.
- Don(Dutile)

> ---
> arch/x86/include/asm/irq_remapping.h | 1 +
> arch/x86/kernel/early-quirks.c | 26 ++++++++++++++++++++++++++
> drivers/iommu/intel_irq_remapping.c | 10 ++++++++++
> drivers/iommu/irq_remapping.c | 6 ++++++
> drivers/iommu/irq_remapping.h | 2 ++
> 5 files changed, 45 insertions(+)
>
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index 95fd352..d740cb4 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -28,6 +28,7 @@
>
> extern void setup_irq_remapping_ops(void);
> extern int irq_remapping_supported(void);
> +extern void set_irq_remapping_broken(void);
> extern int irq_remapping_prepare(void);
> extern int irq_remapping_enable(void);
> extern void irq_remapping_disable(void);
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 3755ef4..589092d 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -18,6 +18,7 @@
> #include<asm/apic.h>
> #include<asm/iommu.h>
> #include<asm/gart.h>
> +#include<asm/irq_remapping.h>
>
> static void __init fix_hypertransport_config(int num, int slot, int func)
> {
> @@ -192,6 +193,27 @@ static void __init ati_bugs_contd(int num, int slot, int func)
> }
> #endif
>
> +#ifdef CONFIG_IRQ_REMAP
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> + u8 revision;
> +
> + revision = read_pci_config_byte(num, slot, func, PCI_REVISION_ID);
> +
> + /*
> + * Revision 0x13 of this chipset supports irq remapping
> + * but has an erratum that breaks its behavior, flag it as such
> + */
> + if (revision == 0x13)
> + set_irq_remapping_broken();
> +
> +}
> +#else
> +static void __init intel_remapping_check(int num, int slot, int func)
> +{
> +}
> +#endif
> +
> #define QFLAG_APPLY_ONCE 0x1
> #define QFLAG_APPLIED 0x2
> #define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -221,6 +243,10 @@ static struct chipset early_qrk[] __initdata = {
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
> { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> + { PCI_VENDOR_ID_INTEL, 0x3403, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> + { PCI_VENDOR_ID_INTEL, 0x3406, PCI_CLASS_BRIDGE_HOST,
> + PCI_BASE_CLASS_BRIDGE, 0, intel_remapping_check },
> {}
> };
>
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f3b8f23..5b19b2d 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -524,6 +524,16 @@ static int __init intel_irq_remapping_supported(void)
>
> if (disable_irq_remap)
> return 0;
> + if (irq_remap_broken) {
> + WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> + "This system BIOS has enabled interrupt remapping\n"
> + "on a chipset that contains an erratum making that\n"
> + "feature unstable. To maintain system stability\n"
> + "interrupt remapping is being disabled. Please\n"
> + "contact your BIOS vendor for an update\n");
> + disable_irq_remap = 1;
> + return 0;
> + }
>
> if (!dmar_ir_support())
> return 0;
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index d56f8c1..3c11043 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -19,6 +19,7 @@
> int irq_remapping_enabled;
>
> int disable_irq_remap;
> +int irq_remap_broken;
> int disable_sourceid_checking;
> int no_x2apic_optout;
>
> @@ -211,6 +212,11 @@ void __init setup_irq_remapping_ops(void)
> #endif
> }
>
> +void set_irq_remapping_broken(void)
> +{
> + irq_remap_broken = 1;
> +}
> +
> int irq_remapping_supported(void)
> {
> if (disable_irq_remap)
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index ecb6376..90c4dae 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -32,6 +32,7 @@ struct pci_dev;
> struct msi_msg;
>
> extern int disable_irq_remap;
> +extern int irq_remap_broken;
> extern int disable_sourceid_checking;
> extern int no_x2apic_optout;
> extern int irq_remapping_enabled;
> @@ -89,6 +90,7 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
>
> #define irq_remapping_enabled 0
> #define disable_irq_remap 1
> +#define irq_remap_broken 0
>
> #endif /* CONFIG_IRQ_REMAP */
>

2013-04-18 15:02:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v10] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Tue, Apr 16, 2013 at 04:38:32PM -0400, Neil Horman wrote:
> A few years back intel published a spec update:
> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
>
> For the 5520 and 5500 chipsets which contained an errata (specificially errata
> 53), which noted that these chipsets can't properly do interrupt remapping, and
> as a result the recommend that interrupt remapping be disabled in bios. While
> many vendors have a bios update to do exactly that, not all do, and of course
> not all users update their bios to a level that corrects the problem. As a
> result, occasionally interrupts can arrive at a cpu even after affinity for that
> interrupt has be moved, leading to lost or spurrious interrupts (usually
> characterized by the message:
> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
>
> There have been several incidents recently of people seeing this error, and
> investigation has shown that they have system for which their BIOS level is such
> that this feature was not properly turned off. As such, it would be good to
> give them a reminder that their systems are vulnurable to this problem. For
> details of those that reported the problem, please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=887006
>
> Signed-off-by: Neil Horman <[email protected]>
> CC: Prarit Bhargava <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Don Dutile <[email protected]>
> CC: Bjorn Helgaas <[email protected]>
> CC: Asit Mallick <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: [email protected]
> CC: Joerg Roedel <[email protected]>
> CC: Konrad Rzeszutek Wilk <[email protected]>
> CC: Arkadiusz Miśkiewicz <[email protected]>

Applied with some small changes to my x86/vt-d branch, thanks Neil.

2013-04-18 17:00:51

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH v10] irq: add quirk for broken interrupt remapping on 55XX chipsets

On Thu, Apr 18, 2013 at 05:02:39PM +0200, Joerg Roedel wrote:
> On Tue, Apr 16, 2013 at 04:38:32PM -0400, Neil Horman wrote:
> > A few years back intel published a spec update:
> > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf
> >
> > For the 5520 and 5500 chipsets which contained an errata (specificially errata
> > 53), which noted that these chipsets can't properly do interrupt remapping, and
> > as a result the recommend that interrupt remapping be disabled in bios. While
> > many vendors have a bios update to do exactly that, not all do, and of course
> > not all users update their bios to a level that corrects the problem. As a
> > result, occasionally interrupts can arrive at a cpu even after affinity for that
> > interrupt has be moved, leading to lost or spurrious interrupts (usually
> > characterized by the message:
> > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)
> >
> > There have been several incidents recently of people seeing this error, and
> > investigation has shown that they have system for which their BIOS level is such
> > that this feature was not properly turned off. As such, it would be good to
> > give them a reminder that their systems are vulnurable to this problem. For
> > details of those that reported the problem, please see:
> > https://bugzilla.redhat.com/show_bug.cgi?id=887006
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > CC: Prarit Bhargava <[email protected]>
> > CC: Don Zickus <[email protected]>
> > CC: Don Dutile <[email protected]>
> > CC: Bjorn Helgaas <[email protected]>
> > CC: Asit Mallick <[email protected]>
> > CC: David Woodhouse <[email protected]>
> > CC: [email protected]
> > CC: Joerg Roedel <[email protected]>
> > CC: Konrad Rzeszutek Wilk <[email protected]>
> > CC: Arkadiusz Miśkiewicz <[email protected]>
>
> Applied with some small changes to my x86/vt-d branch, thanks Neil.
>
>
>
Thanks!
Neil