2010-12-01 06:56:05

by Suresh Siddha

[permalink] [raw]
Subject: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On platforms with Intel 7500 chipset, there were some reports of system
hang/NMI's during kexec/kdump in the presence of interrupt-remapping enabled.

During kdump, there is a window where the devices might be still using old
kernel's interrupt information, while the kdump kernel is coming up. This can
cause vt-d faults as the interrupt configuration from the old kernel map to
null IRTE entries in the new kernel etc. (with out interrupt-remapping enabled,
we still have the same issue but in this case we will see benign spurious
interrupt hit the new kernel).

Based on platform config settings, these platforms seem to generate NMI/SMI
when a vt-d fault happens and there were reports that the resulting SMI causes
the system to hang.

Fix it by masking vt-d spec defined errors to platform error reporting logic.
VT-d spec related errors are already handled by the VT-d OS code, so need to
report the same erorr through other channels.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [v2.6.32+]
---
drivers/pci/quirks.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

Index: tip/drivers/pci/quirks.c
===================================================================
--- tip.orig/drivers/pci/quirks.c
+++ tip/drivers/pci/quirks.c
@@ -2764,6 +2764,26 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_RI
DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_R5C832, ricoh_mmc_fixup_r5c832);
#endif /*CONFIG_MMC_RICOH_MMC*/

+#if defined(CONFIG_DMAR) || defined(CONFIG_INTR_REMAP)
+/*
+ * This is a quirk for masking vt-d spec defined errors to platform error
+ * handling logic. With out this, platforms seem to generate NMI/SMI (based
+ * on the RAS config settings of the platform) when a vt-d fault happens and
+ * there were reports that the resulting SMI causes system to hang.
+ *
+ * VT-d spec related errors are already handled by the VT-d OS code, so no
+ * need to report the same erorr through other channels.
+ */
+static void vtd_mask_spec_errors(struct pci_dev *dev)
+{
+ u32 word;
+
+ pci_read_config_dword(dev, 0x1AC, &word);
+ pci_write_config_dword(dev, 0x1AC, word | (1 << 31));
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x342e, vtd_mask_spec_errors);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3c28, vtd_mask_spec_errors);
+#endif

static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
struct pci_fixup *end)


2010-12-01 07:28:55

by Chris Wright

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

* Suresh Siddha ([email protected]) wrote:
> On platforms with Intel 7500 chipset, there were some reports of system
> hang/NMI's during kexec/kdump in the presence of interrupt-remapping enabled.
>
> During kdump, there is a window where the devices might be still using old
> kernel's interrupt information, while the kdump kernel is coming up. This can
> cause vt-d faults as the interrupt configuration from the old kernel map to
> null IRTE entries in the new kernel etc. (with out interrupt-remapping enabled,
> we still have the same issue but in this case we will see benign spurious
> interrupt hit the new kernel).
>
> Based on platform config settings, these platforms seem to generate NMI/SMI
> when a vt-d fault happens and there were reports that the resulting SMI causes
> the system to hang.
>
> Fix it by masking vt-d spec defined errors to platform error reporting logic.
> VT-d spec related errors are already handled by the VT-d OS code, so need to
> report the same erorr through other channels.
>
> Signed-off-by: Suresh Siddha <[email protected]>

Acked-by: Chris Wright <[email protected]>

2010-12-06 17:34:35

by Jesse Barnes

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On Tue, 30 Nov 2010 22:22:26 -0800
Suresh Siddha <[email protected]> wrote:

> On platforms with Intel 7500 chipset, there were some reports of system
> hang/NMI's during kexec/kdump in the presence of interrupt-remapping enabled.
>
> During kdump, there is a window where the devices might be still using old
> kernel's interrupt information, while the kdump kernel is coming up. This can
> cause vt-d faults as the interrupt configuration from the old kernel map to
> null IRTE entries in the new kernel etc. (with out interrupt-remapping enabled,
> we still have the same issue but in this case we will see benign spurious
> interrupt hit the new kernel).
>
> Based on platform config settings, these platforms seem to generate NMI/SMI
> when a vt-d fault happens and there were reports that the resulting SMI causes
> the system to hang.
>
> Fix it by masking vt-d spec defined errors to platform error reporting logic.
> VT-d spec related errors are already handled by the VT-d OS code, so need to
> report the same erorr through other channels.
>
> Signed-off-by: Suresh Siddha <[email protected]>
> Cc: [email protected] [v2.6.32+]
> ---
> drivers/pci/quirks.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> Index: tip/drivers/pci/quirks.c
> ===================================================================
> --- tip.orig/drivers/pci/quirks.c
> +++ tip/drivers/pci/quirks.c
> @@ -2764,6 +2764,26 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_RI
> DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_R5C832, ricoh_mmc_fixup_r5c832);
> #endif /*CONFIG_MMC_RICOH_MMC*/
>
> +#if defined(CONFIG_DMAR) || defined(CONFIG_INTR_REMAP)
> +/*
> + * This is a quirk for masking vt-d spec defined errors to platform error
> + * handling logic. With out this, platforms seem to generate NMI/SMI (based
> + * on the RAS config settings of the platform) when a vt-d fault happens and
> + * there were reports that the resulting SMI causes system to hang.
> + *
> + * VT-d spec related errors are already handled by the VT-d OS code, so no
> + * need to report the same erorr through other channels.
> + */
> +static void vtd_mask_spec_errors(struct pci_dev *dev)
> +{
> + u32 word;
> +
> + pci_read_config_dword(dev, 0x1AC, &word);
> + pci_write_config_dword(dev, 0x1AC, word | (1 << 31));
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x342e, vtd_mask_spec_errors);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3c28, vtd_mask_spec_errors);
> +#endif
>
> static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
> struct pci_fixup *end)

Can we make these registers and bits a bit more self-documenting (i.e.
#defines for both, maybe along with other useful bit definitions for
this reg)? Also, "error" is misspelled as "erorr" above. :)

--
Jesse Barnes, Intel Open Source Technology Center

2010-12-06 20:26:24

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On Mon, 2010-12-06 at 09:27 -0800, Jesse Barnes wrote:
> Can we make these registers and bits a bit more self-documenting (i.e.
> #defines for both, maybe along with other useful bit definitions for
> this reg)? Also, "error" is misspelled as "erorr" above. :)

Thanks for the review. Appended the updated patch. I haven't used
#defines for the pci-id's, as the first one (IOH) is used by several
chipsets and the second one is not named yet.

---

From: Suresh Siddha <[email protected]>
Subject: vt-d: quirk for masking vtd spec errors to platform error handling logic

On platforms with Intel 7500 chipset, there were some reports of system
hang/NMI's during kexec/kdump in the presence of interrupt-remapping enabled.

During kdump, there is a window where the devices might be still using old
kernel's interrupt information, while the kdump kernel is coming up. This can
cause vt-d faults as the interrupt configuration from the old kernel map to
null IRTE entries in the new kernel etc. (with out interrupt-remapping enabled,
we still have the same issue but in this case we will see benign spurious
interrupt hit the new kernel).

Based on platform config settings, these platforms seem to generate NMI/SMI
when a vt-d fault happens and there were reports that the resulting SMI causes
the system to hang.

Fix it by masking vt-d spec defined errors to platform error reporting logic.
VT-d spec related errors are already handled by the VT-d OS code, so need to
report the same error through other channels.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [v2.6.32+]
---
drivers/pci/quirks.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

Index: tip/drivers/pci/quirks.c
===================================================================
--- tip.orig/drivers/pci/quirks.c
+++ tip/drivers/pci/quirks.c
@@ -2764,6 +2764,29 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_RI
DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_R5C832, ricoh_mmc_fixup_r5c832);
#endif /*CONFIG_MMC_RICOH_MMC*/

+#if defined(CONFIG_DMAR) || defined(CONFIG_INTR_REMAP)
+#define VTUNCERRMSK_REG 0x1ac
+#define VTD_MSK_SPEC_ERRORS (1 << 31)
+/*
+ * This is a quirk for masking vt-d spec defined errors to platform error
+ * handling logic. With out this, platforms using Intel 7500, 5500 chipsets
+ * (and the derivative chipsets like X58 etc) seem to generate NMI/SMI (based
+ * on the RAS config settings of the platform) when a vt-d fault happens.
+ * The resulting SMI caused the system to hang.
+ *
+ * VT-d spec related errors are already handled by the VT-d OS code, so no
+ * need to report the same error through other channels.
+ */
+static void vtd_mask_spec_errors(struct pci_dev *dev)
+{
+ u32 word;
+
+ pci_read_config_dword(dev, VTUNCERRMSK_REG, &word);
+ pci_write_config_dword(dev, VTUNCERRMSK_REG, word | VTD_MSK_SPEC_ERRORS);
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x342e, vtd_mask_spec_errors);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3c28, vtd_mask_spec_errors);
+#endif

static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
struct pci_fixup *end)

2010-12-06 20:45:14

by Jesse Barnes

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On Mon, 06 Dec 2010 12:26:30 -0800
Suresh Siddha <[email protected]> wrote:

> On Mon, 2010-12-06 at 09:27 -0800, Jesse Barnes wrote:
> > Can we make these registers and bits a bit more self-documenting (i.e.
> > #defines for both, maybe along with other useful bit definitions for
> > this reg)? Also, "error" is misspelled as "erorr" above. :)
>
> Thanks for the review. Appended the updated patch. I haven't used
> #defines for the pci-id's, as the first one (IOH) is used by several
> chipsets and the second one is not named yet.

Is there a bug # that should be referenced in the commit log? Any
tested-bys to add?

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2010-12-06 21:02:07

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On Mon, 2010-12-06 at 12:44 -0800, Jesse Barnes wrote:
> On Mon, 06 Dec 2010 12:26:30 -0800
> Suresh Siddha <[email protected]> wrote:
>
> > On Mon, 2010-12-06 at 09:27 -0800, Jesse Barnes wrote:
> > > Can we make these registers and bits a bit more self-documenting (i.e.
> > > #defines for both, maybe along with other useful bit definitions for
> > > this reg)? Also, "error" is misspelled as "erorr" above. :)
> >
> > Thanks for the review. Appended the updated patch. I haven't used
> > #defines for the pci-id's, as the first one (IOH) is used by several
> > chipsets and the second one is not named yet.
>
> Is there a bug # that should be referenced in the commit log? Any
> tested-bys to add?

There is no kernel.org bug# but there are multiple bugs with different
OSV's. And hence didn't care to mention to the bug #

Please add:

Reported-by: Max Asbock <[email protected]>
Reported-and-tested-by: Takao Indoh <[email protected]>
Acked-by: Chris Wright <[email protected]>
Acked-by: Kenji Kaneshige <[email protected]>

thanks,
suresh

2010-12-06 23:01:41

by Max Asbock

[permalink] [raw]
Subject: Re: [patch 1/4] vt-d: quirk for masking vtd spec errors to platform error handling logic

On Mon, 2010-12-06 at 13:02 -0800, Suresh Siddha wrote:
> On Mon, 2010-12-06 at 12:44 -0800, Jesse Barnes wrote:
> > On Mon, 06 Dec 2010 12:26:30 -0800
> > Suresh Siddha <[email protected]> wrote:
> >
> > > On Mon, 2010-12-06 at 09:27 -0800, Jesse Barnes wrote:
> > > > Can we make these registers and bits a bit more self-documenting (i.e.
> > > > #defines for both, maybe along with other useful bit definitions for
> > > > this reg)? Also, "error" is misspelled as "erorr" above. :)
> > >
> > > Thanks for the review. Appended the updated patch. I haven't used
> > > #defines for the pci-id's, as the first one (IOH) is used by several
> > > chipsets and the second one is not named yet.
> >
> > Is there a bug # that should be referenced in the commit log? Any
> > tested-bys to add?
>
> There is no kernel.org bug# but there are multiple bugs with different
> OSV's. And hence didn't care to mention to the bug #
>
> Please add:
>
> Reported-by: Max Asbock <[email protected]>
> Reported-and-tested-by: Takao Indoh <[email protected]>
> Acked-by: Chris Wright <[email protected]>
> Acked-by: Kenji Kaneshige <[email protected]>
>

I tested the patches on a system with a Tylersburg chipset. I used the
patches against the 2.6.37-rc4 kernel and tested kdump. I still see the
Vt-d errors but they no longer cause NMIs. It works as expected.

- Max

2010-12-14 01:16:22

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/urgent] x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic

Commit-ID: 254e42006c893f45bca48f313536fcba12206418
Gitweb: http://git.kernel.org/tip/254e42006c893f45bca48f313536fcba12206418
Author: Suresh Siddha <[email protected]>
AuthorDate: Mon, 6 Dec 2010 12:26:30 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 13 Dec 2010 16:51:51 -0800

x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic

On platforms with Intel 7500 chipset, there were some reports of system
hang/NMI's during kexec/kdump in the presence of interrupt-remapping enabled.

During kdump, there is a window where the devices might be still using old
kernel's interrupt information, while the kdump kernel is coming up. This can
cause vt-d faults as the interrupt configuration from the old kernel map to
null IRTE entries in the new kernel etc. (with out interrupt-remapping enabled,
we still have the same issue but in this case we will see benign spurious
interrupt hit the new kernel).

Based on platform config settings, these platforms seem to generate NMI/SMI
when a vt-d fault happens and there were reports that the resulting SMI causes
the system to hang.

Fix it by masking vt-d spec defined errors to platform error reporting logic.
VT-d spec related errors are already handled by the VT-d OS code, so need to
report the same error through other channels.

Signed-off-by: Suresh Siddha <[email protected]>
LKML-Reference: <[email protected]>
Cc: [email protected] [v2.6.32+]
Reported-by: Max Asbock <[email protected]>
Reported-and-tested-by: Takao Indoh <[email protected]>
Acked-by: Chris Wright <[email protected]>
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
drivers/pci/quirks.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6f9350c..36191ed 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2764,6 +2764,29 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_R5C832, ricoh_m
DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_R5C832, ricoh_mmc_fixup_r5c832);
#endif /*CONFIG_MMC_RICOH_MMC*/

+#if defined(CONFIG_DMAR) || defined(CONFIG_INTR_REMAP)
+#define VTUNCERRMSK_REG 0x1ac
+#define VTD_MSK_SPEC_ERRORS (1 << 31)
+/*
+ * This is a quirk for masking vt-d spec defined errors to platform error
+ * handling logic. With out this, platforms using Intel 7500, 5500 chipsets
+ * (and the derivative chipsets like X58 etc) seem to generate NMI/SMI (based
+ * on the RAS config settings of the platform) when a vt-d fault happens.
+ * The resulting SMI caused the system to hang.
+ *
+ * VT-d spec related errors are already handled by the VT-d OS code, so no
+ * need to report the same error through other channels.
+ */
+static void vtd_mask_spec_errors(struct pci_dev *dev)
+{
+ u32 word;
+
+ pci_read_config_dword(dev, VTUNCERRMSK_REG, &word);
+ pci_write_config_dword(dev, VTUNCERRMSK_REG, word | VTD_MSK_SPEC_ERRORS);
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x342e, vtd_mask_spec_errors);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x3c28, vtd_mask_spec_errors);
+#endif

static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f,
struct pci_fixup *end)