2008-08-03 18:02:28

by James Bottomley

[permalink] [raw]
Subject: [PATCH 1/2] pci: add misrouted interrupt error handling

We're getting a lot of storage drivers blamed for interrupt misrouting
issues. This patch provides a standard way of reporting the problem
... and, if possible, correcting it.

Signed-off-by: James Bottomley <[email protected]>
---
drivers/pci/Makefile | 3 +-
drivers/pci/irq.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 7 +++++
3 files changed, 69 insertions(+), 1 deletions(-)
create mode 100644 drivers/pci/irq.c

diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 7d63f8c..19dacb8 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -3,7 +3,8 @@
#

obj-y += access.o bus.o probe.o remove.o pci.o quirks.o slot.o \
- pci-driver.o search.o pci-sysfs.o rom.o setup-res.o
+ pci-driver.o search.o pci-sysfs.o rom.o setup-res.o \
+ irq.o
obj-$(CONFIG_PROC_FS) += proc.o

# Build PCI Express stuff if needed
diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
new file mode 100644
index 0000000..6441dfa
--- /dev/null
+++ b/drivers/pci/irq.c
@@ -0,0 +1,60 @@
+/*
+ * PCI IRQ failure handing code
+ *
+ * Copyright (c) 2008 James Bottomley <[email protected]>
+ */
+
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+
+static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
+{
+ struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
+
+ dev_printk(KERN_ERR, &pdev->dev,
+ "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
+ parent->dev.bus_id, parent->vendor, parent->device);
+ dev_printk(KERN_ERR, &pdev->dev, "%s\n", reason);
+ dev_printk(KERN_ERR, &pdev->dev, "Please report to [email protected]\n");
+ WARN_ON(1);
+}
+
+/**
+ * pci_lost_interrupt - reports a lost PCI interrupt
+ * @pdev: device whose interrupt is lost
+ *
+ * The primary function of this routine is to report a lost interrupt
+ * in a standard way which users can recognise (instead of blaming the
+ * driver).
+ *
+ * Returns:
+ * a suggestion for fixing it (although the driver is not required to
+ * act on this).
+ */
+enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
+{
+ if (pdev->msi_enabled || pdev->msix_enabled) {
+ enum pci_lost_interrupt_reason ret;
+
+ if (pdev->msix_enabled) {
+ pci_note_irq_problem(pdev, "MSIX routing failure");
+ ret = PCI_LOST_IRQ_DISABLE_MSIX;
+ } else {
+ pci_note_irq_problem(pdev, "MSI routing failure");
+ ret = PCI_LOST_IRQ_DISABLE_MSI;
+ }
+ return ret;
+ }
+#ifdef CONFIG_ACPI
+ if (!(acpi_disabled || acpi_noirq)) {
+ pci_note_irq_problem(pdev, "Potential ACPI misrouting please reboot with acpi=noirq");
+ /* currently no way to fix acpi on the fly */
+ return PCI_LOST_IRQ_DISABLE_ACPI;
+ }
+#endif
+ pci_note_irq_problem(pdev, "unknown cause (not MSI or ACPI)");
+ return PCI_LOST_IRQ_NO_INFORMATION;
+}
+EXPORT_SYMBOL(pci_lost_interrupt);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 825be38..121698a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -539,6 +539,13 @@ struct pci_dev __deprecated *pci_find_slot(unsigned int bus,
unsigned int devfn);
#endif /* CONFIG_PCI_LEGACY */

+enum pci_lost_interrupt_reason {
+ PCI_LOST_IRQ_NO_INFORMATION = 0,
+ PCI_LOST_IRQ_DISABLE_MSI,
+ PCI_LOST_IRQ_DISABLE_MSIX,
+ PCI_LOST_IRQ_DISABLE_ACPI,
+};
+enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *dev);
int pci_find_capability(struct pci_dev *dev, int cap);
int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap);
int pci_find_ext_capability(struct pci_dev *dev, int cap);
--
1.5.6.3



2008-08-04 02:52:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sun, Aug 03, 2008 at 01:02:12PM -0500, James Bottomley wrote:
> +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> +{
> + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> +
> + dev_printk(KERN_ERR, &pdev->dev,
> + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> + parent->dev.bus_id, parent->vendor, parent->device);
> + dev_printk(KERN_ERR, &pdev->dev, "%s\n", reason);
> + dev_printk(KERN_ERR, &pdev->dev, "Please report to [email protected]\n");
> + WARN_ON(1);
> +}

Will the dev_printk() strings be included in the kerneloops report? And
what if there is no parent of the device? Consider device 00:02.0 on my
laptop:

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 03)
Subsystem: Fujitsu Limited. Device 13fe
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 16

> +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
> +{
> + if (pdev->msi_enabled || pdev->msix_enabled) {
> + enum pci_lost_interrupt_reason ret;
> +
> + if (pdev->msix_enabled) {
> + pci_note_irq_problem(pdev, "MSIX routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSIX;
> + } else {
> + pci_note_irq_problem(pdev, "MSI routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSI;
> + }
> + return ret;
> + }

Couldn't this be written more concisely as:

if (pdev->msix_enabled) {
pci_note_irq_problem(pdev, "MSIX routing failure");
return PCI_LOST_IRQ_DISABLE_MSIX;
}
if (pdev->msi_enabled) {
pci_note_irq_problem(pdev, "MSI routing failure");
return PCI_LOST_IRQ_DISABLE_MSI;
}

> +#ifdef CONFIG_ACPI
> + if (!(acpi_disabled || acpi_noirq)) {
> + pci_note_irq_problem(pdev, "Potential ACPI misrouting please reboot with acpi=noirq");
> + /* currently no way to fix acpi on the fly */
> + return PCI_LOST_IRQ_DISABLE_ACPI;
> + }
> +#endif
> + pci_note_irq_problem(pdev, "unknown cause (not MSI or ACPI)");
> + return PCI_LOST_IRQ_NO_INFORMATION;
> +}
> +EXPORT_SYMBOL(pci_lost_interrupt);

2008-08-04 03:47:27

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sun, 2008-08-03 at 20:51 -0600, Matthew Wilcox wrote:
> On Sun, Aug 03, 2008 at 01:02:12PM -0500, James Bottomley wrote:
> > +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> > +{
> > + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> > +
> > + dev_printk(KERN_ERR, &pdev->dev,
> > + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> > + parent->dev.bus_id, parent->vendor, parent->device);
> > + dev_printk(KERN_ERR, &pdev->dev, "%s\n", reason);
> > + dev_printk(KERN_ERR, &pdev->dev, "Please report to [email protected]\n");
> > + WARN_ON(1);
> > +}
>
> Will the dev_printk() strings be included in the kerneloops report? And
> what if there is no parent of the device? Consider device 00:02.0 on my
> laptop:

No, but some of them take the prior strings (depending on the
implementation).

> 00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 03)
> Subsystem: Fujitsu Limited. Device 13fe
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 16

There is always a parent device ... it might be the PCI root device, in
which case the information will be blank, but there is always one.

> > +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
> > +{
> > + if (pdev->msi_enabled || pdev->msix_enabled) {
> > + enum pci_lost_interrupt_reason ret;
> > +
> > + if (pdev->msix_enabled) {
> > + pci_note_irq_problem(pdev, "MSIX routing failure");
> > + ret = PCI_LOST_IRQ_DISABLE_MSIX;
> > + } else {
> > + pci_note_irq_problem(pdev, "MSI routing failure");
> > + ret = PCI_LOST_IRQ_DISABLE_MSI;
> > + }
> > + return ret;
> > + }
>
> Couldn't this be written more concisely as:
>
> if (pdev->msix_enabled) {
> pci_note_irq_problem(pdev, "MSIX routing failure");
> return PCI_LOST_IRQ_DISABLE_MSIX;
> }
> if (pdev->msi_enabled) {
> pci_note_irq_problem(pdev, "MSI routing failure");
> return PCI_LOST_IRQ_DISABLE_MSI;
> }

The idea was to separate the cases in case something extra needs be
done. I think it's pretty much identical as far as the compiler
optimises, and therefore probably not worth worrying about much.

James

2008-08-04 04:31:21

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sun, Aug 03, 2008 at 01:02:12PM -0500, James Bottomley wrote:
> We're getting a lot of storage drivers blamed for interrupt misrouting
> issues. This patch provides a standard way of reporting the problem
> ... and, if possible, correcting it.

James,
Can you add a paragraph to Documentation/pci.txt about the usage
of the new API?

thanks,
grant

>
> Signed-off-by: James Bottomley <[email protected]>
> ---
> drivers/pci/Makefile | 3 +-
> drivers/pci/irq.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/pci.h | 7 +++++
> 3 files changed, 69 insertions(+), 1 deletions(-)
> create mode 100644 drivers/pci/irq.c
>
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 7d63f8c..19dacb8 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -3,7 +3,8 @@
> #
>
> obj-y += access.o bus.o probe.o remove.o pci.o quirks.o slot.o \
> - pci-driver.o search.o pci-sysfs.o rom.o setup-res.o
> + pci-driver.o search.o pci-sysfs.o rom.o setup-res.o \
> + irq.o
> obj-$(CONFIG_PROC_FS) += proc.o
>
> # Build PCI Express stuff if needed
> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
> new file mode 100644
> index 0000000..6441dfa
> --- /dev/null
> +++ b/drivers/pci/irq.c
> @@ -0,0 +1,60 @@
> +/*
> + * PCI IRQ failure handing code
> + *
> + * Copyright (c) 2008 James Bottomley <[email protected]>
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/pci.h>
> +
> +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> +{
> + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> +
> + dev_printk(KERN_ERR, &pdev->dev,
> + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> + parent->dev.bus_id, parent->vendor, parent->device);
> + dev_printk(KERN_ERR, &pdev->dev, "%s\n", reason);
> + dev_printk(KERN_ERR, &pdev->dev, "Please report to [email protected]\n");
> + WARN_ON(1);
> +}
> +
> +/**
> + * pci_lost_interrupt - reports a lost PCI interrupt
> + * @pdev: device whose interrupt is lost
> + *
> + * The primary function of this routine is to report a lost interrupt
> + * in a standard way which users can recognise (instead of blaming the
> + * driver).
> + *
> + * Returns:
> + * a suggestion for fixing it (although the driver is not required to
> + * act on this).
> + */
> +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
> +{
> + if (pdev->msi_enabled || pdev->msix_enabled) {
> + enum pci_lost_interrupt_reason ret;
> +
> + if (pdev->msix_enabled) {
> + pci_note_irq_problem(pdev, "MSIX routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSIX;
> + } else {
> + pci_note_irq_problem(pdev, "MSI routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSI;
> + }
> + return ret;
> + }
> +#ifdef CONFIG_ACPI
> + if (!(acpi_disabled || acpi_noirq)) {
> + pci_note_irq_problem(pdev, "Potential ACPI misrouting please reboot with acpi=noirq");
> + /* currently no way to fix acpi on the fly */
> + return PCI_LOST_IRQ_DISABLE_ACPI;
> + }
> +#endif
> + pci_note_irq_problem(pdev, "unknown cause (not MSI or ACPI)");
> + return PCI_LOST_IRQ_NO_INFORMATION;
> +}
> +EXPORT_SYMBOL(pci_lost_interrupt);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 825be38..121698a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -539,6 +539,13 @@ struct pci_dev __deprecated *pci_find_slot(unsigned int bus,
> unsigned int devfn);
> #endif /* CONFIG_PCI_LEGACY */
>
> +enum pci_lost_interrupt_reason {
> + PCI_LOST_IRQ_NO_INFORMATION = 0,
> + PCI_LOST_IRQ_DISABLE_MSI,
> + PCI_LOST_IRQ_DISABLE_MSIX,
> + PCI_LOST_IRQ_DISABLE_ACPI,
> +};
> +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *dev);
> int pci_find_capability(struct pci_dev *dev, int cap);
> int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap);
> int pci_find_ext_capability(struct pci_dev *dev, int cap);
> --
> 1.5.6.3
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-04 13:32:11

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sun, 2008-08-03 at 22:30 -0600, Grant Grundler wrote:
> On Sun, Aug 03, 2008 at 01:02:12PM -0500, James Bottomley wrote:
> > We're getting a lot of storage drivers blamed for interrupt misrouting
> > issues. This patch provides a standard way of reporting the problem
> > ... and, if possible, correcting it.
>
> James,
> Can you add a paragraph to Documentation/pci.txt about the usage
> of the new API?

When there actually is a new API, sure.

James

2008-08-04 20:45:15

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sunday 03 August 2008 12:02:12 pm James Bottomley wrote:
> +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> +{
> + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> +
> + dev_printk(KERN_ERR, &pdev->dev,
> + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> + parent->dev.bus_id, parent->vendor, parent->device);

Do you prefer "dev_printk(KERN_ERR, ...)" over "dev_err(...)"? Easier
to grep for the former, maybe? If so, should we deprecate "dev_err()"
and friends? When I converted most of the PCI core to use dev_printk(),
(80ccba1186d48f ...) I used dev_err(), but I don't really care one way
or the other.

Maybe use pci_name(parent)?

I tried to standardize the PCI core on "[%04x/%04x]" for vendor/device ID.

Bjorn

2008-08-04 22:05:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Mon, Aug 04, 2008 at 02:43:20PM -0600, Bjorn Helgaas wrote:
> I tried to standardize the PCI core on "[%04x/%04x]" for vendor/device ID.

That's unfortunately different from lspci -nn:

00:00.0 Host bridge [0600]: Intel Corporation Mobile PM965/GM965/GL960
Memory Controller Hub [8086:2a00] (rev 03)

Sorry, I should have reviewed your patches ;-(

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-08-04 22:21:47

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Monday 04 August 2008 03:35:35 pm Matthew Wilcox wrote:
> On Mon, Aug 04, 2008 at 02:43:20PM -0600, Bjorn Helgaas wrote:
> > I tried to standardize the PCI core on "[%04x/%04x]" for vendor/device ID.
>
> That's unfortunately different from lspci -nn:
>
> 00:00.0 Host bridge [0600]: Intel Corporation Mobile PM965/GM965/GL960
> Memory Controller Hub [8086:2a00] (rev 03)
>
> Sorry, I should have reviewed your patches ;-(

I can change those. I just copied what seemed to be the most
prevalent. But lspci is harder to change, so we should probably
follow that.

2008-08-05 07:16:52

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Mon, 2008-08-04 at 14:43 -0600, Bjorn Helgaas wrote:
> On Sunday 03 August 2008 12:02:12 pm James Bottomley wrote:
> > +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> > +{
> > + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> > +
> > + dev_printk(KERN_ERR, &pdev->dev,
> > + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> > + parent->dev.bus_id, parent->vendor, parent->device);
>
> Do you prefer "dev_printk(KERN_ERR, ...)" over "dev_err(...)"? Easier
> to grep for the former, maybe? If so, should we deprecate "dev_err()"
> and friends? When I converted most of the PCI core to use dev_printk(),
> (80ccba1186d48f ...) I used dev_err(), but I don't really care one way
> or the other.
>
> Maybe use pci_name(parent)?
>
> I tried to standardize the PCI core on "[%04x/%04x]" for vendor/device ID.

To be honest I'm not really interested too much in the various API
preferences ... they can be fixed up later by the people who care.

What I am interested in is whether we can get this interface to give
sufficient information to blacklist the offending motherboard if we can
identify it as the source of the problem. Since the usual culprit is
the bridge, that's why I'm doing parent vendor/device pairs. However,
better suggestions for the information that should be displayed would be
gratefully received.

James

2008-08-05 15:43:23

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Monday 04 August 2008 06:02:27 pm James Bottomley wrote:
> On Mon, 2008-08-04 at 14:43 -0600, Bjorn Helgaas wrote:
> > On Sunday 03 August 2008 12:02:12 pm James Bottomley wrote:
> > > +static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
> > > +{
> > > + struct pci_dev *parent = to_pci_dev(pdev->dev.parent);
> > > +
> > > + dev_printk(KERN_ERR, &pdev->dev,
> > > + "Potentially misrouted IRQ (Bridge %s %04x:%04x)\n",
> > > + parent->dev.bus_id, parent->vendor, parent->device);
> >
> > Do you prefer "dev_printk(KERN_ERR, ...)" over "dev_err(...)"? Easier
> > to grep for the former, maybe? If so, should we deprecate "dev_err()"
> > and friends? When I converted most of the PCI core to use dev_printk(),
> > (80ccba1186d48f ...) I used dev_err(), but I don't really care one way
> > or the other.
> >
> > Maybe use pci_name(parent)?
> >
> > I tried to standardize the PCI core on "[%04x/%04x]" for vendor/device ID.
>
> To be honest I'm not really interested too much in the various API
> preferences ... they can be fixed up later by the people who care.

I'm happy to fix it up later if you prefer. I only mentioned it
because doing it later adds churn and risk of breakage.

Bjorn

2008-08-05 17:03:24

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sunday, August 03, 2008 11:02 am James Bottomley wrote:
> +/**
> + * pci_lost_interrupt - reports a lost PCI interrupt
> + * @pdev: device whose interrupt is lost
> + *
> + * The primary function of this routine is to report a lost interrupt
> + * in a standard way which users can recognise (instead of blaming the
> + * driver).
> + *
> + * Returns:
> + * a suggestion for fixing it (although the driver is not required to
> + * act on this).
> + */
> +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
> +{
> + if (pdev->msi_enabled || pdev->msix_enabled) {
> + enum pci_lost_interrupt_reason ret;
> +
> + if (pdev->msix_enabled) {
> + pci_note_irq_problem(pdev, "MSIX routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSIX;
> + } else {
> + pci_note_irq_problem(pdev, "MSI routing failure");
> + ret = PCI_LOST_IRQ_DISABLE_MSI;
> + }
> + return ret;
> + }
> +#ifdef CONFIG_ACPI
> + if (!(acpi_disabled || acpi_noirq)) {
> + pci_note_irq_problem(pdev, "Potential ACPI misrouting please reboot with
> acpi=noirq"); + /* currently no way to fix acpi on the fly */
> + return PCI_LOST_IRQ_DISABLE_ACPI;
> + }
> +#endif
> + pci_note_irq_problem(pdev, "unknown cause (not MSI or ACPI)");
> + return PCI_LOST_IRQ_NO_INFORMATION;
> +}

This seems to be a function that just returns what type of IRQ you're using or
how it's routed and it isn't necessarily "lost interrupt" specific.

This information is clearly useful to drivers both for informational purposes
and for debugging problems, so on that front it looks good. However, I think
it should probably be called pci_interrupt_type(struct pci_dev *dev) or
something instead, and just return an enum of either MSIX, MSI, or LINE
(though I'm open to better names). From that, the driver can infer what
might be going wrong, though in the case of a LINE interrupt, all you can
really do is report that there's probably a platform IRQ routing problem.

Jesse

2008-08-05 20:44:27

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tue, 2008-08-05 at 10:03 -0700, Jesse Barnes wrote:
> On Sunday, August 03, 2008 11:02 am James Bottomley wrote:
> > +/**
> > + * pci_lost_interrupt - reports a lost PCI interrupt
> > + * @pdev: device whose interrupt is lost
> > + *
> > + * The primary function of this routine is to report a lost interrupt
> > + * in a standard way which users can recognise (instead of blaming the
> > + * driver).
> > + *
> > + * Returns:
> > + * a suggestion for fixing it (although the driver is not required to
> > + * act on this).
> > + */
> > +enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)
> > +{
> > + if (pdev->msi_enabled || pdev->msix_enabled) {
> > + enum pci_lost_interrupt_reason ret;
> > +
> > + if (pdev->msix_enabled) {
> > + pci_note_irq_problem(pdev, "MSIX routing failure");
> > + ret = PCI_LOST_IRQ_DISABLE_MSIX;
> > + } else {
> > + pci_note_irq_problem(pdev, "MSI routing failure");
> > + ret = PCI_LOST_IRQ_DISABLE_MSI;
> > + }
> > + return ret;
> > + }
> > +#ifdef CONFIG_ACPI
> > + if (!(acpi_disabled || acpi_noirq)) {
> > + pci_note_irq_problem(pdev, "Potential ACPI misrouting please reboot with
> > acpi=noirq"); + /* currently no way to fix acpi on the fly */
> > + return PCI_LOST_IRQ_DISABLE_ACPI;
> > + }
> > +#endif
> > + pci_note_irq_problem(pdev, "unknown cause (not MSI or ACPI)");
> > + return PCI_LOST_IRQ_NO_INFORMATION;
> > +}
>
> This seems to be a function that just returns what type of IRQ you're using or
> how it's routed and it isn't necessarily "lost interrupt" specific.

So perhaps this routine should only note but not advise? The drivers
can then just call pci_interrupt_type to see if they can do anything.

> This information is clearly useful to drivers both for informational purposes
> and for debugging problems, so on that front it looks good. However, I think
> it should probably be called pci_interrupt_type(struct pci_dev *dev) or
> something instead, and just return an enum of either MSIX, MSI, or LINE
> (though I'm open to better names). From that, the driver can infer what
> might be going wrong, though in the case of a LINE interrupt, all you can
> really do is report that there's probably a platform IRQ routing problem.

The only thing that this can't do is ACPI ... but on the other hand once
the IRQ routing is set by ACPI I'm not sure the driver can do anything
to fix it.

So ... depending on how the feedback goes, I'll plan to reroll this as a
note but not advise function.

James


2008-08-05 20:54:18

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tuesday, August 05, 2008 1:44 pm James Bottomley wrote:
> On Tue, 2008-08-05 at 10:03 -0700, Jesse Barnes wrote:
> > This seems to be a function that just returns what type of IRQ you're
> > using or how it's routed and it isn't necessarily "lost interrupt"
> > specific.
>
> So perhaps this routine should only note but not advise? The drivers
> can then just call pci_interrupt_type to see if they can do anything.

If it's just a pci_irq_type function then it probably shouldn't print
anything, and leave that to the caller, since it might be used for other
purposes too (e.g. a driver load printk or something). In the lost interrupt
case you already have to disable MSI or MSIX in the Fusion driver, so you may
as well put the printk there, right? I guess I'm saying it should neither
note nor advise, just return the IRQ type.

> > This information is clearly useful to drivers both for informational
> > purposes and for debugging problems, so on that front it looks good.
> > However, I think it should probably be called pci_interrupt_type(struct
> > pci_dev *dev) or something instead, and just return an enum of either
> > MSIX, MSI, or LINE (though I'm open to better names). From that, the
> > driver can infer what might be going wrong, though in the case of a LINE
> > interrupt, all you can really do is report that there's probably a
> > platform IRQ routing problem.
>
> The only thing that this can't do is ACPI ... but on the other hand once
> the IRQ routing is set by ACPI I'm not sure the driver can do anything
> to fix it.

Yeah, ACPI may or may not be the problem, all you'll really know is that
you've got a line interrupt that you failed to get... The driver won't be
able to do much in that case aside from complain loudly.

Thanks,
Jesse

2008-08-05 20:57:38

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tue, 2008-08-05 at 13:53 -0700, Jesse Barnes wrote:
> On Tuesday, August 05, 2008 1:44 pm James Bottomley wrote:
> > On Tue, 2008-08-05 at 10:03 -0700, Jesse Barnes wrote:
> > > This seems to be a function that just returns what type of IRQ you're
> > > using or how it's routed and it isn't necessarily "lost interrupt"
> > > specific.
> >
> > So perhaps this routine should only note but not advise? The drivers
> > can then just call pci_interrupt_type to see if they can do anything.
>
> If it's just a pci_irq_type function then it probably shouldn't print
> anything, and leave that to the caller, since it might be used for other
> purposes too (e.g. a driver load printk or something). In the lost interrupt
> case you already have to disable MSI or MSIX in the Fusion driver, so you may
> as well put the printk there, right? I guess I'm saying it should neither
> note nor advise, just return the IRQ type.

Well, no; the object was to have the layer that knew (PCI) print
information which could be used to identify the problem. Likely what
the driver will say is something like "MSI isn't working and it's not my
fault". What I want is for PCI to print something that may be helpful
to people trying to diagnose the problem. Driver writers aren't going
to get that right.

> > > This information is clearly useful to drivers both for informational
> > > purposes and for debugging problems, so on that front it looks good.
> > > However, I think it should probably be called pci_interrupt_type(struct
> > > pci_dev *dev) or something instead, and just return an enum of either
> > > MSIX, MSI, or LINE (though I'm open to better names). From that, the
> > > driver can infer what might be going wrong, though in the case of a LINE
> > > interrupt, all you can really do is report that there's probably a
> > > platform IRQ routing problem.
> >
> > The only thing that this can't do is ACPI ... but on the other hand once
> > the IRQ routing is set by ACPI I'm not sure the driver can do anything
> > to fix it.
>
> Yeah, ACPI may or may not be the problem, all you'll really know is that
> you've got a line interrupt that you failed to get... The driver won't be
> able to do much in that case aside from complain loudly.

Yes ... it's just that when line interrupts fail (especially if they
worked previously) it's usually ACPI to blame.

James

2008-08-05 21:23:18

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tuesday, August 05, 2008 1:56 pm James Bottomley wrote:
> On Tue, 2008-08-05 at 13:53 -0700, Jesse Barnes wrote:
> > On Tuesday, August 05, 2008 1:44 pm James Bottomley wrote:
> > > On Tue, 2008-08-05 at 10:03 -0700, Jesse Barnes wrote:
> > > > This seems to be a function that just returns what type of IRQ you're
> > > > using or how it's routed and it isn't necessarily "lost interrupt"
> > > > specific.
> > >
> > > So perhaps this routine should only note but not advise? The drivers
> > > can then just call pci_interrupt_type to see if they can do anything.
> >
> > If it's just a pci_irq_type function then it probably shouldn't print
> > anything, and leave that to the caller, since it might be used for other
> > purposes too (e.g. a driver load printk or something). In the lost
> > interrupt case you already have to disable MSI or MSIX in the Fusion
> > driver, so you may as well put the printk there, right? I guess I'm
> > saying it should neither note nor advise, just return the IRQ type.
>
> Well, no; the object was to have the layer that knew (PCI) print
> information which could be used to identify the problem. Likely what
> the driver will say is something like "MSI isn't working and it's not my
> fault". What I want is for PCI to print something that may be helpful
> to people trying to diagnose the problem. Driver writers aren't going
> to get that right.

Yeah, I understand what you're trying to do, and given that there's only one
user of this function (your fusion patch), I don't have a strong preference.

However, the function is really just telling the driver what type of IRQ a
given PCI device currently has; it's not really a "lost interrupt" function
at all. So to me, it makes sense to just generalize it into a pci_irq_type
function and let drivers do what they will with it. It looks like this is
the real lost interrupt detection:

+ /* May fail becuase of IRQ misrouting */
+ rc = mpt_get_manufacturing_pg_0(ioc);
+ if (rc) {
+ ...
+ printk(MYIOC_s_ERR_FMT "Cannot recover IRQ routing\n",
+ ioc->name);
+ return -1;
+ }

not the pci_lost_interrupt() function. So it could just as easily be written
as such:

+ /* May fail becuase of IRQ misrouting */
+ rc = mpt_get_manufacturing_pg_0(ioc);
+ if (rc) {
+ /* Lost an IRQ, see what type... */
+ int irq_type = pci_irq_type(dev);
+
+ if (type == PCI_IRQ_MSI) {
+ /* Lost an MSI interrupt, try re-config w/o MSI */
+ free_irq(ioc->pci_irq, ioc);
+ ioc->msi_enable = 0;
+ pci_disable_msi(ioc->pcidev);
+ goto retry;
+ }
+ /* This platform is just broken... */
+ printk(MYIOC_s_ERR_FMT "Cannot recover from %s IRQ
+ routing error\n", pci_irq_type_name(type), ioc->name);
+ return -1;
+ }

or somesuch. That seems just as simple for driver writers as your initial
patch, and the function is named in accordance with what it actually does,
rather than what it's used for...

Jesse

2008-08-05 21:59:47

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tue, 2008-08-05 at 14:15 -0700, Jesse Barnes wrote:
> On Tuesday, August 05, 2008 1:56 pm James Bottomley wrote:
> > On Tue, 2008-08-05 at 13:53 -0700, Jesse Barnes wrote:
> > > On Tuesday, August 05, 2008 1:44 pm James Bottomley wrote:
> > > > On Tue, 2008-08-05 at 10:03 -0700, Jesse Barnes wrote:
> > > > > This seems to be a function that just returns what type of IRQ you're
> > > > > using or how it's routed and it isn't necessarily "lost interrupt"
> > > > > specific.
> > > >
> > > > So perhaps this routine should only note but not advise? The drivers
> > > > can then just call pci_interrupt_type to see if they can do anything.
> > >
> > > If it's just a pci_irq_type function then it probably shouldn't print
> > > anything, and leave that to the caller, since it might be used for other
> > > purposes too (e.g. a driver load printk or something). In the lost
> > > interrupt case you already have to disable MSI or MSIX in the Fusion
> > > driver, so you may as well put the printk there, right? I guess I'm
> > > saying it should neither note nor advise, just return the IRQ type.
> >
> > Well, no; the object was to have the layer that knew (PCI) print
> > information which could be used to identify the problem. Likely what
> > the driver will say is something like "MSI isn't working and it's not my
> > fault". What I want is for PCI to print something that may be helpful
> > to people trying to diagnose the problem. Driver writers aren't going
> > to get that right.
>
> Yeah, I understand what you're trying to do, and given that there's only one
> user of this function (your fusion patch), I don't have a strong preference.
>
> However, the function is really just telling the driver what type of IRQ a
> given PCI device currently has; it's not really a "lost interrupt" function
> at all. So to me, it makes sense to just generalize it into a pci_irq_type
> function and let drivers do what they will with it. It looks like this is
> the real lost interrupt detection:
>
> + /* May fail becuase of IRQ misrouting */
> + rc = mpt_get_manufacturing_pg_0(ioc);
> + if (rc) {
> + ...
> + printk(MYIOC_s_ERR_FMT "Cannot recover IRQ routing\n",
> + ioc->name);
> + return -1;
> + }
>
> not the pci_lost_interrupt() function. So it could just as easily be written
> as such:
>
> + /* May fail becuase of IRQ misrouting */
> + rc = mpt_get_manufacturing_pg_0(ioc);
> + if (rc) {
> + /* Lost an IRQ, see what type... */
> + int irq_type = pci_irq_type(dev);
> +
> + if (type == PCI_IRQ_MSI) {
> + /* Lost an MSI interrupt, try re-config w/o MSI */
> + free_irq(ioc->pci_irq, ioc);
> + ioc->msi_enable = 0;
> + pci_disable_msi(ioc->pcidev);
> + goto retry;
> + }
> + /* This platform is just broken... */
> + printk(MYIOC_s_ERR_FMT "Cannot recover from %s IRQ
> + routing error\n", pci_irq_type_name(type), ioc->name);
> + return -1;
> + }
>
> or somesuch. That seems just as simple for driver writers as your initial
> patch, and the function is named in accordance with what it actually does,
> rather than what it's used for...

It could, but if the bridge is the culprit (as it usually is for MSI
problems), this print won't help identify it.

Therefore, rather than give driver writers a recipe for "print this and
this and go to the bridge and print this", I'd rather have a single PCI
callback that prints all the (hopefully) relevant information that will
allow either fixing or blacklisting.

James

2008-08-07 16:03:41

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Tuesday, August 5, 2008 2:54 pm James Bottomley wrote:
> > or somesuch. That seems just as simple for driver writers as your
> > initial patch, and the function is named in accordance with what it
> > actually does, rather than what it's used for...
>
> It could, but if the bridge is the culprit (as it usually is for MSI
> problems), this print won't help identify it.
>
> Therefore, rather than give driver writers a recipe for "print this and
> this and go to the bridge and print this", I'd rather have a single PCI
> callback that prints all the (hopefully) relevant information that will
> allow either fixing or blacklisting.

So in addition to the IRQ type check we need to dump some device topology
information... yeah that makes sense. I wonder if the driver core should
provide something like this. Greg? In the meantime we can definitely add
the IRQ type function.

Thanks,
Jesse

2008-08-07 17:22:56

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Thu, Aug 07, 2008 at 09:03:22AM -0700, Jesse Barnes wrote:
> On Tuesday, August 5, 2008 2:54 pm James Bottomley wrote:
> > > or somesuch. That seems just as simple for driver writers as your
> > > initial patch, and the function is named in accordance with what it
> > > actually does, rather than what it's used for...
> >
> > It could, but if the bridge is the culprit (as it usually is for MSI
> > problems), this print won't help identify it.
> >
> > Therefore, rather than give driver writers a recipe for "print this and
> > this and go to the bridge and print this", I'd rather have a single PCI
> > callback that prints all the (hopefully) relevant information that will
> > allow either fixing or blacklisting.
>
> So in addition to the IRQ type check we need to dump some device topology
> information... yeah that makes sense. I wonder if the driver core should
> provide something like this. Greg?

What kind of topology do you need that is not already provided?

thanks,

greg k-h

2008-08-07 17:37:21

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Thu, 2008-08-07 at 10:20 -0700, Greg KH wrote:
> On Thu, Aug 07, 2008 at 09:03:22AM -0700, Jesse Barnes wrote:
> > On Tuesday, August 5, 2008 2:54 pm James Bottomley wrote:
> > > > or somesuch. That seems just as simple for driver writers as your
> > > > initial patch, and the function is named in accordance with what it
> > > > actually does, rather than what it's used for...
> > >
> > > It could, but if the bridge is the culprit (as it usually is for MSI
> > > problems), this print won't help identify it.
> > >
> > > Therefore, rather than give driver writers a recipe for "print this and
> > > this and go to the bridge and print this", I'd rather have a single PCI
> > > callback that prints all the (hopefully) relevant information that will
> > > allow either fixing or blacklisting.
> >
> > So in addition to the IRQ type check we need to dump some device topology
> > information... yeah that makes sense. I wonder if the driver core should
> > provide something like this. Greg?
>
> What kind of topology do you need that is not already provided?

We really need to print how the device interrupt is routed. But,
unfortunately, we need to identify the devices in the layout (bridges
and the like by say device/vendor numbers) so we know what they are ...
this is the bit that the generic device model can't do because it's
embedded in the bus specific part.

James

2008-10-23 21:56:19

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/2] pci: add misrouted interrupt error handling

On Sunday, August 3, 2008 11:02 am James Bottomley wrote:
> We're getting a lot of storage drivers blamed for interrupt misrouting
> issues. This patch provides a standard way of reporting the problem
> ... and, if possible, correcting it.
>
> Signed-off-by: James Bottomley <[email protected]>

Ok, per our discussion at the PCI BoF I applied this, so now you can apply the
fusion and other bits to help debug issues.

Thanks,
Jesse