The existing code always suggests trying the pci=biosirq kernel parameter, but
this option is only recognized when CONFIG_PCI_BIOS is set, which in turn
depends on CONFIG_X86_32, so it is never appropriate on x86_64.
kernel-parameters.txt confirms that pci=biosirq is intended to be supported
only on X86-32.
The new version tries to be more helpful by recommending changes to ACPI
settings if appropriate, and only mentioning pci=biosirq (and the manual
pirq= option) for kernels that support it. Additionally, it encourages
the user to file bug reports so faulty firmware can be identified and
potentially handled via known quirks in a future kernel release.
ACPI is relevant to these warnings because it will significantly change
the path taken through the PCI discovery (and later interrupt routing)
code. Booting with acpi=noirq should be highly unusual and likely
indicates an attempt to work around faulty motherboard firmware, so I
added a new log message in pci_acpi_init() for this case, with yet
another recommendation to file a bug report.
Signed-off-by: Brent Spillner <[email protected]>
---
Changes in v2:
- Tried to make the code more legible by reducing use of #ifdef (only
used where required to guard reference to acpi_noirq)
- The tradeoff is there's now an idiosyncratic use of do {...} while (0),
but that lets me early-out from the acpi_noirq case without more #ifdefs
or duplicated if statements.
- Included a warning for acpi_noirq in pci_acpi_init() per maintainer suggestion
- Encourage user to file bug reports in all warning messages
arch/x86/pci/acpi.c | 4 +++-
arch/x86/pci/irq.c | 22 +++++++++++++++++++---
2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 052f1d78a562..12f894d345a9 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -401,8 +401,10 @@ int __init pci_acpi_init(void)
{
struct pci_dev *dev = NULL;
- if (acpi_noirq)
+ if (acpi_noirq) {
+ printk(KERN_WARNING "PCI: ACPI IRQ routing disabled; please submit a bug report if this was required to work around firmware defects\n");
return -ENODEV;
+ }
printk(KERN_INFO "PCI: Using ACPI for IRQ routing\n");
acpi_irq_penalty_init();
diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
index 97b63e35e152..393b036e773b 100644
--- a/arch/x86/pci/irq.c
+++ b/arch/x86/pci/irq.c
@@ -1519,10 +1519,26 @@ static int pirq_enable_irq(struct pci_dev *dev)
} else
msg = "; probably buggy MP table";
#endif
- } else if (pci_probe & PCI_BIOS_IRQ_SCAN)
+ } else if (pci_probe & PCI_BIOS_IRQ_SCAN) {
msg = "";
- else
- msg = "; please try using pci=biosirq";
+ } else {
+ do { /* just one iteration; allows break to minimize code duplication */
+#ifdef CONFIG_ACPI
+ if (acpi_noirq) {
+ msg = "; consider removing acpi=noirq, and file a bug report if that does not help";
+ break; /* out of remainder of one-iteration do {} loop */
+ }
+#endif
+ if (IS_ENABLED(CONFIG_PCI_BIOS))
+ /* pci=biosirq and pirq= are valid only on x86_32, and should never be necessary */
+ msg = "; try using pci=biosirq or manual pirq=, and file a bug report for this device";
+ else if (!IS_ENABLED(CONFIG_ACPI))
+ /* ACPI will change code path through PCI subsystem, and is worth trying */
+ msg = "; try enabling ACPI if feasible, and file a bug report for this device";
+ else
+ msg = "; please file a bug report for failure to discover device IRQ via ACPI";
+ } while (0);
+ }
/*
* With IDE legacy devices the IRQ lookup failure is not
--
2.35.1
On Tue, Feb 01, 2022 at 12:40:29AM +0000, Brent Spillner wrote:
> The existing code always suggests trying the pci=biosirq kernel parameter, but
> this option is only recognized when CONFIG_PCI_BIOS is set, which in turn
> depends on CONFIG_X86_32, so it is never appropriate on x86_64.
> kernel-parameters.txt confirms that pci=biosirq is intended to be supported
> only on X86-32.
>
> The new version tries to be more helpful by recommending changes to ACPI
> settings if appropriate, and only mentioning pci=biosirq (and the manual
> pirq= option) for kernels that support it. Additionally, it encourages
> the user to file bug reports so faulty firmware can be identified and
> potentially handled via known quirks in a future kernel release.
>
> ACPI is relevant to these warnings because it will significantly change
> the path taken through the PCI discovery (and later interrupt routing)
> code. Booting with acpi=noirq should be highly unusual and likely
> indicates an attempt to work around faulty motherboard firmware, so I
> added a new log message in pci_acpi_init() for this case, with yet
> another recommendation to file a bug report.
>
> Signed-off-by: Brent Spillner <[email protected]>
IIUC pirq_enable_irq() is only used for non-ACPI, non-DT, non-Xen,
non-Intel MID systems, so this is a real legacy path.
I don't think it's really worth cluttering an error case in a path
that should be rarely used in the first place.
Are you seeing a problem where you're getting the wrong error message
today? Can we just fix that problem instead so no kernel parameter is
needed in the first place?
> ---
> Changes in v2:
> - Tried to make the code more legible by reducing use of #ifdef (only
> used where required to guard reference to acpi_noirq)
> - The tradeoff is there's now an idiosyncratic use of do {...} while (0),
> but that lets me early-out from the acpi_noirq case without more #ifdefs
> or duplicated if statements.
> - Included a warning for acpi_noirq in pci_acpi_init() per maintainer suggestion
> - Encourage user to file bug reports in all warning messages
>
> arch/x86/pci/acpi.c | 4 +++-
> arch/x86/pci/irq.c | 22 +++++++++++++++++++---
> 2 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index 052f1d78a562..12f894d345a9 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -401,8 +401,10 @@ int __init pci_acpi_init(void)
> {
> struct pci_dev *dev = NULL;
>
> - if (acpi_noirq)
> + if (acpi_noirq) {
> + printk(KERN_WARNING "PCI: ACPI IRQ routing disabled; please submit a bug report if this was required to work around firmware defects\n");
> return -ENODEV;
> + }
>
> printk(KERN_INFO "PCI: Using ACPI for IRQ routing\n");
> acpi_irq_penalty_init();
> diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
> index 97b63e35e152..393b036e773b 100644
> --- a/arch/x86/pci/irq.c
> +++ b/arch/x86/pci/irq.c
> @@ -1519,10 +1519,26 @@ static int pirq_enable_irq(struct pci_dev *dev)
> } else
> msg = "; probably buggy MP table";
> #endif
> - } else if (pci_probe & PCI_BIOS_IRQ_SCAN)
> + } else if (pci_probe & PCI_BIOS_IRQ_SCAN) {
> msg = "";
> - else
> - msg = "; please try using pci=biosirq";
> + } else {
> + do { /* just one iteration; allows break to minimize code duplication */
> +#ifdef CONFIG_ACPI
> + if (acpi_noirq) {
> + msg = "; consider removing acpi=noirq, and file a bug report if that does not help";
> + break; /* out of remainder of one-iteration do {} loop */
> + }
> +#endif
> + if (IS_ENABLED(CONFIG_PCI_BIOS))
> + /* pci=biosirq and pirq= are valid only on x86_32, and should never be necessary */
> + msg = "; try using pci=biosirq or manual pirq=, and file a bug report for this device";
> + else if (!IS_ENABLED(CONFIG_ACPI))
> + /* ACPI will change code path through PCI subsystem, and is worth trying */
> + msg = "; try enabling ACPI if feasible, and file a bug report for this device";
> + else
> + msg = "; please file a bug report for failure to discover device IRQ via ACPI";
> + } while (0);
> + }
>
> /*
> * With IDE legacy devices the IRQ lookup failure is not
> --
> 2.35.1
>
On Wed, Feb 02, 2022 at 10:13:31PM +0000, Brent Spillner wrote:
> On Wed, Feb 2, 2022 at 6:13 PM Bjorn Helgaas <[email protected]> wrote:
> > IIUC pirq_enable_irq() is only used for non-ACPI, non-DT, non-Xen,
> > non-Intel MID systems, so this is a real legacy path.
> >
> > I don't think it's really worth cluttering an error case in a path
> > that should be rarely used in the first place.
>
> I figured if people are getting this message, then they either have
> broken hardware or are debugging something, and if the message is
> trying to be useful then it can't hurt to mention other things that
> might help. If the pirq path has become buggy or unsupportable in
> modern kernels then it probably ought to be removed altogether; if it
> still works, however rarely it might be needed these days, then it's
> perhaps worth mentioning to those who might occasionally have a use
> for it.
>
> > Are you seeing a problem where you're getting the wrong error message
> > today? Can we just fix that problem instead so no kernel parameter is
> > needed in the first place?
>
> I was trying to isolate intermittent ACPI errors and tried booting
> with acpi=noirq, as seemingly the closest modern equivalent to the
> acpi=ht that had solved, or at least half-split, similar issues for me
> in the past. With noirq, a different PCIe device stopped working (MPT
> Fusion driver not picking up any responses to doorbell interrupts),
If your system has ACPI, I think "pci=biosirq" and "acpi=noirq" are at
best distractions from finding the real problem.
> and while reviewing dmesg I noticed that the PCI error messages were
> suggesting a kernel option that wasn't appropriate for my x86_64
> architecture. In an ideal world with no hardware or driver problems
> these log messages should never even happen, but in the real world of
> troubleshooting and debugging I think they can be useful, and if
> they're going to be reported they might as well be correct.
>
> Obviously, this is a very minor bug, affecting only logs rather than
> behavior, and I'm sure there are more pressing things to worry about.
> On the other hand, it also seems like a very easy and low-risk fix
> that leaves the kernel in a slightly better state for future users and
> developers. At any rate, the current state of the PCI code (a)
> generates a message that recommends specific kernel parameters, (b)
> does so even on builds for which those parameters are inappropriate,
> (c) doesn't say anything to encourage bug reports, and (d) doesn't
> warn about the risks of noirq, which could cause other problems to be
> misattributed. So even though the alternate messages I drafted may not
> be perfect, and might need to be tuned in the future based on patterns
> in whatever bug reports come in, I'm still confident that they're an
> improvement (and I'm open to further suggestions).
This path has to do with ancient x86 and BIOS history, which I know
very little about. If I were going to do something about these
messages, here's what I would do. Maybe it's too aggressive; I dunno.
diff --git a/arch/x86/pci/irq.c b/arch/x86/pci/irq.c
index 97b63e35e152..553c9f00c0e0 100644
--- a/arch/x86/pci/irq.c
+++ b/arch/x86/pci/irq.c
@@ -1473,8 +1473,6 @@ static int pirq_enable_irq(struct pci_dev *dev)
pci_read_config_byte(dev, PCI_INTERRUPT_PIN, &pin);
if (pin && !pcibios_lookup_irq(dev, 1)) {
- char *msg = "";
-
if (!io_apic_assign_pci_irqs && dev->irq)
return 0;
@@ -1516,13 +1514,9 @@ static int pirq_enable_irq(struct pci_dev *dev)
dev_info(&dev->dev, "PCI->APIC IRQ transform: "
"INT %c -> IRQ %d\n", 'A' + pin - 1, irq);
return 0;
- } else
- msg = "; probably buggy MP table";
+ }
#endif
- } else if (pci_probe & PCI_BIOS_IRQ_SCAN)
- msg = "";
- else
- msg = "; please try using pci=biosirq";
+ }
/*
* With IDE legacy devices the IRQ lookup failure is not
@@ -1532,8 +1526,8 @@ static int pirq_enable_irq(struct pci_dev *dev)
!(dev->class & 0x5))
return 0;
- dev_warn(&dev->dev, "can't find IRQ for PCI INT %c%s\n",
- 'A' + pin - 1, msg);
+ dev_warn(&dev->dev, "can't find IRQ for PCI INT %c\n",
+ 'A' + pin - 1);
}
return 0;
}
On Wed, Feb 2, 2022 at 6:13 PM Bjorn Helgaas <[email protected]> wrote:
> IIUC pirq_enable_irq() is only used for non-ACPI, non-DT, non-Xen,
> non-Intel MID systems, so this is a real legacy path.
>
> I don't think it's really worth cluttering an error case in a path
> that should be rarely used in the first place.
I figured if people are getting this message, then they either have
broken hardware or are debugging something, and if the message is
trying to be useful then it can't hurt to mention other things that
might help. If the pirq path has become buggy or unsupportable in
modern kernels then it probably ought to be removed altogether; if it
still works, however rarely it might be needed these days, then it's
perhaps worth mentioning to those who might occasionally have a use
for it.
> Are you seeing a problem where you're getting the wrong error message
> today? Can we just fix that problem instead so no kernel parameter is
> needed in the first place?
I was trying to isolate intermittent ACPI errors and tried booting
with acpi=noirq, as seemingly the closest modern equivalent to the
acpi=ht that had solved, or at least half-split, similar issues for me
in the past. With noirq, a different PCIe device stopped working (MPT
Fusion driver not picking up any responses to doorbell interrupts),
and while reviewing dmesg I noticed that the PCI error messages were
suggesting a kernel option that wasn't appropriate for my x86_64
architecture. In an ideal world with no hardware or driver problems
these log messages should never even happen, but in the real world of
troubleshooting and debugging I think they can be useful, and if
they're going to be reported they might as well be correct.
Obviously, this is a very minor bug, affecting only logs rather than
behavior, and I'm sure there are more pressing things to worry about.
On the other hand, it also seems like a very easy and low-risk fix
that leaves the kernel in a slightly better state for future users and
developers. At any rate, the current state of the PCI code (a)
generates a message that recommends specific kernel parameters, (b)
does so even on builds for which those parameters are inappropriate,
(c) doesn't say anything to encourage bug reports, and (d) doesn't
warn about the risks of noirq, which could cause other problems to be
misattributed. So even though the alternate messages I drafted may not
be perfect, and might need to be tuned in the future based on patterns
in whatever bug reports come in, I'm still confident that they're an
improvement (and I'm open to further suggestions).
On Thu, 3 Feb 2022, Brent Spillner wrote:
> > If your system has ACPI, I think "pci=biosirq" and "acpi=noirq" are at
> > best distractions from finding the real problem.
>
> ...except when the cause is indeed buggy ACPI firmware, which is
> presumably the only reason these options exist in the first place.
The former is really for us missing a PIRQ router or for a BIOS missing a
$PIR table. I have been posting patches recently to add support for some
of such systems identified. Those are all pre-ACPI, possibly long before,
i.e. early to late 1990s (486 and P5/P54C Pentium and similar systems),
and I think the option was added before we had ACPI support too. I have a
couple of pre-ACPI x86 systems too, which are quirky to say the least even
though they were the quality systems of the time from reputable vendors.
FWIW,
Maciej
On Wed, Feb 2, 2022 at 10:42 PM Bjorn Helgaas <[email protected]> wrote:
> If your system has ACPI, I think "pci=biosirq" and "acpi=noirq" are at
> best distractions from finding the real problem.
...except when the cause is indeed buggy ACPI firmware, which is
presumably the only reason these options exist in the first place.
> This path has to do with ancient x86 and BIOS history, which I know
> very little about. If I were going to do something about these
> messages, here's what I would do. Maybe it's too aggressive; I dunno.
Well, it wouldn't require future maintenance when supported command
line options change, which is how we got the stale warnings in the
current code. I'm just not sure who it helps: the vast majority of
users with no IRQ discovery problems never see these messages so they
have no reason to care how concise they are, and they won't be getting
recommendations to try risky kernel parameters for no good reason.
Those who do hit a problem get fewer hints about how to proceed; some
might file a bug report if they can't figure it out, but others will
probably prematurely assume that their hardware "isn't supported" and
give up. Those who dig through the code and kernel_parameters.txt to
find these alternatives even without the hints are probably the ones
who would have written the best bug reports, but they don't get any
specific encouragement to do so, and might never report anything when
they do find a workaround. And I have to wonder how many PCI IRQ bug
reports would get a first response like "Hmm, that should never
happen--- trying booting with pci=biosirq and report whether that
changes anything." Even when it doesn't help, as it often won't, you
have both a data point and encouragement to file the bug report for
further investigation.
Again, this is a minor issue, and I'm not emotionally attached to any
particular solution. Your approach solves the immediate problem of
inappropriate recommendations, and if the code looked like that
already I wouldn't have proposed this patch in the first place. I just
think that if the goal is to get useful bug reports, then providing a
little bit of advice (similar to, but updated from, the current
biosirq suggestion) is more constructive than going for the tersest
possible printk, and I'm sure that log messages will be seen by more
of the potentially affected users than equivalent comments in the code
would be.