2004-10-15 08:37:27

by Jim Paris

[permalink] [raw]
Subject: PCI IRQ problems: "nobody cared!"

I'm having some strange PCI IRQ problems on my new laptop (Panasonic
Toughbook CF-M34UTVZKM) under 2.6.8-1-686 (Debian). I'm at a loss to
figure out their source, other than the fact that Toughbooks seem to
have a particularly crappy BIOS.

The errors are something like this (taken from default.txt, link below):
irq 9: nobody cared!
[<c010841a>] __report_bad_irq+0x2a/0x90
[<c0108510>] note_interrupt+0x70/0xb0
[<c01087f0>] do_IRQ+0x120/0x130
[<c0106a20>] common_interrupt+0x18/0x20
[<c01200fe>] __do_softirq+0x2e/0x80
[<c01b3b60>] acpi_irq+0x0/0x16
[<c0120177>] do_softirq+0x27/0x30
[<c01087cb>] do_IRQ+0xfb/0x130
[<c0106a20>] common_interrupt+0x18/0x20
[<c02124a2>] pci_conf1_write+0x92/0xf0
[<c01b3e86>] acpi_os_write_pci_configuration+0x69/0x76
...

It seems that once some particular piece of PCI hardware gets
initialized, it causes a flood of unexpected interrupts. The kernel
then disables IRQ 9, which basically breaks most of my devices because
that's the one they all share.

I captured the following boots for different command lines. The ACPI
and non-ACPI cases die at different points, but with the same result.

root=/dev/hda1 ro console=ttyS0,115200n8
https://jim.sh/svn/jim/devl/toughbook/log/default.txt

root=/dev/hda1 ro console=ttyS0,115200n8 acpi=off
https://jim.sh/svn/jim/devl/toughbook/log/acpioff.txt

root=/dev/hda1 ro console=ttyS0,115200n8 acpi=off pci=usepirqmask
https://jim.sh/svn/jim/devl/toughbook/log/usepirqmask.txt

lspci, lspci -vxxxn, and /proc/interrupts:
https://jim.sh/svn/jim/devl/toughbook/log/lspci.txt

Could someone who knows more than me about PCI IRQs take a quick look
at those dumps and tell me if there's anything obvious that I'm
missing, or some way to work around the problem?

-jim


2004-10-15 14:22:30

by Alan

[permalink] [raw]
Subject: Re: PCI IRQ problems: "nobody cared!"

On Gwe, 2004-10-15 at 09:37, Jim Paris wrote:
> Could someone who knows more than me about PCI IRQs take a quick look
> at those dumps and tell me if there's anything obvious that I'm
> missing, or some way to work around the problem?

I posted a patch to poll when we find IRQ's have gone astray. It needs
redoing versus Ingo's new 2.6.9 IRQ code but should apply cleanly to
2.6.8. You can the boot with "irqpoll" as a boot option and the box
should survive.

Alan

2004-10-15 18:51:47

by Jim Paris

[permalink] [raw]
Subject: Re: PCI IRQ problems: "nobody cared!"

> I posted a patch to poll when we find IRQ's have gone astray. It needs
> redoing versus Ingo's new 2.6.9 IRQ code but should apply cleanly to
> 2.6.8. You can the boot with "irqpoll" as a boot option and the box
> should survive.

You rock! irqpoll works like a charm. I get the same error in the
same place, but now all of my devices still work. I don't see any
obvious performance impact (although I haven't tested it much).

I applied this irqpoll patch:
http://groups.google.com/groups?selm=2BunT-6Be-15%40gated-at.bofh.it
and then some minor fixes (see below).

The log for this boot are at
https://jim.sh/svn/jim/devl/toughbook/log/irqpoll.txt
in case anyone is interested.

-jim


diff -urN ac/arch/i386/kernel/irq.c jim/arch/i386/kernel/irq.c
--- ac/arch/i386/kernel/irq.c 2004-10-15 13:18:46.000000000 -0400
+++ jim/arch/i386/kernel/irq.c 2004-10-15 13:18:26.000000000 -0400
@@ -391,11 +391,11 @@
{
if((irqfixup == 2 && irq == 0) || action_ret == IRQ_NONE)
{
+ int ok;
#ifdef CONFIG_4KSTACKS
u32 *isp;
union irq_ctx * curctx;
union irq_ctx * irqctx;
- int ok;

curctx = (union irq_ctx *) current_thread_info();
irqctx = hardirq_ctx[smp_processor_id()];
@@ -435,7 +435,7 @@
#else
spin_unlock(&desc->lock);

- ok = misrouted_irq(irq, desc, regs);
+ ok = misrouted_irq(irq, regs);

spin_lock(&desc->lock);
#endif

2004-10-16 05:55:53

by Brown, Len

[permalink] [raw]
Subject: Re: PCI IRQ problems: "nobody cared!"

On Fri, 2004-10-15 at 04:37, Jim Paris wrote:
> I'm having some strange PCI IRQ problems on my new laptop (Panasonic
> Toughbook CF-M34UTVZKM) under 2.6.8-1-686 (Debian). I'm at a loss to
> figure out their source, other than the fact that Toughbooks seem to
> have a particularly crappy BIOS.

Jim,

Before doing anything else, please verify that you're running the latest
BIOS for this box.

I don't think this is an interrupt misrouting problem -- I think the
problem is that a device is pulling on a shared interrupt before
we get its driver loaded.

You might find that using "noirqdebug" gets you through the boot
sequence and that after the drivers are loaded the system runs
normally. This may be preferable to the irqpoll workaround.
Of course both are workarounds and neither actually help us
identify the root cause.

Note that PNPBIOS shouldn't run on an ACPI-enabled system -- probably no
harm, but use a CONFIG_PNPBIOS=n kernel to verify there is no change in
your ACPI enabled result.

ACPI: PCI Interrupt Link [LNKA] (IRQs *9)
ACPI: PCI Interrupt Link [LNKB] (IRQs *9)
ACPI: PCI Interrupt Link [LNKC] (IRQs 9) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs *9)
ACPI: PCI Interrupt Link [LNKE] (IRQs *9)
ACPI: PCI Interrupt Link [LNKF] (IRQs *9)
ACPI: PCI Interrupt Link [LNKG] (IRQs 9) *0, disabled.
ACPI: PCI Interrupt Link [LNKH] (IRQs 9) *0, disabled.

There isn't any opportunity for "mis-routing" IRQs here, IRQ 9 is the
only choice the BIOS gives us. It would be interesting to look at your
BIOS setup to see if there are some parameters you can use to allow the
BIOS to give us more freedom. Sometimes these BIOS setup options allow
selecting broken configurations, so start by restoring the system to its
default configuration if there is a setup option to do that.

ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 9
ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 9 (level, low) -> IRQ 9
ACPI: PCI interrupt 0000:00:1d.0[A] -> GSI 9 (level, low) -> IRQ 9
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 9
ACPI: PCI interrupt 0000:00:1d.1[B] -> GSI 9 (level, low) -> IRQ 9
irq 9: nobody cared!
...
handlers:
[<c01b3b60>] (acpi_irq+0x0/0x16)
Disabling IRQ #9
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 9
ACPI: PCI interrupt 0000:00:1f.1[A] -> GSI 9 (level, low) -> IRQ 9

It appears that a device is pulling on its interrupt line and so as soon
as we enable its link we get clobbered. acpi_irq() is the only handler
on the IRQ at that point and it gets clobbered.

One thing that might help is if you try Bjorn's patch
to delay enabling the PCI Interrupt Links until the
actual drivers request that their interrupt be enabled:
http://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/remove-unconditional-pci-acpi-irq-routing.patch

Also, it would be a good idea to identify the device at the root cause.
As a start, go to your BIOS SETUP
and disable all devices that it allows you to disable.
Looks like you have a couple of NICs behind a cardbus bridge.
If they are physically removable, then take them out.

If you supply the output from lspci -vv
and acpidmp, then we can find out exactly
what devices are attached to which interrupt link
and that will probably tell us which device is being bad.
acpidmp is in /usr/sbin/, or you can be had from pmtools here:
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

The failure in the acpioff case is a clue I think.
Several devices get enabled on IRQ9 with no issues,
but the system dies when yenta gets enabled,
so perhaps the devices behind that bridge are at fault.

They appear to be NICs, you might try pulling out the cable
and disabling the Radio to see if it allows us to
get through boot and get those drivers loaded.

cheers,
-Len


2004-10-17 00:12:45

by Jim Paris

[permalink] [raw]
Subject: Re: PCI IRQ problems: "nobody cared!"

Hi Len,

Thanks for the tips. I tried your suggestions (a bit out of order).
Here are the results.

> Before doing anything else, please verify that you're running the latest
> BIOS for this box.

The BIOS and EC are both the latest version available.

> You might find that using "noirqdebug" gets you through the boot
> sequence and that after the drivers are loaded the system runs
> normally. This may be preferable to the irqpoll workaround.
> Of course both are workarounds and neither actually help us
> identify the root cause.

"noirqdebug" results in a freeze where it would previously have given
the "irq 9: nobody cared!" error.

> Note that PNPBIOS shouldn't run on an ACPI-enabled system -- probably no
> harm, but use a CONFIG_PNPBIOS=n kernel to verify there is no change in
> your ACPI enabled result.

That was on the Debian kernel. On my own 2.6.8.1 build, I have
CONFIG_PNP=n, and the result (without irqpoll) is the same.
My full config:
https://jim.sh/svn/jim/devl/toughbook/log/current.config

> The failure in the acpioff case is a clue I think. Several devices
> get enabled on IRQ9 with no issues, but the system dies when yenta
> gets enabled, so perhaps the devices behind that bridge are at
> fault.

It doesn't seem specific to a particular device.

With my 2.6.8.1 and acpi=off, it dies when uhci_hcd is loaded, before yenta.
https://jim.sh/svn/jim/devl/toughbook/log/uhcidie.txt
(fyi, I've also added some ram since the last logs I posted)

In order to get Debian installed with their kernel, I needed to have
IDE working, as well as one of either wireless net, wired net, or usb.
I tried a ton of configurations, until I finally found that the only
thing that worked was using acpi=off, and then only loading modules
8139too for net, and generic for IDE. If I used piix for IDE, it
would cause the same irq9 problem on load and break the other devices
(but the IDE would still work).

[ I should have mentioned and tested PIIX earlier. See the end of
this mail for what I found out. ]

> It would be interesting to look at your
> BIOS setup to see if there are some parameters you can use to allow the
> BIOS to give us more freedom.
..
> As a start, go to your BIOS SETUP
> and disable all devices that it allows you to disable.

BIOS lets me enable or disable the following:
serial ports, touch screen, parallel port,
touch pad, LAN, modem, wireless LAN.
For the devices on the first line, I can also choose their port/IRQ
manually, but the BIOS (Phoenix) doesn't give me any more control than
that. The motherboard has six DIP switches, but they are undocumented.

Setting all of them to "disable" doesn't change anything for acpi=on
or acpi=off.

> Also, it would be a good idea to identify the device at the root cause.
> Looks like you have a couple of NICs behind a cardbus bridge.
> If they are physically removable, then take them out.

I physically removed both the NIC/modem combo and the wireless card.
Both the acpi=on and acpi=off cases failed the same way as when the
cards are present:
https://jim.sh/svn/jim/devl/toughbook/log/removed.txt
https://jim.sh/svn/jim/devl/toughbook/log/removed-acpioff.txt

> If you supply the output from lspci -vv
> and acpidmp, then we can find out exactly
> what devices are attached to which interrupt link
> and that will probably tell us which device is being bad.

Sure. With acpi and irqpoll:
https://jim.sh/svn/jim/devl/toughbook/log/lspcivv.txt
https://jim.sh/svn/jim/devl/toughbook/log/acpidmp.txt

> One thing that might help is if you try Bjorn's patch
> to delay enabling the PCI Interrupt Links until the
> actual drivers request that their interrupt be enabled:
> http://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/remove-unconditional-pci-acpi-irq-routing.patch

First, 2.6.9-final has the same problems:
https://jim.sh/svn/jim/devl/toughbook/log/269.txt

2.6.9-final-routeirq (with Bjorne's patches) breaks at IDE initialization:
https://jim.sh/svn/jim/devl/toughbook/log/269bjorn.txt

But! 2.6.9-final-routeirq, _without_ CONFIG_BLK_DEV_PIIX (just
CONFIG_BLK_DEV_GENERIC), works!
https://jim.sh/svn/jim/devl/toughbook/log/nopiix.txt

And 2.6.9-final-routeirq with neither ACPI nor PIIX also works:
https://jim.sh/svn/jim/devl/toughbook/log/noacpi-nopiix.txt

For completeness, 2.6.9-final-routeirq with pci=routeirq and no PIIX breaks:
https://jim.sh/svn/jim/devl/toughbook/log/routeirq-nopiix.txt

So it appears that the problem is very much related to the IDE
controller. That's a little bit surprising, because IDE was the only
thing that consistently worked when everything else broke.

-jim