2009-10-02 07:33:58

by Nick Piggin

[permalink] [raw]
Subject: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

Hi,

Your patch db8be50c4307dac2b37305fc59c8dc0f978d09ea is causing my
ia64 Altix system to die with an MCE in early boot.

Good dmesg looks like this:
ACPI: PCI Root Bridge [P001] (0001:00)
pci 0001:00:01.0: reg 10 io port: [0x1000-0x10ff]
pci 0001:00:01.0: reg 14 64bit mmio: [0x700000-0x703fff]
pci 0001:00:01.0: reg 1c 64bit mmio: [0x710000-0x71ffff]
pci 0001:00:01.0: reg 30 32bit mmio pref: [0x800000-0x8fffff]
pci 0001:00:01.0: supports D1 D2
pci 0001:00:02.0: reg 10 64bit mmio: [0x720000-0x72ffff]
pci 0001:00:02.0: PME# supported from D3hot
pci 0001:00:02.0: PME# disabled
pci 0001:00:02.1: reg 10 64bit mmio: [0x730000-0x73ffff]
pci 0001:00:02.1: PME# supported from D3hot
pci 0001:00:02.1: PME# disabled
pci 0001:00:03.0: reg 10 64bit mmio pref: [0x900000-0x9fffff]
pci 0001:01:01.0: reg 10 32bit mmio: [0xa00000-0xa00fff]
pci 0001:01:01.0: supports D1 D2

Bad (which is less verbose because it came from a console output):
ACPI: PCI Root Bridge [P001] (0001:00)
pci 0001:00:02.0: PME# supported from D3hot
pci 0001:00:02.0: PME# disabled
pci 0001:00:02.1: PME# supported from D3hot
pci 0001:00:02.1: PME# disabled
Entered OS MCA handler. PSP=20000000fff21120 cpu=2 monarch=1
<6>cpu 2, MCA inconsistent r12 and r13, original stack not modified
<6>Entered OS MCA handler. PSP=20000000fff21120 cpu=1 monarch=0
<6>cpu 1, MCA inconsistent r12 and r13, original stack not modified
<6>Entered OS MCA handler. PSP=20000000fff21120 cpu=0 monarch=0
<6>cpu 0, MCA inconsistent r12 and r13, original stack not modified
<6>Entered OS MCA handler. PSP=20000000fff21120 cpu=3 monarch=0
<6>cpu 3, MCA inconsistent r12 and r13, original stack not modified
<6>All OS MCA slaves have reached rendezvous
mlogbuf_finish: printing switched to urgent mode, MCA/INIT might be dodgy or fail.
Delaying for 5 seconds...

Relevant pci devices:
0001:00:03.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03)
0001:01:01.0 USB Controller: NEC Corporation USB (rev 43)
0001:01:01.1 USB Controller: NEC Corporation USB (rev 43)
0001:01:01.2 USB Controller: NEC Corporation USB 2.0 (rev 04)


Let me know any more info I can provide.


2009-10-02 10:26:41

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

Nick Piggin writes:
> Hi,
>
> Your patch db8be50c4307dac2b37305fc59c8dc0f978d09ea is causing my
> ia64 Altix system to die with an MCE in early boot.

The same commit has been confirmed by two people on the ARM list
to cause boot failures on two different Intel XScale IOP machines.
The machines have serial consoles, but only show

Uncompressing Linux... done. Booting the kernel.

before they hang.

2009-10-02 19:28:39

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

Mikael Pettersson writes:
> Nick Piggin writes:
> > Hi,
> >
> > Your patch db8be50c4307dac2b37305fc59c8dc0f978d09ea is causing my
> > ia64 Altix system to die with an MCE in early boot.
>
> The same commit has been confirmed by two people on the ARM list
> to cause boot failures on two different Intel XScale IOP machines.
> The machines have serial consoles, but only show
>
> Uncompressing Linux... done. Booting the kernel.
>
> before they hang.

I've just investigated this on one of my ARM boxes that this commit kills.

The commit changed quirk_usb_early_handoff to be a FIXUP_HEADER, which
caused it to be invoked during the early stages of the platform's PCI
init (arch/arm/kernel/bios32.c). quirk_usb_handoff_uhci() gets a bogus
I/O base address, passes that down to uhci_reset_hc(), causing a kernel
page fault in the first "outw(UHCI_USBCMD_HCRESET, base + UHCI_USBCMD);",
causing the kernel to oops.

(All this occurs before the serial console works, so I had to add a
platform-specific puts() and lots of tracing statements.)

Changing this quirk back to a FIXUP_FINAL allows the platform's PCI
init to complete. Later on the generic pci_init() calls the quirk,
which now gets the correct I/O base address, and the outw()s in
uhci_reset_hc() don't fail.

2009-10-06 04:44:40

by Nick Piggin

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

On Fri, Oct 02, 2009 at 09:28:32PM +0200, Mikael Pettersson wrote:
> Mikael Pettersson writes:
> > Nick Piggin writes:
> > > Hi,
> > >
> > > Your patch db8be50c4307dac2b37305fc59c8dc0f978d09ea is causing my
> > > ia64 Altix system to die with an MCE in early boot.
> >
> > The same commit has been confirmed by two people on the ARM list
> > to cause boot failures on two different Intel XScale IOP machines.
> > The machines have serial consoles, but only show
> >
> > Uncompressing Linux... done. Booting the kernel.
> >
> > before they hang.
>
> I've just investigated this on one of my ARM boxes that this commit kills.
>
> The commit changed quirk_usb_early_handoff to be a FIXUP_HEADER, which
> caused it to be invoked during the early stages of the platform's PCI
> init (arch/arm/kernel/bios32.c). quirk_usb_handoff_uhci() gets a bogus
> I/O base address, passes that down to uhci_reset_hc(), causing a kernel
> page fault in the first "outw(UHCI_USBCMD_HCRESET, base + UHCI_USBCMD);",
> causing the kernel to oops.
>
> (All this occurs before the serial console works, so I had to add a
> platform-specific puts() and lots of tracing statements.)
>
> Changing this quirk back to a FIXUP_FINAL allows the platform's PCI
> init to complete. Later on the generic pci_init() calls the quirk,
> which now gets the correct I/O base address, and the outw()s in
> uhci_reset_hc() don't fail.

Thanks for this, I guess we await David's response.

2009-10-06 04:54:27

by David Woodhouse

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

On Tue, 2009-10-06 at 06:44 +0200, Nick Piggin wrote:
> > Changing this quirk back to a FIXUP_FINAL allows the platform's PCI
> > init to complete. Later on the generic pci_init() calls the quirk,
> > which now gets the correct I/O base address, and the outw()s in
> > uhci_reset_hc() don't fail.
>
> Thanks for this, I guess we await David's response.

The problem is that FIXUP_FINAL is too late. If the USB controllers are
still active at the time the IOMMU is initialised, then DMA for their
'legacy keyboard/mouse emulation' will go AWOL -- because the BIOS
authors are too incompetent to tell the OS about it correctly.

And then the whole system goes down, locked up in SMM mode because the
BIOS authors are too incompetent to write code which can _cope_ with the
DMA going AWOL. Or do any testing.

One option is to add a new quirk somewhere in the middle, or to invoke
the USB fixups manually rather than by the quirk mechanism.

Another might be to move the initialisation of the IOMMU to later in the
boot, so it happens just after the final PCI fixups.

But I distinctly remember a conversation, probably with hpa, in which he
(or whoever) said that he wants to kill the USB controllers even
_earlier_ in the boot process, because of other problems. So I was kind
of hoping that whoever it was would pipe up. And then we'd take that
option.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-10-11 22:51:53

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

David Woodhouse writes:
> On Tue, 2009-10-06 at 06:44 +0200, Nick Piggin wrote:
> > > Changing this quirk back to a FIXUP_FINAL allows the platform's PCI
> > > init to complete. Later on the generic pci_init() calls the quirk,
> > > which now gets the correct I/O base address, and the outw()s in
> > > uhci_reset_hc() don't fail.
> >
> > Thanks for this, I guess we await David's response.
>
> The problem is that FIXUP_FINAL is too late. If the USB controllers are
> still active at the time the IOMMU is initialised, then DMA for their
> 'legacy keyboard/mouse emulation' will go AWOL -- because the BIOS
> authors are too incompetent to tell the OS about it correctly.
>
> And then the whole system goes down, locked up in SMM mode because the
> BIOS authors are too incompetent to write code which can _cope_ with the
> DMA going AWOL. Or do any testing.
>
> One option is to add a new quirk somewhere in the middle, or to invoke
> the USB fixups manually rather than by the quirk mechanism.
>
> Another might be to move the initialisation of the IOMMU to later in the
> boot, so it happens just after the final PCI fixups.
>
> But I distinctly remember a conversation, probably with hpa, in which he
> (or whoever) said that he wants to kill the USB controllers even
> _earlier_ in the boot process, because of other problems. So I was kind
> of hoping that whoever it was would pipe up. And then we'd take that
> option.

I've done a more detailed analysis of this patch now.

On my arm/iop/n2100 box which crashes when quirk_usb_early_handoff()
is a FIXUP_HEADER, the concrete reason is that a bogus base address is
returned by pci_resource_start() and passed on to outw() which oopses.
When the quirk is a FIXUP_FINAL the base address is correct and outw() works.

So I hacked the kernel to trace all reads and writes to any resource's
".start" field to see how the base address correction was done.

On the n2100 with the quirk as a FIXUP_FINAL, this happens on the 1st UHCI device:
1. drivers/pci/probe.c:__pci_read_base() sets start to 0xfce0 based primarily
on pci_read_config_word()
2. arch/arm/kernel/bios32.c:pdev_fixup_device_resources() runs but leaves it alone
3. kernel/resource.c:allocate_resource() inserts the resource which immediately
resets it from the root resource (0x90000000), and after taking care of
alignment and sibling resources it becomes 0x90000800 and never changes again.
4. quirk_usb_handoff_uhci() sees base 0x90000800 and the outw() works.

The 2nd UHCI device goes through the exact same steps, start is initially
0xfce0 but gets changed to 0x90000820 before quirk_usb_handoff_uhci().

When the quirk is a FIXUP_HEADER, this happens:
1. drivers/pci/probe.c:__pci_read_base() sets start to 0xfce0.
2. quirk_usb_handoff_uhci() sees pio_enabled() [from pci_read_config_word()],
gets base = 0xfce0, passes that to outw(), which dies.

On a different arm box (arm/mach-ixp4xx/) which this patch doesn't kill,
the same bogus start address 0xfce0 is detected by __pci_read_base().
But on that box !pio_enabled() is true so quirk_usb_handoff_uhci() bails
and doesn't try to do any I/O accesses. Later on this UHCI's base is
changed to 0x1000 by allocate_resource().

Finally I tested on an old PentiumIII PC. On that one __pci_read_base()
assigned 0xb400 as the start address, that address never changes afterwards,
and the quirk runs without oopsing regardless of whether it's HEADER or FINAL.

So the basic problem with the patch is that it moves PCI resource using code
to before PCI resources have been fixed up, and that may work on x86 PCs but
doesn't work in general.

2009-10-12 17:34:48

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Patch "USB: Work around BIOS bugs by quiescing USB controllers earlier" causes MCEs

On Monday 05 October 2009 10:44:01 pm Nick Piggin wrote:
> On Fri, Oct 02, 2009 at 09:28:32PM +0200, Mikael Pettersson wrote:
> > Mikael Pettersson writes:
> > > Nick Piggin writes:
> > > > Hi,
> > > >
> > > > Your patch db8be50c4307dac2b37305fc59c8dc0f978d09ea is causing my
> > > > ia64 Altix system to die with an MCE in early boot.
> > >
> > > The same commit has been confirmed by two people on the ARM list
> > > to cause boot failures on two different Intel XScale IOP machines.
> > > The machines have serial consoles, but only show
> > >
> > > Uncompressing Linux... done. Booting the kernel.
> > >
> > > before they hang.
> >
> > I've just investigated this on one of my ARM boxes that this commit kills.
> >
> > The commit changed quirk_usb_early_handoff to be a FIXUP_HEADER, which
> > caused it to be invoked during the early stages of the platform's PCI
> > init (arch/arm/kernel/bios32.c). quirk_usb_handoff_uhci() gets a bogus
> > I/O base address, passes that down to uhci_reset_hc(), causing a kernel
> > page fault in the first "outw(UHCI_USBCMD_HCRESET, base + UHCI_USBCMD);",
> > causing the kernel to oops.
> >
> > (All this occurs before the serial console works, so I had to add a
> > platform-specific puts() and lots of tracing statements.)
> >
> > Changing this quirk back to a FIXUP_FINAL allows the platform's PCI
> > init to complete. Later on the generic pci_init() calls the quirk,
> > which now gets the correct I/O base address, and the outw()s in
> > uhci_reset_hc() don't fail.
>
> Thanks for this, I guess we await David's response.

The problem seen by Andrew on ia64 is that FIXUP_HEADER happens between
device discovery and the PCI fixups, and in this interval, the struct
pci_dev contains PCI bus addresses, not CPU (host) addresses. Often
the PCI bus address and the CPU address are the same, but on machines
where they differ, we can't access PCI BARs in this interval.

I don't know about ARM, but on ia64, we do have enough information to
avoid this problem by always putting the CPU addresses in the pci_dev,
i.e., by doing the PCI fixups immediately at device discovery-time.

I think this is the best solution, because it removes the restriction
that FIXUP_HEADER can't access PCI BARs on certain machines.

Bjorn