2005-11-27 11:35:17

by Michael Buesch

[permalink] [raw]
Subject: Latest GIT: USB ehci_hcd broken (spinlock corruption)

Hi,

Latest GIT code oopses in the USB driver with a spinlock corruption:

hub 3-0:1.0: 2 ports detected
usb 1-1: device descriptor read/64, error -71
usb 1-1: new full speed USB device using ohci_hcd and address 3
usb 1-1: device descriptor read/64, error -71
usb 1-1: device descriptor read/64, error -71
PCI: Enabling device 0001:10:1b.2 (0004 -> 0006)
ehci_hcd 0001:10:1b.2: EHCI Host Controller
BUG: spinlock bad magic on CPU#0, modprobe/2001
lock: cfe6ce64, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
Call Trace:
[DF1CDC40] [C000988C] show_stack+0x5c/0x1a0 (unreliable)
[DF1CDC70] [C012D408] spin_bug+0xa8/0xd0
[DF1CDC90] [C012D608] _raw_spin_lock+0x88/0x180
[DF1CDCA0] [C02B5E10] _spin_lock_irqsave+0x30/0x50
[DF1CDCC0] [E2587F58] ehci_hub_control+0x38/0x750 [ehci_hcd]
[DF1CDCF0] [E25886DC] ehci_port_power+0x6c/0xa0 [ehci_hcd]
[DF1CDD10] [E25887F4] ehci_pci_reinit+0xe4/0x320 [ehci_hcd]
[DF1CDD50] [E258A3D0] ehci_pci_reset+0xe0/0x5f0 [ehci_hcd]
[DF1CDD80] [E25ACD38] usb_add_hcd+0x68/0x7b0 [usbcore]
[DF1CDDD0] [E25B77CC] usb_hcd_pci_probe+0x19c/0x2d0 [usbcore]
[DF1CDE00] [C0133164] pci_device_probe+0x84/0xb0
[DF1CDE20] [C018FA84] driver_probe_device+0x64/0x110
[DF1CDE40] [C018FC94] __driver_attach+0x94/0xb0
[DF1CDE60] [C018EC9C] bus_for_each_dev+0x5c/0xa0
[DF1CDE90] [C018F7E4] driver_attach+0x24/0x40
[DF1CDEA0] [C018F230] bus_add_driver+0x90/0x180
[DF1CDEC0] [C01900CC] driver_register+0x6c/0x80
[DF1CDEF0] [C0132C60] __pci_register_driver+0xb0/0x100
[DF1CDF10] [E1035030] ehci_hcd_pci_init+0x30/0x90 [ehci_hcd]
[DF1CDF20] [C0048AD8] sys_init_module+0x148/0x3b0
[DF1CDF40] [C000E1AC] ret_from_syscall+0x0/0x44
--- Exception: c01 at 0xff7391c
LR = 0x10003a00
BUG: spinlock lockup on CPU#0, modprobe/2001, cfe6ce64
Call Trace:
[DF1CDC60] [C000988C] show_stack+0x5c/0x1a0 (unreliable)
[DF1CDC90] [C012D6A8] _raw_spin_lock+0x128/0x180
[DF1CDCA0] [C02B5E10] _spin_lock_irqsave+0x30/0x50
[DF1CDCC0] [E2587F58] ehci_hub_control+0x38/0x750 [ehci_hcd]
[DF1CDCF0] [E25886DC] ehci_port_power+0x6c/0xa0 [ehci_hcd]
[DF1CDD10] [E25887F4] ehci_pci_reinit+0xe4/0x320 [ehci_hcd]
[DF1CDD50] [E258A3D0] ehci_pci_reset+0xe0/0x5f0 [ehci_hcd]
[DF1CDD80] [E25ACD38] usb_add_hcd+0x68/0x7b0 [usbcore]
[DF1CDDD0] [E25B77CC] usb_hcd_pci_probe+0x19c/0x2d0 [usbcore]
[DF1CDE00] [C0133164] pci_device_probe+0x84/0xb0
[DF1CDE20] [C018FA84] driver_probe_device+0x64/0x110
[DF1CDE40] [C018FC94] __driver_attach+0x94/0xb0
[There is more, but netconsole cut off here]

This is a PowerPC machine (PowerBook G4).
The .config is attached.
2.6.15-rc2-g1778d55e (Nov 23 2005) does not crash.

--
Greetings Michael.


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2005-11-27 21:30:56

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)

On Sun, 2005-11-27 at 12:34 +0100, Michael Buesch wrote:
> Hi,
>
> Latest GIT code oopses in the USB driver with a spinlock corruption:

(backtrace below)

Looks bad. I went through the code, and I discovered that indeed,
usb_add_hcd() calls driver->reset() before anything else, that is,
before giving the low level driver a chance to initialize it's local
data structure, including the spinlock. On EHCI, at least, reset will
dive deep into the guts of the driver tying among others to acquire the
lock.

That is very bad. It's yet another example of broken "mid-layer" design.
I can't say enough how bad this whole mid-layer hcd stuff is but you
guys don't beleive it.

So what you should do to fix the immediate problem is to have a separate
"init" callback to the low level driver to initialize it's private data
structure before anything else is called. A kind of "constructor". Call
that from usb_add_hcd() before you call reset().

In the long run, the whole hcd layer junk should probably be flipped
upside down though. It's the low level driver that should be in control,
and the hcd layer should act as a "library" used by the hcd driver,
instead of the opposite.

That is, the HCD driver gets probed, gets control first, calls something
to allocate the hcd data structure, gets a chance to initialize it, then
calls usb_add_hcd() etc...

The whole thing is done backward currently.

> hub 3-0:1.0: 2 ports detected
> usb 1-1: device descriptor read/64, error -71
> usb 1-1: new full speed USB device using ohci_hcd and address
> usb 1-1: device descriptor read/64, error -71
> usb 1-1: device descriptor read/64, error -71
> PCI: Enabling device 0001:10:1b.2 (0004 -> 0006)
> ehci_hcd 0001:10:1b.2: EHCI Host Controller
> BUG: spinlock bad magic on CPU#0, modprobe/2001
> lock: cfe6ce64, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> Call Trace:
> [DF1CDC40] [C000988C] show_stack+0x5c/0x1a0 (unreliable)
> [DF1CDC70] [C012D408] spin_bug+0xa8/0xd0
> [DF1CDC90] [C012D608] _raw_spin_lock+0x88/0x180
> [DF1CDCA0] [C02B5E10] _spin_lock_irqsave+0x30/0x50
> [DF1CDCC0] [E2587F58] ehci_hub_control+0x38/0x750 [ehci_hcd]
> [DF1CDCF0] [E25886DC] ehci_port_power+0x6c/0xa0 [ehci_hcd]
> [DF1CDD10] [E25887F4] ehci_pci_reinit+0xe4/0x320 [ehci_hcd]
> [DF1CDD50] [E258A3D0] ehci_pci_reset+0xe0/0x5f0 [ehci_hcd]
> [DF1CDD80] [E25ACD38] usb_add_hcd+0x68/0x7b0 [usbcore]
> [DF1CDDD0] [E25B77CC] usb_hcd_pci_probe+0x19c/0x2d0 [usbcore]
> [DF1CDE00] [C0133164] pci_device_probe+0x84/0xb0
> [DF1CDE20] [C018FA84] driver_probe_device+0x64/0x110
> [DF1CDE40] [C018FC94] __driver_attach+0x94/0xb0
> [DF1CDE60] [C018EC9C] bus_for_each_dev+0x5c/0xa0
> [DF1CDE90] [C018F7E4] driver_attach+0x24/0x40
> [DF1CDEA0] [C018F230] bus_add_driver+0x90/0x180
> [DF1CDEC0] [C01900CC] driver_register+0x6c/0x80
> [DF1CDEF0] [C0132C60] __pci_register_driver+0xb0/0x100
> [DF1CDF10] [E1035030] ehci_hcd_pci_init+0x30/0x90 [ehci_hcd]
> [DF1CDF20] [C0048AD8] sys_init_module+0x148/0x3b0
> [DF1CDF40] [C000E1AC] ret_from_syscall+0x0/0x44
> --- Exception: c01 at 0xff7391c
> LR = 0x10003a00
> BUG: spinlock lockup on CPU#0, modprobe/2001, cfe6ce64
> Call Trace:
> [DF1CDC60] [C000988C] show_stack+0x5c/0x1a0 (unreliable)
> [DF1CDC90] [C012D6A8] _raw_spin_lock+0x128/0x180
> [DF1CDCA0] [C02B5E10] _spin_lock_irqsave+0x30/0x50
> [DF1CDCC0] [E2587F58] ehci_hub_control+0x38/0x750 [ehci_hcd]
> [DF1CDCF0] [E25886DC] ehci_port_power+0x6c/0xa0 [ehci_hcd]
> [DF1CDD10] [E25887F4] ehci_pci_reinit+0xe4/0x320 [ehci_hcd]
> [DF1CDD50] [E258A3D0] ehci_pci_reset+0xe0/0x5f0 [ehci_hcd]
> [DF1CDD80] [E25ACD38] usb_add_hcd+0x68/0x7b0 [usbcore]
> [DF1CDDD0] [E25B77CC] usb_hcd_pci_probe+0x19c/0x2d0 [usbcore]
> [DF1CDE00] [C0133164] pci_device_probe+0x84/0xb0
> [DF1CDE20] [C018FA84] driver_probe_device+0x64/0x110
> [DF1CDE40] [C018FC94] __driver_attach+0x94/0xb0
> [There is more, but netconsole cut off here]
>
> This is a PowerPC machine (PowerBook G4).
> The .config is attached.
> 2.6.15-rc2-g1778d55e (Nov 23 2005) does not crash.
>

2005-11-27 22:06:07

by Alan Stern

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)

On Mon, 28 Nov 2005, Benjamin Herrenschmidt wrote:

> On Sun, 2005-11-27 at 12:34 +0100, Michael Buesch wrote:
> > Hi,
> >
> > Latest GIT code oopses in the USB driver with a spinlock corruption:
>
> (backtrace below)
>
> Looks bad. I went through the code, and I discovered that indeed,
> usb_add_hcd() calls driver->reset() before anything else, that is,
> before giving the low level driver a chance to initialize it's local
> data structure, including the spinlock. On EHCI, at least, reset will
> dive deep into the guts of the driver tying among others to acquire the
> lock.
>
> That is very bad. It's yet another example of broken "mid-layer" design.
> I can't say enough how bad this whole mid-layer hcd stuff is but you
> guys don't beleive it.

Stop complaining. We've known about this bug for over a month.
It's even mentioned in the OSDL bugzilla:
<http://bugzilla.kernel.org/show_bug.cgi?id=5433>, comment #2. The only
question is why Dave hasn't fixed it yet...

> So what you should do to fix the immediate problem is to have a separate
> "init" callback to the low level driver to initialize it's private data
> structure before anything else is called. A kind of "constructor". Call
> that from usb_add_hcd() before you call reset().

I suggested back in October that the "reset" method should be renamed
"init". It only gets called once, during device initialization, so the
"reset" name is a misnomer. It should do an initialize _and_ a reset.

> In the long run, the whole hcd layer junk should probably be flipped
> upside down though. It's the low level driver that should be in control,
> and the hcd layer should act as a "library" used by the hcd driver,
> instead of the opposite.
>
> That is, the HCD driver gets probed, gets control first, calls something
> to allocate the hcd data structure, gets a chance to initialize it, then
> calls usb_add_hcd() etc...
>
> The whole thing is done backward currently.

Nonsense. It's done just as you described. You just have to think of the
reset method as an initialization method.

For now, you can workaround the bug by adding:

spin_lock_init(&ehci_lock);

as the first line of drivers/usb/host/ehci-pci.c:ehci_pci_reset().

Alan Stern

2005-11-28 00:59:30

by Andrew Morton

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)

Alan Stern <[email protected]> wrote:
>
> For now, you can workaround the bug by adding:
>
> spin_lock_init(&ehci_lock);
>
> as the first line of drivers/usb/host/ehci-pci.c:ehci_pci_reset().

Yup.

--- devel/drivers/usb/host/ehci-pci.c~gregkh-usb-usb-pm-09-fix 2005-11-25 19:03:59.000000000 -0800
+++ devel-akpm/drivers/usb/host/ehci-pci.c 2005-11-25 19:03:59.000000000 -0800
@@ -129,6 +129,8 @@ static int ehci_pci_reset(struct usb_hcd
u32 temp;
int retval;

+ spin_lock_init(&ehci->lock);
+
ehci->caps = hcd->regs;
ehci->regs = hcd->regs + HC_LENGTH(readl(&ehci->caps->hc_capbase));
dbg_hcs_params(ehci, "reset");
_

2005-11-28 16:40:49

by David Brownell

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)


> Looks bad. I went through the code, and I discovered that indeed,
> usb_add_hcd() calls driver->reset() before anything else, that is,
> before giving the low level driver a chance to initialize it's local
> data structure, including the spinlock. On EHCI, at least, reset will
> dive deep into the guts of the driver tying among others to acquire the
> lock.

The last step of the misnamed ehci "reset" routine calls ehci_init()
to set up the HCD data structures ... which is well after the comment
saying "below this point it's all PCI-specific", and in particular it's
after the call to ehci_pci_reinit() which implicitly requires that init
to have been done already. I'll be testing the attached patch, which
moves that data structure init up above that relevant line.

I think I see why I've not been seeing this problem though. The
machines I can use for testing lately don't have per-port power
switching with EHCI ... so this particular codepath doesn't trigger.


> That is very bad. It's yet another example of broken "mid-layer" design.
> I can't say enough how bad this whole mid-layer hcd stuff is but you
> guys don't beleive it.

You _could_ get down off that hobby-horse, Ben. In this case the issue
has nothing to do with how the hcd glue works, or doesn't. :)

We've been slowly refactoring things to make them better. Given that
it needs to be done while not breaking *lots of stuff* it goes slowly.
And the fact that we've been having to fill in some significant missing
pieces while we do that has also made it go slowly.


> So what you should do to fix the immediate problem is to have a separate
> "init" callback to the low level driver to initialize it's private data
> structure before anything else is called. A kind of "constructor". Call
> that from usb_add_hcd() before you call reset().

As of 2.6.15 I think the "reset" semantics have been entirely re-whacked;
we may well change the "reset" method name to match. Not all the code
or comments have caught up to those changes yet. The PCI devices are now
always using the "early usb handoff" logic, but the non-PCI devices need
some updates...

But there's a related point, that usb_add_hcd() sequences things in a
pretty strange order. For one thing, the root hub isn't created at
the time that misnamed reset/init/setup routine is called ... which
means that one-time root hub customization must be done elsewhere, in
some routine that's not for one-time setup. Of course I have a patch
for that; but that calls for a bit more testing.

- Dave



Attachments:
(No filename) (2.48 kB)
e.patch (1.60 kB)
Download all attachments

2005-11-28 22:07:21

by Andrew Morton

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)

David Brownell <[email protected]> wrote:
>
> Rename the EHCI "reset" routine so it better matches what it does (setup);
> and move the one-time data structure setup earlier, before doing anything
> that implicitly relies on it having been completed already.
>
> --- g26.orig/drivers/usb/host/ehci-pci.c 2005-11-12 10:38:26.000000000 -0800
> +++ g26/drivers/usb/host/ehci-pci.c 2005-11-28 08:22:59.000000000 -0800

This fixes the hang on my test box. I don't know if USB still works though..

2005-11-29 06:28:46

by David Brownell

[permalink] [raw]
Subject: Re: Latest GIT: USB ehci_hcd broken (spinlock corruption)

On Monday 28 November 2005 2:07 pm, Andrew Morton wrote:
> David Brownell <[email protected]> wrote:
> >
> > Rename the EHCI "reset" routine so it better matches what it does (setup);
> > and move the one-time data structure setup earlier, before doing anything
> > that implicitly relies on it having been completed already.
> >
> > --- g26.orig/drivers/usb/host/ehci-pci.c 2005-11-12 10:38:26.000000000 -0800
> > +++ g26/drivers/usb/host/ehci-pci.c 2005-11-28 08:22:59.000000000 -0800
>
> This fixes the hang on my test box. I don't know if USB still works though..

Worked for me ... this should get merged to Linus' tree ASAP. (Whoops, this
fix missed RC3.)

- Dave