2008-10-15 23:43:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Oops in UHCI when encountering "host controller process error"

I'm trying to get UHCI working in a Xen dom0. This is essentially akin
to making it work with an iommu, as physical memory pages are not
contiguous, and their kernel-visible addresses are not directly usable
as DMA addresses. I'm not too surprised that I'm seeing driver errors
(though e1000 and mpt fusion work fine), so the fact that I'm getting
this error probably isn't a reflection on the UHCI driver.

The problem I'm seeing is this:

xen_create_contiguous_region: vstart=ffff880073ff0000 order=0 addr_bits=20
uhci_hcd 0000:00:1d.0: -> ret ffff880073ff0000 dma 79b6c000
uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff803acb56>] uhci_scan_schedule+0xa8/0x85f
PGD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.27-tip #233
RIP: e030:[<ffffffff803acb56>] [<ffffffff803acb56>] uhci_scan_schedule+0xa8/0x85f
RSP: e02b:ffffffff80657da8 EFLAGS: 00010006
RAX: fffffffffffffff0 RBX: ffff8800738921e0 RCX: ffff880073892158
RDX: ffff880073892158 RSI: 0000000000000000 RDI: ffff880073892158
RBP: ffffffff80657e18 R08: ffffffffffffffff R09: 0000000000008f00
R10: ffff8800738921e0 R11: 0000000000000246 R12: fffffffffffffff0
R13: 0000000000000000 R14: ffff880073892158 R15: ffff880073892000
FS: 0000000000000000(0000) GS:ffffffff805adf40(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff805b2000, task ffffffff8056e3a0)
Stack:
ffffffff80657db8 ffff8800738921a8 ffffffff80657e08 ffffffff80243df5
ffffffff80657dd8 ffffffff803253c3 ffff880073892158 ffffffff80328d89
ffff8800738921e0 ffff8800738921e0 ffff880073892158 0000000000000000
Call Trace:
<IRQ> <0> [<ffffffff80243df5>] ? __mod_timer+0xb8/0xca
[<ffffffff803253c3>] ? __const_udelay+0x44/0x46
[<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
[<ffffffff803aef89>] uhci_irq+0x13f/0x158
[<ffffffff8039744a>] usb_hcd_irq+0x42/0x90
[<ffffffff80251a7e>] ? __update_sched_clock+0x1e/0x93
[<ffffffff8026164a>] handle_IRQ_event+0x2e/0x65
[<ffffffff80262b35>] handle_level_irq+0x91/0xe2
[<ffffffff80214a6d>] handle_irq+0x27/0x36
[<ffffffff80365f05>] xen_evtchn_do_upcall+0x198/0x1be
[<ffffffff8045f8be>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> <0> [<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff8020e5e0>] ? xen_safe_halt+0x10/0x1a
[<ffffffff8020be00>] ? xen_idle+0x34/0x48
[<ffffffff80211188>] ? cpu_idle+0x51/0x92
[<ffffffff8044f018>] ? rest_init+0x5c/0x5e
[<ffffffff805dbd64>] ? start_kernel+0x409/0x414
[<ffffffff805db2ba>] ? x86_64_start_reservations+0xa5/0xa9
[<ffffffff805df532>] ? xen_start_kernel+0x96f/0x981
Code: c8 00 00 00 4c 89 75 c0 41 89 86 d4 00 00 00 48 8b 55 c0 48 8b 42 28 48 8b 40 10 48 83 e8 10 49 89 86 80 00 00 00 e9 e0 06 00 00 <49> 8b 44 24 10 48 83 e8 10 49 89 86 80 00 00 00 41 83 7c 24 74


I'm not too surprised its getting hardware errors, and I wouldn't assume
its a USB-level bug at this point (though if its misusing the DMA API,
it could be a driver bug; I think I saw an iommu-related bug go past,
which could be a clue).

But the crash as a result of the "host controller process error" does
look like a UHCI driver bug.

The RIP corresponds to:
0xffffffff803acb56 is in uhci_scan_schedule
(/home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1740).

1740 uhci->next_qh = list_entry(qh->node.next,
1741 struct uhci_qh, node);


If you have any hints as to what's causing the host controller process
error and how I might go about debugging it, that would be very useful.

Thanks,
J


2008-10-16 14:03:45

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Wed, 15 Oct 2008, Jeremy Fitzhardinge wrote:

> I'm trying to get UHCI working in a Xen dom0. This is essentially akin
> to making it work with an iommu, as physical memory pages are not
> contiguous, and their kernel-visible addresses are not directly usable
> as DMA addresses. I'm not too surprised that I'm seeing driver errors
> (though e1000 and mpt fusion work fine), so the fact that I'm getting
> this error probably isn't a reflection on the UHCI driver.

uhci-hcd uses dma_allocate_coherent() and dma_pool_create() with
dma_pool_alloc(). If either of these returned an area of memory that
crossed a physical page boundary then there might be trouble -- but
there probably would already be trouble in non-virtualized systems too!

> The problem I'm seeing is this:
>
> xen_create_contiguous_region: vstart=ffff880073ff0000 order=0 addr_bits=20
> uhci_hcd 0000:00:1d.0: -> ret ffff880073ff0000 dma 79b6c000
> uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803acb56>] uhci_scan_schedule+0xa8/0x85f
> PGD 0
> Thread overran stack, or stack corrupted

That last line sounds bad in and of itself.

> Call Trace:
> <IRQ> <0> [<ffffffff80243df5>] ? __mod_timer+0xb8/0xca
> [<ffffffff803253c3>] ? __const_udelay+0x44/0x46
> [<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
> [<ffffffff803aef89>] uhci_irq+0x13f/0x158
> [<ffffffff8039744a>] usb_hcd_irq+0x42/0x90

> I'm not too surprised its getting hardware errors, and I wouldn't assume
> its a USB-level bug at this point (though if its misusing the DMA API,
> it could be a driver bug; I think I saw an iommu-related bug go past,
> which could be a clue).
>
> But the crash as a result of the "host controller process error" does
> look like a UHCI driver bug.

Yes; it shouldn't happen.

> The RIP corresponds to:
> 0xffffffff803acb56 is in uhci_scan_schedule
> (/home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1740).
>
> 1740 uhci->next_qh = list_entry(qh->node.next,
> 1741 struct uhci_qh, node);

Does this mean that qh is NULL? I don't have a 64-bit system so I
can't tell just where in the instruction stream the fault occurred.
Maybe you can add one or two debugging printks in there to figure out
exactly what's going wrong.

> If you have any hints as to what's causing the host controller process
> error and how I might go about debugging it, that would be very useful.

You should start by loading uhci-hcd with the debug=2 parameter setting
(you'll have to enable CONFIG_USB_DEBUG). Then when an HC process
error occurs, the driver will dump its internal data structures to the
system log.

Alan Stern

2008-10-16 22:02:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> uhci-hcd uses dma_allocate_coherent() and dma_pool_create() with
> dma_pool_alloc(). If either of these returned an area of memory that
> crossed a physical page boundary then there might be trouble -- but
> there probably would already be trouble in non-virtualized systems too!
>

Hm, that should be OK then. No chance something is simply using __pa()
on an address rather than using the proper dma address?

>> The problem I'm seeing is this:
>>
>> xen_create_contiguous_region: vstart=ffff880073ff0000 order=0 addr_bits=20
>> uhci_hcd 0000:00:1d.0: -> ret ffff880073ff0000 dma 79b6c000
>> uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
>> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>> IP: [<ffffffff803acb56>] uhci_scan_schedule+0xa8/0x85f
>> PGD 0
>> Thread overran stack, or stack corrupted
>>
>
> That last line sounds bad in and of itself.
>

Hm, could well be spurious.

>> The RIP corresponds to:
>> 0xffffffff803acb56 is in uhci_scan_schedule
>> (/home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1740).
>>
>> 1740 uhci->next_qh = list_entry(qh->node.next,
>> 1741 struct uhci_qh, node);
>>
>
> Does this mean that qh is NULL? I don't have a 64-bit system so I
> can't tell just where in the instruction stream the fault occurred.
> Maybe you can add one or two debugging printks in there to figure out
> exactly what's going wrong.
>

Yes, it must be qh which is NULL. uhci is the only other dereference
there, and must be non-NULL to get to that point.

>> If you have any hints as to what's causing the host controller process
>> error and how I might go about debugging it, that would be very useful.
>>
>
> You should start by loading uhci-hcd with the debug=2 parameter setting
> (you'll have to enable CONFIG_USB_DEBUG). Then when an HC process
> error occurs, the driver will dump its internal data structures to the
> system log.
>

OK, with uhci-hcd.debug=2 on the kernel command line I still get an
oops, but in a different function. I guess the qh list is corrupt
either way?

ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 23 (level, low) -> IRQ 31
xen: PCI device 0000:00:1d.7 pin 4 -> irq 31
ehci_hcd 0000:00:1d.7: setting latency timer to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
/home/jeremy/hg/xen/paravirt/linux/drivers/usb/core/inode.c: creating file 'devices'
/home/jeremy/hg/xen/paravirt/linux/drivers/usb/core/inode.c: creating file '001'
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:1d.7: reset hcs_params 0x103206 dbg=1 cc=3 pcc=2 ordered !ppc ports=6
ehci_hcd 0000:00:1d.7: reset hcc_params 6871 thresh 7 uframes 1024 64 bit addr
ehci_hcd 0000:00:1d.7: xen_alloc_coherent: size=4096 handlep=ffff880075371498 gfp=24
xen_create_contiguous_region: vstart=ffff880073f2b000 order=0 addr_bits=20
ehci_hcd 0000:00:1d.7: -> ret ffff880073f2b000 dma 7dda2000
ehci_hcd 0000:00:1d.7: xen_alloc_coherent: size=4096 handlep=ffff880075371458 gfp=24
xen_create_contiguous_region: vstart=ffff880073f2c000 order=0 addr_bits=20
ehci_hcd 0000:00:1d.7: -> ret ffff880073f2c000 dma 79830000
ehci_hcd 0000:00:1d.7: xen_alloc_coherent: size=4096 handlep=ffff880073f7adb0 gfp=4
xen_create_contiguous_region: vstart=ffff880073f2d000 order=0 addr_bits=20
ehci_hcd 0000:00:1d.7: -> ret ffff880073f2d000 dma 7982f000
ehci_hcd 0000:00:1d.7: reset command 080002 (park)=0 ithresh=8 period=1024 Reset HALT
ehci_hcd 0000:00:1d.7: debug port 1
ehci_hcd 0000:00:1d.7: cache line size of 128 is not supported
ehci_hcd 0000:00:1d.7: supports USB remote wakeup
ehci_hcd 0000:00:1d.7: irq 31, io mem 0xdff00000
ehci_hcd 0000:00:1d.7: reset command 080002 (park)=0 ithresh=8 period=1024 Reset HALT
ehci_hcd 0000:00:1d.7: init command 010001 (park)=0 ithresh=1 period=1024 RUN
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: default language 0x0409
usb usb1: uevent
usb usb1: usb_probe_device
usb usb1: configuration #1 chosen from 1 choice
usb usb1: adding 1-0:1.0 (config #1, interface 0)
usb 1-0:1.0: uevent
hub 1-0:1.0: usb_probe_interface
hub 1-0:1.0: usb_probe_interface - got id
hub 1-0:1.0: USB hub found
ehci_hcd 0000:00:1d.7: xen_alloc_coherent: size=4096 handlep=ffff880075371318 gfp=24
xen_create_contiguous_region: vstart=ffff88007538e000 order=0 addr_bits=20
ehci_hcd 0000:00:1d.7: -> ret ffff88007538e000 dma 7982e000
hub 1-0:1.0: 6 ports detected
hub 1-0:1.0: standalone hub
hub 1-0:1.0: no power switching (usb 1.0)
hub 1-0:1.0: individual port over-current protection
hub 1-0:1.0: power on to power good time: 20ms
hub 1-0:1.0: local power source is good
hub 1-0:1.0: trying to enable port power on non-switchable hub
ehci_hcd 0000:00:1d.7: GetStatus port 3 status 001803 POWER sig=j CSC CONNECT
hub 1-0:1.0: port 3: status 0501 change 0001
hub 1-0:1.0: state 7 ports 6 chg 0008 evt 0000
/home/jeremy/hg/xen/paravirt/linux/drivers/usb/core/inode.c: creating file '001'
hub 1-0:1.0: port 3, status 0501, change 0000, 480 Mb/s
usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: EHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.27-tip ehci_hcd
usb usb1: SerialNumber: 0000:00:1d.7
USB Universal Host Controller Interface driver v3.0
xen: enabling pci device 0000:00:1d.0 pin 1
xen: registering gsi 16 triggering 0 polarity 1
xen_allocate_pirq: returning irq 29 for gsi 16
xen: --> irq=29
xen_set_ioapic_routing: irq 29 gsi 16 vector 152 ioapic 0 pin 16 triggering 0 polarity 1
uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 29
xen: PCI device 0000:00:1d.0 pin 1 -> irq 29
uhci_hcd 0000:00:1d.0: setting latency timer to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
/home/jeremy/hg/xen/paravirt/linux/drivers/usb/core/inode.c: creating file '002'
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.0: detected 2 ports
uhci_hcd 0000:00:1d.0: uhci_check_and_reset_hc: cmd = 0x0000
uhci_hcd 0000:00:1d.0: Performing full reset
uhci_hcd 0000:00:1d.0: irq 29, io base 0x0000bce0
uhci_hcd 0000:00:1d.0: xen_alloc_coherent: size=4096 handlep=ffff880073f7bdf8 gfp=4
xen_create_contiguous_region: vstart=ffff880073fde000 order=0 addr_bits=20
uhci_hcd 0000:00:1d.0: -> ret ffff880073fde000 dma 7dfcd000
uhci_hcd 0000:00:1d.0: xen_alloc_coherent: size=4096 handlep=ffff880075371158 gfp=24
xen_create_contiguous_region: vstart=ffff880073fdf000 order=0 addr_bits=20
uhci_hcd 0000:00:1d.0: -> ret ffff880073fdf000 dma 79b7d000
uhci_hcd 0000:00:1d.0: xen_alloc_coherent: size=4096 handlep=ffff880075371118 gfp=24
xen_create_contiguous_region: vstart=ffff880073f74000 order=0 addr_bits=20
uhci_hcd 0000:00:1d.0: -> ret ffff880073f74000 dma 79b7c000
uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
usb usb2: default language 0x0409
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
BUG: unable to handle kernel NULL pointer dereference<7>usb usb2: uevent
at 0000000000000020
IP: [<ffffffff803b0d29>] uhci_show_qh+0x228/0x59d
PGD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.27-tip #236
RIP: e030:[<ffffffff803b0d29>] [<ffffffff803b0d29>] uhci_show_qh+0x228/0x59d
RSP: e02b:ffffffff80661cc8 EFLAGS: 00010003
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
RDX: ffff880073f74000 RSI: 0000000000000000 RDI: ffffffff8051c332
RBP: ffffffff80661d58 R08: 00000000ffffffff R09: 0000000000000000
R10: ffff8800f3fc8450 R11: 0000000000000010 R12: ffff880073fc84af
R13: ffff880073f7bd58 R14: ffff880073f74000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff805b6f40(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000000201000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff805bc000, task ffffffff805773a0)
Stack:
0000000000000000 0000000000000000 0000000000000000 0000000047c39500
ffffffff80661d58 00007baf00000004 ffff880073fc8451 ffff880073f74000
ffff880073f74030 ffffffff8051c332 ffffffff8051c332 ffffffff8051c332
Call Trace:
<IRQ> <0> [<ffffffff803b16cf>] uhci_sprint_schedule+0x631/0x77a
[<ffffffff803b427c>] uhci_irq+0x11a/0x18a
[<ffffffff8039822b>] usb_hcd_irq+0x42/0x90
[<ffffffff80251a7e>] ? __update_sched_clock+0x1e/0x93
[<ffffffff8026164a>] handle_IRQ_event+0x2e/0x65
[<ffffffff80262b35>] handle_level_irq+0x91/0xe2
[<ffffffff80214a6d>] handle_irq+0x27/0x36
[<ffffffff80365f05>] xen_evtchn_do_upcall+0x198/0x1be
[<ffffffff80464e5e>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> <0> [<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff8020e5e0>] ? xen_safe_halt+0x10/0x1a
[<ffffffff8020be00>] ? xen_idle+0x34/0x48
[<ffffffff80211188>] ? cpu_idle+0x51/0x92
[<ffffffff80454598>] ? rest_init+0x5c/0x5e
[<ffffffff805e5d64>] ? start_kernel+0x409/0x414
[<ffffffff805e52ba>] ? x86_64_start_reservations+0xa5/0xa9
[<ffffffff805e9532>] ? xen_start_kernel+0x96f/0x981
Code: 48 98 49 01 c4 49 3b 4d 70 0f 85 30 03 00 00 8b 55 9c 49 8b 7d 20 03 55 a0 31 c9 44 29 e2 4c 89 e6 e8 e1 fa ff ff e9 0f 03 00 00 <48> 8b 40 20 3b 58 f8 74 20 4c 89 e7 48 c7 c1 32 c3 51 80 8b 55
RIP [<ffffffff803b0d29>] uhci_show_qh+0x228/0x59d
RSP <ffffffff80661cc8>
CR2: 0000000000000020
Kernel panic - not syncing: Fatal exception in interrupt

Thanks,
J


2008-10-16 23:34:53

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:

> Alan Stern wrote:
> > uhci-hcd uses dma_allocate_coherent() and dma_pool_create() with
> > dma_pool_alloc(). If either of these returned an area of memory that
> > crossed a physical page boundary then there might be trouble -- but
> > there probably would already be trouble in non-virtualized systems too!
> >
>
> Hm, that should be OK then. No chance something is simply using __pa()
> on an address rather than using the proper dma address?

No, no chance.

> >> The RIP corresponds to:
> >> 0xffffffff803acb56 is in uhci_scan_schedule
> >> (/home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1740).
> >>
> >> 1740 uhci->next_qh = list_entry(qh->node.next,
> >> 1741 struct uhci_qh, node);
> >>
> >
> > Does this mean that qh is NULL? I don't have a 64-bit system so I
> > can't tell just where in the instruction stream the fault occurred.
> > Maybe you can add one or two debugging printks in there to figure out
> > exactly what's going wrong.
> >
>
> Yes, it must be qh which is NULL. uhci is the only other dereference
> there, and must be non-NULL to get to that point.

And at that point qh must be equal to uhci->next_qh. There are only
about five places where uhci->next_qh is assigned to; you could test
each of them for NULL.

> OK, with uhci-hcd.debug=2 on the kernel command line I still get an
> oops, but in a different function. I guess the qh list is corrupt
> either way?

It sure looks that way.

> uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
> usb usb2: default language 0x0409
> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
> BUG: unable to handle kernel NULL pointer dereference<7>usb usb2: uevent
> at 0000000000000020
> IP: [<ffffffff803b0d29>] uhci_show_qh+0x228/0x59d

I guess you also better add code to uhci_sprint_schedule() in
uhci-debug.c to check each assignment to qh (there are only two) for
NULL, and jump directly to the next iteration of the "for i" loop when
you see it.

Maybe make this change first, since it will be easier and it might give
a good idea of where to look in uhci-q.c.

By the way, would there be any problem caused by the fact that the
hardware can only use 32-bit DMA addresses?

Alan Stern

2008-10-17 01:00:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> And at that point qh must be equal to uhci->next_qh. There are only
> about five places where uhci->next_qh is assigned to; you could test
> each of them for NULL.
>

OK.

>> OK, with uhci-hcd.debug=2 on the kernel command line I still get an
>> oops, but in a different function. I guess the qh list is corrupt
>> either way?
>>
>
> It sure looks that way.
>
>
>> uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
>> usb usb2: default language 0x0409
>> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
>> BUG: unable to handle kernel NULL pointer dereference<7>usb usb2: uevent
>> at 0000000000000020
>> IP: [<ffffffff803b0d29>] uhci_show_qh+0x228/0x59d
>>
>
> I guess you also better add code to uhci_sprint_schedule() in
> uhci-debug.c to check each assignment to qh (there are only two) for
> NULL, and jump directly to the next iteration of the "for i" loop when
> you see it.
>
Looks like this is the relevent detail: "uhci->skelqh[1]->node.next is
NULL" for all the queues. Haven't looked into it yet.

hub 4-0:1.0: 2 ports detected
hub 4-0:1.0: standalone hub
hub 4-0:1.0: no power switching (usb 1.0)
hub 4-0:1.0: individual port over-current protection
hub 4-0:1.0: power on to power good time: 2ms
hub 4-0:1.0: local power source is good
hub 4-0:1.0: trying to enable port power on non-switchable hub
/home/jeremy/hg/xen/paravirt/linux/drivers/usb/core/inode.c: creating file '001'
usb usb4: New USB device found, idVendor=1d6b, idProduct=0001
usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb4: Product: UHCI Host Controller
usb usb4: Manufacturer: Linux 2.6.27-tip uhci_hcd
usb usb4: SerialNumber: 0000:00:1d.2
PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[<ffffffff80235786>] ? try_to_wake_up+0x198/0x1aa
[<ffffffff802357a5>] ? default_wake_function+0xd/0xf
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff80464744>] ? _spin_unlock_irqrestore+0x1b/0x20
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff8020ea9b>] ? xen_clocksource_read+0x21/0x23
[<ffffffff8024ef4c>] ? run_posix_cpu_timers+0x27/0x73d
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff803b1d4f>] uhci_scan_schedule+0xb7/0x9a3
[<ffffffff8020eb38>] ? xen_vcpuop_set_next_event+0x52/0x66
[<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
[<ffffffff803b2f93>] uhci_hub_status_data+0x38/0x179
[<ffffffff80398b1c>] usb_hcd_poll_rh_status+0x49/0x14f
[<ffffffff80399261>] ? rh_timer_func+0x0/0xb
[<ffffffff8039926a>] rh_timer_func+0x9/0xb
[<ffffffff8024363b>] run_timer_softirq+0x175/0x1ef
[<ffffffff8023f746>] __do_softirq+0x8a/0x150
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
[<ffffffff8021374c>] call_softirq+0x1c/0x28
[<ffffffff80214885>] do_softirq+0x44/0x8b
[<ffffffff8023f44a>] irq_exit+0x3f/0x82
[<ffffffff80365f0a>] xen_evtchn_do_upcall+0x19d/0x1be
[<ffffffff80464f2e>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff8020922a>] ? _stext+0x22a/0x1000
[<ffffffff8020922a>] ? _stext+0x22a/0x1000
[<ffffffff8020e584>] ? xen_restore_fl+0x39/0x3b
[<ffffffff8023b421>] ? vprintk+0x2e5/0x312
[<ffffffff8039a313>] ? urb_destroy+0x0/0x28
[<ffffffff80320efd>] ? kref_put+0x43/0x4f
[<ffffffff8039a1a8>] ? usb_free_urb+0x15/0x17
[<ffffffff80461fb1>] ? printk+0x67/0x6e
[<ffffffff8039b242>] ? usb_control_msg+0xe0/0xf1
[<ffffffff80396066>] ? hub_probe+0x3c4/0xa3a
[<ffffffff8039c1f0>] ? usb_match_one_id+0x2b/0x88
[<ffffffff8039c76f>] ? usb_probe_interface+0x13c/0x16a
[<ffffffff8038e29b>] ? driver_probe_device+0xc0/0x16e
[<ffffffff8038e3c2>] ? __device_attach+0x0/0xb
[<ffffffff8038e3cb>] ? __device_attach+0x9/0xb
[<ffffffff8038d83b>] ? bus_for_each_drv+0x51/0x88
[<ffffffff8038e455>] ? device_attach+0x64/0x7b
[<ffffffff8038d679>] ? bus_attach_device+0x28/0x59
[<ffffffff8038c523>] ? device_add+0x3a9/0x54f
[<ffffffff80290cea>] ? kfree+0xed/0xfa
[<ffffffff8039bf52>] ? usb_set_configuration+0x4c7/0x521
[<ffffffff803a30fe>] ? generic_probe+0x55/0x9b
[<ffffffff8039c801>] ? usb_probe_device+0x64/0x69
[<ffffffff8038e29b>] ? driver_probe_device+0xc0/0x16e
[<ffffffff8038e3c2>] ? __device_attach+0x0/0xb
[<ffffffff8038e3cb>] ? __device_attach+0x9/0xb
[<ffffffff8038d83b>] ? bus_for_each_drv+0x51/0x88
[<ffffffff8038e455>] ? device_attach+0x64/0x7b
[<ffffffff8038d679>] ? bus_attach_device+0x28/0x59
[<ffffffff8038c523>] ? device_add+0x3a9/0x54f
[<ffffffff8039580d>] ? usb_new_device+0x5b/0x16b
[<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
[<ffffffff80397377>] ? hub_thread+0xc9b/0x127c
[<ffffffff8020bb19>] ? xen_leave_lazy+0x15/0x17
[<ffffffff8024d69b>] ? autoremove_wake_function+0x0/0x38
[<ffffffff803966dc>] ? hub_thread+0x0/0x127c
[<ffffffff8024d2ee>] ? kthread+0x49/0x76
[<ffffffff80213339>] ? child_rip+0xa/0x11
[<ffffffff8021264d>] ? retint_restore_args+0x5/0x20
[<ffffffff8021332f>] ? child_rip+0x0/0x11
---[ end trace 7c092a45d81b69fc ]---
------------[ cut here ]------------
WARNING: at /home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1744 uhci_scan_schedule+0xb7/0x9a3()
uhci->skelqh[1]->node.next is NULL
Modules linked in:
Pid: 125, comm: khubd Tainted: G W 2.6.27-tip #241
Call Trace:
<IRQ> [<ffffffff8023a6d5>] warn_slowpath+0xae/0xd7
[<ffffffff80235786>] ? try_to_wake_up+0x198/0x1aa
[<ffffffff802357a5>] ? default_wake_function+0xd/0xf
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff80464744>] ? _spin_unlock_irqrestore+0x1b/0x20
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff8020ea9b>] ? xen_clocksource_read+0x21/0x23
[<ffffffff8024ef4c>] ? run_posix_cpu_timers+0x27/0x73d
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff80225832>] ? pvclock_clocksource_read+0x42/0x7e
[<ffffffff803b1d4f>] uhci_scan_schedule+0xb7/0x9a3
[<ffffffff8020eb38>] ? xen_vcpuop_set_next_event+0x52/0x66
[<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
[<ffffffff803b2f93>] uhci_hub_status_data+0x38/0x179
[<ffffffff80398b1c>] usb_hcd_poll_rh_status+0x49/0x14f
[<ffffffff80399261>] ? rh_timer_func+0x0/0xb
[<ffffffff8039926a>] rh_timer_func+0x9/0xb
[<ffffffff8024363b>] run_timer_softirq+0x175/0x1ef
[<ffffffff8023f746>] __do_softirq+0x8a/0x150
[<ffffffff8021374c>] call_softirq+0x1c/0x28
[<ffffffff80214885>] do_softirq+0x44/0x8b
[<ffffffff8023f44a>] irq_exit+0x3f/0x82
[<ffffffff80365f0a>] xen_evtchn_do_upcall+0x19d/0x1be
[<ffffffff80464f2e>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff8020922a>] ? _stext+0x22a/0x1000
[<ffffffff8020922a>] ? _stext+0x22a/0x1000
[<ffffffff8020e584>] ? xen_restore_fl+0x39/0x3b
[<ffffffff8023b421>] ? vprintk+0x2e5/0x312
[<ffffffff8039a313>] ? urb_destroy+0x0/0x28
[<ffffffff80320efd>] ? kref_put+0x43/0x4f
[<ffffffff8039a1a8>] ? usb_free_urb+0x15/0x17
[<ffffffff80461fb1>] ? printk+0x67/0x6e
[<ffffffff8039b242>] ? usb_control_msg+0xe0/0xf1
[<ffffffff80396066>] ? hub_probe+0x3c4/0xa3a
[<ffffffff8039c1f0>] ? usb_match_one_id+0x2b/0x88
[<ffffffff8039c76f>] ? usb_probe_interface+0x13c/0x16a
[<ffffffff8038e29b>] ? driver_probe_device+0xc0/0x16e
[<ffffffff8038e3c2>] ? __device_attach+0x0/0xb
[<ffffffff8038e3cb>] ? __device_attach+0x9/0xb
[<ffffffff8038d83b>] ? bus_for_each_drv+0x51/0x88
[<ffffffff8038e455>] ? device_attach+0x64/0x7b
[<ffffffff8038d679>] ? bus_attach_device+0x28/0x59
[<ffffffff8038c523>] ? device_add+0x3a9/0x54f
[<ffffffff80290cea>] ? kfree+0xed/0xfa
[<ffffffff8039bf52>] ? usb_set_configuration+0x4c7/0x521
[<ffffffff803a30fe>] ? generic_probe+0x55/0x9b
[<ffffffff8039c801>] ? usb_probe_device+0x64/0x69
[<ffffffff8038e29b>] ? driver_probe_device+0xc0/0x16e
[<ffffffff8038e3c2>] ? __device_attach+0x0/0xb
[<ffffffff8038e3cb>] ? __device_attach+0x9/0xb
[<ffffffff8038d83b>] ? bus_for_each_drv+0x51/0x88
[<ffffffff8038e455>] ? device_attach+0x64/0x7b
[<ffffffff8038d679>] ? bus_attach_device+0x28/0x59
[<ffffffff8038c523>] ? device_add+0x3a9/0x54f
[<ffffffff8039580d>] ? usb_new_device+0x5b/0x16b
[<ffffffff80328d89>] ? _raw_spin_lock+0x68/0x10b
[<ffffffff80397377>] ? hub_thread+0xc9b/0x127c
[<ffffffff8020bb19>] ? xen_leave_lazy+0x15/0x17
[<ffffffff8024d69b>] ? autoremove_wake_function+0x0/0x38
[<ffffffff803966dc>] ? hub_thread+0x0/0x127c
[<ffffffff8024d2ee>] ? kthread+0x49/0x76
[<ffffffff80213339>] ? child_rip+0xa/0x11
[<ffffffff8021264d>] ? retint_restore_args+0x5/0x20
[<ffffffff8021332f>] ? child_rip+0x0/0x11


> Maybe make this change first, since it will be easier and it might give
> a good idea of where to look in uhci-q.c.
>
> By the way, would there be any problem caused by the fact that the
> hardware can only use 32-bit DMA addresses?
>

No. I'm explicitly asking for memory under 4G (and this machine only
has 2G anyway).

J

2008-10-17 14:58:39

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:

> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next is
> NULL" for all the queues. Haven't looked into it yet.

> ------------[ cut here ]------------
> WARNING: at /home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1744 uhci_scan_schedule+0xb7/0x9a3()
> uhci->skelqh[1]->node.next is NULL

Odd. Is skelqh[1] the only entry whose node.next is NULL? Not
skelqh[2] or others?

The node is initialized during uhci_start() in uhci-hcd.c -- there's a
loop which initializes all the uhci->skelqh[i] entries by calling
uhci_alloc_qh(). You should check inside that loop to see whether
uhci->skelqh[i]->node.next is NULL to begin with. If it is then
something about the DMA-coherent memory is funny; if not then the
linked list pointer gets corrupted somewhere.

Alan Stern

2008-10-17 15:20:57

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
>> ------------[ cut here ]------------
>> WARNING: at /home/jeremy/hg/xen/paravirt/linux/drivers/usb/host/uhci-q.c:1744 uhci_scan_schedule+0xb7/0x9a3()
>> uhci->skelqh[1]->node.next is NULL
>>
>
> Odd. Is skelqh[1] the only entry whose node.next is NULL? Not
> skelqh[2] or others?
>

No, the others too, but all the WARN_ON backtraces got very verbose and
I had to rush home at the end of the day.

> The node is initialized during uhci_start() in uhci-hcd.c -- there's a
> loop which initializes all the uhci->skelqh[i] entries by calling
> uhci_alloc_qh(). You should check inside that loop to see whether
> uhci->skelqh[i]->node.next is NULL to begin with. If it is then
> something about the DMA-coherent memory is funny; if not then the
> linked list pointer gets corrupted somewhere.
>

OK, will check it out.

J

2008-10-21 20:02:21

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:

> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next is
> NULL" for all the queues. Haven't looked into it yet.

Any news?

Alan Stern

2008-10-21 20:41:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:
>
>
>> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next is
>> NULL" for all the queues. Haven't looked into it yet.
>>
>
> Any news?
>

They're being initialized properly (which is what I'd expect, since the
driver works fine when booting native). They're being trashed at some
later point, likely when handing the hardware error. But I haven't had
a chance to pick through it in any more detail than that.

J

2008-12-09 00:05:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:
>
>
>> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next is
>> NULL" for all the queues. Haven't looked into it yet.
>>
>
> Any news?

The problem went away for a while, but then came back. I still have no
idea why, but I'm back to debugging it.

The most strange thing I'm seeing is this:

uhci_hcd 0000:00:1d.0: irq 29, io base 0x0000bce0
uhci_alloc_td uhci ffff88002e1a3d58 td ffff88002e105000 dma_handle 4ce2f000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106080 handle=4ce36080
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106001 handle=4ce36001 <<<
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107080 handle=7e546080
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107001 handle=7e546001 <<<
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108080 handle=7e22d080
uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
usb usb1: configuration #1 chosen from 1 choice
uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC


For some reason dma_pool_alloc() is returning unaligned, overlapping
memory chunks. That that point everything else is no surprise...

So I'm trying to figure out how the dma pool stuff is malfunctioning,
and whether anything we've done is causing it.

J

2008-12-09 01:04:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Jeremy Fitzhardinge wrote:
> Alan Stern wrote:
>> On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:
>>
>>
>>> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next
>>> is NULL" for all the queues. Haven't looked into it yet.
>>>
>>
>> Any news?
>
> The problem went away for a while, but then came back. I still have
> no idea why, but I'm back to debugging it.
>
> The most strange thing I'm seeing is this:
>
> uhci_hcd 0000:00:1d.0: irq 29, io base 0x0000bce0
> uhci_alloc_td uhci ffff88002e1a3d58 td ffff88002e105000 dma_handle
> 4ce2f000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106080 handle=4ce36080
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106001
> handle=4ce36001 <<<
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107080 handle=7e546080
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107001
> handle=7e546001 <<<
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108080 handle=7e22d080
> uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
> usb usb1: configuration #1 chosen from 1 choice
> uhci_hcd 0000:00:1d.0: host controller process error, something bad
> happened!
> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
>
>
> For some reason dma_pool_alloc() is returning unaligned, overlapping
> memory chunks. That that point everything else is no surprise...
>
> So I'm trying to figure out how the dma pool stuff is malfunctioning,
> and whether anything we've done is causing it.

On other runs the allocations turn out OK:

uhci_alloc_td: uhci ffff88002e631b08 td ffff88002e621000 dma_handle 4ce2f000
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622080 handle=4ce36080
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622100 handle=4ce36100
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622180 handle=4ce36180
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622200 handle=4ce36200
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622280 handle=4ce36280
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622300 handle=4ce36300
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622380 handle=4ce36380
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622400 handle=4ce36400
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000
uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000


but the controller crashes and the structures seem to have been corrupted:

uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
[lots of WARN_ONs I added about NULL qh->queue.nexts omitted ]
Root-hub state: running FSBR: 0
HC status
usbcmd = 00c0 Maxp64 CF
usbstat = 0020 HCHalted
usbint = 000f
usbfrnum = (0)000
flbaseadd = 7dd2b000
sof = 40
stat1 = 0080
stat2 = 0080
Most recent frame: 0 (0) Last ISO frame: 0 (0)
Periodic load table
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Total: 0, #INT: 0, #ISO: 0
Frame List
Skeleton QHs
- skel_unlink_qh
[ffff88002e622000] Skel QH link (4ce36002) element (00000000)
Element is NULL (bug?)
queue is empty
[ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI
last QH not linked to next skeleton!
- skel_iso_qh
[ffff88002e622080] CTL QH link (00000000) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
last QH not linked to next skeleton!
- skel_int128_qh
[ffff88002e622100] CTL QH link (4ce36002) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
- skel_int64_qh
[ffff88002e622180] CTL QH link (4ce36002) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
- skel_int32_qh
[ffff88002e622200] CTL QH link (4ce36002) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
- skel_int16_qh
[ffff88002e622280] CTL QH link (4ce36002) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
- skel_int8_qh
[ffff88002e622300] CTL QH link (4ce36002) element (00000000)
Element is NULL (bug?)
qh->queue.next == NULL
- skel_int4_qh
[ffff88002e622380] Skel QH link (4ce36002) element (00000001)
queue is empty
- skel_int2_qh
[ffff88002e622400] Skel QH link (4ce36002) element (00000001)
queue is empty
- skel_async_qh
[ffff88002e622000] Skel QH link (4ce36002) element (00000000)
Element is NULL (bug?)
queue is empty
[ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI
last QH not linked to next skeleton!
- skel_term_qh
[ffff88002e622000] Skel QH link (4ce36002) element (00000000)
Element is NULL (bug?)
queue is empty
[ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI


Any clues about what this means? Also, where's the best place to dump
all the structures before kicking off the hardware to make sure they're
correct from the outset?

Thanks,
J

2008-12-09 02:13:47

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Mon, 8 Dec 2008, Jeremy Fitzhardinge wrote:

> Jeremy Fitzhardinge wrote:
> > Alan Stern wrote:
> >> On Thu, 16 Oct 2008, Jeremy Fitzhardinge wrote:
> >>
> >>
> >>> Looks like this is the relevent detail: "uhci->skelqh[1]->node.next
> >>> is NULL" for all the queues. Haven't looked into it yet.
> >>>
> >>
> >> Any news?
> >
> > The problem went away for a while, but then came back. I still have
> > no idea why, but I'm back to debugging it.
> >
> > The most strange thing I'm seeing is this:
> >
> > uhci_hcd 0000:00:1d.0: irq 29, io base 0x0000bce0
> > uhci_alloc_td uhci ffff88002e1a3d58 td ffff88002e105000 dma_handle
> > 4ce2f000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106080 handle=4ce36080
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106000 handle=4ce36000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e106001
> > handle=4ce36001 <<<
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107080 handle=7e546080
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107000 handle=7e546000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e107001
> > handle=7e546001 <<<
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108080 handle=7e22d080
> > uhci_alloc_qh: uhci=ffff88002e1a3d58 qh=ffff88002e108000 handle=7e22d000
> > usb usb1: configuration #1 chosen from 1 choice
> > uhci_hcd 0000:00:1d.0: host controller process error, something bad
> > happened!
> > uhci_hcd 0000:00:1d.0: host controller halted, very bad!
> > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> >
> >
> > For some reason dma_pool_alloc() is returning unaligned, overlapping
> > memory chunks. That that point everything else is no surprise...

Truly weird.

> > So I'm trying to figure out how the dma pool stuff is malfunctioning,
> > and whether anything we've done is causing it.
>
> On other runs the allocations turn out OK:
>
> uhci_alloc_td: uhci ffff88002e631b08 td ffff88002e621000 dma_handle 4ce2f000
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622080 handle=4ce36080
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622100 handle=4ce36100
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622180 handle=4ce36180
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622200 handle=4ce36200
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622280 handle=4ce36280
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622300 handle=4ce36300
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622380 handle=4ce36380
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622400 handle=4ce36400
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000
> uhci_alloc_qh: uhci=ffff88002e631b08 qh=ffff88002e622000 handle=4ce36000

The last two are the same, and they are the same as the second one.
Does that mean you're getting duplicated allocations?

> but the controller crashes and the structures seem to have been corrupted:
>
> uhci_hcd 0000:00:1d.0: host controller process error, something bad happened!
> uhci_hcd 0000:00:1d.0: host controller halted, very bad!
> [lots of WARN_ONs I added about NULL qh->queue.nexts omitted ]
> Root-hub state: running FSBR: 0
> HC status
> usbcmd = 00c0 Maxp64 CF
> usbstat = 0020 HCHalted
> usbint = 000f
> usbfrnum = (0)000
> flbaseadd = 7dd2b000
> sof = 40
> stat1 = 0080
> stat2 = 0080
> Most recent frame: 0 (0) Last ISO frame: 0 (0)
> Periodic load table
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> Total: 0, #INT: 0, #ISO: 0
> Frame List
> Skeleton QHs
> - skel_unlink_qh
> [ffff88002e622000] Skel QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> queue is empty
> [ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI
> last QH not linked to next skeleton!
> - skel_iso_qh
> [ffff88002e622080] CTL QH link (00000000) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> last QH not linked to next skeleton!
> - skel_int128_qh
> [ffff88002e622100] CTL QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> - skel_int64_qh
> [ffff88002e622180] CTL QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> - skel_int32_qh
> [ffff88002e622200] CTL QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> - skel_int16_qh
> [ffff88002e622280] CTL QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> - skel_int8_qh
> [ffff88002e622300] CTL QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> qh->queue.next == NULL
> - skel_int4_qh
> [ffff88002e622380] Skel QH link (4ce36002) element (00000001)
> queue is empty
> - skel_int2_qh
> [ffff88002e622400] Skel QH link (4ce36002) element (00000001)
> queue is empty
> - skel_async_qh
> [ffff88002e622000] Skel QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> queue is empty
> [ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI
> last QH not linked to next skeleton!
> - skel_term_qh
> [ffff88002e622000] Skel QH link (4ce36002) element (00000000)
> Element is NULL (bug?)
> queue is empty
> [ffff88002e621000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PI

skel_term_qh is definitely _not_ supposed to be equal to skel_async_qh
or to skel_unlink_qh. The allocation is still messed up.

> Any clues about what this means? Also, where's the best place to dump
> all the structures before kicking off the hardware to make sure they're
> correct from the outset?

The hardware starts running right at the end of uhci_start(), in the
call to start_rh(). However according to those debugging lines you
added to uhci_alloc_qh(), the DMA pool allocations are all messed up.
Those initial allocations all occur within uhci_start(), in the

for (i = 0; i < UHCI_NUM_SKELQH; ++i)

loop. It sure looks like the DMA pool memory management needs
attention.

Alan Stern

2008-12-09 18:25:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> The hardware starts running right at the end of uhci_start(), in the
> call to start_rh(). However according to those debugging lines you
> added to uhci_alloc_qh(), the DMA pool allocations are all messed up.
> Those initial allocations all occur within uhci_start(), in the
>
> for (i = 0; i < UHCI_NUM_SKELQH; ++i)
>
> loop. It sure looks like the DMA pool memory management needs
> attention.
>
>

Yep, there's something very odd going on in there. By contrast, does
this look OK? The allocations are fine, but I'm wondering if the
skeleton QH stuff is all correct; it seems to work OK in this state.

uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4000 handle=7e212000
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4080 handle=7e212080
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4100 handle=7e212100
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4180 handle=7e212180
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4200 handle=7e212200
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4280 handle=7e212280
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4300 handle=7e212300
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4380 handle=7e212380
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4400 handle=7e212400
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4480 handle=7e212480
uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4500 handle=7e212500
Root-hub state: reset FSBR: 0
HC status
usbcmd = 0000 Maxp32
usbstat = 0020 HCHalted
usbint = 0000
usbfrnum = (0)000
flbaseadd = 7ffd5000
sof = 40
stat1 = 0080
stat2 = 0080
Most recent frame: 0 (0) Last ISO frame: 0 (0)
Periodic load table
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Total: 0, #INT: 0, #ISO: 0
Frame List
Skeleton QHs
- skel_unlink_qh
[ffff88002e5f4000] Skel QH link (00000001) element (00000001)
queue is empty
- skel_iso_qh
[ffff88002e5f4080] Skel QH link (00000001) element (00000001)
queue is empty
- skel_int128_qh
[ffff88002e5f4100] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int64_qh
[ffff88002e5f4180] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int32_qh
[ffff88002e5f4200] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int16_qh
[ffff88002e5f4280] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int8_qh
[ffff88002e5f4300] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int4_qh
[ffff88002e5f4380] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_int2_qh
[ffff88002e5f4400] Skel QH link (7e212482) element (00000001)
queue is empty
- skel_async_qh
[ffff88002e5f4480] Skel QH link (00000001) element (7e215000)
queue is empty
[ffff88002e5f3000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PID=69(IN) (buf=00000000)
- skel_term_qh
[ffff88002e5f4500] Skel QH link (7e212502) element (7e215000)
queue is empty


Thanks,
J

2008-12-09 18:43:18

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Tue, 9 Dec 2008, Jeremy Fitzhardinge wrote:

> Yep, there's something very odd going on in there. By contrast, does
> this look OK? The allocations are fine, but I'm wondering if the
> skeleton QH stuff is all correct; it seems to work OK in this state.
>
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4000 handle=7e212000
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4080 handle=7e212080
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4100 handle=7e212100
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4180 handle=7e212180
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4200 handle=7e212200
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4280 handle=7e212280
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4300 handle=7e212300
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4380 handle=7e212380
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4400 handle=7e212400
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4480 handle=7e212480
> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4500 handle=7e212500
> Root-hub state: reset FSBR: 0
> HC status
> usbcmd = 0000 Maxp32
> usbstat = 0020 HCHalted
> usbint = 0000
> usbfrnum = (0)000
> flbaseadd = 7ffd5000
> sof = 40
> stat1 = 0080
> stat2 = 0080
> Most recent frame: 0 (0) Last ISO frame: 0 (0)
> Periodic load table
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> Total: 0, #INT: 0, #ISO: 0
> Frame List
> Skeleton QHs
> - skel_unlink_qh
> [ffff88002e5f4000] Skel QH link (00000001) element (00000001)
> queue is empty
> - skel_iso_qh
> [ffff88002e5f4080] Skel QH link (00000001) element (00000001)
> queue is empty
> - skel_int128_qh
> [ffff88002e5f4100] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int64_qh
> [ffff88002e5f4180] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int32_qh
> [ffff88002e5f4200] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int16_qh
> [ffff88002e5f4280] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int8_qh
> [ffff88002e5f4300] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int4_qh
> [ffff88002e5f4380] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_int2_qh
> [ffff88002e5f4400] Skel QH link (7e212482) element (00000001)
> queue is empty
> - skel_async_qh
> [ffff88002e5f4480] Skel QH link (00000001) element (7e215000)
> queue is empty
> [ffff88002e5f3000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PID=69(IN) (buf=00000000)
> - skel_term_qh
> [ffff88002e5f4500] Skel QH link (7e212502) element (7e215000)
> queue is empty

Yes, that looks normal for a controller with no activity.

Alan Stern

2008-12-09 22:49:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> On Tue, 9 Dec 2008, Jeremy Fitzhardinge wrote:
>
>
>> Yep, there's something very odd going on in there. By contrast, does
>> this look OK? The allocations are fine, but I'm wondering if the
>> skeleton QH stuff is all correct; it seems to work OK in this state.
>>
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4000 handle=7e212000
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4080 handle=7e212080
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4100 handle=7e212100
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4180 handle=7e212180
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4200 handle=7e212200
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4280 handle=7e212280
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4300 handle=7e212300
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4380 handle=7e212380
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4400 handle=7e212400
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4480 handle=7e212480
>> uhci_alloc_qh: uhci=ffff88002e5de3d0 qh=ffff88002e5f4500 handle=7e212500
>> Root-hub state: reset FSBR: 0
>> HC status
>> usbcmd = 0000 Maxp32
>> usbstat = 0020 HCHalted
>> usbint = 0000
>> usbfrnum = (0)000
>> flbaseadd = 7ffd5000
>> sof = 40
>> stat1 = 0080
>> stat2 = 0080
>> Most recent frame: 0 (0) Last ISO frame: 0 (0)
>> Periodic load table
>> 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0
>> Total: 0, #INT: 0, #ISO: 0
>> Frame List
>> Skeleton QHs
>> - skel_unlink_qh
>> [ffff88002e5f4000] Skel QH link (00000001) element (00000001)
>> queue is empty
>> - skel_iso_qh
>> [ffff88002e5f4080] Skel QH link (00000001) element (00000001)
>> queue is empty
>> - skel_int128_qh
>> [ffff88002e5f4100] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int64_qh
>> [ffff88002e5f4180] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int32_qh
>> [ffff88002e5f4200] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int16_qh
>> [ffff88002e5f4280] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int8_qh
>> [ffff88002e5f4300] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int4_qh
>> [ffff88002e5f4380] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_int2_qh
>> [ffff88002e5f4400] Skel QH link (7e212482) element (00000001)
>> queue is empty
>> - skel_async_qh
>> [ffff88002e5f4480] Skel QH link (00000001) element (7e215000)
>> queue is empty
>> [ffff88002e5f3000] link (00000001) e0 Length=0 MaxLen=7ff DT0 EndPt=0 Dev=7f, PID=69(IN) (buf=00000000)
>> - skel_term_qh
>> [ffff88002e5f4500] Skel QH link (7e212502) element (7e215000)
>> queue is empty
>>
>
> Yes, that looks normal for a controller with no activity.
>

OK. If the controller is idle and the kernel wants to submit it some
new work, where does that happen? Can I stick a uhci_sprint_schedule()
there to see the state of the structures before it starts to work on them?


Thanks,
J

2008-12-10 15:10:34

by Alan Stern

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

On Tue, 9 Dec 2008, Jeremy Fitzhardinge wrote:

> OK. If the controller is idle and the kernel wants to submit it some
> new work, where does that happen? Can I stick a uhci_sprint_schedule()
> there to see the state of the structures before it starts to work on them?

The entry point is uhci_urb_enqueue(). You can do whatever you like
there, but bear in mind that it may be called in interrupt context.
uhci_sprint_schedule() should be fine.

However I don't really see the point. If the DMA pool allocations are
so badly messed up that the initial setup is wrong, why bother to
search any further?

Alan Stern

2008-12-10 19:15:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Oops in UHCI when encountering "host controller process error"

Alan Stern wrote:
> The entry point is uhci_urb_enqueue(). You can do whatever you like
> there, but bear in mind that it may be called in interrupt context.
> uhci_sprint_schedule() should be fine.
>
> However I don't really see the point. If the DMA pool allocations are
> so badly messed up that the initial setup is wrong, why bother to
> search any further?

I was trying to use it as a way of gauging where the damage was coming from.

But its moot now; Ian Campbell tracked it down to a missing TLB flush
which explained all the weird non-deterministic symptoms we were
seeing. It all looks pretty good now.

J