2008-03-12 09:30:59

by Lennert Buytenhek

[permalink] [raw]
Subject: ehci watchdog-related splatter in 2.6.25-rc4

Seems like EHCI IRQ firing while ehci_iaa_watchdog() is running.

This is on the integrated EHCI controller of an ARM SoC, the workload
is pumping a lot of data through an USB-ethernet dongle.

Chris (CC) is seeing similar oopsen on x86.


Unable to handle kernel NULL pointer dereference at virtual address 00000060
pgd = ef524000
[00000060] *pgd=2ce44031, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] PREEMPT
Modules linked in:
CPU: 0 Not tainted (2.6.25-rc4 #347)
PC is at end_unlink_async+0x1c/0x13c
LR is at end_unlink_async+0x1c/0x13c
pc : [<c0284440>] lr : [<c0284440>] psr: 60000093
sp : ecdafe08 ip : ecdafe08 fp : ecdafe24
r10: efe72800 r9 : ecdae000 r8 : 00000021
r7 : 00000010 r6 : 00000000 r5 : efe728c0 r4 : 00000000
r3 : 00000000 r2 : ecdae000 r1 : efe72800 r0 : 00000000
Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
Control: 0005317f Table: 2f524000 DAC: 00000015
Process yumdownloader (pid: 1159, stack limit = 0xecdae268)
Stack: (0xecdafe08 to 0xecdb0000)
fe00: 000080a9 efe728c0 00000000 00000010 ecdafe4c ecdafe28
fe20: c0288284 c0284434 efe72800 00000000 00000000 00000010 00000103 4015e200
fe40: ecdafe64 ecdafe50 c02764e8 c0288210 1dcd6500 efe75860 ecdafe84 ecdafe68
fe60: c0065f64 c02764bc ecdae000 c0401b90 00000010 efe75860 ecdafea4 ecdafe88
fe80: c0067db8 c0065f30 00000010 c0401b90 ecdaffb0 00000104 ecdafec4 ecdafea8
fea0: c0022048 c0067cf8 ffd3a1cc ffffffff fe400200 0000000f ecdaff24 ecdafec8
fec0: c0022668 c0022010 00000000 ffd3a180 ffd3a1cc ecdae010 efe728c0 60000013
fee0: ecdae000 c02880ec 00000000 ecdae000 4015e200 ecdaff24 ecdafef0 ecdaff10
ff00: c0284484 c0288174 60000013 ffffffff 00000102 c041f2c0 ecdaff54 ecdaff28
ff20: c00489e0 c02880fc ecdaff28 ecdaff28 ecdaff54 00000005 c041f0e4 00000009
ff40: 00000001 00000000 ecdaff74 ecdaff58 c0043f78 c004887c ecdae000 ecdae000
ff60: c0401b90 00000000 ecdaff8c ecdaff78 c00443b8 c0043f28 00000000 00000010
ff80: ecdaffac ecdaff90 c002204c c0044378 00000000 ffffffff fe400200 0000000f
ffa0: 00000000 ecdaffb0 c0022890 c0022010 403b8760 401612a8 000251eb 00000002
ffc0: 403b8760 403b88f0 00000000 403b1c60 00000002 40167e6c 4015e200 4016c1d8
ffe0: 403b1c60 bedbfab8 4016b82c 400a13b4 20000010 ffffffff feedfeed feedfeed
Backtrace:
[<c0284424>] (end_unlink_async+0x0/0x13c) from [<c0288284>] (ehci_irq+0x84/0x1b0)
r7:00000010 r6:00000000 r5:efe728c0 r4:000080a9
[<c0288200>] (ehci_irq+0x0/0x1b0) from [<c02764e8>] (usb_hcd_irq+0x3c/0x8c)
[<c02764ac>] (usb_hcd_irq+0x0/0x8c) from [<c0065f64>] (handle_IRQ_event+0x44/0x84)
r4:efe75860
[<c0065f20>] (handle_IRQ_event+0x0/0x84) from [<c0067db8>] (handle_level_irq+0xd0/0x158)
r7:efe75860 r6:00000010 r5:c0401b90 r4:ecdae000
[<c0067ce8>] (handle_level_irq+0x0/0x158) from [<c0022048>] (__exception_text_start+0x48/0x64)
r7:00000104 r6:ecdaffb0 r5:c0401b90 r4:00000010
[<c0022000>] (__exception_text_start+0x0/0x64) from [<c0022668>] (__irq_svc+0x48/0x88)
Exception stack(0xecdafec8 to 0xecdaff10)
fec0: 00000000 ffd3a180 ffd3a1cc ecdae010 efe728c0 60000013
fee0: ecdae000 c02880ec 00000000 ecdae000 4015e200 ecdaff24 ecdafef0 ecdaff10
ff00: c0284484 c0288174 60000013 ffffffff
r6:0000000f r5:fe400200 r4:ffffffff
[<c02880ec>] (ehci_iaa_watchdog+0x0/0xb4) from [<c00489e0>] (run_timer_softirq+0x174/0x220)
r5:c041f2c0 r4:00000102
[<c004886c>] (run_timer_softirq+0x0/0x220) from [<c0043f78>] (__do_softirq+0x60/0xd4)
r8:00000000 r7:00000001 r6:00000009 r5:c041f0e4 r4:00000005
[<c0043f18>] (__do_softirq+0x0/0xd4) from [<c00443b8>] (irq_exit+0x50/0x9c)
r6:00000000 r5:c0401b90 r4:ecdae000
[<c0044368>] (irq_exit+0x0/0x9c) from [<c002204c>] (__exception_text_start+0x4c/0x64)
r4:00000010
[<c0022000>] (__exception_text_start+0x0/0x64) from [<c0022890>] (__irq_usr+0x50/0xa0)
Exception stack(0xecdaffb0 to 0xecdafff8)
ffa0: 403b8760 401612a8 000251eb 00000002
ffc0: 403b8760 403b88f0 00000000 403b1c60 00000002 40167e6c 4015e200 4016c1d8
ffe0: 403b1c60 bedbfab8 4016b82c 400a13b4 20000010 ffffffff
r6:0000000f r5:fe400200 r4:ffffffff
Code: e5904014 e1a05000 e2800090 ebf71090 (e5942060)
Kernel panic - not syncing: Fatal exception in interrupt


2008-03-12 09:39:00

by David Brownell

[permalink] [raw]
Subject: Re: ehci watchdog-related splatter in 2.6.25-rc4

On Wednesday 12 March 2008, Lennert Buytenhek wrote:
> Seems like EHCI IRQ firing while ehci_iaa_watchdog() is running.

Does RC5 resolve this?

2008-03-12 10:11:43

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: ehci watchdog-related splatter in 2.6.25-rc4

On Wed, Mar 12, 2008 at 01:38:48AM -0800, David Brownell wrote:

> > Seems like EHCI IRQ firing while ehci_iaa_watchdog() is running.
>
> Does RC5 resolve this?

Nope.

e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 (post -rc5) looks like it
might fix it -- I'll re-test with that applied as well.


[root@mv78100 tmp]# yumdownloader --source kde-i18n
Enabling f8-source repository
Enabling f8-updates-source repository
kde-i18n-3.5.9-4.fc8.src. 58% |============== | 149 MB 00:22 ETA
pgd = ef6c0000
[00000060] *pgd=2f41a031, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] PREEMPT
Modules linked in:
CPU: 0 Not tainted (2.6.25-rc5 #349)
PC is at end_unlink_async+0x1c/0x13c
LR is at end_unlink_async+0x1c/0x13c
pc : [<c0284520>] lr : [<c0284520>] psr: 60000093
sp : ef529e08 ip : ef529e08 fp : ef529e24
r10: efe72800 r9 : ef528000 r8 : 00000021
r7 : 00000010 r6 : 00000000 r5 : efe728c0 r4 : 00000000
r3 : 00000000 r2 : ef528000 r1 : efe72800 r0 : 00000000
Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
Control: 0005317f Table: 2f6c0000 DAC: 00000015
Process yumdownloader (pid: 1027, stack limit = 0xef528268)
Stack: (0xef529e08 to 0xef52a000)
9e00: 000080a9 efe728c0 00000000 00000010 ef529e4c ef529e28
9e20: c0288364 c0284514 efe72800 00000000 00000000 00000010 00000103 4015e200
9e40: ef529e64 ef529e50 c02765c8 c02882f0 055d4a80 efe75860 ef529e84 ef529e68
9e60: c0065fa4 c027659c ef528000 c0401b90 00000010 efe75860 ef529ea4 ef529e88
9e80: c0067df8 c0065f70 00000010 c0401b90 ef529fb0 00000104 ef529ec4 ef529ea8
9ea0: c0022048 c0067d38 ffd3a1cc ffffffff fe400200 0000000f ef529f24 ef529ec8
9ec0: c0022668 c0022010 00000000 ffd3a180 ffd3a1cc ef528010 efe728c0 60000013
9ee0: ef528000 c02881cc 00000000 ef528000 4015e200 ef529f24 ef529ef0 ef529f10
9f00: c0284564 c0288254 60000013 ffffffff 00000102 c041f2c0 ef529f54 ef529f28
9f20: c0048a00 c02881dc ef529f28 ef529f28 ef529f54 00000005 c041f0e4 00000009
9f40: 00000001 00000000 ef529f74 ef529f58 c0043f98 c004889c ef528000 ef528000
9f60: c0401b90 00000000 ef529f8c ef529f78 c00443d8 c0043f48 00000000 00000010
9f80: ef529fac ef529f90 c002204c c0044398 00000000 ffffffff fe400200 0000000f
9fa0: 00000000 ef529fb0 c0022890 c0022010 00000004 0179c000 40168734 0000a534
9fc0: 01560968 000111c0 015609b8 00000000 0138d870 40c4fb3d 4015e200 000111c0
9fe0: 01560000 bee148d0 012832e0 400a3ac4 20000010 ffffffff 00000000 00000000
Backtrace:
[<c0284504>] (end_unlink_async+0x0/0x13c) from [<c0288364>] (ehci_irq+0x84/0x1b
r7:00000010 r6:00000000 r5:efe728c0 r4:000080a9
[<c02882e0>] (ehci_irq+0x0/0x1b0) from [<c02765c8>] (usb_hcd_irq+0x3c/0x8c)
[<c027658c>] (usb_hcd_irq+0x0/0x8c) from [<c0065fa4>] (handle_IRQ_event+0x44/0x
r4:efe75860
[<c0065f60>] (handle_IRQ_event+0x0/0x84) from [<c0067df8>] (handle_level_irq+0x
r7:efe75860 r6:00000010 r5:c0401b90 r4:ef528000
[<c0067d28>] (handle_level_irq+0x0/0x158) from [<c0022048>] (__exception_text_s
r7:00000104 r6:ef529fb0 r5:c0401b90 r4:00000010
[<c0022000>] (__exception_text_start+0x0/0x64) from [<c0022668>] (__irq_svc+0x4
Exception stack(0xef529ec8 to 0xef529f10)
9ec0: 00000000 ffd3a180 ffd3a1cc ef528010 efe728c0 60000013
9ee0: ef528000 c02881cc 00000000 ef528000 4015e200 ef529f24 ef529ef0 ef529f10
9f00: c0284564 c0288254 60000013 ffffffff
r6:0000000f r5:fe400200 r4:ffffffff
[<c02881cc>] (ehci_iaa_watchdog+0x0/0xb4) from [<c0048a00>] (run_timer_softirq+
r5:c041f2c0 r4:00000102
[<c004888c>] (run_timer_softirq+0x0/0x220) from [<c0043f98>] (__do_softirq+0x60
r8:00000000 r7:00000001 r6:00000009 r5:c041f0e4 r4:00000005
[<c0043f38>] (__do_softirq+0x0/0xd4) from [<c00443d8>] (irq_exit+0x50/0x9c)
r6:00000000 r5:c0401b90 r4:ef528000
[<c0044388>] (irq_exit+0x0/0x9c) from [<c002204c>] (__exception_text_start+0x4c
r4:00000010
[<c0022000>] (__exception_text_start+0x0/0x64) from [<c0022890>] (__irq_usr+0x5
Exception stack(0xef529fb0 to 0xef529ff8)
9fa0: 00000004 0179c000 40168734 0000a534
9fc0: 01560968 000111c0 015609b8 00000000 0138d870 40c4fb3d 4015e200 000111c0
9fe0: 01560000 bee148d0 012832e0 400a3ac4 20000010 ffffffff
r6:0000000f r5:fe400200 r4:ffffffff
Code: e5904014 e1a05000 e2800090 ebf71060 (e5942060)
Kernel panic - not syncing: Fatal exception in interrupt

2008-03-12 10:13:44

by David Brownell

[permalink] [raw]
Subject: Re: ehci watchdog-related splatter in 2.6.25-rc4

On Wednesday 12 March 2008, Lennert Buytenhek wrote:
>
> > Does RC5 resolve this?
>
> Nope.
>
> e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 (post -rc5) looks like it
> might fix it -- I'll re-test with that applied as well.

Sorry, I forgot that made it just *after* RC5. That's the patch
I was interested in.

- Dave

2008-03-12 10:47:22

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: ehci watchdog-related splatter in 2.6.25-rc4

On Wed, Mar 12, 2008 at 11:11:31AM +0100, Lennert Buytenhek wrote:

> > > Seems like EHCI IRQ firing while ehci_iaa_watchdog() is running.
> >
> > Does RC5 resolve this?
>
> Nope.
>
> e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 (post -rc5) looks like it
> might fix it -- I'll re-test with that applied as well.

It seems to be looking somewhat better so far with -rc5 plus
e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 -- I haven't run into the
oops yet.

2008-03-16 22:33:17

by Chris Wedgwood

[permalink] [raw]
Subject: USB oops with mainline (post 2.6.25-rc5)

On Wed, Mar 12, 2008 at 02:13:23AM -0800, David Brownell wrote:

> Sorry, I forgot that made it just *after* RC5. That's the patch I
> was interested in.

That's not enough to prevent the oops here. Using mainline from about
an hour ago I get a crash inside ub_end_rq. No full oops (yet) but
pic at http://stupidest.org/tmp/2.6.25-rc5-cw3-usboops1.png if that
helps. Should I back out all the USB changes of late and see if that
helps?

I got that a second or so after inserting a USB storage device.

2008-03-16 22:43:52

by David Brownell

[permalink] [raw]
Subject: Re: USB oops with mainline (post 2.6.25-rc5)

On Sunday 16 March 2008, Chris Wedgwood wrote:
> On Wed, Mar 12, 2008 at 02:13:23AM -0800, David Brownell wrote:
>
> > Sorry, I forgot that made it just *after* RC5. That's the patch I
> > was interested in.
>
> That's not enough to prevent the oops here. Using mainline from about
> an hour ago I get a crash inside ub_end_rq. No full oops (yet) but
> pic at http://stupidest.org/tmp/2.6.25-rc5-cw3-usboops1.png if that

Are you sure it's the same problem?


> helps. Should I back out all the USB changes of late and see if that
> helps?
>
> I got that a second or so after inserting a USB storage device.

Does that happen with usb-storage too, or just "ub"? ISTR hearing
about similar issues specific to "ub".

- Dave


2008-03-16 23:03:33

by Chris Wedgwood

[permalink] [raw]
Subject: Re: USB oops with mainline (post 2.6.25-rc5)

(cc' list trimmed)

On Sun, Mar 16, 2008 at 02:43:37PM -0800, David Brownell wrote:

> Are you sure it's the same problem?

Indeed it's not, the earlier problem seems to have gone away. I just
saw the oops and read ub as urb and didn't think past the fact it
crashed earlier. Sorry about that.

> Does that happen with usb-storage too, or just "ub"? ISTR hearing
> about similar issues specific to "ub".

It's ub specific and reproducable. With usb-storage everything works
as expected.

I'll post the UB oops in a few minutes.

2008-03-17 01:04:32

by Pete Zaitcev

[permalink] [raw]
Subject: Re: USB oops with mainline (post 2.6.25-rc5)

On Sun, 16 Mar 2008 16:02:56 -0700, Chris Wedgwood <[email protected]> wrote:

> > Does that happen with usb-storage too, or just "ub"? ISTR hearing
> > about similar issues specific to "ub".
>
> It's ub specific and reproducable. With usb-storage everything works
> as expected.
>
> I'll post the UB oops in a few minutes.

It's the BUG() which the new s/g API change added.

I don't know exactly what caused it. It's possible that block layer
is buggy by itself. Or perhaps hald sens a packet command and the
conversion to the new API is not complete. In the old API drivers
had to adjust the request, but in the new one the right way seems
to pass the byte count into the __blk_end_request. I don't have
a specific guidance regarding it, and the conversion of ub was
hurried apparently.

Try this, please:

--- a/drivers/block/ub.c
+++ b/drivers/block/ub.c
@@ -359,7 +360,8 @@ static void ub_cmd_build_block(struct ub_dev *sc, struct ub_lun *lun,
static void ub_cmd_build_packet(struct ub_dev *sc, struct ub_lun *lun,
struct ub_scsi_cmd *cmd, struct ub_request *urq);
static void ub_rw_cmd_done(struct ub_dev *sc, struct ub_scsi_cmd *cmd);
-static void ub_end_rq(struct request *rq, unsigned int status);
+static void ub_end_rq(struct request *rq, unsigned int status,
+ unsigned int act_len, unsigned int length);
static int ub_rw_cmd_retry(struct ub_dev *sc, struct ub_lun *lun,
struct ub_request *urq, struct ub_scsi_cmd *cmd);
static int ub_submit_scsi(struct ub_dev *sc, struct ub_scsi_cmd *cmd);
@@ -642,13 +644,13 @@ static int ub_request_fn_1(struct ub_lun *lun, struct request *rq)

if (atomic_read(&sc->poison)) {
blkdev_dequeue_request(rq);
- ub_end_rq(rq, DID_NO_CONNECT << 16);
+ ub_end_rq(rq, DID_NO_CONNECT << 16, 0, 1);
return 0;
}

if (lun->changed && !blk_pc_request(rq)) {
blkdev_dequeue_request(rq);
- ub_end_rq(rq, SAM_STAT_CHECK_CONDITION);
+ ub_end_rq(rq, SAM_STAT_CHECK_CONDITION, 0, 1);
return 0;
}

@@ -701,7 +703,7 @@ static int ub_request_fn_1(struct ub_lun *lun, struct request *rq)

drop:
ub_put_cmd(lun, cmd);
- ub_end_rq(rq, DID_ERROR << 16);
+ ub_end_rq(rq, DID_ERROR << 16, 0, 3);
return 0;
}

@@ -775,12 +777,30 @@ static void ub_rw_cmd_done(struct ub_dev *sc, struct ub_scsi_cmd *cmd)

if (cmd->error == 0) {
if (blk_pc_request(rq)) {
+#if 0
+/*
+ * This was required for cdrecord to work before the Tomo's __blk_end_request.
+ * Now, doing so causes __blk_end_request to end with a nonzero.
+ */
if (cmd->act_len >= rq->data_len)
rq->data_len = 0;
else
rq->data_len -= cmd->act_len;
+#endif
+ scsi_status = 0;
+ } else {
+ if (cmd->act_len != rq->nr_sectors * 512) {
+ /*
+ * XXX Not sure if this is worth retrying.
+ * Maybe it's some kind of end-of-device?
+ */
+ if (ub_rw_cmd_retry(sc, lun, urq, cmd) == 0)
+ return;
+ scsi_status = SAM_STAT_CHECK_CONDITION;
+ } else {
+ scsi_status = 0;
+ }
}
- scsi_status = 0;
} else {
if (blk_pc_request(rq)) {
/* UB_SENSE_SIZE is smaller than SCSI_SENSE_BUFFERSIZE */
@@ -802,13 +822,15 @@ static void ub_rw_cmd_done(struct ub_dev *sc, struct ub_scsi_cmd *cmd)
urq->rq = NULL;

ub_put_cmd(lun, cmd);
- ub_end_rq(rq, scsi_status);
+ ub_end_rq(rq, scsi_status, cmd->act_len, cmd->len);
blk_start_queue(lun->disk->queue);
}

-static void ub_end_rq(struct request *rq, unsigned int scsi_status)
+static void ub_end_rq(struct request *rq, unsigned int scsi_status,
+ unsigned int act_len, unsigned int length)
{
int error;
+ long rqlen;

if (scsi_status == 0) {
error = 0;
@@ -816,8 +838,13 @@ static void ub_end_rq(struct request *rq, unsigned int scsi_status)
error = -EIO;
rq->errors = scsi_status;
}
- if (__blk_end_request(rq, error, blk_rq_bytes(rq)))
- BUG();
+ rqlen = blk_rq_bytes(rq); /* in case __blk_end_request recalculates */
+ if (__blk_end_request(rq, error, act_len)) {
+ printk(KERN_WARNING DRV_NAME ": "
+ "__blk_end_request blew, cmd %s len %u rqlen %ld act %u\n",
+ blk_pc_request(rq)? "pc": "fs",
+ length, rqlen, act_len);
+ }
}

static int ub_rw_cmd_retry(struct ub_dev *sc, struct ub_lun *lun,

If ub prints anything, I'd like to see the unedited dmesg.

-- Pete