2003-06-24 22:05:27

by David Dillow

[permalink] [raw]
Subject: 2.4.21: kernel BUG at ide-iops.c:1262!

I've found a completely repeatable BUG/panic in 2.4.21 (final). The
kernel BUGS and then panics in an interrupt while beginning to burn a
DVD+R. I get the following in the logs, then the BUG/panic decoded at
the end of this message:

scsi : aborting command due to timeout : pid 390, scsi0, channel 0, id
0, lun 0
Test Unit Ready 00 00 00 00 00
scsi : aborting command due to timeout : pid 390, scsi0, channel 0, id
0, lun 0
Test Unit Ready 00 00 00 00 00
SCSI host 0 abort (pid 390) timed out - resetting
SCSI bus is being reset for host 0 channel 0.

Hardware is a dual PIII, 512MB memory, VIA vt82c686b IDE controller,
Sony DRU-500A DVD+/-RW burner, and Maxwell 4x DVD+R media. I have not
been able to reproduce with any other media, and the firmware on the
500A is the latest.

I was able to reproduce this with 2.4.21-rc5-ac1, but have not had media
available to try other kernels. I do have a few left, so I can help test
theories and solutions.

I was using growifs from dvd+rw-tools versions 5.6.4.4.4 and 5.9.4.4.4.
It will happen under either version.

Dave




ksymoops 2.4.5 on i686 2.4.21. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.21/ (default)
-m /boot/System.map-2.4.21 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

kernel BUG at ide-iops.c:1262!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c01b5551>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00013082
eax: c01d8ee0 ebx: 00000000 ecx: 00000036 edx: 00000002
esi: c15d7c00 edi: c032c450 ebp: c032c254 esp: dfff5e84
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=dfff5000)
Stack: 00000001 00003086 00003086 c15d7c00 c1597c60 00003086 00000000
c15d7c00
00000000 c15e64c0 c01b5767 c032c450 00000000 c01da110 c032c450
c01d427d
c15d7c00 00000002 00000000 c15d7c00 00003286 00000020 00000001
c01d3675
Call Trace: [<c01b5767>] [<c01da110>] [<c01d427d>] [<c01d3675>]
[<e0825fbc>]
[<c01d35f0>] [<c0125c2e>] [<c0121b54>] [<c01219f7>] [<c01217b9>]
[<c0109399>]
[<c01053e0>] [<c010bed8>] [<c01053e0>] [<c010540c>] [<c01054a2>]
[<c011c5a5>]
[<c011c810>]
Code: 0f 0b ee 04 61 ef 25 c0 80 bf 09 01 00 00 20 74 0c 8b 74 24


>>EIP; c01b5551 <do_reset1+31/230> <=====

>>eax; c01d8ee0 <idescsi_pc_intr+0/360>
>>esi; c15d7c00 <_end+1298ac8/204cef28>
>>edi; c032c450 <ide_hwifs+670/2c88>
>>ebp; c032c254 <ide_hwifs+474/2c88>
>>esp; dfff5e84 <_end+1fcb6d4c/204cef28>

Trace; c01b5767 <ide_do_reset+17/20>
Trace; c01da110 <idescsi_reset+20/40>
Trace; c01d427d <scsi_reset+11d/370>
Trace; c01d3675 <scsi_old_times_out+85/160>
Trace; e0825fbc <[usb-uhci]rh_init_int_timer+5c/70>
Trace; c01d35f0 <scsi_old_times_out+0/160>
Trace; c0125c2e <run_timer_list+12e/1b7>
Trace; c0121b54 <bh_action+54/80>
Trace; c01219f7 <tasklet_hi_action+67/a0>
Trace; c01217b9 <do_softirq+d9/e0>
Trace; c0109399 <do_IRQ+e9/f0>
Trace; c01053e0 <default_idle+0/50>
Trace; c010bed8 <call_do_IRQ+5/d>
Trace; c01053e0 <default_idle+0/50>
Trace; c010540c <default_idle+2c/50>
Trace; c01054a2 <cpu_idle+52/70>
Trace; c011c5a5 <call_console_drivers+65/120>
Trace; c011c810 <printk+140/180>

Code; c01b5551 <do_reset1+31/230>
00000000 <_EIP>:
Code; c01b5551 <do_reset1+31/230> <=====
0: 0f 0b ud2a <=====
Code; c01b5553 <do_reset1+33/230>
2: ee out %al,(%dx)
Code; c01b5554 <do_reset1+34/230>
3: 04 61 add $0x61,%al
Code; c01b5556 <do_reset1+36/230>
5: ef out %eax,(%dx)
Code; c01b5557 <do_reset1+37/230>
6: 25 c0 80 bf 09 and $0x9bf80c0,%eax
Code; c01b555c <do_reset1+3c/230>
b: 01 00 add %eax,(%eax)
Code; c01b555e <do_reset1+3e/230>
d: 00 20 add %ah,(%eax)
Code; c01b5560 <do_reset1+40/230>
f: 74 0c je 1d <_EIP+0x1d>
Code; c01b5562 <do_reset1+42/230>
11: 8b 74 24 00 mov 0x0(%esp,1),%esi

<0>Kernel panic: Aiee, killing interrupt handler!

1 warning issued. Results may not be reliable.



2003-06-25 00:34:24

by Scott McDermott

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

David Dillow on Tue 24/06 18:19 -0400:
> I've found a completely repeatable BUG/panic in 2.4.21
> (final). The kernel BUGS and then panics in an interrupt
> while beginning to burn a DVD+R. I get the following in
> the logs, then the BUG/panic decoded at the end of this
> message:

There were recent threads about this, and a Bugzilla bug 829
I think. Try killing magicdev.

2003-06-25 03:40:45

by David Dillow

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Tue, 2003-06-24 at 20:48, Scott McDermott wrote:
> David Dillow on Tue 24/06 18:19 -0400:
> > I've found a completely repeatable BUG/panic in 2.4.21
> > (final). The kernel BUGS and then panics in an interrupt
> > while beginning to burn a DVD+R. I get the following in
> > the logs, then the BUG/panic decoded at the end of this
> > message:
>
> There were recent threads about this, and a Bugzilla bug 829
> I think.

Doh! Missed those, and my quick search didn't hit. Found one after your
message, though.

> Try killing magicdev.

I thought I had, but upon checking, it was still kicking. Killing it
allows the burn to complete uninterrupted.

I'll test patches to kill the kernel BUG, but I'm happy to be able to
burn without a reboot again.

Dave

2003-06-25 07:28:31

by Roland Mas

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

Scott McDermott, 2003-06-24 20:48:28 -0400 :

> There were recent threads about this, and a Bugzilla bug 829 I
> think. Try killing magicdev.

Okay, I'll ask a LKML-newbie question. Bugzilla says "Linux 2.5
kernel bugs only at this time"... Does that mean this bug won't be
fixed in 2.4, or just that the fix will be written for 2.5 and then
backported to 2.4?

Roland.
--
Roland Mas

Au royaume des aveugles, les borgnes n'ont qu'un oeil.

2003-06-25 11:51:31

by Alan

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!


Its a known problem. There are two approaches. Either rewrite the ide
scsi reset code to use ide_abort infrastructure or switch ide-scsi back
to the old abort/reset code not new_eh and use SCSI_RESET_PUNT so that
the recovery is all handled by the ide layer.

For 2.4 the latter may be the best approach, for 2.5 it has to use
new_eh

2003-06-25 21:06:05

by Adrian Bunk

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Wed, Jun 25, 2003 at 09:42:38AM +0200, Roland Mas wrote:
> Scott McDermott, 2003-06-24 20:48:28 -0400 :
>
> > There were recent threads about this, and a Bugzilla bug 829 I
> > think. Try killing magicdev.
>
> Okay, I'll ask a LKML-newbie question. Bugzilla says "Linux 2.5
> kernel bugs only at this time"... Does that mean this bug won't be
> fixed in 2.4, or just that the fix will be written for 2.5 and then
> backported to 2.4?

The Bugzilla tracks only 2.5 bugs.

Whether it will be fixed in 2.4 or a fix will be backported to 2.4 is
beyond the scope of the Bugzilla (but of course Bugzilla does not
affect how kernel developers write and submit fixes).

> Roland.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2003-06-25 21:24:19

by Scott McDermott

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

Adrian Bunk on Wed 25/06 23:20 +0200:
> > > There were recent threads about this, and a Bugzilla
> > > bug 829 I think. Try killing magicdev.
> >
> > Okay, I'll ask a LKML-newbie question. Bugzilla says
> > "Linux 2.5 kernel bugs only at this time"... Does that
> > mean this bug won't be fixed in 2.4, or just that the
> > fix will be written for 2.5 and then backported to 2.4?
>
> The Bugzilla tracks only 2.5 bugs.
>
> Whether it will be fixed in 2.4 or a fix will be
> backported to 2.4 is beyond the scope of the Bugzilla (but
> of course Bugzilla does not affect how kernel developers
> write and submit fixes).

it tracks 2.5 but I think the bug is the same; it appears
the relevant code has changed but I think the root issue
probably is the same (ide-scsi reset error handling causing
kernel to crash). The bug submitter for the original bug
said he gets the same behavior on 2.4, but he didn't yet
respond as to whether killing magicdev fixes his problem.
Maybe they're different problems, but they appear related.

2003-06-25 22:33:20

by Scott McDermott

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

Alan Cox on Wed 25/06 13:00 +0100:
> Its a known problem. There are two approaches. Either
> rewrite the ide scsi reset code to use ide_abort
> infrastructure or switch ide-scsi back to the old
> abort/reset code not new_eh and use SCSI_RESET_PUNT so
> that the recovery is all handled by the ide layer.
>
> For 2.4 the latter may be the best approach, for 2.5 it
> has to use new_eh

is this what

o First crack at fixing the ide reset oopses (me)

tries to fix?

The CDRW devices that have problems with Test Unit Ready
during finalization (like the GCC-4240N) are broken and this
won't fix that problem, but the fix you're talking about
will stop the kernel from crashing when it happens, do I
have that right?

2003-06-25 22:38:42

by Alan

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Mer, 2003-06-25 at 23:47, Scott McDermott wrote:
> is this what
>
> o First crack at fixing the ide reset oopses (me)
>
> tries to fix?
>
> The CDRW devices that have problems with Test Unit Ready
> during finalization (like the GCC-4240N) are broken and this
> won't fix that problem, but the fix you're talking about
> will stop the kernel from crashing when it happens, do I
> have that right?

The TUA thing is a bit different but the general problem of
reset being mishandled is part of it and needs fixing anyway

2003-06-26 03:03:37

by David Dillow

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Wed, 2003-06-25 at 12:08, Alan Cox wrote:
> On Mer, 2003-06-25 at 04:54, David Dillow wrote:
>
> > I thought I had, but upon checking, it was still kicking. Killing it
> > allows the burn to complete uninterrupted.
> >
> > I'll test patches to kill the kernel BUG, but I'm happy to be able to
> > burn without a reboot again.
>
> I've put a test fix into -ac3

Well, it no longer BUGS. Progress. :)

Unfortunately, it still oopses, then panics. Everything but the kernel
is the same as before. Forgot to get ksym/module info for ksymoops
before testing, can get that info if need be.

ksymoops 2.4.5 on i686 2.4.21. Options used
-V (default)
-K (specified)
-L (specified)
-o /lib/modules/2.4.21-ac3/ (specified)
-m /boot/System.map-2.4.21-ac3 (specified)

No modules in ksyms, skipping objects
cpu: 0, clocks: 1339302, slice: 446434
cpu: 1, clocks: 1339302, slice: 446434
Unable to handle kernel NULL pointer dereference at virtual address
00000000
00000000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<00000000>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010096
eax: c029c480 ebx: c034b7fc ecx: 00000033 edx: 00000002
esi: c15de400 edi: 00000000 ebp: c15e3440 esp: c02cded4
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c02cd000)
Stack: c01de899 c034b7fc c026be65 00000000 c01d891d c15de400 00000002
00000000
c15de400 00000292 00000000 00000001 c01d7d15 c15de400 00000002
000003ac
00000000 c15de46c c01d7c90 c0126f1e c15de400 c15de46c c15de46c
00000001
Call Trace: [<c01de899>] [<c01d891d>] [<c01d7d15>] [<c01d7c90>]
[<c0126f1e>]
[<c0122e24>] [<c0122cc2>] [<c0122a89>] [<c0109289>] [<c01053e0>]
[<c0105000>]
[<c010bde8>] [<c01053e0>] [<c0105000>] [<c010540c>] [<c0105474>]
Code: Bad EIP value.


>>EIP; 00000000 Before first symbol

>>eax; c029c480 <idescsi_driver+0/6c>
>>ebx; c034b7fc <ide_hwifs+67c/2cd8>
>>esi; c15de400 <END_OF_CODE+127f9c8/????>
>>ebp; c15e3440 <END_OF_CODE+1284a08/????>
>>esp; c02cded4 <init_task_union+1ed4/2000>

Trace; c01de899 <idescsi_reset+29/80>
Trace; c01d891d <scsi_reset+11d/370>
Trace; c01d7d15 <scsi_old_times_out+85/160>
Trace; c01d7c90 <scsi_old_times_out+0/160>
Trace; c0126f1e <run_timer_list+12e/1b7>
Trace; c0122e24 <bh_action+54/80>
Trace; c0122cc2 <tasklet_hi_action+62/a0>
Trace; c0122a89 <do_softirq+d9/e0>
Trace; c0109289 <do_IRQ+e9/f0>
Trace; c01053e0 <default_idle+0/50>
Trace; c0105000 <_stext+0/0>
Trace; c010bde8 <call_do_IRQ+5/d>
Trace; c01053e0 <default_idle+0/50>
Trace; c0105000 <_stext+0/0>
Trace; c010540c <default_idle+2c/50>
Trace; c0105474 <cpu_idle+24/30>

<0>Kernel panic: Aiee, killing interrupt handler!


2003-06-26 21:23:02

by Alan

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Iau, 2003-06-26 at 04:17, David Dillow wrote:
> Trace; c01de899 <idescsi_reset+29/80>
> Trace; c01d891d <scsi_reset+11d/370>

Interesting trace.

Lets try a little sanity check first then. Replace that whole
idescsi_reset handler with "return SCSI_RESET_SNOOZE"

(I think its snooze)


2003-06-27 02:45:22

by David Dillow

[permalink] [raw]
Subject: Re: 2.4.21: kernel BUG at ide-iops.c:1262!

On Thu, 2003-06-26 at 17:34, Alan Cox wrote:
> On Iau, 2003-06-26 at 04:17, David Dillow wrote:
> > Trace; c01de899 <idescsi_reset+29/80>
> > Trace; c01d891d <scsi_reset+11d/370>
>
> Interesting trace.
>
> Lets try a little sanity check first then. Replace that whole
> idescsi_reset handler with "return SCSI_RESET_SNOOZE"

With that change, I've burned two disks back to back with no oopsen.

All I get in the logs are these (expected) messages:
scsi : aborting command due to timeout : pid 522, scsi0, channel 0, id
0, lun 0
scsi : aborting command due to timeout : pid 522, scsi0, channel 0, id
0, lun 0
SCSI host 0 abort (pid 522) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
SCSI host 0 channel 0 reset (pid 522) timed out - trying harder
SCSI bus is being reset for host 0 channel 0.
scsi : aborting command due to timeout : pid 144056, scsi0, channel 0,
id 0, lu
scsi : aborting command due to timeout : pid 144056, scsi0, channel 0,
id 0, lu
SCSI host 0 abort (pid 144056) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
SCSI host 0 channel 0 reset (pid 144056) timed out - trying harder
SCSI bus is being reset for host 0 channel 0.

Don't know if you broke anything else, but I can burn with magicdev
running and it not lock up the system now.

I've got four pieces of this media left, so if you want me to test the
next iteration of the patch, give me a yell.

Thanks,
Dave