2002-01-03 12:06:28

by Ralf Oehler

[permalink] [raw]
Subject: kernel 2.4.17 crashes on SCSI-errors

Hi, List

right now I tried the new kernel 2.4.17, hoping, that
the SCSI-system is now useable again.
But NO! It immediately crashed, like the few kernels before.

In the meantime I'm really getting into problems with
our product, because I expect SuSE to launch their next
release soon with an instable "stable" kernel.

Isn't anybody recognizing, that this bug is serious?
3.5" MO-drives report blank sectors as "SCSI-Hardware-Error"
This kind of sense code also appears for errors, that
are much more common than blanked sectors.
Any flaw in SCSI-disks will crash the kernel.
Please don't rely on modern hardware to be so perfect, that
errors will never occure. Then you could likewise remove
the complete error-handling-code.
This would at least prevent the crashes...


Here is a simple procedure to reliably trigger the BUG:

1) I compiled the SCSI-stuff as modules.
2) I put an erased MO-Medium in a MO-SCSI-drive.
3) I connected the drive to the computer.
4) I typed "modprobe sd_mod"
5) Crash! Serial console said:

Welcome to SuSE Linux 7.3 (i386) - Kernel 2.4.17 (ttyS0).

tick login: invalid operand: 0000
CPU: 0
EIP: 0010:[<d0851735>] Not tainted
EFLAGS: 00010082
eax: 00000042 ebx: ce3dc070 ecx: c0224080 edx: 0000270d
esi: c009e018 edi: 00000018 ebp: c009e000 esp: c0237dd4
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0237000)
Stack: d0867340 00000093 cf95b9ac cfb6de00 c0237e2c 00000000 66656400 00000006
cfb6de10 00000002 00000003 00000282 41000031 c0220002 ce434a00 d0851346
cfb6de00 ce468ecc 00000293 ce434ab8 ce434a00 cf4f416c 00000092 d083466a
Call Trace: [<d0867340>] [<d0851346>] [<d083466a>] [<d0834df8>] [<d083baaf>]
[<d084e880>] [<d083b10e>] [<d083b2b3>] [<d083b318>] [<d083b7a0>] [<d084cce8>]
[<d08351f7>] [<d0835099>] [<c01176a2>] [<c01175d9>] [<c01173ca>] [<c0107f8d>]
[<c0105150>] [<c0105150>] [<c0105173>] [<c01051d7>] [<c0105000>] [<c0105027>]

Code: 0f 0b 83 c4 08 83 3e 00 74 13 8b 06 05 00 00 00 40 89 46 0c
<0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing



Again I offer my time and my hardware for testing purposes.
I cannot fix the bug in the kernel myself, but I can test patches
and provide resulting stack traces.

Regards,
Ralf

-----------------------------------------------------------------
| Ralf Oehler
| GDI - Gesellschaft fuer Digitale Informationstechnik mbH
|
| E-Mail: [email protected]
| Tel.: +49 6182-9271-23
| Fax.: +49 6182-25035
| Mail: GDI, Bensbruchstra?e 11, D-63533 Mainhausen
| HTTP: http://www.GDImbH.com
-----------------------------------------------------------------

time is a funny concept


2002-01-03 12:10:28

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

On Thu, Jan 03 2002, [email protected] wrote:
> tick login: invalid operand: 0000
> CPU: 0
> EIP: 0010:[<d0851735>] Not tainted
> EFLAGS: 00010082
> eax: 00000042 ebx: ce3dc070 ecx: c0224080 edx: 0000270d
> esi: c009e018 edi: 00000018 ebp: c009e000 esp: c0237dd4
> ds: 0018 es: 0018 ss: 0018
> Process swapper (pid: 0, stackpage=c0237000)
> Stack: d0867340 00000093 cf95b9ac cfb6de00 c0237e2c 00000000 66656400 00000006
> cfb6de10 00000002 00000003 00000282 41000031 c0220002 ce434a00 d0851346
> cfb6de00 ce468ecc 00000293 ce434ab8 ce434a00 cf4f416c 00000092 d083466a
> Call Trace: [<d0867340>] [<d0851346>] [<d083466a>] [<d0834df8>] [<d083baaf>]
> [<d084e880>] [<d083b10e>] [<d083b2b3>] [<d083b318>] [<d083b7a0>] [<d084cce8>]
> [<d08351f7>] [<d0835099>] [<c01176a2>] [<c01175d9>] [<c01173ca>] [<c0107f8d>]
> [<c0105150>] [<c0105150>] [<c0105173>] [<c01051d7>] [<c0105000>] [<c0105027>]
>
> Code: 0f 0b 83 c4 08 83 3e 00 74 13 8b 06 05 00 00 00 40 89 46 0c
> <0>Kernel panic: Aiee, killing interrupt handler!
> In interrupt handler - not syncing

Please ksymoops this oops.

--
Jens Axboe

2002-01-03 12:33:51

by Alan

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

> Isn't anybody recognizing, that this bug is serious?

My 2.4.9-ac kernel tree here seems to be behaving

> 1) I compiled the SCSI-stuff as modules.
> 2) I put an erased MO-Medium in a MO-SCSI-drive.
[erased and formatted I assume ?]
> 3) I connected the drive to the computer.
> 4) I typed "modprobe sd_mod"
> 5) Crash! Serial console said:
>
> tick login: invalid operand: 0000

BUG trap. Turn on verbose bug reporting, also run the oops you then
get through ksymoops so that its actually readable by others. List what
scsi controller you use too.

The RH tree I'm running backed out a couple of scsi error handling changes
because we saw strange deadlocks. I don't think those are in Marcelo's tree
because I never had time to work out why they had to be reverted

Alan

2002-01-03 12:35:21

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors


seeing an older post on linux-scsi, you might want to retry your test
with the aic7xxx nseg bug fixed. this is against 2.4.17, haven't checked
if it's applied in 2.4.18-pre1 -- if not, Marcelo please apply.

--- drivers/scsi/aic7xxx/aic7xxx_linux.c~ Thu Jan 3 13:32:33 2002
+++ drivers/scsi/aic7xxx/aic7xxx_linux.c Thu Jan 3 13:33:00 2002
@@ -1703,6 +1703,7 @@
cmd->request_buffer,
cmd->request_bufflen,
scsi_to_pci_dma_dir(cmd->sc_data_direction));
+ scb->sg_count = 0;
scb->sg_count = ahc_linux_map_seg(ahc, scb,
sg, addr,
cmd->request_bufflen);


--
Jens Axboe

2002-01-03 13:52:29

by Ralf Oehler

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors



Ksymoops was not possible, because after rebooting the
memory/module-layout had changed. (Or is there a trick
I don't know?) To get a useable stack chain I patched
the kernel with SGI's kadb and reproduced the crash.

I'm using the aic7xxx controller. (With both, the old and
the new one I can reproduce the crash).

Alan Cox reports that the -ac kernels behave. This
makes me believe that the BUG sneaked into the linus
kernel at 2.4.10, where heavy block layer changes
happened, which were not applied to alan's kernel.
linus-2.4.0 behaves, too.



Here is
1) register dump
2) stack chain
3) last few lines dumped from syslog buffer


Is there anything more I can do?

Regards,
Ralf





Welcome to SuSE Linux 7.3 (i386) - Kernel 2.4.17-Dbg (ttyS0).

tick login:
Entering kdb (current=0xc0290000, pid 0) Oops: invalid operand
due to oops @ 0xd0853729
eax = 0x00000046 ebx = 0xce550070 ecx = 0xc027de00 edx = 0x0000276d
esi = 0xc009e018 edi = 0x00000018 esp = 0xc0291d94 eip = 0xd0853729
ebp = 0xc0291dd8 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010002
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc0291d60


kdb> bt
EBP EIP Function(args)
0xc0291dd8 0xd0853729 [aic7xxx]ahc_linux_run_device_queue+0x39d (0xcfb6be00, 0xce62ed1c)
aic7xxx .text 0xd0852060 0xd085338c 0xd0853c90
0xc0291dfc 0xd0853356 [aic7xxx]ahc_linux_queue+0x172 (0xce5f6a00, 0xd0834de0)
aic7xxx .text 0xd0852060 0xd08531e4 0xd085338c
0xc0291e20 0xd0834674 [scsi_mod]scsi_dispatch_cmd+0x1a4 (0xce5f6a00, 0xce5f6a00)
scsi_mod .text 0xd0834060 0xd08344d0 0xd083481c
0xc0291e50 0xd083bc7d [scsi_mod]scsi_request_fn+0x2bd (0xcf9f77b4)
scsi_mod .text 0xd0834060 0xd083b9c0 0xd083bcb4
0xc0291e6c 0xd083b2d6 [scsi_mod]scsi_queue_next_request+0x46 (0xcf9f77b4, 0xce5f6a00)
scsi_mod .text 0xd0834060 0xd083b290 0xd083b39c
0xc0291e88 0xd083b489 [scsi_mod]__scsi_end_request+0xed (0xce5f6a00, 0x0, 0x0, 0x1, 0x1)
scsi_mod .text 0xd0834060 0xd083b39c 0xd083b4d4
0xc0291ea4 0xd083b4ec [scsi_mod]scsi_end_request+0x18 (0xce5f6a00, 0x0, 0x2)
scsi_mod .text 0xd0834060 0xd083b4d4 0xd083b4f0
0xc0291ee0 0xd083b96b [scsi_mod]scsi_io_completion+0x3ab (0xce5f6a00, 0x0, 0x1)
scsi_mod .text 0xd0834060 0xd083b5c0 0xd083b978
0xc0291f10 0xd084ecec [sd_mod]rw_intr+0x1e8 (0xce5f6a00)
sd_mod .text 0xd084e060 0xd084eb04 0xd084ecf8
0xc0291f28 0xd0835214 [scsi_mod]scsi_finish_command+0xdc (0xce5f6a00)
scsi_mod .text 0xd0834060 0xd0835138 0xd0835220
0xc0291f3c 0xd083508a [scsi_mod]scsi_bottom_half_handler+0x1f2
more>
scsi_mod .text 0xd0834060 0xd0834e98 0xd08350ac
0xc0291f44 0xc0117dd0 bh_action+0x1c (0x8)
kernel .text 0xc0100000 0xc0117db4 0xc0117df8
0xc0291f5c 0xc0117ce9 tasklet_hi_action+0x59 (0xc02a85c0)
kernel .text 0xc0100000 0xc0117c90 0xc0117d10
0xc0291f78 0xc0117aac do_softirq+0x4c
kernel .text 0xc0100000 0xc0117a60 0xc0117b00
0xc0291f90 0xc01083ed do_IRQ+0xa1 (0xc0290000, 0xc14e4000, 0xc14e4270, 0xc0105170, 0xffffe000)
kernel .text 0xc0100000 0xc010834c 0xc0108400
0xc0291fcc 0xc01f33b8 call_do_IRQ+0x5
kernel .rodata 0xc01f1b00 0xc01f33b3 0xc01f33c0
0xc0105207 cpu_idle+0x3f
kernel .text 0xc0100000 0xc01051c8 0xc010521c
0xc0291fe8 0xc010502a stext+0x2a
kernel .text 0xc0100000 0xc0105000 0xc0105030
0xc0291ff8 0xc0292931 start_kernel+0x101
kernel .text.init 0xc0292000 0xc0292830 0xc0292938


>From log_buf[]:
<4>SCSI device sda: 1273011 1024-byte hdwr sectors (1304 MB).
<4>sda: Write Protect is off.
<6> /dev/scsi/host0/bus0/target1/lun0:SCSI disk error :
host 0 channel 0 id 1 lun 0 return code = 8000002.
<4>Info fld=0x0, Current sd08:00: sense key Blank Check.
<4> I/O error: dev 08:00, sector 0.
<4>Incorrect number of segments after building list.
<4>kernel BUG at /usr/src/linux-SuSE73-2.4.17-Dbg/include/asm/pci.h:147!


Regards,
Ralf Oehler

-----------------------------------------------------------------
| Ralf Oehler
| GDI - Gesellschaft fuer Digitale Informationstechnik mbH
|
| E-Mail: [email protected]
| Tel.: +49 6182-9271-23
| Fax.: +49 6182-25035
| Mail: GDI, Bensbruchstra?e 11, D-63533 Mainhausen
| HTTP: http://www.GDImbH.com
-----------------------------------------------------------------

time is a funny concept

2002-01-03 19:33:50

by Keith Owens

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

On Thu, 03 Jan 2002 14:39:02 +0100 (MET),
[email protected] wrote:
>Ksymoops was not possible, because after rebooting the
>memory/module-layout had changed. (Or is there a trick
>I don't know?)

/var/log/ksymoops. man insmod, look for ksymoops assistance.

2002-01-03 21:33:35

by Andrew Morton

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

Alan Cox wrote:
>
>
> BUG trap. Turn on verbose bug reporting,

Boy, was that ever a dumb idea. Rod. Back. Pain.

-

2002-01-04 08:33:27

by Ralf Oehler

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors


On 03-Jan-2002 Keith Owens wrote:
> On Thu, 03 Jan 2002 14:39:02 +0100 (MET),
> [email protected] wrote:
>>Ksymoops was not possible, because after rebooting the
>>memory/module-layout had changed. (Or is there a trick
>>I don't know?)
>
> /var/log/ksymoops. man insmod, look for ksymoops assistance.
>

Thanks a lot, I'll try it for the next crash.
But for now, I think, the output of the SGI debugger I sent
to the list shows the same.

kernel BUG at /usr/src/linux-2.4.17-Dbg/include/asm/pci.h:147!
from [aic7xxx]ahc_linux_run_device_queue+0x39d



Is there anything more I can do?
Regards,
Ralf



-----------------------------------------------------------------
| Ralf Oehler
| GDI - Gesellschaft fuer Digitale Informationstechnik mbH
|
| E-Mail: [email protected]
| Tel.: +49 6182-9271-23
| Fax.: +49 6182-25035
| Mail: GDI, Bensbruchstra?e 11, D-63533 Mainhausen
| HTTP: http://www.GDImbH.com
-----------------------------------------------------------------

time is a funny concept

2002-01-04 08:46:37

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

On Fri, Jan 04 2002, [email protected] wrote:
>
> On 03-Jan-2002 Keith Owens wrote:
> > On Thu, 03 Jan 2002 14:39:02 +0100 (MET),
> > [email protected] wrote:
> >>Ksymoops was not possible, because after rebooting the
> >>memory/module-layout had changed. (Or is there a trick
> >>I don't know?)
> >
> > /var/log/ksymoops. man insmod, look for ksymoops assistance.
> >
>
> Thanks a lot, I'll try it for the next crash.
> But for now, I think, the output of the SGI debugger I sent
> to the list shows the same.
>
> kernel BUG at /usr/src/linux-2.4.17-Dbg/include/asm/pci.h:147!
> from [aic7xxx]ahc_linux_run_device_queue+0x39d

aic7xxx is calling pci_map_sg on either an unitialized scatterlist, or
maybe just specifying too many segments. try and add a printk to print
'i' before the BUG() at line 147 in include/asm-i386/pci.h

--
Jens Axboe

2002-01-04 09:44:32

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors

On Thu, Jan 03 2002, Andrew Morton wrote:
> Alan Cox wrote:
> >
> >
> > BUG trap. Turn on verbose bug reporting,
>
> Boy, was that ever a dumb idea. Rod. Back. Pain.

Couldn't agree more...

--
Jens Axboe

2002-01-04 09:49:32

by Ralf Oehler

[permalink] [raw]
Subject: Re: kernel 2.4.17 crashes on SCSI-errors


On 04-Jan-2002 Jens Axboe wrote:
>> kernel BUG at /usr/src/linux-2.4.17-Dbg/include/asm/pci.h:147!
>> from [aic7xxx]ahc_linux_run_device_queue+0x39d
>
> aic7xxx is calling pci_map_sg on either an unitialized scatterlist, or
> maybe just specifying too many segments. try and add a printk to print
> 'i' before the BUG() at line 147 in include/asm-i386/pci.h

Line 147 now reads: {printk("nents=%d, i=%d\n",nents,i); BUG();}
and syslog buf yields:

<4>Incorrect number of segments after building list.
<4>nents=3, i=1.
<4>kernel BUG at /usr/src/linux-2.4.17-Dbg/include/asm/pci.h:147!


Regards,
Ralf

-----------------------------------------------------------------
| Ralf Oehler
| GDI - Gesellschaft fuer Digitale Informationstechnik mbH
|
| E-Mail: [email protected]
| Tel.: +49 6182-9271-23
| Fax.: +49 6182-25035
| Mail: GDI, Bensbruchstra?e 11, D-63533 Mainhausen
| HTTP: http://www.GDImbH.com
-----------------------------------------------------------------

time is a funny concept