2001-03-24 20:07:38

by Pete Toscano

[permalink] [raw]
Subject: Constant Crash in scsi_eh_0

Hello,

I'm currently running 2.4.3-pre4. (I tried 2.4.3-pre6, but it wouldn't
boot. I'm about to try -pre7.) This seemed worse with 2.4.2, but it's
still a problem.

My system's about as stable as Crispin Glover after a week-long meth
binge. =8]

I'm running an SMP system (dual P3 600s) with 640M of RAM. It's using a
Tyan Tiger 133 motherboard (VIA Apollo Pro 133a chipset). It's got 2
ATA66 IDE drives, an IDE CD-ROM, and a SCSI CD burner (connected to an
Adaptec 2940 adaptor). I also have a few USB devices: mouse, rio500,
and Sandisk SDDR-31 compact flash reader. The burner and CF reader use
SCSI. These are the only two SCSI-related devices on my system (that
I'm aware of, at least).

For every crash that I remember, I have not once been using either the
CF reader or the burner. The usb-storage and scsi_mod devices were
loaded by the hotplug driver (version 2001_02_28). None of these were
used.

I do have KDB compiled into the kernel, so I was able to get some
debugging info captured via a serial console. Unfortunately, I don't
know what would be useful and what's not that useful. Anyway, I've
attached the log. If there's other information that'd be good to have,
please let me know and I'll try to get it *sigh* the next time my
machine crashes. As far as I can tell, something bad happens in
scsi_eh_0 every time.

Also attached is my config file.

Please let me know if there's anything I can do to try to fix the
problem. I'm not adverse to trying experimental patches.

pete


Attachments:
(No filename) (0.00 B)
(No filename) (232.00 B)
Download all attachments

2001-03-24 23:54:01

by Keith Owens

[permalink] [raw]
Subject: Re: Constant Crash in scsi_eh_0

On Sat, 24 Mar 2001 15:06:23 -0500,
Pete Toscano <[email protected]> wrote:
>[0]kdb> btp 862
> EBP EIP Function(args)
>0xe2bdbf6c 0xc011526a schedule+0x41e (0xe2ce0960, 0xe2bda000)
>0xe2bdbf9c 0xc0107bb8 __down_interruptible+0x94
>0xe2bdbfac 0xc0107c96 __down_failed_interruptible+0xa (0x100, 0xe2c9dd14, 0xe2c9dd6c, 0xe2bdbfd8, 0x0)
> 0xeaf94d7f [scsi_mod].text.lock+0x1fb
>0xe2bdbfec 0xeaf90281 [scsi_mod]scsi_error_handler+0x101
> 0xc0107547 kernel_thread+0x23

scsi_error_handler has tried to get a lock and somebody else has
already got it and is not letting go. It is not clear from the source
of scsi_error_handler which lock is the problem.

objdump -S --start-address=0xeaf90180 --end-address=0xeaf902f0 vmlinux

will disassemble the scsi_error_handler routine, the object code will
probably mean something to the scsi maintainers.

The trick is to find out which routine is holding the lock. It could
be an active routine or it could be caused by code that failed to
release a lock when it should. To check for active routines, in kdb

set BTSECT=0
bta

that will do a backtrace on every process, without the section lines.
Look for any other process with scsi code in its backtrace, it is
suspect.

kdb can help diagnose the problem but the fix will have to come from
the scsi maintainers.