2001-02-13 16:14:53

by Ralf Oehler

[permalink] [raw]
Subject: Kernel panic: Ththththaats all folks

Dear mailing-list,

Currently I'm working on a device-virtualizing robotic driver for HP jukeboxes
(magneto-optical media) an a WORM filesystem for linux.

Both work really nice now, but it seems there are some weak spots in the SCSI
subsystem which I cannot work around.
The worst situation I encounter sometimes (reproducible within about 48 hours)
always ends up in a kernel panic with the following call chain.

My system:
Pentium-II, 350 MHz, 64MB
Adaptec AHA-2940A (-> aic7xxx.o)
kernel: 2.4.0
patch: SGI-debugger (kdb-v1.7-2.4.0)

Some more details:
My jukebox consists of a picker device (sg.o) and some MO-drives (sd_mod.o)
The sg.o attaches to all these devices, while sd attaches to the drives only.
My jukebox driver registers as scsi-disk-device, moves a medium to a free drive
(via sg), opens sd for the drive and further maps
( see ll_rw_blk.c: generic_make_request() ) the requests to sd.
For testing I run more copy-processes to mounted virtual jukebox-disks
than there are physical drives. So the picker-thread alternatingly moves the
involved media between their storage-slots the drives while the copy-processes
do stop-and-go.
It works well, just until SCSI timeouts occur...

The kernel-panic occurs, independently of the presence of tagged queuing, some time
(seconds or minutes) after the occurence of disconnected timeouts for the
SCSI-READ/WRITE commands to the MO-drives (by sd_mod.o).
I see some messages about bus-resets in the syslog, then the kernel stops.

Can anybody help ...?



----------------------------------------------------------
Master Resource Control: runlevel 3 has been reached
....and some hours later....
Kernel panic: Ththththaats all folks. Too dangerous to continue.
In interrupt handler - not syncing
SysRq: Emergency Sync

Entering kdb (current=0xc3884000, pid 177) Panic: invalid operand
due to panic @ 0xc0110064
eax = 0x0000001b ebx = 0xc3885bbc ecx = 0xc0292c48 edx = 0x00000001
esi = 0xc11fb264 edi = 0xc3884000 esp = 0xc3885b78 eip = 0xc0110064
ebp = 0xc3885ba8 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010202
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc3885b44
kdb> bt
EBP EIP Function(args)
0xc3885ba8 0xc0110064 schedule+0x388
kernel .text 0xc0100000 0xc010fcdc 0xc0110070
0xc012e33a __wait_on_buffer+0x6a (0xc11fb264)
kernel .text 0xc0100000 0xc012e2d0 0xc012e360
0xc3885bf0 0xc012e3e3 sync_buffers+0x83 (0x302, 0x1)
kernel .text 0xc0100000 0xc012e360 0xc012e500
0xc3885c04 0xc012e552 fsync_dev+0x2a (0x302)
kernel .text 0xc0100000 0xc012e528 0xc012e55c
0xc3885c1c 0xc0195f7a go_sync+0x112 (0xc11f8a00, 0x0)
kernel .text 0xc0100000 0xc0195e68 0xc0195f90
0xc3885c34 0xc0195fd8 do_emergency_sync+0x48
kernel .text 0xc0100000 0xc0195f90 0xc0196040
0xc3885c40 0xc0111ffe panic+0xde (0xc491b4a0)
kernel .text 0xc0100000 0xc0111f20 0xc0112000
0xc3885c58 0xc490f939 [scsi_mod]dump_stats+0x79 (0xc2d59aac, 0x1, 0x0, 0x20, 0xc491b600)
scsi_mod .text 0xc4908060 0xc490f8c0 0xc490f93c
0xc3885c8c 0xc4910e9e [scsi_mod]scsi_init_io_vc+0x12a (0xc2d59a00)
scsi_mod .text 0xc4908060 0xc4910d74 0xc4910ec8
0xc3885cb8 0xc490f78c [scsi_mod]scsi_request_fn+0x238 (0xc3a0e49c)
scsi_mod .text 0xc4908060 0xc490f554 0xc490f864
0xc3885cd4 0xc490eebb [scsi_mod]scsi_queue_next_request+0x3f (0xc3a0e49c, 0xc2d59a00)
more>
scsi_mod .text 0xc4908060 0xc490ee7c 0xc490ef84
0xc3885cf0 0xc490f063 [scsi_mod]__scsi_end_request+0xdf (0xc2d59a00, 0x0, 0x0, 0x1, 0x1)
scsi_mod .text 0xc4908060 0xc490ef84 0xc490f0ac
0xc3885d0c 0xc490f0c4 [scsi_mod]scsi_end_request_Rac26c809+0x18 (0xc2d59a00, 0x0, 0x2)
scsi_mod .text 0xc4908060 0xc490f0ac 0xc490f0c8
0xc3885d3c 0xc490f504 [scsi_mod]scsi_io_completion_R2f9973d3+0x38c (0xc2d59a00, 0x0, 0x1)
scsi_mod .text 0xc4908060 0xc490f178 0xc490f510
0xc3885d6c 0xc4964d0a [sd_mod]rw_intr+0x1ee (0xc2d59a00)
sd_mod .text 0xc4964060 0xc4964b1c 0xc4964d14
0xc3885d98 0xc490e636 [scsi_mod]scsi_old_done+0x596 (0xc2d59a00)
scsi_mod .text 0xc4908060 0xc490e0a0 0xc490e644
0xc3885da8 0xc49423d0 [aic7xxx]aic7xxx_done_cmds_complete+0x2c (0xc3a48078)
aic7xxx .text 0xc4940060 0xc49423a4 0xc49423e0
0xc3885dc0 0xc494ca56 [aic7xxx]do_aic7xxx_isr+0x72 (0xb, 0xc3a48078, 0xc3885e0c)
aic7xxx .text 0xc4940060 0xc494c9e4 0xc494ca70
0xc3885de0 0xc010a2e6 handle_IRQ_event+0x2e (0xb, 0xc3885e0c, 0xc390db04)
kernel .text 0xc0100000 0xc010a2b8 0xc010a314
0xc3885e04 0xc010a456 do_IRQ+0x6a (0x32, 0xc383c978, 0xc383c000, 0xc383c000, 0x32)
kernel .text 0xc0100000 0xc010a3ec 0xc010a498
0xc0108e64 ret_from_intr
kernel .text 0xc0100000 0xc0108e64 0xc0108e84
Interrupt registers:
eax = 0x00001000 ebx = 0x00000032 ecx = 0xc383c978 edx = 0xc383c000
more>
esi = 0xc383c000 edi = 0x00000032 esp = 0xc3885e40 eip = 0xc018ea12
ebp = 0xc3885e40 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00000246
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffff0b &regs = 0xc3885e0c
0xc018ea12 con_write_room+0x1a (0xc383c000)
kernel .text 0xc0100000 0xc018e9f8 0xc018ea18
0xc3885ea8 0xc01819f1 opost_block+0x15 (0xc383c000, 0x8052638, 0x32, 0x32, 0xc383c000)
kernel .text 0xc0100000 0xc01819dc 0xc0181b54
0xc3885ef8 0xc018391c write_chan+0x158 (0xc383c000, 0xc38ce404, 0x8052638, 0x32)
kernel .text 0xc0100000 0xc01837c4 0xc01839ac
0xc017f7ea tty_write+0x14a (0xc38ce404, 0x8052638, 0x32, 0xc38ce424)
kernel .text 0xc0100000 0xc017f6a0 0xc017f858
0xc3885f9c 0xc012d915 do_readv_writev+0x1a1 (0x0, 0xc38ce404, 0xbfffef64, 0x1)
kernel .text 0xc0100000 0xc012d774 0xc012d984
0xc3885fbc 0xc012da1d sys_writev+0x41 (0x1, 0xbfffef64, 0x6, 0x6, 0x1)
kernel .text 0xc0100000 0xc012d9dc 0xc012da34
0xc0108dbb system_call+0x33
kernel .text 0xc0100000 0xc0108d88 0xc0108dc0
kdb>



Regards,

Ralf Oehler


*********************************************************************
* *
* Ralf Oehler GDI *
* *
* E-Mail: [email protected] Gesellschaft fuer Digitale *
* Tel.: +49 6182-9271-23 Informationstechnik mbH *
* Fax.: +49 6182-25035 Bensbruchstrasse 11 *
* 63533 Mainhausen *
* *
*********************************************************************


2001-02-13 18:46:34

by Chiaki

[permalink] [raw]
Subject: Re: Kernel panic: Ththththaats all folks

> My system:
> Pentium-II, 350 MHz, 64MB
> Adaptec AHA-2940A (-> aic7xxx.o)
> kernel: 2.4.0
> patch: SGI-debugger (kdb-v1.7-2.4.0)
>
> Some more details:
> My jukebox consists of a picker device (sg.o) and some MO-drives (sd_mod.o)
> The sg.o attaches to all these devices, while sd attaches to the drives only.
> My jukebox driver registers as scsi-disk-device, moves a medium to a free drive
> (via sg), opens sd for the drive and further maps
> ( see ll_rw_blk.c: generic_make_request() ) the requests to sd.
> For testing I run more copy-processes to mounted virtual jukebox-disks
> than there are physical drives. So the picker-thread alternatingly moves the
> involved media between their storage-slots the drives while the copy-processes
> do stop-and-go.
> It works well, just until SCSI timeouts occur...
>
> The kernel-panic occurs, independently of the presence of tagged queuing, some time
> (seconds or minutes) after the occurence of disconnected timeouts for the
> SCSI-READ/WRITE commands to the MO-drives (by sd_mod.o).
> I see some messages about bus-resets in the syslog, then the kernel stops.
>
> Can anybody help ...?

I used to see kernel panic when I used a 7 CD changer
(Nakamichi MBR-7) if I mount two CDs and
tried to access them simultaneously.
It was during 2.2.xx, and 2.3.yy development.
I can't recall what happened during 2.0.zz.

My problem with just 2 CDs mounting seems to have been
solved in the latest 2.4.1 kernel and the driver.
I use tmscsim (DC390) driver.
Admittedly, just two CDs in a seven CD changer drive
is not a heavy workload in comparison to yours.
So the bug is still lurking somewhere but is now harder
to spot by casual users.

I vaguely recall one thing.
During the earlier 2.3.1xx or even during 2.2.yy development and
debugging efforts.
some timeout values in the driver and SCSI subsystem
were tweaked to see if the change would affect
the behavior of the drive. I remember
the symptom or the frequencey of the kenrel hung
were lessened.
The hung never disappered completely, though
in my setting. (However, at certain point in
time during 2.3.1xx
development, using BusLogic Flashpoint
card and the buslogic driver
somehow eliminated the kernel panic I saw with DC390
card with tmscsim driver. So I figured that a very subtle
interaction of timeout values and such were responsible
for the problem. I tried to see if I could find the
difference of timeout values between tmscsim and buslogic
driver, but didn't follow through due to lack of time then.
My guess is that it must have been the timeout value setting
as well as the use of new and old scsi error handler code back then.)

It might be that you need to see if there are
hard-coded timeout values in aic7xxx.o and
see if they could be LENGTHENED without breaking
scsi compatibility with other devices: but you don't
have other devices on the same scsi chain, do you?

When these timeout occurs and resets are invoked,
the old scsi error handler can get into a big trouble
sometimes, and it has been a constant irritation for
scsi users with esoteric devices such as yours,
slow-starting scanners, or CD changers like mine.

It seems that old SCSI standards never mentioned
anything about device startup time or
dead time during transition of the media into the
drive, or for that matter, the time it takes DAT drives
to rewind the tape and so the
application and/or driver writer needs to
take care of these long silent/dead time without causing
timeout (and yet not miss the real time out from other
faster devices.)

BTW, I was pleasantly surprised to see the nicely formatted
dump of registers and such. It seems that kdb is really
useful for the type of analysis at hand!

Happy Hacking,

Chiaki Ishikawa