-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!

Anders Karlsson wrote:
> On Wed, 2003-06-11 at 21:48, Marcelo Tosatti wrote:
>
>>On Sat, 7 Jun 2003, Andreas Haumer wrote:
>
> [snip]
>
>>>I had this system running under heavy load for about 24 hours
>>>without problems. I then stopped the stress testing, and had
>>>several system freezes since then.
>>>
>>>With system freeze I mean:
>>>
>>>*) machine doesn't answer to ping, no reaction to console
>>> keyboard, no message on the console screen, no message
>>> in logfile, no oops, no noticeable system activity
>
>
> I have this problem without actually stressing the machine too hard. The
> average load on my Thinkpad over a weekend would perhaps be 0.05, yet I
> can have several hard hangs where there seems to be no trace of a hang
> at all in logfiles.
>
I have to admit that "system freeze" is a quite unspecific
symptom. It could have a zillion of different reasons.

In my case I'm currently chasing SCSI errors which I think
could have something to do with it (besides, it's _not_ an Adaptec
controller, but a LSI 53c1030 with Fusion MPT driver... :-)

In my server logs I sometimes see SCSI timeouts like this:

[...]
scsi : aborting command due to timeout : pid 1148093, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 00 0f af 00 00 10 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=dfca8e00)
IOs outstanding = 3
mptscsih: ioc0: Issue of TaskMgmt Successful!
SCSI host 0 abort (pid 1148093) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=dfca8e00)
IOs outstanding = 4
SCSI Error Report =-=-= (0:0:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 2A 00 00 3C 4D 78 00 00 02 00 - "WRITE(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:1:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 00 0F AF 00 00 10 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:2:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 4E 0A 37 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:3:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 03 B0 08 6F 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
[...]

There are 4 hot swap SCSI disks in the server, and all of them
eventually report those timeouts (so it's not specific to a single
disk)
I already replaced cabling, tried a different hot swap (SCA)
cage, and I'm now trying to replace the disks one by one to
eventually find the culprit.

There are two problems with this approach:

1.) After each change I have to wait several hours up to two
days for a SCSI timeout to occur as I can not reproduce
the problem at will.

2.) I'm not _sure_ if those SCSI timeouts are related to the server
freeze symptoms I see. It's just an assumption.
IMHO it could work as follows: SCSI timeouts occure somtimes.
The driver then aborts the command and resets the SCSI bus
to get it into a sane state again. But what if the bus reset
doesn't work as expected and the bus remains unusable for a
while? Could this bring the whole system into this "freeze"
state (the system is still running, but everything waits for
the SCSI bus to recover)? Could this explain the symptom of
those big delays of ICMP ping answer messages I saw?

So the most precious resource for chasing this problem is time,
and this is also the resource which I don't have available as
much as I'd like to... :-(

>
>>Maybe the NMI oopser helps?
>
>
> Marcelo, where can I get hold of this and would there be documentation
> included with it for how to install/use it?
>
Look at /usr/src/linux/Documentation/nmi_watchdog.txt

Regards,

- - andreas

- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE+6El7xJmyeGcXPhERAqykAKCumORTm/lDofkrg52FX33rOfgC/ACeNxR7
l9/znrbi0lZoR/zw+LTdNhI=
=W7Gt
-----END PGP SIGNATURE-----