2003-07-29 07:40:45

by Ville Herva

[permalink] [raw]
Subject: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs, crashes on boot

After about a year of stable operation, a server begun acting up. First it
begun hanging up solid during the nightly oracle backup (that had run
successfully for a year), the I got some aic7xxx-related crashes on boot.

Initially, the box ran 2.4.20pre7 kernel with aic7xxx version 6.4.8. When
the hangs started happening, I upgraded to 2.4.21-jam1 (basically 2.4.21
vanilla + -aa patch + some minor stuff) that includes aic7xxx version 6.2.36.
It did not help.

I enabled kmsgdump and nmi watchdog, but when the box hangs, it hangs solid:
no ctrl-alt-del, no caps lock led, no alt-sysrq-b, no kmsgdump, nmi watchdog
doesn't trigger. Only the cursor on the console blinks, but no messages from
the kernel appear. (Apart from "spurious 8259A interrupt: IRQ7." that
always happens sometime after boot on this box, but way before the hang.)

After upgrading to 2.4.21-jam1, the box usually boots up fine, but twice
I've gotten the crash below.

After a reset, the box boots up fine (fsck goes through, no corruption), but
it seems little more prone to lock up soon after a hang-and-reset. I made it
hang three times on a row by just compiling kernel, but then on fourth boot,
no amount of IO abuse got it on its knees -- until after a week it hung
during the nightly backup.

Any ideas on how to to debug this kind of hang? Does it sound kernel/driver
or hw related? Are the two crashes related to the hang? Is the hang related
to aic7xxx?

One more detail: with the 2.4.20pre7/aic7xxx-6.2.8 kernel, I got "Panic:
HOST_MSG_LOOP with invalid SCB 0" crashes every now and then. Justin Gibbs
said: "it looks like memory mapped I/O simply does not work reliably on this
board", and recommended forcing programmed I/O (by undefining MMAPIO from
aic7xxx_osm_pci.c). That seemed to cure the problems -- until now.

linux-2.2.18pre18 + aic7xxx-5.1.31 was rock stable on this box.


-- v --

[email protected]


Hardware:
---------------------------------------------------------------------------
Intel 815EEA2LU (i815 Chipset)
Celeron 1.3GHz (Tualatin)
Adaptec AHA-2940 / AIC-7871
- Disk (rootfs) SEAGATE Model: ST19171W Rev: 0024
- Tape Drive HP Model: C1537A Rev: L708
30GB IDE disk (scratch)
---------------------------------------------------------------------------



Dump of crash on boot:
---------------------------------------------------------------------------
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 4 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 8 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Device is active, asserting ATN
<4>Recovery code sleeping
<4>Recovery code awake
<4>Timer Expired
<4>aic7xxx_abort returns 0x2003
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x28 0x0 0x0 0xbe 0x12 0x96 0x0 0x0 0x18 0x0
<4>scsi0:0:0:0: Command not found
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x0 0x0 0x0 0x0 0x0 0x0
<4>scsi0: At time of recovery, card was not paused
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Command phase, at SEQADDR 0x36
<4>Card was paused
<4>ACCUM = 0x80, SINDEX = 0xac, DINDEX = 0xc0, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0x96] ERROR[0x0] SCSIBUSL[0x80] LASTPHASE[0x80]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
<4>SEQ_FLAGS[0x0] SSTAT0[0x5] SSTAT1[0x3] SSTAT2[0x0]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x88]
<4>DFCNTRL[0x6] DFSTATUS[0x48]
<4>STACK: 0x0 0x166 0x196 0x35
<4>SCB count = 10
<4>Kernel NEXTQSCB = 8
<4>Card NEXTQSCB = 4
<4>QINFIFO entries: 4 3
<4>Waiting Queue entries:
<4>Disconnected Queue entries: 1:0
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x0] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 3 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 4 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Cmd aborted from QINFIFO
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x2a 0x0 0x0 0x0 0x0 0x3f 0x0 0x0 0x10 0x0
<4>scsi0:0:0:0: Command not found
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x0 0x0 0x0 0x0 0x0 0x0
<4>scsi0: At time of recovery, card was not paused
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Command phase, at SEQADDR 0x1a3
<4>Card was paused
<4>ACCUM = 0x80, SINDEX = 0xa2, DINDEX = 0xc0, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0x96] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x80]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
<4>SEQ_FLAGS[0x0] SSTAT0[0x5] SSTAT1[0x3] SSTAT2[0x0]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x80]
<4>DFCNTRL[0x34] DFSTATUS[0x68]
<4>STACK: 0xaf 0x0 0x166 0x196
<4>SCB count = 10
<4>Kernel NEXTQSCB = 3
<4>Card NEXTQSCB = 8
<4>QINFIFO entries: 8 4
<4>Waiting Queue entries:
<4>Disconnected Queue entries: 1:0
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x0] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 4 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 8 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Cmd aborted from QINFIFO
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x2a 0x0 0x0 0x0 0x4 0x2f 0x0 0x0 0x10 0x0
<4>scsi0:0:0:0: Command not found
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x0 0x0 0x0 0x0 0x0 0x0
<4>scsi0: At time of recovery, card was not paused
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Command phase, at SEQADDR 0x1a4
<4>Card was paused
<4>ACCUM = 0x80, SINDEX = 0xa9, DINDEX = 0xc0, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0x96] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x80]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
<4>SEQ_FLAGS[0x0] SSTAT0[0x5] SSTAT1[0x3] SSTAT2[0x0]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x80]
<4>DFCNTRL[0x34] DFSTATUS[0x48]
<4>STACK: 0xb0 0x0 0x166 0x196
<4>SCB count = 10
<4>Kernel NEXTQSCB = 4
<4>Card NEXTQSCB = 3
<4>QINFIFO entries: 3 8
<4>Waiting Queue entries:
<4>Disconnected Queue entries: 1:0
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x0] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 8 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 3 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Cmd aborted from QINFIFO
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x2a 0x0 0x0 0x4 0x0 0x5f 0x0 0x0 0x10 0x0
<4>scsi0:0:0:0: Command not found
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x0 0x0 0x0 0x0 0x0 0x0
<4>scsi0: At time of recovery, card was not paused
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Command phase, at SEQADDR 0x16f
<4>Card was paused
<4>ACCUM = 0x80, SINDEX = 0xac, DINDEX = 0xc0, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0x96] ERROR[0x0] SCSIBUSL[0x80] LASTPHASE[0x80]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
<4>SEQ_FLAGS[0x0] SSTAT0[0x5] SSTAT1[0x3] SSTAT2[0x0]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x88]
<4>DFCNTRL[0x6] DFSTATUS[0x48]
<4>STACK: 0x35 0x0 0x166 0x196
<4>SCB count = 10
<4>Kernel NEXTQSCB = 8
<4>Card NEXTQSCB = 4
<4>QINFIFO entries: 4 3
<4>Waiting Queue entries:
<4>Disconnected Queue entries: 1:0
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x0] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 3 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 4 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Cmd aborted from QINFIFO
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x2a 0x0 0x0 0x4 0x3 0xcf 0x0 0x0 0x8 0x0
<4>scsi0:0:0:0: Command not found
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue an ABORT message
<4>CDB: 0x0 0x0 0x0 0x0 0x0 0x0
<4>scsi0: At time of recovery, card was not paused
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Command phase, at SEQADDR 0xa5
<4>Card was paused
<4>ACCUM = 0x80, SINDEX = 0xac, DINDEX = 0xc0, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0x96] ERROR[0x0] SCSIBUSL[0x80] LASTPHASE[0x80]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x0] SEQCTL[0x10]
<4>SEQ_FLAGS[0x0] SSTAT0[0x5] SSTAT1[0x3] SSTAT2[0x0]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x88]
<4>DFCNTRL[0x6] DFSTATUS[0x48]
<4>STACK: 0x0 0x166 0x196 0x35
<4>SCB count = 10
<4>Kernel NEXTQSCB = 3
<4>Card NEXTQSCB = 8
<4>QINFIFO entries: 8 4
<4>Waiting Queue entries:
<4>Disconnected Queue entries: 1:0
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x0] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 4 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 8 SCB_CONTROL[0x74] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4> 0 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<4>scsi0:0:0:0: Cmd aborted from QINFIFO
<4>aic7xxx_abort returns 0x2002
<4>scsi0:0:0:0: Attempting to queue a TARGET RESET message
<4>CDB: 0x28 0x0 0x0 0xbe 0x12 0x5e 0x0 0x0 0x20 0x0
<4>aic7xxx_dev_reset returns 0x2003
<4>Recovery SCB completes
<4>Recovery SCB completes
<4>scsi0:A:0:0: ahc_intr - referenced scb not valid during seqint 0x71 scb(0)
<4>>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
<4>scsi0: Dumping Card State in Message-in phase, at SEQADDR 0x1bc
<4>Card was paused
<4>ACCUM = 0xc0, SINDEX = 0x71, DINDEX = 0x8c, ARG_2 = 0x0
<4>HCNT = 0x0 SCBPTR = 0x0
<4>SCSISIGI[0xe6] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0xe0]
<4>SCSISEQ[0x12] SBLKCTL[0x2] SCSIRATE[0x42] SEQCTL[0x10]
<4>SEQ_FLAGS[0x40] SSTAT0[0x2] SSTAT1[0x3] SSTAT2[0x10]
<4>SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xac] SXFRCTL0[0x88]
<4>DFCNTRL[0x0] DFSTATUS[0x29]
<4>STACK: 0xff 0x0 0x166 0x18e
<4>SCB count = 10
<4>Kernel NEXTQSCB = 0
<4>Card NEXTQSCB = 193
<4>QINFIFO entries:
<4>Waiting Queue entries:
<4>Disconnected Queue entries:
<4>QOUTFIFO entries:
<4>Sequencer Free SCB List: 1 3 2 6 4 7 5 8 9 10 11 12 13 14 15
<4>Sequencer SCB Info:
<4> 0 SCB_CONTROL[0x80] SCB_SCSIID[0x0] SCB_LUN[0x0] SCB_TAG[0x0]
<4> 1 SCB_CONTROL[0x0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 2 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 4 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 5 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 6 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 7 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0xff]
<4> 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4> 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
<4>Pending list:
<4> 8 SCB_CONTROL[0x70] SCB_SCSIID[0x7] SCB_LUN[0x0]
<4>Kernel Free SCB list: 3 4 7 6 9 1 2 5
<4>DevQ(0:0:0): 0 waiting
<4>DevQ(0:2:0): 0 waiting
<4>
<4><<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
<0>Kernel panic: for safety
<0>In interrupt handler - not syncing
<4> <0>Dumping messages in 0 seconds : last chance for Alt-SysRq...
---------------------------------------------------------------------------




/proc/scsi/aic7xxx>cat 0
---------------------------------------------------------------------------
Adaptec AIC7xxx driver version: 6.2.36
Adaptec 2940 SCSI adapter
aic7870: Wide Channel A, SCSI Id=7, 16/253 SCBs
Allocated SCBs: 10, SG List Length: 102

Serial EEPROM:
0x0238 0x0218 0x0238 0x0238 0x0238 0x0238 0x0238 0x0238
0x0238 0x0238 0x0238 0x0238 0x0238 0x0238 0x0238 0x0238
0x0096 0x005c 0x2807 0xff10 0xffff 0xffff 0xffff 0xffff
0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0x00ff 0x4c5e

Target 0 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Goal: 20.000MB/s transfers (10.000MHz, offset 8, 16bit)
Curr: 20.000MB/s transfers (10.000MHz, offset 8, 16bit)
Channel A Target 0 Lun 0 Settings
Commands Queued 14441
Commands Active 0
Command Openings 8
Max Tagged Openings 8
Device Queue Frozen Count 0
Target 1 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 127)
Target 2 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Goal: 10.000MB/s transfers (10.000MHz, offset 15)
Curr: 10.000MB/s transfers (10.000MHz, offset 15)
Channel A Target 2 Lun 0 Settings
Commands Queued 1
Commands Active 0
Command Openings 1
Max Tagged Openings 0
Device Queue Frozen Count 0
Target 3 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 4 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 5 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 6 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 7 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 8 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 9 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 10 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 11 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 12 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 13 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 14 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
Target 15 Negotiation Settings
User: 20.000MB/s transfers (10.000MHz, offset 127, 16bit)
---------------------------------------------------------------------------


2003-07-30 07:13:44

by Ville Herva

[permalink] [raw]
Subject: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)

On Tue, Jul 29, 2003 at 10:39:48AM +0300, you [Ville Herva] wrote:
> After about a year of stable operation, a server begun acting up. First it
> begun hanging up solid during the nightly oracle backup (that had run
> successfully for a year), the I got some aic7xxx-related crashes on boot.
>
> Initially, the box ran 2.4.20pre7 kernel with aic7xxx version 6.4.8. When
> the hangs started happening, I upgraded to 2.4.21-jam1 (basically 2.4.21
> vanilla + -aa patch + some minor stuff) that includes aic7xxx version 6.2.36.
> It did not help.
>
> I enabled kmsgdump and nmi watchdog, but when the box hangs, it hangs solid:
> no ctrl-alt-del, no caps lock led, no alt-sysrq-b, no kmsgdump, nmi watchdog
> doesn't trigger. Only the cursor on the console blinks, but no messages from
> the kernel appear. (Apart from "spurious 8259A interrupt: IRQ7." that
> always happens sometime after boot on this box, but way before the hang.)

Herbert P?tzl indicted that he'd had similar lockups with fairly similar hw
up until 2.4.22pre6. He suggested I should try 2.4.22pre8.

2.4.22pre8 locked up the same way in about 10 hours.

> Any ideas on how to to debug this kind of hang?

The question still stands; how do I debug this?

> Does it sound kernel/driver or hw related? Are the two crashes related to
> the hang? Is the hang related to aic7xxx?

Any ideas?



-- v --

[email protected]

2003-07-30 14:55:45

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)



On Wed, 30 Jul 2003, Ville Herva wrote:

> On Tue, Jul 29, 2003 at 10:39:48AM +0300, you [Ville Herva] wrote:
> > After about a year of stable operation, a server begun acting up. First it
> > begun hanging up solid during the nightly oracle backup (that had run
> > successfully for a year), the I got some aic7xxx-related crashes on boot.
> >
> > Initially, the box ran 2.4.20pre7 kernel with aic7xxx version 6.4.8. When
> > the hangs started happening, I upgraded to 2.4.21-jam1 (basically 2.4.21
> > vanilla + -aa patch + some minor stuff) that includes aic7xxx version 6.2.36.
> > It did not help.
> >
> > I enabled kmsgdump and nmi watchdog, but when the box hangs, it hangs solid:
> > no ctrl-alt-del, no caps lock led, no alt-sysrq-b, no kmsgdump, nmi watchdog
> > doesn't trigger. Only the cursor on the console blinks, but no messages from
> > the kernel appear. (Apart from "spurious 8259A interrupt: IRQ7." that
> > always happens sometime after boot on this box, but way before the hang.)
>
> Herbert P?tzl indicted that he'd had similar lockups with fairly similar hw
> up until 2.4.22pre6. He suggested I should try 2.4.22pre8.
>
> 2.4.22pre8 locked up the same way in about 10 hours.
>
> > Any ideas on how to to debug this kind of hang?
>
> The question still stands; how do I debug this?
>
> > Does it sound kernel/driver or hw related? Are the two crashes related to
> > the hang? Is the hang related to aic7xxx?
>
> Any ideas?

Ville,

Mind trying 2.4.22-pre8 without MMAPIO defined in the SCSI driver?

Justin, is this problem known to other boards or.. ?

2003-07-30 18:10:14

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)

On Wed, Jul 30, 2003 at 11:50:50AM -0300, you [Marcelo Tosatti] wrote:
>
> > Any ideas?
>
> Ville,
>
> Mind trying 2.4.22-pre8 without MMAPIO defined in the SCSI driver?

2.4.20pre7 (aic7xxx 6.2.8) that I initially saw the lockups with was
compiled with MMAPIO undefined. 2.4.21-jam1 (aic7xxx 6.2.36) and 2.4.22pre8
(aic7xxx 6.2.36) had it defined (the default). All of the three locked up
the same way. Hence, I think it's unlikely MMAPIO is the culprit.

However, I just realized that all of those kernel were compiled with fairly
dubious gcc, version 2.96-85. I just compiled otherwise identically
configured 2.4.21-jam1 with gcc-3.2.1-2. It'll take some time to tell
whether this cures it. This is my main suspect now.

> Justin, is this problem known to other boards or.. ?

The lockups may be completely unrelated to aic7xxx and the crashes on boot
that I posted kernel logs of. I don't know.


-- v --

[email protected]

2003-08-08 12:55:13

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)

On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
>
> However, I just realized that all of those kernel were compiled with fairly
> dubious gcc, version 2.96-85. I just compiled otherwise identically
> configured 2.4.21-jam1 with gcc-3.2.1-2. It'll take some time to tell
> whether this cures it. This is my main suspect now.

Ok, the kernel compiled with gcc version 3.2.1 20021207 (Red Hat Linux 8.0
3.2.1-2) has now been up for more than a week. It seems stable, but I'm not
sure yet.

Which brings me to the question: which gcc version is considered most stable
for compiling 2.4.x these days?

README says:
"Make sure you have gcc 2.95.3 available. gcc 2.91.66 (egcs-1.1.2) may
also work but is not as safe, and *gcc 2.7.2.3 is no longer supported*"

And Documentation/Changes says:

"You may use gcc 3.0.x instead if you wish, although it may cause problems.
Later versions of gcc have not received much testing for Linux kernel
compilation, and there are almost certainly bugs (mainly, but not
exclusively, in the kernel) that will need to be fixed in order to use these
compilers."

and

"The Red Hat gcc 2.96 compiler subtree can also be used to build this tree.
You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build
the kernel correctly."

This seems to suggest 2.96-85 would be more stable than gcc-3.2.1-2. Is this
the case?

> > Justin, is this problem known to other boards or.. ?
>
> The lockups may be completely unrelated to aic7xxx and the crashes on boot
> that I posted kernel logs of. I don't know.

I guess I'll poll Justin when/if the aic7xxx crashes reappear. The hang was
probably not related to aic7xxx. Sorry for the false accusation.



-- v --

[email protected]

2003-08-09 20:19:59

by Adrian Bunk

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)

On Fri, Aug 08, 2003 at 03:55:02PM +0300, Ville Herva wrote:
>...
> Ok, the kernel compiled with gcc version 3.2.1 20021207 (Red Hat Linux 8.0
> 3.2.1-2) has now been up for more than a week. It seems stable, but I'm not
> sure yet.
>
> Which brings me to the question: which gcc version is considered most stable
> for compiling 2.4.x these days?
>...
> This seems to suggest 2.96-85 would be more stable than gcc-3.2.1-2. Is this
> the case?
>...

2.95.3 and the (unofficial) 2.96 are the best compilers for 2.4 .

In most cases 3.2.1 will give you a working kernel, but if you need
maximum stablity don't use gcc 3.x for compiling kernel 2.4 .

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2003-08-09 22:16:15

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1, aic7xxx-6.2.36: solid hangs)

On Sat, Aug 09, 2003 at 10:19:51PM +0200, you [Adrian Bunk] wrote:
> On Fri, Aug 08, 2003 at 03:55:02PM +0300, Ville Herva wrote:
> >
> > Which brings me to the question: which gcc version is considered most stable
> > for compiling 2.4.x these days?
> >...
> > This seems to suggest 2.96-85 would be more stable than gcc-3.2.1-2. Is this
> > the case?
> >...
>
> 2.95.3 and the (unofficial) 2.96 are the best compilers for 2.4 .
>
> In most cases 3.2.1 will give you a working kernel, but if you need
> maximum stablity don't use gcc 3.x for compiling kernel 2.4 .

I'm surely aiming for stability, yeah ;).

2.96-85 produces a kernel that hangs (though it's not proven it's gcc's
fault) -- the one compiled with gcc-3.2.1-2 hasn't hung yet. I guess I
should at least use the latest errata version if I go with 2.96...


-- v --

[email protected]

2003-08-27 06:43:21

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
>
> However, I just realized that all of those kernel were compiled with fairly
> dubious gcc, version 2.96-85. I just compiled otherwise identically
> configured 2.4.21-jam1 with gcc-3.2.1-2. It'll take some time to tell
> whether this cures it. This is my main suspect now.

I celebrated too early.

The kernel compiled with gcc 3.2.1 20021207 (Red Hat Linux 8.0 3.2.1-2) hung
too, it just happened to take a little longer.

Short summary:

- The hangs are solid:
- nothing in the log, nothing on the screen
- no ctrl-alt-del, numlock
- no sysrq-s, sysrq-u, sysrq-b
- nmi watchdog doesn't trigger
- The hangs mostly happen when the nightly oracle backup dump is in
progress
- the oracle database is on an ide disk, oracle app and the dump
destination are on an scsi disk (Adaptec 2940, SEAGATE ST19171W)
- HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
- The aic7xxx driver has been acting up in past: crashes on boot and
sometimes at runtime too. I don't know if this is at all related to the
lock ups.
- Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1,
2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.

Perhaps this is related to the "Race condition in 2.4 tasklet handling
(cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
discussing?

Any ideas?


-- v --

[email protected]

2003-08-27 07:03:54

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, 27 Aug 2003 09:43:02 +0300
Ville Herva <[email protected]> wrote:

> On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
> [...]
> - HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
> 9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
> - The aic7xxx driver has been acting up in past: crashes on boot and
> sometimes at runtime too. I don't know if this is at all related to the
> lock ups.
> - Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1,
> 2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.
>
> Perhaps this is related to the "Race condition in 2.4 tasklet handling
> (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> discussing?

This is no SMP box, is it? If it is no SMP is it probably unrelated.

Regards,
Stephan

2003-08-27 07:13:19

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 09:03:51AM +0200, you [Stephan von Krawczynski] wrote:
> > On Wed, Jul 30, 2003 at 09:10:03PM +0300, you [Ville Herva] wrote:
> > [...]
> > - HW: Intel 815EEA2LU mobo, i815, Celeron Tualatin 1.3GHz. Adaptec 2940,
> > 9GB Seagate, HP C1537A tapedrive (not used), IBM-DTLA-305030 ide disk.
> > - The aic7xxx driver has been acting up in past: crashes on boot and
> > sometimes at runtime too. I don't know if this is at all related to the
> > lock ups.
> > - Kernels tried: 2.4.22-pre8/gcc-2.96-85, 2.4.21-jam1/2.4.21-jam1,
> > 2.4.21-jam1/gcc-3.2.1-2, 2.4.20pre7 -- all hang.

Forgot to mention: all fs's are ext2.

> > Perhaps this is related to the "Race condition in 2.4 tasklet handling
> > (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> > discussing?
>
> This is no SMP box, is it? If it is no SMP is it probably unrelated.

Yes, no SMP.


-- v --

[email protected]

2003-08-27 07:21:43

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, 27 Aug 2003 10:12:59 +0300
Ville Herva <[email protected]> wrote:

> > > Perhaps this is related to the "Race condition in 2.4 tasklet handling
> > > (cli() broken?)" problem TeJun Huh and Stephan von Krawczynski have been
> > > discussing?
> >
> > This is no SMP box, is it? If it is no SMP is it probably unrelated.
>
> Yes, no SMP.

Sorry, then you have to look for another explanation.
Did you already try to exchange everything but the harddisks ?

Regards,
Stephan

2003-08-27 07:38:10

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 09:21:39AM +0200, you [Stephan von Krawczynski] wrote:
>
> Sorry, then you have to look for another explanation.

Yep, but I don't have any reasonable suspects.

> Did you already try to exchange everything but the harddisks ?

No. Do you suspect faulty hardware?

Apart from perhaps Adaptec 2940 (Adaptecs always give me trouble), I
believe the hw is pretty solid. It had no problems with 2.2 kernels. Based
on my experience, the i815 chipset is not that shaky (unlike the Via dung),
and I would expect the Intel motherboard to be on the better side as well.

I can't completely rule faulty hw out, though.

Exchanging hw will be quite difficult, as the hangs take as much as three
weeks to trigger (sometimes they happen withing a day after reboot), the box
is a production server, and I don't have much spare hardware atm.

What I had hoped for is to be able to get some information on where it hangs.
But sysrq and nmi watchdog don't cut it...


-- v --

[email protected]

2003-08-27 09:30:31

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, 27 Aug 2003 10:37:58 +0300
Ville Herva <[email protected]> wrote:

> > Did you already try to exchange everything but the harddisks ?
>
> No. Do you suspect faulty hardware?
> [...]
> What I had hoped for is to be able to get some information on where it hangs.
> But sysrq and nmi watchdog don't cut it...

Hm, did you try a serial console? On my side this was a big step forward.
If you experience complete hangs it may be something around hanging interrupts.
Did you play with apic/acpi etc. to try different interrupt handling? What does
your /proc/interrupts look like compared between 2.2 and 2.4 ?

Regards,
Stephan

2003-08-27 10:13:48

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 11:30:27AM +0200, you [Stephan von Krawczynski] wrote:
>
> Hm, did you try a serial console? On my side this was a big step forward.

Do you mean in your case nothing shown on monitor (I've disabled monitor
blanking, so that is not it), sysrq key didn't work, nmi watchdog didn't
trigger but you were still able to get output from serial console? An oops?

Or, did you use kdb/kgdb in addition to serial console?

> If you experience complete hangs it may be something around hanging
> interrupts.

Probably, yes.

> Did you play with apic/acpi etc. to try different interrupt handling?

ACPI has never been enabled. I enabled local APIC when I enabled nmi
watchdog, so I've tried it on and off.

> What does your /proc/interrupts look like compared between 2.2 and 2.4 ?

I don't have 2.2 output at hand, but the 2.4.21-jam1 output doesn't seem too
suspicious:

cat /proc/interrupts
CPU0
0: 1675428 XT-PIC timer
1: 3 XT-PIC keyboard
2: 0 XT-PIC cascade
4: 19625 XT-PIC serial
9: 25447 XT-PIC aic7xxx
11: 25203 XT-PIC eth0
12: 0 XT-PIC PS/2 Mouse
14: 178082 XT-PIC ide0
NMI: 16763
LOC: 1675326
ERR: 0



-- v --

[email protected]

2003-08-27 10:56:40

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, 27 Aug 2003 13:13:13 +0300
Ville Herva <[email protected]> wrote:

> On Wed, Aug 27, 2003 at 11:30:27AM +0200, you [Stephan von Krawczynski]
> wrote:
> >
> > Hm, did you try a serial console? On my side this was a big step forward.
>
> Do you mean in your case nothing shown on monitor (I've disabled monitor
> blanking, so that is not it), sysrq key didn't work, nmi watchdog didn't
> trigger but you were still able to get output from serial console? An oops?

I often have X setups, so console output gets _somewhere_ in the background.

> Or, did you use kdb/kgdb in addition to serial console?

No.

> > What does your /proc/interrupts look like compared between 2.2 and 2.4 ?
>
> I don't have 2.2 output at hand, but the 2.4.21-jam1 output doesn't seem too
> suspicious:

You're right, it looks pretty clean and simple. Possibly the only thing I would
try is moving aic away from int 9 to int 10 or so. Int 9 sometimes interferes
with VGA int routing on broken boxes. But that is unlikely (though simple to
test).

Regards,
Stephan

2003-08-27 11:04:43

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 12:56:33PM +0200, you [Stephan von Krawczynski] wrote:
>
> > Or, did you use kdb/kgdb in addition to serial console?
>
> No.

Ok.

I might give a debugger a shot anyway when I find the time.

> You're right, it looks pretty clean and simple. Possibly the only thing I would
> try is moving aic away from int 9 to int 10 or so. Int 9 sometimes interferes
> with VGA int routing on broken boxes. But that is unlikely (though simple to
> test).

I don't think vga interferes with anything: I never run X on the box, and
even the text console remains quiescent as nothing is logged.

Better test would perhaps be to get rid of Adaptec 2940 altogether and move
the rootfs on an ide disk. But that's not exactly convenient either...


-- v --

[email protected]

2003-08-27 11:31:01

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, 27 Aug 2003 14:04:17 +0300
Ville Herva <[email protected]> wrote:

> > You're right, it looks pretty clean and simple. Possibly the only thing I
> > would try is moving aic away from int 9 to int 10 or so. Int 9 sometimes
> > interferes with VGA int routing on broken boxes. But that is unlikely
> > (though simple to test).
>
> I don't think vga interferes with anything: I never run X on the box, and
> even the text console remains quiescent as nothing is logged.

The thing I ran into once was not really an intensive use of VGA and its ints
but rather some weird glitches in the boards' int logic that sometimes drove
the software drivers crazy (was network back then).

Regards,
Stephan

2003-08-28 01:11:44

by Tejun Huh

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 10:37:58AM +0300, Ville Herva wrote:
> On Wed, Aug 27, 2003 at 09:21:39AM +0200, you [Stephan von Krawczynski] wrote:
> >
> > Sorry, then you have to look for another explanation.
>
> Yep, but I don't have any reasonable suspects.
>
> > Did you already try to exchange everything but the harddisks ?
>
> No. Do you suspect faulty hardware?
>
> Apart from perhaps Adaptec 2940 (Adaptecs always give me trouble), I
> believe the hw is pretty solid. It had no problems with 2.2 kernels. Based
> on my experience, the i815 chipset is not that shaky (unlike the Via dung),
> and I would expect the Intel motherboard to be on the better side as well.
>
> I can't completely rule faulty hw out, though.
>
> Exchanging hw will be quite difficult, as the hangs take as much as three
> weeks to trigger (sometimes they happen withing a day after reboot), the box
> is a production server, and I don't have much spare hardware atm.
>
> What I had hoped for is to be able to get some information on where it hangs.
> But sysrq and nmi watchdog don't cut it...
>

Hello Ville. Hello Stephan. :-)

Your problem sounds very simlar to the problem we were suffering.
The problem was a spinlock deadlock inside drivers/char/random.c which
is used by tcp to generate random initial sequence number. The bug
fix was checked into 2.4 tree on 28th July after the release of pre8
at 14th July.

[email protected], 2003-07-24 14:21:29-03:00, [email protected]
Changed EXTRAVERSION to -pre8
TAG: v2.4.22-pre8

[email protected], 2003-07-28 17:25:49-07:00, [email protected]
[RANDOM]: Fix SMP deadlock in __check_and_rekey().

This problem can happen on UP machine if the kernel is compiled with
CONFIG_SMP. Because the offending routine is called only every five
minutes and it should receive a SYN packet while it's connecting, it
occurs rarely, but it happens when it happens.

Please try 2.4.22.

P.S. This bug is a real headache. We had many servers deployed and
they all randomly locked up about every two or four weeks. I believe
people should be warned about this one.

--
tejun

2003-08-28 05:58:54

by Tejun Huh

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Thu, Aug 28, 2003 at 08:40:00AM +0300, Ville Herva wrote:
> On Thu, Aug 28, 2003 at 10:13:41AM +0900, you [TeJun Huh] wrote:
> >
> > Your problem sounds very simlar to the problem we were suffering.
> > The problem was a spinlock deadlock inside drivers/char/random.c which
> > is used by tcp to generate random initial sequence number. The bug
> > fix was checked into 2.4 tree on 28th July after the release of pre8
> > at 14th July.
>
> Uhh, I tried 2.4.22pre8 a while ago (I think it was Herbert P?tzl's
> suggestion), and it locked up too. Shame that the fix didn't make it in
> it...
>
> I'll give .22-final a spin.
>
> > This problem can happen on UP machine if the kernel is compiled with
> > CONFIG_SMP.
>
> This is UP box and the kernel is _not_ compiled with CONFIG_SMP.

Then, it should be a different problem. That deadlock wouldn't occur
with UP kernel.

> > Because the offending routine is called only every five
> > minutes and it should receive a SYN packet while it's connecting, it
> > occurs rarely, but it happens when it happens.
>
> In my case, the lock up seems clearly related to disk io: it usually happens
> during the nightly oracle backup dump, and at some point it kept happening
> while compiling kernel. (It's random, I can no longer reproduce it by just
> compiling a kernel.)
>
> Do you still think it could be the same one?

No, I don't think so anymore. I think trying kdb/kgdb would be
better.

> > Please try 2.4.22.
> >
> > P.S. This bug is a real headache. We had many servers deployed and
> > they all randomly locked up about every two or four weeks. I believe
> > people should be warned about this one.
>
> What's really strange is that the box kept running with 2.4.20pre7 for
> almost a year without problems (with the same oracle dump jub in nightly
> cron), and then suddenly begun acting up on my the first day of my summer
> vacatnion...

Good luck. :-)

--
tejun

2003-08-28 05:46:53

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Thu, Aug 28, 2003 at 10:13:41AM +0900, you [TeJun Huh] wrote:
>
> Your problem sounds very simlar to the problem we were suffering.
> The problem was a spinlock deadlock inside drivers/char/random.c which
> is used by tcp to generate random initial sequence number. The bug
> fix was checked into 2.4 tree on 28th July after the release of pre8
> at 14th July.

Uhh, I tried 2.4.22pre8 a while ago (I think it was Herbert P?tzl's
suggestion), and it locked up too. Shame that the fix didn't make it in
it...

I'll give .22-final a spin.

> This problem can happen on UP machine if the kernel is compiled with
> CONFIG_SMP.

This is UP box and the kernel is _not_ compiled with CONFIG_SMP.

> Because the offending routine is called only every five
> minutes and it should receive a SYN packet while it's connecting, it
> occurs rarely, but it happens when it happens.

In my case, the lock up seems clearly related to disk io: it usually happens
during the nightly oracle backup dump, and at some point it kept happening
while compiling kernel. (It's random, I can no longer reproduce it by just
compiling a kernel.)

Do you still think it could be the same one?

> Please try 2.4.22.
>
> P.S. This bug is a real headache. We had many servers deployed and
> they all randomly locked up about every two or four weeks. I believe
> people should be warned about this one.

What's really strange is that the box kept running with 2.4.20pre7 for
almost a year without problems (with the same oracle dump jub in nightly
cron), and then suddenly begun acting up on my the first day of my summer
vacation...


-- v --

[email protected]

2003-08-28 16:38:20

by Ingo Oeser

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Wed, Aug 27, 2003 at 01:30:55PM +0200, Stephan von Krawczynski wrote:
> On Wed, 27 Aug 2003 14:04:17 +0300
> Ville Herva <[email protected]> wrote:
> > I don't think vga interferes with anything: I never run X on the box, and
> > even the text console remains quiescent as nothing is logged.
>
> The thing I ran into once was not really an intensive use of VGA and its ints
> but rather some weird glitches in the boards' int logic that sometimes drove
> the software drivers crazy (was network back then).

I have seen this too, with some DSP board.

But heavy (disk) IO and misterious crashes sound like power problems,
doesn't it?

Regards

Ingo Oeser

2003-08-28 19:09:50

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Thu, Aug 28, 2003 at 11:26:30AM +0200, you [Ingo Oeser] wrote:
>
> But heavy (disk) IO and misterious crashes sound like power problems,
> doesn't it?

Hmm. It doesn't crash, it locks up solid. (Well the aic7xxx driver sometimes
crashes (spits a huge log of errors, rather), but I'm still not sure if
that's related.)

The box only has two disks, 1.3GHz Celeron (~30W), and other lighter power
consumers. Not exactly a power hungry config. I'm not sure about the power
supply - I think it's a 250W one - I'll have to check.

Accoring to sensors, the voltages do not fluctuate much. Also, the
temperatures are moderate (34.0?C system, 41.0?C CPU).

Power problems are surely possible, but don't exactly sound like promising
lead to me.


-- v --

[email protected]