2001-12-20 13:23:44

by T. A.

[permalink] [raw]
Subject: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card

Hi all,

I recently setup my spiffy new SMP system. The System consists of:

Abit VP6 motherboard
Dual ! Gig Pentium III CPUs
128MB memory (for now)
EIDE boot drive on the VIA EIDE port for most of the system
RAID 5 setup using an Ultra2 ICP RAID controller (gdth driver)

After setting up the new raid I decided to run a burn in test on it with
the following script:

while [ "" = "" ]
do
rm -rv linux-2.4.16.old
mv -v linux-2.4.16 linux-2.4.16.old
cp -av linux-2.4.16.old linux-2.4.16
done

It didn't take very long. After a 15 minutes or so the entire system
deadlocked. And very badly at that. Not even the magic sysrq key
combinations worked anymore. I've spent the past few days trying to debug
the problem and this is what I've found so far.

The freezing appears to only happen when running my burn-in test on the raid
drive. (may have been one instant of it happening on the ide drive while
compiling a kernel too, however I did a file operation on the raid drive
shortly before the freeze so can't say for sure. Though the EIDE burn-in
test can run for quite a long while)

The freezing only happens on an SMP kernel on dual CPUs. (I tried out a
non-SMP kernel. No lockup.)
It doesn't appear filesystem related. (Tried both ext3 and ext2)
It doesn't appear like a hardware issue. Ran a burn-in test with FreeBSD.
(Worked beautify)
Also tried switching the RAID controller to different PCI slots and
disabling the built-in Highpoint HPT370 EIDE pseudo raid controller so that
the card didn't share an irq.
Tried different BIOS settings, no change.
Tried passing "noapic" to the kernel. Deadlock still remained.
Upgraded the BIOS. (deadlock on 2 different BIOSes)

Best I can tell there appears to be a problem with the ICP raid
controller driver (gdth) in an SMP system, or at least in an SMP system
running on this motherboard. Does anybody else have an ICP RAID controller
with a RAID 5 setup running successfully in an SMP system? If so are you on
an Abit VP6? Anyone know of any 2.4.16 kernel bug that could be doing this?
If so, is there a fix or a workaround?

Anyone know how I could debug the cause of this problem? Machine
deadlocks. Not even an Ooops so I'm short on ideas on how to track the
problem down. Please help. 8-( My new SMP system sucks on Linux. 8-(


2001-12-20 14:11:01

by Keith Owens

[permalink] [raw]
Subject: Re: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card

On Wed, 19 Dec 2001 16:22:47 -0500,
"T. A." <[email protected]> wrote:
> Anyone know how I could debug the cause of this problem? Machine
>deadlocks. Not even an Ooops so I'm short on ideas on how to track the
>problem down. Please help. 8-( My new SMP system sucks on Linux. 8-(

Compile for SMP and boot with nmi_watchdog=1. If the problem is
hardware that will not help. If the problem is a software loop in
kernel space (much more likely) then the nmi watchdog will trip after 5
seconds.

You might also find the kernel debugger to be useful,
ftp://oss.sgi.com/projects/kdb/download/ix86. See Documentation/kdb
for man pages. Using the pause key on a PC keyboard or control-A on a
serial console will drop into kdb, unless the kernel has stopped
processing interuupts, in which case the nmi watchdog should trip and
drop you into kdb.

For all low level debugging, I strongly recommend a serial console so
you can capture the output on another system, see
Documentation/serial-console.txt.

2001-12-20 15:00:29

by Alex Scheele

[permalink] [raw]
Subject: RE: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card

Hi,

i am having exactly the same problem! One of my servers
locks up once in a while and i was reading your post and i thought it might
be
the same problem with me, and it indeed is.

The server consists of:

Supermicro 370DLE motherbord
Dual 1Ghz Pentium III CPU's
1.5 Gig memory (highmem enabled)
2 ide software raid-0's and a scsi linux disk.

My systems also hangs after a short while when running such a script.
And it just hangs in the same way every now and then (mostly after 1 to 2
weeks
uptime). The HDD-led burns constantly when the system locks up.
SysRq aint responding when it locks up neither.

I have not been able to try without SMP support yet, altho i will (hopefully
today)
test it. I have been having this problem for a some time now and did not
solve it yet.
I will aslo try the nmi_watchdog=1 and kdb, if i get some more information i
will mail it.


--
Alex ([email protected])

On Thursday, December 20, 2001 3:10 PM,
"Keith Owens" <[email protected]> wrote:
>
> On Wed, 19 Dec 2001 16:22:47 -0500,
> "T. A." <[email protected]> wrote:
> > Anyone know how I could debug the cause of this problem? Machine
> >deadlocks. Not even an Ooops so I'm short on ideas on how to track the
> >problem down. Please help. 8-( My new SMP system sucks on Linux. 8-(
>
> Compile for SMP and boot with nmi_watchdog=1. If the problem is
> hardware that will not help. If the problem is a software loop in
> kernel space (much more likely) then the nmi watchdog will trip after 5
> seconds.
>
> You might also find the kernel debugger to be useful,
> ftp://oss.sgi.com/projects/kdb/download/ix86. See Documentation/kdb
> for man pages. Using the pause key on a PC keyboard or control-A on a
> serial console will drop into kdb, unless the kernel has stopped
> processing interuupts, in which case the nmi watchdog should trip and
> drop you into kdb.
>
> For all low level debugging, I strongly recommend a serial console so
> you can capture the output on another system, see
> Documentation/serial-console.txt.
>



2001-12-20 15:36:46

by Dennis Schoen

[permalink] [raw]
Subject: Re: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card


I'm having the same problems on my:

Dual PIII 650Mhz
300Mb Ram
3 Scsi Disk -> on one Scsi controller
1 Scsi Burner and 1 Scsi CDRom -> on one Scsi controller

The System justs hangs completely after a while, no MagSysrq
response.

Dennis

2001-12-21 17:27:01

by T. A.

[permalink] [raw]
Subject: Re: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card

Did the two debugging steps listed below. Tried the nmi_watchdog=1 and
also the kdb debugger. Unfortunately neither debugging feature gave me any
results. When the machine locks up the result is total. The nmi_watchdog
does not produce any output (after the lockup) and the kdb debugger fails to
load up when pressing the pause key (after the lockup). I suppose this is
indicating a hardware issue. However as I said before my burning test under
FreeBSD was successful. No machine lockup. And the last test (under
FreeBSD) burned in for two days. I also tried removing a processor. Just
as with the non-smp kernel, the lockup did not occur. Tried single
processor mode with each processor, both were successful. Also gave
2.2.17-rc2 a try, just in case. Same result. As it stands looks like I'm
still in the same boat. 8-( Anything else I can do to debug this issue?

By the way. Are people: Successfully using the Abit VP6 motherboard
under Linux? Successfully using the ICP GDT7528RN raid controller under
Linux in an SMP system? Successfully using an ICP GDT7528RN on a Abit VP6
motherboard?

----- Original Message -----
From: "Keith Owens" <[email protected]>
To: "T. A." <[email protected]>
Cc: "Linux Kernel Mailing List" <[email protected]>
Sent: Thursday, December 20, 2001 9:10 AM
Subject: Re: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6
with dual 1 Gig CPUs and an ICP GDT RAID card


> On Wed, 19 Dec 2001 16:22:47 -0500,
> "T. A." <[email protected]> wrote:
> > Anyone know how I could debug the cause of this problem? Machine
> >deadlocks. Not even an Ooops so I'm short on ideas on how to track the
> >problem down. Please help. 8-( My new SMP system sucks on Linux. 8-(
>
> Compile for SMP and boot with nmi_watchdog=1. If the problem is
> hardware that will not help. If the problem is a software loop in
> kernel space (much more likely) then the nmi watchdog will trip after 5
> seconds.
>
> You might also find the kernel debugger to be useful,
> ftp://oss.sgi.com/projects/kdb/download/ix86. See Documentation/kdb
> for man pages. Using the pause key on a PC keyboard or control-A on a
> serial console will drop into kdb, unless the kernel has stopped
> processing interuupts, in which case the nmi watchdog should trip and
> drop you into kdb.
>
> For all low level debugging, I strongly recommend a serial console so
> you can capture the output on another system, see
> Documentation/serial-console.txt.
>
>

2001-12-23 23:30:18

by Alex Scheele

[permalink] [raw]
Subject: RE: Consistant complete deadlock with kernel 2.4.16 on an Abit VP6 with dual 1 Gig CPUs and an ICP GDT RAID card

Hi,

i had the same problem, unfortunaly i have not been able to debug it yet.
The problem is on my machine it takes a lot longer to lockup. And i dont
have
the machine local at this time. I tried doing it on a Dual Celeron 450 which
i
do have local, but here i can not get it to lockup.

Maybe u should try debugging (kdb and maybe sysrq?) on a serial console, i
dont
know if u already tried that tho.

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of T. A.
> Sent: Friday, December 21, 2001 2:26 AM
> To: Keith Owens
> Cc: Linux Kernel Mailing List
> Subject: Re: Consistant complete deadlock with kernel 2.4.16 on an Abit
> VP6 with dual 1 Gig CPUs and an ICP GDT RAID card
>
>
> Did the two debugging steps listed below. Tried the
> nmi_watchdog=1 and
> also the kdb debugger. Unfortunately neither debugging feature
> gave me any
> results. When the machine locks up the result is total. The nmi_watchdog
> does not produce any output (after the lockup) and the kdb
> debugger fails to
> load up when pressing the pause key (after the lockup). I suppose this is
> indicating a hardware issue. However as I said before my burning
> test under
> FreeBSD was successful. No machine lockup. And the last test (under
> FreeBSD) burned in for two days. I also tried removing a processor. Just
> as with the non-smp kernel, the lockup did not occur. Tried single
> processor mode with each processor, both were successful. Also gave
> 2.2.17-rc2 a try, just in case. Same result. As it stands looks like I'm
> still in the same boat. 8-( Anything else I can do to debug this issue?
>
> By the way. Are people: Successfully using the Abit VP6 motherboard
> under Linux? Successfully using the ICP GDT7528RN raid controller under
> Linux in an SMP system? Successfully using an ICP GDT7528RN on a Abit VP6
> motherboard?
>
> ----- Original Message -----
> From: "Keith Owens" <[email protected]>
> To: "T. A." <[email protected]>
> Cc: "Linux Kernel Mailing List" <[email protected]>
> Sent: Thursday, December 20, 2001 9:10 AM
> Subject: Re: Consistant complete deadlock with kernel 2.4.16 on
> an Abit VP6
> with dual 1 Gig CPUs and an ICP GDT RAID card
>
>
> > On Wed, 19 Dec 2001 16:22:47 -0500,
> > "T. A." <[email protected]> wrote:
> > > Anyone know how I could debug the cause of this problem? Machine
> > >deadlocks. Not even an Ooops so I'm short on ideas on how to track the
> > >problem down. Please help. 8-( My new SMP system sucks on
> Linux. 8-(
> >
> > Compile for SMP and boot with nmi_watchdog=1. If the problem is
> > hardware that will not help. If the problem is a software loop in
> > kernel space (much more likely) then the nmi watchdog will trip after 5
> > seconds.
> >
> > You might also find the kernel debugger to be useful,
> > ftp://oss.sgi.com/projects/kdb/download/ix86. See Documentation/kdb
> > for man pages. Using the pause key on a PC keyboard or control-A on a
> > serial console will drop into kdb, unless the kernel has stopped
> > processing interuupts, in which case the nmi watchdog should trip and
> > drop you into kdb.
> >
> > For all low level debugging, I strongly recommend a serial console so
> > you can capture the output on another system, see
> > Documentation/serial-console.txt.
> >
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>