LinuxLists.cc - 2.4.18 crashdump

2002-07-30 23:52:04

Subject: 2.4.18 crashdump - P4/3ware/NFS

I recently built a new home file server based around a P4 1.8Ghz CPU and a
3ware 6410 RAID controller along with a 2-port Intel 10/100 NIC. The 6410
has 4x120GB IDE drives on it for a 360GB RAID5 drive. I have been
exporting a a large partition (ext3) via NFS to a couple of other machines
to move some things around (I have another 3ware-based system with dual
P2-400s and 4x80GB in raid5, which in this case was where things were
being moved to).

Several times now the p4 system has either frozen (usually with a
crashdump), or just mysteriously rebooted while moving large quantities of
data over the NFS link.

I'm at a loss to explain this unless it starts to point to hardware
problems (bad CPU? Memory? motherboard?). I thought for a bit it was due
to setiathome running in the background (one screendump mentioned seti) so
I disabled it and had it work fine when I copied 45GB across, but then
freeze and croak when I started another copy session when the first
completed.

Note, I also export a lot of this system over Samba and haven't had any
problems there, but I'm not pushing the samba links very hard (face it,
MP3s being streamed to winamp ain't hard work...)

I don't have a text log of the dump, but I did take a digital picture of
it at http://www.roberthayden.com/crashdump.jpg

If anyone has ANY thoughts here, they would be appreciated. I hate to
start randomly replacing parts without an idea where to look.

Thanks all.

Robert

System info:
P4 1.8
512MB
SCSI0 - Symbios U160 SCSI with a DD3 Tape drive
SCSI1 - 3ware 6410 RAID w/ 4x120GB IDE (ibm) drives
Ethernet - Dual Intel 10/100
IDE CDrom (channel 1 master)
Distribution Base: Redhat 7.3, recompiled 2.4.18 kernel

[~] # df
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda3 8262068 960484 6881888 13% /
/dev/sda1 101089 11384 84486 12% /boot
/dev/sda5 2071384 34404 2036980 2% /home
/dev/sda6 1035660 34204 1001456 4% /var/log
/dev/sda7 344199916 217349096 126850820 64% /fileserver

[~] # cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 1
model name : Intel(R) Pentium(R) 4 CPU 1.80GHz
stepping : 2
cpu MHz : 1800.110
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3591.37

[~] # uname -a
Linux farnsworth 2.4.18 #3 Fri Jul 19 23:52:53 CDT 2002 i686 unknown

2002-08-05 09:07:15

by Adrian Bunk

[permalink] [raw]

Subject: Re: 2.4.18 crashdump - P4/3ware/NFS

On Tue, 30 Jul 2002, Robert A. Hayden wrote:

>...
> Several times now the p4 system has either frozen (usually with a
> crashdump), or just mysteriously rebooted while moving large quantities of
> data over the NFS link.
>
> I'm at a loss to explain this unless it starts to point to hardware
> problems (bad CPU? Memory? motherboard?). I thought for a bit it was due
>...
> I don't have a text log of the dump, but I did take a digital picture of
> it at http://www.roberthayden.com/crashdump.jpg
>
> If anyone has ANY thoughts here, they would be appreciated. I hate to
> start randomly replacing parts without an idea where to look.

- Does this problem still exist in 2.4.19?
- Are there temperature problems inside your machine?
- Does your power supply give your system enough energy?
- You can test your memory using memtest86 [1].
- Is there anything in the logfiles after the machine "mysteriously
rebooted"?
- If none of the above helps, could you type the information of the
screenshot (best the information if it happens using 2.4.19) to a file
and run it through ksymoops?

> Thanks all.
>
> Robert
>...

cu
Adrian

[1] http://www.memtest86.com/

--

You only think this is a free country. Like the US the UK spends a lot of
time explaining its a free country because its a police state.
Alan Cox

2002-08-05 12:41:57

by Robert A. Hayden

[permalink] [raw]

Subject: Re: 2.4.18 crashdump - P4/3ware/NFS

On Mon, 5 Aug 2002, Adrian Bunk wrote:

> - Does this problem still exist in 2.4.19?
> - Are there temperature problems inside your machine?
> - Does your power supply give your system enough energy?
> - You can test your memory using memtest86 [1].
> - Is there anything in the logfiles after the machine "mysteriously
> rebooted"?
> - If none of the above helps, could you type the information of the
> screenshot (best the information if it happens using 2.4.19) to a file
> and run it through ksymoops?

Adrian et al,

I thought it might have been heat-related (4 120GB drives get very hot) so
I added two high-volume fans to the case. While it did lower the
motherboard temp about 8 degrees F, the system still froze up doing both
NFS and Samba xfers.

The power supply is a 400w, so it should be able to handle 4 hard drives
and a P4 ok.

There was never anything in any of the logfiles to indicate much of
anything.

When 2.4.19 came out Friday, I noticed in the changelog several references
to both the 3ware drivers as well as the SiS chipset, which my motherboard
uses (ECS). Since upgrading to 2.4.19 the system has been rock-solid
steady. I've been stress-testing it for about 72 hours now, copying 10
2GB files across to /dev/null over NFS in a rotation, as well as moving
about 100GB of production data over NFS and not even so much as a hiccup.

The result: I think whatever problem there was has been fixed. I'm
guessing that it was SiS-related, as I've been using 3ware controllers for
a couple years with never so much as even a hiccup. I am going to
continue running my stress-test over the rest of the week to be sure, but
so far so good.

Nice to see a kernel change fix the problem. I was already pricing out a
replacement motherboard.

2002-08-05 21:43:33

by Roland Kuhn

[permalink] [raw]

Subject: Re: 2.4.18 crashdump - P4/3ware/NFS

Hi!

Please excuse my jumping in, but I have a very similar problem: when
copying data in both directions between the 3ware RAID and the 3C996B-T
GE-card (tg3 driver) the machine will lock up after some seconds. It
happens always in the same spot, BUG in line 1557 of tg3.c (v0.99). I have
also tried the bcm5700 driver (I know, it's unreadable), and this one died
on the exact same occasion, namely that an interrupt arrives to notify a
completed transmission, but there was none queued.

Upgrading to 2.4.19 did not help. Nor did upgrading the PS to 550W.

This happens on a dozen of dual Athlon MP (Tyan Tiger v1.02) machines, the
fs on the RAID is reiserfs. And both directions of data flow (about 10MB/s
each) have to be present to trigger this.

On Mon, 5 Aug 2002, Adrian Bunk wrote:

> On Tue, 30 Jul 2002, Robert A. Hayden wrote:
>
> >...
> > Several times now the p4 system has either frozen (usually with a
> > crashdump), or just mysteriously rebooted while moving large quantities of
> > data over the NFS link.
> >
> > I'm at a loss to explain this unless it starts to point to hardware
> > problems (bad CPU? Memory? motherboard?). I thought for a bit it was due
> >...
> > I don't have a text log of the dump, but I did take a digital picture of
> > it at http://www.roberthayden.com/crashdump.jpg
> >
> > If anyone has ANY thoughts here, they would be appreciated. I hate to
> > start randomly replacing parts without an idea where to look.
>
> - Does this problem still exist in 2.4.19?
> - Are there temperature problems inside your machine?
> - Does your power supply give your system enough energy?
> - You can test your memory using memtest86 [1].
> - Is there anything in the logfiles after the machine "mysteriously
> rebooted"?
> - If none of the above helps, could you type the information of the
> screenshot (best the information if it happens using 2.4.19) to a file
> and run it through ksymoops?
>
> > Thanks all.
> >
> > Robert
> >...
>
> cu
> Adrian
>
> [1] http://www.memtest86.com/
>
>

Ciao,
Roland

+---------------------------+-------------------------+
| TU Muenchen | |
| Physik-Department E18 | Raum 3558 |
| James-Franck-Str. | Telefon 089/289-12592 |
| 85747 Garching | |
+---------------------------+-------------------------+