2002-04-01 06:17:14

by Jason Czerak

[permalink] [raw]
Subject: ECC memory and SMP lockups on Gateway 6400 server

Hello.

I'll start with the ECC memory problem....

Recently got the goahead to upgrade the Gateway Win2K server to a linux
box to replace out old webserver. It's a 6400 server. 2 PIII-733's, 704
megs ECC registered ram.. NT ran fine on this box. not a hitch.

Going to install Suse 7.3 on it and ran into some slowness problems.
Once the kernel was booted and the install programs were running thigns
slowed to a crawl. After an hour of messing around. I started to pull
memory and CPU's out. Turns out the 512meg DIMM ECC ram was the cause of
the slowness problem. No error messages no nothing. looks like the ECC
was doing it's thing. But created a CPU useage of 100% all the time...
Is there a kernel switch I can flip to make it place nice with broken
ECC ram? or is this ram just worthless?


Now the real issue. I searched all over google and this mailing list to
see if I'm the only one. But looks like I am. Infact, I seen nothing but
priase for this box with linux running on it.


In a non-SMP kernel this machien runs perfectly. Not a hitch.. When
booted with an SMP kernel. Once the load hits 1.4, the machine locks up
like a rock. I have to hit the reset button. I turned off MP 1.4 spec,
all power management is turned off any goodies in bios are off. I even
upgraded the BIOS to the latest version. No luck. I been suggested to
try the 2.2 kernel, but since the root partition is reiserfs and
namesys.com has been down all weekend. I'm not able to get a 2.2 kernel
running with reiserfs support.. I'll try 2.2.20 as soon as I can DL the
resierfs patch.

I have attached the dmesg for this box. Any suggestions or patches or
anything I'll try.


Also I would like to note that the second CPU is the exact same stepping
and speed. But was installed well after the machine was purchased.


--
Jason Czerak


Attachments:
moby.boot.msg (11.70 kB)

2002-04-01 08:24:38

by Alan

[permalink] [raw]
Subject: Re: ECC memory and SMP lockups on Gateway 6400 server

> slowed to a crawl. After an hour of messing around. I started to pull
> memory and CPU's out. Turns out the 512meg DIMM ECC ram was the cause of
> the slowness problem. No error messages no nothing. looks like the ECC
> was doing it's thing. But created a CPU useage of 100% all the time...
> Is there a kernel switch I can flip to make it place nice with broken
> ECC ram? or is this ram just worthless?

Unless you loaded the extra ECC modules Linux really has no awareness of the
ECC at all. More likely and the one I would check first is that the mtrr
ranges are right and the BIOS set up the memory correctly. It could be
continuous ecc faults (eg if the kernel puts something critical in an iffy
spot in the DIMM and NT didnt) but that sounds dubious.

Alan

2002-04-01 09:45:06

by Manfred Spraul

[permalink] [raw]
Subject: Re: ECC memory and SMP lockups on Gateway 6400 server

> Recently got the goahead to upgrade the Gateway Win2K server to a linux
> box to replace out old webserver. It's a 6400 server. 2 PIII-733's, 704
> megs ECC registered ram.. NT ran fine on this box. not a hitch.
>
Could you check /proc/interrupts? Is one number extremely high?

And try to boot with "mem=690M". My sis boards become extremely slow if
I don't limit the memory. I guess the e820 map is wrong, and one of the
pages are actually power managmenet registes/NVS.

--
Manfred

2002-04-01 22:26:03

by Jason Czerak

[permalink] [raw]
Subject: Re: ECC memory and SMP lockups on Gateway 6400 server

On Mon, 2002-04-01 at 04:44, Manfred Spraul wrote:
> > Recently got the goahead to upgrade the Gateway Win2K server to a linux
> > box to replace out old webserver. It's a 6400 server. 2 PIII-733's, 704
> > megs ECC registered ram.. NT ran fine on this box. not a hitch.
> >
> Could you check /proc/interrupts? Is one number extremely high?
>
> And try to boot with "mem=690M". My sis boards become extremely slow if
> I don't limit the memory. I guess the e820 map is wrong, and one of the
> pages are actually power managmenet registes/NVS.
>

The OEM 128 and 64 meg sitck of ECC that came with the machine works
fine. I got an aftermarket ECC stick that NT likes but linux doesn't.


Standard boot with no extra kerenl switches
and it's very slow and CPU load is 1.0

Moby:/proc # cat interrupts
CPU0 CPU1
0: 17736 20742 IO-APIC-edge timer
1: 62 70 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
8: 0 2 IO-APIC-edge rtc
20: 434 453 IO-APIC-level eth0
24: 15 15 IO-APIC-level sym53c8xx
25: 1073 1108 IO-APIC-level sym53c8xx
NMI: 0 0
LOC: 38293 38388
ERR: 0
MIS: 0




Booted with "kernel-2.4.18 mem=704M" at lilo prompt and it's still slow

Kernel command line: auto BOOT_IMAGE=Linux-2.4.18 ro root=802
BOOT_FILE=/boot/kernel-2.4.18 mem=704M

is what dmesg cought.

Moby:/proc # cat interrupts
CPU0 CPU1
0: 15540 18511 IO-APIC-edge timer
1: 16 24 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
8: 1 1 IO-APIC-edge rtc
20: 151 162 IO-APIC-level eth0
24: 15 15 IO-APIC-level sym53c8xx
25: 3647 3656 IO-APIC-level sym53c8xx
NMI: 0 0
LOC: 33966 33907
ERR: 0
MIS: 0



I finally got the machine stable with SMP enabled in the kernel (had to
be a heat issue and the mobo shutting things down)

I'm about to compile and use http://www.anime.net/~goemon/linux-ecc/
to see if I can figure out what exactly is happening.



2002-04-01 22:51:59

by Ed Vance

[permalink] [raw]
Subject: RE: ECC memory and SMP lockups on Gateway 6400 server

Jason Czerak wrote:
>
> The OEM 128 and 64 meg stick of ECC that came with the machine
> works fine. I got an aftermarket ECC stick that NT likes but
> linux doesn't.

Reminded me of the time I got burned by swap meet bargain memory vendor. The
timing config info in the DIMM's SPD EEPROM was very optimistic compared to
the datasheet spec for the actual memory chips used.

----------------------------------------------------------------
Ed Vance [email protected]
Macrolink, Inc. 1500 N. Kellogg Dr Anaheim, CA 92807
----------------------------------------------------------------