2001-03-17 16:43:46

by kees

[permalink] [raw]
Subject: [OT] how to catch HW fault

Hi,
I'm getting mad because of random freezes of my system. Linux-2.2.19pre7
on MSI 694D dual PIII(677MHz) 128 MB, no OC. I tried to isolate the
problem with replacing cards (S3 video, 3com 59X, ES1373 and
AIC7xxx) didn't solve anything. Even in initlevel 1 with only a videocard
the freeze happens. It is a total lockup, no SYSRQ , no ping from network,
nothing in the logs. A freeze may happen 4 times in a hour or once in 2
weeks. I have the same mobo and PIII's at home without the slightest
problems. Who knows of a suitable diagnostics to track this down?
regards
Kees




2001-03-17 18:26:24

by Aaron Lunansky

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

Sounds like the only thing you haven't swapped out of your machine is the
ram/cpu.

It could very well be your ram (I don't suspect the cpu). If you can, try a
different stick of ram.



-----Original Message-----
From: kees <[email protected]>
To: [email protected] <[email protected]>
Sent: Sat Mar 17 11:29:35 2001
Subject: [OT] how to catch HW fault

Hi,
I'm getting mad because of random freezes of my system. Linux-2.2.19pre7
on MSI 694D dual PIII(677MHz) 128 MB, no OC. I tried to isolate the
problem with replacing cards (S3 video, 3com 59X, ES1373 and
AIC7xxx) didn't solve anything. Even in initlevel 1 with only a videocard
the freeze happens. It is a total lockup, no SYSRQ , no ping from network,
nothing in the logs. A freeze may happen 4 times in a hour or once in 2
weeks. I have the same mobo and PIII's at home without the slightest
problems. Who knows of a suitable diagnostics to track this down?
regards
Kees



2001-03-17 19:05:11

by John Jasen

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

On Sat, 17 Mar 2001, Aaron Lunansky wrote:

> It could very well be your ram (I don't suspect the cpu). If you can, try a
> different stick of ram.

I've found a good exercise for exercising memory faults is to recompile
the kernel with a -j16 flag; and in a second virtual console, do something
like dd if=/dev/hda of=/dev/null bs=2048k

Either the kernel compile will fail with a sig11, or the dd will fail and
lock the system, in my experience.

I've used this method, crudely, to chase down memory problems in systems
using 256-512MB ram.

YMMV.

--
-- John E. Jasen ([email protected])
-- In theory, theory and practise are the same. In practise, they aren't.

2001-03-17 20:23:18

by Ville Herva

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

On Sat, Mar 17, 2001 at 01:22:46PM -0500, you [Aaron Lunansky] claimed:
> Sounds like the only thing you haven't swapped out of your machine is the
> ram/cpu.
>
> It could very well be your ram (I don't suspect the cpu). If you can, try a
> different stick of ram.

Or try memtest86 (http://reality.sgi.com/cbrady_denver/memtest86/) it's a
very good memory tester. My first option if I suspect a hardware fault.


-- v --

[email protected]

2001-03-18 22:23:47

by kees

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

Hi,

I tried memtest86 for 24 hours also and that didn't gave a clue. When bad
ram was really involved I'd expected to find things like:
failing fsck's, failing kernel compiles and such. But none of them
the system runs perfect if it doesn't freeze(lockup).

So yes, only the CPU's and the mobo are at question. What I was looking
for was a tool like memtest86 but now for motherboards.....

regards

Kees


On Sat, 17 Mar 2001, Aaron Lunansky wrote:

> Sounds like the only thing you haven't swapped out of your machine is the
> ram/cpu.
>
> It could very well be your ram (I don't suspect the cpu). If you can, try a
> different stick of ram.
>
>

2001-03-19 10:36:36

by Ville Herva

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

On Sun, Mar 18, 2001 at 09:11:46PM +0100, you [kees] claimed:
> Hi,
>
> I tried memtest86 for 24 hours also and that didn't gave a clue. When bad
> ram was really involved I'd expected to find things like:
> failing fsck's, failing kernel compiles and such. But none of them
> the system runs perfect if it doesn't freeze(lockup).
>
> So yes, only the CPU's and the mobo are at question. What I was looking
> for was a tool like memtest86 but now for motherboards.....

You really cannot say that bad memory is involved ONLY when fsck's fail and
kernel compiled fail. No way.

Think about it: that failing bit can well be in a place that kernel never
touches, and gcc usually does not touch. Moreover the bit flip usually does
not happen every time; you have to stress the memory for hours and sometimes
use a specific bit pattern to trigger the problem.

I had one machine that compiled kernel just fine, and ran pretty smoothly
overall, but experienced weird hickups like dying apps, failing oracle
install etc. Not too much though, I was attributing them to buggy software.
Then I tried to take a large backup. Bzip failed (internal error) one third
of the time, and once produced a different result. I quickly hacked up an
user space memory tester, and sure enough it reported an error after five
hours. (The machine was already in production, so I couldn't just take it
down and launch memtest86.) I verified the problem with memtest86 during the
next night, and applied the marvellous badmem patch to the kernel. After
marking the problematic place unuable, all problem disappeared. I just lost
2MB out of 256.

What I learned is that spotting meory error can be difficult, and the
symptoms can be stealthy.


-- v --

[email protected]

2001-03-19 11:45:34

by Ville Herva

[permalink] [raw]
Subject: Re: [OT] how to catch HW fault

On Mon, Mar 19, 2001 at 12:35:19PM +0200, you [Ville Herva] claimed:
> I quickly hacked up an user space memory tester, and sure enough it
> reported an error after five

If anyone is interested in the said hack (some already mailed me that they
are), I made it available at

http://v.iki.fi/~vherva/memburn.c

Disclaimer: it's just a quick hack, please use memtest86 if possible.
Memburn does have one found memory error under its belt, though ;).


-- v --

[email protected]