2001-07-08 17:33:33

by Pavel Machek

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

Hi!

> > Great, glad to here it. Who (if anyone) is still attempting to unravel
> > the puzzle of the Via southbridge bug? You, Andy, should try and get in
> > touch with them and help debug this thing, if you're up to it.
>
> The IWILL problem seems unrelated. Its the board that more than others people
> report fails totally when streaming memory copies using movntq instructions.
>
> The Athlon optimised kernel places pretty much the absolute maximum load
> possible on the memory bus. Several people have reported that machines that
> are otherwise stable on the bios fast options require the proper conservative
> settings to be stable with the Athlon optimisations

Do we need patch to memtest to use 3dnow?
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


2001-07-08 17:40:54

by Alan

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

> > possible on the memory bus. Several people have reported that machines that
> > are otherwise stable on the bios fast options require the proper conservative
> > settings to be stable with the Athlon optimisations
>
> Do we need patch to memtest to use 3dnow?

Possibly yes. Although memtest86 really tries to test for onchip not bus
related problems

2001-07-10 01:51:36

by Rob Landley

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

On Sunday 08 July 2001 13:37, Alan Cox wrote:
> > > possible on the memory bus. Several people have reported that machines
> > > that are otherwise stable on the bios fast options require the proper
> > > conservative settings to be stable with the Athlon optimisations
> >
> > Do we need patch to memtest to use 3dnow?
>
> Possibly yes. Although memtest86 really tries to test for onchip not bus
> related problems

What else tends to fail on the motherboard that might be easy to test for?
Processor overheating? (When the thermometer circuitry's there, anyway.)
Something to do with DMA? (Would DMA to/from a common card like VGA catch
chipset-side DMA problems?) There was an SMP exception thing floating by
recently, is that common and testable?

I know there's a lot of funky peripheral combinations that behave strangely,
but without opening that can of worms what kind of common problems on the
motherboard itself might be easy to test for in a "run this overnight and see
if it finds a problem with your hardware" sort of way?

Rob

(P.S. What kind of CPU load is most likely to send a processor into overheat?
(Other than "a tight loop", thanks. I mean what kind of instructions?)
This is going to be CPU specific, isn't it? Our would a general instruction
mix that doesn't call halt be enough? It would need to keep the FPU busy
too, wouldn't it? And maybe handle interrupts. Hmmm...)

I wonder... The torture test Tom's Hardware guide uses for processor
overheating is GCC compiling the Linux kernel. (That's what caught the
Pentium III 1.13 gigahertz instability when nothing else would.) I wonder,
maybe if a stripped down subset of a known version of GCC and a known version
of the kernel were running from a ramdisk... It USED to fit in 8 megs with
no swap, might still fit in 32 with a decent chunk of kernel source. Throw
the compile in a loop, add in a processor temperature detector daemon to kill
the test and HLT the system if the temperature went too high...

I wonder what bits of the kernel GCC actually needs to run these days?
(System V inter-process communication? sysctl support? Hmmm... Would
2.4.anything be a stable enough base for this yet, or should it be 2.2.19?
Is 2.4 still psychotic with less swap space than ram (I.E. no swap space at
all)?)

Off to play...

Still Rob.

2001-07-10 09:17:56

by Ville Herva

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

On Mon, Jul 09, 2001 at 12:48:59PM -0400, you [Rob Landley] claimed:
>
> (P.S. What kind of CPU load is most likely to send a processor into overheat?
> (Other than "a tight loop", thanks. I mean what kind of instructions?)
> This is going to be CPU specific, isn't it? Our would a general instruction
> mix that doesn't call halt be enough? It would need to keep the FPU busy
> too, wouldn't it? And maybe handle interrupts. Hmmm...)

See Robert Redelmeier's cpuburn:

http://users.ev1.net/~redelm/

It is coded is assembly specificly to heat the CPU as much as possible. See
the README for details, but it seems that floating point operations are
tougher than integers and MMX can be even harder (depending on CPU model, of
course). Not sure what kind of role SSE, SSE2, 3dNow! play these days.
Perhaps Alan knows?

> I wonder... The torture test Tom's Hardware guide uses for processor
> overheating is GCC compiling the Linux kernel.

That shouldn't really be that good a test. During compilation, CPU spends a
_lot_ of time waiting for the memory and even for the disk io. For maximum
heat, you really want a tight loop of instructions, that sits firmly in L1
cache.

The gcc compile is a good test for many other tests - it uses a lot of
memory with complex pointers references (tests memory, and bit errors in
pointers are likely to sig11 rather than produce subtle errors in output),
stresses chipset somewhat (memory throughput), and cpu somewhat. But to test
CPU overheating and nothing else, cpuburn should be a lot better. (Even
seti@home is better as it uses FPU). Just run them an observe the sensors
readings. Cpuburn gets several degrees higher.

> the compile in a loop, add in a processor temperature detector daemon to kill
> the test and HLT the system if the temperature went too high...

Cpuburn exists when CPU miscalculates something (sign of overheat).

I'm not sure if cpuburn is included in cerberus these days (istr it is), but
a nice test set for memory, cpu, disk etc to run over night or over weekend
to catch most of the hw faults would definetely be nice.


-- v --

[email protected]

2001-07-10 23:54:08

by Adam Sampson

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

Ville Herva <[email protected]> writes:

> It is coded is assembly specificly to heat the CPU as much as possible. See
> the README for details, but it seems that floating point operations are
> tougher than integers and MMX can be even harder (depending on CPU model, of
> course). Not sure what kind of role SSE, SSE2, 3dNow! play these days.
> Perhaps Alan knows?

I would have thought this would be a nice problem for a genetic
algorithm to solve---start with random blocks of data, execute them
repeatedly for a period of time (restarting upon CPU traps), and
"breed" those that cause the greatest temperature increase. Any bored
research students out there?

--
Adam Sampson <[email protected]> <URL:http://azz.us-lot.org/>

2001-07-11 00:30:55

by Rob Landley

[permalink] [raw]
Subject: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

On Tuesday 10 July 2001 05:17, Ville Herva wrote:
> On Mon, Jul 09, 2001 at 12:48:59PM -0400, you [Rob Landley] claimed:
> > (P.S. What kind of CPU load is most likely to send a processor into
> > overheat? (Other than "a tight loop", thanks. I mean what kind of
> > instructions?) This is going to be CPU specific, isn't it? Our would a
> > general instruction mix that doesn't call halt be enough? It would need
> > to keep the FPU busy too, wouldn't it? And maybe handle interrupts.
> > Hmmm...)
>
> See Robert Redelmeier's cpuburn:
>
> http://users.ev1.net/~redelm/

Cool. If nothing else, this is a much better starting point for further work
than starting from scratch...

> It is coded is assembly specificly to heat the CPU as much as possible. See
> the README for details, but it seems that floating point operations are
> tougher than integers and MMX can be even harder (depending on CPU model,
> of course). Not sure what kind of role SSE, SSE2, 3dNow! play these days.
> Perhaps Alan knows?

There's at least three seperate things that need testing here. memtest86
tests whether your memory is OK. CPUburn seems to do a good job testing
processor heat (not that I'm running it on my laptop, which doesn't seem to
have a thermal readout thingy anyway...)

The third thing (which started this thread) was memory bus. The new 3DNow
optimizations drove a memory bus into failure, and that IS processor
specific...

> The gcc compile is a good test for many other tests - it uses a lot of
> memory with complex pointers references (tests memory, and bit errors in
> pointers are likely to sig11 rather than produce subtle errors in output),
> stresses chipset somewhat (memory throughput), and cpu somewhat. But to
> test CPU overheating and nothing else, cpuburn should be a lot better.
> (Even seti@home is better as it uses FPU). Just run them an observe the
> sensors readings. Cpuburn gets several degrees higher.

The downside of a test like gcc is that it does test many things, meaning
when it fails you still don't know why.

memtest86 is great becuase it ONLY tests memory. CPUburn is similarly
specific. A memory bus buster would be a good tool to add to the mix. (DMA
is another common problem, but the more I look into it, the more it seems to
be dependent on whatever peripheral you're talking to, which is more
complication than I'm looking to bite off...)

The downside of memtest86 is that your system can pass it and still have an
obvious problem (for example, overclocking stresses both memory bus AND
heat...)

It might be possible to put all three testers into a menu where you could
switch on and off what you wanted to test, and run them overnight. That way,
if you are testing for three things (perhaps alternating tests every few
minutes?), and you get it to fail, you can switch some off to get more
specific tests to narrow down the problem...

> > the compile in a loop, add in a processor temperature detector daemon to
> > kill the test and HLT the system if the temperature went too high...
>
> Cpuburn exists when CPU miscalculates something (sign of overheat).
>
> I'm not sure if cpuburn is included in cerberus these days (istr it is),
> but a nice test set for memory, cpu, disk etc to run over night or over
> weekend to catch most of the hw faults would definetely be nice.

I've heard of ceberus but thought it was just a disk test suite... One more
thing to download and look into... (If the tests in it can be switched
on/off, maybe this is what I'm looking for...)

Rob

2001-07-11 04:19:08

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

Rob Landley writes:

> The third thing (which started this thread) was memory bus. The new 3DNow
> optimizations drove a memory bus into failure, and that IS processor
> specific...
...
> memtest86 is great becuase it ONLY tests memory. CPUburn is similarly
> specific. A memory bus buster would be a good tool to add to the mix. (DMA
> is another common problem, but the more I look into it, the more it seems to
> be dependent on whatever peripheral you're talking to, which is more
> complication than I'm looking to bite off...)

DMA could be done in a sane manner. Let drivers register a function
to excercise DMA. When you want to test, tell all registered drivers
to start wild excessive DMA. Use a timer to stop this, because you
might end up pretty well locked out of your system while the bus is
busy moving test data.

2001-07-11 08:32:36

by Ville Herva

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

On Tue, Jul 10, 2001 at 10:24:21PM +0100, you [Adam Sampson] claimed:
> Ville Herva <[email protected]> writes:
>
> > It is coded is assembly specificly to heat the CPU as much as possible. See
> > the README for details, but it seems that floating point operations are
> > tougher than integers and MMX can be even harder (depending on CPU model, of
> > course). Not sure what kind of role SSE, SSE2, 3dNow! play these days.
> > Perhaps Alan knows?
>
> I would have thought this would be a nice problem for a genetic
> algorithm to solve---start with random blocks of data, execute them
> repeatedly for a period of time (restarting upon CPU traps), and
> "breed" those that cause the greatest temperature increase. Any bored
> research students out there?

I'm sure getting an Intel or AMD engineer to comment on this would be far
more fertile. After all, engineers developed a computer in just 50 years,
but it took millions of years for the evolution to come up something like a
human being... [1]


-- v --

[email protected]

[1] Now, of course someone will insist that it was in fact God who created
man... Perhaps someone ought to go to the desert and wait for an
enlightenment on the Right Instruction Sequence.

Ob-;), no offense intended.


2001-07-11 08:43:46

by Ville Herva

[permalink] [raw]
Subject: Re: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

On Tue, Jul 10, 2001 at 11:28:25AM -0400, you [Rob Landley] claimed:
>
> The downside of a test like gcc is that it does test many things, meaning
> when it fails you still don't know why.

True.

> memtest86 is great becuase it ONLY tests memory.

Yes, and because it also accurately tells you which memory location is bad.
(This can't be easily done from user space, I gather). You can use this
information to workaround the memory problem with the BadRam patch from Rick
Van Rein.

> CPUburn is similarly specific. A memory bus buster would be a good tool
> to add to the mix. (DMA is another common problem, but the more I look
> into it, the more it seems to be dependent on whatever peripheral you're
> talking to, which is more complication than I'm looking to bite off...)

True.

> It might be possible to put all three testers into a menu where you could
> switch on and off what you wanted to test, and run them overnight. That way,
> if you are testing for three things (perhaps alternating tests every few
> minutes?), and you get it to fail, you can switch some off to get more
> specific tests to narrow down the problem...

Actually lilo is just about enough for a such menu system...

Something like

image = /boot/memtest86
label = memtest86
image = /boot/vmlinux
label = cpuburn
root = /dev/hda2
append = "init=/usr/local/bin/burnP6"
read-only
image = /boot/vmlinux
label = cpuburn
root = /dev/hda2
append = "init=/usr/local/bin/testDMA"
read-only

It would take some scripting to alternate the tests automatically, but
perhaps it could be done.

> I've heard of ceberus but thought it was just a disk test suite... One more
> thing to download and look into... (If the tests in it can be switched
> on/off, maybe this is what I'm looking for...)

AFAIK it's a pretty complete test suite VA uses (used?) for testing their
hw. I'm not sure, though.


-- v --

[email protected]

2001-07-11 09:04:36

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))

Ville Herva wrote:
>
> On Mon, Jul 09, 2001 at 12:48:59PM -0400, you [Rob Landley] claimed:
> >
> > (P.S. What kind of CPU load is most likely to send a processor into overheat?
> > (Other than "a tight loop", thanks. I mean what kind of instructions?)
> > This is going to be CPU specific, isn't it? Our would a general instruction
> > mix that doesn't call halt be enough? It would need to keep the FPU busy
> > too, wouldn't it? And maybe handle interrupts. Hmmm...)
>
> See Robert Redelmeier's cpuburn:
>
> http://users.ev1.net/~redelm/

I took this program for a spin and I noted the reported CPU temp
went up by 12dc (43->55).

However, more interesting, the +5V line dropped from 4.82 to 4.72.
This is on a Gigabyte GA-7ZX with an Athlon/1200 and 2x128MB.

Some mobos may actually have their voltages pushed outside accepted
levels and cause a failure, which is actually not related to the
temperature. And you do not need to run the test for a long time,
the drop is immediate and stable.

I can only imagine what will happen if some game pushes the CPU to
the limit while running a hot video card hard, as I expect some
highly optimized graphics drivers might do. May cause some
interesting crashes.

Anyone up to enhancing the program to stress the video memory at the
same time?


In other words, this is a good stress test for the whole mobo design
and setup, not just the CPU/HSF combo.

--
Eyal Lebedinsky ([email protected]) <http://samba.anu.edu.au/eyal/>

2001-07-11 09:14:07

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

On Tue, Jul 10, 2001 at 11:28:25AM -0400, Rob Landley wrote:
> On Tuesday 10 July 2001 05:17, Ville Herva wrote:
> > On Mon, Jul 09, 2001 at 12:48:59PM -0400, you [Rob Landley] claimed:
> > > (P.S. What kind of CPU load is most likely to send a processor into
> > > overheat? (Other than "a tight loop", thanks. I mean what kind of
> > > instructions?) This is going to be CPU specific, isn't it? Our would a
> > > general instruction mix that doesn't call halt be enough? It would need
> > > to keep the FPU busy too, wouldn't it? And maybe handle interrupts.
> > > Hmmm...)
> >
> > See Robert Redelmeier's cpuburn:
> >
> > http://users.ev1.net/~redelm/
>
> Cool. If nothing else, this is a much better starting point for further work
> than starting from scratch...
>
> > It is coded is assembly specificly to heat the CPU as much as possible. See
> > the README for details, but it seems that floating point operations are
> > tougher than integers and MMX can be even harder (depending on CPU model,
> > of course). Not sure what kind of role SSE, SSE2, 3dNow! play these days.
> > Perhaps Alan knows?
>
> There's at least three seperate things that need testing here. memtest86
> tests whether your memory is OK. CPUburn seems to do a good job testing
> processor heat (not that I'm running it on my laptop, which doesn't seem to
> have a thermal readout thingy anyway...)
>
> The third thing (which started this thread) was memory bus. The new 3DNow
> optimizations drove a memory bus into failure, and that IS processor
> specific...

Don't forget the L1/L2/L3 caches. I had once a mainboard with a faulty
L2 cache chip ('twas a K6-3 CPU, plus a FIC VA-503+ mainboard). No memory
or CPU test found the failure, yet kernel compliation was still crashing
after 6-8 hours.

I modified the 'memtest.c' little proggy (not the big memtest86, just a
little utility that runs under Linux), to use patterns and test size
that tests the L1 and then L2, and the error has shown after ten seconds
of running the test.

--
Vojtech Pavlik
SuSE Labs

2001-07-12 00:07:53

by Rob Landley

[permalink] [raw]
Subject: Re: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

On Wednesday 11 July 2001 05:11, Vojtech Pavlik wrote:

> Don't forget the L1/L2/L3 caches. I had once a mainboard with a faulty
> L2 cache chip ('twas a K6-3 CPU, plus a FIC VA-503+ mainboard). No memory
> or CPU test found the failure, yet kernel compliation was still crashing
> after 6-8 hours.
>
> I modified the 'memtest.c' little proggy (not the big memtest86, just a
> little utility that runs under Linux), to use patterns and test size
> that tests the L1 and then L2, and the error has shown after ten seconds
> of running the test.

I don't suppose you still have that lying around somewhere? :)

Rob

2001-07-12 06:58:15

by Ville Herva

[permalink] [raw]
Subject: Re: Hardware testing [was Re: VIA Southbridge bug (Was: Crash on boot (2.4.5))]

On Wed, Jul 11, 2001 at 11:05:19AM -0400, you [Rob Landley] claimed:
> On Wednesday 11 July 2001 05:11, Vojtech Pavlik wrote:
>
> > I modified the 'memtest.c' little proggy (not the big memtest86, just a
> > little utility that runs under Linux), to use patterns and test size
> > that tests the L1 and then L2, and the error has shown after ten seconds
> > of running the test.
>
> I don't suppose you still have that lying around somewhere? :)

I'm not sure if it's any good, but I have one at

http://v.iki.fi/~vherva/memburn.c

(It did find one bad memory case a while ago...)


-- v --

[email protected]