2002-03-11 20:23:51

by James Washer

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"


Tom,

If I send you a patch to do a bit of debugging, would you be able to run
it? Basically, my plan is just to call a die() like routine for trap=6, so
that we get a good stack frame (much like an oops).. From there, we should
be able to figure out WHY the program is getting an illegal op..

My guesses at this point in time are.... bad file io( i.e. the executable
is corrupt) , or bad mem... (check to see if the same phys page is being
used, for example)

- jim

"Tom Epperly" <[email protected]>@vger.kernel.org on 03/11/2002 08:07:05 AM

Sent by: [email protected]


To: Alan Cox <[email protected]>
cc: [email protected]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal
instructions"



Alan Cox wrote:

>>Do you agree that this is likely to be a kernel problem? Is upgrading
>>the kernel my best course of action?
>>
>
>Almost every other report I have ever seen that looked like that one has
always
>turned out to be hardware related. The randomness in paticular tends to be
>a pointer to thinks like cache faults.
>
>You do have ECC main memory which is good.
>
>What other hardware is in the machine ?
>
>
To recap from an earlier email, my nightly build & regression tests (a
roughly 2 hour process involving Sun's JDK, GNU make, gcc, g++, g77 &
Bourne shell scripts) has been failing intermittently on a dual-Xeon
system usually with an "Illegal intruction" signal. I've tried removing
the sound card and disabling the X11 server to avoid loading the nVidia
kernel mod. The intermittent failures disappear when I run non-SMP. I've
tried swapping the processors on the motherboard, and both processors
appear to work fine individually. Most of the boxes I've run on have >=
512MB ECC RAM. I've run Dell's hardware diagnostics (especially the
memory ones) twice. The diagnostics don't seem to have SMP tests where
both CPUs are being stressed.

FYI when I upgraded to the 2.4.18-1smp kernel, the failure rate went
from 20% to 100%. I have tried running the nightly build & regression on
roughly 6 different dual processors Pentium III or better machines
(cylcing it over and over), and they all have intermittent failures of
one kind or another. All these machines are made by Dell, but they
provide some evidence that it is not a hardware problem.

Tom