2002-02-21 17:36:08

by Tom Epperly

[permalink] [raw]
Subject: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" during builds

I am getting intermittent "Illegal instruction" errors during builds
of the software I am developing, and it appears to be kernel
related. I have been investigating this problem for several weeks, and
I have exhausted all the means of investigation known to me (detailed
below). There is evidence to suggest it is not a RAM problem or a
random hardware problem (see below). I can "solve" the problem by
running the non-SMP kernel and ignoring the second processor, but this
is not a particularly satisfying solution. I am wondering if someone
can suggest some additional things I can do to understand and fix this
problem. I would appreciate if you could CC me on replies.

EVIDENCE OF THE PROBLEM
=======================

Here is an excerpt from the make log to show the effects of the problem:

make[2]: Entering directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc/talks'
rm -rf .libs _libs
rm -f *.lo
make[2]: Leaving directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc/talks'
Making clean in papers
make[1]: *** [clean-recursive] Illegal instruction
make[1]: Leaving directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc'
make: *** [clean-recursive] Error 1
****** make clean failed ******

Sometimes, the build runs to completion and succeeds. When it fails, it
fails in a different spot each time. It doesn't always list "Illegal
instruction" as the error. Here is another error message I've seen:

make[3]: *** [installcheck-local] Error 132

I could show you a lot more examples, but they don't seem to indicate more
than the examples I've shown here. The package we build can be downloaded
here: http://www.llnl.gov/casc/components/docs/babel-0.6.3.tar.gz
The build uses the autotools suite, Sun's JDK 1.3.1_02, gcc, g++, g77, and
Python.

On one occasion, random processes started dying on the machine. I had to
reboot to recover.

The failure rate of our nightly build is between 20-40%. These failures
exclude any that relate to things we can trace back to our coding
mistakes. The nightly builds do a sequence of configure, build and
regression testing.

MACHINE DETAILS
===============

HARDWARE Dell Precision Workstation 530
PROCESSORS Dual Intel Xeon Processors 1500MHz
RAM 512MB ECC RAM
O/S RH 7.2 (upgraded from 7.1) running RH's 2.4.9-21 SMP

$ /sbin/lsmod
Module Size Used by
nfsd 71232 8 (autoclean)
autofs 11556 1 (autoclean)
nfs 79840 3 (autoclean)
lockd 53184 1 (autoclean) [nfsd nfs]
sunrpc 64816 1 (autoclean) [nfsd nfs lockd]
3c59x 26504 1
usb-uhci 21668 0 (unused)
usbcore 51808 1 [usb-uhci]
aic7xxx 114624 6
sd_mod 11900 6
scsi_mod 98584 2 [aic7xxx sd_mod]

An identical machine has the same intermittent problems that my box does.

WHAT I HAVE ALREADY TRIED
=========================

1. Upgrade to from an earlier kernel to RH 2.4.9-21 SMP

The new kernel didn't change anything.

2. Ran Dell's memory checker on the RAM for an hour. It checked out
fine.

The fact that another machine next door has the exact same problems
suggests that it isn't a random hardware problem unless they both came
from the same bad batch.

3. Open my case to vent additional heat.

Someone suggested that the CPUs might be overheating and that opening
the case might solve the problem. It didn't solve the problem. My
machine is in a well air conditioned room, and it didn't seem
excessively hot when I opened the case. I haven't overclocked the
machine or anything like that.

4. Disable X11 server and reboot to avoid loading nVIDIA kernel module.

This may have lowered the frequency of problems, but it did not
eliminate the problem. Our nightly build still failed roughly
20% of the time with "Illegal instruction" errors or other
unexplainable failures.

5. Run the non-SMP 2.4.9-21 effectively turning off the second processor
(X11 still disabled)

This seems to have "solved" the problem. I've run over 22 nightly
builds, two at a time, on the system without a single failure.
Running with just one processor is better than running an unstable
two processor system, but it seems like I should be able to figure
out how to have a stable two processor system.

I have not tried compiling my own kernel because I don't have root on this
machine yet. I work in an environment where they try to centralize machine
administration, so I need special permission to get root. There is also a
desire to stick with generic RH software components. Going through the
process above is part of what I've done to justify getting root, so I can
try installing a more recent kernel.

Do you agree that this is likely to be a kernel problem? Is upgrading
the kernel my best course of action?

Here is what I get when running the non-SMP kernel.
$ uname -a
Linux tux06.llnl.gov 2.4.9-21 #1 Thu Jan 17 14:16:30 EST
2002 i686 unknown
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 0
model name : Intel(R) Xeon(TM) CPU 1500MHz
stepping : 10
cpu MHz : 1495.463
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 2981.88

Thanks in advance,

Tom

--
------------------------------------------------------------------------
Tom Epperly
Center for Applied Scientific Computing Phone: 925-424-3159
Lawrence Livermore National Laboratory Fax: 925-424-2477
L-661, P.O. Box 808, Livermore, CA 94551 Email: [email protected]
------------------------------------------------------------------------


2002-02-21 18:05:57

by Alan

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"

> Do you agree that this is likely to be a kernel problem? Is upgrading
> the kernel my best course of action?

Almost every other report I have ever seen that looked like that one has always
turned out to be hardware related. The randomness in paticular tends to be
a pointer to thinks like cache faults.

You do have ECC main memory which is good.

What other hardware is in the machine ?

2002-02-21 18:27:20

by Tom Epperly

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"

On Thu, 21 Feb 2002, Alan Cox wrote:

> Almost every other report I have ever seen that looked like that one has
> always turned out to be hardware related. The randomness in paticular
> tends to be a pointer to thinks like cache faults.

Is there any way to test this?

> You do have ECC main memory which is good.
>
> What other hardware is in the machine ?

1. SCSI harddisk plugged into the motherboard
2. Floppy drive plugged into the motherboard
3. CD-ROM and a never used Zip disk drive plugged into IDE2 on the
motherboard
4. Sound Blaster live sound code plugged into a PCI slot.
5. nVidia Corp NV15 GL (Quadro2) plugged into the AGP slot.

No USB ports are in use. The video output, mouse and keyboard are plugged
into a KVM switch. There is also a CAT-5 cable attached to the network
jack. Otherwise, there are no other external connections outside of the
power cable.

By running without the X11 server, I hoped to remove the nVidia board as a
source of trouble.

$ /sbin/lspci
00:00.0 Host bridge: Intel Corporation 82850 860 (Wombat) Chipset Host Bridge (MCH) (rev 04)
00:01.0 PCI bridge: Intel Corporation 82850 850 (Tehama) Chipset AGP Bridge (rev 04)
00:02.0 PCI bridge: Intel Corporation 82860 860 (Wombat) Chipset AGP Bridge (rev 04)
00:1e.0 PCI bridge: Intel Corporation 82801BAM PCI (rev 04)
00:1f.0 ISA bridge: Intel Corporation 82801BA ISA Bridge (ICH2) (rev 04)
00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 (rev 04)
00:1f.2 USB Controller: Intel Corporation 82801BA(M) USB (Hub A) (rev 04)
00:1f.3 SMBus: Intel Corporation 82801BA(M) SMBus (rev 04)
00:1f.4 USB Controller: Intel Corporation 82801BA(M) USB (Hub B) (rev 04)
00:1f.5 Multimedia audio controller: Intel Corporation 82801BA(M) AC'97 Audio (rev 04)
01:00.0 VGA compatible controller: nVidia Corporation NV15 GL (Quadro2) (rev a4)
02:1f.0 PCI bridge: Intel Corporation 82806AA PCI64 Hub PCI Bridge (rev 03)
03:00.0 PIC: Intel Corporation 82806AA PCI64 Hub Advanced Programmable Interrupt Controller (rev 01)
03:0e.0 SCSI storage controller: Adaptec 7892P (rev 02)
04:0b.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 78)
04:0c.0 FireWire (IEEE 1394): Texas Instruments: Unknown device 8020
04:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10000 (rev 08)
04:0d.1 Input device controller: Creative Labs SB Live! (rev 08)

Tom

--
------------------------------------------------------------------------
Tom Epperly
Center for Applied Scientific Computing Phone: 925-424-3159
Lawrence Livermore National Laboratory Fax: 925-424-2477
L-661, P.O. Box 808, Livermore, CA 94551 Email: [email protected]
------------------------------------------------------------------------

2002-02-21 18:35:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

In article <[email protected]> you wrote:
> 5. nVidia Corp NV15 GL (Quadro2) plugged into the AGP slot.

> By running without the X11 server, I hoped to remove the nVidia board as a
> source of trouble.

did you ever install the NVidia driver ?

2002-02-21 19:06:10

by Richard B. Johnson

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"

On Thu, 21 Feb 2002, Tom Epperly wrote:

> On Thu, 21 Feb 2002, Alan Cox wrote:
>
> > Almost every other report I have ever seen that looked like that one has
> > always turned out to be hardware related. The randomness in paticular
> > tends to be a pointer to thinks like cache faults.
>

> 5. nVidia Corp NV15 GL (Quadro2) plugged into the AGP slot.
^^^^^^^^^^^^^^^^^^^^^_________ Bingo!

Just for kicks, grab some other screen card.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

111,111,111 * 111,111,111 = 12,345,678,987,654,321

2002-02-21 19:23:11

by Alan

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

> > tends to be a pointer to thinks like cache faults.
> Is there any way to test this?

Well in theory the cpu should report it with a machine check (which we
would report). Sometimes it doesnt

> By running without the X11 server, I hoped to remove the nVidia board as a
> source of trouble.

If you booted and never ran X or hand loaded the nvidia module you did that

> 00:00.0 Host bridge: Intel Corporation 82850 860 (Wombat) Chipset Host Bridge (MCH) (rev 04)

when you see the nicknames it makes you wonder if intel know that wombat is
"waste of money brains and time" in business speak 8)

Nothing else in the hardware really stands out. You can avoid the sb live!
for testing I guess but I wouldnt have expected it to be a problem.

Possibly one hardware thing to try (depending on who and how the box is
maintained) is swapping the cpus over and seeing if it then works single
cpu..

2002-02-21 19:24:21

by Tom Epperly

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

On Thu, 21 Feb 2002 [email protected] wrote:

> In article <[email protected]> you wrote:
> > 5. nVidia Corp NV15 GL (Quadro2) plugged into the AGP slot.
>
> > By running without the X11 server, I hoped to remove the nVidia board as a
> > source of trouble.
>
> did you ever install the NVidia driver ?

The NVidia drivers (kernel module and X11) are installed, but I have
rebooted since disabling the X11 server. /sbin/lsmod does not list the
NVdriver in the running system. Will the kernel load NVdriver if the X11
server is never started after a reboot? /etc/modules.conf has this line

alias char-major-195 NVdriver

Tom

--
------------------------------------------------------------------------
Tom Epperly
Center for Applied Scientific Computing Phone: 925-424-3159
Lawrence Livermore National Laboratory Fax: 925-424-2477
L-661, P.O. Box 808, Livermore, CA 94551 Email: [email protected]
------------------------------------------------------------------------

2002-02-21 19:28:21

by Alan

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

> NVdriver in the running system. Will the kernel load NVdriver if the X11
> server is never started after a reboot? /etc/modules.conf has this line

Shouldn't do

> alias char-major-195 NVdriver

Change to

alias char-major-195 off

and it should be fine

2002-02-21 20:23:17

by jjs

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

I really doubt the nvidia is the problem
unless it's actually bad hardware doing
strange things to the bus.

FWIW I've torture tested linux with voodoo,
nvidia and radeon cards, and interestingly
enough the only problems I've seen are
with the radeon.

However to satisfy all demands you could
change the modules.conf line to read

alias char-major-195 off

I predict that will change nothing...

Joe

Tom Epperly wrote:

>On Thu, 21 Feb 2002 [email protected] wrote:
>
>>In article <[email protected]> you wrote:
>>
>>>5. nVidia Corp NV15 GL (Quadro2) plugged into the AGP slot.
>>>
>>>By running without the X11 server, I hoped to remove the nVidia board as a
>>>source of trouble.
>>>
>>did you ever install the NVidia driver ?
>>
>
>The NVidia drivers (kernel module and X11) are installed, but I have
>rebooted since disabling the X11 server. /sbin/lsmod does not list the
>NVdriver in the running system. Will the kernel load NVdriver if the X11
>server is never started after a reboot? /etc/modules.conf has this line
>
>alias char-major-195 NVdriver
>
>Tom
>
>--
>------------------------------------------------------------------------
>Tom Epperly
>Center for Applied Scientific Computing Phone: 925-424-3159
>Lawrence Livermore National Laboratory Fax: 925-424-2477
>L-661, P.O. Box 808, Livermore, CA 94551 Email: [email protected]
>------------------------------------------------------------------------
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>


2002-02-22 22:45:51

by Tom Epperly

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal

On Thu, 21 Feb 2002, Alan Cox wrote:

> Possibly one hardware thing to try (depending on who and how the box is
> maintained) is swapping the cpus over and seeing if it then works single
> cpu..

I tried swapping the cpus (i.e. swapping chips), and it still runs fine
non-SMP.

Tom

--
------------------------------------------------------------------------
Tom Epperly
Center for Applied Scientific Computing Phone: 925-424-3159
Lawrence Livermore National Laboratory Fax: 925-424-2477
L-661, P.O. Box 808, Livermore, CA 94551 Email: [email protected]
------------------------------------------------------------------------

2002-03-11 16:07:39

by Tom Epperly

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"

Alan Cox wrote:

>>Do you agree that this is likely to be a kernel problem? Is upgrading
>>the kernel my best course of action?
>>
>
>Almost every other report I have ever seen that looked like that one has always
>turned out to be hardware related. The randomness in paticular tends to be
>a pointer to thinks like cache faults.
>
>You do have ECC main memory which is good.
>
>What other hardware is in the machine ?
>
>
To recap from an earlier email, my nightly build & regression tests (a
roughly 2 hour process involving Sun's JDK, GNU make, gcc, g++, g77 &
Bourne shell scripts) has been failing intermittently on a dual-Xeon
system usually with an "Illegal intruction" signal. I've tried removing
the sound card and disabling the X11 server to avoid loading the nVidia
kernel mod. The intermittent failures disappear when I run non-SMP. I've
tried swapping the processors on the motherboard, and both processors
appear to work fine individually. Most of the boxes I've run on have >=
512MB ECC RAM. I've run Dell's hardware diagnostics (especially the
memory ones) twice. The diagnostics don't seem to have SMP tests where
both CPUs are being stressed.

FYI when I upgraded to the 2.4.18-1smp kernel, the failure rate went
from 20% to 100%. I have tried running the nightly build & regression on
roughly 6 different dual processors Pentium III or better machines
(cylcing it over and over), and they all have intermittent failures of
one kind or another. All these machines are made by Dell, but they
provide some evidence that it is not a hardware problem.

Tom

2002-03-11 16:53:21

by Alan

[permalink] [raw]
Subject: Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"

> FYI when I upgraded to the 2.4.18-1smp kernel, the failure rate went
> from 20% to 100%. I have tried running the nightly build & regression on
> roughly 6 different dual processors Pentium III or better machines
> (cylcing it over and over), and they all have intermittent failures of
> one kind or another. All these machines are made by Dell, but they
> provide some evidence that it is not a hardware problem.

It does sound like either a bad compiler (eg gcc 3) which is unlikely but
possible or your workload triggers something important and unusual which is
buggy. Do you know which bit of the workload is the trigger - is it practical
to find out ?

Alan