2002-12-06 14:48:01

by Greg Boyce

[permalink] [raw]
Subject: Dazed and Confused

Folks,

I have an issue that I've been trying to track down for some time, and I
was hoping that someone might be able to provide me with a definitive
awnser.

I work in a company with a large number of Linux machine deployed all
around the country, and in some of the machines we've been seeing the
following error:

Uhhuh. NMI received. Dazed and confused, but trying to continue
You probably have a hardware problem with your RAM chips

Now, from what I've read about this error, it is caused by the memory
detecting a parity error in the actual RAM chips, and reporting it to the
OS. However, some of the people within the company who handle the
replacement of hardware are convinced that it might be something simpler
in some of the cases. Perhaps the RAM chip isn't fully seated, or the
machine just needs a reboot.

Due to the number of machines and their locations, running memtest86 on
them isn't exactly feasible.

Is there anything besides failing hardware that could be the cause of this
error? Also, how serious is this error? Some of the machines reporting
this error have had problems with programs crashing, while others seem to
run fine.

Any input or resources you could point me at would be appreciated.

Greg Boyce


2002-12-06 15:14:47

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 6 Dec 2002, Greg Boyce wrote:

> Folks,
>
> I have an issue that I've been trying to track down for some time, and I
> was hoping that someone might be able to provide me with a definitive
> awnser.
>
> I work in a company with a large number of Linux machine deployed all
> around the country, and in some of the machines we've been seeing the
> following error:
>
> Uhhuh. NMI received. Dazed and confused, but trying to continue
> You probably have a hardware problem with your RAM chips
>

Hardware (read HARDWARE) generates a NMI when something BAD happens.
Linux didn't do it and Linux can't do anything about it. It just
reports that something bad happened (like a RAM parity error).

FYI Linux never "just needs to be re-booted". That response from
the computer maintenance department was implanted by the Redmond
group so they wouldn't have to fix their defective operating system(s).

Bad RAM, improperly socketed RAM, bad fans, bad power supplies,
bad heat-sinks, bad feature-cards, all kinds of bad components
can cause the NMI. Anything that will interfere with reading what
was written to RAM, including bad address-timing caused by the
box getting way too hot or the power supplies having noise or
sagging voltages, will cause this error.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-06 15:22:28

by Alan

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 2002-12-06 at 14:55, Greg Boyce wrote:
> I work in a company with a large number of Linux machine deployed all
> around the country, and in some of the machines we've been seeing the
> following error:
>
> Uhhuh. NMI received. Dazed and confused, but trying to continue
> You probably have a hardware problem with your RAM chips

There are several causes of an NMI depending on the system - hardware
failures is one, some systems do it for things like PCI errors, a few
boxes you see them on power management events (notably old 486's)

> Due to the number of machines and their locations, running memtest86 on
> them isn't exactly feasible.

Then buy better ram ;)

> Is there anything besides failing hardware that could be the cause of this
> error? Also, how serious is this error? Some of the machines reporting
> this error have had problems with programs crashing, while others seem to
> run fine.

Take a sample set of machines which have been crashing and run memtest86
on a couple. That should tell you if it is RAM. From a sample you can
then figure out how to handle the rest (things that come to mind if
memtest86 fails on the test machines include replacing the ram in a few
more then taking the old ram back to test)


2002-12-06 16:40:00

by Greg Boyce

[permalink] [raw]
Subject: Re: Dazed and Confused

On 6 Dec 2002, Alan Cox wrote:

> On Fri, 2002-12-06 at 14:55, Greg Boyce wrote:
> > I work in a company with a large number of Linux machine deployed all
> > around the country, and in some of the machines we've been seeing the
> > following error:
> >
> > Uhhuh. NMI received. Dazed and confused, but trying to continue
> > You probably have a hardware problem with your RAM chips
>
> There are several causes of an NMI depending on the system - hardware
> failures is one, some systems do it for things like PCI errors, a few
> boxes you see them on power management events (notably old 486's)
>
> > Due to the number of machines and their locations, running memtest86 on
> > them isn't exactly feasible.
>
> Then buy better ram ;)

We have a large number of a very small number of machine types. The
OS images installed are identical, and the bioses should be identical
between each individual machine types.

Since the number of machines reporting this error are pretty small, I
think it's unlikely to be power management, or anything like that.

> > Is there anything besides failing hardware that could be the cause of this
> > error? Also, how serious is this error? Some of the machines reporting
> > this error have had problems with programs crashing, while others seem to
> > run fine.
>
> Take a sample set of machines which have been crashing and run memtest86
> on a couple. That should tell you if it is RAM. From a sample you can
> then figure out how to handle the rest (things that come to mind if
> memtest86 fails on the test machines include replacing the ram in a few
> more then taking the old ram back to test)

I'll mention it to the people who handle the replacement of hardware, but
from the sounds of this and Dick's e-mail, it's most likely hardware of
some sort or possibly overheating. They can decide if they want to try to
figure out which component is causing the problem, or if they'd prefer to
just replace the faulty machines completely and worry about tracking the
component later. We have plenty of spares in the warehouse.

Thanks for the help,

Greg

2002-12-06 17:00:36

by Greg Boyce

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 6 Dec 2002, Greg Boyce wrote:

> On 6 Dec 2002, Alan Cox wrote:
> > Take a sample set of machines which have been crashing and run memtest86
> > on a couple. That should tell you if it is RAM. From a sample you can
> > then figure out how to handle the rest (things that come to mind if
> > memtest86 fails on the test machines include replacing the ram in a few
> > more then taking the old ram back to test)
>
> I'll mention it to the people who handle the replacement of hardware, but
> from the sounds of this and Dick's e-mail, it's most likely hardware of
> some sort or possibly overheating. They can decide if they want to try to
> figure out which component is causing the problem, or if they'd prefer to
> just replace the faulty machines completely and worry about tracking the
> component later. We have plenty of spares in the warehouse.

Actually, this does leave one question still: How serious is the problem?
How much would you trust a machine reporting these errors? Most of the
machines are just performing DNS and web service (although with a pretty
high load). The processes on the machine are are cpu and memory
intensive, but there is no critical data stored on most of the machines.

Are the machines likely to give us problems with crashing and data
corruption, or would it be safe to ignore the problem unless we started
noticing odd behavior?

Greg

2002-12-06 21:42:53

by Alan

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 2002-12-06 at 17:08, Greg Boyce wrote:
> Are the machines likely to give us problems with crashing and data
> corruption, or would it be safe to ignore the problem unless we started
> noticing odd behavior?

You've already noticed odd behaviour - crashing, NMI's ...

2002-12-07 13:43:05

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Dazed and Confused

Greg Boyce <[email protected]> writes:

> Are the machines likely to give us problems with crashing and data
> corruption, or would it be safe to ignore the problem unless we started
> noticing odd behavior?

First of all, only RAM with parity bits (or ECC) can generate such NMI
(the motherboard must support this as well, of course).

Most motherboads can be configured in ECC mode, and they correct 1-bit
errors. 2-bit errors are reported and not corrected, but the probability
of such error is nearly zero in normal conditions (unless your hardware
is defective, of course).

CPU caches do ECC as well, and possibly can generate NMI requests. However,
they use static RAM (as opposed to dynamic) and bit errors should not
happen there.
--
Krzysztof Halasa
Network Administrator

2002-12-08 00:52:34

by Alan

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 2002-12-06 at 23:33, Krzysztof Halasa wrote:
> CPU caches do ECC as well, and possibly can generate NMI requests. However,
> they use static RAM (as opposed to dynamic) and bit errors should not
> happen there.

CPU caches generate MCA/MCE on pentium pro and higher it appears rather
than NMI from the core logic

2002-12-08 11:53:24

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: Dazed and Confused

On Fri, 2002-12-06 at 16:55, Greg Boyce wrote:

> I have an issue that I've been trying to track down for some time, and I
> was hoping that someone might be able to provide me with a definitive
> awnser.
>
> I work in a company with a large number of Linux machine deployed all
> around the country, and in some of the machines we've been seeing the
> following error:
>
> Uhhuh. NMI received. Dazed and confused, but trying to continue
> You probably have a hardware problem with your RAM chips

I have had the exact same error happen a while back on a 2.2.x kernel.
It did not seem to hurt anything but it made the QA dept. go bonkers so
I've spent some time chasing it down and found out what caused it back
then - perhaps the same, or similar, applies to your setup as well:

The machines in question were Intel ISP1100 1U servers and for various
non important reasons I have built the kernel which they were running
without APM support. Now these machines have 3 small non marked buttons
on their front - one is the power button, one is the reset button and
one was a suspend button.

What I found out was that whenever anyone pressed the "suspend" button
(usually because they meant to press the power or reset buttons and
missed) the error in questions was logged. It seems that APM suspend is
implemented (at least on those machines) as an NMI, and if you compiled
the kernel sans APM support the NMI handling code simply did not grok
that specific NMI and thus reported said error, which was otherwise
harmless.

Hope this helps,
Gilad.

--
Gilad Ben-Yossef <[email protected]>
http://benyossef.com
"Denial really is a river in Egypt."

2002-12-08 15:15:28

by Greg Boyce

[permalink] [raw]
Subject: Re: Dazed and Confused

On Sun, 2002-12-08 at 06:59, Gilad Ben-Yossef wrote:
> On Fri, 2002-12-06 at 16:55, Greg Boyce wrote:
>
> > I have an issue that I've been trying to track down for some time, and I
> > was hoping that someone might be able to provide me with a definitive
> > awnser.
> >
> > I work in a company with a large number of Linux machine deployed all
> > around the country, and in some of the machines we've been seeing the
> > following error:
> >
> > Uhhuh. NMI received. Dazed and confused, but trying to continue
> > You probably have a hardware problem with your RAM chips
>
> I have had the exact same error happen a while back on a 2.2.x kernel.
> It did not seem to hurt anything but it made the QA dept. go bonkers so
> I've spent some time chasing it down and found out what caused it back
> then - perhaps the same, or similar, applies to your setup as well:
>
> The machines in question were Intel ISP1100 1U servers and for various
> non important reasons I have built the kernel which they were running
> without APM support. Now these machines have 3 small non marked buttons
> on their front - one is the power button, one is the reset button and
> one was a suspend button.
>
> What I found out was that whenever anyone pressed the "suspend" button
> (usually because they meant to press the power or reset buttons and
> missed) the error in questions was logged. It seems that APM suspend is
> implemented (at least on those machines) as an NMI, and if you compiled
> the kernel sans APM support the NMI handling code simply did not grok
> that specific NMI and thus reported said error, which was otherwise
> harmless.

Most of these machines do have Intel motherboards. I don't recall
seeing suspend buttons, but I'll take a look. Thanks!

--
Greg

2002-12-08 22:53:33

by Toon van der Pas

[permalink] [raw]
Subject: Re: Dazed and Confused

On Sun, Dec 08, 2002 at 10:22:38AM -0500, Gregory Boyce wrote:
> On Sun, 2002-12-08 at 06:59, Gilad Ben-Yossef wrote:
> > On Fri, 2002-12-06 at 16:55, Greg Boyce wrote:
> >
> > > I have an issue that I've been trying to track down for some time, and I
> > > was hoping that someone might be able to provide me with a definitive
> > > awnser.
> > >
> > > I work in a company with a large number of Linux machine deployed all
> > > around the country, and in some of the machines we've been seeing the
> > > following error:
> > >
> > > Uhhuh. NMI received. Dazed and confused, but trying to continue
> > > You probably have a hardware problem with your RAM chips
> >
> > I have had the exact same error happen a while back on a 2.2.x kernel.
> > It did not seem to hurt anything but it made the QA dept. go bonkers so
> > I've spent some time chasing it down and found out what caused it back
> > then - perhaps the same, or similar, applies to your setup as well:
> >
> > The machines in question were Intel ISP1100 1U servers and for various
> > non important reasons I have built the kernel which they were running
> > without APM support. Now these machines have 3 small non marked buttons
> > on their front - one is the power button, one is the reset button and
> > one was a suspend button.
> >
> > What I found out was that whenever anyone pressed the "suspend" button
> > (usually because they meant to press the power or reset buttons and
> > missed) the error in questions was logged. It seems that APM suspend is
> > implemented (at least on those machines) as an NMI, and if you compiled
> > the kernel sans APM support the NMI handling code simply did not grok
> > that specific NMI and thus reported said error, which was otherwise
> > harmless.
>
> Most of these machines do have Intel motherboards. I don't recall
> seeing suspend buttons, but I'll take a look. Thanks!

Just another datapoint:

I administer a Digitial Prioris server (Pentium II, 512MB memory),
which logs one (yes, exactly one) "Uhhuh. NMI received. Etc.." message
at boot time. After booting it _never_ logs this message again.
The machine is rock stable; it currently has an uptime of 19 months
and is humming along nicely. It runs a pristine 2.2.19 kernel, with
the following patches applied:

raw-2.2.18.FULL.diff
linux-2.2.19-reiserfs-3.5.32-patch
lvm_0.9.1_beta7

It looks like something is producing exactly one NMI during the boot
process. It doesn't seem to be a hardware problem.
Could it be something SMP-related? (the machine runs with one CPU, but
the motherboard can accomodate a second CPU and the BIOS supports that)
Or some power management thing, like was suggested previously in this thread?

Regards,
Toon.
--
/"\ | "I never much liked Macs.
\ / ASCII RIBBON CAMPAIGN | All the interesting stuff is hidden away."
X AGAINST HTML MAIL | -- Linus Torvalds (at the Geek Cruise)
/ \

2002-12-09 19:11:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Dazed and Confused

Followup to: <Pine.LNX.4.42.0212061133330.7770-100000@egg>
By author: Greg Boyce <[email protected]>
In newsgroup: linux.dev.kernel
>
> > Then buy better ram ;)
>
> We have a large number of a very small number of machine types. The
> OS images installed are identical, and the bioses should be identical
> between each individual machine types.
>
> Since the number of machines reporting this error are pretty small, I
> think it's unlikely to be power management, or anything like that.
>

What you have, then, is a set of defective machines.

>
> I'll mention it to the people who handle the replacement of hardware, but
> from the sounds of this and Dick's e-mail, it's most likely hardware of
> some sort or possibly overheating. They can decide if they want to try to
> figure out which component is causing the problem, or if they'd prefer to
> just replace the faulty machines completely and worry about tracking the
> component later. We have plenty of spares in the warehouse.
>

Indeed. This is the way to do it.

FWIW, a lot of PC vendors do extremely limited testing on each
machine and effectively use warranty service as a burn-in test.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-12-09 19:12:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Dazed and Confused

Followup to: <Pine.LNX.4.42.0212061202230.7770-100000@egg>
By author: Greg Boyce <[email protected]>
In newsgroup: linux.dev.kernel
>
> Actually, this does leave one question still: How serious is the problem?
> How much would you trust a machine reporting these errors? Most of the
> machines are just performing DNS and web service (although with a pretty
> high load). The processes on the machine are are cpu and memory
> intensive, but there is no critical data stored on most of the machines.
>
> Are the machines likely to give us problems with crashing and data
> corruption, or would it be safe to ignore the problem unless we started
> noticing odd behavior?
>

The fact that you're seeing the error means data corruption has
already occurred.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-12-10 00:36:21

by Bill Davidsen

[permalink] [raw]
Subject: Re: Dazed and Confused

On 9 Dec 2002, H. Peter Anvin wrote:

> The fact that you're seeing the error means data corruption has
> already occurred.

Maybe. The fact that the error is noticed indicates that the memory has at
least parity capability. However, current memory with has "by 72" width
instead of "by 64" can also do EDAC, in which case all one bit errors will
be corrected. Some motherboards can be configured to NMI on parity, even
if corrected. I had one, back when PPro was hot stuff.

So it's just possible that the data is fine in spite of the NMI. That
said, I'd start looking for a hardware problem regardless, even if it's
currently working, it's not working *right*.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.