2007-09-26 02:25:25

by AndrewL733

[permalink] [raw]
Subject: NMI error and Intel S5000PSL Motherboards

We have about 100 servers based on Intel S5000PSL-SATA motherboards.
They have been running for anywhere between 1 and 10 months. For the
past few months, after updating them all to the 2.6.20.15 kernel
(because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
errors. For example:

Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue

Sometimes these errors cause a total system freeze. Most of the time the
systems keep running.

We have determined these errors come most frequently on machines that
have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
these cards (it doesn't matter what slot they are in), the NMI errors
can occur as frequently as every 3-5 minutes. On machines that do NOT
have these Quad Port Adapters, the NMI errors occur about once per month
on average. (we have tried the "in-kernel" e1000 drivers, as well as
Intel's latest - 7.6.5).

We have also determined (through a chance discovery) that running
?scanpci? can 100 percent reliably reproduce the NMI error on any
machine that has the Quad Port NICS. Our various motherboards have
different Intel BIOS versions ? some have Rev 70, others 74, 79 or 81.
They all exhibit the same behavior regardless of BIOS version.

We have reproduced this problem with:

Mandriva 2008 RC2 (2.6.22 kernel)
Mandriva 2007 with custom 2.6.20.15 kernel
Mandriva 2007 with custom 2.6.19.8 kernel
Ubuntu ?Feisty? with 2.6.20 kernel
Fedora Core 7 with 2.6.22 kernel

The problem does NOT occur with any distribution running a 2.6.18 kernel
or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
2.6.17 kernel or custom-compiled 2.6.18 kernel.

We have been in contact with Intel. Their high level tech support people
have basically said,

?the errors we have logged so far are pointing to a kernel issue and
not a hardware problem. If we [Intel] can confirm this, it will be
up to the kernel developer or OS system manufacturer to debug those
ones, as we do not perform Operating system support.?

In other words, Intel seems to be blaming the problem we are seeing on
something introduced starting with the 2.6.19 kernel. We are not looking
to blame anybody. We are only looking for a solution.

Does anybody have an idea what could be going on here, as well as what
the solution may be? Going back to 2.6.18 or lower is not an option.



2007-09-26 02:59:59

by Randy Dunlap

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

> We have about 100 servers based on Intel S5000PSL-SATA motherboards.

product info (for others):
http://support.intel.com/support/motherboards/server/s5000psl/index.htm

> They have been running for anywhere between 1 and 10 months. For the
> past few months, after updating them all to the 2.6.20.15 kernel
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
> errors. For example:
>
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>
> Sometimes these errors cause a total system freeze. Most of the time the
> systems keep running.
>
> We have determined these errors come most frequently on machines that
> have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
> these cards (it doesn't matter what slot they are in), the NMI errors
> can occur as frequently as every 3-5 minutes. On machines that do NOT
> have these Quad Port Adapters, the NMI errors occur about once per month
> on average. (we have tried the "in-kernel" e1000 drivers, as well as
> Intel's latest - 7.6.5).
>
> We have also determined (through a chance discovery) that running
> “scanpci” can 100 percent reliably reproduce the NMI error on any
> machine that has the Quad Port NICS. Our various motherboards have
> different Intel BIOS versions – some have Rev 70, others 74, 79 or 81.
> They all exhibit the same behavior regardless of BIOS version.
>
> We have reproduced this problem with:
>
> Mandriva 2008 RC2 (2.6.22 kernel)
> Mandriva 2007 with custom 2.6.20.15 kernel
> Mandriva 2007 with custom 2.6.19.8 kernel
> Ubuntu “Feisty” with 2.6.20 kernel
> Fedora Core 7 with 2.6.22 kernel
>
> The problem does NOT occur with any distribution running a 2.6.18 kernel
> or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
> 2.6.17 kernel or custom-compiled 2.6.18 kernel.
>
> We have been in contact with Intel. Their high level tech support people
> have basically said,
>
> “the errors we have logged so far are pointing to a kernel issue and
> not a hardware problem. If we [Intel] can confirm this, it will be
> up to the kernel developer or OS system manufacturer to debug those
> ones, as we do not perform Operating system support.”
>
> In other words, Intel seems to be blaming the problem we are seeing on
> something introduced starting with the 2.6.19 kernel. We are not looking
> to blame anybody. We are only looking for a solution.
>
> Does anybody have an idea what could be going on here, as well as what
> the solution may be? Going back to 2.6.18 or lower is not an option.


Please provide some basic info, like:

- how much RAM
- what CPUs (be precise: use 'cat /proc/cpuinfo')
- output of 'lspci -v'
- what kind(s) of SATA drives
- are you using 32-bit or 64-bit kernel(s)

Can you test kernels from kernel.org (i.e., not vendor kernels,
no other [unkwown] patches applied to them)?

Does tracing 'scanpci' produce any helpful information?
# strace -o scanpci.trace scanpci


---
~Randy
Phaedrus says that Quality is about caring.

2007-09-26 04:58:29

by Randy Dunlap

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

> We have about 100 servers based on Intel S5000PSL-SATA motherboards.
> They have been running for anywhere between 1 and 10 months. For the
> past few months, after updating them all to the 2.6.20.15 kernel
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
> errors. For example:
>
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>
> Sometimes these errors cause a total system freeze. Most of the time the
> systems keep running.
>
> We have determined these errors come most frequently on machines that
> have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
> these cards (it doesn't matter what slot they are in), the NMI errors
> can occur as frequently as every 3-5 minutes. On machines that do NOT
> have these Quad Port Adapters, the NMI errors occur about once per month
> on average. (we have tried the "in-kernel" e1000 drivers, as well as
> Intel's latest - 7.6.5).
>
> We have also determined (through a chance discovery) that running
> “scanpci” can 100 percent reliably reproduce the NMI error on any
> machine that has the Quad Port NICS. Our various motherboards have
> different Intel BIOS versions – some have Rev 70, others 74, 79 or 81.
> They all exhibit the same behavior regardless of BIOS version.
>
> We have reproduced this problem with:
>
> Mandriva 2008 RC2 (2.6.22 kernel)
> Mandriva 2007 with custom 2.6.20.15 kernel
> Mandriva 2007 with custom 2.6.19.8 kernel
> Ubuntu “Feisty” with 2.6.20 kernel
> Fedora Core 7 with 2.6.22 kernel
>
> The problem does NOT occur with any distribution running a 2.6.18 kernel
> or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
> 2.6.17 kernel or custom-compiled 2.6.18 kernel.
>
> We have been in contact with Intel. Their high level tech support people
> have basically said,
>
> “the errors we have logged so far are pointing to a kernel issue and
> not a hardware problem. If we [Intel] can confirm this, it will be
> up to the kernel developer or OS system manufacturer to debug those
> ones, as we do not perform Operating system support.”
>
> In other words, Intel seems to be blaming the problem we are seeing on
> something introduced starting with the 2.6.19 kernel. We are not looking
> to blame anybody. We are only looking for a solution.
>
> Does anybody have an idea what could be going on here, as well as what
> the solution may be? Going back to 2.6.18 or lower is not an option.

Answer #2: if a kernel change was responsible for this problem,
the direct way to find that change is to clone the kernel 'git' tree
and then use git bisect to find the culprit. If you are certain
that 2.6.18 is good and 2.6.19 is bad, then use those git tree tags
instead of the ones that are used in the example at:
http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html

git wiki is here: http://git.or.cz/
and git docs are here: http://www.kernel.org/pub/software/scm/git/docs/

If you want to use this tool, say so and I think that we (the royal
"we") will try to work you thru it.

---
~Randy
Phaedrus says that Quality is about caring.

2007-09-26 11:10:54

by Alan

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
> enabled?

What would be useful is to know under what situations that board can
raise NMI 30.

> In other words, Intel seems to be blaming the problem we are seeing on
> something introduced starting with the 2.6.19 kernel. We are not looking
> to blame anybody. We are only looking for a solution.

The first thing to find out is to find out in which kernel the behaviour
is introduced. It might also be worth disabling msi in case Intel screwed
the board up somewhat.

> Does anybody have an idea what could be going on here, as well as what
> the solution may be? Going back to 2.6.18 or lower is not an option.

See if 2.6.20.* with the 2.6.18 driver compiles and how that behaves.
Also see if pci=nomsi and/or pci=nommconf make a difference.

Alan

2007-09-27 00:03:43

by Randy Dunlap

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

> Hello,
>
> > We have about 100 servers based on Intel S5000PSL-SATA motherboards.
> > They have been running for anywhere between 1 and 10 months. For the
> > past few months, after updating them all to the 2.6.20.15 kernel
> > (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
> > errors. For example:
> >
> > Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> > Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
> > Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>
> I'm also working with Andrew and Samson. It seems that the cause of
> the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
> defaults to y.
>
> With CONFIG_PCIEAER=n, scanpci works fine with no errors. This is the
> workaround that they'll likely use for now.

Glad that you found it.

> With CONFIG_PCIEAER=y, scanpci always triggers the NMI error. The
> option aerdriver.forceload=1 has no effect.

The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


> The related dmesg output at boot is:
>
> Evaluate _OSC Set fails. Status = 0x0005
> Evaluate _OSC Set fails. Status = 0x0005
> aer_init: AER service init fails - Run ACPI _OSC fails
> aer: probe of 0000:00:02.0:pcie01 failed with error 2
> aer_init: AER service init fails - No ACPI _OSC support
> aer: probe of 0000:00:03.0:pcie01 failed with error 1
> Evaluate _OSC Set fails. Status = 0x0005
> Evaluate _OSC Set fails. Status = 0x0005
> aer_init: AER service init fails - Run ACPI _OSC fails
> aer: probe of 0000:00:04.0:pcie01 failed with error 2
> Evaluate _OSC Set fails. Status = 0x0005
> Evaluate _OSC Set fails. Status = 0x0005
> aer_init: AER service init fails - Run ACPI _OSC fails
> aer: probe of 0000:00:05.0:pcie01 failed with error 2
> Evaluate _OSC Set fails. Status = 0x0005
> Evaluate _OSC Set fails. Status = 0x0005
> aer_init: AER service init fails - Run ACPI _OSC fails
> aer: probe of 0000:00:06.0:pcie01 failed with error 2
> aer_init: AER service init fails - No ACPI _OSC support
> aer: probe of 0000:00:07.0:pcie01 failed with error 1
>
> Full dmesg, lspci, and ACPI DSDT are available here:
> http://jim.sh/~jim/tmp/nmi/
>
> -jim


---
~Randy
Phaedrus says that Quality is about caring.

2007-09-28 15:22:17

by AndrewL733

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

[email protected] wrote:
> On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:
>
>
>> Hello,
>>
>>
>>> We have about 100 servers based on Intel S5000PSL-SATA motherboards.
>>> They have been running for anywhere between 1 and 10 months. For the
>>> past few months, after updating them all to the 2.6.20.15 kernel
>>> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
>>> errors. For example:
>>>
>>> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
>>> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
>>> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>>>
>> I'm also working with Andrew and Samson. It seems that the cause of
>> the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
>> defaults to y.
>>
>> With CONFIG_PCIEAER=n, scanpci works fine with no errors. This is the
>> workaround that they'll likely use for now.
>>
>
> Glad that you found it.
>
>
>> With CONFIG_PCIEAER=y, scanpci always triggers the NMI error. The
>> option aerdriver.forceload=1 has no effect.
>>

So, looking for some closure here, what do we think is the "root cause"?
Is it:

1) a defect with Intel's S5000PSL motherboards that is exposed by an
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case
we and others should probably press Intel to recognize they have a
problem, seeing as they only "officially support" distributions running
on 2.6.16 or below so maybe they don't even know about this issue).

2) a problem with PCIEAER? And maybe "CONFIG_PCIEAER=y" should NOT be
the default setting? (in which case the kernel maybe needs fixing)

3) just a bad interaction between a good motherboard and a good Linux
feature that don't play well together? (in which case this is a kernel
"feature" that anybody compiling a kernel to run on the Intel S5000PSL
motherboard should know not to enable -- maybe a note is warranted so
that when configuring the kernel, people with S5000PSL motherboards
might not make the same mistake???).

>
> The 'forceload' option only forces the driver to load even when the
> ACPI hardware initialization routine fails.
>
> It would be nice to be able to disable PCIEAER at boot time though.
> Shouldn't be difficult.
>
>
>
>> The related dmesg output at boot is:
>>
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:02.0:pcie01 failed with error 2
>> aer_init: AER service init fails - No ACPI _OSC support
>> aer: probe of 0000:00:03.0:pcie01 failed with error 1
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:04.0:pcie01 failed with error 2
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:05.0:pcie01 failed with error 2
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:06.0:pcie01 failed with error 2
>> aer_init: AER service init fails - No ACPI _OSC support
>> aer: probe of 0000:00:07.0:pcie01 failed with error 1
>>
>> Full dmesg, lspci, and ACPI DSDT are available here:
>> http://jim.sh/~jim/tmp/nmi/
>>
>> -jim
>>
>
>
> ---
> ~Randy
> Phaedrus says that Quality is about caring.
>

2007-09-28 15:24:18

by AndrewL733

[permalink] [raw]
Subject: Re: NMI error and Intel S5000PSL Motherboards

[email protected] wrote:
> On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:
>
>
>> Hello,
>>
>>
>>> We have about 100 servers based on Intel S5000PSL-SATA motherboards.
>>> They have been running for anywhere between 1 and 10 months. For the
>>> past few months, after updating them all to the 2.6.20.15 kernel
>>> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
>>> errors. For example:
>>>
>>> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
>>> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
>>> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>>>
>> I'm also working with Andrew and Samson. It seems that the cause of
>> the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
>> defaults to y.
>>
>> With CONFIG_PCIEAER=n, scanpci works fine with no errors. This is the
>> workaround that they'll likely use for now.
>>
>
> Glad that you found it.
>
>
>> With CONFIG_PCIEAER=y, scanpci always triggers the NMI error. The
>> option aerdriver.forceload=1 has no effect.
>>
>
> The 'forceload' option only forces the driver to load even when the
> ACPI hardware initialization routine fails.
>
> It would be nice to be able to disable PCIEAER at boot time though.
> Shouldn't be difficult.
>
>
So, looking for some closure here, what do we think is the "root cause"?
Is it:

1) a defect with Intel's S5000PSL motherboards that is exposed by an
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case
we and others should probably press Intel to recognize they have a
problem, seeing as they only "officially support" distributions running
on 2.6.16 or below so maybe they don't even know about this issue).

2) a problem with PCIEAER? And maybe "CONFIG_PCIEAER=y" should NOT be
the default setting? (in which case the kernel maybe needs fixing)

3) just a bad interaction between a good motherboard and a good Linux
feature that don't play well together? (in which case this is a kernel
"feature" that anybody compiling a kernel to run on the Intel S5000PSL
motherboard should know not to enable -- maybe a note is warranted so
that when configuring the kernel, people with S5000PSL motherboards
might not make the same mistake???).



>
>> The related dmesg output at boot is:
>>
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:02.0:pcie01 failed with error 2
>> aer_init: AER service init fails - No ACPI _OSC support
>> aer: probe of 0000:00:03.0:pcie01 failed with error 1
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:04.0:pcie01 failed with error 2
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:05.0:pcie01 failed with error 2
>> Evaluate _OSC Set fails. Status = 0x0005
>> Evaluate _OSC Set fails. Status = 0x0005
>> aer_init: AER service init fails - Run ACPI _OSC fails
>> aer: probe of 0000:00:06.0:pcie01 failed with error 2
>> aer_init: AER service init fails - No ACPI _OSC support
>> aer: probe of 0000:00:07.0:pcie01 failed with error 1
>>
>> Full dmesg, lspci, and ACPI DSDT are available here:
>> http://jim.sh/~jim/tmp/nmi/
>>
>> -jim
>>
>
>
> ---
> ~Randy
> Phaedrus says that Quality is about caring.
>

2007-10-01 04:09:38

by AndrewL733

[permalink] [raw]
Subject: Repost: NMI error and Intel S5000PSL Motherboards

This is a slightly edited repost of a note sent on Friday September 28,
as we haven't heard back from anyone yet. (I know it was the weekend!)
Sorry to post again but this issue caused great problems for us and I
want to be sure we're choosing a decent solution.

Perhaps one of the people who so helpfully commented on this issue
earlier last week can now give their opinion on the what should be
concluded from our discovery that "CONFIG_PCIEAER=y" -- introduced in
the 2.6.19 kernel and set as the default -- leads to NMI errors on the
Intel S5000PSL motherboard.

I'm told Intel people were closely involved in the development of this
PCIEAER feature -- so it seems even weirder that it causes problems for
this Intel motherboard. But we have confirmed the problem with multiple
Linux distributions.

We are hoping to get some insights into the real cause. Please see below
where I outlined what seem to be the 3 possibilities.
> [email protected] wrote:
>> On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:
>>
>>
>>> Hello,
>>>
>>>
>>>> We have about 100 servers based on Intel S5000PSL-SATA
>>>> motherboards. They have been running for anywhere between 1 and 10
>>>> months. For the past few months, after updating them all to the
>>>> 2.6.20.15 kernel (because of a bug in the 2.6.18 kernel), we are
>>>> seeing some strange NMI errors. For example:
>>>>
>>>> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown
>>>> reason 30.
>>>> Aug 29 09:02:10 master kernel: Do you have a strange power saving
>>>> mode enabled?
>>>> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to
>>>> continue
>>>>
>>> I'm also working with Andrew and Samson. It seems that the cause of
>>> the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
>>> defaults to y.
>>>
>>> With CONFIG_PCIEAER=n, scanpci works fine with no errors. This is the
>>> workaround that they'll likely use for now.
>>>
>>
>> Glad that you found it.
>>
>>
>>> With CONFIG_PCIEAER=y, scanpci always triggers the NMI error. The
>>> option aerdriver.forceload=1 has no effect.
>>>
Although running "scanpci" provoked the NMI errors 100 percent on
demand, the NMI errors would also occur randomly every few weeks on a
given system without doing anything special. I don't want anybody to
think we are just trying to prevent a problem from occurring because we
like running "scanpci". "Scanpci" just turned out to be a reliable way
to reproduce an otherwise random problem.
>>
>> The 'forceload' option only forces the driver to load even when the
>> ACPI hardware initialization routine fails.
>>
>> It would be nice to be able to disable PCIEAER at boot time though.
>> Shouldn't be difficult.
>>
>>


So, looking for some closure here, what do you think is the "root
cause"? Is it:

1) a defect with Intel's S5000PSL motherboards that is not seen when
running 2.6.18 and earlier kernels but that is exposed by this feature
added in 2.6.19? In which case, shouldn't we work to get Intel to
investigate?

2) a problem with the PCIEAER feature? And maybe "CONFIG_PCIEAER=y"
should NOT be the default setting?

3) just a bad interaction between a good motherboard and a good Linux
feature that don't play well together? (in which case isn't this a
"feature" that anybody compiling a kernel to run on the Intel S5000PSL
motherboard should know not to enable?/

And in general is it a bad idea to set "CONFIG_PCIEAER to "no"". Or is
it something that we can really live without?




Andrew