2011-02-01 16:27:09

by George Spelvin

[permalink] [raw]
Subject: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
complaints at irregular intervals. This didn't used to happen with 2.6.37.

It's an old crappy 1.6 GHz P4 (HP Pavilion) with an ASUS P4B266LA
motherboard and a 2001 Award BIOS.

00:00.0 Host bridge [0600]: Intel Corporation 82845 845 [Brookdale] Chipset Host Bridge [8086:1a30] (rev 04)
00:01.0 PCI bridge [0604]: Intel Corporation 82845 845 [Brookdale] Chipset AGP Bridge [8086:1a31] (rev 04)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 05)
00:1f.0 ISA bridge [0601]: Intel Corporation 82801BA ISA Bridge (LPC) [8086:2440] (rev 05)
00:1f.1 IDE interface [0101]: Intel Corporation 82801BA IDE U100 Controller [8086:244b] (rev 05)
00:1f.2 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2442] (rev 05)
00:1f.3 SMBus [0c05]: Intel Corporation 82801BA/BAM SMBus Controller [8086:2443] (rev 05)
00:1f.4 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2444] (rev 05)
00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801BA/BAM AC'97 Audio Controller [8086:2445] (rev 05)
01:00.0 VGA compatible controller [0300]: ATI Technologies Inc Radeon RV250 If [Radeon 9000] [1002:4966] (rev 01)
01:00.1 Display controller [0380]: ATI Technologies Inc Radeon RV250 [Radeon 9000] (Secondary) [1002:496e] (rev 01)
02:08.0 Ethernet controller [0200]: Intel Corporation 82801BA/BAM/CA/CAM Ethernet Controller [8086:2449] (rev 03)
02:09.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB12LV26 IEEE-1394 Controller (Link) [104c:8020]


Should I bisect this, or does someone know what might be happening?

Thank you!


Jan 30 13:13:25 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 13:13:25 kernel: Do you have a strange power saving mode enabled?
Jan 30 13:13:25 kernel: Dazed and confused, but trying to continue
Jan 30 17:51:10 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 17:51:10 kernel: Do you have a strange power saving mode enabled?
Jan 30 17:51:10 kernel: Dazed and confused, but trying to continue
Jan 30 18:05:11 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 18:05:11 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:05:11 kernel: Dazed and confused, but trying to continue
Jan 30 18:19:16 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 18:19:16 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:19:16 kernel: Dazed and confused, but trying to continue
Jan 30 18:33:33 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 18:33:33 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:33:33 kernel: Dazed and confused, but trying to continue
Jan 30 18:48:23 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 18:48:23 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:48:23 kernel: Dazed and confused, but trying to continue
Jan 30 21:39:58 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 21:39:58 kernel: Do you have a strange power saving mode enabled?
Jan 30 21:39:58 kernel: Dazed and confused, but trying to continue
Jan 30 22:01:46 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:01:46 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:01:46 kernel: Dazed and confused, but trying to continue
Jan 30 22:03:13 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:03:13 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:03:13 kernel: Dazed and confused, but trying to continue
Jan 30 22:04:38 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 22:04:38 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:04:38 kernel: Dazed and confused, but trying to continue
Jan 30 22:06:03 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:06:03 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:06:03 kernel: Dazed and confused, but trying to continue
Jan 30 22:07:23 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 22:07:23 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:07:23 kernel: Dazed and confused, but trying to continue
Jan 31 01:00:28 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 01:00:28 kernel: Do you have a strange power saving mode enabled?
Jan 31 01:00:28 kernel: Dazed and confused, but trying to continue
Jan 31 03:00:02 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 03:00:02 kernel: Do you have a strange power saving mode enabled?
Jan 31 03:00:02 kernel: Dazed and confused, but trying to continue
Jan 31 06:27:52 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 06:27:52 kernel: Do you have a strange power saving mode enabled?
Jan 31 06:27:52 kernel: Dazed and confused, but trying to continue
Jan 31 07:36:54 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 07:36:54 kernel: Do you have a strange power saving mode enabled?
Jan 31 07:36:54 kernel: Dazed and confused, but trying to continue
Jan 31 10:08:08 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 10:08:08 kernel: Do you have a strange power saving mode enabled?
Jan 31 10:08:08 kernel: Dazed and confused, but trying to continue
Jan 31 16:42:02 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 16:42:02 kernel: Do you have a strange power saving mode enabled?
Jan 31 16:42:02 kernel: Dazed and confused, but trying to continue
Jan 31 20:05:21 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 20:05:21 kernel: Do you have a strange power saving mode enabled?
Jan 31 20:05:21 kernel: Dazed and confused, but trying to continue
Feb 1 01:00:19 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 01:00:19 kernel: Do you have a strange power saving mode enabled?
Feb 1 01:00:19 kernel: Dazed and confused, but trying to continue
Feb 1 01:36:42 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 01:36:42 kernel: Do you have a strange power saving mode enabled?
Feb 1 01:36:42 kernel: Dazed and confused, but trying to continue
Feb 1 02:01:04 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 02:01:04 kernel: Do you have a strange power saving mode enabled?
Feb 1 02:01:04 kernel: Dazed and confused, but trying to continue
Feb 1 05:58:05 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 05:58:05 kernel: Do you have a strange power saving mode enabled?
Feb 1 05:58:05 kernel: Dazed and confused, but trying to continue
Feb 1 06:28:18 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 06:28:18 kernel: Do you have a strange power saving mode enabled?
Feb 1 06:28:18 kernel: Dazed and confused, but trying to continue
Feb 1 08:59:18 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 08:59:18 kernel: Do you have a strange power saving mode enabled?
Feb 1 08:59:18 kernel: Dazed and confused, but trying to continue
Feb 1 11:04:43 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:04:43 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:04:43 kernel: Dazed and confused, but trying to continue
Feb 1 11:05:47 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:05:47 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:05:47 kernel: Dazed and confused, but trying to continue
Feb 1 11:06:48 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:06:48 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:06:48 kernel: Dazed and confused, but trying to continue
Feb 1 11:07:50 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:07:50 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:07:50 kernel: Dazed and confused, but trying to continue
Feb 1 11:08:52 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:08:52 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:08:52 kernel: Dazed and confused, but trying to continue
Feb 1 11:09:54 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:09:54 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:09:54 kernel: Dazed and confused, but trying to continue
Feb 1 11:10:56 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:10:56 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:10:56 kernel: Dazed and confused, but trying to continue
Feb 1 11:11:58 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:11:58 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:11:58 kernel: Dazed and confused, but trying to continue
Feb 1 11:13:00 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:13:00 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:13:00 kernel: Dazed and confused, but trying to continue
Feb 1 11:14:01 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:14:01 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:14:01 kernel: Dazed and confused, but trying to continue
Feb 1 11:15:04 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:15:04 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:15:04 kernel: Dazed and confused, but trying to continue
Feb 1 11:16:05 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb 1 11:16:05 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:16:05 kernel: Dazed and confused, but trying to continue
Feb 1 11:17:07 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:17:07 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:17:07 kernel: Dazed and confused, but trying to continue
Feb 1 11:18:33 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb 1 11:18:33 kernel: Do you have a strange power saving mode enabled?
Feb 1 11:18:33 kernel: Dazed and confused, but trying to continue


2011-02-01 17:52:25

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 02/01/2011 07:27 PM, George Spelvin wrote:
> Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
> complaints at irregular intervals. This didn't used to happen with 2.6.37.
>
...
> Should I bisect this, or does someone know what might be happening?
>
> Thank you!
>

I fear it's known issue at moment, we're trying to resolve it. There is
an option -- to disable nmi_watchdog (nmi_watchdog=0 boot option).

But if you have a will or would like to help debug the problem -- mind to
try the patch below? Note the patch is ugly at moment and must *not* be
running on non-P4 system (and I only compile-tested it so no guarantees
at all, and I've CC'ed a couple of people as well)

Cyrill

---
arch/x86/kernel/cpu/perf_event.c | 12 +++++++++++-
arch/x86/kernel/cpu/perf_event_p4.c | 8 +++++++-
2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6.git/arch/x86/kernel/cpu/perf_event.c
@@ -1075,7 +1075,17 @@ static void x86_pmu_start(struct perf_ev

cpuc->events[idx] = event;
__set_bit(idx, cpuc->active_mask);
- __set_bit(idx, cpuc->running);
+ if (1) {
+ /* running mask is shared across a core */
+ int leader_cpu;
+ struct cpu_hw_events *leader_cpuc;
+
+ leader_cpu = cpumask_first(__get_cpu_var(cpu_sibling_map));
+ leader_cpuc = &per_cpu(cpu_hw_events, leader_cpu);
+
+ __set_bit(idx, leader_cpuc->running);
+ } else
+ __set_bit(idx, cpuc->running);
x86_pmu.enable(event);
perf_event_update_userpage(event);
}
Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
@@ -907,8 +907,14 @@ static int p4_pmu_handle_irq(struct pt_r
int overflow;

if (!test_bit(idx, cpuc->active_mask)) {
+ int leader_cpu;
+ struct cpu_hw_events *leader_cpuc;
+
+ leader_cpu = cpumask_first(__get_cpu_var(cpu_sibling_map));
+ leader_cpuc = &per_cpu(cpu_hw_events, leader_cpu);
+
/* catch in-flight IRQs */
- if (__test_and_clear_bit(idx, cpuc->running))
+ if (__test_and_clear_bit(idx, leader_cpuc->running))
handled++;
continue;
}

2011-02-01 18:41:40

by Don Zickus

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Tue, Feb 01, 2011 at 08:52:19PM +0300, Cyrill Gorcunov wrote:
> On 02/01/2011 07:27 PM, George Spelvin wrote:
> > Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
> > complaints at irregular intervals. This didn't used to happen with 2.6.37.
> >
> ...
> > Should I bisect this, or does someone know what might be happening?
> >
> > Thank you!
> >
>
> I fear it's known issue at moment, we're trying to resolve it. There is
> an option -- to disable nmi_watchdog (nmi_watchdog=0 boot option).
>
> But if you have a will or would like to help debug the problem -- mind to
> try the patch below? Note the patch is ugly at moment and must *not* be
> running on non-P4 system (and I only compile-tested it so no guarantees
> at all, and I've CC'ed a couple of people as well)

Unfortunately, I have not had success with patch below on my system. :-(

Cheers,
Don

2011-02-01 18:44:21

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 02/01/2011 09:41 PM, Don Zickus wrote:
...
>
> Unfortunately, I have not had success with patch below on my system. :-(
>
> Cheers,
> Don

You mean it didn't help?

--
Cyrill

2011-02-01 18:51:42

by Don Zickus

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Tue, Feb 01, 2011 at 09:44:15PM +0300, Cyrill Gorcunov wrote:
> On 02/01/2011 09:41 PM, Don Zickus wrote:
> ...
> >
> > Unfortunately, I have not had success with patch below on my system. :-(
> >
> > Cheers,
> > Don
>
> You mean it didn't help?

Not that I noticed no.

Cheers,
Don

2011-02-01 20:01:00

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 02/01/2011 09:51 PM, Don Zickus wrote:
...
>>
>> You mean it didn't help?
>
> Not that I noticed no.
>
> Cheers,
> Don

Thanks a huge for testing, Don! I'll check what else I can do.

--
Cyrill

2011-02-02 02:36:12

by George Spelvin

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

> But if you have a will or would like to help debug the problem -- mind to
> try the patch below? Note the patch is ugly at moment and must *not* be
> running on non-P4 system (and I only compile-tested it so no guarantees
> at all, and I've CC'ed a couple of people as well)

Promising... After 32 minute of uptime, no NMI complaints so far.

I'll let it run overnight and see what happens.

Thank you very much!

2011-02-02 04:18:24

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 2/2/11, George Spelvin <[email protected]> wrote:
>> But if you have a will or would like to help debug the problem -- mind to
>> try the patch below? Note the patch is ugly at moment and must *not* be
>> running on non-P4 system (and I only compile-tested it so no guarantees
>> at all, and I've CC'ed a couple of people as well)
>
> Promising... After 32 minute of uptime, no NMI complaints so far.
>
> I'll let it run overnight and see what happens.
>

Great, thanks. Though the patch didn't help for Don, ie there is still
an issue which needs to be resolved as well.

2011-02-14 13:41:19

by Preeti Khurana

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

I am getting the similar issue as reported in https://lkml.org/lkml/2011/2/10/187

Can someone tell me if the same issue because I am getting the problem on Intel Xeon..

Thanks
Preeti

2011-02-16 01:58:07

by Dave Airlie

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <[email protected]> wrote:
> On 2/2/11, George Spelvin <[email protected]> wrote:
>>> ?But if you have a will or would like to help debug the problem -- mind to
>>> try the patch below? Note the patch is ugly at moment and must *not* be
>>> running on non-P4 system (and I only compile-tested it so no guarantees
>>> at all, and I've CC'ed a couple of people as well)
>>
>> Promising... ?After 32 minute of uptime, no NMI complaints so far.
>>
>> I'll let it run overnight and see what happens.
>>
>
> Great, thanks. Though the patch didn't help for Don, ie there is still
> an issue which needs to be resolved as well.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

Ping on this problem, still seeing

Uhhuh. NMI received for unknown reason 3c on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

on my Pentium-D system here with latest Linus head.

its sometimes 3c, sometimes 3d, I'm going to bisect and push for
reverts if nobody still has any clue about how to fix this.

Dave.

2011-02-16 04:19:05

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 2/16/11, Dave Airlie <[email protected]> wrote:
> On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <[email protected]> wrote:
>> On 2/2/11, George Spelvin <[email protected]> wrote:
>>>> But if you have a will or would like to help debug the problem -- mind
>>>> to
>>>> try the patch below? Note the patch is ugly at moment and must *not* be
>>>> running on non-P4 system (and I only compile-tested it so no guarantees
>>>> at all, and I've CC'ed a couple of people as well)
>>>
>>> Promising... After 32 minute of uptime, no NMI complaints so far.
>>>
>>> I'll let it run overnight and see what happens.
>>>
>>
>> Great, thanks. Though the patch didn't help for Don, ie there is still
>> an issue which needs to be resolved as well.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
> Ping on this problem, still seeing
>
> Uhhuh. NMI received for unknown reason 3c on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
>
> on my Pentium-D system here with latest Linus head.
>
> its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> reverts if nobody still has any clue about how to fix this.
>
> Dave.
>

We still trying to resolve it but without success yet. There is no
easy way to revert it. One of the option might be to disable perf on
p4 for a while. If this is acceptable -- i'll cook such patch and send
it to Ingo. Hm?

2011-02-16 08:38:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.


* Cyrill Gorcunov <[email protected]> wrote:

> On 2/16/11, Dave Airlie <[email protected]> wrote:
> > On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <[email protected]> wrote:
> >> On 2/2/11, George Spelvin <[email protected]> wrote:
> >>>> But if you have a will or would like to help debug the problem -- mind
> >>>> to
> >>>> try the patch below? Note the patch is ugly at moment and must *not* be
> >>>> running on non-P4 system (and I only compile-tested it so no guarantees
> >>>> at all, and I've CC'ed a couple of people as well)
> >>>
> >>> Promising... After 32 minute of uptime, no NMI complaints so far.
> >>>
> >>> I'll let it run overnight and see what happens.
> >>>
> >>
> >> Great, thanks. Though the patch didn't help for Don, ie there is still
> >> an issue which needs to be resolved as well.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at http://www.tux.org/lkml/
> >>
> >
> > Ping on this problem, still seeing
> >
> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> > Do you have a strange power saving mode enabled?
> > Dazed and confused, but trying to continue
> >
> > on my Pentium-D system here with latest Linus head.
> >
> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> > reverts if nobody still has any clue about how to fix this.
> >
> > Dave.
> >
>
> We still trying to resolve it but without success yet. There is no
> easy way to revert it. One of the option might be to disable perf on
> p4 for a while. If this is acceptable -- i'll cook such patch and send
> it to Ingo. Hm?

That's not really acceptable - need to fix it or revert it to the last working
state. Which commit broke it?

Thanks,

Ingo

2011-02-16 08:49:53

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <[email protected]> wrote:
...
>> >>
>> >
>> > Ping on this problem, still seeing
>> >
>> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
>> > Do you have a strange power saving mode enabled?
>> > Dazed and confused, but trying to continue
>> >
>> > on my Pentium-D system here with latest Linus head.
>> >
>> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
>> > reverts if nobody still has any clue about how to fix this.
>> >
>> > Dave.
>> >
>>
>> We still trying to resolve it but without success yet. There is no
>> easy way to revert it. One of the option might be to disable perf on
>> p4 for a while. If this is acceptable -- i'll cook such patch and send
>> it to Ingo. Hm?
>
> That's not really acceptable - need to fix it or revert it to the last working
> state. Which commit broke it?
>
> Thanks,
>
> ? ? ? ?Ingo
>

I can't say you the commit id after which unknown-nmi start happening
(i'm out of git tree
at moment) but even then this commit should not be reverted since the
problem is in
p4 code not in the rest of perf system.

I have two patches here (attached) and would really appreciate of
their testing on HT machine
together with kgdb bootup tests enabled. Dave could you please?


Attachments:
perf-x86-p4-extra-nmi (1.02 kB)
perf-x86-p4-unflagged-nmi (1.98 kB)
Download all attachments

2011-02-16 08:56:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.


* Cyrill Gorcunov <[email protected]> wrote:

> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <[email protected]> wrote:
> ...
> >> >>
> >> >
> >> > Ping on this problem, still seeing
> >> >
> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> >> > Do you have a strange power saving mode enabled?
> >> > Dazed and confused, but trying to continue
> >> >
> >> > on my Pentium-D system here with latest Linus head.
> >> >
> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> >> > reverts if nobody still has any clue about how to fix this.
> >> >
> >> > Dave.
> >> >
> >>
> >> We still trying to resolve it but without success yet. There is no
> >> easy way to revert it. One of the option might be to disable perf on
> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
> >> it to Ingo. Hm?
> >
> > That's not really acceptable - need to fix it or revert it to the last working
> > state. Which commit broke it?
> >
> > Thanks,
> >
> > ? ? ? ?Ingo
> >
>
> I can't say you the commit id after which unknown-nmi start happening
> (i'm out of git tree
> at moment) but even then this commit should not be reverted since the
> problem is in
> p4 code not in the rest of perf system.
>
> I have two patches here (attached) and would really appreciate of
> their testing on HT machine
> together with kgdb bootup tests enabled. Dave could you please?

Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is
probably using perf which triggers NMIs? Dave, can you confirm that?

And it's a spurious NMI message, not actual lockup or other misbehavior, right?

Thanks,

Ingo

2011-02-16 09:33:37

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On 2/16/11, Ingo Molnar <[email protected]> wrote:
>
> * Cyrill Gorcunov <[email protected]> wrote:
>
>> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <[email protected]> wrote:
>> ...
>> >> >>
>> >> >
>> >> > Ping on this problem, still seeing
>> >> >
>> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
>> >> > Do you have a strange power saving mode enabled?
>> >> > Dazed and confused, but trying to continue
>> >> >
>> >> > on my Pentium-D system here with latest Linus head.
>> >> >
>> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
>> >> > reverts if nobody still has any clue about how to fix this.
>> >> >
>> >> > Dave.
>> >> >
>> >>
>> >> We still trying to resolve it but without success yet. There is no
>> >> easy way to revert it. One of the option might be to disable perf on
>> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
>> >> it to Ingo. Hm?
>> >
>> > That's not really acceptable - need to fix it or revert it to the last
>> > working
>> > state. Which commit broke it?
>> >
>> > Thanks,
>> >
>> > Ingo
>> >
>>
>> I can't say you the commit id after which unknown-nmi start happening
>> (i'm out of git tree
>> at moment) but even then this commit should not be reverted since the
>> problem is in
>> p4 code not in the rest of perf system.
>>
>> I have two patches here (attached) and would really appreciate of
>> their testing on HT machine
>> together with kgdb bootup tests enabled. Dave could you please?
>
> Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is
> probably using perf which triggers NMIs? Dave, can you confirm that?
>
> And it's a spurious NMI message, not actual lockup or other misbehavior,
> right?
>
> Thanks,
>
> Ingo
>
For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
tested it on non-ht machine by self. without it there is no lockup
but only a message about unknown nmi.

for hr-machine with kgdb the things go harder, Don reported lockup on
boot. The second patch might help but i cant test it (here i need help
in testing)

2011-02-16 10:10:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.


* Cyrill Gorcunov <[email protected]> wrote:

> On 2/16/11, Ingo Molnar <[email protected]> wrote:
> >
> > * Cyrill Gorcunov <[email protected]> wrote:
> >
> >> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <[email protected]> wrote:
> >> ...
> >> >> >>
> >> >> >
> >> >> > Ping on this problem, still seeing
> >> >> >
> >> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> >> >> > Do you have a strange power saving mode enabled?
> >> >> > Dazed and confused, but trying to continue
> >> >> >
> >> >> > on my Pentium-D system here with latest Linus head.
> >> >> >
> >> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> >> >> > reverts if nobody still has any clue about how to fix this.
> >> >> >
> >> >> > Dave.
> >> >> >
> >> >>
> >> >> We still trying to resolve it but without success yet. There is no
> >> >> easy way to revert it. One of the option might be to disable perf on
> >> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
> >> >> it to Ingo. Hm?
> >> >
> >> > That's not really acceptable - need to fix it or revert it to the last
> >> > working
> >> > state. Which commit broke it?
> >> >
> >> > Thanks,
> >> >
> >> > Ingo
> >> >
> >>
> >> I can't say you the commit id after which unknown-nmi start happening
> >> (i'm out of git tree
> >> at moment) but even then this commit should not be reverted since the
> >> problem is in
> >> p4 code not in the rest of perf system.
> >>
> >> I have two patches here (attached) and would really appreciate of
> >> their testing on HT machine
> >> together with kgdb bootup tests enabled. Dave could you please?
> >
> > Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is
> > probably using perf which triggers NMIs? Dave, can you confirm that?
> >
> > And it's a spurious NMI message, not actual lockup or other misbehavior,
> > right?
> >
> > Thanks,
> >
> > Ingo
> >
>
> For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
> tested it on non-ht machine by self. without it there is no lockup
> but only a message about unknown nmi.

Ok, please submit it ASAP then - that ought to address the regression. Please Cc:
Dave to the patch.

Thanks,

Ingo

2011-02-16 11:08:08

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Wed, Feb 16, 2011 at 1:09 PM, Ingo Molnar <[email protected]> wrote:
...
>>
>> For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
>> tested it on non-ht machine by self. without it there is no lockup
>> but only a message about unknown nmi.
>
> Ok, please submit it ASAP then - that ought to address the regression. Please Cc:
> Dave to the patch.
>
> Thanks,
>
> ? ? ? ?Ingo
>

Ingo both patches are already in thread (attached to previous mail
since I've web
access at moment). So just to be sure the 'unflagged' nmi patch is
attached again,
this fix which should be aplied to -tip/master.


Attachments:
perf-x86-p4-unflagged-nmi (2.02 kB)

2011-02-16 11:34:40

by Cyrill Gorcunov

[permalink] [raw]
Subject: [tip:perf/urgent] perf, x86: P4 PMU: Fix spurious NMI messages

Commit-ID: 7d44ec193d95416d1342cdd86392a1eeb7461186
Gitweb: http://git.kernel.org/tip/7d44ec193d95416d1342cdd86392a1eeb7461186
Author: Cyrill Gorcunov <[email protected]>
AuthorDate: Wed, 16 Feb 2011 14:08:02 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Feb 2011 12:26:12 +0100

perf, x86: P4 PMU: Fix spurious NMI messages

Several people have reported spurious unknown NMI
messages on some P4 CPUs.

This patch fixes it by checking for an overflow (negative
counter values) directly, instead of relying on the
P4_CCCR_OVF bit.

Reported-by: George Spelvin <[email protected]>
Reported-by: Meelis Roos <[email protected]>
Reported-by: Don Zickus <[email protected]>
Reported-by: Dave Airlie <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Lin Ming <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/perf_event_p4.h | 1 +
arch/x86/kernel/cpu/perf_event_p4.c | 11 ++++++++---
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/perf_event_p4.h b/arch/x86/include/asm/perf_event_p4.h
index e2f6a99..cc29086 100644
--- a/arch/x86/include/asm/perf_event_p4.h
+++ b/arch/x86/include/asm/perf_event_p4.h
@@ -22,6 +22,7 @@

#define ARCH_P4_CNTRVAL_BITS (40)
#define ARCH_P4_CNTRVAL_MASK ((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
+#define ARCH_P4_UNFLAGGED_BIT ((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))

#define P4_ESCR_EVENT_MASK 0x7e000000U
#define P4_ESCR_EVENT_SHIFT 25
diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c
index f7a0993..ff751a9 100644
--- a/arch/x86/kernel/cpu/perf_event_p4.c
+++ b/arch/x86/kernel/cpu/perf_event_p4.c
@@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(struct hw_perf_event *hwc)
return 1;
}

- /* it might be unflagged overflow */
- rdmsrl(hwc->event_base + hwc->idx, v);
- if (!(v & ARCH_P4_CNTRVAL_MASK))
+ /*
+ * In some circumstances the overflow might issue an NMI but did
+ * not set P4_CCCR_OVF bit. Because a counter holds a negative value
+ * we simply check for high bit being set, if it's cleared it means
+ * the counter has reached zero value and continued counting before
+ * real NMI signal was received:
+ */
+ if (!(v & ARCH_P4_UNFLAGGED_BIT))
return 1;

return 0;

2011-02-16 11:57:08

by George Spelvin

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

> Ping on this problem, still seeing
>
> Uhhuh. NMI received for unknown reason 3c on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
>
> on my Pentium-D system here with latest Linus head.
>
> its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> reverts if nobody still has any clue about how to fix this.

The second patch (not the one you quote) fixed it for me. Almost 8 days
of uptime and no log spam.

It's appended below for your convenience. Are you using this
unsuccessfully?


From: Cyrill Gorcunov <[email protected]>
Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test

A couple of people have reported an unknown NMI issue on p4 pmu.
This patch should fix it.

Reported-by: George Spelvin <[email protected]>
Reported-by: Meelis Roos <[email protected]>
Reported-by: Don Zickus <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Lin Ming <[email protected]>
CC: Don Zickus <[email protected]>
CC: Peter Zijlstra <[email protected]>
---
arch/x86/include/asm/perf_event_p4.h | 1 +
arch/x86/kernel/cpu/perf_event_p4.c | 11 ++++++++---
2 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
===================================================================
--- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
+++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
@@ -22,6 +22,7 @@

#define ARCH_P4_CNTRVAL_BITS (40)
#define ARCH_P4_CNTRVAL_MASK ((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
+#define ARCH_P4_UNFLAGGED_BIT ((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))

#define P4_ESCR_EVENT_MASK 0x7e000000U
#define P4_ESCR_EVENT_SHIFT 25
Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
@@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
return 1;
}

- /* it might be unflagged overflow */
- rdmsrl(hwc->event_base + hwc->idx, v);
- if (!(v & ARCH_P4_CNTRVAL_MASK))
+ /*
+ * at some circumstances the overflow might issue NMI but did
+ * not set P4_CCCR_OVF bit so since a counter holds a negative value
+ * we simply check for high bit being set, if it's cleared it means
+ * the counter has reached zero value and continued counting before
+ * real NMI signal was received
+ */
+ if (!(v & ARCH_P4_UNFLAGGED_BIT))
return 1;

return 0;

2011-02-17 00:20:08

by Underwood, Ryan

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:

>
> I am getting the similar issue as reported
> in https://lkml.org/lkml/2011/2/10/187
>
> Can someone tell me if the same issue because I am getting the
> problem on Intel Xeon..
>

I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
on some Xeon servers but only with recently shipped BIOS revisions. The OS is
CentOS 5.5.

In my cases, the system sometimes hangs with no comment, sometimes with a NMI
message immediately before hanging and sometimes with a long trail of
backtrace originating at cpu_idle(). The NMI reason code is different but
in my observation it is usually 21 or 31.

The problem seems to be triggered by accessing a PCI card (via MMIO) because
until accessing the PCI card, the system will run forever with no problems.

Other servers of exactly the same model (Intel SR2500) but older BIOS revision
are working (working is 3/14/2008, non working is 3/9/2010). All software is
identical in these cases.

Also, in one instance, kernel v2.6.18 is used on these servers with the
3/14/2008 BIOS revision without a problem. The rest of the software is again
the same (except for kernel and drivers).

It seems to be a problem with newer kernels combined with the newer Intel BIOS.
I have not tried an older kernel on the newer BIOS yet.

I have not tried the following patches yet which seem to both be for spurious
NMI messages, not accompanied by system lockups:

https://lkml.org/lkml/2011/2/16/106
https://lkml.org/lkml/2011/2/1/286

Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.

I am not subscribed so please Cc me.

2011-02-17 02:56:06

by Dave Airlie

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

>
> It's appended below for your convenience. ?Are you using this
> unsuccessfully?

This patch quoted below fixes it for me.

No more spurious NMIs on my P4.

Tested-by: Dave Airlie <[email protected]>

>
>
> From: Cyrill Gorcunov <[email protected]>
> Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test
>
> A couple of people have reported an unknown NMI issue on p4 pmu.
> This patch should fix it.
>
> Reported-by: George Spelvin <[email protected]>
> Reported-by: Meelis Roos <[email protected]>
> Reported-by: Don Zickus <[email protected]>
> Signed-off-by: Cyrill Gorcunov <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Lin Ming <[email protected]>
> CC: Don Zickus <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> ---
> ?arch/x86/include/asm/perf_event_p4.h | ? ?1 +
> ?arch/x86/kernel/cpu/perf_event_p4.c ?| ? 11 ++++++++---
> ?2 files changed, 9 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
> ===================================================================
> --- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
> +++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
> @@ -22,6 +22,7 @@
>
> ?#define ARCH_P4_CNTRVAL_BITS ? (40)
> ?#define ARCH_P4_CNTRVAL_MASK ? ((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
> +#define ARCH_P4_UNFLAGGED_BIT ?((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))
>
> ?#define P4_ESCR_EVENT_MASK ? ? 0x7e000000U
> ?#define P4_ESCR_EVENT_SHIFT ? ?25
> Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
> ===================================================================
> --- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
> +++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
> @@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
> ? ? ? ? ? ? ? ?return 1;
> ? ? ? ?}
>
> - ? ? ? /* it might be unflagged overflow */
> - ? ? ? rdmsrl(hwc->event_base + hwc->idx, v);
> - ? ? ? if (!(v & ARCH_P4_CNTRVAL_MASK))
> + ? ? ? /*
> + ? ? ? ?* at some circumstances the overflow might issue NMI but did
> + ? ? ? ?* not set P4_CCCR_OVF bit so since a counter holds a negative value
> + ? ? ? ?* we simply check for high bit being set, if it's cleared it means
> + ? ? ? ?* the counter has reached zero value and continued counting before
> + ? ? ? ?* real NMI signal was received
> + ? ? ? ?*/
> + ? ? ? if (!(v & ARCH_P4_UNFLAGGED_BIT))
> ? ? ? ? ? ? ? ?return 1;
>
> ? ? ? ?return 0;
>

2011-02-17 07:49:02

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Thu, Feb 17, 2011 at 5:56 AM, Dave Airlie <[email protected]> wrote:
>>
>> It's appended below for your convenience. ?Are you using this
>> unsuccessfully?
>
> This patch quoted below fixes it for me.
>
> No more spurious NMIs on my P4.
>
> Tested-by: Dave Airlie <[email protected]>
>

Thanks Dave! Ingo has merged it into -urgent branch, so it should
reach mainline soon.

2011-02-17 07:59:47

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Thu, Feb 17, 2011 at 3:17 AM, Ryan Underwood
<[email protected]> wrote:
> Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:
>
>>
>> I am getting the similar issue as reported
>> in https://lkml.org/lkml/2011/2/10/187
>>
>> Can someone tell me if the same issue ?because I am getting the
>> problem on Intel Xeon..
>>
>
> I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
> on some Xeon servers but only with recently shipped BIOS revisions. The OS is
> CentOS 5.5.
>
...
> I have not tried the following patches yet which seem to both be for spurious
> NMI messages, not accompanied by system lockups:
>
> https://lkml.org/lkml/2011/2/16/106
> https://lkml.org/lkml/2011/2/1/286
>
> Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.
>
> I am not subscribed so please Cc me.
>

Since nmi_watchdog=0 didn't help -- I believe it's a different issue unrelated
to 'perf' patches you mentioned, probably rcu-people help is needed here.

2011-02-18 16:16:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

On Thu, Feb 17, 2011 at 10:59:43AM +0300, Cyrill Gorcunov wrote:
> On Thu, Feb 17, 2011 at 3:17 AM, Ryan Underwood
> <[email protected]> wrote:
> > Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:
> >
> >>
> >> I am getting the similar issue as reported
> >> in https://lkml.org/lkml/2011/2/10/187
> >>
> >> Can someone tell me if the same issue ?because I am getting the
> >> problem on Intel Xeon..
> >>
> >
> > I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
> > on some Xeon servers but only with recently shipped BIOS revisions. The OS is
> > CentOS 5.5.
> >
> ...
> > I have not tried the following patches yet which seem to both be for spurious
> > NMI messages, not accompanied by system lockups:
> >
> > https://lkml.org/lkml/2011/2/16/106
> > https://lkml.org/lkml/2011/2/1/286
> >
> > Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.
> >
> > I am not subscribed so please Cc me.

Given 2.6.35, has anyone tried applying the following patch?

https://patchwork.kernel.org/patch/23985/

It turned out to resolve an otherwise mysterious RCU CPU stall warning
for someone running 2.6.36, IIRC.

Thanx, Paul

2011-02-18 20:44:48

by Underwood, Ryan

[permalink] [raw]
Subject: RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

>
> Given 2.6.35, has anyone tried applying the following patch?
>
> https://patchwork.kernel.org/patch/23985/
>
> It turned out to resolve an otherwise mysterious RCU CPU stall warning
> for someone running 2.6.36, IIRC.
>

Now I've tried 2.6.38-rc5 which already includes that patch, and the
same problems remain. It also includes the following patch that Preeti
seems to have had some success with on 2.6.35, so my problem must really
be elsewhere:
https://lkml.org/lkml/2011/1/6/131

Since a previous BIOS version is known to work I may end up having to
do some BIOS-bisecting today...

2011-02-21 07:01:30

by Preeti Khurana

[permalink] [raw]
Subject: RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

> >
> > Given 2.6.35, has anyone tried applying the following patch?
> >
> > https://patchwork.kernel.org/patch/23985/
> >
> > It turned out to resolve an otherwise mysterious RCU CPU stall warning
> > for someone running 2.6.36, IIRC.
> >
>
This fix is already in 2.6.35, so this doesn't seem to be an issue.

> Now I've tried 2.6.38-rc5 which already includes that patch, and the same
> problems remain. It also includes the following patch that Preeti seems to
> have had some success with on 2.6.35, so my problem must really be
> elsewhere:
> https://lkml.org/lkml/2011/1/6/131
>
> Since a previous BIOS version is known to work I may end up having to do
> some BIOS-bisecting today...

Ryan,
Cant say that this patch (https://lkml.org/lkml/2011/1/6/131) worked for me since I am not able to reproduce the problem quite reliably and now not getting the problem even under the original kernel without this patch. Just wondering what triggers this problem.

2011-02-21 16:45:22

by Underwood, Ryan

[permalink] [raw]
Subject: RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.

> > Since a previous BIOS version is known to work I may end up having to do
> > some BIOS-bisecting today...
>
> Ryan,
> Cant say that this patch (https://lkml.org/lkml/2011/1/6/131) worked
> for me since I am not able to reproduce the problem quite reliably and now
> not getting the problem even under the original kernel without this patch.
> Just wondering what triggers this problem.

I found that even with downgrading to the same BIOS as the working systems,
the problem on the newer SR2500 systems remains! There must have been a
recent change in the hardware causing this, or some arcane BIOS setting
that I am overlooking. I still need to rule out our PCI hardware as the
source of the problem, since PCI parity errors seem to be a usual source
of NMIs, but I thought there would be a standard error code in that case...