Currently, a NMI handler for NMI watchdog may falsely handle any NMI
signaled for different purpose if CondChgd bit in
MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
This commit deals with the issue simply by ignoring CondChgd bit.
Here is explanation in detail.
On x86 NMI watchdog uses performance monitoring feature to
periodically signal NMI each time performance counter gets overflowed.
intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
of a given NMI by looking at overflow status bits in
MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
handles the given NMI as its own NMI.
The problem is that intel_pmu_handle_irq() doesn't distinguish
CondChgd bit from other bits. Unlike the other status bits, CondChgd
bit doesn't represent overflow status for performance counters. Thus,
CondChgd bit cannot be thought of as a mark indicating a given NMI is
NMI watchdog's.
I noticed this behavior on systems with Ivy Bridge processors: Intel
Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
in the beginning at boot. (then the CondChgd bit is cleared by next
wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
On the other hand, on older processors such as Nehalem, CondChgd bit
is not set in the beginning at boot.
I'm not sure about exact behavior of CondChgd bit, in particular when
this bit is set. Although I read Intel System Programmer's Manual to
figure out but I have yet completed that. At least, I think ignoring
CondChgd bit should be enough for NMI watchdog perspective.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..07846d7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1382,6 +1382,15 @@ again:
intel_pmu_lbr_read();
/*
+ * CondChgd bit 63 doesn't mean any overflow status. Ignore
+ * and clear the bit.
+ */
+ if (__test_and_clear_bit(63, (unsigned long *)&status)) {
+ if (!status)
+ goto done;
+ }
+
+ /*
* PEBS overflow sets bit 62 in the global status register
*/
if (__test_and_clear_bit(62, (unsigned long *)&status)) {
On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote:
> Currently, a NMI handler for NMI watchdog may falsely handle any NMI
> signaled for different purpose if CondChgd bit in
> MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
>
> This commit deals with the issue simply by ignoring CondChgd bit.
>
> Here is explanation in detail.
>
> On x86 NMI watchdog uses performance monitoring feature to
> periodically signal NMI each time performance counter gets overflowed.
>
> intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
> handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
> of a given NMI by looking at overflow status bits in
> MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
> handles the given NMI as its own NMI.
>
> The problem is that intel_pmu_handle_irq() doesn't distinguish
> CondChgd bit from other bits. Unlike the other status bits, CondChgd
> bit doesn't represent overflow status for performance counters. Thus,
> CondChgd bit cannot be thought of as a mark indicating a given NMI is
> NMI watchdog's.
So what was the problem? It ate another NMI?
> I noticed this behavior on systems with Ivy Bridge processors: Intel
> Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
> CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
> in the beginning at boot. (then the CondChgd bit is cleared by next
> wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
>
> On the other hand, on older processors such as Nehalem, CondChgd bit
> is not set in the beginning at boot.
>
> I'm not sure about exact behavior of CondChgd bit, in particular when
> this bit is set. Although I read Intel System Programmer's Manual to
> figure out but I have yet completed that. At least, I think ignoring
> CondChgd bit should be enough for NMI watchdog perspective.
So yes, the SDM lists the bit as existing but never once mentions it
outside of that, and its been doing that at least back to 2008.
Ooh, I found it:
"The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
indicate changes to the state of performance monitoring hardware (see
Figure 18-29)."
Which is of course completely useless, not to mention inconsistent with
the later CondChgd name.
HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
like having that set on boot?
On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote:
> > I'm not sure about exact behavior of CondChgd bit, in particular when
> > this bit is set. Although I read Intel System Programmer's Manual to
> > figure out but I have yet completed that. At least, I think ignoring
> > CondChgd bit should be enough for NMI watchdog perspective.
>
> So yes, the SDM lists the bit as existing but never once mentions it
> outside of that, and its been doing that at least back to 2008.
>
> Ooh, I found it:
>
> "The IA32_PERF_GLOBAL_STATUS MSR also provides a ‘sticky bit’ to
> indicate changes to the state of performance monitoring hardware (see
> Figure 18-29)."
>
> Which is of course completely useless, not to mention inconsistent with
> the later CondChgd name.
>
> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
> like having that set on boot?
Matt found in the MSR listing for GLOBAL_STATUS:
63 CondChg: status bits of this register has changed. If CPUID.0AH: EAX[7:0] > 0
Which brings us to a grand total of 3 different names for this bit.
If it indeed does what it says on the tin, set every time the status
changes its like the most useless bit ever and I wonder why people
bothered to spend silicon on it.
In any case, the proposed patch seems fine, just needs a better
changelog.
On Wed, 11 Jun, at 01:54:13PM, Peter Zijlstra wrote:
>
> Matt found in the MSR listing for GLOBAL_STATUS:
>
> 63 CondChg: status bits of this register has changed. If CPUID.0AH: EAX[7:0] > 0
>
> Which brings us to a grand total of 3 different names for this bit.
Right, I'm flinging emails around to get this fixed in the SDM, so that
the answer to,
Q: "How many kernel developers does it take to find the definition of
this bit?"
is less than 4.
--
Matt Fleming, Intel Open Source Technology Center
From: Peter Zijlstra <[email protected]>
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Date: Wed, 11 Jun 2014 10:54:48 +0200
> On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote:
>> Currently, a NMI handler for NMI watchdog may falsely handle any NMI
>> signaled for different purpose if CondChgd bit in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
>>
>> This commit deals with the issue simply by ignoring CondChgd bit.
>>
>> Here is explanation in detail.
>>
>> On x86 NMI watchdog uses performance monitoring feature to
>> periodically signal NMI each time performance counter gets overflowed.
>>
>> intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
>> handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
>> of a given NMI by looking at overflow status bits in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
>> handles the given NMI as its own NMI.
>>
>> The problem is that intel_pmu_handle_irq() doesn't distinguish
>> CondChgd bit from other bits. Unlike the other status bits, CondChgd
>> bit doesn't represent overflow status for performance counters. Thus,
>> CondChgd bit cannot be thought of as a mark indicating a given NMI is
>> NMI watchdog's.
>
> So what was the problem? It ate another NMI?
>
Yes.
The handler for NMI watchdog is called in NMI_LOCAL. This is done in
earlier timing than NMI_UNKNOWN, and if some of the handlers in
NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are
not called. Like this:
handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);
if (handled) {
/*
* There are cases when a NMI handler handles multiple
* events in the current NMI. One of these events may
* be queued for in the next NMI. Because the event is
* already handled, the next NMI will result in an unknown
* NMI. Instead lets flag this for a potential NMI to
* swallow.
*/
if (handled > 1)
__this_cpu_write(swallow_nmi, true);
return;
}
For example, we use unknown NMI to make system panic by enabling
kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by
NMI handler for NMI watchdog.
>> I noticed this behavior on systems with Ivy Bridge processors: Intel
>> Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
>> CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
>> in the beginning at boot. (then the CondChgd bit is cleared by next
>> wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
>>
>> On the other hand, on older processors such as Nehalem, CondChgd bit
>> is not set in the beginning at boot.
>>
>> I'm not sure about exact behavior of CondChgd bit, in particular when
>> this bit is set. Although I read Intel System Programmer's Manual to
>> figure out but I have yet completed that. At least, I think ignoring
>> CondChgd bit should be enough for NMI watchdog perspective.
>
> So yes, the SDM lists the bit as existing but never once mentions it
> outside of that, and its been doing that at least back to 2008.
>
> Ooh, I found it:
>
> "The IA32_PERF_GLOBAL_STATUS MSR also provides a ?sticky bit? to
> indicate changes to the state of performance monitoring hardware (see
> Figure 18-29)."
>
> Which is of course completely useless, not to mention inconsistent with
> the later CondChgd name.
>
> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
> like having that set on boot?
--
Thanks.
HATAYAMA, Daisuke
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
From: Peter Zijlstra <[email protected]>
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Date: Wed, 11 Jun 2014 13:54:13 +0200
> On Wed, Jun 11, 2014 at 10:54:48AM +0200, Peter Zijlstra wrote:
>> > I'm not sure about exact behavior of CondChgd bit, in particular when
>> > this bit is set. Although I read Intel System Programmer's Manual to
>> > figure out but I have yet completed that. At least, I think ignoring
>> > CondChgd bit should be enough for NMI watchdog perspective.
>>
>> So yes, the SDM lists the bit as existing but never once mentions it
>> outside of that, and its been doing that at least back to 2008.
>>
>> Ooh, I found it:
>>
>> "The IA32_PERF_GLOBAL_STATUS MSR also provides a ?sticky bit? to
>> indicate changes to the state of performance monitoring hardware (see
>> Figure 18-29)."
>>
>> Which is of course completely useless, not to mention inconsistent with
>> the later CondChgd name.
>>
>> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
>> like having that set on boot?
>
> Matt found in the MSR listing for GLOBAL_STATUS:
>
> 63 CondChg: status bits of this register has changed. If CPUID.0AH: EAX[7:0] > 0
>
> Which brings us to a grand total of 3 different names for this bit.
>
> If it indeed does what it says on the tin, set every time the status
> changes its like the most useless bit ever and I wonder why people
> bothered to spend silicon on it.
>
Yes, I didn't mention in patch description this, but I reached the
same conclusion. The description confuses me because the desciption
and the behaviour of CondChg bit I see on the actual system is not
coincide.
Also, I checked cpuid on the system with Neharlem processor where I
have never seen CondChg bit is set.
[root@localhost ~]# ./cpuid -r
CPU 0:
0x00000000 0x00: eax=0x0000000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
0x00000001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff
<snip>
0x0000000a 0x00: eax=0x07300403 ebx=0x00000044 ecx=0x00000000 edx=0x00000603
^^^^^^^^^^^^^^
So, cpuid tells that CondChg bit is supported on this processor, too.
> In any case, the proposed patch seems fine, just needs a better
> changelog.
>
I see.
I'll write that the problem is that any NMI could be robbed by NMI
watchdog explicitly. Now only patch title says this explicitly. This
is your first comment.
About CondChgd bit, I cannot write more than I see on actual
system. If it's necessary to describe more about CondChgd bit, it
would be appreciated if someone tell me more information about it.
Thanks.
HATAYAMA, Daisuke
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Jun 12, 2014 at 04:00:11PM +0900, HATAYAMA Daisuke wrote:
> Also, I checked cpuid on the system with Neharlem processor where I
> have never seen CondChg bit is set.
>
> [root@localhost ~]# ./cpuid -r
> CPU 0:
> 0x00000000 0x00: eax=0x0000000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
> 0x00000001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff
> <snip>
> 0x0000000a 0x00: eax=0x07300403 ebx=0x00000044 ecx=0x00000000 edx=0x00000603
> ^^^^^^^^^^^^^^
> So, cpuid tells that CondChg bit is supported on this processor, too.
Yeah, I can't remember ever seeing that bit on nhm/wsm either. Weird
stuff that.
> > In any case, the proposed patch seems fine, just needs a better
> > changelog.
> >
>
> I see.
>
> I'll write that the problem is that any NMI could be robbed by NMI
> watchdog explicitly. Now only patch title says this explicitly. This
> is your first comment.
Yeah, since that is the actual problem, its good to be clear on that.
> About CondChgd bit, I cannot write more than I see on actual
> system. If it's necessary to describe more about CondChgd bit, it
> would be appreciated if someone tell me more information about it.
I think we've found all 2 sentences the SDM has about that and unless
someone from Intel is going to come and explain why they wasted precious
silicon on this I suppose it will remain a mystery. No need to update on
that.
On Thu, Jun 12, 2014 at 09:37:16AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2014 at 04:00:11PM +0900, HATAYAMA Daisuke wrote:
> > Also, I checked cpuid on the system with Neharlem processor where I
> > have never seen CondChg bit is set.
> >
> > [root@localhost ~]# ./cpuid -r
> > CPU 0:
> > 0x00000000 0x00: eax=0x0000000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
> > 0x00000001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff
> > <snip>
> > 0x0000000a 0x00: eax=0x07300403 ebx=0x00000044 ecx=0x00000000 edx=0x00000603
> > ^^^^^^^^^^^^^^
> > So, cpuid tells that CondChg bit is supported on this processor, too.
>
> Yeah, I can't remember ever seeing that bit on nhm/wsm either. Weird
> stuff that.
>
> > > In any case, the proposed patch seems fine, just needs a better
> > > changelog.
> > >
> >
> > I see.
> >
> > I'll write that the problem is that any NMI could be robbed by NMI
> > watchdog explicitly. Now only patch title says this explicitly. This
> > is your first comment.
>
> Yeah, since that is the actual problem, its good to be clear on that.
>
> > About CondChgd bit, I cannot write more than I see on actual
> > system. If it's necessary to describe more about CondChgd bit, it
> > would be appreciated if someone tell me more information about it.
>
> I think we've found all 2 sentences the SDM has about that and unless
> someone from Intel is going to come and explain why they wasted precious
> silicon on this I suppose it will remain a mystery. No need to update on
> that.
Just to add to the mix, we (Red Hat) has a customer with the same
problem. I told them to fight it out with Intel to figure out why that
bit is non-zero at boot. Partly because I didn't feel like send a patch
upstream and feel the wrath of Peter Z. descend upon me. :-)
So if this patch is acceptable, I would ack it as it fixes our customer's
problem too.
Cheers,
Don
On Thu, Jun 12, 2014 at 09:37:16AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2014 at 04:00:11PM +0900, HATAYAMA Daisuke wrote:
> > Also, I checked cpuid on the system with Neharlem processor where I
> > have never seen CondChg bit is set.
> >
> > [root@localhost ~]# ./cpuid -r
> > CPU 0:
> > 0x00000000 0x00: eax=0x0000000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
> > 0x00000001 0x00: eax=0x000206e6 ebx=0x40200800 ecx=0x00bce3bd edx=0xbfebfbff
> > <snip>
> > 0x0000000a 0x00: eax=0x07300403 ebx=0x00000044 ecx=0x00000000 edx=0x00000603
> > ^^^^^^^^^^^^^^
> > So, cpuid tells that CondChg bit is supported on this processor, too.
>
> Yeah, I can't remember ever seeing that bit on nhm/wsm either. Weird
> stuff that.
Just to add before I forget, this problem has been around for awhile as it
was explained to me. The reason why it was never reported is because (in
our customer case), the nmi_watchdog clears the register after about 10
seconds after boot. Most machines do not tend to send external NMIs the
first 10 seconds after booting. Our customer saw it because he happened
to purposely press his external NMI button to trigger a panic with the
nmi_watchdog disabled and the watchdog happened to be disabled because we
were debugging a kdump problem.
Cheers,
Don
>
> > > In any case, the proposed patch seems fine, just needs a better
> > > changelog.
> > >
> >
> > I see.
> >
> > I'll write that the problem is that any NMI could be robbed by NMI
> > watchdog explicitly. Now only patch title says this explicitly. This
> > is your first comment.
>
> Yeah, since that is the actual problem, its good to be clear on that.
>
> > About CondChgd bit, I cannot write more than I see on actual
> > system. If it's necessary to describe more about CondChgd bit, it
> > would be appreciated if someone tell me more information about it.
>
> I think we've found all 2 sentences the SDM has about that and unless
> someone from Intel is going to come and explain why they wasted precious
> silicon on this I suppose it will remain a mystery. No need to update on
> that.