2018-06-03 18:24:59

by David Arcari

[permalink] [raw]
Subject: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On some systems pressing the external NMI button is now failing to inject
an NMI 5-10% of the time. This causes confusion for a user that expects
the NMI to dump the system.

Commit 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
does not read the firmware setting of the FREEZE_WHILE_SMM bit and will
always clear it when the PMU is initialized. As a result the performance
counters will always run and that greatly expands the race in which
external NMI will not be processed if a local NMI is already being
processed.

One option is to change default_do_nmi(). The code snippet below shows the
relevant portion of a patch that resolves the issue, but it is problematic
from a performance perspective and was dismissed.

-345,7 +345,17 @@ static void default_do_nmi(struct pt_regs *regs)
*/
if (handled > 1)
__this_cpu_write(swallow_nmi, true);
- return;
+
+ /*
+ * Unfortunately, there is a race condition which can
+ * result in a missing an external NMI. Typically, an
+ * external NMI is processed on cpu 0. Therefore, on
+ * cpu 0 check for an external NMI before returning.
+ */
+ if (smp_processor_id() ||
+ (x86_platform.get_nmi_reason() & NMI_REASON_MASK) == 0) {
+ return;
+ }
}

Ultimately, the issue can be resolved by storing the default firmware
setting of FREEZE_WHILE_SMM before initializing the PMU.

Fixes: 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")

Signed-off-by: David Arcari <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Donald Zickus <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Jerry Hoemann <[email protected]>
---
arch/x86/events/intel/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a9..fce98df 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3325,6 +3325,18 @@ static void flip_smm_bit(void *data)
}
}

+static int read_smm_bit(void)
+{
+ u64 val;
+
+ if (!rdmsrl_safe(MSR_IA32_DEBUGCTLMSR, &val)) {
+ if (val & DEBUGCTLMSR_FREEZE_IN_SMM)
+ return 1;
+ }
+
+ return 0;
+}
+
static void intel_pmu_cpu_starting(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
@@ -4423,6 +4435,8 @@ __init int intel_pmu_init(void)
pr_cont("full-width counters, ");
}

+ x86_pmu.attr_freeze_on_smi = read_smm_bit();
+
kfree(to_free);
return 0;
}
--
1.8.3.1



2018-06-04 08:24:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On Sun, Jun 03, 2018 at 02:23:43PM -0400, David Arcari wrote:
> On some systems pressing the external NMI button is now failing to inject
> an NMI 5-10% of the time. This causes confusion for a user that expects
> the NMI to dump the system.
>
> Commit 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
> does not read the firmware setting of the FREEZE_WHILE_SMM bit and will
> always clear it when the PMU is initialized. As a result the performance
> counters will always run and that greatly expands the race in which
> external NMI will not be processed if a local NMI is already being
> processed.
>
> One option is to change default_do_nmi(). The code snippet below shows the
> relevant portion of a patch that resolves the issue, but it is problematic
> from a performance perspective and was dismissed.
>
> -345,7 +345,17 @@ static void default_do_nmi(struct pt_regs *regs)
> */
> if (handled > 1)
> __this_cpu_write(swallow_nmi, true);
> - return;
> +
> + /*
> + * Unfortunately, there is a race condition which can
> + * result in a missing an external NMI. Typically, an
> + * external NMI is processed on cpu 0. Therefore, on
> + * cpu 0 check for an external NMI before returning.
> + */
> + if (smp_processor_id() ||
> + (x86_platform.get_nmi_reason() & NMI_REASON_MASK) == 0) {
> + return;
> + }
> }
>
> Ultimately, the issue can be resolved by storing the default firmware
> setting of FREEZE_WHILE_SMM before initializing the PMU.

I'm sorry, I know it's Monday morning, but what?! I really don't
understand anything you write there.

Maybe if you explain the race and how your proposed fix closes it things
will make sense. The above refers to too many things not here.

2018-06-04 14:13:32

by David Arcari

[permalink] [raw]
Subject: Re: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On 06/04/2018 04:24 AM, Peter Zijlstra wrote:
> On Sun, Jun 03, 2018 at 02:23:43PM -0400, David Arcari wrote:
>> On some systems pressing the external NMI button is now failing to inject
>> an NMI 5-10% of the time. This causes confusion for a user that expects
>> the NMI to dump the system.
>>
>> Commit 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
>> does not read the firmware setting of the FREEZE_WHILE_SMM bit and will
>> always clear it when the PMU is initialized. As a result the performance
>> counters will always run and that greatly expands the race in which
>> external NMI will not be processed if a local NMI is already being
>> processed.
>>
>> One option is to change default_do_nmi(). The code snippet below shows the
>> relevant portion of a patch that resolves the issue, but it is problematic
>> from a performance perspective and was dismissed.
>>
>> -345,7 +345,17 @@ static void default_do_nmi(struct pt_regs *regs)
>> */
>> if (handled > 1)
>> __this_cpu_write(swallow_nmi, true);
>> - return;
>> +
>> + /*
>> + * Unfortunately, there is a race condition which can
>> + * result in a missing an external NMI. Typically, an
>> + * external NMI is processed on cpu 0. Therefore, on
>> + * cpu 0 check for an external NMI before returning.
>> + */
>> + if (smp_processor_id() ||
>> + (x86_platform.get_nmi_reason() & NMI_REASON_MASK) == 0) {
>> + return;
>> + }
>> }
>>
>> Ultimately, the issue can be resolved by storing the default firmware
>> setting of FREEZE_WHILE_SMM before initializing the PMU.
>
> I'm sorry, I know it's Monday morning, but what?! I really don't
> understand anything you write there.
>
> Maybe if you explain the race and how your proposed fix closes it things
> will make sense. The above refers to too many things not here.
>

Sorry.

default_do_nmi() will process both perf events (local interrupts) as well as
external interrupts (such as the NMI button). The handler is coded such that
if a local interrupt occurs, no check is made for an external interrupt.
Therefore, if the two interrupts occur simultaneously, the external interrupt
is lost.

The code above, which was ultimately discounted, attempts to avoid this
scenario with as little performance impact as possible by reading the register
without the spinlock for cpu 0 only (currently only cpu 0 can handle an
external NMI, I verified this on my system by testing the NMI button with
cpu 0 offline).

The code above is problematic for a number of reasons not the least of which
is performance. Furthermore, I don't see a less intrusive solution wrt
do_default_nmi().

Upstream 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
appears to have made it relatively easy to hit this race condition. On some
systems, this commit has resulted in a change to the default firmware setting
of DEBUGCTLMSR_FREEZE_IN_SMM_BIT (it is now cleared by the OS by default).

With this bit cleared, the following situation occurs:

1) external NMI - due to io check
2) long duration SMI (counters do not freeze)
3) NMI handler runs and misattributes interrupt to perf event

Ultimately, my solution was to restore the previous behavior by reading and
storing the firmware setting of the bit rather than to always clear it.

2018-06-11 18:03:35

by David Arcari

[permalink] [raw]
Subject: Re: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On 06/04/2018 10:12 AM, David Arcari wrote:
> On 06/04/2018 04:24 AM, Peter Zijlstra wrote:
>> On Sun, Jun 03, 2018 at 02:23:43PM -0400, David Arcari wrote:
>>> On some systems pressing the external NMI button is now failing to inject
>>> an NMI 5-10% of the time. This causes confusion for a user that expects
>>> the NMI to dump the system.
>>>
>>> Commit 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
>>> does not read the firmware setting of the FREEZE_WHILE_SMM bit and will
>>> always clear it when the PMU is initialized. As a result the performance
>>> counters will always run and that greatly expands the race in which
>>> external NMI will not be processed if a local NMI is already being
>>> processed.
>>>
>>> One option is to change default_do_nmi(). The code snippet below shows the
>>> relevant portion of a patch that resolves the issue, but it is problematic
>>> from a performance perspective and was dismissed.
>>>
>>> -345,7 +345,17 @@ static void default_do_nmi(struct pt_regs *regs)
>>> */
>>> if (handled > 1)
>>> __this_cpu_write(swallow_nmi, true);
>>> - return;
>>> +
>>> + /*
>>> + * Unfortunately, there is a race condition which can
>>> + * result in a missing an external NMI. Typically, an
>>> + * external NMI is processed on cpu 0. Therefore, on
>>> + * cpu 0 check for an external NMI before returning.
>>> + */
>>> + if (smp_processor_id() ||
>>> + (x86_platform.get_nmi_reason() & NMI_REASON_MASK) == 0) {
>>> + return;
>>> + }
>>> }
>>>
>>> Ultimately, the issue can be resolved by storing the default firmware
>>> setting of FREEZE_WHILE_SMM before initializing the PMU.
>>
>> I'm sorry, I know it's Monday morning, but what?! I really don't
>> understand anything you write there.
>>
>> Maybe if you explain the race and how your proposed fix closes it things
>> will make sense. The above refers to too many things not here.
>>
>
> Sorry.
>
> default_do_nmi() will process both perf events (local interrupts) as well as
> external interrupts (such as the NMI button). The handler is coded such that
> if a local interrupt occurs, no check is made for an external interrupt.
> Therefore, if the two interrupts occur simultaneously, the external interrupt
> is lost.
>
> The code above, which was ultimately discounted, attempts to avoid this
> scenario with as little performance impact as possible by reading the register
> without the spinlock for cpu 0 only (currently only cpu 0 can handle an
> external NMI, I verified this on my system by testing the NMI button with
> cpu 0 offline).
>
> The code above is problematic for a number of reasons not the least of which
> is performance. Furthermore, I don't see a less intrusive solution wrt
> do_default_nmi().
>
> Upstream 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
> appears to have made it relatively easy to hit this race condition. On some
> systems, this commit has resulted in a change to the default firmware setting
> of DEBUGCTLMSR_FREEZE_IN_SMM_BIT (it is now cleared by the OS by default).
>
> With this bit cleared, the following situation occurs:
>
> 1) external NMI - due to io check
> 2) long duration SMI (counters do not freeze)
> 3) NMI handler runs and misattributes interrupt to perf event
>
> Ultimately, my solution was to restore the previous behavior by reading and
> storing the firmware setting of the bit rather than to always clear it.
>

Hi Peter,

Have you had a chance to take a look at this?

Thanks,

-Dave

2018-06-12 16:58:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On Mon, Jun 04, 2018 at 10:12:20AM -0400, David Arcari wrote:
> default_do_nmi() will process both perf events (local interrupts) as well as
> external interrupts (such as the NMI button). The handler is coded such that
> if a local interrupt occurs, no check is made for an external interrupt.
> Therefore, if the two interrupts occur simultaneously, the external interrupt
> is lost.

ACK, NMI handling on x86 is less than ideal.

> The code above, which was ultimately discounted, attempts to avoid this
> scenario with as little performance impact as possible by reading the register
> without the spinlock for cpu 0 only (currently only cpu 0 can handle an
> external NMI, I verified this on my system by testing the NMI button with
> cpu 0 offline).
>
> The code above is problematic for a number of reasons not the least of which
> is performance. Furthermore, I don't see a less intrusive solution wrt
> do_default_nmi().

Right, because reading the register itself is dog slow IIRC.

> Upstream 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI")
> appears to have made it relatively easy to hit this race condition. On some
> systems, this commit has resulted in a change to the default firmware setting
> of DEBUGCTLMSR_FREEZE_IN_SMM_BIT (it is now cleared by the OS by default).
>
> With this bit cleared, the following situation occurs:
>
> 1) external NMI - due to io check
> 2) long duration SMI (counters do not freeze)
> 3) NMI handler runs and misattributes interrupt to perf event

I think 3 is wrong, because 2 will in fact have triggered a PMI (due to
long running) so 3 will observe a PMI and claim the NMI. No
misattribution what so ever.

Because if this wasn't the case, flipping FREEZE_IN_SMM wouldn't have
made a difference.

> Ultimately, my solution was to restore the previous behavior by reading and
> storing the firmware setting of the bit rather than to always clear it.

Ah, urgh.. what a mess. So the OS setting the bit to a known and
consistent value is 'good' IMO. The firmware magically frobbing things
is 'bad'.

Now, explain to me why an IO-check results in an external NMI, and why
there are long running SMI handlers around? Why can't the IO error not
be propagated through the regular device interrupt/state? Why are long
running SMIs required at all, ever? Why doesn't the OS handler whatever
it is the SMM does?

Are you not solving the wrong problem here?

2018-06-18 19:16:33

by David Arcari

[permalink] [raw]
Subject: Re: [PATCH] perf/x86: read the FREEZE_WHILE_SMM bit during boot

On 06/12/2018 12:56 PM, Peter Zijlstra wrote:

>
>> Ultimately, my solution was to restore the previous behavior by reading and
>> storing the firmware setting of the bit rather than to always clear it.
>
> Ah, urgh.. what a mess. So the OS setting the bit to a known and
> consistent value is 'good' IMO. The firmware magically frobbing things
> is 'bad'.

I had actually considered changing the code to enable the FREEZE_WHILE_SMM by
default, but decided against this approach as I was concerned that setting the
bit on a system where it is initially cleared by the firmware could also have
negative side effects.

>
> Now, explain to me why an IO-check results in an external NMI, and why
> there are long running SMI handlers around? Why can't the IO error not
> be propagated through the regular device interrupt/state? Why are long
> running SMIs required at all, ever? Why doesn't the OS handler whatever
> it is the SMM does?
>
> Are you not solving the wrong problem here?
>

I didn't think so.

1) This functionality was working reasonably well before this commit was
introduced into the OS.

2) As discussed, the problem cannot be addressed in the NMI handler.

3) The work around that I proposed is quite unobtrusive. You are correct in
this is not an actual "fix" since the problem will be present if the user
decides to change the setting of FREEZE_WHILE_SMM via sysfs, but at least
external NMIs are functional by default.

IIUC, you are proposing a complete rewrite of the external NMI infrastructure
along with modification to system firmware. I also believe this would be
somewhat problematic as an external NMI would not function when interrupts are
disabled.

Is there an alternate solution that could provide relief in the short term? I
think that what I have proposed accomplishes this, but perhaps there is a better
more palatable alternative.