This patch series addresses issues with increased NMI latency in newer
AMD processors that can result in unknown NMI messages when PMC counters
are active.
The following fixes are included in this series:
- Resolve a race condition when disabling an overflowed PMC counter,
specifically when updating the PMC counter with a new value.
- Resolve handling of multiple active PMC counter overflows in the perf
NMI handler and when to report that the NMI is not related to a PMC.
If everything looks ok, I'll re-submit without the RFC tag. Definitely
looks cleaner with the AMD specific callback functions.
---
Changes from v1 (based on feedback from Peter Z):
- Created an AMD specific disable_all callback function to handle the
disabling of the counters and resolve the race condition
- Created an AMD specific handle_irq callback function that invokes the
common x86_pmu_handle_irq() function and then performs the NMI latency
mitigation.
- Take into account the possibility of non-perf NMI sources when applying
the mitigation.
This patch series is based off of the perf/core branch of tip:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
Commit c978b9460fe1 ("Merge tag 'perf-core-for-mingo-5.1-20190225' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core")
On AMD processors, the detection of an overflowed counter in the NMI
handler relies on the current value of the counter. So, for example, to
check for overflow on a 48 bit counter, bit 47 is checked to see if it
is 1 (not overflowed) or 0 (overflowed).
There is currently a race condition present when disabling and then
updating the PMC. Increased NMI latency in newer AMD processors makes this
race condition more pronounced. If the counter value has overflowed, it is
possible to update the PMC value before the NMI handler can run. The
updated PMC value is not an overflowed value, so when the perf NMI handler
does run, it will not find an overflowed counter. This may appear as an
unknown NMI resulting in either a panic or a series of messages, depending
on how the kernel is configured.
To eliminate this race condition, the PMC value must be checked after
disabling the counter. Add an AMD function, amd_pmu_disable_all(), to be
used in place of the common x86_pmu_disable_all() that, after disabling
the counters, will wait for the NMI handler to reset any active and
enabled overflowed counter.
Cc: <[email protected]> # 4.14.x-
Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/events/amd/core.c | 85 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 82 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 7d2d7c801dba..732402dcffdc 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <linux/init.h>
#include <linux/slab.h>
+#include <linux/delay.h>
#include <asm/apicdef.h>
#include "../perf_event.h"
@@ -429,6 +430,84 @@ static void amd_pmu_cpu_dead(int cpu)
}
}
+/*
+ * When a PMC counter overflows, an NMI is used to process the event and
+ * reset the counter. NMI latency can result in the counter being updated
+ * before the NMI can run, which can result in what appear to be spurious
+ * NMIs. This function is intended to wait for the NMI to run and reset
+ * the counter to avoid possible unhandled NMI messages.
+ */
+#define OVERFLOW_WAIT_COUNT 50
+static void amd_pmu_wait_on_overflow(int idx)
+{
+ unsigned int i;
+ u64 counter;
+
+ /*
+ * Wait for the counter to be reset if it has overflowed. This loop
+ * should exit very, very quickly, but just in case, don't wait
+ * forever...
+ */
+ for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
+ rdmsrl(x86_pmu_event_addr(idx), counter);
+ if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
+ break;
+
+ /* Might be in IRQ context, so can't sleep */
+ udelay(1);
+ }
+}
+
+void amd_pmu_disable_all(void)
+{
+ unsigned long overflow_check[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int idx;
+
+ bitmap_zero(overflow_check, X86_PMC_IDX_MAX);
+
+ for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+ u64 val;
+
+ if (!test_bit(idx, cpuc->active_mask))
+ continue;
+
+ rdmsrl(x86_pmu_config_addr(idx), val);
+ if (!(val & ARCH_PERFMON_EVENTSEL_ENABLE))
+ continue;
+
+ val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
+ wrmsrl(x86_pmu_config_addr(idx), val);
+
+ /*
+ * If the interrupt is enabled, this counter must be checked
+ * for an overflow condition to avoid possibly changing the
+ * counter value before the NMI handler runs.
+ */
+ if (val & ARCH_PERFMON_EVENTSEL_INT)
+ __set_bit(idx, overflow_check);
+ }
+
+ /*
+ * This shouldn't be called from NMI context, but add a safeguard here
+ * to return, since if we're in NMI context we can't wait for an NMI
+ * to reset an overflowed counter value.
+ */
+ if (in_nmi())
+ return;
+
+ /*
+ * Check each counter for overflow and wait for it to be reset by the
+ * NMI if it has overflowed.
+ */
+ for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+ if (!test_bit(idx, overflow_check))
+ continue;
+
+ amd_pmu_wait_on_overflow(idx);
+ }
+}
+
static struct event_constraint *
amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -622,7 +701,7 @@ static ssize_t amd_event_sysfs_show(char *page, u64 config)
static __initconst const struct x86_pmu amd_pmu = {
.name = "AMD",
.handle_irq = x86_pmu_handle_irq,
- .disable_all = x86_pmu_disable_all,
+ .disable_all = amd_pmu_disable_all,
.enable_all = x86_pmu_enable_all,
.enable = x86_pmu_enable_event,
.disable = x86_pmu_disable_event,
@@ -732,7 +811,7 @@ void amd_pmu_enable_virt(void)
cpuc->perf_ctr_virt_mask = 0;
/* Reload all events */
- x86_pmu_disable_all();
+ amd_pmu_disable_all();
x86_pmu_enable_all(0);
}
EXPORT_SYMBOL_GPL(amd_pmu_enable_virt);
@@ -750,7 +829,7 @@ void amd_pmu_disable_virt(void)
cpuc->perf_ctr_virt_mask = AMD64_EVENTSEL_HOSTONLY;
/* Reload all events */
- x86_pmu_disable_all();
+ amd_pmu_disable_all();
x86_pmu_enable_all(0);
}
EXPORT_SYMBOL_GPL(amd_pmu_disable_virt);
On AMD processors, the detection of an overflowed PMC counter in the NMI
handler relies on the current value of the PMC. So, for example, to check
for overflow on a 48-bit counter, bit 47 is checked to see if it is 1 (not
overflowed) or 0 (overflowed).
When the perf NMI handler executes it does not know in advance which PMC
counters have overflowed. As such, the NMI handler will process all active
PMC counters that have overflowed. NMI latency in newer AMD processors can
result in multiple overflowed PMC counters being processed in one NMI and
then a subsequent NMI, that does not appear to be a back-to-back NMI, not
finding any PMC counters that have overflowed. This may appear to be an
unhandled NMI resulting in either a panic or a series of messages,
depending on how the kernel was configured.
To mitigate this issue, add an AMD handle_irq callback function,
amd_pmu_handle_irq(), that will invoke the common x86_pmu_handle_irq()
function and upon return perform some additional processing that will
indicate if the NMI has been handled or would have been handled had an
earlier NMI not handled the overflowed PMC. Using a per-CPU variable, a
minimum value of the number of active PMCs or 2 will be set whenever a
PMC is active. This is used to indicate the possible number of NMIs that
can still occur. The value of 2 is used for when an NMI does not arrive
at the LAPIC in time to be collapsed into an already pending NMI. Each
time the function is called without having handled an overflowed counter,
the per-CPU value is checked. If the value is non-zero, it is decremented
and the NMI indicates that it handled the NMI. If the value is zero, then
the NMI indicates that it did not handle the NMI.
This issue is different from a previous issue related to perf NMIs that
was fixed in commit:
63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
The difference here is that the NMI latency can contribute to what appear
to be spurious NMIs during the handling of PMC counter overflow while the
counter is active as opposed to when the counter is being disabled.
Cc: <[email protected]> # 4.14.x-
Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/events/amd/core.c | 58 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 732402dcffdc..3bc7a92de8af 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -5,9 +5,12 @@
#include <linux/slab.h>
#include <linux/delay.h>
#include <asm/apicdef.h>
+#include <asm/nmi.h>
#include "../perf_event.h"
+static DEFINE_PER_CPU(unsigned int, perf_nmi_counter);
+
static __initconst const u64 amd_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -508,6 +511,59 @@ void amd_pmu_disable_all(void)
}
}
+/*
+ * Because of NMI latency, if multiple PMC counters are active or other sources
+ * of NMIs are received, the perf NMI handler can handle one or more overflowed
+ * PMC counters outside of the NMI associated with the PMC overflow. If the NMI
+ * doesn't arrive at the LAPIC in time to become a pending NMI, then the kernel
+ * back-to-back NMI support won't be active. This PMC handler needs to take into
+ * account that this can occur, otherwise this could result in unknown NMI
+ * messages being issued. Examples of this is PMC overflow while in the NMI
+ * handler when multiple PMCs are active or PMC overflow while handling some
+ * other source of an NMI.
+ *
+ * Attempt to mitigate this by using the number of active PMCs to determine
+ * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
+ * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of the
+ * number of active PMCs or 2. The value of 2 is used in case an NMI does not
+ * arrive at the LAPIC in time to be collapsed into an already pending NMI.
+ */
+static int amd_pmu_handle_irq(struct pt_regs *regs)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int active, handled;
+
+ active = __bitmap_weight(cpuc->active_mask, X86_PMC_IDX_MAX);
+ handled = x86_pmu_handle_irq(regs);
+
+ /*
+ * If no counters are active, reset perf_nmi_counter and return
+ * NMI_DONE
+ */
+ if (!active) {
+ this_cpu_write(perf_nmi_counter, 0);
+ return NMI_DONE;
+ }
+
+ /*
+ * If a counter was handled, record the number of possible remaining
+ * NMIs that can occur.
+ */
+ if (handled) {
+ this_cpu_write(perf_nmi_counter,
+ min_t(unsigned int, 2, active));
+
+ return handled;
+ }
+
+ if (!this_cpu_read(perf_nmi_counter))
+ return NMI_DONE;
+
+ this_cpu_dec(perf_nmi_counter);
+
+ return NMI_HANDLED;
+}
+
static struct event_constraint *
amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -700,7 +756,7 @@ static ssize_t amd_event_sysfs_show(char *page, u64 config)
static __initconst const struct x86_pmu amd_pmu = {
.name = "AMD",
- .handle_irq = x86_pmu_handle_irq,
+ .handle_irq = amd_pmu_handle_irq,
.disable_all = amd_pmu_disable_all,
.enable_all = x86_pmu_enable_all,
.enable = x86_pmu_enable_event,
On Fri, Mar 15, 2019 at 08:40:53PM +0000, Lendacky, Thomas wrote:
> +void amd_pmu_disable_all(void)
> +{
> + unsigned long overflow_check[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + int idx;
> +
> + bitmap_zero(overflow_check, X86_PMC_IDX_MAX);
> +
> + for (idx = 0; idx < x86_pmu.num_counters; idx++) {
> + u64 val;
> +
> + if (!test_bit(idx, cpuc->active_mask))
> + continue;
> +
> + rdmsrl(x86_pmu_config_addr(idx), val);
> + if (!(val & ARCH_PERFMON_EVENTSEL_ENABLE))
> + continue;
> +
> + val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> + wrmsrl(x86_pmu_config_addr(idx), val);
> +
> + /*
> + * If the interrupt is enabled, this counter must be checked
> + * for an overflow condition to avoid possibly changing the
> + * counter value before the NMI handler runs.
> + */
> + if (val & ARCH_PERFMON_EVENTSEL_INT)
> + __set_bit(idx, overflow_check);
> + }
I think you can ditch overflow_check and directly call
x86_pmu_disable_all() here.
> +
> + /*
> + * This shouldn't be called from NMI context, but add a safeguard here
> + * to return, since if we're in NMI context we can't wait for an NMI
> + * to reset an overflowed counter value.
> + */
> + if (in_nmi())
> + return;
> +
> + /*
> + * Check each counter for overflow and wait for it to be reset by the
> + * NMI if it has overflowed.
> + */
> + for (idx = 0; idx < x86_pmu.num_counters; idx++) {
> + if (!test_bit(idx, overflow_check))
And simply iterate cpuc->active_mask again here.
> + continue;
> +
> + amd_pmu_wait_on_overflow(idx);
> + }
> +}
Because, per x86_pmu_hw_config() we _always_ have EVENTSEL_INT set, even
for !sampling events -- such that we can deal with overflow, and we
should 'always' have EVENTSEL_ENABLE set when 'active', see
x86_pmu_enable_all().
On Fri, Mar 15, 2019 at 08:41:00PM +0000, Lendacky, Thomas wrote:
> This issue is different from a previous issue related to perf NMIs that
> was fixed in commit:
> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
>
> The difference here is that the NMI latency can contribute to what appear
> to be spurious NMIs during the handling of PMC counter overflow while the
> counter is active as opposed to when the counter is being disabled.
But could we not somehow merge these two cases? The cause is similar
IIUC. The PMI gets triggered, but then:
- a previous PMI handler handled our overflow, or
- our event gets disabled.
and when our PMI triggers it finds nothing to do.
> +/*
> + * Because of NMI latency, if multiple PMC counters are active or other sources
> + * of NMIs are received, the perf NMI handler can handle one or more overflowed
> + * PMC counters outside of the NMI associated with the PMC overflow. If the NMI
> + * doesn't arrive at the LAPIC in time to become a pending NMI, then the kernel
> + * back-to-back NMI support won't be active. This PMC handler needs to take into
> + * account that this can occur, otherwise this could result in unknown NMI
> + * messages being issued. Examples of this is PMC overflow while in the NMI
> + * handler when multiple PMCs are active or PMC overflow while handling some
> + * other source of an NMI.
> + *
> + * Attempt to mitigate this by using the number of active PMCs to determine
> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of the
> + * number of active PMCs or 2. The value of 2 is used in case an NMI does not
> + * arrive at the LAPIC in time to be collapsed into an already pending NMI.
> + */
> +static int amd_pmu_handle_irq(struct pt_regs *regs)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + int active, handled;
> +
> + active = __bitmap_weight(cpuc->active_mask, X86_PMC_IDX_MAX);
> + handled = x86_pmu_handle_irq(regs);
> +
> + /*
> + * If no counters are active, reset perf_nmi_counter and return
> + * NMI_DONE
> + */
> + if (!active) {
> + this_cpu_write(perf_nmi_counter, 0);
> + return NMI_DONE;
> + }
This will actively render 63e6be6d98e1 void I think. Because that can
return !0 while !active -- that's rather the whole point of it.
> + /*
> + * If a counter was handled, record the number of possible remaining
> + * NMIs that can occur.
> + */
> + if (handled) {
> + this_cpu_write(perf_nmi_counter,
> + min_t(unsigned int, 2, active));
> +
> + return handled;
> + }
> +
> + if (!this_cpu_read(perf_nmi_counter))
> + return NMI_DONE;
> +
> + this_cpu_dec(perf_nmi_counter);
> +
> + return NMI_HANDLED;
> +}
On 3/18/19 4:37 AM, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 08:40:53PM +0000, Lendacky, Thomas wrote:
>> +void amd_pmu_disable_all(void)
>> +{
>> + unsigned long overflow_check[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
>> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> + int idx;
>> +
>> + bitmap_zero(overflow_check, X86_PMC_IDX_MAX);
>> +
>> + for (idx = 0; idx < x86_pmu.num_counters; idx++) {
>> + u64 val;
>> +
>> + if (!test_bit(idx, cpuc->active_mask))
>> + continue;
>> +
>> + rdmsrl(x86_pmu_config_addr(idx), val);
>> + if (!(val & ARCH_PERFMON_EVENTSEL_ENABLE))
>> + continue;
>> +
>> + val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>> + wrmsrl(x86_pmu_config_addr(idx), val);
>> +
>> + /*
>> + * If the interrupt is enabled, this counter must be checked
>> + * for an overflow condition to avoid possibly changing the
>> + * counter value before the NMI handler runs.
>> + */
>> + if (val & ARCH_PERFMON_EVENTSEL_INT)
>> + __set_bit(idx, overflow_check);
>> + }
>
> I think you can ditch overflow_check and directly call
> x86_pmu_disable_all() here.
>
>> +
>> + /*
>> + * This shouldn't be called from NMI context, but add a safeguard here
>> + * to return, since if we're in NMI context we can't wait for an NMI
>> + * to reset an overflowed counter value.
>> + */
>> + if (in_nmi())
>> + return;
>> +
>> + /*
>> + * Check each counter for overflow and wait for it to be reset by the
>> + * NMI if it has overflowed.
>> + */
>> + for (idx = 0; idx < x86_pmu.num_counters; idx++) {
>> + if (!test_bit(idx, overflow_check))
>
> And simply iterate cpuc->active_mask again here.
>
>> + continue;
>> +
>> + amd_pmu_wait_on_overflow(idx);
>> + }
>> +}
>
> Because, per x86_pmu_hw_config() we _always_ have EVENTSEL_INT set, even
> for !sampling events -- such that we can deal with overflow, and we
> should 'always' have EVENTSEL_ENABLE set when 'active', see
> x86_pmu_enable_all().
That would make it much simpler. I will have to rely on those assumptions,
though, which I'm fine with.
I'm traveling this week so it might be a few days before I can get the
next version out.
Thanks,
Tom
>
>
On 3/18/19 5:40 AM, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 08:41:00PM +0000, Lendacky, Thomas wrote:
>> This issue is different from a previous issue related to perf NMIs that
>> was fixed in commit:
>> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
>>
>> The difference here is that the NMI latency can contribute to what appear
>> to be spurious NMIs during the handling of PMC counter overflow while the
>> counter is active as opposed to when the counter is being disabled.
>
> But could we not somehow merge these two cases? The cause is similar
> IIUC. The PMI gets triggered, but then:
>
> - a previous PMI handler handled our overflow, or
> - our event gets disabled.
>
> and when our PMI triggers it finds nothing to do.
Yes, let me look into what we can do here.
>
>> +/*
>> + * Because of NMI latency, if multiple PMC counters are active or other sources
>> + * of NMIs are received, the perf NMI handler can handle one or more overflowed
>> + * PMC counters outside of the NMI associated with the PMC overflow. If the NMI
>> + * doesn't arrive at the LAPIC in time to become a pending NMI, then the kernel
>> + * back-to-back NMI support won't be active. This PMC handler needs to take into
>> + * account that this can occur, otherwise this could result in unknown NMI
>> + * messages being issued. Examples of this is PMC overflow while in the NMI
>> + * handler when multiple PMCs are active or PMC overflow while handling some
>> + * other source of an NMI.
>> + *
>> + * Attempt to mitigate this by using the number of active PMCs to determine
>> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
>> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of the
>> + * number of active PMCs or 2. The value of 2 is used in case an NMI does not
>> + * arrive at the LAPIC in time to be collapsed into an already pending NMI.
>> + */
>> +static int amd_pmu_handle_irq(struct pt_regs *regs)
>> +{
>> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> + int active, handled;
>> +
>> + active = __bitmap_weight(cpuc->active_mask, X86_PMC_IDX_MAX);
>> + handled = x86_pmu_handle_irq(regs);
>> +
>> + /*
>> + * If no counters are active, reset perf_nmi_counter and return
>> + * NMI_DONE
>> + */
>> + if (!active) {
>> + this_cpu_write(perf_nmi_counter, 0);
>> + return NMI_DONE;
>> + }
>
> This will actively render 63e6be6d98e1 void I think. Because that can
> return !0 while !active -- that's rather the whole point of it.
Yup, I didn't take that into account when I changed that. Let me take a
closer look at all this along with your suggestion from the other email.
I'm traveling this week, so it might be a few days.
Thanks,
Tom
>
>> + /*
>> + * If a counter was handled, record the number of possible remaining
>> + * NMIs that can occur.
>> + */
>> + if (handled) {
>> + this_cpu_write(perf_nmi_counter,
>> + min_t(unsigned int, 2, active));
>> +
>> + return handled;
>> + }
>> +
>> + if (!this_cpu_read(perf_nmi_counter))
>> + return NMI_DONE;
>> +
>> + this_cpu_dec(perf_nmi_counter);
>> +
>> + return NMI_HANDLED;
>> +}
>
>