This patch series addresses issues with increased NMI latency in newer
AMD processors that can result in unknown NMI messages when PMC counters
are active.
The following fixes are included in this series:
- Resolve a race condition when disabling an overflowed PMC counter,
specifically when updating the PMC counter with a new value.
- Resolve handling of multiple active PMC counter overflows in the perf
NMI handler and when to report that the NMI is not related to a PMC.
I'm looking for feedback on the direction used in these two patches. In
the first patch, the reset check loop typically only runs one iteration,
and very rarely ever ran 3 or 4 times. I also looked at an alternative to
the first patch where I set a per-CPU, per-PMC flag that can be checked
in the NMI handler when it is found that the PMC hasn't overflowed. That
would mean the sample would be lost, though (on average I was seeing about
a 0.20% sample loss that way).
---
This patch series is based off of the perf/core branch of tip:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
Commit c978b9460fe1 ("Merge tag 'perf-core-for-mingo-5.1-20190225' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core")
On AMD processors, the detection of an overflowed counter in the NMI
handler relies on the current value of the counter. So, for example, to
check for overflow on a 48 bit counter, bit 47 is checked to see if it
is 1 (not overflowed) or 0 (overflowed).
There is currently a race condition present when disabling and then
updating the PMC. Increased NMI latency in newer AMD processors makes this
race condition more pronounced. If the counter value has overflowed, it is
possible to update the PMC value before the NMI handler can run. The
updated PMC value is not an overflowed value, so when the perf NMI handler
does run, it will not find an overflowed counter. This may appear as an
unknown NMI resulting in either a panic or a series of messages, depending
on how the kernel is configured.
To eliminate this race condition, the PMC value must be checked after
disabling the counter in x86_pmu_disable_all(), and, if overflowed, must
wait for the NMI handler to reset the value before continuing. Add a new,
optional, callable function that can be used to test for and resolve this
condition.
Cc: <[email protected]> # 4.14.x-
Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/events/amd/core.c | 39 +++++++++++++++++++++++++++++++++++++++
arch/x86/events/core.c | 3 +++
arch/x86/events/perf_event.h | 1 +
3 files changed, 43 insertions(+)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 7d2d7c801dba..d989640fa87d 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <linux/init.h>
#include <linux/slab.h>
+#include <linux/delay.h>
#include <asm/apicdef.h>
#include "../perf_event.h"
@@ -429,6 +430,43 @@ static void amd_pmu_cpu_dead(int cpu)
}
}
+#define OVERFLOW_WAIT_COUNT 50
+
+static void amd_pmu_wait_on_overflow(int idx, u64 config)
+{
+ unsigned int i;
+ u64 counter;
+
+ /*
+ * We shouldn't be calling this from NMI context, but add a safeguard
+ * here to return, since if we're in NMI context we can't wait for an
+ * NMI to reset an overflowed counter value.
+ */
+ if (in_nmi())
+ return;
+
+ /*
+ * If the interrupt isn't enabled then we won't get the NMI that will
+ * reset the overflow condition, so return.
+ */
+ if (!(config & ARCH_PERFMON_EVENTSEL_INT))
+ return;
+
+ /*
+ * Wait for the counter to be reset if it has overflowed. This loop
+ * should exit very, very quickly, but just in case, don't wait
+ * forever...
+ */
+ for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
+ rdmsrl(x86_pmu_event_addr(idx), counter);
+ if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
+ break;
+
+ /* Might be in IRQ context, so can't sleep */
+ udelay(1);
+ }
+}
+
static struct event_constraint *
amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -650,6 +688,7 @@ static __initconst const struct x86_pmu amd_pmu = {
.cpu_dead = amd_pmu_cpu_dead,
.amd_nb_constraints = 1,
+ .wait_on_overflow = amd_pmu_wait_on_overflow,
};
static int __init amd_core_pmu_init(void)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b684f0294f35..f1d2f70000cd 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -606,6 +606,9 @@ void x86_pmu_disable_all(void)
continue;
val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
wrmsrl(x86_pmu_config_addr(idx), val);
+
+ if (x86_pmu.wait_on_overflow)
+ x86_pmu.wait_on_overflow(idx, val);
}
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7e75f474b076..a37490a26a09 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -636,6 +636,7 @@ struct x86_pmu {
* AMD bits
*/
unsigned int amd_nb_constraints : 1;
+ void (*wait_on_overflow)(int idx, u64 config);
/*
* Extra registers for events
On AMD processors, the detection of an overflowed PMC counter in the NMI
handler relies on the current value of the PMC. So, for example, to check
for overflow on a 48-bit counter, bit 47 is checked to see if it is 1 (not
overflowed) or 0 (overflowed).
When the perf NMI handler executes it does not know in advance which PMC
counters have overflowed. As such, the NMI handler will process all active
PMC counters that have overflowed. NMI latency in newer AMD processors can
result in multiple overflowed PMC counters being processed in one NMI and
then a subsequent NMI, that does not appear to be a back-to-back NMI, not
finding any PMC counters that have overflowed. This may appear to be an
unhandled NMI resulting in either a panic or a series of messages,
depending on how the kernel was configured.
To mitigate this issue, a new, optional, callable function is introduced
that is called before returning from x86_pmu_handle_irq(). The AMD perf
support will use this function to indicate if the NMI has been handled or
would have been handled had an earlier NMI not handled the overflowed PMC.
Using a per-CPU variable, a minimum value of one less than the number of
active PMCs or 2 will be set when any PMC overflow has been handled and
more than one PMC is active. This is used to indicate the possible number
of NMIs that can still occur. The value of 2 is used for when an NMI does
not arrive at the APIC in time to be collapsed into an already pending
NMI. Each time the function is called without having handled an overflowed
counter, the per-CPU value is checked. If the value is non-zero, it is
decremented and the NMI indicates that it handled the NMI. If the value is
zero, then the NMI indicates that it did not handle the NMI.
Cc: <[email protected]> # 4.14.x-
Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/events/amd/core.c | 43 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/events/core.c | 6 ++++++
arch/x86/events/perf_event.h | 2 ++
3 files changed, 51 insertions(+)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index d989640fa87d..30457b511e6d 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -5,9 +5,12 @@
#include <linux/slab.h>
#include <linux/delay.h>
#include <asm/apicdef.h>
+#include <asm/nmi.h>
#include "../perf_event.h"
+static DEFINE_PER_CPU(unsigned int, perf_nmi_counter);
+
static __initconst const u64 amd_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
}
}
+/*
+ * Because of NMI latency, if multiple PMC counters are active we need to take
+ * into account that multiple PMC overflows can generate multiple NMIs but be
+ * handled by a single invocation of the NMI handler (think PMC overflow while
+ * in the NMI handler). This could result in subsequent unknown NMI messages
+ * being issued.
+ *
+ * Attempt to mitigate this by using the number of active PMCs to determine
+ * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
+ * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
+ * less than the number of active PMCs or 2. The value of 2 is used in case the
+ * NMI does not arrive at the APIC in time to be collapsed into an already
+ * pending NMI.
+ */
+static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
+{
+ /* If multiple counters are not active return original handled count */
+ if (active <= 1)
+ return handled;
+
+ /*
+ * If a counter was handled, record the number of possible remaining
+ * NMIs that can occur.
+ */
+ if (handled) {
+ this_cpu_write(perf_nmi_counter,
+ min_t(unsigned int, 2, active - 1));
+
+ return handled;
+ }
+
+ if (!this_cpu_read(perf_nmi_counter))
+ return NMI_DONE;
+
+ this_cpu_dec(perf_nmi_counter);
+
+ return NMI_HANDLED;
+}
+
static struct event_constraint *
amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
.amd_nb_constraints = 1,
.wait_on_overflow = amd_pmu_wait_on_overflow,
+ .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
};
static int __init amd_core_pmu_init(void)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f1d2f70000cd..a59c3fcbae6a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1434,6 +1434,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
struct perf_sample_data data;
struct cpu_hw_events *cpuc;
struct perf_event *event;
+ unsigned int active = 0;
int idx, handled = 0;
u64 val;
@@ -1461,6 +1462,8 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
continue;
}
+ active++;
+
event = cpuc->events[idx];
val = x86_perf_event_update(event);
@@ -1483,6 +1486,9 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
if (handled)
inc_irq_stat(apic_perf_irqs);
+ if (x86_pmu.mitigate_nmi_latency)
+ handled = x86_pmu.mitigate_nmi_latency(active, handled);
+
return handled;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a37490a26a09..619214bda92e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -637,6 +637,8 @@ struct x86_pmu {
*/
unsigned int amd_nb_constraints : 1;
void (*wait_on_overflow)(int idx, u64 config);
+ int (*mitigate_nmi_latency)(unsigned int active,
+ int handled);
/*
* Extra registers for events
On Mon, Mar 11, 2019 at 04:48:44PM +0000, Lendacky, Thomas wrote:
> On AMD processors, the detection of an overflowed counter in the NMI
> handler relies on the current value of the counter. So, for example, to
> check for overflow on a 48 bit counter, bit 47 is checked to see if it
> is 1 (not overflowed) or 0 (overflowed).
>
> There is currently a race condition present when disabling and then
> updating the PMC. Increased NMI latency in newer AMD processors makes this
> race condition more pronounced.
Increased NMI latency also makes the results less useful :/ What amount
of skid are we talking about, and is there anything AMD is going to do
about this?
> If the counter value has overflowed, it is
> possible to update the PMC value before the NMI handler can run.
Arguably the WRMSR should sync against the PMI. That is the beahviour
one would expect.
Isn't that something you can fix in ucode? And could you very please
tell the hardware people this is disguisting?
> The updated PMC value is not an overflowed value, so when the perf NMI
> handler does run, it will not find an overflowed counter. This may
> appear as an unknown NMI resulting in either a panic or a series of
> messages, depending on how the kernel is configured.
>
> To eliminate this race condition, the PMC value must be checked after
> disabling the counter in x86_pmu_disable_all(), and, if overflowed, must
> wait for the NMI handler to reset the value before continuing. Add a new,
> optional, callable function that can be used to test for and resolve this
> condition.
>
> Cc: <[email protected]> # 4.14.x-
> Signed-off-by: Tom Lendacky <[email protected]>
> +static void amd_pmu_wait_on_overflow(int idx, u64 config)
> +{
> + unsigned int i;
> + u64 counter;
> +
> + /*
> + * We shouldn't be calling this from NMI context, but add a safeguard
> + * here to return, since if we're in NMI context we can't wait for an
> + * NMI to reset an overflowed counter value.
> + */
> + if (in_nmi())
> + return;
> +
> + /*
> + * If the interrupt isn't enabled then we won't get the NMI that will
> + * reset the overflow condition, so return.
> + */
> + if (!(config & ARCH_PERFMON_EVENTSEL_INT))
> + return;
> +
> + /*
> + * Wait for the counter to be reset if it has overflowed. This loop
> + * should exit very, very quickly, but just in case, don't wait
> + * forever...
> + */
> + for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
> + rdmsrl(x86_pmu_event_addr(idx), counter);
> + if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
> + break;
> +
> + /* Might be in IRQ context, so can't sleep */
> + udelay(1);
> + }
> +}
Argh.. that's horrible, as I'm sure you fully appreciate :/
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index b684f0294f35..f1d2f70000cd 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -606,6 +606,9 @@ void x86_pmu_disable_all(void)
> continue;
> val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> wrmsrl(x86_pmu_config_addr(idx), val);
> +
> + if (x86_pmu.wait_on_overflow)
> + x86_pmu.wait_on_overflow(idx, val);
> }
> }
One alternative is adding amd_pmu_disable_all() to amd/core.c and using
that. Then you can also change the loop to do the wait after all the
WRMSRs, if that helps with latency.
On Mon, Mar 11, 2019 at 04:48:51PM +0000, Lendacky, Thomas wrote:
> @@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
> }
> }
>
> +/*
> + * Because of NMI latency, if multiple PMC counters are active we need to take
> + * into account that multiple PMC overflows can generate multiple NMIs but be
> + * handled by a single invocation of the NMI handler (think PMC overflow while
> + * in the NMI handler). This could result in subsequent unknown NMI messages
> + * being issued.
> + *
> + * Attempt to mitigate this by using the number of active PMCs to determine
> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
> + * less than the number of active PMCs or 2. The value of 2 is used in case the
> + * NMI does not arrive at the APIC in time to be collapsed into an already
> + * pending NMI.
LAPIC I really do hope?!
> + */
> +static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
> +{
> + /* If multiple counters are not active return original handled count */
> + if (active <= 1)
> + return handled;
Should we not reset perf_nmi_counter in this case?
> +
> + /*
> + * If a counter was handled, record the number of possible remaining
> + * NMIs that can occur.
> + */
> + if (handled) {
> + this_cpu_write(perf_nmi_counter,
> + min_t(unsigned int, 2, active - 1));
> +
> + return handled;
> + }
> +
> + if (!this_cpu_read(perf_nmi_counter))
> + return NMI_DONE;
> +
> + this_cpu_dec(perf_nmi_counter);
> +
> + return NMI_HANDLED;
> +}
> +
> static struct event_constraint *
> amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> struct perf_event *event)
> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>
> .amd_nb_constraints = 1,
> .wait_on_overflow = amd_pmu_wait_on_overflow,
> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
> };
Again, you could just do amd_pmu_handle_irq() and avoid an extra
callback.
Anyway, we already had code to deal with spurious NMIs from AMD; see
commit:
63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
And that looks to be doing something very much the same. Why then do you
still need this on top?
On 3/15/19 5:50 AM, Peter Zijlstra wrote:
> On Mon, Mar 11, 2019 at 04:48:44PM +0000, Lendacky, Thomas wrote:
>> On AMD processors, the detection of an overflowed counter in the NMI
>> handler relies on the current value of the counter. So, for example, to
>> check for overflow on a 48 bit counter, bit 47 is checked to see if it
>> is 1 (not overflowed) or 0 (overflowed).
>>
>> There is currently a race condition present when disabling and then
>> updating the PMC. Increased NMI latency in newer AMD processors makes this
>> race condition more pronounced.
>
> Increased NMI latency also makes the results less useful :/ What amount
> of skid are we talking about, and is there anything AMD is going to do
> about this?
I haven't looked into the amount of skid, but, yes, the hardware team is
looking at this.
>
>> If the counter value has overflowed, it is
>> possible to update the PMC value before the NMI handler can run.
>
> Arguably the WRMSR should sync against the PMI. That is the beahviour
> one would expect.
>
> Isn't that something you can fix in ucode? And could you very please
> tell the hardware people this is disguisting?
Currently, there's nothing they've found that can be done in ucode for
this. But, yes, they know it's a problem and they're looking at what they
can do.
>
>> The updated PMC value is not an overflowed value, so when the perf NMI
>> handler does run, it will not find an overflowed counter. This may
>> appear as an unknown NMI resulting in either a panic or a series of
>> messages, depending on how the kernel is configured.
>>
>> To eliminate this race condition, the PMC value must be checked after
>> disabling the counter in x86_pmu_disable_all(), and, if overflowed, must
>> wait for the NMI handler to reset the value before continuing. Add a new,
>> optional, callable function that can be used to test for and resolve this
>> condition.
>>
>> Cc: <[email protected]> # 4.14.x-
>> Signed-off-by: Tom Lendacky <[email protected]>
>
>> +static void amd_pmu_wait_on_overflow(int idx, u64 config)
>> +{
>> + unsigned int i;
>> + u64 counter;
>> +
>> + /*
>> + * We shouldn't be calling this from NMI context, but add a safeguard
>> + * here to return, since if we're in NMI context we can't wait for an
>> + * NMI to reset an overflowed counter value.
>> + */
>> + if (in_nmi())
>> + return;
>> +
>> + /*
>> + * If the interrupt isn't enabled then we won't get the NMI that will
>> + * reset the overflow condition, so return.
>> + */
>> + if (!(config & ARCH_PERFMON_EVENTSEL_INT))
>> + return;
>> +
>> + /*
>> + * Wait for the counter to be reset if it has overflowed. This loop
>> + * should exit very, very quickly, but just in case, don't wait
>> + * forever...
>> + */
>> + for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
>> + rdmsrl(x86_pmu_event_addr(idx), counter);
>> + if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
>> + break;
>> +
>> + /* Might be in IRQ context, so can't sleep */
>> + udelay(1);
>> + }
>> +}
>
> Argh.. that's horrible, as I'm sure you fully appreciate :/
Yeah...
>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index b684f0294f35..f1d2f70000cd 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -606,6 +606,9 @@ void x86_pmu_disable_all(void)
>> continue;
>> val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>> wrmsrl(x86_pmu_config_addr(idx), val);
>> +
>> + if (x86_pmu.wait_on_overflow)
>> + x86_pmu.wait_on_overflow(idx, val);
>> }
>> }
>
> One alternative is adding amd_pmu_disable_all() to amd/core.c and using
> that. Then you can also change the loop to do the wait after all the
> WRMSRs, if that helps with latency.
I thought about that for both this and the next patch. But since it would
be duplicating all the code I went with the added callbacks. It might be
worth it for this patch to have an AMD disable_all callback since it's
not a lot of code to duplicate compared to handle_irq and I like the idea
of doing the overflow check after all the WRMSRs.
Thanks for the review, Peter.
Tom
>
On 3/15/19 7:03 AM, Peter Zijlstra wrote:
> On Mon, Mar 11, 2019 at 04:48:51PM +0000, Lendacky, Thomas wrote:
>> @@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
>> }
>> }
>>
>> +/*
>> + * Because of NMI latency, if multiple PMC counters are active we need to take
>> + * into account that multiple PMC overflows can generate multiple NMIs but be
>> + * handled by a single invocation of the NMI handler (think PMC overflow while
>> + * in the NMI handler). This could result in subsequent unknown NMI messages
>> + * being issued.
>> + *
>> + * Attempt to mitigate this by using the number of active PMCs to determine
>> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
>> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
>> + * less than the number of active PMCs or 2. The value of 2 is used in case the
>> + * NMI does not arrive at the APIC in time to be collapsed into an already
>> + * pending NMI.
>
> LAPIC I really do hope?!
Yes, LAPIC.
>
>> + */
>> +static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
>> +{
>> + /* If multiple counters are not active return original handled count */
>> + if (active <= 1)
>> + return handled;
>
> Should we not reset perf_nmi_counter in this case?
Yes, we should. I'll make that change.
>
>> +
>> + /*
>> + * If a counter was handled, record the number of possible remaining
>> + * NMIs that can occur.
>> + */
>> + if (handled) {
>> + this_cpu_write(perf_nmi_counter,
>> + min_t(unsigned int, 2, active - 1));
>> +
>> + return handled;
>> + }
>> +
>> + if (!this_cpu_read(perf_nmi_counter))
>> + return NMI_DONE;
>> +
>> + this_cpu_dec(perf_nmi_counter);
>> +
>> + return NMI_HANDLED;
>> +}
>> +
>> static struct event_constraint *
>> amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>> struct perf_event *event)
>> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>>
>> .amd_nb_constraints = 1,
>> .wait_on_overflow = amd_pmu_wait_on_overflow,
>> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
>> };
>
> Again, you could just do amd_pmu_handle_irq() and avoid an extra
> callback.
This is where there would be a bunch of code duplication where I thought
adding the callback at the end would be better. But if it's best to add
an AMD handle_irq callback I can do that. I'm easy, let me know if you'd
prefer that.
>
> Anyway, we already had code to deal with spurious NMIs from AMD; see
> commit:
>
> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
>
> And that looks to be doing something very much the same. Why then do you
> still need this on top?
This can happen while perf is handling normal counter overflow as opposed
to covering the disabling of the counter case. When multiple counters
overflow at roughly the same time, but the NMI doesn't arrive in time to
get collapsed into a pending NMI, the back-to-back support in
do_default_nmi() doesn't kick in.
Hmmm... I wonder if the wait on overflow in the disable_all() function
would eliminate the need for 63e6be6d98e1. That would take a more testing
on some older hardware to verify. That's something I can look into
separate from this series.
Thanks,
Tom
>
>
On Fri, Mar 15, 2019 at 02:44:32PM +0000, Lendacky, Thomas wrote:
> >> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
> >>
> >> .amd_nb_constraints = 1,
> >> .wait_on_overflow = amd_pmu_wait_on_overflow,
> >> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
> >> };
> >
> > Again, you could just do amd_pmu_handle_irq() and avoid an extra
> > callback.
>
> This is where there would be a bunch of code duplication where I thought
> adding the callback at the end would be better. But if it's best to add
> an AMD handle_irq callback I can do that. I'm easy, let me know if you'd
> prefer that.
Hmm, the thing that avoids you directly using x86_pmu_handle_irq() is
that added active count, but is that not the same as the POPCNT of
cpuc->active_mask?
Is the latency of POPCNT so bad that we need avoid it?
That is, I was thinking of something like:
int amd_pmu_handle_irq(struct pt_regs *regs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int active = hweight_long(cpuc->active_mask);
int handled = x86_pmu_handle_irq(regs);
+ if (active <= 1) {
this_cpu_write(perf_nmi_counter, 0);
+ return handled;
}
+
+ /*
+ * If a counter was handled, record the number of possible remaining
+ * NMIs that can occur.
+ */
+ if (handled) {
+ this_cpu_write(perf_nmi_counter,
+ min_t(unsigned int, 2, active - 1));
+
+ return handled;
+ }
+
+ if (!this_cpu_read(perf_nmi_counter))
+ return NMI_DONE;
+
+ this_cpu_dec(perf_nmi_counter);
+
+ return NMI_HANDLED;
}
> > Anyway, we already had code to deal with spurious NMIs from AMD; see
> > commit:
> >
> > 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
> >
> > And that looks to be doing something very much the same. Why then do you
> > still need this on top?
>
> This can happen while perf is handling normal counter overflow as opposed
> to covering the disabling of the counter case. When multiple counters
> overflow at roughly the same time, but the NMI doesn't arrive in time to
> get collapsed into a pending NMI, the back-to-back support in
> do_default_nmi() doesn't kick in.
>
> Hmmm... I wonder if the wait on overflow in the disable_all() function
> would eliminate the need for 63e6be6d98e1. That would take a more testing
> on some older hardware to verify. That's something I can look into
> separate from this series.
Yes please, or at least better document the reason for their separate
existence. It's all turning into a bit of magic it seems.
On 3/15/19 10:11 AM, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 02:44:32PM +0000, Lendacky, Thomas wrote:
>
>>>> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>>>>
>>>> .amd_nb_constraints = 1,
>>>> .wait_on_overflow = amd_pmu_wait_on_overflow,
>>>> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
>>>> };
>>>
>>> Again, you could just do amd_pmu_handle_irq() and avoid an extra
>>> callback.
>>
>> This is where there would be a bunch of code duplication where I thought
>> adding the callback at the end would be better. But if it's best to add
>> an AMD handle_irq callback I can do that. I'm easy, let me know if you'd
>> prefer that.
>
> Hmm, the thing that avoids you directly using x86_pmu_handle_irq() is
> that added active count, but is that not the same as the POPCNT of
> cpuc->active_mask?
>
> Is the latency of POPCNT so bad that we need avoid it?
>
> That is, I was thinking of something like:
>
> int amd_pmu_handle_irq(struct pt_regs *regs)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> int active = hweight_long(cpuc->active_mask);
> int handled = x86_pmu_handle_irq(regs);
Yup, I had a total brain lapse there of just calling x86_pmu_handle_irq()
from the new routine.
>
> + if (active <= 1) {
> this_cpu_write(perf_nmi_counter, 0);
> + return handled;
> }
> +
> + /*
> + * If a counter was handled, record the number of possible remaining
> + * NMIs that can occur.
> + */
> + if (handled) {
> + this_cpu_write(perf_nmi_counter,
> + min_t(unsigned int, 2, active - 1));
> +
> + return handled;
> + }
> +
> + if (!this_cpu_read(perf_nmi_counter))
> + return NMI_DONE;
> +
> + this_cpu_dec(perf_nmi_counter);
> +
> + return NMI_HANDLED;
> }
>
>>> Anyway, we already had code to deal with spurious NMIs from AMD; see
>>> commit:
>>>
>>> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
>>>
>>> And that looks to be doing something very much the same. Why then do you
>>> still need this on top?
>>
>> This can happen while perf is handling normal counter overflow as opposed
>> to covering the disabling of the counter case. When multiple counters
>> overflow at roughly the same time, but the NMI doesn't arrive in time to
>> get collapsed into a pending NMI, the back-to-back support in
>> do_default_nmi() doesn't kick in.
>>
>> Hmmm... I wonder if the wait on overflow in the disable_all() function
>> would eliminate the need for 63e6be6d98e1. That would take a more testing
>> on some older hardware to verify. That's something I can look into
>> separate from this series.
>
> Yes please, or at least better document the reason for their separate
> existence. It's all turning into a bit of magic it seems.
Ok, I'll update the commit message with a bit more info and add to the
comment of the new AMD handle_irq function.
Thanks,
Tom
>
On 3/15/19 10:49 AM, Tom Lendacky wrote:
> On 3/15/19 10:11 AM, Peter Zijlstra wrote:
>> On Fri, Mar 15, 2019 at 02:44:32PM +0000, Lendacky, Thomas wrote:
>>
>>>>> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>>>>> .amd_nb_constraints = 1,
>>>>> .wait_on_overflow = amd_pmu_wait_on_overflow,
>>>>> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
>>>>> };
>>>>
>>>> Again, you could just do amd_pmu_handle_irq() and avoid an extra
>>>> callback.
>>>
>>> This is where there would be a bunch of code duplication where I thought
>>> adding the callback at the end would be better. But if it's best to add
>>> an AMD handle_irq callback I can do that. I'm easy, let me know if you'd
>>> prefer that.
>>
>> Hmm, the thing that avoids you directly using x86_pmu_handle_irq() is
>> that added active count, but is that not the same as the POPCNT of
>> cpuc->active_mask?
>>
>> Is the latency of POPCNT so bad that we need avoid it?
>>
>> That is, I was thinking of something like:
>>
>> int amd_pmu_handle_irq(struct pt_regs *regs)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> int active = hweight_long(cpuc->active_mask);
>> int handled = x86_pmu_handle_irq(regs);
>
> Yup, I had a total brain lapse there of just calling x86_pmu_handle_irq()
> from the new routine.
>
>>
>> + if (active <= 1) {
And I wasn't taking into account other sources of NMIs triggering the
running of the handler while perf is running. I was only thinking in terms
of NMIs coming from the PMCs. So this really needs to be a !active check
and the setting of the perf_nmi_counter below needs to be the min of 2 or
active.
Thanks,
Tom
>> this_cpu_write(perf_nmi_counter, 0);
>> + return handled;
>> }
>> +
>> + /*
>> + * If a counter was handled, record the number of possible
>> remaining
>> + * NMIs that can occur.
>> + */
>> + if (handled) {
>> + this_cpu_write(perf_nmi_counter,
>> + min_t(unsigned int, 2, active - 1));
>> +
>> + return handled;
>> + }
>> +
>> + if (!this_cpu_read(perf_nmi_counter))
>> + return NMI_DONE;
>> +
>> + this_cpu_dec(perf_nmi_counter);
>> +
>> + return NMI_HANDLED;
>> }
>>
>>>> Anyway, we already had code to deal with spurious NMIs from AMD; see
>>>> commit:
>>>>
>>>> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after
>>>> disabling counters")
>>>>
>>>> And that looks to be doing something very much the same. Why then do you
>>>> still need this on top?
>>>
>>> This can happen while perf is handling normal counter overflow as opposed
>>> to covering the disabling of the counter case. When multiple counters
>>> overflow at roughly the same time, but the NMI doesn't arrive in time to
>>> get collapsed into a pending NMI, the back-to-back support in
>>> do_default_nmi() doesn't kick in.
>>>
>>> Hmmm... I wonder if the wait on overflow in the disable_all() function
>>> would eliminate the need for 63e6be6d98e1. That would take a more testing
>>> on some older hardware to verify. That's something I can look into
>>> separate from this series.
>>
>> Yes please, or at least better document the reason for their separate
>> existence. It's all turning into a bit of magic it seems.
>
> Ok, I'll update the commit message with a bit more info and add to the
> comment of the new AMD handle_irq function.
>
> Thanks,
> Tom
>
>>
On Fri, Mar 15, 2019 at 01:03:11PM +0100, Peter Zijlstra wrote:
> Anyway, we already had code to deal with spurious NMIs from AMD; see
> commit:
>
> 63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
>
> And that looks to be doing something very much the same. Why then do you
> still need this on top?
And I think I've spotted a bug there; consider the case where only PMC3
has an active event left, then the interrupt would consume the running
state for PMC0-2, not leaving it for the spurious interrupt that might
come after it.
Similarly, if there's nothing running anymore, a single spurious
interrupt will clear the entire state.
It effectively is a single state, not a per-pmc one.
Something like the below would cure that... would that help with
something? Or were we going to get get rid of this entirely with your
patches...
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..a8b5535f7888 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1432,6 +1432,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
struct cpu_hw_events *cpuc;
struct perf_event *event;
int idx, handled = 0;
+ int ridx = -1;
u64 val;
cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1453,8 +1454,8 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
* might still deliver spurious interrupts still
* in flight. Catch them:
*/
- if (__test_and_clear_bit(idx, cpuc->running))
- handled++;
+ if (test_bit(idx, cpuc->running))
+ ridx = idx;
continue;
}
@@ -1477,6 +1478,11 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
x86_pmu_stop(event, 0);
}
+ if (!handled && ridx >= 0) {
+ __clear_bit(ridx, cpuc->running);
+ handled++;
+ }
+
if (handled)
inc_irq_stat(apic_perf_irqs);