So this is just a warning, and I've reported it before, but the
perf_fuzzer triggers this fairly regularly on my Haswell system.
It looks like fixed counter 0 (retired instructions) being set to
0000fffffffffffe occasionally causes an irq loop storm and gets stuck
until the PMU state is cleared.
[ 8224.179407] ------------[ cut here ]------------
[ 8224.184368] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/perf_event_intel.c:1602 intel_pmu_handle_irq+0x2bc/0x450()
[ 8224.195686] perfevents: irq loop stuck!
[ 8224.199835] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm crct10dif_pclmul snd_hda_codec_hdmi cr$
[ 8224.273188] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.0.0+ #136
[ 8224.281012] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 8224.288988] ffffffff81a093a8 ffff88011ea05b50 ffffffff816d5639 0000000000000000
[ 8224.297058] ffff88011ea05ba0 ffff88011ea05b90 ffffffff81072e4a ffff880119b87090
[ 8224.306453] 0000000000000064 ffff88011ea0bd80 ffff8800c38c7800 0000000000000040
[ 8224.315894] Call Trace:
[ 8224.319886] <NMI> [<ffffffff816d5639>] dump_stack+0x45/0x57
[ 8224.327439] [<ffffffff81072e4a>] warn_slowpath_common+0x8a/0xc0
[ 8224.335238] [<ffffffff81072ec6>] warn_slowpath_fmt+0x46/0x50
[ 8224.342812] [<ffffffff8103399c>] intel_pmu_handle_irq+0x2bc/0x450
[ 8224.350830] [<ffffffff81164215>] ? perf_event_task_tick+0x1d5/0x2b0
[ 8224.359025] [<ffffffff8102a40b>] perf_event_nmi_handler+0x2b/0x50
[ 8224.367069] [<ffffffff810185c0>] nmi_handle+0xa0/0x150
[ 8224.374067] [<ffffffff81018525>] ? nmi_handle+0x5/0x150
[ 8224.381169] [<ffffffff8101886a>] default_do_nmi+0x4a/0x140
[ 8224.388487] [<ffffffff810189f8>] do_nmi+0x98/0xe0
[ 8224.394886] [<ffffffff816df8af>] end_repeat_nmi+0x1e/0x2e
[ 8224.402082] [<ffffffff8105bffa>] ? native_write_msr_safe+0xa/0x10
[ 8224.409958] [<ffffffff8105bffa>] ? native_write_msr_safe+0xa/0x10
[ 8224.417785] [<ffffffff8105bffa>] ? native_write_msr_safe+0xa/0x10
[ 8224.425536] <<EOE>> <IRQ> [<ffffffff8103356e>] intel_pmu_enable_event+0x20e/0x230
[ 8224.435001] [<ffffffff8102b721>] x86_pmu_start+0x81/0x120
[ 8224.442007] [<ffffffff8102bf25>] x86_pmu_enable+0x295/0x310
[ 8224.449150] [<ffffffff811621aa>] perf_pmu_enable+0x2a/0x30
[ 8224.456173] [<ffffffff81029fe8>] x86_pmu_commit_txn+0x78/0xa0
[ 8224.456173] [<ffffffff81029fe8>] x86_pmu_commit_txn+0x78/0xa0
[ 8224.463561] [<ffffffff81427f6b>] ? debug_object_activate+0x14b/0x1e0
[ 8224.471591] [<ffffffff810bdde9>] ? __lock_acquire.isra.31+0x3b9/0x1000
[ 8224.479715] [<ffffffff81427f6b>] ? debug_object_activate+0x14b/0x1e0
[ 8224.487676] [<ffffffff810bcba4>] ? __lock_is_held+0x54/0x80
[ 8224.494794] [<ffffffff81162d50>] ? event_sched_in.isra.85+0x190/0x330
[ 8224.502894] [<ffffffff811630a8>] group_sched_in+0x1b8/0x1e0
[ 8224.510082] [<ffffffff8101df1a>] ? native_sched_clock+0x2a/0x90
[ 8224.517646] [<ffffffff811637bc>] __perf_event_enable+0x25c/0x2a0
[ 8224.525297] [<ffffffff810f3e69>] ? tick_nohz_irq_exit+0x29/0x30
[ 8224.532776] [<ffffffff8115ef30>] ? cpu_clock_event_start+0x40/0x40
[ 8224.540595] [<ffffffff8115ef80>] remote_function+0x50/0x60
[ 8224.547668] [<ffffffff810f8cd1>] flush_smp_call_function_queue+0x81/0x180
[ 8224.556126] [<ffffffff810f9763>] generic_smp_call_function_single_interrupt+0x13/0x60
[ 8224.565749] [<ffffffff8104b5c8>] smp_trace_call_function_single_interrupt+0x38/0xc0
[ 8224.575132] [<ffffffff816de9be>] trace_call_function_single_interrupt+0x6e/0x80
[ 8224.584168] <EOI> [<ffffffff8156e8b5>] ? cpuidle_enter_state+0x65/0x160
[ 8224.584168] <EOI> [<ffffffff8156e8b5>] ? cpuidle_enter_state+0x65/0x160
[ 8224.592646] [<ffffffff8156e8a1>] ? cpuidle_enter_state+0x51/0x160
[ 8224.600408] [<ffffffff8156e9e7>] cpuidle_enter+0x17/0x20
[ 8224.607285] [<ffffffff810b6e39>] cpu_startup_entry+0x399/0x440
[ 8224.614712] [<ffffffff816c9ddb>] rest_init+0xbb/0xd0
[ 8224.621283] [<ffffffff81d46f74>] start_kernel+0x44e/0x45b
[ 8224.628212] [<ffffffff81d46120>] ? early_idt_handlers+0x120/0x120
[ 8224.635980] [<ffffffff81d464d7>] x86_64_start_reservations+0x2a/0x2c
[ 8224.643999] [<ffffffff81d46614>] x86_64_start_kernel+0x13b/0x14a
[ 8224.651691] ---[ end trace 5679ca0875946dba ]---
[ 8224.657764]
[ 8224.660447] CPU#0: ctrl: 0000000000000000
[ 8224.666448] CPU#0: status: 0000000000000000
[ 8224.672414] CPU#0: overflow: 0000000000000000
[ 8224.678373] CPU#0: fixed: 00000000000000b9
[ 8224.684338] CPU#0: pebs: 0000000000000000
[ 8224.690322] CPU#0: debugctl: 0000000000000000
[ 8224.696253] CPU#0: active: 0000000300000000
[ 8224.702232] CPU#0: gen-PMC0 ctrl: 00000000108001c2
[ 8224.708823] CPU#0: gen-PMC0 count: 0000000000000001
[ 8224.715371] CPU#0: gen-PMC0 left: 0000ffffffffffff
[ 8224.721914] CPU#0: gen-PMC1 ctrl: 00000000001382d0
[ 8224.728402] CPU#0: gen-PMC1 count: 0000000000ff3792
[ 8224.734849] CPU#0: gen-PMC1 left: 0000ffffff012119
[ 8224.741271] CPU#0: gen-PMC2 ctrl: 00000003ff96764b
[ 8224.747600] CPU#0: gen-PMC2 count: 0000000000000001
[ 8224.753884] CPU#0: gen-PMC2 left: 0000ffffffffffff
[ 8224.760100] CPU#0: gen-PMC3 ctrl: 0000000000131acd
[ 8224.766320] CPU#0: gen-PMC3 count: 000000000000288a
[ 8224.772585] CPU#0: gen-PMC3 left: 0000ffffffffffff
[ 8224.778769] CPU#0: fixed-PMC0 count: 0000fffffffffffe
[ 8224.784973] CPU#0: fixed-PMC1 count: 0000fffda1877268
[ 8224.791196] CPU#0: fixed-PMC2 count: 0000000000f222eb
[ 8224.797450] perf_event_intel: clearing PMU state on CPU#0
[ 8224.804063] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 624.630 msecs
[ 8224.813789] perf interrupt took too long (4880480 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 8224.825400] perf interrupt took too long (4842512 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 8225.007411] perf interrupt took too long (4804688 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[ 8225.244832] perf interrupt took too long (4767159 > 20000), lowering kernel.perf_event_max_sample_rate to 6250
[ 8225.295535] perf interrupt took too long (4729923 > 38461), lowering kernel.perf_event_max_sample_rate to 3250
[ 8225.747314] perf interrupt took too long (4692980 > 71428), lowering kernel.perf_event_max_sample_rate to 1750
...
[202220.612133] perf_event_intel: clearing PMU state on CPU#6
[202220.665533] perf_event_intel: clearing PMU state on CPU#6
[202220.674658] perf_event_intel: clearing PMU state on CPU#6
[202220.734200] perf_event_intel: clearing PMU state on CPU#6
[202220.966526] perf_event_intel: clearing PMU state on CPU#6
[202220.976874] perf_event_intel: clearing PMU state on CPU#6
[202222.950479] perf_event_intel: clearing PMU state on CPU#6
[202223.183925] perf_event_intel: clearing PMU state on CPU#6
[212335.610297] perf_event_intel: clearing PMU state on CPU#2
[212335.922425] perf_event_intel: clearing PMU state on CPU#2
[212336.014458] perf_event_intel: clearing PMU state on CPU#2
[212336.258568] perf_event_intel: clearing PMU state on CPU#2
[226783.599841] perf_event_intel: clearing PMU state on CPU#0
[269662.113390] perf_event_intel: clearing PMU state on CPU#4
[401640.647353] perf_event_intel: clearing PMU state on CPU#3
[409463.976440] perf_event_intel: clearing PMU state on CPU#6
[409464.172527] perf_event_intel: clearing PMU state on CPU#6
[409464.692722] perf_event_intel: clearing PMU state on CPU#6
[422018.413829] perf_event_intel: clearing PMU state on CPU#1
[422018.458279] perf_event_intel: clearing PMU state on CPU#1
[422018.496802] perf_event_intel: clearing PMU state on CPU#1
[422018.510058] perf_event_intel: clearing PMU state on CPU#1
[422018.523668] perf_event_intel: clearing PMU state on CPU#1
[422018.549935] perf_event_intel: clearing PMU state on CPU#1
[422019.274748] perf_event_intel: clearing PMU state on CPU#1
* Vince Weaver <[email protected]> wrote:
> So this is just a warning, and I've reported it before, but the
> perf_fuzzer triggers this fairly regularly on my Haswell system.
>
> It looks like fixed counter 0 (retired instructions) being set to
> 0000fffffffffffe occasionally causes an irq loop storm and gets
> stuck until the PMU state is cleared.
So 0000fffffffffffe corresponds to 2 events left until overflow,
right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
allow these super short periods.
Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
one similar to bdw_limit_period()? Something like the patch below?
Totally untested and such. I picked 128 because of Broadwell, but
lower values might work as well. You could try to increase it to 3 and
upwards and see which one stops triggering stuck NMI loops?
Thanks,
Ingo
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 960e85de13fb..26b13ea8299c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2479,6 +2479,15 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
return c;
}
+/*
+ * Really short periods might create infinite PMC NMI loops on Haswell,
+ * so limit them to 128. There's no official erratum for this AFAIK.
+ */
+static unsigned int hsw_limit_period(struct perf_event *event, unsigned int left)
+{
+ return max(left, 128U);
+}
+
/*
* Broadwell:
@@ -2495,7 +2504,7 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
* Therefore the effective (average) period matches the requested period,
* despite coarser hardware granularity.
*/
-static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
+static unsigned int bdw_limit_period(struct perf_event *event, unsigned left)
{
if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
X86_CONFIG(.event=0xc0, .umask=0x01)) {
@@ -3265,6 +3274,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
x86_pmu.cpu_events = hsw_events_attrs;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;
On Fri, 1 May 2015, Ingo Molnar wrote:
>
> * Vince Weaver <[email protected]> wrote:
>
> > So this is just a warning, and I've reported it before, but the
> > perf_fuzzer triggers this fairly regularly on my Haswell system.
> >
> > It looks like fixed counter 0 (retired instructions) being set to
> > 0000fffffffffffe occasionally causes an irq loop storm and gets
> > stuck until the PMU state is cleared.
>
> So 0000fffffffffffe corresponds to 2 events left until overflow,
> right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> allow these super short periods.
>
> Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> one similar to bdw_limit_period()? Something like the patch below?
I spent the morning trying to get a reproducer for this. It turns out to
be complex. It seems in addition to fixed counter 0 being set to -2, at
least one other non-fixed counter must be about to overflow.
For example, in this case gen-PMC2 is also poised to overflow at the same
time.
CPU#0: gen-PMC2 ctrl: 00000003ff96764b
CPU#0: gen-PMC2 count: 0000000000000001
gen-PMC2 left: 0000ffffffffffff
...
[ 2408.612442] CPU#0: fixed-PMC0 count: 0000fffffffffffe
It's not always PMC2 but in the warnings there's at least one other
gen-PMC about to overflow at the exact same time as the fixed one.
Vince
On Fri, 1 May 2015, Ingo Molnar wrote:
> So 0000fffffffffffe corresponds to 2 events left until overflow,
> right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> allow these super short periods.
>
> Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> one similar to bdw_limit_period()? Something like the patch below?
>
> Totally untested and such. I picked 128 because of Broadwell, but
> lower values might work as well. You could try to increase it to 3 and
> upwards and see which one stops triggering stuck NMI loops?
I spent a lot of time trying to come up with a test case that triggered
this more reliably but failed.
It definitely is an issue with PMC0 being -2 causing the PMC0 bit in the
status register getting stuck and no clearing. Often there is also a PEBS
event active at the same time but that might be coincidence.
With your patch applied I can't trigger the issue. I haven't tried
narrowing down the exact value yet.
Vince
* Vince Weaver <[email protected]> wrote:
> On Fri, 1 May 2015, Ingo Molnar wrote:
>
> > So 0000fffffffffffe corresponds to 2 events left until overflow,
> > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> > allow these super short periods.
> >
> > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> > one similar to bdw_limit_period()? Something like the patch below?
> >
> > Totally untested and such. I picked 128 because of Broadwell, but
> > lower values might work as well. You could try to increase it to 3 and
> > upwards and see which one stops triggering stuck NMI loops?
>
> I spent a lot of time trying to come up with a test case that
> triggered this more reliably but failed.
>
> It definitely is an issue with PMC0 being -2 causing the PMC0 bit in
> the status register getting stuck and no clearing. Often there is
> also a PEBS event active at the same time but that might be
> coincidence.
>
> With your patch applied I can't trigger the issue. I haven't tried
> narrowing down the exact value yet.
So how about I change it from 128U to 2U and apply it upstream?
I.e. use the minimal threshold that we have observed to cause
problems. That way should it ever trigger in different circumstances
we'll eventually trigger it or hear about it.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
>
> * Vince Weaver <[email protected]> wrote:
>
> > So this is just a warning, and I've reported it before, but the
> > perf_fuzzer triggers this fairly regularly on my Haswell system.
> >
> > It looks like fixed counter 0 (retired instructions) being set to
> > 0000fffffffffffe occasionally causes an irq loop storm and gets
> > stuck until the PMU state is cleared.
>
> So 0000fffffffffffe corresponds to 2 events left until overflow,
> right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> allow these super short periods.
>
> Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> one similar to bdw_limit_period()? Something like the patch below?
>
> Totally untested and such. I picked 128 because of Broadwell, but
> lower values might work as well. You could try to increase it to 3 and
> upwards and see which one stops triggering stuck NMI loops?
>
> Thanks,
>
> Ingo
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> ---
> arch/x86/kernel/cpu/perf_event_intel.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 960e85de13fb..26b13ea8299c 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2479,6 +2479,15 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>
> return c;
> }
> +/*
> + * Really short periods might create infinite PMC NMI loops on Haswell,
> + * so limit them to 128. There's no official erratum for this AFAIK.
> + */
> +static unsigned int hsw_limit_period(struct perf_event *event, unsigned int left)
> +{
> + return max(left, 128U);
> +}
> +
>
> /*
> * Broadwell:
> @@ -2495,7 +2504,7 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> * Therefore the effective (average) period matches the requested period,
> * despite coarser hardware granularity.
> */
> -static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
> +static unsigned int bdw_limit_period(struct perf_event *event, unsigned left)
> {
> if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
> X86_CONFIG(.event=0xc0, .umask=0x01)) {
> @@ -3265,6 +3274,7 @@ __init int intel_pmu_init(void)
> x86_pmu.hw_config = hsw_hw_config;
> x86_pmu.get_event_constraints = hsw_get_event_constraints;
> x86_pmu.cpu_events = hsw_events_attrs;
> + x86_pmu.limit_period = hsw_limit_period;
> x86_pmu.lbr_double_abort = true;
> pr_cont("Haswell events, ");
> break;
Also, I'd apply the quirk not just to Haswell, but Nehalem, Westmere
and Ivy Bridge as well, I have seen it as early as on a Nehalem
prototype box.
Thanks,
Ingo
On Fri, 8 May 2015, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Vince Weaver <[email protected]> wrote:
> >
> > > So this is just a warning, and I've reported it before, but the
> > > perf_fuzzer triggers this fairly regularly on my Haswell system.
> > >
> > > It looks like fixed counter 0 (retired instructions) being set to
> > > 0000fffffffffffe occasionally causes an irq loop storm and gets
> > > stuck until the PMU state is cleared.
> >
> > So 0000fffffffffffe corresponds to 2 events left until overflow,
> > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> > allow these super short periods.
> >
> > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> > one similar to bdw_limit_period()? Something like the patch below?
> >
> > Totally untested and such. I picked 128 because of Broadwell, but
> > lower values might work as well. You could try to increase it to 3 and
> > upwards and see which one stops triggering stuck NMI loops?
> >
> > Thanks,
> >
> > Ingo
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> >
> > ---
> > arch/x86/kernel/cpu/perf_event_intel.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> > index 960e85de13fb..26b13ea8299c 100644
> > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > @@ -2479,6 +2479,15 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> >
> > return c;
> > }
> > +/*
> > + * Really short periods might create infinite PMC NMI loops on Haswell,
> > + * so limit them to 128. There's no official erratum for this AFAIK.
> > + */
> > +static unsigned int hsw_limit_period(struct perf_event *event, unsigned int left)
> > +{
> > + return max(left, 128U);
> > +}
> > +
> >
> > /*
> > * Broadwell:
> > @@ -2495,7 +2504,7 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> > * Therefore the effective (average) period matches the requested period,
> > * despite coarser hardware granularity.
> > */
> > -static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
> > +static unsigned int bdw_limit_period(struct perf_event *event, unsigned left)
> > {
> > if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
> > X86_CONFIG(.event=0xc0, .umask=0x01)) {
> > @@ -3265,6 +3274,7 @@ __init int intel_pmu_init(void)
> > x86_pmu.hw_config = hsw_hw_config;
> > x86_pmu.get_event_constraints = hsw_get_event_constraints;
> > x86_pmu.cpu_events = hsw_events_attrs;
> > + x86_pmu.limit_period = hsw_limit_period;
> > x86_pmu.lbr_double_abort = true;
> > pr_cont("Haswell events, ");
> > break;
>
> Also, I'd apply the quirk not just to Haswell, but Nehalem, Westmere
> and Ivy Bridge as well, I have seen it as early as on a Nehalem
> prototype box.
so at the suggestion of Andi Kleen I did some tests to see if this was
related to Haswell erratum HSD143:
Fixed-Function Performance Counter May Over Count Instructions
Retired by 32 When Intel Hyper-Threading Technology is Enabled
and indeed the problem seemed to go away if I disabled Hyperthreading.
However a patch implementing the Intel suggested workaround for that
erratum of programming the FIXED_CTR_CTRL_MSR only after the
GLOBAL_CTRL_MSR is set did not fix the issue (once I re-enabled
hypethreading on the machine).
Vince