So this does seem to have gotten worse with 4.2-rc0. Four warnings are
triggered more or less immediately after starting fuzzing.
(#1)
WARNING: CPU: 0 PID: 32434 at arch/x86/kernel/cpu/perf_event.c:1219 x86_pmu_start+0xaa/0x100()
if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
[ 224.158535] [<ffffffff8102a1ea>] x86_pmu_start+0xaa/0x100
[ 224.164401] [<ffffffff81159f39>] perf_ioctl+0x429/0x470
[ 224.170101] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
[ 224.176012] [<ffffffff811fef14>] ? mntput+0x24/0x40
[ 224.181343] [<ffffffff811df3ea>] ? __fput+0x17a/0x1e0
[ 224.186858] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
[ 224.192275] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
(#2)
[ 584.352324] WARNING: CPU: 2 PID: 18924 at arch/x86/kernel/cpu/perf_event_intel_ds.c:1198 intel_pmu_drain_pebs_nhm+0x283/0x2e0()
(this one was described in a previous e-mail)
(#3)
[ 1210.501445] WARNING: CPU: 6 PID: 0 at arch/x86/kernel/cpu/perf_event.c:1307 x86_pmu_stop+0xb8/0xd0()
WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
[ 1210.653328] [<ffffffff810291b8>] x86_pmu_stop+0xb8/0xd0
[ 1210.659466] [<ffffffff81029216>] x86_pmu_del+0x46/0x140
[ 1210.665607] [<ffffffff81156244>] event_sched_out.isra.80+0x124/0x320
[ 1210.672924] [<ffffffff81156553>] ? __perf_event_disable+0x53/0x130
[ 1210.680097] [<ffffffff81156488>] group_sched_out+0x48/0xc0
[ 1210.686530] [<ffffffff8101d2a8>] ? native_sched_clock+0x28/0x90
[ 1210.693445] [<ffffffff811565fe>] __perf_event_disable+0xfe/0x130
[ 1210.700408] [<ffffffff811532c0>] ? cpu_clock_event_start+0x40/0x40
[ 1210.707550] [<ffffffff81153302>] remote_function+0x42/0x50
[ 1210.713942] [<ffffffff810f0dab>] flush_smp_call_function_queue+0x7b/0x170
[ 1210.721762] [<ffffffff810f1793>] generic_smp_call_function_single_interrupt+0x13/0x60
[ 1210.730679] [<ffffffff81049228>] smp_trace_call_function_single_interrupt+0x38/0xb0
[ 1210.739467] [<ffffffff816a78db>] trace_call_function_single_interrupt+0x6b/0x70
[ 1210.747820] <EOI> [<ffffffff81543f24>] ? cpuidle_enter_state+0xf4/0x220
[ 1210.755614] [<ffffffff81543f00>] ? cpuidle_enter_state+0xd0/0x220
[ 1210.762708] [<ffffffff81544087>] cpuidle_enter+0x17/0x20
[ 1210.768966] [<ffffffff810b05db>] call_cpuidle+0x3b/0x70
[ 1210.775135] [<ffffffff81544063>] ? cpuidle_select+0x13/0x20
[ 1210.781637] [<ffffffff810b0855>] cpu_startup_entry+0x245/0x310
[ 1210.788411] [<ffffffff81049a40>] start_secondary+0x120/0x130
(#4)
[ 1638.702578] WARNING: CPU: 2 PID: 25543 at arch/x86/kernel/cpu/perf_event_intel_lbr.c:349 intel_pmu_lbr_disable+0xbc/0xd0()
WARN_ON_ONCE(cpuc->lbr_users < 0);
[ 1638.857921] [<ffffffff8102e56c>] intel_pmu_lbr_disable+0xbc/0xd0
[ 1638.864874] [<ffffffff81031812>] intel_pmu_disable_event+0xc2/0x130
[ 1638.872125] [<ffffffff81030a49>] ? __intel_pmu_disable_all+0x49/0x70
[ 1638.879455] [<ffffffff81029175>] x86_pmu_stop+0x75/0xd0
[ 1638.885560] [<ffffffff81159f1a>] perf_ioctl+0x40a/0x470
[ 1638.891650] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
[ 1638.897945] [<ffffffff811dd9ac>] ? vfs_read+0x11c/0x130
[ 1638.904069] [<ffffffff811f27bd>] ? poll_select_set_timeout+0x4d/0x70
[ 1638.911417] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
[ 1638.917295] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
On Thu, Jul 02, 2015 at 11:42:43AM -0400, Vince Weaver wrote:
> (#4)
>
> [ 1638.702578] WARNING: CPU: 2 PID: 25543 at arch/x86/kernel/cpu/perf_event_intel_lbr.c:349 intel_pmu_lbr_disable+0xbc/0xd0()
>
> WARN_ON_ONCE(cpuc->lbr_users < 0);
>
> [ 1638.857921] [<ffffffff8102e56c>] intel_pmu_lbr_disable+0xbc/0xd0
> [ 1638.864874] [<ffffffff81031812>] intel_pmu_disable_event+0xc2/0x130
> [ 1638.872125] [<ffffffff81030a49>] ? __intel_pmu_disable_all+0x49/0x70
> [ 1638.879455] [<ffffffff81029175>] x86_pmu_stop+0x75/0xd0
> [ 1638.885560] [<ffffffff81159f1a>] perf_ioctl+0x40a/0x470
> [ 1638.891650] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
> [ 1638.897945] [<ffffffff811dd9ac>] ? vfs_read+0x11c/0x130
> [ 1638.904069] [<ffffffff811f27bd>] ? poll_select_set_timeout+0x4d/0x70
> [ 1638.911417] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
> [ 1638.917295] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
This might maybe help:
http://lkml.kernel.org/r/[email protected]
On Thu, 2 Jul 2015, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 11:42:43AM -0400, Vince Weaver wrote:
> > (#4)
> >
> > [ 1638.702578] WARNING: CPU: 2 PID: 25543 at arch/x86/kernel/cpu/perf_event_intel_lbr.c:349 intel_pmu_lbr_disable+0xbc/0xd0()
> >
> > WARN_ON_ONCE(cpuc->lbr_users < 0);
> >
> > [ 1638.857921] [<ffffffff8102e56c>] intel_pmu_lbr_disable+0xbc/0xd0
> > [ 1638.864874] [<ffffffff81031812>] intel_pmu_disable_event+0xc2/0x130
> > [ 1638.872125] [<ffffffff81030a49>] ? __intel_pmu_disable_all+0x49/0x70
> > [ 1638.879455] [<ffffffff81029175>] x86_pmu_stop+0x75/0xd0
> > [ 1638.885560] [<ffffffff81159f1a>] perf_ioctl+0x40a/0x470
> > [ 1638.891650] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
> > [ 1638.897945] [<ffffffff811dd9ac>] ? vfs_read+0x11c/0x130
> > [ 1638.904069] [<ffffffff811f27bd>] ? poll_select_set_timeout+0x4d/0x70
> > [ 1638.911417] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
> > [ 1638.917295] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
>
> This might maybe help:
>
> http://lkml.kernel.org/r/[email protected]
ah, OK, I missed that one somehow. It would also explain the
[ 3971.156004] Uhhuh. NMI received for unknown reason 21 on CPU 2.
[ 3971.162812] Do you have a strange power saving mode enabled?
[ 3971.169363] Dazed and confused, but trying to continue
messages I am getting.
I'm also still getting a lot of
perfevents: irq loop stuck!
messages, I thought the workaround for that had gone in for 4.2 but I
guess not.
Vince
* Vince Weaver <[email protected]> wrote:
> I'm also still getting a lot of
> perfevents: irq loop stuck!
> messages, I thought the workaround for that had gone in for 4.2 but I
> guess not.
Hm, so I was waiting for your feedback regarding the precise period cutoff to use,
and I guess that's where the patch got lost.
Does the value of 2 below work for you?
Also I bet we'd need the workaround on a lot more CPU models as well, I sometimes
see that warning on an early Nehalem prototype, model 26 (Nehalem-EP).
So my guess is that everything Nehalem and later is affected, i.e. NHM, WSM, SNB,
IVB and HSW:
case 30: /* 45nm Nehalem */
case 26: /* 45nm Nehalem-EP */
case 46: /* 45nm Nehalem-EX */
case 37: /* 32nm Westmere */
case 44: /* 32nm Westmere-EP */
case 47: /* 32nm Westmere-EX */
case 42: /* 32nm SandyBridge */
case 45: /* 32nm SandyBridge-E/EN/EP */
case 58: /* 22nm IvyBridge */
case 62: /* 22nm IvyBridge-EP/EX */
case 60: /* 22nm Haswell Core */
case 63: /* 22nm Haswell Server */
case 69: /* 22nm Haswell ULT */
case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
Has anyone ever seen that warning on Broadwell and later Intel CPUs?
Thanks,
Ingo
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 960e85de13fb..26b13ea8299c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2479,6 +2479,15 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
return c;
}
+/*
+ * Really short periods might create infinite PMC NMI loops on Haswell,
+ * so don't allow a period of 1. There's no official erratum for this AFAIK.
+ */
+static unsigned int hsw_limit_period(struct perf_event *event, unsigned int left)
+{
+ return max(left, 2U);
+}
+
/*
* Broadwell:
@@ -2495,7 +2504,7 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
* Therefore the effective (average) period matches the requested period,
* despite coarser hardware granularity.
*/
-static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
+static unsigned int bdw_limit_period(struct perf_event *event, unsigned left)
{
if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
X86_CONFIG(.event=0xc0, .umask=0x01)) {
@@ -3265,6 +3274,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
x86_pmu.cpu_events = hsw_events_attrs;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;
On Thu, Jul 2, 2015 at 7:59 PM, Vince Weaver <[email protected]> wrote:
>
> On Thu, 2 Jul 2015, Peter Zijlstra wrote:
>
> > On Thu, Jul 02, 2015 at 11:42:43AM -0400, Vince Weaver wrote:
> > > (#4)
> > >
> > > [ 1638.702578] WARNING: CPU: 2 PID: 25543 at arch/x86/kernel/cpu/perf_event_intel_lbr.c:349 intel_pmu_lbr_disable+0xbc/0xd0()
> > >
> > > WARN_ON_ONCE(cpuc->lbr_users < 0);
> > >
> > > [ 1638.857921] [<ffffffff8102e56c>] intel_pmu_lbr_disable+0xbc/0xd0
> > > [ 1638.864874] [<ffffffff81031812>] intel_pmu_disable_event+0xc2/0x130
> > > [ 1638.872125] [<ffffffff81030a49>] ? __intel_pmu_disable_all+0x49/0x70
> > > [ 1638.879455] [<ffffffff81029175>] x86_pmu_stop+0x75/0xd0
> > > [ 1638.885560] [<ffffffff81159f1a>] perf_ioctl+0x40a/0x470
> > > [ 1638.891650] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
> > > [ 1638.897945] [<ffffffff811dd9ac>] ? vfs_read+0x11c/0x130
> > > [ 1638.904069] [<ffffffff811f27bd>] ? poll_select_set_timeout+0x4d/0x70
> > > [ 1638.911417] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
> > > [ 1638.917295] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
> >
> > This might maybe help:
> >
> > http://lkml.kernel.org/r/[email protected]
>
> ah, OK, I missed that one somehow. It would also explain the
>
> [ 3971.156004] Uhhuh. NMI received for unknown reason 21 on CPU 2.
> [ 3971.162812] Do you have a strange power saving mode enabled?
> [ 3971.169363] Dazed and confused, but trying to continue
>
> messages I am getting.
>
>
> I'm also still getting a lot of
> perfevents: irq loop stuck!
Last I saw this was on WSM. I also suspect this may not happen if the
event is only measured at the user level.
> messages, I thought the workaround for that had gone in for 4.2 but I
> guess not.
>
> Vince
On Thu, 2 Jul 2015, Vince Weaver wrote:
> On Thu, 2 Jul 2015, Peter Zijlstra wrote:
>
> > On Thu, Jul 02, 2015 at 11:42:43AM -0400, Vince Weaver wrote:
> > > (#4)
> > >
> > > [ 1638.702578] WARNING: CPU: 2 PID: 25543 at arch/x86/kernel/cpu/perf_event_intel_lbr.c:349 intel_pmu_lbr_disable+0xbc/0xd0()
> > >
> > > WARN_ON_ONCE(cpuc->lbr_users < 0);
> > >
> > > [ 1638.857921] [<ffffffff8102e56c>] intel_pmu_lbr_disable+0xbc/0xd0
> > > [ 1638.864874] [<ffffffff81031812>] intel_pmu_disable_event+0xc2/0x130
> > > [ 1638.872125] [<ffffffff81030a49>] ? __intel_pmu_disable_all+0x49/0x70
> > > [ 1638.879455] [<ffffffff81029175>] x86_pmu_stop+0x75/0xd0
> > > [ 1638.885560] [<ffffffff81159f1a>] perf_ioctl+0x40a/0x470
> > > [ 1638.891650] [<ffffffff811f199a>] do_vfs_ioctl+0x2ba/0x4a0
> > > [ 1638.897945] [<ffffffff811dd9ac>] ? vfs_read+0x11c/0x130
> > > [ 1638.904069] [<ffffffff811f27bd>] ? poll_select_set_timeout+0x4d/0x70
> > > [ 1638.911417] [<ffffffff811f1bf9>] SyS_ioctl+0x79/0x90
> > > [ 1638.917295] [<ffffffff816a64b2>] entry_SYSCALL_64_fastpath+0x16/0x7a
> >
> > This might maybe help:
> >
> > http://lkml.kernel.org/r/[email protected]
I applied that patch and while the lbr_disable warning hasn't re-appeared
I am still getting a lot of NMI dazed and confused messages.
Though that might be normal, I think I've always gotten NMI dazed and
confused messages (although a bit more frequently on the core2 box than
the haswell one).
Vince
On Thu, 2 Jul 2015, Ingo Molnar wrote:
>
> * Vince Weaver <[email protected]> wrote:
>
> > I'm also still getting a lot of
> > perfevents: irq loop stuck!
> > messages, I thought the workaround for that had gone in for 4.2 but I
> > guess not.
>
> Hm, so I was waiting for your feedback regarding the precise period cutoff to use,
> and I guess that's where the patch got lost.
>
> Does the value of 2 below work for you?
I still get the messages when using 2. I'll see if I can narrow down the
cutoff. I know the messages definitely go away when using 128.
Vince
On Fri, 3 Jul 2015, Vince Weaver wrote:
> On Thu, 2 Jul 2015, Ingo Molnar wrote:
>
> >
> > * Vince Weaver <[email protected]> wrote:
> >
> > > I'm also still getting a lot of
> > > perfevents: irq loop stuck!
> > > messages, I thought the workaround for that had gone in for 4.2 but I
> > > guess not.
> >
> > Hm, so I was waiting for your feedback regarding the precise period cutoff to use,
> > and I guess that's where the patch got lost.
> >
> > Does the value of 2 below work for you?
>
> I still get the messages when using 2. I'll see if I can narrow down the
> cutoff. I know the messages definitely go away when using 128.
>From what I can tell, a value of 2 will still generate "irq loop stuck",
but 3 appears to be safe, but I'm still fuzzing to be sure. Previously I
ran a kernel with 5 for a while and that also did not generate any
messages.
Vince