From: Kan Liang <[email protected]>
This patch tries to fix a pebs warning found in my stress test.
The following perf command can easily trigger the pebs warning or
spurious NMI error on Skylake/Broadwell/Haswell platforms.
sudo perf record -e
'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses,
cache-references' --call-graph fp -b -c1000 -a
Also NMI watchdog must be enabled.
For this case, the events number is larger than counter number. So
perf has to do multiplexing. In perf_mux_hrtimer_handler, it does
perf_pmu_disable, schedule out old events, rotate_ctx, schedule in new
events and finally perf_pmu_enable. If the old events include precise
event, the MSR_IA32_PEBS_ENABLE should be cleared when perf_pmu_disable.
The MSR_IA32_PEBS_ENABLE should keep 0 until the perf_pmu_enable is
called and the new event is precise event.
However, there is a corner case which could restore PEBS_ENABLE to stale
value during the above period. In perf_pmu_disable, GLOBAL_CTRL will be
set to 0 to stop overflow and followed PMI. But there may be pending PMI
from an earlier overflow, which cannot be stopped. So even GLOBAL_CTRL
is cleared, the kernel still be possible to get PMI. At the end of the
PMI handler, __intel_pmu_enable_all will be called, which will restore
the stale values if old events haven't scheduled out.
Once the stale pebs value is set, it's impossible to be corrected if the
new events are non-precise. Because the pebs_enabled will be set to 0.
x86_pmu.enable_all will ignore the MSR_IA32_PEBS_ENABLE setting.
As a result, the following NMI with stale PEBS_ENABLE trigger pebs
warning.
This patch checks cpuc->enabled in pmi and only restore the registers
when pmu is active.
Here is the dump.
[ 113.452176] Call Trace:
[ 113.452178] <NMI> [<ffffffff813c3a2e>] dump_stack+0x63/0x85
[ 113.452188] [<ffffffff810a46f2>] warn_slowpath_common+0x82/0xc0
[ 113.452190] [<ffffffff810a483a>] warn_slowpath_null+0x1a/0x20
[ 113.452193] [<ffffffff8100fe2e>]
intel_pmu_drain_pebs_nhm+0x2be/0x320
[ 113.452197] [<ffffffff8100caa9>] intel_pmu_handle_irq+0x279/0x460
[ 113.452204] [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[ 113.452208] [<ffffffff811f290d>] ? vunmap_page_range+0x20d/0x330
[ 113.452211] [<ffffffff811f2f11>] ?
unmap_kernel_range_noflush+0x11/0x20
[ 113.452216] [<ffffffff8148379f>] ? ghes_copy_tofrom_phys+0x10f/0x2a0
[ 113.452218] [<ffffffff814839c8>] ? ghes_read_estatus+0x98/0x170
[ 113.452224] [<ffffffff81005a7d>] perf_event_nmi_handler+0x2d/0x50
[ 113.452230] [<ffffffff810310b9>] nmi_handle+0x69/0x120
[ 113.452233] [<ffffffff810316f6>] default_do_nmi+0xe6/0x100
[ 113.452236] [<ffffffff810317f2>] do_nmi+0xe2/0x130
[ 113.452240] [<ffffffff817aea71>] end_repeat_nmi+0x1a/0x1e
[ 113.452243] [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[ 113.452246] [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[ 113.452249] [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[ 113.452250] <<EOE>> <IRQ> [<ffffffff81006df8>] ?
x86_perf_event_set_period+0xd8/0x180
[ 113.452255] [<ffffffff81006eec>] x86_pmu_start+0x4c/0x100
[ 113.452258] [<ffffffff8100722d>] x86_pmu_enable+0x28d/0x300
[ 113.452263] [<ffffffff811994d7>] perf_pmu_enable.part.81+0x7/0x10
[ 113.452267] [<ffffffff8119cb70>]
perf_mux_hrtimer_handler+0x200/0x280
[ 113.452270] [<ffffffff8119c970>] ?
__perf_install_in_context+0xc0/0xc0
[ 113.452273] [<ffffffff8110f92d>] __hrtimer_run_queues+0xfd/0x280
[ 113.452276] [<ffffffff811100d8>] hrtimer_interrupt+0xa8/0x190
[ 113.452278] [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[ 113.452283] [<ffffffff81051bd8>]
local_apic_timer_interrupt+0x38/0x60
[ 113.452286] [<ffffffff817af01d>] smp_apic_timer_interrupt+0x3d/0x50
[ 113.452290] [<ffffffff817ad15c>] apic_timer_interrupt+0x8c/0xa0
[ 113.452291] <EOI> [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[ 113.452298] [<ffffffff81123de5>] ?
smp_call_function_single+0xd5/0x130
[ 113.452300] [<ffffffff81123ddb>] ?
smp_call_function_single+0xcb/0x130
[ 113.452303] [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[ 113.452306] [<ffffffff8119765a>] event_function_call+0x10a/0x120
[ 113.452308] [<ffffffff8119c660>] ? ctx_resched+0x90/0x90
[ 113.452311] [<ffffffff811971e0>] ? cpu_clock_event_read+0x30/0x30
[ 113.452313] [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
[ 113.452315] [<ffffffff8119772b>] _perf_event_enable+0x5b/0x70
[ 113.452318] [<ffffffff81197388>] perf_event_for_each_child+0x38/0xa0
[ 113.452320] [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
[ 113.452322] [<ffffffff811a0ffd>] perf_ioctl+0x12d/0x3c0
[ 113.452326] [<ffffffff8134d855>] ? selinux_file_ioctl+0x95/0x1e0
[ 113.452330] [<ffffffff8124a3a1>] do_vfs_ioctl+0xa1/0x5a0
[ 113.452334] [<ffffffff81036d29>] ? sched_clock+0x9/0x10
[ 113.452336] [<ffffffff8124a919>] SyS_ioctl+0x79/0x90
[ 113.452338] [<ffffffff817ac4b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 113.452340] ---[ end trace aef202839fe9a71d ]---
[ 113.452611] Uhhuh. NMI received for unknown reason 2d on CPU 2.
[ 113.453880] Do you have a strange power saving mode enabled?
Signed-off-by: Kan Liang <[email protected]>
---
Changes since V1:
- Update the corner case in changelog
- Drop the way of always update msr when disabling pebs event
arch/x86/events/intel/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a7ec685..6e95da0 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1929,7 +1929,9 @@ again:
goto again;
done:
- __intel_pmu_enable_all(0, true);
+ /* Only restore pmu when it's active. */
+ if (cpuc->enabled != 0)
+ __intel_pmu_enable_all(0, true);
/*
* Only unmask the NMI after the overflow counters
* have been reset. This avoids spurious NMIs on
--
2.4.3
On Wed, Mar 02, 2016 at 05:03:53PM -0500, [email protected] wrote:
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a7ec685..6e95da0 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -1929,7 +1929,9 @@ again:
> goto again;
>
> done:
> - __intel_pmu_enable_all(0, true);
> + /* Only restore pmu when it's active. */
> + if (cpuc->enabled != 0)
> + __intel_pmu_enable_all(0, true);
> /*
> * Only unmask the NMI after the overflow counters
> * have been reset. This avoids spurious NMIs on
Very good catch that! That might actually explain some of the weirdness
I've seen from the fuzzer but was too sporadic to really pin down yet.
So that got 'introduced' by:
3fb2b8ddcc6a ("perf, x86, Do not user perf_disable from NMI context")
which has a shit Changelog for not explaining the actual race, whoever
let me merge that.. :-)
I suspect the race is the NMI hitting between incrementing the
pmu->pmu_disable_count and setting the cpuc state in x86_pmu_disable().
A nested perf_pmu_disable() call would then see a !0 disable_count and
not call x86_pmu_disable() and not initialize the cpuc state.
Now, your fix relies on nested __intel_pmu_disable_all() to not change
state, which I think is true. The intel_bts_disable_local() call does
change state, but a nested call should not matter.
This very much needs a comment though.
There now is the fun issue if the PMI hitting after cpuc->enabled = 0;
but before x86_pmu.disable_all(). I suppose this too is harmless, but
again, this very much needs a comment.
Looking through the PMI handlers, the KNC driver suffers the same
problem (not to mention however many !x86 drivers that copied this,
arm-xscale for example seems affected, as does the MIPS driver, assuming
they have actual NMIs)
So could you respin with comments at:
__intel_pmu_disable_all() describing the requirement that consecutive
calls should work and why.
x86_pmu_disable() a comment about the PMI landing after enabled=0
And fix the KNC driver too.