Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756127AbcCCDQO (ORCPT ); Wed, 2 Mar 2016 22:16:14 -0500 Received: from mga09.intel.com ([134.134.136.24]:12507 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751891AbcCCDQM (ORCPT ); Wed, 2 Mar 2016 22:16:12 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,531,1449561600"; d="scan'208";a="58823469" From: kan.liang@intel.com To: peterz@infradead.org, linux-kernel@vger.kernel.org Cc: ak@linux.intel.com, eranian@google.com, alexander.shishkin@linux.intel.com, vincent.weaver@maine.edu, Kan Liang Subject: [PATCH V2 1/1] perf, x86: fix pebs warning by only restore active pmu in pmi Date: Wed, 2 Mar 2016 17:03:53 -0500 Message-Id: <1456956233-3148-1-git-send-email-kan.liang@intel.com> X-Mailer: git-send-email 2.4.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5960 Lines: 126 From: Kan Liang This patch tries to fix a pebs warning found in my stress test. The following perf command can easily trigger the pebs warning or spurious NMI error on Skylake/Broadwell/Haswell platforms. sudo perf record -e 'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses, cache-references' --call-graph fp -b -c1000 -a Also NMI watchdog must be enabled. For this case, the events number is larger than counter number. So perf has to do multiplexing. In perf_mux_hrtimer_handler, it does perf_pmu_disable, schedule out old events, rotate_ctx, schedule in new events and finally perf_pmu_enable. If the old events include precise event, the MSR_IA32_PEBS_ENABLE should be cleared when perf_pmu_disable. The MSR_IA32_PEBS_ENABLE should keep 0 until the perf_pmu_enable is called and the new event is precise event. However, there is a corner case which could restore PEBS_ENABLE to stale value during the above period. In perf_pmu_disable, GLOBAL_CTRL will be set to 0 to stop overflow and followed PMI. But there may be pending PMI from an earlier overflow, which cannot be stopped. So even GLOBAL_CTRL is cleared, the kernel still be possible to get PMI. At the end of the PMI handler, __intel_pmu_enable_all will be called, which will restore the stale values if old events haven't scheduled out. Once the stale pebs value is set, it's impossible to be corrected if the new events are non-precise. Because the pebs_enabled will be set to 0. x86_pmu.enable_all will ignore the MSR_IA32_PEBS_ENABLE setting. As a result, the following NMI with stale PEBS_ENABLE trigger pebs warning. This patch checks cpuc->enabled in pmi and only restore the registers when pmu is active. Here is the dump. [ 113.452176] Call Trace: [ 113.452178] [] dump_stack+0x63/0x85 [ 113.452188] [] warn_slowpath_common+0x82/0xc0 [ 113.452190] [] warn_slowpath_null+0x1a/0x20 [ 113.452193] [] intel_pmu_drain_pebs_nhm+0x2be/0x320 [ 113.452197] [] intel_pmu_handle_irq+0x279/0x460 [ 113.452204] [] ? native_write_msr_safe+0x6/0x40 [ 113.452208] [] ? vunmap_page_range+0x20d/0x330 [ 113.452211] [] ? unmap_kernel_range_noflush+0x11/0x20 [ 113.452216] [] ? ghes_copy_tofrom_phys+0x10f/0x2a0 [ 113.452218] [] ? ghes_read_estatus+0x98/0x170 [ 113.452224] [] perf_event_nmi_handler+0x2d/0x50 [ 113.452230] [] nmi_handle+0x69/0x120 [ 113.452233] [] default_do_nmi+0xe6/0x100 [ 113.452236] [] do_nmi+0xe2/0x130 [ 113.452240] [] end_repeat_nmi+0x1a/0x1e [ 113.452243] [] ? native_write_msr_safe+0x6/0x40 [ 113.452246] [] ? native_write_msr_safe+0x6/0x40 [ 113.452249] [] ? native_write_msr_safe+0x6/0x40 [ 113.452250] <> [] ? x86_perf_event_set_period+0xd8/0x180 [ 113.452255] [] x86_pmu_start+0x4c/0x100 [ 113.452258] [] x86_pmu_enable+0x28d/0x300 [ 113.452263] [] perf_pmu_enable.part.81+0x7/0x10 [ 113.452267] [] perf_mux_hrtimer_handler+0x200/0x280 [ 113.452270] [] ? __perf_install_in_context+0xc0/0xc0 [ 113.452273] [] __hrtimer_run_queues+0xfd/0x280 [ 113.452276] [] hrtimer_interrupt+0xa8/0x190 [ 113.452278] [] ? __perf_read_group_add.part.61+0x1a0/0x1a0 [ 113.452283] [] local_apic_timer_interrupt+0x38/0x60 [ 113.452286] [] smp_apic_timer_interrupt+0x3d/0x50 [ 113.452290] [] apic_timer_interrupt+0x8c/0xa0 [ 113.452291] [] ? __perf_read_group_add.part.61+0x1a0/0x1a0 [ 113.452298] [] ? smp_call_function_single+0xd5/0x130 [ 113.452300] [] ? smp_call_function_single+0xcb/0x130 [ 113.452303] [] ? __perf_read_group_add.part.61+0x1a0/0x1a0 [ 113.452306] [] event_function_call+0x10a/0x120 [ 113.452308] [] ? ctx_resched+0x90/0x90 [ 113.452311] [] ? cpu_clock_event_read+0x30/0x30 [ 113.452313] [] ? _perf_event_disable+0x60/0x60 [ 113.452315] [] _perf_event_enable+0x5b/0x70 [ 113.452318] [] perf_event_for_each_child+0x38/0xa0 [ 113.452320] [] ? _perf_event_disable+0x60/0x60 [ 113.452322] [] perf_ioctl+0x12d/0x3c0 [ 113.452326] [] ? selinux_file_ioctl+0x95/0x1e0 [ 113.452330] [] do_vfs_ioctl+0xa1/0x5a0 [ 113.452334] [] ? sched_clock+0x9/0x10 [ 113.452336] [] SyS_ioctl+0x79/0x90 [ 113.452338] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 113.452340] ---[ end trace aef202839fe9a71d ]--- [ 113.452611] Uhhuh. NMI received for unknown reason 2d on CPU 2. [ 113.453880] Do you have a strange power saving mode enabled? Signed-off-by: Kan Liang --- Changes since V1: - Update the corner case in changelog - Drop the way of always update msr when disabling pebs event arch/x86/events/intel/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index a7ec685..6e95da0 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1929,7 +1929,9 @@ again: goto again; done: - __intel_pmu_enable_all(0, true); + /* Only restore pmu when it's active. */ + if (cpuc->enabled != 0) + __intel_pmu_enable_all(0, true); /* * Only unmask the NMI after the overflow counters * have been reset. This avoids spurious NMIs on -- 2.4.3