Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754457AbcK1T1X (ORCPT ); Mon, 28 Nov 2016 14:27:23 -0500 Received: from mga07.intel.com ([134.134.136.100]:46597 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752456AbcK1T1P (ORCPT ); Mon, 28 Nov 2016 14:27:15 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,564,1473145200"; d="scan'208";a="1091652069" From: kan.liang@intel.com To: peterz@infradead.org, mingo@redhat.com, linux-kernel@vger.kernel.org Cc: eranian@google.com, alexander.shishkin@linux.intel.com, ak@linux.intel.com, lukasz.odzioba@intel.com, Kan Liang Subject: [PATCH] perf/x86: fix event counter update issue Date: Mon, 28 Nov 2016 11:26:46 -0800 Message-Id: <1480361206-1702-1-git-send-email-kan.liang@intel.com> X-Mailer: git-send-email 2.4.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6998 Lines: 189 From: Kan Liang Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update") A bug can be easily reproduced by perf stat loop program on KNL/SLM. The loop program is just "for (i=0;iprev_count. - in PMI/not in PMI: the event update happens in PMI handler or not. Current code to calculate delta: delta = (new << shift) - (prev << shift); delta >>= shift; Case A: Not in PMI. new > prev. But delta is negative. That's the failure case of Test 2. delta is s64 type. new and prev are u64 type. If the new is big enough, after doing left shift and sub, the bit 64 of the result may still be 1. After converting to s64, the sign flag will be set. Since delta is s64 type, arithmetic right shift is applied, which copy the sign flag into empty bit positions on the upper end of delta. It can be fixed by adding the max count value. Here is the real data for test2 on KNL. new = aea96e1927 prev = ffffff0000000001 delta = aea96e1927000000 - 1000000 = aea96e1926000000 aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta Case B: In PMI. new > prev. delta is positive. That's the failure case of Test 3. The PMI is triggered by overflow. But there is latency between overflow and PMI handler. So new has small amount. Current calculation lose the max count value. Case C: In PMI. new < prev. delta is negative. The PMU counter may be start from a big value. E.g. the fixed period is small. It can be fixed by adding the max count value. There is still a case which delta could be wrong. The case is that event update just happens in PMI and overflow gap. It's not in PMI, new > prev, and delta is positive. Current calculation may lose the max count value. But this case rarely happens. So the rare case doesn't handle here. Reported-by: Lukasz Odzioba Signed-off-by: Kan Liang Tested-by: Lukasz Odzioba --- arch/x86/events/core.c | 23 +++++++++++++++++++---- arch/x86/events/intel/core.c | 4 ++-- arch/x86/events/intel/p4.c | 2 +- arch/x86/events/perf_event.h | 2 +- 4 files changed, 23 insertions(+), 8 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 6c3b0ef..c351ac4 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs * Can only be executed on the CPU where the event is active. * Returns the delta events processed. */ -u64 x86_perf_event_update(struct perf_event *event) +u64 x86_perf_event_update(struct perf_event *event, bool pmi) { struct hw_perf_event *hwc = &event->hw; int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event) delta = (new_raw_count << shift) - (prev_raw_count << shift); delta >>= shift; + /* + * Correct delta for special cases caused by the late PMI handle. + * Case1: PMU counter may be start from a big value. + * In PMI, new < prev. delta is negative. + * Case2: new is big enough which impact the sign flag. + * The delta will be negative after arithmetic right shift. + * Case3: In PMI, new > prev. + * The new - prev lose the max count value. + * + * There may be event update in PMI and overflow gap, + * but it rarely happen. The rare case doesn't handle here. + */ + if (((delta > 0) && pmi) || (delta < 0)) + delta += x86_pmu.cntval_mask + 1; + local64_add(delta, &event->count); local64_sub(delta, &hwc->period_left); @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags) * Drain the remaining delta count out of a event * that we are disabling: */ - x86_perf_event_update(event); + x86_perf_event_update(event, (flags == 0)); hwc->state |= PERF_HES_UPTODATE; } } @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs) event = cpuc->events[idx]; - val = x86_perf_event_update(event); + val = x86_perf_event_update(event, true); if (val & (1ULL << (x86_pmu.cntval_bits - 1))) continue; @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events); static inline void x86_pmu_read(struct perf_event *event) { - x86_perf_event_update(event); + x86_perf_event_update(event, false); } /* diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void) for (i = 0; i < 4; i++) { event = cpuc->events[i]; if (event) - x86_perf_event_update(event); + x86_perf_event_update(event, false); } for (i = 0; i < 4; i++) { @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event) */ int intel_pmu_save_and_restart(struct perf_event *event) { - x86_perf_event_update(event); + x86_perf_event_update(event, true); /* * For a checkpointed counter always reset back to 0. This * avoids a situation where the counter overflows, aborts the diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c index eb05335..8117de8 100644 --- a/arch/x86/events/intel/p4.c +++ b/arch/x86/events/intel/p4.c @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) /* it might be unflagged overflow */ overflow = p4_pmu_clear_cccr_ovf(hwc); - val = x86_perf_event_update(event); + val = x86_perf_event_update(event, true); if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1)))) continue; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; -u64 x86_perf_event_update(struct perf_event *event); +u64 x86_perf_event_update(struct perf_event *event, bool pmi); static inline unsigned int x86_pmu_config_addr(int index) { -- 2.4.3