Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755007AbcK1TlW (ORCPT ); Mon, 28 Nov 2016 14:41:22 -0500 Received: from mail-yb0-f177.google.com ([209.85.213.177]:34804 "EHLO mail-yb0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754945AbcK1TlO (ORCPT ); Mon, 28 Nov 2016 14:41:14 -0500 MIME-Version: 1.0 In-Reply-To: <1480361206-1702-1-git-send-email-kan.liang@intel.com> References: <1480361206-1702-1-git-send-email-kan.liang@intel.com> From: Stephane Eranian Date: Mon, 28 Nov 2016 11:41:12 -0800 Message-ID: Subject: Re: [PATCH] perf/x86: fix event counter update issue To: "Liang, Kan" Cc: Peter Zijlstra , "mingo@redhat.com" , LKML , Alexander Shishkin , "ak@linux.intel.com" , "Odzioba, Lukasz" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8267 Lines: 205 On Mon, Nov 28, 2016 at 11:26 AM, wrote: > > From: Kan Liang > > Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update") > > A bug can be easily reproduced by perf stat loop program on KNL/SLM. > The loop program is just "for (i=0;i "loops" is set to different value during the test. > Test 1: > perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000 > 451739576564,,cycles,313694125185,100.00 > Test 2: > perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000 > 18446743727079256064,,cycles,313694125185,100.00 > Test 3: > perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000 > 406240041973,,cycles,313694125185,100.00 > Only test 1 has correct result. > The result for test 2 is too big. Test 3 is too small. > > The bug happens more frequently on KNL/SLM due to narrower counters, > but should be generic issue on all platforms. > > The issues are caused by mistakenly calculating the counter > increment (delta) in x86_perf_event_update for some cases. > > Here, all the possible failure cases are listed. > Terms: > - new: current PMU counter value which read from rdpmcl. > - prev: previous counter value which is stored in &hwc->prev_count. > - in PMI/not in PMI: the event update happens in PMI handler or not. > Current code to calculate delta: > delta = (new << shift) - (prev << shift); > delta >>= shift; > > Case A: Not in PMI. new > prev. But delta is negative. > That's the failure case of Test 2. > delta is s64 type. new and prev are u64 type. If the new is big > enough, after doing left shift and sub, the bit 64 of the result may > still be 1. > After converting to s64, the sign flag will be set. Since delta is > s64 type, arithmetic right shift is applied, which copy the sign flag > into empty bit positions on the upper end of delta. > It can be fixed by adding the max count value. > > Here is the real data for test2 on KNL. > new = aea96e1927 > prev = ffffff0000000001 How can new be so large here? I mean between prev and new, the counter wrapped around. And it took that many occurrences (of cycles) to handle the overflow? > > delta = aea96e1927000000 - 1000000 = aea96e1926000000 > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta > > Case B: In PMI. new > prev. delta is positive. > That's the failure case of Test 3. > The PMI is triggered by overflow. But there is latency between > overflow and PMI handler. So new has small amount. > Current calculation lose the max count value. > > Case C: In PMI. new < prev. delta is negative. > The PMU counter may be start from a big value. E.g. the fixed period > is small. > It can be fixed by adding the max count value. > I am not sure I understand why PMI should matter here. What matters is prev vs. current and the wrap-around. Please explain. Thanks. > > There is still a case which delta could be wrong. > The case is that event update just happens in PMI and overflow gap. It's > not in PMI, new > prev, and delta is positive. Current calculation may > lose the max count value. > But this case rarely happens. So the rare case doesn't handle here. > > Reported-by: Lukasz Odzioba > Signed-off-by: Kan Liang > Tested-by: Lukasz Odzioba > --- > arch/x86/events/core.c | 23 +++++++++++++++++++---- > arch/x86/events/intel/core.c | 4 ++-- > arch/x86/events/intel/p4.c | 2 +- > arch/x86/events/perf_event.h | 2 +- > 4 files changed, 23 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 6c3b0ef..c351ac4 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs > * Can only be executed on the CPU where the event is active. > * Returns the delta events processed. > */ > -u64 x86_perf_event_update(struct perf_event *event) > +u64 x86_perf_event_update(struct perf_event *event, bool pmi) > { > struct hw_perf_event *hwc = &event->hw; > int shift = 64 - x86_pmu.cntval_bits; > @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event) > delta = (new_raw_count << shift) - (prev_raw_count << shift); > delta >>= shift; > > + /* > + * Correct delta for special cases caused by the late PMI handle. > + * Case1: PMU counter may be start from a big value. > + * In PMI, new < prev. delta is negative. > + * Case2: new is big enough which impact the sign flag. > + * The delta will be negative after arithmetic right shift. > + * Case3: In PMI, new > prev. > + * The new - prev lose the max count value. > + * > + * There may be event update in PMI and overflow gap, > + * but it rarely happen. The rare case doesn't handle here. > + */ > + if (((delta > 0) && pmi) || (delta < 0)) > + delta += x86_pmu.cntval_mask + 1; > + > local64_add(delta, &event->count); > local64_sub(delta, &hwc->period_left); > > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags) > * Drain the remaining delta count out of a event > * that we are disabling: > */ > - x86_perf_event_update(event); > + x86_perf_event_update(event, (flags == 0)); > hwc->state |= PERF_HES_UPTODATE; > } > } > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs) > > event = cpuc->events[idx]; > > - val = x86_perf_event_update(event); > + val = x86_perf_event_update(event, true); > if (val & (1ULL << (x86_pmu.cntval_bits - 1))) > continue; > > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events); > > static inline void x86_pmu_read(struct perf_event *event) > { > - x86_perf_event_update(event); > + x86_perf_event_update(event, false); > } > > /* > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index a74a2db..69d65e6 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void) > for (i = 0; i < 4; i++) { > event = cpuc->events[i]; > if (event) > - x86_perf_event_update(event); > + x86_perf_event_update(event, false); > } > > for (i = 0; i < 4; i++) { > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event) > */ > int intel_pmu_save_and_restart(struct perf_event *event) > { > - x86_perf_event_update(event); > + x86_perf_event_update(event, true); > /* > * For a checkpointed counter always reset back to 0. This > * avoids a situation where the counter overflows, aborts the > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c > index eb05335..8117de8 100644 > --- a/arch/x86/events/intel/p4.c > +++ b/arch/x86/events/intel/p4.c > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) > /* it might be unflagged overflow */ > overflow = p4_pmu_clear_cccr_ovf(hwc); > > - val = x86_perf_event_update(event); > + val = x86_perf_event_update(event, true); > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1)))) > continue; > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h > index c6b25ac..09c9db2 100644 > --- a/arch/x86/events/perf_event.h > +++ b/arch/x86/events/perf_event.h > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs > [PERF_COUNT_HW_CACHE_OP_MAX] > [PERF_COUNT_HW_CACHE_RESULT_MAX]; > > -u64 x86_perf_event_update(struct perf_event *event); > +u64 x86_perf_event_update(struct perf_event *event, bool pmi); > > static inline unsigned int x86_pmu_config_addr(int index) > { > -- > 2.4.3 >