Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755105AbcK1USu (ORCPT ); Mon, 28 Nov 2016 15:18:50 -0500 Received: from mail-yw0-f170.google.com ([209.85.161.170]:35101 "EHLO mail-yw0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751307AbcK1USl (ORCPT ); Mon, 28 Nov 2016 15:18:41 -0500 MIME-Version: 1.0 In-Reply-To: <37D7C6CF3E00A74B8858931C1DB2F07750CA39C9@SHSMSX103.ccr.corp.intel.com> References: <1480361206-1702-1-git-send-email-kan.liang@intel.com> <37D7C6CF3E00A74B8858931C1DB2F07750CA39C9@SHSMSX103.ccr.corp.intel.com> From: Stephane Eranian Date: Mon, 28 Nov 2016 12:18:39 -0800 Message-ID: Subject: Re: [PATCH] perf/x86: fix event counter update issue To: "Liang, Kan" Cc: Peter Zijlstra , "mingo@redhat.com" , LKML , Alexander Shishkin , "ak@linux.intel.com" , "Odzioba, Lukasz" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uASKIsSA005180 Content-Length: 8384 Lines: 201 On Mon, Nov 28, 2016 at 11:59 AM, Liang, Kan wrote: > > >> > >> > Here, all the possible failure cases are listed. >> > Terms: >> > - new: current PMU counter value which read from rdpmcl. >> > - prev: previous counter value which is stored in &hwc->prev_count. >> > - in PMI/not in PMI: the event update happens in PMI handler or not. >> > Current code to calculate delta: >> > delta = (new << shift) - (prev << shift); >> > delta >>= shift; >> > >> > Case A: Not in PMI. new > prev. But delta is negative. >> > That's the failure case of Test 2. >> > delta is s64 type. new and prev are u64 type. If the new is big >> > enough, after doing left shift and sub, the bit 64 of the result may >> > still be 1. >> > After converting to s64, the sign flag will be set. Since delta is >> > s64 type, arithmetic right shift is applied, which copy the sign flag >> > into empty bit positions on the upper end of delta. >> > It can be fixed by adding the max count value. >> > >> > Here is the real data for test2 on KNL. >> > new = aea96e1927 >> > prev = ffffff0000000001 >> >> >> How can new be so large here? >> I mean between prev and new, the counter wrapped around. And it took >> that many occurrences (of cycles) to handle the overflow? > > There is no overflow in this case. > The counter is 40bit width. The highest value which can be count without > overflow is 0xfffffffffe > >> >> > >> > delta = aea96e1927000000 - 1000000 = aea96e1926000000 >> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta >> > >> > Case B: In PMI. new > prev. delta is positive. >> > That's the failure case of Test 3. >> > The PMI is triggered by overflow. But there is latency between >> > overflow and PMI handler. So new has small amount. >> > Current calculation lose the max count value. >> > >> > Case C: In PMI. new < prev. delta is negative. >> > The PMU counter may be start from a big value. E.g. the fixed period >> > is small. >> > It can be fixed by adding the max count value. >> > >> I am not sure I understand why PMI should matter here. What matters is >> prev vs. current and the wrap-around. >> Please explain. >> Thanks. > > Right, PMI shouldn't matter here. > We should add max count value if delta is negative, no matter if it’s in PMI. > That sounds right, but then why do you have that boolean (bool pmi) in your patch? > To fix this issue, I once had a list which include all the possible cases. > "non-PMI, new < prev, delta is negative" is one of them. But it rarely happen. > So I remove it, but forget to modify the description of Case C. > > Thanks, > Kan > >> >> > >> > There is still a case which delta could be wrong. >> > The case is that event update just happens in PMI and overflow gap. >> > It's not in PMI, new > prev, and delta is positive. Current >> > calculation may lose the max count value. >> > But this case rarely happens. So the rare case doesn't handle here. >> > >> > Reported-by: Lukasz Odzioba >> > Signed-off-by: Kan Liang >> > Tested-by: Lukasz Odzioba >> > --- >> > arch/x86/events/core.c | 23 +++++++++++++++++++---- >> > arch/x86/events/intel/core.c | 4 ++-- >> > arch/x86/events/intel/p4.c | 2 +- >> > arch/x86/events/perf_event.h | 2 +- >> > 4 files changed, 23 insertions(+), 8 deletions(-) >> > >> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index >> > 6c3b0ef..c351ac4 100644 >> > --- a/arch/x86/events/core.c >> > +++ b/arch/x86/events/core.c >> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs >> > * Can only be executed on the CPU where the event is active. >> > * Returns the delta events processed. >> > */ >> > -u64 x86_perf_event_update(struct perf_event *event) >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi) >> > { >> > struct hw_perf_event *hwc = &event->hw; >> > int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64 >> > x86_perf_event_update(struct perf_event *event) >> > delta = (new_raw_count << shift) - (prev_raw_count << shift); >> > delta >>= shift; >> > >> > + /* >> > + * Correct delta for special cases caused by the late PMI handle. >> > + * Case1: PMU counter may be start from a big value. >> > + * In PMI, new < prev. delta is negative. >> > + * Case2: new is big enough which impact the sign flag. >> > + * The delta will be negative after arithmetic right shift. >> > + * Case3: In PMI, new > prev. >> > + * The new - prev lose the max count value. >> > + * >> > + * There may be event update in PMI and overflow gap, >> > + * but it rarely happen. The rare case doesn't handle here. >> > + */ >> > + if (((delta > 0) && pmi) || (delta < 0)) >> > + delta += x86_pmu.cntval_mask + 1; >> > + >> > local64_add(delta, &event->count); >> > local64_sub(delta, &hwc->period_left); >> > >> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, >> int flags) >> > * Drain the remaining delta count out of a event >> > * that we are disabling: >> > */ >> > - x86_perf_event_update(event); >> > + x86_perf_event_update(event, (flags == 0)); >> > hwc->state |= PERF_HES_UPTODATE; >> > } >> > } >> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs) >> > >> > event = cpuc->events[idx]; >> > >> > - val = x86_perf_event_update(event); >> > + val = x86_perf_event_update(event, true); >> > if (val & (1ULL << (x86_pmu.cntval_bits - 1))) >> > continue; >> > >> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events); >> > >> > static inline void x86_pmu_read(struct perf_event *event) { >> > - x86_perf_event_update(event); >> > + x86_perf_event_update(event, false); >> > } >> > >> > /* >> > diff --git a/arch/x86/events/intel/core.c >> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644 >> > --- a/arch/x86/events/intel/core.c >> > +++ b/arch/x86/events/intel/core.c >> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void) >> > for (i = 0; i < 4; i++) { >> > event = cpuc->events[i]; >> > if (event) >> > - x86_perf_event_update(event); >> > + x86_perf_event_update(event, false); >> > } >> > >> > for (i = 0; i < 4; i++) { >> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct >> perf_event *event) >> > */ >> > int intel_pmu_save_and_restart(struct perf_event *event) { >> > - x86_perf_event_update(event); >> > + x86_perf_event_update(event, true); >> > /* >> > * For a checkpointed counter always reset back to 0. This >> > * avoids a situation where the counter overflows, aborts the >> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c >> > index eb05335..8117de8 100644 >> > --- a/arch/x86/events/intel/p4.c >> > +++ b/arch/x86/events/intel/p4.c >> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs >> *regs) >> > /* it might be unflagged overflow */ >> > overflow = p4_pmu_clear_cccr_ovf(hwc); >> > >> > - val = x86_perf_event_update(event); >> > + val = x86_perf_event_update(event, true); >> > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1)))) >> > continue; >> > >> > diff --git a/arch/x86/events/perf_event.h >> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644 >> > --- a/arch/x86/events/perf_event.h >> > +++ b/arch/x86/events/perf_event.h >> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs >> > [PERF_COUNT_HW_CACHE_OP_MAX] >> > [PERF_COUNT_HW_CACHE_RESULT_MAX]; >> > >> > -u64 x86_perf_event_update(struct perf_event *event); >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi); >> > >> > static inline unsigned int x86_pmu_config_addr(int index) { >> > -- >> > 2.4.3 >> >