MIME-Version: 1.0
In-Reply-To: <1480361206-1702-1-git-send-email-kan.liang@intel.com>
References: <1480361206-1702-1-git-send-email-kan.liang@intel.com>
From: Stephane Eranian <eranian@google.com>
Date: Mon, 28 Nov 2016 11:41:12 -0800
Message-ID: <CABPqkBRYUJASncQai0dZ=prpFzgNfNXVA_MNA9E=ZmTCs5P0GA@mail.gmail.com>
Subject: Re: [PATCH] perf/x86: fix event counter update issue
To: "Liang, Kan" <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        "mingo@redhat.com" <mingo@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        "ak@linux.intel.com" <ak@linux.intel.com>,
        "Odzioba, Lukasz" <lukasz.odzioba@intel.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8267
Lines: 205

On Mon, Nov 28, 2016 at 11:26 AM, <kan.liang@intel.com> wrote:
>
> From: Kan Liang <kan.liang@intel.com>
>
> Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")
>
> A bug can be easily reproduced by perf stat loop program on KNL/SLM.
> The loop program is just "for (i=0;i<loops;i++);"
> "loops" is set to different value during the test.
> Test 1:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
>         451739576564,,cycles,313694125185,100.00
> Test 2:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
>         18446743727079256064,,cycles,313694125185,100.00
> Test 3:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
>         406240041973,,cycles,313694125185,100.00
> Only test 1 has correct result.
> The result for test 2 is too big. Test 3 is too small.
>
> The bug happens more frequently on KNL/SLM due to narrower counters,
> but should be generic issue on all platforms.
>
> The issues are caused by mistakenly calculating the counter
> increment (delta) in x86_perf_event_update for some cases.
>
> Here, all the possible failure cases are listed.
> Terms:
>     - new: current PMU counter value which read from rdpmcl.
>     - prev: previous counter value which is stored in &hwc->prev_count.
>     - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
>     delta = (new << shift) - (prev << shift);
>     delta >>= shift;
>
> Case A: Not in PMI.  new > prev. But delta is negative.
>    That's the failure case of Test 2.
>    delta is s64 type. new and prev are u64 type. If the new is big
>    enough, after doing left shift and sub, the bit 64 of the result may
>    still be 1.
>    After converting to s64, the sign flag will be set. Since delta is
>    s64 type, arithmetic right shift is applied, which copy the sign flag
>    into empty bit positions on the upper end of delta.
>    It can be fixed by adding the max count value.
>
>    Here is the real data for test2 on KNL.
>    new = aea96e1927
>    prev = ffffff0000000001


How can new be so large here?
I mean between prev and new, the counter wrapped around. And it took
that many occurrences (of cycles) to handle the overflow?

>
>    delta = aea96e1927000000 - 1000000 = aea96e1926000000
>    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
>
> Case B: In PMI. new > prev. delta is positive.
>    That's the failure case of Test 3.
>    The PMI is triggered by overflow. But there is latency between
>    overflow and PMI handler. So new has small amount.
>    Current calculation lose the max count value.
>
> Case C: In PMI. new < prev. delta is negative.
>    The PMU counter may be start from a big value. E.g. the fixed period
>    is small.
>    It can be fixed by adding the max count value.
>
I am not sure I understand why PMI should matter here. What matters is
prev vs. current and the wrap-around.
Please explain.
Thanks.

>
> There is still a case which delta could be wrong.
> The case is that event update just happens in PMI and overflow gap. It's
> not in PMI, new > prev, and delta is positive. Current calculation may
> lose the max count value.
> But this case rarely happens. So the rare case doesn't handle here.
>
> Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> ---
>  arch/x86/events/core.c       | 23 +++++++++++++++++++----
>  arch/x86/events/intel/core.c |  4 ++--
>  arch/x86/events/intel/p4.c   |  2 +-
>  arch/x86/events/perf_event.h |  2 +-
>  4 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..c351ac4 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>   * Can only be executed on the CPU where the event is active.
>   * Returns the delta events processed.
>   */
> -u64 x86_perf_event_update(struct perf_event *event)
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>  {
>         struct hw_perf_event *hwc = &event->hw;
>         int shift = 64 - x86_pmu.cntval_bits;
> @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
>         delta = (new_raw_count << shift) - (prev_raw_count << shift);
>         delta >>= shift;
>
> +       /*
> +        * Correct delta for special cases caused by the late PMI handle.
> +        * Case1: PMU counter may be start from a big value.
> +        *        In PMI, new < prev. delta is negative.
> +        * Case2: new is big enough which impact the sign flag.
> +        *        The delta will be negative after arithmetic right shift.
> +        * Case3: In PMI, new > prev.
> +        *        The new - prev lose the max count value.
> +        *
> +        * There may be event update in PMI and overflow gap,
> +        * but it rarely happen. The rare case doesn't handle here.
> +        */
> +       if (((delta > 0) && pmi) || (delta < 0))
> +               delta += x86_pmu.cntval_mask + 1;
> +
>         local64_add(delta, &event->count);
>         local64_sub(delta, &hwc->period_left);
>
> @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
>                  * Drain the remaining delta count out of a event
>                  * that we are disabling:
>                  */
> -               x86_perf_event_update(event);
> +               x86_perf_event_update(event, (flags == 0));
>                 hwc->state |= PERF_HES_UPTODATE;
>         }
>  }
> @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>
>                 event = cpuc->events[idx];
>
> -               val = x86_perf_event_update(event);
> +               val = x86_perf_event_update(event, true);
>                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>                         continue;
>
> @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>
>  static inline void x86_pmu_read(struct perf_event *event)
>  {
> -       x86_perf_event_update(event);
> +       x86_perf_event_update(event, false);
>  }
>
>  /*
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..69d65e6 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>         for (i = 0; i < 4; i++) {
>                 event = cpuc->events[i];
>                 if (event)
> -                       x86_perf_event_update(event);
> +                       x86_perf_event_update(event, false);
>         }
>
>         for (i = 0; i < 4; i++) {
> @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
>   */
>  int intel_pmu_save_and_restart(struct perf_event *event)
>  {
> -       x86_perf_event_update(event);
> +       x86_perf_event_update(event, true);
>         /*
>          * For a checkpointed counter always reset back to 0.  This
>          * avoids a situation where the counter overflows, aborts the
> diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> index eb05335..8117de8 100644
> --- a/arch/x86/events/intel/p4.c
> +++ b/arch/x86/events/intel/p4.c
> @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
>                 /* it might be unflagged overflow */
>                 overflow = p4_pmu_clear_cccr_ovf(hwc);
>
> -               val = x86_perf_event_update(event);
> +               val = x86_perf_event_update(event, true);
>                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>                         continue;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index c6b25ac..09c9db2 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>                                 [PERF_COUNT_HW_CACHE_OP_MAX]
>                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
>
> -u64 x86_perf_event_update(struct perf_event *event);
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>
>  static inline unsigned int x86_pmu_config_addr(int index)
>  {
> --
> 2.4.3
>