From: Kan Liang <[email protected]>
Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")
A bug can be easily reproduced by perf stat loop program on KNL/SLM.
The loop program is just "for (i=0;i<loops;i++);"
"loops" is set to different value during the test.
Test 1:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
451739576564,,cycles,313694125185,100.00
Test 2:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
18446743727079256064,,cycles,313694125185,100.00
Test 3:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
406240041973,,cycles,313694125185,100.00
Only test 1 has correct result.
The result for test 2 is too big. Test 3 is too small.
The bug happens more frequently on KNL/SLM due to narrower counters,
but should be generic issue on all platforms.
The issues are caused by mistakenly calculating the counter
increment (delta) in x86_perf_event_update for some cases.
Here, all the possible failure cases are listed.
Terms:
- new: current PMU counter value which read from rdpmcl.
- prev: previous counter value which is stored in &hwc->prev_count.
- in PMI/not in PMI: the event update happens in PMI handler or not.
Current code to calculate delta:
delta = (new << shift) - (prev << shift);
delta >>= shift;
Case A: Not in PMI. new > prev. But delta is negative.
That's the failure case of Test 2.
delta is s64 type. new and prev are u64 type. If the new is big
enough, after doing left shift and sub, the bit 64 of the result may
still be 1.
After converting to s64, the sign flag will be set. Since delta is
s64 type, arithmetic right shift is applied, which copy the sign flag
into empty bit positions on the upper end of delta.
It can be fixed by adding the max count value.
Here is the real data for test2 on KNL.
new = aea96e1927
prev = ffffff0000000001
delta = aea96e1927000000 - 1000000 = aea96e1926000000
aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
Case B: In PMI. new > prev. delta is positive.
That's the failure case of Test 3.
The PMI is triggered by overflow. But there is latency between
overflow and PMI handler. So new has small amount.
Current calculation lose the max count value.
Case C: In PMI. new < prev. delta is negative.
The PMU counter may be start from a big value. E.g. the fixed period
is small.
It can be fixed by adding the max count value.
There is still a case which delta could be wrong.
The case is that event update just happens in PMI and overflow gap. It's
not in PMI, new > prev, and delta is positive. Current calculation may
lose the max count value.
But this case rarely happens. So the rare case doesn't handle here.
Reported-by: Lukasz Odzioba <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Tested-by: Lukasz Odzioba <[email protected]>
---
arch/x86/events/core.c | 23 +++++++++++++++++++----
arch/x86/events/intel/core.c | 4 ++--
arch/x86/events/intel/p4.c | 2 +-
arch/x86/events/perf_event.h | 2 +-
4 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..c351ac4 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
* Can only be executed on the CPU where the event is active.
* Returns the delta events processed.
*/
-u64 x86_perf_event_update(struct perf_event *event)
+u64 x86_perf_event_update(struct perf_event *event, bool pmi)
{
struct hw_perf_event *hwc = &event->hw;
int shift = 64 - x86_pmu.cntval_bits;
@@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
delta = (new_raw_count << shift) - (prev_raw_count << shift);
delta >>= shift;
+ /*
+ * Correct delta for special cases caused by the late PMI handle.
+ * Case1: PMU counter may be start from a big value.
+ * In PMI, new < prev. delta is negative.
+ * Case2: new is big enough which impact the sign flag.
+ * The delta will be negative after arithmetic right shift.
+ * Case3: In PMI, new > prev.
+ * The new - prev lose the max count value.
+ *
+ * There may be event update in PMI and overflow gap,
+ * but it rarely happen. The rare case doesn't handle here.
+ */
+ if (((delta > 0) && pmi) || (delta < 0))
+ delta += x86_pmu.cntval_mask + 1;
+
local64_add(delta, &event->count);
local64_sub(delta, &hwc->period_left);
@@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
* Drain the remaining delta count out of a event
* that we are disabling:
*/
- x86_perf_event_update(event);
+ x86_perf_event_update(event, (flags == 0));
hwc->state |= PERF_HES_UPTODATE;
}
}
@@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
event = cpuc->events[idx];
- val = x86_perf_event_update(event);
+ val = x86_perf_event_update(event, true);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
continue;
@@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
static inline void x86_pmu_read(struct perf_event *event)
{
- x86_perf_event_update(event);
+ x86_perf_event_update(event, false);
}
/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2db..69d65e6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
for (i = 0; i < 4; i++) {
event = cpuc->events[i];
if (event)
- x86_perf_event_update(event);
+ x86_perf_event_update(event, false);
}
for (i = 0; i < 4; i++) {
@@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
*/
int intel_pmu_save_and_restart(struct perf_event *event)
{
- x86_perf_event_update(event);
+ x86_perf_event_update(event, true);
/*
* For a checkpointed counter always reset back to 0. This
* avoids a situation where the counter overflows, aborts the
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index eb05335..8117de8 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
/* it might be unflagged overflow */
overflow = p4_pmu_clear_cccr_ovf(hwc);
- val = x86_perf_event_update(event);
+ val = x86_perf_event_update(event, true);
if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
continue;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c6b25ac..09c9db2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX];
-u64 x86_perf_event_update(struct perf_event *event);
+u64 x86_perf_event_update(struct perf_event *event, bool pmi);
static inline unsigned int x86_pmu_config_addr(int index)
{
--
2.4.3
On Mon, Nov 28, 2016 at 11:26 AM, <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")
>
> A bug can be easily reproduced by perf stat loop program on KNL/SLM.
> The loop program is just "for (i=0;i<loops;i++);"
> "loops" is set to different value during the test.
> Test 1:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
> 451739576564,,cycles,313694125185,100.00
> Test 2:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
> 18446743727079256064,,cycles,313694125185,100.00
> Test 3:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
> 406240041973,,cycles,313694125185,100.00
> Only test 1 has correct result.
> The result for test 2 is too big. Test 3 is too small.
>
> The bug happens more frequently on KNL/SLM due to narrower counters,
> but should be generic issue on all platforms.
>
> The issues are caused by mistakenly calculating the counter
> increment (delta) in x86_perf_event_update for some cases.
>
> Here, all the possible failure cases are listed.
> Terms:
> - new: current PMU counter value which read from rdpmcl.
> - prev: previous counter value which is stored in &hwc->prev_count.
> - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
> delta = (new << shift) - (prev << shift);
> delta >>= shift;
>
> Case A: Not in PMI. new > prev. But delta is negative.
> That's the failure case of Test 2.
> delta is s64 type. new and prev are u64 type. If the new is big
> enough, after doing left shift and sub, the bit 64 of the result may
> still be 1.
> After converting to s64, the sign flag will be set. Since delta is
> s64 type, arithmetic right shift is applied, which copy the sign flag
> into empty bit positions on the upper end of delta.
> It can be fixed by adding the max count value.
>
> Here is the real data for test2 on KNL.
> new = aea96e1927
> prev = ffffff0000000001
How can new be so large here?
I mean between prev and new, the counter wrapped around. And it took
that many occurrences (of cycles) to handle the overflow?
>
> delta = aea96e1927000000 - 1000000 = aea96e1926000000
> aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>
> Case B: In PMI. new > prev. delta is positive.
> That's the failure case of Test 3.
> The PMI is triggered by overflow. But there is latency between
> overflow and PMI handler. So new has small amount.
> Current calculation lose the max count value.
>
> Case C: In PMI. new < prev. delta is negative.
> The PMU counter may be start from a big value. E.g. the fixed period
> is small.
> It can be fixed by adding the max count value.
>
I am not sure I understand why PMI should matter here. What matters is
prev vs. current and the wrap-around.
Please explain.
Thanks.
>
> There is still a case which delta could be wrong.
> The case is that event update just happens in PMI and overflow gap. It's
> not in PMI, new > prev, and delta is positive. Current calculation may
> lose the max count value.
> But this case rarely happens. So the rare case doesn't handle here.
>
> Reported-by: Lukasz Odzioba <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Tested-by: Lukasz Odzioba <[email protected]>
> ---
> arch/x86/events/core.c | 23 +++++++++++++++++++----
> arch/x86/events/intel/core.c | 4 ++--
> arch/x86/events/intel/p4.c | 2 +-
> arch/x86/events/perf_event.h | 2 +-
> 4 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..c351ac4 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> * Can only be executed on the CPU where the event is active.
> * Returns the delta events processed.
> */
> -u64 x86_perf_event_update(struct perf_event *event)
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> {
> struct hw_perf_event *hwc = &event->hw;
> int shift = 64 - x86_pmu.cntval_bits;
> @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
> delta = (new_raw_count << shift) - (prev_raw_count << shift);
> delta >>= shift;
>
> + /*
> + * Correct delta for special cases caused by the late PMI handle.
> + * Case1: PMU counter may be start from a big value.
> + * In PMI, new < prev. delta is negative.
> + * Case2: new is big enough which impact the sign flag.
> + * The delta will be negative after arithmetic right shift.
> + * Case3: In PMI, new > prev.
> + * The new - prev lose the max count value.
> + *
> + * There may be event update in PMI and overflow gap,
> + * but it rarely happen. The rare case doesn't handle here.
> + */
> + if (((delta > 0) && pmi) || (delta < 0))
> + delta += x86_pmu.cntval_mask + 1;
> +
> local64_add(delta, &event->count);
> local64_sub(delta, &hwc->period_left);
>
> @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
> * Drain the remaining delta count out of a event
> * that we are disabling:
> */
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, (flags == 0));
> hwc->state |= PERF_HES_UPTODATE;
> }
> }
> @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>
> event = cpuc->events[idx];
>
> - val = x86_perf_event_update(event);
> + val = x86_perf_event_update(event, true);
> if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> continue;
>
> @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>
> static inline void x86_pmu_read(struct perf_event *event)
> {
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, false);
> }
>
> /*
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..69d65e6 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> for (i = 0; i < 4; i++) {
> event = cpuc->events[i];
> if (event)
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, false);
> }
>
> for (i = 0; i < 4; i++) {
> @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
> */
> int intel_pmu_save_and_restart(struct perf_event *event)
> {
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, true);
> /*
> * For a checkpointed counter always reset back to 0. This
> * avoids a situation where the counter overflows, aborts the
> diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> index eb05335..8117de8 100644
> --- a/arch/x86/events/intel/p4.c
> +++ b/arch/x86/events/intel/p4.c
> @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
> /* it might be unflagged overflow */
> overflow = p4_pmu_clear_cccr_ovf(hwc);
>
> - val = x86_perf_event_update(event);
> + val = x86_perf_event_update(event, true);
> if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> continue;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index c6b25ac..09c9db2 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
> [PERF_COUNT_HW_CACHE_OP_MAX]
> [PERF_COUNT_HW_CACHE_RESULT_MAX];
>
> -u64 x86_perf_event_update(struct perf_event *event);
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>
> static inline unsigned int x86_pmu_config_addr(int index)
> {
> --
> 2.4.3
>
> >
> > Here, all the possible failure cases are listed.
> > Terms:
> > - new: current PMU counter value which read from rdpmcl.
> > - prev: previous counter value which is stored in &hwc->prev_count.
> > - in PMI/not in PMI: the event update happens in PMI handler or not.
> > Current code to calculate delta:
> > delta = (new << shift) - (prev << shift);
> > delta >>= shift;
> >
> > Case A: Not in PMI. new > prev. But delta is negative.
> > That's the failure case of Test 2.
> > delta is s64 type. new and prev are u64 type. If the new is big
> > enough, after doing left shift and sub, the bit 64 of the result may
> > still be 1.
> > After converting to s64, the sign flag will be set. Since delta is
> > s64 type, arithmetic right shift is applied, which copy the sign flag
> > into empty bit positions on the upper end of delta.
> > It can be fixed by adding the max count value.
> >
> > Here is the real data for test2 on KNL.
> > new = aea96e1927
> > prev = ffffff0000000001
>
>
> How can new be so large here?
> I mean between prev and new, the counter wrapped around. And it took
> that many occurrences (of cycles) to handle the overflow?
There is no overflow in this case.
The counter is 40bit width. The highest value which can be count without
overflow is 0xfffffffffe
>
> >
> > delta = aea96e1927000000 - 1000000 = aea96e1926000000
> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
> >
> > Case B: In PMI. new > prev. delta is positive.
> > That's the failure case of Test 3.
> > The PMI is triggered by overflow. But there is latency between
> > overflow and PMI handler. So new has small amount.
> > Current calculation lose the max count value.
> >
> > Case C: In PMI. new < prev. delta is negative.
> > The PMU counter may be start from a big value. E.g. the fixed period
> > is small.
> > It can be fixed by adding the max count value.
> >
> I am not sure I understand why PMI should matter here. What matters is
> prev vs. current and the wrap-around.
> Please explain.
> Thanks.
Right, PMI shouldn't matter here.
We should add max count value if delta is negative, no matter if it’s in PMI.
To fix this issue, I once had a list which include all the possible cases.
"non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
So I remove it, but forget to modify the description of Case C.
Thanks,
Kan
>
> >
> > There is still a case which delta could be wrong.
> > The case is that event update just happens in PMI and overflow gap.
> > It's not in PMI, new > prev, and delta is positive. Current
> > calculation may lose the max count value.
> > But this case rarely happens. So the rare case doesn't handle here.
> >
> > Reported-by: Lukasz Odzioba <[email protected]>
> > Signed-off-by: Kan Liang <[email protected]>
> > Tested-by: Lukasz Odzioba <[email protected]>
> > ---
> > arch/x86/events/core.c | 23 +++++++++++++++++++----
> > arch/x86/events/intel/core.c | 4 ++--
> > arch/x86/events/intel/p4.c | 2 +-
> > arch/x86/events/perf_event.h | 2 +-
> > 4 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > 6c3b0ef..c351ac4 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> > * Can only be executed on the CPU where the event is active.
> > * Returns the delta events processed.
> > */
> > -u64 x86_perf_event_update(struct perf_event *event)
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> > {
> > struct hw_perf_event *hwc = &event->hw;
> > int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
> > x86_perf_event_update(struct perf_event *event)
> > delta = (new_raw_count << shift) - (prev_raw_count << shift);
> > delta >>= shift;
> >
> > + /*
> > + * Correct delta for special cases caused by the late PMI handle.
> > + * Case1: PMU counter may be start from a big value.
> > + * In PMI, new < prev. delta is negative.
> > + * Case2: new is big enough which impact the sign flag.
> > + * The delta will be negative after arithmetic right shift.
> > + * Case3: In PMI, new > prev.
> > + * The new - prev lose the max count value.
> > + *
> > + * There may be event update in PMI and overflow gap,
> > + * but it rarely happen. The rare case doesn't handle here.
> > + */
> > + if (((delta > 0) && pmi) || (delta < 0))
> > + delta += x86_pmu.cntval_mask + 1;
> > +
> > local64_add(delta, &event->count);
> > local64_sub(delta, &hwc->period_left);
> >
> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
> int flags)
> > * Drain the remaining delta count out of a event
> > * that we are disabling:
> > */
> > - x86_perf_event_update(event);
> > + x86_perf_event_update(event, (flags == 0));
> > hwc->state |= PERF_HES_UPTODATE;
> > }
> > }
> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> >
> > event = cpuc->events[idx];
> >
> > - val = x86_perf_event_update(event);
> > + val = x86_perf_event_update(event, true);
> > if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> > continue;
> >
> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
> >
> > static inline void x86_pmu_read(struct perf_event *event) {
> > - x86_perf_event_update(event);
> > + x86_perf_event_update(event, false);
> > }
> >
> > /*
> > diff --git a/arch/x86/events/intel/core.c
> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> > for (i = 0; i < 4; i++) {
> > event = cpuc->events[i];
> > if (event)
> > - x86_perf_event_update(event);
> > + x86_perf_event_update(event, false);
> > }
> >
> > for (i = 0; i < 4; i++) {
> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
> perf_event *event)
> > */
> > int intel_pmu_save_and_restart(struct perf_event *event) {
> > - x86_perf_event_update(event);
> > + x86_perf_event_update(event, true);
> > /*
> > * For a checkpointed counter always reset back to 0. This
> > * avoids a situation where the counter overflows, aborts the
> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> > index eb05335..8117de8 100644
> > --- a/arch/x86/events/intel/p4.c
> > +++ b/arch/x86/events/intel/p4.c
> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
> *regs)
> > /* it might be unflagged overflow */
> > overflow = p4_pmu_clear_cccr_ovf(hwc);
> >
> > - val = x86_perf_event_update(event);
> > + val = x86_perf_event_update(event, true);
> > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> > continue;
> >
> > diff --git a/arch/x86/events/perf_event.h
> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
> > --- a/arch/x86/events/perf_event.h
> > +++ b/arch/x86/events/perf_event.h
> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
> > [PERF_COUNT_HW_CACHE_OP_MAX]
> > [PERF_COUNT_HW_CACHE_RESULT_MAX];
> >
> > -u64 x86_perf_event_update(struct perf_event *event);
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
> >
> > static inline unsigned int x86_pmu_config_addr(int index) {
> > --
> > 2.4.3
> >
On Mon, Nov 28, 2016 at 11:59 AM, Liang, Kan <[email protected]> wrote:
>
>
>> >
>> > Here, all the possible failure cases are listed.
>> > Terms:
>> > - new: current PMU counter value which read from rdpmcl.
>> > - prev: previous counter value which is stored in &hwc->prev_count.
>> > - in PMI/not in PMI: the event update happens in PMI handler or not.
>> > Current code to calculate delta:
>> > delta = (new << shift) - (prev << shift);
>> > delta >>= shift;
>> >
>> > Case A: Not in PMI. new > prev. But delta is negative.
>> > That's the failure case of Test 2.
>> > delta is s64 type. new and prev are u64 type. If the new is big
>> > enough, after doing left shift and sub, the bit 64 of the result may
>> > still be 1.
>> > After converting to s64, the sign flag will be set. Since delta is
>> > s64 type, arithmetic right shift is applied, which copy the sign flag
>> > into empty bit positions on the upper end of delta.
>> > It can be fixed by adding the max count value.
>> >
>> > Here is the real data for test2 on KNL.
>> > new = aea96e1927
>> > prev = ffffff0000000001
>>
>>
>> How can new be so large here?
>> I mean between prev and new, the counter wrapped around. And it took
>> that many occurrences (of cycles) to handle the overflow?
>
> There is no overflow in this case.
> The counter is 40bit width. The highest value which can be count without
> overflow is 0xfffffffffe
>
>>
>> >
>> > delta = aea96e1927000000 - 1000000 = aea96e1926000000
>> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>> >
>> > Case B: In PMI. new > prev. delta is positive.
>> > That's the failure case of Test 3.
>> > The PMI is triggered by overflow. But there is latency between
>> > overflow and PMI handler. So new has small amount.
>> > Current calculation lose the max count value.
>> >
>> > Case C: In PMI. new < prev. delta is negative.
>> > The PMU counter may be start from a big value. E.g. the fixed period
>> > is small.
>> > It can be fixed by adding the max count value.
>> >
>> I am not sure I understand why PMI should matter here. What matters is
>> prev vs. current and the wrap-around.
>> Please explain.
>> Thanks.
>
> Right, PMI shouldn't matter here.
> We should add max count value if delta is negative, no matter if it’s in PMI.
>
That sounds right, but then why do you have that boolean (bool pmi) in
your patch?
> To fix this issue, I once had a list which include all the possible cases.
> "non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
> So I remove it, but forget to modify the description of Case C.
>
> Thanks,
> Kan
>
>>
>> >
>> > There is still a case which delta could be wrong.
>> > The case is that event update just happens in PMI and overflow gap.
>> > It's not in PMI, new > prev, and delta is positive. Current
>> > calculation may lose the max count value.
>> > But this case rarely happens. So the rare case doesn't handle here.
>> >
>> > Reported-by: Lukasz Odzioba <[email protected]>
>> > Signed-off-by: Kan Liang <[email protected]>
>> > Tested-by: Lukasz Odzioba <[email protected]>
>> > ---
>> > arch/x86/events/core.c | 23 +++++++++++++++++++----
>> > arch/x86/events/intel/core.c | 4 ++--
>> > arch/x86/events/intel/p4.c | 2 +-
>> > arch/x86/events/perf_event.h | 2 +-
>> > 4 files changed, 23 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
>> > 6c3b0ef..c351ac4 100644
>> > --- a/arch/x86/events/core.c
>> > +++ b/arch/x86/events/core.c
>> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>> > * Can only be executed on the CPU where the event is active.
>> > * Returns the delta events processed.
>> > */
>> > -u64 x86_perf_event_update(struct perf_event *event)
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>> > {
>> > struct hw_perf_event *hwc = &event->hw;
>> > int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
>> > x86_perf_event_update(struct perf_event *event)
>> > delta = (new_raw_count << shift) - (prev_raw_count << shift);
>> > delta >>= shift;
>> >
>> > + /*
>> > + * Correct delta for special cases caused by the late PMI handle.
>> > + * Case1: PMU counter may be start from a big value.
>> > + * In PMI, new < prev. delta is negative.
>> > + * Case2: new is big enough which impact the sign flag.
>> > + * The delta will be negative after arithmetic right shift.
>> > + * Case3: In PMI, new > prev.
>> > + * The new - prev lose the max count value.
>> > + *
>> > + * There may be event update in PMI and overflow gap,
>> > + * but it rarely happen. The rare case doesn't handle here.
>> > + */
>> > + if (((delta > 0) && pmi) || (delta < 0))
>> > + delta += x86_pmu.cntval_mask + 1;
>> > +
>> > local64_add(delta, &event->count);
>> > local64_sub(delta, &hwc->period_left);
>> >
>> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
>> int flags)
>> > * Drain the remaining delta count out of a event
>> > * that we are disabling:
>> > */
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, (flags == 0));
>> > hwc->state |= PERF_HES_UPTODATE;
>> > }
>> > }
>> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>> >
>> > event = cpuc->events[idx];
>> >
>> > - val = x86_perf_event_update(event);
>> > + val = x86_perf_event_update(event, true);
>> > if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>> > continue;
>> >
>> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>> >
>> > static inline void x86_pmu_read(struct perf_event *event) {
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, false);
>> > }
>> >
>> > /*
>> > diff --git a/arch/x86/events/intel/core.c
>> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
>> > --- a/arch/x86/events/intel/core.c
>> > +++ b/arch/x86/events/intel/core.c
>> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>> > for (i = 0; i < 4; i++) {
>> > event = cpuc->events[i];
>> > if (event)
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, false);
>> > }
>> >
>> > for (i = 0; i < 4; i++) {
>> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
>> perf_event *event)
>> > */
>> > int intel_pmu_save_and_restart(struct perf_event *event) {
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, true);
>> > /*
>> > * For a checkpointed counter always reset back to 0. This
>> > * avoids a situation where the counter overflows, aborts the
>> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
>> > index eb05335..8117de8 100644
>> > --- a/arch/x86/events/intel/p4.c
>> > +++ b/arch/x86/events/intel/p4.c
>> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
>> *regs)
>> > /* it might be unflagged overflow */
>> > overflow = p4_pmu_clear_cccr_ovf(hwc);
>> >
>> > - val = x86_perf_event_update(event);
>> > + val = x86_perf_event_update(event, true);
>> > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>> > continue;
>> >
>> > diff --git a/arch/x86/events/perf_event.h
>> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
>> > --- a/arch/x86/events/perf_event.h
>> > +++ b/arch/x86/events/perf_event.h
>> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>> > [PERF_COUNT_HW_CACHE_OP_MAX]
>> > [PERF_COUNT_HW_CACHE_RESULT_MAX];
>> >
>> > -u64 x86_perf_event_update(struct perf_event *event);
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>> >
>> > static inline unsigned int x86_pmu_config_addr(int index) {
>> > --
>> > 2.4.3
>> >
> >> > Case B: In PMI. new > prev. delta is positive.
> >> > That's the failure case of Test 3.
> >> > The PMI is triggered by overflow. But there is latency between
> >> > overflow and PMI handler. So new has small amount.
> >> > Current calculation lose the max count value.
> >> >
> >> > Case C: In PMI. new < prev. delta is negative.
> >> > The PMU counter may be start from a big value. E.g. the fixed period
> >> > is small.
> >> > It can be fixed by adding the max count value.
> >> >
> >> I am not sure I understand why PMI should matter here. What matters
> >> is prev vs. current and the wrap-around.
> >> Please explain.
> >> Thanks.
> >
> > Right, PMI shouldn't matter here.
> > We should add max count value if delta is negative, no matter if it’s in
> PMI.
> >
> That sounds right, but then why do you have that boolean (bool pmi) in
> your patch?
For case B. It needs to add max count value if new > prev in PMI.
Thanks,
Kan
>
> > To fix this issue, I once had a list which include all the possible cases.
> > "non-PMI, new < prev, delta is negative" is one of them. But it rarely
> happen.
> > So I remove it, but forget to modify the description of Case C.
> >
> > Thanks,
> > Kan
> >
> >>
> >> >
> >> > There is still a case which delta could be wrong.
> >> > The case is that event update just happens in PMI and overflow gap.
> >> > It's not in PMI, new > prev, and delta is positive. Current
> >> > calculation may lose the max count value.
> >> > But this case rarely happens. So the rare case doesn't handle here.
> >> >
> >> > Reported-by: Lukasz Odzioba <[email protected]>
> >> > Signed-off-by: Kan Liang <[email protected]>
> >> > Tested-by: Lukasz Odzioba <[email protected]>
> >> > ---
> >> > arch/x86/events/core.c | 23 +++++++++++++++++++----
> >> > arch/x86/events/intel/core.c | 4 ++--
> >> > arch/x86/events/intel/p4.c | 2 +-
> >> > arch/x86/events/perf_event.h | 2 +-
> >> > 4 files changed, 23 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> >> > 6c3b0ef..c351ac4 100644
> >> > --- a/arch/x86/events/core.c
> >> > +++ b/arch/x86/events/core.c
> >> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> >> > * Can only be executed on the CPU where the event is active.
> >> > * Returns the delta events processed.
> >> > */
> >> > -u64 x86_perf_event_update(struct perf_event *event)
> >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> >> > {
> >> > struct hw_perf_event *hwc = &event->hw;
> >> > int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@
> >> > u64 x86_perf_event_update(struct perf_event *event)
> >> > delta = (new_raw_count << shift) - (prev_raw_count << shift);
> >> > delta >>= shift;
> >> >
> >> > + /*
> >> > + * Correct delta for special cases caused by the late PMI handle.
> >> > + * Case1: PMU counter may be start from a big value.
> >> > + * In PMI, new < prev. delta is negative.
> >> > + * Case2: new is big enough which impact the sign flag.
> >> > + * The delta will be negative after arithmetic right shift.
> >> > + * Case3: In PMI, new > prev.
> >> > + * The new - prev lose the max count value.
> >> > + *
> >> > + * There may be event update in PMI and overflow gap,
> >> > + * but it rarely happen. The rare case doesn't handle here.
> >> > + */
> >> > + if (((delta > 0) && pmi) || (delta < 0))
> >> > + delta += x86_pmu.cntval_mask + 1;
> >> > +
> >> > local64_add(delta, &event->count);
> >> > local64_sub(delta, &hwc->period_left);
> >> >
> >> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
> >> int flags)
> >> > * Drain the remaining delta count out of a event
> >> > * that we are disabling:
> >> > */
> >> > - x86_perf_event_update(event);
> >> > + x86_perf_event_update(event, (flags == 0));
> >> > hwc->state |= PERF_HES_UPTODATE;
> >> > }
> >> > }
> >> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> >> >
> >> > event = cpuc->events[idx];
> >> >
> >> > - val = x86_perf_event_update(event);
> >> > + val = x86_perf_event_update(event, true);
> >> > if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> >> > continue;
> >> >
> >> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
> >> >
> >> > static inline void x86_pmu_read(struct perf_event *event) {
> >> > - x86_perf_event_update(event);
> >> > + x86_perf_event_update(event, false);
> >> > }
> >> >
> >> > /*
> >> > diff --git a/arch/x86/events/intel/core.c
> >> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
> >> > --- a/arch/x86/events/intel/core.c
> >> > +++ b/arch/x86/events/intel/core.c
> >> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> >> > for (i = 0; i < 4; i++) {
> >> > event = cpuc->events[i];
> >> > if (event)
> >> > - x86_perf_event_update(event);
> >> > + x86_perf_event_update(event, false);
> >> > }
> >> >
> >> > for (i = 0; i < 4; i++) {
> >> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
> >> perf_event *event)
> >> > */
> >> > int intel_pmu_save_and_restart(struct perf_event *event) {
> >> > - x86_perf_event_update(event);
> >> > + x86_perf_event_update(event, true);
> >> > /*
> >> > * For a checkpointed counter always reset back to 0. This
> >> > * avoids a situation where the counter overflows, aborts
> >> > the diff --git a/arch/x86/events/intel/p4.c
> >> > b/arch/x86/events/intel/p4.c index eb05335..8117de8 100644
> >> > --- a/arch/x86/events/intel/p4.c
> >> > +++ b/arch/x86/events/intel/p4.c
> >> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
> >> *regs)
> >> > /* it might be unflagged overflow */
> >> > overflow = p4_pmu_clear_cccr_ovf(hwc);
> >> >
> >> > - val = x86_perf_event_update(event);
> >> > + val = x86_perf_event_update(event, true);
> >> > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> >> > continue;
> >> >
> >> > diff --git a/arch/x86/events/perf_event.h
> >> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
> >> > --- a/arch/x86/events/perf_event.h
> >> > +++ b/arch/x86/events/perf_event.h
> >> > @@ -712,7 +712,7 @@ extern u64 __read_mostly
> hw_cache_extra_regs
> >> > [PERF_COUNT_HW_CACHE_OP_MAX]
> >> > [PERF_COUNT_HW_CACHE_RESULT_MAX];
> >> >
> >> > -u64 x86_perf_event_update(struct perf_event *event);
> >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
> >> >
> >> > static inline unsigned int x86_pmu_config_addr(int index) {
> >> > --
> >> > 2.4.3
> >> >
So caveat that I'm ill and cannot think much..
On Mon, Nov 28, 2016 at 11:26:46AM -0800, [email protected] wrote:
> Here, all the possible failure cases are listed.
> Terms:
> - new: current PMU counter value which read from rdpmcl.
> - prev: previous counter value which is stored in &hwc->prev_count.
> - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
> delta = (new << shift) - (prev << shift);
> delta >>= shift;
>
> Case A: Not in PMI. new > prev. But delta is negative.
> That's the failure case of Test 2.
> delta is s64 type. new and prev are u64 type. If the new is big
> enough, after doing left shift and sub, the bit 64 of the result may
> still be 1.
> After converting to s64, the sign flag will be set. Since delta is
> s64 type, arithmetic right shift is applied, which copy the sign flag
> into empty bit positions on the upper end of delta.
> It can be fixed by adding the max count value.
>
> Here is the real data for test2 on KNL.
> new = aea96e1927
> prev = ffffff0000000001
> delta = aea96e1927000000 - 1000000 = aea96e1926000000
> aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
How can this happen? IIRC the thing increments, we program a negative
value, and when it passes 0 we generate a PMI.
And note that we _ALWAYS_ set the IN bits, even for !sampling events.
Also note we set max_period to (1<<31) - 1, so we should never exceed 31
bits.
> Case B: In PMI. new > prev. delta is positive.
> That's the failure case of Test 3.
> The PMI is triggered by overflow. But there is latency between
> overflow and PMI handler. So new has small amount.
> Current calculation lose the max count value.
That doesn't make sense, per the 31bit limit.
> Case C: In PMI. new < prev. delta is negative.
> The PMU counter may be start from a big value. E.g. the fixed period
> is small.
> It can be fixed by adding the max count value.
Doesn't make sense, how can this happen?
> So caveat that I'm ill and cannot think much..
>
> On Mon, Nov 28, 2016 at 11:26:46AM -0800, [email protected] wrote:
>
> > Here, all the possible failure cases are listed.
> > Terms:
> > - new: current PMU counter value which read from rdpmcl.
> > - prev: previous counter value which is stored in &hwc->prev_count.
> > - in PMI/not in PMI: the event update happens in PMI handler or not.
>
> > Current code to calculate delta:
> > delta = (new << shift) - (prev << shift);
> > delta >>= shift;
> >
> > Case A: Not in PMI. new > prev. But delta is negative.
> > That's the failure case of Test 2.
> > delta is s64 type. new and prev are u64 type. If the new is big
> > enough, after doing left shift and sub, the bit 64 of the result may
> > still be 1.
> > After converting to s64, the sign flag will be set. Since delta is
> > s64 type, arithmetic right shift is applied, which copy the sign flag
> > into empty bit positions on the upper end of delta.
> > It can be fixed by adding the max count value.
> >
> > Here is the real data for test2 on KNL.
> > new = aea96e1927
> > prev = ffffff0000000001
> > delta = aea96e1927000000 - 1000000 = aea96e1926000000
> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>
> How can this happen? IIRC the thing increments, we program a negative
> value, and when it passes 0 we generate a PMI.
>
> And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> bits.
>
The max_period is 0xfffffffff.
The limit is breaked by this patch.
069e0c3c4058 ("perf/x86/intel: Support full width counting")
https://patchwork.kernel.org/patch/2784191/
/* Support full width counters using alternative MSR range */
if (x86_pmu.intel_cap.full_width_write) {
x86_pmu.max_period = x86_pmu.cntval_mask;
x86_pmu.perfctr = MSR_IA32_PMC0;
pr_cont("full-width counters, ");
}
>
> > Case B: In PMI. new > prev. delta is positive.
> > That's the failure case of Test 3.
> > The PMI is triggered by overflow. But there is latency between
> > overflow and PMI handler. So new has small amount.
> > Current calculation lose the max count value.
>
> That doesn't make sense, per the 31bit limit.
>
>
> > Case C: In PMI. new < prev. delta is negative.
> > The PMU counter may be start from a big value. E.g. the fixed period
> > is small.
> > It can be fixed by adding the max count value.
>
> Doesn't make sense, how can this happen?
On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> > bits.
> >
>
> The max_period is 0xfffffffff.
>
> The limit is breaked by this patch.
> 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> https://patchwork.kernel.org/patch/2784191/
>
> /* Support full width counters using alternative MSR range */
> if (x86_pmu.intel_cap.full_width_write) {
> x86_pmu.max_period = x86_pmu.cntval_mask;
> x86_pmu.perfctr = MSR_IA32_PMC0;
> pr_cont("full-width counters, ");
> }
>
Wth do KNL/SLM have full_width_write set if they have short counters? I
should the whole point of the full_wdith thing was that in that case the
counters were actually 64bit wide.
> On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > > Also note we set max_period to (1<<31) - 1, so we should never
> > > exceed 31 bits.
> > >
> >
> > The max_period is 0xfffffffff.
> >
> > The limit is breaked by this patch.
> > 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> > https://patchwork.kernel.org/patch/2784191/
> >
> > /* Support full width counters using alternative MSR range */
> > if (x86_pmu.intel_cap.full_width_write) {
> > x86_pmu.max_period = x86_pmu.cntval_mask;
> > x86_pmu.perfctr = MSR_IA32_PMC0;
> > pr_cont("full-width counters, ");
> > }
> >
>
> Wth do KNL/SLM have full_width_write set if they have short counters?
Yes, the full_width_write is set on KNL/SLM.
> I should the whole point of the full_wdith thing was that in that case the
> counters were actually 64bit wide.
Even for big core, the counter width is 48 bit.
AFAIK, no Intel platform has 64bit wide counter.
Thanks,
Kan
On Tue, Nov 29, 2016 at 05:06:36PM +0000, Liang, Kan wrote:
>
>
> > On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > > > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > > > Also note we set max_period to (1<<31) - 1, so we should never
> > > > exceed 31 bits.
> > > >
> > >
> > > The max_period is 0xfffffffff.
> > >
> > > The limit is breaked by this patch.
> > > 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> > > https://patchwork.kernel.org/patch/2784191/
> > >
> > > /* Support full width counters using alternative MSR range */
> > > if (x86_pmu.intel_cap.full_width_write) {
> > > x86_pmu.max_period = x86_pmu.cntval_mask;
> > > x86_pmu.perfctr = MSR_IA32_PMC0;
> > > pr_cont("full-width counters, ");
> > > }
> > >
> >
> > Wth do KNL/SLM have full_width_write set if they have short counters?
>
> Yes, the full_width_write is set on KNL/SLM.
>
> > I should the whole point of the full_wdith thing was that in that case the
> > counters were actually 64bit wide.
>
> Even for big core, the counter width is 48 bit.
> AFAIK, no Intel platform has 64bit wide counter.
Then yes, that patch is very broken.
On Tue, Nov 29, 2016 at 1:25 AM, Peter Zijlstra <[email protected]> wrote:
>
>
> So caveat that I'm ill and cannot think much..
>
> On Mon, Nov 28, 2016 at 11:26:46AM -0800, [email protected] wrote:
>
> > Here, all the possible failure cases are listed.
> > Terms:
> > - new: current PMU counter value which read from rdpmcl.
> > - prev: previous counter value which is stored in &hwc->prev_count.
> > - in PMI/not in PMI: the event update happens in PMI handler or not.
>
> > Current code to calculate delta:
> > delta = (new << shift) - (prev << shift);
> > delta >>= shift;
> >
> > Case A: Not in PMI. new > prev. But delta is negative.
> > That's the failure case of Test 2.
> > delta is s64 type. new and prev are u64 type. If the new is big
> > enough, after doing left shift and sub, the bit 64 of the result may
> > still be 1.
> > After converting to s64, the sign flag will be set. Since delta is
> > s64 type, arithmetic right shift is applied, which copy the sign flag
> > into empty bit positions on the upper end of delta.
> > It can be fixed by adding the max count value.
> >
> > Here is the real data for test2 on KNL.
> > new = aea96e1927
> > prev = ffffff0000000001
> > delta = aea96e1927000000 - 1000000 = aea96e1926000000
> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>
> How can this happen? IIRC the thing increments, we program a negative
> value, and when it passes 0 we generate a PMI.
>
Yeah, that's the part I don't quite understand thus my initial
question. What you describe
is what is supposed to happen regardless of the counter width.
>
> And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> bits.
>
Max period is limited by the number of bits the kernel can write to an MSR.
Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
>
> > Case B: In PMI. new > prev. delta is positive.
> > That's the failure case of Test 3.
> > The PMI is triggered by overflow. But there is latency between
> > overflow and PMI handler. So new has small amount.
> > Current calculation lose the max count value.
>
> That doesn't make sense, per the 31bit limit.
>
>
> > Case C: In PMI. new < prev. delta is negative.
> > The PMU counter may be start from a big value. E.g. the fixed period
> > is small.
> > It can be fixed by adding the max count value.
>
> Doesn't make sense, how can this happen?
The only issue I could think of is in case the counter is reprogrammed
to a different value which could be much lower or higher than the last
saved value, like for instance, in frequency mode. Otherwise each
counter increments monotonically from the period. Counting events are
also programmed that way.
On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> Max period is limited by the number of bits the kernel can write to an MSR.
> Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
No, I think it sets it to 48 now, which is the problem. It should be 1
bit less than the total width.
So something like so.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2dbc0180..cb8522290e6a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
/* Support full width counters using alternative MSR range */
if (x86_pmu.intel_cap.full_width_write) {
- x86_pmu.max_period = x86_pmu.cntval_mask;
+ x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
x86_pmu.perfctr = MSR_IA32_PMC0;
pr_cont("full-width counters, ");
}
On Tue, Nov 29, 2016 at 9:30 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
>> Max period is limited by the number of bits the kernel can write to an MSR.
>> Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
> No, I think it sets it to 48 now, which is the problem. It should be 1
> bit less than the total width.
>
> So something like so.
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
> /* Support full width counters using alternative MSR range */
> if (x86_pmu.intel_cap.full_width_write) {
> - x86_pmu.max_period = x86_pmu.cntval_mask;
> + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> x86_pmu.perfctr = MSR_IA32_PMC0;
> pr_cont("full-width counters, ");
> }
Ah, yes!
That would make it consistent with the other Intel PMU settings I see
in intel/core.c such as for
intel_core_pmu.max_period = (1<<31) -1 which is 0x7ffffff whereas now
I see max_period of
0x0000ffffffffffff instead of 0x7fffffffffff. So I think Peter's
patch is required no matter what.
On Tue, Nov 29, 2016 at 06:30:55PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> > Max period is limited by the number of bits the kernel can write to an MSR.
> > Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
> No, I think it sets it to 48 now, which is the problem. It should be 1
> bit less than the total width.
>
> So something like so.
That looks good. Kan can you test it?
-Andi
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
> /* Support full width counters using alternative MSR range */
> if (x86_pmu.intel_cap.full_width_write) {
> - x86_pmu.max_period = x86_pmu.cntval_mask;
> + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> x86_pmu.perfctr = MSR_IA32_PMC0;
> pr_cont("full-width counters, ");
> }
> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> > Max period is limited by the number of bits the kernel can write to an
> MSR.
> > Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
> No, I think it sets it to 48 now, which is the problem. It should be 1 bit less
> than the total width.
>
> So something like so.
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
> /* Support full width counters using alternative MSR range */
> if (x86_pmu.intel_cap.full_width_write) {
> - x86_pmu.max_period = x86_pmu.cntval_mask;
> + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> x86_pmu.perfctr = MSR_IA32_PMC0;
> pr_cont("full-width counters, ");
> }
It doesn't work.
perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
18446743727217821696,,cycles,313837854019,100.00
delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
I guess we need at least x86_pmu.cntval_mask >> 2 to prevent
the sign flag set.
I'm testing it now.
Also, no matter how it's fixed. I think we'd better add WARN_ONCE
for delta.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..2ce8299 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -100,6 +100,9 @@ u64 x86_perf_event_update(struct perf_event *event)
delta = (new_raw_count << shift) - (prev_raw_count << shift);
delta >>= shift;
+ WARN_ONCE((delta < 0), "counter increment must be positive. delta 0x%llx new 0x%llx prev 0x%llx\n",
+ delta, new_raw_count, prev_raw_count);
+
local64_add(delta, &event->count);
local64_sub(delta, &hwc->period_left);
Thanks,
Kan
On Tuesday, November 29, 2016 6:20 PM Stephane Eranian wrote:
>> On Tue, Nov 29, 2016 at 1:25 AM, Peter Zijlstra <[email protected]> wrote:
>> How can this happen? IIRC the thing increments, we program a negative
>> value, and when it passes 0 we generate a PMI.
>>
> Yeah, that's the part I don't quite understand thus my initial
> question. What you describe
> is what is supposed to happen regardless of the counter width.
I don't see any interrupts flying around:
# cat /proc/interrupts | grep PMI
PMI: 10 14 12 18 Performance monitoring interrupts
(test case here)
# cat /proc/interrupts | grep PMI
PMI: 10 14 12 18 Performance monitoring interrupts
What I also don't understand is why this is only seen on per cpu, or system wide mode.
PMIs seems to be programmed because if we wait long enough it will eventually overflow,
what we saw in one of tests cases from commit msg.
In the failing case (system wide) x86_perf_event_update is called only once at the end of workload:
259.154530: x86_perf_event_set_period: x86_perf_event_set_period left = ffffffffff, ret = 0
550.079623: x86_pmu_stop: x86_pmu_stop -> need update
550.079625: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
550.079627: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff, last_period = ffffffffff, period_left = ffffffffff
550.079629: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff0000000001, new = 8bba934b9e, delta = ffffff8bba934b9d, shift=24
In the good case -per pid case, x86_perf_event_update is called every ~3 seconds.
568.448083: x86_perf_event_set_period: x86_perf_event_set_period left = ffffffffff, ret = 0
568.949105: x86_pmu_stop: x86_pmu_stop -> need update
568.949106: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
568.949108: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff, last_period = ffffffffff, period_left = ffffffffff
568.949110: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff0000000001, new = 3d9525cc, delta = 3d9525cb, shift=24
568.949135: x86_perf_event_set_period: x86_perf_event_set_period left = ffc26ada34, ret = 0
570.679697: x86_pmu_stop: x86_pmu_stop -> need update
570.679699: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
570.679700: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff, last_period = ffffffffff, period_left = ffc26ada34
570.679701: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff003d9525cc, new = 112601ddf, delta = d4caf813, shift=24
570.679723: x86_perf_event_set_period: x86_perf_event_set_period left = feed9fe221, ret = 0
(... ~115 similar calls here)
859.431686: x86_pmu_stop: x86_pmu_stop -> need update
859.431687: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276209000
859.431688: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff, last_period = ffffffffff, period_left = 7453fc7d0d
859.431689: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff8bac0382f3, new = 8bbaa4a147, delta = ea11e54, shift=24
Thanks,
Lukas
On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index a74a2dbc0180..cb8522290e6a 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
> >
> > /* Support full width counters using alternative MSR range */
> > if (x86_pmu.intel_cap.full_width_write) {
> > - x86_pmu.max_period = x86_pmu.cntval_mask;
> > + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> > x86_pmu.perfctr = MSR_IA32_PMC0;
> > pr_cont("full-width counters, ");
> > }
>
> It doesn't work.
> perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
> 18446743727217821696,,cycles,313837854019,100.00
>
> delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
>
> I guess we need at least x86_pmu.cntval_mask >> 2 to prevent
> the sign flag set.
Possible delta should be u64, as we know the counter cannot decrement.
> On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
> > > diff --git a/arch/x86/events/intel/core.c
> > > b/arch/x86/events/intel/core.c index a74a2dbc0180..cb8522290e6a
> > > 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
> > >
> > > /* Support full width counters using alternative MSR range */
> > > if (x86_pmu.intel_cap.full_width_write) {
> > > - x86_pmu.max_period = x86_pmu.cntval_mask;
> > > + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> > > x86_pmu.perfctr = MSR_IA32_PMC0;
> > > pr_cont("full-width counters, ");
> > > }
> >
> > It doesn't work.
> > perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
> > 18446743727217821696,,cycles,313837854019,100.00
> >
> > delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
> >
> > I guess we need at least x86_pmu.cntval_mask >> 2 to prevent the sign
> > flag set.
>
> Possible delta should be u64, as we know the counter cannot decrement.
Yes, the patch as below fixes the issue on my SLM.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..abd97e8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_event *event)
int shift = 64 - x86_pmu.cntval_bits;
u64 prev_raw_count, new_raw_count;
int idx = hwc->idx;
- s64 delta;
+ u64 delta;
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2db..cb85222 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
/* Support full width counters using alternative MSR range */
if (x86_pmu.intel_cap.full_width_write) {
- x86_pmu.max_period = x86_pmu.cntval_mask;
+ x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
x86_pmu.perfctr = MSR_IA32_PMC0;
pr_cont("full-width counters, ");
}
On Tue, Nov 29, 2016 at 12:33 PM, Liang, Kan <[email protected]> wrote:
>
>
>> On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
>> > > diff --git a/arch/x86/events/intel/core.c
>> > > b/arch/x86/events/intel/core.c index a74a2dbc0180..cb8522290e6a
>> > > 100644
>> > > --- a/arch/x86/events/intel/core.c
>> > > +++ b/arch/x86/events/intel/core.c
>> > > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>> > >
>> > > /* Support full width counters using alternative MSR range */
>> > > if (x86_pmu.intel_cap.full_width_write) {
>> > > - x86_pmu.max_period = x86_pmu.cntval_mask;
>> > > + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>> > > x86_pmu.perfctr = MSR_IA32_PMC0;
>> > > pr_cont("full-width counters, ");
>> > > }
>> >
>> > It doesn't work.
>> > perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
>> > 18446743727217821696,,cycles,313837854019,100.00
>> >
>> > delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
>> >
>> > I guess we need at least x86_pmu.cntval_mask >> 2 to prevent the sign
>> > flag set.
>>
>> Possible delta should be u64, as we know the counter cannot decrement.
>
> Yes, the patch as below fixes the issue on my SLM.
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..abd97e8 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_event *event)
> int shift = 64 - x86_pmu.cntval_bits;
> u64 prev_raw_count, new_raw_count;
> int idx = hwc->idx;
> - s64 delta;
> + u64 delta;
>
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..cb85222 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
> /* Support full width counters using alternative MSR range */
> if (x86_pmu.intel_cap.full_width_write) {
> - x86_pmu.max_period = x86_pmu.cntval_mask;
> + x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> x86_pmu.perfctr = MSR_IA32_PMC0;
> pr_cont("full-width counters, ");
> }
>
Yes, now that makes sense! This s64 has been there since the beginning....
Thanks for fixing this.
On Tuesday, November 29, 2016 9:33 PM, Liang, Kan wrote:
> Yes, the patch as below fixes the issue on my SLM.
It works for me as well.
Can we still have it in 4.9?
Thanks,
Lukas
On Fri, Dec 02, 2016 at 12:58:17PM +0000, Odzioba, Lukasz wrote:
> On Tuesday, November 29, 2016 9:33 PM, Liang, Kan wrote:
> > Yes, the patch as below fixes the issue on my SLM.
>
> It works for me as well.
> Can we still have it in 4.9?
I'll certainly, try. I've queued it as per the below.
---
Subject: perf,x86: Fix full width counter, counter overflow
Date: Tue, 29 Nov 2016 20:33:28 +0000
Lukasz reported that perf stat counters overflow is broken on KNL/SLM.
Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.
However commit:
069e0c3c4058 ("perf/x86/intel: Support full width counting")
sets the sampling interval to the full period, not half.
Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.
With both these issues fixed, counter overflow functions correctly
again.
Cc: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: [email protected]
Reported-by: Lukasz Odzioba <[email protected]>
Tested-by: "Liang, Kan" <[email protected]>
Tested-by: "Odzioba, Lukasz" <[email protected]>
Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/core.c | 2 +-
arch/x86/events/intel/core.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_ev
int shift = 64 - x86_pmu.cntval_bits;
u64 prev_raw_count, new_raw_count;
int idx = hwc->idx;
- s64 delta;
+ u64 delta;
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
/* Support full width counters using alternative MSR range */
if (x86_pmu.intel_cap.full_width_write) {
- x86_pmu.max_period = x86_pmu.cntval_mask;
+ x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
x86_pmu.perfctr = MSR_IA32_PMC0;
pr_cont("full-width counters, ");
}
On Monday, December 5, 2016 11:25 AM, Peter Zijlstra wrote:
> I'll certainly, try. I've queued it as per the below.
Great, thank you!
Lukas