by Peter Zijlstra

[permalink] [raw]

2016-12-15 17:00:00

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH 2/3] perf/x86/pebs: add workaround for broken OVFL status on HSW

On Thu, Dec 15, 2016 at 12:42 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Dec 14, 2016 at 11:26:49PM -0800, Stephane Eranian wrote:
>> On Wed, Dec 14, 2016 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > Just spotted this again, ping?
>> >
>> Ok, on what processor running what command, so I can try and reproduce?
>
> For me its more of a correctness issue, i've not actually spotted a
> problem as such.
>
> But every time I read this code it makes me wonder.
>
> Supposing that the hardware sets the CTRL overflow flags but hasn't
> generated the PEBS record yet (or not enough records to reach the PEBS
> buffer threshold) we still don't want to process these events as if they
> were !PEBS.
>
I am suspicious about the case where you have multiple PEBS events and
they do not quite fire at the same time but close enough that you may have
PEBS in-flight by the time you enter handle_irq.

Last night I ran a simple test on SKL using tip.git:

$ perf record --e
cpu/event=0xd0,umask=0x81/upp,cpu/event=0xc0,umask=1/upp,cpu/event=0xd0,umask=0x81/upp
multichase; perf report -D | fgrep SAMPLE | grep -v 'IP, 0x4' | grep
-v events

Basically, looking for samples missing the EXACT tag, i.e., samples
processed a regular event when I only have PEBS events. Over 8h, I got
about 3 or 4 such samples. So there is still a condition where we see
the overflow as regular and not PEBS. So we need to examine that code
again looking for possible race with PEBS in flight and not having the
PEBS overflow bits yet.

> That is, we _never_ want to look at pebs_enabled, hence unconditional
> masking of those bits.
>
> Hardware _should_ not set them in the first place, but clearly it does
> sometimes.
>
>> >> How about we make the clear of pebs_enabled unconditional?
>> >>
>> >> ---
>> >> arch/x86/events/intel/core.c | 20 ++++++++++----------
>> >> 1 file changed, 10 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> >> index 68fa55b4d42e..dc9579665425 100644
>> >> --- a/arch/x86/events/intel/core.c
>> >> +++ b/arch/x86/events/intel/core.c
>> >> @@ -1883,6 +1883,16 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
>> >> status &= ~(GLOBAL_STATUS_COND_CHG |
>> >> GLOBAL_STATUS_ASIF |
>> >> GLOBAL_STATUS_LBRS_FROZEN);
>> >> + /*
>> >> + * There are cases where, even though, the PEBS ovfl bit is set
>> >> + * in GLOBAL_OVF_STATUS, the PEBS events may also have their
>> >> + * overflow bits set for their counters. We must clear them
>> >> + * here because they have been processed as exact samples in
>> >> + * the drain_pebs() routine. They must not be processed again
>> >> + * in the for_each_bit_set() loop for regular samples below.
>> >> + */
>> >> + status &= ~cpuc->pebs_enabled;
>> >> +
>> >> if (!status)
>> >> goto done;
>> >>
>> >> @@ -1892,16 +1902,6 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
>> >> if (__test_and_clear_bit(62, (unsigned long *)&status)) {
>> >> handled++;
>> >> x86_pmu.drain_pebs(regs);
>> >> - /*
>> >> - * There are cases where, even though, the PEBS ovfl bit is set
>> >> - * in GLOBAL_OVF_STATUS, the PEBS events may also have their
>> >> - * overflow bits set for their counters. We must clear them
>> >> - * here because they have been processed as exact samples in
>> >> - * the drain_pebs() routine. They must not be processed again
>> >> - * in the for_each_bit_set() loop for regular samples below.
>> >> - */
>> >> - status &= ~cpuc->pebs_enabled;
>> >> - status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
>> >> }
>> >>
>> >> /*

2016-12-15 17:11:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 2/3] perf/x86/pebs: add workaround for broken OVFL status on HSW

On Thu, Dec 15, 2016 at 08:59:56AM -0800, Stephane Eranian wrote:
> On Thu, Dec 15, 2016 at 12:42 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Dec 14, 2016 at 11:26:49PM -0800, Stephane Eranian wrote:
> >> On Wed, Dec 14, 2016 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
> >> >
> >> > Just spotted this again, ping?
> >> >
> >> Ok, on what processor running what command, so I can try and reproduce?
> >
> > For me its more of a correctness issue, i've not actually spotted a
> > problem as such.
> >
> > But every time I read this code it makes me wonder.
> >
> > Supposing that the hardware sets the CTRL overflow flags but hasn't
> > generated the PEBS record yet (or not enough records to reach the PEBS
> > buffer threshold) we still don't want to process these events as if they
> > were !PEBS.
> >
> I am suspicious about the case where you have multiple PEBS events and
> they do not quite fire at the same time but close enough that you may have
> PEBS in-flight by the time you enter handle_irq.
>
> Last night I ran a simple test on SKL using tip.git:
>
> $ perf record --e
> cpu/event=0xd0,umask=0x81/upp,cpu/event=0xc0,umask=1/upp,cpu/event=0xd0,umask=0x81/upp
> multichase; perf report -D | fgrep SAMPLE | grep -v 'IP, 0x4' | grep
> -v events
>
> Basically, looking for samples missing the EXACT tag, i.e., samples
> processed a regular event when I only have PEBS events. Over 8h, I got
> about 3 or 4 such samples. So there is still a condition where we see
> the overflow as regular and not PEBS. So we need to examine that code
> again looking for possible race with PEBS in flight and not having the
> PEBS overflow bits yet.

Isn't that exactly the case I was talking about? and would be avoided by
the proposed patch?

So semantically the counter overflows and then arms PEBS to record a
record on the next event once its armed (and this can be multiple events
after the overflow, since arming takes a while too).

Now, if the chip manages to raise the regular overflow bit during that
time, you get exactly what is described.

meaning we should unconditionally clear the pebs_enabled.

2016-12-16 08:39:03

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH 2/3] perf/x86/pebs: add workaround for broken OVFL status on HSW

On Thu, Dec 15, 2016 at 9:10 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Dec 15, 2016 at 08:59:56AM -0800, Stephane Eranian wrote:
>> On Thu, Dec 15, 2016 at 12:42 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Dec 14, 2016 at 11:26:49PM -0800, Stephane Eranian wrote:
>> >> On Wed, Dec 14, 2016 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
>> >> >
>> >> > Just spotted this again, ping?
>> >> >
>> >> Ok, on what processor running what command, so I can try and reproduce?
>> >
>> > For me its more of a correctness issue, i've not actually spotted a
>> > problem as such.
>> >
>> > But every time I read this code it makes me wonder.
>> >
>> > Supposing that the hardware sets the CTRL overflow flags but hasn't
>> > generated the PEBS record yet (or not enough records to reach the PEBS
>> > buffer threshold) we still don't want to process these events as if they
>> > were !PEBS.
>> >
>> I am suspicious about the case where you have multiple PEBS events and
>> they do not quite fire at the same time but close enough that you may have
>> PEBS in-flight by the time you enter handle_irq.
>>
>> Last night I ran a simple test on SKL using tip.git:
>>
>> $ perf record --e
>> cpu/event=0xd0,umask=0x81/upp,cpu/event=0xc0,umask=1/upp,cpu/event=0xd0,umask=0x81/upp
>> multichase; perf report -D | fgrep SAMPLE | grep -v 'IP, 0x4' | grep
>> -v events
>>
>> Basically, looking for samples missing the EXACT tag, i.e., samples
>> processed a regular event when I only have PEBS events. Over 8h, I got
>> about 3 or 4 such samples. So there is still a condition where we see
>> the overflow as regular and not PEBS. So we need to examine that code
>> again looking for possible race with PEBS in flight and not having the
>> PEBS overflow bits yet.
>
> Isn't that exactly the case I was talking about? and would be avoided by
> the proposed patch?
>
>
> So semantically the counter overflows and then arms PEBS to record a
> record on the next event once its armed (and this can be multiple events
> after the overflow, since arming takes a while too).
>
>
> Now, if the chip manages to raise the regular overflow bit during that
> time, you get exactly what is described.
>
> meaning we should unconditionally clear the pebs_enabled.

If we unconditionally clear the pebs_enabled counter from status, then
we guarantee
these counters will never be processed as regular overflowed counters.
I am okay with this.
I am testing it right now.

But the part I still don't get is why is the status bitmask showing
some pebs counters set when
the counter are explicitly program with their PMI bit cleared. I need
to check whether somehow
the PMI bit gets set.

2016-12-16 17:49:10

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH 2/3] perf/x86/pebs: add workaround for broken OVFL status on HSW

Hi,

Some more updates on this problem. I believe I understand how this can
happen based on my observations and what the SDM describes.

The problem appears only if you are sampling on multiple PEBS events.

The SDM says :
"When a performance counter is configured for PEBS, overflow condition
in the counter generates a performance- monitoring interrupt signaling
a PEBS event. On a PEBS event, the processor stores data records into
the buffer area (see Section 18.15.5), clears the counter overflow
status., and sets the “OvfBuffer” bit in IA32_PERF_GLOBAL_STATUS."

PEBS is a two-step mechanism. When the PEBS-enabled counter overflows,
PEBS arms and the next qualifying instruction that retires is
captured. On the overflow the GLOBAL_STATUS bit is set. It remains set
until the PEBS record is finalized.
But there can be a long delay between the overflow and when a
qualifying instruction is actually captured. This is especially true
in cases where you run PEBS on rare events or when you only measure in
kernel or user mode and you transition in the other priv level right
after the overflow. Therefore there is a race condition in there and
this is what we are observing.

When sampling on multiple PEBS events, one could generate the PMI when
its PEBS record is finalized, but by the time we reach the
handle_irq() handler, another PEBS event has overflowed but its PEBS
record is not yet finalized and therefore we observe its bit set in
GLOBAL_STATUS.

It is not clear to me why PEBS does not clear the overflow bit
immediately after the overflow at the beginning of the PEBS arming
code.

Given these observations, I believe that unconditionally clearing the
PEBS_ENABLED bits from GLOBAL_STATUS is the solution. Overflows on the
PEBS_ENABLED counters should always be processed via bit 62 and not
via GLOBAL_STATUS.
I have tried that last night and the problem goes away when measuring
multiple PEBS events simultaneously.

On Fri, Dec 16, 2016 at 12:38 AM, Stephane Eranian <[email protected]> wrote:
> On Thu, Dec 15, 2016 at 9:10 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, Dec 15, 2016 at 08:59:56AM -0800, Stephane Eranian wrote:
>>> On Thu, Dec 15, 2016 at 12:42 AM, Peter Zijlstra <[email protected]> wrote:
>>> > On Wed, Dec 14, 2016 at 11:26:49PM -0800, Stephane Eranian wrote:
>>> >> On Wed, Dec 14, 2016 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
>>> >> >
>>> >> > Just spotted this again, ping?
>>> >> >
>>> >> Ok, on what processor running what command, so I can try and reproduce?
>>> >
>>> > For me its more of a correctness issue, i've not actually spotted a
>>> > problem as such.
>>> >
>>> > But every time I read this code it makes me wonder.
>>> >
>>> > Supposing that the hardware sets the CTRL overflow flags but hasn't
>>> > generated the PEBS record yet (or not enough records to reach the PEBS
>>> > buffer threshold) we still don't want to process these events as if they
>>> > were !PEBS.
>>> >
>>> I am suspicious about the case where you have multiple PEBS events and
>>> they do not quite fire at the same time but close enough that you may have
>>> PEBS in-flight by the time you enter handle_irq.
>>>
>>> Last night I ran a simple test on SKL using tip.git:
>>>
>>> $ perf record --e
>>> cpu/event=0xd0,umask=0x81/upp,cpu/event=0xc0,umask=1/upp,cpu/event=0xd0,umask=0x81/upp
>>> multichase; perf report -D | fgrep SAMPLE | grep -v 'IP, 0x4' | grep
>>> -v events
>>>
>>> Basically, looking for samples missing the EXACT tag, i.e., samples
>>> processed a regular event when I only have PEBS events. Over 8h, I got
>>> about 3 or 4 such samples. So there is still a condition where we see
>>> the overflow as regular and not PEBS. So we need to examine that code
>>> again looking for possible race with PEBS in flight and not having the
>>> PEBS overflow bits yet.
>>
>> Isn't that exactly the case I was talking about? and would be avoided by
>> the proposed patch?
>>
>>
>> So semantically the counter overflows and then arms PEBS to record a
>> record on the next event once its armed (and this can be multiple events
>> after the overflow, since arming takes a while too).
>>
>>
>> Now, if the chip manages to raise the regular overflow bit during that
>> time, you get exactly what is described.
>>
>> meaning we should unconditionally clear the pebs_enabled.
>
> If we unconditionally clear the pebs_enabled counter from status, then
> we guarantee
> these counters will never be processed as regular overflowed counters.
> I am okay with this.
> I am testing it right now.
>
> But the part I still don't get is why is the status bitmask showing
> some pebs counters set when
> the counter are explicitly program with their PMI bit cleared. I need
> to check whether somehow
> the PMI bit gets set.