LinuxLists.cc - Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

2014-01-01 18:38:07

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
> On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
>> On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
>>>> I was carrying that patch while working on perf-kvm-stat-live last
>>>> Fall. It does not solve the problem for live commands, so ended up
>>>> dropping it and going with local (to the command) hacks. I still
>>>> think for live commands getting a perf_clock timestamp at the start
>>>> of a round and using that as the flush time will work best.

For perf-kvm-stat-live using perf_clock value at the start of the round
as the flush time works beautifully:

https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2

Never once failed with "Warning: Timestamp below last timeslice flush"
error.

David

>>>
>>> Ok, but how would you fetch this perf clock timestamp, with an explicit read?
>>
>> One option is exporting perf_clock to userspace.
>>
>> Preferably a built-in option, https://lkml.org/lkml/2013/12/10/781,
>> but I can go the module route if I have to:
>> https://github.com/dsahern/linux/blob/perf-full-monty/README.ahern
>
> I see, that's one of the very verbose threads I've seen lately ;)
>

2014-01-03 22:07:52

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On Wed, Jan 01, 2014 at 11:37:55AM -0700, David Ahern wrote:
> On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
> >On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
> >>On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
> >>>>I was carrying that patch while working on perf-kvm-stat-live last
> >>>>Fall. It does not solve the problem for live commands, so ended up
> >>>>dropping it and going with local (to the command) hacks. I still
> >>>>think for live commands getting a perf_clock timestamp at the start
> >>>>of a round and using that as the flush time will work best.
>
> For perf-kvm-stat-live using perf_clock value at the start of the
> round as the flush time works beautifully:
>
> https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2
>
> Never once failed with "Warning: Timestamp below last timeslice
> flush" error.

I'm not sure I understand why we need that. Why doesn't it work by simply flushing
events prior to the earliest timestamp among every CPUs last event?

I can see one remaining issue when an event interrupts another in a CPU. If the
interrupt happens after perf_prepare_sample() -> perf_clock() and perf_output_begin(),
we may have locally non-monotonic timestamps in a CPU buffer.

That could be solved with a heuristic similar to yours: flush events prior a few millisecs
before the barrier since interrupt are supposed to be short. Or we could move the perf_clock()
event snapshot to perf_output_sample() to make sure that the event space is reserved before
we get the timestamp, thus the interrupting events having superior timestamps are guaranteed
to be past the interrupted event in the stream.

2014-01-03 22:45:42

by David Ahern

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
> On Wed, Jan 01, 2014 at 11:37:55AM -0700, David Ahern wrote:
>> On 12/26/13, 8:30 AM, Frederic Weisbecker wrote:
>>> On Thu, Dec 26, 2013 at 10:24:03AM -0500, David Ahern wrote:
>>>> On 12/26/13, 10:14 AM, Frederic Weisbecker wrote:
>>>>>> I was carrying that patch while working on perf-kvm-stat-live last
>>>>>> Fall. It does not solve the problem for live commands, so ended up
>>>>>> dropping it and going with local (to the command) hacks. I still
>>>>>> think for live commands getting a perf_clock timestamp at the start
>>>>>> of a round and using that as the flush time will work best.
>>
>> For perf-kvm-stat-live using perf_clock value at the start of the
>> round as the flush time works beautifully:
>>
>> https://github.com/dsahern/linux/commit/ba8b7b63d5dbdc95aedbbafa670c2232e0cc07a2
>>
>> Never once failed with "Warning: Timestamp below last timeslice
>> flush" error.
>
> I'm not sure I understand why we need that. Why doesn't it work by simply flushing
> events prior to the earliest timestamp among every CPUs last event?

Here's one scenario. Consider N-mmaps:

|----- t_flush
v
0 -----|---x------------------------
1 -----|----|------------------------
... |
N -----|-------ssss-|-----------------

t_start t_1 ... t_N

You start a round at some time -- t_start. By starting a round it means
you go to mmap 0 and check for events, then mmap 1, ..., mmap N. It
takes a finite amount of time to move from one mmap to another.

Assume there are no events on mmap 0, 1, ... N-1 but samples are
generated in mmap N. In the time it takes to move forward from 0 to N, a
sample can be generated for mmap 0 and written to the buffer - the 'x'
above. It now contains a timestamp < than samples on any other mmap and
out pops the flush error.

perf-kvm can have over 650,000 events per second and those tend to come
in bunches on a single mmap. So even if you go for a "max of the min
times across mmaps" it is often wrong.

The non-perf_clock logic in perf-kvm uses the min time across all mmaps
and even it occasionally fails with the flush error.

David

>
> I can see one remaining issue when an event interrupts another in a CPU. If the
> interrupt happens after perf_prepare_sample() -> perf_clock() and perf_output_begin(),
> we may have locally non-monotonic timestamps in a CPU buffer.
>
> That could be solved with a heuristic similar to yours: flush events prior a few millisecs
> before the barrier since interrupt are supposed to be short. Or we could move the perf_clock()
> event snapshot to perf_output_sample() to make sure that the event space is reserved before
> we get the timestamp, thus the interrupting events having superior timestamps are guaranteed
> to be past the interrupted event in the stream.
>

2014-01-04 15:05:52

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On Fri, Jan 03, 2014 at 03:45:36PM -0700, David Ahern wrote:
> On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
> >I'm not sure I understand why we need that. Why doesn't it work by simply flushing
> >events prior to the earliest timestamp among every CPUs last event?
>
> Here's one scenario. Consider N-mmaps:
>
> |----- t_flush
> v
> 0 -----|---x------------------------
> 1 -----|----|------------------------
> ... |
> N -----|-------ssss-|-----------------
>
> t_start t_1 ... t_N
>
> You start a round at some time -- t_start. By starting a round it
> means you go to mmap 0 and check for events, then mmap 1, ..., mmap
> N. It takes a finite amount of time to move from one mmap to
> another.
>
> Assume there are no events on mmap 0, 1, ... N-1 but samples are
> generated in mmap N. In the time it takes to move forward from 0 to
> N, a sample can be generated for mmap 0 and written to the buffer -
> the 'x' above. It now contains a timestamp < than samples on any
> other mmap and out pops the flush error.

Lets reformulate as following. I'm copy-pasting the example in session.c
but adapting it to your scenario.

* ============ PASS n =================
* CPU 0 | CPU 1
* |
* - | 2
* - | 3
* - | 4 <--- max recorded
*
* ============ PASS n + 1 ==============
* CPU 0 | CPU 1
* |
* 1 | 5
* - | 6
* - | 7 <---- max recorded
*
* Flush every events below timestamp 4
*

So in the first round, CPU 0 has no event by the time we read it. Then while
we read the events from CPU 1 (val 2,3,4), a new one comes in concurrently
in CPU 0 (val 1, which matches 's' in your example ). We missed it due to the
linear mmap read on all buffers so we'll get it on the second round.

We find it out in the second round, CPU 1 has also new events. At this time we know that
if CPU 0 had events up to timestamp 4, we should have seen all of them because
we read CPU 0 buffer in PASS n + 1 after we read CPU 1 buffer on PASS n.

Of course that's what happens in a perfect world with the assumption that ordering
is all correct, that events write/commit doesn't take too much time to complete,
that perf_clock() is globally monotonic (and it's not IIUC). But a little heuristical
correction on the timestamp barrier should prevent from issues with that.

So this is how the code behaves currently and it should handle a case like above.
Now there is still the problem of:

1) local timestamps not moving forward (could it happen when events happen in storm,
when they overflow multiple times in once for example, and clock is not granular
enough?)
Anyway this should be solved with the patch that takes the earliest last event on all
CPU buffer instead of the maximum of a round as a barrier.

2) local timestamps not monotonic due to interrupting events. This could be fixed
in the kernel with moving perf_clock() snapshot in perf_output_sample().

2014-01-08 21:48:46

by David Ahern

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On 1/4/14, 8:05 AM, Frederic Weisbecker wrote:
> On Fri, Jan 03, 2014 at 03:45:36PM -0700, David Ahern wrote:
>> On 1/3/14, 3:07 PM, Frederic Weisbecker wrote:
>>> I'm not sure I understand why we need that. Why doesn't it work by simply flushing
>>> events prior to the earliest timestamp among every CPUs last event?
>>
>> Here's one scenario. Consider N-mmaps:
>>
>> |----- t_flush
>> v
>> 0 -----|---x------------------------
>> 1 -----|----|------------------------
>> ... |
>> N -----|-------ssss-|-----------------
>>
>> t_start t_1 ... t_N
>>
>> You start a round at some time -- t_start. By starting a round it
>> means you go to mmap 0 and check for events, then mmap 1, ..., mmap
>> N. It takes a finite amount of time to move from one mmap to
>> another.
>>
>> Assume there are no events on mmap 0, 1, ... N-1 but samples are
>> generated in mmap N. In the time it takes to move forward from 0 to
>> N, a sample can be generated for mmap 0 and written to the buffer -
>> the 'x' above. It now contains a timestamp < than samples on any
>> other mmap and out pops the flush error.
>
> Lets reformulate as following. I'm copy-pasting the example in session.c
> but adapting it to your scenario.
>
> * ============ PASS n =================
> * CPU 0 | CPU 1
> * |
> * - | 2
> * - | 3
> * - | 4 <--- max recorded
> *
> * ============ PASS n + 1 ==============
> * CPU 0 | CPU 1
> * |
> * 1 | 5
> * - | 6
> * - | 7 <---- max recorded
> *
> * Flush every events below timestamp 4
> *
>
> So in the first round, CPU 0 has no event by the time we read it. Then while
> we read the events from CPU 1 (val 2,3,4), a new one comes in concurrently
> in CPU 0 (val 1, which matches 's' in your example ). We missed it due to the
> linear mmap read on all buffers so we'll get it on the second round.
>
> We find it out in the second round, CPU 1 has also new events. At this time we know that
> if CPU 0 had events up to timestamp 4, we should have seen all of them because
> we read CPU 0 buffer in PASS n + 1 after we read CPU 1 buffer on PASS n.
>
> Of course that's what happens in a perfect world with the assumption that ordering
> is all correct, that events write/commit doesn't take too much time to complete,
> that perf_clock() is globally monotonic (and it's not IIUC). But a little heuristical
> correction on the timestamp barrier should prevent from issues with that.
>
> So this is how the code behaves currently and it should handle a case like above.

The existing code does not work. Your unstable tsc patch did not work. I
have not tried Joseph's patch. Are you proposing that one or do you have
something else in mind?

> Now there is still the problem of:
>
> 1) local timestamps not moving forward (could it happen when events happen in storm,
> when they overflow multiple times in once for example, and clock is not granular
> enough?)

Even at 650k events/sec I am not seeing this problem.

> Anyway this should be solved with the patch that takes the earliest last event on all
> CPU buffer instead of the maximum of a round as a barrier.
>
> 2) local timestamps not monotonic due to interrupting events. This could be fixed
> in the kernel with moving perf_clock() snapshot in perf_output_sample().
>

For perf-kvm the events are all tracepoints, so there should not be a
problem of overlap due to interruption.

David

2014-01-09 15:20:08

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On Wed, Jan 08, 2014 at 02:48:37PM -0700, David Ahern wrote:
> The existing code does not work. Your unstable tsc patch did not
> work. I have not tried Joseph's patch. Are you proposing that one or
> do you have something else in mind?

I think we should integrate Joseph's patch (or mine, or some mixup, I mean
they do about the same IIRC) as it solves known and understood bugs in any case.

Then we need to check what is the real issue in your case.

>
> >Now there is still the problem of:
> >
> >1) local timestamps not moving forward (could it happen when events happen in storm,
> >when they overflow multiple times in once for example, and clock is not granular
> >enough?)
>
> Even at 650k events/sec I am not seeing this problem.

Yeah it happens mostly when a single event, supposed to overflow on period of 1, trigger
with a higher period. This is the case of sched stat runtime tracepoints for example
because it is a weighted tracepoint (see perf_count). So it demux into gazillions of
events all having very close timestamps. But normal tracepoints shouldn't have this problem.

>
> >Anyway this should be solved with the patch that takes the earliest last event on all
> >CPU buffer instead of the maximum of a round as a barrier.
> >
> >2) local timestamps not monotonic due to interrupting events. This could be fixed
> >in the kernel with moving perf_clock() snapshot in perf_output_sample().
> >
>
> For perf-kvm the events are all tracepoints, so there should not be
> a problem of overlap due to interruption.

Nope, I'm curious what kind of issue happens with kvm events. Could you send me a perf.data
that has this ordering problem?

Thanks.

2014-01-12 15:47:06

by David Ahern

[permalink] [raw]

Subject: Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps in Passes

On 1/9/14, 8:19 AM, Frederic Weisbecker wrote:
> On Wed, Jan 08, 2014 at 02:48:37PM -0700, David Ahern wrote:
>> The existing code does not work. Your unstable tsc patch did not
>> work. I have not tried Joseph's patch. Are you proposing that one or
>> do you have something else in mind?
>
> I think we should integrate Joseph's patch (or mine, or some mixup, I mean
> they do about the same IIRC) as it solves known and understood bugs in any case.
>
> Then we need to check what is the real issue in your case.
>
>>
>>> Now there is still the problem of:
>>>
>>> 1) local timestamps not moving forward (could it happen when events happen in storm,
>>> when they overflow multiple times in once for example, and clock is not granular
>>> enough?)
>>
>> Even at 650k events/sec I am not seeing this problem.
>
> Yeah it happens mostly when a single event, supposed to overflow on period of 1, trigger
> with a higher period. This is the case of sched stat runtime tracepoints for example
> because it is a weighted tracepoint (see perf_count). So it demux into gazillions of
> events all having very close timestamps. But normal tracepoints shouldn't have this problem.
>
>>
>>> Anyway this should be solved with the patch that takes the earliest last event on all
>>> CPU buffer instead of the maximum of a round as a barrier.
>>>
>>> 2) local timestamps not monotonic due to interrupting events. This could be fixed
>>> in the kernel with moving perf_clock() snapshot in perf_output_sample().
>>>
>>
>> For perf-kvm the events are all tracepoints, so there should not be
>> a problem of overlap due to interruption.
>
> Nope, I'm curious what kind of issue happens with kvm events. Could you send me a perf.data
> that has this ordering problem?

There's no data file in this case; it's perf kvm stat live. For this use
case the flush code in session.c causes the 'timestamp below last flush
error' really quick. That's what led me down the road of tweaking the
flush time in builtin-kvm.c. With the existing tweaks perf-kvm-stat-live
runs for a lot longer between timeslice failures. With my perf_clock
based flush time I have yet to the 'Timestamp below last timeslice
flush' error.

David