2012-10-16 10:13:45

by Stephane Eranian

[permalink] [raw]
Subject: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

Hi,

There are many situations where we want to correlate events happening at
the user level with samples recorded in the perf_event kernel sampling buffer.
For instance, we might want to correlate the call to a function or creation of
a file with samples. Similarly, when we want to monitor a JVM with jitted code,
we need to be able to correlate jitted code mappings with perf event samples
for symbolization.

Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
That causes each PERF_RECORD_SAMPLE to include a timestamp
generated by calling the local_clock() -> sched_clock_cpu() function.

To make correlating user vs. kernel samples easy, we would need to
access that sched_clock() functionality. However, none of the existing
clock calls permit this at this point. They all return timestamps which are
not using the same source and/or offset as sched_clock.

I believe a similar issue exists with the ftrace subsystem.

The problem needs to be adressed in a portable manner. Solutions
based on reading TSC for the user level to reconstruct sched_clock()
don't seem appropriate to me.

One possibility to address this limitation would be to extend clock_gettime()
with a new clock time, e.g., CLOCK_PERF.

However, I understand that sched_clock_cpu() provides ordering guarantees only
when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
But we already have to deal with this problem when merging samples obtained
from different CPU sampling buffer in per-thread mode. So this is not
necessarily
a showstopper.

Alternatives could be to use uprobes but that's less practical to setup.

Anyone with better ideas?


2012-10-16 17:24:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
> Hi,
>
> There are many situations where we want to correlate events happening at
> the user level with samples recorded in the perf_event kernel sampling buffer.
> For instance, we might want to correlate the call to a function or creation of
> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
> we need to be able to correlate jitted code mappings with perf event samples
> for symbolization.
>
> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
> That causes each PERF_RECORD_SAMPLE to include a timestamp
> generated by calling the local_clock() -> sched_clock_cpu() function.
>
> To make correlating user vs. kernel samples easy, we would need to
> access that sched_clock() functionality. However, none of the existing
> clock calls permit this at this point. They all return timestamps which are
> not using the same source and/or offset as sched_clock.
>
> I believe a similar issue exists with the ftrace subsystem.
>
> The problem needs to be adressed in a portable manner. Solutions
> based on reading TSC for the user level to reconstruct sched_clock()
> don't seem appropriate to me.
>
> One possibility to address this limitation would be to extend clock_gettime()
> with a new clock time, e.g., CLOCK_PERF.
>
> However, I understand that sched_clock_cpu() provides ordering guarantees only
> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
> But we already have to deal with this problem when merging samples obtained
> from different CPU sampling buffer in per-thread mode. So this is not
> necessarily
> a showstopper.
>
> Alternatives could be to use uprobes but that's less practical to setup.
>
> Anyone with better ideas?

You forgot to CC the time people ;-)

I've no problem with adding CLOCK_PERF (or another/better name).

Thomas, John?

2012-10-18 19:33:36

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, Oct 16, 2012 at 7:23 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> There are many situations where we want to correlate events happening at
>> the user level with samples recorded in the perf_event kernel sampling buffer.
>> For instance, we might want to correlate the call to a function or creation of
>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> we need to be able to correlate jitted code mappings with perf event samples
>> for symbolization.
>>
>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>
>> To make correlating user vs. kernel samples easy, we would need to
>> access that sched_clock() functionality. However, none of the existing
>> clock calls permit this at this point. They all return timestamps which are
>> not using the same source and/or offset as sched_clock.
>>
>> I believe a similar issue exists with the ftrace subsystem.
>>
>> The problem needs to be adressed in a portable manner. Solutions
>> based on reading TSC for the user level to reconstruct sched_clock()
>> don't seem appropriate to me.
>>
>> One possibility to address this limitation would be to extend clock_gettime()
>> with a new clock time, e.g., CLOCK_PERF.
>>
>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> But we already have to deal with this problem when merging samples obtained
>> from different CPU sampling buffer in per-thread mode. So this is not
>> necessarily
>> a showstopper.
>>
>> Alternatives could be to use uprobes but that's less practical to setup.
>>
>> Anyone with better ideas?
>
> You forgot to CC the time people ;-)
>
I did not know where they were.

> I've no problem with adding CLOCK_PERF (or another/better name).
>
Ok, good.

> Thomas, John?
>
Any comment?

2012-11-10 02:05:12

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> There are many situations where we want to correlate events happening at
>> the user level with samples recorded in the perf_event kernel sampling buffer.
>> For instance, we might want to correlate the call to a function or creation of
>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> we need to be able to correlate jitted code mappings with perf event samples
>> for symbolization.
>>
>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>
>> To make correlating user vs. kernel samples easy, we would need to
>> access that sched_clock() functionality. However, none of the existing
>> clock calls permit this at this point. They all return timestamps which are
>> not using the same source and/or offset as sched_clock.
>>
>> I believe a similar issue exists with the ftrace subsystem.
>>
>> The problem needs to be adressed in a portable manner. Solutions
>> based on reading TSC for the user level to reconstruct sched_clock()
>> don't seem appropriate to me.
>>
>> One possibility to address this limitation would be to extend clock_gettime()
>> with a new clock time, e.g., CLOCK_PERF.
>>
>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> But we already have to deal with this problem when merging samples obtained
>> from different CPU sampling buffer in per-thread mode. So this is not
>> necessarily
>> a showstopper.
>>
>> Alternatives could be to use uprobes but that's less practical to setup.
>>
>> Anyone with better ideas?
> You forgot to CC the time people ;-)
>
> I've no problem with adding CLOCK_PERF (or another/better name).
Hrm. I'm not excited about exporting that sort of internal kernel
details to userland.

The behavior and expectations from sched_clock() has changed over the
years, so I'm not sure its wise to export it, since we'd have to
preserve its behavior from then on.

Also I worry that it will be abused in the same way that direct TSC
access is, where the seemingly better performance from the more
careful/correct CLOCK_MONOTONIC would cause developers to write fragile
userland code that will break when moved from one machine to the next.

I'd probably rather perf output timestamps to userland using sane clocks
(CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
userland. But I probably could be convinced I'm wrong.

thanks
-john

2012-11-11 20:32:47

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <[email protected]> wrote:
> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>
>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>
>>> Hi,
>>>
>>> There are many situations where we want to correlate events happening at
>>> the user level with samples recorded in the perf_event kernel sampling
>>> buffer.
>>> For instance, we might want to correlate the call to a function or
>>> creation of
>>> a file with samples. Similarly, when we want to monitor a JVM with jitted
>>> code,
>>> we need to be able to correlate jitted code mappings with perf event
>>> samples
>>> for symbolization.
>>>
>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>
>>> To make correlating user vs. kernel samples easy, we would need to
>>> access that sched_clock() functionality. However, none of the existing
>>> clock calls permit this at this point. They all return timestamps which
>>> are
>>> not using the same source and/or offset as sched_clock.
>>>
>>> I believe a similar issue exists with the ftrace subsystem.
>>>
>>> The problem needs to be adressed in a portable manner. Solutions
>>> based on reading TSC for the user level to reconstruct sched_clock()
>>> don't seem appropriate to me.
>>>
>>> One possibility to address this limitation would be to extend
>>> clock_gettime()
>>> with a new clock time, e.g., CLOCK_PERF.
>>>
>>> However, I understand that sched_clock_cpu() provides ordering guarantees
>>> only
>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>> synchronized.
>>> But we already have to deal with this problem when merging samples
>>> obtained
>>> from different CPU sampling buffer in per-thread mode. So this is not
>>> necessarily
>>> a showstopper.
>>>
>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>
>>> Anyone with better ideas?
>>
>> You forgot to CC the time people ;-)
>>
>> I've no problem with adding CLOCK_PERF (or another/better name).
>
> Hrm. I'm not excited about exporting that sort of internal kernel details to
> userland.
>
> The behavior and expectations from sched_clock() has changed over the years,
> so I'm not sure its wise to export it, since we'd have to preserve its
> behavior from then on.
>
It's not about just exposing sched_clock(). We need to expose a time source
that is exactly equivalent to what perf_event uses internally. If sched_clock()
changes, then perf_event clock will change too and so would that new time
source for clock_gettime(). As long as everything remains consistent, we are
good.

> Also I worry that it will be abused in the same way that direct TSC access
> is, where the seemingly better performance from the more careful/correct
> CLOCK_MONOTONIC would cause developers to write fragile userland code that
> will break when moved from one machine to the next.
>
The only goal for this new time source is for correlating user-level
samples with
kernel level samples, i.e., application level events with a PMU counter overflow
for instance. Anybody trying anything else would be on their own.

clock_gettime(CLOCK_PERF): guarantee to return the same time source as
that used by the perf_event subsystem to timestamp samples when
PERF_SAMPLE_TIME is requested in attr->sample_type.

> I'd probably rather perf output timestamps to userland using sane clocks
> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
> userland. But I probably could be convinced I'm wrong.
>
Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
grabbing any locks because that would need to run from NMI context?

2012-11-12 18:54:10

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 11/11/2012 12:32 PM, Stephane Eranian wrote:
> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <[email protected]> wrote:
>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> There are many situations where we want to correlate events happening at
>>>> the user level with samples recorded in the perf_event kernel sampling
>>>> buffer.
>>>> For instance, we might want to correlate the call to a function or
>>>> creation of
>>>> a file with samples. Similarly, when we want to monitor a JVM with jitted
>>>> code,
>>>> we need to be able to correlate jitted code mappings with perf event
>>>> samples
>>>> for symbolization.
>>>>
>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>>
>>>> To make correlating user vs. kernel samples easy, we would need to
>>>> access that sched_clock() functionality. However, none of the existing
>>>> clock calls permit this at this point. They all return timestamps which
>>>> are
>>>> not using the same source and/or offset as sched_clock.
>>>>
>>>> I believe a similar issue exists with the ftrace subsystem.
>>>>
>>>> The problem needs to be adressed in a portable manner. Solutions
>>>> based on reading TSC for the user level to reconstruct sched_clock()
>>>> don't seem appropriate to me.
>>>>
>>>> One possibility to address this limitation would be to extend
>>>> clock_gettime()
>>>> with a new clock time, e.g., CLOCK_PERF.
>>>>
>>>> However, I understand that sched_clock_cpu() provides ordering guarantees
>>>> only
>>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>>> synchronized.
>>>> But we already have to deal with this problem when merging samples
>>>> obtained
>>>> from different CPU sampling buffer in per-thread mode. So this is not
>>>> necessarily
>>>> a showstopper.
>>>>
>>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>>
>>>> Anyone with better ideas?
>>> You forgot to CC the time people ;-)
>>>
>>> I've no problem with adding CLOCK_PERF (or another/better name).
>> Hrm. I'm not excited about exporting that sort of internal kernel details to
>> userland.
>>
>> The behavior and expectations from sched_clock() has changed over the years,
>> so I'm not sure its wise to export it, since we'd have to preserve its
>> behavior from then on.
>>
> It's not about just exposing sched_clock(). We need to expose a time source
> that is exactly equivalent to what perf_event uses internally. If sched_clock()
> changes, then perf_event clock will change too and so would that new time
> source for clock_gettime(). As long as everything remains consistent, we are
> good.

Sure, but I'm just hesitant to expose that sort of internal detail. If
we change it later, its not just perf_events, but any other applications
that have come to depend on the particular behavior we expose. We can
claim "that was never promised" but it still leads to a bad situation.

>> Also I worry that it will be abused in the same way that direct TSC access
>> is, where the seemingly better performance from the more careful/correct
>> CLOCK_MONOTONIC would cause developers to write fragile userland code that
>> will break when moved from one machine to the next.
>>
> The only goal for this new time source is for correlating user-level
> samples with
> kernel level samples, i.e., application level events with a PMU counter overflow
> for instance. Anybody trying anything else would be on their own.
>
> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
> that used by the perf_event subsystem to timestamp samples when
> PERF_SAMPLE_TIME is requested in attr->sample_type.

I'm not familiar enough with perf's interfaces, but if you are going to
make this clockid bound so tightly with perf, could you maybe export a
perf timestamp from one of perf's interfaces rather then using the more
generic clock_gettime() interface?


>
>> I'd probably rather perf output timestamps to userland using sane clocks
>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>> userland. But I probably could be convinced I'm wrong.
>>
> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
> grabbing any locks because that would need to run from NMI context?
No, of course why we have sched_clock. But I'm suggesting we consider
changing what perf exports (via maybe interpolation/translation) to be
CLOCK_MONOTONIC-ish.


I'm not strongly objecting here, I just want to make sure other
alternatives are explored before we start giving applications another
internal kernel behavior dependent interface to hang themselves with. :)

thanks
-john

2012-11-12 20:55:03

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <[email protected]> wrote:
> On 11/11/2012 12:32 PM, Stephane Eranian wrote:
>>
>> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <[email protected]>
>> wrote:
>>>
>>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>>>
>>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There are many situations where we want to correlate events happening
>>>>> at
>>>>> the user level with samples recorded in the perf_event kernel sampling
>>>>> buffer.
>>>>> For instance, we might want to correlate the call to a function or
>>>>> creation of
>>>>> a file with samples. Similarly, when we want to monitor a JVM with
>>>>> jitted
>>>>> code,
>>>>> we need to be able to correlate jitted code mappings with perf event
>>>>> samples
>>>>> for symbolization.
>>>>>
>>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>>>
>>>>> To make correlating user vs. kernel samples easy, we would need to
>>>>> access that sched_clock() functionality. However, none of the existing
>>>>> clock calls permit this at this point. They all return timestamps which
>>>>> are
>>>>> not using the same source and/or offset as sched_clock.
>>>>>
>>>>> I believe a similar issue exists with the ftrace subsystem.
>>>>>
>>>>> The problem needs to be adressed in a portable manner. Solutions
>>>>> based on reading TSC for the user level to reconstruct sched_clock()
>>>>> don't seem appropriate to me.
>>>>>
>>>>> One possibility to address this limitation would be to extend
>>>>> clock_gettime()
>>>>> with a new clock time, e.g., CLOCK_PERF.
>>>>>
>>>>> However, I understand that sched_clock_cpu() provides ordering
>>>>> guarantees
>>>>> only
>>>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>>>> synchronized.
>>>>> But we already have to deal with this problem when merging samples
>>>>> obtained
>>>>> from different CPU sampling buffer in per-thread mode. So this is not
>>>>> necessarily
>>>>> a showstopper.
>>>>>
>>>>> Alternatives could be to use uprobes but that's less practical to
>>>>> setup.
>>>>>
>>>>> Anyone with better ideas?
>>>>
>>>> You forgot to CC the time people ;-)
>>>>
>>>> I've no problem with adding CLOCK_PERF (or another/better name).
>>>
>>> Hrm. I'm not excited about exporting that sort of internal kernel details
>>> to
>>> userland.
>>>
>>> The behavior and expectations from sched_clock() has changed over the
>>> years,
>>> so I'm not sure its wise to export it, since we'd have to preserve its
>>> behavior from then on.
>>>
>> It's not about just exposing sched_clock(). We need to expose a time
>> source
>> that is exactly equivalent to what perf_event uses internally. If
>> sched_clock()
>> changes, then perf_event clock will change too and so would that new time
>> source for clock_gettime(). As long as everything remains consistent, we
>> are
>> good.
>
>
> Sure, but I'm just hesitant to expose that sort of internal detail. If we
> change it later, its not just perf_events, but any other applications that
> have come to depend on the particular behavior we expose. We can claim
> "that was never promised" but it still leads to a bad situation.
>
>
>>> Also I worry that it will be abused in the same way that direct TSC
>>> access
>>> is, where the seemingly better performance from the more careful/correct
>>> CLOCK_MONOTONIC would cause developers to write fragile userland code
>>> that
>>> will break when moved from one machine to the next.
>>>
>> The only goal for this new time source is for correlating user-level
>> samples with
>> kernel level samples, i.e., application level events with a PMU counter
>> overflow
>> for instance. Anybody trying anything else would be on their own.
>>
>> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
>> that used by the perf_event subsystem to timestamp samples when
>> PERF_SAMPLE_TIME is requested in attr->sample_type.
>
>
> I'm not familiar enough with perf's interfaces, but if you are going to make
> this clockid bound so tightly with perf, could you maybe export a perf
> timestamp from one of perf's interfaces rather then using the more generic
> clock_gettime() interface?
>
Yeah, I considered that as well. But it is more complicated. The only syscall
we could extend for perf_events is ioctl(). But that one requires that an
event be created so we obtain a file descriptor for the ioctl() call
So we'd have to
pretend programming a dummy event just for the purpose of obtained a timestamp.
We could do that but that's not so nice. But more amenable to the

Keep in mind that the clock_gettime() would be used by programs which are not
self-monitoring but may be monitored externally by a tool such as perf. We just
need to them to emit their events with a timestamp that can be
correlated offline
with those of perf_events.

>
>
>>
>>> I'd probably rather perf output timestamps to userland using sane clocks
>>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>>> userland. But I probably could be convinced I'm wrong.
>>>
>> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
>> grabbing any locks because that would need to run from NMI context?
>
> No, of course why we have sched_clock. But I'm suggesting we consider
> changing what perf exports (via maybe interpolation/translation) to be
> CLOCK_MONOTONIC-ish.
>
Explain to me the key difference between monotonic and what sched_clock()
is returning today? Does this have to do with the global monotonic vs.
the cpu-wide
monotonic?

>
> I'm not strongly objecting here, I just want to make sure other alternatives
> are explored before we start giving applications another internal kernel
> behavior dependent interface to hang themselves with. :)
>
> thanks
> -john
>

2012-11-12 22:40:24

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 11/12/2012 12:54 PM, Stephane Eranian wrote:
> On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <[email protected]> wrote:
>> On 11/11/2012 12:32 PM, Stephane Eranian wrote:
>>> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <[email protected]>
>>> wrote:
>>>> Also I worry that it will be abused in the same way that direct TSC
>>>> access
>>>> is, where the seemingly better performance from the more careful/correct
>>>> CLOCK_MONOTONIC would cause developers to write fragile userland code
>>>> that
>>>> will break when moved from one machine to the next.
>>>>
>>> The only goal for this new time source is for correlating user-level
>>> samples with
>>> kernel level samples, i.e., application level events with a PMU counter
>>> overflow
>>> for instance. Anybody trying anything else would be on their own.
>>>
>>> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
>>> that used by the perf_event subsystem to timestamp samples when
>>> PERF_SAMPLE_TIME is requested in attr->sample_type.
>>
>> I'm not familiar enough with perf's interfaces, but if you are going to make
>> this clockid bound so tightly with perf, could you maybe export a perf
>> timestamp from one of perf's interfaces rather then using the more generic
>> clock_gettime() interface?
>>
> Yeah, I considered that as well. But it is more complicated. The only syscall
> we could extend for perf_events is ioctl(). But that one requires that an
> event be created so we obtain a file descriptor for the ioctl() call
> So we'd have to
> pretend programming a dummy event just for the purpose of obtained a timestamp.
> We could do that but that's not so nice. But more amenable to the

Sorry, you trailed off. Did you want to finish that thought? (I do
that all the time. :)

> Keep in mind that the clock_gettime() would be used by programs which are not
> self-monitoring but may be monitored externally by a tool such as perf. We just
> need to them to emit their events with a timestamp that can be
> correlated offline
> with those of perf_events.

Again, forgive me for not really knowing much about perf here, but could
you have a perf log an event when clock_gettime() was called, possibly
recording the returned value, so you could correlate that data yourself?


>>>> I'd probably rather perf output timestamps to userland using sane clocks
>>>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>>>> userland. But I probably could be convinced I'm wrong.
>>>>
>>> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
>>> grabbing any locks because that would need to run from NMI context?
>> No, of course why we have sched_clock. But I'm suggesting we consider
>> changing what perf exports (via maybe interpolation/translation) to be
>> CLOCK_MONOTONIC-ish.
>>
> Explain to me the key difference between monotonic and what sched_clock()
> is returning today? Does this have to do with the global monotonic vs.
> the cpu-wide
> monotonic?

So CLOCK_MONOTONIC is the number of NTP corrected (for accuracy) seconds
+ nsecs that the machine has been up for (so that doesn't include time
in suspend). Its promised to be globally monotonic across cpus.

In my understanding, sched_clock's definition has changed over time. It
used to be a fast but possibly incorrect nanoseconds since boot, but
with suspend and other events it could reset/overflow and users (then
only the scheduler) would be able to deal with it. It also wasn't
guaranteed to be consistent across cpus. So it was limited to
calculating approximate time intervals on a single cpu.

However, with cfs (And Peter or Ingo could probably hop in and clarify
further) I believe it started to require some cross-cpu consistency and
reset events would cause probelms with the scheduler, so additional
layers have been added to try to enforce these additional requirements.

I suspect they aren't that far off, except calibration frequency errors
go uncorrected with sched_clock. But was thinking you could get periodic
timestamps in perf that correlated CLOCK_MONOTONIC with sched_clock and
then allow the kernel to interpolate the sched_clock times out to
something pretty close to CLOCK_MONOTONIC. That way perf wouldn't leak
the sched_clock time domain to userland.

Again, sorry for being a pain here. The CLOCK_PERF would be a easy
solution, but I just want to make sure its really the best one long term.

thanks
-john


2012-11-13 20:59:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Fri, 2012-11-09 at 18:04 -0800, John Stultz wrote:
> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:

> > I've no problem with adding CLOCK_PERF (or another/better name).
> Hrm. I'm not excited about exporting that sort of internal kernel
> details to userland.
>
> The behavior and expectations from sched_clock() has changed over the
> years, so I'm not sure its wise to export it, since we'd have to
> preserve its behavior from then on.
>
> Also I worry that it will be abused in the same way that direct TSC
> access is, where the seemingly better performance from the more
> careful/correct CLOCK_MONOTONIC would cause developers to write fragile
> userland code that will break when moved from one machine to the next.
>
> I'd probably rather perf output timestamps to userland using sane clocks
> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
> userland. But I probably could be convinced I'm wrong.

I'm surprised that perf has its own clock anyway. But I would like to
export the tracing clocks. We have three (well four) of them:

trace_clock_local() which is defined to be a very fast clock but may not
be synced with other cpus (basically, it just calls sched_clock).

trace_clock() which is not totally serialized, but also not totally off
(between local and global). This uses local_clock() which is the same
thing that perf_clock() uses.

trace_clock_global() which is a monotonic clock across CPUs. It's much
slower than the above, but works well when you require synced
timestamps.

There's also trace_clock_counter() which isn't even a clock :-) It's
just a incremental atomic counter that goes up every time it's called.
This is the most synced clock, but is absolutely meaningless for
timestamps. It's just a way to show ordered events.

-- Steve

2012-11-14 22:26:35

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 11/13/2012 12:58 PM, Steven Rostedt wrote:
> On Fri, 2012-11-09 at 18:04 -0800, John Stultz wrote:
>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>> I've no problem with adding CLOCK_PERF (or another/better name).
>> Hrm. I'm not excited about exporting that sort of internal kernel
>> details to userland.
>>
>> The behavior and expectations from sched_clock() has changed over the
>> years, so I'm not sure its wise to export it, since we'd have to
>> preserve its behavior from then on.
>>
>> Also I worry that it will be abused in the same way that direct TSC
>> access is, where the seemingly better performance from the more
>> careful/correct CLOCK_MONOTONIC would cause developers to write fragile
>> userland code that will break when moved from one machine to the next.
>>
>> I'd probably rather perf output timestamps to userland using sane clocks
>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>> userland. But I probably could be convinced I'm wrong.
> I'm surprised that perf has its own clock anyway. But I would like to
> export the tracing clocks. We have three (well four) of them:
>
> trace_clock_local() which is defined to be a very fast clock but may not
> be synced with other cpus (basically, it just calls sched_clock).
>
> trace_clock() which is not totally serialized, but also not totally off
> (between local and global). This uses local_clock() which is the same
> thing that perf_clock() uses.
>
> trace_clock_global() which is a monotonic clock across CPUs. It's much
> slower than the above, but works well when you require synced
> timestamps.
>
> There's also trace_clock_counter() which isn't even a clock :-) It's
> just a incremental atomic counter that goes up every time it's called.
> This is the most synced clock, but is absolutely meaningless for
> timestamps. It's just a way to show ordered events.

Oof. This is getting uglier. I'd really prefer not to expose all these
different internal clocks out userland. Especially via clock_gettime().

thanks
-john

2012-11-14 23:30:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, 2012-11-14 at 14:26 -0800, John Stultz wrote:

> > There's also trace_clock_counter() which isn't even a clock :-) It's
> > just a incremental atomic counter that goes up every time it's called.
> > This is the most synced clock, but is absolutely meaningless for
> > timestamps. It's just a way to show ordered events.
>
> Oof. This is getting uglier. I'd really prefer not to expose all these
> different internal clocks out userland. Especially via clock_gettime().

Actually, I would be happy to just expose them to modules. As things
like hwlat_detect could use them.

-- Steve

2013-03-14 15:34:06

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Mon, Feb 25, 2013 at 3:10 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2013-02-22 at 22:04 -0800, John Stultz wrote:
>> On 02/20/2013 02:29 AM, Peter Zijlstra wrote:
>> > On Tue, 2013-02-19 at 10:25 -0800, John Stultz wrote:
>> >> So describe how the perf time domain is different then
>> >> CLOCK_MONOTONIC_RAW.
>> > The primary difference is that the trace/sched/perf time domain is not
>> > strictly monotonic, it is only locally monotonic -- that is two time
>> > stamps taken on the same cpu are guaranteed to be monotonic.
>>
>> So how would a clock_gettime(CLOCK_PERF,...) interface help you figure
>> out which cpu you got your timestamp from?
>
> I'm not sure we want to expose it that far.. The reason people want
> this clock exposed is to be able to do logging on the same time-line so
> we can correlate events from both sources (kernel and user-space).
>
> In case of parallel execution we cannot guarantee order and reading
> logs/reconstructing events things require a bit of human intelligence.
>
>> > Furthermore, to make it useful, there's an actual bound on the inter-cpu
>> > drift (implemented by limiting the drift to CLOCK_MONOTONIC).
>>
>> So this sounds like you're already sort of interpolating to
>> CLOCK_MONOTONIC, or am I just misunderstanding you?
>
> That's right, although there's modes where the TSC is guaranteed stable
> where we don't do this (it avoids some expensive bits), so we can not
> rely on this.
>
>> > Additionally -- to increase use -- we also added a monotonic sync point
>> > when cpu A queries time of cpu B.
>>
>> Not sure I'm following this bit. But I'll have to go look at the code
>> on Monday.
>
> It will basically pull the 'slowest' cpu forward so that for that
> 'event' we can say the two time-lines have a common point.
>
>> Right, and this I understand. We can can play a little fast and lose
>> with the rules for in-kernel uses, given the variety of hardware and the
>> fact that performance is more critical then perfect accuracy. Since
>> we're in-kernel we also have more information then userland does about
>> what cpu we're running on, so we can get away with only
>> locally-monotonic timestamps.
>>
>> But I want to be careful if we're exporting this out to userland that
>> its both useful and that there's an actual specification for how
>> CLOCK_PERF behaves, applications can rely upon not changing in the future.
>
> Well, the timestamps themselves are already exposed to userspace
> through the ftrace and perf data logs. All people want is to add
> secondary data stream in the same time-line.
>
I agree with Peter on this. The timestamps are already visible.
All we need is the ability to generate them for another user-level
level data stream.

2013-03-14 19:57:07

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Thu, 2013-03-14 at 15:34 +0000, Stephane Eranian wrote:
> > Well, the timestamps themselves are already exposed to userspace
> > through the ftrace and perf data logs. All people want is to add
> > secondary data stream in the same time-line.
> >
> I agree with Peter on this. The timestamps are already visible.
> All we need is the ability to generate them for another user-level
> level data stream.

Ok, how about the code below? I must say I have some doubts about the
resolution, as there seem to be no generic way of figuring it out for
the sched_clock (the arch/arm/kernel/sched_clock.c is actually
calculating it, but than just prints it out and nothing more).

And, to summarize, we went through 3 ideas:

1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
3. POSIX clock - below

John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
instead of local/sched_clock().

How about a final decision?

Regards

Pawel

8<--------------------
>From c986492d38156f1fc25ab3182f0a494bb13389ce Mon Sep 17 00:00:00 2001
From: Pawel Moll <[email protected]>
Date: Thu, 14 Mar 2013 19:49:09 +0000
Subject: [PATCH] perf: POSIX CLOCK_PERF to report current time value

To co-relate user space events with the perf events stream
a current (as in: "what time(stamp) is it now?") time value
must be made available.

This patch adds a POSIX clock returning the perf_clock()
value and accesible from userspace:

#include <time.h>

struct timespec ts;

clock_gettime(CLOCK_PERF, &ts);

Signed-off-by: Pawel Moll <[email protected]>
---
include/uapi/linux/time.h | 1 +
kernel/events/core.c | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index 0d3c0ed..cea16b0 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -54,6 +54,7 @@ struct itimerval {
#define CLOCK_BOOTTIME 7
#define CLOCK_REALTIME_ALARM 8
#define CLOCK_BOOTTIME_ALARM 9
+#define CLOCK_PERF 10

/*
* The IDs of various hardware clocks:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..81ca459 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -37,6 +37,7 @@
#include <linux/ftrace_event.h>
#include <linux/hw_breakpoint.h>
#include <linux/mm_types.h>
+#include <linux/posix-timers.h>

#include "internal.h"

@@ -209,6 +210,19 @@ static inline u64 perf_clock(void)
return local_clock();
}

+static int perf_posix_clock_getres(const clockid_t which_clock,
+ struct timespec *tp)
+{
+ *tp = ns_to_timespec(TICK_NSEC);
+ return 0;
+}
+
+static int perf_posix_clock_get(clockid_t which_clock, struct timespec *tp)
+{
+ *tp = ns_to_timespec(perf_clock());
+ return 0;
+}
+
static inline struct perf_cpu_context *
__get_cpu_context(struct perf_event_context *ctx)
{
@@ -7391,6 +7405,10 @@ perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)

void __init perf_event_init(void)
{
+ struct k_clock perf_posix_clock = {
+ .clock_getres = perf_posix_clock_getres,
+ .clock_get = perf_posix_clock_get,
+ };
int ret;

idr_init(&pmu_idr);
@@ -7407,6 +7425,8 @@ void __init perf_event_init(void)
ret = init_hw_breakpoint();
WARN(ret, "hw_breakpoint initialization failed with: %d", ret);

+ posix_timers_register_clock(CLOCK_PERF, &perf_posix_clock);
+
/* do not patch jump label more than once per second */
jump_label_rate_limit(&perf_sched_events, HZ);

--
1.7.10.4



2013-03-31 16:24:00

by David Ahern

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 3/14/13 1:57 PM, Pawel Moll wrote:
> Ok, how about the code below? I must say I have some doubts about the
> resolution, as there seem to be no generic way of figuring it out for
> the sched_clock (the arch/arm/kernel/sched_clock.c is actually
> calculating it, but than just prints it out and nothing more).
>
> And, to summarize, we went through 3 ideas:
>
> 1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
> 2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
> 3. POSIX clock - below
>
> John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
> instead of local/sched_clock().

Any chance a decision can be reached in time for 3.10? Seems like the
simplest option is the perf event based ioctl.

Converting/correlating perf_clock timestamps to time-of-day is a feature
I have been trying to get into perf for over 2 years. This is a big
piece needed for that goal -- along with the xtime tracepoints:
https://lkml.org/lkml/2013/3/19/433

David

2013-04-01 18:29:57

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 03/31/2013 09:23 AM, David Ahern wrote:
> On 3/14/13 1:57 PM, Pawel Moll wrote:
>> Ok, how about the code below? I must say I have some doubts about the
>> resolution, as there seem to be no generic way of figuring it out for
>> the sched_clock (the arch/arm/kernel/sched_clock.c is actually
>> calculating it, but than just prints it out and nothing more).
>>
>> And, to summarize, we went through 3 ideas:
>>
>> 1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
>> 2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
>> 3. POSIX clock - below
>>
>> John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
>> instead of local/sched_clock().
>
> Any chance a decision can be reached in time for 3.10? Seems like the
> simplest option is the perf event based ioctl.

I'm still not sold on the CLOCK_PERF posix clock. The semantics are
still too hand-wavy and implementation specific.

While I'd prefer perf to export some existing semi-sane time domain
(using interpolation if necessary), I realize the hardware constraints
and performance optimizations make this unlikely (though I'm
disappointed I've not seen any attempt or proof point that it won't work).

Thus if we must expose this kernel detail to userland, I think we should
be careful about how publicly we expose such an interface, as it has the
potential for misuse and eventual user-land breakage.

So while having a perf specific ioctl is still exposing what I expect
will be non-static kernel internal behavior to userland, it at least it
exposes it in a less generic fashion, which is preferable to me.



The next point of conflict is likely if the ioctl method will be
sufficient given performance concerns. Something I'd be interested in
hearing about from the folks pushing this. Right now it seems any method
is preferable then not having an interface - but I want to make sure
that's really true.

For example, if the ioctl interface is really too slow, its likely folks
will end up using periodic perf ioctl samples and interpolating using
normal vdso clock_gettime() timestamps.

If that is acceptable, then why not invert the solution and just have
perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
have perf report fast, but less-accurate sched_clock deltas from that
CLOCK_MONOTONIC boundary.

Another alternative that might be a reasonable compromise: have perf
register a dynamic posix clock id, which would be a driver specific,
less public interface. That would provide the initial method to access
the perf time domain. Then when it came time to optimize further,
someone would have to sort out the difficulties of creating a vdso
method for accessing dynamic posix clocks. It wouldn't be easy, but it
wouldn't be impossible to do.


> Converting/correlating perf_clock timestamps to time-of-day is a
> feature I have been trying to get into perf for over 2 years. This is
> a big piece needed for that goal -- along with the xtime tracepoints:
> https://lkml.org/lkml/2013/3/19/433

I sympathize with how long this process can take. Having maintainers
disagree without resolution can be a tar-pit. That said, its only been a
few months where this has had proper visibility, and the discussion has
paused for months at a time. Despite how long and slow this probably
feels, the idea of maintaining a bad interface for the next decade seems
much longer. ;) So don't get discouraged yet.

thanks
-john

2013-04-01 22:29:20

by David Ahern

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 4/1/13 12:29 PM, John Stultz wrote:
>> Any chance a decision can be reached in time for 3.10? Seems like the
>> simplest option is the perf event based ioctl.
>
> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
> still too hand-wavy and implementation specific.
>
> While I'd prefer perf to export some existing semi-sane time domain
> (using interpolation if necessary), I realize the hardware constraints
> and performance optimizations make this unlikely (though I'm
> disappointed I've not seen any attempt or proof point that it won't work).
>
> Thus if we must expose this kernel detail to userland, I think we should
> be careful about how publicly we expose such an interface, as it has the
> potential for misuse and eventual user-land breakage.

But perf_clock timestamps are already exposed to userland. This new API
-- be it a posix clock or an ioctl -- just allows retrieval of a
timestamp outside of a generated event.

>
> So while having a perf specific ioctl is still exposing what I expect
> will be non-static kernel internal behavior to userland, it at least it
> exposes it in a less generic fashion, which is preferable to me.
>
>
>
> The next point of conflict is likely if the ioctl method will be
> sufficient given performance concerns. Something I'd be interested in
> hearing about from the folks pushing this. Right now it seems any method
> is preferable then not having an interface - but I want to make sure
> that's really true.
>
> For example, if the ioctl interface is really too slow, its likely folks
> will end up using periodic perf ioctl samples and interpolating using
> normal vdso clock_gettime() timestamps.

The performance/speed depends on how often is called. I have no idea
what Stephane's use case is but for me it is to correlate perf_clock
timestamps to timeofday. In my perf-based daemon that tracks process
schedulings, I update the correlation every 5-10 minutes.

>
> If that is acceptable, then why not invert the solution and just have
> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
> have perf report fast, but less-accurate sched_clock deltas from that
> CLOCK_MONOTONIC boundary.

Something similar to that approach has been discussed as well. i.e, add
a realtime clock event and have it injected into the stream e.g.,
https://lkml.org/lkml/2011/2/27/158

But there are cons to this approach -- e.g, you need that first event
generated that tells you realtime to perf_clock correlation and you
don't want to have to scan an unknown length of events looking for the
first one to get the correlation only to backup and process the events.

And an ioctl to generate that first event was shot down as well...
https://lkml.org/lkml/2011/3/1/174
https://lkml.org/lkml/2011/3/2/186

David

>
> Another alternative that might be a reasonable compromise: have perf
> register a dynamic posix clock id, which would be a driver specific,
> less public interface. That would provide the initial method to access
> the perf time domain. Then when it came time to optimize further,
> someone would have to sort out the difficulties of creating a vdso
> method for accessing dynamic posix clocks. It wouldn't be easy, but it
> wouldn't be impossible to do.
>
>
>> Converting/correlating perf_clock timestamps to time-of-day is a
>> feature I have been trying to get into perf for over 2 years. This is
>> a big piece needed for that goal -- along with the xtime tracepoints:
>> https://lkml.org/lkml/2013/3/19/433
>
> I sympathize with how long this process can take. Having maintainers
> disagree without resolution can be a tar-pit. That said, its only been a
> few months where this has had proper visibility, and the discussion has
> paused for months at a time. Despite how long and slow this probably
> feels, the idea of maintaining a bad interface for the next decade seems
> much longer. ;) So don't get discouraged yet.
>
> thanks
> -john

2013-04-01 23:12:48

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/01/2013 03:29 PM, David Ahern wrote:
> On 4/1/13 12:29 PM, John Stultz wrote:
>>> Any chance a decision can be reached in time for 3.10? Seems like the
>>> simplest option is the perf event based ioctl.
>>
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
>>
>> While I'd prefer perf to export some existing semi-sane time domain
>> (using interpolation if necessary), I realize the hardware constraints
>> and performance optimizations make this unlikely (though I'm
>> disappointed I've not seen any attempt or proof point that it won't
>> work).
>>
>> Thus if we must expose this kernel detail to userland, I think we should
>> be careful about how publicly we expose such an interface, as it has the
>> potential for misuse and eventual user-land breakage.
>
> But perf_clock timestamps are already exposed to userland. This new
> API -- be it a posix clock or an ioctl -- just allows retrieval of a
> timestamp outside of a generated event.

Although perf_clock timestamps are not exposed to applications in a way
they can use for their own purposes, no? Just as timestamp data
correlated with other perf data.

My big concern here is that if we applications can retrieve these
timestamps for their own use, folks will see CLOCK_PERF as a cheaper
alternative to CLOCK_MONOTONIC, and then end up getting bitten when the
CLOCK_PERF semantics change. So until someone will hammer out exactly
the behavior CLOCK_PERF should have forever going forward, I'd rather
not add it as a generic clockid.

If we're going to have to expose the perf timestamps to userland, then
I'd prefer we do it in a less public way, where its clearly tied to the
perf interface and not as a generic clockid. (And either the ioctl or
dynamic posix clock id would be a way to go there).


>
>>
>> So while having a perf specific ioctl is still exposing what I expect
>> will be non-static kernel internal behavior to userland, it at least it
>> exposes it in a less generic fashion, which is preferable to me.
>>
>>
>>
>> The next point of conflict is likely if the ioctl method will be
>> sufficient given performance concerns. Something I'd be interested in
>> hearing about from the folks pushing this. Right now it seems any method
>> is preferable then not having an interface - but I want to make sure
>> that's really true.
>>
>> For example, if the ioctl interface is really too slow, its likely folks
>> will end up using periodic perf ioctl samples and interpolating using
>> normal vdso clock_gettime() timestamps.
>
> The performance/speed depends on how often is called. I have no idea
> what Stephane's use case is but for me it is to correlate perf_clock
> timestamps to timeofday. In my perf-based daemon that tracks process
> schedulings, I update the correlation every 5-10 minutes.

So that sounds like the ioctl approach would have no penalty from a
performance perspective.


>
>>
>> If that is acceptable, then why not invert the solution and just have
>> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
>> have perf report fast, but less-accurate sched_clock deltas from that
>> CLOCK_MONOTONIC boundary.
>
> Something similar to that approach has been discussed as well. i.e,
> add a realtime clock event and have it injected into the stream e.g.,
> https://lkml.org/lkml/2011/2/27/158
>
> But there are cons to this approach -- e.g, you need that first event
> generated that tells you realtime to perf_clock correlation and you
> don't want to have to scan an unknown length of events looking for the
> first one to get the correlation only to backup and process the events.
>
> And an ioctl to generate that first event was shot down as well...
> https://lkml.org/lkml/2011/3/1/174
> https://lkml.org/lkml/2011/3/2/186

Hrm.

So from my quick read of that thread, what it seems Thomas is getting at
there is having the periodic CLOCK_REALTIME injection isn't valuable
without the tracepoints on timekeeping changes. But once those are
there, the periodic timestamp injection would seemingly provide _most_
of what you need.

The missing bit is the desire to inject arbitrary timestamp "fences"
into the log, which Peter and Thomas apparently don't like, but actually
sounds useful to me.

But maybe there is a way to do this without adding an ioctl?

For instance, and this is just spitballing here, if you had a tracepoint
for clock_gettime() which returned the clockid and value, you could
create these fences just by requesting the time from userland. The VDSO
clock_gettime() would avoid the syscall, so you'd have to actually call
the syscall directly from userland. But this would have the added
benefit of not slowing down normal userspace that doesn't want to cause
these fences in the log.

I realize these half-baked "brainstorming" ideas are probably a bit
unwelcome after you've spent quite a bit of time trying different
approaches without any decisive resolution from maintainers. But maybe
Thomas and Peter can chime in here and maybe help to clarify their
objections?

thanks
-john

2013-04-02 07:54:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
> still too hand-wavy and implementation specific.

How about we define the semantics as: match whatever comes out of perf
(and preferably ftrace by default) stuff?

Since that stuff is already exposed to userspace, doesn't it make sense
to have a user accessible time source that generates the same time-line
so that people can create logs that can be properly interleaved?

2013-04-02 16:05:56

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, 2013-04-02 at 08:54 +0100, Peter Zijlstra wrote:
> On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
> > I'm still not sold on the CLOCK_PERF posix clock. The semantics are
> > still too hand-wavy and implementation specific.
>
> How about we define the semantics as: match whatever comes out of perf
> (and preferably ftrace by default) stuff?

My thought exactly. Maybe if we defined it as "CLOCK_TRACE" and had
equivalent "trace_clock()" function used by both perf (instead of
perf_clock()) and ftrace the semantics would became clearer? This clock
could be then described as "source of timestamps used by Linux trace
infrastructure, in particular by ftrace and perf".

Paweł

2013-04-02 16:19:19

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/02/2013 12:54 AM, Peter Zijlstra wrote:
> On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
> How about we define the semantics as: match whatever comes out of perf
> (and preferably ftrace by default) stuff?

That's not a sane interface. We've already been bitten by semantic
changes in sched_clock affecting in-kernel users. How are we going to
handle this with userland in the future? What happens when applications
depend on "what comes out of perf" on one system and that ends up being
different on another? "Oh, its just broken, the application shouldn't be
using that."

I'm sort of amazed that folks are so careful and hesitant to add an
ioctl to inject a timestamp fence into perf, but then so cavalier about
adding a ill-defined clockid as a generic interface.


> Since that stuff is already exposed to userspace, doesn't it make sense
> to have a user accessible time source that generates the same time-line
> so that people can create logs that can be properly interleaved?

Its exposed to userspace as timestamps correlated with specific data,
not timestamps for any purpose. We export kernel function addresses via
WARN_ON messages to dmesg, it doesn't mean we might as well allow
userland to jump and execute those addresses. ;)

I still think exposing the perf clock to userland is a bad idea, and
would much rather the kernel provide timestamp data in the logs
themselves to make the logs useful. But if we're going to have to do
this via a clockid, I'm going to want it to be done via a dynamic posix
clockid, so its clear its tightly tied with perf and not considered a
generic interface (and I can clearly point folks having problems to the
perf maintainers ;).


thanks
-john

2013-04-02 16:34:54

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> I still think exposing the perf clock to userland is a bad idea, and
> would much rather the kernel provide timestamp data in the logs
> themselves to make the logs useful. But if we're going to have to do
> this via a clockid, I'm going to want it to be done via a dynamic posix
> clockid, so its clear its tightly tied with perf and not considered a
> generic interface (and I can clearly point folks having problems to the
> perf maintainers ;).

Hm. 15 mins ago I didn't know about dynamic posix clocks existence at
all ;-)

I feel that the idea of opening a magic character device to obtain a
magic number to be used with clock_gettime() to get the timestamp may
not be popular, but maybe (just a thought) we could somehow use the file
descriptor obtained by the sys_perf_open() itself? How different would
it be from the ioctl(*_GET_TIME) I'm not sure, but I'll try to research
the idea. Counts as number 4 (5?) on my list ;-)

Paweł

2013-04-03 09:17:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, Apr 2, 2013 at 12:29 AM, David Ahern <[email protected]> wrote:
> On 4/1/13 12:29 PM, John Stultz wrote:
>>>
>>> Any chance a decision can be reached in time for 3.10? Seems like the
>>> simplest option is the perf event based ioctl.
>>
>>
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
>>
>> While I'd prefer perf to export some existing semi-sane time domain
>> (using interpolation if necessary), I realize the hardware constraints
>> and performance optimizations make this unlikely (though I'm
>> disappointed I've not seen any attempt or proof point that it won't work).
>>
>> Thus if we must expose this kernel detail to userland, I think we should
>> be careful about how publicly we expose such an interface, as it has the
>> potential for misuse and eventual user-land breakage.
>
>
> But perf_clock timestamps are already exposed to userland. This new API --
> be it a posix clock or an ioctl -- just allows retrieval of a timestamp
> outside of a generated event.
>
Agreed.
>
>>
>> So while having a perf specific ioctl is still exposing what I expect
>> will be non-static kernel internal behavior to userland, it at least it
>> exposes it in a less generic fashion, which is preferable to me.
>>
>>
>>
>> The next point of conflict is likely if the ioctl method will be
>> sufficient given performance concerns. Something I'd be interested in
>> hearing about from the folks pushing this. Right now it seems any method
>> is preferable then not having an interface - but I want to make sure
>> that's really true.
>>
>> For example, if the ioctl interface is really too slow, its likely folks
>> will end up using periodic perf ioctl samples and interpolating using
>> normal vdso clock_gettime() timestamps.
>
I haven't done any specific testing with either approach yet. The goal is to
use this perf timestamp to correlate user level events to hardware
events recorded
by the kernel. I would assume there would be situations where those user events
could be on the critical path, and thus the timestamp operation would have to be
as efficient as possible. The vdso approach would be ideal.

>
> The performance/speed depends on how often is called. I have no idea what
> Stephane's use case is but for me it is to correlate perf_clock timestamps
> to timeofday. In my perf-based daemon that tracks process schedulings, I
> update the correlation every 5-10 minutes.
>
I was more thinking along the lines of runtime environments like Java where
a JIT compiler is invoked frequently and you need to correlate samples in the
native code with Java source. For that, the JIT compiler has to emit mapping
tables which have to be timestamped as address ranges may be re-used.

>
>>
>> If that is acceptable, then why not invert the solution and just have
>> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
>> have perf report fast, but less-accurate sched_clock deltas from that
>> CLOCK_MONOTONIC boundary.
>
>
> Something similar to that approach has been discussed as well. i.e, add a
> realtime clock event and have it injected into the stream e.g.,
> https://lkml.org/lkml/2011/2/27/158
>
> But there are cons to this approach -- e.g, you need that first event
> generated that tells you realtime to perf_clock correlation and you don't
> want to have to scan an unknown length of events looking for the first one
> to get the correlation only to backup and process the events.
>
> And an ioctl to generate that first event was shot down as well...
> https://lkml.org/lkml/2011/3/1/174
> https://lkml.org/lkml/2011/3/2/186
>
> David
>
>
>>
>> Another alternative that might be a reasonable compromise: have perf
>> register a dynamic posix clock id, which would be a driver specific,
>> less public interface. That would provide the initial method to access
>> the perf time domain. Then when it came time to optimize further,
>> someone would have to sort out the difficulties of creating a vdso
>> method for accessing dynamic posix clocks. It wouldn't be easy, but it
>> wouldn't be impossible to do.
>>
>>
>>> Converting/correlating perf_clock timestamps to time-of-day is a
>>> feature I have been trying to get into perf for over 2 years. This is
>>> a big piece needed for that goal -- along with the xtime tracepoints:
>>> https://lkml.org/lkml/2013/3/19/433
>>
>>
>> I sympathize with how long this process can take. Having maintainers
>> disagree without resolution can be a tar-pit. That said, its only been a
>> few months where this has had proper visibility, and the discussion has
>> paused for months at a time. Despite how long and slow this probably
>> feels, the idea of maintaining a bad interface for the next decade seems
>> much longer. ;) So don't get discouraged yet.
>>
>> thanks
>> -john
>
>

2013-04-03 13:55:22

by David Ahern

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 4/3/13 3:17 AM, Stephane Eranian wrote:
> I haven't done any specific testing with either approach yet. The goal is to
> use this perf timestamp to correlate user level events to hardware
> events recorded
> by the kernel. I would assume there would be situations where those user events
> could be on the critical path, and thus the timestamp operation would have to be
> as efficient as possible. The vdso approach would be ideal.
>
>>
>> The performance/speed depends on how often is called. I have no idea what
>> Stephane's use case is but for me it is to correlate perf_clock timestamps
>> to timeofday. In my perf-based daemon that tracks process schedulings, I
>> update the correlation every 5-10 minutes.
>>
> I was more thinking along the lines of runtime environments like Java where
> a JIT compiler is invoked frequently and you need to correlate samples in the
> native code with Java source. For that, the JIT compiler has to emit mapping
> tables which have to be timestamped as address ranges may be re-used.

What's the advantage of changing apps -- like the JIT compiler -- to
emit perf based timestamps versus having perf emit existing timestamps?
ie., monotonic and realtime clocks already have vdso mappings for
userspace with well known performance characteristics. Why not have perf
convert its perf_clock timestamps into monotonic or realtime when
dumping events?

David

2013-04-03 14:00:10

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, Apr 3, 2013 at 3:55 PM, David Ahern <[email protected]> wrote:
> On 4/3/13 3:17 AM, Stephane Eranian wrote:
>>
>> I haven't done any specific testing with either approach yet. The goal is
>> to
>> use this perf timestamp to correlate user level events to hardware
>> events recorded
>> by the kernel. I would assume there would be situations where those user
>> events
>> could be on the critical path, and thus the timestamp operation would have
>> to be
>> as efficient as possible. The vdso approach would be ideal.
>>
>>>
>>> The performance/speed depends on how often is called. I have no idea what
>>> Stephane's use case is but for me it is to correlate perf_clock
>>> timestamps
>>> to timeofday. In my perf-based daemon that tracks process schedulings, I
>>> update the correlation every 5-10 minutes.
>>>
>> I was more thinking along the lines of runtime environments like Java
>> where
>> a JIT compiler is invoked frequently and you need to correlate samples in
>> the
>> native code with Java source. For that, the JIT compiler has to emit
>> mapping
>> tables which have to be timestamped as address ranges may be re-used.
>
>
> What's the advantage of changing apps -- like the JIT compiler -- to emit
> perf based timestamps versus having perf emit existing timestamps? ie.,
> monotonic and realtime clocks already have vdso mappings for userspace with
> well known performance characteristics. Why not have perf convert its
> perf_clock timestamps into monotonic or realtime when dumping events?
>
Can monotonic timestamps be obtained from NMI context in the kernel?

2013-04-03 14:14:26

by David Ahern

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 4/3/13 8:00 AM, Stephane Eranian wrote:
>> What's the advantage of changing apps -- like the JIT compiler -- to emit
>> perf based timestamps versus having perf emit existing timestamps? ie.,
>> monotonic and realtime clocks already have vdso mappings for userspace with
>> well known performance characteristics. Why not have perf convert its
>> perf_clock timestamps into monotonic or realtime when dumping events?
>>
> Can monotonic timestamps be obtained from NMI context in the kernel?

I don't understand the context of the question.

I am not suggesting perf_clock be changed. I am working on correlating
existing perf_clock timestamps to clocks typically used by apps
(REALTIME and time-of-day but also applies to MONOTONIC).

You are wanting the reverse -- have apps emit perf_clock timestamps. I
was just wondering what is the advantage of this approach?

David

2013-04-03 14:22:50

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <[email protected]> wrote:
> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>
>>> What's the advantage of changing apps -- like the JIT compiler -- to emit
>>> perf based timestamps versus having perf emit existing timestamps? ie.,
>>> monotonic and realtime clocks already have vdso mappings for userspace
>>> with
>>> well known performance characteristics. Why not have perf convert its
>>> perf_clock timestamps into monotonic or realtime when dumping events?
>>>
>> Can monotonic timestamps be obtained from NMI context in the kernel?
>
>
> I don't understand the context of the question.
>
> I am not suggesting perf_clock be changed. I am working on correlating
> existing perf_clock timestamps to clocks typically used by apps (REALTIME
> and time-of-day but also applies to MONOTONIC).
>
But for that, you'd need to expose to users the correlation between
the two clocks.
And now you'd fixed two clock sources definitions not just one.

> You are wanting the reverse -- have apps emit perf_clock timestamps. I was
> just wondering what is the advantage of this approach?
>
Well, that's how I interpreted your question ;-<

If you could have perf_clock use monotonic then we would not have this
discussion.
The correlation would be trivial.

2013-04-03 17:19:25

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> But if we're going to have to do
> this via a clockid, I'm going to want it to be done via a dynamic posix
> clockid, so its clear its tightly tied with perf and not considered a
> generic interface (and I can clearly point folks having problems to the
> perf maintainers ;).

Ok, so how about the code below?

There are two distinct parts of the "solution":

1. The dynamic posix clock, as you suggested. Then one can get the perf
timestamp by doing:

clock_fd = open("/dev/perf-clock", O_RDONLY);
clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)

2. A sort-of-hack in the get_posix_clock() function making it possible
to do the same using the perf event file descriptor, eg.:

fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
clock_gettime(FD_TO_CLOCKID(fd), &ts)

Any (either strong or not) opinions?

Pawel

8<--------------
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ee46..b2127e3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -52,6 +52,7 @@ struct perf_guest_info_callbacks {
#include <linux/atomic.h>
#include <linux/sysfs.h>
#include <linux/perf_regs.h>
+#include <linux/posix-clock.h>
#include <asm/local.h>

struct perf_callchain_entry {
@@ -845,4 +846,6 @@ _name##_show(struct device *dev, \
\
static struct device_attribute format_attr_##_name = __ATTR_RO(_name)

+struct posix_clock *perf_get_posix_clock(struct file *fp);
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..534cb43 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7446,6 +7446,49 @@ unlock:
}
device_initcall(perf_event_sysfs_init);

+static int perf_posix_clock_getres(struct posix_clock *pc, struct timespec *tp)
+{
+ *tp = ns_to_timespec(TICK_NSEC);
+ return 0;
+}
+
+static int perf_posix_clock_gettime(struct posix_clock *pc, struct timespec *tp)
+{
+ *tp = ns_to_timespec(perf_clock());
+ return 0;
+}
+
+static const struct posix_clock_operations perf_posix_clock_ops = {
+ .clock_getres = perf_posix_clock_getres,
+ .clock_gettime = perf_posix_clock_gettime,
+};
+
+static struct posix_clock perf_posix_clock;
+
+struct posix_clock *perf_get_posix_clock(struct file *fp)
+{
+ if (!fp || fp->f_op != &perf_fops)
+ return NULL;
+
+ down_read(&perf_posix_clock.rwsem);
+
+ return &perf_posix_clock;
+}
+
+static int __init perf_posix_clock_init(void)
+{
+ dev_t devt;
+ int ret;
+
+ ret = alloc_chrdev_region(&devt, 0, 1, "perf-clock");
+ if (ret)
+ return ret;
+
+ perf_posix_clock.ops = perf_posix_clock_ops;
+ return posix_clock_register(&perf_posix_clock, devt);
+}
+device_initcall(perf_posix_clock_init);
+
#ifdef CONFIG_CGROUP_PERF
static struct cgroup_subsys_state *perf_cgroup_css_alloc(struct cgroup *cont)
{
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7..e2a40a5 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -20,6 +20,7 @@
#include <linux/device.h>
#include <linux/export.h>
#include <linux/file.h>
+#include <linux/perf_event.h>
#include <linux/posix-clock.h>
#include <linux/slab.h>
#include <linux/syscalls.h>
@@ -249,16 +250,21 @@ struct posix_clock_desc {
static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
{
struct file *fp = fget(CLOCKID_TO_FD(id));
+ struct posix_clock *perf_clk = NULL;
int err = -EINVAL;

if (!fp)
return err;

- if (fp->f_op->open != posix_clock_open || !fp->private_data)
+#if defined(CONFIG_PERF_EVENTS)
+ perf_clk = perf_get_posix_clock(fp);
+#endif
+ if ((fp->f_op->open != posix_clock_open || !fp->private_data) &&
+ !perf_clk)
goto out;

cd->fp = fp;
- cd->clk = get_posix_clock(fp);
+ cd->clk = perf_clk ? perf_clk : get_posix_clock(fp);

err = cd->clk ? 0 : -ENODEV;
out:


2013-04-03 17:29:40

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/03/2013 10:19 AM, Pawel Moll wrote:
> On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
>> But if we're going to have to do
>> this via a clockid, I'm going to want it to be done via a dynamic posix
>> clockid, so its clear its tightly tied with perf and not considered a
>> generic interface (and I can clearly point folks having problems to the
>> perf maintainers ;).
> Ok, so how about the code below?
>
> There are two distinct parts of the "solution":
>
> 1. The dynamic posix clock, as you suggested. Then one can get the perf
> timestamp by doing:
>
> clock_fd = open("/dev/perf-clock", O_RDONLY);
> clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
>
> 2. A sort-of-hack in the get_posix_clock() function making it possible
> to do the same using the perf event file descriptor, eg.:
>
> fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
> clock_gettime(FD_TO_CLOCKID(fd), &ts)

#2 makes my nose wrinkle. Forgive me for being somewhat ignorant on the
perf interfaces, but why is the second portion necessary or beneficial?

thanks
-john

2013-04-03 17:35:10

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, 2013-04-03 at 18:29 +0100, John Stultz wrote:
> On 04/03/2013 10:19 AM, Pawel Moll wrote:
> > On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> >> But if we're going to have to do
> >> this via a clockid, I'm going to want it to be done via a dynamic posix
> >> clockid, so its clear its tightly tied with perf and not considered a
> >> generic interface (and I can clearly point folks having problems to the
> >> perf maintainers ;).
> > Ok, so how about the code below?
> >
> > There are two distinct parts of the "solution":
> >
> > 1. The dynamic posix clock, as you suggested. Then one can get the perf
> > timestamp by doing:
> >
> > clock_fd = open("/dev/perf-clock", O_RDONLY);
> > clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
> >
> > 2. A sort-of-hack in the get_posix_clock() function making it possible
> > to do the same using the perf event file descriptor, eg.:
> >
> > fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
> > clock_gettime(FD_TO_CLOCKID(fd), &ts)
>
> #2 makes my nose wrinkle.

To make myself clear: I consider the code as it is a hack.

> Forgive me for being somewhat ignorant on the
> perf interfaces, but why is the second portion necessary or beneficial?

My thinking: the perf syscall returns a file descriptor already, so it
would make sense to re-use it in the clock_gettime() call instead of
jumping through loops to open a character device file, which may not
exist at all (eg. no udev) or may be placed or named in a random way
(eg. some local udev rule).

I'm open for different opinions :-)

Pawel

2013-04-03 17:51:05

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/03/2013 10:35 AM, Pawel Moll wrote:
> On Wed, 2013-04-03 at 18:29 +0100, John Stultz wrote:
>> On 04/03/2013 10:19 AM, Pawel Moll wrote:
>>> On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
>>>> But if we're going to have to do
>>>> this via a clockid, I'm going to want it to be done via a dynamic posix
>>>> clockid, so its clear its tightly tied with perf and not considered a
>>>> generic interface (and I can clearly point folks having problems to the
>>>> perf maintainers ;).
>>> Ok, so how about the code below?
>>>
>>> There are two distinct parts of the "solution":
>>>
>>> 1. The dynamic posix clock, as you suggested. Then one can get the perf
>>> timestamp by doing:
>>>
>>> clock_fd = open("/dev/perf-clock", O_RDONLY);
>>> clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
>>>
>>> 2. A sort-of-hack in the get_posix_clock() function making it possible
>>> to do the same using the perf event file descriptor, eg.:
>>>
>>> fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
>>> clock_gettime(FD_TO_CLOCKID(fd), &ts)
>> #2 makes my nose wrinkle.
> To make myself clear: I consider the code as it is a hack.
>
>> Forgive me for being somewhat ignorant on the
>> perf interfaces, but why is the second portion necessary or beneficial?
> My thinking: the perf syscall returns a file descriptor already, so it
> would make sense to re-use it in the clock_gettime() call instead of
> jumping through loops to open a character device file, which may not
> exist at all (eg. no udev) or may be placed or named in a random way
> (eg. some local udev rule).
>
> I'm open for different opinions :-)

Cc'ing Richard for his thoughts here.


I get the reasoning around reusing the fd we already have, but is the
possibility of a dynamic chardev pathname really a big concern?

I'm guessing the private_data on the perf file is already used?

Maybe can we extend the dynamic posix clock code to work on more then
just the chardev? Although I worry about multiplexing too much
functionality on the file.

thanks
-john

2013-04-03 17:57:33

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/03/2013 07:22 AM, Stephane Eranian wrote:
> On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <[email protected]> wrote:
>> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>> Why not have perf convert its
>>>> perf_clock timestamps into monotonic or realtime when dumping events?

So this is exactly what I've been wondering through all this.

Perf can keep track of events using its own time domain (which is
understandably required due to performance and locking issues), but when
exporting those timestamps to userland, could it not do the same (likely
imperfect) conversion to existing userland time domains (like
CLOCK_MONOTONIC)?


>>> Can monotonic timestamps be obtained from NMI context in the kernel?
>>
>> I don't understand the context of the question.
>>
>> I am not suggesting perf_clock be changed. I am working on correlating
>> existing perf_clock timestamps to clocks typically used by apps (REALTIME
>> and time-of-day but also applies to MONOTONIC).
>>
> But for that, you'd need to expose to users the correlation between
> the two clocks.
> And now you'd fixed two clock sources definitions not just one.

I'm not sure I follow this. If perf exported data came with
CLOCK_MONOTONIC timestamps, no correlation would need to be exposed.
perf would just have to do the extra overhead of doing the conversion on
export.


>> You are wanting the reverse -- have apps emit perf_clock timestamps. I was
>> just wondering what is the advantage of this approach?
>>
> Well, that's how I interpreted your question ;-<
>
> If you could have perf_clock use monotonic then we would not have this
> discussion.
> The correlation would be trivial.

I think the suggestion is not to have the perf_clock use
CLOCK_MONOTONIC, but the perf interfaces export CLOCK_MONOTONIC.

thanks
-john

2013-04-04 07:37:14

by Richard Cochran

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, Apr 03, 2013 at 10:50:57AM -0700, John Stultz wrote:
>
> I get the reasoning around reusing the fd we already have, but is
> the possibility of a dynamic chardev pathname really a big concern?

I have been following this thread, and, not knowing very much about
perf, I would think that the userland can easily open a second file
(the dynamic posix clock chardev) in order to get these time stamps.

> Maybe can we extend the dynamic posix clock code to work on more
> then just the chardev? Although I worry about multiplexing too much
> functionality on the file.

I don't yet see a need for that, but if we do, then it should work in
a generic way, and not as a list of special cases, like we saw in the
patch.

Thanks,
Richard

2013-04-04 08:12:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, Apr 3, 2013 at 7:57 PM, John Stultz <[email protected]> wrote:
> On 04/03/2013 07:22 AM, Stephane Eranian wrote:
>>
>> On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <[email protected]> wrote:
>>>
>>> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>>>
>>>>> Why not have perf convert its
>>>>> perf_clock timestamps into monotonic or realtime when dumping events?
>
>
> So this is exactly what I've been wondering through all this.
>
> Perf can keep track of events using its own time domain (which is
> understandably required due to performance and locking issues), but when
> exporting those timestamps to userland, could it not do the same (likely
> imperfect) conversion to existing userland time domains (like
> CLOCK_MONOTONIC)?
>
>
>
>>>> Can monotonic timestamps be obtained from NMI context in the kernel?
>>>
>>>
>>> I don't understand the context of the question.
>>>
>>> I am not suggesting perf_clock be changed. I am working on correlating
>>> existing perf_clock timestamps to clocks typically used by apps (REALTIME
>>> and time-of-day but also applies to MONOTONIC).
>>>
>> But for that, you'd need to expose to users the correlation between
>> the two clocks.
>> And now you'd fixed two clock sources definitions not just one.
>
>
> I'm not sure I follow this. If perf exported data came with CLOCK_MONOTONIC
> timestamps, no correlation would need to be exposed. perf would just have
> to do the extra overhead of doing the conversion on export.
>
There is no explicit export operation in perf. You record a sample when
the counter overflows and generates an NMI interrupt. In the NMI interrupt
handler, the sample record is written to the sampling buffer. That is when
the timestamp is generated. The sampling buffer is directly accessible to
users via mmap(). The perf tool just dumps the raw sampling buffer into
a file, no sample record is modified or even looked at. The processing
of the samples is done offline (via perf report) and could be done on
another machine. In other words, the perf.data file is self-contained.

Are you suggesting that the perf tool or kernel could expose a constant
correlation factor between perf timestamp and MONOTONIC and that
this constant could be record by the perf tool in the perf.data file and
used later on by the perf report command?



>
>
>>> You are wanting the reverse -- have apps emit perf_clock timestamps. I
>>> was
>>> just wondering what is the advantage of this approach?
>>>
>> Well, that's how I interpreted your question ;-<
>>
>> If you could have perf_clock use monotonic then we would not have this
>> discussion.
>> The correlation would be trivial.
>
>
> I think the suggestion is not to have the perf_clock use CLOCK_MONOTONIC,
> but the perf interfaces export CLOCK_MONOTONIC.
>
> thanks
> -john
>

2013-04-04 16:29:30

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, 2013-04-03 at 18:50 +0100, John Stultz wrote:
> I get the reasoning around reusing the fd we already have, but is the
> possibility of a dynamic chardev pathname really a big concern?

Well, in my particular development system I have no udev, so I had to
manually do "mknod". Perf syscall works out of the box. Of course one
could say it's my problem...

> I'm guessing the private_data on the perf file is already used?

Of course.

> Maybe can we extend the dynamic posix clock code to work on more then
> just the chardev?

The idea I'm following now is to make the dynamic clock framework even
more generic, so there could be a clock associated with an arbitrary
struct file * (the perf syscall is getting one with
anon_inode_getfile()). I don't know how to get this done yet, but I'll
give it a try and report.

Paweł

2013-04-04 16:33:28

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Thu, 2013-04-04 at 08:37 +0100, Richard Cochran wrote:
> > I get the reasoning around reusing the fd we already have, but is
> > the possibility of a dynamic chardev pathname really a big concern?
>
> I have been following this thread, and, not knowing very much about
> perf, I would think that the userland can easily open a second file
> (the dynamic posix clock chardev) in order to get these time stamps.

Sure it can - I've tested it. It's just a bit cumbersome in my opinion
(there is nothing else perf-related in /dev). I can agree to disagree if
you think otherwise :-)

> > Maybe can we extend the dynamic posix clock code to work on more
> > then just the chardev? Although I worry about multiplexing too much
> > functionality on the file.
>
> I don't yet see a need for that, but if we do, then it should work in
> a generic way, and not as a list of special cases, like we saw in the
> patch.

By all means - and even more generic way than it is now (why character
devices not any other file?). I'll give it a try.

Paweł


2013-04-04 22:26:45

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/04/2013 01:12 AM, Stephane Eranian wrote:
> On Wed, Apr 3, 2013 at 7:57 PM, John Stultz <[email protected]> wrote:
>> I'm not sure I follow this. If perf exported data came with CLOCK_MONOTONIC
>> timestamps, no correlation would need to be exposed. perf would just have
>> to do the extra overhead of doing the conversion on export.
> There is no explicit export operation in perf. You record a sample when
> the counter overflows and generates an NMI interrupt. In the NMI interrupt
> handler, the sample record is written to the sampling buffer. That is when
> the timestamp is generated. The sampling buffer is directly accessible to
> users via mmap(). The perf tool just dumps the raw sampling buffer into
> a file, no sample record is modified or even looked at. The processing
> of the samples is done offline (via perf report) and could be done on
> another machine. In other words, the perf.data file is self-contained.
Ah. Ok, I didn't realize perfs buffers were directly mmaped. I was
thinking perf could do the translation not at NMI time but when the
buffer was later read by the application. That helps explain some of
the constraints.

thanks
-john

2013-04-05 18:16:59

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Thu, 2013-04-04 at 17:29 +0100, Pawel Moll wrote:
> > Maybe can we extend the dynamic posix clock code to work on more then
> > just the chardev?
>
> The idea I'm following now is to make the dynamic clock framework even
> more generic, so there could be a clock associated with an arbitrary
> struct file * (the perf syscall is getting one with
> anon_inode_getfile()). I don't know how to get this done yet, but I'll
> give it a try and report.

Ok, so how about the code below? Disclaimer: this is just a proposal.
I'm not sure how welcomed would be an extra field in struct file, but
this makes the clocks ultimately flexible - one can "attach" the clock
to any arbitrary struct file. Alternatively we could mark a "clocked"
file with a special flag in f_mode and have some kind of lookup.

Also, I can't stop thinking that the posix-clock.c shouldn't actually do
anything about the character device... The PTP core (as the model of
using character device seems to me just one of possible choices) could
do this on its own and have simple open/release attaching/detaching the
clock. This would remove a lot of "generic dev" code in the
posix-clock.c and all the optional cdev methods in struct posix_clock.
It's just a thought, though...

And a couple of questions to Richard... Isn't the kref_put() in
posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
register->unregister sequence was making the ref count == -1, so the
delete_clock() won't be called. And was there any particular reason that
the ops in struct posix_clock are *not* a pointer? This makes static
clock declaration a bit cumbersome (I'm not a C language lawyer, but gcc
doesn't let me do simply .ops = other_static_struct_with_ops).

Regards

Pawel

8<-------------------------------------------
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..4090500 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -804,6 +804,9 @@ struct file {
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
+
+ /* for clock_gettime(FD_TO_CLOCKID(fd)) and friends */
+ struct posix_clock *posix_clock;
};

struct file_handle {
diff --git a/include/linux/posix-clock.h b/include/linux/posix-clock.h
index 34c4498..85df2c5 100644
--- a/include/linux/posix-clock.h
+++ b/include/linux/posix-clock.h
@@ -123,6 +123,10 @@ struct posix_clock {
void (*release)(struct posix_clock *clk);
};

+void posix_clock_init(struct posix_clock *clk);
+void posix_clock_attach(struct posix_clock *clk, struct file *fp);
+void posix_clock_detach(struct file *fp);
+
/**
* posix_clock_register() - register a new clock
* @clk: Pointer to the clock. Caller must provide 'ops' and 'release'
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..0b70ad1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -34,6 +34,7 @@
#include <linux/anon_inodes.h>
#include <linux/kernel_stat.h>
#include <linux/perf_event.h>
+#include <linux/posix-clock.h>
#include <linux/ftrace_event.h>
#include <linux/hw_breakpoint.h>
#include <linux/mm_types.h>
@@ -627,6 +628,25 @@ perf_cgroup_mark_enabled(struct perf_event *event,
}
#endif

+static int perf_posix_clock_getres(struct posix_clock *pc, struct timespec *tp)
+{
+ *tp = ns_to_timespec(TICK_NSEC);
+ return 0;
+}
+
+static int perf_posix_clock_gettime(struct posix_clock *pc, struct timespec *tp)
+{
+ *tp = ns_to_timespec(perf_clock());
+ return 0;
+}
+
+static struct posix_clock perf_posix_clock = {
+ .ops = (struct posix_clock_operations) {
+ .clock_getres = perf_posix_clock_getres,
+ .clock_gettime = perf_posix_clock_gettime,
+ },
+};
+
void perf_pmu_disable(struct pmu *pmu)
{
int *count = this_cpu_ptr(pmu->pmu_disable_count);
@@ -2992,6 +3012,7 @@ static void put_event(struct perf_event *event)

static int perf_release(struct inode *inode, struct file *file)
{
+ posix_clock_detach(file);
put_event(file->private_data);
return 0;
}
@@ -6671,6 +6692,7 @@ SYSCALL_DEFINE5(perf_event_open,
* perf_group_detach().
*/
fdput(group);
+ posix_clock_attach(&perf_posix_clock, event_file);
fd_install(event_fd, event_file);
return event_fd;

@@ -7416,6 +7438,8 @@ void __init perf_event_init(void)
*/
BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
!= 1024);
+
+ posix_clock_init(&perf_posix_clock);
}

static int __init perf_event_sysfs_init(void)
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7..525fa44 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -25,14 +25,44 @@
#include <linux/syscalls.h>
#include <linux/uaccess.h>

-static void delete_clock(struct kref *kref);
+void posix_clock_init(struct posix_clock *clk)
+{
+ kref_init(&clk->kref);
+ init_rwsem(&clk->rwsem);
+}
+EXPORT_SYMBOL_GPL(posix_clock_init);
+
+void posix_clock_attach(struct posix_clock *clk, struct file *fp)
+{
+ kref_get(&clk->kref);
+ fp->posix_clock = clk;
+}
+EXPORT_SYMBOL_GPL(posix_clock_attach);
+
+static void delete_clock(struct kref *kref)
+{
+ struct posix_clock *clk = container_of(kref, struct posix_clock, kref);
+
+ if (clk->release)
+ clk->release(clk);
+}
+
+void posix_clock_detach(struct file *fp)
+{
+ kref_put(&fp->posix_clock->kref, delete_clock);
+ fp->posix_clock = NULL;
+}
+EXPORT_SYMBOL_GPL(posix_clock_detach);

/*
* Returns NULL if the posix_clock instance attached to 'fp' is old and stale.
*/
static struct posix_clock *get_posix_clock(struct file *fp)
{
- struct posix_clock *clk = fp->private_data;
+ struct posix_clock *clk = fp->posix_clock;
+
+ if (!clk)
+ return NULL;

down_read(&clk->rwsem);

@@ -167,10 +197,8 @@ static int posix_clock_open(struct inode *inode, struct file *fp)
else
err = 0;

- if (!err) {
- kref_get(&clk->kref);
- fp->private_data = clk;
- }
+ if (!err)
+ posix_clock_attach(clk, fp);
out:
up_read(&clk->rwsem);
return err;
@@ -178,15 +206,13 @@ out:

static int posix_clock_release(struct inode *inode, struct file *fp)
{
- struct posix_clock *clk = fp->private_data;
+ struct posix_clock *clk = fp->posix_clock;
int err = 0;

if (clk->ops.release)
err = clk->ops.release(clk);

- kref_put(&clk->kref, delete_clock);
-
- fp->private_data = NULL;
+ posix_clock_detach(fp);

return err;
}
@@ -210,8 +236,7 @@ int posix_clock_register(struct posix_clock *clk, dev_t devid)
{
int err;

- kref_init(&clk->kref);
- init_rwsem(&clk->rwsem);
+ posix_clock_init(clk);

cdev_init(&clk->cdev, &posix_clock_file_operations);
clk->cdev.owner = clk->ops.owner;
@@ -221,14 +246,6 @@ int posix_clock_register(struct posix_clock *clk, dev_t devid)
}
EXPORT_SYMBOL_GPL(posix_clock_register);

-static void delete_clock(struct kref *kref)
-{
- struct posix_clock *clk = container_of(kref, struct posix_clock, kref);
-
- if (clk->release)
- clk->release(clk);
-}
-
void posix_clock_unregister(struct posix_clock *clk)
{
cdev_del(&clk->cdev);
@@ -249,22 +266,19 @@ struct posix_clock_desc {
static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
{
struct file *fp = fget(CLOCKID_TO_FD(id));
- int err = -EINVAL;

if (!fp)
- return err;
-
- if (fp->f_op->open != posix_clock_open || !fp->private_data)
- goto out;
+ return -EINVAL;

cd->fp = fp;
cd->clk = get_posix_clock(fp);

- err = cd->clk ? 0 : -ENODEV;
-out:
- if (err)
+ if (!cd->clk) {
fput(fp);
- return err;
+ return -ENODEV;
+ }
+
+ return 0;
}

static void put_clock_desc(struct posix_clock_desc *cd)



2013-04-06 11:05:25

by Richard Cochran

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Fri, Apr 05, 2013 at 07:16:53PM +0100, Pawel Moll wrote:

> Ok, so how about the code below? Disclaimer: this is just a proposal.
> I'm not sure how welcomed would be an extra field in struct file, but
> this makes the clocks ultimately flexible - one can "attach" the clock
> to any arbitrary struct file. Alternatively we could mark a "clocked"
> file with a special flag in f_mode and have some kind of lookup.

Only a tiny minority of file instances will want to be clocks.
Therefore I think adding the extra field will be a hard sell.

The flag idea sounds harmless, but how do you perform the lookup?

> Also, I can't stop thinking that the posix-clock.c shouldn't actually do
> anything about the character device... The PTP core (as the model of
> using character device seems to me just one of possible choices) could
> do this on its own and have simple open/release attaching/detaching the
> clock. This would remove a lot of "generic dev" code in the
> posix-clock.c and all the optional cdev methods in struct posix_clock.
> It's just a thought, though...

Right, the chardev could be pushed into the PHC layer. The original
idea of chardev clocks did have precedents, though, like hpet and rtc.

> And a couple of questions to Richard... Isn't the kref_put() in
> posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
> register->unregister sequence was making the ref count == -1, so the
> delete_clock() won't be called.

Well,

posix_clock_register() -> kref_init() ->
atomic_set(&kref->refcount, 1);

So refcount is now 1 ...

posix_clock_unregister() -> kref_put() -> kref_sub(count=1) ->
atomic_sub_and_test((int) count, &kref->refcount)

and refcount is now 0. Can't see how you would get -1 here.

> And was there any particular reason that the ops in struct
> posix_clock are *not* a pointer?

One less run time indirection maybe? I don't really remember why or
how we arrived at this. The whole PHC review took a year, with
something like fifteen revisions.

Thanks,
Richard

2013-04-08 17:58:22

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Sat, 2013-04-06 at 12:05 +0100, Richard Cochran wrote:
> On Fri, Apr 05, 2013 at 07:16:53PM +0100, Pawel Moll wrote:
> > Ok, so how about the code below? Disclaimer: this is just a proposal.
> > I'm not sure how welcomed would be an extra field in struct file, but
> > this makes the clocks ultimately flexible - one can "attach" the clock
> > to any arbitrary struct file. Alternatively we could mark a "clocked"
> > file with a special flag in f_mode and have some kind of lookup.
>
> Only a tiny minority of file instances will want to be clocks.
> Therefore I think adding the extra field will be a hard sell.
>
> The flag idea sounds harmless, but how do you perform the lookup?

Hash table. I'll get some code typed and post it tomorrow.

> > Also, I can't stop thinking that the posix-clock.c shouldn't actually do
> > anything about the character device... The PTP core (as the model of
> > using character device seems to me just one of possible choices) could
> > do this on its own and have simple open/release attaching/detaching the
> > clock. This would remove a lot of "generic dev" code in the
> > posix-clock.c and all the optional cdev methods in struct posix_clock.
> > It's just a thought, though...
>
> Right, the chardev could be pushed into the PHC layer. The original
> idea of chardev clocks did have precedents, though, like hpet and rtc.

I'm not arguing about the use of cdev for PTP clocks, it's perfectly
fine with me. I'm just not convinced that the "more generic" clock layer
should enforce cdevs and nothing more. But I think we're more-or-less
talking the same language here, so I'll simply create a patch and send
it as RFC.

> > And a couple of questions to Richard... Isn't the kref_put() in
> > posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
> > register->unregister sequence was making the ref count == -1, so the
> > delete_clock() won't be called.
>
> Well,
>
> posix_clock_register() -> kref_init() ->
> atomic_set(&kref->refcount, 1);
>
> So refcount is now 1 ...
>
> posix_clock_unregister() -> kref_put() -> kref_sub(count=1) ->
> atomic_sub_and_test((int) count, &kref->refcount)
>
> and refcount is now 0. Can't see how you would get -1 here.

Eh. For some reason I was convinced that kref_init() sets the counter to
0 not 1. My bad.

> > And was there any particular reason that the ops in struct
> > posix_clock are *not* a pointer?
>
> One less run time indirection maybe? I don't really remember why or
> how we arrived at this. The whole PHC review took a year, with
> something like fifteen revisions.

Ok. As most of the *_ops seem to be referenced via pointers (including
file ops, which are pretty heavily used ;-) and this makes it much
easier to define static clocks, I'll propose a change in a separate
patch.

Now, before I spend time doing all this, a question to John, Peter,
Stephane and the rest of the public - would a solution providing such
userspace interface:

fd = sys_perf_open()
timestamp = clock_gettime((FD_TO_CLOCKID(fd), &ts)

be acceptable to all?

Paweł

2013-04-08 19:06:02

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On 04/08/2013 10:58 AM, Pawel Moll wrote:
> Now, before I spend time doing all this, a question to John, Peter,
> Stephane and the rest of the public - would a solution providing such
> userspace interface:
>
> fd = sys_perf_open()
> timestamp = clock_gettime((FD_TO_CLOCKID(fd), &ts)
>
> be acceptable to all?

So thinking this through further, I'm worried we may _not_ be able to
eventually enable this to be a vdso as I had earlier hoped. Mostly
because I'm not sure how the fd -> file -> clock lookup could be done in
userland (any ideas?).

So this makes this approach mostly equivalent long term to the ioctl
method, from a performance perspective. And makes the dynamic posix
clockid somewhat less of a middle-ground compromise between the ioctl
and generic constant clockid approach.

So while I'm not opposed to the sort of extention proposed above, I want
to make sure introducing the new approach is worth the effort when
compared with just adding an ioctl.

thanks
-john

2013-04-09 05:03:05

by Richard Cochran

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Mon, Apr 08, 2013 at 12:05:52PM -0700, John Stultz wrote:
>
> So thinking this through further, I'm worried we may _not_ be able
> to eventually enable this to be a vdso as I had earlier hoped.
> Mostly because I'm not sure how the fd -> file -> clock lookup could
> be done in userland (any ideas?).

How about a new clock operation, clock_install_vdso(), that lets the
process arrange for one dynamic clock to be reflected in its vdso
page?

Thanks,
Richard

2013-06-26 16:49:57

by David Ahern

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

With all the perf ioctl extensions tossed out the past day or so I
wanted to revive this request. Still need a solution to the problem of
correlating perf_clock to other clocks ...

On 2/1/13 7:18 AM, Pawel Moll wrote:
> Hello,
>
> I'd like to revive the topic...
>
> On Tue, 2012-10-16 at 18:23 +0100, Peter Zijlstra wrote:
>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>> Hi,
>>>
>>> There are many situations where we want to correlate events happening at
>>> the user level with samples recorded in the perf_event kernel sampling buffer.
>>> For instance, we might want to correlate the call to a function or creation of
>>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>>> we need to be able to correlate jitted code mappings with perf event samples
>>> for symbolization.
>>>
>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>
>>> To make correlating user vs. kernel samples easy, we would need to
>>> access that sched_clock() functionality. However, none of the existing
>>> clock calls permit this at this point. They all return timestamps which are
>>> not using the same source and/or offset as sched_clock.
>>>
>>> I believe a similar issue exists with the ftrace subsystem.
>>>
>>> The problem needs to be adressed in a portable manner. Solutions
>>> based on reading TSC for the user level to reconstruct sched_clock()
>>> don't seem appropriate to me.
>>>
>>> One possibility to address this limitation would be to extend clock_gettime()
>>> with a new clock time, e.g., CLOCK_PERF.
>>>
>>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>>> But we already have to deal with this problem when merging samples obtained
>>> from different CPU sampling buffer in per-thread mode. So this is not
>>> necessarily
>>> a showstopper.
>>>
>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>
>>> Anyone with better ideas?
>>
>> You forgot to CC the time people ;-)
>>
>> I've no problem with adding CLOCK_PERF (or another/better name).
>>
>> Thomas, John?
>
> I've just faced the same issue - correlating an event in userspace with
> data from the perf stream, but to my mind what I want to get is a value
> returned by perf_clock() _in the current "session" context_.
>
> Stephane didn't like the idea of opening a "fake" perf descriptor in
> order to get the timestamp, but surely one must have the "session"
> already running to be interested in such data in the first place? So I
> think the ioctl() idea is not out of place here... How about the simple
> change below?
>
> Regards
>
> Pawel
>
> 8<---
> From 2ad51a27fbf64bf98cee190efc3fbd7002819692 Mon Sep 17 00:00:00 2001
> From: Pawel Moll <[email protected]>
> Date: Fri, 1 Feb 2013 14:03:56 +0000
> Subject: [PATCH] perf: Add ioctl to return current time value
>
> To co-relate user space events with the perf events stream
> a current (as in: "what time(stamp) is it now?") time value
> must be made available.
>
> This patch adds a perf ioctl that makes this possible.
>
> Signed-off-by: Pawel Moll <[email protected]>
> ---
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 8 ++++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 4f63c05..b745fb0 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -316,6 +316,7 @@ struct perf_event_attr {
> #define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
> #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
> #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
> +#define PERF_EVENT_IOC_GET_TIME _IOR('$', 7, __u64)
>
> enum perf_event_ioc_flags {
> PERF_IOC_FLAG_GROUP = 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..4202b1c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3298,6 +3298,14 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> case PERF_EVENT_IOC_SET_FILTER:
> return perf_event_set_filter(event, (void __user *)arg);
>
> + case PERF_EVENT_IOC_GET_TIME:
> + {
> + u64 time = perf_clock();
> + if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> + return -EFAULT;
> + return 0;
> + }
> +
> default:
> return -ENOTTY;
> }
>

2013-07-15 10:45:05

by Pawel Moll

[permalink] [raw]
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

On Wed, 2013-06-26 at 17:49 +0100, David Ahern wrote:
> With all the perf ioctl extensions tossed out the past day or so I
> wanted to revive this request. Still need a solution to the problem of
> correlating perf_clock to other clocks ...

And I second. We've been trying to squeeze the solution into the posix
clock framework (and vdso) but it didn't get anywhere, really. I've
spoken to John last week and although there is one more potential
"solution" (and the quotes are meaningful ;-), it seems that the
perf-specific ioctl would work for all interested individuals for now.

Paweł