2009-04-03 01:46:51

by Corey Ashford

[permalink] [raw]
Subject: perf_counter: request for three more sample data options

Currently, perf_counter has the ability to record the following on event
counter overflow:

Instruction Pointer
Call chain
Group counter values
Thread id

To give perf_counter similar capabilities to perfmon2's default sampling
module, I'd like the following additional sample data to be added.

Time stamp
CPU number
Thread Group Id

I'd suggest the following

enum perf_counter_record_format {
PERF_RECORD_IP = 1U << 0,
PERF_RECORD_TID = 1U << 1,
PERF_RECORD_TGID = 1U << 2,
- PERF_RECORD_GROUP = 1U << 2,
+ PERF_RECORD_GROUP = 1U << 3,
- PERF_RECORD_CALLCHAIN = 1U << 3,
+ PERF_RECORD_CALLCHAIN = 1U << 4,
+ PERF_RECORD_CPU_ID = 1U << 5,
+ PERF_RECORD_TIMESTAMP = 1U << 6,
};

And of course the obvious changes to perf_event_type.

I would expect that CPU ID would be 32 bits, and the timestamp to be the
64-bit current time. TGID is the same size as TID.

I am guessing the only difficult thing here would be obtaining the
current time from an IRQ, especially NMI handler. Is this difficult?


--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]


2009-04-03 07:00:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options

On Thu, 2009-04-02 at 18:46 -0700, Corey Ashford wrote:
> Currently, perf_counter has the ability to record the following on event
> counter overflow:
>
> Instruction Pointer
> Call chain
> Group counter values
> Thread id
>
> To give perf_counter similar capabilities to perfmon2's default sampling
> module, I'd like the following additional sample data to be added.
>
> Time stamp

Rather hard actually, to provide a decent timestamp from NMI context.

> CPU number

Could do I guess.

> Thread Group Id

As in the process id? PERF_RECORD_TID already provides that.

> I'd suggest the following
>
> enum perf_counter_record_format {
> PERF_RECORD_IP = 1U << 0,
> PERF_RECORD_TID = 1U << 1,
> PERF_RECORD_TGID = 1U << 2,
> - PERF_RECORD_GROUP = 1U << 2,
> + PERF_RECORD_GROUP = 1U << 3,
> - PERF_RECORD_CALLCHAIN = 1U << 3,
> + PERF_RECORD_CALLCHAIN = 1U << 4,
> + PERF_RECORD_CPU_ID = 1U << 5,
> + PERF_RECORD_TIMESTAMP = 1U << 6,
> };
>
> And of course the obvious changes to perf_event_type.
>
> I would expect that CPU ID would be 32 bits, and the timestamp to be the
> 64-bit current time. TGID is the same size as TID.

Right, so PREF_RECORD_TID provides:

{ u32 pid, tid; }

PERF_RECORD_TIMESTAMP would provide something like:

{ u64 time; }

and per our u64 alignment rule, PERF_RECORD_CPU would provide

{ u64 cpuid; }

unless you can think of anything else to stuff in there?

> I am guessing the only difficult thing here would be obtaining the
> current time from an IRQ, especially NMI handler. Is this difficult?

Yes, quite :-) I'll have to see what we can do there -- we could do a
best effort thing with little to no guarantees I think.

2009-04-03 07:25:50

by Corey Ashford

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options

Thank you for your reply, Peter.

Peter Zijlstra wrote:
> On Thu, 2009-04-02 at 18:46 -0700, Corey Ashford wrote:
>> Currently, perf_counter has the ability to record the following on event
>> counter overflow:
>>
>> Instruction Pointer
>> Call chain
>> Group counter values
>> Thread id
>>
>> To give perf_counter similar capabilities to perfmon2's default sampling
>> module, I'd like the following additional sample data to be added.
>>
>> Time stamp
>
> Rather hard actually, to provide a decent timestamp from NMI context.
>
>> CPU number
>
> Could do I guess.
>
>> Thread Group Id
>
> As in the process id? PERF_RECORD_TID already provides that.
>
>> I'd suggest the following
>>
>> enum perf_counter_record_format {
>> PERF_RECORD_IP = 1U << 0,
>> PERF_RECORD_TID = 1U << 1,
>> PERF_RECORD_TGID = 1U << 2,
>> - PERF_RECORD_GROUP = 1U << 2,
>> + PERF_RECORD_GROUP = 1U << 3,
>> - PERF_RECORD_CALLCHAIN = 1U << 3,
>> + PERF_RECORD_CALLCHAIN = 1U << 4,
>> + PERF_RECORD_CPU_ID = 1U << 5,
>> + PERF_RECORD_TIMESTAMP = 1U << 6,
>> };
>>
>> And of course the obvious changes to perf_event_type.
>>
>> I would expect that CPU ID would be 32 bits, and the timestamp to be the
>> 64-bit current time. TGID is the same size as TID.
>
> Right, so PREF_RECORD_TID provides:
>
> { u32 pid, tid; }

Ah, I didn't know that. Ok, that's only two things I want then :)

>
> PERF_RECORD_TIMESTAMP would provide something like:
>
> { u64 time; }

Yep.

>
> and per our u64 alignment rule, PERF_RECORD_CPU would provide
>
> { u64 cpuid; }
>
> unless you can think of anything else to stuff in there?

We could leave the upper 32-bits reserved for now. Perhaps someone
later will come up with some nice info to put there.

>
>> I am guessing the only difficult thing here would be obtaining the
>> current time from an IRQ, especially NMI handler. Is this difficult?
>
> Yes, quite :-) I'll have to see what we can do there -- we could do a
> best effort thing with little to no guarantees I think.
>

Best effort would be fine, I think. I would assume that means that
99.9% of the time, you'll get a correct timestamp, and the rest are
rubbish? Or would there be a way to detect when you're not able to give
a correct timestamp and in that case replace the timestamp field with a
special sentinel, like all hex f's?

Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-04-03 07:50:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options

On Fri, 2009-04-03 at 00:25 -0700, Corey Ashford wrote:

> >> I am guessing the only difficult thing here would be obtaining the
> >> current time from an IRQ, especially NMI handler. Is this difficult?
> >
> > Yes, quite :-) I'll have to see what we can do there -- we could do a
> > best effort thing with little to no guarantees I think.
> >
>
> Best effort would be fine, I think. I would assume that means that
> 99.9% of the time, you'll get a correct timestamp, and the rest are
> rubbish? Or would there be a way to detect when you're not able to give
> a correct timestamp and in that case replace the timestamp field with a
> special sentinel, like all hex f's?

What I was thinking of was re-using some of the cpu_clock()
infrastructure. That provides us with a jiffy based GTOD sample,
cpu_clock() then uses TSC and a few filters to compute a current
timestamp.

I was thinking about cutting back those filters and thus trusting the
TSC more -- which on x86 can do any random odd thing. So provided the
TSC is not doing funny the results will be ok-ish.

This does mean however, that its not possible to know when its gone bad.

Also, cpu_clock() can only provide monotonicity per-cpu, if a value read
on one cpu is compared to a value read on another cpu, there can be a
drift of at most 1-2 jiffies.

Anyway, I'll prod some at this and see how much of cpu_clock() we can
get working in NMI context -- currently it just bails and returns the
last value computed.

The question to Paul is, does the powerpc sched_clock() call work in NMI
-- or hard irq disable -- context?

2009-04-03 08:51:29

by Paul Mackerras

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options

Peter Zijlstra writes:

> What I was thinking of was re-using some of the cpu_clock()
> infrastructure. That provides us with a jiffy based GTOD sample,
> cpu_clock() then uses TSC and a few filters to compute a current
> timestamp.
>
> I was thinking about cutting back those filters and thus trusting the
> TSC more -- which on x86 can do any random odd thing. So provided the
> TSC is not doing funny the results will be ok-ish.
>
> This does mean however, that its not possible to know when its gone bad.

I would expect that perfmon would be just reading the TSC and
recording that. If you can read the TSC and do some correction then
we're ahead. :)

> The question to Paul is, does the powerpc sched_clock() call work in NMI
> -- or hard irq disable -- context?

Yes - timekeeping is one area where us powerpc guys can be smug. :)
We have a per-core, 64-bit timebase register which counts at a
constant frequency and is synchronized across all cores. So
sched_clock works in any context on powerpc - all it does is read the
timebase and do some simple integer arithmetic on it.

Paul.

2009-04-03 16:33:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-04-03 at 00:25 -0700, Corey Ashford wrote:
>
> > >> I am guessing the only difficult thing here would be obtaining the
> > >> current time from an IRQ, especially NMI handler. Is this difficult?
> > >
> > > Yes, quite :-) I'll have to see what we can do there -- we could do a
> > > best effort thing with little to no guarantees I think.
> > >
> >
> > Best effort would be fine, I think. I would assume that means
> > that 99.9% of the time, you'll get a correct timestamp, and the
> > rest are rubbish? Or would there be a way to detect when you're
> > not able to give a correct timestamp and in that case replace
> > the timestamp field with a special sentinel, like all hex f's?
>
> What I was thinking of was re-using some of the cpu_clock()
> infrastructure. That provides us with a jiffy based GTOD sample,
> cpu_clock() then uses TSC and a few filters to compute a current
> timestamp.
>
> I was thinking about cutting back those filters and thus trusting
> the TSC more -- which on x86 can do any random odd thing. So
> provided the TSC is not doing funny the results will be ok-ish.
>
> This does mean however, that its not possible to know when its
> gone bad.

Note that on latest mainline and on Nehalem CPUs that filter is
being cut back already. So there's an opt-in mechanism to trust
sched_clock() some more.

> Also, cpu_clock() can only provide monotonicity per-cpu, if a
> value read on one cpu is compared to a value read on another cpu,
> there can be a drift of at most 1-2 jiffies.

That should be a good start i think. If it causes any measurable
jitter then the performance monitoring community is probably going
to be the first one to notice! ;-) So there's good synergy IMO.

Ingo

Subject: Re: perf_counter: request for three more sample data options

On 03.04.09 19:51:11, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > What I was thinking of was re-using some of the cpu_clock()
> > infrastructure. That provides us with a jiffy based GTOD sample,
> > cpu_clock() then uses TSC and a few filters to compute a current
> > timestamp.
> >
> > I was thinking about cutting back those filters and thus trusting the
> > TSC more -- which on x86 can do any random odd thing. So provided the
> > TSC is not doing funny the results will be ok-ish.
> >
> > This does mean however, that its not possible to know when its gone bad.
>
> I would expect that perfmon would be just reading the TSC and
> recording that. If you can read the TSC and do some correction then
> we're ahead. :)
>
> > The question to Paul is, does the powerpc sched_clock() call work in NMI
> > -- or hard irq disable -- context?
>
> Yes - timekeeping is one area where us powerpc guys can be smug. :)
> We have a per-core, 64-bit timebase register which counts at a
> constant frequency and is synchronized across all cores. So
> sched_clock works in any context on powerpc - all it does is read the
> timebase and do some simple integer arithmetic on it.

Ftrace is using ring_buffer_time_stamp() that finally uses
sched_clock(). But I am not sure if the time is correct when calling
from an NMI handler.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-04-03 16:41:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options


* Robert Richter <[email protected]> wrote:

> On 03.04.09 19:51:11, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> >
> > > What I was thinking of was re-using some of the cpu_clock()
> > > infrastructure. That provides us with a jiffy based GTOD sample,
> > > cpu_clock() then uses TSC and a few filters to compute a current
> > > timestamp.
> > >
> > > I was thinking about cutting back those filters and thus trusting the
> > > TSC more -- which on x86 can do any random odd thing. So provided the
> > > TSC is not doing funny the results will be ok-ish.
> > >
> > > This does mean however, that its not possible to know when its gone bad.
> >
> > I would expect that perfmon would be just reading the TSC and
> > recording that. If you can read the TSC and do some correction then
> > we're ahead. :)
> >
> > > The question to Paul is, does the powerpc sched_clock() call work in NMI
> > > -- or hard irq disable -- context?
> >
> > Yes - timekeeping is one area where us powerpc guys can be smug.
> > :) We have a per-core, 64-bit timebase register which counts at
> > a constant frequency and is synchronized across all cores. So
> > sched_clock works in any context on powerpc - all it does is
> > read the timebase and do some simple integer arithmetic on it.
>
> Ftrace is using ring_buffer_time_stamp() that finally uses
> sched_clock(). But I am not sure if the time is correct when
> calling from an NMI handler.

Yeah, that's a bit icky. Right now we have the following
accelerator:

u64 sched_clock_cpu(int cpu)
{
u64 now, clock, this_clock, remote_clock;
struct sched_clock_data *scd;

if (sched_clock_stable)
return sched_clock();

which works rather well on CPUs that set sched_clock_stable. Do you
think we could set it on Barcelona?

in the non-stable case we chicken out:

/*
* Normally this is not called in NMI context - but if it is,
* trying to do any locking here is totally lethal.
*/
if (unlikely(in_nmi()))
return scd->clock;

as we'd have to take a spinlock which isnt safe from NMI context.

Ingo

2009-04-03 16:58:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options

On Fri, 2009-04-03 at 18:41 +0200, Ingo Molnar wrote:
> * Robert Richter <[email protected]> wrote:
>
> > On 03.04.09 19:51:11, Paul Mackerras wrote:
> > > Peter Zijlstra writes:
> > >
> > > > What I was thinking of was re-using some of the cpu_clock()
> > > > infrastructure. That provides us with a jiffy based GTOD sample,
> > > > cpu_clock() then uses TSC and a few filters to compute a current
> > > > timestamp.
> > > >
> > > > I was thinking about cutting back those filters and thus trusting the
> > > > TSC more -- which on x86 can do any random odd thing. So provided the
> > > > TSC is not doing funny the results will be ok-ish.
> > > >
> > > > This does mean however, that its not possible to know when its gone bad.
> > >
> > > I would expect that perfmon would be just reading the TSC and
> > > recording that. If you can read the TSC and do some correction then
> > > we're ahead. :)
> > >
> > > > The question to Paul is, does the powerpc sched_clock() call work in NMI
> > > > -- or hard irq disable -- context?
> > >
> > > Yes - timekeeping is one area where us powerpc guys can be smug.
> > > :) We have a per-core, 64-bit timebase register which counts at
> > > a constant frequency and is synchronized across all cores. So
> > > sched_clock works in any context on powerpc - all it does is
> > > read the timebase and do some simple integer arithmetic on it.
> >
> > Ftrace is using ring_buffer_time_stamp() that finally uses
> > sched_clock(). But I am not sure if the time is correct when
> > calling from an NMI handler.
>
> Yeah, that's a bit icky. Right now we have the following
> accelerator:
>
> u64 sched_clock_cpu(int cpu)
> {
> u64 now, clock, this_clock, remote_clock;
> struct sched_clock_data *scd;
>
> if (sched_clock_stable)
> return sched_clock();
>
> which works rather well on CPUs that set sched_clock_stable. Do you
> think we could set it on Barcelona?

I think you should couple it to the tsc clocksource detection thingy. On
all systems the tsc is good enough to use as clocksource, we can
short-circuit.

> in the non-stable case we chicken out:
>
> /*
> * Normally this is not called in NMI context - but if it is,
> * trying to do any locking here is totally lethal.
> */
> if (unlikely(in_nmi()))
> return scd->clock;
>
> as we'd have to take a spinlock which isnt safe from NMI context.

Right, I've been looking at doing cpu_clock() differently, but since its
all 64-bit we'd either need to introduce atomic64 into the code, or redo
it in the perf counter code.

So for now I've stuck with a plain sched_clock() timestamp.

2009-04-03 17:06:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf_counter: request for three more sample data options


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-04-03 at 18:41 +0200, Ingo Molnar wrote:
> > * Robert Richter <[email protected]> wrote:
> >
> > > On 03.04.09 19:51:11, Paul Mackerras wrote:
> > > > Peter Zijlstra writes:
> > > >
> > > > > What I was thinking of was re-using some of the cpu_clock()
> > > > > infrastructure. That provides us with a jiffy based GTOD sample,
> > > > > cpu_clock() then uses TSC and a few filters to compute a current
> > > > > timestamp.
> > > > >
> > > > > I was thinking about cutting back those filters and thus trusting the
> > > > > TSC more -- which on x86 can do any random odd thing. So provided the
> > > > > TSC is not doing funny the results will be ok-ish.
> > > > >
> > > > > This does mean however, that its not possible to know when its gone bad.
> > > >
> > > > I would expect that perfmon would be just reading the TSC and
> > > > recording that. If you can read the TSC and do some correction then
> > > > we're ahead. :)
> > > >
> > > > > The question to Paul is, does the powerpc sched_clock() call work in NMI
> > > > > -- or hard irq disable -- context?
> > > >
> > > > Yes - timekeeping is one area where us powerpc guys can be smug.
> > > > :) We have a per-core, 64-bit timebase register which counts at
> > > > a constant frequency and is synchronized across all cores. So
> > > > sched_clock works in any context on powerpc - all it does is
> > > > read the timebase and do some simple integer arithmetic on it.
> > >
> > > Ftrace is using ring_buffer_time_stamp() that finally uses
> > > sched_clock(). But I am not sure if the time is correct when
> > > calling from an NMI handler.
> >
> > Yeah, that's a bit icky. Right now we have the following
> > accelerator:
> >
> > u64 sched_clock_cpu(int cpu)
> > {
> > u64 now, clock, this_clock, remote_clock;
> > struct sched_clock_data *scd;
> >
> > if (sched_clock_stable)
> > return sched_clock();
> >
> > which works rather well on CPUs that set sched_clock_stable. Do you
> > think we could set it on Barcelona?
>
> I think you should couple it to the tsc clocksource detection
> thingy. On all systems the tsc is good enough to use as
> clocksource, we can short-circuit.

No principal objections, if it works.

Ingo