2009-03-30 17:11:51

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 13/15] perf_counter: provide generic callchain bits

Provide the generic callchain support bits. If hw_event->callchain is
set the arch specific perf_callchain() function is called upon to
provide a perf_callchain_entry structure filled with the current
callchain.

If it does so, it is added to the overflow output event.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_counter.h | 13 ++++++++++++-
kernel/perf_counter.c | 27 +++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -140,8 +140,9 @@ struct perf_counter_hw_event {
include_tid : 1, /* include the tid */
mmap : 1, /* include mmap data */
munmap : 1, /* include munmap data */
+ callchain : 1, /* add callchain data */

- __reserved_1 : 52;
+ __reserved_1 : 51;

__u32 extra_config_len;
__u32 __reserved_4;
@@ -219,6 +220,7 @@ enum perf_event_type {
PERF_EVENT_OVERFLOW = 1UL << 31,
__PERF_EVENT_IP = 1UL << 30,
__PERF_EVENT_TID = 1UL << 29,
+ __PERF_EVENT_CALLCHAIN = 1UL << 28,
};

#ifdef __KERNEL__
@@ -504,6 +506,15 @@ extern void perf_counter_mmap(unsigned l
extern void perf_counter_munmap(unsigned long addr, unsigned long len,
unsigned long pgoff, struct file *file);

+#define MAX_STACK_DEPTH 255
+
+struct perf_callchain_entry {
+ u64 nr;
+ u64 ip[MAX_STACK_DEPTH];
+};
+
+extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
+
#else
static inline void
perf_counter_task_sched_in(struct task_struct *task, int cpu) { }
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1654,6 +1654,17 @@ void perf_counter_do_pending(void)
}

/*
+ * Callchain support -- arch specific
+ */
+
+struct perf_callchain_entry *
+__attribute__((weak))
+perf_callchain(struct pt_regs *regs)
+{
+ return NULL;
+}
+
+/*
* Output
*/

@@ -1764,6 +1775,8 @@ static void perf_output_simple(struct pe
struct {
u32 pid, tid;
} tid_entry;
+ struct perf_callchain_entry *callchain = NULL;
+ int callchain_size = 0;

header.type = PERF_EVENT_OVERFLOW;
header.size = sizeof(header);
@@ -1781,6 +1794,17 @@ static void perf_output_simple(struct pe
header.size += sizeof(tid_entry);
}

+ if (counter->hw_event.callchain) {
+ callchain = perf_callchain(regs);
+
+ if (callchain) {
+ callchain_size = (1 + callchain->nr) * sizeof(u64);
+
+ header.type |= __PERF_EVENT_CALLCHAIN;
+ header.size += callchain_size;
+ }
+ }
+
ret = perf_output_begin(&handle, counter, header.size, nmi);
if (ret)
return;
@@ -1791,6 +1815,9 @@ static void perf_output_simple(struct pe
if (counter->hw_event.include_tid)
perf_output_put(&handle, tid_entry);

+ if (callchain)
+ perf_output_copy(&handle, callchain, callchain_size);
+
perf_output_end(&handle);
}


--


2009-03-31 06:13:20

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra writes:

> include_tid : 1, /* include the tid */
> mmap : 1, /* include mmap data */
> munmap : 1, /* include munmap data */
> + callchain : 1, /* add callchain data */

Interesting, I would have put callchain (and include_tid, also) in
hw_event.record_type rather than as individual 1-bit fields. The
present arrangement where some selection of what goes into the ring
buffer is in record_type and some is in individual bits seems a bit
awkward. Plus, with the current arrangement I can't get both the IP
and the values of the other group members, which I might reasonable
want.

I think either we make record_type bit-significant, or we define
individual bits in hw_event for recording the IP and other group
members.

There are a couple of other things I want to be able to record on an
event - we have registers on powerpc that give information about the
event that caused the interrupt, and it would be nice to be able to
record them. (These registers include instruction and data addresses
associated with the event; the instruction address can be further on
from where the interrupt was taken because of out-of-order instruction
execution and because interrupts might be hard-disabled at the point
where the interrupt becomes pending.)

Those registers would need bits in record_type or in the hw_event to
indicate that we want them recorded.

Paul.

2009-03-31 06:38:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > include_tid : 1, /* include the tid */
> > mmap : 1, /* include mmap data */
> > munmap : 1, /* include munmap data */
> > + callchain : 1, /* add callchain data */
>
> Interesting, I would have put callchain (and include_tid, also) in
> hw_event.record_type rather than as individual 1-bit fields. The
> present arrangement where some selection of what goes into the ring
> buffer is in record_type and some is in individual bits seems a bit
> awkward. Plus, with the current arrangement I can't get both the IP
> and the values of the other group members, which I might reasonable
> want.
>
> I think either we make record_type bit-significant, or we define
> individual bits in hw_event for recording the IP and other group
> members.
>
> There are a couple of other things I want to be able to record on an
> event - we have registers on powerpc that give information about the
> event that caused the interrupt, and it would be nice to be able to
> record them. (These registers include instruction and data addresses
> associated with the event; the instruction address can be further on
> from where the interrupt was taken because of out-of-order instruction
> execution and because interrupts might be hard-disabled at the point
> where the interrupt becomes pending.)
>
> Those registers would need bits in record_type or in the hw_event to
> indicate that we want them recorded.

Sure, I'm fine with moving them to record_type and making that a bit
field. I've also considered merging the group and 'simple' output to
enable what you say.


2009-03-31 07:25:23

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra wrote:
> On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
>> Peter Zijlstra writes:
>>
>>> include_tid : 1, /* include the tid */
>>> mmap : 1, /* include mmap data */
>>> munmap : 1, /* include munmap data */
>>> + callchain : 1, /* add callchain data */
>> Interesting, I would have put callchain (and include_tid, also) in
>> hw_event.record_type rather than as individual 1-bit fields. The
>> present arrangement where some selection of what goes into the ring
>> buffer is in record_type and some is in individual bits seems a bit
>> awkward. Plus, with the current arrangement I can't get both the IP
>> and the values of the other group members, which I might reasonable
>> want.
>>
>> I think either we make record_type bit-significant, or we define
>> individual bits in hw_event for recording the IP and other group
>> members.
>>
>> There are a couple of other things I want to be able to record on an
>> event - we have registers on powerpc that give information about the
>> event that caused the interrupt, and it would be nice to be able to
>> record them. (These registers include instruction and data addresses
>> associated with the event; the instruction address can be further on
>> from where the interrupt was taken because of out-of-order instruction
>> execution and because interrupts might be hard-disabled at the point
>> where the interrupt becomes pending.)
>>
>> Those registers would need bits in record_type or in the hw_event to
>> indicate that we want them recorded.
>
> Sure, I'm fine with moving them to record_type and making that a bit
> field. I've also considered merging the group and 'simple' output to
> enable what you say.

Originally, the record_type field was used to determine what would be
read if you read from the fd. However, now it seems to be what you get
from the mmap buffer (only), that there is no longer a way to read the
record stream from the fd anymore. Is this true? If so, I like this
idea because it simplifies the ABI: there is one way to read the current
value of counters and another way to read sample records (via mmap),
both of these operations use a single fd.

If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE
now? and PERF_COUNTER_GROUP? One simplification would be that reading
the fd of a group leader would always read up all of the counters in the
group (along with their enabled and running times if those bits are
set), and that reading a single counter's fd would yield only that
counter's values and times (if enabled). In effect, these two values,
PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
at all. Other bits would be used to determine what's in the mmap'd samples.

- Corey

2009-03-31 08:43:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Tue, 2009-03-31 at 00:24 -0700, Corey Ashford wrote:
> Peter Zijlstra wrote:
> > On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
> >> Peter Zijlstra writes:
> >>
> >>> include_tid : 1, /* include the tid */
> >>> mmap : 1, /* include mmap data */
> >>> munmap : 1, /* include munmap data */
> >>> + callchain : 1, /* add callchain data */
> >> Interesting, I would have put callchain (and include_tid, also) in
> >> hw_event.record_type rather than as individual 1-bit fields. The
> >> present arrangement where some selection of what goes into the ring
> >> buffer is in record_type and some is in individual bits seems a bit
> >> awkward. Plus, with the current arrangement I can't get both the IP
> >> and the values of the other group members, which I might reasonable
> >> want.
> >>
> >> I think either we make record_type bit-significant, or we define
> >> individual bits in hw_event for recording the IP and other group
> >> members.
> >>
> >> There are a couple of other things I want to be able to record on an
> >> event - we have registers on powerpc that give information about the
> >> event that caused the interrupt, and it would be nice to be able to
> >> record them. (These registers include instruction and data addresses
> >> associated with the event; the instruction address can be further on
> >> from where the interrupt was taken because of out-of-order instruction
> >> execution and because interrupts might be hard-disabled at the point
> >> where the interrupt becomes pending.)
> >>
> >> Those registers would need bits in record_type or in the hw_event to
> >> indicate that we want them recorded.
> >
> > Sure, I'm fine with moving them to record_type and making that a bit
> > field. I've also considered merging the group and 'simple' output to
> > enable what you say.
>
> Originally, the record_type field was used to determine what would be
> read if you read from the fd. However, now it seems to be what you get
> from the mmap buffer (only), that there is no longer a way to read the
> record stream from the fd anymore. Is this true?

Yes, read() will return the current value of the counter.
mmap() is used for all async data streams.

> If so, I like this
> idea because it simplifies the ABI: there is one way to read the current
> value of counters and another way to read sample records (via mmap),
> both of these operations use a single fd.

Correct, glad you like it ;-)

> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE
> now? and PERF_COUNTER_GROUP?

Legacy mainly. I've been working towards unifying those. As Paul also
suggested, group can be yet another output option.

> One simplification would be that reading
> the fd of a group leader would always read up all of the counters in the
> group (along with their enabled and running times if those bits are
> set), and that reading a single counter's fd would yield only that
> counter's values and times (if enabled). In effect, these two values,
> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
> at all. Other bits would be used to determine what's in the mmap'd samples.

Quite so, that sounds sensible, Paul?

Would you be overly bothered by the read() output also having that
{type,size} header we use for the mmap() data?

2009-03-31 09:10:02

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra writes:

> > One simplification would be that reading
> > the fd of a group leader would always read up all of the counters in the
> > group (along with their enabled and running times if those bits are
> > set), and that reading a single counter's fd would yield only that
> > counter's values and times (if enabled). In effect, these two values,
> > PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
> > at all. Other bits would be used to determine what's in the mmap'd samples.
>
> Quite so, that sounds sensible, Paul?

I think the readout of the other group members (for a group leader)
should be enabled by a bit in hw_event.read_format, say
PERF_FORMAT_GROUP.

There is a slight complication - we would want to read all the
counters in the group atomically, or as close to atomically as we
could get, and we don't have any way to do that at the moment.

> Would you be overly bothered by the read() output also having that
> {type,size} header we use for the mmap() data?

How about a PERF_FORMAT_HEADERS bit in hw_event.read_format to say that
we want to get the headers in the read() output?

With those two extra bits we can fulfill these desires without adding
overhead for programs that don't want the extra stuff, and without
breaking compatibility.

Paul.

2009-03-31 09:13:06

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Corey Ashford writes:

> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE
> now? and PERF_COUNTER_GROUP? One simplification would be that reading
> the fd of a group leader would always read up all of the counters in the
> group (along with their enabled and running times if those bits are
> set), and that reading a single counter's fd would yield only that
> counter's values and times (if enabled). In effect, these two values,
> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
> at all. Other bits would be used to determine what's in the mmap'd samples.

Now that events are only read through mmap, it's quite simple -
hw_event.read_format controls what read() gives you, and
hw_event.record_type controls what gets put into the pages that you
get with mmap.

Currently read_format and record_type don't use the same set of bit
definitions. Maybe they should? I don't have a strong feeling about
it, but that might be a nice simplification.

Paul.

2009-03-31 14:00:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Tue, 2009-03-31 at 20:12 +1100, Paul Mackerras wrote:
> Corey Ashford writes:
>
> > If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE
> > now? and PERF_COUNTER_GROUP? One simplification would be that reading
> > the fd of a group leader would always read up all of the counters in the
> > group (along with their enabled and running times if those bits are
> > set), and that reading a single counter's fd would yield only that
> > counter's values and times (if enabled). In effect, these two values,
> > PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
> > at all. Other bits would be used to determine what's in the mmap'd samples.
>
> Now that events are only read through mmap, it's quite simple -
> hw_event.read_format controls what read() gives you, and
> hw_event.record_type controls what gets put into the pages that you
> get with mmap.
>
> Currently read_format and record_type don't use the same set of bit
> definitions. Maybe they should? I don't have a strong feeling about
> it, but that might be a nice simplification.

Something like the below?

That still has the record and read things separate, but as one unified
overflow output.

I reduced record and read fields to 32bits, as the output type only has
32bits.

diff did a wonderful job making the patch unreadable :/

---
Subject:
From: Peter Zijlstra <[email protected]>
Date: Tue Mar 31 15:13:36 CEST 2009

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_counter.h | 37 +++++++---------
kernel/perf_counter.c | 99 ++++++++++++++++---------------------------
2 files changed, 57 insertions(+), 79 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -73,15 +73,6 @@ enum sw_event_ids {
PERF_SW_EVENTS_MAX = 7,
};

-/*
- * IRQ-notification data record type:
- */
-enum perf_counter_record_type {
- PERF_RECORD_SIMPLE = 0,
- PERF_RECORD_IRQ = 1,
- PERF_RECORD_GROUP = 2,
-};
-
#define __PERF_COUNTER_MASK(name) \
(((1ULL << PERF_COUNTER_##name##_BITS) - 1) << \
PERF_COUNTER_##name##_SHIFT)
@@ -103,6 +94,17 @@ enum perf_counter_record_type {
#define PERF_COUNTER_EVENT_MASK __PERF_COUNTER_MASK(EVENT)

/*
+ * Bits that can be set in hw_event.record_type to request information
+ * in the overflow packets.
+ */
+enum perf_counter_record_format {
+ PERF_RECORD_IP = 1U << 0,
+ PERF_RECORD_TID = 1U << 1,
+ PERF_RECORD_GROUP = 1U << 2,
+ PERF_RECORD_CALLCHAIN = 1U << 3,
+};
+
+/*
* Bits that can be set in hw_event.read_format to request that
* reads on the counter should return the indicated quantities,
* in increasing order of bit value, after the counter value.
@@ -125,8 +127,8 @@ struct perf_counter_hw_event {
__u64 config;

__u64 irq_period;
- __u64 record_type;
- __u64 read_format;
+ __u32 record_type;
+ __u32 read_format;

__u64 disabled : 1, /* off by default */
nmi : 1, /* NMI sampling */
@@ -137,12 +139,10 @@ struct perf_counter_hw_event {
exclude_kernel : 1, /* ditto kernel */
exclude_hv : 1, /* ditto hypervisor */
exclude_idle : 1, /* don't count when idle */
- include_tid : 1, /* include the tid */
mmap : 1, /* include mmap data */
munmap : 1, /* include munmap data */
- callchain : 1, /* add callchain data */

- __reserved_1 : 51;
+ __reserved_1 : 53;

__u32 extra_config_len;
__u32 __reserved_4;
@@ -212,15 +212,14 @@ struct perf_event_header {

enum perf_event_type {

- PERF_EVENT_GROUP = 1,
-
- PERF_EVENT_MMAP = 2,
- PERF_EVENT_MUNMAP = 3,
+ PERF_EVENT_MMAP = 1,
+ PERF_EVENT_MUNMAP = 2,

PERF_EVENT_OVERFLOW = 1UL << 31,
__PERF_EVENT_IP = 1UL << 30,
__PERF_EVENT_TID = 1UL << 29,
- __PERF_EVENT_CALLCHAIN = 1UL << 28,
+ __PERF_EVENT_GROUP = 1UL << 28,
+ __PERF_EVENT_CALLCHAIN = 1UL << 27,
};

#ifdef __KERNEL__
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1765,27 +1765,34 @@ static void perf_output_end(struct perf_
rcu_read_unlock();
}

-static void perf_output_simple(struct perf_counter *counter,
- int nmi, struct pt_regs *regs)
+void perf_counter_output(struct perf_counter *counter,
+ int nmi, struct pt_regs *regs)
{
int ret;
+ u64 record_type = counter->hw_event.record_type;
struct perf_output_handle handle;
struct perf_event_header header;
u64 ip;
struct {
u32 pid, tid;
} tid_entry;
+ struct {
+ u64 event;
+ u64 counter;
+ } group_entry;
struct perf_callchain_entry *callchain = NULL;
int callchain_size = 0;

header.type = PERF_EVENT_OVERFLOW;
header.size = sizeof(header);

- ip = instruction_pointer(regs);
- header.type |= __PERF_EVENT_IP;
- header.size += sizeof(ip);
+ if (record_type & PERF_RECORD_IP) {
+ ip = instruction_pointer(regs);
+ header.type |= __PERF_EVENT_IP;
+ header.size += sizeof(ip);
+ }

- if (counter->hw_event.include_tid) {
+ if (record_type & PERF_RECORD_TID) {
/* namespace issues */
tid_entry.pid = current->group_leader->pid;
tid_entry.tid = current->pid;
@@ -1794,7 +1801,13 @@ static void perf_output_simple(struct pe
header.size += sizeof(tid_entry);
}

- if (counter->hw_event.callchain) {
+ if (record_type & PERF_RECORD_GROUP) {
+ header.type |= __PERF_EVENT_GROUP;
+ header.size += sizeof(u64) +
+ counter->nr_siblings * sizeof(group_entry);
+ }
+
+ if (record_type & PERF_RECORD_CALLCHAIN) {
callchain = perf_callchain(regs);

if (callchain) {
@@ -1810,69 +1823,35 @@ static void perf_output_simple(struct pe
return;

perf_output_put(&handle, header);
- perf_output_put(&handle, ip);

- if (counter->hw_event.include_tid)
- perf_output_put(&handle, tid_entry);
+ if (record_type & PERF_RECORD_IP)
+ perf_output_put(&handle, ip);

- if (callchain)
- perf_output_copy(&handle, callchain, callchain_size);
-
- perf_output_end(&handle);
-}
-
-static void perf_output_group(struct perf_counter *counter, int nmi)
-{
- struct perf_output_handle handle;
- struct perf_event_header header;
- struct perf_counter *leader, *sub;
- unsigned int size;
- struct {
- u64 event;
- u64 counter;
- } entry;
- int ret;
-
- size = sizeof(header) + counter->nr_siblings * sizeof(entry);
+ if (record_type & PERF_RECORD_TID)
+ perf_output_put(&handle, tid_entry);

- ret = perf_output_begin(&handle, counter, size, nmi);
- if (ret)
- return;
+ if (record_type & PERF_RECORD_GROUP) {
+ struct perf_counter *leader, *sub;
+ u64 nr = counter->nr_siblings;

- header.type = PERF_EVENT_GROUP;
- header.size = size;
+ perf_output_put(&handle, nr);

- perf_output_put(&handle, header);
+ leader = counter->group_leader;
+ list_for_each_entry(sub, &leader->sibling_list, list_entry) {
+ if (sub != counter)
+ sub->hw_ops->read(sub);

- leader = counter->group_leader;
- list_for_each_entry(sub, &leader->sibling_list, list_entry) {
- if (sub != counter)
- sub->hw_ops->read(sub);
+ group_entry.event = sub->hw_event.config;
+ group_entry.counter = atomic64_read(&sub->count);

- entry.event = sub->hw_event.config;
- entry.counter = atomic64_read(&sub->count);
-
- perf_output_put(&handle, entry);
+ perf_output_put(&handle, group_entry);
+ }
}

- perf_output_end(&handle);
-}
-
-void perf_counter_output(struct perf_counter *counter,
- int nmi, struct pt_regs *regs)
-{
- switch (counter->hw_event.record_type) {
- case PERF_RECORD_SIMPLE:
- return;
-
- case PERF_RECORD_IRQ:
- perf_output_simple(counter, nmi, regs);
- break;
+ if (callchain)
+ perf_output_copy(&handle, callchain, callchain_size);

- case PERF_RECORD_GROUP:
- perf_output_group(counter, nmi);
- break;
- }
+ perf_output_end(&handle);
}

/*

2009-03-31 17:12:21

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra wrote:
> On Tue, 2009-03-31 at 20:12 +1100, Paul Mackerras wrote:
>> Corey Ashford writes:
>>
>>> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE
>>> now? and PERF_COUNTER_GROUP? One simplification would be that reading
>>> the fd of a group leader would always read up all of the counters in the
>>> group (along with their enabled and running times if those bits are
>>> set), and that reading a single counter's fd would yield only that
>>> counter's values and times (if enabled). In effect, these two values,
>>> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary
>>> at all. Other bits would be used to determine what's in the mmap'd samples.
>> Now that events are only read through mmap, it's quite simple -
>> hw_event.read_format controls what read() gives you, and
>> hw_event.record_type controls what gets put into the pages that you
>> get with mmap.
>>
>> Currently read_format and record_type don't use the same set of bit
>> definitions. Maybe they should? I don't have a strong feeling about
>> it, but that might be a nice simplification.
>
> Something like the below?
>
> That still has the record and read things separate, but as one unified
> overflow output.
>
> I reduced record and read fields to 32bits, as the output type only has
> 32bits.
>
> diff did a wonderful job making the patch unreadable :/
>

This looks great! I like the idea of two bit vectors to describe what
you get when you read and the format of the mmap'd data. And I like the
ability not to include the header on each record if the user space
program knows the exact format of the sample records.

Paul is right, too, that having a read_format bit to determine whether
reading the leader gives all of the counters or just the leader is a
nice addition. I can't think of when I'd use it, but it's a good
capability.

- Corey

2009-04-01 03:49:14

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra writes:

> That still has the record and read things separate, but as one unified
> overflow output.

I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
buffer overflow? That had me confused for a bit, so more explicit
naming, or at least some comments, would be good.

> /*
> + * Bits that can be set in hw_event.record_type to request information
> + * in the overflow packets.
> + */
> +enum perf_counter_record_format {
> + PERF_RECORD_IP = 1U << 0,
> + PERF_RECORD_TID = 1U << 1,
> + PERF_RECORD_GROUP = 1U << 2,
> + PERF_RECORD_CALLCHAIN = 1U << 3,
> +};
[snip]
> enum perf_event_type {
>
> - PERF_EVENT_GROUP = 1,
> -
> - PERF_EVENT_MMAP = 2,
> - PERF_EVENT_MUNMAP = 3,
> + PERF_EVENT_MMAP = 1,
> + PERF_EVENT_MUNMAP = 2,
>
> PERF_EVENT_OVERFLOW = 1UL << 31,
> __PERF_EVENT_IP = 1UL << 30,
> __PERF_EVENT_TID = 1UL << 29,
> - __PERF_EVENT_CALLCHAIN = 1UL << 28,
> + __PERF_EVENT_GROUP = 1UL << 28,
> + __PERF_EVENT_CALLCHAIN = 1UL << 27,

Could we use the same value (and even the same name) for
PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
etc.?

Also, I haven't looked at the callchain stuff much, but does the
callchain info contain a recognizable end delimiter? At present the
callchain comes last, but if we add more output elements they'll
presumably go after it, so working out where the callchain ends may
become tricky if we're not careful.

Also, let's add PERF_RECORD/PERF_EVENT bits for:

* EVENT_INSTR_ADDR
* EVENT_DATA_ADDR
* EVENT_INSTR_FLAGS
* EVENT_CPU_FLAGS (so we can distinguish hypervisor/kernel/user)

We would have to call into arch code to get the values for these.

Paul.

2009-04-01 07:58:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Wed, 2009-04-01 at 14:48 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > That still has the record and read things separate, but as one unified
> > overflow output.
>
> I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
> buffer overflow? That had me confused for a bit, so more explicit
> naming, or at least some comments, would be good.

Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?

I was thinking about doing splice() support and that could also generate
actual event overflow events ;-)

> > /*
> > + * Bits that can be set in hw_event.record_type to request information
> > + * in the overflow packets.
> > + */
> > +enum perf_counter_record_format {
> > + PERF_RECORD_IP = 1U << 0,
> > + PERF_RECORD_TID = 1U << 1,
> > + PERF_RECORD_GROUP = 1U << 2,
> > + PERF_RECORD_CALLCHAIN = 1U << 3,
> > +};
> [snip]
> > enum perf_event_type {
> >
> > - PERF_EVENT_GROUP = 1,
> > -
> > - PERF_EVENT_MMAP = 2,
> > - PERF_EVENT_MUNMAP = 3,
> > + PERF_EVENT_MMAP = 1,
> > + PERF_EVENT_MUNMAP = 2,
> >
> > PERF_EVENT_OVERFLOW = 1UL << 31,
> > __PERF_EVENT_IP = 1UL << 30,
> > __PERF_EVENT_TID = 1UL << 29,
> > - __PERF_EVENT_CALLCHAIN = 1UL << 28,
> > + __PERF_EVENT_GROUP = 1UL << 28,
> > + __PERF_EVENT_CALLCHAIN = 1UL << 27,
>
> Could we use the same value (and even the same name) for
> PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
> etc.?

I suppose we could.

> Also, I haven't looked at the callchain stuff much, but does the
> callchain info contain a recognizable end delimiter? At present the
> callchain comes last, but if we add more output elements they'll
> presumably go after it, so working out where the callchain ends may
> become tricky if we're not careful.

It writes:

struct callchain_event {
u64 nr
u64 ips[nr]
};

So the first number tells how many ips are in the callchain, which
should be good enough to figure out where it ends.

> Also, let's add PERF_RECORD/PERF_EVENT bits for:
>
> * EVENT_INSTR_ADDR

I'm failing to come up with what this could be..

> * EVENT_DATA_ADDR

This would be the data address operated upon? Like what address caused
the fault/cache-miss, etc?

> * EVENT_INSTR_FLAGS

Again not quite sure what this would be.

> * EVENT_CPU_FLAGS (so we can distinguish hypervisor/kernel/user)

Currently we can based on address, an IP < 0 is kernel and > 0 is
userspace, but yeah, I see how this makes life easier.

> We would have to call into arch code to get the values for these.

I suppose all these things can be gleaned from pt_regs, so that should
be doable.

2009-04-01 09:32:44

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra writes:

> Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?

Sounds reasonable.

> > Also, let's add PERF_RECORD/PERF_EVENT bits for:
> >
> > * EVENT_INSTR_ADDR
>
> I'm failing to come up with what this could be..

So, you have lots of instructions in flight in the processor, and one
of them causes an event that increments a counter and causes it to
overflow, so an interrupt request is generated. Even if the interrupt
is taken "immediately", it can still happen that the set of
instructions the processor decides to complete before taking the
interrupt includes some instructions after the instruction that caused
the counter to overflow, and of course if interrupts are (hard-)
disabled at the time of the overflow, the interrupt will happen
later. That means that the IP from the pt_regs is not generally a
reliable indication of which instruction made the counter overflow.

On POWER processors we have a register which gives us a much more
reliable indication of which instruction caused the counter overflow,
at least in those cases where the event can be attributed to a
specific instruction. This EVENT_INSTR_ADDR bit would ask for that
register to be sampled and recorded.

> > * EVENT_DATA_ADDR
>
> This would be the data address operated upon? Like what address caused
> the fault/cache-miss, etc?

That's right. POWER processors have a register that records that
where possible.

> > * EVENT_INSTR_FLAGS
>
> Again not quite sure what this would be.

POWER processors have a register that records information about the
instruction that caused the counter overflow, such as did it have a
data address associated with it, did it cause a dcache miss, etc.

> > * EVENT_CPU_FLAGS (so we can distinguish hypervisor/kernel/user)
>
> Currently we can based on address, an IP < 0 is kernel and > 0 is
> userspace, but yeah, I see how this makes life easier.

We can't distinguish hypervisor addresses that way, and on some
architectures (including x86_32 with a 4G/4G split) we can't
distinguish kernel/user just by the address. I was thinking the cpu
flags would also include things like interrupt enable state, FPU
enable state, etc.

> > We would have to call into arch code to get the values for these.
>
> I suppose all these things can be gleaned from pt_regs, so that should
> be doable.

Hmmm, 3 of the 4 would require SPR (= what x86 calls MSR) reads.
Maybe it's best to have a block of arch-specific bits that can be
defined per-arch and implemented in arch code.

Paul.

2009-04-01 10:00:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits


* Paul Mackerras <[email protected]> wrote:

> Peter Zijlstra writes:
>
> > Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?
>
> Sounds reasonable.
>
> > > Also, let's add PERF_RECORD/PERF_EVENT bits for:
> > >
> > > * EVENT_INSTR_ADDR
> >
> > I'm failing to come up with what this could be..
>
> So, you have lots of instructions in flight in the processor, and
> one of them causes an event that increments a counter and causes
> it to overflow, so an interrupt request is generated. Even if the
> interrupt is taken "immediately", it can still happen that the set
> of instructions the processor decides to complete before taking
> the interrupt includes some instructions after the instruction
> that caused the counter to overflow, and of course if interrupts
> are (hard-) disabled at the time of the overflow, the interrupt
> will happen later. That means that the IP from the pt_regs is not
> generally a reliable indication of which instruction made the
> counter overflow.
>
> On POWER processors we have a register which gives us a much more
> reliable indication of which instruction caused the counter
> overflow, at least in those cases where the event can be
> attributed to a specific instruction. This EVENT_INSTR_ADDR bit
> would ask for that register to be sampled and recorded.

So it's a bit like PEBS and IBS on the x86, right?

In theory one could simply override the sampled ptregs->ip with this
more precise register value. The instruction where the IRQ hit is
probably meaningless, if more precise information is available. But
we can have both too i guess.

The data address extension definitely makes sense - it can be used
to for a profile view along the data symbol dimension, instead of
the usual function symbol dimension.

CPU flags makes sense too - irqs-off can help the annotation of
source code sections where the profiler sees that irqs were
disabled.

It seems here we gradually descend into arch-specific CPU state
technicalities and it's not immediately obvious where to draw the
line.

Call-chain and data address abstractions are clear. CPU flags is
less clear: we could perhaps split off the irq state and the
privilege level information - that is present on all CPUs.

The rest should probably be opaque and not generalized.

_Perhaps_, to stem the inevitable list of such small details, it
might make sense to have a record type with signal frame qualities -
which would include most of this info. That would mix well with the
planned feature of signal generation anyway, right?

I.e. we could extend the lowlevel sigcontext signal frame generation
code in arch/x86/kernel/signal.c (and its powerpc equivalent) to
generate a signal frame but output it into the mmap buffer, not into
the userspace stack - and we would not actually execute a signal in
that context.

[ of course, when the counter is configured to generate a signal
that is done too. The code would be dual purpose. ]

So user-space would get a fully signal frame compatible record - and
we'd not have to create a per arch ABI for this because we'd piggy
back to the signal frame format.

We could add SA_NOFPU support for fast-track integer-registers-only
frames, etc.

Hm?

Ingo

2009-04-01 10:19:44

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:perfcounters/core] perf_counter: provide generic callchain bits

Commit-ID: 1254256f65ce3973462dc153e02b4f3a7e314e20
Gitweb: http://git.kernel.org/tip/1254256f65ce3973462dc153e02b4f3a7e314e20
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 30 Mar 2009 19:07:14 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 1 Apr 2009 11:33:38 +0200

perf_counter: provide generic callchain bits

Provide the generic callchain support bits. If hw_event->callchain is
set the arch specific perf_callchain() function is called upon to
provide a perf_callchain_entry structure filled with the current
callchain.

If it does so, it is added to the overflow output event.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Paul Mackerras <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
include/linux/perf_counter.h | 13 ++++++++++++-
kernel/perf_counter.c | 27 +++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index edf5bfb..43083af 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -140,8 +140,9 @@ struct perf_counter_hw_event {
include_tid : 1, /* include the tid */
mmap : 1, /* include mmap data */
munmap : 1, /* include munmap data */
+ callchain : 1, /* add callchain data */

- __reserved_1 : 52;
+ __reserved_1 : 51;

__u32 extra_config_len;
__u32 __reserved_4;
@@ -219,6 +220,7 @@ enum perf_event_type {
PERF_EVENT_OVERFLOW = 1UL << 31,
__PERF_EVENT_IP = 1UL << 30,
__PERF_EVENT_TID = 1UL << 29,
+ __PERF_EVENT_CALLCHAIN = 1UL << 28,
};

#ifdef __KERNEL__
@@ -504,6 +506,15 @@ extern void perf_counter_mmap(unsigned long addr, unsigned long len,
extern void perf_counter_munmap(unsigned long addr, unsigned long len,
unsigned long pgoff, struct file *file);

+#define MAX_STACK_DEPTH 255
+
+struct perf_callchain_entry {
+ u64 nr;
+ u64 ip[MAX_STACK_DEPTH];
+};
+
+extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
+
#else
static inline void
perf_counter_task_sched_in(struct task_struct *task, int cpu) { }
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index d93e9dd..860cdc2 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1654,6 +1654,17 @@ void perf_counter_do_pending(void)
}

/*
+ * Callchain support -- arch specific
+ */
+
+struct perf_callchain_entry *
+__attribute__((weak))
+perf_callchain(struct pt_regs *regs)
+{
+ return NULL;
+}
+
+/*
* Output
*/

@@ -1764,6 +1775,8 @@ static void perf_output_simple(struct perf_counter *counter,
struct {
u32 pid, tid;
} tid_entry;
+ struct perf_callchain_entry *callchain = NULL;
+ int callchain_size = 0;

header.type = PERF_EVENT_OVERFLOW;
header.size = sizeof(header);
@@ -1781,6 +1794,17 @@ static void perf_output_simple(struct perf_counter *counter,
header.size += sizeof(tid_entry);
}

+ if (counter->hw_event.callchain) {
+ callchain = perf_callchain(regs);
+
+ if (callchain) {
+ callchain_size = (1 + callchain->nr) * sizeof(u64);
+
+ header.type |= __PERF_EVENT_CALLCHAIN;
+ header.size += callchain_size;
+ }
+ }
+
ret = perf_output_begin(&handle, counter, header.size, nmi);
if (ret)
return;
@@ -1791,6 +1815,9 @@ static void perf_output_simple(struct perf_counter *counter,
if (counter->hw_event.include_tid)
perf_output_put(&handle, tid_entry);

+ if (callchain)
+ perf_output_copy(&handle, callchain, callchain_size);
+
perf_output_end(&handle);
}

2009-04-01 11:53:54

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Ingo Molnar writes:

> So it's a bit like PEBS and IBS on the x86, right?

Well, my hazy impression was that one or both of those wrote samples
into a ring buffer in hardware, which would be different...

> In theory one could simply override the sampled ptregs->ip with this
> more precise register value. The instruction where the IRQ hit is
> probably meaningless, if more precise information is available. But
> we can have both too i guess.
>
> The data address extension definitely makes sense - it can be used
> to for a profile view along the data symbol dimension, instead of
> the usual function symbol dimension.
>
> CPU flags makes sense too - irqs-off can help the annotation of
> source code sections where the profiler sees that irqs were
> disabled.
>
> It seems here we gradually descend into arch-specific CPU state
> technicalities and it's not immediately obvious where to draw the
> line.

I hoped that event instruction address, event data address and cpu
flags, at least, would be sufficiently abstract to consider having as
generic things, though of course how you get to them is
arch-specific. The main use of event flags is to know whether or not
the event instruction/data address values are valid. Instead of
recording the event flags we could just not put the event instr/data
address records in the ring buffer if the values aren't valid
(e.g. the event data address won't be valid if the instruction doesn't
access memory).

> Call-chain and data address abstractions are clear. CPU flags is
> less clear: we could perhaps split off the irq state and the
> privilege level information - that is present on all CPUs.

In a machine-independent format, you mean? That would be a good idea.

> The rest should probably be opaque and not generalized.

And can be provided on a per-arch basis using special raw event code
values.

> _Perhaps_, to stem the inevitable list of such small details, it
> might make sense to have a record type with signal frame qualities -
> which would include most of this info. That would mix well with the
> planned feature of signal generation anyway, right?

Hmmm, so record the entire architected state of the machine, and
effectively put an entire ucontext_t in the ring buffer? That's
certainly possible - what would we use it for?

> I.e. we could extend the lowlevel sigcontext signal frame generation
> code in arch/x86/kernel/signal.c (and its powerpc equivalent) to
> generate a signal frame but output it into the mmap buffer, not into
> the userspace stack - and we would not actually execute a signal in
> that context.
>
> [ of course, when the counter is configured to generate a signal
> that is done too. The code would be dual purpose. ]
>
> So user-space would get a fully signal frame compatible record - and
> we'd not have to create a per arch ABI for this because we'd piggy
> back to the signal frame format.
>
> We could add SA_NOFPU support for fast-track integer-registers-only
> frames, etc.
>
> Hm?

I think the motivation for having the stop-and-signal behaviour is not
so much to get at the entire register set, as to have a way to profile
application-dependent things, i.e. you'd record or histogram something
derived from memory contents at the time the signal was delivered.
Or you might want to do a stack trace but use unwind information to
get the parameters to each function (which on powerpc could now be in
any register or saved onto the stack, once you start going up the call
chain any distance), which is more than we want to do in the kernel.

Paul.

2009-04-01 23:30:33

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra wrote:
> On Wed, 2009-04-01 at 14:48 +1100, Paul Mackerras wrote:
>> Peter Zijlstra writes:
>>
>>> That still has the record and read things separate, but as one unified
>>> overflow output.
>> I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
>> buffer overflow? That had me confused for a bit, so more explicit
>> naming, or at least some comments, would be good.
>
> Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?
>
> I was thinking about doing splice() support and that could also generate
> actual event overflow events ;-)
>
>>> /*
>>> + * Bits that can be set in hw_event.record_type to request information
>>> + * in the overflow packets.
>>> + */
>>> +enum perf_counter_record_format {
>>> + PERF_RECORD_IP = 1U << 0,
>>> + PERF_RECORD_TID = 1U << 1,
>>> + PERF_RECORD_GROUP = 1U << 2,
>>> + PERF_RECORD_CALLCHAIN = 1U << 3,
>>> +};
>> [snip]
>>> enum perf_event_type {
>>>
>>> - PERF_EVENT_GROUP = 1,
>>> -
>>> - PERF_EVENT_MMAP = 2,
>>> - PERF_EVENT_MUNMAP = 3,
>>> + PERF_EVENT_MMAP = 1,
>>> + PERF_EVENT_MUNMAP = 2,
>>>
>>> PERF_EVENT_OVERFLOW = 1UL << 31,
>>> __PERF_EVENT_IP = 1UL << 30,
>>> __PERF_EVENT_TID = 1UL << 29,
>>> - __PERF_EVENT_CALLCHAIN = 1UL << 28,
>>> + __PERF_EVENT_GROUP = 1UL << 28,
>>> + __PERF_EVENT_CALLCHAIN = 1UL << 27,
>> Could we use the same value (and even the same name) for
>> PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
>> etc.?
>
> I suppose we could.
>
>> Also, I haven't looked at the callchain stuff much, but does the
>> callchain info contain a recognizable end delimiter? At present the
>> callchain comes last, but if we add more output elements they'll
>> presumably go after it, so working out where the callchain ends may
>> become tricky if we're not careful.
>
> It writes:
>
> struct callchain_event {
> u64 nr
> u64 ips[nr]
> };

Looking at the current version of perf_counter.h in the -tip tree, this
struct definition is not visible to userspace (it's in the #ifdef KERNEL
section).

- Corey

2009-04-02 06:43:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Wed, 2009-04-01 at 16:25 -0700, Corey Ashford wrote:

> > struct callchain_event {
> > u64 nr
> > u64 ips[nr]
> > };
>
> Looking at the current version of perf_counter.h in the -tip tree, this
> struct definition is not visible to userspace (it's in the #ifdef KERNEL
> section).

Correct. Its been on my todo list to write a meta-struct definition
along with the perf_event_type enum that details the data layout of the
various types, but not include any actual structure other than the
header in the userspace bits.

Also, related to a comment Paul made on callchain markers, I wanted to
change it so that we have ip markers in the chain that separate the
kernel and user bits.

I was thinking of ending the kernel trace with -1ULL and ending the user
chain with 0ULL. Presumably we need one more to end hypervisor chains in
case anybody manages to generate those ;-).

This chain however will not change the struct, it will merely add 2
extra entries.

2009-04-02 07:40:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Thu, 2009-04-02 at 08:43 +0200, Peter Zijlstra wrote:
> On Wed, 2009-04-01 at 16:25 -0700, Corey Ashford wrote:
>
> > > struct callchain_event {
> > > u64 nr
> > > u64 ips[nr]
> > > };
> >
> > Looking at the current version of perf_counter.h in the -tip tree, this
> > struct definition is not visible to userspace (it's in the #ifdef KERNEL
> > section).
>
> Correct. Its been on my todo list to write a meta-struct definition
> along with the perf_event_type enum that details the data layout of the
> various types, but not include any actual structure other than the
> header in the userspace bits.
>
> Also, related to a comment Paul made on callchain markers, I wanted to
> change it so that we have ip markers in the chain that separate the
> kernel and user bits.
>
> I was thinking of ending the kernel trace with -1ULL and ending the user
> chain with 0ULL. Presumably we need one more to end hypervisor chains in
> case anybody manages to generate those ;-).
>
> This chain however will not change the struct, it will merely add 2
> extra entries.

Hmm, another idea:

struct perf_callchain_entry {
__u32 total, hv, kernel, user;
__u64 ips[total];
};

2009-04-02 09:10:35

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

Peter Zijlstra writes:

> Hmm, another idea:
>
> struct perf_callchain_entry {
> __u32 total, hv, kernel, user;

That should work - and they could be u16's in fact.

> __u64 ips[total];
> };

Paul.

2009-04-02 09:16:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] perf_counter: provide generic callchain bits

On Thu, 2009-04-02 at 20:10 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > Hmm, another idea:
> >
> > struct perf_callchain_entry {
> > __u32 total, hv, kernel, user;
>
> That should work - and they could be u16's in fact.

Gah, I just send it out with u32. But yes, will change.