2009-03-21 04:53:48

by Paul Mackerras

[permalink] [raw]
Subject: [PATCH] perfcounters: fix type/event_id layout on big-endian systems

Impact: build fix for powerpc

Commit db3a944aca35ae61 ("perf_counter: revamp syscall input ABI")
expanded the hw_event.type field into a union of structs containing
bitfields. In particular it introduced a type field and a raw_type
field, with the intention that the 1-bit raw_type field should
overlay the most-significant bit of the 8-bit type field, and in fact
perf_counter_alloc() now assumes that (or at least, assumes that
raw_type doesn't overlay any of the bits that are 1 in the values of
PERF_TYPE_{HARDWARE,SOFTWARE,TRACEPOINT}).

Unfortunately this is not true on big-endian systems such as PowerPC,
where bitfields are laid out from left to right, i.e. from most
significant bit to least significant. This means that setting
hw_event.type = PERF_TYPE_SOFTWARE will set hw_event.raw_type to 1.

This fixes it by making the layout depend on whether or not
__BIG_ENDIAN_BITFIELD is defined. It's a bit ugly, but that's what
we get for using bitfields in a user/kernel ABI.

Also, that commit didn't fix up some places in arch/powerpc/kernel/
perf_counter.c where hw_event.raw and hw_event.event_id were used.
This fixes them too.

Signed-off-by: Paul Mackerras <[email protected]>
---
This is in the master branch of my perfcounters.git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulus/perfcounters.git master

arch/powerpc/kernel/perf_counter.c | 9 +++++----
include/linux/perf_counter.h | 12 ++++++++++++
2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index 830ca9c..6413d9c 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -602,12 +602,13 @@ hw_perf_counter_init(struct perf_counter *counter)
return NULL;
if ((s64)counter->hw_event.irq_period < 0)
return NULL;
- ev = counter->hw_event.event_id;
- if (!counter->hw_event.raw) {
- if (ev >= ppmu->n_generic ||
- ppmu->generic_events[ev] == 0)
+ if (!counter->hw_event.raw_type) {
+ ev = counter->hw_event.event_id;
+ if (ev >= ppmu->n_generic || ppmu->generic_events[ev] == 0)
return NULL;
ev = ppmu->generic_events[ev];
+ } else {
+ ev = counter->hw_event.raw_event_id;
}
counter->hw.config_base = ev;
counter->hw.idx = 0;
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index a4b76c0..98f5990 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -15,6 +15,7 @@

#include <linux/types.h>
#include <linux/ioctl.h>
+#include <asm/byteorder.h>

/*
* User-space ABI bits:
@@ -86,6 +87,7 @@ enum perf_counter_record_type {
*/
struct perf_counter_hw_event {
union {
+#ifndef __BIG_ENDIAN_BITFIELD
struct {
__u64 event_id : 56,
type : 8;
@@ -94,6 +96,16 @@ struct perf_counter_hw_event {
__u64 raw_event_id : 63,
raw_type : 1;
};
+#else
+ struct {
+ __u64 type : 8,
+ event_id : 56;
+ };
+ struct {
+ __u64 raw_type : 1,
+ raw_event_id : 63;
+ };
+#endif /* __BIT_ENDIAN_BITFIELD */
__u64 event_config;
};

--
1.5.6.3


2009-03-21 09:43:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems


* Paul Mackerras <[email protected]> wrote:

> struct perf_counter_hw_event {
> union {
> +#ifndef __BIG_ENDIAN_BITFIELD
> struct {
> __u64 event_id : 56,
> type : 8;
> @@ -94,6 +96,16 @@ struct perf_counter_hw_event {
> __u64 raw_event_id : 63,
> raw_type : 1;
> };
> +#else
> + struct {
> + __u64 type : 8,
> + event_id : 56;
> + };
> + struct {
> + __u64 raw_type : 1,
> + raw_event_id : 63;
> + };
> +#endif /* __BIT_ENDIAN_BITFIELD */

hm, this ifdef really looks ugly. How about just changing event_id
to 64 bits and having a separate u32 type field? The size impact is
minimal, the cleanliness win is significant :-)

Ingo

2009-03-21 09:53:03

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems

Ingo Molnar writes:

> hm, this ifdef really looks ugly. How about just changing event_id
> to 64 bits and having a separate u32 type field? The size impact is
> minimal, the cleanliness win is significant :-)

We wanted to get a complete and unique identifier for the event into
64 bits so that we could put it into the ring buffer for
PERF_RECORD_GROUP and have it take up only one 8-byte slot and yet
identify uniquely which counter's value follows it. I don't know that
that is absolutely necessary but it sounds like a nice property.

We could easily go back to a 1-bit raw field and have the type be
either a 64-bit raw value or an 8-bit type plus 32 or 56-bit
event_id.

Or we could keep the current layout but use explicit shifts and masks
rather than bitfields.

I don't know the current C rules concerning unions very well, but I
have the impression that writing to one member of a union and reading
another is undefined behaviour, which is another strike against the
current code if true...

Anyway, the point is that the current code doesn't compile on powerpc
and wouldn't work properly even if it did, so we need to do something.

Paul.

2009-03-21 09:57:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > hm, this ifdef really looks ugly. How about just changing event_id
> > to 64 bits and having a separate u32 type field? The size impact is
> > minimal, the cleanliness win is significant :-)
>
> We wanted to get a complete and unique identifier for the event
> into 64 bits so that we could put it into the ring buffer for
> PERF_RECORD_GROUP and have it take up only one 8-byte slot and yet
> identify uniquely which counter's value follows it. I don't know
> that that is absolutely necessary but it sounds like a nice
> property.
>
> We could easily go back to a 1-bit raw field and have the type be
> either a 64-bit raw value or an 8-bit type plus 32 or 56-bit
> event_id.
>
> Or we could keep the current layout but use explicit shifts and
> masks rather than bitfields.
>
> I don't know the current C rules concerning unions very well, but
> I have the impression that writing to one member of a union and
> reading another is undefined behaviour, which is another strike
> against the current code if true...
>
> Anyway, the point is that the current code doesn't compile on
> powerpc and wouldn't work properly even if it did, so we need to
> do something.

yeah - i pulled your fix, thanks Paul.

Maybe the best option is to get rid of the bitfields and use masks
...

Ingo

2009-03-21 10:24:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems

On Sat, 2009-03-21 at 15:53 +1100, Paul Mackerras wrote:
> Impact: build fix for powerpc
>
> Commit db3a944aca35ae61 ("perf_counter: revamp syscall input ABI")
> expanded the hw_event.type field into a union of structs containing
> bitfields. In particular it introduced a type field and a raw_type
> field, with the intention that the 1-bit raw_type field should
> overlay the most-significant bit of the 8-bit type field, and in fact
> perf_counter_alloc() now assumes that (or at least, assumes that
> raw_type doesn't overlay any of the bits that are 1 in the values of
> PERF_TYPE_{HARDWARE,SOFTWARE,TRACEPOINT}).
>
> Unfortunately this is not true on big-endian systems such as PowerPC,
> where bitfields are laid out from left to right, i.e. from most
> significant bit to least significant. This means that setting
> hw_event.type = PERF_TYPE_SOFTWARE will set hw_event.raw_type to 1.
>
> This fixes it by making the layout depend on whether or not
> __BIG_ENDIAN_BITFIELD is defined. It's a bit ugly, but that's what
> we get for using bitfields in a user/kernel ABI.

Hmm, does userspace know about __BIG_ENDIAN_BITFIELD too? If not, we've
got ourselves a problem here.

Sorry about missing those powerpc bits, I guess I should dust-off the
cross-compiler.

> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index a4b76c0..98f5990 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -15,6 +15,7 @@
>
> #include <linux/types.h>
> #include <linux/ioctl.h>
> +#include <asm/byteorder.h>
>
> /*
> * User-space ABI bits:
> @@ -86,6 +87,7 @@ enum perf_counter_record_type {
> */
> struct perf_counter_hw_event {
> union {
> +#ifndef __BIG_ENDIAN_BITFIELD
> struct {
> __u64 event_id : 56,
> type : 8;
> @@ -94,6 +96,16 @@ struct perf_counter_hw_event {
> __u64 raw_event_id : 63,
> raw_type : 1;
> };
> +#else
> + struct {
> + __u64 type : 8,
> + event_id : 56;
> + };
> + struct {
> + __u64 raw_type : 1,
> + raw_event_id : 63;
> + };
> +#endif /* __BIT_ENDIAN_BITFIELD */
> __u64 event_config;
> };
>

2009-03-21 11:41:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems

On Sat, 2009-03-21 at 10:56 +0100, Ingo Molnar wrote:

> Maybe the best option is to get rid of the bitfields and use masks

Yeah, I think that'll be sanest..

2009-03-21 12:08:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perfcounters: fix type/event_id layout on big-endian systems

Peter Zijlstra writes:

> Hmm, does userspace know about __BIG_ENDIAN_BITFIELD too? If not, we've
> got ourselves a problem here.

Yes, it does (I did test the patch), but I think going to explicit
shifting and masking will be better.

Paul.

2009-03-21 22:04:46

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] perf_counter: remove the event config bitfields

On Sat, 2009-03-21 at 12:41 +0100, Peter Zijlstra wrote:
> On Sat, 2009-03-21 at 10:56 +0100, Ingo Molnar wrote:
>
> > Maybe the best option is to get rid of the bitfields and use masks
>
> Yeah, I think that'll be sanest..

Paul, I noticed you assign the raw config without mask, is that ok?

---
Subject: perf_counter: remove the event config bitfields

Since the bitfields turned into a bit of a mess, remove them and rely on
good old masks.

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index 6413d9c..d056515 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -602,13 +602,13 @@ hw_perf_counter_init(struct perf_counter *counter)
return NULL;
if ((s64)counter->hw_event.irq_period < 0)
return NULL;
- if (!counter->hw_event.raw_type) {
- ev = counter->hw_event.event_id;
+ if (!perf_event_raw(&counter->hw_event)) {
+ ev = perf_event_id(&counter->hw_event);
if (ev >= ppmu->n_generic || ppmu->generic_events[ev] == 0)
return NULL;
ev = ppmu->generic_events[ev];
} else {
- ev = counter->hw_event.raw_event_id;
+ ev = perf_event_config(&counter->hw_event);
}
counter->hw.config_base = ev;
counter->hw.idx = 0;
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index 902282d..3f95b0c 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -217,15 +217,15 @@ static int __hw_perf_counter_init(struct perf_counter *counter)
/*
* Raw event type provide the config in the event structure
*/
- if (hw_event->raw_type) {
- hwc->config |= pmc_ops->raw_event(hw_event->raw_event_id);
+ if (perf_event_raw(hw_event)) {
+ hwc->config |= pmc_ops->raw_event(perf_event_config(hw_event));
} else {
- if (hw_event->event_id >= pmc_ops->max_events)
+ if (perf_event_id(hw_event) >= pmc_ops->max_events)
return -EINVAL;
/*
* The generic map:
*/
- hwc->config |= pmc_ops->event_map(hw_event->event_id);
+ hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
}
counter->wakeup_pending = 0;

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 98f5990..4d0d787 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -82,32 +82,22 @@ enum perf_counter_record_type {
PERF_RECORD_GROUP = 2,
};

+#define PERF_COUNTER_RAW_MASK 0x8000000000000000ULL
+#define PERF_COUNTER_CONFIG_MASK 0x7FFFFFFFFFFFFFFFULL
+#define PERF_COUNTER_TYPE_MASK 0x7F00000000000000ULL
+#define PERF_COUNTER_EVENT_MASK 0x00FFFFFFFFFFFFFFULL
+
/*
* Hardware event to monitor via a performance monitoring counter:
*/
struct perf_counter_hw_event {
- union {
-#ifndef __BIG_ENDIAN_BITFIELD
- struct {
- __u64 event_id : 56,
- type : 8;
- };
- struct {
- __u64 raw_event_id : 63,
- raw_type : 1;
- };
-#else
- struct {
- __u64 type : 8,
- event_id : 56;
- };
- struct {
- __u64 raw_type : 1,
- raw_event_id : 63;
- };
-#endif /* __BIT_ENDIAN_BITFIELD */
- __u64 event_config;
- };
+ /*
+ * The MSB of the config word signifies if the rest contains cpu
+ * specific (raw) counter configuration data, if unset, the next
+ * 7 bits are an event type and the rest of the bits are the event
+ * identifier.
+ */
+ __u64 config;

__u64 irq_period;
__u64 record_type;
@@ -157,6 +147,26 @@ struct perf_counter_hw_event {

struct task_struct;

+static inline u64 perf_event_raw(struct perf_counter_hw_event *hw_event)
+{
+ return hw_event->config & PERF_COUNTER_RAW_MASK;
+}
+
+static inline u64 perf_event_config(struct perf_counter_hw_event *hw_event)
+{
+ return hw_event->config & PERF_COUNTER_CONFIG_MASK;
+}
+
+static inline u64 perf_event_type(struct perf_counter_hw_event *hw_event)
+{
+ return (hw_event->config & PERF_COUNTER_TYPE_MASK) >> 56;
+}
+
+static inline u64 perf_event_id(struct perf_counter_hw_event *hw_event)
+{
+ return hw_event->config & PERF_COUNTER_EVENT_MASK;
+}
+
/**
* struct hw_perf_counter - performance counter hardware details:
*/
@@ -336,8 +346,8 @@ extern void perf_counter_output(struct perf_counter *counter,
*/
static inline int is_software_counter(struct perf_counter *counter)
{
- return !counter->hw_event.raw_type &&
- counter->hw_event.type != PERF_TYPE_HARDWARE;
+ return !perf_event_raw(&counter->hw_event) &&
+ perf_event_type(&counter->hw_event) != PERF_TYPE_HARDWARE;
}

extern void perf_swcounter_event(u32, u64, int, struct pt_regs *);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index f054b8c..bbd538a 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1379,7 +1379,7 @@ static void perf_counter_handle_group(struct perf_counter *counter)
list_for_each_entry(sub, &leader->sibling_list, list_entry) {
if (sub != counter)
sub->hw_ops->read(sub);
- perf_counter_store_irq(counter, sub->hw_event.event_config);
+ perf_counter_store_irq(counter, sub->hw_event.config);
perf_counter_store_irq(counter, atomic64_read(&sub->count));
}
}
@@ -1489,13 +1489,13 @@ static int perf_swcounter_match(struct perf_counter *counter,
if (counter->state != PERF_COUNTER_STATE_ACTIVE)
return 0;

- if (counter->hw_event.raw_type)
+ if (perf_event_raw(&counter->hw_event))
return 0;

- if (counter->hw_event.type != type)
+ if (perf_event_type(&counter->hw_event) != type)
return 0;

- if (counter->hw_event.event_id != event)
+ if (perf_event_id(&counter->hw_event) != event)
return 0;

if (counter->hw_event.exclude_user && user_mode(regs))
@@ -1763,7 +1763,7 @@ static void tp_perf_counter_destroy(struct perf_counter *counter)
static const struct hw_perf_counter_ops *
tp_perf_counter_init(struct perf_counter *counter)
{
- int event_id = counter->hw_event.event_id;
+ int event_id = perf_event_id(&counter->hw_event);
int ret;

ret = ftrace_profile_enable(event_id);
@@ -1797,7 +1797,7 @@ sw_perf_counter_init(struct perf_counter *counter)
* to be kernel events, and page faults are never hypervisor
* events.
*/
- switch (counter->hw_event.event_id) {
+ switch (perf_event_id(&counter->hw_event)) {
case PERF_COUNT_CPU_CLOCK:
hw_ops = &perf_ops_cpu_clock;

@@ -1882,9 +1882,12 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,

hw_ops = NULL;

- if (hw_event->raw_type)
+ if (perf_event_raw(hw_event)) {
hw_ops = hw_perf_counter_init(counter);
- else switch (hw_event->type) {
+ goto done;
+ }
+
+ switch (perf_event_type(hw_event)) {
case PERF_TYPE_HARDWARE:
hw_ops = hw_perf_counter_init(counter);
break;
@@ -1902,6 +1905,7 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
kfree(counter);
return NULL;
}
+done:
counter->hw_ops = hw_ops;

return counter;

2009-03-21 23:50:43

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: remove the event config bitfields

Peter Zijlstra writes:

> Paul, I noticed you assign the raw config without mask, is that ok?

Yep. That doesn't get assigned directly to a hardware register; it
undergoes further processing first, which just pulls out the bits it
needs. (It's not used directly because the hardware has a couple of
registers that configure all the counters at once, rather than a
configuration register per counter.)

Paul.