2011-04-21 17:41:18

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [GIT PULL 0/1] perf/urgent Fix missing support for config1/config2

Hi Ingo,

Please consider pulling from:

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 perf/urgent

Regards,

- Arnaldo

Andi Kleen (1):
perf tools: Add missing user space support for config1/config2

tools/perf/Documentation/perf-list.txt | 11 +++++++++++
tools/perf/util/parse-events.c | 18 +++++++++++++++++-
2 files changed, 28 insertions(+), 1 deletions(-)


2011-04-21 17:41:12

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

From: Andi Kleen <[email protected]>

The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
user space bits were not. This made it impossible to set the extra mask
and actually do the OFFCORE profiling

This patch fixes this. It adds a new syntax ':' to raw events to specify
additional event masks. I also added support for setting config2, even
though that is not needed currently.

[Note: the original version back in time used , -- but that actually
conflicted with event lists, so now it's :]

Acked-by: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Lin Ming <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/Documentation/perf-list.txt | 11 +++++++++++
tools/perf/util/parse-events.c | 18 +++++++++++++++++-
2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
index 7a527f7..f19f1e5 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -61,6 +61,17 @@ raw encoding of 0x1A8 can be used:
You should refer to the processor specific documentation for getting these
details. Some of them are referenced in the SEE ALSO section below.

+Some raw events -- like the Intel OFFCORE events -- support additional
+parameters. These can be appended after a ':'.
+
+For example on a multi socket Intel Nehalem:
+
+ perf stat -e r1b7:20ff -a sleep 1
+
+Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
+that measures any access to DRAM on another socket. Upto two parameters can
+be specified with additional ':'
+
OPTIONS
-------

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 952b4ae..fe9d079 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -688,9 +688,25 @@ parse_raw_event(const char **strp, struct perf_event_attr *attr)
return EVT_FAILED;
n = hex2u64(str + 1, &config);
if (n > 0) {
- *strp = str + n + 1;
+ str += n + 1;
attr->type = PERF_TYPE_RAW;
attr->config = config;
+ if (*str == ':') {
+ str++;
+ n = hex2u64(str, &config);
+ if (n == 0)
+ return EVT_FAILED;
+ attr->config1 = config;
+ str += n;
+ if (*str == ':') {
+ str++;
+ n = hex2u64(str + 1, &config);
+ if (n == 0)
+ return EVT_FAILED;
+ attr->config2 = config;
+ }
+ }
+ *strp = str;
return EVT_HANDLED;
}
return EVT_FAILED;
--
1.6.2.5

2011-04-22 06:34:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Arnaldo Carvalho de Melo <[email protected]> wrote:

> From: Andi Kleen <[email protected]>
>
> The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
> user space bits were not. This made it impossible to set the extra mask
> and actually do the OFFCORE profiling
>
> This patch fixes this. It adds a new syntax ':' to raw events to specify
> additional event masks. I also added support for setting config2, even
> though that is not needed currently.
>
> [Note: the original version back in time used , -- but that actually
> conflicted with event lists, so now it's :]
>
> Acked-by: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> Cc: Lin Ming <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
> ---
> tools/perf/Documentation/perf-list.txt | 11 +++++++++++
> tools/perf/util/parse-events.c | 18 +++++++++++++++++-
> 2 files changed, 28 insertions(+), 1 deletions(-)
>
> diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
> index 7a527f7..f19f1e5 100644
> --- a/tools/perf/Documentation/perf-list.txt
> +++ b/tools/perf/Documentation/perf-list.txt
> @@ -61,6 +61,17 @@ raw encoding of 0x1A8 can be used:
> You should refer to the processor specific documentation for getting these
> details. Some of them are referenced in the SEE ALSO section below.
>
> +Some raw events -- like the Intel OFFCORE events -- support additional
> +parameters. These can be appended after a ':'.
> +
> +For example on a multi socket Intel Nehalem:
> +
> + perf stat -e r1b7:20ff -a sleep 1
> +
> +Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
> +that measures any access to DRAM on another socket. Upto two parameters can
> +be specified with additional ':'

This needs to be a *lot* more user friendly. Users do not want to type in
stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
era really.

Unless there's proper generalized and human usable support i'm leaning towards
turning off the offcore user-space accessible raw bits for now, and use them
only kernel-internally, for the cache events.

Thanks,

Ingo

2011-04-22 08:06:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Ingo Molnar <[email protected]> wrote:

> This needs to be a *lot* more user friendly. Users do not want to type in
> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
> era really.
>
> Unless there's proper generalized and human usable support i'm leaning
> towards turning off the offcore user-space accessible raw bits for now, and
> use them only kernel-internally, for the cache events.

I'm about to push out the patch attached below - it lays out the arguments in
detail. I don't think we have time to fix this properly for .39 - but memory
profiling could be a nice feature for v2.6.40.

Thanks,

Ingo

--------------------->
>From b52c55c6a25e4515b5e075a989ff346fc251ed09 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Fri, 22 Apr 2011 08:44:38 +0200
Subject: [PATCH] x86, perf event: Turn off unstructured raw event access to offcore registers

Andi Kleen pointed out that the Intel offcore support patches were merged
without user-space tool support to the functionality:

|
| The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
| user space bits were not. This made it impossible to set the extra mask
| and actually do the OFFCORE profiling
|

Andi submitted a preliminary patch for user-space support, as an
extension to perf's raw event syntax:

|
| Some raw events -- like the Intel OFFCORE events -- support additional
| parameters. These can be appended after a ':'.
|
| For example on a multi socket Intel Nehalem:
|
| perf stat -e r1b7:20ff -a sleep 1
|
| Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
| that measures any access to DRAM on another socket.
|

But this kind of usability is absolutely unacceptable - users should not
be expected to type in magic, CPU and model specific incantations to get
access to useful hardware functionality.

The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which specific
CPU model they are using, they can use the conceptual event and not some
model specific quirky hexa number.

We already have such generalization in place for CPU cache events,
and it's all very extensible.

"Offcore" events measure general DRAM access patters along various
parameters. They are particularly useful in NUMA systems.

We want to support them via generalized DRAM events: either as the
fourth level of cache (after the last-level cache), or as a separate
generalization category.

That way user-space support would be very obvious, memory access
profiling could be done via self-explanatory commands like:

perf record -e dram ./myapp
perf record -e dram-remote ./myapp

... to measure DRAM accesses or more expensive cross-node NUMA DRAM
accesses.

These generalized events would work on all CPUs and architectures that
have comparable PMU features.

( Note, these are just examples: actual implementation could have more
sophistication and more parameter - as long as they center around
similarly simple usecases. )

Now we do not want to revert *all* of the current offcore bits, as they
are still somewhat useful for generic last-level-cache events, implemented
in this commit:

e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere

But we definitely do not yet want to expose the unstructured raw events
to user-space, until better generalization and usability is implemented
for these hardware event features.

( Note: after generalization has been implemented raw offcore events can be
supported as well: there can always be an odd event that is marginally
useful but not useful enough to generalize. DRAM profiling is definitely
*not* such a category so generalization must be done first. )

Furthermore, PERF_TYPE_RAW access to these registers was not intended
to go upstream without proper support - it was a side-effect of the above
e994d7d23a0b commit, not mentioned in the changelog.

As v2.6.39 is nearing release we go for the simplest approach: disable
the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
kernel and becomes an ABI.

Once proper structure is implemented for these hardware events and users
are offered usable solutions we can revisit this issue.

Reported-by: Andi Kleen <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index eed3673a..632e5dc 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -586,8 +586,12 @@ static int x86_setup_perfctr(struct perf_event *event)
return -EOPNOTSUPP;
}

+ /*
+ * Do not allow config1 (extended registers) to propagate,
+ * there's no sane user-space generalization yet:
+ */
if (attr->type == PERF_TYPE_RAW)
- return x86_pmu_extra_regs(event->attr.config, event);
+ return 0;

if (attr->type == PERF_TYPE_HW_CACHE)
return set_ext_hw_attr(hwc, event);

2011-04-22 08:47:45

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> This needs to be a *lot* more user friendly. Users do not want to type in
>> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
>> era really.
>>
>> Unless there's proper generalized and human usable support i'm leaning
>> towards turning off the offcore user-space accessible raw bits for now, and
>> use them only kernel-internally, for the cache events.
>
Generic cache events are a myth. They are not usable. I keep getting questions
from users because nobody knows what they are actually counting, thus nobody
knows how to interpret the counts. You cannot really hide the micro-architecture
if you want to make any sensible measurements.

I agree with the poor usability of perf when you have to pass hex
values for events.
But that's why I have a user level library to map event strings to
event codes for perf.
Arun Sharma posted a patch a while ago to connect this library with perf, so far
it's been ignored, it seems:
perf stat -e offcore_response_0:dmd_data_rd foo


> I'm about to push out the patch attached below - it lays out the arguments in
> detail. I don't think we have time to fix this properly for .39 - but memory
> profiling could be a nice feature for v2.6.40.
>
You will not be able to do any reasonable memory profiling using
offcore response
events. Dont' expect a profile to point to the missing loads. If
you're lucky it would
point to the use instruction.


> --------------------->
> From b52c55c6a25e4515b5e075a989ff346fc251ed09 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Fri, 22 Apr 2011 08:44:38 +0200
> Subject: [PATCH] x86, perf event: Turn off unstructured raw event access to offcore registers
>
> Andi Kleen pointed out that the Intel offcore support patches were merged
> without user-space tool support to the functionality:
>
>  |
>  | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
>  | user space bits were not. This made it impossible to set the extra mask
>  | and actually do the OFFCORE profiling
>  |
>
> Andi submitted a preliminary patch for user-space support, as an
> extension to perf's raw event syntax:
>
>  |
>  | Some raw events -- like the Intel OFFCORE events -- support additional
>  | parameters. These can be appended after a ':'.
>  |
>  | For example on a multi socket Intel Nehalem:
>  |
>  |    perf stat -e r1b7:20ff -a sleep 1
>  |
>  | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
>  | that measures any access to DRAM on another socket.
>  |
>
> But this kind of usability is absolutely unacceptable - users should not
> be expected to type in magic, CPU and model specific incantations to get
> access to useful hardware functionality.
>
> The proper solution is to expose useful offcore functionality via
> generalized events - that way users do not have to care which specific
> CPU model they are using, they can use the conceptual event and not some
> model specific quirky hexa number.
>
> We already have such generalization in place for CPU cache events,
> and it's all very extensible.
>
> "Offcore" events measure general DRAM access patters along various
> parameters. They are particularly useful in NUMA systems.
>
> We want to support them via generalized DRAM events: either as the
> fourth level of cache (after the last-level cache), or as a separate
> generalization category.
>
> That way user-space support would be very obvious, memory access
> profiling could be done via self-explanatory commands like:
>
>  perf record -e dram ./myapp
>  perf record -e dram-remote ./myapp
>
> ... to measure DRAM accesses or more expensive cross-node NUMA DRAM
> accesses.
>
> These generalized events would work on all CPUs and architectures that
> have comparable PMU features.
>
> ( Note, these are just examples: actual implementation could have more
>  sophistication and more parameter - as long as they center around
>  similarly simple usecases. )
>
> Now we do not want to revert *all* of the current offcore bits, as they
> are still somewhat useful for generic last-level-cache events, implemented
> in this commit:
>
>  e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere
>
> But we definitely do not yet want to expose the unstructured raw events
> to user-space, until better generalization and usability is implemented
> for these hardware event features.
>
> ( Note: after generalization has been implemented raw offcore events can be
>  supported as well: there can always be an odd event that is marginally
>  useful but not useful enough to generalize. DRAM profiling is definitely
>  *not* such a category so generalization must be done first. )
>
> Furthermore, PERF_TYPE_RAW access to these registers was not intended
> to go upstream without proper support - it was a side-effect of the above
> e994d7d23a0b commit, not mentioned in the changelog.
>
> As v2.6.39 is nearing release we go for the simplest approach: disable
> the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
> kernel and becomes an ABI.
>
> Once proper structure is implemented for these hardware events and users
> are offered usable solutions we can revisit this issue.
>
> Reported-by: Andi Kleen <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
>  arch/x86/kernel/cpu/perf_event.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index eed3673a..632e5dc 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -586,8 +586,12 @@ static int x86_setup_perfctr(struct perf_event *event)
>                        return -EOPNOTSUPP;
>        }
>
> +       /*
> +        * Do not allow config1 (extended registers) to propagate,
> +        * there's no sane user-space generalization yet:
> +        */
>        if (attr->type == PERF_TYPE_RAW)
> -               return x86_pmu_extra_regs(event->attr.config, event);
> +               return 0;
>
>        if (attr->type == PERF_TYPE_HW_CACHE)
>                return set_ext_hw_attr(hwc, event);
>

2011-04-22 09:23:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> This needs to be a *lot* more user friendly. Users do not want to type in
> >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
> >> era really.
> >>
> >> Unless there's proper generalized and human usable support i'm leaning
> >> towards turning off the offcore user-space accessible raw bits for now, and
> >> use them only kernel-internally, for the cache events.
>
> Generic cache events are a myth. They are not usable. I keep getting
> questions from users because nobody knows what they are actually counting,
> thus nobody knows how to interpret the counts. You cannot really hide the
> micro-architecture if you want to make any sensible measurements.

Well:

aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
Time: 0.125
Time: 0.136
Time: 0.180
Time: 0.103
Time: 0.097
Time: 0.125
Time: 0.104
Time: 0.125
Time: 0.114
Time: 0.158

Performance counter stats for './hackbench 10' (10 runs):

2,102,556,398 instructions # 0.000 IPC ( +- 1.179% )
843,957,634 L1-dcache-loads ( +- 1.295% )
130,007,361 L1-dcache-load-misses ( +- 3.281% )
6,328,938 LLC-misses ( +- 3.969% )

0.146160287 seconds time elapsed ( +- 5.851% )

It's certainly useful if you want to get ballpark figures about cache behavior
of an app and want to do comparisons.

There are inconsistencies in our generic cache events - but that's not really a
reason to obcure their usage behind nonsensical microarchitecture-specific
details.

But i'm definitely in favor of making these generalized events more consistent
across different CPU types. Can you list examples of inconsistencies that we
should resolve? (and which you possibly consider impossible to resolve, right?)

Thanks,

Ingo

2011-04-22 09:41:11

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 11:23 AM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Fri, Apr 22, 2011 at 10:06 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Ingo Molnar <[email protected]> wrote:
>> >
>> >> This needs to be a *lot* more user friendly. Users do not want to type in
>> >> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
>> >> era really.
>> >>
>> >> Unless there's proper generalized and human usable support i'm leaning
>> >> towards turning off the offcore user-space accessible raw bits for now, and
>> >> use them only kernel-internally, for the cache events.
>>
>> Generic cache events are a myth. They are not usable. I keep getting
>> questions from users because nobody knows what they are actually counting,
>> thus nobody knows how to interpret the counts. You cannot really hide the
>> micro-architecture if you want to make any sensible measurements.
>
> Well:
>
>  aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
>  Time: 0.125
>  Time: 0.136
>  Time: 0.180
>  Time: 0.103
>  Time: 0.097
>  Time: 0.125
>  Time: 0.104
>  Time: 0.125
>  Time: 0.114
>  Time: 0.158
>
>  Performance counter stats for './hackbench 10' (10 runs):
>
>     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
>       843,957,634 L1-dcache-loads            ( +-   1.295% )
>       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
>         6,328,938 LLC-misses                 ( +-   3.969% )
>
>        0.146160287  seconds time elapsed   ( +-   5.851% )
>
> It's certainly useful if you want to get ballpark figures about cache behavior
> of an app and want to do comparisons.
>
What can you conclude from the above counts?
Are they good or bad? If they are bad, how do you go about fixing the app?

> There are inconsistencies in our generic cache events - but that's not really a
> reason to obcure their usage behind nonsensical microarchitecture-specific
> details.
>
The actual events are a reflection of the micro-architecture. They indirectly
describe how it works. It is not clear to me that you can really improve your
app without some exposure to the micro-architecture.

So if you want to have generic events, I am fine with this, but you should not
block access to actual events pretending they are useless. Some people are
certainly interested in using them and learning about the micro-architecture
of their processor.


> But i'm definitely in favor of making these generalized events more consistent
> across different CPU types. Can you list examples of inconsistencies that we
> should resolve? (and which you possibly consider impossible to resolve, right?)
>
To make generic events more uniform across processors, one would have to have
precise definitions as to what they are supposed to count. Once you
have that, then
we may have a better chance at finding consistent mappings for each processor.
I have not yet seen such definitions.

2011-04-22 10:52:36

by Ingo Molnar

[permalink] [raw]
Subject: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> >> Generic cache events are a myth. They are not usable. [...]
> >
> > Well:
> >
> > ?aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
> > ?Time: 0.125
> > ?Time: 0.136
> > ?Time: 0.180
> > ?Time: 0.103
> > ?Time: 0.097
> > ?Time: 0.125
> > ?Time: 0.104
> > ?Time: 0.125
> > ?Time: 0.114
> > ?Time: 0.158
> >
> > ?Performance counter stats for './hackbench 10' (10 runs):
> >
> > ? ? 2,102,556,398 instructions ? ? ? ? ? ? # ? ? ?0.000 IPC ? ? ( +- ? 1.179% )
> > ? ? ? 843,957,634 L1-dcache-loads ? ? ? ? ? ?( +- ? 1.295% )
> > ? ? ? 130,007,361 L1-dcache-load-misses ? ? ?( +- ? 3.281% )
> > ? ? ? ? 6,328,938 LLC-misses ? ? ? ? ? ? ? ? ( +- ? 3.969% )
> >
> > ? ? ? ?0.146160287 ?seconds time elapsed ? ( +- ? 5.851% )
> >
> > It's certainly useful if you want to get ballpark figures about cache behavior
> > of an app and want to do comparisons.
> >
> What can you conclude from the above counts?
> Are they good or bad? If they are bad, how do you go about fixing the app?

So let me give you a simplified example.

Say i'm a developer and i have an app with such code:

#define THOUSAND 1000

static char array[THOUSAND][THOUSAND];

int init_array(void)
{
int i, j;

for (i = 0; i < THOUSAND; i++) {
for (j = 0; j < THOUSAND; j++) {
array[j][i]++;
}
}

return 0;
}

Pretty common stuff, right?

Using the generalized cache events i can run:

$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array

Performance counter stats for './array' (10 runs):

6,719,130 cycles:u ( +- 0.662% )
5,084,792 instructions:u # 0.757 IPC ( +- 0.000% )
1,037,032 l1-dcache-loads:u ( +- 0.009% )
1,003,604 l1-dcache-load-misses:u ( +- 0.003% )

0.003802098 seconds time elapsed ( +- 13.395% )

I consider that this is 'bad', because for almost every dcache-load there's a
dcache-miss - a 99% L1 cache miss rate!

Then i think a bit, notice something, apply this performance optimization:

diff --git a/array.c b/array.c
index 4758d9a..d3f7037 100644
--- a/array.c
+++ b/array.c
@@ -9,7 +9,7 @@ int init_array(void)

for (i = 0; i < THOUSAND; i++) {
for (j = 0; j < THOUSAND; j++) {
- array[j][i]++;
+ array[i][j]++;
}
}

I re-run perf-stat:

$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array

Performance counter stats for './array' (10 runs):

2,395,407 cycles:u ( +- 0.365% )
5,084,788 instructions:u # 2.123 IPC ( +- 0.000% )
1,035,731 l1-dcache-loads:u ( +- 0.006% )
3,955 l1-dcache-load-misses:u ( +- 4.872% )

0.001806438 seconds time elapsed ( +- 3.831% )

And i'm happy that indeed the l1-dcache misses are now super-low and that the
app got much faster as well - the cycle count is a third of what it was before
the optimization!

Note that:

- I got absolute numbers in the right ballpark figure: i got a million loads as
expected (the array has 1 million elements), and 1 million cache-misses in
the 'bad' case.

- I did not care which specific Intel CPU model this was running on

- I did not care about *any* microarchitectural details - i only knew it's a
reasonably modern CPU with caching

- I did not care how i could get access to L1 load and miss events. The events
were named obviously and it just worked.

So no, kernel driven generalization and sane tooling is not at all a 'myth'
today, really.

So this is the general direction in which we want to move on. If you know about
problems with existing generalization definitions then lets *fix* them, not
pretend that generalizations and sane workflows are impossible ...

Thanks,

Ingo

2011-04-22 12:05:00

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 12:52 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> >> Generic cache events are a myth. They are not usable. [...]
>> >
>> > Well:
>> >
>> >  aldebaran:~> perf stat --repeat 10 -e instructions -e L1-dcache-loads -e L1-dcache-load-misses -e LLC-misses ./hackbench 10
>> >  Time: 0.125
>> >  Time: 0.136
>> >  Time: 0.180
>> >  Time: 0.103
>> >  Time: 0.097
>> >  Time: 0.125
>> >  Time: 0.104
>> >  Time: 0.125
>> >  Time: 0.114
>> >  Time: 0.158
>> >
>> >  Performance counter stats for './hackbench 10' (10 runs):
>> >
>> >     2,102,556,398 instructions             #      0.000 IPC     ( +-   1.179% )
>> >       843,957,634 L1-dcache-loads            ( +-   1.295% )
>> >       130,007,361 L1-dcache-load-misses      ( +-   3.281% )
>> >         6,328,938 LLC-misses                 ( +-   3.969% )
>> >
>> >        0.146160287  seconds time elapsed   ( +-   5.851% )
>> >
>> > It's certainly useful if you want to get ballpark figures about cache behavior
>> > of an app and want to do comparisons.
>> >
>> What can you conclude from the above counts?
>> Are they good or bad? If they are bad, how do you go about fixing the app?
>
> So let me give you a simplified example.
>
> Say i'm a developer and i have an app with such code:
>
> #define THOUSAND 1000
>
> static char array[THOUSAND][THOUSAND];
>
> int init_array(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++) {
>                for (j = 0; j < THOUSAND; j++) {
>                        array[j][i]++;
>                }
>        }
>
>        return 0;
> }
>
> Pretty common stuff, right?
>
> Using the generalized cache events i can run:
>
>  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>
>  Performance counter stats for './array' (10 runs):
>
>         6,719,130 cycles:u                   ( +-   0.662% )
>         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>
>        0.003802098  seconds time elapsed   ( +-  13.395% )
>
> I consider that this is 'bad', because for almost every dcache-load there's a
> dcache-miss - a 99% L1 cache miss rate!
>
> Then i think a bit, notice something, apply this performance optimization:
>
I don't think this example is really representative of the kind of problems
people face, it is just too small and obvious. So I would not generalize on it.

If you are happy with generalized cache events then, as I said, I am fine with
it. But the API should ALWAYS allow users access to raw events when they
need finer grain analysis.

> diff --git a/array.c b/array.c
> index 4758d9a..d3f7037 100644
> --- a/array.c
> +++ b/array.c
> @@ -9,7 +9,7 @@ int init_array(void)
>
>        for (i = 0; i < THOUSAND; i++) {
>                for (j = 0; j < THOUSAND; j++) {
> -                       array[j][i]++;
> +                       array[i][j]++;
>                }
>        }
>
> I re-run perf-stat:
>
>  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>
>  Performance counter stats for './array' (10 runs):
>
>         2,395,407 cycles:u                   ( +-   0.365% )
>         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
>         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
>             3,955 l1-dcache-load-misses:u    ( +-   4.872% )
>
>  - I got absolute numbers in the right ballpark figure: i got a million loads as
>   expected (the array has 1 million elements), and 1 million cache-misses in
>   the 'bad' case.
>
>  - I did not care which specific Intel CPU model this was running on
>
>  - I did not care about *any* microarchitectural details - i only knew it's a
>   reasonably modern CPU with caching
>
>  - I did not care how i could get access to L1 load and miss events. The events
>   were named obviously and it just worked.
>
> So no, kernel driven generalization and sane tooling is not at all a 'myth'
> today, really.
>
> So this is the general direction in which we want to move on. If you know about
> problems with existing generalization definitions then lets *fix* them, not
> pretend that generalizations and sane workflows are impossible ...
>
Again, to fix them, you need to give us definitions for what you expect those
events to count. Otherwise we cannot make forward progress.

Let me give just one simple example: cycles

What your definition for the generic cycle event?

There are various flavors:
- count halted, unhalted cycles?
- impacted by frequency scaling?

LLC-misses:
- what considered the LLC?
- does it include code, data or both?
- does it include demand, hw prefetch?
- it is to local or remote dram?

Once you have clear and precise definition, then we can look at the actual
events and figure out a mapping.

2011-04-22 13:19:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> > Say i'm a developer and i have an app with such code:
> >
> > #define THOUSAND 1000
> >
> > static char array[THOUSAND][THOUSAND];
> >
> > int init_array(void)
> > {
> > ? ? ? ?int i, j;
> >
> > ? ? ? ?for (i = 0; i < THOUSAND; i++) {
> > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) {
> > ? ? ? ? ? ? ? ? ? ? ? ?array[j][i]++;
> > ? ? ? ? ? ? ? ?}
> > ? ? ? ?}
> >
> > ? ? ? ?return 0;
> > }
> >
> > Pretty common stuff, right?
> >
> > Using the generalized cache events i can run:
> >
> > ?$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> > ?Performance counter stats for './array' (10 runs):
> >
> > ? ? ? ? 6,719,130 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.662% )
> > ? ? ? ? 5,084,792 instructions:u ? ? ? ? ? # ? ? ?0.757 IPC ? ? ( +- ? 0.000% )
> > ? ? ? ? 1,037,032 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.009% )
> > ? ? ? ? 1,003,604 l1-dcache-load-misses:u ? ?( +- ? 0.003% )
> >
> > ? ? ? ?0.003802098 ?seconds time elapsed ? ( +- ?13.395% )
> >
> > I consider that this is 'bad', because for almost every dcache-load there's a
> > dcache-miss - a 99% L1 cache miss rate!
> >
> > Then i think a bit, notice something, apply this performance optimization:
>
> I don't think this example is really representative of the kind of problems
> people face, it is just too small and obvious. [...]

Well, the overwhelming majority of performance problems are 'small and obvious'
- once a tool roughly pinpoints their existence and location!

And you have not offered a counter example either so you have not really
demonstrated what you consider a 'real' example and why you consider
generalized cache events inadequate.

> [...] So I would not generalize on it.

To the contrary, it demonstrates the most fundamental concept of cache
profiling: looking at the hits/misses ratios and identifying hotspots.

That concept can be applied pretty nicely to all sorts of applications.

Interestly, the exact hardware event doesnt even *matter* for most problems, as
long as it *correlates* with the conceptual entity we want to measure.

So what we need are hardware events that correlate with:

- loads done
- stores done
- load misses suffered
- store misses suffered
- branches done
- branches missed
- instructions executed

It is the *ratio* that matters in most cases: before-change versus
after-change, hits versus misses, etc.

Yes, there will be imprecisions, CPU quirks, limitations and speculation
effects - but as long as we keep our eyes on the ball, generalizations are
useful for solving practical problems.

> If you are happy with generalized cache events then, as I said, I am fine
> with it. But the API should ALWAYS allow users access to raw events when they
> need finer grain analysis.

Well, that's a pretty far cry from calling it a 'myth' :-)

So my point is (outlined in detail in the common changelog) that we need sane
generalized remote DRAM events *first* - before we think about exposing the
'rest' of te offcore-PMU as raw events.

> > diff --git a/array.c b/array.c
> > index 4758d9a..d3f7037 100644
> > --- a/array.c
> > +++ b/array.c
> > @@ -9,7 +9,7 @@ int init_array(void)
> >
> > ? ? ? ?for (i = 0; i < THOUSAND; i++) {
> > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) {
> > - ? ? ? ? ? ? ? ? ? ? ? array[j][i]++;
> > + ? ? ? ? ? ? ? ? ? ? ? array[i][j]++;
> > ? ? ? ? ? ? ? ?}
> > ? ? ? ?}
> >
> > I re-run perf-stat:
> >
> > ?$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> > ?Performance counter stats for './array' (10 runs):
> >
> > ? ? ? ? 2,395,407 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.365% )
> > ? ? ? ? 5,084,788 instructions:u ? ? ? ? ? # ? ? ?2.123 IPC ? ? ( +- ? 0.000% )
> > ? ? ? ? 1,035,731 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.006% )
> > ? ? ? ? ? ? 3,955 l1-dcache-load-misses:u ? ?( +- ? 4.872% )
> >
> > ?- I got absolute numbers in the right ballpark figure: i got a million loads as
> > ? expected (the array has 1 million elements), and 1 million cache-misses in
> > ? the 'bad' case.
> >
> > ?- I did not care which specific Intel CPU model this was running on
> >
> > ?- I did not care about *any* microarchitectural details - i only knew it's a
> > ? reasonably modern CPU with caching
> >
> > ?- I did not care how i could get access to L1 load and miss events. The events
> > ? were named obviously and it just worked.
> >
> > So no, kernel driven generalization and sane tooling is not at all a 'myth'
> > today, really.
> >
> > So this is the general direction in which we want to move on. If you know about
> > problems with existing generalization definitions then lets *fix* them, not
> > pretend that generalizations and sane workflows are impossible ...
>
> Again, to fix them, you need to give us definitions for what you expect those
> events to count. Otherwise we cannot make forward progress.

No, we do not 'need' to give exact definitions. This whole topic is more
analogous to physics than to mathematics. See my description above about how
ratios and high level structure matters more than absolute values and
definitions.

Yes, if we can then 'loads' and 'stores' should correspond to the number of
loads a program flow does, which you get if you look at the assembly code.
'Instructions' should correspond to the number of instructions executed.

If the CPU cannot do it it's not a huge deal in practice - we will cope and
hopefully it will all be fixed in future CPU versions.

That having said, most CPUs i have access to get the fundamentals right, so
it's not like we have huge problems in practice. Key CPU statistics are
available.

> Let me give just one simple example: cycles
>
> What your definition for the generic cycle event?
>
> There are various flavors:
> - count halted, unhalted cycles?

Again i think you are getting lost in too much detail.

For typical developers halted versus unhalted is mostly an uninteresting
distinction, as people tend to just type 'perf record ./myapp', which is per
workload profiling so it excludes idle time. So it would give the same result
to them regardless of whether it's halted or unhalted cycles.

( This simple example already shows the idiocy of the hardware names, calling
cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not*
care about those distinctions so the defaults should not be complicated with
them. )

> - impacted by frequency scaling?

The best default for developers is a frequency scaling invariant result - i.e.
one that is not against a reference clock but against the real CPU clock.

( Even that one will not be completely invariant due to the frequency-scaling
dependent cost of misses and bus ops, etc. )

But profiling against a reference frequency makes sense as well, especially for
system-wide profiling - this is the hardware equivalent of the cpu-clock /
elapsed time metric. We could implement the cpu-clock using reference cycles
events for example.

> LLC-misses:
> - what considered the LLC?

The last level cache is whichever cache sits before DRAM.

> - does it include code, data or both?

Both if possible as they tend to be unified caches anyway.

> - does it include demand, hw prefetch?

Do you mean for the LLC-prefetch events? What would be your suggestion, which
is the most useful metric? Prefetches are not directly done by program logic so
this is borderline. We wanted to include them for completeness - and the metric
should probably include 'all activities that program flow has not caused
directly and which may be sucking up system resources' - i.e. including hw
prefetch.

> - it is to local or remote dram?

The current definitions should include both.

Measuring remote DRAM accesss is of course useful - that is the original point
of this thread. It should be done as an additional layer, basically local RAM
is yet another cache level - but we can take other generalized approach as
well, if they make more sense.

Thanks,

Ingo

2011-04-22 16:23:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 08:34:29AM +0200, Ingo Molnar wrote:
> This needs to be a *lot* more user friendly. Users do not want to type in
> stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
> era really.

I agree that the raw events are quite user unfriendly.

Unfortunately they are the way of life in perf -- unlike oprofile -- currently
if you want any CPU specific events like this.

Really to make sense out of all this you need per CPU full event lists.

I have an own wrapper to make it more user friendly, but its functionality
should arguably migrate into perf.

I did a patch to add a mapping file some time ago, but it likely
needs some improvements before it can be merged (aka not .39), like
auto selecting a suitable mapping file and backtranslating raw
mappings on output.

BTW the new perf lat code needs the raw events config1 specification
internally, so this is needed in some form anyways.

Short of that the extended raw events is the best we can get short term I
think. So I would prefer to have it for .39 to make this feature
usable at all.

I attached the old mapping file patch for your reference.
I also put up a few mapping files for Intel CPUs at
ftp://ftp.kernel.org/pub/linux/kernel/people/ak/pmu/*

e.g. to use it with Nehalem offcore events and this patch you would
use today

wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/pmu/nhm-ep.map
perf --map-file nhm-ep.map top -e offcore_response_0.any_data.local_cache_dram

-Andi

commit 37323c19ceb57101cc2160059c567ee14055b7c8
Author: Andi Kleen <[email protected]>
Date: Mon Nov 8 04:52:18 2010 +0100

mapping file support

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index a91f9f9..63bdbbb 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -120,6 +120,9 @@ Do not update the builid cache. This saves some overhead in situations
where the information in the perf.data file (which includes buildids)
is sufficient.

+--map-events=file
+Use file as event mapping file.
+
SEE ALSO
--------
linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 4b3a2d4..4f20af3 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -53,6 +53,9 @@ comma-sperated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2
In per-thread mode, this option is ignored. The -a option is still necessary
to activate system-wide monitoring. Default is to count on all CPUs.

+--map-events=file
+Use file as event mapping file.
+
EXAMPLES
--------

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 93bd2ff..6fdf892 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -794,6 +794,9 @@ const struct option record_options[] = {
OPT_CALLBACK('e', "event", NULL, "event",
"event selector. use 'perf list' to list available events",
parse_events),
+ OPT_CALLBACK(0, "map-events", NULL, "map-events",
+ "specify mapping file for events",
+ map_events),
OPT_CALLBACK(0, "filter", NULL, "filter",
"event filter", parse_filter),
OPT_INTEGER('p', "pid", &target_pid,
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index a6b4d44..f21f307 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -525,6 +525,9 @@ static const struct option options[] = {
OPT_CALLBACK('e', "event", NULL, "event",
"event selector. use 'perf list' to list available events",
parse_events),
+ OPT_CALLBACK(0, "map-events", NULL, "map-events",
+ "specify mapping file for events",
+ map_events),
OPT_BOOLEAN('i', "no-inherit", &no_inherit,
"child tasks do not inherit counters"),
OPT_INTEGER('p', "pid", &target_pid,
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..2cc7b3d 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -83,6 +83,14 @@ static const char *sw_event_names[] = {
"emulation-faults",
};

+struct mapping {
+ const char *str;
+ const char *res;
+};
+
+static int mapping_max;
+static struct mapping *mappings;
+
#define MAX_ALIASES 8

static const char *hw_cache[][MAX_ALIASES] = {
@@ -731,12 +739,28 @@ parse_event_modifier(const char **strp, struct perf_event_attr *attr)
return 0;
}

+static int cmp_mapping(const void *a, const void *b)
+{
+ const struct mapping *am = a;
+ const struct mapping *bm = b;
+ return strcmp(am->str, bm->str);
+}
+
+static const char *
+get_event_mapping(const char *str)
+{
+ struct mapping key = { .str = str };
+ struct mapping *r = bsearch(&key, mappings, mapping_max,
+ sizeof(struct mapping), cmp_mapping);
+ return r ? r->res : NULL;
+}
+
/*
* Each event can have multiple symbolic names.
* Symbolic names are (almost) exactly matched.
*/
static enum event_result
-parse_event_symbols(const char **str, struct perf_event_attr *attr)
+do_parse_event_symbols(const char **str, struct perf_event_attr *attr)
{
enum event_result ret;

@@ -774,6 +798,15 @@ modifier:
return ret;
}

+static enum event_result
+parse_event_symbols(const char **str, struct perf_event_attr *attr)
+{
+ const char *map = get_event_mapping(*str);
+ if (map)
+ *str = map;
+ return do_parse_event_symbols(str, attr);
+}
+
static int store_event_type(const char *orgname)
{
char filename[PATH_MAX], *c;
@@ -963,3 +996,54 @@ void print_events(void)

exit(129);
}
+
+int map_events(const struct option *opt __used, const char *str,
+ int unset __used)
+{
+ FILE *f;
+ char *line = NULL;
+ size_t linelen = 0;
+ char *p;
+ int lineno = 0;
+ static int mapping_size;
+ struct mapping *map;
+
+ f = fopen(str, "r");
+ if (!f) {
+ pr_err("Cannot open event map file");
+ return -1;
+ }
+ while (getline(&line, &linelen, f) > 0) {
+ lineno++;
+ p = strpbrk(line, "\n#");
+ if (p)
+ *p = 0;
+ p = line + strspn(line, " \t");
+ if (*p == 0)
+ continue;
+ if (mapping_max >= mapping_size) {
+ if (!mapping_size)
+ mapping_size = 2048;
+ mapping_size *= 2;
+ mappings = realloc(mappings,
+ mapping_size * sizeof(struct mapping));
+ if (!mappings) {
+ pr_err("Out of memory\n");
+ exit(ENOMEM);
+ }
+ }
+ map = &mappings[mapping_max++];
+ map->str = strsep(&p, " \t");
+ map->res = strsep(&p, " \t");
+ if (!map->str || !map->res) {
+ fprintf(stderr, "%s:%d: Invalid line in map file\n",
+ str, lineno);
+ }
+ line = NULL;
+ linelen = 0;
+ }
+ fclose(f);
+ qsort(mappings, mapping_max, sizeof(struct mapping),
+ cmp_mapping);
+ return 0;
+}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..1d6df9c 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -33,5 +33,6 @@ extern void print_events(void);
extern char debugfs_path[];
extern int valid_debugfs_mount(const char *debugfs);

+extern int map_events(const struct option *opt, const char *str, int unset);

#endif /* __PERF_PARSE_EVENTS_H */

2011-04-22 16:52:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> Once you have clear and precise definition, then we can look at the actual
> events and figure out a mapping.

It's unclear this can be even done. Linux runs on a wide variety of micro
architectures, with all kinds of cache architectures.

Micro architectures are so different. I suspect a "generic" definition would
need to be so vague as to be useless.

This in general seems to be the problem of the current cache events.

Overall for any interesting analysis you need to go CPU specific.
Abstracted performance analysis is a contradiction in terms.

-Andi

--
[email protected] -- Speaking for myself only

2011-04-22 16:59:19

by Arun Sharma

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
>
> Using the generalized cache events i can run:
>
> $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>
> Performance counter stats for './array' (10 runs):
>
> 6,719,130 cycles:u ( +- 0.662% )
> 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% )
> 1,037,032 l1-dcache-loads:u ( +- 0.009% )
> 1,003,604 l1-dcache-load-misses:u ( +- 0.003% )
>
> 0.003802098 seconds time elapsed ( +- 13.395% )
>
> I consider that this is 'bad', because for almost every dcache-load there's a
> dcache-miss - a 99% L1 cache miss rate!

One could argue that all you need is cycles and instructions. If there is an
expensive load, you'll see that the load instruction takes many cycles and
you can infer that it's a cache miss.

Questions app developers typically ask me:

* If I fix all my top 5 L3 misses how much faster will my app go?
* Am I bottlenecked on memory bandwidth?
* I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per
1000 instructions. Which one should I focus on?

It's hard to answer some of these without access to all events.
While your approach of having generic events for commonly used counters
might be useful for some use cases, I don't see why exposing all vendor
defined events is harmful.

A clear statement on the last point would be helpful.

-Arun

2011-04-22 17:01:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> One could argue that all you need is cycles and instructions. If there is an
> expensive load, you'll see that the load instruction takes many cycles and
> you can infer that it's a cache miss.

That is when you can actually recognize which of the last load instructions
caused the problem. Often skid on Out of order CPUs makes this very hard to
identify which actual load caused the the long stall (or if they all stalled)

There's a way around it of course using advanced events, but not with
cycles.

> I don't see why exposing all vendor defined events is harmful

Simple: without vendor events you cannot answer a lot of questions.

-Andi

--
[email protected] -- Speaking for myself only

2011-04-22 19:54:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> On Fri, Apr 22, 2011 at 08:34:29AM +0200, Ingo Molnar wrote:
> > This needs to be a *lot* more user friendly. Users do not want to type in
> > stupid hexa magic numbers to get profiling. We have moved beyond the oprofile
> > era really.
>
> I agree that the raw events are quite user unfriendly.
>
> Unfortunately they are the way of life in perf -- unlike oprofile --
> currently if you want any CPU specific events like this.

Not sure where you take that blanket statement from, but no, raw events are not
really the 'way of life' - judging by the various user feedback we get they
come up pretty rarely.

The thing is, most people just use the default 'perf record' and that's it -
they do not even care about a *single* event - they just want to profile their
code somehow.

Then the second most popular event category are the generalized events, the
ones you can see in perf list output:

cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]

cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
minor-faults [Software event]
major-faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
alignment-faults [Software event]
emulation-faults [Software event]

L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
L1-icache-prefetch-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-stores [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-load-misses [Hardware cache event]

These are useful but are used less frequently.

Then come tracepoint based events - and as a distant last, come raw events.
Yes, they raw events are useful occasionally, just like modifying applications
via a hexa editor is useful occasionally. If done often we better abstract it
out.

> Really to make sense out of all this you need per CPU full event lists.

To make sense out of what? You are making very sweeping yet vague statements.

> I have an own wrapper to make it more user friendly, but its functionality
> should arguably migrate into perf.

Uhm, no - your patch seem to reintroduce oprofile's horrible events files. We
really learned from that mistake and do not want to step back ...

Please see the detailed mails i wrote in this thread, what we want is to extend
and improve existing generalizations of events. The useful bits of the offcore
PMU fit nicely into that scheme.

Thanks,

Ingo

2011-04-22 19:57:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> > Once you have clear and precise definition, then we can look at the actual
> > events and figure out a mapping.
>
> It's unclear this can be even done. Linux runs on a wide variety of micro
> architectures, with all kinds of cache architectures.
>
> Micro architectures are so different. I suspect a "generic" definition would
> need to be so vague as to be useless.

Not really. I gave a very specific example which solved a common and real
problem, using L1-loads and L1-load-misses events.

> This in general seems to be the problem of the current cache events.
>
> Overall for any interesting analysis you need to go CPU specific. Abstracted
> performance analysis is a contradiction in terms.

Nothing of what i did in that example was CPU or microarchitecture specific.

Really, you are making this more complex than it really is. Just check the
cache profiling example i gave, it works just fine today.

Thanks,

Ingo

2011-04-22 20:31:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* [email protected] <[email protected]> wrote:

> On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
> >
> > Using the generalized cache events i can run:
> >
> > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> > Performance counter stats for './array' (10 runs):
> >
> > 6,719,130 cycles:u ( +- 0.662% )
> > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% )
> > 1,037,032 l1-dcache-loads:u ( +- 0.009% )
> > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% )
> >
> > 0.003802098 seconds time elapsed ( +- 13.395% )
> >
> > I consider that this is 'bad', because for almost every dcache-load there's a
> > dcache-miss - a 99% L1 cache miss rate!
>
> One could argue that all you need is cycles and instructions. [...]

Yes, and note that with instructions events we even have skid-less PEBS
profiling so seeing the precise .

> [...] If there is an expensive load, you'll see that the load instruction
> takes many cycles and you can infer that it's a cache miss.
>
> Questions app developers typically ask me:
>
> * If I fix all my top 5 L3 misses how much faster will my app go?

This has come up: we could add a 'stalled/idle-cycles' generic event - i.e.
cycles spent without performing useful work in the pipelines. (Resource-stall
events on Intel CPUs.)

Then you would profile L3 misses (there's a generic event for that), plus
stalls, and the answer to your question would be the percentage of hits you get
in the stalled-cycles profile, multiplied by the stalled-cycles/cycles ratio.

> * Am I bottlenecked on memory bandwidth?

This would be a variant of the measurement above: say the top 90% of L3 misses
profile-correlated with stalled-cycles, relative to total-cycles. If you get
'90% of L3 misses cause a 1% wall-time slowdown' then you are not memory
bottlenecked. If the answer is '35% slowdown' then you are memory bottlenecked.

> * I have 4 L3 misses every 1000 instructions and 15 branch mispredicts per
> 1000 instructions. Which one should I focus on?

AFAICS this would be another variant of stalled-cycles measurements: you create
a stalled-cycles profile and check whether the top hits are branches or memory
loads.

> It's hard to answer some of these without access to all events.

I'm curious, how would you measure these properties - do you have some
different events in mind?

> While your approach of having generic events for commonly used counters might
> be useful for some use cases, I don't see why exposing all vendor defined
> events is harmful.
>
> A clear statement on the last point would be helpful.

Well, the thing is, i think users are helped most if we add useful, highlevel
PMU features added and not just an opaque raw event pass-through engine. The
problem with lowlevel raw ABIs is that the tool space fragments into a zillion
small hacks and there's no good concentration of know-how. I'd like the art of
performance measurement to be generalized out, as well as it can be.

We had this discussion in the big perf-counters flamewars 2+ years ago, where
one side wanted raw events, while we wanted intelligent kernel-side
abstractions and generalizations. I think the abstraction and generalization
angle worked out very well in practice - but we are having this discussion
again and again :-)

As i stated it in my prior mails, i'm not against raw events as a rare
exception channel - that increases utility. I'm against what was attempted
here: an extension to raw events as the *primary* channel for DRAM measurement
features. That is just sloppy and *reduces* utility.

I'm very simple-minded: when i see reduced utility i become sad :)

Thanks,

Ingo

2011-04-22 20:31:58

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
> > > Say i'm a developer and i have an app with such code:
> > >
> > > #define THOUSAND 1000
> > >
> > > static char array[THOUSAND][THOUSAND];
> > >
> > > int init_array(void)
> > > {
> > >        int i, j;
> > >
> > >        for (i = 0; i < THOUSAND; i++) {
> > >                for (j = 0; j < THOUSAND; j++) {
> > >                        array[j][i]++;
> > >                }
> > >        }
> > >
> > >        return 0;
> > > }
> > >
> > > Pretty common stuff, right?
> > >
> > > Using the generalized cache events i can run:
> > >
> > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > >
> > >  Performance counter stats for './array' (10 runs):
> > >
> > >         6,719,130 cycles:u                   ( +-   0.662% )
> > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
> > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
> > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
> > >
> > >        0.003802098  seconds time elapsed   ( +-  13.395% )
> > >
> > > I consider that this is 'bad', because for almost every dcache-load there's a
> > > dcache-miss - a 99% L1 cache miss rate!
> > >
> > > Then i think a bit, notice something, apply this performance optimization:
> >
> > I don't think this example is really representative of the kind of problems
> > people face, it is just too small and obvious. [...]
>
> Well, the overwhelming majority of performance problems are 'small and obvious'

Problems are not simple. Most serious applications these days are huge, hundreds
of MB of text, if not GB.

In your artificial example, you knew the answer before you started the
measurement.
Most of the time, applications are assembled out of hundreds of libraries, so no
single developers knows all the code. Thus, the performance analyst is
faced with a
black box most of the time.

Let's go back to your example.
Performance counter stats for './array' (10 runs):

         6,719,130 cycles:u                   ( +-   0.662% )
         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )

Looking at this I don't think you can pinpoint which function has a problem and
whether or not there is a real problem. You need to evaluate the penalty. Once
you know that you can estimate any potential gain from fixing the code. Arun
pointed that out rightfully in his answer. How do you know the penalty if you
don't decompose some more?

It's not because you see cache misses that they are harmful. They could be
generated by the HW prefetchers and be harmless to your program. Does
L1d-cache-load misses include HW prefetchers? Imagine it does on Intel Nehalem.
Do you guarantee it does on AMD Shanghai as well? If not then how do I compare?


As I said, if you are only interested in generic events, then that's
fine with me
as long as you don't prevent others from accessing raw events.

As people have pointed, I don't see why raw events would be harmful to
users. If you
don't want to learn about them, just stick with the generic events.


> - once a tool roughly pinpoints their existence and location!
>
> And you have not offered a counter example either so you have not really
> demonstrated what you consider a 'real' example and why you consider
> generalized cache events inadequate.
>
> > [...] So I would not generalize on it.
>
> To the contrary, it demonstrates the most fundamental concept of cache
> profiling: looking at the hits/misses ratios and identifying hotspots.
>
> That concept can be applied pretty nicely to all sorts of applications.
>
> Interestly, the exact hardware event doesnt even *matter* for most problems, as
> long as it *correlates* with the conceptual entity we want to measure.
>
> So what we need are hardware events that correlate with:
>
>  - loads done
>  - stores done
>  - load misses suffered
>  - store misses suffered
>  - branches done
>  - branches missed
>  - instructions executed
>
> It is the *ratio* that matters in most cases: before-change versus
> after-change, hits versus misses, etc.
>
> Yes, there will be imprecisions, CPU quirks, limitations and speculation
> effects - but as long as we keep our eyes on the ball, generalizations are
> useful for solving practical problems.
>
> > If you are happy with generalized cache events then, as I said, I am fine
> > with it. But the API should ALWAYS allow users access to raw events when they
> > need finer grain analysis.
>
> Well, that's a pretty far cry from calling it a 'myth' :-)
>
> So my point is (outlined in detail in the common changelog) that we need sane
> generalized remote DRAM events *first* - before we think about exposing the
> 'rest' of te offcore-PMU as raw events.
>
> > > diff --git a/array.c b/array.c
> > > index 4758d9a..d3f7037 100644
> > > --- a/array.c
> > > +++ b/array.c
> > > @@ -9,7 +9,7 @@ int init_array(void)
> > >
> > >        for (i = 0; i < THOUSAND; i++) {
> > >                for (j = 0; j < THOUSAND; j++) {
> > > -                       array[j][i]++;
> > > +                       array[i][j]++;
> > >                }
> > >        }
> > >
> > > I re-run perf-stat:
> > >
> > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > >
> > >  Performance counter stats for './array' (10 runs):
> > >
> > >         2,395,407 cycles:u                   ( +-   0.365% )
> > >         5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
> > >         1,035,731 l1-dcache-loads:u          ( +-   0.006% )
> > >             3,955 l1-dcache-load-misses:u    ( +-   4.872% )
> > >
> > >  - I got absolute numbers in the right ballpark figure: i got a million loads as
> > >   expected (the array has 1 million elements), and 1 million cache-misses in
> > >   the 'bad' case.
> > >
> > >  - I did not care which specific Intel CPU model this was running on
> > >
> > >  - I did not care about *any* microarchitectural details - i only knew it's a
> > >   reasonably modern CPU with caching
> > >
> > >  - I did not care how i could get access to L1 load and miss events. The events
> > >   were named obviously and it just worked.
> > >
> > > So no, kernel driven generalization and sane tooling is not at all a 'myth'
> > > today, really.
> > >
> > > So this is the general direction in which we want to move on. If you know about
> > > problems with existing generalization definitions then lets *fix* them, not
> > > pretend that generalizations and sane workflows are impossible ...
> >
> > Again, to fix them, you need to give us definitions for what you expect those
> > events to count. Otherwise we cannot make forward progress.
>
> No, we do not 'need' to give exact definitions. This whole topic is more
> analogous to physics than to mathematics. See my description above about how
> ratios and high level structure matters more than absolute values and
> definitions.
>
> Yes, if we can then 'loads' and 'stores' should correspond to the number of
> loads a program flow does, which you get if you look at the assembly code.
> 'Instructions' should correspond to the number of instructions executed.
>
> If the CPU cannot do it it's not a huge deal in practice - we will cope and
> hopefully it will all be fixed in future CPU versions.
>
> That having said, most CPUs i have access to get the fundamentals right, so
> it's not like we have huge problems in practice. Key CPU statistics are
> available.
>
> > Let me give just one simple example: cycles
> >
> > What your definition for the generic cycle event?
> >
> > There are various flavors:
> >   - count halted, unhalted cycles?
>
> Again i think you are getting lost in too much detail.
>
> For typical developers halted versus unhalted is mostly an uninteresting
> distinction, as people tend to just type 'perf record ./myapp', which is per
> workload profiling so it excludes idle time. So it would give the same result
> to them regardless of whether it's halted or unhalted cycles.
>
> ( This simple example already shows the idiocy of the hardware names, calling
>  cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not*
>  care about those distinctions so the defaults should not be complicated with
>  them. )
>
> >   - impacted by frequency scaling?
>
> The best default for developers is a frequency scaling invariant result - i.e.
> one that is not against a reference clock but against the real CPU clock.
>
> ( Even that one will not be completely invariant due to the frequency-scaling
>  dependent cost of misses and bus ops, etc. )
>
> But profiling against a reference frequency makes sense as well, especially for
> system-wide profiling - this is the hardware equivalent of the cpu-clock /
> elapsed time metric. We could implement the cpu-clock using reference cycles
> events for example.
>
> > LLC-misses:
> >   - what considered the LLC?
>
> The last level cache is whichever cache sits before DRAM.
>
> >   - does it include code, data or both?
>
> Both if possible as they tend to be unified caches anyway.
>
> >   - does it include demand, hw prefetch?
>
> Do you mean for the LLC-prefetch events? What would be your suggestion, which
> is the most useful metric? Prefetches are not directly done by program logic so
> this is borderline. We wanted to include them for completeness - and the metric
> should probably include 'all activities that program flow has not caused
> directly and which may be sucking up system resources' - i.e. including hw
> prefetch.
>
> >   - it is to local or remote dram?
>
> The current definitions should include both.
>
> Measuring remote DRAM accesss is of course useful - that is the original point
> of this thread. It should be done as an additional layer, basically local RAM
> is yet another cache level - but we can take other generalized approach as
> well, if they make more sense.
>
> Thanks,
>
>        Ingo

2011-04-22 20:32:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Ingo Molnar <[email protected]> wrote:

> * [email protected] <[email protected]> wrote:
>
> > On Fri, Apr 22, 2011 at 12:52:11PM +0200, Ingo Molnar wrote:
> > >
> > > Using the generalized cache events i can run:
> > >
> > > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > >
> > > Performance counter stats for './array' (10 runs):
> > >
> > > 6,719,130 cycles:u ( +- 0.662% )
> > > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% )
> > > 1,037,032 l1-dcache-loads:u ( +- 0.009% )
> > > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% )
> > >
> > > 0.003802098 seconds time elapsed ( +- 13.395% )
> > >
> > > I consider that this is 'bad', because for almost every dcache-load there's a
> > > dcache-miss - a 99% L1 cache miss rate!
> >
> > One could argue that all you need is cycles and instructions. [...]
>
> Yes, and note that with instructions events we even have skid-less PEBS
> profiling so seeing the precise .
- location of instructions is possible.

[ An email gremlin ate that part of the sentence. ]

Thanks,

Ingo

2011-04-22 20:48:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> > > > Say i'm a developer and i have an app with such code:
> > > >
> > > > #define THOUSAND 1000
> > > >
> > > > static char array[THOUSAND][THOUSAND];
> > > >
> > > > int init_array(void)
> > > > {
> > > > ? ? ? ?int i, j;
> > > >
> > > > ? ? ? ?for (i = 0; i < THOUSAND; i++) {
> > > > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) {
> > > > ? ? ? ? ? ? ? ? ? ? ? ?array[j][i]++;
> > > > ? ? ? ? ? ? ? ?}
> > > > ? ? ? ?}
> > > >
> > > > ? ? ? ?return 0;
> > > > }
> > > >
> > > > Pretty common stuff, right?
> > > >
> > > > Using the generalized cache events i can run:
> > > >
> > > > ?$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> > > >
> > > > ?Performance counter stats for './array' (10 runs):
> > > >
> > > > ? ? ? ? 6,719,130 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.662% )
> > > > ? ? ? ? 5,084,792 instructions:u ? ? ? ? ? # ? ? ?0.757 IPC ? ? ( +- ? 0.000% )
> > > > ? ? ? ? 1,037,032 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.009% )
> > > > ? ? ? ? 1,003,604 l1-dcache-load-misses:u ? ?( +- ? 0.003% )
> > > >
> > > > ? ? ? ?0.003802098 ?seconds time elapsed ? ( +- ?13.395% )
> > > >
> > > > I consider that this is 'bad', because for almost every dcache-load there's a
> > > > dcache-miss - a 99% L1 cache miss rate!
> > > >
> > > > Then i think a bit, notice something, apply this performance optimization:
> > >
> > > I don't think this example is really representative of the kind of problems
> > > people face, it is just too small and obvious. [...]
> >
> > Well, the overwhelming majority of performance problems are 'small and obvious'
>
> Problems are not simple. Most serious applications?these days are huge,
> hundreds of MB of text, if not GB.
>
> In your artificial example, you knew the answer before you started the
> measurement.
>
> Most of the time, applications are assembled out of hundreds of libraries, so
> no single developers knows all?the code. Thus, the performance analyst is
> faced with a black box most of the time.

I isolated out an example and assumed that you'd agree that identifying hot
spots is trivial with generic cache events.

My assumption was wrong so let me show you how trivial it really is.

Here's an example with *two* problematic functions (but it could have hundreds,
it does not matter):

-------------------------------->
#define THOUSAND 1000

static char array1[THOUSAND][THOUSAND];

static char array2[THOUSAND][THOUSAND];

void func1(void)
{
int i, j;

for (i = 0; i < THOUSAND; i++)
for (j = 0; j < THOUSAND; j++)
array1[i][j]++;
}

void func2(void)
{
int i, j;

for (i = 0; i < THOUSAND; i++)
for (j = 0; j < THOUSAND; j++)
array2[j][i]++;
}

int main(void)
{
for (;;) {
func1();
func2();
}

return 0;
}
<--------------------------------

We do not know which one has the cache-misses problem, func1() or func2(), it's
all a black box, right?

Using generic cache events you simply type this:

$ perf top -e l1-dcache-load-misses -e l1-dcache-loads

And you get such output:

PerfTop: 1923 irqs/sec kernel: 0.0% exact: 0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------

weight samples pcnt funct DSO
______ _______ _____ _____ ______________________

1.9 6184 98.8% func2 /home/mingo/opt/array2
0.0 69 1.1% func1 /home/mingo/opt/array2

It has pinpointed the problem in func2 *very* precisely.

Obviously this can be used to analyze larger apps as well, with thousands of
functions, to pinpoint cachemiss problems in specific functions.

Thanks,

Ingo

2011-04-22 21:04:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> Let's go back to your example.
> Performance counter stats for './array' (10 runs):
>
> ?? ? ? ? 6,719,130 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.662% )
> ?? ? ? ? 5,084,792 instructions:u ? ? ? ? ? # ? ? ?0.757 IPC ? ? ( +- ? 0.000% )
> ?? ? ? ? 1,037,032 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.009% )
> ?? ? ? ? 1,003,604 l1-dcache-load-misses:u ? ?( +- ? 0.003% )
>
> Looking at this I don't think you can pinpoint which function has a problem
> [...]

In my previous mail i showed how to pinpoint specific functions. You bring up
an interesting question, cost/benefit analysis:

> [...] and whether or not there is a real problem. You need to evaluate the
> penalty. Once you know that you can estimate any potential gain from fixing
> the code. Arun pointed that out rightfully in his answer. How do you know the
> penalty if you don't decompose some more?

We can measure that even with today's tooling - which doesnt do cost/benefit
analysis out of box. In my previous example i showed the cachemisses profile:

weight samples pcnt funct DSO
______ _______ _____ _____ ______________________

1.9 6184 98.8% func2 /home/mingo/opt/array2
0.0 69 1.1% func1 /home/mingo/opt/array2

and here's the cycles profile:

samples pcnt funct DSO
_______ _____ _____ ______________________

2555.00 67.4% func2 /home/mingo/opt/array2
1220.00 32.2% func1 /home/mingo/opt/array2

So, given that there was no other big miss sources:

$ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1

Performance counter stats for 'sleep 1':

70,674 branch-misses:u
347,992,027 l1-dcache-load-misses:u
1,779 l1-dcache-store-misses:u
8,007 l1-icache-load-misses:u

1.000982021 seconds time elapsed

I can tell you that by fixing the cache-misses in that function, the code will
be roughly 33% faster.

So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs:

$ perf stat -e cpu-clock --repeat 10 ./array2

Performance counter stats for './array2' (10 runs):

298.405074 cpu-clock ( +- 1.823% )

After the fix it took 190 msecs:

$ perf stat -e cpu-clock --repeat 10 ./array2

Performance counter stats for './array2' (10 runs):

189.409569 cpu-clock ( +- 0.019% )

0.190007596 seconds time elapsed ( +- 0.025% )

Which is 63% of the original speed - 37% faster. And no, i first did the
calculation, then did the measurement of the optimized code.

Now it would be nice to automate such analysis some more within perf - but i
think i have established the principle well enough that we can use generic
cache events for such measurements.

Also, we could certainly add more generic events - a stalled-cycles event would
certainly be useful for example, to collect all (or at least most) 'harmful
delays' the execution flow can experience. Want to take a stab at that patch?

Thanks,

Ingo

2011-04-22 21:34:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 10:06 +0200, Ingo Molnar wrote:
>
> I'm about to push out the patch attached below - it lays out the arguments in
> detail. I don't think we have time to fix this properly for .39 - but memory
> profiling could be a nice feature for v2.6.40.

Does something like the below provide enough generic infrastructure to
allow the raw offcore bits again?

The below needs filling out for !x86 (which I filled out with
unsupported events) and x86 needs the offcore bits fixed to auto select
between the two offcore events.


Not-signed-off-by: /me
---
arch/arm/kernel/perf_event_v6.c | 28 ++++++
arch/arm/kernel/perf_event_v7.c | 28 ++++++
arch/arm/kernel/perf_event_xscale.c | 14 +++
arch/mips/kernel/perf_event_mipsxx.c | 28 ++++++
arch/powerpc/kernel/e500-pmu.c | 5 +
arch/powerpc/kernel/mpc7450-pmu.c | 5 +
arch/powerpc/kernel/power4-pmu.c | 5 +
arch/powerpc/kernel/power5+-pmu.c | 5 +
arch/powerpc/kernel/power5-pmu.c | 5 +
arch/powerpc/kernel/power6-pmu.c | 5 +
arch/powerpc/kernel/power7-pmu.c | 5 +
arch/powerpc/kernel/ppc970-pmu.c | 5 +
arch/sh/kernel/cpu/sh4/perf_event.c | 15 +++
arch/sh/kernel/cpu/sh4a/perf_event.c | 15 +++
arch/sparc/kernel/perf_event.c | 42 ++++++++
arch/x86/kernel/cpu/perf_event_amd.c | 14 +++
arch/x86/kernel/cpu/perf_event_intel.c | 167 +++++++++++++++++++++++++-------
arch/x86/kernel/cpu/perf_event_p4.c | 14 +++
include/linux/perf_event.h | 3 +-
19 files changed, 373 insertions(+), 35 deletions(-)

diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c
index f1e8dd9..02178da 100644
--- a/arch/arm/kernel/perf_event_v6.c
+++ b/arch/arm/kernel/perf_event_v6.c
@@ -173,6 +173,20 @@ static const unsigned armv6_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

enum armv6mpcore_perf_types {
@@ -310,6 +324,20 @@ static const unsigned armv6mpcore_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

static inline unsigned long
diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
index 4960686..79ffc83 100644
--- a/arch/arm/kernel/perf_event_v7.c
+++ b/arch/arm/kernel/perf_event_v7.c
@@ -255,6 +255,20 @@ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

/*
@@ -371,6 +385,20 @@ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

/*
diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c
index 39affbe..7ed1a55 100644
--- a/arch/arm/kernel/perf_event_xscale.c
+++ b/arch/arm/kernel/perf_event_xscale.c
@@ -144,6 +144,20 @@ static const unsigned xscale_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

#define XSCALE_PMU_ENABLE 0x001
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index 75266ff..e5ad09a 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -377,6 +377,20 @@ static const struct mips_perf_event mipsxxcore_cache_map
[C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+},
};

/* 74K core has completely different cache event map. */
@@ -480,6 +494,20 @@ static const struct mips_perf_event mipsxx74Kcore_cache_map
[C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+},
};

#ifdef CONFIG_MIPS_MT_SMP
diff --git a/arch/powerpc/kernel/e500-pmu.c b/arch/powerpc/kernel/e500-pmu.c
index b150b51..cb2e294 100644
--- a/arch/powerpc/kernel/e500-pmu.c
+++ b/arch/powerpc/kernel/e500-pmu.c
@@ -75,6 +75,11 @@ static int e500_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static int num_events = 128;
diff --git a/arch/powerpc/kernel/mpc7450-pmu.c b/arch/powerpc/kernel/mpc7450-pmu.c
index 2cc5e03..845a584 100644
--- a/arch/powerpc/kernel/mpc7450-pmu.c
+++ b/arch/powerpc/kernel/mpc7450-pmu.c
@@ -388,6 +388,11 @@ static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

struct power_pmu mpc7450_pmu = {
diff --git a/arch/powerpc/kernel/power4-pmu.c b/arch/powerpc/kernel/power4-pmu.c
index ead8b3c..e9dbc2d 100644
--- a/arch/powerpc/kernel/power4-pmu.c
+++ b/arch/powerpc/kernel/power4-pmu.c
@@ -587,6 +587,11 @@ static int power4_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power4_pmu = {
diff --git a/arch/powerpc/kernel/power5+-pmu.c b/arch/powerpc/kernel/power5+-pmu.c
index eca0ac5..f58a2bd 100644
--- a/arch/powerpc/kernel/power5+-pmu.c
+++ b/arch/powerpc/kernel/power5+-pmu.c
@@ -653,6 +653,11 @@ static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power5p_pmu = {
diff --git a/arch/powerpc/kernel/power5-pmu.c b/arch/powerpc/kernel/power5-pmu.c
index d5ff0f6..b1acab6 100644
--- a/arch/powerpc/kernel/power5-pmu.c
+++ b/arch/powerpc/kernel/power5-pmu.c
@@ -595,6 +595,11 @@ static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power5_pmu = {
diff --git a/arch/powerpc/kernel/power6-pmu.c b/arch/powerpc/kernel/power6-pmu.c
index 3160392..b24a3a2 100644
--- a/arch/powerpc/kernel/power6-pmu.c
+++ b/arch/powerpc/kernel/power6-pmu.c
@@ -516,6 +516,11 @@ static int power6_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power6_pmu = {
diff --git a/arch/powerpc/kernel/power7-pmu.c b/arch/powerpc/kernel/power7-pmu.c
index 593740f..6d9dccb 100644
--- a/arch/powerpc/kernel/power7-pmu.c
+++ b/arch/powerpc/kernel/power7-pmu.c
@@ -342,6 +342,11 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power7_pmu = {
diff --git a/arch/powerpc/kernel/ppc970-pmu.c b/arch/powerpc/kernel/ppc970-pmu.c
index 9a6e093..b121de9 100644
--- a/arch/powerpc/kernel/ppc970-pmu.c
+++ b/arch/powerpc/kernel/ppc970-pmu.c
@@ -467,6 +467,11 @@ static int ppc970_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu ppc970_pmu = {
diff --git a/arch/sh/kernel/cpu/sh4/perf_event.c b/arch/sh/kernel/cpu/sh4/perf_event.c
index 748955d..fa4f724 100644
--- a/arch/sh/kernel/cpu/sh4/perf_event.c
+++ b/arch/sh/kernel/cpu/sh4/perf_event.c
@@ -180,6 +180,21 @@ static const int sh7750_cache_events
[ C(RESULT_MISS) ] = -1,
},
},
+
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static int sh7750_event_map(int event)
diff --git a/arch/sh/kernel/cpu/sh4a/perf_event.c b/arch/sh/kernel/cpu/sh4a/perf_event.c
index 17e6beb..84a2c39 100644
--- a/arch/sh/kernel/cpu/sh4a/perf_event.c
+++ b/arch/sh/kernel/cpu/sh4a/perf_event.c
@@ -205,6 +205,21 @@ static const int sh4a_cache_events
[ C(RESULT_MISS) ] = -1,
},
},
+
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static int sh4a_event_map(int event)
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index ee8426e..d890e0f 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -245,6 +245,20 @@ static const cache_map_t ultra3_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu ultra3_pmu = {
@@ -360,6 +374,20 @@ static const cache_map_t niagara1_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu niagara1_pmu = {
@@ -472,6 +500,20 @@ static const cache_map_t niagara2_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu niagara2_pmu = {
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index cf4e369..01c7dd3 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -89,6 +89,20 @@ static __initconst const u64 amd_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0xb8e9, /* CPU Request to Memory, l+r */
+ [ C(RESULT_MISS) ] = 0x98e9, /* CPU Request to Memory, r */
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 43fa20b..225efa0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -184,26 +184,23 @@ static __initconst const u64 snb_hw_cache_event_ids
},
},
[ C(LL ) ] = {
- /*
- * TBD: Need Off-core Response Performance Monitoring support
- */
[ C(OP_READ) ] = {
- /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_WRITE) ] = {
- /* OFFCORE_RESPONSE_0.ANY_RFO.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_RFO.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_PREFETCH) ] = {
- /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
},
[ C(DTLB) ] = {
@@ -248,6 +245,20 @@ static __initconst const u64 snb_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

static __initconst const u64 westmere_hw_cache_event_ids
@@ -285,26 +296,26 @@ static __initconst const u64 westmere_hw_cache_event_ids
},
[ C(LL ) ] = {
[ C(OP_READ) ] = {
- /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
/*
* Use RFO, not WRITEBACK, because a write miss would typically occur
* on RFO.
*/
[ C(OP_WRITE) ] = {
- /* OFFCORE_RESPONSE_1.ANY_RFO.LOCAL_CACHE */
- [ C(RESULT_ACCESS) ] = 0x01bb,
- /* OFFCORE_RESPONSE_0.ANY_RFO.ANY_LLC_MISS */
+ /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_PREFETCH) ] = {
- /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
},
[ C(DTLB) ] = {
@@ -349,19 +360,51 @@ static __initconst const u64 westmere_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

/*
* OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3
*/

-#define DMND_DATA_RD (1 << 0)
-#define DMND_RFO (1 << 1)
-#define DMND_WB (1 << 3)
-#define PF_DATA_RD (1 << 4)
-#define PF_DATA_RFO (1 << 5)
-#define RESP_UNCORE_HIT (1 << 8)
-#define RESP_MISS (0xf600) /* non uncore hit */
+#define DMND_DATA_RD (1 << 0)
+#define DMND_RFO (1 << 1)
+#define DMND_IFETCH (1 << 2)
+#define DMND_WB (1 << 3)
+#define PF_DATA_RD (1 << 4)
+#define PF_DATA_RFO (1 << 5)
+#define PF_IFETCH (1 << 6)
+#define OFFCORE_OTHER (1 << 7)
+#define UNCORE_HIT (1 << 8)
+#define OTHER_CORE_HIT_SNP (1 << 9)
+#define OTHER_CORE_HITM (1 << 10)
+ /* reserved */
+#define REMOTE_CACHE_FWD (1 << 12)
+#define REMOTE_DRAM (1 << 13)
+#define LOCAL_DRAM (1 << 14)
+#define NON_DRAM (1 << 15)
+
+#define ALL_DRAM (REMOTE_DRAM|LOCAL_DRAM)
+
+#define DMND_READ (DMND_DATA_RD)
+#define DMND_WRITE (DMND_RFO|DMND_WB)
+#define DMND_PREFETCH (PF_DATA_RD|PF_DATA_FRO)
+
+#define L3_HIT (UNCORE_HIT|OTHER_CORE_HIT_SNP|OTHER_CORE_HITM)
+#define L3_MISS (NON_DRAM|ALL_DRAM|REMOTE_CACHE_FWD)

static __initconst const u64 nehalem_hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_MAX]
@@ -370,18 +413,32 @@ static __initconst const u64 nehalem_hw_cache_extra_regs
{
[ C(LL ) ] = {
[ C(OP_READ) ] = {
- [ C(RESULT_ACCESS) ] = DMND_DATA_RD|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = DMND_DATA_RD|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = DMND_READ|L3_HIT,
+ [ C(RESULT_MISS) ] = DMND_READ|L3_MISS,
},
[ C(OP_WRITE) ] = {
- [ C(RESULT_ACCESS) ] = DMND_RFO|DMND_WB|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = DMND_RFO|DMND_WB|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = DMND_WRITE|L3_HIT,
+ [ C(RESULT_MISS) ] = DMND_WRITE|L3_MISS,
},
[ C(OP_PREFETCH) ] = {
- [ C(RESULT_ACCESS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = DMND_PREFETCH|L3_HIT,
+ [ C(RESULT_MISS) ] = DMND_PREFETCH|L3_MISS,
},
}
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = DMND_READ|ALL_DRAM,
+ [ C(RESULT_MISS) ] = DMND_READ|REMOTE_DRAM,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = DMND_WRITE|ALL_DRAM,
+ [ C(RESULT_MISS) ] = DMND_WRITE|REMOTE_DRAM,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = DMND_PREFETCH|ALL_DRAM,
+ [ C(RESULT_MISS) ] = DMND_PREFETCH|REMOTE_DRAM,
+ },
+ },
};

static __initconst const u64 nehalem_hw_cache_event_ids
@@ -483,6 +540,20 @@ static __initconst const u64 nehalem_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

static __initconst const u64 core2_hw_cache_event_ids
@@ -574,6 +645,20 @@ static __initconst const u64 core2_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static __initconst const u64 atom_hw_cache_event_ids
@@ -665,6 +750,20 @@ static __initconst const u64 atom_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static void intel_pmu_disable_all(void)
diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c
index 74507c1..e802c7e 100644
--- a/arch/x86/kernel/cpu/perf_event_p4.c
+++ b/arch/x86/kernel/cpu/perf_event_p4.c
@@ -554,6 +554,20 @@ static __initconst const u64 p4_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static u64 p4_general_events[PERF_COUNT_HW_MAX] = {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ee9f1e7..df4a841 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -59,7 +59,7 @@ enum perf_hw_id {
/*
* Generalized hardware cache events:
*
- * { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x
+ * { L1-D, L1-I, LLC, ITLB, DTLB, BPU, NODE } x
* { read, write, prefetch } x
* { accesses, misses }
*/
@@ -70,6 +70,7 @@ enum perf_hw_cache_id {
PERF_COUNT_HW_CACHE_DTLB = 3,
PERF_COUNT_HW_CACHE_ITLB = 4,
PERF_COUNT_HW_CACHE_BPU = 5,
+ PERF_COUNT_HW_CACHE_NODE = 6,

PERF_COUNT_HW_CACHE_MAX, /* non-ABI */
};

2011-04-22 21:52:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote:
> The below needs filling out for !x86 (which I filled out with
> unsupported events) and x86 needs the offcore bits fixed to auto select
> between the two offcore events.

Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table.

2011-04-22 22:17:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote:
> On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote:
> > The below needs filling out for !x86 (which I filled out with
> > unsupported events) and x86 needs the offcore bits fixed to auto select
> > between the two offcore events.
>
> Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table.

/*
* Sandy Bridge MSR_OFFCORE_RESPONSE bits;
* See IA32 SDM Vol 3B 30.8.5
*/

#define SNB_DMND_DATA_RD (1 << 0)
#define SNB_DMND_RFO (1 << 1)
#define SNB_DMND_IFETCH (1 << 2)
#define SNB_DMND_WB (1 << 3)
#define SNB_PF_DATA_RD (1 << 4)
#define SNB_PF_DATA_RFO (1 << 5)
#define SNB_PF_IFETCH (1 << 6)
#define SNB_PF_LLC_DATA_RD (1 << 7)
#define SNB_PF_LLC_RFO (1 << 8)
#define SNB_PF_LLC_IFETCH (1 << 9)
#define SNB_BUS_LOCKS (1 << 10)
#define SNB_STRM_ST (1 << 11)
/* hole */
#define SNB_OFFCORE_OTHER (1 << 15)
#define SNB_COMMON (1 << 16)
#define SNB_NO_SUPP (1 << 17)
#define SNB_LLC_HITM (1 << 18)
#define SNB_LLC_HITE (1 << 19)
#define SNB_LLC_HITS (1 << 20)
#define SNB_LLC_HITF (1 << 21)
/* hole */
#define SNB_SNP_NONE (1 << 31)
#define SNB_SNP_NOT_NEEDED (1 << 32)
#define SNB_SNP_MISS (1 << 33)
#define SNB_SNP_NO_FWD (1 << 34)
#define SNB_SNP_FWD (1 << 35)
#define SNB_HITM (1 << 36)
#define SNB_NON_DRAM (1 << 37)

#define SNB_DMND_READ (SNB_DMND_DATA_RD)
#define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST)
#define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO)

Is what I came up with, but I'm stumped on how to construct:

#define SNB_L3_HIT
#define SNB_L3_MISS

#define SNB_ALL_DRAM
#define SNB_REMOTE_DRAM

Anybody got clue?

2011-04-22 22:55:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote:
> On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote:
> > The below needs filling out for !x86 (which I filled out with
> > unsupported events) and x86 needs the offcore bits fixed to auto select
> > between the two offcore events.
>
> Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table.

Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and
counted OTHER_CORE_HIT* as MISS even though its clearly documented as an
L3 hit.

Current scribblings below..

---
arch/arm/kernel/perf_event_v6.c | 28 ++++
arch/arm/kernel/perf_event_v7.c | 28 ++++
arch/arm/kernel/perf_event_xscale.c | 14 ++
arch/mips/kernel/perf_event_mipsxx.c | 28 ++++
arch/powerpc/kernel/e500-pmu.c | 5 +
arch/powerpc/kernel/mpc7450-pmu.c | 5 +
arch/powerpc/kernel/power4-pmu.c | 5 +
arch/powerpc/kernel/power5+-pmu.c | 5 +
arch/powerpc/kernel/power5-pmu.c | 5 +
arch/powerpc/kernel/power6-pmu.c | 5 +
arch/powerpc/kernel/power7-pmu.c | 5 +
arch/powerpc/kernel/ppc970-pmu.c | 5 +
arch/sh/kernel/cpu/sh4/perf_event.c | 15 ++
arch/sh/kernel/cpu/sh4a/perf_event.c | 15 ++
arch/sparc/kernel/perf_event.c | 42 ++++++
arch/x86/kernel/cpu/perf_event_amd.c | 14 ++
arch/x86/kernel/cpu/perf_event_intel.c | 253 +++++++++++++++++++++++++++-----
arch/x86/kernel/cpu/perf_event_p4.c | 14 ++
include/linux/perf_event.h | 3 +-
19 files changed, 458 insertions(+), 36 deletions(-)

diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c
index f1e8dd9..02178da 100644
--- a/arch/arm/kernel/perf_event_v6.c
+++ b/arch/arm/kernel/perf_event_v6.c
@@ -173,6 +173,20 @@ static const unsigned armv6_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

enum armv6mpcore_perf_types {
@@ -310,6 +324,20 @@ static const unsigned armv6mpcore_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

static inline unsigned long
diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
index 4960686..79ffc83 100644
--- a/arch/arm/kernel/perf_event_v7.c
+++ b/arch/arm/kernel/perf_event_v7.c
@@ -255,6 +255,20 @@ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

/*
@@ -371,6 +385,20 @@ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

/*
diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c
index 39affbe..7ed1a55 100644
--- a/arch/arm/kernel/perf_event_xscale.c
+++ b/arch/arm/kernel/perf_event_xscale.c
@@ -144,6 +144,20 @@ static const unsigned xscale_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
[C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
},
},
+ [C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = CACHE_OP_UNSUPPORTED,
+ [C(RESULT_MISS)] = CACHE_OP_UNSUPPORTED,
+ },
+ },
};

#define XSCALE_PMU_ENABLE 0x001
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index 75266ff..e5ad09a 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -377,6 +377,20 @@ static const struct mips_perf_event mipsxxcore_cache_map
[C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+},
};

/* 74K core has completely different cache event map. */
@@ -480,6 +494,20 @@ static const struct mips_perf_event mipsxx74Kcore_cache_map
[C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_WRITE)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+ [C(OP_PREFETCH)] = {
+ [C(RESULT_ACCESS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ [C(RESULT_MISS)] = { UNSUPPORTED_PERF_EVENT_ID },
+ },
+},
};

#ifdef CONFIG_MIPS_MT_SMP
diff --git a/arch/powerpc/kernel/e500-pmu.c b/arch/powerpc/kernel/e500-pmu.c
index b150b51..cb2e294 100644
--- a/arch/powerpc/kernel/e500-pmu.c
+++ b/arch/powerpc/kernel/e500-pmu.c
@@ -75,6 +75,11 @@ static int e500_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static int num_events = 128;
diff --git a/arch/powerpc/kernel/mpc7450-pmu.c b/arch/powerpc/kernel/mpc7450-pmu.c
index 2cc5e03..845a584 100644
--- a/arch/powerpc/kernel/mpc7450-pmu.c
+++ b/arch/powerpc/kernel/mpc7450-pmu.c
@@ -388,6 +388,11 @@ static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

struct power_pmu mpc7450_pmu = {
diff --git a/arch/powerpc/kernel/power4-pmu.c b/arch/powerpc/kernel/power4-pmu.c
index ead8b3c..e9dbc2d 100644
--- a/arch/powerpc/kernel/power4-pmu.c
+++ b/arch/powerpc/kernel/power4-pmu.c
@@ -587,6 +587,11 @@ static int power4_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power4_pmu = {
diff --git a/arch/powerpc/kernel/power5+-pmu.c b/arch/powerpc/kernel/power5+-pmu.c
index eca0ac5..f58a2bd 100644
--- a/arch/powerpc/kernel/power5+-pmu.c
+++ b/arch/powerpc/kernel/power5+-pmu.c
@@ -653,6 +653,11 @@ static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power5p_pmu = {
diff --git a/arch/powerpc/kernel/power5-pmu.c b/arch/powerpc/kernel/power5-pmu.c
index d5ff0f6..b1acab6 100644
--- a/arch/powerpc/kernel/power5-pmu.c
+++ b/arch/powerpc/kernel/power5-pmu.c
@@ -595,6 +595,11 @@ static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power5_pmu = {
diff --git a/arch/powerpc/kernel/power6-pmu.c b/arch/powerpc/kernel/power6-pmu.c
index 3160392..b24a3a2 100644
--- a/arch/powerpc/kernel/power6-pmu.c
+++ b/arch/powerpc/kernel/power6-pmu.c
@@ -516,6 +516,11 @@ static int power6_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power6_pmu = {
diff --git a/arch/powerpc/kernel/power7-pmu.c b/arch/powerpc/kernel/power7-pmu.c
index 593740f..6d9dccb 100644
--- a/arch/powerpc/kernel/power7-pmu.c
+++ b/arch/powerpc/kernel/power7-pmu.c
@@ -342,6 +342,11 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu power7_pmu = {
diff --git a/arch/powerpc/kernel/ppc970-pmu.c b/arch/powerpc/kernel/ppc970-pmu.c
index 9a6e093..b121de9 100644
--- a/arch/powerpc/kernel/ppc970-pmu.c
+++ b/arch/powerpc/kernel/ppc970-pmu.c
@@ -467,6 +467,11 @@ static int ppc970_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(OP_WRITE)] = { -1, -1 },
[C(OP_PREFETCH)] = { -1, -1 },
},
+ [C(NODE)] = { /* RESULT_ACCESS RESULT_MISS */
+ [C(OP_READ)] = { -1, -1 },
+ [C(OP_WRITE)] = { -1, -1 },
+ [C(OP_PREFETCH)] = { -1, -1 },
+ },
};

static struct power_pmu ppc970_pmu = {
diff --git a/arch/sh/kernel/cpu/sh4/perf_event.c b/arch/sh/kernel/cpu/sh4/perf_event.c
index 748955d..fa4f724 100644
--- a/arch/sh/kernel/cpu/sh4/perf_event.c
+++ b/arch/sh/kernel/cpu/sh4/perf_event.c
@@ -180,6 +180,21 @@ static const int sh7750_cache_events
[ C(RESULT_MISS) ] = -1,
},
},
+
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static int sh7750_event_map(int event)
diff --git a/arch/sh/kernel/cpu/sh4a/perf_event.c b/arch/sh/kernel/cpu/sh4a/perf_event.c
index 17e6beb..84a2c39 100644
--- a/arch/sh/kernel/cpu/sh4a/perf_event.c
+++ b/arch/sh/kernel/cpu/sh4a/perf_event.c
@@ -205,6 +205,21 @@ static const int sh4a_cache_events
[ C(RESULT_MISS) ] = -1,
},
},
+
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static int sh4a_event_map(int event)
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index ee8426e..d890e0f 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -245,6 +245,20 @@ static const cache_map_t ultra3_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu ultra3_pmu = {
@@ -360,6 +374,20 @@ static const cache_map_t niagara1_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu niagara1_pmu = {
@@ -472,6 +500,20 @@ static const cache_map_t niagara2_cache_map = {
[ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
},
},
+[C(NODE)] = {
+ [C(OP_READ)] = {
+ [C(RESULT_ACCESS)] = { CACHE_OP_UNSUPPORTED },
+ [C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = { CACHE_OP_UNSUPPORTED },
+ [ C(RESULT_MISS) ] = { CACHE_OP_UNSUPPORTED },
+ },
+},
};

static const struct sparc_pmu niagara2_pmu = {
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index cf4e369..01c7dd3 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -89,6 +89,20 @@ static __initconst const u64 amd_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0xb8e9, /* CPU Request to Memory, l+r */
+ [ C(RESULT_MISS) ] = 0x98e9, /* CPU Request to Memory, r */
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 43fa20b..fe4e8b1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -150,6 +150,86 @@ static u64 intel_pmu_event_map(int hw_event)
return intel_perfmon_event_map[hw_event];
}

+/*
+ * Sandy Bridge MSR_OFFCORE_RESPONSE bits;
+ * See IA32 SDM Vol 3B 30.8.5
+ */
+
+#define SNB_DMND_DATA_RD (1 << 0)
+#define SNB_DMND_RFO (1 << 1)
+#define SNB_DMND_IFETCH (1 << 2)
+#define SNB_DMND_WB (1 << 3)
+#define SNB_PF_DATA_RD (1 << 4)
+#define SNB_PF_DATA_RFO (1 << 5)
+#define SNB_PF_IFETCH (1 << 6)
+#define SNB_PF_LLC_DATA_RD (1 << 7)
+#define SNB_PF_LLC_RFO (1 << 8)
+#define SNB_PF_LLC_IFETCH (1 << 9)
+#define SNB_BUS_LOCKS (1 << 10)
+#define SNB_STRM_ST (1 << 11)
+ /* hole */
+#define SNB_OFFCORE_OTHER (1 << 15)
+#define SNB_COMMON (1 << 16)
+#define SNB_NO_SUPP (1 << 17)
+#define SNB_LLC_HITM (1 << 18)
+#define SNB_LLC_HITE (1 << 19)
+#define SNB_LLC_HITS (1 << 20)
+#define SNB_LLC_HITF (1 << 21)
+ /* hole */
+#define SNB_SNP_NONE (1 << 31)
+#define SNB_SNP_NOT_NEEDED (1 << 32)
+#define SNB_SNP_MISS (1 << 33)
+#define SNB_SNP_NO_FWD (1 << 34)
+#define SNB_SNP_FWD (1 << 35)
+#define SNB_HITM (1 << 36)
+#define SNB_NON_DRAM (1 << 37)
+
+#define SNB_DMND_READ (SNB_DMND_DATA_RD)
+#define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST)
+#define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO)
+
+#define SNB_L3_HIT ()
+#define SNB_L3_MISS ()
+#define SNB_L3_ACCESS (SNB_L3_HIT|SNB_L3_MISS)
+
+#define SNB_ALL_DRAM ()
+#define SNB_REMOTE_DRAM ()
+
+static __initconst const u64 snb_hw_cache_extra_regs
+ [PERF_COUNT_HW_CACHE_MAX]
+ [PERF_COUNT_HW_CACHE_OP_MAX]
+ [PERF_COUNT_HW_CACHE_RESULT_MAX] =
+{
+ [ C(LL ) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_READ|SNB_L3_ACCESS,
+ [ C(RESULT_MISS) ] = SNB_DMND_READ|SNB_L3_MISS,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_WRITE|SNB_L3_ACCESS,
+ [ C(RESULT_MISS) ] = SNB_DMND_WRITE|SNB_L3_MISS,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_PREFETCH|SNB_L3_ACCESS,
+ [ C(RESULT_MISS) ] = SNB_DMND_PREFETCH|SNB_L3_MISS,
+ },
+ }
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_READ|SNB_ALL_DRAM,
+ [ C(RESULT_MISS) ] = SNB_DMND_READ|SNB_REMOTE_DRAM,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_WRITE|SNB_ALL_DRAM,
+ [ C(RESULT_MISS) ] = SNB_DMND_WRITE|SNB_REMOTE_DRAM,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = SNB_DMND_PREFETCH|SNB_ALL_DRAM,
+ [ C(RESULT_MISS) ] = SNB_DMND_PREFETCH|SNB_REMOTE_DRAM,
+ },
+ },
+};
+
static __initconst const u64 snb_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -184,26 +264,23 @@ static __initconst const u64 snb_hw_cache_event_ids
},
},
[ C(LL ) ] = {
- /*
- * TBD: Need Off-core Response Performance Monitoring support
- */
[ C(OP_READ) ] = {
- /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_WRITE) ] = {
- /* OFFCORE_RESPONSE_0.ANY_RFO.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_RFO.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_PREFETCH) ] = {
- /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
},
[ C(DTLB) ] = {
@@ -248,6 +325,20 @@ static __initconst const u64 snb_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

static __initconst const u64 westmere_hw_cache_event_ids
@@ -285,26 +376,26 @@ static __initconst const u64 westmere_hw_cache_event_ids
},
[ C(LL ) ] = {
[ C(OP_READ) ] = {
- /* OFFCORE_RESPONSE_0.ANY_DATA.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.ANY_DATA.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
/*
* Use RFO, not WRITEBACK, because a write miss would typically occur
* on RFO.
*/
[ C(OP_WRITE) ] = {
- /* OFFCORE_RESPONSE_1.ANY_RFO.LOCAL_CACHE */
- [ C(RESULT_ACCESS) ] = 0x01bb,
- /* OFFCORE_RESPONSE_0.ANY_RFO.ANY_LLC_MISS */
+ /* OFFCORE_RESPONSE.ANY_RFO.LOCAL_CACHE */
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ /* OFFCORE_RESPONSE.ANY_RFO.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
},
[ C(OP_PREFETCH) ] = {
- /* OFFCORE_RESPONSE_0.PREFETCH.LOCAL_CACHE */
+ /* OFFCORE_RESPONSE.PREFETCH.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
- /* OFFCORE_RESPONSE_1.PREFETCH.ANY_LLC_MISS */
- [ C(RESULT_MISS) ] = 0x01bb,
+ /* OFFCORE_RESPONSE.PREFETCH.ANY_LLC_MISS */
+ [ C(RESULT_MISS) ] = 0x01b7,
},
},
[ C(DTLB) ] = {
@@ -349,19 +440,53 @@ static __initconst const u64 westmere_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

/*
- * OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3
+ * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
+ * See IA32 SDM Vol 3B 30.6.1.3
*/

-#define DMND_DATA_RD (1 << 0)
-#define DMND_RFO (1 << 1)
-#define DMND_WB (1 << 3)
-#define PF_DATA_RD (1 << 4)
-#define PF_DATA_RFO (1 << 5)
-#define RESP_UNCORE_HIT (1 << 8)
-#define RESP_MISS (0xf600) /* non uncore hit */
+#define NHM_DMND_DATA_RD (1 << 0)
+#define NHM_DMND_RFO (1 << 1)
+#define NHM_DMND_IFETCH (1 << 2)
+#define NHM_DMND_WB (1 << 3)
+#define NHM_PF_DATA_RD (1 << 4)
+#define NHM_PF_DATA_RFO (1 << 5)
+#define NHM_PF_IFETCH (1 << 6)
+#define NHM_OFFCORE_OTHER (1 << 7)
+#define NHM_UNCORE_HIT (1 << 8)
+#define NHM_OTHER_CORE_HIT_SNP (1 << 9)
+#define NHM_OTHER_CORE_HITM (1 << 10)
+ /* reserved */
+#define NHM_REMOTE_CACHE_FWD (1 << 12)
+#define NHM_REMOTE_DRAM (1 << 13)
+#define NHM_LOCAL_DRAM (1 << 14)
+#define NHM_NON_DRAM (1 << 15)
+
+#define NHM_ALL_DRAM (NHM_REMOTE_DRAM|NHM_LOCAL_DRAM)
+
+#define NHM_DMND_READ (NHM_DMND_DATA_RD)
+#define NHM_DMND_WRITE (NHM_DMND_RFO|NHM_DMND_WB)
+#define NHM_DMND_PREFETCH (NHM_PF_DATA_RD|NHM_PF_DATA_RFO)
+
+#define NHM_L3_HIT (NHM_UNCORE_HIT|NHM_OTHER_CORE_HIT_SNP|NHM_OTHER_CORE_HITM)
+#define NHM_L3_MISS (NHM_NON_DRAM|NHM_ALL_DRAM|NHM_REMOTE_CACHE_FWD)
+#define NHM_L3_ACCESS (NHM_L3_HIT|NHM_L3_MISS)

static __initconst const u64 nehalem_hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_MAX]
@@ -370,18 +495,32 @@ static __initconst const u64 nehalem_hw_cache_extra_regs
{
[ C(LL ) ] = {
[ C(OP_READ) ] = {
- [ C(RESULT_ACCESS) ] = DMND_DATA_RD|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = DMND_DATA_RD|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
+ [ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS,
},
[ C(OP_WRITE) ] = {
- [ C(RESULT_ACCESS) ] = DMND_RFO|DMND_WB|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = DMND_RFO|DMND_WB|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = NHM_DMND_WRITE|NHM_L3_ACCESS,
+ [ C(RESULT_MISS) ] = NHM_DMND_WRITE|NHM_L3_MISS,
},
[ C(OP_PREFETCH) ] = {
- [ C(RESULT_ACCESS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_UNCORE_HIT,
- [ C(RESULT_MISS) ] = PF_DATA_RD|PF_DATA_RFO|RESP_MISS,
+ [ C(RESULT_ACCESS) ] = NHM_DMND_PREFETCH|NHM_L3_ACCESS,
+ [ C(RESULT_MISS) ] = NHM_DMND_PREFETCH|NHM_L3_MISS,
},
}
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_ALL_DRAM,
+ [ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_REMOTE_DRAM,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = NHM_DMND_WRITE|NHM_ALL_DRAM,
+ [ C(RESULT_MISS) ] = NHM_DMND_WRITE|NHM_REMOTE_DRAM,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = NHM_DMND_PREFETCH|NHM_ALL_DRAM,
+ [ C(RESULT_MISS) ] = NHM_DMND_PREFETCH|NHM_REMOTE_DRAM,
+ },
+ },
};

static __initconst const u64 nehalem_hw_cache_event_ids
@@ -483,6 +622,20 @@ static __initconst const u64 nehalem_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7, /* OFFCORE_RESP */
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = 0x01b7,
+ [ C(RESULT_MISS) ] = 0x01b7,
+ },
+ },
};

static __initconst const u64 core2_hw_cache_event_ids
@@ -574,6 +727,20 @@ static __initconst const u64 core2_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static __initconst const u64 atom_hw_cache_event_ids
@@ -665,6 +832,20 @@ static __initconst const u64 atom_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static void intel_pmu_disable_all(void)
@@ -1444,6 +1625,8 @@ static __init int intel_pmu_init(void)
case 42: /* SandyBridge */
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
+ memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
+ sizeof(hw_cache_extra_regs));

intel_pmu_lbr_init_nhm();

diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c
index 74507c1..e802c7e 100644
--- a/arch/x86/kernel/cpu/perf_event_p4.c
+++ b/arch/x86/kernel/cpu/perf_event_p4.c
@@ -554,6 +554,20 @@ static __initconst const u64 p4_hw_cache_event_ids
[ C(RESULT_MISS) ] = -1,
},
},
+ [ C(NODE) ] = {
+ [ C(OP_READ) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_WRITE) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ [ C(OP_PREFETCH) ] = {
+ [ C(RESULT_ACCESS) ] = -1,
+ [ C(RESULT_MISS) ] = -1,
+ },
+ },
};

static u64 p4_general_events[PERF_COUNT_HW_MAX] = {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ee9f1e7..df4a841 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -59,7 +59,7 @@ enum perf_hw_id {
/*
* Generalized hardware cache events:
*
- * { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x
+ * { L1-D, L1-I, LLC, ITLB, DTLB, BPU, NODE } x
* { read, write, prefetch } x
* { accesses, misses }
*/
@@ -70,6 +70,7 @@ enum perf_hw_cache_id {
PERF_COUNT_HW_CACHE_DTLB = 3,
PERF_COUNT_HW_CACHE_ITLB = 4,
PERF_COUNT_HW_CACHE_BPU = 5,
+ PERF_COUNT_HW_CACHE_NODE = 6,

PERF_COUNT_HW_CACHE_MAX, /* non-ABI */
};

2011-04-22 23:55:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

>
> #define SNB_PF_LLC_DATA_RD (1 << 7)
> #define SNB_PF_LLC_RFO (1 << 8)
> #define SNB_PF_LLC_IFETCH (1 << 9)
> #define SNB_BUS_LOCKS (1 << 10)
> #define SNB_STRM_ST (1 << 11)
> /* hole */
> #define SNB_OFFCORE_OTHER (1 << 15)
> #define SNB_COMMON (1 << 16)
> #define SNB_NO_SUPP (1 << 17)
> #define SNB_LLC_HITM (1 << 18)
> #define SNB_LLC_HITE (1 << 19)
> #define SNB_LLC_HITS (1 << 20)
> #define SNB_LLC_HITF (1 << 21)
> /* hole */
> #define SNB_SNP_NONE (1 << 31)
> #define SNB_SNP_NOT_NEEDED (1 << 32)
> #define SNB_SNP_MISS (1 << 33)
> #define SNB_SNP_NO_FWD (1 << 34)
> #define SNB_SNP_FWD (1 << 35)
> #define SNB_HITM (1 << 36)
> #define SNB_NON_DRAM (1 << 37)
>
> #define SNB_DMND_READ (SNB_DMND_DATA_RD)
> #define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST)
> #define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO)
>
> Is what I came up with, but I'm stumped on how to construct:
>
> #define SNB_L3_HIT

All the LLC hits together.

Or it can be done with the PEBS memory latency event (like Lin-Ming's patch) or
with mem_load_uops_retired (but then only for loads)

> #define SNB_L3_MISS

Don't set any of the LLC bits


>
> #define SNB_ALL_DRAM

Just don't set NON_DRAM


> #define SNB_REMOTE_DRAM

The current client SNBs for which those tables are don't have remote
DRAM.

-Andi

--
[email protected] -- Speaking for myself only

2011-04-23 00:00:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Sat, Apr 23, 2011 at 12:57:42AM +0200, Peter Zijlstra wrote:
> On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote:
> > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote:
> > > The below needs filling out for !x86 (which I filled out with
> > > unsupported events) and x86 needs the offcore bits fixed to auto select
> > > between the two offcore events.
> >
> > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table.
>
> Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and

What is ACCESS if not a HIT?

> counted OTHER_CORE_HIT* as MISS even though its clearly documented as an
> L3 hit.

When the other core owns the cache line it has to be fetched from there.
That's not a LLC hit.

-Andi

2011-04-23 00:04:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> > Yes, and note that with instructions events we even have skid-less PEBS
> > profiling so seeing the precise .
> - location of instructions is possible.

It was better when it was eaten. PEBS does not actually eliminated
skid unfortunately. The interrupt still occurs later, so the
instruction location is off.

PEBS merely gives you more information.

-Andi

2011-04-23 07:50:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 16:54 -0700, Andi Kleen wrote:
> >
> > #define SNB_PF_LLC_DATA_RD (1 << 7)
> > #define SNB_PF_LLC_RFO (1 << 8)
> > #define SNB_PF_LLC_IFETCH (1 << 9)
> > #define SNB_BUS_LOCKS (1 << 10)
> > #define SNB_STRM_ST (1 << 11)
> > /* hole */
> > #define SNB_OFFCORE_OTHER (1 << 15)
> > #define SNB_COMMON (1 << 16)
> > #define SNB_NO_SUPP (1 << 17)
> > #define SNB_LLC_HITM (1 << 18)
> > #define SNB_LLC_HITE (1 << 19)
> > #define SNB_LLC_HITS (1 << 20)
> > #define SNB_LLC_HITF (1 << 21)
> > /* hole */
> > #define SNB_SNP_NONE (1 << 31)
> > #define SNB_SNP_NOT_NEEDED (1 << 32)
> > #define SNB_SNP_MISS (1 << 33)
> > #define SNB_SNP_NO_FWD (1 << 34)
> > #define SNB_SNP_FWD (1 << 35)
> > #define SNB_HITM (1 << 36)
> > #define SNB_NON_DRAM (1 << 37)
> >
> > #define SNB_DMND_READ (SNB_DMND_DATA_RD)
> > #define SNB_DMND_WRITE (SNB_DMND_RFO|SNB_DMND_WB|SNB_STRM_ST)
> > #define SNB_DMND_PREFETCH (SNB_PF_DATA_RD|SNB_PF_DATA_RFO)
> >
> > Is what I came up with, but I'm stumped on how to construct:
> >
> > #define SNB_L3_HIT
>
> All the LLC hits together.

Bits 18-21 ?

> Or it can be done with the PEBS memory latency event (like Lin-Ming's patch) or
> with mem_load_uops_retired (but then only for loads)
>
> > #define SNB_L3_MISS
>
> Don't set any of the LLC bits

So a 0 for the response type field? That's not valid. You have to set
some bit between 16-37.

>
> >
> > #define SNB_ALL_DRAM
>
> Just don't set NON_DRAM

So bits 17-21|31-36 for the response type field?

That seems wrong as that would include what we previously defined to be
L3_HIT, which never makes it to DRAM.

> > #define SNB_REMOTE_DRAM
>
> The current client SNBs for which those tables are don't have remote
> DRAM.

So what you're telling us is that simply because Intel hasn't shipped a
multi-socket SNB system yet they either:

1) omitted a few bits from that table,
2) have a completely different offcore response msr just for kicks?

Feh!

2011-04-23 07:50:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 17:00 -0700, Andi Kleen wrote:
> On Sat, Apr 23, 2011 at 12:57:42AM +0200, Peter Zijlstra wrote:
> > On Fri, 2011-04-22 at 23:54 +0200, Peter Zijlstra wrote:
> > > On Fri, 2011-04-22 at 23:37 +0200, Peter Zijlstra wrote:
> > > > The below needs filling out for !x86 (which I filled out with
> > > > unsupported events) and x86 needs the offcore bits fixed to auto select
> > > > between the two offcore events.
> > >
> > > Urgh, so SNB has different MSR_OFFCORE_RESPONSE bits and needs another table.
> >
> > Also, NHM offcore bits were wrong... it implemented _ACCESS as _HIT and
>
> What is ACCESS if not a HIT?

An ACCESS is all requests for data that comes in, after which you either
HIT or MISS in which case you have to ask someone else down the line.

> > counted OTHER_CORE_HIT* as MISS even though its clearly documented as an
> > L3 hit.
>
> When the other core owns the cache line it has to be fetched from there.
> That's not a LLC hit.

Then _why_ are they described in 30.6.1.3, table 30-15, as:

OTHER_CORE_HIT_SNP 9 (R/W). L3 Hit: ....
OTHER_CORE_HITM 10 (R/W). L3 Hit: ...

2011-04-23 07:50:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> > > Yes, and note that with instructions events we even have skid-less PEBS
> > > profiling so seeing the precise .
> > - location of instructions is possible.
>
> It was better when it was eaten. PEBS does not actually eliminated
> skid unfortunately. The interrupt still occurs later, so the
> instruction location is off.
>
> PEBS merely gives you more information.

You're so skilled at not actually saying anything useful. Are you
perchance referring to the fact that the IP reported in the PEBS data is
exactly _one_ instruction off? Something that is demonstrated to be
fixable?

Or are you defining skid differently and not telling us your definition?

What are you saying?

2011-04-23 08:03:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> > > Yes, and note that with instructions events we even have skid-less PEBS
> > > profiling so seeing the precise .
> > - location of instructions is possible.
>
> It was better when it was eaten. PEBS does not actually eliminated
> skid unfortunately. The interrupt still occurs later, so the
> instruction location is off.
>
> PEBS merely gives you more information.

Have you actually tried perf's PEBS support feature? Try:

perf record -e instructions:pp ./myapp

(the ':pp' postfix stands for 'precise' and activates PEBS+LBR tricks.)

Look at the perf report --tui annotated asssembly output (or check 'perf
annotate' directly) and see how precise and skid-less the hits are. Works
pretty well on Nehalem.

Here's a cache-bound loop with skid (profiled with '-e instructions'):

: 0000000000400390 <main>:
0.00 : 400390: 31 c0 xor %eax,%eax
0.00 : 400392: eb 22 jmp 4003b6 <main+0x26>
12.08 : 400394: fe 84 10 50 08 60 00 incb 0x600850(%rax,%rdx,1)
87.92 : 40039b: 48 81 c2 10 27 00 00 add $0x2710,%rdx
0.00 : 4003a2: 48 81 fa 00 e1 f5 05 cmp $0x5f5e100,%rdx
0.00 : 4003a9: 75 e9 jne 400394 <main+0x4>
0.00 : 4003ab: 48 ff c0 inc %rax
0.00 : 4003ae: 48 3d 10 27 00 00 cmp $0x2710,%rax
0.00 : 4003b4: 74 04 je 4003ba <main+0x2a>
0.00 : 4003b6: 31 d2 xor %edx,%edx
0.00 : 4003b8: eb da jmp 400394 <main+0x4>
0.00 : 4003ba: 31 c0 xor %eax,%eax

Those 'ADD' instruction hits are bogus: 99% of the cost in this function is in
the INCB, but the PMU NMI often skids to the next (few) instructions.

Profiled with "-e instructions:pp" we get:

: 0000000000400390 <main>:
0.00 : 400390: 31 c0 xor %eax,%eax
0.00 : 400392: eb 22 jmp 4003b6 <main+0x26>
85.33 : 400394: fe 84 10 50 08 60 00 incb 0x600850(%rax,%rdx,1)
0.00 : 40039b: 48 81 c2 10 27 00 00 add $0x2710,%rdx
14.67 : 4003a2: 48 81 fa 00 e1 f5 05 cmp $0x5f5e100,%rdx
0.00 : 4003a9: 75 e9 jne 400394 <main+0x4>
0.00 : 4003ab: 48 ff c0 inc %rax
0.00 : 4003ae: 48 3d 10 27 00 00 cmp $0x2710,%rax
0.00 : 4003b4: 74 04 je 4003ba <main+0x2a>
0.00 : 4003b6: 31 d2 xor %edx,%edx
0.00 : 4003b8: eb da jmp 400394 <main+0x4>
0.00 : 4003ba: 31 c0 xor %eax,%eax

The INCB has the most hits as expected - but we also learn that there's
something about the CMP.

Thanks,

Ingo

2011-04-23 08:14:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2011-04-22 at 10:06 +0200, Ingo Molnar wrote:
> >
> > I'm about to push out the patch attached below - it lays out the arguments in
> > detail. I don't think we have time to fix this properly for .39 - but memory
> > profiling could be a nice feature for v2.6.40.
>
> Does something like the below provide enough generic infrastructure to
> allow the raw offcore bits again?

Yeah, this looks like a pretty good start - this is roughly the approach i
outlined to Stephane and Andi, generic cache events extended with one more
'node' level.

Andi, Stephane, if you'd like to see the Intel offcore bits supported in 2.6.40
(or 2.6.41) please help out Peter with review, testing, tools/perf/
integration, etc.

Thanks,

Ingo

2011-04-23 12:06:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
>> > > Yes, and note that with instructions events we even have skid-less PEBS
>> > > profiling so seeing the precise .
>> >                                   - location of instructions is possible.
>>
>> It was better when it was eaten. PEBS does not actually eliminated
>> skid unfortunately. The interrupt still occurs later, so the
>> instruction location is off.
>>
>> PEBS merely gives you more information.
>
> You're so skilled at not actually saying anything useful. Are you
> perchance referring to the fact that the IP reported in the PEBS data is
> exactly _one_ instruction off? Something that is demonstrated to be
> fixable?
>
> Or are you defining skid differently and not telling us your definition?
>

PEBS is guaranteed to return an IP that is just after AN instruction that
caused the event. However, that instruction is NOT the one at the end
of your period. Let's take an example with INST_RETIRED, period=100000.
Then, the IP you get is NOT after the 100,000th retired instruction. It's an
instruction that is N cycles after that one. There is internal skid due to the
way PEBS is implemented.

That is what Andi is referring to. The issue causes bias and thus impacts
the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
event. PREC_DIST=precise distribution. It tries to correct for the skid
on this event on INST_RETIRED with PEBS (look at Vol3b).

2011-04-23 12:13:18

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Stephane Eranian <[email protected]> wrote:
>> >
>> > > > Say i'm a developer and i have an app with such code:
>> > > >
>> > > > #define THOUSAND 1000
>> > > >
>> > > > static char array[THOUSAND][THOUSAND];
>> > > >
>> > > > int init_array(void)
>> > > > {
>> > > >        int i, j;
>> > > >
>> > > >        for (i = 0; i < THOUSAND; i++) {
>> > > >                for (j = 0; j < THOUSAND; j++) {
>> > > >                        array[j][i]++;
>> > > >                }
>> > > >        }
>> > > >
>> > > >        return 0;
>> > > > }
>> > > >
>> > > > Pretty common stuff, right?
>> > > >
>> > > > Using the generalized cache events i can run:
>> > > >
>> > > >  $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
>> > > >
>> > > >  Performance counter stats for './array' (10 runs):
>> > > >
>> > > >         6,719,130 cycles:u                   ( +-   0.662% )
>> > > >         5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>> > > >         1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>> > > >         1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>> > > >
>> > > >        0.003802098  seconds time elapsed   ( +-  13.395% )
>> > > >
>> > > > I consider that this is 'bad', because for almost every dcache-load there's a
>> > > > dcache-miss - a 99% L1 cache miss rate!
>> > > >
>> > > > Then i think a bit, notice something, apply this performance optimization:
>> > >
>> > > I don't think this example is really representative of the kind of problems
>> > > people face, it is just too small and obvious. [...]
>> >
>> > Well, the overwhelming majority of performance problems are 'small and obvious'
>>
>> Problems are not simple. Most serious applications these days are huge,
>> hundreds of MB of text, if not GB.
>>
>> In your artificial example, you knew the answer before you started the
>> measurement.
>>
>> Most of the time, applications are assembled out of hundreds of libraries, so
>> no single developers knows all the code. Thus, the performance analyst is
>> faced with a black box most of the time.
>
> I isolated out an example and assumed that you'd agree that identifying hot
> spots is trivial with generic cache events.
>
> My assumption was wrong so let me show you how trivial it really is.
>
> Here's an example with *two* problematic functions (but it could have hundreds,
> it does not matter):
>
> -------------------------------->
> #define THOUSAND 1000
>
> static char array1[THOUSAND][THOUSAND];
>
> static char array2[THOUSAND][THOUSAND];
>
> void func1(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++)
>                for (j = 0; j < THOUSAND; j++)
>                        array1[i][j]++;
> }
>
> void func2(void)
> {
>        int i, j;
>
>        for (i = 0; i < THOUSAND; i++)
>                for (j = 0; j < THOUSAND; j++)
>                        array2[j][i]++;
> }
>
> int main(void)
> {
>        for (;;) {
>                func1();
>                func2();
>        }
>
>        return 0;
> }
> <--------------------------------
>
> We do not know which one has the cache-misses problem, func1() or func2(), it's
> all a black box, right?
>
> Using generic cache events you simply type this:
>
>  $ perf top -e l1-dcache-load-misses -e l1-dcache-loads
>
> And you get such output:
>
>   PerfTop:    1923 irqs/sec  kernel: 0.0%  exact:  0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u],  (all, 16 CPUs)
> -------------------------------------------------------------------------------------------------------
>
>   weight    samples  pcnt funct DSO
>   ______    _______ _____ _____ ______________________
>
>      1.9       6184 98.8% func2 /home/mingo/opt/array2
>      0.0         69  1.1% func1 /home/mingo/opt/array2
>
> It has pinpointed the problem in func2 *very* precisely.
>
> Obviously this can be used to analyze larger apps as well, with thousands of
> functions, to pinpoint cachemiss problems in specific functions.
>
No, it does not.

As I said before, your example is just to trivial to be representative. You
keep thinking that what you see in the profile pinpoints exactly the instruction
or even the function where the problem always occurs. This is not always
the case. There is skid, and it can be very big, the IP you get may not even
be in the same function where the load was issued.

You cannot generalized based on this example.

2011-04-23 12:27:38

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, Apr 22, 2011 at 11:03 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> Let's go back to your example.
>> Performance counter stats for './array' (10 runs):
>>
>>          6,719,130 cycles:u                   ( +-   0.662% )
>>          5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
>>          1,037,032 l1-dcache-loads:u          ( +-   0.009% )
>>          1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )
>>
>> Looking at this I don't think you can pinpoint which function has a problem
>> [...]
>
> In my previous mail i showed how to pinpoint specific functions. You bring up
> an interesting question, cost/benefit analysis:
>
>> [...] and whether or not there is a real problem. You need to evaluate the
>> penalty. Once you know that you can estimate any potential gain from fixing
>> the code. Arun pointed that out rightfully in his answer. How do you know the
>> penalty if you don't decompose some more?
>
> We can measure that even with today's tooling - which doesnt do cost/benefit
> analysis out of box. In my previous example i showed the cachemisses profile:
>
>   weight    samples  pcnt funct DSO
>   ______    _______ _____ _____ ______________________
>
>      1.9       6184 98.8% func2 /home/mingo/opt/array2
>      0.0         69  1.1% func1 /home/mingo/opt/array2
>
> and here's the cycles profile:
>
>             samples  pcnt funct DSO
>             _______ _____ _____ ______________________
>
>             2555.00 67.4% func2 /home/mingo/opt/array2
>             1220.00 32.2% func1 /home/mingo/opt/array2
>
> So, given that there was no other big miss sources:
>
>  $ perf stat -a -e branch-misses:u -e l1-dcache-load-misses:u -e l1-dcache-store-misses:u -e l1-icache-load-misses:u sleep 1
>
>  Performance counter stats for 'sleep 1':
>
>            70,674 branch-misses:u
>       347,992,027 l1-dcache-load-misses:u
>             1,779 l1-dcache-store-misses:u
>             8,007 l1-icache-load-misses:u
>
>        1.000982021  seconds time elapsed
>
> I can tell you that by fixing the cache-misses in that function, the code will
> be roughly 33% faster.
>
33% based on what? l1d-load-misses? The fact that in the same program you have a
problematic function, func1(), and its fixed counter-part func2() +
you know both do
the same thing? How often do you think this happens in real life?

Now, imagine you don't have func2(). Tell me how much of an impact (cycles)
you think func1() is having on the overall execution of a program especially
if is far more complex that your toy example above?

Your arguments would carry more weight if you were to derive them from real
life applications.


> So i fixed the bug, and before it 100 iterations of func1+func2 took 300 msecs:
>
>  $ perf stat -e cpu-clock --repeat 10 ./array2
>
>  Performance counter stats for './array2' (10 runs):
>
>        298.405074 cpu-clock                  ( +-   1.823% )
>
> After the fix it took 190 msecs:
>
>  $ perf stat -e cpu-clock --repeat 10 ./array2
>
>  Performance counter stats for './array2' (10 runs):
>
>        189.409569 cpu-clock                  ( +-   0.019% )
>
>        0.190007596  seconds time elapsed   ( +-   0.025% )
>
> Which is 63% of the original speed - 37% faster. And no, i first did the
> calculation, then did the measurement of the optimized code.
>
> Now it would be nice to automate such analysis some more within perf - but i
> think i have established the principle well enough that we can use generic
> cache events for such measurements.
>
> Also, we could certainly add more generic events - a stalled-cycles event would
> certainly be useful for example, to collect all (or at least most) 'harmful
> delays' the execution flow can experience. Want to take a stab at that patch?
>
> Thanks,
>
>        Ingo
>

2011-04-23 12:37:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> >> > > Yes, and note that with instructions events we even have skid-less PEBS
> >> > > profiling so seeing the precise .
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - location of instructions is possible.
> >>
> >> It was better when it was eaten. PEBS does not actually eliminated
> >> skid unfortunately. The interrupt still occurs later, so the
> >> instruction location is off.
> >>
> >> PEBS merely gives you more information.
> >
> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
> >
> > Or are you defining skid differently and not telling us your definition?
> >
>
> PEBS is guaranteed to return an IP that is just after AN instruction that
> caused the event. However, that instruction is NOT the one at the end of your
> period. Let's take an example with INST_RETIRED, period=100000. Then, the IP
> you get is NOT after the 100,000th retired instruction. It's an instruction
> that is N cycles after that one. There is internal skid due to the way PEBS
> is implemented.

You are really misapplying the common-sense definition of 'skid'.

Skid refers to the instruction causing a profiler hit being mis-identified.
Google 'x86 pmu skid' and read the third entry: your own prior posting ;-)

What you are referring to here is not really classic skid but a small, mostly
constant skew in the period length with some very small amount of variability.
It's thus mostly immaterial - at most a second or third order effect with
typical frequencies of sampling.

Thanks,

Ingo

2011-04-23 12:49:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar <[email protected]> wrote:
> >> >
> >> > * Stephane Eranian <[email protected]> wrote:
> >> >
> >> > > > Say i'm a developer and i have an app with such code:
> >> > > >
> >> > > > #define THOUSAND 1000
> >> > > >
> >> > > > static char array[THOUSAND][THOUSAND];
> >> > > >
> >> > > > int init_array(void)
> >> > > > {
> >> > > > ? ? ? ?int i, j;
> >> > > >
> >> > > > ? ? ? ?for (i = 0; i < THOUSAND; i++) {
> >> > > > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) {
> >> > > > ? ? ? ? ? ? ? ? ? ? ? ?array[j][i]++;
> >> > > > ? ? ? ? ? ? ? ?}
> >> > > > ? ? ? ?}
> >> > > >
> >> > > > ? ? ? ?return 0;
> >> > > > }
> >> > > >
> >> > > > Pretty common stuff, right?
> >> > > >
> >> > > > Using the generalized cache events i can run:
> >> > > >
> >> > > > ?$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >> > > >
> >> > > > ?Performance counter stats for './array' (10 runs):
> >> > > >
> >> > > > ? ? ? ? 6,719,130 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.662% )
> >> > > > ? ? ? ? 5,084,792 instructions:u ? ? ? ? ? # ? ? ?0.757 IPC ? ? ( +- ? 0.000% )
> >> > > > ? ? ? ? 1,037,032 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.009% )
> >> > > > ? ? ? ? 1,003,604 l1-dcache-load-misses:u ? ?( +- ? 0.003% )
> >> > > >
> >> > > > ? ? ? ?0.003802098 ?seconds time elapsed ? ( +- ?13.395% )
> >> > > >
> >> > > > I consider that this is 'bad', because for almost every dcache-load there's a
> >> > > > dcache-miss - a 99% L1 cache miss rate!
> >> > > >
> >> > > > Then i think a bit, notice something, apply this performance optimization:
> >> > >
> >> > > I don't think this example is really representative of the kind of problems
> >> > > people face, it is just too small and obvious. [...]
> >> >
> >> > Well, the overwhelming majority of performance problems are 'small and obvious'
> >>
> >> Problems are not simple. Most serious applications?these days are huge,
> >> hundreds of MB of text, if not GB.
> >>
> >> In your artificial example, you knew the answer before you started the
> >> measurement.
> >>
> >> Most of the time, applications are assembled out of hundreds of libraries, so
> >> no single developers knows all?the code. Thus, the performance analyst is
> >> faced with a black box most of the time.
> >
> > I isolated out an example and assumed that you'd agree that identifying hot
> > spots is trivial with generic cache events.
> >
> > My assumption was wrong so let me show you how trivial it really is.
> >
> > Here's an example with *two* problematic functions (but it could have hundreds,
> > it does not matter):
> >
> > -------------------------------->
> > #define THOUSAND 1000
> >
> > static char array1[THOUSAND][THOUSAND];
> >
> > static char array2[THOUSAND][THOUSAND];
> >
> > void func1(void)
> > {
> > ? ? ? ?int i, j;
> >
> > ? ? ? ?for (i = 0; i < THOUSAND; i++)
> > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++)
> > ? ? ? ? ? ? ? ? ? ? ? ?array1[i][j]++;
> > }
> >
> > void func2(void)
> > {
> > ? ? ? ?int i, j;
> >
> > ? ? ? ?for (i = 0; i < THOUSAND; i++)
> > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++)
> > ? ? ? ? ? ? ? ? ? ? ? ?array2[j][i]++;
> > }
> >
> > int main(void)
> > {
> > ? ? ? ?for (;;) {
> > ? ? ? ? ? ? ? ?func1();
> > ? ? ? ? ? ? ? ?func2();
> > ? ? ? ?}
> >
> > ? ? ? ?return 0;
> > }
> > <--------------------------------
> >
> > We do not know which one has the cache-misses problem, func1() or func2(), it's
> > all a black box, right?
> >
> > Using generic cache events you simply type this:
> >
> > ?$ perf top -e l1-dcache-load-misses -e l1-dcache-loads
> >
> > And you get such output:
> >
> > ? PerfTop: ? ?1923 irqs/sec ?kernel: 0.0% ?exact: ?0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], ?(all, 16 CPUs)
> > -------------------------------------------------------------------------------------------------------
> >
> > ? weight ? ?samples ?pcnt funct DSO
> > ? ______ ? ?_______ _____ _____ ______________________
> >
> > ? ? ?1.9 ? ? ? 6184 98.8% func2 /home/mingo/opt/array2
> > ? ? ?0.0 ? ? ? ? 69 ?1.1% func1 /home/mingo/opt/array2
> >
> > It has pinpointed the problem in func2 *very* precisely.
> >
> > Obviously this can be used to analyze larger apps as well, with thousands
> > of functions, to pinpoint cachemiss problems in specific functions.
>
> No, it does not.

The thing is, you will need to come up with more convincing and concrete
arguments than a blanket, unsupported "No, it does not" claim.

I *just showed* you an example which you claimed just two mails ago is
impossible to analyze. I showed an example two functions and claimed that the
same thing works with 3 or more functions as well: perf top will happily
display the ones with the highest cachemiss ratio, regardless of how many there
are.

> As I said before, your example is just to trivial to be representative. You
> keep thinking that what you see in the profile pinpoints exactly the
> instruction or even the function where the problem always occurs. This is not
> always the case. There is skid, and it can be very big, the IP you get may
> not even be in the same function where the load was issued.

So now you claim a narrow special case (most of the hot-spot overhead skidding
out of a function) as a counter-proof?

Sometimes skid causes problems - in practice it rarely does, and i do a lot of
profiling.

Also, i'd expect PEBS to be extended in the future to more and more events -
including cachemiss events. That will solve this kind of skidding in a pretty
natural way.

Also, lets analyze your narrow special case: if a function is indeed
"invisible" to profiling because most overhead skids out of it then there's
little you can do with raw events to begin with ...

You really need to specifically demonstrate how raw events help your example.

Thanks,

Ingo

2011-04-23 13:16:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote:
> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
> >> > > Yes, and note that with instructions events we even have skid-less PEBS
> >> > > profiling so seeing the precise .
> >> > - location of instructions is possible.
> >>
> >> It was better when it was eaten. PEBS does not actually eliminated
> >> skid unfortunately. The interrupt still occurs later, so the
> >> instruction location is off.
> >>
> >> PEBS merely gives you more information.
> >
> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
> >
> > Or are you defining skid differently and not telling us your definition?
> >
>
> PEBS is guaranteed to return an IP that is just after AN instruction that
> caused the event. However, that instruction is NOT the one at the end
> of your period. Let's take an example with INST_RETIRED, period=100000.
> Then, the IP you get is NOT after the 100,000th retired instruction. It's an
> instruction that is N cycles after that one. There is internal skid due to the
> way PEBS is implemented.
>
> That is what Andi is referring to. The issue causes bias and thus impacts
> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
> event. PREC_DIST=precise distribution. It tries to correct for the skid
> on this event on INST_RETIRED with PEBS (look at Vol3b).

Sure, but who cares? So your period isn't exactly what you specified,
but the effective period will have an average and a fairly small stdev
(assuming the initial period is much larger than the relatively few
cycles it takes to arm the PEBS assist), therefore you still get a
fairly uniform spread.

I don't much get the obsession with precision here, its all a statistics
game anyway.

And while you keep saying the examples are too trivial and Andi keep
sprouting vague non-statements, neither of you actually provide anything
sensible to the discussion.

So stop f*cking whining and start talking sense or stop talking all
together.

I mean, you were in the room where Intel presented their research on
event correlations based on pathological micro-benches. That clearly
shows that exact event definitions simply don't matter.

Similarly all this precision wanking isn't _that_ important, the big
fish clearly stand out, its only when you start shaving off the last few
cycles that all that really comes in handy, before that its mostly: ooh
thinking is hard, lets go shopping.

2011-04-23 20:14:45

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Ingo Molnar <[email protected]> wrote:

> > [...] If there is an expensive load, you'll see that the load instruction
> > takes many cycles and you can infer that it's a cache miss.
> >
> > Questions app developers typically ask me:
> >
> > * If I fix all my top 5 L3 misses how much faster will my app go?
>
> This has come up: we could add a 'stalled/idle-cycles' generic event - i.e.
> cycles spent without performing useful work in the pipelines. (Resource-stall
> events on Intel CPUs.)

How about something like the patch below?

Ingo
---
Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
From: Ingo Molnar <[email protected]>

The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.

Note: this is still a incomplete and will work on Intel Nehalem
CPUs for now, the intel_perfmon_event_map[] needs to be
properly split between the major models.

Also update 'perf stat' to print:

611,527 cycles
400,553 instructions # ( 0.7 instructions per cycle )
77,809 stalled-cycles # ( 12.7% of all cycles )

0.000610987 seconds time elapsed

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 2 ++
include/linux/perf_event.h | 1 +
tools/perf/builtin-stat.c | 11 +++++++++--
tools/perf/util/parse-events.c | 1 +
tools/perf/util/python.c | 1 +
5 files changed, 14 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux/arch/x86/kernel/cpu/perf_event_intel.c
@@ -34,6 +34,8 @@ static const u64 intel_perfmon_event_map
[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4,
[PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5,
[PERF_COUNT_HW_BUS_CYCLES] = 0x013c,
+ [PERF_COUNT_HW_STALLED_CYCLES] = 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */
+
};

static struct event_constraint intel_core_event_constraints[] =
Index: linux/include/linux/perf_event.h
===================================================================
--- linux.orig/include/linux/perf_event.h
+++ linux/include/linux/perf_event.h
@@ -52,6 +52,7 @@ enum perf_hw_id {
PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4,
PERF_COUNT_HW_BRANCH_MISSES = 5,
PERF_COUNT_HW_BUS_CYCLES = 6,
+ PERF_COUNT_HW_STALLED_CYCLES = 7,

PERF_COUNT_HW_MAX, /* non-ABI */
};
Index: linux/tools/perf/builtin-stat.c
===================================================================
--- linux.orig/tools/perf/builtin-stat.c
+++ linux/tools/perf/builtin-stat.c
@@ -442,7 +442,7 @@ static void abs_printout(int cpu, struct
if (total)
ratio = avg / total;

- fprintf(stderr, " # %10.3f IPC ", ratio);
+ fprintf(stderr, " # ( %3.1f instructions per cycle )", ratio);
} else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES) &&
runtime_branches_stats[cpu].n != 0) {
total = avg_stats(&runtime_branches_stats[cpu]);
@@ -450,7 +450,7 @@ static void abs_printout(int cpu, struct
if (total)
ratio = avg * 100 / total;

- fprintf(stderr, " # %10.3f %% ", ratio);
+ fprintf(stderr, " # %10.3f %%", ratio);

} else if (runtime_nsecs_stats[cpu].n != 0) {
total = avg_stats(&runtime_nsecs_stats[cpu]);
@@ -459,6 +459,13 @@ static void abs_printout(int cpu, struct
ratio = 1000.0 * avg / total;

fprintf(stderr, " # %10.3f M/sec", ratio);
+ } else if (perf_evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES)) {
+ total = avg_stats(&runtime_cycles_stats[cpu]);
+
+ if (total)
+ ratio = avg / total * 100.0;
+
+ fprintf(stderr, " # (%5.1f%% of all cycles )", ratio);
}
}

Index: linux/tools/perf/util/parse-events.c
===================================================================
--- linux.orig/tools/perf/util/parse-events.c
+++ linux/tools/perf/util/parse-events.c
@@ -38,6 +38,7 @@ static struct event_symbol event_symbols
{ CHW(BRANCH_INSTRUCTIONS), "branch-instructions", "branches" },
{ CHW(BRANCH_MISSES), "branch-misses", "" },
{ CHW(BUS_CYCLES), "bus-cycles", "" },
+ { CHW(STALLED_CYCLES), "stalled-cycles", "" },

{ CSW(CPU_CLOCK), "cpu-clock", "" },
{ CSW(TASK_CLOCK), "task-clock", "" },
Index: linux/tools/perf/util/python.c
===================================================================
--- linux.orig/tools/perf/util/python.c
+++ linux/tools/perf/util/python.c
@@ -798,6 +798,7 @@ static struct {
{ "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
{ "COUNT_HW_BRANCH_MISSES", PERF_COUNT_HW_BRANCH_MISSES },
{ "COUNT_HW_BUS_CYCLES", PERF_COUNT_HW_BUS_CYCLES },
+ { "COUNT_HW_STALLED_CYCLES", PERF_COUNT_HW_STALLED_CYCLES },
{ "COUNT_HW_CACHE_L1D", PERF_COUNT_HW_CACHE_L1D },
{ "COUNT_HW_CACHE_L1I", PERF_COUNT_HW_CACHE_L1I },
{ "COUNT_HW_CACHE_LL", PERF_COUNT_HW_CACHE_LL },

2011-04-24 02:15:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

>
> That is what Andi is referring to. The issue causes bias and thus impacts
> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
> event. PREC_DIST=precise distribution. It tries to correct for the skid
> on this event on INST_RETIRED with PEBS (look at Vol3b).

Unfortunately even PDIST doesn't completely fix the problem, but
it only makes it somewhat better. Also it's only statistical,
so you won't get a guaranteed answer for every sample.

-Andi

2011-04-24 02:19:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> You're so skilled at not actually saying anything useful. Are you
> perchance referring to the fact that the IP reported in the PEBS data is
> exactly _one_ instruction off? Something that is demonstrated to be
> fixable?

It's one instruction off the instruction that was retired when the PEBS
interrupt was ready, but not one instruction off the instruction
that caused the event. There's still skid in triggering the interrupt.

The main good thing about PEBS is that you can get some information
about the state of the instruction, just not the EIP.
For example with the memory latency event you can actually get
the address and memory cache state (as Lin Ming's patchkit implements)

-Andi

2011-04-24 06:21:49

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
>
> The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> cycles the CPU does nothing useful, because it is stalled on a
> cache-miss or some other condition.

Conceptually looks fine. I'd prefer a more precise name such as:
PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
retirement stalls).

In the example below:

==> foo.c <==

foo()
{
}


bar()
{
}

==> test.c <==
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define FNV_PRIME_32 16777619
#define FNV_OFFSET_32 2166136261U
uint32_t hash1(const char *s)
{
uint32_t hash = FNV_OFFSET_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_32; // Multiply by prime number found to work well
}
return hash;
}

#define FNV_PRIME_WEAK_32 100
#define FNV_OFFSET_WEAK_32 200

uint32_t hash2(const char *s)
{
uint32_t hash = FNV_OFFSET_WEAK_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_WEAK_32; // Multiply by prime number found to work well
}
return hash;
}

int main()
{
int r = random();

while(1) {
r++;
#ifdef HARD
if (hash1((const char *) &r) & 0x500)
#else
if (hash2((const char *) &r) & 0x500)
#endif
foo();
else
bar();
}
}
==> Makefile <==

all:
gcc -O2 test.c foo.c -UHARD -o test.easy
gcc -O2 test.c foo.c -DHARD -o test.hard


# perf stat -e cycles,instructions ./test.hard
^C
Performance counter stats for './test.hard':

3,742,855,848 cycles
4,179,896,309 instructions # 1.117 IPC

1.754804730 seconds time elapsed


# perf stat -e cycles,instructions ./test.easy
^C
Performance counter stats for './test.easy':

3,932,887,528 cycles
8,994,416,316 instructions # 2.287 IPC

1.843832827 seconds time elapsed

i.e. fixing the branch mispredicts could result in
a nearly 2x speedup for the program.

Looking at:

# perf stat -e cycles,instructions,branch-misses,cache-misses,RESOURCE_STALLS:ANY ./test.hard
^C
Performance counter stats for './test.hard':

3,413,713,048 cycles (scaled from 69.93%)
3,772,611,782 instructions # 1.105 IPC (scaled from 80.01%)
51,113,340 branch-misses (scaled from 80.01%)
12,370 cache-misses (scaled from 80.02%)
26,656,983 RESOURCE_STALLS:ANY (scaled from 69.99%)

1.626595305 seconds time elapsed

it's hard to spot the opporunity. On the other hand:

# ./analyze.py
Percent idle: 27%
Retirement Stalls: 82%
Backend Stalls: 0%
Frontend Stalls: 62%
Instruction Starvation: 62%
icache stalls: 0%

does give me a signal about where to look. The script below is
a quick and dirty hack. I haven't really validated it with
many workloads. I'm posting it here anyway hoping that it'd
result in better kernel support for these types of analyses.

Even if we cover this with various generic PERF_COUNT_*STALL events,
we'll still have a need for other events:

* Things that give info about instruction mixes.

Ratio of {loads, stores, floating point, branches, conditional branches}
to total instructions.

* Activity related to micro architecture specific caches

People using -funroll-loops may have a significant performance opportunity.
But it's hard to spot bottlenecks in the instruction decoder.

* Monitoring traffic on Hypertransport/QPI links

Like you observe, most people will not look at these events, so
focusing on getting the common events right makes sense. But I
still like access to all events (either via a mapping file or
a library such as libpfm4). Hiding them in "perf list" sounds
like a reasonable way of keeping complexity out.

-Arun

PS: branch-misses:pp was spot on for the example above.

#!/usr/bin/env python

from optparse import OptionParser
from itertools import izip, chain, repeat
from subprocess import Popen, PIPE
import re, struct

def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

counter_re = re.compile('\s+(?P<count>\d+)\s+(?P<event>\S+)')
def sample(events):
cmd = 'perf stat --no-big-num -a'
ncounters = 4
groups = grouper(ncounters, events)
for g in groups:
# filter padding
g = [ e for e in g if e ]
cmd += ' -e ' + ','.join(g)
cmd += ' -- sleep ' + str(options.time)
process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
out, err = process.communicate()
ret = process.poll()
if ret: raise "Perf failed: " + err
ret = {}
for line in err.split('\n'):
m = counter_re.match(line)
if not m: continue
ret[m.group('event')] = long(m.group('count'))
return ret

def measure_cycles():
# disable C-states
f = open("/dev/cpu_dma_latency", "wb")
f.write(struct.pack("i", 0))
f.flush()
saved = options.time
options.time = 1 # one sec is sufficient to measure clock
cycles = sample(["cycles"])['cycles']
cycles /= options.time
f.close()
options.time = saved
return cycles

if __name__ == '__main__':
parser = OptionParser()
parser.add_option("-t", "--time", dest="time", default=1,
help="How long to sample events")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")

(options, args) = parser.parse_args()
cycles_per_sec = measure_cycles()
c = sample(["cycles", "instructions", "UOPS_ISSUED:ANY:c=1", "UOPS_ISSUED:ANY:c=1:t=1",
"RESOURCE_STALLS:ANY", "UOPS_RETIRED:ANY:c=1:t=1",
"UOPS_EXECUTED:PORT015:t=1", "UOPS_EXECUTED:PORT234_CORE",
"UOPS_ISSUED:ANY:t=1", "UOPS_ISSUED:FUSED:t=1", "UOPS_RETIRED:ANY:t=1",
"L1I:CYCLES_STALLED"])
cycles = c["cycles"] * 1.0
cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
icache_stall_cycles = c["L1I:CYCLES_STALLED"]

# Cycle stall accounting
print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)

# Wasted work
uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
uops_retired = c["UOPS_RETIRED:ANY:t=1"]
uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]

print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)

2011-04-25 17:25:31

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


sorry for the late reply on this thread, it happened inconveniently over
the long weekend.


On Fri, 22 Apr 2011, Ingo Molnar wrote:

> But this kind of usability is absolutely unacceptable - users should not
> be expected to type in magic, CPU and model specific incantations to get
> access to useful hardware functionality.

That's why people use libpfm4. or PAPI. And they do.

Current PAPI snapshots support offcore response on recent git kernels.
With full names, no hex values, thanks to libpfm4.

All the world is not perf.

> The proper solution is to expose useful offcore functionality via
> generalized events - that way users do not have to care which specific
> CPU model they are using, they can use the conceptual event and not some
> model specific quirky hexa number.

No no no no.

Blocking access to raw events is the wrong idea. If anything, the whole
"generic events" thing in the kernel should be ditched. Wrong events are
used at times (see AMD branch events a few releases back, now Nehalem
cache events). This all belongs in userspace, as was pointed out at the
start. The kernel has no business telling users which perf events are
interesting, or limiting them! What is this, windows?

If you do block access to any raw events, we're going to have to start
recommending people ditch perf_events and start patching the kernel with
perfctr again. We already do for P4/netburst users, as Pentium 4 support
is currently hosed due to NMI event conflicts.

Also with perfctr it's much easier to get low-latency access to the
counters. See:
http://web.eecs.utk.edu/~vweaver1/projects/papi-cost/

Vince

2011-04-25 17:38:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Arun Sharma <[email protected]> wrote:

> On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
> >
> > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> > cycles the CPU does nothing useful, because it is stalled on a
> > cache-miss or some other condition.
>
> Conceptually looks fine. I'd prefer a more precise name such as:
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
> retirement stalls).

Ok.

Your script:

> # ./analyze.py
> Percent idle: 27%
> Retirement Stalls: 82%
> Backend Stalls: 0%
> Frontend Stalls: 62%
> Instruction Starvation: 62%
> icache stalls: 0%
>
> does give me a signal about where to look. The script below is
> a quick and dirty hack. I haven't really validated it with
> many workloads. I'm posting it here anyway hoping that it'd
> result in better kernel support for these types of analyses.

Is pretty useful IMO.

The frontend/backend characterisation is pretty generic - most modern CPUs
share that and have similar events.

So we could try to generalize these and get most of the statistics your script
outputs.

> Even if we cover this with various generic PERF_COUNT_*STALL events,
> we'll still have a need for other events:
>
> * Things that give info about instruction mixes.
>
> Ratio of {loads, stores, floating point, branches, conditional branches}
> to total instructions.

We have this at least partially covered, but yeah, we stopped short of covering
all instruction types so complete ratios cannot be built yet.

> * Activity related to micro architecture specific caches
>
> People using -funroll-loops may have a significant performance opportunity.
> But it's hard to spot bottlenecks in the instruction decoder.
>
> * Monitoring traffic on Hypertransport/QPI links

Cross-node accesses ought to be covered by Peter's RFC patch. In terms of
isolating cross-CPU cache accesses i suspect we could do that too if it really
matters to analysis in practice.

Basically the way to go about it are the testcases you wrote - they demonstrate
the utility of a given type of event - and that justifies generalization as
well.

> Like you observe, most people will not look at these events, so
> focusing on getting the common events right makes sense. But I
> still like access to all events (either via a mapping file or
> a library such as libpfm4). Hiding them in "perf list" sounds
> like a reasonable way of keeping complexity out.

Yes. We have access to raw events for relatively obscure (or too CPU dependent)
events - but what we do not want to do is to extend that space without adding
*any* generic event in essence. If something like offcore or uncore PMU support
is useful enough to be in the kernel, then it should also be useful enough to
gain generic events.

> PS: branch-misses:pp was spot on for the example above.

heh :-)

Thanks,

Ingo

2011-04-25 17:41:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> > You're so skilled at not actually saying anything useful. Are you
> > perchance referring to the fact that the IP reported in the PEBS data is
> > exactly _one_ instruction off? Something that is demonstrated to be
> > fixable?
>
> It's one instruction off the instruction that was retired when the PEBS
> interrupt was ready, but not one instruction off the instruction that caused
> the event. There's still skid in triggering the interrupt.

Peter answered this in the other mail:

|
| Sure, but who cares? So your period isn't exactly what you specified, but
| the effective period will have an average and a fairly small stdev (assuming
| the initial period is much larger than the relatively few cycles it takes to
| arm the PEBS assist), therefore you still get a fairly uniform spread.
|

... and the resulting low level of noise in the average period length is what
matters. The instruction itself will still be one of the hotspot instructions,
statistically.

Thanks,

Ingo

2011-04-25 17:55:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Vince Weaver <[email protected]> wrote:

> [...] The kernel has no business telling users which perf events are
> interesting, or limiting them! [...]

The policy is very simple and common-sense: if a given piece of PMU
functionality is useful enough to be exposed via a raw interface, then
it must be useful enough to be generalized as well.

> [...] What is this, windows?

FYI, this is how the Linux kernel has operated from day 1 on: we support
hardware features to abstract useful highlevel functionality out of it.
I would not expect this to change anytime soon.

Thanks,

Ingo

2011-04-25 18:00:13

by Dehao Chen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Tue, Apr 26, 2011 at 1:41 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > You're so skilled at not actually saying anything useful. Are you
> > > perchance referring to the fact that the IP reported in the PEBS data is
> > > exactly _one_ instruction off? Something that is demonstrated to be
> > > fixable?
> >
> > It's one instruction off the instruction that was retired when the PEBS
> > interrupt was ready, but not one instruction off the instruction that caused
> > the event. There's still skid in triggering the interrupt.
>
> Peter answered this in the other mail:
>
> ?|
> ?| Sure, but who cares? So your period isn't exactly what you specified, but
> ?| the effective period will have an average and a fairly small stdev (assuming
> ?| the initial period is much larger than the relatively few cycles it takes to
> ?| arm the PEBS assist), therefore you still get a fairly uniform spread.
> ?|
>
> ... and the resulting low level of noise in the average period length is what
> matters. The instruction itself will still be one of the hotspot instructions,
> statistically.

Not true. This skid will lead to some aggregation and shadow effects
on some certain instructions. To make things worse, these effects are
deterministic and cannot be removed by either sampling for multiple
times or by averaging among instructions within a basic block. As a
result, some actual "hot spot" are not sampled at all. You can simply
try to collect a basic block level CPI, and you'll get a very
misleading profile.

Dehao

>
> Thanks,
>
> ? ? ? ?Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/

2011-04-25 18:05:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Dehao Chen <[email protected]> wrote:

> > ... and the resulting low level of noise in the average period length is
> > what matters. The instruction itself will still be one of the hotspot
> > instructions, statistically.
>
> Not true. This skid will lead to some aggregation and shadow effects on some
> certain instructions. To make things worse, these effects are deterministic
> and cannot be removed by either sampling for multiple times or by averaging
> among instructions within a basic block. As a result, some actual "hot spot"
> are not sampled at all. You can simply try to collect a basic block level
> CPI, and you'll get a very misleading profile.

This certainly does not match the results i'm seeing on real applications,
using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also,
can you demonstrate your claim with a real example?

Thanks,

Ingo

2011-04-25 18:39:31

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Mon, Apr 25, 2011 at 8:05 PM, Ingo Molnar <[email protected]> wrote:
>
> * Dehao Chen <[email protected]> wrote:
>
>> > ... and the resulting low level of noise in the average period length is
>> > what matters. The instruction itself will still be one of the hotspot
>> > instructions, statistically.
>>
>> Not true. This skid will lead to some aggregation and shadow effects on some
>> certain instructions. To make things worse, these effects are deterministic
>> and cannot be removed by either sampling for multiple times or by averaging
>> among instructions within a basic block. As a result, some actual "hot spot"
>> are not sampled at all. You can simply try to collect a basic block level
>> CPI, and you'll get a very misleading profile.
>
> This certainly does not match the results i'm seeing on real applications,
> using "-e instructions:pp" PEBS+LBR profiling. How do you explain that? Also,
> can you demonstrate your claim with a real example?
>

LBR removes the off-by-1 IP problem, it does not remove the shadow effect, i.e.,
that blind spot of N cycles caused by the PEBS arming mechanism.

2011-04-25 18:49:00

by Stephane Eranian

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Sat, Apr 23, 2011 at 3:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Sat, 2011-04-23 at 14:06 +0200, Stephane Eranian wrote:
>> On Sat, Apr 23, 2011 at 9:50 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, 2011-04-22 at 17:03 -0700, Andi Kleen wrote:
>> >> > > Yes, and note that with instructions events we even have skid-less PEBS
>> >> > > profiling so seeing the precise .
>> >> >                                   - location of instructions is possible.
>> >>
>> >> It was better when it was eaten. PEBS does not actually eliminated
>> >> skid unfortunately. The interrupt still occurs later, so the
>> >> instruction location is off.
>> >>
>> >> PEBS merely gives you more information.
>> >
>> > You're so skilled at not actually saying anything useful. Are you
>> > perchance referring to the fact that the IP reported in the PEBS data is
>> > exactly _one_ instruction off? Something that is demonstrated to be
>> > fixable?
>> >
>> > Or are you defining skid differently and not telling us your definition?
>> >
>>
>> PEBS is guaranteed to return an IP that is just after AN instruction that
>> caused the event. However, that instruction is NOT the one at the end
>> of your period. Let's take an example with INST_RETIRED, period=100000.
>> Then, the IP you get is NOT after the 100,000th retired instruction. It's an
>> instruction that is N cycles after that one. There is internal skid due to the
>> way PEBS is implemented.
>>
>> That is what Andi is referring to. The issue causes bias and thus impacts
>> the quality of the samples. On SNB, there is a new INST_RETIRED:PREC_DIST
>> event. PREC_DIST=precise distribution. It tries to correct for the skid
>> on this event on INST_RETIRED with PEBS (look at Vol3b).
>
> Sure, but who cares? So your period isn't exactly what you specified,
> but the effective period will have an average and a fairly small stdev
> (assuming the initial period is much larger than the relatively few
> cycles it takes to arm the PEBS assist), therefore you still get a
> fairly uniform spread.
>
> I don't much get the obsession with precision here, its all a statistics
> game anyway.
>

The particular example I am thinking about came from compiler people
I work with who would like to use PEBS to do statistical basic block profiling.
They do care about correctness of the profile. Otherwise, it may cause wrong
attribution of "hotness" of basic blocks and mislead the compiler when it tries
to reorder blocks on the critical path. Compiler people can validate a
statistical
profile because they have a reference profile obtained via instrumentation of
each basic block.

> And while you keep saying the examples are too trivial and Andi keep
> sprouting vague non-statements, neither of you actually provide anything
> sensible to the discussion.
>
> So stop f*cking whining and start talking sense or stop talking all
> together.
>
> I mean, you were in the room where Intel presented their research on
> event correlations based on pathological micro-benches. That clearly
> shows that exact event definitions simply don't matter.
>
Yes, and I don't get the same reading of the presentation. He never mentioned
generic events. He never even used them, I mean the Intel generic events.
Instead he used very focused Atom-specific events.

2011-04-25 19:40:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> Sure, but who cares? So your period isn't exactly what you specified,
> but the effective period will have an average and a fairly small stdev
> (assuming the initial period is much larger than the relatively few
> cycles it takes to arm the PEBS assist), therefore you still get a
> fairly uniform spread.

The skid is not uniform and not necessarily random unfortunately,
and difficult to correct in a standard way.

> I don't much get the obsession with precision here, its all a statistics
> game anyway.

If you want to make your code faster it's often important to figure
out what exactly is slow.

One example of this we had recently in the kernel:

function accesses three global objects. Scalability tanks when the test is
run with more CPUs. Now the hit is near the three memory accesses. Which one
is the one that is actually bouncing cache lines?

The CPU executes them all in parallel so it's hard to tell. It's
all in the out of order reordering window.

PEBS (e.g. the memory latency event) can give you some information about
which memory access is to blame with the right events, but it's not
using the RIP.

The generic events won't help with that, because they're still RIP
based, which is not accurate.

> Similarly all this precision wanking isn't _that_ important, the big
> fish clearly stand out, its only when you start shaving off the last few
> cycles that all that really comes in handy, before that its mostly: ooh
> thinking is hard, lets go shopping.

I wish it was that easy.

In the example above it's about scaling or not scaling, which is
definitely not the last cycle, but more a life-and-death
"is the workload feasible on this machine or not" question.

-Andi
--
[email protected] -- Speaking for myself only

2011-04-25 19:46:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Stephane Eranian <[email protected]> wrote:

> > This certainly does not match the results i'm seeing on real applications,
> > using "-e instructions:pp" PEBS+LBR profiling. How do you explain that?
> > Also, can you demonstrate your claim with a real example?
>
> LBR removes the off-by-1 IP problem, it does not remove the shadow effect,
> i.e., that blind spot of N cycles caused by the PEBS arming mechanism.

I really think you are grasping at straws here - unless you are able to
demonstrate clear problems - which you have failed to do so far. The pure act
of profiling probably disturbs a typical workload statistically more than a
few cycles skew of the period.

I could imagine artifacts with realy short periods and artificially short and
dominant hotpaths - but in those cases the skew does not matter much in
practice: a short and dominant hotpath is pinpointed very easily ...

So i really think it's a non-issue in practice - but you can certainly prove me
wrong by demonstrating whatever problems you suspect.

Thanks,

Ingo

2011-04-25 19:56:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> One example of this we had recently in the kernel:
>
> function accesses three global objects. Scalability tanks when the test is
> run with more CPUs. Now the hit is near the three memory accesses. Which one
> is the one that is actually bouncing cache lines?

that's not an example - you are still only giving vague, untestable,
unverifiable references. You need to give us something specific and
reproducible - preferably a testcase.

Peter and me are doing lots of scalability work in the core kernel and for most
problems i've met it was enough if we knew the function name - the scalability
problem is typically very obvious from that point on - and an annotated profile
makes it even more obvious.

I've never met a situation what you describe, that it was not possible to
disambiguate a real SMP bounce - and i've been fixing SMP bounces in the kernel
for over ten years.

So you really will have to back up your point with an accurate, reproducible
testcase - vague statements like the ones you are making i do not accept at
face value, sorry.

Thanks,

Ingo

2011-04-25 21:46:48

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Mon, 25 Apr 2011, Ingo Molnar wrote:

>
> * Vince Weaver <[email protected]> wrote:
>
> > [...] The kernel has no business telling users which perf events are
> > interesting, or limiting them! [...]
>
> The policy is very simple and common-sense: if a given piece of PMU
> functionality is useful enough to be exposed via a raw interface, then
> it must be useful enough to be generalized as well.

what does that even mean? How do you "generalize" a functionality like
writing a value to an auxiliary MSR register?

The PAPI tool was using the perf_events interface in the 2.6.39-git
kernels to collect offcore response results by properly setting the
config1 register on Nehalem and Westmere machines.

Now it has been disabled for unclear reasons.

Could you at least have some sort of relevant errno value set in this
case? It's a real pain in userspace code to try to sort out the
perf_event return values to find out if a feature is supported,
unsupported (lack of hardware), unsupported (not implemented yet),
unsupported (disabled due to whim of kernel developer), unsupported
(because you have some sort of configuration conflict).

> > [...] What is this, windows?
>
> FYI, this is how the Linux kernel has operated from day 1 on: we support
> hardware features to abstract useful highlevel functionality out of it.
> I would not expect this to change anytime soon.

I started using Linux because it actually let me use my hardware without
interfering with what I was trying to do. Not because it disabled access
to the hardware due to some perceived lack of generalization in an extra
unncessary software translation layer.

Vince
[email protected]

2011-04-25 22:13:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> The PAPI tool was using the perf_events interface in the 2.6.39-git
> kernels to collect offcore response results by properly setting the
> config1 register on Nehalem and Westmere machines.

I already had some users for this functionality too. Offcore
events are quite useful for various analysis: basically every time
you have a memory performance problem -- especially a NUMA
problem -- they can help you a lot tracking it down.

They answer questions like "who accesses memory on another node"

As far as I'm concerned b52c55c6a25e4515b5e075a989ff346fc251ed09
is a bad feature regression.

>
> Now it has been disabled for unclear reasons.

Also unfortunately only partial. Previously you could at least
write the MSR from user space through /dev/cpu/*/msr, but now the kernel
randomly rewrites it if anyone else uses cache events.

Right now I have some frontend scripts which are doing this,
but it's really quite nasty.

It's very sad we have to go through this.

-Andi

2011-04-26 07:24:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Andi Kleen <[email protected]> wrote:

> > Now it has been disabled for unclear reasons.
>
> Also unfortunately only partial. Previously you could at least write the MSR
> from user space through /dev/cpu/*/msr, but now the kernel randomly rewrites
> it if anyone else uses cache events.

Ugh, that's an unbelievable hack - if you hack an active PMU via writing to it
via /dev/cpu/*/msr and it breaks you really get to keep the pieces. There's a
reason why those devices are root only - it's as if you wrote to a filesystem
that is already mounted!

If your user-space twiddling scripts go bad who knows what state the CPU gets
into and you might be reporting bogus bugs. I think writing to those msrs
directly should probably taint the kernel: i'll prepare a patch for that.

> It's very sad we have to go through this.

Not really, it took Peter 10 minutes to come up with an RFC patch to extend the
cache events in a meaningful way - and that was actually more useful to users
than all prior offcore patches combined. So the kernel already won from this
episode.

We are not at all interested in hiding PMU functionality and keeping it
unstructured, and just passing through some opaque raw ABI to user-space.

Thanks,

Ingo

2011-04-26 07:39:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Vince Weaver <[email protected]> wrote:

> On Mon, 25 Apr 2011, Ingo Molnar wrote:
>
> >
> > * Vince Weaver <[email protected]> wrote:
> >
> > > [...] The kernel has no business telling users which perf events are
> > > interesting, or limiting them! [...]
> >
> > The policy is very simple and common-sense: if a given piece of PMU
> > functionality is useful enough to be exposed via a raw interface, then
> > it must be useful enough to be generalized as well.
>
> what does that even mean? How do you "generalize" a functionality like
> writing a value to an auxiliary MSR register?

Here are a few examples:

- the pure act of task switching sometimes involves writing to MSRs. How is it
generalized? The concept of 'processes/threads' is offered to user-space and
thus this functionality is generalized - the raw MSRs are not just passed
through to user-space.

- a wide range of VMX (virtualization) functionality on Intel CPUs operates via
writing special values to specific MSR registers. How is it 'generalized'? A
meaningful, structured ABI is provided to user-space in form of the KVM
device and associated semantics. The raw MSRs are not just passed through to
user-space.

- the ability of CPUs to change frequency is offered via writing special
values to special MSRs. How is this generalized? The cpufreq subsystem
offers a frequency/cpu API and associated abstractions - the raw MSRs are
not just passed through to user-space.

- in the context of perf events we generalize the concept of an 'event' and
we abstract out common, CPU model neutral CPU hardware concepts like
'cycles', 'instructions', 'branches' and a simplified cache hierarchy - and
offer those events as generic events to user-space. We do not just pass the
raw MSRs through to user-space.

- [ etc. - a lot of useful CPU functionality is MSR driven, the PMU is nothing
special there. ]

The kernel development process is in essence an abstraction engine, and if you
expect something else you'll probably be facing a lot of frustrating episodes
in the future as well where others try to abstract out meaningful
generalizations.

Thanks,

Ingo

2011-04-26 09:26:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Mon, 2011-04-25 at 13:12 -0400, Vince Weaver wrote:
> On Fri, 22 Apr 2011, Ingo Molnar wrote:
>
> > But this kind of usability is absolutely unacceptable - users should not
> > be expected to type in magic, CPU and model specific incantations to get
> > access to useful hardware functionality.
>
> That's why people use libpfm4. or PAPI. And they do.

And how is typing in hex numbers different from typing in model specific
event names? All the same to me, you still need to understand your micro
architecture very thoroughly and read the SDMs.

PAPI actually has 'generalized' events, but I guess you're going to tell
me nobody uses those since they're not useful.

> Current PAPI snapshots support offcore response on recent git kernels.
> With full names, no hex values, thanks to libpfm4.
>
> All the world is not perf.

I know, all the world is interested in investing tons of time learning
about their one architecture and extract the last few percent of
performance.

And that is fine for those few people who can afford it, but generally
optimizing for a single specific platform isn't cost effective.

I looks like you're all so stuck in your HPC/lowlevel way of things
you're not even realizing there's much more to be gained by providing
easy and useful tools to the general public, stuff that works similarly
across architectures.

> > The proper solution is to expose useful offcore functionality via
> > generalized events - that way users do not have to care which specific
> > CPU model they are using, they can use the conceptual event and not some
> > model specific quirky hexa number.
>
> No no no no.
>
> Blocking access to raw events is the wrong idea. If anything, the whole
> "generic events" thing in the kernel should be ditched. Wrong events are
> used at times (see AMD branch events a few releases back, now Nehalem
> cache events). This all belongs in userspace, as was pointed out at the
> start. The kernel has no business telling users which perf events are
> interesting, or limiting them! What is this, windows?

The kernel has no place scheduling pmcs either I expect, or scheduling
tasks for that matter.

We all know you don't believe in upgrading kernels or in kernels very
much at all.

> If you do block access to any raw events, we're going to have to start
> recommending people ditch perf_events and start patching the kernel with
> perfctr again. We already do for P4/netburst users, as Pentium 4 support
> is currently hosed due to NMI event conflicts.

Very constructive attitude, instead of helping you simply subvert and
route around, thanks man!

You could of course a) simply disable the NMI watchdog, or b) improve
the space-heater (aka. P4) PMU implementation to use alternative
encodings -- from what I understood the problem with P4 is that there's
multiple ways to encode the same event and currently if you take one it
doesn't try others.

> Also with perfctr it's much easier to get low-latency access to the
> counters. See:
> http://web.eecs.utk.edu/~vweaver1/projects/papi-cost/

And why is that? is that the lack of userspace rdpmc? That should be
possible with perf, powerpc actually does that already. Various people
mentioned wanting to make this work on x86 but I've yet to see a patch.

2011-04-26 09:26:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Fri, 2011-04-22 at 09:51 -0700, Andi Kleen wrote:
>
> Micro architectures are so different. I suspect a "generic" definition would
> need to be so vague as to be useless.
>
> This in general seems to be the problem of the current cache events.
>
> Overall for any interesting analysis you need to go CPU specific.
> Abstracted performance analysis is a contradiction in terms.

It might help if you'd talk to your own research department before
making statements like that, they make you look silly.

Intel research has shown that you don't actually need exact definitions
as a side effect of applying machine learning principles in order to
provide machine aided optimizing (ie. clippy style guides for vtune).

They create simple micro-kernels (not our kind of kernels, but more like
the excellent example Arun provided) that trigger a pathological case
and a perfect counter-case and run it over _all_ possible events and do
correlation analysis.

The explicit example given was branch misses on an atom, and they found
(to nobody's great surprise) BR_INST_RETIRED.MISPRED to be the best
correlating event. But that's not the important part.

The important part is that all it needs is a strong correlation, and it
could even be a combination of events, it would just make the analysis a
bit more complex.

Anyway, given a sufficient large set of these pathological cases, you
can train a neural net for your target hardware and then reverse the
situation, run it over an unknown program and have it create suggestions
-> yay clippy!

So given a set of pathological cases and hardware with decent PMU
coverage you can train this thing and get useful results. Exact event
definitions be damned -- it doesn't care.

http://sites.google.com/site/fhpm2010/program/baugh_fhpm2010.pptx?attredirects=0

2011-04-26 09:26:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

On Sat, 2011-04-23 at 23:16 -0700, Arun Sharma wrote:
> Conceptually looks fine. I'd prefer a more precise name such as:
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
> retirement stalls).

Very nice example!! This is the stuff we want people to do, but instead
of focusing on the raw event aspect, put in a little more effort and see
what it takes to make it work across the board.

None of the things you mention are very specific to Intel, afaik those
concepts you listed: Retirement, Frontend (instruction
decode/uop-issue), Backend (uop execution), I-cache (instruction fetch)
map to pretty much all hardware I know (PMU coverage of these aspects
aside).

So in fact you propose these concepts, and that is the kind of feedback
perf wants and needs.

The thing that set off this whole discussion is that most people don't
seem to believe in concepts and stick to their very narrow HPC every
last cycle matters therefore we need absolute events mentality.

That too is a form of vendor lock, once you're so dependent on a
particular platform the cost of switching increases dramatically.
Furthermore very few people are actually interested in it.

That is not to say we should not enable those people, but the current
state of affairs seems to be that some people are only interested in
enabling that and simply don't care (and don't want to care) about cross
platform performance analysis and useful abstractions.

We'd very much like to make the cost of entry -- the cost of supporting
lowlevel capabilities -- the addition of high level concepts, means for
the greater public to use them.

2011-04-26 09:49:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Mon, 2011-04-25 at 17:46 -0400, Vince Weaver wrote:
> > The policy is very simple and common-sense: if a given piece of PMU
> > functionality is useful enough to be exposed via a raw interface, then
> > it must be useful enough to be generalized as well.
>
> what does that even mean? How do you "generalize" a functionality like
> writing a value to an auxiliary MSR register?

Come-on Vince, I know you're smarter than that!

The external register is simply an extension of the configuration space,
instead of the normal evsel msr you get evsel:offcore pairs. After that
its simply scheduling them right.

It simply adds more events to the PMU (in a rather sad way, it would
have been so much nicer if Intel had simply extended the evsel MSR for
every PMC, they could have also used that for the load-latency thing
etc.)

Now, these extra events offered are L3 and NUMA events, the 'common'
interesting set is mostly covered by Andi's LLC mods and my NODE
extension, after that there's mostly details left in offcore.

So the writing of an extra MSR is totally irrelevant, its the extra
events that are.

2011-04-26 14:00:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Arun Sharma <[email protected]> wrote:

> On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
> >
> > The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> > cycles the CPU does nothing useful, because it is stalled on a
> > cache-miss or some other condition.
>
> Conceptually looks fine. I'd prefer a more precise name such as:
> PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
> retirement stalls).

How about this naming convention:

PERF_COUNT_HW_STALLED_CYCLES # execution
PERF_COUNT_HW_STALLED_CYCLES_FRONTEND # frontend
PERF_COUNT_HW_STALLED_CYCLES_ICACHE_MISS # icache

So STALLED_CYCLES would be the most general metric, the one that shows the real
impact to the application. The other events would then help disambiguate this
metric some more.

Below is the updated patch - this version makes the backend stalls event
properly per model. (with the Nehalem table filled in.)

What do you think?

Thanks,

Ingo

--------------------->
Subject: perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
From: Ingo Molnar <[email protected]>
Date: Sun Apr 24 08:18:31 CEST 2011

The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 42 +++++++++++++++++++++++++--------
include/linux/perf_event.h | 1
tools/perf/util/parse-events.c | 1
tools/perf/util/python.c | 1
4 files changed, 36 insertions(+), 9 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux/arch/x86/kernel/cpu/perf_event_intel.c
@@ -36,6 +36,23 @@ static const u64 intel_perfmon_event_map
[PERF_COUNT_HW_BUS_CYCLES] = 0x013c,
};

+/*
+ * Other generic events, Nehalem:
+ */
+static const u64 intel_nhm_event_map[] =
+{
+ /* Arch-perfmon events: */
+ [PERF_COUNT_HW_CPU_CYCLES] = 0x003c,
+ [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
+ [PERF_COUNT_HW_CACHE_REFERENCES] = 0x4f2e,
+ [PERF_COUNT_HW_CACHE_MISSES] = 0x412e,
+ [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4,
+ [PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5,
+ [PERF_COUNT_HW_BUS_CYCLES] = 0x013c,
+
+ [PERF_COUNT_HW_STALLED_CYCLES] = 0xffa2, /* 0xff: All reasons, 0xa2: Resource stalls */
+};
+
static struct event_constraint intel_core_event_constraints[] =
{
INTEL_EVENT_CONSTRAINT(0x11, 0x2), /* FP_ASSIST */
@@ -150,6 +167,12 @@ static u64 intel_pmu_event_map(int hw_ev
return intel_perfmon_event_map[hw_event];
}

+static u64 intel_pmu_nhm_event_map(int hw_event)
+{
+ return intel_nhm_event_map[hw_event];
+}
+
+
static __initconst const u64 snb_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -1400,18 +1423,19 @@ static __init int intel_pmu_init(void)
case 26: /* 45 nm nehalem, "Bloomfield" */
case 30: /* 45 nm nehalem, "Lynnfield" */
case 46: /* 45 nm nehalem-ex, "Beckton" */
- memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
- sizeof(hw_cache_event_ids));
- memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
- sizeof(hw_cache_extra_regs));
+ memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids, sizeof(hw_cache_event_ids));
+ memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));

intel_pmu_lbr_init_nhm();

- x86_pmu.event_constraints = intel_nehalem_event_constraints;
- x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
- x86_pmu.percore_constraints = intel_nehalem_percore_constraints;
- x86_pmu.enable_all = intel_pmu_nhm_enable_all;
- x86_pmu.extra_regs = intel_nehalem_extra_regs;
+ x86_pmu.event_constraints = intel_nehalem_event_constraints;
+ x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
+ x86_pmu.percore_constraints = intel_nehalem_percore_constraints;
+ x86_pmu.enable_all = intel_pmu_nhm_enable_all;
+ x86_pmu.extra_regs = intel_nehalem_extra_regs;
+ x86_pmu.event_map = intel_pmu_nhm_event_map;
+ x86_pmu.max_events = ARRAY_SIZE(intel_perfmon_event_map),
+
pr_cont("Nehalem events, ");
break;

Index: linux/include/linux/perf_event.h
===================================================================
--- linux.orig/include/linux/perf_event.h
+++ linux/include/linux/perf_event.h
@@ -52,6 +52,7 @@ enum perf_hw_id {
PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4,
PERF_COUNT_HW_BRANCH_MISSES = 5,
PERF_COUNT_HW_BUS_CYCLES = 6,
+ PERF_COUNT_HW_STALLED_CYCLES = 7,

PERF_COUNT_HW_MAX, /* non-ABI */
};
Index: linux/tools/perf/util/parse-events.c
===================================================================
--- linux.orig/tools/perf/util/parse-events.c
+++ linux/tools/perf/util/parse-events.c
@@ -38,6 +38,7 @@ static struct event_symbol event_symbols
{ CHW(BRANCH_INSTRUCTIONS), "branch-instructions", "branches" },
{ CHW(BRANCH_MISSES), "branch-misses", "" },
{ CHW(BUS_CYCLES), "bus-cycles", "" },
+ { CHW(STALLED_CYCLES), "stalled-cycles", "" },

{ CSW(CPU_CLOCK), "cpu-clock", "" },
{ CSW(TASK_CLOCK), "task-clock", "" },
Index: linux/tools/perf/util/python.c
===================================================================
--- linux.orig/tools/perf/util/python.c
+++ linux/tools/perf/util/python.c
@@ -798,6 +798,7 @@ static struct {
{ "COUNT_HW_BRANCH_INSTRUCTIONS", PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
{ "COUNT_HW_BRANCH_MISSES", PERF_COUNT_HW_BRANCH_MISSES },
{ "COUNT_HW_BUS_CYCLES", PERF_COUNT_HW_BUS_CYCLES },
+ { "COUNT_HW_STALLED_CYCLES", PERF_COUNT_HW_STALLED_CYCLES },
{ "COUNT_HW_CACHE_L1D", PERF_COUNT_HW_CACHE_L1D },
{ "COUNT_HW_CACHE_L1I", PERF_COUNT_HW_CACHE_L1I },
{ "COUNT_HW_CACHE_LL", PERF_COUNT_HW_CACHE_LL },

2011-04-26 20:33:46

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Tue, 26 Apr 2011, Peter Zijlstra wrote:

> > That's why people use libpfm4. or PAPI. And they do.
>
> And how is typing in hex numbers different from typing in model specific
> event names?

Reall... quick, tell me what event 0x53cf28 corresponds to on a core2.

Now if I said L2_IFETCH:BOTH_CORES you know several things about what it
is.

Plus, you can do a quick search in the Intel Arch manual and find more
info. With the hex value you have to do some shifting and masking by hand
before looking up.

An even worse problem:

Quick... tell me what actual hardware event L1-dcache-loads corresponds
to on an L2. Can you tell without digging through kernel source code?

Does that event include prefetches? Does it include speculative events?
Does it count page walks? Does it overcount by an amount equal to the
number of hardware interrupts? If I use the equivelent event on an
AMD64, will all the same hold?

> PAPI actually has 'generalized' events, but I guess you're going to tell
> me nobody uses those since they're not useful.

Of course people use them. But we don't _force_ people to use them. We
don't disable access to raw events. Although alarmingly it seems like the
kernel is going to start to, possibly meaning even our users can't use our
'generalized' events if for example they incorporate OFFCORE_RESPONSE.

Another issue: if a problem is found with one of the PAPI events, they
can update and recompile and run out of their own account at will.

If there's a problem with a kernel generalized events, you have to
reinstall a kernel. Something many users can't do. For example, your
Nehalem cache fixes will be in 2.6.39. How long until that appears in a
stock distro? How long until that appears in an RHEL release?

> > All the world is not perf.
>
> I know, all the world is interested in investing tons of time learning
> about their one architecture and extract the last few percent of
> performance.

There are people out there who have been using perf counters on UNIX/Linux
machines for decades. They know what events they want to measure. They
are not interested in having the kernel tell them they can't do it.

> I looks like you're all so stuck in your HPC/lowlevel way of things
> you're not even realizing there's much more to be gained by providing
> easy and useful tools to the general public, stuff that works similarly
> across architectures.

We're not saying people can't use perf. Who knows, maybe PAPI will go
away becayse perf is so good. It's just silly to block out access to RAW
events on the argument that "it's too hard". Again, are we Microsoft
here?

> Very constructive attitude, instead of helping you simply subvert and
> route around, thanks man!

I spent a lot of time trying to fix P4 support back in the 2.6.35 days.
I only have so much time to spend on this stuff.

When people complain about p4 support, I direct them to Cyrill et al. I
can't force them to become kernel developers. Usually they want immediate
results, which they can get with perfctr.

People want offcore response. People want uncore access. People want raw
event access. I can tell them "send a patch to the kernel, it'll
languish in obscurity for years and maybe in 2.6.4x you'll see it". Or
they can have support today with an outside patch. Which do you think
they choose?

> And why is that? is that the lack of userspace rdpmc? That should be
> possible with perf, powerpc actually does that already. Various people
> mentioned wanting to make this work on x86 but I've yet to see a patch.

We at the PAPI project welcome any patches you'd care to contribute to our
project too, to make things better. It goes both ways you know.

Vince

2011-04-26 20:51:57

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Tue, 26 Apr 2011, Ingo Molnar wrote:

> The kernel development process is in essence an abstraction engine, and if you
> expect something else you'll probably be facing a lot of frustrating episodes
> in the future as well where others try to abstract out meaningful
> generalizations.

yes, but you are taking abstraction to the extreme.

A filesystem abstracts out the access to raw disk... but under Linux we
still allow raw access to /dev/sda

TCP/IP abstracts out the access to the network... but under Linux we still
allow creating raw packets.

It is fine to have some sort of high-level abstraction of perf events for
those who don't have PhDs in computer architecture. Fine. But don't get
in the way of people who know what they are doing.

Vince

2011-04-26 21:19:23

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On 04/27/2011 12:33 AM, Vince Weaver wrote:
...
>
> I spent a lot of time trying to fix P4 support back in the 2.6.35 days.
> I only have so much time to spend on this stuff.
>
> When people complain about p4 support, I direct them to Cyrill et al. I
> can't force them to become kernel developers. Usually they want immediate
> results, which they can get with perfctr.
>

Vince I've not read the whole thread so no idea what is all about, but if you
have some p4 machines and have some will to help -- mind to test the patch below,
it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there
is a nit i should update) but still might be interesting to see the results.
Untested.
--
perf, x86: P4 PMU -- Introduce alternate events v3

Alternate events are used to increase perf subsystem counter usage.
In general the idea is to find an "alternate" event (if there is
one) which do count the same magnitude as former event but use
different counter allowing them to run simultaneously with original
event.

Signed-off-by: Cyrill Gorcunov <[email protected]>
---
arch/x86/include/asm/perf_event_p4.h | 6 ++
arch/x86/kernel/cpu/perf_event_p4.c | 74 ++++++++++++++++++++++++++++++++++-
2 files changed, 78 insertions(+), 2 deletions(-)

Index: linux-2.6.git/arch/x86/include/asm/perf_event_p4.h
=====================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/perf_event_p4.h
+++ linux-2.6.git/arch/x86/include/asm/perf_event_p4.h
@@ -36,6 +36,10 @@
#define P4_ESCR_T1_OS 0x00000002U
#define P4_ESCR_T1_USR 0x00000001U

+#define P4_ESCR_USR_MASK \
+ (P4_ESCR_T0_OS | P4_ESCR_T0_USR | \
+ P4_ESCR_T1_OS | P4_ESCR_T1_USR)
+
#define P4_ESCR_EVENT(v) ((v) << P4_ESCR_EVENT_SHIFT)
#define P4_ESCR_EMASK(v) ((v) << P4_ESCR_EVENTMASK_SHIFT)
#define P4_ESCR_TAG(v) ((v) << P4_ESCR_TAG_SHIFT)
@@ -839,5 +843,7 @@ enum P4_PEBS_METRIC {
* 31: reserved (HT thread)
*/

+#define P4_INVALID_CONFIG (u64)~0
+
#endif /* PERF_EVENT_P4_H */

Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
@@ -609,6 +609,31 @@ static u64 p4_general_events[PERF_COUNT_
p4_config_pack_cccr(P4_CCCR_EDGE | P4_CCCR_COMPARE),
};

+/*
+ * Alternate events allow us to find substitution for an event if
+ * it's already borrowed, so they may be considered as event aliases.
+ */
+struct p4_alt_event {
+ unsigned int event;
+ u64 config;
+} p4_alternate_events[]= {
+ {
+ .event = P4_EVENT_GLOBAL_POWER_EVENTS,
+ .config =
+ p4_config_pack_escr(P4_ESCR_EVENT(P4_EVENT_EXECUTION_EVENT) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS0) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS1) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS2) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, NBOGUS3) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS0) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS1) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS2) |
+ P4_ESCR_EMASK_BIT(P4_EVENT_EXECUTION_EVENT, BOGUS3)) |
+ p4_config_pack_cccr(P4_CCCR_THRESHOLD(15) | P4_CCCR_COMPLEMENT |
+ P4_CCCR_COMPARE),
+ },
+};
+
static struct p4_event_bind *p4_config_get_bind(u64 config)
{
unsigned int evnt = p4_config_unpack_event(config);
@@ -620,6 +645,18 @@ static struct p4_event_bind *p4_config_g
return bind;
}

+static u64 p4_find_alternate_config(unsigned int evnt)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(p4_alternate_events); i++) {
+ if (evnt == p4_alternate_events[i].event)
+ return p4_alternate_events[i].config;
+ }
+
+ return P4_INVALID_CONFIG;
+}
+
static u64 p4_pmu_event_map(int hw_event)
{
struct p4_event_bind *bind;
@@ -1133,8 +1170,41 @@ static int p4_pmu_schedule_events(struct
}

cntr_idx = p4_next_cntr(thread, used_mask, bind);
- if (cntr_idx == -1 || test_bit(escr_idx, escr_mask))
- goto done;
+ if (cntr_idx == -1 || test_bit(escr_idx, escr_mask)) {
+
+ /*
+ * So the former event already accepted to run
+ * and the only way to success here is to use
+ * an alternate event.
+ */
+ const u64 usr_mask = p4_config_pack_escr(P4_ESCR_USR_MASK);
+ u64 alt_config;
+ unsigned int event;
+
+ event = p4_config_unpack_event(hwc->config);
+ alt_config = p4_find_alternate_config(event);
+
+ if (alt_config == P4_INVALID_CONFIG)
+ goto done;
+
+ bind = p4_config_get_bind(alt_config);
+ escr_idx = p4_get_escr_idx(bind->escr_msr[thread]);
+ if (unlikely(escr_idx == -1))
+ goto done;
+
+ cntr_idx = p4_next_cntr(thread, used_mask, bind);
+ if (cntr_idx == -1 || test_bit(escr_idx, escr_mask))
+ goto done;
+
+ /*
+ * This is a destructive operation we're going
+ * to make. We substitute the former config with
+ * alternate one to continue tracking it after.
+ * Be carefull and don't kill the custom bits
+ * in the former config.
+ */
+ hwc->config = (hwc->config & usr_mask) | alt_config;
+ }

p4_pmu_swap_config_ts(hwc, cpu);
if (assign)

2011-04-26 21:25:48

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Wed, Apr 27, 2011 at 01:19:07AM +0400, Cyrill Gorcunov wrote:
> Vince I've not read the whole thread so no idea what is all about, but if you
> have some p4 machines and have some will to help -- mind to test the patch below,
> it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there
> is a nit i should update) but still might be interesting to see the results.
> Untested.
> --
> perf, x86: P4 PMU -- Introduce alternate events v3

Unfortunately it just panic'd for me when I ran

perf record grep -r don /

Thoughts?

Cheers,
Don

redfish.lab.bos.redhat.com login: BUG: unable to handle kernel NULL
pointer dereference at 0000000000000008
IP: [<ffffffff8101ff60>] p4_pmu_schedule_events+0xb0/0x4c0
PGD 2c603067 PUD 2d617067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 2
Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log
uinput ppdev e1000 parport_pc parport sg dcdbas pcspkr snd_intel8x0
snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm sn]

Pid: 1734, comm: grep Not tainted 2.6.39-rc3usb3-latest+ #339 Dell Inc.
Precision WorkStation 470 /0P7996
RIP: 0010:[<ffffffff8101ff60>] [<ffffffff8101ff60>]
p4_pmu_schedule_events+0xb0/0x4c0
RSP: 0018:ffff88003fb03b18 EFLAGS: 00010016
RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88003c30de00 RSI: 0000000000000004 RDI: 000000000000000f
RBP: ffff88003fb03bb8 R08: 0000000000000001 R09: 0000000000000001
R10: 000000000000006d R11: ffff88003acb4ae8 R12: ffff88002d490c00
R13: ffff88003fb03b78 R14: 0000000000000001 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88003fb00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000002d728000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process grep (pid: 1734, threadinfo ffff88002d648000, task
ffff88003acb4240)
Stack:
ffff880000000014 ffff88003acb4b10 b00002030803c000 0000000000000003
0000000200000001 ffff88003fb03bc8 0000000100000002 ffff88003fb03bcc
0000000181a24ee0 ffff88003fb0cd48 0000000000000008 0000000000000000
Call Trace:
<IRQ>
[<ffffffff8101b9e1>] ? x86_pmu_add+0xb1/0x170
[<ffffffff8101b8bf>] x86_pmu_commit_txn+0x5f/0xb0
[<ffffffff810ff0c4>] ? perf_event_update_userpage+0xa4/0xe0
[<ffffffff810ff020>] ? perf_output_end+0x60/0x60
[<ffffffff81100dca>] group_sched_in+0x8a/0x160
[<ffffffff8110100b>] ctx_sched_in+0x16b/0x1d0
[<ffffffff811017ce>] perf_event_task_tick+0x1de/0x260
[<ffffffff8104fc1e>] scheduler_tick+0xde/0x2b0
[<ffffffff81096e20>] ? tick_nohz_handler+0x100/0x100

2011-04-26 21:33:54

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On 04/27/2011 01:25 AM, Don Zickus wrote:
> On Wed, Apr 27, 2011 at 01:19:07AM +0400, Cyrill Gorcunov wrote:
>> Vince I've not read the whole thread so no idea what is all about, but if you
>> have some p4 machines and have some will to help -- mind to test the patch below,
>> it should fix nmi-watchdog and cycles conflict. It's utter raw RFC (and i know there
>> is a nit i should update) but still might be interesting to see the results.
>> Untested.
>> --
>> perf, x86: P4 PMU -- Introduce alternate events v3
>
> Unfortunately it just panic'd for me when I ran
>
> perf record grep -r don /
>
> Thoughts?
>
> Cheers,
> Don
>
> redfish.lab.bos.redhat.com login: BUG: unable to handle kernel NULL
> pointer dereference at 0000000000000008
> IP: [<ffffffff8101ff60>] p4_pmu_schedule_events+0xb0/0x4c0
> PGD 2c603067 PUD 2d617067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/system/cpu/online
> CPU 2
> Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log
> uinput ppdev e1000 parport_pc parport sg dcdbas pcspkr snd_intel8x0
> snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm sn]
>
> Pid: 1734, comm: grep Not tainted 2.6.39-rc3usb3-latest+ #339 Dell Inc.
> Precision WorkStation 470 /0P7996
> RIP: 0010:[<ffffffff8101ff60>] [<ffffffff8101ff60>]
> p4_pmu_schedule_events+0xb0/0x4c0
> RSP: 0018:ffff88003fb03b18 EFLAGS: 00010016
> RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88003c30de00 RSI: 0000000000000004 RDI: 000000000000000f
> RBP: ffff88003fb03bb8 R08: 0000000000000001 R09: 0000000000000001
> R10: 000000000000006d R11: ffff88003acb4ae8 R12: ffff88002d490c00
> R13: ffff88003fb03b78 R14: 0000000000000001 R15: 0000000000000001
> FS: 0000000000000000(0000) GS:ffff88003fb00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000002d728000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process grep (pid: 1734, threadinfo ffff88002d648000, task
> ffff88003acb4240)
> Stack:
> ffff880000000014 ffff88003acb4b10 b00002030803c000 0000000000000003
> 0000000200000001 ffff88003fb03bc8 0000000100000002 ffff88003fb03bcc
> 0000000181a24ee0 ffff88003fb0cd48 0000000000000008 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff8101b9e1>] ? x86_pmu_add+0xb1/0x170
> [<ffffffff8101b8bf>] x86_pmu_commit_txn+0x5f/0xb0
> [<ffffffff810ff0c4>] ? perf_event_update_userpage+0xa4/0xe0
> [<ffffffff810ff020>] ? perf_output_end+0x60/0x60
> [<ffffffff81100dca>] group_sched_in+0x8a/0x160
> [<ffffffff8110100b>] ctx_sched_in+0x16b/0x1d0
> [<ffffffff811017ce>] perf_event_task_tick+0x1de/0x260
> [<ffffffff8104fc1e>] scheduler_tick+0xde/0x2b0
> [<ffffffff81096e20>] ? tick_nohz_handler+0x100/0x100
>

Ouch, I bet p4_config_get_bind returned NULL and here we are. Weird,
seems I've missed something. Don, I'll continue tomorrow, ok? (kinda
sleep already).

--
Cyrill

2011-04-27 06:44:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Vince Weaver <[email protected]> wrote:

> On Tue, 26 Apr 2011, Peter Zijlstra wrote:
>
> > > That's why people use libpfm4. or PAPI. And they do.
> >
> > And how is typing in hex numbers different from typing in model specific
> > event names?
>
> Reall... quick, tell me what event 0x53cf28 corresponds to on a core2.
>
> Now if I said L2_IFETCH:BOTH_CORES you know several things about what it is.

Erm, that assumes you already know that magic incantation. Most of the users
who want to do measurements and profiling do not know that. So there's little
difference between:

- someone shows them the 0x53cf28 magic code
- someone shows them the L2_IFETCH:BOTH_CORES magic symbol

So you while hexa values have like 10% utility, the stupid, vendor-specific
event names you are pushing here have like 15% utility.

In perf we are aiming for 100% utility, where if someone knows something about
CPUs and can type 'cycles', 'instructions' or 'branches', will get the obvious
result.

This is not a difficult usability concept really.

Thanks,

Ingo

2011-04-27 06:53:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Vince Weaver <[email protected]> wrote:

> On Tue, 26 Apr 2011, Ingo Molnar wrote:
>
> > The kernel development process is in essence an abstraction engine, and if
> > you expect something else you'll probably be facing a lot of frustrating
> > episodes in the future as well where others try to abstract out meaningful
> > generalizations.
>
> yes, but you are taking abstraction to the extreme.

Firstly, that claim is a far cry from your original claim:

' How do you "generalize" a functionality like writing a value to an auxiliary
MSR register? '

... so i guess you conceded the point at least partially, without actually
openly and honestly conceding the point?

Secondly, you are still quite wrong even with your revised opinion. Being able
to type '-e cycles' and '-e instructions' in perf and get ... cycles and
instructions counts/events, and the kernel helping that kind of approach is not
'abstraction to the extreme', it's called 'common sense'.

The fact that perfmon and oprofile works via magic vendor-specific event string
incantations is one of the many design failures of those projects - not a
virtue.

Thanks,

Ingo

2011-04-27 11:12:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Arun Sharma <[email protected]> wrote:

> cycles = c["cycles"] * 1.0
> cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
> cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
> backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
> icache_stall_cycles = c["L1I:CYCLES_STALLED"]
>
> # Cycle stall accounting
> print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
> print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
> print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
> print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
> print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
> print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)
>
> # Wasted work
> uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
> uops_retired = c["UOPS_RETIRED:ANY:t=1"]
> uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]
>
> print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
> print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)

Just an update: i started working on generalizing these events.

As a first step i'd like to introduce stall statistics in default 'perf stat'
output, then as a second step offer more detailed modi of analysis (like your
script).

As for the first, 'overview' step, i'd like to use one or two numbers only, to
give people a general ballpark figure about how good the CPU is performing for
a given workload.

Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
primary "stall" indicator? This is similar to the "cycles-uops_executed" value
in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
based): it counts cycles when there's no execution at all - not even
speculative one.

This would cover a wide variety of 'stall' reasons: external latency, or
stalling on lack of paralellism in the incoming instruction stream and most
other stall reasons. So it would measure everything that moves the CPU away
from 100% utilization.

Secondly, the 'speculative waste' proportion is probably pretty well captured
via branch-misprediction counts - those are the primary source of filling the
pipeline with useless work.

So in the most high-level view we could already print useful information via
the introduction of a single new generic event:

PERF_COUNT_HW_CPU_CYCLES_BUSY

and 'idle cycles' are "cycles-busy_cycles".

I have implemented preliminary support for this, and this is how the new
'overview' output looks like currently. Firstly here's the output from a "bad"
testcase (lots of branch-misses):

$ perf stat ./branches 20 1

Performance counter stats for './branches 20 1' (10 runs):

9.829903 task-clock # 0.972 CPUs utilized ( +- 0.07% )
0 context-switches # 0.000 M/sec ( +- 0.00% )
0 CPU-migrations # 0.000 M/sec ( +- 0.00% )
111 page-faults # 0.011 M/sec ( +- 0.09% )
31,470,886 cycles # 3.202 GHz ( +- 0.06% )
9,825,068 stalled-cycles # 31.22% of all cycles are idle ( +- 13.89% )
27,868,090 instructions # 0.89 insns per cycle
# 0.35 stalled cycles per insn ( +- 0.02% )
4,313,661 branches # 438.830 M/sec ( +- 0.02% )
1,068,668 branch-misses # 24.77% of all branches ( +- 0.01% )

0.010117520 seconds time elapsed ( +- 0.14% )

The two important values are the "20.00% of all cycles are idle" and the
"24.77% of all branches" missed values - both are high and indicative of
trouble.

The fixed testcase shows:

Performance counter stats for './branches 20 0' (100 runs):

4.417987 task-clock # 0.948 CPUs utilized ( +- 0.10% )
0 context-switches # 0.000 M/sec ( +- 0.00% )
0 CPU-migrations # 0.000 M/sec ( +- 0.00% )
111 page-faults # 0.025 M/sec ( +- 0.02% )
14,135,368 cycles # 3.200 GHz ( +- 0.10% )
1,939,275 stalled-cycles # 13.72% of all cycles are idle ( +- 4.99% )
27,846,610 instructions # 1.97 insns per cycle
# 0.07 stalled cycles per insn ( +- 0.00% )
4,309,228 branches # 975.383 M/sec ( +- 0.00% )
3,992 branch-misses # 0.09% of all branches ( +- 0.26% )

0.004660164 seconds time elapsed ( +- 0.15% )

Both stall values are much lower and the instructions per cycle value doubled.

Here's another testcase, one that fills the pipeline near-perfectly:

$ perf stat ./fill_1b

Performance counter stats for './fill_1b':

1874.601174 task-clock # 0.998 CPUs utilized
1 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
107 page-faults # 0.000 M/sec
6,009,321,149 cycles # 3.206 GHz
212,795,827 stalled-cycles # 3.54% of all cycles are idle
18,007,646,574 instructions # 3.00 insns per cycle
# 0.01 stalled cycles per insn
1,001,527,311 branches # 534.262 M/sec
16,988 branch-misses # 0.00% of all branches

1.878558311 seconds time elapsed

Here too both counts are very low.

The next step is to provide the tools to further analyze why the CPU is not
utilized perfectly. I have implemented some preliminary code for that too,
using generic cache events:

$ perf stat --repeat 10 --detailed ./array-bad

Performance counter stats for './array-bad' (10 runs):

50.552646 task-clock # 0.992 CPUs utilized ( +- 0.04% )
0 context-switches # 0.000 M/sec ( +- 0.00% )
0 CPU-migrations # 0.000 M/sec ( +- 0.00% )
1,877 page-faults # 0.037 M/sec ( +- 0.01% )
142,802,193 cycles # 2.825 GHz ( +- 0.18% ) (22.55%)
88,622,411 stalled-cycles # 62.06% of all cycles are idle ( +- 0.22% ) (34.97%)
45,381,755 instructions # 0.32 insns per cycle
# 1.95 stalled cycles per insn ( +- 0.11% ) (46.94%)
7,725,207 branches # 152.815 M/sec ( +- 0.05% ) (58.44%)
29,788 branch-misses # 0.39% of all branches ( +- 1.06% ) (69.46%)
8,421,969 L1-dcache-loads # 166.598 M/sec ( +- 0.37% ) (70.06%)
7,868,389 L1-dcache-load-misses # 93.43% of all L1-dcache hits ( +- 0.13% ) (58.28%)
4,553,490 LLC-loads # 90.074 M/sec ( +- 0.31% ) (44.49%)
1,764,277 LLC-load-misses # 34.900 M/sec ( +- 0.21% ) (9.98%)

0.050973462 seconds time elapsed ( +- 0.05% )

The --detailed flag is what activates wider counting. The "93.43% of all
L1-dcache hits" is a giveaway indicator that this particular workload is
primarily data-access limited and that much of it escapes into RAM as well.

Is this the direction you'd like to see perf stat to move into? Any comments,
suggestions?

Thanks,

Ingo

2011-04-27 14:47:05

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <[email protected]> wrote:
> As for the first, 'overview' step, i'd like to use one or two numbers only, to
> give people a general ballpark figure about how good the CPU is performing for
> a given workload.
>
> Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
> primary "stall" indicator? This is similar to the "cycles-uops_executed" value
> in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
> based): it counts cycles when there's no execution at all - not even
> speculative one.

If we're going to pick one stall indicator, why not pick cycles where
no uops are retiring?

cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]

In the presence of C-states and some halted cycles, I found that I
couldn't measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts
halted cycles too and could be greater than (unhalted) cycles.

The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
condition. I believe this is caused by what AMD calls sideband stack
optimizer and Intel calls dedicated stack manager (i.e. UOPS executed
outside the main pipeline). A recursive fibonacci(30) is a good test
case for reproducing this.

>
> Is this the direction you'd like to see perf stat to move into? Any comments,
> suggestions?
>

Looks like a step in the right direction. Thanks.

-Arun

2011-04-27 15:48:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Arun Sharma <[email protected]> wrote:

> On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <[email protected]> wrote:
> > As for the first, 'overview' step, i'd like to use one or two numbers only, to
> > give people a general ballpark figure about how good the CPU is performing for
> > a given workload.
> >
> > Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
> > primary "stall" indicator? This is similar to the "cycles-uops_executed" value
> > in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
> > based): it counts cycles when there's no execution at all - not even
> > speculative one.
>
> If we're going to pick one stall indicator, [...]

Well, one stall indicator for the 'general overview' stage, plus branch misses.

Other stages can also have all sorts of details, including various subsets of
stall reasons. (and stalls of different units of the CPU)

We'll see how far it can be pushed.

> [...] why not pick cycles where no uops are retiring?
>
> cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
>
> In the presence of C-states and some halted cycles, I found that I couldn't
> measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too
> and could be greater than (unhalted) cycles.

Agreed, good point.

You are right that it is more robust to pick 'the CPU was busy on our behalf'
metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special
type of idling around does not have to be identified.

HLT is not an issue for the default 'perf stat' behavior (because it only
measures task execution, never the idle thread or other tasks not involved with
the workload), but for per CPU and system-wide (--all) it matters.

I'll flip it around.

> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
> condition. I believe this is caused by what AMD calls sideband stack
> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside
> the main pipeline). A recursive fibonacci(30) is a good test case for
> reproducing this.

So the PORT015+234 sum is not precise? The definition seems to be rather firm:

Counts number of Uops executed that where issued on port 2, 3, or 4.
Counts number of Uops executed that where issued on port 0, 1, or 5.

Wouldnt that include all uops?

> > Is this the direction you'd like to see perf stat to move into? Any
> > comments, suggestions?
>
> Looks like a step in the right direction. Thanks.

Ok, great - will keep you updated. I doubt the defaults can ever beat truly
expert use of PMU events: there will always be fine details that a generic
approach will miss. But i'd be happy if we got 70% the way ...

Thanks,

Ingo

2011-04-27 16:27:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES


* Ingo Molnar <[email protected]> wrote:

> > [...] why not pick cycles where no uops are retiring?
> >
> > cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
> >
> > In the presence of C-states and some halted cycles, I found that I couldn't
> > measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too
> > and could be greater than (unhalted) cycles.
>
> Agreed, good point.
>
> You are right that it is more robust to pick 'the CPU was busy on our behalf'
> metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special
> type of idling around does not have to be identified.

Sidenote, there's one advantage of the idle event: it's more meaningful to
profile idle cycles - and it's easy to ignore the HLT loop in the profile
output (we already do).

That way we get a 'hidden overhead' profile: a profile of frequently executed
code which executes in the CPU in a suboptimal way.

So we should probably offer both events.

Thanks,

Ingo

2011-04-27 19:03:14

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

On Wed, Apr 27, 2011 at 8:48 AM, Ingo Molnar <[email protected]> wrote:

>
>> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
>> condition. I believe this is caused by what AMD calls sideband stack
>> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside
>> the main pipeline). A recursive fibonacci(30) is a good test case for
>> reproducing this.
>
> So the PORT015+234 sum is not precise? The definition seems to be rather firm:
>
>  Counts number of Uops executed that where issued on port 2, 3, or 4.
>  Counts number of Uops executed that where issued on port 0, 1, or 5.
>

There is some work done outside of the main out of order engine for
power optimization reasons:

Described as dedicated stack engine here:
http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

However, I can't seem to be able to reproduce this behavior using a
micro benchmark right now:

# cat foo.s
.text
.global main
main:
1:
push %rax
push %rbx
push %rcx
push %rdx
pop %rax
pop %rbx
pop %rcx
pop %rdx
jmp 1b

Performance counter stats for './foo':

7,755,881,073 UOPS_ISSUED:ANY:t=1 (scaled from 79.98%)
10,569,957,988 UOPS_RETIRED:ANY:t=1 (scaled from 79.96%)
9,155,400,383 UOPS_EXECUTED:PORT234_CORE (scaled from 80.02%)
2,594,206,312 UOPS_EXECUTED:PORT015:t=1 (scaled from 80.02%)

Perhaps I was thinking of UOPS_ISSUED < UOPS_RETIRED.

In general, UOPS_RETIRED (or instruction retirement in general) is the
"source of truth" in an otherwise crazy world and might be more
interesting as a generalized event that works on multiple
architectures.

-Arun

2011-04-27 19:05:26

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

On Wed, Apr 27, 2011 at 9:27 AM, Ingo Molnar <[email protected]> wrote:

>
> That way we get a 'hidden overhead' profile: a profile of frequently executed
> code which executes in the CPU in a suboptimal way.
>
> So we should probably offer both events.
>

Yes - certainly.

-Arun

2011-04-28 22:10:50

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Wed, 27 Apr 2011, Ingo Molnar wrote:

>
> Erm, that assumes you already know that magic incantation. Most of the users
> who want to do measurements and profiling do not know that. So there's little
> difference between:
>
> - someone shows them the 0x53cf28 magic code
> - someone shows them the L2_IFETCH:BOTH_CORES magic symbol
>
> So you while hexa values have like 10% utility, the stupid, vendor-specific
> event names you are pushing here have like 15% utility.
>
> In perf we are aiming for 100% utility, where if someone knows something about
> CPUs and can type 'cycles', 'instructions' or 'branches', will get the obvious
> result.
>
> This is not a difficult usability concept really.

yes, and this functionality belongs in the perf tool itself (or some other
user tool, like libpfm4, or PAPI). Not in the kernel.

How much larger are you willing to make the kernel to hold your
generalized events? PAPI has at least 128 that people have found useful
enough to add over the years. There are probably more.

I notice the kernel doesn't have any FP or SSE/Vector counts yet. Or
uops. Or hw-interrupt counts. Fused multiply-add?
How about GPU counters (PAPI is starting to support these)? Network
counters? Infiniband?

You're being lazy and pushing "perf" functionality into the kernel. It
belongs in userspace.

It's not the kernel's job to make things easy for users. Its job is to
make things possible, and get out of the way.

It's already bad enough that your generalized events can change from
kernel version to kernel version without warning. By being in the kernel,
aren't they a stable ABI that can't be changed?

Vince

2011-04-28 22:16:42

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

On Wed, 27 Apr 2011, Ingo Molnar wrote:

> Secondly, you are still quite wrong even with your revised opinion. Being able
> to type '-e cycles' and '-e instructions' in perf and get ... cycles and
> instructions counts/events, and the kernel helping that kind of approach is not
> 'abstraction to the extreme', it's called 'common sense'.

by your logic I should be able to delete a file by saying
echo "delete /tmp/tempfile" > /dev/sdc1
because using unlink() is too low of an abstraction and confusing to the
user.

> The fact that perfmon and oprofile works via magic vendor-specific event string
> incantations is one of the many design failures of those projects - not a
> virtue.

Well we disagree. I think one of perf_events biggest failings (among
many) is that these generalized event definitions are shoved into the
kernel. At least it bloats the kernel in an option commonly turned on by
vendors. At worst it gives users a full sense of security in thinking
these counters are A). Portable across architectures and B). Actually
measure what they say they do.

I know it is fun to reinvent the wheel, but you ignored decades of
experience in dealing with perf-counters when you ran off and invented
perf_events. It will bite you eventually.

Vince

2011-04-28 23:30:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

Vince,

On Thu, 28 Apr 2011, Vince Weaver wrote:

> On Wed, 27 Apr 2011, Ingo Molnar wrote:
>
> > Secondly, you are still quite wrong even with your revised opinion. Being able
> > to type '-e cycles' and '-e instructions' in perf and get ... cycles and
> > instructions counts/events, and the kernel helping that kind of approach is not
> > 'abstraction to the extreme', it's called 'common sense'.
>
> by your logic I should be able to delete a file by saying
> echo "delete /tmp/tempfile" > /dev/sdc1
> because using unlink() is too low of an abstraction and confusing to the
> user.

Your definition of 'common sense' seems to be rather backwards.

> > The fact that perfmon and oprofile works via magic vendor-specific event string
> > incantations is one of the many design failures of those projects - not a
> > virtue.
>
> Well we disagree. I think one of perf_events biggest failings (among
> many) is that these generalized event definitions are shoved into the

Put the failings on the table if you think there are any real
ones.

The generalized event definitions are debatable, but Ingo's argument
that they fulfil the common sense level is definitely a strong enough
one to keep them.

The problem at hand which ignited this flame war is definitely
borderline and I don't agree with Ingo that it should not made be
available right now in the raw form. That's an hardware enablement
feature which can be useful even if tools/perf has not support for it
and we have no generalized event for it. That's two different
stories. perf has always allowed to use raw events and I don't see a
reason why we should not do that in this case if it enables a subset
of the perf userbase to make use of it.

> kernel. At least it bloats the kernel in an option commonly turned on by

Well compared to the back then proposed perfmon kernel bloat, that's
really nothing you should whine about.

> vendors. At worst it gives users a full sense of security in thinking
> these counters are A). Portable across architectures and B). Actually
> measure what they say they do.

Again, in the common sense approach they actually do what they
say.

For real experts like you there are still the raw events to get the
real thing which is meaningful for those who understand what 'cycles'
and 'instructions' really mean. Cough, cough....

> I know it is fun to reinvent the wheel, but you ignored decades of
> experience in dealing with perf-counters when you ran off and invented
> perf_events. It will bite you eventually.

Stop this whining already. I thoroughly reviewed the outcome of
"decades of experience" and I still shudder when I get reminded of
that exercise.

Yes, we invented perf_events because the proposed perfmon kernel
patches were an outright horror full of cobbled together experience
dump along with a nice bunch of unfixable security holes, locking
issues and permisson problems plus a completely nonobvious userspace
interface. Short a complete design failure.

So perf_events were not the reinvention of the wheel. It was a sane
design decision to make performance counters available _AND_ useful
for a broad audience and a broad range of use cases.

If the only substantial complaint about perf you can bring up is the
detail of generalized events, then we can agree that we disagree and
stop wasting electrons right now.

Thanks,

tglx

2011-04-29 02:28:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2

> I know it is fun to reinvent the wheel, but you ignored decades of
> experience in dealing with perf-counters when you ran off and invented
> perf_events. It will bite you eventually.

s/eventually//

A good example for that is perf events counted completely bogus LLC
generalized cache events for several releases (before the offcore patches
went in).

And BTW they now completely changed again with Peter's changed, counting
something quite different.

-Andi

2011-04-29 19:32:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2


* Vince Weaver <[email protected]> wrote:

> On Wed, 27 Apr 2011, Ingo Molnar wrote:
>
> > Secondly, you are still quite wrong even with your revised opinion. Being able
> > to type '-e cycles' and '-e instructions' in perf and get ... cycles and
> > instructions counts/events, and the kernel helping that kind of approach is not
> > 'abstraction to the extreme', it's called 'common sense'.
>
> by your logic I should be able to delete a file by saying
>
> echo "delete /tmp/tempfile" > /dev/sdc1

> because using unlink() is too low of an abstraction and confusing to the
> user.

Erm, unlink() does not pass magic hexa constants to the disk controller.

unlink() is a high level interface that works across a vast range of disk
controllers, disks, network mounted filesystems, in-RAM filesystems, in-ROM
filesystems, clustered filesystems and other mediums.

Just like that we can tell perf to count 'cycles', 'branches' or
'branch-misses' - all of these are relatively high level concepts (in the scope
of CPUs) that work across a vast range of CPU types and models.

Similarly, for offcore we want to introduce the concept of 'node local' versus
'remote' memory - perhaps with some events for inter-CPU traffic as well -
because that probably covers most of the NUMA related memory profiling needs.

Raw events are to perf what ioctls are to the VFS: small details nobody felt
worth generalizing. My point in this discussion is that we do not offer new
filesystems that support *only* ioctl calls ... Is this simple concept so hard
to understand?

Thanks,

Ingo