2009-06-22 20:53:46

by Brice Goglin

[permalink] [raw]
Subject: [perf] howto switch from pfmon

Hello,

I am trying to play with perfcounters in current git (actually in latest
mmotm). I'd like to reproduce what I previously did with pfmon, but I
couldn't so far.

Something like
pfmon --follow-exec 'foobar' -e
CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
-- <shell script>
gives the number of memory accesses to dram node #0 and #1 for all
processes whose name matches 'foobar'.

So there are several questions here:
1) is it possible to specify counter names like the above or do we have
to use raw counter numbers? I tried raw numbers from [1] without
success. How am I supposed to find and specify these raw numbers?
2) how do we specify "subevents"?
3) is there anything similar to --follow-exec, or --follow-pthreads for
getting separated outputs for each thread?

I guess there are still a lot of things on the TODOlist but I'd like to
understand a bit more where things are going. Sorry I didn't read all
the archives about this, there are way too many of them recently :)

thanks,
Brice

[1]
https://aiya.ms.mff.cuni.cz/svn/rip/trunk/doc/devel/native_events_barcelona.txt


2009-06-23 12:12:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

Brice Goglin <[email protected]> writes:

> Hello,
>
> I am trying to play with perfcounters in current git (actually in latest
> mmotm). I'd like to reproduce what I previously did with pfmon, but I
> couldn't so far.
>
> Something like
> pfmon --follow-exec 'foobar' -e
> CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> -- <shell script>
> gives the number of memory accesses to dram node #0 and #1 for all
> processes whose name matches 'foobar'.

My understanding based on recent emails on the topic is that the
perfctr gods decreed you are not to do any of this because they cannot
think of a use case for it, therefore none exist.

-Andi



--
[email protected] -- Speaking for myself only.

2009-06-23 12:23:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

On Tue, 2009-06-23 at 14:12 +0200, Andi Kleen wrote:
> Brice Goglin <[email protected]> writes:
>
> > Hello,
> >
> > I am trying to play with perfcounters in current git (actually in latest
> > mmotm). I'd like to reproduce what I previously did with pfmon, but I
> > couldn't so far.
> >
> > Something like
> > pfmon --follow-exec 'foobar' -e
> > CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> > -- <shell script>
> > gives the number of memory accesses to dram node #0 and #1 for all
> > processes whose name matches 'foobar'.
>
> My understanding based on recent emails on the topic is that the
> perfctr gods decreed you are not to do any of this because they cannot
> think of a use case for it, therefore none exist.

I wouldn't put it like that.

But we haven't gotten around to implementing uncore pmu stuff --
assuming that is what was meant.

What would be accurate is to say that we think uncore is a lot less
interesting that a lot of other pmu features.

2009-06-23 13:15:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

* Brice Goglin <[email protected]> wrote:

> Hello,
>
> I am trying to play with perfcounters in current git (actually in
> latest mmotm). I'd like to reproduce what I previously did with
> pfmon, but I couldn't so far.
>
> Something like
> pfmon --follow-exec 'foobar' -e
> CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> -- <shell script>
> gives the number of memory accesses to dram node #0 and #1 for all
> processes whose name matches 'foobar'.
>
> So there are several questions here:
> 1) is it possible to specify counter names like the above or do we have
> to use raw counter numbers? I tried raw numbers from [1] without
> success. How am I supposed to find and specify these raw numbers?
> 2) how do we specify "subevents"?
> 3) is there anything similar to --follow-exec, or --follow-pthreads for
> getting separated outputs for each thread?
>
> I guess there are still a lot of things on the TODOlist but I'd
> like to understand a bit more where things are going. Sorry I
> didn't read all the archives about this, there are way too many of
> them recently :)

Yeah, there's indeed still a lot on the TODO list :-)

CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE is a Barcelona hardware event,
so if you know that it maps to raw ID 0x100000e0 then you can always
extend the events that 'perf' knows about via raw events:

$ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
Time: 0.186

Performance counter stats for './hackbench 10':

4381248335 cycles
1964394846 instructions # 0.448 IPC
838 raw 0x1000ffe0

0.215382037 seconds time elapsed.

That 'r1000ffe0' is the raw event. You can also do a profile with
such events:

perf record -f -e r1000ffe0 ./hackbench 10

and look at it via 'perf report'.

Figuring out raw codes is certainly avoidable, we could probably
integrate all the oprofile (and PAPI) event names into perf too,
from the /usr/share/oprofile/ event lists perhaps - for easier
migration for those who got used to those event names. It also gives
a wider set of events - which is useful if you got used to any
specific name.

The Barcelona events are listed in listed in section 3.14 of "BIOS
and Kernel Developer's Guide for AMD Familiy 10h Processors", that's
where all the projects take these symbols from. If you want to
contribute then creating such tables for 'perf', for model-specific
events would certainly be useful.

[ Note, there's no need to specify any --follow-* flags as that is
implicit in 'perf'. (and you'll probably also notice that perf
stat is a lot faster at following fast-forking or
context-switching workloads than is pfmon, because it's not ptrace
based.) ]

And please let us know if you see any weirdness/difficulty while
using 'perf' or if you just notice some quirky thing in the tool.

Thanks,

Ingo

2009-06-23 13:22:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

On Tue, 2009-06-23 at 15:14 +0200, Ingo Molnar wrote:
> * Brice Goglin <[email protected]> wrote:
>
> > Hello,
> >
> > I am trying to play with perfcounters in current git (actually in
> > latest mmotm). I'd like to reproduce what I previously did with
> > pfmon, but I couldn't so far.
> >
> > Something like
> > pfmon --follow-exec 'foobar' -e
> > CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> > -- <shell script>
> > gives the number of memory accesses to dram node #0 and #1 for all
> > processes whose name matches 'foobar'.
> >
> > So there are several questions here:
> > 1) is it possible to specify counter names like the above or do we have
> > to use raw counter numbers? I tried raw numbers from [1] without
> > success. How am I supposed to find and specify these raw numbers?
> > 2) how do we specify "subevents"?
> > 3) is there anything similar to --follow-exec, or --follow-pthreads for
> > getting separated outputs for each thread?
> >
> > I guess there are still a lot of things on the TODOlist but I'd
> > like to understand a bit more where things are going. Sorry I
> > didn't read all the archives about this, there are way too many of
> > them recently :)
>
> Yeah, there's indeed still a lot on the TODO list :-)
>
> CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE is a Barcelona hardware event,
> so if you know that it maps to raw ID 0x100000e0 then you can always
> extend the events that 'perf' knows about via raw events:
>
> $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
> Time: 0.186
>
> Performance counter stats for './hackbench 10':
>
> 4381248335 cycles
> 1964394846 instructions # 0.448 IPC
> 838 raw 0x1000ffe0
>
> 0.215382037 seconds time elapsed.

Just to clarify, The event code is 1E0h, and Ingo used a FFh unit mask.
These are combined using the arch masks below:

#define K7_EVNTSEL_EVENT_MASK 0x7000000FFULL
#define K7_EVNTSEL_UNIT_MASK 0x00000FF00ULL

to form the raw event code used: 0x1000ffe0

2009-06-23 13:25:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Ingo Molnar <[email protected]> wrote:

> > I guess there are still a lot of things on the TODOlist but I'd
> > like to understand a bit more where things are going. Sorry I
> > didn't read all the archives about this, there are way too many
> > of them recently :)
>
> Yeah, there's indeed still a lot on the TODO list :-)
>
> CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE is a Barcelona hardware event,
> so if you know that it maps to raw ID 0x100000e0 then you can
> always extend the events that 'perf' knows about via raw events:
>
> $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10

Note, beyond using raw events, if you are interested in profiling
out 'locality badness' of your app, you are probably quite well
served with the default metrics on Barcelona as well:

$ perf stat ~/hackbench 10
Time: 0.205

Performance counter stats for '/home/mingo/hackbench 10':

2187.328436 task-clock-msecs # 3.315 CPUs
54554 context-switches # 0.025 M/sec
1160 CPU-migrations # 0.001 M/sec
17755 page-faults # 0.008 M/sec
4995437535 cycles # 2283.808 M/sec
2150881875 instructions # 0.431 IPC
644099534 cache-references # 294.469 M/sec
8516562 cache-misses # 3.894 M/sec

0.659895237 seconds time elapsed.

The cache-misses event is sufficiently well-represented to be
meaningful to profile based on it. Raw DRAM access stats can be
useful too - but they are generally layered much later and your app
can hurt already flip-flopping its working set, without hitting too
hard on the DRAM channels.

So perhaps 'cache-misses' is a good first-level approximation metric
to measure and profile along. You can get a good
(last-level-)cache-misses profile using the auto-freq counters:

perf record -e cache-misses -F 10000 ./your-app

The '-F 10000' tells the kernel to do 10 KHz sampling of your-app,
regardless of how frequent cache-misses are. The tools (perf report)
will take the weight of events into account, so it's all
well-normalized between the functions.

So you dont need to specify the 'sampling interval' by hand to get a
sufficient number of samples, you just specify a sampling frequency
- and the perfcounters subsystem takes care of the rest.

Also, your system wont over-sample nor under-sample if your workload
idles around occasionally.

Ingo

2009-06-23 13:39:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2009-06-23 at 15:14 +0200, Ingo Molnar wrote:
> > * Brice Goglin <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > I am trying to play with perfcounters in current git (actually in
> > > latest mmotm). I'd like to reproduce what I previously did with
> > > pfmon, but I couldn't so far.
> > >
> > > Something like
> > > pfmon --follow-exec 'foobar' -e
> > > CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> > > -- <shell script>
> > > gives the number of memory accesses to dram node #0 and #1 for all
> > > processes whose name matches 'foobar'.
> > >
> > > So there are several questions here:
> > > 1) is it possible to specify counter names like the above or do we have
> > > to use raw counter numbers? I tried raw numbers from [1] without
> > > success. How am I supposed to find and specify these raw numbers?
> > > 2) how do we specify "subevents"?
> > > 3) is there anything similar to --follow-exec, or --follow-pthreads for
> > > getting separated outputs for each thread?
> > >
> > > I guess there are still a lot of things on the TODOlist but I'd
> > > like to understand a bit more where things are going. Sorry I
> > > didn't read all the archives about this, there are way too many of
> > > them recently :)
> >
> > Yeah, there's indeed still a lot on the TODO list :-)
> >
> > CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE is a Barcelona hardware event,
> > so if you know that it maps to raw ID 0x100000e0 then you can always
> > extend the events that 'perf' knows about via raw events:
> >
> > $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
> > Time: 0.186
> >
> > Performance counter stats for './hackbench 10':
> >
> > 4381248335 cycles
> > 1964394846 instructions # 0.448 IPC
> > 838 raw 0x1000ffe0
> >
> > 0.215382037 seconds time elapsed.
>
> Just to clarify, The event code is 1E0h, and Ingo used a FFh unit mask.
> These are combined using the arch masks below:
>
> #define K7_EVNTSEL_EVENT_MASK 0x7000000FFULL
> #define K7_EVNTSEL_UNIT_MASK 0x00000FF00ULL
>
> to form the raw event code used: 0x1000ffe0

Yes. The individual node mappings are 01, 02 .. 80 - ff is 'all 8
nodes'.

Ingo

2009-06-23 13:48:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Ingo Molnar <[email protected]> wrote:

> $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
> Time: 0.186

Correction: that should be r10000ffe0.

Ingo

2009-06-23 13:57:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Andi Kleen <[email protected]> wrote:

> Brice Goglin <[email protected]> writes:
>
> > Hello,
> >
> > I am trying to play with perfcounters in current git (actually in latest
> > mmotm). I'd like to reproduce what I previously did with pfmon, but I
> > couldn't so far.
> >
> > Something like
> > pfmon --follow-exec 'foobar' -e
> > CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_0,CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_1
> > -- <shell script>
> > gives the number of memory accesses to dram node #0 and #1 for all
> > processes whose name matches 'foobar'.
>
> My understanding based on recent emails on the topic is that the
> perfctr gods decreed you are not to do any of this because they
> cannot think of a use case for it, therefore none exist.

You are working for Intel, right?

Is the trolling of AMD related threads now an officially sanctioned
activity by Intel, or do you do it out of personal motivation, in
your free time? I'd really like to know, because what you do here is
quite unprofessional and quite a distraction.

[ Btw., 'perfctr' is the name of another project, the one you wanted
to attack here is called 'perfcounters'. ]

Ingo

2009-06-23 13:59:53

by Brice Goglin

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
>> Time: 0.186
>>
>
> Correction: that should be r10000ffe0.
>

Oh thanks a lot, it seems to work now!

One strange thing I noticed: sometimes perf reports that there were some
accesses to target numa nodes 4-7 while my box only has 4 numa nodes:
If I request counters only for the non-existing target numa nodes (4-7,
with -e r1000010e0 -e r1000020e0 -e r1000040e0 -e r1000080e0), I always
get 4 zeros.
But if I mix some couinters from the existing nodes (0-3) with some
counters from non-existing nodes (4-7), the non-existing ones report
some small but non-empty values.
Does it ring any bell?

Brice

2009-06-23 14:20:55

by Brice Goglin

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

Ingo Molnar wrote:
> You can also do a profile with such events:
>
> perf record -f -e r1000ffe0 ./hackbench 10
>
> and look at it via 'perf report'.
>

I am not sure what the perf.data profile file contains but 'perf report'
only shows percentages. Is there a way to get a 'perf stat'-like output
from 'perf report'? Or maybe just have a -f option in 'perf stat' to
send the output into a file (with the PID in the name).

By the way, there's a typo in the description in
tools/perf/Documentation/perf-report.txt, you want s/via perf report/via
perf record/

> [ Note, there's no need to specify any --follow-* flags as that is
> implicit in 'perf'. (and you'll probably also notice that perf
> stat is a lot faster at following fast-forking or
> context-switching workloads than is pfmon, because it's not ptrace
> based.) ]
>

What about threads? I didn't find any way to get per-thread counters.

Ideally, I'd like to be able to see no perf-related output on
stdout/stderr at runtime, and later have a look at per-thread counters
like 'perf stat' does at runtime.

thanks,
Brice

2009-06-23 14:36:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Brice Goglin <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> >
> >> $ perf stat -e cycles -e instructions -e r1000ffe0 ./hackbench 10
> >> Time: 0.186
> >>
> >
> > Correction: that should be r10000ffe0.
>
> Oh thanks a lot, it seems to work now!

btw., it might make sense to expose NUMA inbalance via generic
enumeration. Right now we have:

PERF_COUNT_HW_CPU_CYCLES = 0,
PERF_COUNT_HW_INSTRUCTIONS = 1,
PERF_COUNT_HW_CACHE_REFERENCES = 2,
PERF_COUNT_HW_CACHE_MISSES = 3,
PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4,
PERF_COUNT_HW_BRANCH_MISSES = 5,
PERF_COUNT_HW_BUS_CYCLES = 6,

plus we have cache stats:

* Generalized hardware cache counters:
*
* { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x
* { read, write, prefetch } x
* { accesses, misses }

NUMA is here to stay, and expressing local versus remote access
stats seems useful. We could add two generic counters:

PERF_COUNT_HW_RAM_LOCAL = 7,
PERF_COUNT_HW_RAM_REMOTE = 8,

And map them properly on all CPUs that support such stats. They'd be
accessible via '-e ram-local-refs' and '-e ram-remote-refs' type of
event symbols.

What is your typical usage pattern of this counter? What (general)
kind of app do you profile with it and how do you make use of the
specific node masks?

Would a local/all-remote distinction be enough, or do you need to
make a distinction between the individual nodes to get the best
insight into the workload?

> One strange thing I noticed: sometimes perf reports that there
> were some accesses to target numa nodes 4-7 while my box only has
> 4 numa nodes: If I request counters only for the non-existing
> target numa nodes (4-7, with -e r1000010e0 -e r1000020e0 -e
> r1000040e0 -e r1000080e0), I always get 4 zeros.
>
> But if I mix some couinters from the existing nodes (0-3) with
> some counters from non-existing nodes (4-7), the non-existing ones
> report some small but non-empty values. Does it ring any bell?

I can see that too. I have a similar system (4 nodes), and if i use
the stats for nodes 4-7 (non-existent) i get:

phoenix:~> perf stat -e r1000010e0 -e r1000020e0 -e r1000040e0 -e r1000080e0 --repeat 10 ./hackbench 30
Time: 0.490
Time: 0.435
Time: 0.492
Time: 0.569
Time: 0.491
Time: 0.498
Time: 0.549
Time: 0.530
Time: 0.543
Time: 0.482

Performance counter stats for './hackbench 30' (10 runs):

0 raw 0x1000010e0 ( +- 0.000% )
0 raw 0x1000020e0 ( +- 0.000% )
0 raw 0x1000040e0 ( +- 0.000% )
0 raw 0x1000080e0 ( +- 0.000% )

0.610303953 seconds time elapsed.

( Note the --repeat option - that way you can repeat workloads and
observe their statistical properties. )

If i try the first 4 nodes i get:

phoenix:~> perf stat -e r1000001e0 -e r1000002e0 -e r1000004e0 -e r1000008e0 --repeat 10 ./hackbench 30
Time: 0.403
Time: 0.431
Time: 0.406
Time: 0.421
Time: 0.461
Time: 0.423
Time: 0.495
Time: 0.462
Time: 0.434
Time: 0.459

Performance counter stats for './hackbench 30' (10 runs):

52255370 raw 0x1000001e0 ( +- 5.510% )
46052950 raw 0x1000002e0 ( +- 8.067% )
45966395 raw 0x1000004e0 ( +- 10.341% )
63240044 raw 0x1000008e0 ( +- 11.707% )

0.530894007 seconds time elapsed.

Quite noisy across runs - which is expected on NUMA, as the memory
allocations are not really deterministic and some more NUMA friendly
than others. This box has all relevant NUMA options enabled:

CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ACPI_NUMA=y

But if i 'mix' counters, i too get weird stats:

phoenix:~> perf stat -e r1000020e0 -e r1000040e0 -e r1000080e0 -e r10000ffe0 --repeat 10 ./hackbench 30
Time: 0.432
Time: 0.446
Time: 0.428
Time: 0.472
Time: 0.443
Time: 0.454
Time: 0.398
Time: 0.438
Time: 0.403
Time: 0.463

Performance counter stats for './hackbench 30' (10 runs):

2355436 raw 0x1000020e0 ( +- 8.989% )
0 raw 0x1000040e0 ( +- 0.000% )
0 raw 0x1000080e0 ( +- 0.000% )
204768941 raw 0x10000ffe0 ( +- 0.788% )

0.528447241 seconds time elapsed.

That 2355436 count for node 5 should have been zero.

Ingo

2009-06-23 14:52:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Brice Goglin <[email protected]> wrote:

> Ingo Molnar wrote:
> > You can also do a profile with such events:
> >
> > perf record -f -e r1000ffe0 ./hackbench 10
> >
> > and look at it via 'perf report'.
> >
>
> I am not sure what the perf.data profile file contains but 'perf
> report' only shows percentages. Is there a way to get a 'perf
> stat'-like output from 'perf report'? Or maybe just have a -f
> option in 'perf stat' to send the output into a file (with the PID
> in the name).

It's not yet possible but it's a very good feature request.

> By the way, there's a typo in the description in
> tools/perf/Documentation/perf-report.txt, you want s/via perf
> report/via perf record/

thanks, fixed and pushed out. You can generally find the latest
'perf' stuff at:

http://people.redhat.com/mingo/tip.git/README

> > [ Note, there's no need to specify any --follow-* flags as that is
> > implicit in 'perf'. (and you'll probably also notice that perf
> > stat is a lot faster at following fast-forking or
> > context-switching workloads than is pfmon, because it's not ptrace
> > based.) ]
>
> What about threads? I didn't find any way to get per-thread
> counters.
>
> Ideally, I'd like to be able to see no perf-related output on
> stdout/stderr at runtime, and later have a look at per-thread
> counters like 'perf stat' does at runtime.

That's not possible yet either, but makes a lot of sense.

How many threads does your workload typically run, and how do you
get their stats displayed?

Per thread info is currently available in the profile output:

perf report --sort comm,pid,symbol

But it would be nice to either extend perf report with a --stat
option:

perf report --stat

or to extend perf stat to take an input file via -i:

perf stat -i perf.data

Ingo

2009-06-23 15:21:42

by Brice Goglin

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

Ingo Molnar wrote:
> btw., it might make sense to expose NUMA inbalance via generic
> enumeration. Right now we have:
>
> PERF_COUNT_HW_CPU_CYCLES = 0,
> PERF_COUNT_HW_INSTRUCTIONS = 1,
> PERF_COUNT_HW_CACHE_REFERENCES = 2,
> PERF_COUNT_HW_CACHE_MISSES = 3,
> PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4,
> PERF_COUNT_HW_BRANCH_MISSES = 5,
> PERF_COUNT_HW_BUS_CYCLES = 6,
>
> plus we have cache stats:
>
> * Generalized hardware cache counters:
> *
> * { L1-D, L1-I, LLC, ITLB, DTLB, BPU } x
> * { read, write, prefetch } x
> * { accesses, misses }
>

By the way, is there a way to know which cache was actually used when we
request cache references/misses? Always the largest/top one by default?

> NUMA is here to stay, and expressing local versus remote access
> stats seems useful. We could add two generic counters:
>
> PERF_COUNT_HW_RAM_LOCAL = 7,
> PERF_COUNT_HW_RAM_REMOTE = 8,
>
> And map them properly on all CPUs that support such stats. They'd be
> accessible via '-e ram-local-refs' and '-e ram-remote-refs' type of
> event symbols.
>
> What is your typical usage pattern of this counter? What (general)
> kind of app do you profile with it and how do you make use of the
> specific node masks?
>
> Would a local/all-remote distinction be enough, or do you need to
> make a distinction between the individual nodes to get the best
> insight into the workload?
>

People here work on OpenMP runtime systems where you try to keep threads
and data together. So in the end, what's important is to maximize the
overall local/remote access ratio. But during development, it may useful
to have a distinction between individual nodes so as to understand
what's going on. That said, we still have raw numbers when we really
need that many details, and I don't know if it'd be easy for you to add
a generic counter with a sort of node-number attribute.


(including part of your other email here since it's relevant)

> How many threads does your workload typically run, and how do you
> get their stats displayed?
>

In the aforementioned OpenMP stuff, we use pfmon to get the local/remote
numa memory access ratio of each thread. In this specific case, we bind
one thread per core (even with a O(1) scheduler, people tend to avoid
launching hundreds of threads on current machines). pfmon gives us
something similar to the output of 'perf stat' in a file whose filename
contains process and thread IDs. We apply our own custom script to
convert these many pfmon output files into a single summary saying for
each thread, its thread ID, its core binding, its individual numa node
access numbers and percentages, and if they were local or remote (with
the Barcelona counters we were talking about, you need to check where
you were running before you know if accesses to node X are actually
local or remote accesses).

thanks,
Brice

2009-06-23 15:32:40

by Jaswinder Singh Rajput

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon

On Tue, 2009-06-23 at 16:51 +0200, Ingo Molnar wrote:
> Per thread info is currently available in the profile output:
>
> perf report --sort comm,pid,symbol
>
> But it would be nice to either extend perf report with a --stat
> option:
>
> perf report --stat
>
> or to extend perf stat to take an input file via -i:
>
> perf stat -i perf.data
>

I prefer 'perf report --stat' as it is already handling file.

--
JSR

2009-06-29 19:29:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [perf] howto switch from pfmon


* Brice Goglin <[email protected]> wrote:

> > How many threads does your workload typically run, and how do
> > you get their stats displayed?
>
> In the aforementioned OpenMP stuff, we use pfmon to get the
> local/remote numa memory access ratio of each thread. In this
> specific case, we bind one thread per core (even with a O(1)
> scheduler, people tend to avoid launching hundreds of threads on
> current machines). pfmon gives us something similar to the output
> of 'perf stat' in a file whose filename contains process and
> thread IDs. We apply our own custom script to convert these many
> pfmon output files into a single summary saying for each thread,
> its thread ID, its core binding, its individual numa node access
> numbers and percentages, and if they were local or remote (with
> the Barcelona counters we were talking about, you need to check
> where you were running before you know if accesses to node X are
> actually local or remote accesses).

Update: based on your feedback the latest perfcounters tree includes
the following new perf record features:

-s, --stat per thread counts
-n, --no-samples don't sample

--stat instructs the kernel to gather precise per task/thread stats
and emits those counts to the data file. Via --no-samples one can do
non-profiling runs - i.e. only statistics collection.

The 'perf stat' pretty printing side is not fully implemented yet -
right now you can only see these stats if you look for
PERF_EVENT_READ counts in the raw event log:

perf report -D | grep PERF_EVENT_READ

But the biggest piece, the kernel and perf record side is there
already. What kind of output would you prefer? Maybe you'd like to
take a stab at implementing the perf report side?

Ingo