2011-05-13 15:15:13

by Wim Heirman

[permalink] [raw]
Subject: perf-stat per thread results

Hi all,

I'm using perf-stat to look at hardware performance counters for a
parallel program. Is there a way to get counter values for each thread
individually, rather than aggregated for the whole process? I know I
can attach to a specific thread using --tid=, but due to the time
required to find the tid and attach/detach this isn't accurate for
short-running programs.
Or, alternatively, can I use perf record --stat and get an exact count
for each performance counter?

Thanks,
Wim Heirman
Ghent University, Belgium


2011-05-13 15:30:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf-stat per thread results


* Wim Heirman <[email protected]> wrote:

> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread
> individually, rather than aggregated for the whole process? [...]

Not at the moment, but it would be a useful feature.

> [...] I know I can attach to a specific thread using --tid=, but due to the
> time required to find the tid and attach/detach this isn't accurate for
> short-running programs. Or, alternatively, can I use perf record --stat and
> get an exact count for each performance counter?

Yes perf record --stat should work. 'perf report -T --stdio' is supposed to
print this, but it has regressed i think.

Arnaldo, any ideas?

Thanks,

Ingo

2011-05-13 15:35:38

by David Ahern

[permalink] [raw]
Subject: Re: perf-stat per thread results



On 05/13/11 09:14, Wim Heirman wrote:
> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread
> individually, rather than aggregated for the whole process? I know I
> can attach to a specific thread using --tid=, but due to the time
> required to find the tid and attach/detach this isn't accurate for
> short-running programs.

perf-stat requires changes to dump counters per thread; it currently
sums all threads into a single value.

> Or, alternatively, can I use perf record --stat and get an exact count
> for each performance counter?

perf-record does not read values from hardware counters.

David



>
> Thanks,
> Wim Heirman
> Ghent University, Belgium
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-13 15:47:38

by David Ahern

[permalink] [raw]
Subject: Re: perf-stat per thread results

On 05/13/11 09:30, Ingo Molnar wrote:

> Yes perf record --stat should work. 'perf report -T --stdio' is supposed to
> print this, but it has regressed i think.

There is a known regression in 2.6.39; perf-record and perf-top cannot
profile all threads in a process. They fail with:

Fatal: failed to mmap with 22 (Invalid argument)

Curiously, perf-stat does work - or at least does not fail with the mmap
error.

David


>
> Arnaldo, any ideas?
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-13 15:44:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: perf-stat per thread results


* David Ahern <[email protected]> wrote:

> On 05/13/11 09:14, Wim Heirman wrote:
> > Hi all,
> >
> > I'm using perf-stat to look at hardware performance counters for a
> > parallel program. Is there a way to get counter values for each thread
> > individually, rather than aggregated for the whole process? I know I
> > can attach to a specific thread using --tid=, but due to the time
> > required to find the tid and attach/detach this isn't accurate for
> > short-running programs.
>
> perf-stat requires changes to dump counters per thread; it currently
> sums all threads into a single value.
>
> > Or, alternatively, can I use perf record --stat and get an exact count
> > for each performance counter?
>
> perf-record does not read values from hardware counters.

It's supposed to do that if --stat is specified, and it used to work - see this
commit:

8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters

and the output there:

# PID TID cache-misses cache-references
4658 4659 495581 3238779
4658 4662 498246 3236823
4658 4663 499531 3243162

which appears to be roughly what Wim is asking for, AFAICT.

But this seems to have regressed meanwhile.

Thanks,

Ingo

2011-05-13 20:11:53

by Juri Lelli

[permalink] [raw]
Subject: Re: perf-stat per thread results

Hi,

On 05/13/2011 05:14 PM, Wim Heirman wrote:
> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread
> individually, rather than aggregated for the whole process? I know I
> can attach to a specific thread using --tid=, but due to the time
> required to find the tid and attach/detach this isn't accurate for
> short-running programs.
> Or, alternatively, can I use perf record --stat and get an exact count
> for each performance counter?
>

What about using PAPI library (http://icl.cs.utk.edu/papi/)?
It can be built on top of the perf subsystem and allows to get counters
values on a pre-thread basis.
I used it for some experiments and works very well.

Cheers,
Juri

--
Juri Lelli
Via G. Moruzzi 1, 56124 Pisa (PI), Italy
Scuola Superiore Sant'Anna
TeCIP, ReTiS Lab

Web Site: http://retis.sssup.it/~jlelli | Skype: jurile2712
-------------------------------------------------------------
Il male minore non esiste: ? sempre il preannuncio di un male peggiore.
(Sylos Labini)

2011-05-13 20:33:20

by Wim Heirman

[permalink] [raw]
Subject: Re: perf-stat per thread results

Hi,

> It's supposed to do that if --stat is specified, and it used to work - see this
> commit:
>
> ?8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>
> and the output there:
>
> ? ? # ?PID ? TID ?cache-misses ?cache-references
> ? ? ? 4658 ?4659 ? ? ? ?495581 ? ? ? ? ? 3238779
> ? ? ? 4658 ?4662 ? ? ? ?498246 ? ? ? ? ? 3236823
> ? ? ? 4658 ?4663 ? ? ? ?499531 ? ? ? ? ? 3243162
>
> which appears to be roughly what Wim is asking for, AFAICT.

Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
it works, although if I use --pid rather than the -- <command> variant
the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
first thread is missing in both use cases, and I get one column per
processor (which in itself is fine).

Regards,
Wim

2011-05-13 20:46:01

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: perf-stat per thread results

Em Fri, May 13, 2011 at 10:32:58PM +0200, Wim Heirman escreveu:
> > It's supposed to do that if --stat is specified, and it used to work - see this
> > commit:

> > ?8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters

> > and the output there:
> >
> > ? ? # ?PID ? TID ?cache-misses ?cache-references
> > ? ? ? 4658 ?4659 ? ? ? ?495581 ? ? ? ? ? 3238779
> > ? ? ? 4658 ?4662 ? ? ? ?498246 ? ? ? ? ? 3236823
> > ? ? ? 4658 ?4663 ? ? ? ?499531 ? ? ? ? ? 3243162

> > which appears to be roughly what Wim is asking for, AFAICT.

> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
> it works, although if I use --pid rather than the -- <command> variant
> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
> first thread is missing in both use cases, and I get one column per
> processor (which in itself is fine).

Can you try after applying the patches in this message:

http://marc.info/?l=linux-kernel&m=130385067430510&w=2

and report your results?

If it fixes the problems you're experiencing, please provide a:

Tested-by: Wim Heirman <[email protected]>

So that I can add when sending them to Ingo.

Thanks,

- Arnaldo

2011-05-13 23:02:36

by David Ahern

[permalink] [raw]
Subject: Re: perf-stat per thread results



On 05/13/11 14:32, Wim Heirman wrote:
> Hi,
>
>> It's supposed to do that if --stat is specified, and it used to work - see this
>> commit:
>>
>> 8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>>
>> and the output there:
>>
>> # PID TID cache-misses cache-references
>> 4658 4659 495581 3238779
>> 4658 4662 498246 3236823
>> 4658 4663 499531 3243162
>>
>> which appears to be roughly what Wim is asking for, AFAICT.
>
> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
> it works, although if I use --pid rather than the -- <command> variant
> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
> first thread is missing in both use cases, and I get one column per
> processor (which in itself is fine).
>
> Regards,
> Wim

Hmm.... my mileage varies using latest kernel
(446cc6345d3de6571bdd0840f48aca441488a28d)

$ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]

$ /tmp/build-perf/perf report -T -i /tmp/perf.data
# Events: 6 cycles
#
# Overhead Command Shared Object Symbol
# ........ ........ ................. ..........................
#
97.61% rsyslogd libc-2.13.so [.] __libc_disable_asynccancel
2.39% rsyslogd [kernel.kallsyms] [k] native_write_msr_safe


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
# PID TID


ie., I do not get the counter values. Specifying the counter with -e
(e.g., -e branch-misses) does not help -- still no counter output.

David

2011-05-14 12:46:11

by Wim Heirman

[permalink] [raw]
Subject: Re: perf-stat per thread results

On 13 May 2011 22:45, Arnaldo Carvalho de Melo <[email protected]> wrote:
> Em Fri, May 13, 2011 at 10:32:58PM +0200, Wim Heirman escreveu:
>> > It's supposed to do that if --stat is specified, and it used to work - see this
>> > commit:
>
>> > ?8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>
>> > and the output there:
>> >
>> > ? ? # ?PID ? TID ?cache-misses ?cache-references
>> > ? ? ? 4658 ?4659 ? ? ? ?495581 ? ? ? ? ? 3238779
>> > ? ? ? 4658 ?4662 ? ? ? ?498246 ? ? ? ? ? 3236823
>> > ? ? ? 4658 ?4663 ? ? ? ?499531 ? ? ? ? ? 3243162
>
>> > which appears to be roughly what Wim is asking for, AFAICT.
>
>> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
>> it works, although if I use --pid rather than the -- <command> variant
>> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
>> first thread is missing in both use cases, and I get one column per
>> processor (which in itself is fine).
>
> Can you try after applying the patches in this message:
>
> http://marc.info/?l=linux-kernel&m=130385067430510&w=2
>
> and report your results?

Sorry, no improvement.

$ ./perf record -e cycles --stat -- ./fft -p4 -m24 && ./perf report
--thread | tail
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.198 MB perf.data (~52331 samples) ]
# PID TID cpu-clock
954 958 8067423322
954 957 6761317556
954 956 6006327147

$ ls /proc/$(pidof fft)/task
954 956 957 958

2011-05-14 12:49:50

by Wim Heirman

[permalink] [raw]
Subject: Re: perf-stat per thread results

On 14 May 2011 01:02, David Ahern <[email protected]> wrote:
>
>
> On 05/13/11 14:32, Wim Heirman wrote:
>> Hi,
>>
>>> It's supposed to do that if --stat is specified, and it used to work - see this
>>> commit:
>>>
>>> ?8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>>>
>>> and the output there:
>>>
>>> ? ? # ?PID ? TID ?cache-misses ?cache-references
>>> ? ? ? 4658 ?4659 ? ? ? ?495581 ? ? ? ? ? 3238779
>>> ? ? ? 4658 ?4662 ? ? ? ?498246 ? ? ? ? ? 3236823
>>> ? ? ? 4658 ?4663 ? ? ? ?499531 ? ? ? ? ? 3243162
>>>
>>> which appears to be roughly what Wim is asking for, AFAICT.
>>
>> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
>> it works, although if I use --pid rather than the -- <command> variant
>> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
>> first thread is missing in both use cases, and I get one column per
>> processor (which in itself is fine).
>>
>> Regards,
>> Wim
>
> Hmm.... my mileage varies using latest kernel
> (446cc6345d3de6571bdd0840f48aca441488a28d)
>
> $ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]
>
> $ /tmp/build-perf/perf report -T -i /tmp/perf.data
> # Events: 6 ?cycles
> #
> # Overhead ? Command ? ? ?Shared Object ? ? ? ? ? ? ? ? ? ? ?Symbol
> # ........ ?........ ?................. ?..........................
> #
> ? ?97.61% ?rsyslogd ?libc-2.13.so ? ? ? [.] __libc_disable_asynccancel
> ? ? 2.39% ?rsyslogd ?[kernel.kallsyms] ?[k] native_write_msr_safe
>
>
> #
> # (For a higher level overview, try: perf report --sort comm,dso)
> #
> # PID ?TID
>
>
> ie., I do not get the counter values. Specifying the counter with -e
> (e.g., -e branch-misses) does not help -- still no counter output.

Is rsyslogd multithreaded? (Or at least, do the non-main threads
execute any work during your perf-record measurement) If not, then
what you see is consistent with what I'm getting, i.e. everything but
the main thread is reported.

Regards,
Wim

2011-05-14 17:48:29

by David Ahern

[permalink] [raw]
Subject: Re: perf-stat per thread results

On 05/14/11 06:49, Wim Heirman wrote:
>> Hmm.... my mileage varies using latest kernel
>> (446cc6345d3de6571bdd0840f48aca441488a28d)
>>
>> $ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
>> ^C[ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]
>>
>> $ /tmp/build-perf/perf report -T -i /tmp/perf.data
>> # Events: 6 cycles
>> #
>> # Overhead Command Shared Object Symbol
>> # ........ ........ ................. ..........................
>> #
>> 97.61% rsyslogd libc-2.13.so [.] __libc_disable_asynccancel
>> 2.39% rsyslogd [kernel.kallsyms] [k] native_write_msr_safe
>>
>>
>> #
>> # (For a higher level overview, try: perf report --sort comm,dso)
>> #
>> # PID TID
>>
>>
>> ie., I do not get the counter values. Specifying the counter with -e
>> (e.g., -e branch-misses) does not help -- still no counter output.
>
> Is rsyslogd multithreaded? (Or at least, do the non-main threads
> execute any work during your perf-record measurement) If not, then
> what you see is consistent with what I'm getting, i.e. everything but
> the main thread is reported.

It is multithreaded, but my point is that I do not get counter output at
the end -- the PID/TID table is empty. I do not get counters for single
threaded processes nor for commands run by perf record -- e.g.,
/tmp/build-perf/perf record --stat -e instructions -fo /tmp/perf.data --
sleep 1

David


>
> Regards,
> Wim

2011-05-14 18:53:20

by Wim Heirman

[permalink] [raw]
Subject: Re: perf-stat per thread results

2011/5/14 David Ahern <[email protected]>:
> On 05/14/11 06:49, Wim Heirman wrote:
>>> Hmm.... my mileage varies using latest kernel
>>> (446cc6345d3de6571bdd0840f48aca441488a28d)
>>>
>>> $ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
>>> ^C[ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]
>>>
>>> $ /tmp/build-perf/perf report -T -i /tmp/perf.data
>>> # Events: 6 ?cycles
>>> #
>>> # Overhead ? Command ? ? ?Shared Object ? ? ? ? ? ? ? ? ? ? ?Symbol
>>> # ........ ?........ ?................. ?..........................
>>> #
>>> ? ?97.61% ?rsyslogd ?libc-2.13.so ? ? ? [.] __libc_disable_asynccancel
>>> ? ? 2.39% ?rsyslogd ?[kernel.kallsyms] ?[k] native_write_msr_safe
>>>
>>>
>>> #
>>> # (For a higher level overview, try: perf report --sort comm,dso)
>>> #
>>> # PID ?TID
>>>
>>>
>>> ie., I do not get the counter values. Specifying the counter with -e
>>> (e.g., -e branch-misses) does not help -- still no counter output.
>>
>> Is rsyslogd multithreaded? (Or at least, do the non-main threads
>> execute any work during your perf-record measurement) If not, then
>> what you see is consistent with what I'm getting, i.e. everything but
>> the main thread is reported.
>
> It is multithreaded, but my point is that I do not get counter output at
> the end -- the PID/TID table is empty. I do not get counters for single
> threaded processes nor for commands run by perf record -- e.g.,
> /tmp/build-perf/perf record --stat -e instructions -fo /tmp/perf.data --
> sleep 1

My guess was that none of the threads got scheduled while you did your
perf-record run (rsyslogd usually isn't exactly very CPU intensive).
And the main thread isn't ever reported, at least that's the bug I'm
seeing. Can you try with a compute-intensive, multi-threaded program?

Wim.

2011-05-14 18:57:07

by David Ahern

[permalink] [raw]
Subject: Re: perf-stat per thread results



On 05/14/11 12:52, Wim Heirman wrote:
> 2011/5/14 David Ahern <[email protected]>:
>> On 05/14/11 06:49, Wim Heirman wrote:
>>>> Hmm.... my mileage varies using latest kernel
>>>> (446cc6345d3de6571bdd0840f48aca441488a28d)
>>>>
>>>> $ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
>>>> ^C[ perf record: Woken up 1 times to write data ]
>>>> [ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]
>>>>
>>>> $ /tmp/build-perf/perf report -T -i /tmp/perf.data
>>>> # Events: 6 cycles
>>>> #
>>>> # Overhead Command Shared Object Symbol
>>>> # ........ ........ ................. ..........................
>>>> #
>>>> 97.61% rsyslogd libc-2.13.so [.] __libc_disable_asynccancel
>>>> 2.39% rsyslogd [kernel.kallsyms] [k] native_write_msr_safe
>>>>
>>>>
>>>> #
>>>> # (For a higher level overview, try: perf report --sort comm,dso)
>>>> #
>>>> # PID TID
>>>>
>>>>
>>>> ie., I do not get the counter values. Specifying the counter with -e
>>>> (e.g., -e branch-misses) does not help -- still no counter output.
>>>
>>> Is rsyslogd multithreaded? (Or at least, do the non-main threads
>>> execute any work during your perf-record measurement) If not, then
>>> what you see is consistent with what I'm getting, i.e. everything but
>>> the main thread is reported.
>>
>> It is multithreaded, but my point is that I do not get counter output at
>> the end -- the PID/TID table is empty. I do not get counters for single
>> threaded processes nor for commands run by perf record -- e.g.,
>> /tmp/build-perf/perf record --stat -e instructions -fo /tmp/perf.data --
>> sleep 1
>
> My guess was that none of the threads got scheduled while you did your
> perf-record run (rsyslogd usually isn't exactly very CPU intensive).
> And the main thread isn't ever reported, at least that's the bug I'm
> seeing. Can you try with a compute-intensive, multi-threaded program?
>
> Wim.

The reports show data was collected. I have done a number of simple
examples -- all of which execute at least 1 instruction, but the
counters are not displayed (which they should be per the commit
changelog commit).

David

2011-05-14 20:24:51

by Wim Heirman

[permalink] [raw]
Subject: Re: perf-stat per thread results

On 14 May 2011 14:45, Wim Heirman <[email protected]> wrote:
> On 13 May 2011 22:45, Arnaldo Carvalho de Melo <[email protected]> wrote:
>> Em Fri, May 13, 2011 at 10:32:58PM +0200, Wim Heirman escreveu:
>>> > It's supposed to do that if --stat is specified, and it used to work - see this
>>> > commit:
>>
>>> > ?8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>>
>>> > and the output there:
>>> >
>>> > ? ? # ?PID ? TID ?cache-misses ?cache-references
>>> > ? ? ? 4658 ?4659 ? ? ? ?495581 ? ? ? ? ? 3238779
>>> > ? ? ? 4658 ?4662 ? ? ? ?498246 ? ? ? ? ? 3236823
>>> > ? ? ? 4658 ?4663 ? ? ? ?499531 ? ? ? ? ? 3243162
>>
>>> > which appears to be roughly what Wim is asking for, AFAICT.
>>
>>> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
>>> it works, although if I use --pid rather than the -- <command> variant
>>> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
>>> first thread is missing in both use cases, and I get one column per
>>> processor (which in itself is fine).
>>
>> Can you try after applying the patches in this message:
>>
>> http://marc.info/?l=linux-kernel&m=130385067430510&w=2
>>
>> and report your results?
>
> Sorry, no improvement.
>
> $ ./perf record -e cycles --stat -- ./fft -p4 -m24 && ./perf report
> --thread | tail
> [ perf record: Woken up 5 times to write data ]
> [ perf record: Captured and wrote 1.198 MB perf.data (~52331 samples) ]
> # PID ?TID ? cpu-clock
> ?954 ?958 ?8067423322
> ?954 ?957 ?6761317556
> ?954 ?956 ?6006327147
>
> $ ls /proc/$(pidof fft)/task
> 954 ?956 ?957 ?958

Looks like perf-report --thread is reading PERF_RECORD_READ events
from perf.data. But these are only emitted by the kernel for child
threads: in kernel/events/core.c, the only call to
perf_event_read_event() is in sync_child_event().
Should perf-record then be adapted to do something like perf-stat does
and use __perf_evsel__read to read the parent counter's final values
and add them to perf.data ? That way perf-report can subtract all
children from the final value and get the main thread's counter
values.

Regards,
Wim