LinuxLists.cc - [RFC] perf tool improvement requests

2018-09-04 02:47:40

Subject: [RFC] perf tool improvement requests

Hi Arnaldo, Jiri,

A few weeks ago, you had asked if I had more requests for the perf tool.
I have put together the following list to improve the usability of the
perf tool, at
least for our usage. Nothing is very big just small improvements here and there.

1/ perf stat interval printing

Today, the timestamp printed via perf stat -I is relative to the
start of the measurements. It would be beneficial to also support a
mode where it is using a source which can be synchronized with other
traces or profiles. For instance, using gettimeofday() or
clocktime(MONOTONIC).

2/ perf report event grouping

if you do:
$ perf record -e '{ cycles, instructions, branches }' ....
$ perf report
It will show the 3 profiles together which is VERY useful. However
the output is confusing because it is hard to tell which % corresponds
to which event. I know it is cmdline order. But it would be good to
have a header in the columns to point to the events, instead of
guessing. A few times, I had to revert to perf report --header-only to
figure out the event order. I discovered the 'i' key on the function
profile. But it is still hard to find the events, especially if you
passed many of them.

3/ annotate output of loops

Percent│401f00: xor %eax,%eax
│401f02: test %edi,%edi
│401f04: ↓ jle 401f2b <triad+0x2b>
│401f06: nopw %cs:0x0(%rax,%rax,1)
34.20 │401f1┌─→ movsd (%rcx,%rax,8),%xmm1
14.60 │401f1│: mulsd %xmm0,%xmm1
33.24 │401f1│: addsd (%rdx,%rax,8),%xmm1
9.98 │401f1│: movsd %xmm1,(%rsi,%rax,8)
0.10 │401f2│: add $0x1,%rax
0.03 │401f2├── cmp %eax,%edi
7.84 │401f2└──↑ jg 401f10 <triad+0x10>
│401f2b: mov $0x18,%eax
│401f30: ← retq

The loop arrows cut through the code addresses. That is annoying!

4/ sorting and event groups

If I do:
$ perf record -e '{cycles,instructions}'
$ perf report
It will sort the samples based on the first (leader) of the
group. Yet here all events are sampling events. You could as well sort
with the second event. But I don't think perf report support sort
order on multiple events. Both are from the same category: syms (or
ip).

Right now, I would have to collect another profile:
$ perf record -e '{instructions,cycles}'
$ perf report

5) cgroups

Today, to measure multiple group events in the same cgroup, you need to do:
$ perf stat -e cycles,branch,instructions -G foo,foo,foo .....

You need to specify the cgroup N-times for N-events. It would be
good to support a mode where you'd have to specify the cgroup once:

$ perf stat -e cycles,branches,instructions --cgroup-all foo,bar

Would measure cycles,branches,instructions for both cgroup foo and bar.

6) perf script ip vs. callchain

I already submitted this request separately. It is about
providing a way to generate the callchain separately from the ip in
perf script. Right now, they are lumped together which is not always
useful. Also right now, the callchain is a multi-line output which is
not useful. perf script should stick with one line per sample, at
least when symbolization is off. We have examples of that with
brstack.

I may have more requests but I wanted to start with these for now.
Thanks for your efforts.

2018-09-04 07:12:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote:
> Hi Arnaldo, Jiri,
>
> A few weeks ago, you had asked if I had more requests for the perf tool.

I have one long standing one; that is IP based data structure
annotation.

When we get an exact IP (using PEBS) and were sampling a data related
event (say L1 misses), we can get the data type from the instruction
itself; that is, through DWARF. We _know_ what type (structure::member)
is read/written to.

I would love to get that in a pahole style output.

Better yet, when you measure both hits and misses, you can get a
structure usage overview, and see what lines are used lots and what
members inside that line are rarely used. Ideal information for data
structure layout optimization.

1000x more useful than that c2c crap.

Can we please get that?

2018-09-04 13:44:00

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu:
> On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote:
> > A few weeks ago, you had asked if I had more requests for the perf tool.

> I have one long standing one; that is IP based data structure
> annotation.

> When we get an exact IP (using PEBS) and were sampling a data related
> event (say L1 misses), we can get the data type from the instruction
> itself; that is, through DWARF. We _know_ what type (structure::member)
> is read/written to.

> I would love to get that in a pahole style output.

> Better yet, when you measure both hits and misses, you can get a
> structure usage overview, and see what lines are used lots and what
> members inside that line are rarely used. Ideal information for data
> structure layout optimization.

> 1000x more useful than that c2c crap.

> Can we please get that?

So, use 'c2c record' to get the samples:

[root@jouet ~]# perf c2c record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ]

Events collected:

[root@jouet ~]# perf evlist -v
cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f
cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1

Then we'll get a 'annotate --hits' option (just cooked up, will
polish) that will show the name of the function, info about it globally,
i.e. what annotate already produced, we may get this in CSV for better
post processing consumption:

[root@jouet ~]# perf annotate --hits kmem_cache_alloc
Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period]
kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux
4.91 15: mov gfp_allowed_mask,%ebx
2.51 51: mov (%r15),%r8
17.14 54: mov %gs:0x8(%r8),%rdx
6.51 61: cmpq $0x0,0x10(%r8)
17.14 66: mov (%r8),%r14
6.29 78: mov 0x20(%r15),%ebx
5.71 7c: mov (%r15),%rdi
29.49 85: xor 0x138(%r15),%rbx
2.86 9d: lea (%rdi),%rsi
3.43 d7: pop %rbx
2.29 dc: pop %r12
1.71 ed: testb $0x4,0xb(%rbp)
[root@jouet ~]#

Then I need to get the DW_AT_location stuff parsed in pahole, so
that with those offsets (second column, ending with :) with hits (first
column, there its local period, but we can ask for some specific metric
[1]), I'll be able to figure out what DW_TAG_variable or
DW_TAG_formal_parameter is living there at that time, get the offset
from the decoded instruction, say that xor, 0x138 offset from the type
for %r15 at that offset (85) from kmem_cache_alloc, right?

In a first milestone we'd have something like:

perf annotate --hits function | pahole --annotate -C task_struct

perf annotate --hits | pahole --annotate

Would show all structs with hits, for all functions with hits.

Other options would show which struct has more hits, etc.

- Arnaldo

[1]

[root@jouet ~]# perf annotate -h local

Usage: perf annotate [<options>]

--percent-type <local-period>
Set percent type local/global-period/hits

[root@jouet ~]#

- Arnaldo

2018-09-04 13:54:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote:
> So, use 'c2c record' to get the samples:

IIRC that uses numa events and is completely useless.

2018-09-04 14:00:00

by Jiri Olsa

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote:
> > So, use 'c2c record' to get the samples:
>
> IIRC that uses numa events and is completely useless.

I guess perf record on any other event would work
in Arnaldo's workflow

jirka

2018-09-04 14:10:45

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

Em Tue, Sep 04, 2018 at 03:58:35PM +0200, Jiri Olsa escreveu:
> On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote:
> > > So, use 'c2c record' to get the samples:

> > IIRC that uses numa events and is completely useless.

> I guess perf record on any other event would work
> in Arnaldo's workflow

Right. I should've avoided useless events ;-)

- Arnaldo

2018-09-04 14:19:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote:
> Then I need to get the DW_AT_location stuff parsed in pahole, so
> that with those offsets (second column, ending with :) with hits (first
> column, there its local period, but we can ask for some specific metric
> [1]), I'll be able to figure out what DW_TAG_variable or
> DW_TAG_formal_parameter is living there at that time, get the offset
> from the decoded instruction, say that xor, 0x138 offset from the type
> for %r15 at that offset (85) from kmem_cache_alloc, right?

I'm not sure how the DWARF location stuff works; it could be it already
includes the offset and decoding the instruction is not needed.

But yes, that's the basic idea; get DWARF to tell you what variable is
used at a certain IP.

> In a first milestone we'd have something like:
>
> perf annotate --hits function | pahole --annotate -C task_struct
>
> perf annotate --hits | pahole --annotate

Not sure keeping it two proglets makes sense, but whatever :-)

The alternative I suppose is making perf do the IP->struct::member
mapping and freed that to pahole, which then only uses it to annotate
the output.

Or, munge the entirety of pahole into perf..

2018-09-04 14:37:20

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

Em Tue, Sep 04, 2018 at 04:17:24PM +0200, Peter Zijlstra escreveu:
> On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote:
> > Then I need to get the DW_AT_location stuff parsed in pahole, so
> > that with those offsets (second column, ending with :) with hits (first
> > column, there its local period, but we can ask for some specific metric
> > [1]), I'll be able to figure out what DW_TAG_variable or
> > DW_TAG_formal_parameter is living there at that time, get the offset
> > from the decoded instruction, say that xor, 0x138 offset from the type
> > for %r15 at that offset (85) from kmem_cache_alloc, right?
>
> I'm not sure how the DWARF location stuff works; it could be it already
> includes the offset and decoding the instruction is not needed.
>
> But yes, that's the basic idea; get DWARF to tell you what variable is
> used at a certain IP.
>
> > In a first milestone we'd have something like:
> >
> > perf annotate --hits function | pahole --annotate -C task_struct
> >
> > perf annotate --hits | pahole --annotate
>
> Not sure keeping it two proglets makes sense, but whatever :-)

This is just a start, trying to take advantage of existing codebases.

> The alternative I suppose is making perf do the IP->struct::member
> mapping and freed that to pahole, which then only uses it to annotate
> the output.

So, what I'm trying to do now is to make perf get the samples associated
with functions/offsets + decoded instructions.

Pahole, that already touches DWARF info, will just use the
DW_AT_location, look at its description, from
https://blog.tartanllama.xyz/writing-a-linux-debugger-variables/:

-------------------
Simple location descriptions describe the location of one contiguous
piece (usually all) of an object. A simple location description may
describe a location in addressable memory, or in a register, or the lack
of a location (with or without a known value).

Example:

DW_OP_fbreg -32

A variable which is entirely stored -32 bytes from the stack
frame base.

Composite location descriptions describe an object in terms of pieces,
each of which may be contained in part of a register or stored in a
memory location unrelated to other pieces.

Example:

DW_OP_reg3 DW_OP_piece 4 DW_OP_reg10 DW_OP_piece 2

A variable whose first four bytes reside in register 3 and
whose next two bytes reside in register 10.

Location lists describe objects which have a limited lifetime or change
location during their lifetime.

Example:
<loclist with 3 entries follows>
[ 0]<lowpc=0x2e00><highpc=0x2e19>DW_OP_reg0
[ 1]<lowpc=0x2e19><highpc=0x2e3f>DW_OP_reg3
[ 2]<lowpc=0x2ec4><highpc=0x2ec7>DW_OP_reg2

A variable whose location moves between registers depending
on the current value of the program counter
-------------------

So I have a list of DW_TAG_formal_parameter (function parameters) and
DW_TAG_variable, and the above location lists/descriptions, stating in
what registers and what IP ranges the variables are in, and in the
DW_TAG_{formal_parameter,variable} I have DW_AT_type, that points to the
type of that variable, couple that with the offset taken from the
decoded instruction we get from 'perf annotate --hits' and we should
have all we need, no?

Then pahole can have all this painted on structs (like 'perf annotate')
for the whole workload, or for specific callchains, etc.

> Or, munge the entirety of pahole into perf..

That may be interesting at some point, yes.

- Arnaldo

2018-09-04 15:52:23

by Stephane Eranian

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

Arnaldo,

On Tue, Sep 4, 2018 at 6:42 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu:
> > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote:
> > > A few weeks ago, you had asked if I had more requests for the perf tool.
>
> > I have one long standing one; that is IP based data structure
> > annotation.
>
> > When we get an exact IP (using PEBS) and were sampling a data related
> > event (say L1 misses), we can get the data type from the instruction
> > itself; that is, through DWARF. We _know_ what type (structure::member)
> > is read/written to.
>
I have been asking this from the compiler people for a long time!
I don't think it is there. I'd like each load/store to be annotated
with a data type + offset
within the type. It would allow data type profiling. This would not be
bulletproof though
because of the accessor function problem:
void incr(int *v) { (*v)++; }
struct foo { int a, int b } bar;
incr(&bar.a);

Here the load/store in incr() would see an int pointer, not an int
inside struct foo at offset 0 which
is what we want. There are concern with the volume of data that this
would generate. But my argument
is that this is just debug binaries, does not make the stripped binary
any bigger.

> > I would love to get that in a pahole style output.
>
Yes, me too!

> > Better yet, when you measure both hits and misses, you can get a
> > structure usage overview, and see what lines are used lots and what
> > members inside that line are rarely used. Ideal information for data
> > structure layout optimization.
>
> > 1000x more useful than that c2c crap.
>

c2c is about something else: more about NUMA issues and false sharing.

> > Can we please get that?
>
> So, use 'c2c record' to get the samples:
>
> [root@jouet ~]# perf c2c record
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ]
>
> Events collected:
>
> [root@jouet ~]# perf evlist -v
> cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f
> cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1
>
> Then we'll get a 'annotate --hits' option (just cooked up, will
> polish) that will show the name of the function, info about it globally,
> i.e. what annotate already produced, we may get this in CSV for better
> post processing consumption:
>
> [root@jouet ~]# perf annotate --hits kmem_cache_alloc
> Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period]
> kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux
> 4.91 15: mov gfp_allowed_mask,%ebx
> 2.51 51: mov (%r15),%r8
> 17.14 54: mov %gs:0x8(%r8),%rdx
> 6.51 61: cmpq $0x0,0x10(%r8)
> 17.14 66: mov (%r8),%r14
> 6.29 78: mov 0x20(%r15),%ebx
> 5.71 7c: mov (%r15),%rdi
> 29.49 85: xor 0x138(%r15),%rbx
> 2.86 9d: lea (%rdi),%rsi
> 3.43 d7: pop %rbx
> 2.29 dc: pop %r12
> 1.71 ed: testb $0x4,0xb(%rbp)
> [root@jouet ~]#
>
How does this related to what Peter was asking? It has nothing about data types.

What I'd like is a true data type profiler showing you the most
accesses data types.
and then an annotate-mode showing you which fields inside the types are mostly
read or written with their sizes and alignment. Goal is to improve
layout based on
accesses to minimize the number of cachelines moved.
You need DLA sampling on all loads and stores and then type annotation.
As I said, I have prototyped this for self-sampling programs but not
in the perf tool.
It is harder there because you need type information and heap information.
I think DWARF is one way to go assuming it is extended to support the right kind
of load/store annotations. Another way is to track allocations and
correlate to data
types.

> Then I need to get the DW_AT_location stuff parsed in pahole, so
> that with those offsets (second column, ending with :) with hits (first
> column, there its local period, but we can ask for some specific metric
> [1]), I'll be able to figure out what DW_TAG_variable or
> DW_TAG_formal_parameter is living there at that time, get the offset
> from the decoded instruction, say that xor, 0x138 offset from the type
> for %r15 at that offset (85) from kmem_cache_alloc, right?
>
> In a first milestone we'd have something like:
>
> perf annotate --hits function | pahole --annotate -C task_struct
>
> perf annotate --hits | pahole --annotate
>
I don't want to combine tools. I'd like this to be built into perf.

> Would show all structs with hits, for all functions with hits.
>
> Other options would show which struct has more hits, etc.
>
> - Arnaldo
>
> [1]
>
> [root@jouet ~]# perf annotate -h local
>
> Usage: perf annotate [<options>]
>
> --percent-type <local-period>
> Set percent type local/global-period/hits
>
> [root@jouet ~]#
>
> - Arnaldo

2018-09-04 16:07:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] perf tool improvement requests

On Tue, Sep 04, 2018 at 08:50:07AM -0700, Stephane Eranian wrote:
> > > When we get an exact IP (using PEBS) and were sampling a data related
> > > event (say L1 misses), we can get the data type from the instruction
> > > itself; that is, through DWARF. We _know_ what type (structure::member)
> > > is read/written to.
> >
> I have been asking this from the compiler people for a long time!
> I don't think it is there. I'd like each load/store to be annotated
> with a data type + offset
> within the type. It would allow data type profiling. This would not be
> bulletproof though
> because of the accessor function problem:
> void incr(int *v) { (*v)++; }
> struct foo { int a, int b } bar;
> incr(&bar.a);

Cute, yes. Also, array accesses are tricky.

But I think even with those caveats it would be _very_ useful.

> There are concern with the volume of data that this
> would generate. But my argument
> is that this is just debug binaries, does not make the stripped binary
> any bigger.

Right; the alternative is that we build an asm interpreter and follow
the data types throughout the function, because DWARF can tell us about
the types at a number of places, like function call arguments etc..

That is, of course, a terrible lot of work :/