LinuxLists.cc - [RFC PATCH] perf report: add sort by file lines

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

(2011/03/29 18:32), Lin Ming wrote:
> Hi, all
>
> This patch adds sort by file lines using dwarf debug info.
>
> In order to add perf tool support for my load latency patches,
> I asked a question about data variable symbols.
> http://marc.info/?l=linux-kernel&m=129736540309559&w=2
>
> Peter suggested to reverse map the reported IP (PEBS + fixup)
> to a data access using dwarf info.
> So I wrote this patch to see if the direction is right.

Good work! :)
Hmm, this seems to require my series of patches which introduce
debuginfo object, which wraps strongly libdw dependent Dwarf/Dwfl
objects and allows us to export debuginfo object even if no libdw
support.
I'll post the series soon!

>
> On Fri, Feb 11, 2011 at 3:17 AM, Peter Zijlstra <[email protected]> wrote:
>> Another way is to reverse map the reported IP (PEBS + fixup) to a data
>> access using the dwarf info. That would also work for dynamically
>> allocated data structures.
>>
>> (clearly you'd loose variable things like which entry in an array, but
>> you should still be able to identify the structure members)
>>
>
> $ ./perf report --stdio -k ~/vmlinux -s comm,dso,symbol,line
>
> # Overhead Command Shared Object Symbol Line
> # ........ ........... .................. ............... ..............................................
>
> 0.99% cc1 [kernel.kallsyms] [k] check_bytes /opt/linux-2.6/mm/slub.c:511
> 0.84% fixdep [kernel.kallsyms] [k] check_bytes /opt/linux-2.6/mm/slub.c:510
> 0.79% fixdep [kernel.kallsyms] [k] check_bytes /opt/linux-2.6/mm/slub.c:511
> 0.74% cc1 [kernel.kallsyms] [k] check_bytes /opt/linux-2.6/mm/slub.c:513
> 0.71% fixdep [kernel.kallsyms] [k] check_bytes /opt/linux-2.6/mm/slub.c:513
> 0.71% cc1 [kernel.kallsyms] [k] page_fault /opt/linux-2.6/arch/x86/kernel/entry_64.S:1336
> 0.69% cc1 cc1 [.] 0x5ec3a3 0x5ec3a3
> 0.67% cc1 [kernel.kallsyms] [k] clear_page_c /opt/linux-2.6/arch/x86/lib/clear_page_64.S:12
>
>
> Signed-off-by: Lin Ming <[email protected]>
> ---
> tools/perf/util/hist.c | 71 ++++++++++++++++++++++++++++++++++++++-------
> tools/perf/util/hist.h | 4 ++
> tools/perf/util/sort.c | 44 ++++++++++++++++++++++++++--
> tools/perf/util/sort.h | 3 ++
> tools/perf/util/symbol.c | 4 ++
> tools/perf/util/symbol.h | 2 +
> 6 files changed, 114 insertions(+), 14 deletions(-)
>
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index 627a02e..c1e95e4 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -17,6 +17,38 @@ struct callchain_param callchain_param = {
> .min_percent = 0.5
> };
>
> +static Dwarf_Line *hists__dwarf_line(struct map *map, u64 rip)
> +{
> + Dwarf_Die cudie;
> + Dwarf_Line *line = NULL;
> + u64 ip;
> +
> + if (!map || !map->dso || !map->dso->dwarf)
> + return NULL;
> +
> + ip = map->unmap_ip(map, rip);
> + if (dwarf_addrdie(map->dso->dwarf, (Dwarf_Addr)ip, &cudie))
> + line = dwarf_getsrc_die(&cudie, (Dwarf_Addr)ip);
> +
> + return line;
> +}
> +
> +int hists__line(Dwarf_Line *line, char *buf, int len)
> +{
> + int ret;
> + const char *file;
> + int lineno;
> +
> + if (!line || !buf)
> + return 0;
> +
> + file = dwarf_linesrc(line, NULL, NULL);
> + dwarf_lineno(line, &lineno);
> + ret = snprintf(buf, len, "%s:%d", file, lineno);
> +
> + return ret;
> +}
> +
> u16 hists__col_len(struct hists *self, enum hist_column col)
> {
> return self->col_len[col];
> @@ -44,21 +76,25 @@ static void hists__reset_col_len(struct hists *self)
> hists__set_col_len(self, col, 0);
> }
>
> +static void hists__set_unresolved_col_len(struct hists *self, enum hist_column col)
> +{
> + const unsigned int unresolved_col_width = BITS_PER_LONG / 4 + 2;
> +
> + if (hists__col_len(self, col) < unresolved_col_width &&
> + !symbol_conf.col_width_list_str && !symbol_conf.field_sep &&
> + !symbol_conf.dso_list)
> + hists__set_col_len(self, col,
> + unresolved_col_width);
> +}
> +
> static void hists__calc_col_len(struct hists *self, struct hist_entry *h)
> {
> u16 len;
>
> if (h->ms.sym)
> - hists__new_col_len(self, HISTC_SYMBOL, h->ms.sym->namelen);
> - else {
> - const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
> -
> - if (hists__col_len(self, HISTC_DSO) < unresolved_col_width &&
> - !symbol_conf.col_width_list_str && !symbol_conf.field_sep &&
> - !symbol_conf.dso_list)
> - hists__set_col_len(self, HISTC_DSO,
> - unresolved_col_width);
> - }
> + hists__new_col_len(self, HISTC_SYMBOL, h->ms.sym->namelen + 4);
> + else
> + hists__set_unresolved_col_len(self, HISTC_DSO);
>
> len = thread__comm_len(h->thread);
> if (hists__new_col_len(self, HISTC_COMM, len))
> @@ -68,6 +104,15 @@ static void hists__calc_col_len(struct hists *self, struct hist_entry *h)
> len = dso__name_len(h->ms.map->dso);
> hists__new_col_len(self, HISTC_DSO, len);
> }
> +
> + if (!h->line)
> + hists__set_unresolved_col_len(self, HISTC_LINE);
> + else {
> + char tmp[BUFSIZ];
> +
> + len = hists__line(h->line, tmp, BUFSIZ);
> + hists__new_col_len(self, HISTC_LINE, len);
> + }
> }
>
> static void hist_entry__add_cpumode_period(struct hist_entry *self,
> @@ -103,8 +148,11 @@ static struct hist_entry *hist_entry__new(struct hist_entry *template)
> if (self != NULL) {
> *self = *template;
> self->nr_events = 1;
> - if (self->ms.map)
> + if (self->ms.map) {
> self->ms.map->referenced = true;
> +
> + self->line = hists__dwarf_line(self->ms.map, self->ip);
> + }
> if (symbol_conf.use_callchain)
> callchain_init(self->callchain);
> }
> @@ -142,6 +190,7 @@ struct hist_entry *__hists__add_entry(struct hists *self,
> },
> .cpu = al->cpu,
> .ip = al->addr,
> + .line = hists__dwarf_line(al->map, al->addr),
> .level = al->level,
> .period = period,
> .parent = sym_parent,
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 3beb97c..07e0f04 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -3,6 +3,7 @@
>
> #include <linux/types.h>
> #include "callchain.h"
> +#include <elfutils/libdw.h>
>
> extern struct callchain_param callchain_param;
>
> @@ -39,6 +40,7 @@ enum hist_column {
> HISTC_COMM,
> HISTC_PARENT,
> HISTC_CPU,
> + HISTC_LINE,
> HISTC_NR_COLS, /* Last entry */
> };
>
> @@ -85,6 +87,8 @@ u16 hists__col_len(struct hists *self, enum hist_column col);
> void hists__set_col_len(struct hists *self, enum hist_column col, u16 len);
> bool hists__new_col_len(struct hists *self, enum hist_column col, u16 len);
>
> +int hists__line(Dwarf_Line *line, char *buf, int len);
> +
> struct perf_evlist;
>
> #ifdef NO_NEWT_SUPPORT
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index f44fa54..cfbdb6c 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -27,6 +27,9 @@ static int hist_entry__parent_snprintf(struct hist_entry *self, char *bf,
> size_t size, unsigned int width);
> static int hist_entry__cpu_snprintf(struct hist_entry *self, char *bf,
> size_t size, unsigned int width);
> +static int hist_entry__line_snprintf(struct hist_entry *self, char *bf,
> + size_t size, unsigned int width);
> +
>
> struct sort_entry sort_thread = {
> .se_header = "Command: Pid",
> @@ -71,6 +74,13 @@ struct sort_entry sort_cpu = {
> .se_width_idx = HISTC_CPU,
> };
>
> +struct sort_entry sort_line = {
> + .se_header = "Line",
> + .se_cmp = sort__line_cmp,
> + .se_snprintf = hist_entry__line_snprintf,
> + .se_width_idx = HISTC_LINE,
> +};
> +
> struct sort_dimension {
> const char *name;
> struct sort_entry *entry;
> @@ -84,6 +94,7 @@ static struct sort_dimension sort_dimensions[] = {
> { .name = "symbol", .entry = &sort_sym, },
> { .name = "parent", .entry = &sort_parent, },
> { .name = "cpu", .entry = &sort_cpu, },
> + { .name = "line", .entry = &sort_line, },
> };
>
> int64_t cmp_null(void *l, void *r)
> @@ -190,7 +201,7 @@ sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
> }
>
> static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
> - size_t size, unsigned int width __used)
> + size_t size, unsigned int width)
> {
> size_t ret = 0;
>
> @@ -202,11 +213,11 @@ static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
>
> ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", self->level);
> if (self->ms.sym)
> - ret += repsep_snprintf(bf + ret, size - ret, "%s",
> + ret += repsep_snprintf(bf + ret, size - ret, "%-*s", width - 4,
> self->ms.sym->name);
> else
> ret += repsep_snprintf(bf + ret, size - ret, "%-#*llx",
> - BITS_PER_LONG / 4, self->ip);
> + width - 4, self->ip);
>
> return ret;
> }
> @@ -266,6 +277,31 @@ static int hist_entry__cpu_snprintf(struct hist_entry *self, char *bf,
> return repsep_snprintf(bf, size, "%-*d", width, self->cpu);
> }
>
> +
> +/* --sort line */
> +
> +int64_t
> +sort__line_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> + if (!left->line || !right->line)
> + return (int64_t)(left->ip - right->ip);
> +
> + return (unsigned long)left->line - (unsigned long)right->line;
> +}
> +
> +static int hist_entry__line_snprintf(struct hist_entry *self, char *bf,
> + size_t size, unsigned int width)
> +{
> + char buf[BUFSIZ];
> +
> + if (!self->line)
> + return repsep_snprintf(bf, size, "%-#*llx", width, self->ip);
> +
> + hists__line(self->line, buf, BUFSIZ);
> +
> + return repsep_snprintf(bf, size, "%-*s", width, buf);
> +}
> +
> int sort_dimension__add(const char *tok)
> {
> unsigned int i;
> @@ -307,6 +343,8 @@ int sort_dimension__add(const char *tok)
> sort__first_dimension = SORT_PARENT;
> else if (!strcmp(sd->name, "cpu"))
> sort__first_dimension = SORT_CPU;
> + else if (!strcmp(sd->name, "line"))
> + sort__first_dimension = SORT_LINE;
> }
>
> list_add_tail(&sd->entry->list, &hist_entry__sort_list);
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 0b91053..d2a4424 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -53,6 +53,7 @@ struct hist_entry {
> u64 period_guest_us;
> struct map_symbol ms;
> struct thread *thread;
> + Dwarf_Line *line;
> u64 ip;
> s32 cpu;
> u32 nr_events;
> @@ -80,6 +81,7 @@ enum sort_type {
> SORT_SYM,
> SORT_PARENT,
> SORT_CPU,
> + SORT_LINE,
> };
>
> /*
> @@ -116,6 +118,7 @@ extern int64_t sort__dso_cmp(struct hist_entry *, struct hist_entry *);
> extern int64_t sort__sym_cmp(struct hist_entry *, struct hist_entry *);
> extern int64_t sort__parent_cmp(struct hist_entry *, struct hist_entry *);
> int64_t sort__cpu_cmp(struct hist_entry *left, struct hist_entry *right);
> +int64_t sort__line_cmp(struct hist_entry *left, struct hist_entry *right);
> extern size_t sort__parent_print(FILE *, struct hist_entry *, unsigned int);
> extern int sort_dimension__add(const char *);
> void sort_entry__setup_elide(struct sort_entry *self, struct strlist *list,
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index 17df793..ddaf396 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -240,6 +240,8 @@ void dso__delete(struct dso *self)
> free((char *)self->short_name);
> if (self->lname_alloc)
> free(self->long_name);
> + if (self->dwarf)
> + dwarf_end(self->dwarf);
> free(self);
> }
>
> @@ -1052,6 +1054,8 @@ static int dso__load_sym(struct dso *self, struct map *map, const char *name,
> int nr = 0;
> size_t opdidx = 0;
>
> + self->dwarf = dwarf_begin(fd, DWARF_C_READ);
> +
> elf = elf_begin(fd, PERF_ELF_C_READ_MMAP, NULL);
> if (elf == NULL) {
> pr_debug("%s: cannot read %s ELF file.\n", __func__, name);
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index 713b0b4..e07b907 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -8,6 +8,7 @@
> #include <linux/list.h>
> #include <linux/rbtree.h>
> #include <stdio.h>
> +#include <elfutils/libdw.h>
>
> #ifdef HAVE_CPLUS_DEMANGLE
> extern char *cplus_demangle(const char *, int);
> @@ -135,6 +136,7 @@ struct dso {
> struct list_head node;
> struct rb_root symbols[MAP__NR_TYPES];
> struct rb_root symbol_names[MAP__NR_TYPES];
> + Dwarf *dwarf;
> enum dso_kernel_type kernel;
> u8 adjust_symbols:1;
> u8 has_build_id:1;
>
>

--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2011-03-29 09:54:45

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Tue, 2011-03-29 at 17:32 +0800, Lin Ming wrote:
>
> Peter suggested to reverse map the reported IP (PEBS + fixup)
> to a data access using dwarf info.
> So I wrote this patch to see if the direction is right.

I'm not sure this is quite the same thing, I'm not arguing this might
not be useful, but this is not about data access.

Suppose you have a line like:

foo->bar->fubar = tmp->blah;

There's 3 indirections there, a line number doesn't even get you close
to knowing what data access triggered the event.

struct bar {
int poekoe[5];
int fubar;
};

struct foo {
long poekoe[3];
struct bar *bar;
};

struct tmp {
long poekoe[4];
int blah;
};

void foo(struct foo *foo, struct tmp *tmp)
{
foo->bar->fubar = tmp->blah;
}

Which gives (somewhat simplified):

foo:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq -8(%rbp), %rax /* load foo arg from stack */
movq 24(%rax), %rax /* load foo->bar */
movq -16(%rbp), %rdx /* load tmp arg from stack */
movl 32(%rdx), %edx /* load tmp->blah */
movl %edx, 20(%rax) /* store bar->fubar */
leave
ret
.cfi_endproc

where I annotated the various moves with C comments.

Now depending on what exact IP you get using PEBS+fixup you could using
DWARF bits generate similar deductions from the code as I did in those
comments and thus know exactly what data member was accessed and how
(read/write).

With that data you could then borrow some pahole code and annotate the
various data structures to illustrate read/write distributions, which
can then be used as input for data-reorder.

2011-03-29 16:45:57

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Tue, 2011-03-29 at 17:54 +0800, Peter Zijlstra wrote:
> On Tue, 2011-03-29 at 17:32 +0800, Lin Ming wrote:
> >
> > Peter suggested to reverse map the reported IP (PEBS + fixup)
> > to a data access using dwarf info.
> > So I wrote this patch to see if the direction is right.
>
> I'm not sure this is quite the same thing, I'm not arguing this might
> not be useful, but this is not about data access.
>
> Suppose you have a line like:
>
> foo->bar->fubar = tmp->blah;
>
> There's 3 indirections there, a line number doesn't even get you close
> to knowing what data access triggered the event.
>
> struct bar {
> int poekoe[5];
> int fubar;
> };
>
> struct foo {
> long poekoe[3];
> struct bar *bar;
> };
>
> struct tmp {
> long poekoe[4];
> int blah;
> };
>
> void foo(struct foo *foo, struct tmp *tmp)
> {
> foo->bar->fubar = tmp->blah;
> }
>
> Which gives (somewhat simplified):
>
> foo:
> .cfi_startproc
> pushq %rbp
> .cfi_def_cfa_offset 16
> movq %rsp, %rbp
> .cfi_offset 6, -16
> .cfi_def_cfa_register 6
> movq %rdi, -8(%rbp)
> movq %rsi, -16(%rbp)
> movq -8(%rbp), %rax /* load foo arg from stack */
> movq 24(%rax), %rax /* load foo->bar */
> movq -16(%rbp), %rdx /* load tmp arg from stack */
> movl 32(%rdx), %edx /* load tmp->blah */
> movl %edx, 20(%rax) /* store bar->fubar */
> leave
> ret
> .cfi_endproc

I need to have a close look at how dwarf cfi thing works.

>
> where I annotated the various moves with C comments.
>
> Now depending on what exact IP you get using PEBS+fixup you could using
> DWARF bits generate similar deductions from the code as I did in those
> comments and thus know exactly what data member was accessed and how
> (read/write).

Is it an unwind of the call frame stack to find out what data member was
accessed?
How to know the access type(read or write)?

>
> With that data you could then borrow some pahole code and annotate the
> various data structures to illustrate read/write distributions, which
> can then be used as input for data-reorder.

Could you explain a bit more about this?

Thanks,
Lin Ming

2011-03-29 17:00:57

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Wed, 2011-03-30 at 00:45 +0800, Lin Ming wrote:
> On Tue, 2011-03-29 at 17:54 +0800, Peter Zijlstra wrote:
> > On Tue, 2011-03-29 at 17:32 +0800, Lin Ming wrote:
> > >
> > > Peter suggested to reverse map the reported IP (PEBS + fixup)
> > > to a data access using dwarf info.
> > > So I wrote this patch to see if the direction is right.
> >
> > I'm not sure this is quite the same thing, I'm not arguing this might
> > not be useful, but this is not about data access.
> >
> > Suppose you have a line like:
> >
> > foo->bar->fubar = tmp->blah;
> >
> > There's 3 indirections there, a line number doesn't even get you close
> > to knowing what data access triggered the event.
> >
> > struct bar {
> > int poekoe[5];
> > int fubar;
> > };
> >
> > struct foo {
> > long poekoe[3];
> > struct bar *bar;
> > };
> >
> > struct tmp {
> > long poekoe[4];
> > int blah;
> > };
> >
> > void foo(struct foo *foo, struct tmp *tmp)
> > {
> > foo->bar->fubar = tmp->blah;
> > }
> >
> > Which gives (somewhat simplified):
> >
> > foo:
> > .cfi_startproc
> > pushq %rbp
> > .cfi_def_cfa_offset 16
> > movq %rsp, %rbp
> > .cfi_offset 6, -16
> > .cfi_def_cfa_register 6
> > movq %rdi, -8(%rbp)
> > movq %rsi, -16(%rbp)
> > movq -8(%rbp), %rax /* load foo arg from stack */
> > movq 24(%rax), %rax /* load foo->bar */
> > movq -16(%rbp), %rdx /* load tmp arg from stack */
> > movl 32(%rdx), %edx /* load tmp->blah */
> > movl %edx, 20(%rax) /* store bar->fubar */
> > leave
> > ret
> > .cfi_endproc
>
> I need to have a close look at how dwarf cfi thing works.

You'll need more than CFI.

> >
> > where I annotated the various moves with C comments.
> >
> > Now depending on what exact IP you get using PEBS+fixup you could using
> > DWARF bits generate similar deductions from the code as I did in those
> > comments and thus know exactly what data member was accessed and how
> > (read/write).
>
> Is it an unwind of the call frame stack to find out what data member was
> accessed?

No need to unwind stacks, DWARF should have information on function
local stack. It should be able to tell you the type of things like -8(%
rbp).

> How to know the access type(read or write)?

instruction decode, see if the memory operand is a source or target.

> >
> > With that data you could then borrow some pahole code and annotate the
> > various data structures to illustrate read/write distributions, which
> > can then be used as input for data-reorder.
>
> Could you explain a bit more about this?

$ pahole tmp.o
struct tmp {
long int poekoe[4]; /* 0 32 */
int blah; /* 32 4 */

/* size: 40, cachelines: 1 */
/* padding: 4 */
/* last cacheline: 40 bytes */
}; /* definitions: 1 */

struct foo {
long int poekoe[3]; /* 0 24 */
struct bar * bar; /* 24 8 */

/* size: 32, cachelines: 1 */
/* last cacheline: 32 bytes */
}; /* definitions: 1 */

struct bar {
int poekoe[5]; /* 0 20 */
int fubar; /* 20 4 */

/* size: 24, cachelines: 1 */
/* last cacheline: 24 bytes */
}; /* definitions: 1 */

You could provide similar structure printout from the report function,
and add read/write statistics.

2011-03-29 17:04:34

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
> > Is it an unwind of the call frame stack to find out what data member was
> > accessed?
>
> No need to unwind stacks, DWARF should have information on function
> local stack. It should be able to tell you the type of things like -8(%
> rbp).

That is, we don't even have a life stack to unwind, so we simply cannot
unwind at all, but the debug information should be able to tell you the
type of whatever would live at a certain stack position if we were to
have a stack.

2011-03-29 17:06:43

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Tue, 2011-03-29 at 19:06 +0200, Peter Zijlstra wrote:
> On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
> > > Is it an unwind of the call frame stack to find out what data member was
> > > accessed?
> >
> > No need to unwind stacks, DWARF should have information on function
> > local stack. It should be able to tell you the type of things like -8(%
> > rbp).
>
> That is, we don't even have a life stack to unwind, so we simply cannot
> unwind at all, but the debug information should be able to tell you the
> type of whatever would live at a certain stack position if we were to
> have a stack.

Furthermore, I've now completely exhausted my knowledge of debuginfo and
hope Masami and Arnaldo will help with further details ;-)

2011-03-29 17:46:09

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

Em Tue, Mar 29, 2011 at 07:08:53PM +0200, Peter Zijlstra escreveu:
> On Tue, 2011-03-29 at 19:06 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
> > > > Is it an unwind of the call frame stack to find out what data member was
> > > > accessed?

> > > No need to unwind stacks, DWARF should have information on function
> > > local stack. It should be able to tell you the type of things like -8(%
> > > rbp).

> > That is, we don't even have a life stack to unwind, so we simply cannot
> > unwind at all, but the debug information should be able to tell you the
> > type of whatever would live at a certain stack position if we were to
> > have a stack.

> Furthermore, I've now completely exhausted my knowledge of debuginfo and
> hope Masami and Arnaldo will help with further details ;-)

:-)

I think the place to look is 'perf probe', look for the way one
variable is translated to an expression passed to the kprobe_tracer, I
think you'll need the reverse operation.

Look at tools/perf/util/probe-finder.c, function find_variable() and
convert_variable_location().

Understanding how those functions use the dwarf_ prefixed routines from
libdw should be a good start.

- Arnaldo

2011-03-30 01:04:16

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

(2011/03/30 2:45), Arnaldo Carvalho de Melo wrote:
> Em Tue, Mar 29, 2011 at 07:08:53PM +0200, Peter Zijlstra escreveu:
>> On Tue, 2011-03-29 at 19:06 +0200, Peter Zijlstra wrote:
>>> On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
>>>>> Is it an unwind of the call frame stack to find out what data member was
>>>>> accessed?
>
>>>> No need to unwind stacks, DWARF should have information on function
>>>> local stack. It should be able to tell you the type of things like -8(%
>>>> rbp).
>
>>> That is, we don't even have a life stack to unwind, so we simply cannot
>>> unwind at all, but the debug information should be able to tell you the
>>> type of whatever would live at a certain stack position if we were to
>>> have a stack.
>
>> Furthermore, I've now completely exhausted my knowledge of debuginfo and
>> hope Masami and Arnaldo will help with further details ;-)
>
> :-)
>
> I think the place to look is 'perf probe', look for the way one
> variable is translated to an expression passed to the kprobe_tracer, I
> think you'll need the reverse operation.
>
> Look at tools/perf/util/probe-finder.c, function find_variable() and
> convert_variable_location().

Right, perf probe is available for looking up a variable at
an address, however, since it just depends on the dwarf,
we can just know the mapping from (actual) variables to
locations (registers/stacks etc.)

It seems that Peter requires more than that, his request is
getting a map from (temporarily used) register to a specific
member of a data structure pointed by a pointer. But dwarf is
not enough for that.

Peter gave us a good example of that.

> void foo(struct foo *foo, struct tmp *tmp)
> {
> foo->bar->fubar = tmp->blah;
> }
>
> foo:
> .cfi_startproc
> pushq %rbp
> .cfi_def_cfa_offset 16
> movq %rsp, %rbp
> .cfi_offset 6, -16
> .cfi_def_cfa_register 6
> movq %rdi, -8(%rbp)
> movq %rsi, -16(%rbp)
> movq -8(%rbp), %rax /* load foo arg from stack */
> movq 24(%rax), %rax /* load foo->bar */
> movq -16(%rbp), %rdx /* load tmp arg from stack */
> movl 32(%rdx), %edx /* load tmp->blah */
> movl %edx, 20(%rax) /* store bar->fubar */ <<==(1)
> leave
> ret
> .cfi_endproc

At (1), dwarf tells us the location of 'foo' is -8(%rbp) and 'tmp' is
-16(%rbp), but doesn't know to what 20(%rax) is mapped, because
foo->bar->fubar is not a real variable.

So that what we need is a kind of the reverse compiler which generates
intermediate code (a sequence of register assignments) from
instruction code. That's not impossible task, but just hard and fun. :)
For that purpose, we'll need an instruction decoder and an evaluator
which allows us to trace the sequence of address dereferences.

Anyway, I'd recommend that we should start with just showing
the corresponding "source line" of the address. It may be
enough for some cases.

Thank you,

--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2011-03-30 02:19:25

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

Em Wed, Mar 30, 2011 at 10:04:10AM +0900, Masami Hiramatsu escreveu:
> So that what we need is a kind of the reverse compiler which generates
> intermediate code (a sequence of register assignments) from
> instruction code. That's not impossible task, but just hard and fun. :)
> For that purpose, we'll need an instruction decoder and an evaluator
> which allows us to trace the sequence of address dereferences.
>
> Anyway, I'd recommend that we should start with just showing
> the corresponding "source line" of the address. It may be
> enough for some cases.

Right, but being able to map events to structure fields is an interestig
goal, data annotation instead of code annotation, and if it is fun and
hard, even better! :-)

- Arnaldo

2011-03-31 06:58:01

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Wed, 2011-03-30 at 09:04 +0800, Masami Hiramatsu wrote:
> (2011/03/30 2:45), Arnaldo Carvalho de Melo wrote:
> > Em Tue, Mar 29, 2011 at 07:08:53PM +0200, Peter Zijlstra escreveu:
> >> On Tue, 2011-03-29 at 19:06 +0200, Peter Zijlstra wrote:
> >>> On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
> >>>>> Is it an unwind of the call frame stack to find out what data member was
> >>>>> accessed?
> >
> >>>> No need to unwind stacks, DWARF should have information on function
> >>>> local stack. It should be able to tell you the type of things like -8(%
> >>>> rbp).
> >
> >>> That is, we don't even have a life stack to unwind, so we simply cannot
> >>> unwind at all, but the debug information should be able to tell you the
> >>> type of whatever would live at a certain stack position if we were to
> >>> have a stack.
> >
> >> Furthermore, I've now completely exhausted my knowledge of debuginfo and
> >> hope Masami and Arnaldo will help with further details ;-)
> >
> > :-)
> >
> > I think the place to look is 'perf probe', look for the way one
> > variable is translated to an expression passed to the kprobe_tracer, I
> > think you'll need the reverse operation.
> >
> > Look at tools/perf/util/probe-finder.c, function find_variable() and
> > convert_variable_location().
>
> Right, perf probe is available for looking up a variable at
> an address, however, since it just depends on the dwarf,
> we can just know the mapping from (actual) variables to
> locations (registers/stacks etc.)
>
> It seems that Peter requires more than that, his request is
> getting a map from (temporarily used) register to a specific
> member of a data structure pointed by a pointer. But dwarf is
> not enough for that.
>
> Peter gave us a good example of that.
>
> > void foo(struct foo *foo, struct tmp *tmp)
> > {
> > foo->bar->fubar = tmp->blah;
> > }
> >
> > foo:
> > .cfi_startproc
> > pushq %rbp
> > .cfi_def_cfa_offset 16
> > movq %rsp, %rbp
> > .cfi_offset 6, -16
> > .cfi_def_cfa_register 6
> > movq %rdi, -8(%rbp)
> > movq %rsi, -16(%rbp)
> > movq -8(%rbp), %rax /* load foo arg from stack */
> > movq 24(%rax), %rax /* load foo->bar */
> > movq -16(%rbp), %rdx /* load tmp arg from stack */
> > movl 32(%rdx), %edx /* load tmp->blah */
> > movl %edx, 20(%rax) /* store bar->fubar */ <<==(1)
> > leave
> > ret
> > .cfi_endproc
>
> At (1), dwarf tells us the location of 'foo' is -8(%rbp) and 'tmp' is
> -16(%rbp), but doesn't know to what 20(%rax) is mapped, because
> foo->bar->fubar is not a real variable.
>
> So that what we need is a kind of the reverse compiler which generates
> intermediate code (a sequence of register assignments) from
> instruction code. That's not impossible task, but just hard and fun. :)
> For that purpose, we'll need an instruction decoder and an evaluator
> which allows us to trace the sequence of address dereferences.

I may borrow code from objdump to decode instruction.
But not sure about the instruction evaluator, any hints?

Thanks,
Lin Ming

>
> Anyway, I'd recommend that we should start with just showing
> the corresponding "source line" of the address. It may be
> enough for some cases.
>
> Thank you,
>
>

2011-03-31 08:46:26

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Wed, 2011-03-30 at 09:04 +0800, Masami Hiramatsu wrote:
> (2011/03/30 2:45), Arnaldo Carvalho de Melo wrote:
> > Em Tue, Mar 29, 2011 at 07:08:53PM +0200, Peter Zijlstra escreveu:
> >> On Tue, 2011-03-29 at 19:06 +0200, Peter Zijlstra wrote:
> >>> On Tue, 2011-03-29 at 19:03 +0200, Peter Zijlstra wrote:
> >>>>> Is it an unwind of the call frame stack to find out what data member was
> >>>>> accessed?
> >
> >>>> No need to unwind stacks, DWARF should have information on function
> >>>> local stack. It should be able to tell you the type of things like -8(%
> >>>> rbp).
> >
> >>> That is, we don't even have a life stack to unwind, so we simply cannot
> >>> unwind at all, but the debug information should be able to tell you the
> >>> type of whatever would live at a certain stack position if we were to
> >>> have a stack.
> >
> >> Furthermore, I've now completely exhausted my knowledge of debuginfo and
> >> hope Masami and Arnaldo will help with further details ;-)
> >
> > :-)
> >
> > I think the place to look is 'perf probe', look for the way one
> > variable is translated to an expression passed to the kprobe_tracer, I
> > think you'll need the reverse operation.
> >
> > Look at tools/perf/util/probe-finder.c, function find_variable() and
> > convert_variable_location().
>
> Right, perf probe is available for looking up a variable at
> an address, however, since it just depends on the dwarf,
> we can just know the mapping from (actual) variables to
> locations (registers/stacks etc.)
>
> It seems that Peter requires more than that, his request is
> getting a map from (temporarily used) register to a specific
> member of a data structure pointed by a pointer. But dwarf is
> not enough for that.
>
> Peter gave us a good example of that.
>
> > void foo(struct foo *foo, struct tmp *tmp)
> > {
> > foo->bar->fubar = tmp->blah;
> > }
> >
> > foo:
> > .cfi_startproc
> > pushq %rbp
> > .cfi_def_cfa_offset 16
> > movq %rsp, %rbp
> > .cfi_offset 6, -16
> > .cfi_def_cfa_register 6
> > movq %rdi, -8(%rbp)
> > movq %rsi, -16(%rbp)
> > movq -8(%rbp), %rax /* load foo arg from stack */
> > movq 24(%rax), %rax /* load foo->bar */
> > movq -16(%rbp), %rdx /* load tmp arg from stack */
> > movl 32(%rdx), %edx /* load tmp->blah */
> > movl %edx, 20(%rax) /* store bar->fubar */ <<==(1)
> > leave
> > ret
> > .cfi_endproc
>
> At (1), dwarf tells us the location of 'foo' is -8(%rbp) and 'tmp' is
> -16(%rbp), but doesn't know to what 20(%rax) is mapped, because
> foo->bar->fubar is not a real variable.

I am considering if it is possible to do "instruction unwind" to get a
map from (temporarily used) register to a specific member of a data
structure pointed by a pointer.

4004a0: movq -8(%rbp), %rax /* load foo arg from stack */
4004a4: movq 24(%rax), %rax /* load foo->bar */
4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack */
4004ac: movl 32(%rdx), %edx /* load tmp->blah */
4004af: movl %edx, 20(%rax) /* store bar->fubar */

foo: -8(%rbp)
tmp: -16(%rbp)

Assume we are now at ip 4004af, from the instruction decoder, we know
it's a store operation, and we want to find out what %rax is.

1. unwind to 4004ac
Ignore this, because it does not touch %rax

2. unwind to 4004a8
Ignore this, because it does not touch %rax

3. unwind to 4004a4
20(%rax) => 20(24(%rax)), continue to unwind because we still
have no idea what %rax is

4. unwind to 4004a0
20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
-8(%rbp) is foo.

So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
foo->bar->fubar

Does this make sense?

Thanks,
Lin Ming

>
> So that what we need is a kind of the reverse compiler which generates
> intermediate code (a sequence of register assignments) from
> instruction code. That's not impossible task, but just hard and fun. :)
> For that purpose, we'll need an instruction decoder and an evaluator
> which allows us to trace the sequence of address dereferences.
>
> Anyway, I'd recommend that we should start with just showing
> the corresponding "source line" of the address. It may be
> enough for some cases.
>
> Thank you,
>
>

2011-03-31 13:55:59

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

Em Thu, Mar 31, 2011 at 04:45:55PM +0800, Lin Ming escreveu:
> On Wed, 2011-03-30 at 09:04 +0800, Masami Hiramatsu wrote:
> > > foo:
> > > .cfi_startproc
> > > pushq %rbp
> > > .cfi_def_cfa_offset 16
> > > movq %rsp, %rbp
> > > .cfi_offset 6, -16
> > > .cfi_def_cfa_register 6
> > > movq %rdi, -8(%rbp)
> > > movq %rsi, -16(%rbp)
> > > movq -8(%rbp), %rax /* load foo arg from stack */
> > > movq 24(%rax), %rax /* load foo->bar */
> > > movq -16(%rbp), %rdx /* load tmp arg from stack */
> > > movl 32(%rdx), %edx /* load tmp->blah */
> > > movl %edx, 20(%rax) /* store bar->fubar */ <<==(1)
> > > leave
> > > ret
> > > .cfi_endproc
> >
> > At (1), dwarf tells us the location of 'foo' is -8(%rbp) and 'tmp' is
> > -16(%rbp), but doesn't know to what 20(%rax) is mapped, because
> > foo->bar->fubar is not a real variable.
>
> I am considering if it is possible to do "instruction unwind" to get a
> map from (temporarily used) register to a specific member of a data
> structure pointed by a pointer.
>
> 4004a0: movq -8(%rbp), %rax /* load foo arg from stack */
> 4004a4: movq 24(%rax), %rax /* load foo->bar */
> 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack */
> 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> 4004af: movl %edx, 20(%rax) /* store bar->fubar */
>
> foo: -8(%rbp)
> tmp: -16(%rbp)
>
> Assume we are now at ip 4004af, from the instruction decoder, we know
> it's a store operation, and we want to find out what %rax is.
>
> 1. unwind to 4004ac
> Ignore this, because it does not touch %rax
>
> 2. unwind to 4004a8
> Ignore this, because it does not touch %rax
>
> 3. unwind to 4004a4
> 20(%rax) => 20(24(%rax)), continue to unwind because we still
> have no idea what %rax is
>
> 4. unwind to 4004a0
> 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> -8(%rbp) is foo.
>
> So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> foo->bar->fubar
>
> Does this make sense?

I think it does, so we do just like with annotation, but parsing objdump
-S output, is that what you're planning?

After we have the members we can do data annotation, in much the same
way we do with code annotation, i.e. augmenting pahole output.

- Arnaldo

2011-03-31 14:02:27

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 16:45 +0800, Lin Ming wrote:
> I am considering if it is possible to do "instruction unwind" to get a
> map from (temporarily used) register to a specific member of a data
> structure pointed by a pointer.
>
> 4004a0: movq -8(%rbp), %rax /* load foo arg from stack
> */
> 4004a4: movq 24(%rax), %rax /* load foo->bar */
> 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack
> */
> 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> 4004af: movl %edx, 20(%rax) /* store bar->fubar */
>
> foo: -8(%rbp)
> tmp: -16(%rbp)
>
> Assume we are now at ip 4004af, from the instruction decoder, we know
> it's a store operation, and we want to find out what %rax is.
>
> 1. unwind to 4004ac
> Ignore this, because it does not touch %rax
>
> 2. unwind to 4004a8
> Ignore this, because it does not touch %rax
>
> 3. unwind to 4004a4
> 20(%rax) => 20(24(%rax)), continue to unwind because we still
> have no idea what %rax is
>
> 4. unwind to 4004a0
> 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> -8(%rbp) is foo.
>
> So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> foo->bar->fubar
>
> Does this make sense?

Yes and no, the problem is that you cannot unwind an x86 instruction
stream. Therefore its easier to start at the beginning of a function
where DWARF should be able to tell you everything you need and then do a
single fwd scan to propagate the information until you reach the point
of interest.

2011-03-31 14:19:38

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 21:46 +0800, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 31, 2011 at 04:45:55PM +0800, Lin Ming escreveu:
> > On Wed, 2011-03-30 at 09:04 +0800, Masami Hiramatsu wrote:
> > > > foo:
> > > > .cfi_startproc
> > > > pushq %rbp
> > > > .cfi_def_cfa_offset 16
> > > > movq %rsp, %rbp
> > > > .cfi_offset 6, -16
> > > > .cfi_def_cfa_register 6
> > > > movq %rdi, -8(%rbp)
> > > > movq %rsi, -16(%rbp)
> > > > movq -8(%rbp), %rax /* load foo arg from stack */
> > > > movq 24(%rax), %rax /* load foo->bar */
> > > > movq -16(%rbp), %rdx /* load tmp arg from stack */
> > > > movl 32(%rdx), %edx /* load tmp->blah */
> > > > movl %edx, 20(%rax) /* store bar->fubar */ <<==(1)
> > > > leave
> > > > ret
> > > > .cfi_endproc
> > >
> > > At (1), dwarf tells us the location of 'foo' is -8(%rbp) and 'tmp' is
> > > -16(%rbp), but doesn't know to what 20(%rax) is mapped, because
> > > foo->bar->fubar is not a real variable.
> >
> > I am considering if it is possible to do "instruction unwind" to get a
> > map from (temporarily used) register to a specific member of a data
> > structure pointed by a pointer.
> >
> > 4004a0: movq -8(%rbp), %rax /* load foo arg from stack */
> > 4004a4: movq 24(%rax), %rax /* load foo->bar */
> > 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack */
> > 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> > 4004af: movl %edx, 20(%rax) /* store bar->fubar */
> >
> > foo: -8(%rbp)
> > tmp: -16(%rbp)
> >
> > Assume we are now at ip 4004af, from the instruction decoder, we know
> > it's a store operation, and we want to find out what %rax is.
> >
> > 1. unwind to 4004ac
> > Ignore this, because it does not touch %rax
> >
> > 2. unwind to 4004a8
> > Ignore this, because it does not touch %rax
> >
> > 3. unwind to 4004a4
> > 20(%rax) => 20(24(%rax)), continue to unwind because we still
> > have no idea what %rax is
> >
> > 4. unwind to 4004a0
> > 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> > -8(%rbp) is foo.
> >
> > So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> > foo->bar->fubar
> >
> > Does this make sense?
>
> I think it does, so we do just like with annotation, but parsing objdump
> -S output, is that what you're planning?

I mean to use libopcodes to decode the instructions.
But to parse objdump -S output maybe a better and simpler idea.

>
> After we have the members we can do data annotation, in much the same
> way we do with code annotation, i.e. augmenting pahole output.

What is data annotation? Is it to list the percentage of access for each
member of the structure? For example,

struct bar {
int poekoe[5]; //10%
int fubar; //20%
};

Lin Ming

>
> - Arnaldo

2011-03-31 14:34:17

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 22:01 +0800, Peter Zijlstra wrote:
> On Thu, 2011-03-31 at 16:45 +0800, Lin Ming wrote:
> > I am considering if it is possible to do "instruction unwind" to get a
> > map from (temporarily used) register to a specific member of a data
> > structure pointed by a pointer.
> >
> > 4004a0: movq -8(%rbp), %rax /* load foo arg from stack
> > */
> > 4004a4: movq 24(%rax), %rax /* load foo->bar */
> > 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack
> > */
> > 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> > 4004af: movl %edx, 20(%rax) /* store bar->fubar */
> >
> > foo: -8(%rbp)
> > tmp: -16(%rbp)
> >
> > Assume we are now at ip 4004af, from the instruction decoder, we know
> > it's a store operation, and we want to find out what %rax is.
> >
> > 1. unwind to 4004ac
> > Ignore this, because it does not touch %rax
> >
> > 2. unwind to 4004a8
> > Ignore this, because it does not touch %rax
> >
> > 3. unwind to 4004a4
> > 20(%rax) => 20(24(%rax)), continue to unwind because we still
> > have no idea what %rax is
> >
> > 4. unwind to 4004a0
> > 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> > -8(%rbp) is foo.
> >
> > So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> > foo->bar->fubar
> >
> > Does this make sense?
>
> Yes and no, the problem is that you cannot unwind an x86 instruction
> stream. Therefore its easier to start at the beginning of a function
> where DWARF should be able to tell you everything you need and then do a
> single fwd scan to propagate the information until you reach the point
> of interest.

I'm afraid that fwd scan may not work, because of branch instruction.

void foo(struct foo *foo, struct tmp *tmp, int flag)
{
if (flag)
foo->bar->fubar = tmp->blah;
else
tmp->blah = foo->bar->fubar;
}

===>

void foo(struct foo *foo, struct tmp *tmp, int flag)
{
400494: 55 push %rbp
400495: 48 89 e5 mov %rsp,%rbp
400498: 48 89 7d f8 mov %rdi,-0x8(%rbp)
40049c: 48 89 75 f0 mov %rsi,-0x10(%rbp)
4004a0: 89 55 ec mov %edx,-0x14(%rbp)
if (flag)
4004a3: 83 7d ec 00 cmpl $0x0,-0x14(%rbp)
4004a7: 74 14 je 4004bd <foo+0x29>
foo->bar->fubar = tmp->blah;
4004a9: 48 8b 45 f8 mov -0x8(%rbp),%rax
4004ad: 48 8b 40 18 mov 0x18(%rax),%rax
4004b1: 48 8b 55 f0 mov -0x10(%rbp),%rdx
4004b5: 8b 52 20 mov 0x20(%rdx),%edx
4004b8: 89 50 14 mov %edx,0x14(%rax)
4004bb: eb 12 jmp 4004cf <foo+0x3b>
else
tmp->blah = foo->bar->fubar;
4004bd: 48 8b 45 f8 mov -0x8(%rbp),%rax
4004c1: 48 8b 40 18 mov 0x18(%rax),%rax
4004c5: 8b 50 14 mov 0x14(%rax),%edx
4004c8: 48 8b 45 f0 mov -0x10(%rbp),%rax
4004cc: 89 50 20 mov %edx,0x20(%rax)
}
4004cf: c9 leaveq
4004d0: c3 retq

Assume we are at ip 4004c5, the fwd scan from the beginning of
function(400494) to 4004c5 will not get what we want about %rax.

Lin Ming

2011-03-31 14:52:08

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 22:34 +0800, Lin Ming wrote:
> On Thu, 2011-03-31 at 22:01 +0800, Peter Zijlstra wrote:
> > On Thu, 2011-03-31 at 16:45 +0800, Lin Ming wrote:
> > > I am considering if it is possible to do "instruction unwind" to get a
> > > map from (temporarily used) register to a specific member of a data
> > > structure pointed by a pointer.
> > >
> > > 4004a0: movq -8(%rbp), %rax /* load foo arg from stack
> > > */
> > > 4004a4: movq 24(%rax), %rax /* load foo->bar */
> > > 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack
> > > */
> > > 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> > > 4004af: movl %edx, 20(%rax) /* store bar->fubar */
> > >
> > > foo: -8(%rbp)
> > > tmp: -16(%rbp)
> > >
> > > Assume we are now at ip 4004af, from the instruction decoder, we know
> > > it's a store operation, and we want to find out what %rax is.
> > >
> > > 1. unwind to 4004ac
> > > Ignore this, because it does not touch %rax
> > >
> > > 2. unwind to 4004a8
> > > Ignore this, because it does not touch %rax
> > >
> > > 3. unwind to 4004a4
> > > 20(%rax) => 20(24(%rax)), continue to unwind because we still
> > > have no idea what %rax is
> > >
> > > 4. unwind to 4004a0
> > > 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> > > -8(%rbp) is foo.
> > >
> > > So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> > > foo->bar->fubar
> > >
> > > Does this make sense?
> >
> > Yes and no, the problem is that you cannot unwind an x86 instruction
> > stream. Therefore its easier to start at the beginning of a function
> > where DWARF should be able to tell you everything you need and then do a
> > single fwd scan to propagate the information until you reach the point
> > of interest.
>
> I'm afraid that fwd scan may not work, because of branch instruction.
>
> void foo(struct foo *foo, struct tmp *tmp, int flag)
> {
> if (flag)
> foo->bar->fubar = tmp->blah;
> else
> tmp->blah = foo->bar->fubar;
> }
>
> ===>
>
> void foo(struct foo *foo, struct tmp *tmp, int flag)
> {
> 400494: 55 push %rbp
> 400495: 48 89 e5 mov %rsp,%rbp
> 400498: 48 89 7d f8 mov %rdi,-0x8(%rbp)
> 40049c: 48 89 75 f0 mov %rsi,-0x10(%rbp)
> 4004a0: 89 55 ec mov %edx,-0x14(%rbp)
> if (flag)
> 4004a3: 83 7d ec 00 cmpl $0x0,-0x14(%rbp)
> 4004a7: 74 14 je 4004bd <foo+0x29>
> foo->bar->fubar = tmp->blah;
> 4004a9: 48 8b 45 f8 mov -0x8(%rbp),%rax
> 4004ad: 48 8b 40 18 mov 0x18(%rax),%rax
> 4004b1: 48 8b 55 f0 mov -0x10(%rbp),%rdx
> 4004b5: 8b 52 20 mov 0x20(%rdx),%edx
> 4004b8: 89 50 14 mov %edx,0x14(%rax)
> 4004bb: eb 12 jmp 4004cf <foo+0x3b>
> else
> tmp->blah = foo->bar->fubar;
> 4004bd: 48 8b 45 f8 mov -0x8(%rbp),%rax
> 4004c1: 48 8b 40 18 mov 0x18(%rax),%rax
> 4004c5: 8b 50 14 mov 0x14(%rax),%edx
> 4004c8: 48 8b 45 f0 mov -0x10(%rbp),%rax
> 4004cc: 89 50 20 mov %edx,0x20(%rax)
> }
> 4004cf: c9 leaveq
> 4004d0: c3 retq
>
> Assume we are at ip 4004c5, the fwd scan from the beginning of
> function(400494) to 4004c5 will not get what we want about %rax.

In contrast, we can scan from 4004c5 toward to the beginning of the
function to get the info about 0x14(%rax)

We already know foo is -0x8(%rbp)

Scan 4004c1: 0x14(%rax) -> 0x14(0x18(%rax))
Scan 4004bd: 0x14(0x18(%rax)) -> 0x14(0x18(-0x8(%rbp))), stop scan
because we already know -0x8(%rbp) is foo.

And with other dwarf info, we finally know 0x14(%rax) at ip 4004c5 means
foo->bar->fubar.

Lin Ming

>
> Lin Ming

2011-03-31 15:35:53

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

Em Thu, Mar 31, 2011 at 10:19:36PM +0800, Lin Ming escreveu:
> On Thu, 2011-03-31 at 21:46 +0800, Arnaldo Carvalho de Melo wrote:
> > I think it does, so we do just like with annotation, but parsing objdump
> > -S output, is that what you're planning?
>
> I mean to use libopcodes to decode the instructions.
> But to parse objdump -S output maybe a better and simpler idea.

When we started working on code annotation I thought about using a
library, but after a while, no, not really, better to parse a tool
output like Ingo did in the first usable version of perf annotate.

objdump output (and pahole, for data annotation) is very much set in
stone after all those years, so we can just use it.

> > After we have the members we can do data annotation, in much the same
> > way we do with code annotation, i.e. augmenting pahole output.

> What is data annotation? Is it to list the percentage of access for each
> member of the structure? For example,

> struct bar {
> int poekoe[5]; //10%
> int fubar; //20%
> };

Exactly. Using different colors for percentage ranges, etc. Like code
annotation right now.

Done in perf report (static) zoomable, i.e. choosing a DSO and getting
the hits originated from functions in it, or in a thread, or in both,
like we have today for code annotation.

And in 'perf top --tui' live data annotation. With the same set of
zooming operations. So that one can see patterns of data structure
access as a workload runs.

- Arnaldo

2011-03-31 16:26:34

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 22:34 +0800, Lin Ming wrote:
> On Thu, 2011-03-31 at 22:01 +0800, Peter Zijlstra wrote:
> > On Thu, 2011-03-31 at 16:45 +0800, Lin Ming wrote:
> > > I am considering if it is possible to do "instruction unwind" to get a
> > > map from (temporarily used) register to a specific member of a data
> > > structure pointed by a pointer.
> > >
> > > 4004a0: movq -8(%rbp), %rax /* load foo arg from stack
> > > */
> > > 4004a4: movq 24(%rax), %rax /* load foo->bar */
> > > 4004a8: movq -16(%rbp), %rdx /* load tmp arg from stack
> > > */
> > > 4004ac: movl 32(%rdx), %edx /* load tmp->blah */
> > > 4004af: movl %edx, 20(%rax) /* store bar->fubar */
> > >
> > > foo: -8(%rbp)
> > > tmp: -16(%rbp)
> > >
> > > Assume we are now at ip 4004af, from the instruction decoder, we know
> > > it's a store operation, and we want to find out what %rax is.
> > >
> > > 1. unwind to 4004ac
> > > Ignore this, because it does not touch %rax
> > >
> > > 2. unwind to 4004a8
> > > Ignore this, because it does not touch %rax
> > >
> > > 3. unwind to 4004a4
> > > 20(%rax) => 20(24(%rax)), continue to unwind because we still
> > > have no idea what %rax is
> > >
> > > 4. unwind to 4004a0
> > > 20(24(%rax)) => 20(24(-8(%rbp))), stop unwind, because we now know
> > > -8(%rbp) is foo.
> > >
> > > So the original 20(%rax) is replace as 20(24(-8(%rbp))), and it means
> > > foo->bar->fubar
> > >
> > > Does this make sense?
> >
> > Yes and no, the problem is that you cannot unwind an x86 instruction
> > stream. Therefore its easier to start at the beginning of a function
> > where DWARF should be able to tell you everything you need and then do a
> > single fwd scan to propagate the information until you reach the point
> > of interest.
>
> I'm afraid that fwd scan may not work, because of branch instruction.
>
> void foo(struct foo *foo, struct tmp *tmp, int flag)
> {
> if (flag)
> foo->bar->fubar = tmp->blah;
> else
> tmp->blah = foo->bar->fubar;
> }
>
> ===>
>
> void foo(struct foo *foo, struct tmp *tmp, int flag)
> {
> 400494: 55 push %rbp
> 400495: 48 89 e5 mov %rsp,%rbp
> 400498: 48 89 7d f8 mov %rdi,-0x8(%rbp)
> 40049c: 48 89 75 f0 mov %rsi,-0x10(%rbp)
> 4004a0: 89 55 ec mov %edx,-0x14(%rbp)
> if (flag)
> 4004a3: 83 7d ec 00 cmpl $0x0,-0x14(%rbp)
> 4004a7: 74 14 je 4004bd <foo+0x29>
> foo->bar->fubar = tmp->blah;
> 4004a9: 48 8b 45 f8 mov -0x8(%rbp),%rax
> 4004ad: 48 8b 40 18 mov 0x18(%rax),%rax
> 4004b1: 48 8b 55 f0 mov -0x10(%rbp),%rdx
> 4004b5: 8b 52 20 mov 0x20(%rdx),%edx
> 4004b8: 89 50 14 mov %edx,0x14(%rax)
> 4004bb: eb 12 jmp 4004cf <foo+0x3b>
> else
> tmp->blah = foo->bar->fubar;
> 4004bd: 48 8b 45 f8 mov -0x8(%rbp),%rax
> 4004c1: 48 8b 40 18 mov 0x18(%rax),%rax
> 4004c5: 8b 50 14 mov 0x14(%rax),%edx
> 4004c8: 48 8b 45 f0 mov -0x10(%rbp),%rax
> 4004cc: 89 50 20 mov %edx,0x20(%rax)
> }
> 4004cf: c9 leaveq
> 4004d0: c3 retq
>
> Assume we are at ip 4004c5, the fwd scan from the beginning of
> function(400494) to 4004c5 will not get what we want about %rax.

Conversely backwards scans can get confused if there's more places to
come from (intercal ftw!).

That said, your example above should not get confused about %rax if it
knows about the jumps, simply clone your context on any jump instruction
and follow both branches. That would then give you:

400494 -> 4004a7 -> 4004bb -> 4004cf
-> 4004bd

You could even first build the basic block tree and only follow those
branches that end up covering the region IP is in.

2011-03-31 16:29:59

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Thu, 2011-03-31 at 18:28 +0200, Peter Zijlstra wrote:
>
> You could even first build the basic block tree and only follow those
> branches that end up covering the region IP is in.

s/tree/directed-graph/ clearly the basic blocks don't form a tree and
can contain cycles.

2011-04-01 10:44:44

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

(2011/04/01 1:28), Peter Zijlstra wrote:
> On Thu, 2011-03-31 at 22:34 +0800, Lin Ming wrote:
>> I'm afraid that fwd scan may not work, because of branch instruction.
>>
>> void foo(struct foo *foo, struct tmp *tmp, int flag)
>> {
>> if (flag)
>> foo->bar->fubar = tmp->blah;
>> else
>> tmp->blah = foo->bar->fubar;
>> }
>>
>> ===>
>>
>> void foo(struct foo *foo, struct tmp *tmp, int flag)
>> {
>> 400494: 55 push %rbp
>> 400495: 48 89 e5 mov %rsp,%rbp
>> 400498: 48 89 7d f8 mov %rdi,-0x8(%rbp)
>> 40049c: 48 89 75 f0 mov %rsi,-0x10(%rbp)
>> 4004a0: 89 55 ec mov %edx,-0x14(%rbp)
>> if (flag)
>> 4004a3: 83 7d ec 00 cmpl $0x0,-0x14(%rbp)
>> 4004a7: 74 14 je 4004bd <foo+0x29>
>> foo->bar->fubar = tmp->blah;
>> 4004a9: 48 8b 45 f8 mov -0x8(%rbp),%rax
>> 4004ad: 48 8b 40 18 mov 0x18(%rax),%rax
>> 4004b1: 48 8b 55 f0 mov -0x10(%rbp),%rdx
>> 4004b5: 8b 52 20 mov 0x20(%rdx),%edx
>> 4004b8: 89 50 14 mov %edx,0x14(%rax)
>> 4004bb: eb 12 jmp 4004cf <foo+0x3b>
>> else
>> tmp->blah = foo->bar->fubar;
>> 4004bd: 48 8b 45 f8 mov -0x8(%rbp),%rax
>> 4004c1: 48 8b 40 18 mov 0x18(%rax),%rax
>> 4004c5: 8b 50 14 mov 0x14(%rax),%edx
>> 4004c8: 48 8b 45 f0 mov -0x10(%rbp),%rax
>> 4004cc: 89 50 20 mov %edx,0x20(%rax)
>> }
>> 4004cf: c9 leaveq
>> 4004d0: c3 retq
>>
>> Assume we are at ip 4004c5, the fwd scan from the beginning of
>> function(400494) to 4004c5 will not get what we want about %rax.
>
> Conversely backwards scans can get confused if there's more places to
> come from (intercal ftw!).
>
> That said, your example above should not get confused about %rax if it
> knows about the jumps, simply clone your context on any jump instruction
> and follow both branches. That would then give you:
>
> 400494 -> 4004a7 -> 4004bb -> 4004cf
> -> 4004bd

Hm, I think that is the easiest case (like as kprobes does in the kernel)
E.g. an indirect jump makes it hard to find where it jumps to.

> You could even first build the basic block tree and only follow those
> branches that end up covering the region IP is in.

Ah, that's reasonable :)

Thank you,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2011-04-01 10:48:23

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

(2011/03/31 15:57), Lin Ming wrote:
>> So that what we need is a kind of the reverse compiler which generates
>> intermediate code (a sequence of register assignments) from
>> instruction code. That's not impossible task, but just hard and fun. :)
>> For that purpose, we'll need an instruction decoder and an evaluator
>> which allows us to trace the sequence of address dereferences.
>
> I may borrow code from objdump to decode instruction.
> But not sure about the instruction evaluator, any hints?

I think using objdump is a fast path to make it.
But since we already have a decoder(insn.c) and an evaluator like code (in kvm)
inside the kernel, we'd better reuse it in user space tools too :)

Thank you,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2011-04-01 11:05:38

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Fri, 2011-04-01 at 19:44 +0900, Masami Hiramatsu wrote:
> E.g. an indirect jump makes it hard to find where it jumps to.

Yes indirect jumps are 'interesting', is there anything in the debug
info that will help us out with the possible target sites?

Also what generates indirect jumps, switch() stmts? Indirect function
call that get optimized might also be, but hopefully DWARF would tell us
about that and allow us to know the state right after the jump.

2011-04-01 13:02:14

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Fri, 2011-04-01 at 00:32 +0800, Peter Zijlstra wrote:
> On Thu, 2011-03-31 at 18:28 +0200, Peter Zijlstra wrote:
> >
> > You could even first build the basic block tree and only follow those
> > branches that end up covering the region IP is in.
>
> s/tree/directed-graph/ clearly the basic blocks don't form a tree and
> can contain cycles.

There may be multiple different paths from the beginning of the function
to the point we are interested in.

Any idea how to handle this case?

2011-04-01 13:22:11

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Fri, 2011-04-01 at 19:05 +0800, Peter Zijlstra wrote:
> On Fri, 2011-04-01 at 19:44 +0900, Masami Hiramatsu wrote:
> > E.g. an indirect jump makes it hard to find where it jumps to.
>
> Yes indirect jumps are 'interesting', is there anything in the debug
> info that will help us out with the possible target sites?
>
> Also what generates indirect jumps, switch() stmts? Indirect function
> call that get optimized might also be, but hopefully DWARF would tell us
> about that and allow us to know the state right after the jump.

I'm not sure.

But I'm thinking another way to trace the register assignment by LBR
records.

LBR will introduce overhead, but it can tell us all the branches,
including the indirect jump.

2011-04-01 13:48:48

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Fri, 2011-04-01 at 21:02 +0800, Lin Ming wrote:
> On Fri, 2011-04-01 at 00:32 +0800, Peter Zijlstra wrote:
> > On Thu, 2011-03-31 at 18:28 +0200, Peter Zijlstra wrote:
> > >
> > > You could even first build the basic block tree and only follow those
> > > branches that end up covering the region IP is in.
> >
> > s/tree/directed-graph/ clearly the basic blocks don't form a tree and
> > can contain cycles.
>
> There may be multiple different paths from the beginning of the function
> to the point we are interested in.
>
> Any idea how to handle this case?

Any one path should be sufficient I think, C doesn't really have dynamic
typing so whatever path leads you to where you need to be ought to be
type invariant.

2011-04-01 13:49:44

[permalink] [raw]

Subject: Re: [RFC PATCH] perf report: add sort by file lines

On Fri, 2011-04-01 at 21:22 +0800, Lin Ming wrote:
> On Fri, 2011-04-01 at 19:05 +0800, Peter Zijlstra wrote:
> > On Fri, 2011-04-01 at 19:44 +0900, Masami Hiramatsu wrote:
> > > E.g. an indirect jump makes it hard to find where it jumps to.
> >
> > Yes indirect jumps are 'interesting', is there anything in the debug
> > info that will help us out with the possible target sites?
> >
> > Also what generates indirect jumps, switch() stmts? Indirect function
> > call that get optimized might also be, but hopefully DWARF would tell us
> > about that and allow us to know the state right after the jump.
>
> I'm not sure.
>
> But I'm thinking another way to trace the register assignment by LBR
> records.
>
> LBR will introduce overhead, but it can tell us all the branches,
> including the indirect jump.

Yeah, but it would be nice to see how far we can get without using extra
information, we can always try and complement stuff later.

2011-04-01 13:57:38