Hello,
I noticed a performance problem in the srcline/srcfile processing during
perf report when it's using an external addr2line process. I guess it's
also helpful even if it uses the libbfd to get the srcline info.
Also note that it's mostly from large (static) binaries, but smaller
binaries should also benefit from the fix if they have a lot of samples.
The first 5 patches are general fixes and updates. The latter 4 patches
implemented the actual speed-up.
Let's test it with the perf tools itself. Build a static binary like below.
$ cd tools/perf
$ make NO_JVMTI=1 LDFLAGS=-static
Then run the perf test workload.
$ ./perf record -- ./perf test -w noploop
And run the perf report with srcline sort key like this.
$ ./perf report -n -s srcline --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:u'
# Event count (approx.): 3572938596
#
# Overhead Samples Source:Line
# ........ ............ ............
#
99.94% 4010 noploop.c:26
0.03% 14 ??:0
0.03% 1 perf.c:330
0.00% 1 wcscpy.o:0
The problem is that it runs the addr2line when it processes each sample.
But as you can see many samples can have same result. IOW, if the samples
have same address, we don't need to run the addr2line each time.
So I changed the sort_key->cmp() to compare the addresses only and moved
the addr2line from sort_key->collapse() so that they can be run after
merging the samples with the same address.
With the change, I can get a huge speed-up in processing srcline info
while they generate the same output.
Before:
$ ./perf stat -- ./perf report -s srcline > /dev/null
Performance counter stats for './perf report -s srcline':
15,397.13 msec task-clock:u # 0.993 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,810 page-faults:u # 247.449 /sec
54,516,351,820 cycles:u # 3.541 GHz
31,494,118,293 instructions:u # 0.58 insn per cycle
8,577,271,187 branches:u # 557.069 M/sec
1,216,165,520 branch-misses:u # 14.18% of all branches
15.505066606 seconds time elapsed
15.094122000 seconds user
0.396962000 seconds sys
After:
$ ./perf stat -- ./perf report -s srcline > /dev/null
Performance counter stats for './perf report -s srcline':
105.66 msec task-clock:u # 0.994 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,275 page-faults:u # 30.995 K/sec
185,063,407 cycles:u # 1.751 GHz
142,470,215 instructions:u # 0.77 insn per cycle
34,584,038 branches:u # 327.311 M/sec
3,226,005 branch-misses:u # 9.33% of all branches
0.106270464 seconds time elapsed
0.074254000 seconds user
0.032871000 seconds sys
The code is available at 'perf/srcline-v1' branch in
git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
Thanks,
Namhyung
Namhyung Kim (9):
perf srcline: Do not return NULL for srcline
perf report: Ignore SIGPIPE for srcline
perf symbol: Add filename__has_section()
perf srcline: Skip srcline if .debug_line is missing
perf srcline: Conditionally suppress addr2line warnings
perf hist: Add perf_hpp_fmt->init() callback
perf hist: Improve srcline sort key performance
perf hist: Improve srcfile sort key performance
perf hist: Improve srcline_{from,to} sort key performance
tools/perf/builtin-report.c | 1 +
tools/perf/util/hist.c | 10 +--
tools/perf/util/hist.h | 1 +
tools/perf/util/sort.c | 129 ++++++++++++++++++++++++++++---
tools/perf/util/sort.h | 1 +
tools/perf/util/srcline.c | 20 +++--
tools/perf/util/symbol-elf.c | 28 +++++++
tools/perf/util/symbol-minimal.c | 5 ++
tools/perf/util/symbol.h | 1 +
9 files changed, 176 insertions(+), 20 deletions(-)
base-commit: 818448e9cf92e5c6b3c10320372eefcbe4174e4f
--
2.39.0.314.g84b9a713c41-goog
The sort_entry->cmp() will be called for eventy sample data to find a
matching entry. When it has 'srcline' sort key, that means it needs to
call addr2line or libbfd everytime.
This is not optimal because many samples will have same address and it
just can call addr2line once. So postpone the actual srcline check to
the sort_entry->collpase() and compare addresses in ->cmp().
Also it needs to add ->init() callback to make sure it has srcline info.
If a sample has a unique data, chances are the entry can be sorted out
by other (previous) keys and callbacks in sort_srcline never called.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/sort.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index f6333b3dca35..913045c5b2b2 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -373,6 +373,18 @@ char *hist_entry__srcline(struct hist_entry *he)
static int64_t
sort__srcline_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ int64_t ret;
+
+ ret = _sort__addr_cmp(left->ip, right->ip);
+ if (ret)
+ return ret;
+
+ return sort__dso_cmp(left, right);
+}
+
+static int64_t
+sort__srcline_collapse(struct hist_entry *left, struct hist_entry *right)
{
if (!left->srcline)
left->srcline = hist_entry__srcline(left);
@@ -382,18 +394,31 @@ sort__srcline_cmp(struct hist_entry *left, struct hist_entry *right)
return strcmp(right->srcline, left->srcline);
}
-static int hist_entry__srcline_snprintf(struct hist_entry *he, char *bf,
- size_t size, unsigned int width)
+static int64_t
+sort__srcline_sort(struct hist_entry *left, struct hist_entry *right)
+{
+ return sort__srcline_collapse(left, right);
+}
+
+static void
+sort__srcline_init(struct hist_entry *he)
{
if (!he->srcline)
he->srcline = hist_entry__srcline(he);
+}
+static int hist_entry__srcline_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
return repsep_snprintf(bf, size, "%-.*s", width, he->srcline);
}
struct sort_entry sort_srcline = {
.se_header = "Source:Line",
.se_cmp = sort__srcline_cmp,
+ .se_collapse = sort__srcline_collapse,
+ .se_sort = sort__srcline_sort,
+ .se_init = sort__srcline_init,
.se_snprintf = hist_entry__srcline_snprintf,
.se_width_idx = HISTC_SRCLINE,
};
--
2.39.0.314.g84b9a713c41-goog
Likewise, modify ->cmp() callback to compare sample address and map
address. And add ->collapse() and ->sort() to check the actual
srcfile string. Also add ->init() to make sure it has the srcfile.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/sort.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index c290539dcf43..e188f74698dd 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -432,6 +432,12 @@ static char *addr_map_symbol__srcline(struct addr_map_symbol *ams)
static int64_t
sort__srcline_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->branch_info->from.addr - right->branch_info->from.addr;
+}
+
+static int64_t
+sort__srcline_from_collapse(struct hist_entry *left, struct hist_entry *right)
{
if (!left->branch_info->srcline_from)
left->branch_info->srcline_from = addr_map_symbol__srcline(&left->branch_info->from);
@@ -442,6 +448,18 @@ sort__srcline_from_cmp(struct hist_entry *left, struct hist_entry *right)
return strcmp(right->branch_info->srcline_from, left->branch_info->srcline_from);
}
+static int64_t
+sort__srcline_from_sort(struct hist_entry *left, struct hist_entry *right)
+{
+ return sort__srcline_from_collapse(left, right);
+}
+
+static void sort__srcline_from_init(struct hist_entry *he)
+{
+ if (!he->branch_info->srcline_from)
+ he->branch_info->srcline_from = addr_map_symbol__srcline(&he->branch_info->from);
+}
+
static int hist_entry__srcline_from_snprintf(struct hist_entry *he, char *bf,
size_t size, unsigned int width)
{
@@ -451,6 +469,9 @@ static int hist_entry__srcline_from_snprintf(struct hist_entry *he, char *bf,
struct sort_entry sort_srcline_from = {
.se_header = "From Source:Line",
.se_cmp = sort__srcline_from_cmp,
+ .se_collapse = sort__srcline_from_collapse,
+ .se_sort = sort__srcline_from_sort,
+ .se_init = sort__srcline_from_init,
.se_snprintf = hist_entry__srcline_from_snprintf,
.se_width_idx = HISTC_SRCLINE_FROM,
};
@@ -459,6 +480,12 @@ struct sort_entry sort_srcline_from = {
static int64_t
sort__srcline_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->branch_info->to.addr - right->branch_info->to.addr;
+}
+
+static int64_t
+sort__srcline_to_collapse(struct hist_entry *left, struct hist_entry *right)
{
if (!left->branch_info->srcline_to)
left->branch_info->srcline_to = addr_map_symbol__srcline(&left->branch_info->to);
@@ -469,6 +496,18 @@ sort__srcline_to_cmp(struct hist_entry *left, struct hist_entry *right)
return strcmp(right->branch_info->srcline_to, left->branch_info->srcline_to);
}
+static int64_t
+sort__srcline_to_sort(struct hist_entry *left, struct hist_entry *right)
+{
+ return sort__srcline_to_collapse(left, right);
+}
+
+static void sort__srcline_to_init(struct hist_entry *he)
+{
+ if (!he->branch_info->srcline_to)
+ he->branch_info->srcline_to = addr_map_symbol__srcline(&he->branch_info->to);
+}
+
static int hist_entry__srcline_to_snprintf(struct hist_entry *he, char *bf,
size_t size, unsigned int width)
{
@@ -478,6 +517,9 @@ static int hist_entry__srcline_to_snprintf(struct hist_entry *he, char *bf,
struct sort_entry sort_srcline_to = {
.se_header = "To Source:Line",
.se_cmp = sort__srcline_to_cmp,
+ .se_collapse = sort__srcline_to_collapse,
+ .se_sort = sort__srcline_to_sort,
+ .se_init = sort__srcline_to_init,
.se_snprintf = hist_entry__srcline_to_snprintf,
.se_width_idx = HISTC_SRCLINE_TO,
};
--
2.39.0.314.g84b9a713c41-goog
The code assumes non-NULL srcline value always, let's return the usual
SRCLINE_UNKNOWN ("??:0") string instead.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/srcline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/srcline.c b/tools/perf/util/srcline.c
index af468e3bb6fa..5319efb16a5a 100644
--- a/tools/perf/util/srcline.c
+++ b/tools/perf/util/srcline.c
@@ -716,7 +716,7 @@ char *__get_srcline(struct dso *dso, u64 addr, struct symbol *sym,
if (!show_addr)
return (show_sym && sym) ?
- strndup(sym->name, sym->namelen) : NULL;
+ strndup(sym->name, sym->namelen) : SRCLINE_UNKNOWN;
if (sym) {
if (asprintf(&srcline, "%s+%" PRIu64, show_sym ? sym->name : "",
--
2.39.0.314.g84b9a713c41-goog
On Thu, Dec 15, 2022 at 11:28 AM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> I noticed a performance problem in the srcline/srcfile processing during
> perf report when it's using an external addr2line process. I guess it's
> also helpful even if it uses the libbfd to get the srcline info.
>
> Also note that it's mostly from large (static) binaries, but smaller
> binaries should also benefit from the fix if they have a lot of samples.
>
> The first 5 patches are general fixes and updates. The latter 4 patches
> implemented the actual speed-up.
>
> Let's test it with the perf tools itself. Build a static binary like below.
>
> $ cd tools/perf
> $ make NO_JVMTI=1 LDFLAGS=-static
>
> Then run the perf test workload.
>
> $ ./perf record -- ./perf test -w noploop
>
> And run the perf report with srcline sort key like this.
>
> $ ./perf report -n -s srcline --stdio
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 4K of event 'cycles:u'
> # Event count (approx.): 3572938596
> #
> # Overhead Samples Source:Line
> # ........ ............ ............
> #
> 99.94% 4010 noploop.c:26
> 0.03% 14 ??:0
> 0.03% 1 perf.c:330
> 0.00% 1 wcscpy.o:0
>
> The problem is that it runs the addr2line when it processes each sample.
> But as you can see many samples can have same result. IOW, if the samples
> have same address, we don't need to run the addr2line each time.
>
> So I changed the sort_key->cmp() to compare the addresses only and moved
> the addr2line from sort_key->collapse() so that they can be run after
> merging the samples with the same address.
>
> With the change, I can get a huge speed-up in processing srcline info
> while they generate the same output.
>
> Before:
>
> $ ./perf stat -- ./perf report -s srcline > /dev/null
>
> Performance counter stats for './perf report -s srcline':
>
> 15,397.13 msec task-clock:u # 0.993 CPUs utilized
> 0 context-switches:u # 0.000 /sec
> 0 cpu-migrations:u # 0.000 /sec
> 3,810 page-faults:u # 247.449 /sec
> 54,516,351,820 cycles:u # 3.541 GHz
> 31,494,118,293 instructions:u # 0.58 insn per cycle
> 8,577,271,187 branches:u # 557.069 M/sec
> 1,216,165,520 branch-misses:u # 14.18% of all branches
>
> 15.505066606 seconds time elapsed
>
> 15.094122000 seconds user
> 0.396962000 seconds sys
>
> After:
>
> $ ./perf stat -- ./perf report -s srcline > /dev/null
>
> Performance counter stats for './perf report -s srcline':
>
> 105.66 msec task-clock:u # 0.994 CPUs utilized
> 0 context-switches:u # 0.000 /sec
> 0 cpu-migrations:u # 0.000 /sec
> 3,275 page-faults:u # 30.995 K/sec
> 185,063,407 cycles:u # 1.751 GHz
> 142,470,215 instructions:u # 0.77 insn per cycle
> 34,584,038 branches:u # 327.311 M/sec
> 3,226,005 branch-misses:u # 9.33% of all branches
>
> 0.106270464 seconds time elapsed
>
> 0.074254000 seconds user
> 0.032871000 seconds sys
>
> The code is available at 'perf/srcline-v1' branch in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
>
> Namhyung Kim (9):
> perf srcline: Do not return NULL for srcline
> perf report: Ignore SIGPIPE for srcline
> perf symbol: Add filename__has_section()
> perf srcline: Skip srcline if .debug_line is missing
> perf srcline: Conditionally suppress addr2line warnings
> perf hist: Add perf_hpp_fmt->init() callback
> perf hist: Improve srcline sort key performance
> perf hist: Improve srcfile sort key performance
> perf hist: Improve srcline_{from,to} sort key performance
Nice 145x performance win!
Series: Acked-by: Ian Rogers <[email protected]>
Thanks,
Ian
> tools/perf/builtin-report.c | 1 +
> tools/perf/util/hist.c | 10 +--
> tools/perf/util/hist.h | 1 +
> tools/perf/util/sort.c | 129 ++++++++++++++++++++++++++++---
> tools/perf/util/sort.h | 1 +
> tools/perf/util/srcline.c | 20 +++--
> tools/perf/util/symbol-elf.c | 28 +++++++
> tools/perf/util/symbol-minimal.c | 5 ++
> tools/perf/util/symbol.h | 1 +
> 9 files changed, 176 insertions(+), 20 deletions(-)
>
>
> base-commit: 818448e9cf92e5c6b3c10320372eefcbe4174e4f
> --
> 2.39.0.314.g84b9a713c41-goog
>