2013-09-10 08:24:23

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 0/3] perf tools: Fix scalability problem on callchain merging

Hello,

Linus reported that perf report was stuck in after "processing time
ordered events". It turned out that merging/collapsing the callchain
takes most of time to look up a matching callchain. Since it did a
linear search for the lookup, it took more than 95% of time during the
processing huge file that contains callchains.

This patchset tries to fix the problem by

1. converting children list to a rbtree
2. add UI progress window to inform user

With this patch, processing time of 400MB perf.data file went down
from 380s to 20s.

Thanks,
Namhyung


Namhyung Kim (3):
perf callchain: Convert children list to rbtree
perf ui/progress: Add new helper functions for progress bar
perf tools: Show progress on histogram collapsing

tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-diff.c | 2 +-
tools/perf/builtin-report.c | 10 ++-
tools/perf/builtin-top.c | 4 +-
tools/perf/tests/hists_link.c | 2 +-
tools/perf/ui/progress.c | 18 +++++
tools/perf/ui/progress.h | 10 +++
tools/perf/util/callchain.c | 151 ++++++++++++++++++++++++++++++++----------
tools/perf/util/callchain.h | 5 +-
tools/perf/util/hist.c | 5 +-
tools/perf/util/hist.h | 3 +-
tools/perf/util/session.c | 22 +++---
12 files changed, 173 insertions(+), 61 deletions(-)

--
1.7.11.7


2013-09-10 08:24:29

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 2/3] perf ui/progress: Add new helper functions for progress bar

From: Namhyung Kim <[email protected]>

Introduce ui_progress_setup() and ui_progress__advance() to separate
out from the session logic. It'll be used by other places in the
upcoming patch.

Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/ui/progress.c | 18 ++++++++++++++++++
tools/perf/ui/progress.h | 10 ++++++++++
tools/perf/util/session.c | 22 +++++++++-------------
3 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/tools/perf/ui/progress.c b/tools/perf/ui/progress.c
index 3ec695607a4d..b5c22c6b1d93 100644
--- a/tools/perf/ui/progress.c
+++ b/tools/perf/ui/progress.c
@@ -19,6 +19,24 @@ void ui_progress__update(u64 curr, u64 total, const char *title)
return progress_fns->update(curr, total, title);
}

+void ui_progress__setup(struct perf_progress *p, u64 total)
+{
+ p->curr = 0;
+ p->next = p->unit = total / 16;
+ p->total = total;
+
+}
+
+void ui_progress__advance(struct perf_progress *p, u64 adv, const char *title)
+{
+ p->curr += adv;
+
+ if (p->curr >= p->next) {
+ p->next += p->unit;
+ ui_progress__update(p->curr, p->total, title);
+ }
+}
+
void ui_progress__finish(void)
{
if (progress_fns->finish)
diff --git a/tools/perf/ui/progress.h b/tools/perf/ui/progress.h
index 257cc224f9cf..3bd23e825953 100644
--- a/tools/perf/ui/progress.h
+++ b/tools/perf/ui/progress.h
@@ -8,10 +8,20 @@ struct ui_progress {
void (*finish)(void);
};

+struct perf_progress {
+ u64 curr;
+ u64 next;
+ u64 unit;
+ u64 total;
+};
+
extern struct ui_progress *progress_fns;

void ui_progress__init(void);

+void ui_progress__setup(struct perf_progress *p, u64 total);
+void ui_progress__advance(struct perf_progress *p, u64 adv, const char *title);
+
void ui_progress__update(u64 curr, u64 total, const char *title);
void ui_progress__finish(void);

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1fc0c628683e..fe8f8e7a7d02 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -503,12 +503,14 @@ static int flush_sample_queue(struct perf_session *s,
struct perf_sample sample;
u64 limit = os->next_flush;
u64 last_ts = os->last_sample ? os->last_sample->timestamp : 0ULL;
- unsigned idx = 0, progress_next = os->nr_samples / 16;
+ struct perf_progress prog;
int ret;

if (!tool->ordered_samples || !limit)
return 0;

+ ui_progress__setup(&prog, os->nr_samples);
+
list_for_each_entry_safe(iter, tmp, head, list) {
if (iter->timestamp > limit)
break;
@@ -526,11 +528,8 @@ static int flush_sample_queue(struct perf_session *s,
os->last_flush = iter->timestamp;
list_del(&iter->list);
list_add(&iter->list, &os->sample_cache);
- if (++idx >= progress_next) {
- progress_next += os->nr_samples / 16;
- ui_progress__update(idx, os->nr_samples,
- "Processing time ordered events...");
- }
+
+ ui_progress__advance(&prog, 1, "Processing time ordered events...");
}

if (list_empty(head)) {
@@ -1271,12 +1270,13 @@ int __perf_session__process_events(struct perf_session *session,
u64 data_offset, u64 data_size,
u64 file_size, struct perf_tool *tool)
{
- u64 head, page_offset, file_offset, file_pos, progress_next;
+ u64 head, page_offset, file_offset, file_pos;
int err, mmap_prot, mmap_flags, map_idx = 0;
size_t mmap_size;
char *buf, *mmaps[NUM_MMAPS];
union perf_event *event;
uint32_t size;
+ struct perf_progress prog;

perf_tool__fill_defaults(tool);

@@ -1287,7 +1287,7 @@ int __perf_session__process_events(struct perf_session *session,
if (data_offset + data_size < file_size)
file_size = data_offset + data_size;

- progress_next = file_size / 16;
+ ui_progress__setup(&prog, file_size);

mmap_size = MMAP_SIZE;
if (mmap_size > file_size)
@@ -1342,11 +1342,7 @@ more:
head += size;
file_pos += size;

- if (file_pos >= progress_next) {
- progress_next += file_size / 16;
- ui_progress__update(file_pos, file_size,
- "Processing events...");
- }
+ ui_progress__advance(&prog, size, "Processing events...");

if (file_pos < file_size)
goto more;
--
1.7.11.7

2013-09-10 08:26:22

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 3/3] perf tools: Show progress on histogram collapsing

From: Namhyung Kim <[email protected]>

It can take quite amount of time so add progress bar UI to inform user.

Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-diff.c | 2 +-
tools/perf/builtin-report.c | 10 +++++++++-
tools/perf/builtin-top.c | 4 ++--
tools/perf/tests/hists_link.c | 2 +-
tools/perf/util/hist.c | 5 ++++-
tools/perf/util/hist.h | 3 ++-
7 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index f988d380c52f..8984db37de6f 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -231,7 +231,7 @@ static int __cmd_annotate(struct perf_annotate *ann)

if (nr_samples > 0) {
total_nr_samples += nr_samples;
- hists__collapse_resort(hists);
+ hists__collapse_resort(hists, NULL);
hists__output_resort(hists);

if (symbol_conf.event_group &&
diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index f28799e94f2a..f4ab7d210105 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -367,7 +367,7 @@ static void perf_evlist__collapse_resort(struct perf_evlist *evlist)
list_for_each_entry(evsel, &evlist->entries, node) {
struct hists *hists = &evsel->hists;

- hists__collapse_resort(hists);
+ hists__collapse_resort(hists, NULL);
}
}

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 9725aa375414..e88231cd9655 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -488,6 +488,7 @@ static int __cmd_report(struct perf_report *rep)
struct map *kernel_map;
struct kmap *kernel_kmap;
const char *help = "For a higher level overview, try: perf report --sort comm,dso";
+ struct perf_progress prog;

signal(SIGINT, sig_handler);

@@ -549,13 +550,19 @@ static int __cmd_report(struct perf_report *rep)
}

nr_samples = 0;
+ list_for_each_entry(pos, &session->evlist->entries, node)
+ nr_samples += pos->hists.nr_entries;
+
+ ui_progress__setup(&prog, nr_samples);
+
+ nr_samples = 0;
list_for_each_entry(pos, &session->evlist->entries, node) {
struct hists *hists = &pos->hists;

if (pos->idx == 0)
hists->symbol_filter_str = rep->symbol_filter_str;

- hists__collapse_resort(hists);
+ hists__collapse_resort(hists, &prog);
nr_samples += hists->stats.nr_events[PERF_RECORD_SAMPLE];

/* Non-group events are considered as leader */
@@ -567,6 +574,7 @@ static int __cmd_report(struct perf_report *rep)
hists__link(leader_hists, hists);
}
}
+ ui_progress__finish();

if (nr_samples == 0) {
ui__error("The %s file has no samples!\n", session->filename);
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 212214162bb2..cabd027bd507 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -287,7 +287,7 @@ static void perf_top__print_sym_table(struct perf_top *top)
return;
}

- hists__collapse_resort(&top->sym_evsel->hists);
+ hists__collapse_resort(&top->sym_evsel->hists, NULL);
hists__output_resort(&top->sym_evsel->hists);
hists__decay_entries(&top->sym_evsel->hists,
top->hide_user_symbols,
@@ -553,7 +553,7 @@ static void perf_top__sort_new_samples(void *arg)
if (t->evlist->selected != NULL)
t->sym_evsel = t->evlist->selected;

- hists__collapse_resort(&t->sym_evsel->hists);
+ hists__collapse_resort(&t->sym_evsel->hists, NULL);
hists__output_resort(&t->sym_evsel->hists);
hists__decay_entries(&t->sym_evsel->hists,
t->hide_user_symbols,
diff --git a/tools/perf/tests/hists_link.c b/tools/perf/tests/hists_link.c
index 4228ffc0d968..3b5faaa5c8cf 100644
--- a/tools/perf/tests/hists_link.c
+++ b/tools/perf/tests/hists_link.c
@@ -465,7 +465,7 @@ int test__hists_link(void)
goto out;

list_for_each_entry(evsel, &evlist->entries, node) {
- hists__collapse_resort(&evsel->hists);
+ hists__collapse_resort(&evsel->hists, NULL);

if (verbose > 2)
print_hists(&evsel->hists);
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 46a0d35a05e1..1f5767f97dce 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -395,6 +395,7 @@ static struct hist_entry *add_hist_entry(struct hists *hists,
if (!he)
return NULL;

+ hists->nr_entries++;
rb_link_node(&he->rb_node_in, parent, p);
rb_insert_color(&he->rb_node_in, hists->entries_in);
out:
@@ -598,7 +599,7 @@ static void hists__apply_filters(struct hists *hists, struct hist_entry *he)
hists__filter_entry_by_symbol(hists, he);
}

-void hists__collapse_resort(struct hists *hists)
+void hists__collapse_resort(struct hists *hists, struct perf_progress *prog)
{
struct rb_root *root;
struct rb_node *next;
@@ -623,6 +624,8 @@ void hists__collapse_resort(struct hists *hists)
*/
hists__apply_filters(hists, n);
}
+ if (prog)
+ ui_progress__advance(prog, 1, "Merging related events...");
}
}

diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 1329b6b6ffe6..6e27c0e9d31a 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -5,6 +5,7 @@
#include <pthread.h>
#include "callchain.h"
#include "header.h"
+#include "ui/progress.h"

extern struct callchain_param callchain_param;

@@ -104,7 +105,7 @@ struct hist_entry *__hists__add_mem_entry(struct hists *self,
u64 weight);

void hists__output_resort(struct hists *self);
-void hists__collapse_resort(struct hists *self);
+void hists__collapse_resort(struct hists *self, struct perf_progress *prog);

void hists__decay_entries(struct hists *hists, bool zap_user, bool zap_kernel);
void hists__output_recalc_col_len(struct hists *hists, int max_rows);
--
1.7.11.7

2013-09-10 08:26:50

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 1/3] perf callchain: Convert children list to rbtree

From: Namhyung Kim <[email protected]>

Current collapse stage has a scalability problem which can be
reproduced easily with parallel kernel build. This is because it
needs to traverse every children of callchain linearly during the
collapse/merge stage. Convert it to rbtree reduced the overhead
significantly.

On my 400MB perf.data file which recorded with make -j32 kernel build:

$ time perf --no-pager report --stdio > /dev/null

before:
real 6m22.073s
user 6m18.683s
sys 0m0.706s

after:
real 0m20.780s
user 0m19.962s
sys 0m0.689s

During the perf report the overhead on append_chain_children went down
from 96.69% to 18.16%:

- 18.16% perf perf [.] append_chain_children
- append_chain_children
- 77.48% append_chain_children
+ 69.79% merge_chain_branch
- 22.96% append_chain_children
+ 67.44% merge_chain_branch
+ 30.15% append_chain_children
+ 2.41% callchain_append
+ 7.25% callchain_append
+ 12.26% callchain_append
+ 10.22% merge_chain_branch
+ 11.58% perf perf [.] dso__find_symbol
+ 8.02% perf perf [.] sort__comm_cmp
+ 5.48% perf libc-2.17.so [.] malloc_consolidate

Reported-by: Linus Torvalds <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/callchain.c | 151 +++++++++++++++++++++++++++++++++-----------
tools/perf/util/callchain.h | 5 +-
2 files changed, 116 insertions(+), 40 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 482f68081cd8..4ac385515c2f 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -21,12 +21,6 @@

__thread struct callchain_cursor callchain_cursor;

-#define chain_for_each_child(child, parent) \
- list_for_each_entry(child, &parent->children, siblings)
-
-#define chain_for_each_child_safe(child, next, parent) \
- list_for_each_entry_safe(child, next, &parent->children, siblings)
-
static void
rb_insert_callchain(struct rb_root *root, struct callchain_node *chain,
enum chain_mode mode)
@@ -71,10 +65,17 @@ static void
__sort_chain_flat(struct rb_root *rb_root, struct callchain_node *node,
u64 min_hit)
{
+ struct rb_node *n;
struct callchain_node *child;

- chain_for_each_child(child, node)
+ n = rb_first(&node->rb_root_in);
+ while (n) {
+ child = rb_entry(n, struct callchain_node, rb_node);
+ n = rb_next(n);
+ rb_erase(&child->rb_node, &node->rb_root_in);
+
__sort_chain_flat(rb_root, child, min_hit);
+ }

if (node->hit && node->hit >= min_hit)
rb_insert_callchain(rb_root, node, CHAIN_FLAT);
@@ -94,11 +95,17 @@ sort_chain_flat(struct rb_root *rb_root, struct callchain_root *root,
static void __sort_chain_graph_abs(struct callchain_node *node,
u64 min_hit)
{
+ struct rb_node *n;
struct callchain_node *child;

node->rb_root = RB_ROOT;
+ n = rb_first(&node->rb_root_in);
+
+ while (n) {
+ child = rb_entry(n, struct callchain_node, rb_node);
+ n = rb_next(n);
+ rb_erase(&child->rb_node, &node->rb_root_in);

- chain_for_each_child(child, node) {
__sort_chain_graph_abs(child, min_hit);
if (callchain_cumul_hits(child) >= min_hit)
rb_insert_callchain(&node->rb_root, child,
@@ -117,13 +124,19 @@ sort_chain_graph_abs(struct rb_root *rb_root, struct callchain_root *chain_root,
static void __sort_chain_graph_rel(struct callchain_node *node,
double min_percent)
{
+ struct rb_node *n;
struct callchain_node *child;
u64 min_hit;

node->rb_root = RB_ROOT;
min_hit = ceil(node->children_hit * min_percent);

- chain_for_each_child(child, node) {
+ n = rb_first(&node->rb_root_in);
+ while (n) {
+ child = rb_entry(n, struct callchain_node, rb_node);
+ n = rb_next(n);
+ rb_erase(&child->rb_node, &node->rb_root_in);
+
__sort_chain_graph_rel(child, min_percent);
if (callchain_cumul_hits(child) >= min_hit)
rb_insert_callchain(&node->rb_root, child,
@@ -173,19 +186,22 @@ create_child(struct callchain_node *parent, bool inherit_children)
return NULL;
}
new->parent = parent;
- INIT_LIST_HEAD(&new->children);
INIT_LIST_HEAD(&new->val);

if (inherit_children) {
- struct callchain_node *next;
+ struct rb_node *n;
+ struct callchain_node *child;

- list_splice(&parent->children, &new->children);
- INIT_LIST_HEAD(&parent->children);
+ new->rb_root_in = parent->rb_root_in;
+ parent->rb_root_in = RB_ROOT;

- chain_for_each_child(next, new)
- next->parent = new;
+ n = rb_first(&new->rb_root_in);
+ while (n) {
+ child = rb_entry(n, struct callchain_node, rb_node);
+ child->parent = new;
+ n = rb_next(n);
+ }
}
- list_add_tail(&new->siblings, &parent->children);

return new;
}
@@ -223,7 +239,7 @@ fill_node(struct callchain_node *node, struct callchain_cursor *cursor)
}
}

-static void
+static struct callchain_node *
add_child(struct callchain_node *parent,
struct callchain_cursor *cursor,
u64 period)
@@ -235,6 +251,19 @@ add_child(struct callchain_node *parent,

new->children_hit = 0;
new->hit = period;
+ return new;
+}
+
+static s64 match_chain(struct callchain_cursor_node *node,
+ struct callchain_list *cnode)
+{
+ struct symbol *sym = node->sym;
+
+ if (cnode->ms.sym && sym &&
+ callchain_param.key == CCKEY_FUNCTION)
+ return cnode->ms.sym->start - sym->start;
+ else
+ return cnode->ip - node->ip;
}

/*
@@ -254,6 +283,12 @@ split_add_child(struct callchain_node *parent,

/* split */
new = create_child(parent, true);
+ /* move existing children under new child */
+ new->rb_root_in = parent->rb_root_in;
+ parent->rb_root_in = RB_ROOT;
+ /* make it the first child */
+ rb_link_node(&new->rb_node, NULL, &parent->rb_root_in.rb_node);
+ rb_insert_color(&new->rb_node, &parent->rb_root_in);

/* split the callchain and move a part to the new child */
old_tail = parent->val.prev;
@@ -272,9 +307,33 @@ split_add_child(struct callchain_node *parent,

/* create a new child for the new branch if any */
if (idx_total < cursor->nr) {
+ struct callchain_node *first;
+ struct callchain_list *cnode;
+ struct callchain_cursor_node *node;
+ struct rb_node *p, **pp;
+
parent->hit = 0;
- add_child(parent, cursor, period);
parent->children_hit += period;
+
+ node = callchain_cursor_current(cursor);
+ new = add_child(parent, cursor, period);
+
+ /*
+ * This is second child since we moved parent's children
+ * to new (first) child above.
+ */
+ p = parent->rb_root_in.rb_node;
+ first = rb_entry(p, struct callchain_node, rb_node);
+ cnode = list_first_entry(&first->val, struct callchain_list,
+ list);
+
+ if (match_chain(node, cnode) < 0)
+ pp = &p->rb_left;
+ else
+ pp = &p->rb_right;
+
+ rb_link_node(&new->rb_node, p, pp);
+ rb_insert_color(&new->rb_node, &parent->rb_root_in);
} else {
parent->hit = period;
}
@@ -291,16 +350,39 @@ append_chain_children(struct callchain_node *root,
u64 period)
{
struct callchain_node *rnode;
+ struct callchain_cursor_node *node;
+ struct rb_node **p = &root->rb_root_in.rb_node;
+ struct rb_node *parent = NULL;
+
+ node = callchain_cursor_current(cursor);
+ if (!node)
+ return;

/* lookup in childrens */
- chain_for_each_child(rnode, root) {
- unsigned int ret = append_chain(rnode, cursor, period);
+ while (*p) {
+ s64 ret;
+ struct callchain_list *cnode;

- if (!ret)
+ parent = *p;
+ rnode = rb_entry(parent, struct callchain_node, rb_node);
+ cnode = list_first_entry(&rnode->val, struct callchain_list,
+ list);
+
+ /* just check first entry */
+ ret = match_chain(node, cnode);
+ if (ret < 0)
+ p = &parent->rb_left;
+ else if (ret > 0)
+ p = &parent->rb_right;
+ else {
+ append_chain(rnode, cursor, period);
goto inc_children_hit;
+ }
}
/* nothing in children, add to the current node */
- add_child(root, cursor, period);
+ rnode = add_child(root, cursor, period);
+ rb_link_node(&rnode->rb_node, parent, p);
+ rb_insert_color(&rnode->rb_node, &root->rb_root_in);

inc_children_hit:
root->children_hit += period;
@@ -325,28 +407,20 @@ append_chain(struct callchain_node *root,
*/
list_for_each_entry(cnode, &root->val, list) {
struct callchain_cursor_node *node;
- struct symbol *sym;

node = callchain_cursor_current(cursor);
if (!node)
break;

- sym = node->sym;
-
- if (cnode->ms.sym && sym &&
- callchain_param.key == CCKEY_FUNCTION) {
- if (cnode->ms.sym->start != sym->start)
- break;
- } else if (cnode->ip != node->ip)
+ if (match_chain(node, cnode) != 0)
break;

- if (!found)
- found = true;
+ found = true;

callchain_cursor_advance(cursor);
}

- /* matches not, relay on the parent */
+ /* matches not, relay no the parent */
if (!found) {
cursor->curr = curr_snap;
cursor->pos = start;
@@ -395,8 +469,9 @@ merge_chain_branch(struct callchain_cursor *cursor,
struct callchain_node *dst, struct callchain_node *src)
{
struct callchain_cursor_node **old_last = cursor->last;
- struct callchain_node *child, *next_child;
+ struct callchain_node *child;
struct callchain_list *list, *next_list;
+ struct rb_node *n;
int old_pos = cursor->nr;
int err = 0;

@@ -412,12 +487,16 @@ merge_chain_branch(struct callchain_cursor *cursor,
append_chain_children(dst, cursor, src->hit);
}

- chain_for_each_child_safe(child, next_child, src) {
+ n = rb_first(&src->rb_root_in);
+ while (n) {
+ child = container_of(n, struct callchain_node, rb_node);
+ n = rb_next(n);
+ rb_erase(&child->rb_node, &src->rb_root_in);
+
err = merge_chain_branch(cursor, dst, child);
if (err)
break;

- list_del(&child->siblings);
free(child);
}

diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 2b585bc308cf..1b22b6269213 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -21,10 +21,9 @@ enum chain_order {

struct callchain_node {
struct callchain_node *parent;
- struct list_head siblings;
- struct list_head children;
struct list_head val;
struct rb_node rb_node; /* to sort nodes in an rbtree */
+ struct rb_root rb_root_in; /* sorted tree of children */
struct rb_root rb_root; /* sorted tree of children */
unsigned int val_nr;
u64 hit;
@@ -86,8 +85,6 @@ extern __thread struct callchain_cursor callchain_cursor;

static inline void callchain_init(struct callchain_root *root)
{
- INIT_LIST_HEAD(&root->node.siblings);
- INIT_LIST_HEAD(&root->node.children);
INIT_LIST_HEAD(&root->node.val);

root->node.parent = NULL;
--
1.7.11.7

2013-09-10 09:02:05

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

On Tue, 10 Sep 2013 17:24:16 +0900, Namhyung Kim wrote:
>
> - /* matches not, relay on the parent */
> + /* matches not, relay no the parent */

Argh.. please ignore this hunk.

Thanks,
Namhyung

2013-09-10 09:52:51

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
> From: Namhyung Kim <[email protected]>
>
> Current collapse stage has a scalability problem which can be
> reproduced easily with parallel kernel build. This is because it
> needs to traverse every children of callchain linearly during the
> collapse/merge stage. Convert it to rbtree reduced the overhead
> significantly.
>
> On my 400MB perf.data file which recorded with make -j32 kernel build:


nice!!!

tried on 2.6 GB data file from kernel make -j64 and got
report speed up from 'never' to 2m52.756s ;-)

jirka

2013-09-10 10:25:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree


* Jiri Olsa <[email protected]> wrote:

> On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
> > From: Namhyung Kim <[email protected]>
> >
> > Current collapse stage has a scalability problem which can be
> > reproduced easily with parallel kernel build. This is because it
> > needs to traverse every children of callchain linearly during the
> > collapse/merge stage. Convert it to rbtree reduced the overhead
> > significantly.
> >
> > On my 400MB perf.data file which recorded with make -j32 kernel build:
>
>
> nice!!!

Nice indeed!

> tried on 2.6 GB data file from kernel make -j64 and got report speed up
> from 'never' to 2m52.756s ;-)

It's still rather long though, unacceptable for everyday usage :-/

Frederic thought that we could reduce minimize collapsing to begin with.

Frederic, could you outline that in more detail please?

Thanks,

Ingo

2013-09-10 11:34:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

On Tue, Sep 10, 2013 at 12:25:54PM +0200, Ingo Molnar wrote:
>
> * Jiri Olsa <[email protected]> wrote:
>
> > On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
> > > From: Namhyung Kim <[email protected]>
> > >
> > > Current collapse stage has a scalability problem which can be
> > > reproduced easily with parallel kernel build. This is because it
> > > needs to traverse every children of callchain linearly during the
> > > collapse/merge stage. Convert it to rbtree reduced the overhead
> > > significantly.
> > >
> > > On my 400MB perf.data file which recorded with make -j32 kernel build:
> >
> >
> > nice!!!
>
> Nice indeed!
>
> > tried on 2.6 GB data file from kernel make -j64 and got report speed up
> > from 'never' to 2m52.756s ;-)
>
> It's still rather long though, unacceptable for everyday usage :-/
>
> Frederic thought that we could reduce minimize collapsing to begin with.
>
> Frederic, could you outline that in more detail please?

Yeah. Currently when we sort by comm, hists are first sorted by tid. Then
in the end of the record, the hists are compared and those that have the
same comm are collapsed in one.

So what I'm trying to do now is to gather those hists from the very beginning,
which should remove the need for collapsing.

>
> Thanks,
>
> Ingo

2013-09-10 11:58:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] perf tools: Fix scalability problem on callchain merging

On Tue, Sep 10, 2013 at 05:24:15PM +0900, Namhyung Kim wrote:
> Hello,
>
> Linus reported that perf report was stuck in after "processing time
> ordered events". It turned out that merging/collapsing the callchain
> takes most of time to look up a matching callchain. Since it did a
> linear search for the lookup, it took more than 95% of time during the
> processing huge file that contains callchains.
>
> This patchset tries to fix the problem by
>
> 1. converting children list to a rbtree
> 2. add UI progress window to inform user
>
> With this patch, processing time of 400MB perf.data file went down
> from 380s to 20s.

Even if we get rid of hists collapsing, this may still bring value on
performance. One way to test is to sort by pid or by sym and compare
before and after the patchset. pid/sym shouldn't make use of collapses.

Thanks.

>
> Thanks,
> Namhyung
>
>
> Namhyung Kim (3):
> perf callchain: Convert children list to rbtree
> perf ui/progress: Add new helper functions for progress bar
> perf tools: Show progress on histogram collapsing
>
> tools/perf/builtin-annotate.c | 2 +-
> tools/perf/builtin-diff.c | 2 +-
> tools/perf/builtin-report.c | 10 ++-
> tools/perf/builtin-top.c | 4 +-
> tools/perf/tests/hists_link.c | 2 +-
> tools/perf/ui/progress.c | 18 +++++
> tools/perf/ui/progress.h | 10 +++
> tools/perf/util/callchain.c | 151 ++++++++++++++++++++++++++++++++----------
> tools/perf/util/callchain.h | 5 +-
> tools/perf/util/hist.c | 5 +-
> tools/perf/util/hist.h | 3 +-
> tools/perf/util/session.c | 22 +++---
> 12 files changed, 173 insertions(+), 61 deletions(-)
>
> --
> 1.7.11.7
>

2013-09-22 11:16:20

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
> From: Namhyung Kim <[email protected]>

SNIP

> index 2b585bc308cf..1b22b6269213 100644
> --- a/tools/perf/util/callchain.h
> +++ b/tools/perf/util/callchain.h
> @@ -21,10 +21,9 @@ enum chain_order {
>
> struct callchain_node {
> struct callchain_node *parent;
> - struct list_head siblings;
> - struct list_head children;
> struct list_head val;
> struct rb_node rb_node; /* to sort nodes in an rbtree */
> + struct rb_root rb_root_in; /* sorted tree of children */
> struct rb_root rb_root; /* sorted tree of children */
> unsigned int val_nr;
> u64 hit;
> @@ -86,8 +85,6 @@ extern __thread struct callchain_cursor callchain_cursor;
>
> static inline void callchain_init(struct callchain_root *root)
> {
> - INIT_LIST_HEAD(&root->node.siblings);
> - INIT_LIST_HEAD(&root->node.children);
> INIT_LIST_HEAD(&root->node.val);

don't we miss rb_root_in init here, like:

root->node.rb_root_in = RB_ROOT;

jirka

2013-09-23 09:29:19

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

Hi,

On Sun, 22 Sep 2013 13:15:59 +0200, Jiri Olsa wrote:
> On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
>> From: Namhyung Kim <[email protected]>
>
> SNIP
>
>> index 2b585bc308cf..1b22b6269213 100644
>> --- a/tools/perf/util/callchain.h
>> +++ b/tools/perf/util/callchain.h
>> @@ -21,10 +21,9 @@ enum chain_order {
>>
>> struct callchain_node {
>> struct callchain_node *parent;
>> - struct list_head siblings;
>> - struct list_head children;
>> struct list_head val;
>> struct rb_node rb_node; /* to sort nodes in an rbtree */
>> + struct rb_root rb_root_in; /* sorted tree of children */
>> struct rb_root rb_root; /* sorted tree of children */
>> unsigned int val_nr;
>> u64 hit;
>> @@ -86,8 +85,6 @@ extern __thread struct callchain_cursor callchain_cursor;
>>
>> static inline void callchain_init(struct callchain_root *root)
>> {
>> - INIT_LIST_HEAD(&root->node.siblings);
>> - INIT_LIST_HEAD(&root->node.children);
>> INIT_LIST_HEAD(&root->node.val);
>
> don't we miss rb_root_in init here, like:
>
> root->node.rb_root_in = RB_ROOT;

The struct callchain_root in hist entry initialized to zero so
technically it doesn't need to be set. But I think it's better to have
explicit initializer.

Thanks,
Namhyung

2013-09-23 09:37:04

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf callchain: Convert children list to rbtree

On Mon, Sep 23, 2013 at 06:29:15PM +0900, Namhyung Kim wrote:
> Hi,
>
> On Sun, 22 Sep 2013 13:15:59 +0200, Jiri Olsa wrote:
> > On Tue, Sep 10, 2013 at 05:24:16PM +0900, Namhyung Kim wrote:
> >> From: Namhyung Kim <[email protected]>
> >
> > SNIP
> >
> >> index 2b585bc308cf..1b22b6269213 100644
> >> --- a/tools/perf/util/callchain.h
> >> +++ b/tools/perf/util/callchain.h
> >> @@ -21,10 +21,9 @@ enum chain_order {
> >>
> >> struct callchain_node {
> >> struct callchain_node *parent;
> >> - struct list_head siblings;
> >> - struct list_head children;
> >> struct list_head val;
> >> struct rb_node rb_node; /* to sort nodes in an rbtree */
> >> + struct rb_root rb_root_in; /* sorted tree of children */
> >> struct rb_root rb_root; /* sorted tree of children */
> >> unsigned int val_nr;
> >> u64 hit;
> >> @@ -86,8 +85,6 @@ extern __thread struct callchain_cursor callchain_cursor;
> >>
> >> static inline void callchain_init(struct callchain_root *root)
> >> {
> >> - INIT_LIST_HEAD(&root->node.siblings);
> >> - INIT_LIST_HEAD(&root->node.children);
> >> INIT_LIST_HEAD(&root->node.val);
> >
> > don't we miss rb_root_in init here, like:
> >
> > root->node.rb_root_in = RB_ROOT;
>
> The struct callchain_root in hist entry initialized to zero so
> technically it doesn't need to be set. But I think it's better to have
> explicit initializer.

yep, I was writing some tests for this when I noticed that,
having callchain_root defined on stack

jirka