2015-12-10 07:54:04

by Namhyung Kim

[permalink] [raw]
Subject: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Hello,

This patchset if an attempt to support multi-threading in perf top.
In fact, perf top already run on two threads - a worker thread and a
display thread. However processing all samples with a single thread
in a large machine can have scalability problems.

This patchset extends it to have multiple worker threads to process
samples concurrently. Users can control the number of threads using
--num-thread option. And there's a collector thread for passing hist
entries from worker threads to the display thread.

This basically has same concept of my previous work with perf report
multi-thread support [1]. I decided to work on perf top first, since
it requires smaller changes. If this work finishes with a good
result, I'll apply it to perf report as well, and continue to work on
it. So please test (especially on large machines) and give feedbacks. :)

First 6 patches are fixes and cleanups which can be applied separately.
Rest implements multi-thread support and improves it.

You can get it from 'perf/top-threaded-v1' branch in my tree

git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Any comments are welcome!
Thanks,
Namhyung


[1] https://lkml.org/lkml/2015/10/2/16


Namhyung Kim (16):
perf top: Delete half-processed hist entries when exit
perf top: Fix and cleanup perf_top__record_precise_ip()
perf top: Factor out warnings about kernel addresses and symbols
perf top: Factor out warnings in perf_top__record_precise_ip()
perf top: Show warning messages in the display thread
perf top: Get rid of access to hists->lock in perf_top__record_precise_ip()
perf hists: Pass hists struct to hist_entry_iter struct
perf tools: Export a couple of hist functions
perf tools: Update hist entry's hists pointer
perf hist: Add events_stats__add() and hists__add_stats()
perf top: Implement basic parallel processing
perf tools: Reduce lock contention when processing events
perf top: Protect the seen list using mutex
perf top: Separate struct perf_top_stats
perf top: Add --num-thread option
perf tools: Skip dso front cache for multi-threaded lookup

tools/perf/builtin-report.c | 1 +
tools/perf/builtin-top.c | 491 +++++++++++++++++++++++++++++---------
tools/perf/tests/hists_cumulate.c | 1 +
tools/perf/tests/hists_filter.c | 1 +
tools/perf/tests/hists_output.c | 1 +
tools/perf/util/event.c | 7 +-
tools/perf/util/hist.c | 98 ++++++--
tools/perf/util/hist.h | 12 +
tools/perf/util/machine.c | 19 +-
tools/perf/util/symbol.c | 3 +-
tools/perf/util/symbol.h | 3 +-
tools/perf/util/top.c | 18 +-
tools/perf/util/top.h | 14 +-
13 files changed, 514 insertions(+), 155 deletions(-)

--
2.6.2


2015-12-10 07:54:08

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 01/16] perf top: Delete half-processed hist entries when exit

After sample processing is done, hist entries are in both of
hists->entries and hists->entries_in (or hists->entries_collapsed).
So I guess perf report does not have leaks on hists.

But for perf top, it's possible to have half-processed entries which
are only in hists->entries_in. Eventually they will go to the
hists->entries and get freed but they cannot be deleted by current
hists__delete_entries(). This patch adds hists__delete_all_entries
function to delete those entries.

Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/hist.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 565ea3549894..56e97f5af598 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -270,6 +270,8 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he)

if (sort__need_collapse)
rb_erase(&he->rb_node_in, &hists->entries_collapsed);
+ else
+ rb_erase(&he->rb_node_in, hists->entries_in);

--hists->nr_entries;
if (!he->filtered)
@@ -1567,11 +1569,33 @@ static int hists_evsel__init(struct perf_evsel *evsel)
return 0;
}

+static void hists__delete_remaining_entries(struct rb_root *root)
+{
+ struct rb_node *node;
+ struct hist_entry *he;
+
+ while (!RB_EMPTY_ROOT(root)) {
+ node = rb_first(root);
+ rb_erase(node, root);
+
+ he = rb_entry(node, struct hist_entry, rb_node_in);
+ hist_entry__delete(he);
+ }
+}
+
+static void hists__delete_all_entries(struct hists *hists)
+{
+ hists__delete_entries(hists);
+ hists__delete_remaining_entries(&hists->entries_in_array[0]);
+ hists__delete_remaining_entries(&hists->entries_in_array[1]);
+ hists__delete_remaining_entries(&hists->entries_collapsed);
+}
+
static void hists_evsel__exit(struct perf_evsel *evsel)
{
struct hists *hists = evsel__hists(evsel);

- hists__delete_entries(hists);
+ hists__delete_all_entries(hists);
}

/*
--
2.6.2

2015-12-10 07:54:16

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 02/16] perf top: Fix and cleanup perf_top__record_precise_ip()

At first, it has duplicate ui__has_annotation() and 'sort__has_sym'
and 'use_browser' check. In fact, the ui__has_annotation() should be
removed as it needs to annotate on --stdio as well. And the
top->sym_filter_entry is used only for stdio mode, so make it clear on
the condition too.

Also the check will be simplifed if it sets sym before the check. And
omit check for 'he' since its caller (hist_iter__top_callback) only
gets called when iter->he is not NULL.

Secondly, it converts the 'ip' argument using map__unmap_ip() before
the call and then reverts it using map__map_ip(). This seems
meaningless and strange since only one of them checks the 'map'.

Finally, access the he->hists->lock only if there's an error.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 52 ++++++++++++++++++++----------------------------
1 file changed, 22 insertions(+), 30 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 7e2e72e6d9d1..7cd9bb69f5a6 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -175,42 +175,40 @@ static void perf_top__record_precise_ip(struct perf_top *top,
int counter, u64 ip)
{
struct annotation *notes;
- struct symbol *sym;
+ struct symbol *sym = he->ms.sym;
int err = 0;

- if (he == NULL || he->ms.sym == NULL ||
- ((top->sym_filter_entry == NULL ||
- top->sym_filter_entry->ms.sym != he->ms.sym) && use_browser != 1))
+ if (sym == NULL || (use_browser == 0 &&
+ (top->sym_filter_entry == NULL ||
+ top->sym_filter_entry->ms.sym != sym)))
return;

- sym = he->ms.sym;
notes = symbol__annotation(sym);

if (pthread_mutex_trylock(&notes->lock))
return;

- ip = he->ms.map->map_ip(he->ms.map, ip);
-
- if (ui__has_annotation())
- err = hist_entry__inc_addr_samples(he, counter, ip);
+ err = hist_entry__inc_addr_samples(he, counter, ip);

pthread_mutex_unlock(&notes->lock);

- /*
- * This function is now called with he->hists->lock held.
- * Release it before going to sleep.
- */
- pthread_mutex_unlock(&he->hists->lock);
+ if (unlikely(err)) {
+ /*
+ * This function is now called with hists->lock held.
+ * Release it before going to sleep.
+ */
+ pthread_mutex_unlock(&he->hists->lock);
+
+ if (err == -ERANGE && !he->ms.map->erange_warned)
+ ui__warn_map_erange(he->ms.map, sym, ip);
+ else if (err == -ENOMEM) {
+ pr_err("Not enough memory for annotating '%s' symbol!\n",
+ sym->name);
+ sleep(1);
+ }

- if (err == -ERANGE && !he->ms.map->erange_warned)
- ui__warn_map_erange(he->ms.map, sym, ip);
- else if (err == -ENOMEM) {
- pr_err("Not enough memory for annotating '%s' symbol!\n",
- sym->name);
- sleep(1);
+ pthread_mutex_lock(&he->hists->lock);
}
-
- pthread_mutex_lock(&he->hists->lock);
}

static void perf_top__show_details(struct perf_top *top)
@@ -687,14 +685,8 @@ static int hist_iter__top_callback(struct hist_entry_iter *iter,
struct hist_entry *he = iter->he;
struct perf_evsel *evsel = iter->evsel;

- if (sort__has_sym && single) {
- u64 ip = al->addr;
-
- if (al->map)
- ip = al->map->unmap_ip(al->map, ip);
-
- perf_top__record_precise_ip(top, he, evsel->idx, ip);
- }
+ if (sort__has_sym && single)
+ perf_top__record_precise_ip(top, he, evsel->idx, al->addr);

hist__account_cycles(iter->sample->branch_stack, al, iter->sample,
!(top->record_opts.branch_stack & PERF_SAMPLE_BRANCH_ANY));
--
2.6.2

2015-12-10 07:54:10

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 03/16] perf top: Factor out warnings about kernel addresses and symbols

Factor out warning messages into separate functions. These will be
called in the display thread later.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 95 ++++++++++++++++++++++++++----------------------
1 file changed, 51 insertions(+), 44 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 7cd9bb69f5a6..e6166ef8fd1a 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -170,6 +170,53 @@ static void ui__warn_map_erange(struct map *map, struct symbol *sym, u64 ip)
map->erange_warned = true;
}

+static void ui__warn_kptr_restrict(struct perf_top *top, struct addr_location *al)
+{
+ if (!top->kptr_restrict_warned) {
+ ui__warning(
+"Kernel address maps (/proc/{kallsyms,modules}) are restricted.\n\n"
+"Check /proc/sys/kernel/kptr_restrict.\n\n"
+"Kernel%s samples will not be resolved.\n",
+ al->map && !RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION]) ?
+ " modules" : "");
+ if (use_browser <= 0)
+ sleep(5);
+ top->kptr_restrict_warned = true;
+ }
+}
+
+static void ui__warn_vmlinux(struct perf_top *top, struct addr_location *al)
+{
+ const char *msg = "Kernel samples will not be resolved.\n";
+ /*
+ * As we do lazy loading of symtabs we only will know if the
+ * specified vmlinux file is invalid when we actually have a
+ * hit in kernel space and then try to load it. So if we get
+ * here and there are _no_ symbols in the DSO backing the
+ * kernel map, bail out.
+ *
+ * We may never get here, for instance, if we use -K/
+ * --hide-kernel-symbols, even if the user specifies an
+ * invalid --vmlinux ;-)
+ */
+ if (!top->kptr_restrict_warned && !top->vmlinux_warned &&
+ RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION])) {
+ if (symbol_conf.vmlinux_name) {
+ char serr[256];
+ dso__strerror_load(al->map->dso, serr, sizeof(serr));
+ ui__warning("The %s file can't be used: %s\n%s",
+ symbol_conf.vmlinux_name, serr, msg);
+ } else {
+ ui__warning("A vmlinux file was not found.\n%s",
+ msg);
+ }
+
+ if (use_browser <= 0)
+ sleep(5);
+ top->vmlinux_warned = true;
+ }
+}
+
static void perf_top__record_precise_ip(struct perf_top *top,
struct hist_entry *he,
int counter, u64 ip)
@@ -729,51 +776,11 @@ static void perf_event__process_sample(struct perf_tool *tool,
if (perf_event__preprocess_sample(event, machine, &al, sample) < 0)
return;

- if (!top->kptr_restrict_warned &&
- symbol_conf.kptr_restrict &&
- al.cpumode == PERF_RECORD_MISC_KERNEL) {
- ui__warning(
-"Kernel address maps (/proc/{kallsyms,modules}) are restricted.\n\n"
-"Check /proc/sys/kernel/kptr_restrict.\n\n"
-"Kernel%s samples will not be resolved.\n",
- al.map && !RB_EMPTY_ROOT(&al.map->dso->symbols[MAP__FUNCTION]) ?
- " modules" : "");
- if (use_browser <= 0)
- sleep(5);
- top->kptr_restrict_warned = true;
- }
-
- if (al.sym == NULL) {
- const char *msg = "Kernel samples will not be resolved.\n";
- /*
- * As we do lazy loading of symtabs we only will know if the
- * specified vmlinux file is invalid when we actually have a
- * hit in kernel space and then try to load it. So if we get
- * here and there are _no_ symbols in the DSO backing the
- * kernel map, bail out.
- *
- * We may never get here, for instance, if we use -K/
- * --hide-kernel-symbols, even if the user specifies an
- * invalid --vmlinux ;-)
- */
- if (!top->kptr_restrict_warned && !top->vmlinux_warned &&
- al.map == machine->vmlinux_maps[MAP__FUNCTION] &&
- RB_EMPTY_ROOT(&al.map->dso->symbols[MAP__FUNCTION])) {
- if (symbol_conf.vmlinux_name) {
- char serr[256];
- dso__strerror_load(al.map->dso, serr, sizeof(serr));
- ui__warning("The %s file can't be used: %s\n%s",
- symbol_conf.vmlinux_name, serr, msg);
- } else {
- ui__warning("A vmlinux file was not found.\n%s",
- msg);
- }
+ if (symbol_conf.kptr_restrict && al.cpumode == PERF_RECORD_MISC_KERNEL)
+ ui__warn_kptr_restrict(top, &al);

- if (use_browser <= 0)
- sleep(5);
- top->vmlinux_warned = true;
- }
- }
+ if (al.sym == NULL && al.map == machine->vmlinux_maps[MAP__FUNCTION])
+ ui__warn_vmlinux(top, &al);

if (al.sym == NULL || !al.sym->ignore) {
struct hists *hists = evsel__hists(evsel);
--
2.6.2

2015-12-10 07:54:06

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 04/16] perf top: Factor out warnings in perf_top__record_precise_ip()

Currently it warns two error cases during annotation update. One is
for ERANGE and it already is in a separate function. Fix this
function to be consistent with others like checking erange_warned
inside the function and passing 'al' in the argument.

Another case is for ENOMEM, make it also as a separate function.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 87 +++++++++++++++++++++++++++++-------------------
tools/perf/util/top.h | 1 +
2 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index e6166ef8fd1a..7a237719037a 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -143,31 +143,50 @@ static void __zero_source_counters(struct hist_entry *he)
symbol__annotate_zero_histograms(sym);
}

-static void ui__warn_map_erange(struct map *map, struct symbol *sym, u64 ip)
+static void ui__warn_map_erange(struct perf_top *top __maybe_unused,
+ struct addr_location *al)
{
- struct utsname uts;
- int err = uname(&uts);
-
- ui__warning("Out of bounds address found:\n\n"
- "Addr: %" PRIx64 "\n"
- "DSO: %s %c\n"
- "Map: %" PRIx64 "-%" PRIx64 "\n"
- "Symbol: %" PRIx64 "-%" PRIx64 " %c %s\n"
- "Arch: %s\n"
- "Kernel: %s\n"
- "Tools: %s\n\n"
- "Not all samples will be on the annotation output.\n\n"
- "Please report to [email protected]\n",
- ip, map->dso->long_name, dso__symtab_origin(map->dso),
- map->start, map->end, sym->start, sym->end,
- sym->binding == STB_GLOBAL ? 'g' :
- sym->binding == STB_LOCAL ? 'l' : 'w', sym->name,
- err ? "[unknown]" : uts.machine,
- err ? "[unknown]" : uts.release, perf_version_string);
- if (use_browser <= 0)
- sleep(5);
-
- map->erange_warned = true;
+ struct map *map = al->map;
+ struct symbol *sym = al->sym;
+ u64 ip = al->addr;
+
+ if (!map->erange_warned) {
+ struct utsname uts;
+ int err = uname(&uts);
+
+ ui__warning("Out of bounds address found:\n\n"
+ "Addr: %" PRIx64 "\n"
+ "DSO: %s %c\n"
+ "Map: %" PRIx64 "-%" PRIx64 "\n"
+ "Symbol: %" PRIx64 "-%" PRIx64 " %c %s\n"
+ "Arch: %s\n"
+ "Kernel: %s\n"
+ "Tools: %s\n\n"
+ "Not all samples will be on the annotation output.\n\n"
+ "Please report to [email protected]\n",
+ ip, map->dso->long_name, dso__symtab_origin(map->dso),
+ map->start, map->end, sym->start, sym->end,
+ sym->binding == STB_GLOBAL ? 'g' :
+ sym->binding == STB_LOCAL ? 'l' : 'w', sym->name,
+ err ? "[unknown]" : uts.machine,
+ err ? "[unknown]" : uts.release, perf_version_string);
+ if (use_browser <= 0)
+ sleep(5);
+
+ map->erange_warned = true;
+ }
+}
+
+static void ui__warn_enomem(struct perf_top *top, struct addr_location *al)
+{
+ if (!top->enomem_warned) {
+ ui__warning("Not enough memory for annotating '%s' symbol!\n",
+ al->sym->name);
+
+ if (use_browser <= 0)
+ sleep(5);
+ top->enomem_warned = true;
+ }
}

static void ui__warn_kptr_restrict(struct perf_top *top, struct addr_location *al)
@@ -219,10 +238,11 @@ static void ui__warn_vmlinux(struct perf_top *top, struct addr_location *al)

static void perf_top__record_precise_ip(struct perf_top *top,
struct hist_entry *he,
- int counter, u64 ip)
+ struct addr_location *al,
+ int counter)
{
struct annotation *notes;
- struct symbol *sym = he->ms.sym;
+ struct symbol *sym = al->sym;
int err = 0;

if (sym == NULL || (use_browser == 0 &&
@@ -235,7 +255,7 @@ static void perf_top__record_precise_ip(struct perf_top *top,
if (pthread_mutex_trylock(&notes->lock))
return;

- err = hist_entry__inc_addr_samples(he, counter, ip);
+ err = hist_entry__inc_addr_samples(he, counter, al->addr);

pthread_mutex_unlock(&notes->lock);

@@ -246,13 +266,10 @@ static void perf_top__record_precise_ip(struct perf_top *top,
*/
pthread_mutex_unlock(&he->hists->lock);

- if (err == -ERANGE && !he->ms.map->erange_warned)
- ui__warn_map_erange(he->ms.map, sym, ip);
- else if (err == -ENOMEM) {
- pr_err("Not enough memory for annotating '%s' symbol!\n",
- sym->name);
- sleep(1);
- }
+ if (err == -ERANGE)
+ ui__warn_map_erange(top, al);
+ else if (err == -ENOMEM)
+ ui__warn_enomem(top, al);

pthread_mutex_lock(&he->hists->lock);
}
@@ -733,7 +750,7 @@ static int hist_iter__top_callback(struct hist_entry_iter *iter,
struct perf_evsel *evsel = iter->evsel;

if (sort__has_sym && single)
- perf_top__record_precise_ip(top, he, evsel->idx, al->addr);
+ perf_top__record_precise_ip(top, he, al, evsel->idx);

hist__account_cycles(iter->sample->branch_stack, al, iter->sample,
!(top->record_opts.branch_stack & PERF_SAMPLE_BRANCH_ANY));
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index f92c37abb0a8..c56a00cff5b4 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -29,6 +29,7 @@ struct perf_top {
bool use_tui, use_stdio;
bool kptr_restrict_warned;
bool vmlinux_warned;
+ bool enomem_warned;
bool dump_symtab;
struct hist_entry *sym_filter_entry;
struct perf_evsel *sym_evsel;
--
2.6.2

2015-12-10 07:54:12

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 05/16] perf top: Show warning messages in the display thread

Currently perf top shows warning dialog during processing samples
which means it interrupts the processing. So it'd be better if reader
threads just save the warning information and display thread show the
message. Now warning is managed to have their priority so that higher
priority messages don't get overwritten by lower messages.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 224 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 162 insertions(+), 62 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 7a237719037a..aafcf27c437e 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -143,68 +143,90 @@ static void __zero_source_counters(struct hist_entry *he)
symbol__annotate_zero_histograms(sym);
}

-static void ui__warn_map_erange(struct perf_top *top __maybe_unused,
- struct addr_location *al)
+static int perf_top__check_map_erange(struct perf_top *top __maybe_unused,
+ struct addr_location *al)
+{
+ return al->map->erange_warned;
+}
+
+static void perf_top__warn_map_erange(struct perf_top *top __maybe_unused,
+ struct addr_location *al)
{
struct map *map = al->map;
struct symbol *sym = al->sym;
u64 ip = al->addr;
+ struct utsname uts;
+ int err = uname(&uts);

- if (!map->erange_warned) {
- struct utsname uts;
- int err = uname(&uts);
-
- ui__warning("Out of bounds address found:\n\n"
- "Addr: %" PRIx64 "\n"
- "DSO: %s %c\n"
- "Map: %" PRIx64 "-%" PRIx64 "\n"
- "Symbol: %" PRIx64 "-%" PRIx64 " %c %s\n"
- "Arch: %s\n"
- "Kernel: %s\n"
- "Tools: %s\n\n"
- "Not all samples will be on the annotation output.\n\n"
- "Please report to [email protected]\n",
- ip, map->dso->long_name, dso__symtab_origin(map->dso),
- map->start, map->end, sym->start, sym->end,
- sym->binding == STB_GLOBAL ? 'g' :
- sym->binding == STB_LOCAL ? 'l' : 'w', sym->name,
- err ? "[unknown]" : uts.machine,
- err ? "[unknown]" : uts.release, perf_version_string);
- if (use_browser <= 0)
- sleep(5);
-
- map->erange_warned = true;
+ if (!map || !sym) {
+ ui__warning("Out of bounds address found\n");
+ return;
}
+
+ ui__warning("Out of bounds address found:\n\n"
+ "Addr: %" PRIx64 "\n"
+ "DSO: %s %c\n"
+ "Map: %" PRIx64 "-%" PRIx64 "\n"
+ "Symbol: %" PRIx64 "-%" PRIx64 " %c %s\n"
+ "Arch: %s\n"
+ "Kernel: %s\n"
+ "Tools: %s\n\n"
+ "Not all samples will be on the annotation output.\n\n"
+ "Please report to [email protected]\n",
+ ip, map->dso->long_name, dso__symtab_origin(map->dso),
+ map->start, map->end, sym->start, sym->end,
+ sym->binding == STB_GLOBAL ? 'g' :
+ sym->binding == STB_LOCAL ? 'l' : 'w', sym->name,
+ err ? "[unknown]" : uts.machine,
+ err ? "[unknown]" : uts.release, perf_version_string);
+
+ map->erange_warned = true;
}

-static void ui__warn_enomem(struct perf_top *top, struct addr_location *al)
+static int perf_top__check_enomem(struct perf_top *top,
+ struct addr_location *al __maybe_unused)
{
- if (!top->enomem_warned) {
- ui__warning("Not enough memory for annotating '%s' symbol!\n",
- al->sym->name);
+ return top->enomem_warned;
+}
+static void perf_top__warn_enomem(struct perf_top *top,
+ struct addr_location *al)
+{
+ ui__warning("Not enough memory for annotating '%s' symbol!\n",
+ al->sym ? al->sym->name : "unknown");

- if (use_browser <= 0)
- sleep(5);
- top->enomem_warned = true;
- }
+ top->enomem_warned = true;
+}
+
+static int perf_top__check_kptr_restrict(struct perf_top *top,
+ struct addr_location *al __maybe_unused)
+{
+ return top->kptr_restrict_warned;
}

-static void ui__warn_kptr_restrict(struct perf_top *top, struct addr_location *al)
+static void perf_top__warn_kptr_restrict(struct perf_top *top,
+ struct addr_location *al)
{
- if (!top->kptr_restrict_warned) {
- ui__warning(
+ ui__warning(
"Kernel address maps (/proc/{kallsyms,modules}) are restricted.\n\n"
"Check /proc/sys/kernel/kptr_restrict.\n\n"
"Kernel%s samples will not be resolved.\n",
- al->map && !RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION]) ?
- " modules" : "");
- if (use_browser <= 0)
- sleep(5);
- top->kptr_restrict_warned = true;
- }
+ al->map && !RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION]) ?
+ " modules" : "");
+
+ top->kptr_restrict_warned = true;
+}
+
+static int perf_top__check_vmlinux(struct perf_top *top,
+ struct addr_location *al)
+{
+ if (!RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION]))
+ return true;
+
+ return top->kptr_restrict_warned || top->vmlinux_warned;
}

-static void ui__warn_vmlinux(struct perf_top *top, struct addr_location *al)
+static void perf_top__warn_vmlinux(struct perf_top *top,
+ struct addr_location *al)
{
const char *msg = "Kernel samples will not be resolved.\n";
/*
@@ -218,22 +240,96 @@ static void ui__warn_vmlinux(struct perf_top *top, struct addr_location *al)
* --hide-kernel-symbols, even if the user specifies an
* invalid --vmlinux ;-)
*/
- if (!top->kptr_restrict_warned && !top->vmlinux_warned &&
- RB_EMPTY_ROOT(&al->map->dso->symbols[MAP__FUNCTION])) {
- if (symbol_conf.vmlinux_name) {
- char serr[256];
- dso__strerror_load(al->map->dso, serr, sizeof(serr));
- ui__warning("The %s file can't be used: %s\n%s",
- symbol_conf.vmlinux_name, serr, msg);
- } else {
- ui__warning("A vmlinux file was not found.\n%s",
- msg);
- }
+ if (symbol_conf.vmlinux_name) {
+ char serr[256];
+ dso__strerror_load(al->map->dso, serr, sizeof(serr));
+ ui__warning("The %s file can't be used: %s\n%s",
+ symbol_conf.vmlinux_name, serr, msg);
+ } else {
+ ui__warning("A vmlinux file was not found.\n%s", msg);
+ }
+ top->vmlinux_warned = true;
+}
+
+static struct addr_location warn_al;
+static pthread_mutex_t warn_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static enum warn_request {
+ WARN_NONE,
+ WARN_ENOMEM,
+ WARN_ERANGE,
+ WARN_VMLINUX,
+ WARN_KPTR_RESTRICT,
+ WARN_MAX,
+} wreq = WARN_NONE;
+
+static struct perf_top__warning {
+ int (*check)(struct perf_top *top, struct addr_location *al);
+ void (*warn)(struct perf_top *top, struct addr_location *al);
+} warnings[] = {
+ { NULL, NULL },
+
+#define W(item) { perf_top__check_ ## item, perf_top__warn_ ## item }
+
+ W(enomem),
+ W(map_erange),
+ W(vmlinux),
+ W(kptr_restrict),
+
+#undef W
+};
+
+static void perf_top__request_warning(struct perf_top *top,
+ struct addr_location *al,
+ enum warn_request req)
+{
+ pthread_mutex_lock(&warn_mutex);
+
+ if (req > wreq && !warnings[req].check(top, al)) {
+ map__zput(warn_al.map);
+
+ wreq = req;
+ warn_al = *al;

- if (use_browser <= 0)
- sleep(5);
- top->vmlinux_warned = true;
+ map__get(warn_al.map);
}
+
+ pthread_mutex_unlock(&warn_mutex);
+}
+
+static void perf_top__show_warning(struct perf_top *top)
+{
+ enum warn_request req;
+ struct addr_location al;
+
+ pthread_mutex_lock(&warn_mutex);
+
+retry:
+ BUG_ON(wreq < 0 || wreq >= WARN_MAX);
+
+ req = wreq;
+ al = warn_al;
+ warn_al.map = NULL;
+
+ pthread_mutex_unlock(&warn_mutex);
+
+ if (req == WARN_NONE)
+ return;
+
+ warnings[req].warn(top, &al);
+ map__put(al.map);
+
+ if (use_browser <= 0)
+ sleep(5);
+
+ pthread_mutex_lock(&warn_mutex);
+
+ if (req < wreq)
+ goto retry;
+
+ wreq = WARN_NONE;
+
+ pthread_mutex_unlock(&warn_mutex);
}

static void perf_top__record_precise_ip(struct perf_top *top,
@@ -267,9 +363,9 @@ static void perf_top__record_precise_ip(struct perf_top *top,
pthread_mutex_unlock(&he->hists->lock);

if (err == -ERANGE)
- ui__warn_map_erange(top, al);
+ perf_top__request_warning(top, al, WARN_ERANGE);
else if (err == -ENOMEM)
- ui__warn_enomem(top, al);
+ perf_top__request_warning(top, al, WARN_ENOMEM);

pthread_mutex_lock(&he->hists->lock);
}
@@ -358,6 +454,8 @@ static void perf_top__print_sym_table(struct perf_top *top)
putchar('\n');
hists__fprintf(hists, false, top->print_entries - printed, win_width,
top->min_percent, stdout);
+
+ perf_top__show_warning(top);
}

static void prompt_integer(int *target, const char *msg)
@@ -624,6 +722,8 @@ static void perf_top__sort_new_samples(void *arg)

hists__collapse_resort(hists, NULL);
hists__output_resort(hists, NULL);
+
+ perf_top__show_warning(t);
}

static void *display_thread_tui(void *arg)
@@ -794,10 +894,10 @@ static void perf_event__process_sample(struct perf_tool *tool,
return;

if (symbol_conf.kptr_restrict && al.cpumode == PERF_RECORD_MISC_KERNEL)
- ui__warn_kptr_restrict(top, &al);
+ perf_top__request_warning(top, &al, WARN_KPTR_RESTRICT);

if (al.sym == NULL && al.map == machine->vmlinux_maps[MAP__FUNCTION])
- ui__warn_vmlinux(top, &al);
+ perf_top__request_warning(top, &al, WARN_VMLINUX);

if (al.sym == NULL || !al.sym->ignore) {
struct hists *hists = evsel__hists(evsel);
--
2.6.2

2015-12-10 07:54:14

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 06/16] perf top: Get rid of access to hists->lock in perf_top__record_precise_ip()

It was accessed to release the lock before going to sleep, but now it
doesn't go to sleep anymore since warnings will be shown in the
display thread.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 18 ++++--------------
1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index aafcf27c437e..24a5434e8a0b 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -355,20 +355,10 @@ static void perf_top__record_precise_ip(struct perf_top *top,

pthread_mutex_unlock(&notes->lock);

- if (unlikely(err)) {
- /*
- * This function is now called with hists->lock held.
- * Release it before going to sleep.
- */
- pthread_mutex_unlock(&he->hists->lock);
-
- if (err == -ERANGE)
- perf_top__request_warning(top, al, WARN_ERANGE);
- else if (err == -ENOMEM)
- perf_top__request_warning(top, al, WARN_ENOMEM);
-
- pthread_mutex_lock(&he->hists->lock);
- }
+ if (err == -ERANGE)
+ perf_top__request_warning(top, al, WARN_ERANGE);
+ else if (err == -ENOMEM)
+ perf_top__request_warning(top, al, WARN_ENOMEM);
}

static void perf_top__show_details(struct perf_top *top)
--
2.6.2

2015-12-10 07:54:42

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 07/16] perf hists: Pass hists struct to hist_entry_iter struct

This is a preparation for multi-thread support in perf tools. When
multi-thread is enable, each thread will have its own hists during the
sample processing.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-report.c | 1 +
tools/perf/builtin-top.c | 1 +
tools/perf/tests/hists_cumulate.c | 1 +
tools/perf/tests/hists_filter.c | 1 +
tools/perf/tests/hists_output.c | 1 +
tools/perf/util/hist.c | 22 ++++++++--------------
tools/perf/util/hist.h | 1 +
7 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index af5db885ea9c..c681064f2e18 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -144,6 +144,7 @@ static int process_sample_event(struct perf_tool *tool,
struct addr_location al;
struct hist_entry_iter iter = {
.evsel = evsel,
+ .hists = evsel__hists(evsel),
.sample = sample,
.hide_unresolved = symbol_conf.hide_unresolved,
.add_entry_cb = hist_iter__report_callback,
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 24a5434e8a0b..b62665ce5ea6 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -893,6 +893,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
struct hists *hists = evsel__hists(evsel);
struct hist_entry_iter iter = {
.evsel = evsel,
+ .hists = evsel__hists(evsel),
.sample = sample,
.add_entry_cb = hist_iter__top_callback,
};
diff --git a/tools/perf/tests/hists_cumulate.c b/tools/perf/tests/hists_cumulate.c
index 8292948bc5f9..fd0d5d955612 100644
--- a/tools/perf/tests/hists_cumulate.c
+++ b/tools/perf/tests/hists_cumulate.c
@@ -88,6 +88,7 @@ static int add_hist_entries(struct hists *hists, struct machine *machine)
};
struct hist_entry_iter iter = {
.evsel = evsel,
+ .hists = evsel__hists(evsel),
.sample = &sample,
.hide_unresolved = false,
};
diff --git a/tools/perf/tests/hists_filter.c b/tools/perf/tests/hists_filter.c
index ccb5b4921f25..c57ed8755e34 100644
--- a/tools/perf/tests/hists_filter.c
+++ b/tools/perf/tests/hists_filter.c
@@ -65,6 +65,7 @@ static int add_hist_entries(struct perf_evlist *evlist,
};
struct hist_entry_iter iter = {
.evsel = evsel,
+ .hists = evsel__hists(evsel),
.sample = &sample,
.ops = &hist_iter_normal,
.hide_unresolved = false,
diff --git a/tools/perf/tests/hists_output.c b/tools/perf/tests/hists_output.c
index 248beec1d917..86e9d374cd7f 100644
--- a/tools/perf/tests/hists_output.c
+++ b/tools/perf/tests/hists_output.c
@@ -58,6 +58,7 @@ static int add_hist_entries(struct hists *hists, struct machine *machine)
};
struct hist_entry_iter iter = {
.evsel = evsel,
+ .hists = evsel__hists(evsel),
.sample = &sample,
.ops = &hist_iter_normal,
.hide_unresolved = false,
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 56e97f5af598..d534ed76164b 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -525,7 +525,7 @@ iter_add_single_mem_entry(struct hist_entry_iter *iter, struct addr_location *al
{
u64 cost;
struct mem_info *mi = iter->priv;
- struct hists *hists = evsel__hists(iter->evsel);
+ struct hists *hists = iter->hists;
struct hist_entry *he;

if (mi == NULL)
@@ -555,8 +555,7 @@ static int
iter_finish_mem_entry(struct hist_entry_iter *iter,
struct addr_location *al __maybe_unused)
{
- struct perf_evsel *evsel = iter->evsel;
- struct hists *hists = evsel__hists(evsel);
+ struct hists *hists = iter->hists;
struct hist_entry *he = iter->he;
int err = -EINVAL;

@@ -628,8 +627,7 @@ static int
iter_add_next_branch_entry(struct hist_entry_iter *iter, struct addr_location *al)
{
struct branch_info *bi;
- struct perf_evsel *evsel = iter->evsel;
- struct hists *hists = evsel__hists(evsel);
+ struct hists *hists = iter->hists;
struct hist_entry *he = NULL;
int i = iter->curr;
int err = 0;
@@ -677,11 +675,10 @@ iter_prepare_normal_entry(struct hist_entry_iter *iter __maybe_unused,
static int
iter_add_single_normal_entry(struct hist_entry_iter *iter, struct addr_location *al)
{
- struct perf_evsel *evsel = iter->evsel;
struct perf_sample *sample = iter->sample;
struct hist_entry *he;

- he = __hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
+ he = __hists__add_entry(iter->hists, al, iter->parent, NULL, NULL,
sample->period, sample->weight,
sample->transaction, true);
if (he == NULL)
@@ -696,7 +693,6 @@ iter_finish_normal_entry(struct hist_entry_iter *iter,
struct addr_location *al __maybe_unused)
{
struct hist_entry *he = iter->he;
- struct perf_evsel *evsel = iter->evsel;
struct perf_sample *sample = iter->sample;

if (he == NULL)
@@ -704,7 +700,7 @@ iter_finish_normal_entry(struct hist_entry_iter *iter,

iter->he = NULL;

- hists__inc_nr_samples(evsel__hists(evsel), he->filtered);
+ hists__inc_nr_samples(iter->hists, he->filtered);

return hist_entry__append_callchain(he, sample);
}
@@ -736,8 +732,7 @@ static int
iter_add_single_cumulative_entry(struct hist_entry_iter *iter,
struct addr_location *al)
{
- struct perf_evsel *evsel = iter->evsel;
- struct hists *hists = evsel__hists(evsel);
+ struct hists *hists = iter->hists;
struct perf_sample *sample = iter->sample;
struct hist_entry **he_cache = iter->priv;
struct hist_entry *he;
@@ -782,12 +777,11 @@ static int
iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
struct addr_location *al)
{
- struct perf_evsel *evsel = iter->evsel;
struct perf_sample *sample = iter->sample;
struct hist_entry **he_cache = iter->priv;
struct hist_entry *he;
struct hist_entry he_tmp = {
- .hists = evsel__hists(evsel),
+ .hists = iter->hists,
.cpu = al->cpu,
.thread = al->thread,
.comm = thread__comm(al->thread),
@@ -817,7 +811,7 @@ iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
}
}

- he = __hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
+ he = __hists__add_entry(iter->hists, al, iter->parent, NULL, NULL,
sample->period, sample->weight,
sample->transaction, false);
if (he == NULL)
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a48a2078d288..47ae7e6b1de8 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -93,6 +93,7 @@ struct hist_entry_iter {
bool hide_unresolved;
int max_stack;

+ struct hists *hists;
struct perf_evsel *evsel;
struct perf_sample *sample;
struct hist_entry *he;
--
2.6.2

2015-12-10 07:54:18

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 08/16] perf tools: Export a couple of hist functions

These are necessary for multi threaded sample processing:

- hists__get__get_rotate_entries_in()
- hists__collapse_insert_entry()
- __hists__init()

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/hist.c | 19 ++++++++++++-------
tools/perf/util/hist.h | 5 +++++
2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index d534ed76164b..ea4f3ad978b0 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -972,9 +972,8 @@ void hist_entry__delete(struct hist_entry *he)
* collapse the histogram
*/

-static bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
- struct rb_root *root,
- struct hist_entry *he)
+bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
+ struct rb_root *root, struct hist_entry *he)
{
struct rb_node **p = &root->rb_node;
struct rb_node *parent = NULL;
@@ -1014,7 +1013,7 @@ static bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
return true;
}

-static struct rb_root *hists__get_rotate_entries_in(struct hists *hists)
+struct rb_root *hists__get_rotate_entries_in(struct hists *hists)
{
struct rb_root *root;

@@ -1549,10 +1548,8 @@ int perf_hist_config(const char *var, const char *value)
return 0;
}

-static int hists_evsel__init(struct perf_evsel *evsel)
+int __hists__init(struct hists *hists)
{
- struct hists *hists = evsel__hists(evsel);
-
memset(hists, 0, sizeof(*hists));
hists->entries_in_array[0] = hists->entries_in_array[1] = RB_ROOT;
hists->entries_in = &hists->entries_in_array[0];
@@ -1592,6 +1589,14 @@ static void hists_evsel__exit(struct perf_evsel *evsel)
hists__delete_all_entries(hists);
}

+static int hists_evsel__init(struct perf_evsel *evsel)
+{
+ struct hists *hists = evsel__hists(evsel);
+
+ __hists__init(hists);
+ return 0;
+}
+
/*
* XXX We probably need a hists_evsel__exit() to free the hist_entries
* stored in the rbtree...
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 47ae7e6b1de8..670720ef8acd 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -185,6 +185,11 @@ static inline struct hists *evsel__hists(struct perf_evsel *evsel)
}

int hists__init(void);
+int __hists__init(struct hists *hists);
+
+struct rb_root *hists__get_rotate_entries_in(struct hists *hists);
+bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
+ struct rb_root *root, struct hist_entry *he);

struct perf_hpp {
char *buf;
--
2.6.2

2015-12-10 07:54:24

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 09/16] perf tools: Update hist entry's hists pointer

When sample is processed using multi-thread, each sample is gathered on
each thread's hist tree and then merged into the real hist tree. But
hist_entry->hists pointer was not updated so it could refer wrong hists
resulted in missing outputs.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/hist.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ea4f3ad978b0..a12e5022fe04 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -1008,6 +1008,14 @@ bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
}
hists->nr_entries++;

+ /*
+ * If a hist entry is processed in multi-threaded environment,
+ * it points to a dummy local hists which was used only for
+ * intermidate processing. So update it to a real one so that
+ * it can find the correct info later.
+ */
+ he->hists = hists;
+
rb_link_node(&he->rb_node_in, parent, p);
rb_insert_color(&he->rb_node_in, root);
return true;
--
2.6.2

2015-12-10 07:54:21

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 10/16] perf hist: Add events_stats__add() and hists__add_stats()

These two functions are to update event and hists stats. They'll be
used by multi threads to update local stats in the later patch.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/hist.c | 23 +++++++++++++++++++++++
tools/perf/util/hist.h | 6 ++++++
2 files changed, 29 insertions(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index a12e5022fe04..08396a7fea23 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -1347,6 +1347,29 @@ void events_stats__inc(struct events_stats *stats, u32 type)
++stats->nr_events[type];
}

+void events_stats__add(struct events_stats *dst, struct events_stats *src)
+{
+ int i;
+
+#define ADD(_field) dst->_field += src->_field
+
+ ADD(total_period);
+ ADD(total_non_filtered_period);
+ ADD(total_lost);
+ ADD(total_invalid_chains);
+ ADD(nr_non_filtered_samples);
+ ADD(nr_lost_warned);
+ ADD(nr_unknown_events);
+ ADD(nr_invalid_chains);
+ ADD(nr_unknown_id);
+ ADD(nr_unprocessable_samples);
+
+ for (i = 0; i < PERF_RECORD_HEADER_MAX; i++)
+ ADD(nr_events[i]);
+
+#undef ADD
+}
+
void hists__inc_nr_events(struct hists *hists, u32 type)
{
events_stats__inc(&hists->stats, type);
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 670720ef8acd..725afce73738 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -141,8 +141,14 @@ void hists__inc_stats(struct hists *hists, struct hist_entry *h);
void hists__inc_nr_events(struct hists *hists, u32 type);
void hists__inc_nr_samples(struct hists *hists, bool filtered);
void events_stats__inc(struct events_stats *stats, u32 type);
+void events_stats__add(struct events_stats *dst, struct events_stats *src);
size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);

+static inline void hists__add_stats(struct hists *dst, struct hists *src)
+{
+ events_stats__add(&dst->stats, &src->stats);
+}
+
size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
int max_cols, float min_pcnt, FILE *fp);
size_t perf_evlist__fprintf_nr_events(struct perf_evlist *evlist, FILE *fp);
--
2.6.2

2015-12-10 07:54:33

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 11/16] perf top: Implement basic parallel processing

This patch changes perf top to process event samples with multiple
threads. For now, each mmap is read and processed with its own hists by
dedicated reader threads in parallel. And then a single collector
thread gathers the hist entries and move it to the evsel's hists tree.
As usual, a single UI thread will display them.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 172 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 141 insertions(+), 31 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index b62665ce5ea6..a9b7461be4f0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -831,6 +831,57 @@ static int symbol_filter(struct map *map, struct symbol *sym)
return 0;
}

+struct collector_arg {
+ struct perf_top *top;
+ struct hists *hists;
+};
+
+static void collect_hists(struct perf_top *top, struct hists *hists)
+{
+ int i, k;
+ struct perf_evsel *evsel;
+
+ for (i = 0, k = 0; i < top->evlist->nr_mmaps; i++) {
+ evlist__for_each(top->evlist, evsel) {
+ struct hists *src_hists = &hists[k++];
+ struct hists *dst_hists = evsel__hists(evsel);
+ struct hist_entry *he;
+ struct rb_root *root;
+ struct rb_node *next;
+
+ root = hists__get_rotate_entries_in(src_hists);
+ next = rb_first(root);
+
+ while (next) {
+ if (session_done())
+ return;
+ he = rb_entry(next, struct hist_entry, rb_node_in);
+ next = rb_next(next);
+
+ rb_erase(&he->rb_node_in, root);
+
+ pthread_mutex_lock(&dst_hists->lock);
+ hists__collapse_insert_entry(dst_hists,
+ dst_hists->entries_in, he);
+ pthread_mutex_unlock(&dst_hists->lock);
+ }
+ hists__add_stats(dst_hists, src_hists);
+ }
+ }
+}
+
+static void *collect_worker(void *arg)
+{
+ struct collector_arg *carg = arg;
+
+ while (!done) {
+ collect_hists(carg->top, carg->hists);
+ poll(NULL, 0, 100);
+ }
+
+ return NULL;
+}
+
static int hist_iter__top_callback(struct hist_entry_iter *iter,
struct addr_location *al, bool single,
void *arg)
@@ -847,13 +898,19 @@ static int hist_iter__top_callback(struct hist_entry_iter *iter,
return 0;
}

-static void perf_event__process_sample(struct perf_tool *tool,
+struct reader_arg {
+ int idx;
+ struct perf_top *top;
+ struct hists *hists;
+};
+
+static void perf_event__process_sample(struct reader_arg *rarg,
const union perf_event *event,
struct perf_evsel *evsel,
struct perf_sample *sample,
struct machine *machine)
{
- struct perf_top *top = container_of(tool, struct perf_top, tool);
+ struct perf_top *top = rarg->top;
struct addr_location al;
int err;

@@ -890,10 +947,10 @@ static void perf_event__process_sample(struct perf_tool *tool,
perf_top__request_warning(top, &al, WARN_VMLINUX);

if (al.sym == NULL || !al.sym->ignore) {
- struct hists *hists = evsel__hists(evsel);
+ struct hists* hists = &rarg->hists[evsel->idx];
struct hist_entry_iter iter = {
.evsel = evsel,
- .hists = evsel__hists(evsel),
+ .hists = hists,
.sample = sample,
.add_entry_cb = hist_iter__top_callback,
};
@@ -915,13 +972,15 @@ static void perf_event__process_sample(struct perf_tool *tool,
addr_location__put(&al);
}

-static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
+static void perf_top__mmap_read(struct reader_arg *rarg)
{
struct perf_sample sample;
struct perf_evsel *evsel;
+ struct perf_top *top = rarg->top;
struct perf_session *session = top->session;
union perf_event *event;
struct machine *machine;
+ int idx = rarg->idx;
u8 origin;
int ret;

@@ -974,10 +1033,11 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)


if (event->header.type == PERF_RECORD_SAMPLE) {
- perf_event__process_sample(&top->tool, event, evsel,
+ perf_event__process_sample(rarg, event, evsel,
&sample, machine);
} else if (event->header.type < PERF_RECORD_MAX) {
- hists__inc_nr_events(evsel__hists(evsel), event->header.type);
+ hists__inc_nr_events(&rarg->hists[evsel->idx],
+ event->header.type);
machine__process_event(machine, event, &sample);
} else
++session->evlist->stats.nr_unknown_events;
@@ -986,12 +1046,30 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
}
}

-static void perf_top__mmap_read(struct perf_top *top)
+static void *mmap_read_worker(void *arg)
{
- int i;
+ struct reader_arg *rarg = arg;
+ struct perf_top *top = rarg->top;
+
+ if (top->realtime_prio) {
+ struct sched_param param;
+
+ param.sched_priority = top->realtime_prio;
+ if (sched_setscheduler(0, SCHED_FIFO, &param)) {
+ ui__error("Could not set realtime priority.\n");
+ return NULL;
+ }
+ }
+
+ while (!done) {
+ u64 hits = top->samples;

- for (i = 0; i < top->evlist->nr_mmaps; i++)
- perf_top__mmap_read_idx(top, i);
+ perf_top__mmap_read(rarg);
+
+ if (hits == top->samples)
+ perf_evlist__poll(top->evlist, 100);
+ }
+ return NULL;
}

static int perf_top__start_counters(struct perf_top *top)
@@ -1052,8 +1130,14 @@ static int perf_top__setup_sample_type(struct perf_top *top __maybe_unused)
static int __cmd_top(struct perf_top *top)
{
struct record_opts *opts = &top->record_opts;
- pthread_t thread;
+ pthread_t *readers = NULL;
+ pthread_t collector = (pthread_t) 0;
+ pthread_t ui_thread = (pthread_t) 0;
+ struct hists *hists = NULL;
+ struct reader_arg *rargs = NULL;
+ struct collector_arg carg;
int ret;
+ int i;

top->session = perf_session__new(NULL, false, NULL);
if (top->session == NULL)
@@ -1104,37 +1188,63 @@ static int __cmd_top(struct perf_top *top)
/* Wait for a minimal set of events before starting the snapshot */
perf_evlist__poll(top->evlist, 100);

- perf_top__mmap_read(top);
-
ret = -1;
- if (pthread_create(&thread, NULL, (use_browser > 0 ? display_thread_tui :
- display_thread), top)) {
- ui__error("Could not create display thread.\n");
+ readers = calloc(sizeof(pthread_t), top->evlist->nr_mmaps);
+ if (readers == NULL)
goto out_delete;
- }

- if (top->realtime_prio) {
- struct sched_param param;
+ rargs = calloc(sizeof(*rargs), top->evlist->nr_mmaps);
+ if (rargs == NULL)
+ goto out_free;

- param.sched_priority = top->realtime_prio;
- if (sched_setscheduler(0, SCHED_FIFO, &param)) {
- ui__error("Could not set realtime priority.\n");
- goto out_join;
- }
+ hists = calloc(sizeof(*hists), top->evlist->nr_mmaps * top->evlist->nr_entries);
+ if (hists == NULL)
+ goto out_free;
+
+ for (i = 0; i < top->evlist->nr_mmaps * top->evlist->nr_entries; i++)
+ __hists__init(&hists[i]);
+
+ for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ struct reader_arg *rarg = &rargs[i];
+
+ rarg->idx = i;
+ rarg->top = top;
+ rarg->hists = &hists[i * top->evlist->nr_entries];
+
+ perf_top__mmap_read(rarg);
}
+ collect_hists(top, hists);

- while (!done) {
- u64 hits = top->samples;
+ for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ if (pthread_create(&readers[i], NULL, mmap_read_worker, &rargs[i]))
+ goto out_join;
+ }

- perf_top__mmap_read(top);
+ carg.top = top;
+ carg.hists = hists;
+ if (pthread_create(&collector, NULL, collect_worker, &carg))
+ goto out_join;

- if (hits == top->samples)
- ret = perf_evlist__poll(top->evlist, 100);
+ if (pthread_create(&ui_thread, NULL, (use_browser > 0 ? display_thread_tui :
+ display_thread), top)) {
+ ui__error("Could not create display thread.\n");
+ goto out_join;
}

ret = 0;
+
out_join:
- pthread_join(thread, NULL);
+ pthread_join(ui_thread, NULL);
+ pthread_join(collector, NULL);
+ for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ pthread_join(readers[i], NULL);
+ }
+
+out_free:
+ free(hists);
+ free(rargs);
+ free(readers);
+
out_delete:
perf_session__delete(top->session);
top->session = NULL;
--
2.6.2

2015-12-10 07:54:27

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 12/16] perf tools: Reduce lock contention when processing events

When multi-thread is enabled, the machine->threads_lock is contented
as all worker threads try to grab the writer lock using the
machine__findnew_thread(). Usually, the thread they're looking for is
in the tree so they only need the reader lock though.

Thus try machine__find_thread() first, and then fallback to the
'findnew' API. This will improve the performance.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/event.c | 7 +++++--
tools/perf/util/machine.c | 19 ++++++++++++++++---
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 8b10621b415c..cefd43f8ec66 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -993,10 +993,13 @@ int perf_event__preprocess_sample(const union perf_event *event,
struct perf_sample *sample)
{
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
- struct thread *thread = machine__findnew_thread(machine, sample->pid,
- sample->tid);
+ struct thread *thread = machine__find_thread(machine, sample->pid,
+ sample->tid);

if (thread == NULL)
+ thread = machine__findnew_thread(machine, sample->pid,
+ sample->tid);
+ if (thread == NULL)
return -1;

dump_printf(" ... thread: %s:%d\n", thread__comm_str(thread), thread->tid);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index f5882b8c8db9..6d7349501f26 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -467,7 +467,7 @@ struct comm *machine__thread_exec_comm(struct machine *machine,
int machine__process_comm_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample)
{
- struct thread *thread = machine__findnew_thread(machine,
+ struct thread *thread = machine__find_thread(machine,
event->comm.pid,
event->comm.tid);
bool exec = event->header.misc & PERF_RECORD_MISC_COMM_EXEC;
@@ -479,6 +479,10 @@ int machine__process_comm_event(struct machine *machine, union perf_event *event
if (dump_trace)
perf_event__fprintf_comm(event, stdout);

+ if (thread == NULL) {
+ thread = machine__findnew_thread(machine, event->comm.pid,
+ event->comm.tid);
+ }
if (thread == NULL ||
__thread__set_comm(thread, event->comm.comm, sample->time, exec)) {
dump_printf("problem processing PERF_RECORD_COMM, skipping event.\n");
@@ -1314,8 +1318,13 @@ int machine__process_mmap2_event(struct machine *machine,
return 0;
}

- thread = machine__findnew_thread(machine, event->mmap2.pid,
+ thread = machine__find_thread(machine, event->mmap2.pid,
event->mmap2.tid);
+ if (thread == NULL) {
+ thread = machine__findnew_thread(machine,
+ event->mmap2.pid,
+ event->mmap2.tid);
+ }
if (thread == NULL)
goto out_problem;

@@ -1368,8 +1377,12 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
return 0;
}

- thread = machine__findnew_thread(machine, event->mmap.pid,
+ thread = machine__find_thread(machine, event->mmap.pid,
event->mmap.tid);
+ if (thread == NULL) {
+ thread = machine__findnew_thread(machine, event->mmap.pid,
+ event->mmap.tid);
+ }
if (thread == NULL)
goto out_problem;

--
2.6.2

2015-12-10 07:54:30

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 13/16] perf top: Protect the seen list using mutex

When perf didn't find a machine, it records its pid in the seen list.
With multi-thread enabled, it shoud be protected using a mutex.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index a9b7461be4f0..5987986b5203 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -916,7 +916,9 @@ static void perf_event__process_sample(struct reader_arg *rarg,

if (!machine && perf_guest) {
static struct intlist *seen;
+ static pthread_mutex_t seen_lock = PTHREAD_MUTEX_INITIALIZER;

+ pthread_mutex_lock(&seen_lock);
if (!seen)
seen = intlist__new(NULL);

@@ -925,6 +927,7 @@ static void perf_event__process_sample(struct reader_arg *rarg,
sample->pid);
intlist__add(seen, sample->pid);
}
+ pthread_mutex_unlock(&seen_lock);
return;
}

--
2.6.2

2015-12-10 07:54:36

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 14/16] perf top: Separate struct perf_top_stats

The perf top maintains various stats regarding samples. Separate out
those stats so that it can be updated concurrently.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 36 ++++++++++++++++++++++++++++--------
tools/perf/util/top.c | 18 ++++++++----------
tools/perf/util/top.h | 12 ++++++++----
3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 5987986b5203..f3ab46b234b6 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -902,8 +902,26 @@ struct reader_arg {
int idx;
struct perf_top *top;
struct hists *hists;
+ struct perf_top_stats stats;
};

+static void perf_top_stats__add(struct perf_top_stats *dst,
+ struct perf_top_stats *src)
+{
+ static pthread_mutex_t stats_lock = PTHREAD_MUTEX_INITIALIZER;
+
+ pthread_mutex_lock(&stats_lock);
+
+ dst->samples += src->samples;
+ dst->exact_samples += src->exact_samples;
+ dst->kernel_samples += src->kernel_samples;
+ dst->us_samples += src->us_samples;
+ dst->guest_kernel_samples += src->guest_kernel_samples;
+ dst->guest_us_samples += src->guest_us_samples;
+
+ pthread_mutex_unlock(&stats_lock);
+}
+
static void perf_event__process_sample(struct reader_arg *rarg,
const union perf_event *event,
struct perf_evsel *evsel,
@@ -938,7 +956,7 @@ static void perf_event__process_sample(struct reader_arg *rarg,
}

if (event->header.misc & PERF_RECORD_MISC_EXACT_IP)
- top->exact_samples++;
+ rarg->stats.exact_samples++;

if (perf_event__preprocess_sample(event, machine, &al, sample) < 0)
return;
@@ -1000,28 +1018,28 @@ static void perf_top__mmap_read(struct reader_arg *rarg)
origin = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;

if (event->header.type == PERF_RECORD_SAMPLE)
- ++top->samples;
+ ++rarg->stats.samples;

switch (origin) {
case PERF_RECORD_MISC_USER:
- ++top->us_samples;
+ ++rarg->stats.us_samples;
if (top->hide_user_symbols)
goto next_event;
machine = &session->machines.host;
break;
case PERF_RECORD_MISC_KERNEL:
- ++top->kernel_samples;
+ ++rarg->stats.kernel_samples;
if (top->hide_kernel_symbols)
goto next_event;
machine = &session->machines.host;
break;
case PERF_RECORD_MISC_GUEST_KERNEL:
- ++top->guest_kernel_samples;
+ ++rarg->stats.guest_kernel_samples;
machine = perf_session__find_machine(session,
sample.pid);
break;
case PERF_RECORD_MISC_GUEST_USER:
- ++top->guest_us_samples;
+ ++rarg->stats.guest_us_samples;
/*
* TODO: we don't process guest user from host side
* except simple counting.
@@ -1065,12 +1083,14 @@ static void *mmap_read_worker(void *arg)
}

while (!done) {
- u64 hits = top->samples;
+ u64 hits = rarg->stats.samples;

perf_top__mmap_read(rarg);

- if (hits == top->samples)
+ if (hits == rarg->stats.samples)
perf_evlist__poll(top->evlist, 100);
+ else
+ perf_top_stats__add(&top->stats, &rarg->stats);
}
return NULL;
}
diff --git a/tools/perf/util/top.c b/tools/perf/util/top.c
index 8e517def925b..95d6bba1a2a0 100644
--- a/tools/perf/util/top.c
+++ b/tools/perf/util/top.c
@@ -30,10 +30,10 @@ size_t perf_top__header_snprintf(struct perf_top *top, char *bf, size_t size)
struct target *target = &opts->target;
size_t ret = 0;

- if (top->samples) {
- samples_per_sec = top->samples / top->delay_secs;
- ksamples_per_sec = top->kernel_samples / top->delay_secs;
- esamples_percent = (100.0 * top->exact_samples) / top->samples;
+ if (top->stats.samples) {
+ samples_per_sec = top->stats.samples / top->delay_secs;
+ ksamples_per_sec = top->stats.kernel_samples / top->delay_secs;
+ esamples_percent = (100.0 * top->stats.exact_samples) / top->stats.samples;
} else {
samples_per_sec = ksamples_per_sec = esamples_percent = 0.0;
}
@@ -49,9 +49,9 @@ size_t perf_top__header_snprintf(struct perf_top *top, char *bf, size_t size)
" exact: %4.1f%% [", samples_per_sec,
ksamples_percent, esamples_percent);
} else {
- float us_samples_per_sec = top->us_samples / top->delay_secs;
- float guest_kernel_samples_per_sec = top->guest_kernel_samples / top->delay_secs;
- float guest_us_samples_per_sec = top->guest_us_samples / top->delay_secs;
+ float us_samples_per_sec = top->stats.us_samples / top->delay_secs;
+ float guest_kernel_samples_per_sec = top->stats.guest_kernel_samples / top->delay_secs;
+ float guest_us_samples_per_sec = top->stats.guest_us_samples / top->delay_secs;

ret = SNPRINTF(bf, size,
" PerfTop:%8.0f irqs/sec kernel:%4.1f%% us:%4.1f%%"
@@ -111,7 +111,5 @@ size_t perf_top__header_snprintf(struct perf_top *top, char *bf, size_t size)

void perf_top__reset_sample_counters(struct perf_top *top)
{
- top->samples = top->us_samples = top->kernel_samples =
- top->exact_samples = top->guest_kernel_samples =
- top->guest_us_samples = 0;
+ memset(&top->stats, 0, sizeof(top->stats));
}
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index c56a00cff5b4..55eb5aebae59 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -11,18 +11,22 @@ struct perf_evlist;
struct perf_evsel;
struct perf_session;

+struct perf_top_stats {
+ u64 samples;
+ u64 exact_samples;
+ u64 kernel_samples, us_samples;
+ u64 guest_kernel_samples, guest_us_samples;
+};
+
struct perf_top {
struct perf_tool tool;
struct perf_evlist *evlist;
struct record_opts record_opts;
+ struct perf_top_stats stats;
/*
* Symbols will be added here in perf_event__process_sample and will
* get out after decayed.
*/
- u64 samples;
- u64 kernel_samples, us_samples;
- u64 exact_samples;
- u64 guest_us_samples, guest_kernel_samples;
int print_entries, count_filter, delay_secs;
int max_stack;
bool hide_kernel_symbols, hide_user_symbols, zero;
--
2.6.2

2015-12-10 07:54:39

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 15/16] perf top: Add --num-thread option

The --num-thread option is to set number of reader thread. Default
value is 0 which will be converted to 1/4 of number of mmaps.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 49 +++++++++++++++++++++++++++++++++++-------------
tools/perf/util/top.h | 1 +
2 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index f3ab46b234b6..fc9715b046b3 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -838,10 +838,10 @@ struct collector_arg {

static void collect_hists(struct perf_top *top, struct hists *hists)
{
- int i, k;
+ unsigned int i, k;
struct perf_evsel *evsel;

- for (i = 0, k = 0; i < top->evlist->nr_mmaps; i++) {
+ for (i = 0, k = 0; i < top->nr_threads; i++) {
evlist__for_each(top->evlist, evsel) {
struct hists *src_hists = &hists[k++];
struct hists *dst_hists = evsel__hists(evsel);
@@ -900,6 +900,7 @@ static int hist_iter__top_callback(struct hist_entry_iter *iter,

struct reader_arg {
int idx;
+ int nr_idx;
struct perf_top *top;
struct hists *hists;
struct perf_top_stats stats;
@@ -993,7 +994,7 @@ static void perf_event__process_sample(struct reader_arg *rarg,
addr_location__put(&al);
}

-static void perf_top__mmap_read(struct reader_arg *rarg)
+static void perf_top__mmap_read_idx(struct reader_arg *rarg, int idx)
{
struct perf_sample sample;
struct perf_evsel *evsel;
@@ -1001,7 +1002,6 @@ static void perf_top__mmap_read(struct reader_arg *rarg)
struct perf_session *session = top->session;
union perf_event *event;
struct machine *machine;
- int idx = rarg->idx;
u8 origin;
int ret;

@@ -1067,6 +1067,14 @@ static void perf_top__mmap_read(struct reader_arg *rarg)
}
}

+static void perf_top__mmap_read(struct reader_arg *rarg)
+{
+ int i;
+
+ for (i = 0; i < rarg->nr_idx; i++)
+ perf_top__mmap_read_idx(rarg, rarg->idx + i);
+}
+
static void *mmap_read_worker(void *arg)
{
struct reader_arg *rarg = arg;
@@ -1160,7 +1168,8 @@ static int __cmd_top(struct perf_top *top)
struct reader_arg *rargs = NULL;
struct collector_arg carg;
int ret;
- int i;
+ unsigned int i;
+ int idx, nr_idx, rem;

top->session = perf_session__new(NULL, false, NULL);
if (top->session == NULL)
@@ -1211,34 +1220,47 @@ static int __cmd_top(struct perf_top *top)
/* Wait for a minimal set of events before starting the snapshot */
perf_evlist__poll(top->evlist, 100);

+ if (top->nr_threads == 0)
+ top->nr_threads = top->evlist->nr_mmaps / 4 ?: 1;
+ if ((int)top->nr_threads > top->evlist->nr_mmaps)
+ top->nr_threads = top->evlist->nr_mmaps;
+
+ nr_idx = top->evlist->nr_mmaps / top->nr_threads;
+ rem = top->evlist->nr_mmaps % top->nr_threads;
+
ret = -1;
- readers = calloc(sizeof(pthread_t), top->evlist->nr_mmaps);
+ readers = calloc(sizeof(pthread_t), top->nr_threads);
if (readers == NULL)
goto out_delete;

- rargs = calloc(sizeof(*rargs), top->evlist->nr_mmaps);
+ rargs = calloc(sizeof(*rargs), top->nr_threads);
if (rargs == NULL)
goto out_free;

- hists = calloc(sizeof(*hists), top->evlist->nr_mmaps * top->evlist->nr_entries);
+ hists = calloc(sizeof(*hists), top->nr_threads * top->evlist->nr_entries);
if (hists == NULL)
goto out_free;

- for (i = 0; i < top->evlist->nr_mmaps * top->evlist->nr_entries; i++)
+ for (i = 0; i < top->nr_threads * top->evlist->nr_entries; i++)
__hists__init(&hists[i]);

- for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ for (i = 0, idx = 0; i < top->nr_threads; i++) {
struct reader_arg *rarg = &rargs[i];

- rarg->idx = i;
rarg->top = top;
rarg->hists = &hists[i * top->evlist->nr_entries];

+ rarg->idx = idx;
+ rarg->nr_idx = nr_idx;
+ if (rem-- > 0)
+ rarg->nr_idx++;
+ idx += rarg->nr_idx;
+
perf_top__mmap_read(rarg);
}
collect_hists(top, hists);

- for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ for (i = 0; i < top->nr_threads; i++) {
if (pthread_create(&readers[i], NULL, mmap_read_worker, &rargs[i]))
goto out_join;
}
@@ -1259,7 +1281,7 @@ static int __cmd_top(struct perf_top *top)
out_join:
pthread_join(ui_thread, NULL);
pthread_join(collector, NULL);
- for (i = 0; i < top->evlist->nr_mmaps; i++) {
+ for (i = 0; i < top->nr_threads; i++) {
pthread_join(readers[i], NULL);
}

@@ -1458,6 +1480,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
OPT_CALLBACK('j', "branch-filter", &opts->branch_stack,
"branch filter mask", "branch stack filter modes",
parse_branch_stack),
+ OPT_UINTEGER(0, "num-thread", &top.nr_threads, "number of thread to run"),
OPT_END()
};
const char * const top_usage[] = {
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 55eb5aebae59..916ba36b0ac0 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -43,6 +43,7 @@ struct perf_top {
int sym_pcnt_filter;
const char *sym_filter;
float min_percent;
+ unsigned int nr_threads;
};

#define CONSOLE_CLEAR ""
--
2.6.2

2015-12-10 07:54:41

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH/RFC 16/16] perf tools: Skip dso front cache for multi-threaded lookup

Currently the dso maintains a front cache for faster symbol lookup,
but access to it should be synchronized when multi thread is used.
Also it doesn't help much if data file has callchains since it should
walk through the callchains for each sample so single cache is almost
meaningless.

So skip the cache if mult-thread is enabled.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-top.c | 2 ++
tools/perf/util/symbol.c | 3 +++
tools/perf/util/symbol.h | 3 ++-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index fc9715b046b3..d69069d57f8c 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1225,6 +1225,8 @@ static int __cmd_top(struct perf_top *top)
if ((int)top->nr_threads > top->evlist->nr_mmaps)
top->nr_threads = top->evlist->nr_mmaps;

+ symbol_conf.multi_thread = (top->nr_threads > 1);
+
nr_idx = top->evlist->nr_mmaps / top->nr_threads;
rem = top->evlist->nr_mmaps % top->nr_threads;

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index d51abd2e7865..d4a966004fa0 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -454,6 +454,9 @@ void dso__reset_find_symbol_cache(struct dso *dso)
struct symbol *dso__find_symbol(struct dso *dso,
enum map_type type, u64 addr)
{
+ if (symbol_conf.multi_thread)
+ return symbols__find(&dso->symbols[type], addr);
+
if (dso->last_find_result[type].addr != addr) {
dso->last_find_result[type].addr = addr;
dso->last_find_result[type].symbol = symbols__find(&dso->symbols[type], addr);
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 857f707ac12b..68c198f64e1d 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -109,7 +109,8 @@ struct symbol_conf {
branch_callstack,
has_filter,
show_ref_callgraph,
- hide_unresolved;
+ hide_unresolved,
+ multi_thread;
const char *vmlinux_name,
*kallsyms_name,
*source_prefix,
--
2.6.2

2015-12-10 08:01:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)


* Namhyung Kim <[email protected]> wrote:

> Hello,
>
> This patchset if an attempt to support multi-threading in perf top.
> In fact, perf top already run on two threads - a worker thread and a
> display thread. However processing all samples with a single thread
> in a large machine can have scalability problems.
>
> This patchset extends it to have multiple worker threads to process
> samples concurrently. Users can control the number of threads using
> --num-thread option. And there's a collector thread for passing hist
> entries from worker threads to the display thread.

Could you please make the number of threads default to the number of CPUs?

Since perf top is doing one perf event per CPU anyway, that's a pretty natural
model.

( I think 'perf record' should use per CPU threads as well to receive events, to
address the 'IO overload' problems with -g recording on larger CPU counts. )

Thanks,

Ingo

2015-12-10 08:49:52

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On December 10, 2015 5:01:18 PM GMT+09:00, Ingo Molnar <[email protected]> wrote:
>
>* Namhyung Kim <[email protected]> wrote:
>
>> Hello,
>>
>> This patchset if an attempt to support multi-threading in perf top.
>> In fact, perf top already run on two threads - a worker thread and a
>> display thread. However processing all samples with a single thread
>> in a large machine can have scalability problems.
>>
>> This patchset extends it to have multiple worker threads to process
>> samples concurrently. Users can control the number of threads using
>> --num-thread option. And there's a collector thread for passing
>hist
>> entries from worker threads to the display thread.
>
>Could you please make the number of threads default to the number of
>CPUs?
>
>Since perf top is doing one perf event per CPU anyway, that's a pretty
>natural
>model.
>
>( I think 'perf record' should use per CPU threads as well to receive
>events, to
>address the 'IO overload' problems with -g recording on larger CPU
>counts. )

IIRC David said that thread per cpu seems too much
especially on a large system (like ~1024 cpu). I have no idea
what's the reasonable default on the system, so I chose 1/4
of map buffers (i.e. cpus for most cases). But I think I should
take non-system-wide mode into account too.

Thanks
Namhyung

Hi Ingo,
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Subject: RE: [PATCH/RFC 01/16] perf top: Delete half-processed hist entries when exit

>From: Namhyung Kim [mailto:[email protected]]
>
>After sample processing is done, hist entries are in both of
>hists->entries and hists->entries_in (or hists->entries_collapsed).
>So I guess perf report does not have leaks on hists.
>
>But for perf top, it's possible to have half-processed entries which
>are only in hists->entries_in. Eventually they will go to the
>hists->entries and get freed but they cannot be deleted by current
>hists__delete_entries(). This patch adds hists__delete_all_entries
>function to delete those entries.
>

This is tested under the refcnt debugger and I've reviewed it.

Acked-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>


Thanks!

>Cc: Masami Hiramatsu <[email protected]>
>Signed-off-by: Namhyung Kim <[email protected]>
>---
> tools/perf/util/hist.c | 26 +++++++++++++++++++++++++-
> 1 file changed, 25 insertions(+), 1 deletion(-)
>
>diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
>index 565ea3549894..56e97f5af598 100644
>--- a/tools/perf/util/hist.c
>+++ b/tools/perf/util/hist.c
>@@ -270,6 +270,8 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he)
>
> if (sort__need_collapse)
> rb_erase(&he->rb_node_in, &hists->entries_collapsed);
>+ else
>+ rb_erase(&he->rb_node_in, hists->entries_in);
>
> --hists->nr_entries;
> if (!he->filtered)
>@@ -1567,11 +1569,33 @@ static int hists_evsel__init(struct perf_evsel *evsel)
> return 0;
> }
>
>+static void hists__delete_remaining_entries(struct rb_root *root)
>+{
>+ struct rb_node *node;
>+ struct hist_entry *he;
>+
>+ while (!RB_EMPTY_ROOT(root)) {
>+ node = rb_first(root);
>+ rb_erase(node, root);
>+
>+ he = rb_entry(node, struct hist_entry, rb_node_in);
>+ hist_entry__delete(he);
>+ }
>+}
>+
>+static void hists__delete_all_entries(struct hists *hists)
>+{
>+ hists__delete_entries(hists);
>+ hists__delete_remaining_entries(&hists->entries_in_array[0]);
>+ hists__delete_remaining_entries(&hists->entries_in_array[1]);
>+ hists__delete_remaining_entries(&hists->entries_collapsed);
>+}
>+
> static void hists_evsel__exit(struct perf_evsel *evsel)
> {
> struct hists *hists = evsel__hists(evsel);
>
>- hists__delete_entries(hists);
>+ hists__delete_all_entries(hists);
> }
>
> /*
>--
>2.6.2

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-10 18:57:53

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH/RFC 01/16] perf top: Delete half-processed hist entries when exit

Em Thu, Dec 10, 2015 at 09:55:44AM +0000, 平松雅巳 / HIRAMATU,MASAMI escreveu:
> >From: Namhyung Kim [mailto:[email protected]]
> >
> >After sample processing is done, hist entries are in both of
> >hists->entries and hists->entries_in (or hists->entries_collapsed).
> >So I guess perf report does not have leaks on hists.
> >
> >But for perf top, it's possible to have half-processed entries which
> >are only in hists->entries_in. Eventually they will go to the
> >hists->entries and get freed but they cannot be deleted by current
> >hists__delete_entries(). This patch adds hists__delete_all_entries
> >function to delete those entries.
> >
>
> This is tested under the refcnt debugger and I've reviewed it.
>
> Acked-by: Masami Hiramatsu <[email protected]>
> Tested-by: Masami Hiramatsu <[email protected]>

Thanks, applied.

2015-12-10 19:04:29

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH/RFC 02/16] perf top: Fix and cleanup perf_top__record_precise_ip()

Em Thu, Dec 10, 2015 at 04:53:21PM +0900, Namhyung Kim escreveu:
> At first, it has duplicate ui__has_annotation() and 'sort__has_sym'
> and 'use_browser' check. In fact, the ui__has_annotation() should be
> removed as it needs to annotate on --stdio as well. And the
> top->sym_filter_entry is used only for stdio mode, so make it clear on
> the condition too.

So this is doing more than one thing and changes decisions about
non-trivial operations, can you please break it down into smaller
patches?

> Also the check will be simplifed if it sets sym before the check. And
> omit check for 'he' since its caller (hist_iter__top_callback) only
> gets called when iter->he is not NULL.
>
> Secondly, it converts the 'ip' argument using map__unmap_ip() before
> the call and then reverts it using map__map_ip(). This seems
> meaningless and strange since only one of them checks the 'map'.

For isntance: the following change has value and should be on a separate
patch.

> Finally, access the he->hists->lock only if there's an error.
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/builtin-top.c | 52 ++++++++++++++++++++----------------------------
> 1 file changed, 22 insertions(+), 30 deletions(-)
>
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 7e2e72e6d9d1..7cd9bb69f5a6 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -175,42 +175,40 @@ static void perf_top__record_precise_ip(struct perf_top *top,
> int counter, u64 ip)
> {
> struct annotation *notes;
> - struct symbol *sym;
> + struct symbol *sym = he->ms.sym;
> int err = 0;
>
> - if (he == NULL || he->ms.sym == NULL ||
> - ((top->sym_filter_entry == NULL ||
> - top->sym_filter_entry->ms.sym != he->ms.sym) && use_browser != 1))
> + if (sym == NULL || (use_browser == 0 &&
> + (top->sym_filter_entry == NULL ||
> + top->sym_filter_entry->ms.sym != sym)))
> return;
>
> - sym = he->ms.sym;
> notes = symbol__annotation(sym);
>
> if (pthread_mutex_trylock(&notes->lock))
> return;
>
> - ip = he->ms.map->map_ip(he->ms.map, ip);
> -
> - if (ui__has_annotation())
> - err = hist_entry__inc_addr_samples(he, counter, ip);
> + err = hist_entry__inc_addr_samples(he, counter, ip);
>
> pthread_mutex_unlock(&notes->lock);
>
> - /*
> - * This function is now called with he->hists->lock held.
> - * Release it before going to sleep.
> - */
> - pthread_mutex_unlock(&he->hists->lock);
> + if (unlikely(err)) {
> + /*
> + * This function is now called with hists->lock held.
> + * Release it before going to sleep.
> + */
> + pthread_mutex_unlock(&he->hists->lock);
> +
> + if (err == -ERANGE && !he->ms.map->erange_warned)
> + ui__warn_map_erange(he->ms.map, sym, ip);
> + else if (err == -ENOMEM) {
> + pr_err("Not enough memory for annotating '%s' symbol!\n",
> + sym->name);
> + sleep(1);
> + }
>
> - if (err == -ERANGE && !he->ms.map->erange_warned)
> - ui__warn_map_erange(he->ms.map, sym, ip);
> - else if (err == -ENOMEM) {
> - pr_err("Not enough memory for annotating '%s' symbol!\n",
> - sym->name);
> - sleep(1);
> + pthread_mutex_lock(&he->hists->lock);
> }
> -
> - pthread_mutex_lock(&he->hists->lock);
> }
>
> static void perf_top__show_details(struct perf_top *top)
> @@ -687,14 +685,8 @@ static int hist_iter__top_callback(struct hist_entry_iter *iter,
> struct hist_entry *he = iter->he;
> struct perf_evsel *evsel = iter->evsel;
>
> - if (sort__has_sym && single) {
> - u64 ip = al->addr;
> -
> - if (al->map)
> - ip = al->map->unmap_ip(al->map, ip);
> -
> - perf_top__record_precise_ip(top, he, evsel->idx, ip);
> - }
> + if (sort__has_sym && single)
> + perf_top__record_precise_ip(top, he, evsel->idx, al->addr);
>
> hist__account_cycles(iter->sample->branch_stack, al, iter->sample,
> !(top->record_opts.branch_stack & PERF_SAMPLE_BRANCH_ANY));
> --
> 2.6.2

2015-12-10 19:07:58

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH/RFC 03/16] perf top: Factor out warnings about kernel addresses and symbols

Em Thu, Dec 10, 2015 at 04:53:22PM +0900, Namhyung Kim escreveu:
> Factor out warning messages into separate functions. These will be
> called in the display thread later.
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/builtin-top.c | 95 ++++++++++++++++++++++++++----------------------
> 1 file changed, 51 insertions(+), 44 deletions(-)

Without applying patch 2, will check if it happens without this patch as well



[root@ssdandy ~]# echo 2 > /proc/sys/kernel/kptr_restrict
[root@ssdandy ~]# perf top
perf: Segmentation fault
-------- backtrace --------
perf[0x538b3b]
/lib64/libc.so.6(+0x35650)[0x7f401c036650]
perf[0x43af66]
perf(cmd_top+0xedf)[0x43cdbf]
perf[0x47b7c3]
perf(main+0x617)[0x4222b7]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f401c022af5]
perf[0x4223c9]
[0x0]
[root@ssdandy ~]#

2015-12-11 02:27:45

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 02/16] perf top: Fix and cleanup perf_top__record_precise_ip()

Hi Arnaldo,

On Thu, Dec 10, 2015 at 04:04:17PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Dec 10, 2015 at 04:53:21PM +0900, Namhyung Kim escreveu:
> > At first, it has duplicate ui__has_annotation() and 'sort__has_sym'
> > and 'use_browser' check. In fact, the ui__has_annotation() should be
> > removed as it needs to annotate on --stdio as well. And the
> > top->sym_filter_entry is used only for stdio mode, so make it clear on
> > the condition too.
>
> So this is doing more than one thing and changes decisions about
> non-trivial operations, can you please break it down into smaller
> patches?

Sure.

>
> > Also the check will be simplifed if it sets sym before the check. And
> > omit check for 'he' since its caller (hist_iter__top_callback) only
> > gets called when iter->he is not NULL.
> >
> > Secondly, it converts the 'ip' argument using map__unmap_ip() before
> > the call and then reverts it using map__map_ip(). This seems
> > meaningless and strange since only one of them checks the 'map'.
>
> For isntance: the following change has value and should be on a separate
> patch.

Will do!

Thanks,
Namhyung


>
> > Finally, access the he->hists->lock only if there's an error.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---

2015-12-11 08:11:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)


* Namhyung Kim <[email protected]> wrote:

> IIRC David said that thread per cpu seems too much especially on a large system
> (like ~1024 cpu). [...]

Too much in what fashion? For recording I think it's the fastest, most natural
model - anything else will create cache line bounces.

For perf report, I suspect you are right, it would depend on the actual possible
parallelism - which with time would improve I suspect.

Thanks,

Ingo

2015-12-11 15:01:36

by David Ahern

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On 12/11/15 1:11 AM, Ingo Molnar wrote:
>
> * Namhyung Kim <[email protected]> wrote:
>
>> IIRC David said that thread per cpu seems too much especially on a large system
>> (like ~1024 cpu). [...]
>
> Too much in what fashion? For recording I think it's the fastest, most natural
> model - anything else will create cache line bounces.

The intrusiveness of perf on the system under observation. I understand
there are a lot of factors that go into it.

2015-12-13 23:15:31

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 07/16] perf hists: Pass hists struct to hist_entry_iter struct

On Thu, Dec 10, 2015 at 04:53:26PM +0900, Namhyung Kim wrote:
> This is a preparation for multi-thread support in perf tools. When
> multi-thread is enable, each thread will have its own hists during the
> sample processing.
>
> Signed-off-by: Namhyung Kim <[email protected]>

cool, I actually just need need this one ;-)

Acked-by: Jiri Olsa <[email protected]>

thanks,
jirka

2015-12-13 23:17:14

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 08/16] perf tools: Export a couple of hist functions

On Thu, Dec 10, 2015 at 04:53:27PM +0900, Namhyung Kim wrote:
> These are necessary for multi threaded sample processing:
>
> - hists__get__get_rotate_entries_in()
> - hists__collapse_insert_entry()
> - __hists__init()
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/util/hist.c | 19 ++++++++++++-------
> tools/perf/util/hist.h | 5 +++++
> 2 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index d534ed76164b..ea4f3ad978b0 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -972,9 +972,8 @@ void hist_entry__delete(struct hist_entry *he)
> * collapse the histogram
> */
>
> -static bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
> - struct rb_root *root,
> - struct hist_entry *he)
> +bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
> + struct rb_root *root, struct hist_entry *he)
> {
> struct rb_node **p = &root->rb_node;
> struct rb_node *parent = NULL;
> @@ -1014,7 +1013,7 @@ static bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
> return true;
> }
>
> -static struct rb_root *hists__get_rotate_entries_in(struct hists *hists)
> +struct rb_root *hists__get_rotate_entries_in(struct hists *hists)
> {
> struct rb_root *root;
>
> @@ -1549,10 +1548,8 @@ int perf_hist_config(const char *var, const char *value)
> return 0;
> }
>
> -static int hists_evsel__init(struct perf_evsel *evsel)
> +int __hists__init(struct hists *hists)

and this one also ;-)

Acked-by: Jiri Olsa <[email protected]>


thanks,
jirka

2015-12-13 23:23:44

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 09/16] perf tools: Update hist entry's hists pointer

On Thu, Dec 10, 2015 at 04:53:28PM +0900, Namhyung Kim wrote:
> When sample is processed using multi-thread, each sample is gathered on
> each thread's hist tree and then merged into the real hist tree. But
> hist_entry->hists pointer was not updated so it could refer wrong hists
> resulted in missing outputs.
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/util/hist.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index ea4f3ad978b0..a12e5022fe04 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -1008,6 +1008,14 @@ bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,

there's 'struct hists *hists __maybe_unused' in hists__collapse_insert_entry def..

jirka

> }
> hists->nr_entries++;
>
> + /*
> + * If a hist entry is processed in multi-threaded environment,
> + * it points to a dummy local hists which was used only for
> + * intermidate processing. So update it to a real one so that
> + * it can find the correct info later.
> + */
> + he->hists = hists;
> +
> rb_link_node(&he->rb_node_in, parent, p);
> rb_insert_color(&he->rb_node_in, root);
> return true;
> --
> 2.6.2
>

2015-12-13 23:28:23

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 09/16] perf tools: Update hist entry's hists pointer

On Mon, Dec 14, 2015 at 12:23:30AM +0100, Jiri Olsa wrote:
> On Thu, Dec 10, 2015 at 04:53:28PM +0900, Namhyung Kim wrote:
> > When sample is processed using multi-thread, each sample is gathered on
> > each thread's hist tree and then merged into the real hist tree. But
> > hist_entry->hists pointer was not updated so it could refer wrong hists
> > resulted in missing outputs.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > tools/perf/util/hist.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> > index ea4f3ad978b0..a12e5022fe04 100644
> > --- a/tools/perf/util/hist.c
> > +++ b/tools/perf/util/hist.c
> > @@ -1008,6 +1008,14 @@ bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
>
> there's 'struct hists *hists __maybe_unused' in hists__collapse_insert_entry def..

also hists__collapse_insert_entry is global which seems unnecessary

jirka

>
> jirka
>
> > }
> > hists->nr_entries++;
> >
> > + /*
> > + * If a hist entry is processed in multi-threaded environment,
> > + * it points to a dummy local hists which was used only for
> > + * intermidate processing. So update it to a real one so that
> > + * it can find the correct info later.
> > + */
> > + he->hists = hists;
> > +
> > rb_link_node(&he->rb_node_in, parent, p);
> > rb_insert_color(&he->rb_node_in, root);
> > return true;
> > --
> > 2.6.2
> >

2015-12-14 01:13:05

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Hi David and Ingo,

On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> On 12/11/15 1:11 AM, Ingo Molnar wrote:
> >
> >* Namhyung Kim <[email protected]> wrote:
> >
> >>IIRC David said that thread per cpu seems too much especially on a large system
> >>(like ~1024 cpu). [...]
> >
> >Too much in what fashion? For recording I think it's the fastest, most natural
> >model - anything else will create cache line bounces.
>
> The intrusiveness of perf on the system under observation. I understand
> there are a lot of factors that go into it.

So what would be the sane default? Do you have any other suggestion?

If there's no opinion, I'll follow Ingo's..

Thanks,
Namhyung

2015-12-14 01:44:49

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 03/16] perf top: Factor out warnings about kernel addresses and symbols

Hi Arnaldo,

On Thu, Dec 10, 2015 at 04:07:48PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Dec 10, 2015 at 04:53:22PM +0900, Namhyung Kim escreveu:
> > Factor out warning messages into separate functions. These will be
> > called in the display thread later.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > tools/perf/builtin-top.c | 95 ++++++++++++++++++++++++++----------------------
> > 1 file changed, 51 insertions(+), 44 deletions(-)
>
> Without applying patch 2, will check if it happens without this patch as well
>
>
>
> [root@ssdandy ~]# echo 2 > /proc/sys/kernel/kptr_restrict
> [root@ssdandy ~]# perf top
> perf: Segmentation fault
> -------- backtrace --------
> perf[0x538b3b]
> /lib64/libc.so.6(+0x35650)[0x7f401c036650]
> perf[0x43af66]
> perf(cmd_top+0xedf)[0x43cdbf]
> perf[0x47b7c3]
> perf(main+0x617)[0x4222b7]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f401c022af5]
> perf[0x4223c9]
> [0x0]
> [root@ssdandy ~]#

This patch changed the order in the condition, so if kptr_restrict is
on, kernel map is NULL and checking al->map->dso caused the segfault.

Will fix.

Thanks,
Namhyung

2015-12-14 01:46:05

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 07/16] perf hists: Pass hists struct to hist_entry_iter struct

Hi Jiri,

On Mon, Dec 14, 2015 at 12:15:20AM +0100, Jiri Olsa wrote:
> On Thu, Dec 10, 2015 at 04:53:26PM +0900, Namhyung Kim wrote:
> > This is a preparation for multi-thread support in perf tools. When
> > multi-thread is enable, each thread will have its own hists during the
> > sample processing.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
>
> cool, I actually just need need this one ;-)
>
> Acked-by: Jiri Olsa <[email protected]>

Thanks for your review!
Namhyung

2015-12-14 01:51:46

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 09/16] perf tools: Update hist entry's hists pointer

On Mon, Dec 14, 2015 at 12:28:12AM +0100, Jiri Olsa wrote:
> On Mon, Dec 14, 2015 at 12:23:30AM +0100, Jiri Olsa wrote:
> > On Thu, Dec 10, 2015 at 04:53:28PM +0900, Namhyung Kim wrote:
> > > When sample is processed using multi-thread, each sample is gathered on
> > > each thread's hist tree and then merged into the real hist tree. But
> > > hist_entry->hists pointer was not updated so it could refer wrong hists
> > > resulted in missing outputs.
> > >
> > > Signed-off-by: Namhyung Kim <[email protected]>
> > > ---
> > > tools/perf/util/hist.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> > > index ea4f3ad978b0..a12e5022fe04 100644
> > > --- a/tools/perf/util/hist.c
> > > +++ b/tools/perf/util/hist.c
> > > @@ -1008,6 +1008,14 @@ bool hists__collapse_insert_entry(struct hists *hists __maybe_unused,
> >
> > there's 'struct hists *hists __maybe_unused' in hists__collapse_insert_entry def..

OK, will remove.

>
> also hists__collapse_insert_entry is global which seems unnecessary

What do you mean? The hists__collapse_insert_entry() will be used by
collect_worker thread.

Thanks,
Namhyung


>
> jirka
>
> >
> > jirka
> >
> > > }
> > > hists->nr_entries++;
> > >
> > > + /*
> > > + * If a hist entry is processed in multi-threaded environment,
> > > + * it points to a dummy local hists which was used only for
> > > + * intermidate processing. So update it to a real one so that
> > > + * it can find the correct info later.
> > > + */
> > > + he->hists = hists;
> > > +
> > > rb_link_node(&he->rb_node_in, parent, p);
> > > rb_insert_color(&he->rb_node_in, root);
> > > return true;
> > > --
> > > 2.6.2
> > >

2015-12-14 02:03:17

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 03/16] perf top: Factor out warnings about kernel addresses and symbols

On Mon, Dec 14, 2015 at 10:44 AM, Namhyung Kim <[email protected]> wrote:
> Hi Arnaldo,
>
> On Thu, Dec 10, 2015 at 04:07:48PM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Thu, Dec 10, 2015 at 04:53:22PM +0900, Namhyung Kim escreveu:
>> > Factor out warning messages into separate functions. These will be
>> > called in the display thread later.
>> >
>> > Signed-off-by: Namhyung Kim <[email protected]>
>> > ---
>> > tools/perf/builtin-top.c | 95 ++++++++++++++++++++++++++----------------------
>> > 1 file changed, 51 insertions(+), 44 deletions(-)
>>
>> Without applying patch 2, will check if it happens without this patch as well
>>
>>
>>
>> [root@ssdandy ~]# echo 2 > /proc/sys/kernel/kptr_restrict
>> [root@ssdandy ~]# perf top
>> perf: Segmentation fault
>> -------- backtrace --------
>> perf[0x538b3b]
>> /lib64/libc.so.6(+0x35650)[0x7f401c036650]
>> perf[0x43af66]
>> perf(cmd_top+0xedf)[0x43cdbf]
>> perf[0x47b7c3]
>> perf(main+0x617)[0x4222b7]
>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f401c022af5]
>> perf[0x4223c9]
>> [0x0]
>> [root@ssdandy ~]#
>
> This patch changed the order in the condition, so if kptr_restrict is
> on, kernel map is NULL and checking al->map->dso caused the segfault.

Oh, it's not this patch. This patch just move the condition as is but
I did the change in the patch 5/16. Will fix there.

Thanks,
Namhyung

Subject: [tip:perf/core] perf top: Delete half-processed hist entries when exit

Commit-ID: 61fa0e94ca6ab62db5e095a5528150bf9962196d
Gitweb: http://git.kernel.org/tip/61fa0e94ca6ab62db5e095a5528150bf9962196d
Author: Namhyung Kim <[email protected]>
AuthorDate: Thu, 10 Dec 2015 16:53:20 +0900
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitDate: Thu, 10 Dec 2015 15:56:58 -0300

perf top: Delete half-processed hist entries when exit

After sample processing is done, hist entries are in both of
hists->entries and hists->entries_in (or hists->entries_collapsed). So
I guess perf report does not have leaks on hists.

But for perf top, it's possible to have half-processed entries which are
only in hists->entries_in. Eventually they will go to the
hists->entries and get freed but they cannot be deleted by current
hists__delete_entries(). This patch adds hists__delete_all_entries
function to delete those entries.

Signed-off-by: Namhyung Kim <[email protected]>
Tested-and-Acked-by: Masami Hiramatsu <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/hist.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 565ea35..56e97f5 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -270,6 +270,8 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he)

if (sort__need_collapse)
rb_erase(&he->rb_node_in, &hists->entries_collapsed);
+ else
+ rb_erase(&he->rb_node_in, hists->entries_in);

--hists->nr_entries;
if (!he->filtered)
@@ -1567,11 +1569,33 @@ static int hists_evsel__init(struct perf_evsel *evsel)
return 0;
}

+static void hists__delete_remaining_entries(struct rb_root *root)
+{
+ struct rb_node *node;
+ struct hist_entry *he;
+
+ while (!RB_EMPTY_ROOT(root)) {
+ node = rb_first(root);
+ rb_erase(node, root);
+
+ he = rb_entry(node, struct hist_entry, rb_node_in);
+ hist_entry__delete(he);
+ }
+}
+
+static void hists__delete_all_entries(struct hists *hists)
+{
+ hists__delete_entries(hists);
+ hists__delete_remaining_entries(&hists->entries_in_array[0]);
+ hists__delete_remaining_entries(&hists->entries_in_array[1]);
+ hists__delete_remaining_entries(&hists->entries_collapsed);
+}
+
static void hists_evsel__exit(struct perf_evsel *evsel)
{
struct hists *hists = evsel__hists(evsel);

- hists__delete_entries(hists);
+ hists__delete_all_entries(hists);
}

/*

2015-12-14 08:43:15

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 12/16] perf tools: Reduce lock contention when processing events

On Thu, Dec 10, 2015 at 04:53:31PM +0900, Namhyung Kim wrote:
> When multi-thread is enabled, the machine->threads_lock is contented
> as all worker threads try to grab the writer lock using the
> machine__findnew_thread(). Usually, the thread they're looking for is
> in the tree so they only need the reader lock though.
>
> Thus try machine__find_thread() first, and then fallback to the
> 'findnew' API. This will improve the performance.

found one other place, but I guess you chose those
based on profiling the contention?

parent lookup in machine__process_fork_event

jirka

2015-12-14 09:23:49

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 11/16] perf top: Implement basic parallel processing

On Thu, Dec 10, 2015 at 04:53:30PM +0900, Namhyung Kim wrote:

SNIP

>
> - perf_top__mmap_read(top);
> -
> ret = -1;
> - if (pthread_create(&thread, NULL, (use_browser > 0 ? display_thread_tui :
> - display_thread), top)) {
> - ui__error("Could not create display thread.\n");
> + readers = calloc(sizeof(pthread_t), top->evlist->nr_mmaps);
> + if (readers == NULL)
> goto out_delete;
> - }
>
> - if (top->realtime_prio) {
> - struct sched_param param;
> + rargs = calloc(sizeof(*rargs), top->evlist->nr_mmaps);
> + if (rargs == NULL)
> + goto out_free;
>
> - param.sched_priority = top->realtime_prio;
> - if (sched_setscheduler(0, SCHED_FIFO, &param)) {
> - ui__error("Could not set realtime priority.\n");
> - goto out_join;
> - }
> + hists = calloc(sizeof(*hists), top->evlist->nr_mmaps * top->evlist->nr_entries);
> + if (hists == NULL)
> + goto out_free;
> +
> + for (i = 0; i < top->evlist->nr_mmaps * top->evlist->nr_entries; i++)
> + __hists__init(&hists[i]);
> +
> + for (i = 0; i < top->evlist->nr_mmaps; i++) {
> + struct reader_arg *rarg = &rargs[i];
> +
> + rarg->idx = i;
> + rarg->top = top;
> + rarg->hists = &hists[i * top->evlist->nr_entries];
> +
> + perf_top__mmap_read(rarg);
> }
> + collect_hists(top, hists);

hum, what's the reason for calling perf_top__mmap_read
and collect_hists in here?

also calling collect_hists on hists array pointer seems wrong

jirka

2015-12-14 09:26:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> On 12/11/15 1:11 AM, Ingo Molnar wrote:
> >
> >* Namhyung Kim <[email protected]> wrote:
> >
> >>IIRC David said that thread per cpu seems too much especially on a large system
> >>(like ~1024 cpu). [...]
> >
> >Too much in what fashion? For recording I think it's the fastest, most natural
> >model - anything else will create cache line bounces.
>
> The intrusiveness of perf on the system under observation. I understand
> there are a lot of factors that go into it.

So I can see some of that, if every cpu has its own thread then every
cpu will occasionally schedule that thread. Whereas if there were less,
you'd not have that.

Still, I think it makes sense to implement it, we need the multi-file
option anyway. Once we have that, we can also implement a per-node
option, which should be a fairly simple hybrid of the two approaches.

The thing is, perf-record is really struggling on big machines.

And in an unrelated note, I absolutely detest --buildid being the
default, it makes perf-record blow chunks.

2015-12-14 09:35:24

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH/RFC 11/16] perf top: Implement basic parallel processing

On Mon, Dec 14, 2015 at 10:23:43AM +0100, Jiri Olsa wrote:
> On Thu, Dec 10, 2015 at 04:53:30PM +0900, Namhyung Kim wrote:
>
> SNIP
>
> >
> > - perf_top__mmap_read(top);
> > -
> > ret = -1;
> > - if (pthread_create(&thread, NULL, (use_browser > 0 ? display_thread_tui :
> > - display_thread), top)) {
> > - ui__error("Could not create display thread.\n");
> > + readers = calloc(sizeof(pthread_t), top->evlist->nr_mmaps);
> > + if (readers == NULL)
> > goto out_delete;
> > - }
> >
> > - if (top->realtime_prio) {
> > - struct sched_param param;
> > + rargs = calloc(sizeof(*rargs), top->evlist->nr_mmaps);
> > + if (rargs == NULL)
> > + goto out_free;
> >
> > - param.sched_priority = top->realtime_prio;
> > - if (sched_setscheduler(0, SCHED_FIFO, &param)) {
> > - ui__error("Could not set realtime priority.\n");
> > - goto out_join;
> > - }
> > + hists = calloc(sizeof(*hists), top->evlist->nr_mmaps * top->evlist->nr_entries);
> > + if (hists == NULL)
> > + goto out_free;
> > +
> > + for (i = 0; i < top->evlist->nr_mmaps * top->evlist->nr_entries; i++)
> > + __hists__init(&hists[i]);
> > +
> > + for (i = 0; i < top->evlist->nr_mmaps; i++) {
> > + struct reader_arg *rarg = &rargs[i];
> > +
> > + rarg->idx = i;
> > + rarg->top = top;
> > + rarg->hists = &hists[i * top->evlist->nr_entries];
> > +
> > + perf_top__mmap_read(rarg);
> > }
> > + collect_hists(top, hists);
>
> hum, what's the reason for calling perf_top__mmap_read
> and collect_hists in here?

nah it's for initial data so ui thread has something to display right? ;-)

>
> also calling collect_hists on hists array pointer seems wrong

still 'collect_hists(&hists[0]);' would seem more review friendly

thanks,
jirka

2015-12-14 09:38:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)


* Peter Zijlstra <[email protected]> wrote:

> On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> > On 12/11/15 1:11 AM, Ingo Molnar wrote:
> > >
> > >* Namhyung Kim <[email protected]> wrote:
> > >
> > >>IIRC David said that thread per cpu seems too much especially on a large system
> > >>(like ~1024 cpu). [...]
> > >
> > >Too much in what fashion? For recording I think it's the fastest, most natural
> > >model - anything else will create cache line bounces.
> >
> > The intrusiveness of perf on the system under observation. I understand
> > there are a lot of factors that go into it.
>
> So I can see some of that, if every cpu has its own thread then every
> cpu will occasionally schedule that thread. Whereas if there were less,
> you'd not have that.
>
> Still, I think it makes sense to implement it, we need the multi-file
> option anyway. Once we have that, we can also implement a per-node
> option, which should be a fairly simple hybrid of the two approaches.
>
> The thing is, perf-record is really struggling on big machines.
>
> And in an unrelated note, I absolutely detest --buildid being the
> default, it makes perf-record blow chunks.

So I'd absolutely _love_ to split up the singular perf.data into a hierarchy of
files in a .perf directory, with a structure like this (4-core system):

.perf/cmdline
.perf/features
.perf/evlist
.perf/ring_buffers/cpu0/raw.trace
.perf/ring_buffers/cpu1/raw.trace
.perf/ring_buffers/cpu2/raw.trace
.perf/ring_buffers/cpu3/raw.trace
...

I.e. the current single file format of perf.data would be split up into individual
files. Each CPU would get its own trace file output - any sorting and ordering
would be done afterwards. 'perf record' itself would never by default have to do
any of that, it's a pure recording session.

'perf archive' would still create a single file to make transport between machines
easy.

perf.data.old would be replaced by a .perf.old directory or so.

Debugging would be easier too I think, as there's no complex perf data format
anymore, it's all in individual (typically text, or binary dump) files in the
.perf directory.

This would solve all the scalability problems - and would make the format more
extensible and generally more accessible as well.

What do you think?

Thanks,

Ingo

2015-12-14 14:46:34

by David Ahern

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On 12/14/15 2:26 AM, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
>> On 12/11/15 1:11 AM, Ingo Molnar wrote:
>>>
>>> * Namhyung Kim <[email protected]> wrote:
>>>
>>>> IIRC David said that thread per cpu seems too much especially on a large system
>>>> (like ~1024 cpu). [...]
>>>
>>> Too much in what fashion? For recording I think it's the fastest, most natural
>>> model - anything else will create cache line bounces.
>>
>> The intrusiveness of perf on the system under observation. I understand
>> there are a lot of factors that go into it.
>
> So I can see some of that, if every cpu has its own thread then every
> cpu will occasionally schedule that thread. Whereas if there were less,
> you'd not have that.
>
> Still, I think it makes sense to implement it, we need the multi-file
> option anyway. Once we have that, we can also implement a per-node
> option, which should be a fairly simple hybrid of the two approaches.
>
> The thing is, perf-record is really struggling on big machines.

I've gone from the 1024-cpu sparc systems earlier this year down to
small PPC and Rangeley-based switches. For both ends of the scale (and
in between) I constantly struggle with the options to manage memory, cpu
and disk consumption.

There definitely needs to be options (e.g., multi-threaded on/off). For
the threading options I get the appeal for 1-thread per cpu but other
options make sense as well -- 1 thread per core, 1 per NUMA node. perf
has the CPU topology so should not be too difficult.

If you have 1-thread per cpu that means you are pinning the threads to
the cpu? That brings in additional permissions problems.

2015-12-14 14:55:37

by David Ahern

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On 12/14/15 2:38 AM, Ingo Molnar wrote:
>
>> And in an unrelated note, I absolutely detest --buildid being the
>> default, it makes perf-record blow chunks.

Yes, a .debug directory that gets bloated fast and being a dot-directory
is off the primary radar. I forget about it and too often forget to add
the option to disable it.

>
> So I'd absolutely _love_ to split up the singular perf.data into a hierarchy of
> files in a .perf directory, with a structure like this (4-core system):
>
> .perf/cmdline
> .perf/features
> .perf/evlist
> .perf/ring_buffers/cpu0/raw.trace
> .perf/ring_buffers/cpu1/raw.trace
> .perf/ring_buffers/cpu2/raw.trace
> .perf/ring_buffers/cpu3/raw.trace
> ...

On a related note why a .perf directory?

>
> I.e. the current single file format of perf.data would be split up into individual
> files. Each CPU would get its own trace file output - any sorting and ordering
> would be done afterwards. 'perf record' itself would never by default have to do
> any of that, it's a pure recording session.
>
> 'perf archive' would still create a single file to make transport between machines
> easy.
>
> perf.data.old would be replaced by a .perf.old directory or so.
>
> Debugging would be easier too I think, as there's no complex perf data format
> anymore, it's all in individual (typically text, or binary dump) files in the
> .perf directory.
>
> This would solve all the scalability problems - and would make the format more
> extensible and generally more accessible as well.
>
> What do you think?

Big change to user experience.

I realize perf-archive has been around since I started using perf in
mid-2010, but I for one never use it. I suspect it is not widely used
(definitely not in the circles I have been involved and helped with
perf), so suddenly requiring it is a change in user experience.

The only 2 files on the system I pull off the box are kallsyms and
perf.data. Most of the systems where I use perf have limited symbols and
there is nothing in .debug I need to pull of the box.

2015-12-14 16:26:21

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Hi Peter,

On Mon, Dec 14, 2015 at 10:26:13AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> > On 12/11/15 1:11 AM, Ingo Molnar wrote:
> > >
> > >* Namhyung Kim <[email protected]> wrote:
> > >
> > >>IIRC David said that thread per cpu seems too much especially on a large system
> > >>(like ~1024 cpu). [...]
> > >
> > >Too much in what fashion? For recording I think it's the fastest, most natural
> > >model - anything else will create cache line bounces.
> >
> > The intrusiveness of perf on the system under observation. I understand
> > there are a lot of factors that go into it.
>
> So I can see some of that, if every cpu has its own thread then every
> cpu will occasionally schedule that thread. Whereas if there were less,
> you'd not have that.
>
> Still, I think it makes sense to implement it, we need the multi-file
> option anyway. Once we have that, we can also implement a per-node
> option, which should be a fairly simple hybrid of the two approaches.
>
> The thing is, perf-record is really struggling on big machines.

Yes, but perf-record and perf-top is different. The perf-record
merely saves the data into file while perf-top read events and process
them at the same time without file. So we should choose different
default IMHO.

I want to focus on perf-top for now, once it's in a good shape, I'll
work on perf record/report too.


>
> And in an unrelated note, I absolutely detest --buildid being the
> default, it makes perf-record blow chunks.

Maybe we can add a config option?

Thanks,
Namhyung

2015-12-14 16:27:03

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Em Mon, Dec 14, 2015 at 07:55:32AM -0700, David Ahern escreveu:
> On 12/14/15 2:38 AM, Ingo Molnar wrote:
> >
> >>And in an unrelated note, I absolutely detest --buildid being the
> >>default, it makes perf-record blow chunks.
>
> Yes, a .debug directory that gets bloated fast and being a dot-directory is
> off the primary radar. I forget about it and too often forget to add the
> option to disable it.

Multiple things here, .debug/ should be size limited, and buildid
processing doesn't need necessarily to store things in that cache, I bet
what PeterZ is complaining about is the reprocessing of events at the
end of the session, to find out about the PERF_RECORD_MMAP* events to
then read the build-ids and insert then into the perf.data file header.

All this can be disabled by default, the downside is that when samples
get resolved to something random because the binary used in the session
was replaced by some other we will not be able to notice.

This would be solved by inserting the buildid (20-some bytes) into the
PERF_RECORD_MMAP record, but that remains to be done...

> >
> >So I'd absolutely _love_ to split up the singular perf.data into a hierarchy of
> >files in a .perf directory, with a structure like this (4-core system):
> >
> > .perf/cmdline
> > .perf/features
> > .perf/evlist
> > .perf/ring_buffers/cpu0/raw.trace
> > .perf/ring_buffers/cpu1/raw.trace
> > .perf/ring_buffers/cpu2/raw.trace
> > .perf/ring_buffers/cpu3/raw.trace
> > ...
>
> On a related note why a .perf directory?
>
> >
> >I.e. the current single file format of perf.data would be split up into individual
> >files. Each CPU would get its own trace file output - any sorting and ordering
> >would be done afterwards. 'perf record' itself would never by default have to do
> >any of that, it's a pure recording session.
> >
> >'perf archive' would still create a single file to make transport between machines
> >easy.
> >
> >perf.data.old would be replaced by a .perf.old directory or so.
> >
> >Debugging would be easier too I think, as there's no complex perf data format
> >anymore, it's all in individual (typically text, or binary dump) files in the
> >.perf directory.
> >
> >This would solve all the scalability problems - and would make the format more
> >extensible and generally more accessible as well.
> >
> >What do you think?
>
> Big change to user experience.
>
> I realize perf-archive has been around since I started using perf in
> mid-2010, but I for one never use it. I suspect it is not widely used
> (definitely not in the circles I have been involved and helped with perf),
> so suddenly requiring it is a change in user experience.
>
> The only 2 files on the system I pull off the box are kallsyms and
> perf.data. Most of the systems where I use perf have limited symbols and
> there is nothing in .debug I need to pull of the box.

Well, we don't have to use perf archive for that, doing what Ingo
suggests and on top of it putting it into a perf.data file that in fact
is a cpio or tarball would keep the one-file while introducing the split
up for later processing bits.

- Arnaldo

2015-12-14 16:39:13

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Hi Ingo,

On Mon, Dec 14, 2015 at 10:38:41AM +0100, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> > > On 12/11/15 1:11 AM, Ingo Molnar wrote:
> > > >
> > > >* Namhyung Kim <[email protected]> wrote:
> > > >
> > > >>IIRC David said that thread per cpu seems too much especially on a large system
> > > >>(like ~1024 cpu). [...]
> > > >
> > > >Too much in what fashion? For recording I think it's the fastest, most natural
> > > >model - anything else will create cache line bounces.
> > >
> > > The intrusiveness of perf on the system under observation. I understand
> > > there are a lot of factors that go into it.
> >
> > So I can see some of that, if every cpu has its own thread then every
> > cpu will occasionally schedule that thread. Whereas if there were less,
> > you'd not have that.
> >
> > Still, I think it makes sense to implement it, we need the multi-file
> > option anyway. Once we have that, we can also implement a per-node
> > option, which should be a fairly simple hybrid of the two approaches.
> >
> > The thing is, perf-record is really struggling on big machines.
> >
> > And in an unrelated note, I absolutely detest --buildid being the
> > default, it makes perf-record blow chunks.
>
> So I'd absolutely _love_ to split up the singular perf.data into a hierarchy of
> files in a .perf directory, with a structure like this (4-core system):
>
> .perf/cmdline
> .perf/features
> .perf/evlist
> .perf/ring_buffers/cpu0/raw.trace
> .perf/ring_buffers/cpu1/raw.trace
> .perf/ring_buffers/cpu2/raw.trace
> .perf/ring_buffers/cpu3/raw.trace
> ...
>
> I.e. the current single file format of perf.data would be split up into individual
> files. Each CPU would get its own trace file output - any sorting and ordering
> would be done afterwards. 'perf record' itself would never by default have to do
> any of that, it's a pure recording session.
>
> 'perf archive' would still create a single file to make transport between machines
> easy.
>
> perf.data.old would be replaced by a .perf.old directory or so.
>
> Debugging would be easier too I think, as there's no complex perf data format
> anymore, it's all in individual (typically text, or binary dump) files in the
> .perf directory.
>
> This would solve all the scalability problems - and would make the format more
> extensible and generally more accessible as well.
>
> What do you think?

It requires many changes, but basically I also like the split-up since
it's easier to deal with. IIRC there was an opinion (Andi?) regarding
single-file vs multi-file. The file access will be better for single
file so I changed my earlier implementation to use indexed single data
file instead of multiple files.

But anyway, I'll work on perf-top first :)

Thanks,
Namhyung

2015-12-14 16:41:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On Mon, Dec 14, 2015 at 01:26:55PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 14, 2015 at 07:55:32AM -0700, David Ahern escreveu:
> > On 12/14/15 2:38 AM, Ingo Molnar wrote:
> > >
> > >>And in an unrelated note, I absolutely detest --buildid being the
> > >>default, it makes perf-record blow chunks.

> Multiple things here, .debug/ should be size limited, and buildid
> processing doesn't need necessarily to store things in that cache, I bet
> what PeterZ is complaining about is the reprocessing of events at the
> end of the session, to find out about the PERF_RECORD_MMAP* events to
> then read the build-ids and insert then into the perf.data file header.
>

Yeah, its the reprocessing that is taking forever.. On my moderately
sized system with 40 CPUs, the reprocessing is taking about as long as
the actual workload, which is tedious.

Once I figured out what was happening Jiri was quick to point out I
should be using -B.

2015-12-14 16:44:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On Tue, Dec 15, 2015 at 01:25:35AM +0900, Namhyung Kim wrote:
> I want to focus on perf-top for now, once it's in a good shape, I'll
> work on perf record/report too.

Fair enough and thanks!

>
> >
> > And in an unrelated note, I absolutely detest --buildid being the
> > default, it makes perf-record blow chunks.
>
> Maybe we can add a config option?

That would be good, until I get a new machine and have forgotten all
about this again :-)

2015-12-14 16:56:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On Tue, Dec 15, 2015 at 01:38:30AM +0900, Namhyung Kim wrote:
> It requires many changes, but basically I also like the split-up since
> it's easier to deal with. IIRC there was an opinion (Andi?) regarding
> single-file vs multi-file. The file access will be better for single
> file so I changed my earlier implementation to use indexed single data
> file instead of multiple files.

The page-cache has a lock per inode, so by having all CPUs populate the
one file you get contention on that.

Also, I suppose you'll have to arbitrate ranges in that file for each
cpu to make it work, that too could get you some contention.

Having a file per cpu avoids all that.

2015-12-14 17:07:39

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Hi David,

On Mon, Dec 14, 2015 at 07:46:28AM -0700, David Ahern wrote:
> On 12/14/15 2:26 AM, Peter Zijlstra wrote:
> >On Fri, Dec 11, 2015 at 08:01:31AM -0700, David Ahern wrote:
> >>On 12/11/15 1:11 AM, Ingo Molnar wrote:
> >>>
> >>>* Namhyung Kim <[email protected]> wrote:
> >>>
> >>>>IIRC David said that thread per cpu seems too much especially on a large system
> >>>>(like ~1024 cpu). [...]
> >>>
> >>>Too much in what fashion? For recording I think it's the fastest, most natural
> >>>model - anything else will create cache line bounces.
> >>
> >>The intrusiveness of perf on the system under observation. I understand
> >>there are a lot of factors that go into it.
> >
> >So I can see some of that, if every cpu has its own thread then every
> >cpu will occasionally schedule that thread. Whereas if there were less,
> >you'd not have that.
> >
> >Still, I think it makes sense to implement it, we need the multi-file
> >option anyway. Once we have that, we can also implement a per-node
> >option, which should be a fairly simple hybrid of the two approaches.
> >
> >The thing is, perf-record is really struggling on big machines.
>
> I've gone from the 1024-cpu sparc systems earlier this year down to small
> PPC and Rangeley-based switches. For both ends of the scale (and in between)
> I constantly struggle with the options to manage memory, cpu and disk
> consumption.
>
> There definitely needs to be options (e.g., multi-threaded on/off). For the
> threading options I get the appeal for 1-thread per cpu but other options
> make sense as well -- 1 thread per core, 1 per NUMA node. perf has the CPU
> topology so should not be too difficult.

I think we can use --num-thread option to control multi-threading: 1
for disabling and others for enabling. In the current implementation,
using 1 thread still use same logic so 1 reader + 1 collector will be
created as well as 1 display thread. Not sure it'd be better special
casing 1 thread to use different code path.

Anyway, I think it'd be nice to have per-core, per-socket and per-node
options and per-core is a good default then.

>
> If you have 1-thread per cpu that means you are pinning the threads to the
> cpu? That brings in additional permissions problems.

Did you mean setting sched affinity? It seems not a privileged
operation doing it for its own threads..

Thanks,
Namhyung

2015-12-14 17:12:26

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

On Mon, Dec 14, 2015 at 05:56:14PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2015 at 01:38:30AM +0900, Namhyung Kim wrote:
> > It requires many changes, but basically I also like the split-up since
> > it's easier to deal with. IIRC there was an opinion (Andi?) regarding
> > single-file vs multi-file. The file access will be better for single
> > file so I changed my earlier implementation to use indexed single data
> > file instead of multiple files.
>
> The page-cache has a lock per inode, so by having all CPUs populate the
> one file you get contention on that.
>
> Also, I suppose you'll have to arbitrate ranges in that file for each
> cpu to make it work, that too could get you some contention.
>
> Having a file per cpu avoids all that.

Right. Now I recall that it was about *report* (not record) time
accessing single file vs. multi files.

At record time we should use file per cpu. I combined them into
one with index at post-processing time in my earlier work.

Thanks for clarification!
Namhyung

2015-12-14 17:52:48

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Em Mon, Dec 14, 2015 at 05:41:30PM +0100, Peter Zijlstra escreveu:
> On Mon, Dec 14, 2015 at 01:26:55PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Dec 14, 2015 at 07:55:32AM -0700, David Ahern escreveu:
> > > On 12/14/15 2:38 AM, Ingo Molnar wrote:
> > > >
> > > >>And in an unrelated note, I absolutely detest --buildid being the
> > > >>default, it makes perf-record blow chunks.
>
> > Multiple things here, .debug/ should be size limited, and buildid
> > processing doesn't need necessarily to store things in that cache, I bet
> > what PeterZ is complaining about is the reprocessing of events at the
> > end of the session, to find out about the PERF_RECORD_MMAP* events to
> > then read the build-ids and insert then into the perf.data file header.
> >
>
> Yeah, its the reprocessing that is taking forever.. On my moderately
> sized system with 40 CPUs, the reprocessing is taking about as long as
> the actual workload, which is tedious.

Right, I thought about using some dummy event for tracking mmaps, which
I think is a technique used by the Intel PT code (well, there it uses it
for sched_switches) will check...

> Once I figured out what was happening Jiri was quick to point out I
> should be using -B.

Right, we have to have sane behaviour by default, damn long/high freq
workloads, duh ;-)

- Arnaldo

2015-12-14 17:54:52

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCHSET 00/16] perf top: Add multi-thread support (v1)

Em Tue, Dec 15, 2015 at 02:06:54AM +0900, Namhyung Kim escreveu:
> On Mon, Dec 14, 2015 at 07:46:28AM -0700, David Ahern wrote:
> > If you have 1-thread per cpu that means you are pinning the threads to the
> > cpu? That brings in additional permissions problems.
>
> Did you mean setting sched affinity? It seems not a privileged
> operation doing it for its own threads..

Right, and we should do it (create per cpu rb consuming threads) just
for the cpus in the workload affinity mask.

- Arnaldo

2015-12-15 02:04:15

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 12/16] perf tools: Reduce lock contention when processing events

Hi Jiri,

On Mon, Dec 14, 2015 at 09:43:04AM +0100, Jiri Olsa wrote:
> On Thu, Dec 10, 2015 at 04:53:31PM +0900, Namhyung Kim wrote:
> > When multi-thread is enabled, the machine->threads_lock is contented
> > as all worker threads try to grab the writer lock using the
> > machine__findnew_thread(). Usually, the thread they're looking for is
> > in the tree so they only need the reader lock though.
> >
> > Thus try machine__find_thread() first, and then fallback to the
> > 'findnew' API. This will improve the performance.
>
> found one other place, but I guess you chose those
> based on profiling the contention?

I only profiled my earlier perf report patchset and found
machine__findnew_thread() in perf_event__preprocess_sample() was
contended. But perf-report serialized meta events and only process
sample events parallelly.

As perf-top processes all events parallelly, I thought I should use
the same technique for other places.

>
> parent lookup in machine__process_fork_event

I'll add that too.

Thanks,
Namhyung

2015-12-15 02:09:08

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH/RFC 11/16] perf top: Implement basic parallel processing

On Mon, Dec 14, 2015 at 10:35:19AM +0100, Jiri Olsa wrote:
> On Mon, Dec 14, 2015 at 10:23:43AM +0100, Jiri Olsa wrote:
> > On Thu, Dec 10, 2015 at 04:53:30PM +0900, Namhyung Kim wrote:
> >
> > SNIP
> >
> > >
> > > - perf_top__mmap_read(top);
> > > -
> > > ret = -1;
> > > - if (pthread_create(&thread, NULL, (use_browser > 0 ? display_thread_tui :
> > > - display_thread), top)) {
> > > - ui__error("Could not create display thread.\n");
> > > + readers = calloc(sizeof(pthread_t), top->evlist->nr_mmaps);
> > > + if (readers == NULL)
> > > goto out_delete;
> > > - }
> > >
> > > - if (top->realtime_prio) {
> > > - struct sched_param param;
> > > + rargs = calloc(sizeof(*rargs), top->evlist->nr_mmaps);
> > > + if (rargs == NULL)
> > > + goto out_free;
> > >
> > > - param.sched_priority = top->realtime_prio;
> > > - if (sched_setscheduler(0, SCHED_FIFO, &param)) {
> > > - ui__error("Could not set realtime priority.\n");
> > > - goto out_join;
> > > - }
> > > + hists = calloc(sizeof(*hists), top->evlist->nr_mmaps * top->evlist->nr_entries);
> > > + if (hists == NULL)
> > > + goto out_free;
> > > +
> > > + for (i = 0; i < top->evlist->nr_mmaps * top->evlist->nr_entries; i++)
> > > + __hists__init(&hists[i]);
> > > +
> > > + for (i = 0; i < top->evlist->nr_mmaps; i++) {
> > > + struct reader_arg *rarg = &rargs[i];
> > > +
> > > + rarg->idx = i;
> > > + rarg->top = top;
> > > + rarg->hists = &hists[i * top->evlist->nr_entries];
> > > +
> > > + perf_top__mmap_read(rarg);
> > > }
> > > + collect_hists(top, hists);
> >
> > hum, what's the reason for calling perf_top__mmap_read
> > and collect_hists in here?
>
> nah it's for initial data so ui thread has something to display right? ;-)

Right, the logic was already there. Without it, the initial update
(after 2 seconds) looks too long for users.


>
> >
> > also calling collect_hists on hists array pointer seems wrong
>
> still 'collect_hists(&hists[0]);' would seem more review friendly

Will change!

Thanks,
Namhyung