LinuxLists.cc - [PATCH RFC V3 0/5] perf top optimization

2017-09-25 20:23:44

Subject: [PATCH RFC V3 0/5] perf top optimization

From: Kan Liang <[email protected]>

The patch series intends to fix the severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system.
perf top costs a few minutes to show the result, which is
unacceptable.
With the patch series applied, the latency will reduces to
several seconds.

machine__synthesize_threads and perf_top__mmap_read costs most of
the perf top time (> 99%).
Patch 1-4 do the optimization for machine__synthesize_threads.
Patch 5 does the optimization for perf_top__mmap_read.

Optimization for machine__synthesize_threads
- Multithreading the whole process.
- The threads number is set to the max online CPU# by default.
User can change the threads number through the new option.
- Introduces hashtable for machine threads to reduce the lock
contention.
- The optimization can also benefit other platforms and other
perf tools, like perf record. But this patch series doesn't
do the optimization for other tools. It can be done later
separately.
- With this optimization applied, there is a 1.56x speedup in
Knights Mill with heavy workload.

Optimization for perf_top__mmap_read
- switch back to overwrite mode
For non overwrite mode, it tries to read everything in the ring buffer
and does not check the messup. Once there are lots of samples delivered
shortly, the processing time could be very long.
Considering the real time requirement for perf top, it should switch
back to overwrite mode.
- With this optimization applied, there is a huge 53.8x speedup in
Knights Mill with heavy workload.
- With this optimization applied, the latency of perf_top__mmap_read is
less than the default perf top fresh time (2s) in Knights Mill with
heavy workload.

The source code is also available at
https://github.com/kliang2/perf.git perf_top_opt

Here are perf top latency test result on Knights Mill and Skylake server

The heavy workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf top"

The latency period is the time between perf top launched and the first
profiling result shown.

- Latency on Knights Mill (272 CPUs)

Original(s) With patch(s) Speedup
272.68 16.48 16.54x

- Latency on Skylake server (192 CPUs)

Original(s) With patch(s) Speedup
12.28 2.96 4.15x

Changes since V2:
- patches 1 and 2 for V2 for hashtable and scandir have been merged.
- patches 3, 4 and 7 for V2 are droped. Because the optimization code
doesn't touch those codes. The protection is not needed. (Arnaldo & jirka)
- Using mutex wrappers for multithread only lock. (jirka)
- Move struct synthesize_threads_arg to event.c (jirka)

Changes since V1:
- Patch 1: machine threads and hashtable related renaming (Arnaldo)
- Patch 6: use a smaller locked section for comm_str__put
add a locked wrapper for comm_str__findnew (Arnaldo)

Kan Liang (5):
perf tools: lock to protect namespaces and comm list
perf tools: lock to protect comm_str rb tree
perf top: implement multithreading for perf_event__synthesize_threads
perf top: add option to set the number of thread for event synthesize
perf top: switch back to overwrite mode

tools/perf/builtin-kvm.c | 3 +-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-top.c | 15 +++-
tools/perf/builtin-trace.c | 2 +-
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/util/comm.c | 18 +++-
tools/perf/util/event.c | 157 +++++++++++++++++++++++++++-------
tools/perf/util/event.h | 3 +-
tools/perf/util/evlist.c | 5 +-
tools/perf/util/machine.c | 8 +-
tools/perf/util/machine.h | 9 +-
tools/perf/util/thread.c | 53 ++++++++++--
tools/perf/util/thread.h | 3 +
tools/perf/util/top.h | 1 +
14 files changed, 231 insertions(+), 50 deletions(-)

--
2.5.5

2017-09-25 20:23:52

by Liang, Kan

[permalink] [raw]

Subject: [PATCH RFC V3 4/5] perf top: add option to set the number of thread for event synthesize

From: Kan Liang <[email protected]>

Using UINT_MAX to indicate the default thread#, which is the max number
of online CPU.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-top.c | 11 ++++++++---
tools/perf/util/event.c | 5 ++++-
tools/perf/util/top.h | 1 +
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index bc31b93..477a869 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,14 +958,16 @@ static int __cmd_top(struct perf_top *top)
if (perf_session__register_idle_thread(top->session) < 0)
goto out_delete;

- perf_set_multithreaded();
+ if (top->nr_threads_synthesize > 1)
+ perf_set_multithreaded();

machine__synthesize_threads(&top->session->machines.host, &opts->target,
top->evlist->threads, false,
opts->proc_map_timeout,
- (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+ top->nr_threads_synthesize);

- perf_set_singlethreaded();
+ if (top->nr_threads_synthesize > 1)
+ perf_set_singlethreaded();

if (perf_hpp_list.socket) {
ret = perf_env__read_cpu_topology_map(&perf_env);
@@ -1118,6 +1120,7 @@ int cmd_top(int argc, const char **argv)
},
.max_stack = sysctl_perf_event_max_stack,
.sym_pcnt_filter = 5,
+ .nr_threads_synthesize = UINT_MAX,
};
struct record_opts *opts = &top.record_opts;
struct target *target = &opts->target;
@@ -1227,6 +1230,8 @@ int cmd_top(int argc, const char **argv)
OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
"Show entries in a hierarchy"),
OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
+ OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
+ "number of thread to run event synthesize"),
OPT_END()
};
const char * const top_usage[] = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index c81a0b6..7095f40 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -790,7 +790,10 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
if (n < 0)
return err;

- thread_nr = nr_threads_synthesize;
+ if (nr_threads_synthesize == UINT_MAX)
+ thread_nr = sysconf(_SC_NPROCESSORS_ONLN);
+ else
+ thread_nr = nr_threads_synthesize;
if (thread_nr <= 0)
thread_nr = 1;
if (thread_nr > n)
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 9bdfb78..f4296e1 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -37,6 +37,7 @@ struct perf_top {
int sym_pcnt_filter;
const char *sym_filter;
float min_percent;
+ unsigned int nr_threads_synthesize;
};

#define CONSOLE_CLEAR "[H[2J"
--
2.5.5

2017-09-25 20:23:50

by Liang, Kan

[permalink] [raw]

Subject: [PATCH RFC V3 3/5] perf top: implement multithreading for perf_event__synthesize_threads

From: Kan Liang <[email protected]>

The proc files which is sorted with alphabetical order are evenly
assigned to several synthesize threads to be processed in parallel.

For perf top, the threads number hard code to online CPU number. The
following patch will introduce an option to set it.
For other perf tools, the thread number is 1. Because the process
function is not ready for multithreading, e.g.
process_synthesized_event.
This patch series only support event synthesize multithreading for perf
top. For other tools, it can be done separately later.

With multithread applied, the total processing time can get up to 1.56x
speedup on Knights Mill for perf top.

For specific single event processing, the processing time could increase
because of the lock contention. So proc_map_timeout may need to be
increased. Otherwise some proc maps will be truncated.
Based on my test, increasing the proc_map_timeout has small impact
on the total processing time. The total processing time still get 1.49x
speedup on Knights Mill after increasing the proc_map_timeout.
The patch itself doesn't increase the proc_map_timeout.

Doesn't need to implement multithreading for per task monitoring,
perf_event__synthesize_thread_map. It doesn't have performance issue.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-kvm.c | 3 +-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-top.c | 8 +-
tools/perf/builtin-trace.c | 2 +-
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/util/event.c | 154 +++++++++++++++++++++++++++-------
tools/perf/util/event.h | 3 +-
tools/perf/util/machine.c | 8 +-
tools/perf/util/machine.h | 9 +-
9 files changed, 149 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index c747a1a..721f4f9 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1441,7 +1441,8 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
perf_session__set_id_hdr_size(kvm->session);
ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
machine__synthesize_threads(&kvm->session->machines.host, &kvm->opts.target,
- kvm->evlist->threads, false, kvm->opts.proc_map_timeout);
+ kvm->evlist->threads, false,
+ kvm->opts.proc_map_timeout, 1);
err = kvm_live_open_events(kvm);
if (err)
goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9b379f3..234fdf4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -863,7 +863,7 @@ static int record__synthesize(struct record *rec, bool tail)

err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
process_synthesized_event, opts->sample_address,
- opts->proc_map_timeout);
+ opts->proc_map_timeout, 1);
out:
return err;
}
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index ee954bd..bc31b93 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,8 +958,14 @@ static int __cmd_top(struct perf_top *top)
if (perf_session__register_idle_thread(top->session) < 0)
goto out_delete;

+ perf_set_multithreaded();
+
machine__synthesize_threads(&top->session->machines.host, &opts->target,
- top->evlist->threads, false, opts->proc_map_timeout);
+ top->evlist->threads, false,
+ opts->proc_map_timeout,
+ (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+
+ perf_set_singlethreaded();

if (perf_hpp_list.socket) {
ret = perf_env__read_cpu_topology_map(&perf_env);
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 967bd35..afef6fe 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1131,7 +1131,7 @@ static int trace__symbols_init(struct trace *trace, struct perf_evlist *evlist)

err = __machine__synthesize_threads(trace->host, &trace->tool, &trace->opts.target,
evlist->threads, trace__tool_process, false,
- trace->opts.proc_map_timeout);
+ trace->opts.proc_map_timeout, 1);
if (err)
symbol__exit();

diff --git a/tools/perf/tests/mmap-thread-lookup.c b/tools/perf/tests/mmap-thread-lookup.c
index f94a419..2a0068a 100644
--- a/tools/perf/tests/mmap-thread-lookup.c
+++ b/tools/perf/tests/mmap-thread-lookup.c
@@ -131,7 +131,7 @@ static int synth_all(struct machine *machine)
{
return perf_event__synthesize_threads(NULL,
perf_event__process,
- machine, 0, 500);
+ machine, 0, 500, 1);
}

static int synth_process(struct machine *machine)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 10366b8..c81a0b6 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -678,23 +678,21 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
return err;
}

-int perf_event__synthesize_threads(struct perf_tool *tool,
- perf_event__handler_t process,
- struct machine *machine,
- bool mmap_data,
- unsigned int proc_map_timeout)
+static int __perf_event__synthesize_threads(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ bool mmap_data,
+ unsigned int proc_map_timeout,
+ struct dirent **dirent,
+ int start,
+ int num)
{
union perf_event *comm_event, *mmap_event, *fork_event;
union perf_event *namespaces_event;
- char proc_path[PATH_MAX];
- struct dirent **dirent;
int err = -1;
char *end;
pid_t pid;
- int n, i;
-
- if (machine__is_default_guest(machine))
- return 0;
+ int i;

comm_event = malloc(sizeof(comm_event->comm) + machine->id_hdr_size);
if (comm_event == NULL)
@@ -714,34 +712,25 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
if (namespaces_event == NULL)
goto out_free_fork;

- snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
- n = scandir(proc_path, &dirent, 0, alphasort);
-
- if (n < 0)
- goto out_free_namespaces;
-
- for (i = 0; i < n; i++) {
+ for (i = start; i < start + num; i++) {
if (!isdigit(dirent[i]->d_name[0]))
continue;

pid = (pid_t)strtol(dirent[i]->d_name, &end, 10);
/* only interested in proper numerical dirents */
- if (!*end) {
- /*
- * We may race with exiting thread, so don't stop just because
- * one thread couldn't be synthesized.
- */
- __event__synthesize_thread(comm_event, mmap_event, fork_event,
- namespaces_event, pid, 1, process,
- tool, machine, mmap_data,
- proc_map_timeout);
- }
- free(dirent[i]);
+ if (*end)
+ continue;
+ /*
+ * We may race with exiting thread, so don't stop just because
+ * one thread couldn't be synthesized.
+ */
+ __event__synthesize_thread(comm_event, mmap_event, fork_event,
+ namespaces_event, pid, 1, process,
+ tool, machine, mmap_data,
+ proc_map_timeout);
}
- free(dirent);
err = 0;

-out_free_namespaces:
free(namespaces_event);
out_free_fork:
free(fork_event);
@@ -753,6 +742,109 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
return err;
}

+struct synthesize_threads_arg {
+ struct perf_tool *tool;
+ perf_event__handler_t process;
+ struct machine *machine;
+ bool mmap_data;
+ unsigned int proc_map_timeout;
+ struct dirent **dirent;
+ int num;
+ int start;
+};
+
+static void *synthesize_threads_worker(void *arg)
+{
+ struct synthesize_threads_arg *args = arg;
+
+ __perf_event__synthesize_threads(args->tool, args->process,
+ args->machine, args->mmap_data,
+ args->proc_map_timeout, args->dirent,
+ args->start, args->num);
+ return NULL;
+}
+
+int perf_event__synthesize_threads(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ bool mmap_data,
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
+{
+ struct synthesize_threads_arg *args = NULL;
+ pthread_t *synthesize_threads = NULL;
+ char proc_path[PATH_MAX];
+ struct dirent **dirent;
+ int num_per_thread;
+ int m, n, i, j;
+ int thread_nr;
+ int base = 0;
+ int err = -1;
+
+
+ if (machine__is_default_guest(machine))
+ return 0;
+
+ snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
+ n = scandir(proc_path, &dirent, 0, alphasort);
+ if (n < 0)
+ return err;
+
+ thread_nr = nr_threads_synthesize;
+ if (thread_nr <= 0)
+ thread_nr = 1;
+ if (thread_nr > n)
+ thread_nr = n;
+
+ synthesize_threads = calloc(sizeof(pthread_t), thread_nr);
+ if (synthesize_threads == NULL)
+ goto free_dirent;
+
+ args = calloc(sizeof(*args), thread_nr);
+ if (args == NULL)
+ goto free_threads;
+
+ num_per_thread = n / thread_nr;
+ m = n % thread_nr;
+ for (i = 0; i < thread_nr; i++) {
+ args[i].tool = tool;
+ args[i].process = process;
+ args[i].machine = machine;
+ args[i].mmap_data = mmap_data;
+ args[i].proc_map_timeout = proc_map_timeout;
+ args[i].dirent = dirent;
+ }
+ for (i = 0; i < m; i++) {
+ args[i].num = num_per_thread + 1;
+ args[i].start = i * args[i].num;
+ }
+ if (i != 0)
+ base = args[i-1].start + args[i-1].num;
+ for (j = i; j < thread_nr; j++) {
+ args[j].num = num_per_thread;
+ args[j].start = base + (j - i) * args[i].num;
+ }
+
+ for (i = 0; i < thread_nr; i++) {
+ if (pthread_create(&synthesize_threads[i], NULL,
+ synthesize_threads_worker, &args[i]))
+ goto out_join;
+ }
+ err = 0;
+out_join:
+ for (i = 0; i < thread_nr; i++)
+ pthread_join(synthesize_threads[i], NULL);
+ free(args);
+free_threads:
+ free(synthesize_threads);
+free_dirent:
+ for (i = 0; i < n; i++)
+ free(dirent[i]);
+ free(dirent);
+
+ return err;
+}
+
struct process_symbol_args {
const char *name;
u64 start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ee7bcc8..d6cbb0a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -680,7 +680,8 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
int perf_event__synthesize_threads(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine, bool mmap_data,
- unsigned int proc_map_timeout);
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize);
int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 585b4a3..7c3aa47 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2218,12 +2218,16 @@ int machines__for_each_thread(struct machines *machines,
int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct thread_map *threads,
perf_event__handler_t process, bool data_mmap,
- unsigned int proc_map_timeout)
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
{
if (target__has_task(target))
return perf_event__synthesize_thread_map(tool, threads, process, machine, data_mmap, proc_map_timeout);
else if (target__has_cpu(target))
- return perf_event__synthesize_threads(tool, process, machine, data_mmap, proc_map_timeout);
+ return perf_event__synthesize_threads(tool, process,
+ machine, data_mmap,
+ proc_map_timeout,
+ nr_threads_synthesize);
/* command specified */
return 0;
}
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b1cd516..c6a299e 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -257,15 +257,18 @@ int machines__for_each_thread(struct machines *machines,
int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct thread_map *threads,
perf_event__handler_t process, bool data_mmap,
- unsigned int proc_map_timeout);
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize);
static inline
int machine__synthesize_threads(struct machine *machine, struct target *target,
struct thread_map *threads, bool data_mmap,
- unsigned int proc_map_timeout)
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
{
return __machine__synthesize_threads(machine, NULL, target, threads,
perf_event__process, data_mmap,
- proc_map_timeout);
+ proc_map_timeout,
+ nr_threads_synthesize);
}

pid_t machine__get_current_tid(struct machine *machine, int cpu);
--
2.5.5

2017-09-25 20:23:49

by Liang, Kan

[permalink] [raw]

Subject: [PATCH RFC V3 5/5] perf top: switch back to overwrite mode

From: Kan Liang <[email protected]>

perf_top__mmap_read has severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system. It costs
several minutes to finish, which is unacceptable.

perf top was overwrite mode. But it is changed to non overwrite mode
since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
For non overwrite mode, it tries to read everything in the ring buffer
and does not check the messup. Once there are lots of samples delivered
shortly, the processing time could be very long.
Knights Landing/Mill as a manycore processor contains a large number of
small cores. Because of the huge core number, it will generated lots of
samples in a heavy load system. Also, since the huge sample#, the mmap
writer probably bite the tail and mess up the samples.

Switching to overwrite mode, which dropping the unsure mmap entries,
significantly speeds up the whole progress.
Considering the real time requirement for perf top, it should switch
back to overwrite mode.

Only warning once if the messup is detected.
Providing some hints to users.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-top.c | 2 +-
tools/perf/util/evlist.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 477a869..4b4af34 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -902,7 +902,7 @@ static int perf_top__start_counters(struct perf_top *top)
}
}

- if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
+ if (perf_evlist__mmap(evlist, opts->mmap_pages, true) < 0) {
ui__error("Failed to mmap with %d (%s)\n",
errno, str_error_r(errno, msg, sizeof(msg)));
goto out_err;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 6a0d7ff..ef0b6b1 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -723,7 +723,10 @@ perf_mmap__read(struct perf_mmap *md, bool check_messup, u64 start,
* In either case, truncate and restart at 'end'.
*/
if (diff > md->mask / 2 || diff < 0) {
- fprintf(stderr, "WARNING: failed to keep up with mmap data.\n");
+ WARN_ONCE(1, "WARNING: failed to keep up with mmap data.\n"
+ "Please try increasing the period (-c) or\n"
+ "decreasing the freq (-F) or\n"
+ "limiting the number of CPUs");

/*
* 'end' points to a known good entry, start there.
--
2.5.5

2017-09-25 20:24:31

by Liang, Kan

[permalink] [raw]

Subject: [PATCH RFC V3 2/5] perf tools: lock to protect comm_str rb tree

From: Kan Liang <[email protected]>

Add comm_str_lock to protect comm_str rb tree.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/comm.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/comm.c b/tools/perf/util/comm.c
index 7bc981b..756a9c1 100644
--- a/tools/perf/util/comm.c
+++ b/tools/perf/util/comm.c
@@ -5,6 +5,7 @@
#include <stdio.h>
#include <string.h>
#include <linux/refcount.h>
+#include "rwsem.h"

struct comm_str {
char *str;
@@ -14,6 +15,7 @@ struct comm_str {

/* Should perhaps be moved to struct machine */
static struct rb_root comm_str_root;
+static struct rw_semaphore comm_str_lock = {.lock = PTHREAD_RWLOCK_INITIALIZER,};

static struct comm_str *comm_str__get(struct comm_str *cs)
{
@@ -25,7 +27,9 @@ static struct comm_str *comm_str__get(struct comm_str *cs)
static void comm_str__put(struct comm_str *cs)
{
if (cs && refcount_dec_and_test(&cs->refcnt)) {
+ down_write(&comm_str_lock);
rb_erase(&cs->rb_node, &comm_str_root);
+ up_write(&comm_str_lock);
zfree(&cs->str);
free(cs);
}
@@ -50,7 +54,8 @@ static struct comm_str *comm_str__alloc(const char *str)
return cs;
}

-static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+static
+struct comm_str *__comm_str__findnew(const char *str, struct rb_root *root)
{
struct rb_node **p = &root->rb_node;
struct rb_node *parent = NULL;
@@ -81,6 +86,17 @@ static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
return new;
}

+static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+{
+ struct comm_str *cs;
+
+ down_write(&comm_str_lock);
+ cs = __comm_str__findnew(str, root);
+ up_write(&comm_str_lock);
+
+ return cs;
+}
+
struct comm *comm__new(const char *str, u64 timestamp, bool exec)
{
struct comm *comm = zalloc(sizeof(*comm));
--
2.5.5

2017-09-25 20:24:46

by Liang, Kan

[permalink] [raw]

Subject: [PATCH RFC V3 1/5] perf tools: lock to protect namespaces and comm list

From: Kan Liang <[email protected]>

Add two locks to protect namespaces_list and comm_list.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Not all the comm_list/namespaces_list accessing are protected, e.g.
thread__exec_comm. Because the multithread code for perf top event
synthesizing does not touch them. They don't need a lock.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/thread.c | 53 +++++++++++++++++++++++++++++++++++++++++++-----
tools/perf/util/thread.h | 3 +++
2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c09bdb5..bf73117 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -45,6 +45,8 @@ struct thread *thread__new(pid_t pid, pid_t tid)
thread->cpu = -1;
INIT_LIST_HEAD(&thread->namespaces_list);
INIT_LIST_HEAD(&thread->comm_list);
+ init_rwsem(&thread->namespaces_lock);
+ init_rwsem(&thread->comm_lock);

comm_str = malloc(32);
if (!comm_str)
@@ -83,18 +85,26 @@ void thread__delete(struct thread *thread)
map_groups__put(thread->mg);
thread->mg = NULL;
}
+ down_write(&thread->namespaces_lock);
list_for_each_entry_safe(namespaces, tmp_namespaces,
&thread->namespaces_list, list) {
list_del(&namespaces->list);
namespaces__free(namespaces);
}
+ up_write(&thread->namespaces_lock);
+
+ down_write(&thread->comm_lock);
list_for_each_entry_safe(comm, tmp_comm, &thread->comm_list, list) {
list_del(&comm->list);
comm__free(comm);
}
+ up_write(&thread->comm_lock);
+
unwind__finish_access(thread);
nsinfo__zput(thread->nsinfo);

+ exit_rwsem(&thread->namespaces_lock);
+ exit_rwsem(&thread->comm_lock);
free(thread);
}

@@ -125,8 +135,8 @@ struct namespaces *thread__namespaces(const struct thread *thread)
return list_first_entry(&thread->namespaces_list, struct namespaces, list);
}

-int thread__set_namespaces(struct thread *thread, u64 timestamp,
- struct namespaces_event *event)
+static int __thread__set_namespaces(struct thread *thread, u64 timestamp,
+ struct namespaces_event *event)
{
struct namespaces *new, *curr = thread__namespaces(thread);

@@ -149,6 +159,17 @@ int thread__set_namespaces(struct thread *thread, u64 timestamp,
return 0;
}

+int thread__set_namespaces(struct thread *thread, u64 timestamp,
+ struct namespaces_event *event)
+{
+ int ret;
+
+ down_write(&thread->namespaces_lock);
+ ret = __thread__set_namespaces(thread, timestamp, event);
+ up_write(&thread->namespaces_lock);
+ return ret;
+}
+
struct comm *thread__comm(const struct thread *thread)
{
if (list_empty(&thread->comm_list))
@@ -170,8 +191,8 @@ struct comm *thread__exec_comm(const struct thread *thread)
return last;
}

-int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
- bool exec)
+static int ____thread__set_comm(struct thread *thread, const char *str,
+ u64 timestamp, bool exec)
{
struct comm *new, *curr = thread__comm(thread);

@@ -195,6 +216,17 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
return 0;
}

+int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
+ bool exec)
+{
+ int ret;
+
+ down_write(&thread->comm_lock);
+ ret = ____thread__set_comm(thread, str, timestamp, exec);
+ up_write(&thread->comm_lock);
+ return ret;
+}
+
int thread__set_comm_from_proc(struct thread *thread)
{
char path[64];
@@ -212,7 +244,7 @@ int thread__set_comm_from_proc(struct thread *thread)
return err;
}

-const char *thread__comm_str(const struct thread *thread)
+static const char *__thread__comm_str(const struct thread *thread)
{
const struct comm *comm = thread__comm(thread);

@@ -222,6 +254,17 @@ const char *thread__comm_str(const struct thread *thread)
return comm__str(comm);
}

+const char *thread__comm_str(const struct thread *thread)
+{
+ const char *str;
+
+ down_read((struct rw_semaphore *)&thread->comm_lock);
+ str = __thread__comm_str(thread);
+ up_read((struct rw_semaphore *)&thread->comm_lock);
+
+ return str;
+}
+
/* CHECKME: it should probably better return the max comm len from its comm list */
int thread__comm_len(struct thread *thread)
{
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index cb1a5dd..10555d6 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -9,6 +9,7 @@
#include "symbol.h"
#include <strlist.h>
#include <intlist.h>
+#include "rwsem.h"

struct thread_stack;
struct unwind_libunwind_ops;
@@ -29,7 +30,9 @@ struct thread {
int comm_len;
bool dead; /* if set thread has exited */
struct list_head namespaces_list;
+ struct rw_semaphore namespaces_lock;
struct list_head comm_list;
+ struct rw_semaphore comm_lock;
u64 db_id;

void *priv;
--
2.5.5

2017-09-25 20:36:09

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH RFC V3 5/5] perf top: switch back to overwrite mode

On Mon, Sep 25, 2017 at 01:23:08PM -0700, [email protected] wrote:
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 6a0d7ff..ef0b6b1 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -723,7 +723,10 @@ perf_mmap__read(struct perf_mmap *md, bool check_messup, u64 start,
> * In either case, truncate and restart at 'end'.
> */
> if (diff > md->mask / 2 || diff < 0) {
> - fprintf(stderr, "WARNING: failed to keep up with mmap data.\n");
> + WARN_ONCE(1, "WARNING: failed to keep up with mmap data.\n"
> + "Please try increasing the period (-c) or\n"
> + "decreasing the freq (-F) or\n"
> + "limiting the number of CPUs");

should suggest -C too

"limiting the number of CPUs (-C)");

-Andi

2017-09-27 11:44:57

by Jiri Olsa

[permalink] [raw]

Subject: Re: [PATCH RFC V3 3/5] perf top: implement multithreading for perf_event__synthesize_threads

On Mon, Sep 25, 2017 at 01:23:06PM -0700, [email protected] wrote:

SNIP

> +
> +int perf_event__synthesize_threads(struct perf_tool *tool,
> + perf_event__handler_t process,
> + struct machine *machine,
> + bool mmap_data,
> + unsigned int proc_map_timeout,
> + unsigned int nr_threads_synthesize)
> +{
> + struct synthesize_threads_arg *args = NULL;
> + pthread_t *synthesize_threads = NULL;
> + char proc_path[PATH_MAX];
> + struct dirent **dirent;
> + int num_per_thread;
> + int m, n, i, j;
> + int thread_nr;
> + int base = 0;
> + int err = -1;
> +
> +
> + if (machine__is_default_guest(machine))
> + return 0;
> +
> + snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
> + n = scandir(proc_path, &dirent, 0, alphasort);
> + if (n < 0)
> + return err;
> +
> + thread_nr = nr_threads_synthesize;
> + if (thread_nr <= 0)
> + thread_nr = 1;

given the number of callers with 'nr_threads_synthesize == 1',
could you just make it the special case in here, like:

if (thread_nr == 1)
return __perf_event__synthesize_threads

looks like all the args are ready at this time

thanks,
jirka

2017-09-27 12:00:41

by Jiri Olsa

[permalink] [raw]

Subject: Re: [PATCH RFC V3 5/5] perf top: switch back to overwrite mode

On Mon, Sep 25, 2017 at 01:23:08PM -0700, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> perf_top__mmap_read has severe performance issue in
> Knights Landing/Mill, when monitoring in heavy load system. It costs
> several minutes to finish, which is unacceptable.
>
> perf top was overwrite mode. But it is changed to non overwrite mode
> since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> For non overwrite mode, it tries to read everything in the ring buffer
> and does not check the messup. Once there are lots of samples delivered
> shortly, the processing time could be very long.
> Knights Landing/Mill as a manycore processor contains a large number of
> small cores. Because of the huge core number, it will generated lots of
> samples in a heavy load system. Also, since the huge sample#, the mmap
> writer probably bite the tail and mess up the samples.
>
> Switching to overwrite mode, which dropping the unsure mmap entries,
> significantly speeds up the whole progress.
> Considering the real time requirement for perf top, it should switch
> back to overwrite mode.
>
> Only warning once if the messup is detected.
> Providing some hints to users.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-top.c | 2 +-
> tools/perf/util/evlist.c | 5 ++++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 477a869..4b4af34 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -902,7 +902,7 @@ static int perf_top__start_counters(struct perf_top *top)
> }
> }
>
> - if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
> + if (perf_evlist__mmap(evlist, opts->mmap_pages, true) < 0) {

hum, I thought that it's not as simple as using 'true' in here,
because of the issue explained in here:
9ecda41acb97 perf/core: Add ::write_backward attribute to perf event

jirka