2017-09-29 14:48:35

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 0/6] perf top optimization

From: Kan Liang <[email protected]>

The patch series intends to fix the severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system.
perf top costs a few minutes to show the result, which is
unacceptable.
With the patch series applied, the latency will reduces to
several seconds.

machine__synthesize_threads and perf_top__mmap_read costs most of
the perf top time (> 99%).
Patch 1-4 do the optimization for machine__synthesize_threads.
Patch 5-6 does the optimization for perf_top__mmap_read.

Optimization for machine__synthesize_threads
- Multithreading the whole process.
- The threads number is set to the max online CPU# by default.
User can change the threads number through the new option.
- Introduces hashtable for machine threads to reduce the lock
contention.
- The optimization can also benefit other platforms and other
perf tools, like perf record. But this patch series doesn't
do the optimization for other tools. It can be done later
separately.
- With this optimization applied, there is a 1.56x speedup in
Knights Mill with heavy workload.

Optimization for perf_top__mmap_read
- switch to backward overwrite mode
For non overwrite mode, it tries to read everything in the ring buffer
and does not check the messup. Once there are lots of samples delivered
shortly, the processing time could be very long.
Considering the real time requirement for perf top, it should switch
to backward overwrite mode.
- With this optimization applied, there is a 8.98x speedup in
Knights Mill with heavy workload.
- However, the latency of perf_top__mmap_read is still higher than the
default perf top fresh time (2s) in Knights Mill with heavy workload.
A check is introduced to give some hints to reduce the overhead.

The source code is also available at
https://github.com/kliang2/perf.git perf_top_opt

Here are perf top latency test result on Knights Mill and Skylake server

The heavy workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf top"

The latency period is the time between perf top launched and the first
profiling result shown.

- Latency on Knights Mill (272 CPUs)

Original(s) With patch(s) Speedup
272.68 40.89 6.67x

- Latency on Skylake server (192 CPUs)

Original(s) With patch(s) Speedup
12.28 2.05 5.99x

Changes since V3:
- Switch to backward overwrite mode (jirka)
The backward mode can avoid some problems found in forward,
but the performance drops which compared with V3 for Knights Mill.
Introduce a new patch to check the latency of perf_top__mmap_read.
- Handle the thread_nr = 1 specially (jirka)

Changes since V2:
- patches 1 and 2 for V2 for hashtable and scandir have been merged.
- patches 3, 4 and 7 for V2 are droped. Because the optimization code
doesn't touch those codes. The protection is not needed. (Arnaldo & jirka)
- Using mutex wrappers for multithread only lock. (jirka)
- Move struct synthesize_threads_arg to event.c (jirka)

Changes since V1:
- Patch 1: machine threads and hashtable related renaming (Arnaldo)
- Patch 6: use a smaller locked section for comm_str__put
add a locked wrapper for comm_str__findnew (Arnaldo)

Kan Liang (6):
perf tools: lock to protect namespaces and comm list
perf tools: lock to protect comm_str rb tree
perf top: implement multithreading for perf_event__synthesize_threads
perf top: add option to set the number of thread for event synthesize
perf top: switch to backward overwrite mode
perf top: check the cost of perf_top__mmap_read

tools/perf/Documentation/perf-top.txt | 3 +
tools/perf/builtin-kvm.c | 3 +-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-top.c | 43 ++++++---
tools/perf/builtin-trace.c | 2 +-
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/ui/browsers/hists.c | 12 ++-
tools/perf/util/comm.c | 18 +++-
tools/perf/util/event.c | 163 +++++++++++++++++++++++++++-------
tools/perf/util/event.h | 3 +-
tools/perf/util/machine.c | 8 +-
tools/perf/util/machine.h | 9 +-
tools/perf/util/thread.c | 53 +++++++++--
tools/perf/util/thread.h | 3 +
tools/perf/util/top.h | 1 +
15 files changed, 262 insertions(+), 63 deletions(-)

--
2.5.5


2017-09-29 14:48:39

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 6/6] perf top: check the cost of perf_top__mmap_read

From: Kan Liang <[email protected]>

The latency of perf_top__mmap_read should be lower than refresh time.
If not, give some hints to reduce the overhead.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-top.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 03090d0..b5143c9 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -863,11 +863,20 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)

static void perf_top__mmap_read(struct perf_top *top)
{
+ unsigned long long s, e;
int i;

perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING);
+ s = rdclock();
for (i = 0; i < top->evlist->nr_mmaps; i++)
perf_top__mmap_read_idx(top, i);
+ e = rdclock();
+ if ((e - s) > (unsigned long long)(top->delay_secs * NSEC_PER_SEC)) {
+ ui__warning("Too slow to read ring buffer.\n"
+ "Please try increasing the period (-c) or\n"
+ "decreasing the freq (-F) or\n"
+ "limiting the number of CPUs (-C)\n");
+ }
perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY);
perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING);
}
--
2.5.5

2017-09-29 14:48:55

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 3/6] perf top: implement multithreading for perf_event__synthesize_threads

From: Kan Liang <[email protected]>

The proc files which is sorted with alphabetical order are evenly
assigned to several synthesize threads to be processed in parallel.

For perf top, the threads number hard code to online CPU number. The
following patch will introduce an option to set it.
For other perf tools, the thread number is 1. Because the process
function is not ready for multithreading, e.g.
process_synthesized_event.
This patch series only support event synthesize multithreading for perf
top. For other tools, it can be done separately later.

With multithread applied, the total processing time can get up to 1.56x
speedup on Knights Mill for perf top.

For specific single event processing, the processing time could increase
because of the lock contention. So proc_map_timeout may need to be
increased. Otherwise some proc maps will be truncated.
Based on my test, increasing the proc_map_timeout has small impact
on the total processing time. The total processing time still get 1.49x
speedup on Knights Mill after increasing the proc_map_timeout.
The patch itself doesn't increase the proc_map_timeout.

Doesn't need to implement multithreading for per task monitoring,
perf_event__synthesize_thread_map. It doesn't have performance issue.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-kvm.c | 3 +-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-top.c | 8 +-
tools/perf/builtin-trace.c | 2 +-
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/util/event.c | 160 +++++++++++++++++++++++++++-------
tools/perf/util/event.h | 3 +-
tools/perf/util/machine.c | 8 +-
tools/perf/util/machine.h | 9 +-
9 files changed, 155 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index c747a1a..721f4f9 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1441,7 +1441,8 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
perf_session__set_id_hdr_size(kvm->session);
ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
machine__synthesize_threads(&kvm->session->machines.host, &kvm->opts.target,
- kvm->evlist->threads, false, kvm->opts.proc_map_timeout);
+ kvm->evlist->threads, false,
+ kvm->opts.proc_map_timeout, 1);
err = kvm_live_open_events(kvm);
if (err)
goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9b379f3..234fdf4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -863,7 +863,7 @@ static int record__synthesize(struct record *rec, bool tail)

err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
process_synthesized_event, opts->sample_address,
- opts->proc_map_timeout);
+ opts->proc_map_timeout, 1);
out:
return err;
}
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index ee954bd..bc31b93 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,8 +958,14 @@ static int __cmd_top(struct perf_top *top)
if (perf_session__register_idle_thread(top->session) < 0)
goto out_delete;

+ perf_set_multithreaded();
+
machine__synthesize_threads(&top->session->machines.host, &opts->target,
- top->evlist->threads, false, opts->proc_map_timeout);
+ top->evlist->threads, false,
+ opts->proc_map_timeout,
+ (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+
+ perf_set_singlethreaded();

if (perf_hpp_list.socket) {
ret = perf_env__read_cpu_topology_map(&perf_env);
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 967bd35..afef6fe 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1131,7 +1131,7 @@ static int trace__symbols_init(struct trace *trace, struct perf_evlist *evlist)

err = __machine__synthesize_threads(trace->host, &trace->tool, &trace->opts.target,
evlist->threads, trace__tool_process, false,
- trace->opts.proc_map_timeout);
+ trace->opts.proc_map_timeout, 1);
if (err)
symbol__exit();

diff --git a/tools/perf/tests/mmap-thread-lookup.c b/tools/perf/tests/mmap-thread-lookup.c
index f94a419..2a0068a 100644
--- a/tools/perf/tests/mmap-thread-lookup.c
+++ b/tools/perf/tests/mmap-thread-lookup.c
@@ -131,7 +131,7 @@ static int synth_all(struct machine *machine)
{
return perf_event__synthesize_threads(NULL,
perf_event__process,
- machine, 0, 500);
+ machine, 0, 500, 1);
}

static int synth_process(struct machine *machine)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 10366b8..0e678dd 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -678,23 +678,21 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
return err;
}

-int perf_event__synthesize_threads(struct perf_tool *tool,
- perf_event__handler_t process,
- struct machine *machine,
- bool mmap_data,
- unsigned int proc_map_timeout)
+static int __perf_event__synthesize_threads(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ bool mmap_data,
+ unsigned int proc_map_timeout,
+ struct dirent **dirent,
+ int start,
+ int num)
{
union perf_event *comm_event, *mmap_event, *fork_event;
union perf_event *namespaces_event;
- char proc_path[PATH_MAX];
- struct dirent **dirent;
int err = -1;
char *end;
pid_t pid;
- int n, i;
-
- if (machine__is_default_guest(machine))
- return 0;
+ int i;

comm_event = malloc(sizeof(comm_event->comm) + machine->id_hdr_size);
if (comm_event == NULL)
@@ -714,34 +712,25 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
if (namespaces_event == NULL)
goto out_free_fork;

- snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
- n = scandir(proc_path, &dirent, 0, alphasort);
-
- if (n < 0)
- goto out_free_namespaces;
-
- for (i = 0; i < n; i++) {
+ for (i = start; i < start + num; i++) {
if (!isdigit(dirent[i]->d_name[0]))
continue;

pid = (pid_t)strtol(dirent[i]->d_name, &end, 10);
/* only interested in proper numerical dirents */
- if (!*end) {
- /*
- * We may race with exiting thread, so don't stop just because
- * one thread couldn't be synthesized.
- */
- __event__synthesize_thread(comm_event, mmap_event, fork_event,
- namespaces_event, pid, 1, process,
- tool, machine, mmap_data,
- proc_map_timeout);
- }
- free(dirent[i]);
+ if (*end)
+ continue;
+ /*
+ * We may race with exiting thread, so don't stop just because
+ * one thread couldn't be synthesized.
+ */
+ __event__synthesize_thread(comm_event, mmap_event, fork_event,
+ namespaces_event, pid, 1, process,
+ tool, machine, mmap_data,
+ proc_map_timeout);
}
- free(dirent);
err = 0;

-out_free_namespaces:
free(namespaces_event);
out_free_fork:
free(fork_event);
@@ -753,6 +742,115 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
return err;
}

+struct synthesize_threads_arg {
+ struct perf_tool *tool;
+ perf_event__handler_t process;
+ struct machine *machine;
+ bool mmap_data;
+ unsigned int proc_map_timeout;
+ struct dirent **dirent;
+ int num;
+ int start;
+};
+
+static void *synthesize_threads_worker(void *arg)
+{
+ struct synthesize_threads_arg *args = arg;
+
+ __perf_event__synthesize_threads(args->tool, args->process,
+ args->machine, args->mmap_data,
+ args->proc_map_timeout, args->dirent,
+ args->start, args->num);
+ return NULL;
+}
+
+int perf_event__synthesize_threads(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ bool mmap_data,
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
+{
+ struct synthesize_threads_arg *args = NULL;
+ pthread_t *synthesize_threads = NULL;
+ char proc_path[PATH_MAX];
+ struct dirent **dirent;
+ int num_per_thread;
+ int m, n, i, j;
+ int thread_nr;
+ int base = 0;
+ int err = -1;
+
+
+ if (machine__is_default_guest(machine))
+ return 0;
+
+ snprintf(proc_path, sizeof(proc_path), "%s/proc", machine->root_dir);
+ n = scandir(proc_path, &dirent, 0, alphasort);
+ if (n < 0)
+ return err;
+
+ thread_nr = nr_threads_synthesize;
+
+ if (thread_nr <= 1) {
+ err = __perf_event__synthesize_threads(tool, process,
+ machine, mmap_data,
+ proc_map_timeout,
+ dirent, base, n);
+ goto free_dirent;
+ }
+ if (thread_nr > n)
+ thread_nr = n;
+
+ synthesize_threads = calloc(sizeof(pthread_t), thread_nr);
+ if (synthesize_threads == NULL)
+ goto free_dirent;
+
+ args = calloc(sizeof(*args), thread_nr);
+ if (args == NULL)
+ goto free_threads;
+
+ num_per_thread = n / thread_nr;
+ m = n % thread_nr;
+ for (i = 0; i < thread_nr; i++) {
+ args[i].tool = tool;
+ args[i].process = process;
+ args[i].machine = machine;
+ args[i].mmap_data = mmap_data;
+ args[i].proc_map_timeout = proc_map_timeout;
+ args[i].dirent = dirent;
+ }
+ for (i = 0; i < m; i++) {
+ args[i].num = num_per_thread + 1;
+ args[i].start = i * args[i].num;
+ }
+ if (i != 0)
+ base = args[i-1].start + args[i-1].num;
+ for (j = i; j < thread_nr; j++) {
+ args[j].num = num_per_thread;
+ args[j].start = base + (j - i) * args[i].num;
+ }
+
+ for (i = 0; i < thread_nr; i++) {
+ if (pthread_create(&synthesize_threads[i], NULL,
+ synthesize_threads_worker, &args[i]))
+ goto out_join;
+ }
+ err = 0;
+out_join:
+ for (i = 0; i < thread_nr; i++)
+ pthread_join(synthesize_threads[i], NULL);
+ free(args);
+free_threads:
+ free(synthesize_threads);
+free_dirent:
+ for (i = 0; i < n; i++)
+ free(dirent[i]);
+ free(dirent);
+
+ return err;
+}
+
struct process_symbol_args {
const char *name;
u64 start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ee7bcc8..d6cbb0a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -680,7 +680,8 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
int perf_event__synthesize_threads(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine, bool mmap_data,
- unsigned int proc_map_timeout);
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize);
int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 585b4a3..7c3aa47 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2218,12 +2218,16 @@ int machines__for_each_thread(struct machines *machines,
int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct thread_map *threads,
perf_event__handler_t process, bool data_mmap,
- unsigned int proc_map_timeout)
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
{
if (target__has_task(target))
return perf_event__synthesize_thread_map(tool, threads, process, machine, data_mmap, proc_map_timeout);
else if (target__has_cpu(target))
- return perf_event__synthesize_threads(tool, process, machine, data_mmap, proc_map_timeout);
+ return perf_event__synthesize_threads(tool, process,
+ machine, data_mmap,
+ proc_map_timeout,
+ nr_threads_synthesize);
/* command specified */
return 0;
}
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b1cd516..c6a299e 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -257,15 +257,18 @@ int machines__for_each_thread(struct machines *machines,
int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct thread_map *threads,
perf_event__handler_t process, bool data_mmap,
- unsigned int proc_map_timeout);
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize);
static inline
int machine__synthesize_threads(struct machine *machine, struct target *target,
struct thread_map *threads, bool data_mmap,
- unsigned int proc_map_timeout)
+ unsigned int proc_map_timeout,
+ unsigned int nr_threads_synthesize)
{
return __machine__synthesize_threads(machine, NULL, target, threads,
perf_event__process, data_mmap,
- proc_map_timeout);
+ proc_map_timeout,
+ nr_threads_synthesize);
}

pid_t machine__get_current_tid(struct machine *machine, int cpu);
--
2.5.5

2017-09-29 14:48:54

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 4/6] perf top: add option to set the number of thread for event synthesize

From: Kan Liang <[email protected]>

Using UINT_MAX to indicate the default thread#, which is the max number
of online CPU.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-top.txt | 3 +++
tools/perf/builtin-top.c | 11 ++++++++---
tools/perf/util/event.c | 5 ++++-
tools/perf/util/top.h | 1 +
4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index d864ea6..c3cc9a3 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -240,6 +240,9 @@ Default is to monitor all CPUS.
--force::
Don't do ownership validation.

+--num-thread-synthesize::
+ The number of threads to run event synthesize.
+ By default, the number of threads equals to the online CPU number.

INTERACTIVE PROMPTING KEYS
--------------------------
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index bc31b93..477a869 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -958,14 +958,16 @@ static int __cmd_top(struct perf_top *top)
if (perf_session__register_idle_thread(top->session) < 0)
goto out_delete;

- perf_set_multithreaded();
+ if (top->nr_threads_synthesize > 1)
+ perf_set_multithreaded();

machine__synthesize_threads(&top->session->machines.host, &opts->target,
top->evlist->threads, false,
opts->proc_map_timeout,
- (unsigned int)sysconf(_SC_NPROCESSORS_ONLN));
+ top->nr_threads_synthesize);

- perf_set_singlethreaded();
+ if (top->nr_threads_synthesize > 1)
+ perf_set_singlethreaded();

if (perf_hpp_list.socket) {
ret = perf_env__read_cpu_topology_map(&perf_env);
@@ -1118,6 +1120,7 @@ int cmd_top(int argc, const char **argv)
},
.max_stack = sysctl_perf_event_max_stack,
.sym_pcnt_filter = 5,
+ .nr_threads_synthesize = UINT_MAX,
};
struct record_opts *opts = &top.record_opts;
struct target *target = &opts->target;
@@ -1227,6 +1230,8 @@ int cmd_top(int argc, const char **argv)
OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
"Show entries in a hierarchy"),
OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
+ OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
+ "number of thread to run event synthesize"),
OPT_END()
};
const char * const top_usage[] = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0e678dd..47eff47 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -790,7 +790,10 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
if (n < 0)
return err;

- thread_nr = nr_threads_synthesize;
+ if (nr_threads_synthesize == UINT_MAX)
+ thread_nr = sysconf(_SC_NPROCESSORS_ONLN);
+ else
+ thread_nr = nr_threads_synthesize;

if (thread_nr <= 1) {
err = __perf_event__synthesize_threads(tool, process,
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 9bdfb78..f4296e1 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -37,6 +37,7 @@ struct perf_top {
int sym_pcnt_filter;
const char *sym_filter;
float min_percent;
+ unsigned int nr_threads_synthesize;
};

#define CONSOLE_CLEAR ""
--
2.5.5

2017-09-29 14:48:53

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode

From: Kan Liang <[email protected]>

perf_top__mmap_read has severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system. It costs
several minutes to finish, which is unacceptable.

perf top was overwrite mode. But it is changed to non overwrite mode
since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
For non overwrite mode, it tries to read everything in the ring buffer
and does not check the messup. Once there are lots of samples delivered
shortly, the processing time could be very long.
Knights Landing/Mill as a manycore processor contains a large number of
small cores. Because of the huge core number, it will generated lots of
samples in a heavy load system. Also, since the huge sample#, the mmap
writer probably bite the tail and mess up the samples.

Also, to avoid the problems which is described in
commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
event"), switch to backward overwrite mode.
Pausing the ring-buffer during perf_top__mmap_read to ensure the
ring-buffer is stable.
There would be some records lost in backward overwrite mode. Removing
the lost events checking.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-top.c | 21 +++++++++------------
tools/perf/ui/browsers/hists.c | 12 +++++++++---
2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 477a869..03090d0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -276,16 +276,6 @@ static void perf_top__print_sym_table(struct perf_top *top)

printf("%-*.*s\n", win_width, win_width, graph_dotted_line);

- if (hists->stats.nr_lost_warned !=
- hists->stats.nr_events[PERF_RECORD_LOST]) {
- hists->stats.nr_lost_warned =
- hists->stats.nr_events[PERF_RECORD_LOST];
- color_fprintf(stdout, PERF_COLOR_RED,
- "WARNING: LOST %d chunks, Check IO/CPU overload",
- hists->stats.nr_lost_warned);
- ++printed;
- }
-
if (top->sym_filter_entry) {
perf_top__show_details(top);
return;
@@ -802,6 +792,7 @@ static void perf_event__process_sample(struct perf_tool *tool,

static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
{
+ struct perf_mmap *md = &top->evlist->backward_mmap[idx];
struct perf_sample sample;
struct perf_evsel *evsel;
struct perf_session *session = top->session;
@@ -809,7 +800,9 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
struct machine *machine;
int ret;

- while ((event = perf_evlist__mmap_read(top->evlist, idx)) != NULL) {
+ perf_mmap__read_catchup(md);
+
+ while ((event = perf_mmap__read_backward(md)) != NULL) {
ret = perf_evlist__parse_sample(top->evlist, event, &sample);
if (ret) {
pr_err("Can't parse sample, err = %d\n", ret);
@@ -872,8 +865,11 @@ static void perf_top__mmap_read(struct perf_top *top)
{
int i;

+ perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING);
for (i = 0; i < top->evlist->nr_mmaps; i++)
perf_top__mmap_read_idx(top, i);
+ perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY);
+ perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING);
}

static int perf_top__start_counters(struct perf_top *top)
@@ -902,7 +898,7 @@ static int perf_top__start_counters(struct perf_top *top)
}
}

- if (perf_evlist__mmap(evlist, opts->mmap_pages, false) < 0) {
+ if (perf_evlist__mmap(evlist, opts->mmap_pages, opts->overwrite) < 0) {
ui__error("Failed to mmap with %d (%s)\n",
errno, str_error_r(errno, msg, sizeof(msg)));
goto out_err;
@@ -1117,6 +1113,7 @@ int cmd_top(int argc, const char **argv)
.uses_mmap = true,
},
.proc_map_timeout = 500,
+ .overwrite = 1,
},
.max_stack = sysctl_perf_event_max_stack,
.sym_pcnt_filter = 5,
diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 13dfb0a..7419c05 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -712,8 +712,13 @@ int hist_browser__run(struct hist_browser *browser, const char *help)
nr_entries = hist_browser__nr_entries(browser);
ui_browser__update_nr_entries(&browser->b, nr_entries);

- if (browser->hists->stats.nr_lost_warned !=
- browser->hists->stats.nr_events[PERF_RECORD_LOST]) {
+ /*
+ * Don't print lost events warning for perf top,
+ * because it is backward overwrite mode.
+ * Perf top is the only tool which has hbt timer.
+ */
+ if ((browser->hists->stats.nr_lost_warned !=
+ browser->hists->stats.nr_events[PERF_RECORD_LOST]) && !hbt) {
browser->hists->stats.nr_lost_warned =
browser->hists->stats.nr_events[PERF_RECORD_LOST];
ui_browser__warn_lost_events(&browser->b);
@@ -3358,7 +3363,8 @@ static int perf_evsel_menu__run(struct perf_evsel_menu *menu,
case K_TIMER:
hbt->timer(hbt->arg);

- if (!menu->lost_events_warned && menu->lost_events) {
+ if (!menu->lost_events_warned && menu->lost_events &&
+ !hbt) {
ui_browser__warn_lost_events(&menu->b);
menu->lost_events_warned = true;
}
--
2.5.5

2017-09-29 14:49:38

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 1/6] perf tools: lock to protect namespaces and comm list

From: Kan Liang <[email protected]>

Add two locks to protect namespaces_list and comm_list.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Not all the comm_list/namespaces_list accessing are protected, e.g.
thread__exec_comm. Because the multithread code for perf top event
synthesizing does not touch them. They don't need a lock.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/thread.c | 53 +++++++++++++++++++++++++++++++++++++++++++-----
tools/perf/util/thread.h | 3 +++
2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c09bdb5..bf73117 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -45,6 +45,8 @@ struct thread *thread__new(pid_t pid, pid_t tid)
thread->cpu = -1;
INIT_LIST_HEAD(&thread->namespaces_list);
INIT_LIST_HEAD(&thread->comm_list);
+ init_rwsem(&thread->namespaces_lock);
+ init_rwsem(&thread->comm_lock);

comm_str = malloc(32);
if (!comm_str)
@@ -83,18 +85,26 @@ void thread__delete(struct thread *thread)
map_groups__put(thread->mg);
thread->mg = NULL;
}
+ down_write(&thread->namespaces_lock);
list_for_each_entry_safe(namespaces, tmp_namespaces,
&thread->namespaces_list, list) {
list_del(&namespaces->list);
namespaces__free(namespaces);
}
+ up_write(&thread->namespaces_lock);
+
+ down_write(&thread->comm_lock);
list_for_each_entry_safe(comm, tmp_comm, &thread->comm_list, list) {
list_del(&comm->list);
comm__free(comm);
}
+ up_write(&thread->comm_lock);
+
unwind__finish_access(thread);
nsinfo__zput(thread->nsinfo);

+ exit_rwsem(&thread->namespaces_lock);
+ exit_rwsem(&thread->comm_lock);
free(thread);
}

@@ -125,8 +135,8 @@ struct namespaces *thread__namespaces(const struct thread *thread)
return list_first_entry(&thread->namespaces_list, struct namespaces, list);
}

-int thread__set_namespaces(struct thread *thread, u64 timestamp,
- struct namespaces_event *event)
+static int __thread__set_namespaces(struct thread *thread, u64 timestamp,
+ struct namespaces_event *event)
{
struct namespaces *new, *curr = thread__namespaces(thread);

@@ -149,6 +159,17 @@ int thread__set_namespaces(struct thread *thread, u64 timestamp,
return 0;
}

+int thread__set_namespaces(struct thread *thread, u64 timestamp,
+ struct namespaces_event *event)
+{
+ int ret;
+
+ down_write(&thread->namespaces_lock);
+ ret = __thread__set_namespaces(thread, timestamp, event);
+ up_write(&thread->namespaces_lock);
+ return ret;
+}
+
struct comm *thread__comm(const struct thread *thread)
{
if (list_empty(&thread->comm_list))
@@ -170,8 +191,8 @@ struct comm *thread__exec_comm(const struct thread *thread)
return last;
}

-int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
- bool exec)
+static int ____thread__set_comm(struct thread *thread, const char *str,
+ u64 timestamp, bool exec)
{
struct comm *new, *curr = thread__comm(thread);

@@ -195,6 +216,17 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
return 0;
}

+int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
+ bool exec)
+{
+ int ret;
+
+ down_write(&thread->comm_lock);
+ ret = ____thread__set_comm(thread, str, timestamp, exec);
+ up_write(&thread->comm_lock);
+ return ret;
+}
+
int thread__set_comm_from_proc(struct thread *thread)
{
char path[64];
@@ -212,7 +244,7 @@ int thread__set_comm_from_proc(struct thread *thread)
return err;
}

-const char *thread__comm_str(const struct thread *thread)
+static const char *__thread__comm_str(const struct thread *thread)
{
const struct comm *comm = thread__comm(thread);

@@ -222,6 +254,17 @@ const char *thread__comm_str(const struct thread *thread)
return comm__str(comm);
}

+const char *thread__comm_str(const struct thread *thread)
+{
+ const char *str;
+
+ down_read((struct rw_semaphore *)&thread->comm_lock);
+ str = __thread__comm_str(thread);
+ up_read((struct rw_semaphore *)&thread->comm_lock);
+
+ return str;
+}
+
/* CHECKME: it should probably better return the max comm len from its comm list */
int thread__comm_len(struct thread *thread)
{
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index cb1a5dd..10555d6 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -9,6 +9,7 @@
#include "symbol.h"
#include <strlist.h>
#include <intlist.h>
+#include "rwsem.h"

struct thread_stack;
struct unwind_libunwind_ops;
@@ -29,7 +30,9 @@ struct thread {
int comm_len;
bool dead; /* if set thread has exited */
struct list_head namespaces_list;
+ struct rw_semaphore namespaces_lock;
struct list_head comm_list;
+ struct rw_semaphore comm_lock;
u64 db_id;

void *priv;
--
2.5.5

2017-09-29 14:49:37

by Liang, Kan

[permalink] [raw]
Subject: [PATCH RFC V4 2/6] perf tools: lock to protect comm_str rb tree

From: Kan Liang <[email protected]>

Add comm_str_lock to protect comm_str rb tree.

The lock is only needed for multithreaded code, so using mutex wrappers
provided by perf tool.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/comm.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/comm.c b/tools/perf/util/comm.c
index 7bc981b..756a9c1 100644
--- a/tools/perf/util/comm.c
+++ b/tools/perf/util/comm.c
@@ -5,6 +5,7 @@
#include <stdio.h>
#include <string.h>
#include <linux/refcount.h>
+#include "rwsem.h"

struct comm_str {
char *str;
@@ -14,6 +15,7 @@ struct comm_str {

/* Should perhaps be moved to struct machine */
static struct rb_root comm_str_root;
+static struct rw_semaphore comm_str_lock = {.lock = PTHREAD_RWLOCK_INITIALIZER,};

static struct comm_str *comm_str__get(struct comm_str *cs)
{
@@ -25,7 +27,9 @@ static struct comm_str *comm_str__get(struct comm_str *cs)
static void comm_str__put(struct comm_str *cs)
{
if (cs && refcount_dec_and_test(&cs->refcnt)) {
+ down_write(&comm_str_lock);
rb_erase(&cs->rb_node, &comm_str_root);
+ up_write(&comm_str_lock);
zfree(&cs->str);
free(cs);
}
@@ -50,7 +54,8 @@ static struct comm_str *comm_str__alloc(const char *str)
return cs;
}

-static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+static
+struct comm_str *__comm_str__findnew(const char *str, struct rb_root *root)
{
struct rb_node **p = &root->rb_node;
struct rb_node *parent = NULL;
@@ -81,6 +86,17 @@ static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
return new;
}

+static struct comm_str *comm_str__findnew(const char *str, struct rb_root *root)
+{
+ struct comm_str *cs;
+
+ down_write(&comm_str_lock);
+ cs = __comm_str__findnew(str, root);
+ up_write(&comm_str_lock);
+
+ return cs;
+}
+
struct comm *comm__new(const char *str, u64 timestamp, bool exec)
{
struct comm *comm = zalloc(sizeof(*comm));
--
2.5.5

2017-09-30 19:48:52

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH RFC V4 5/6] perf top: switch to backward overwrite mode

On Fri, Sep 29, 2017 at 07:47:56AM -0700, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> perf_top__mmap_read has severe performance issue in
> Knights Landing/Mill, when monitoring in heavy load system. It costs
> several minutes to finish, which is unacceptable.
>
> perf top was overwrite mode. But it is changed to non overwrite mode
> since commit 93fc64f14472 ("perf top: Switch to non overwrite mode").
> For non overwrite mode, it tries to read everything in the ring buffer
> and does not check the messup. Once there are lots of samples delivered
> shortly, the processing time could be very long.
> Knights Landing/Mill as a manycore processor contains a large number of
> small cores. Because of the huge core number, it will generated lots of
> samples in a heavy load system. Also, since the huge sample#, the mmap
> writer probably bite the tail and mess up the samples.
>
> Also, to avoid the problems which is described in
> commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf
> event"), switch to backward overwrite mode.
> Pausing the ring-buffer during perf_top__mmap_read to ensure the
> ring-buffer is stable.
> There would be some records lost in backward overwrite mode. Removing
> the lost events checking.

I'm getting perf top hogging the cpu completely with this change

jirka