2019-11-07 18:20:09

by Andi Kleen

[permalink] [raw]
Subject: Optimize perf stat for large number of events/cpus

[v5: Address review feedback. Split patches.]

This patch kit optimizes perf stat for a large number of events
on systems with many CPUs and PMUs.

Some profiling shows that the most overhead is doing IPIs to
all the target CPUs. We can optimize this by using sched_setaffinity
to set the affinity to a target CPU once and then doing
the perf operation for all events on that CPU. This requires
some restructuring, but cuts the set up time quite a bit.

In theory we could go further by parallelizing these setups
too, but that would be much more complicated and for now just batching it
per CPU seems to be sufficient. At some point with many more cores
parallelization or a better bulk perf setup API might be needed though.

In addition perf does a lot of redundant /sys accesses with
many PMUs, which can be also expensve. This is also optimized.

On a large test case (>700 events with many weak groups) on a 94 CPU
system I go from

real 0m8.607s
user 0m0.550s
sys 0m8.041s

to

real 0m3.269s
user 0m0.760s
sys 0m1.694s

so shaving ~6 seconds of system time, at slightly more cost
in perf stat itself. On a 4 socket system with the savings
are more dramatic:

real 0m15.641s
user 0m0.873s
sys 0m14.729s

to

real 0m4.493s
user 0m1.578s
sys 0m2.444s

so 11s difference in the user visible set up time.

Also available in

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-8

v1: Initial post.
v2: Rebase. Fix some minor issues.
v3: Rebase. Address review feedback. Fix one minor issue
v4: Modified based on review feedback. Now it maintains
all_cpus per evlist. There is still a need for cpu_index iteration
to get the correct index for indexing the file descriptors.
Fix bug with unsorted cpu maps, now they are always sorted.
Some cleanups and refactoring.
v5: Split patches. Redo loop iteration again. Fix cpu map
merging for uncore. Remove duplicates from cpumaps. Add unit
tests.

-Andi


2019-11-07 18:20:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v5 10/13] perf stat: Use affinity for opening events

From: Andi Kleen <[email protected]>

Restructure the event opening in perf stat to cycle through
the events by CPU after setting affinity to that CPU.
This eliminates IPI overhead in the perf API.

We have to loop through the CPU in the outter builtin-stat
code instead of leaving that to low level functions.

It has to change the weak group fallback strategy slightly.
Since we cannot easily undo the opens for other CPUs
move the weak group retry to a separate loop.

Before with a large test case with 94 CPUs:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
42.75 4.050910 67 60046 110 perf_event_open

After:

26.86 0.944396 16 58069 110 perf_event_open

(the number changes slightly because the weak group retries
work differently and the test case relies on weak groups)

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use new iterator macros.
Fix bug that caused unnecessary retry for errored events.
Add extra assert to check assumption that cpumaps are always subsets
v3:
Use new iterator macros
Factored out code movement for error handling.
v4:
Update iterator macros again
Fix minor bug with errored events
---
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-stat.c | 118 ++++++++++++++++++++++++++++++------
tools/perf/util/evlist.c | 6 +-
tools/perf/util/evlist.h | 3 +-
tools/perf/util/evsel.h | 2 +
tools/perf/util/stat.c | 5 +-
tools/perf/util/stat.h | 3 +-
7 files changed, 113 insertions(+), 26 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2fb83aabbef5..9f8a9393ce4a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -776,7 +776,7 @@ static int record__open(struct record *rec)
if ((errno == EINVAL || errno == EBADF) &&
pos->leader != pos &&
pos->weak_group) {
- pos = perf_evlist__reset_weak_group(evlist, pos);
+ pos = perf_evlist__reset_weak_group(evlist, pos, true);
goto try_again;
}
rc = -errno;
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1a586009e5a7..7f9ec41d8f62 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -65,6 +65,7 @@
#include "util/target.h"
#include "util/time-utils.h"
#include "util/top.h"
+#include "util/affinity.h"
#include "asm/bug.h"

#include <linux/time64.h>
@@ -440,6 +441,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
ui__warning("%s event is not supported by the kernel.\n",
perf_evsel__name(counter));
counter->supported = false;
+ counter->errored = true;

if ((counter->leader != counter) ||
!(counter->leader->core.nr_members > 1))
@@ -484,6 +486,9 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
int status = 0;
const bool forks = (argc > 0);
bool is_pipe = STAT_RECORD ? perf_stat.data.is_pipe : false;
+ struct affinity affinity;
+ int i, cpu;
+ bool second_pass = false;

if (interval) {
ts.tv_sec = interval / USEC_PER_MSEC;
@@ -508,30 +513,105 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
if (group)
perf_evlist__set_leader(evsel_list);

- evlist__for_each_entry(evsel_list, counter) {
+ if (affinity__setup(&affinity) < 0)
+ return -1;
+
+ evlist__for_each_cpu (evsel_list, i, cpu) {
+ affinity__set(&affinity, cpu);
+
+ evlist__for_each_entry(evsel_list, counter) {
+ if (evsel__cpu_iter_skip(counter, cpu))
+ continue;
+ if (counter->reset_group || counter->errored)
+ continue;
try_again:
- if (create_perf_stat_counter(counter, &stat_config, &target) < 0) {
-
- /* Weak group failed. Reset the group. */
- if ((errno == EINVAL || errno == EBADF) &&
- counter->leader != counter &&
- counter->weak_group) {
- counter = perf_evlist__reset_weak_group(evsel_list, counter);
- goto try_again;
+ if (create_perf_stat_counter(counter, &stat_config, &target,
+ counter->cpu_iter - 1) < 0) {
+
+ /*
+ * Weak group failed. We cannot just undo this here
+ * because earlier CPUs might be in group mode, and the kernel
+ * doesn't support mixing group and non group reads. Defer
+ * it to later.
+ * Don't close here because we're in the wrong affinity.
+ */
+ if ((errno == EINVAL || errno == EBADF) &&
+ counter->leader != counter &&
+ counter->weak_group) {
+ perf_evlist__reset_weak_group(evsel_list, counter, false);
+ assert(counter->reset_group);
+ second_pass = true;
+ continue;
+ }
+
+ switch (stat_handle_error(counter)) {
+ case COUNTER_FATAL:
+ return -1;
+ case COUNTER_RETRY:
+ goto try_again;
+ case COUNTER_SKIP:
+ continue;
+ default:
+ break;
+ }
+
}
+ counter->supported = true;
+ }
+ }

- switch (stat_handle_error(counter)) {
- case COUNTER_FATAL:
- return -1;
- case COUNTER_RETRY:
- goto try_again;
- case COUNTER_SKIP:
- continue;
- default:
- break;
+ if (second_pass) {
+ /*
+ * Now redo all the weak group after closing them,
+ * and also close errored counters.
+ */
+
+ evlist__cpu_iter_start(evsel_list);
+ evlist__for_each_cpu (evsel_list, i, cpu) {
+ affinity__set(&affinity, cpu);
+ /* First close errored or weak retry */
+ evlist__for_each_entry(evsel_list, counter) {
+ if (!counter->reset_group && !counter->errored)
+ continue;
+ if (evsel__cpu_iter_skip_no_inc(counter, cpu))
+ continue;
+ perf_evsel__close_cpu(&counter->core, counter->cpu_iter);
}
+ /* Now reopen weak */
+ evlist__for_each_entry(evsel_list, counter) {
+ if (!counter->reset_group && !counter->errored)
+ continue;
+ if (evsel__cpu_iter_skip(counter, cpu))
+ continue;
+ if (!counter->reset_group)
+ continue;
+try_again_reset:
+ pr_debug2("reopening weak %s\n", perf_evsel__name(counter));
+ if (create_perf_stat_counter(counter, &stat_config, &target,
+ counter->cpu_iter - 1) < 0) {
+
+ switch (stat_handle_error(counter)) {
+ case COUNTER_FATAL:
+ return -1;
+ case COUNTER_RETRY:
+ goto try_again_reset;
+ case COUNTER_SKIP:
+ continue;
+ default:
+ break;
+ }
+ }
+ counter->supported = true;
+ }
+ }
+ }
+ affinity__cleanup(&affinity);
+
+ evlist__for_each_entry(evsel_list, counter) {
+ if (!counter->supported) {
+ perf_evsel__free_fd(&counter->core);
+ continue;
}
- counter->supported = true;

l = strlen(counter->unit);
if (l > stat_config.unit_width)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 0dcea66329e2..33080f79b977 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1632,7 +1632,8 @@ void perf_evlist__force_leader(struct evlist *evlist)
}

struct evsel *perf_evlist__reset_weak_group(struct evlist *evsel_list,
- struct evsel *evsel)
+ struct evsel *evsel,
+ bool close)
{
struct evsel *c2, *leader;
bool is_open = true;
@@ -1649,10 +1650,11 @@ struct evsel *perf_evlist__reset_weak_group(struct evlist *evsel_list,
if (c2 == evsel)
is_open = false;
if (c2->leader == leader) {
- if (is_open)
+ if (is_open && close)
perf_evsel__close(&c2->core);
c2->leader = c2;
c2->core.nr_members = 0;
+ c2->reset_group = true;
}
}
return leader;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 12606efc1f7c..ad77091d1e1e 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -355,5 +355,6 @@ bool perf_evlist__exclude_kernel(struct evlist *evlist);
void perf_evlist__force_leader(struct evlist *evlist);

struct evsel *perf_evlist__reset_weak_group(struct evlist *evlist,
- struct evsel *evsel);
+ struct evsel *evsel,
+ bool close);
#endif /* __PERF_EVLIST_H */
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 54513d70c109..ca82a93960cd 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -94,6 +94,8 @@ struct evsel {
struct evsel *metric_leader;
bool collect_stat;
bool weak_group;
+ bool reset_group;
+ bool errored;
bool percore;
int cpu_iter;
const char *pmu_name;
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 36dc95032e4c..3aebe732e886 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -463,7 +463,8 @@ size_t perf_event__fprintf_stat_config(union perf_event *event, FILE *fp)

int create_perf_stat_counter(struct evsel *evsel,
struct perf_stat_config *config,
- struct target *target)
+ struct target *target,
+ int cpu)
{
struct perf_event_attr *attr = &evsel->core.attr;
struct evsel *leader = evsel->leader;
@@ -517,7 +518,7 @@ int create_perf_stat_counter(struct evsel *evsel,
}

if (target__has_cpu(target) && !target__has_per_thread(target))
- return perf_evsel__open_per_cpu(evsel, evsel__cpus(evsel), -1);
+ return perf_evsel__open_per_cpu(evsel, evsel__cpus(evsel), cpu);

return perf_evsel__open_per_thread(evsel, evsel->core.threads);
}
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 081c4a5113c6..4c9a7b68c3e7 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -213,7 +213,8 @@ size_t perf_event__fprintf_stat_config(union perf_event *event, FILE *fp);

int create_perf_stat_counter(struct evsel *evsel,
struct perf_stat_config *config,
- struct target *target);
+ struct target *target,
+ int cpu);
void
perf_evlist__print_counters(struct evlist *evlist,
struct perf_stat_config *config,
--
2.23.0

2019-11-07 18:20:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v5 01/13] perf pmu: Use file system cache to optimize sysfs access

From: Andi Kleen <[email protected]>

pmu.c does a lot of redundant /sys accesses while parsing aliases
and probing for PMUs. On large systems with a lot of PMUs this
can get expensive (>2s):

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
27.25 1.227847 8 160888 16976 openat
26.42 1.190481 7 164224 164077 stat

Add a cache to remember if specific file names exist or don't
exist, which eliminates most of this overhead.

Also optimize some stat() calls to be slightly cheaper access()

Resulting in:

0.18 0.004166 2 1851 305 open
0.08 0.001970 2 829 622 access

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use single lookup function as API (Jiri)
---
tools/perf/util/Build | 1 +
tools/perf/util/fncache.c | 63 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/fncache.h | 7 +++++
tools/perf/util/pmu.c | 34 +++++++--------------
tools/perf/util/srccode.c | 9 +-----
5 files changed, 83 insertions(+), 31 deletions(-)
create mode 100644 tools/perf/util/fncache.c
create mode 100644 tools/perf/util/fncache.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 39814b1806a6..2c1504fe924c 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -48,6 +48,7 @@ perf-y += header.o
perf-y += callchain.o
perf-y += values.o
perf-y += debug.o
+perf-y += fncache.o
perf-y += machine.o
perf-y += map.o
perf-y += pstack.o
diff --git a/tools/perf/util/fncache.c b/tools/perf/util/fncache.c
new file mode 100644
index 000000000000..5afcd7edbe7a
--- /dev/null
+++ b/tools/perf/util/fncache.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Manage a cache of file names' existence */
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/list.h>
+#include "fncache.h"
+
+struct fncache {
+ struct hlist_node nd;
+ bool res;
+ char name[];
+};
+
+#define FNHSIZE 61
+
+static struct hlist_head fncache_hash[FNHSIZE];
+
+unsigned shash(const unsigned char *s)
+{
+ unsigned h = 0;
+ while (*s)
+ h = 65599 * h + *s++;
+ return h ^ (h >> 16);
+}
+
+static bool lookup_fncache(const char *name, bool *res)
+{
+ int h = shash((const unsigned char *)name) % FNHSIZE;
+ struct fncache *n;
+
+ hlist_for_each_entry (n, &fncache_hash[h], nd) {
+ if (!strcmp(n->name, name)) {
+ *res = n->res;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void update_fncache(const char *name, bool res)
+{
+ struct fncache *n = malloc(sizeof(struct fncache) + strlen(name) + 1);
+ int h = shash((const unsigned char *)name) % FNHSIZE;
+
+ if (!n)
+ return;
+ strcpy(n->name, name);
+ n->res = res;
+ hlist_add_head(&n->nd, &fncache_hash[h]);
+}
+
+/* No LRU, only use when bounded in some other way. */
+bool file_available(const char *name)
+{
+ bool res;
+
+ if (lookup_fncache(name, &res))
+ return res;
+ res = access(name, R_OK) == 0;
+ update_fncache(name, res);
+ return res;
+}
diff --git a/tools/perf/util/fncache.h b/tools/perf/util/fncache.h
new file mode 100644
index 000000000000..fe020beaefb1
--- /dev/null
+++ b/tools/perf/util/fncache.h
@@ -0,0 +1,7 @@
+#ifndef _FCACHE_H
+#define _FCACHE_H 1
+
+unsigned shash(const unsigned char *s);
+bool file_available(const char *name);
+
+#endif
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index adbe97e941dd..81357cc3d59a 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -24,6 +24,7 @@
#include "pmu-events/pmu-events.h"
#include "string2.h"
#include "strbuf.h"
+#include "fncache.h"

struct perf_pmu_format {
char *name;
@@ -82,7 +83,6 @@ int perf_pmu__format_parse(char *dir, struct list_head *head)
*/
static int pmu_format(const char *name, struct list_head *format)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -92,8 +92,8 @@ static int pmu_format(const char *name, struct list_head *format)
snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/format", sysfs, name);

- if (stat(path, &st) < 0)
- return 0; /* no error if format does not exist */
+ if (!file_available(path))
+ return 0;

if (perf_pmu__format_parse(path, format))
return -1;
@@ -475,7 +475,6 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
*/
static int pmu_aliases(const char *name, struct list_head *head)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -485,8 +484,8 @@ static int pmu_aliases(const char *name, struct list_head *head)
snprintf(path, PATH_MAX,
"%s/bus/event_source/devices/%s/events", sysfs, name);

- if (stat(path, &st) < 0)
- return 0; /* no error if 'events' does not exist */
+ if (!file_available(path))
+ return 0;

if (pmu_aliases_parse(path, head))
return -1;
@@ -525,7 +524,6 @@ static int pmu_alias_terms(struct perf_pmu_alias *alias,
*/
static int pmu_type(const char *name, __u32 *type)
{
- struct stat st;
char path[PATH_MAX];
FILE *file;
int ret = 0;
@@ -537,7 +535,7 @@ static int pmu_type(const char *name, __u32 *type)
snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/type", sysfs, name);

- if (stat(path, &st) < 0)
+ if (access(path, R_OK) < 0)
return -1;

file = fopen(path, "r");
@@ -628,14 +626,11 @@ static struct perf_cpu_map *pmu_cpumask(const char *name)
static bool pmu_is_uncore(const char *name)
{
char path[PATH_MAX];
- struct perf_cpu_map *cpus;
- const char *sysfs = sysfs__mountpoint();
+ const char *sysfs;

+ sysfs = sysfs__mountpoint();
snprintf(path, PATH_MAX, CPUS_TEMPLATE_UNCORE, sysfs, name);
- cpus = __pmu_cpumask(path);
- perf_cpu_map__put(cpus);
-
- return !!cpus;
+ return file_available(path);
}

/*
@@ -645,7 +640,6 @@ static bool pmu_is_uncore(const char *name)
*/
static int is_arm_pmu_core(const char *name)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -655,10 +649,7 @@ static int is_arm_pmu_core(const char *name)
/* Look for cpu sysfs (specific to arm) */
scnprintf(path, PATH_MAX, "%s/bus/event_source/devices/%s/cpus",
sysfs, name);
- if (stat(path, &st) == 0)
- return 1;
-
- return 0;
+ return file_available(path);
}

static char *perf_pmu__getcpuid(struct perf_pmu *pmu)
@@ -1528,7 +1519,6 @@ bool pmu_have_event(const char *pname, const char *name)

static FILE *perf_pmu__open_file(struct perf_pmu *pmu, const char *name)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs;

@@ -1538,10 +1528,8 @@ static FILE *perf_pmu__open_file(struct perf_pmu *pmu, const char *name)

snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/%s", sysfs, pmu->name, name);
-
- if (stat(path, &st) < 0)
+ if (!file_available(path))
return NULL;
-
return fopen(path, "r");
}

diff --git a/tools/perf/util/srccode.c b/tools/perf/util/srccode.c
index d84ed8b6caaa..c29edaaca863 100644
--- a/tools/perf/util/srccode.c
+++ b/tools/perf/util/srccode.c
@@ -16,6 +16,7 @@
#include "srccode.h"
#include "debug.h"
#include <internal/lib.h> // page_size
+#include "fncache.h"

#define MAXSRCCACHE (32*1024*1024)
#define MAXSRCFILES 64
@@ -36,14 +37,6 @@ static LIST_HEAD(srcfile_list);
static long map_total_sz;
static int num_srcfiles;

-static unsigned shash(unsigned char *s)
-{
- unsigned h = 0;
- while (*s)
- h = 65599 * h + *s++;
- return h ^ (h >> 16);
-}
-
static int countlines(char *map, int maplen)
{
int numl;
--
2.23.0

2019-11-07 18:21:13

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v5 07/13] perf stat: Use affinity for closing file descriptors

From: Andi Kleen <[email protected]>

Closing a perf fd can also trigger an IPI to the target CPU.
Use the same affinity technique as we use for reading/enabling events
to closing to optimize the CPU transitions.

Before on a large test case with 94 CPUs:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.56 3.085463 50 61483 close

After:

10.54 0.735704 11 61485 close

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use new iterator macros
v3: Use new iterator macros
Add missing affinity__cleanup
v4:
Update iterators again
---
tools/perf/util/evlist.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index dae6e846b2f8..0dcea66329e2 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -18,6 +18,7 @@
#include "debug.h"
#include "units.h"
#include <internal/lib.h> // page_size
+#include "affinity.h"
#include "../perf.h"
#include "asm/bug.h"
#include "bpf-event.h"
@@ -1169,9 +1170,31 @@ void perf_evlist__set_selected(struct evlist *evlist,
void evlist__close(struct evlist *evlist)
{
struct evsel *evsel;
+ struct affinity affinity;
+ int cpu, i;

- evlist__for_each_entry_reverse(evlist, evsel)
- evsel__close(evsel);
+ if (!evlist->core.cpus) {
+ evlist__for_each_entry_reverse(evlist, evsel)
+ evsel__close(evsel);
+ return;
+ }
+
+ if (affinity__setup(&affinity) < 0)
+ return;
+ evlist__for_each_cpu (evlist, i, cpu) {
+ affinity__set(&affinity, cpu);
+
+ evlist__for_each_entry_reverse(evlist, evsel) {
+ if (evsel__cpu_iter_skip(evsel, cpu))
+ continue;
+ perf_evsel__close_cpu(&evsel->core, evsel->cpu_iter - 1);
+ }
+ }
+ affinity__cleanup(&affinity);
+ evlist__for_each_entry_reverse(evlist, evsel) {
+ perf_evsel__free_fd(&evsel->core);
+ perf_evsel__free_id(&evsel->core);
+ }
}

static int perf_evlist__create_syswide_maps(struct evlist *evlist)
--
2.23.0

2019-11-07 18:21:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v5 05/13] perf evsel: Add iterator to iterate over events ordered by CPU

From: Andi Kleen <[email protected]>

Add some common code that is needed to iterate over all events
in CPU order. Used in followon patches

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Add cpumap__for_each_cpu macro to factor out some common code
v3: Drop cpumap__for_each_cpu macro again, replace with evlist__for_each_cpu
Add new evlist__for_each_cpu
Don't compute cpus nr in cpu_index iterator init, just use all_cpus
v4:
Remove __next, move into skip
Add _no_inc
Move initialization into iterator macro
Rename cpu_index to cpu_iter
---
tools/perf/util/cpumap.h | 1 +
tools/perf/util/evlist.c | 32 ++++++++++++++++++++++++++++++++
tools/perf/util/evlist.h | 8 ++++++++
tools/perf/util/evsel.h | 1 +
4 files changed, 42 insertions(+)

diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index 2553bef1279d..dbc1d7e949ed 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -60,4 +60,5 @@ int cpu_map__build_map(struct perf_cpu_map *cpus, struct perf_cpu_map **res,

int cpu_map__cpu(struct perf_cpu_map *cpus, int idx);
bool cpu_map__has(struct perf_cpu_map *cpus, int cpu);
+
#endif /* __PERF_CPUMAP_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index fdce590d2278..dae6e846b2f8 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -342,6 +342,38 @@ static int perf_evlist__nr_threads(struct evlist *evlist,
return perf_thread_map__nr(evlist->core.threads);
}

+void evlist__cpu_iter_start(struct evlist *evlist)
+{
+ struct evsel *pos;
+
+ /*
+ * Reset the per evsel cpu_iter. This is needed because
+ * each evsel's cpumap may have a different index space,
+ * and some operations need the index to modify
+ * the FD xyarray (e.g. open, close)
+ */
+ evlist__for_each_entry(evlist, pos)
+ pos->cpu_iter = 0;
+}
+
+bool evsel__cpu_iter_skip_no_inc(struct evsel *ev, int cpu)
+{
+ if (ev->cpu_iter >= ev->core.cpus->nr)
+ return true;
+ if (cpu >= 0 && ev->core.cpus->map[ev->cpu_iter] != cpu)
+ return true;
+ return false;
+}
+
+bool evsel__cpu_iter_skip(struct evsel *ev, int cpu)
+{
+ if (!evsel__cpu_iter_skip_no_inc(ev, cpu)) {
+ ev->cpu_iter++;
+ return false;
+ }
+ return true;
+}
+
void evlist__disable(struct evlist *evlist)
{
struct evsel *pos;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 13051409fd22..12606efc1f7c 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -333,9 +333,17 @@ void perf_evlist__to_front(struct evlist *evlist,
#define evlist__for_each_entry_safe(evlist, tmp, evsel) \
__evlist__for_each_entry_safe(&(evlist)->core.entries, tmp, evsel)

+#define evlist__for_each_cpu(evlist, index, cpu) \
+ evlist__cpu_iter_start(evlist); \
+ perf_cpu_map__for_each_cpu (cpu, index, (evlist)->core.all_cpus)
+
void perf_evlist__set_tracking_event(struct evlist *evlist,
struct evsel *tracking_evsel);

+void evlist__cpu_iter_start(struct evlist *evlist);
+bool evsel__cpu_iter_skip(struct evsel *ev, int cpu);
+bool evsel__cpu_iter_skip_no_inc(struct evsel *ev, int cpu);
+
struct evsel *
perf_evlist__find_evsel_by_str(struct evlist *evlist, const char *str);

diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ddc5ee6f6592..b10d5ba21966 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -95,6 +95,7 @@ struct evsel {
bool collect_stat;
bool weak_group;
bool percore;
+ int cpu_iter;
const char *pmu_name;
struct {
perf_evsel__sb_cb_t *cb;
--
2.23.0

2019-11-07 18:22:01

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v5 03/13] perf cpumap: Maintain cpumaps ordered and without dups

From: Andi Kleen <[email protected]>

Enforce this in _trim()

Needed for followon change.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/lib/cpumap.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/tools/perf/lib/cpumap.c b/tools/perf/lib/cpumap.c
index 2ca1fafa620d..d81656b4635e 100644
--- a/tools/perf/lib/cpumap.c
+++ b/tools/perf/lib/cpumap.c
@@ -68,14 +68,28 @@ static struct perf_cpu_map *cpu_map__default_new(void)
return cpus;
}

+static int cmp_int(const void *a, const void *b)
+{
+ return *(const int *)a - *(const int*)b;
+}
+
static struct perf_cpu_map *cpu_map__trim_new(int nr_cpus, int *tmp_cpus)
{
size_t payload_size = nr_cpus * sizeof(int);
struct perf_cpu_map *cpus = malloc(sizeof(*cpus) + payload_size);
+ int i, j;

if (cpus != NULL) {
- cpus->nr = nr_cpus;
memcpy(cpus->map, tmp_cpus, payload_size);
+ qsort(cpus->map, nr_cpus, sizeof(int), cmp_int);
+ /* Remove dups */
+ j = 0;
+ for (i = 0; i < nr_cpus; i++) {
+ if (i == 0 || cpus->map[i] != cpus->map[i - 1])
+ cpus->map[j++] = cpus->map[i];
+ }
+ cpus->nr = j;
+ assert(j <= nr_cpus);
refcount_set(&cpus->refcnt, 1);
}

--
2.23.0

2019-11-11 13:31:56

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v5 10/13] perf stat: Use affinity for opening events

On Thu, Nov 07, 2019 at 10:16:43AM -0800, Andi Kleen wrote:

SNIP

> + * Don't close here because we're in the wrong affinity.
> + */
> + if ((errno == EINVAL || errno == EBADF) &&
> + counter->leader != counter &&
> + counter->weak_group) {
> + perf_evlist__reset_weak_group(evsel_list, counter, false);
> + assert(counter->reset_group);
> + second_pass = true;
> + continue;
> + }
> +
> + switch (stat_handle_error(counter)) {
> + case COUNTER_FATAL:
> + return -1;
> + case COUNTER_RETRY:
> + goto try_again;
> + case COUNTER_SKIP:
> + continue;
> + default:
> + break;
> + }
> +
> }
> + counter->supported = true;
> + }
> + }
>
> - switch (stat_handle_error(counter)) {
> - case COUNTER_FATAL:
> - return -1;
> - case COUNTER_RETRY:
> - goto try_again;
> - case COUNTER_SKIP:
> - continue;
> - default:
> - break;
> + if (second_pass) {
> + /*
> + * Now redo all the weak group after closing them,
> + * and also close errored counters.
> + */
> +
> + evlist__cpu_iter_start(evsel_list);
> + evlist__for_each_cpu (evsel_list, i, cpu) {

no need for evlist__cpu_iter_start call in here?

jirka

> + affinity__set(&affinity, cpu);
> + /* First close errored or weak retry */
> + evlist__for_each_entry(evsel_list, counter) {
> + if (!counter->reset_group && !counter->errored)
> + continue;
> + if (evsel__cpu_iter_skip_no_inc(counter, cpu))
> + continue;
> + perf_evsel__close_cpu(&counter->core, counter->cpu_iter);
> }
> + /* Now reopen weak */
> + evlist__for_each_entry(evsel_list, counter) {
> + if (!counter->reset_group && !counter->errored)

SNIP

2019-11-11 13:33:39

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v5 10/13] perf stat: Use affinity for opening events

On Thu, Nov 07, 2019 at 10:16:43AM -0800, Andi Kleen wrote:

SNIP

> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1632,7 +1632,8 @@ void perf_evlist__force_leader(struct evlist *evlist)
> }
>
> struct evsel *perf_evlist__reset_weak_group(struct evlist *evsel_list,
> - struct evsel *evsel)
> + struct evsel *evsel,
> + bool close)
> {
> struct evsel *c2, *leader;
> bool is_open = true;
> @@ -1649,10 +1650,11 @@ struct evsel *perf_evlist__reset_weak_group(struct evlist *evsel_list,
> if (c2 == evsel)
> is_open = false;
> if (c2->leader == leader) {
> - if (is_open)
> + if (is_open && close)
> perf_evsel__close(&c2->core);
> c2->leader = c2;
> c2->core.nr_members = 0;
> + c2->reset_group = true;

so it's only set to true and stays.. please explain the logic
in comment.. together with errored

thanks,
jirka

2019-11-11 13:34:25

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v5 10/13] perf stat: Use affinity for opening events

On Thu, Nov 07, 2019 at 10:16:43AM -0800, Andi Kleen wrote:

SNIP

> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 2fb83aabbef5..9f8a9393ce4a 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -776,7 +776,7 @@ static int record__open(struct record *rec)
> if ((errno == EINVAL || errno == EBADF) &&
> pos->leader != pos &&
> pos->weak_group) {
> - pos = perf_evlist__reset_weak_group(evlist, pos);
> + pos = perf_evlist__reset_weak_group(evlist, pos, true);
> goto try_again;
> }
> rc = -errno;
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 1a586009e5a7..7f9ec41d8f62 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -65,6 +65,7 @@
> #include "util/target.h"
> #include "util/time-utils.h"
> #include "util/top.h"
> +#include "util/affinity.h"
> #include "asm/bug.h"
>
> #include <linux/time64.h>
> @@ -440,6 +441,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> ui__warning("%s event is not supported by the kernel.\n",
> perf_evsel__name(counter));
> counter->supported = false;
> + counter->errored = true;

how is errored different from supported?
why can't you use it?

jirka

2019-11-11 13:34:54

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] perf stat: Use affinity for closing file descriptors

On Thu, Nov 07, 2019 at 10:16:40AM -0800, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> Closing a perf fd can also trigger an IPI to the target CPU.
> Use the same affinity technique as we use for reading/enabling events
> to closing to optimize the CPU transitions.
>
> Before on a large test case with 94 CPUs:
>
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 32.56 3.085463 50 61483 close
>
> After:
>
> 10.54 0.735704 11 61485 close
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
>
> v2: Use new iterator macros
> v3: Use new iterator macros
> Add missing affinity__cleanup
> v4:
> Update iterators again
> ---
> tools/perf/util/evlist.c | 27 +++++++++++++++++++++++++--
> 1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index dae6e846b2f8..0dcea66329e2 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -18,6 +18,7 @@
> #include "debug.h"
> #include "units.h"
> #include <internal/lib.h> // page_size
> +#include "affinity.h"
> #include "../perf.h"
> #include "asm/bug.h"
> #include "bpf-event.h"
> @@ -1169,9 +1170,31 @@ void perf_evlist__set_selected(struct evlist *evlist,
> void evlist__close(struct evlist *evlist)
> {
> struct evsel *evsel;
> + struct affinity affinity;
> + int cpu, i;
>
> - evlist__for_each_entry_reverse(evlist, evsel)
> - evsel__close(evsel);
> + if (!evlist->core.cpus) {

should this be evlist->all_cpus?

jirka

> + evlist__for_each_entry_reverse(evlist, evsel)
> + evsel__close(evsel);
> + return;
> + }
> +
> + if (affinity__setup(&affinity) < 0)
> + return;
> + evlist__for_each_cpu (evlist, i, cpu) {
> + affinity__set(&affinity, cpu);
> +
> + evlist__for_each_entry_reverse(evlist, evsel) {
> + if (evsel__cpu_iter_skip(evsel, cpu))
> + continue;
> + perf_evsel__close_cpu(&evsel->core, evsel->cpu_iter - 1);
> + }
> + }
> + affinity__cleanup(&affinity);
> + evlist__for_each_entry_reverse(evlist, evsel) {
> + perf_evsel__free_fd(&evsel->core);
> + perf_evsel__free_id(&evsel->core);
> + }
> }
>
> static int perf_evlist__create_syswide_maps(struct evlist *evlist)
> --
> 2.23.0
>

2019-11-11 16:57:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] perf stat: Use affinity for closing file descriptors

On Mon, Nov 11, 2019 at 02:30:52PM +0100, Jiri Olsa wrote:
> On Thu, Nov 07, 2019 at 10:16:40AM -0800, Andi Kleen wrote:
> > From: Andi Kleen <[email protected]>
> >
> > Closing a perf fd can also trigger an IPI to the target CPU.
> > Use the same affinity technique as we use for reading/enabling events
> > to closing to optimize the CPU transitions.
> >
> > Before on a large test case with 94 CPUs:
> >
> > % time seconds usecs/call calls errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> > 32.56 3.085463 50 61483 close
> >
> > After:
> >
> > 10.54 0.735704 11 61485 close
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> >
> > v2: Use new iterator macros
> > v3: Use new iterator macros
> > Add missing affinity__cleanup
> > v4:
> > Update iterators again
> > ---
> > tools/perf/util/evlist.c | 27 +++++++++++++++++++++++++--
> > 1 file changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index dae6e846b2f8..0dcea66329e2 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -18,6 +18,7 @@
> > #include "debug.h"
> > #include "units.h"
> > #include <internal/lib.h> // page_size
> > +#include "affinity.h"
> > #include "../perf.h"
> > #include "asm/bug.h"
> > #include "bpf-event.h"
> > @@ -1169,9 +1170,31 @@ void perf_evlist__set_selected(struct evlist *evlist,
> > void evlist__close(struct evlist *evlist)
> > {
> > struct evsel *evsel;
> > + struct affinity affinity;
> > + int cpu, i;
> >
> > - evlist__for_each_entry_reverse(evlist, evsel)
> > - evsel__close(evsel);
> > + if (!evlist->core.cpus) {
>
> should this be evlist->all_cpus?

This detects perf record essentially. I had some problems with perf record
in early testing, so I just disabled it, since I was just focussing
on stat. all_cpus would be set for perf record too.

-Andi

2019-11-11 17:06:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 10/13] perf stat: Use affinity for opening events

> > #include <linux/time64.h>
> > @@ -440,6 +441,7 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> > ui__warning("%s event is not supported by the kernel.\n",
> > perf_evsel__name(counter));
> > counter->supported = false;
> > + counter->errored = true;
>
> how is errored different from supported?
> why can't you use it?

errored means that the event is still partially open, while supported means it is
closed. While I guess it could be combined it seems cleaner to keep them
separate.

-Andi