2019-11-16 05:55:22

by Andi Kleen

[permalink] [raw]
Subject: Optimize perf stat for large number of events/cpus

[v7: Address review feedback. Fix python script problem
reported by 0day. Drop merged patches.]

This patch kit optimizes perf stat for a large number of events
on systems with many CPUs and PMUs.

Some profiling shows that the most overhead is doing IPIs to
all the target CPUs. We can optimize this by using sched_setaffinity
to set the affinity to a target CPU once and then doing
the perf operation for all events on that CPU. This requires
some restructuring, but cuts the set up time quite a bit.

In theory we could go further by parallelizing these setups
too, but that would be much more complicated and for now just batching it
per CPU seems to be sufficient. At some point with many more cores
parallelization or a better bulk perf setup API might be needed though.

In addition perf does a lot of redundant /sys accesses with
many PMUs, which can be also expensve. This is also optimized.

On a large test case (>700 events with many weak groups) on a 94 CPU
system I go from

real 0m8.607s
user 0m0.550s
sys 0m8.041s

to

real 0m3.269s
user 0m0.760s
sys 0m1.694s

so shaving ~6 seconds of system time, at slightly more cost
in perf stat itself. On a 4 socket system with the savings
are more dramatic:

real 0m15.641s
user 0m0.873s
sys 0m14.729s

to

real 0m4.493s
user 0m1.578s
sys 0m2.444s

so 11s difference in the user visible set up time.

Also available in

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-10

v1: Initial post.
v2: Rebase. Fix some minor issues.
v3: Rebase. Address review feedback. Fix one minor issue
v4: Modified based on review feedback. Now it maintains
all_cpus per evlist. There is still a need for cpu_index iteration
to get the correct index for indexing the file descriptors.
Fix bug with unsorted cpu maps, now they are always sorted.
Some cleanups and refactoring.
v5: Split patches. Redo loop iteration again. Fix cpu map
merging for uncore. Remove duplicates from cpumaps. Add unit
tests.
v6: Address review feedback. Fix some bugs. Add more comments.
Merge one invalid patch split.
v7: Address review feedback. Fix python scripting (thanks 0day)
Minor updates.

-Andi


2019-11-16 05:55:25

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 01/12] perf pmu: Use file system cache to optimize sysfs access

From: Andi Kleen <[email protected]>

pmu.c does a lot of redundant /sys accesses while parsing aliases
and probing for PMUs. On large systems with a lot of PMUs this
can get expensive (>2s):

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
27.25 1.227847 8 160888 16976 openat
26.42 1.190481 7 164224 164077 stat

Add a cache to remember if specific file names exist or don't
exist, which eliminates most of this overhead.

Also optimize some stat() calls to be slightly cheaper access()

Resulting in:

0.18 0.004166 2 1851 305 open
0.08 0.001970 2 829 622 access

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use single lookup function as API (Jiri)
---
tools/perf/util/Build | 1 +
tools/perf/util/fncache.c | 63 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/fncache.h | 7 +++++
tools/perf/util/pmu.c | 34 +++++++--------------
tools/perf/util/srccode.c | 9 +-----
5 files changed, 83 insertions(+), 31 deletions(-)
create mode 100644 tools/perf/util/fncache.c
create mode 100644 tools/perf/util/fncache.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index b8e05a147b2b..aab05e2c01a5 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -49,6 +49,7 @@ perf-y += header.o
perf-y += callchain.o
perf-y += values.o
perf-y += debug.o
+perf-y += fncache.o
perf-y += machine.o
perf-y += map.o
perf-y += pstack.o
diff --git a/tools/perf/util/fncache.c b/tools/perf/util/fncache.c
new file mode 100644
index 000000000000..5afcd7edbe7a
--- /dev/null
+++ b/tools/perf/util/fncache.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Manage a cache of file names' existence */
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/list.h>
+#include "fncache.h"
+
+struct fncache {
+ struct hlist_node nd;
+ bool res;
+ char name[];
+};
+
+#define FNHSIZE 61
+
+static struct hlist_head fncache_hash[FNHSIZE];
+
+unsigned shash(const unsigned char *s)
+{
+ unsigned h = 0;
+ while (*s)
+ h = 65599 * h + *s++;
+ return h ^ (h >> 16);
+}
+
+static bool lookup_fncache(const char *name, bool *res)
+{
+ int h = shash((const unsigned char *)name) % FNHSIZE;
+ struct fncache *n;
+
+ hlist_for_each_entry (n, &fncache_hash[h], nd) {
+ if (!strcmp(n->name, name)) {
+ *res = n->res;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void update_fncache(const char *name, bool res)
+{
+ struct fncache *n = malloc(sizeof(struct fncache) + strlen(name) + 1);
+ int h = shash((const unsigned char *)name) % FNHSIZE;
+
+ if (!n)
+ return;
+ strcpy(n->name, name);
+ n->res = res;
+ hlist_add_head(&n->nd, &fncache_hash[h]);
+}
+
+/* No LRU, only use when bounded in some other way. */
+bool file_available(const char *name)
+{
+ bool res;
+
+ if (lookup_fncache(name, &res))
+ return res;
+ res = access(name, R_OK) == 0;
+ update_fncache(name, res);
+ return res;
+}
diff --git a/tools/perf/util/fncache.h b/tools/perf/util/fncache.h
new file mode 100644
index 000000000000..fe020beaefb1
--- /dev/null
+++ b/tools/perf/util/fncache.h
@@ -0,0 +1,7 @@
+#ifndef _FCACHE_H
+#define _FCACHE_H 1
+
+unsigned shash(const unsigned char *s);
+bool file_available(const char *name);
+
+#endif
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index db1e57113f4b..65780494a290 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -24,6 +24,7 @@
#include "pmu-events/pmu-events.h"
#include "string2.h"
#include "strbuf.h"
+#include "fncache.h"

struct perf_pmu_format {
char *name;
@@ -82,7 +83,6 @@ int perf_pmu__format_parse(char *dir, struct list_head *head)
*/
static int pmu_format(const char *name, struct list_head *format)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -92,8 +92,8 @@ static int pmu_format(const char *name, struct list_head *format)
snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/format", sysfs, name);

- if (stat(path, &st) < 0)
- return 0; /* no error if format does not exist */
+ if (!file_available(path))
+ return 0;

if (perf_pmu__format_parse(path, format))
return -1;
@@ -475,7 +475,6 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
*/
static int pmu_aliases(const char *name, struct list_head *head)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -485,8 +484,8 @@ static int pmu_aliases(const char *name, struct list_head *head)
snprintf(path, PATH_MAX,
"%s/bus/event_source/devices/%s/events", sysfs, name);

- if (stat(path, &st) < 0)
- return 0; /* no error if 'events' does not exist */
+ if (!file_available(path))
+ return 0;

if (pmu_aliases_parse(path, head))
return -1;
@@ -525,7 +524,6 @@ static int pmu_alias_terms(struct perf_pmu_alias *alias,
*/
static int pmu_type(const char *name, __u32 *type)
{
- struct stat st;
char path[PATH_MAX];
FILE *file;
int ret = 0;
@@ -537,7 +535,7 @@ static int pmu_type(const char *name, __u32 *type)
snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/type", sysfs, name);

- if (stat(path, &st) < 0)
+ if (access(path, R_OK) < 0)
return -1;

file = fopen(path, "r");
@@ -628,14 +626,11 @@ static struct perf_cpu_map *pmu_cpumask(const char *name)
static bool pmu_is_uncore(const char *name)
{
char path[PATH_MAX];
- struct perf_cpu_map *cpus;
- const char *sysfs = sysfs__mountpoint();
+ const char *sysfs;

+ sysfs = sysfs__mountpoint();
snprintf(path, PATH_MAX, CPUS_TEMPLATE_UNCORE, sysfs, name);
- cpus = __pmu_cpumask(path);
- perf_cpu_map__put(cpus);
-
- return !!cpus;
+ return file_available(path);
}

/*
@@ -645,7 +640,6 @@ static bool pmu_is_uncore(const char *name)
*/
static int is_arm_pmu_core(const char *name)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs = sysfs__mountpoint();

@@ -655,10 +649,7 @@ static int is_arm_pmu_core(const char *name)
/* Look for cpu sysfs (specific to arm) */
scnprintf(path, PATH_MAX, "%s/bus/event_source/devices/%s/cpus",
sysfs, name);
- if (stat(path, &st) == 0)
- return 1;
-
- return 0;
+ return file_available(path);
}

static char *perf_pmu__getcpuid(struct perf_pmu *pmu)
@@ -1534,7 +1525,6 @@ bool pmu_have_event(const char *pname, const char *name)

static FILE *perf_pmu__open_file(struct perf_pmu *pmu, const char *name)
{
- struct stat st;
char path[PATH_MAX];
const char *sysfs;

@@ -1544,10 +1534,8 @@ static FILE *perf_pmu__open_file(struct perf_pmu *pmu, const char *name)

snprintf(path, PATH_MAX,
"%s" EVENT_SOURCE_DEVICE_PATH "%s/%s", sysfs, pmu->name, name);
-
- if (stat(path, &st) < 0)
+ if (!file_available(path))
return NULL;
-
return fopen(path, "r");
}

diff --git a/tools/perf/util/srccode.c b/tools/perf/util/srccode.c
index d84ed8b6caaa..c29edaaca863 100644
--- a/tools/perf/util/srccode.c
+++ b/tools/perf/util/srccode.c
@@ -16,6 +16,7 @@
#include "srccode.h"
#include "debug.h"
#include <internal/lib.h> // page_size
+#include "fncache.h"

#define MAXSRCCACHE (32*1024*1024)
#define MAXSRCFILES 64
@@ -36,14 +37,6 @@ static LIST_HEAD(srcfile_list);
static long map_total_sz;
static int num_srcfiles;

-static unsigned shash(unsigned char *s)
-{
- unsigned h = 0;
- while (*s)
- h = 65599 * h + *s++;
- return h ^ (h >> 16);
-}
-
static int countlines(char *map, int maplen)
{
int numl;
--
2.23.0

2019-11-16 05:55:32

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 12/12] perf stat: Use affinity for enabling/disabling events

From: Andi Kleen <[email protected]>

Restructure event enabling/disabling to use affinity, which
minimizes the number of IPIs needed.

Before on a large test case with 94 CPUs:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.65 1.899986 22 84812 660 ioctl

after:

39.21 0.930451 10 84796 644 ioctl

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use new iterator macros
v3: Use new iterator macros
v4: Update iterators again
---
tools/perf/util/evlist.c | 40 +++++++++++++++++++++++++++++++++++++---
1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 17960e4d3a45..682a1f37d244 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -378,11 +378,28 @@ bool evsel__cpu_iter_skip(struct evsel *ev, int cpu)
void evlist__disable(struct evlist *evlist)
{
struct evsel *pos;
+ struct affinity affinity;
+ int cpu, i;
+
+ if (affinity__setup(&affinity) < 0)
+ return;
+
+ evlist__for_each_cpu (evlist, i, cpu) {
+ affinity__set(&affinity, cpu);

+ evlist__for_each_entry(evlist, pos) {
+ if (evsel__cpu_iter_skip(pos, cpu))
+ continue;
+ if (pos->disabled || !perf_evsel__is_group_leader(pos) || !pos->core.fd)
+ continue;
+ evsel__disable_cpu(pos, pos->cpu_iter - 1);
+ }
+ }
+ affinity__cleanup(&affinity);
evlist__for_each_entry(evlist, pos) {
- if (pos->disabled || !perf_evsel__is_group_leader(pos) || !pos->core.fd)
+ if (!perf_evsel__is_group_leader(pos) || !pos->core.fd)
continue;
- evsel__disable(pos);
+ pos->disabled = true;
}

evlist->enabled = false;
@@ -391,11 +408,28 @@ void evlist__disable(struct evlist *evlist)
void evlist__enable(struct evlist *evlist)
{
struct evsel *pos;
+ struct affinity affinity;
+ int cpu, i;

+ if (affinity__setup(&affinity) < 0)
+ return;
+
+ evlist__for_each_cpu (evlist, i, cpu) {
+ affinity__set(&affinity, cpu);
+
+ evlist__for_each_entry(evlist, pos) {
+ if (evsel__cpu_iter_skip(pos, cpu))
+ continue;
+ if (!perf_evsel__is_group_leader(pos) || !pos->core.fd)
+ continue;
+ evsel__enable_cpu(pos, pos->cpu_iter - 1);
+ }
+ }
+ affinity__cleanup(&affinity);
evlist__for_each_entry(evlist, pos) {
if (!perf_evsel__is_group_leader(pos) || !pos->core.fd)
continue;
- evsel__enable(pos);
+ pos->disabled = false;
}

evlist->enabled = true;
--
2.23.0

2019-11-16 05:56:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 04/12] perf evlist: Maintain evlist->all_cpus

From: Andi Kleen <[email protected]>

Maintain a cpumap in the evlist that is the union of all the cpus
of the events.

This needs a cpumap merge operation, which is added together
with tests.

Signed-off-by: Andi Kleen <[email protected]>

v2:
Add tests for cpu map merge
Fix handling of duplicates
Rename _update to _merge
Factor out sorting.
Fix handling of NULL maps in merge
v3:
Add comments and empty lines to _merge
---
tools/perf/lib/cpumap.c | 57 ++++++++++++++++++++++++
tools/perf/lib/evlist.c | 1 +
tools/perf/lib/include/internal/evlist.h | 1 +
tools/perf/lib/include/perf/cpumap.h | 2 +
tools/perf/tests/builtin-test.c | 5 +++
tools/perf/tests/cpumap.c | 16 +++++++
tools/perf/tests/tests.h | 1 +
7 files changed, 83 insertions(+)

diff --git a/tools/perf/lib/cpumap.c b/tools/perf/lib/cpumap.c
index d81656b4635e..f93f4e703e4c 100644
--- a/tools/perf/lib/cpumap.c
+++ b/tools/perf/lib/cpumap.c
@@ -286,3 +286,60 @@ int perf_cpu_map__max(struct perf_cpu_map *map)

return max;
}
+
+/*
+ * Merge two cpumaps
+ *
+ * orig either gets freed and replaced with a new map, or reused
+ * with no reference count change (similar to "realloc")
+ * other has its reference count increased.
+ */
+
+struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
+ struct perf_cpu_map *other)
+{
+ int *tmp_cpus;
+ int tmp_len;
+ int i, j, k;
+ struct perf_cpu_map *merged;
+
+ if (!orig && !other)
+ return NULL;
+ if (!orig) {
+ perf_cpu_map__get(other);
+ return other;
+ }
+ if (!other)
+ return orig;
+ if (orig->nr == other->nr &&
+ !memcmp(orig->map, other->map, orig->nr * sizeof(int)))
+ return orig;
+
+ tmp_len = orig->nr + other->nr;
+ tmp_cpus = malloc(tmp_len * sizeof(int));
+ if (!tmp_cpus)
+ return NULL;
+
+ /* Standard merge algorithm from wikipedia */
+ i = j = k = 0;
+ while (i < orig->nr && j < other->nr) {
+ if (orig->map[i] <= other->map[j]) {
+ if (orig->map[i] == other->map[j])
+ j++;
+ tmp_cpus[k++] = orig->map[i++];
+ } else
+ tmp_cpus[k++] = other->map[j++];
+ }
+
+ while (i < orig->nr)
+ tmp_cpus[k++] = orig->map[i++];
+
+ while (j < other->nr)
+ tmp_cpus[k++] = other->map[j++];
+ assert(k <= tmp_len);
+
+ merged = cpu_map__trim_new(k, tmp_cpus);
+ free(tmp_cpus);
+ perf_cpu_map__put(orig);
+ return merged;
+}
diff --git a/tools/perf/lib/evlist.c b/tools/perf/lib/evlist.c
index 205ddbb80bc1..ae9e65aa2491 100644
--- a/tools/perf/lib/evlist.c
+++ b/tools/perf/lib/evlist.c
@@ -54,6 +54,7 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,

perf_thread_map__put(evsel->threads);
evsel->threads = perf_thread_map__get(evlist->threads);
+ evlist->all_cpus = perf_cpu_map__merge(evlist->all_cpus, evsel->cpus);
}

static void perf_evlist__propagate_maps(struct perf_evlist *evlist)
diff --git a/tools/perf/lib/include/internal/evlist.h b/tools/perf/lib/include/internal/evlist.h
index a2fbccf1922f..74dc8c3f0b66 100644
--- a/tools/perf/lib/include/internal/evlist.h
+++ b/tools/perf/lib/include/internal/evlist.h
@@ -18,6 +18,7 @@ struct perf_evlist {
int nr_entries;
bool has_user_cpus;
struct perf_cpu_map *cpus;
+ struct perf_cpu_map *all_cpus;
struct perf_thread_map *threads;
int nr_mmaps;
size_t mmap_len;
diff --git a/tools/perf/lib/include/perf/cpumap.h b/tools/perf/lib/include/perf/cpumap.h
index ac9aa497f84a..6a17ad730cbc 100644
--- a/tools/perf/lib/include/perf/cpumap.h
+++ b/tools/perf/lib/include/perf/cpumap.h
@@ -12,6 +12,8 @@ LIBPERF_API struct perf_cpu_map *perf_cpu_map__dummy_new(void);
LIBPERF_API struct perf_cpu_map *perf_cpu_map__new(const char *cpu_list);
LIBPERF_API struct perf_cpu_map *perf_cpu_map__read(FILE *file);
LIBPERF_API struct perf_cpu_map *perf_cpu_map__get(struct perf_cpu_map *map);
+LIBPERF_API struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
+ struct perf_cpu_map *other);
LIBPERF_API void perf_cpu_map__put(struct perf_cpu_map *map);
LIBPERF_API int perf_cpu_map__cpu(const struct perf_cpu_map *cpus, int idx);
LIBPERF_API int perf_cpu_map__nr(const struct perf_cpu_map *cpus);
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 8b286e9b7549..5fa37cf7f283 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -259,6 +259,11 @@ static struct test generic_tests[] = {
.desc = "Print cpu map",
.func = test__cpu_map_print,
},
+ {
+ .desc = "Merge cpu map",
+ .func = test__cpu_map_merge,
+ },
+
{
.desc = "Probe SDT events",
.func = test__sdt_event,
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 8a0d236202b0..4ac56741ac5f 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -120,3 +120,19 @@ int test__cpu_map_print(struct test *test __maybe_unused, int subtest __maybe_un
TEST_ASSERT_VAL("failed to convert map", cpu_map_print("1-10,12-20,22-30,32-40"));
return 0;
}
+
+int test__cpu_map_merge(struct test *test __maybe_unused, int subtest __maybe_unused)
+{
+ struct perf_cpu_map *a = perf_cpu_map__new("4,2,1");
+ struct perf_cpu_map *b = perf_cpu_map__new("4,5,7");
+ struct perf_cpu_map *c = perf_cpu_map__merge(a, b);
+ char buf[100];
+
+ TEST_ASSERT_VAL("failed to merge map: bad nr", c->nr == 5);
+ cpu_map__snprint(c, buf, sizeof(buf));
+ TEST_ASSERT_VAL("failed to merge map: bad result", !strcmp(buf, "1-2,4-5,7"));
+ perf_cpu_map__put(a);
+ perf_cpu_map__put(b);
+ perf_cpu_map__put(c);
+ return 0;
+}
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 9837b6e93023..44b184fd869f 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -98,6 +98,7 @@ int test__event_update(struct test *test, int subtest);
int test__event_times(struct test *test, int subtest);
int test__backward_ring_buffer(struct test *test, int subtest);
int test__cpu_map_print(struct test *test, int subtest);
+int test__cpu_map_merge(struct test *test, int subtest);
int test__sdt_event(struct test *test, int subtest);
int test__is_printable_array(struct test *test, int subtest);
int test__bitmap_print(struct test *test, int subtest);
--
2.23.0

2019-11-16 05:57:23

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 03/12] perf cpumap: Maintain cpumaps ordered and without dups

From: Andi Kleen <[email protected]>

Enforce this in _trim()

Needed for followon change.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/lib/cpumap.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/tools/perf/lib/cpumap.c b/tools/perf/lib/cpumap.c
index 2ca1fafa620d..d81656b4635e 100644
--- a/tools/perf/lib/cpumap.c
+++ b/tools/perf/lib/cpumap.c
@@ -68,14 +68,28 @@ static struct perf_cpu_map *cpu_map__default_new(void)
return cpus;
}

+static int cmp_int(const void *a, const void *b)
+{
+ return *(const int *)a - *(const int*)b;
+}
+
static struct perf_cpu_map *cpu_map__trim_new(int nr_cpus, int *tmp_cpus)
{
size_t payload_size = nr_cpus * sizeof(int);
struct perf_cpu_map *cpus = malloc(sizeof(*cpus) + payload_size);
+ int i, j;

if (cpus != NULL) {
- cpus->nr = nr_cpus;
memcpy(cpus->map, tmp_cpus, payload_size);
+ qsort(cpus->map, nr_cpus, sizeof(int), cmp_int);
+ /* Remove dups */
+ j = 0;
+ for (i = 0; i < nr_cpus; i++) {
+ if (i == 0 || cpus->map[i] != cpus->map[i - 1])
+ cpus->map[j++] = cpus->map[i];
+ }
+ cpus->nr = j;
+ assert(j <= nr_cpus);
refcount_set(&cpus->refcnt, 1);
}

--
2.23.0

2019-11-16 05:57:48

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 02/12] perf affinity: Add infrastructure to save/restore affinity

From: Andi Kleen <[email protected]>

The kernel perf subsystem has to IPI to the target CPU for many
operations. On systems with many CPUs and when managing many events the
overhead can be dominated by lots of IPIs.

An alternative is to set up CPU affinity in the perf tool, then set up
all the events for that CPU, and then move on to the next CPU.

Add some affinity management infrastructure to enable such a model.
Used in followon patches.

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Use linux/bitmap.h functions.
v3: Add affinity.c to the python-ext-sources to fix the python
interface. Thanks 0day!
---
tools/perf/util/Build | 1 +
tools/perf/util/affinity.c | 72 ++++++++++++++++++++++++++++++
tools/perf/util/affinity.h | 15 +++++++
tools/perf/util/python-ext-sources | 1 +
4 files changed, 89 insertions(+)
create mode 100644 tools/perf/util/affinity.c
create mode 100644 tools/perf/util/affinity.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index aab05e2c01a5..07da6c790b63 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -77,6 +77,7 @@ perf-y += sort.o
perf-y += hist.o
perf-y += util.o
perf-y += cpumap.o
+perf-y += affinity.o
perf-y += cputopo.o
perf-y += cgroup.o
perf-y += target.o
diff --git a/tools/perf/util/affinity.c b/tools/perf/util/affinity.c
new file mode 100644
index 000000000000..e197b0416f56
--- /dev/null
+++ b/tools/perf/util/affinity.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Manage affinity to optimize IPIs inside the kernel perf API. */
+#define _GNU_SOURCE 1
+#include <sched.h>
+#include <stdlib.h>
+#include <linux/bitmap.h>
+#include "perf.h"
+#include "cpumap.h"
+#include "affinity.h"
+
+static int get_cpu_set_size(void)
+{
+ int sz = cpu__max_cpu() + 8 - 1;
+ /*
+ * sched_getaffinity doesn't like masks smaller than the kernel.
+ * Hopefully that's big enough.
+ */
+ if (sz < 4096)
+ sz = 4096;
+ return sz/8;
+}
+
+int affinity__setup(struct affinity *a)
+{
+ int cpu_set_size = get_cpu_set_size();
+
+ a->orig_cpus = bitmap_alloc(cpu_set_size*8);
+ if (!a->orig_cpus)
+ return -1;
+ sched_getaffinity(0, cpu_set_size, (cpu_set_t *)a->orig_cpus);
+ a->sched_cpus = bitmap_alloc(cpu_set_size*8);
+ if (!a->sched_cpus) {
+ free(a->orig_cpus);
+ return -1;
+ }
+ bitmap_zero((unsigned long *)a->sched_cpus, cpu_set_size);
+ a->changed = false;
+ return 0;
+}
+
+/*
+ * perf_event_open does an IPI internally to the target CPU.
+ * It is more efficient to change perf's affinity to the target
+ * CPU and then set up all events on that CPU, so we amortize
+ * CPU communication.
+ */
+void affinity__set(struct affinity *a, int cpu)
+{
+ int cpu_set_size = get_cpu_set_size();
+
+ if (cpu == -1)
+ return;
+ a->changed = true;
+ set_bit(cpu, a->sched_cpus);
+ /*
+ * We ignore errors because affinity is just an optimization.
+ * This could happen for example with isolated CPUs or cpusets.
+ * In this case the IPIs inside the kernel's perf API still work.
+ */
+ sched_setaffinity(0, cpu_set_size, (cpu_set_t *)a->sched_cpus);
+ clear_bit(cpu, a->sched_cpus);
+}
+
+void affinity__cleanup(struct affinity *a)
+{
+ int cpu_set_size = get_cpu_set_size();
+
+ if (a->changed)
+ sched_setaffinity(0, cpu_set_size, (cpu_set_t *)a->orig_cpus);
+ free(a->sched_cpus);
+ free(a->orig_cpus);
+}
diff --git a/tools/perf/util/affinity.h b/tools/perf/util/affinity.h
new file mode 100644
index 000000000000..008e2c3995b9
--- /dev/null
+++ b/tools/perf/util/affinity.h
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef AFFINITY_H
+#define AFFINITY_H 1
+
+struct affinity {
+ unsigned long *orig_cpus;
+ unsigned long *sched_cpus;
+ bool changed;
+};
+
+void affinity__cleanup(struct affinity *a);
+void affinity__set(struct affinity *a, int cpu);
+int affinity__setup(struct affinity *a);
+
+#endif
diff --git a/tools/perf/util/python-ext-sources b/tools/perf/util/python-ext-sources
index 9af183860fbd..e7279ea6043a 100644
--- a/tools/perf/util/python-ext-sources
+++ b/tools/perf/util/python-ext-sources
@@ -33,3 +33,4 @@ util/trace-event.c
util/string.c
util/symbol_fprintf.c
util/units.c
+util/affinity.c
--
2.23.0

2019-11-16 05:57:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH v7 05/12] perf evsel: Add iterator to iterate over events ordered by CPU

From: Andi Kleen <[email protected]>

Add some common code that is needed to iterate over all events
in CPU order. Used in followon patches

Signed-off-by: Andi Kleen <[email protected]>

---

v2: Add cpumap__for_each_cpu macro to factor out some common code
v3: Drop cpumap__for_each_cpu macro again, replace with evlist__for_each_cpu
Add new evlist__for_each_cpu
Don't compute cpus nr in cpu_index iterator init, just use all_cpus
v4:
Remove __next, move into skip
Add _no_inc
Move initialization into iterator macro
Rename cpu_index to cpu_iter
---
tools/perf/util/cpumap.h | 1 +
tools/perf/util/evlist.c | 32 ++++++++++++++++++++++++++++++++
tools/perf/util/evlist.h | 8 ++++++++
tools/perf/util/evsel.h | 1 +
4 files changed, 42 insertions(+)

diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index 57943f3685f8..3a442f021468 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -63,4 +63,5 @@ int cpu_map__build_map(struct perf_cpu_map *cpus, struct perf_cpu_map **res,

int cpu_map__cpu(struct perf_cpu_map *cpus, int idx);
bool cpu_map__has(struct perf_cpu_map *cpus, int cpu);
+
#endif /* __PERF_CPUMAP_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index fdce590d2278..dae6e846b2f8 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -342,6 +342,38 @@ static int perf_evlist__nr_threads(struct evlist *evlist,
return perf_thread_map__nr(evlist->core.threads);
}

+void evlist__cpu_iter_start(struct evlist *evlist)
+{
+ struct evsel *pos;
+
+ /*
+ * Reset the per evsel cpu_iter. This is needed because
+ * each evsel's cpumap may have a different index space,
+ * and some operations need the index to modify
+ * the FD xyarray (e.g. open, close)
+ */
+ evlist__for_each_entry(evlist, pos)
+ pos->cpu_iter = 0;
+}
+
+bool evsel__cpu_iter_skip_no_inc(struct evsel *ev, int cpu)
+{
+ if (ev->cpu_iter >= ev->core.cpus->nr)
+ return true;
+ if (cpu >= 0 && ev->core.cpus->map[ev->cpu_iter] != cpu)
+ return true;
+ return false;
+}
+
+bool evsel__cpu_iter_skip(struct evsel *ev, int cpu)
+{
+ if (!evsel__cpu_iter_skip_no_inc(ev, cpu)) {
+ ev->cpu_iter++;
+ return false;
+ }
+ return true;
+}
+
void evlist__disable(struct evlist *evlist)
{
struct evsel *pos;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 13051409fd22..12606efc1f7c 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -333,9 +333,17 @@ void perf_evlist__to_front(struct evlist *evlist,
#define evlist__for_each_entry_safe(evlist, tmp, evsel) \
__evlist__for_each_entry_safe(&(evlist)->core.entries, tmp, evsel)

+#define evlist__for_each_cpu(evlist, index, cpu) \
+ evlist__cpu_iter_start(evlist); \
+ perf_cpu_map__for_each_cpu (cpu, index, (evlist)->core.all_cpus)
+
void perf_evlist__set_tracking_event(struct evlist *evlist,
struct evsel *tracking_evsel);

+void evlist__cpu_iter_start(struct evlist *evlist);
+bool evsel__cpu_iter_skip(struct evsel *ev, int cpu);
+bool evsel__cpu_iter_skip_no_inc(struct evsel *ev, int cpu);
+
struct evsel *
perf_evlist__find_evsel_by_str(struct evlist *evlist, const char *str);

diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ddc5ee6f6592..b10d5ba21966 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -95,6 +95,7 @@ struct evsel {
bool collect_stat;
bool weak_group;
bool percore;
+ int cpu_iter;
const char *pmu_name;
struct {
perf_evsel__sb_cb_t *cb;
--
2.23.0

2019-11-20 17:19:56

by Jiri Olsa

[permalink] [raw]
Subject: Re: Optimize perf stat for large number of events/cpus

On Fri, Nov 15, 2019 at 09:52:17PM -0800, Andi Kleen wrote:
> [v7: Address review feedback. Fix python script problem
> reported by 0day. Drop merged patches.]
>
> This patch kit optimizes perf stat for a large number of events
> on systems with many CPUs and PMUs.
>
> Some profiling shows that the most overhead is doing IPIs to
> all the target CPUs. We can optimize this by using sched_setaffinity
> to set the affinity to a target CPU once and then doing
> the perf operation for all events on that CPU. This requires
> some restructuring, but cuts the set up time quite a bit.
>
> In theory we could go further by parallelizing these setups
> too, but that would be much more complicated and for now just batching it
> per CPU seems to be sufficient. At some point with many more cores
> parallelization or a better bulk perf setup API might be needed though.
>
> In addition perf does a lot of redundant /sys accesses with
> many PMUs, which can be also expensve. This is also optimized.
>
> On a large test case (>700 events with many weak groups) on a 94 CPU
> system I go from
>
> real 0m8.607s
> user 0m0.550s
> sys 0m8.041s
>
> to
>
> real 0m3.269s
> user 0m0.760s
> sys 0m1.694s
>
> so shaving ~6 seconds of system time, at slightly more cost
> in perf stat itself. On a 4 socket system with the savings
> are more dramatic:
>
> real 0m15.641s
> user 0m0.873s
> sys 0m14.729s
>
> to
>
> real 0m4.493s
> user 0m1.578s
> sys 0m2.444s
>
> so 11s difference in the user visible set up time.
>
> Also available in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-10
>
> v1: Initial post.
> v2: Rebase. Fix some minor issues.
> v3: Rebase. Address review feedback. Fix one minor issue
> v4: Modified based on review feedback. Now it maintains
> all_cpus per evlist. There is still a need for cpu_index iteration
> to get the correct index for indexing the file descriptors.
> Fix bug with unsorted cpu maps, now they are always sorted.
> Some cleanups and refactoring.
> v5: Split patches. Redo loop iteration again. Fix cpu map
> merging for uncore. Remove duplicates from cpumaps. Add unit
> tests.
> v6: Address review feedback. Fix some bugs. Add more comments.
> Merge one invalid patch split.
> v7: Address review feedback. Fix python scripting (thanks 0day)
> Minor updates.

I posted another 2 comments, but other than that I think it's ok

I don't like it, but can't see a better way ;-) and the speedup
is really impressive

thanks,
jirka