LinuxLists.cc - Optimize perf stat for large number of events/cpus

2019-11-21 00:20:59

Subject: Optimize perf stat for large number of events/cpus

[v8: Address review feedback. Only changes one patch.]

This patch kit optimizes perf stat for a large number of events
on systems with many CPUs and PMUs.

Some profiling shows that the most overhead is doing IPIs to
all the target CPUs. We can optimize this by using sched_setaffinity
to set the affinity to a target CPU once and then doing
the perf operation for all events on that CPU. This requires
some restructuring, but cuts the set up time quite a bit.

In theory we could go further by parallelizing these setups
too, but that would be much more complicated and for now just batching it
per CPU seems to be sufficient. At some point with many more cores
parallelization or a better bulk perf setup API might be needed though.

In addition perf does a lot of redundant /sys accesses with
many PMUs, which can be also expensve. This is also optimized.

On a large test case (>700 events with many weak groups) on a 94 CPU
system I go from

real 0m8.607s
user 0m0.550s
sys 0m8.041s

to

real 0m3.269s
user 0m0.760s
sys 0m1.694s

so shaving ~6 seconds of system time, at slightly more cost
in perf stat itself. On a 4 socket system the savings
are more dramatic:

real 0m15.641s
user 0m0.873s
sys 0m14.729s

to

real 0m4.493s
user 0m1.578s
sys 0m2.444s

so 11s difference in the user visible set up time.

Also available in

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-11

v1: Initial post.
v2: Rebase. Fix some minor issues.
v3: Rebase. Address review feedback. Fix one minor issue
v4: Modified based on review feedback. Now it maintains
all_cpus per evlist. There is still a need for cpu_index iteration
to get the correct index for indexing the file descriptors.
Fix bug with unsorted cpu maps, now they are always sorted.
Some cleanups and refactoring.
v5: Split patches. Redo loop iteration again. Fix cpu map
merging for uncore. Remove duplicates from cpumaps. Add unit
tests.
v6: Address review feedback. Fix some bugs. Add more comments.
Merge one invalid patch split.
v7: Address review feedback. Fix python scripting (thanks 0day)
Minor updates.
v8: Address review feedback.

-Andi

2019-11-21 12:49:47

by Andi Kleen

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

Andi Kleen <[email protected]> writes:

> [v8: Address review feedback. Only changes one patch.]

Sorry forgot to add the -v8 to the subject.

The patches in this thread are version 8.

I'll not repost with the new Subject unless someone asks me to.

-Andi

2019-11-21 14:35:14

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

Em Thu, Nov 21, 2019 at 04:47:29AM -0800, Andi Kleen escreveu:
> Andi Kleen <[email protected]> writes:
>
> > [v8: Address review feedback. Only changes one patch.]
>
> Sorry forgot to add the -v8 to the subject.
>
> The patches in this thread are version 8.
>
> I'll not repost with the new Subject unless someone asks me to.

Ok, I'll try to go over it later today.

> -Andi

--

- Arnaldo

2019-11-27 15:20:29

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

Em Wed, Nov 20, 2019 at 04:15:10PM -0800, Andi Kleen escreveu:
> [v8: Address review feedback. Only changes one patch.]
>
> This patch kit optimizes perf stat for a large number of events
> on systems with many CPUs and PMUs.
>
> Some profiling shows that the most overhead is doing IPIs to
> all the target CPUs. We can optimize this by using sched_setaffinity
> to set the affinity to a target CPU once and then doing
> the perf operation for all events on that CPU. This requires
> some restructuring, but cuts the set up time quite a bit.
>
> In theory we could go further by parallelizing these setups
> too, but that would be much more complicated and for now just batching it
> per CPU seems to be sufficient. At some point with many more cores
> parallelization or a better bulk perf setup API might be needed though.
>
> In addition perf does a lot of redundant /sys accesses with
> many PMUs, which can be also expensve. This is also optimized.
>
> On a large test case (>700 events with many weak groups) on a 94 CPU
> system I go from
>
> real 0m8.607s
> user 0m0.550s
> sys 0m8.041s
>
> to
>
> real 0m3.269s
> user 0m0.760s
> sys 0m1.694s
>
> so shaving ~6 seconds of system time, at slightly more cost
> in perf stat itself. On a 4 socket system the savings
> are more dramatic:
>
> real 0m15.641s
> user 0m0.873s
> sys 0m14.729s
>
> to
>
> real 0m4.493s
> user 0m1.578s
> sys 0m2.444s
>
> so 11s difference in the user visible set up time.

Applied to my local perf/core branch, now undergoing test builds on all
the containers.

Thanks,

- Arnaldo

2019-11-27 15:46:29

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

Em Wed, Nov 27, 2019 at 12:16:57PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Nov 20, 2019 at 04:15:10PM -0800, Andi Kleen escreveu:
> > [v8: Address review feedback. Only changes one patch.]
> >
> > This patch kit optimizes perf stat for a large number of events
> > on systems with many CPUs and PMUs.
> >
> > Some profiling shows that the most overhead is doing IPIs to
> > all the target CPUs. We can optimize this by using sched_setaffinity
> > to set the affinity to a target CPU once and then doing
> > the perf operation for all events on that CPU. This requires
> > some restructuring, but cuts the set up time quite a bit.
> >
> > In theory we could go further by parallelizing these setups
> > too, but that would be much more complicated and for now just batching it
> > per CPU seems to be sufficient. At some point with many more cores
> > parallelization or a better bulk perf setup API might be needed though.
> >
> > In addition perf does a lot of redundant /sys accesses with
> > many PMUs, which can be also expensve. This is also optimized.
> >
> > On a large test case (>700 events with many weak groups) on a 94 CPU
> > system I go from
> >
> > real 0m8.607s
> > user 0m0.550s
> > sys 0m8.041s
> >
> > to
> >
> > real 0m3.269s
> > user 0m0.760s
> > sys 0m1.694s
> >
> > so shaving ~6 seconds of system time, at slightly more cost
> > in perf stat itself. On a 4 socket system the savings
> > are more dramatic:
> >
> > real 0m15.641s
> > user 0m0.873s
> > sys 0m14.729s
> >
> > to
> >
> > real 0m4.493s
> > user 0m1.578s
> > sys 0m2.444s
> >
> > so 11s difference in the user visible set up time.
>
> Applied to my local perf/core branch, now undergoing test builds on all
> the containers.

So, have you tried running 'perf test' after each cset is applied and
built?

[root@quaco ~]# perf test 49
49: Event times : FAILED!

I did a bisect and it ends at:

[acme@quaco perf]$ git bisect good
af39eb7d060751f7f3336e0ffa713575c6bea902 is the first bad commit
commit af39eb7d060751f7f3336e0ffa713575c6bea902
Author: Andi Kleen <[email protected]>
Date: Wed Nov 20 16:15:19 2019 -0800

perf stat: Use affinity for opening events

Restructure the event opening in perf stat to cycle through the events
by CPU after setting affinity to that CPU.

---------

Which for me was a surprise till I saw that this doesn't touch just
'perf stat' as the commit log seems to indicate.

Please check this, and consider splitting the patches to help with
bisection.

I'm keeping this in a separate local branch for now, will leave the
first few patches, that seems ok to go now.

- Arnaldo

2019-11-27 23:28:27

by Andi Kleen

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

On Wed, Nov 27, 2019 at 12:43:05PM -0300, Arnaldo Carvalho de Melo wrote:
> So, have you tried running 'perf test' after each cset is applied and
> built?

I ran it at the end, but there are quite a few fails out of the box,
so I missed that one thanks.

This patch fixes it. Let me know if I should submit it in a more
formal way.

---

Fix event times test case

Reported-by: Arnaldo
Signed-off-by: Andi Kleen <[email protected]>

diff --git a/tools/perf/lib/evsel.c b/tools/perf/lib/evsel.c
index 4c6485fc31b9..4dc06289f4c7 100644
--- a/tools/perf/lib/evsel.c
+++ b/tools/perf/lib/evsel.c
@@ -224,7 +224,7 @@ int perf_evsel__enable(struct perf_evsel *evsel)
int i;
int err = 0;

- for (i = 0; i < evsel->cpus->nr && !err; i++)
+ for (i = 0; i < xyarray__max_x(evsel->fd) && !err; i++)
err = perf_evsel__run_ioctl(evsel, PERF_EVENT_IOC_ENABLE, NULL, i);
return err;
}
@@ -239,7 +239,7 @@ int perf_evsel__disable(struct perf_evsel *evsel)
int i;
int err = 0;

- for (i = 0; i < evsel->cpus->nr && !err; i++)
+ for (i = 0; i < xyarray__max_x(evsel->fd) && !err; i++)
err = perf_evsel__run_ioctl(evsel, PERF_EVENT_IOC_DISABLE, NULL, i);
return err;
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 59b9b4f3fe34..0844e3e29fb0 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1853,6 +1853,10 @@ int perf_evsel__open_per_cpu(struct evsel *evsel,
struct perf_cpu_map *cpus,
int cpu)
{
+ if (cpu == -1)
+ return evsel__open_cpu(evsel, cpus, NULL, 0,
+ cpus ? cpus->nr : 1);
+
return evsel__open_cpu(evsel, cpus, NULL, cpu, cpu + 1);
}

2019-11-28 00:02:32

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: Optimize perf stat for large number of events/cpus

On November 27, 2019 8:26:57 PM GMT-03:00, Andi Kleen <[email protected]> wrote:
>On Wed, Nov 27, 2019 at 12:43:05PM -0300, Arnaldo Carvalho de Melo
>wrote:
>> So, have you tried running 'perf test' after each cset is applied and
>> built?
>
>I ran it at the end, but there are quite a few fails out of the box,
>so I missed that one thanks.
>
>This patch fixes it. Let me know if I should submit it in a more
>formal way.
>
>---
>
>Fix event times test case
>
>Reported-by: Arnaldo
>Signed-off-by: Andi Kleen <[email protected]>
>
>diff --git a/tools/perf/lib/evsel.c b/tools/perf/lib/evsel.c
>index 4c6485fc31b9..4dc06289f4c7 100644
>--- a/tools/perf/lib/evsel.c
>+++ b/tools/perf/lib/evsel.c
>@@ -224,7 +224,7 @@ int perf_evsel__enable(struct perf_evsel *evsel)
> int i;
> int err = 0;
>
>- for (i = 0; i < evsel->cpus->nr && !err; i++)
>+ for (i = 0; i < xyarray__max_x(evsel->fd) && !err; i++)
> err = perf_evsel__run_ioctl(evsel, PERF_EVENT_IOC_ENABLE, NULL, i);
> return err;
> }
>@@ -239,7 +239,7 @@ int perf_evsel__disable(struct perf_evsel *evsel)
> int i;
> int err = 0;
>
>- for (i = 0; i < evsel->cpus->nr && !err; i++)
>+ for (i = 0; i < xyarray__max_x(evsel->fd) && !err; i++)
> err = perf_evsel__run_ioctl(evsel, PERF_EVENT_IOC_DISABLE, NULL, i);
> return err;
> }
>diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>index 59b9b4f3fe34..0844e3e29fb0 100644
>--- a/tools/perf/util/evsel.c
>+++ b/tools/perf/util/evsel.c
>@@ -1853,6 +1853,10 @@ int perf_evsel__open_per_cpu(struct evsel
>*evsel,
> struct perf_cpu_map *cpus,
> int cpu)
> {
>+ if (cpu == -1)
>+ return evsel__open_cpu(evsel, cpus, NULL, 0,
>+ cpus ? cpus->nr : 1);
>+
> return evsel__open_cpu(evsel, cpus, NULL, cpu, cpu + 1);
> }
>

Just save me some time by saying to which cset in v8 I should squash this into, so that we keep the whole shebang bisectable,

Thanks,

- Arnaldo