2023-04-03 09:16:18

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] perf: Optimize perf_pmu_migrate_context()


Thomas reported that offlining CPUs spends a lot of time in
synchronize_rcu() as called from perf_pmu_migrate_context() even though
he's not actually using uncore events.

Turns out, the thing is unconditionally waiting for RCU, even if there's
no actual events to migrate.

Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
Reported-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Thomas Gleixner <[email protected]>
---
kernel/events/core.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fb3e436bcd4a..115320faf1db 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12893,12 +12893,14 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->pinned_groups, &events);
__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->flexible_groups, &events);

- /*
- * Wait for the events to quiesce before re-instating them.
- */
- synchronize_rcu();
+ if (!list_empty(&events)) {
+ /*
+ * Wait for the events to quiesce before re-instating them.
+ */
+ synchronize_rcu();

- __perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+ __perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+ }

mutex_unlock(&dst_ctx->mutex);
mutex_unlock(&src_ctx->mutex);


2023-04-03 22:16:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] perf: Optimize perf_pmu_migrate_context()

On Mon, Apr 03 2023 at 11:08, Peter Zijlstra wrote:
> Thomas reported that offlining CPUs spends a lot of time in
> synchronize_rcu() as called from perf_pmu_migrate_context() even though
> he's not actually using uncore events.

That happens when offlining CPUs from a socket > 0 in the same order how
those CPUs have been brought up. On socket 0 this is not observable
unless the bogus CPU0 offlining hack is enabled.

If the offlining happens in the reverse order then all is shiny.

The reason is that the first online CPU on a socket gets the uncore
events assigned and when it is offlined then those are moved to the next
online CPU in the same socket.

On a SKL-X with 56 threads per sockets this results in a whopping _1_
second delay per thread (except for the last one which shuts down the
per socket uncore events with no delay because there are no users) due
to 62 times of pointless synchronize_rcu() invocations where each takes
~16ms on a HZ=250 kernel.

Which in turn is interesting because that machine is completely idle
other than running the offline muck...

> Turns out, the thing is unconditionally waiting for RCU, even if there's
> no actual events to migrate.
>
> Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
> Reported-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Tested-by: Thomas Gleixner <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2023-04-03 22:53:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] perf: Optimize perf_pmu_migrate_context()

On Tue, Apr 04, 2023 at 12:07:30AM +0200, Thomas Gleixner wrote:
> On Mon, Apr 03 2023 at 11:08, Peter Zijlstra wrote:
> > Thomas reported that offlining CPUs spends a lot of time in
> > synchronize_rcu() as called from perf_pmu_migrate_context() even though
> > he's not actually using uncore events.
>
> That happens when offlining CPUs from a socket > 0 in the same order how
> those CPUs have been brought up. On socket 0 this is not observable
> unless the bogus CPU0 offlining hack is enabled.
>
> If the offlining happens in the reverse order then all is shiny.
>
> The reason is that the first online CPU on a socket gets the uncore
> events assigned and when it is offlined then those are moved to the next
> online CPU in the same socket.
>
> On a SKL-X with 56 threads per sockets this results in a whopping _1_
> second delay per thread (except for the last one which shuts down the
> per socket uncore events with no delay because there are no users) due
> to 62 times of pointless synchronize_rcu() invocations where each takes
> ~16ms on a HZ=250 kernel.
>
> Which in turn is interesting because that machine is completely idle
> other than running the offline muck...
>
> > Turns out, the thing is unconditionally waiting for RCU, even if there's
> > no actual events to migrate.
> >
> > Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
> > Reported-by: Thomas Gleixner <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Tested-by: Thomas Gleixner <[email protected]>
>
> Reviewed-by: Thomas Gleixner <[email protected]>

Yow! ;-)

Assuming that all the events run under RCU protection, as in preemption
disabled:

Reviewed-by: Paul E. McKenney <[email protected]>

Subject: [tip: perf/urgent] perf: Optimize perf_pmu_migrate_context()

The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: b168098912926236bbeebaf7795eb7aab76d2b45
Gitweb: https://git.kernel.org/tip/b168098912926236bbeebaf7795eb7aab76d2b45
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 03 Apr 2023 11:08:58 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 05 Apr 2023 09:58:46 +02:00

perf: Optimize perf_pmu_migrate_context()

Thomas reported that offlining CPUs spends a lot of time in
synchronize_rcu() as called from perf_pmu_migrate_context() even though
he's not actually using uncore events.

Turns out, the thing is unconditionally waiting for RCU, even if there's
no actual events to migrate.

Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
Reported-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Thomas Gleixner <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/events/core.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fb3e436..115320f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12893,12 +12893,14 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->pinned_groups, &events);
__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->flexible_groups, &events);

- /*
- * Wait for the events to quiesce before re-instating them.
- */
- synchronize_rcu();
+ if (!list_empty(&events)) {
+ /*
+ * Wait for the events to quiesce before re-instating them.
+ */
+ synchronize_rcu();

- __perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+ __perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+ }

mutex_unlock(&dst_ctx->mutex);
mutex_unlock(&src_ctx->mutex);