2023-10-03 18:04:13

by Uladzislau Rezki

[permalink] [raw]
Subject: [PATCH v2 1/1] rcu: Reduce synchronize_rcu() waiting time

A call to a synchronize_rcu() can be optimized from time point of
view. Different workloads can be affected by this especially the
ones which use this API in its time critical sections.

For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
callback can be delayed and such delay depends on where in a nocb
list it is located.

1. On our Android devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
<...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
<...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
<...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. On our Android devices we use cpuset/cgroup to classify tasks
and assign them into different cgroups. For example "backgrond"
group which binds tasks only to little CPUs or "foreground" that
binds to all CPUs, i.e. tasks can be migrated between groups.

See below an example of how "surfaceflinger" task is migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speedup it up
it can be temporary moved into "foreground" cgroup which allows
to use big CPUs:

cgroup_attach_task():
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write()
-> rcu_sync_enter()
-> synchronize_rcu()
-> now move tasks to the new cgroup.
-> cgroup_migrate_finish()

<snip>
rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

from this example it is clear that "a moving time" also depends
on how fast synchronize_rcu() completes.

3. This patch improves the synchronize_rcu() approximately by 30%-50%
on synthetic tests. Apart of that i have tested app launch of camera
app where i also see better perf. figures:

542 vs 489 diff: 9%
540 vs 466 diff: 13%
518 vs 468 diff: 9%
531 vs 457 diff: 13%
548 vs 475 diff: 13%
509 vs 484 diff: 4%

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
kernel/rcu/tree.c | 151 +++++++++++++++++++++++++++++++++++++++++-
kernel/rcu/tree_exp.h | 2 +-
2 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 78554e7181dd..a347c1f98f11 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,6 +1384,122 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * There are three lists for handling synchronize_rcu() users.
+ * A first list corresponds to new coming users, second for users
+ * which wait for a grace period and third is for which a grace
+ * period is passed.
+ */
+static struct sr_normal_state {
+ struct llist_head curr; /* request a GP users. */
+ struct llist_head wait; /* wait for GP users. */
+ struct llist_head done; /* ready for GP users. */
+ struct llist_node *curr_tail;
+ struct llist_node *wait_tail;
+ atomic_t active;
+} sr;
+
+/* Enable it by default. */
+static int rcu_normal_wake_from_gp = 1;
+module_param(rcu_normal_wake_from_gp, int, 0644);
+
+static void rcu_sr_normal_complete(struct llist_node *node)
+{
+ struct rcu_synchronize *rs = container_of(
+ (struct rcu_head *) node, struct rcu_synchronize, head);
+ unsigned long oldstate = (unsigned long) rs->head.func;
+
+ if (!poll_state_synchronize_rcu(oldstate))
+ WARN_ONCE(1, "A full grace period is not passed yet: %lu",
+ rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
+
+ /* Finally. */
+ complete(&rs->completion);
+}
+
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
+{
+ struct llist_node *done, *rcu, *next;
+
+ done = llist_del_all(&sr.done);
+ if (!done)
+ return;
+
+ llist_for_each_safe(rcu, next, done)
+ rcu_sr_normal_complete(rcu);
+}
+static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
+
+/*
+ * Helper function for rcu_gp_cleanup().
+ */
+static void rcu_sr_normal_gp_cleanup(void)
+{
+ struct llist_node *first, *tail;
+
+ tail = READ_ONCE(sr.wait_tail);
+ first = llist_del_all(&sr.wait);
+ if (!first)
+ return;
+
+ /* Only one user? */
+ if (!first->next) {
+ rcu_sr_normal_complete(first);
+ return;
+ }
+
+ /* Can be not empty. */
+ llist_add_batch(first, tail, &sr.done);
+ queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
+}
+
+/*
+ * Helper function for rcu_gp_init().
+ */
+static void rcu_sr_normal_gp_init(void)
+{
+ struct llist_node *llnode, *rcu;
+ int ret;
+
+ if (llist_empty(&sr.curr))
+ return;
+
+ /*
+ * A waiting list of GP should be empty on this step,
+ * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+ * rolls it over. If not, it is a BUG, warn a user.
+ */
+ WARN_ON_ONCE(!llist_empty(&sr.wait));
+
+ /*
+ * Obtain a tail of current active users. It is guaranteed
+ * that if we are only one active user and the list is not
+ * empty, the tail has already been updated.
+ */
+ ret = atomic_inc_return(&sr.active);
+ WRITE_ONCE(sr.wait_tail, (ret == 1) ? READ_ONCE(sr.curr_tail):NULL);
+ llnode = llist_del_all(&sr.curr);
+ atomic_dec(&sr.active);
+
+ if (ret != 1) {
+ llist_for_each(rcu, llnode) {
+ if (!rcu->next)
+ WRITE_ONCE(sr.wait_tail, rcu);
+ }
+ }
+
+ llist_add_batch(llnode, READ_ONCE(sr.wait_tail), &sr.wait);
+}
+
+static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
+{
+ atomic_inc(&sr.active);
+ if (llist_add((struct llist_node *) &rs->head, &sr.curr))
+ /* Set the tail. Only first and one user can do that. */
+ WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
+ atomic_dec(&sr.active);
+}
+
/*
* Initialize a new grace period. Return false if no grace period required.
*/
@@ -1420,6 +1536,7 @@ static noinline_for_stack bool rcu_gp_init(void)
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
+ rcu_sr_normal_gp_init();
raw_spin_unlock_irq_rcu_node(rnp);

/*
@@ -1787,6 +1904,9 @@ static noinline void rcu_gp_cleanup(void)
}
raw_spin_unlock_irq_rcu_node(rnp);

+ // Make synchronize_rcu() users aware of the end of old grace period.
+ rcu_sr_normal_gp_cleanup();
+
// If strict, make all CPUs aware of the end of the old grace period.
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
@@ -3500,6 +3620,35 @@ static int rcu_blocking_is_gp(void)
return true;
}

+/*
+ * Helper function for the synchronize_rcu() API.
+ */
+static void synchronize_rcu_normal(void)
+{
+ struct rcu_synchronize rs;
+
+ if (READ_ONCE(rcu_normal_wake_from_gp)) {
+ init_rcu_head_on_stack(&rs.head);
+ init_completion(&rs.completion);
+
+ /*
+ * This code might be preempted, therefore take a GP
+ * snapshot before adding a request.
+ */
+ rs.head.func = (void *) get_state_synchronize_rcu();
+ rcu_sr_normal_add_req(&rs);
+
+ /* Kick a GP and start waiting. */
+ (void) start_poll_synchronize_rcu();
+
+ /* Now we can wait. */
+ wait_for_completion(&rs.completion);
+ destroy_rcu_head_on_stack(&rs.head);
+ } else {
+ wait_rcu_gp(call_rcu_hurry);
+ }
+}
+
/**
* synchronize_rcu - wait until a grace period has elapsed.
*
@@ -3551,7 +3700,7 @@ void synchronize_rcu(void)
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
- wait_rcu_gp(call_rcu_hurry);
+ synchronize_rcu_normal();
return;
}

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6d7cea5d591f..279a37beb05a 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -987,7 +987,7 @@ void synchronize_rcu_expedited(void)

/* If expedited grace periods are prohibited, fall back to normal. */
if (rcu_gp_is_normal()) {
- wait_rcu_gp(call_rcu_hurry);
+ synchronize_rcu_normal();
return;
}

--
2.30.2


2023-10-04 15:24:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] rcu: Reduce synchronize_rcu() waiting time

On Tue, Oct 03, 2023 at 08:04:03PM +0200, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from time point of
> view. Different workloads can be affected by this especially the
> ones which use this API in its time critical sections.
>
> For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> callback can be delayed and such delay depends on where in a nocb
> list it is located.
>
> 1. On our Android devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
> <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. On our Android devices we use cpuset/cgroup to classify tasks
> and assign them into different cgroups. For example "backgrond"
> group which binds tasks only to little CPUs or "foreground" that
> binds to all CPUs, i.e. tasks can be migrated between groups.
>
> See below an example of how "surfaceflinger" task is migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speedup it up
> it can be temporary moved into "foreground" cgroup which allows
> to use big CPUs:
>
> cgroup_attach_task():
> -> cgroup_migrate_execute()
> -> cpuset_can_attach()
> -> percpu_down_write()
> -> rcu_sync_enter()
> -> synchronize_rcu()
> -> now move tasks to the new cgroup.
> -> cgroup_migrate_finish()
>
> <snip>
> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
>
> from this example it is clear that "a moving time" also depends
> on how fast synchronize_rcu() completes.
>
> 3. This patch improves the synchronize_rcu() approximately by 30%-50%
> on synthetic tests. Apart of that i have tested app launch of camera
> app where i also see better perf. figures:
>
> 542 vs 489 diff: 9%
> 540 vs 466 diff: 13%
> 518 vs 468 diff: 9%
> 531 vs 457 diff: 13%
> 548 vs 475 diff: 13%
> 509 vs 484 diff: 4%
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> kernel/rcu/tree.c | 151 +++++++++++++++++++++++++++++++++++++++++-
> kernel/rcu/tree_exp.h | 2 +-
> 2 files changed, 151 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 78554e7181dd..a347c1f98f11 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1384,6 +1384,122 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
>
> +/*
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> + struct llist_head curr; /* request a GP users. */
> + struct llist_head wait; /* wait for GP users. */
> + struct llist_head done; /* ready for GP users. */
> + struct llist_node *curr_tail;
> + struct llist_node *wait_tail;
> + atomic_t active;
> +} sr;
> +
> +/* Enable it by default. */
> +static int rcu_normal_wake_from_gp = 1;
> +module_param(rcu_normal_wake_from_gp, int, 0644);

Nice!

But could you please make this default to zero in order to avoid
surprising people for whom the old way works better?

Thanx, Paul

> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> + struct rcu_synchronize *rs = container_of(
> + (struct rcu_head *) node, struct rcu_synchronize, head);
> + unsigned long oldstate = (unsigned long) rs->head.func;
> +
> + if (!poll_state_synchronize_rcu(oldstate))
> + WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> + rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> +
> + /* Finally. */
> + complete(&rs->completion);
> +}
> +
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> + struct llist_node *done, *rcu, *next;
> +
> + done = llist_del_all(&sr.done);
> + if (!done)
> + return;
> +
> + llist_for_each_safe(rcu, next, done)
> + rcu_sr_normal_complete(rcu);
> +}
> +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> +
> +/*
> + * Helper function for rcu_gp_cleanup().
> + */
> +static void rcu_sr_normal_gp_cleanup(void)
> +{
> + struct llist_node *first, *tail;
> +
> + tail = READ_ONCE(sr.wait_tail);
> + first = llist_del_all(&sr.wait);
> + if (!first)
> + return;
> +
> + /* Only one user? */
> + if (!first->next) {
> + rcu_sr_normal_complete(first);
> + return;
> + }
> +
> + /* Can be not empty. */
> + llist_add_batch(first, tail, &sr.done);
> + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> +}
> +
> +/*
> + * Helper function for rcu_gp_init().
> + */
> +static void rcu_sr_normal_gp_init(void)
> +{
> + struct llist_node *llnode, *rcu;
> + int ret;
> +
> + if (llist_empty(&sr.curr))
> + return;
> +
> + /*
> + * A waiting list of GP should be empty on this step,
> + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> + * rolls it over. If not, it is a BUG, warn a user.
> + */
> + WARN_ON_ONCE(!llist_empty(&sr.wait));
> +
> + /*
> + * Obtain a tail of current active users. It is guaranteed
> + * that if we are only one active user and the list is not
> + * empty, the tail has already been updated.
> + */
> + ret = atomic_inc_return(&sr.active);
> + WRITE_ONCE(sr.wait_tail, (ret == 1) ? READ_ONCE(sr.curr_tail):NULL);
> + llnode = llist_del_all(&sr.curr);
> + atomic_dec(&sr.active);
> +
> + if (ret != 1) {
> + llist_for_each(rcu, llnode) {
> + if (!rcu->next)
> + WRITE_ONCE(sr.wait_tail, rcu);
> + }
> + }
> +
> + llist_add_batch(llnode, READ_ONCE(sr.wait_tail), &sr.wait);
> +}
> +
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> + atomic_inc(&sr.active);
> + if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> + /* Set the tail. Only first and one user can do that. */
> + WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> + atomic_dec(&sr.active);
> +}
> +
> /*
> * Initialize a new grace period. Return false if no grace period required.
> */
> @@ -1420,6 +1536,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> + rcu_sr_normal_gp_init();
> raw_spin_unlock_irq_rcu_node(rnp);
>
> /*
> @@ -1787,6 +1904,9 @@ static noinline void rcu_gp_cleanup(void)
> }
> raw_spin_unlock_irq_rcu_node(rnp);
>
> + // Make synchronize_rcu() users aware of the end of old grace period.
> + rcu_sr_normal_gp_cleanup();
> +
> // If strict, make all CPUs aware of the end of the old grace period.
> if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> @@ -3500,6 +3620,35 @@ static int rcu_blocking_is_gp(void)
> return true;
> }
>
> +/*
> + * Helper function for the synchronize_rcu() API.
> + */
> +static void synchronize_rcu_normal(void)
> +{
> + struct rcu_synchronize rs;
> +
> + if (READ_ONCE(rcu_normal_wake_from_gp)) {
> + init_rcu_head_on_stack(&rs.head);
> + init_completion(&rs.completion);
> +
> + /*
> + * This code might be preempted, therefore take a GP
> + * snapshot before adding a request.
> + */
> + rs.head.func = (void *) get_state_synchronize_rcu();
> + rcu_sr_normal_add_req(&rs);
> +
> + /* Kick a GP and start waiting. */
> + (void) start_poll_synchronize_rcu();
> +
> + /* Now we can wait. */
> + wait_for_completion(&rs.completion);
> + destroy_rcu_head_on_stack(&rs.head);
> + } else {
> + wait_rcu_gp(call_rcu_hurry);
> + }
> +}
> +
> /**
> * synchronize_rcu - wait until a grace period has elapsed.
> *
> @@ -3551,7 +3700,7 @@ void synchronize_rcu(void)
> if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..279a37beb05a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -987,7 +987,7 @@ void synchronize_rcu_expedited(void)
>
> /* If expedited grace periods are prohibited, fall back to normal. */
> if (rcu_gp_is_normal()) {
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> --
> 2.30.2
>

2023-10-04 15:27:12

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] rcu: Reduce synchronize_rcu() waiting time

On Wed, Oct 04, 2023 at 08:24:35AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 03, 2023 at 08:04:03PM +0200, Uladzislau Rezki (Sony) wrote:
> > A call to a synchronize_rcu() can be optimized from time point of
> > view. Different workloads can be affected by this especially the
> > ones which use this API in its time critical sections.
> >
> > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > callback can be delayed and such delay depends on where in a nocb
> > list it is located.
> >
> > 1. On our Android devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. On our Android devices we use cpuset/cgroup to classify tasks
> > and assign them into different cgroups. For example "backgrond"
> > group which binds tasks only to little CPUs or "foreground" that
> > binds to all CPUs, i.e. tasks can be migrated between groups.
> >
> > See below an example of how "surfaceflinger" task is migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speedup it up
> > it can be temporary moved into "foreground" cgroup which allows
> > to use big CPUs:
> >
> > cgroup_attach_task():
> > -> cgroup_migrate_execute()
> > -> cpuset_can_attach()
> > -> percpu_down_write()
> > -> rcu_sync_enter()
> > -> synchronize_rcu()
> > -> now move tasks to the new cgroup.
> > -> cgroup_migrate_finish()
> >
> > <snip>
> > rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> > PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> > <snip>
> >
> > from this example it is clear that "a moving time" also depends
> > on how fast synchronize_rcu() completes.
> >
> > 3. This patch improves the synchronize_rcu() approximately by 30%-50%
> > on synthetic tests. Apart of that i have tested app launch of camera
> > app where i also see better perf. figures:
> >
> > 542 vs 489 diff: 9%
> > 540 vs 466 diff: 13%
> > 518 vs 468 diff: 9%
> > 531 vs 457 diff: 13%
> > 548 vs 475 diff: 13%
> > 509 vs 484 diff: 4%
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > ---
> > kernel/rcu/tree.c | 151 +++++++++++++++++++++++++++++++++++++++++-
> > kernel/rcu/tree_exp.h | 2 +-
> > 2 files changed, 151 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 78554e7181dd..a347c1f98f11 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1384,6 +1384,122 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > }
> >
> > +/*
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > + struct llist_head curr; /* request a GP users. */
> > + struct llist_head wait; /* wait for GP users. */
> > + struct llist_head done; /* ready for GP users. */
> > + struct llist_node *curr_tail;
> > + struct llist_node *wait_tail;
> > + atomic_t active;
> > +} sr;
> > +
> > +/* Enable it by default. */
> > +static int rcu_normal_wake_from_gp = 1;
> > +module_param(rcu_normal_wake_from_gp, int, 0644);
>
> Nice!
>
> But could you please make this default to zero in order to avoid
> surprising people for whom the old way works better?
>
Yep, that i can do. If people prefer a slower version of it :)

--
Uladzislau Rezki

2023-10-04 16:07:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] rcu: Reduce synchronize_rcu() waiting time

On Wed, Oct 04, 2023 at 05:26:52PM +0200, Uladzislau Rezki wrote:
> On Wed, Oct 04, 2023 at 08:24:35AM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 03, 2023 at 08:04:03PM +0200, Uladzislau Rezki (Sony) wrote:
> > > A call to a synchronize_rcu() can be optimized from time point of
> > > view. Different workloads can be affected by this especially the
> > > ones which use this API in its time critical sections.
> > >
> > > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > > callback can be delayed and such delay depends on where in a nocb
> > > list it is located.
> > >
> > > 1. On our Android devices i can easily trigger the scenario when
> > > it is a last in the list out of ~3600 callbacks:
> > >
> > > <snip>
> > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > > ...
> > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > > <snip>
> > >
> > > 2. On our Android devices we use cpuset/cgroup to classify tasks
> > > and assign them into different cgroups. For example "backgrond"
> > > group which binds tasks only to little CPUs or "foreground" that
> > > binds to all CPUs, i.e. tasks can be migrated between groups.
> > >
> > > See below an example of how "surfaceflinger" task is migrated.
> > > Initially it is located in the "system-background" cgroup which
> > > allows to run only on little cores. In order to speedup it up
> > > it can be temporary moved into "foreground" cgroup which allows
> > > to use big CPUs:
> > >
> > > cgroup_attach_task():
> > > -> cgroup_migrate_execute()
> > > -> cpuset_can_attach()
> > > -> percpu_down_write()
> > > -> rcu_sync_enter()
> > > -> synchronize_rcu()
> > > -> now move tasks to the new cgroup.
> > > -> cgroup_migrate_finish()
> > >
> > > <snip>
> > > rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> > > PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > > PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > > TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> > > <snip>
> > >
> > > from this example it is clear that "a moving time" also depends
> > > on how fast synchronize_rcu() completes.
> > >
> > > 3. This patch improves the synchronize_rcu() approximately by 30%-50%
> > > on synthetic tests. Apart of that i have tested app launch of camera
> > > app where i also see better perf. figures:
> > >
> > > 542 vs 489 diff: 9%
> > > 540 vs 466 diff: 13%
> > > 518 vs 468 diff: 9%
> > > 531 vs 457 diff: 13%
> > > 548 vs 475 diff: 13%
> > > 509 vs 484 diff: 4%
> > >
> > > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > > ---
> > > kernel/rcu/tree.c | 151 +++++++++++++++++++++++++++++++++++++++++-
> > > kernel/rcu/tree_exp.h | 2 +-
> > > 2 files changed, 151 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 78554e7181dd..a347c1f98f11 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1384,6 +1384,122 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > }
> > >
> > > +/*
> > > + * There are three lists for handling synchronize_rcu() users.
> > > + * A first list corresponds to new coming users, second for users
> > > + * which wait for a grace period and third is for which a grace
> > > + * period is passed.
> > > + */
> > > +static struct sr_normal_state {
> > > + struct llist_head curr; /* request a GP users. */
> > > + struct llist_head wait; /* wait for GP users. */
> > > + struct llist_head done; /* ready for GP users. */
> > > + struct llist_node *curr_tail;
> > > + struct llist_node *wait_tail;
> > > + atomic_t active;
> > > +} sr;
> > > +
> > > +/* Enable it by default. */
> > > +static int rcu_normal_wake_from_gp = 1;
> > > +module_param(rcu_normal_wake_from_gp, int, 0644);
> >
> > Nice!
> >
> > But could you please make this default to zero in order to avoid
> > surprising people for whom the old way works better?
> >
> Yep, that i can do. If people prefer a slower version of it :)

Keeping in mind that "slower" often means more updates to spread the
grace-period overhead across, thus lower per-update RCU overhead. ;-)

Thanx, Paul

2023-10-04 16:47:57

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] rcu: Reduce synchronize_rcu() waiting time

On Wed, Oct 04, 2023 at 09:07:37AM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 05:26:52PM +0200, Uladzislau Rezki wrote:
> > On Wed, Oct 04, 2023 at 08:24:35AM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 03, 2023 at 08:04:03PM +0200, Uladzislau Rezki (Sony) wrote:
> > > > A call to a synchronize_rcu() can be optimized from time point of
> > > > view. Different workloads can be affected by this especially the
> > > > ones which use this API in its time critical sections.
> > > >
> > > > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > > > callback can be delayed and such delay depends on where in a nocb
> > > > list it is located.
> > > >
> > > > 1. On our Android devices i can easily trigger the scenario when
> > > > it is a last in the list out of ~3600 callbacks:
> > > >
> > > > <snip>
> > > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > > > ...
> > > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> > > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> > > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> > > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> > > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> > > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> > > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > > > <snip>
> > > >
> > > > 2. On our Android devices we use cpuset/cgroup to classify tasks
> > > > and assign them into different cgroups. For example "backgrond"
> > > > group which binds tasks only to little CPUs or "foreground" that
> > > > binds to all CPUs, i.e. tasks can be migrated between groups.
> > > >
> > > > See below an example of how "surfaceflinger" task is migrated.
> > > > Initially it is located in the "system-background" cgroup which
> > > > allows to run only on little cores. In order to speedup it up
> > > > it can be temporary moved into "foreground" cgroup which allows
> > > > to use big CPUs:
> > > >
> > > > cgroup_attach_task():
> > > > -> cgroup_migrate_execute()
> > > > -> cpuset_can_attach()
> > > > -> percpu_down_write()
> > > > -> rcu_sync_enter()
> > > > -> synchronize_rcu()
> > > > -> now move tasks to the new cgroup.
> > > > -> cgroup_migrate_finish()
> > > >
> > > > <snip>
> > > > rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> > > > PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > > > PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> > > > TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> > > > <snip>
> > > >
> > > > from this example it is clear that "a moving time" also depends
> > > > on how fast synchronize_rcu() completes.
> > > >
> > > > 3. This patch improves the synchronize_rcu() approximately by 30%-50%
> > > > on synthetic tests. Apart of that i have tested app launch of camera
> > > > app where i also see better perf. figures:
> > > >
> > > > 542 vs 489 diff: 9%
> > > > 540 vs 466 diff: 13%
> > > > 518 vs 468 diff: 9%
> > > > 531 vs 457 diff: 13%
> > > > 548 vs 475 diff: 13%
> > > > 509 vs 484 diff: 4%
> > > >
> > > > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > ---
> > > > kernel/rcu/tree.c | 151 +++++++++++++++++++++++++++++++++++++++++-
> > > > kernel/rcu/tree_exp.h | 2 +-
> > > > 2 files changed, 151 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 78554e7181dd..a347c1f98f11 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1384,6 +1384,122 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > > }
> > > >
> > > > +/*
> > > > + * There are three lists for handling synchronize_rcu() users.
> > > > + * A first list corresponds to new coming users, second for users
> > > > + * which wait for a grace period and third is for which a grace
> > > > + * period is passed.
> > > > + */
> > > > +static struct sr_normal_state {
> > > > + struct llist_head curr; /* request a GP users. */
> > > > + struct llist_head wait; /* wait for GP users. */
> > > > + struct llist_head done; /* ready for GP users. */
> > > > + struct llist_node *curr_tail;
> > > > + struct llist_node *wait_tail;
> > > > + atomic_t active;
> > > > +} sr;
> > > > +
> > > > +/* Enable it by default. */
> > > > +static int rcu_normal_wake_from_gp = 1;
> > > > +module_param(rcu_normal_wake_from_gp, int, 0644);
> > >
> > > Nice!
> > >
> > > But could you please make this default to zero in order to avoid
> > > surprising people for whom the old way works better?
> > >
> > Yep, that i can do. If people prefer a slower version of it :)
>
> Keeping in mind that "slower" often means more updates to spread the
> grace-period overhead across, thus lower per-update RCU overhead. ;-)
>
Right :)

--
Uladzislau Rezki