2024-01-04 16:25:39

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v4 0/4] Reduce synchronize_rcu() latency(v4)

This is a v4 that tends to improve synchronize_rcu() call. To be more
specific it is about reducing a waiting time(especially worst cases)
of caller that blocks until a grace period is elapsed.

In general, this series separates synchronize_rcu() callers from other
callbacks. We keep a dedicated an independent queue, thus the processing
of it starts as soon as grace period is over, so there is no need to wait
until other callbacks are processed one by one. Please note, a number of
callbacks can be 10K, 20K, 60K and so on. That is why this series maintain
a separate track for this call that blocks a context.

v3 -> v4:
- Squash patches;
- Add more description;
- Fix comments based on v3 feedback.

v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
v2: https://lore.kernel.org/all/[email protected]/T/
v1: https://lore.kernel.org/lkml/[email protected]/T/

Neeraj Upadhyay (1):
rcu: Improve handling of synchronize_rcu() users

Uladzislau Rezki (Sony) (3):
rcu: Reduce synchronize_rcu() latency
rcu: Add a trace event for synchronize_rcu_normal()
rcu: Support direct wake-up of synchronize_rcu() users

.../admin-guide/kernel-parameters.txt | 14 +
include/trace/events/rcu.h | 27 ++
kernel/rcu/Kconfig.debug | 12 +
kernel/rcu/tree.c | 361 +++++++++++++++++-
kernel/rcu/tree.h | 19 +
kernel/rcu/tree_exp.h | 2 +-
6 files changed, 433 insertions(+), 2 deletions(-)

--
2.39.2



2024-01-04 16:25:50

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

A call to a synchronize_rcu() can be optimized from a latency
point of view. Workloads which depend on this can benefit of it.

The delay of wakeme_after_rcu() callback, which unblocks a waiter,
depends on several factors:

- how fast a process of offloading is started. Combination of:
- !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
- !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
- other.
- when started, invoking path is interrupted due to:
- time limit;
- need_resched();
- if limit is reached.
- where in a nocb list it is located;
- how fast previous callbacks completed;

Example:

1. On our embedded devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
<...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
<...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
<...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.

See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:

cgroup_attach_task():
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write()
-> rcu_sync_enter()
-> synchronize_rcu()
-> now move tasks to the new cgroup.
-> cgroup_migrate_finish()

<snip>
rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

"Boosting a task" depends on synchronize_rcu() latency:

- first trace shows a completion of synchronize_rcu();
- second shows attaching a task to a new group;
- last shows a final step when migration occurs.

3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. After completion of a grace period
users are deferred to a dedicated worker to process requests.

4. This patch reduces the latency of synchronize_rcu() approximately
by ~30-40% on synthetic tests. The real test case, camera launch time,
shows(time is in milliseconds):

1-run 542 vs 489 improvement 9%
2-run 540 vs 466 improvement 13%
3-run 518 vs 468 improvement 9%
4-run 531 vs 457 improvement 13%
5-run 548 vs 475 improvement 13%
6-run 509 vs 484 improvement 4%

Synthetic test(no "noise" from other callbacks):
Hardware: x86_64 64 CPUs, 64GB of memory
Linux-6.6

- 10K tasks(simultaneous);
- each task does(1000 loops)
synchronize_rcu();
kfree(p);

default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.

Running 60K gives approximately same results on my setup. Please note
it is without any interaction with another type of callbacks, otherwise
it will impact a lot a default case.

5. An extra CONFIG_RCU_SR_NORMAL_DEBUG_GP kernel option is added
which enables additional debugging for detecting a grace period
incompletion for synchronize_rcu() users. If a GP is not fully
passed for any user, the warning message is emitted.

6. By default it is disabled. To enable this perform one of the
below sequence:

echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 14 ++
kernel/rcu/Kconfig.debug | 12 ++
kernel/rcu/tree.c | 138 +++++++++++++++++-
kernel/rcu/tree_exp.h | 2 +-
4 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 17a454909ab4..2cca75e4f0c6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5047,6 +5047,20 @@
delay, memory pressure or callback list growing too
big.

+ rcutree.rcu_normal_wake_from_gp= [KNL]
+ Reduces a latency of synchronize_rcu() call. This approach
+ maintains its own track of synchronize_rcu() callers, so it
+ does not interact with regular callbacks because it does not
+ use a call_rcu[_hurry]() path. Please note, this is for a
+ normal grace period.
+
+ How to enable it:
+
+ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
+ or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
+
+ Default is 0.
+
rcuscale.gp_async= [KNL]
Measure performance of asynchronous
grace-period primitives such as call_rcu().
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 9b0b52e1836f..4812c6249185 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -168,4 +168,16 @@ config RCU_STRICT_GRACE_PERIOD
when looking for certain types of RCU usage bugs, for example,
too-short RCU read-side critical sections.

+config RCU_SR_NORMAL_DEBUG_GP
+ bool "Debug synchronize_rcu() callers for a grace period completion"
+ depends on DEBUG_KERNEL && RCU_EXPERT
+ default n
+ help
+ This option enables additional debugging for detecting a grace
+ period incompletion for synchronize_rcu() users. If a GP is not
+ fully passed for any user, the warning message is emitted.
+
+ Say Y here if you want to enable such debugging
+ Say N if you are unsure.
+
endmenu # "RCU Debugging"
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 499803234176..b756c40e4960 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1422,6 +1422,106 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * There are three lists for handling synchronize_rcu() users.
+ * A first list corresponds to new coming users, second for users
+ * which wait for a grace period and third is for which a grace
+ * period is passed.
+ */
+static struct sr_normal_state {
+ struct llist_head srs_next; /* request a GP users. */
+ struct llist_head srs_wait; /* wait for GP users. */
+ struct llist_head srs_done; /* ready for GP users. */
+
+ /*
+ * In order to add a batch of nodes to already
+ * existing srs-done-list, a tail of srs-wait-list
+ * is maintained.
+ */
+ struct llist_node *srs_wait_tail;
+} sr;
+
+/* Disabled by default. */
+static int rcu_normal_wake_from_gp;
+module_param(rcu_normal_wake_from_gp, int, 0644);
+
+static void rcu_sr_normal_complete(struct llist_node *node)
+{
+ struct rcu_synchronize *rs = container_of(
+ (struct rcu_head *) node, struct rcu_synchronize, head);
+ unsigned long oldstate = (unsigned long) rs->head.func;
+
+ WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
+ !poll_state_synchronize_rcu(oldstate),
+ "A full grace period is not passed yet: %lu",
+ rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
+
+ /* Finally. */
+ complete(&rs->completion);
+}
+
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
+{
+ struct llist_node *done, *rcu, *next;
+
+ done = llist_del_all(&sr.srs_done);
+ if (!done)
+ return;
+
+ llist_for_each_safe(rcu, next, done)
+ rcu_sr_normal_complete(rcu);
+}
+static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
+
+/*
+ * Helper function for rcu_gp_cleanup().
+ */
+static void rcu_sr_normal_gp_cleanup(void)
+{
+ struct llist_node *head, *tail;
+
+ if (llist_empty(&sr.srs_wait))
+ return;
+
+ tail = READ_ONCE(sr.srs_wait_tail);
+ head = __llist_del_all(&sr.srs_wait);
+
+ if (head) {
+ /* Can be not empty. */
+ llist_add_batch(head, tail, &sr.srs_done);
+ queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
+ }
+}
+
+/*
+ * Helper function for rcu_gp_init().
+ */
+static void rcu_sr_normal_gp_init(void)
+{
+ struct llist_node *head, *tail;
+
+ if (llist_empty(&sr.srs_next))
+ return;
+
+ tail = llist_del_all(&sr.srs_next);
+ head = llist_reverse_order(tail);
+
+ /*
+ * A waiting list of GP should be empty on this step,
+ * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+ * rolls it over. If not, it is a BUG, warn a user.
+ */
+ WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
+
+ WRITE_ONCE(sr.srs_wait_tail, tail);
+ __llist_add_batch(head, tail, &sr.srs_wait);
+}
+
+static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
+{
+ llist_add((struct llist_node *) &rs->head, &sr.srs_next);
+}
+
/*
* Initialize a new grace period. Return false if no grace period required.
*/
@@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
/* Record GP times before starting GP, hence rcu_seq_start(). */
rcu_seq_start(&rcu_state.gp_seq);
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
+ rcu_sr_normal_gp_init();
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);
@@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
}
raw_spin_unlock_irq_rcu_node(rnp);

+ // Make synchronize_rcu() users aware of the end of old grace period.
+ rcu_sr_normal_gp_cleanup();
+
// If strict, make all CPUs aware of the end of the old grace period.
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
@@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
return true;
}

+/*
+ * Helper function for the synchronize_rcu() API.
+ */
+static void synchronize_rcu_normal(void)
+{
+ struct rcu_synchronize rs;
+
+ if (!READ_ONCE(rcu_normal_wake_from_gp)) {
+ wait_rcu_gp(call_rcu_hurry);
+ return;
+ }
+
+ init_rcu_head_on_stack(&rs.head);
+ init_completion(&rs.completion);
+
+ /*
+ * This code might be preempted, therefore take a GP
+ * snapshot before adding a request.
+ */
+ if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
+ rs.head.func = (void *) get_state_synchronize_rcu();
+
+ rcu_sr_normal_add_req(&rs);
+
+ /* Kick a GP and start waiting. */
+ (void) start_poll_synchronize_rcu();
+
+ /* Now we can wait. */
+ wait_for_completion(&rs.completion);
+ destroy_rcu_head_on_stack(&rs.head);
+}
+
/**
* synchronize_rcu - wait until a grace period has elapsed.
*
@@ -3612,7 +3748,7 @@ void synchronize_rcu(void)
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
- wait_rcu_gp(call_rcu_hurry);
+ synchronize_rcu_normal();
return;
}

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 014ddf672165..bdc30d972d32 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -985,7 +985,7 @@ void synchronize_rcu_expedited(void)

/* If expedited grace periods are prohibited, fall back to normal. */
if (rcu_gp_is_normal()) {
- wait_rcu_gp(call_rcu_hurry);
+ synchronize_rcu_normal();
return;
}

--
2.39.2


2024-01-04 16:25:55

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v4 2/4] rcu: Add a trace event for synchronize_rcu_normal()

Add an rcu_sr_normal() trace event. It takes three arguments
first one is the name of RCU flavour, second one is a user id
which triggeres synchronize_rcu_normal() and last one is an
event.

There are two traces in the synchronize_rcu_normal(). On entry,
when a new request is registered and on exit point when request
is completed.

Please note, CONFIG_RCU_TRACE=y is required to activate traces.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
include/trace/events/rcu.h | 27 +++++++++++++++++++++++++++
kernel/rcu/tree.c | 7 ++++++-
2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 2ef9c719772a..31b3e0d3e65f 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -707,6 +707,33 @@ TRACE_EVENT_RCU(rcu_invoke_kfree_bulk_callback,
__entry->rcuname, __entry->p, __entry->nr_records)
);

+/*
+ * Tracepoint for a normal synchronize_rcu() states. The first argument
+ * is the RCU flavor, the second argument is a pointer to rcu_head the
+ * last one is an event.
+ */
+TRACE_EVENT_RCU(rcu_sr_normal,
+
+ TP_PROTO(const char *rcuname, struct rcu_head *rhp, const char *srevent),
+
+ TP_ARGS(rcuname, rhp, srevent),
+
+ TP_STRUCT__entry(
+ __field(const char *, rcuname)
+ __field(void *, rhp)
+ __field(const char *, srevent)
+ ),
+
+ TP_fast_assign(
+ __entry->rcuname = rcuname;
+ __entry->rhp = rhp;
+ __entry->srevent = srevent;
+ ),
+
+ TP_printk("%s rhp=0x%p event=%s",
+ __entry->rcuname, __entry->rhp, __entry->srevent)
+);
+
/*
* Tracepoint for exiting rcu_do_batch after RCU callbacks have been
* invoked. The first argument is the name of the RCU flavor,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b756c40e4960..7d2ed89efcb3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3672,9 +3672,11 @@ static void synchronize_rcu_normal(void)
{
struct rcu_synchronize rs;

+ trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("request"));
+
if (!READ_ONCE(rcu_normal_wake_from_gp)) {
wait_rcu_gp(call_rcu_hurry);
- return;
+ goto trace_complete_out;
}

init_rcu_head_on_stack(&rs.head);
@@ -3695,6 +3697,9 @@ static void synchronize_rcu_normal(void)
/* Now we can wait. */
wait_for_completion(&rs.completion);
destroy_rcu_head_on_stack(&rs.head);
+
+trace_complete_out:
+ trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("complete"));
}

/**
--
2.39.2


2024-01-04 16:26:12

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v4 3/4] rcu: Improve handling of synchronize_rcu() users

From: Neeraj Upadhyay <[email protected]>

Currently, processing of the next batch of rcu_synchronize nodes
for the new grace period, requires doing a llist reversal operation
to find the tail element of the list. This can be a very costly
operation (high number of cache misses) for a long list.

To address this, this patch introduces a "dummy-wait-node" entity.
At every grace period init, a new wait node is added to the llist.
This wait node is used as wait tail for this new grace period.

This allows lockless additions of new rcu_synchronize nodes in the
rcu_sr_normal_add_req(), while the cleanup work executes and does
the progress. The dummy nodes are removed on next round of cleanup
work execution.

Co-developed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Neeraj Upadhyay <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
kernel/rcu/tree.c | 271 +++++++++++++++++++++++++++++++++++++++-------
kernel/rcu/tree.h | 13 +++
2 files changed, 244 insertions(+), 40 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7d2ed89efcb3..88a47a6a658e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1423,23 +1423,157 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
}

/*
- * There are three lists for handling synchronize_rcu() users.
- * A first list corresponds to new coming users, second for users
- * which wait for a grace period and third is for which a grace
- * period is passed.
+ * There is a single llist, which is used for handling
+ * synchronize_rcu() users' enqueued rcu_synchronize nodes.
+ * Within this llist, there are two tail pointers:
+ *
+ * wait tail: Tracks the set of nodes, which need to
+ * wait for the current GP to complete.
+ * done tail: Tracks the set of nodes, for which grace
+ * period has elapsed. These nodes processing
+ * will be done as part of the cleanup work
+ * execution by a kworker.
+ *
+ * At every grace period init, a new wait node is added
+ * to the llist. This wait node is used as wait tail
+ * for this new grace period. Given that there are a fixed
+ * number of wait nodes, if all wait nodes are in use
+ * (which can happen when kworker callback processing
+ * is delayed) and additional grace period is requested.
+ * This means, a system is slow in processing callbacks.
+ *
+ * TODO: If a slow processing is detected, a first node
+ * in the llist should be used as a wait-tail for this
+ * grace period, therefore users which should wait due
+ * to a slow process are handled by _this_ grace period
+ * and not next.
+ *
+ * Below is an illustration of how the done and wait
+ * tail pointers move from one set of rcu_synchronize nodes
+ * to the other, as grace periods start and finish and
+ * nodes are processed by kworker.
+ *
+ *
+ * a. Initial llist callbacks list:
+ *
+ * +----------+ +--------+ +-------+
+ * | | | | | |
+ * | head |---------> | cb2 |--------->| cb1 |
+ * | | | | | |
+ * +----------+ +--------+ +-------+
+ *
+ *
+ *
+ * b. New GP1 Start:
+ *
+ * WAIT TAIL
+ * |
+ * |
+ * v
+ * +----------+ +--------+ +--------+ +-------+
+ * | | | | | | | |
+ * | head ------> wait |------> cb2 |------> | cb1 |
+ * | | | head1 | | | | |
+ * +----------+ +--------+ +--------+ +-------+
+ *
+ *
+ *
+ * c. GP completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *
+ * DONE TAIL
+ * |
+ * |
+ * v
+ * +----------+ +--------+ +--------+ +-------+
+ * | | | | | | | |
+ * | head ------> wait |------> cb2 |------> | cb1 |
+ * | | | head1 | | | | |
+ * +----------+ +--------+ +--------+ +-------+
+ *
+ *
+ *
+ * d. New callbacks and GP2 start:
+ *
+ * WAIT TAIL DONE TAIL
+ * | |
+ * | |
+ * v v
+ * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
+ * | | | | | | | | | | | | | |
+ * | head ------> wait |--->| cb4 |--->| cb3 |--->|wait |--->| cb2 |--->| cb1 |
+ * | | | head2| | | | | |head1| | | | |
+ * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
+ *
+ *
+ *
+ * e. GP2 completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ * DONE TAIL
+ * |
+ * |
+ * v
+ * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
+ * | | | | | | | | | | | | | |
+ * | head ------> wait |--->| cb4 |--->| cb3 |--->|wait |--->| cb2 |--->| cb1 |
+ * | | | head2| | | | | |head1| | | | |
+ * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
+ *
+ *
+ * While the llist state transitions from d to e, a kworker
+ * can start executing rcu_sr_normal_gp_cleanup_work() and
+ * can observe either the old done tail (@c) or the new
+ * done tail (@e). So, done tail updates and reads need
+ * to use the rel-acq semantics. If the concurrent kworker
+ * observes the old done tail, the newly queued work
+ * execution will process the updated done tail. If the
+ * concurrent kworker observes the new done tail, then
+ * the newly queued work will skip processing the done
+ * tail, as workqueue semantics guarantees that the new
+ * work is executed only after the previous one completes.
+ *
+ * f. kworker callbacks processing complete:
+ *
+ *
+ * DONE TAIL
+ * |
+ * |
+ * v
+ * +----------+ +--------+
+ * | | | |
+ * | head ------> wait |
+ * | | | head2 |
+ * +----------+ +--------+
+ *
*/
-static struct sr_normal_state {
- struct llist_head srs_next; /* request a GP users. */
- struct llist_head srs_wait; /* wait for GP users. */
- struct llist_head srs_done; /* ready for GP users. */
+static bool rcu_sr_is_wait_head(struct llist_node *node)
+{
+ return &(rcu_state.srs_wait_nodes)[0].node <= node &&
+ node <= &(rcu_state.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
+}

- /*
- * In order to add a batch of nodes to already
- * existing srs-done-list, a tail of srs-wait-list
- * is maintained.
- */
- struct llist_node *srs_wait_tail;
-} sr;
+static struct llist_node *rcu_sr_get_wait_head(void)
+{
+ struct sr_wait_node *sr_wn;
+ int i;
+
+ for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
+ sr_wn = &(rcu_state.srs_wait_nodes)[i];
+
+ if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
+ return &sr_wn->node;
+ }
+
+ return NULL;
+}
+
+static void rcu_sr_put_wait_head(struct llist_node *node)
+{
+ struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
+ atomic_set_release(&sr_wn->inuse, 0);
+}

/* Disabled by default. */
static int rcu_normal_wake_from_gp;
@@ -1462,14 +1596,44 @@ static void rcu_sr_normal_complete(struct llist_node *node)

static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
{
- struct llist_node *done, *rcu, *next;
+ struct llist_node *done, *rcu, *next, *head;

- done = llist_del_all(&sr.srs_done);
+ /*
+ * This work execution can potentially execute
+ * while a new done tail is being updated by
+ * grace period kthread in rcu_sr_normal_gp_cleanup().
+ * So, read and updates of done tail need to
+ * follow acq-rel semantics.
+ *
+ * Given that wq semantics guarantees that a single work
+ * cannot execute concurrently by multiple kworkers,
+ * the done tail list manipulations are protected here.
+ */
+ done = smp_load_acquire(&rcu_state.srs_done_tail);
if (!done)
return;

- llist_for_each_safe(rcu, next, done)
- rcu_sr_normal_complete(rcu);
+ WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
+ head = done->next;
+ done->next = NULL;
+
+ /*
+ * The dummy node, which is pointed to by the
+ * done tail which is acq-read above is not removed
+ * here. This allows lockless additions of new
+ * rcu_synchronize nodes in rcu_sr_normal_add_req(),
+ * while the cleanup work executes. The dummy
+ * nodes is removed, in next round of cleanup
+ * work execution.
+ */
+ llist_for_each_safe(rcu, next, head) {
+ if (!rcu_sr_is_wait_head(rcu)) {
+ rcu_sr_normal_complete(rcu);
+ continue;
+ }
+
+ rcu_sr_put_wait_head(rcu);
+ }
}
static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);

@@ -1478,48 +1642,61 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
*/
static void rcu_sr_normal_gp_cleanup(void)
{
- struct llist_node *head, *tail;
+ struct llist_node *wait_tail;

- if (llist_empty(&sr.srs_wait))
+ wait_tail = rcu_state.srs_wait_tail;
+ if (wait_tail == NULL)
return;

- tail = READ_ONCE(sr.srs_wait_tail);
- head = __llist_del_all(&sr.srs_wait);
+ rcu_state.srs_wait_tail = NULL;
+ ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);

- if (head) {
- /* Can be not empty. */
- llist_add_batch(head, tail, &sr.srs_done);
+ // concurrent sr_normal_gp_cleanup work might observe this update.
+ smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+ ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
+
+ if (wait_tail)
queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
- }
}

/*
* Helper function for rcu_gp_init().
*/
-static void rcu_sr_normal_gp_init(void)
+static bool rcu_sr_normal_gp_init(void)
{
- struct llist_node *head, *tail;
+ struct llist_node *first;
+ struct llist_node *wait_head;
+ bool start_new_poll = false;

- if (llist_empty(&sr.srs_next))
- return;
+ first = READ_ONCE(rcu_state.srs_next.first);
+ if (!first || rcu_sr_is_wait_head(first))
+ return start_new_poll;
+
+ wait_head = rcu_sr_get_wait_head();
+ if (!wait_head) {
+ // Kick another GP to retry.
+ start_new_poll = true;
+ return start_new_poll;
+ }

- tail = llist_del_all(&sr.srs_next);
- head = llist_reverse_order(tail);
+ /* Inject a wait-dummy-node. */
+ llist_add(wait_head, &rcu_state.srs_next);

/*
- * A waiting list of GP should be empty on this step,
- * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+ * A waiting list of rcu_synchronize nodes should be empty on
+ * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
* rolls it over. If not, it is a BUG, warn a user.
*/
- WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
+ WARN_ON_ONCE(rcu_state.srs_wait_tail != NULL);
+ rcu_state.srs_wait_tail = wait_head;
+ ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);

- WRITE_ONCE(sr.srs_wait_tail, tail);
- __llist_add_batch(head, tail, &sr.srs_wait);
+ return start_new_poll;
}

static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
{
- llist_add((struct llist_node *) &rs->head, &sr.srs_next);
+ llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
}

/*
@@ -1532,6 +1709,7 @@ static noinline_for_stack bool rcu_gp_init(void)
unsigned long mask;
struct rcu_data *rdp;
struct rcu_node *rnp = rcu_get_root();
+ bool start_new_poll;

WRITE_ONCE(rcu_state.gp_activity, jiffies);
raw_spin_lock_irq_rcu_node(rnp);
@@ -1556,11 +1734,24 @@ static noinline_for_stack bool rcu_gp_init(void)
/* Record GP times before starting GP, hence rcu_seq_start(). */
rcu_seq_start(&rcu_state.gp_seq);
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
- rcu_sr_normal_gp_init();
+ start_new_poll = rcu_sr_normal_gp_init();
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);

+ /*
+ * The "start_new_poll" is set to true, only when this GP is not able
+ * to handle anything and there are outstanding users. It happens when
+ * the rcu_sr_normal_gp_init() function was not able to insert a dummy
+ * separator to the llist, because there were no left any dummy-nodes.
+ *
+ * Number of dummy-nodes is fixed, it could be that we are run out of
+ * them, if so we start a new pool request to repeat a try. It is rare
+ * and it means that a system is doing a slow processing of callbacks.
+ */
+ if (start_new_poll)
+ (void) start_poll_synchronize_rcu();
+
/*
* Apply per-leaf buffered online and offline operations to
* the rcu_node tree. Note that this new grace period need not
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e9821a8422db..4c35d7d37647 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -316,6 +316,13 @@ do { \
__set_current_state(TASK_RUNNING); \
} while (0)

+#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
+
+struct sr_wait_node {
+ atomic_t inuse;
+ struct llist_node node;
+};
+
/*
* RCU global state, including node hierarchy. This hierarchy is
* represented in "heap" form in a dense array. The root (first level)
@@ -401,6 +408,12 @@ struct rcu_state {
/* Synchronize offline with */
/* GP pre-initialization. */
int nocb_is_setup; /* nocb is setup from boot */
+
+ /* synchronize_rcu() part. */
+ struct llist_head srs_next; /* request a GP users. */
+ struct llist_node *srs_wait_tail; /* wait for GP users. */
+ struct llist_node *srs_done_tail; /* ready for GP users. */
+ struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
};

/* Values for rcu_state structure's gp_flags field. */
--
2.39.2


2024-01-04 16:27:06

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: [PATCH v4 4/4] rcu: Support direct wake-up of synchronize_rcu() users

This patch introduces a small enhancement which allows to do a
direct wake-up of synchronize_rcu() callers. It occurs after a
completion of grace period, thus by the gp-kthread.

Number of clients is limited by the hard-coded maximum allowed
threshold. The remaining part, if still exists is deferred to
a main worker.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
---
kernel/rcu/tree.c | 31 +++++++++++++++++++++++++++++--
kernel/rcu/tree.h | 6 ++++++
2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 88a47a6a658e..96fe9cc122ca 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1642,7 +1642,8 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
*/
static void rcu_sr_normal_gp_cleanup(void)
{
- struct llist_node *wait_tail;
+ struct llist_node *wait_tail, *next, *rcu;
+ int done = 0;

wait_tail = rcu_state.srs_wait_tail;
if (wait_tail == NULL)
@@ -1650,12 +1651,38 @@ static void rcu_sr_normal_gp_cleanup(void)

rcu_state.srs_wait_tail = NULL;
ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
+ WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
+
+ /*
+ * Process (a) and (d) cases. See an illustration. Apart of
+ * that it handles the scenario when all clients are done,
+ * wait-head is released if last. The worker is not kicked.
+ */
+ llist_for_each_safe(rcu, next, wait_tail->next) {
+ if (rcu_sr_is_wait_head(rcu)) {
+ if (!rcu->next) {
+ rcu_sr_put_wait_head(rcu);
+ wait_tail->next = NULL;
+ } else {
+ wait_tail->next = rcu;
+ }
+
+ break;
+ }
+
+ rcu_sr_normal_complete(rcu);
+ // It can be last, update a next on this step.
+ wait_tail->next = next;
+
+ if (++done == SR_MAX_USERS_WAKE_FROM_GP)
+ break;
+ }

// concurrent sr_normal_gp_cleanup work might observe this update.
smp_store_release(&rcu_state.srs_done_tail, wait_tail);
ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);

- if (wait_tail)
+ if (wait_tail->next)
queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
}

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4c35d7d37647..5d8b71a1caec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -316,6 +316,12 @@ do { \
__set_current_state(TASK_RUNNING); \
} while (0)

+/*
+ * A max threshold for synchronize_rcu() users which are
+ * awaken directly by the rcu_gp_kthread(). Left part is
+ * deferred to the main worker.
+ */
+#define SR_MAX_USERS_WAKE_FROM_GP 5
#define SR_NORMAL_GP_WAIT_HEAD_MAX 5

struct sr_wait_node {
--
2.39.2


2024-01-09 19:26:05

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Thu, Jan 04, 2024 at 05:25:07PM +0100, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
>
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
>
> - how fast a process of offloading is started. Combination of:
> - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> - other.
> - when started, invoking path is interrupted due to:
> - time limit;
> - need_resched();
> - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
>
> Example:
>
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
> <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
> cgroup_attach_task():
> -> cgroup_migrate_execute()
> -> cpuset_can_attach()
> -> percpu_down_write()
> -> rcu_sync_enter()
> -> synchronize_rcu()
> -> now move tasks to the new cgroup.
> -> cgroup_migrate_finish()
>
> <snip>
> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
>
> "Boosting a task" depends on synchronize_rcu() latency:
>
> - first trace shows a completion of synchronize_rcu();
> - second shows attaching a task to a new group;
> - last shows a final step when migration occurs.
>
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. After completion of a grace period
> users are deferred to a dedicated worker to process requests.
>
> 4. This patch reduces the latency of synchronize_rcu() approximately
> by ~30-40% on synthetic tests. The real test case, camera launch time,
> shows(time is in milliseconds):
>
> 1-run 542 vs 489 improvement 9%
> 2-run 540 vs 466 improvement 13%
> 3-run 518 vs 468 improvement 9%
> 4-run 531 vs 457 improvement 13%
> 5-run 548 vs 475 improvement 13%
> 6-run 509 vs 484 improvement 4%
>
> Synthetic test(no "noise" from other callbacks):
> Hardware: x86_64 64 CPUs, 64GB of memory
> Linux-6.6
>
> - 10K tasks(simultaneous);
> - each task does(1000 loops)
> synchronize_rcu();
> kfree(p);
>
> default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
> patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.
>
> Running 60K gives approximately same results on my setup. Please note
> it is without any interaction with another type of callbacks, otherwise
> it will impact a lot a default case.
>
> 5. An extra CONFIG_RCU_SR_NORMAL_DEBUG_GP kernel option is added
> which enables additional debugging for detecting a grace period
> incompletion for synchronize_rcu() users. If a GP is not fully
> passed for any user, the warning message is emitted.
>
> 6. By default it is disabled. To enable this perform one of the
> below sequence:
>
> echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>

Hi Uladzislau,

I've tried your patches (v3) on Android with 6.1.43 kernel.

The test cycles 10 apps (including camera) sequentially for 100
iterations.

I've set rcu_normal to override the rcu_expedited in the boot
parameters:

adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu

rcupdate.rcu_normal=1
rcupdate.rcu_expedited=1
rcu_nocbs=0-7


The configurations are:

A - echo 0 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
B - echo 1 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp

Results:

= APP LAUNCH TIME =
delta (B-A) ratio(%)
overall_app_launch_time(ms) -11399.00 -6.65


== camera_launch_time
type delta(B-A %) A_count B_count
HOT -7.05 99 99
COLD -6.33 1 1


=== Function Latencies ===

Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit

nsec : count distribution nsec : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 0 | | 4 -> 7 : 0 | |
8 -> 15 : 0 | | 8 -> 15 : 0 | |
16 -> 31 : 0 | | 16 -> 31 : 0 | |
32 -> 63 : 0 | | 32 -> 63 : 0 | |
64 -> 127 : 0 | | 64 -> 127 : 0 | |
128 -> 255 : 0 | | 128 -> 255 : 0 | |
256 -> 511 : 0 | | 256 -> 511 : 0 | |
512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
4194304 -> 8388607 : 871 |** | 4194304 -> 8388607 : 1180 |**** |
8388608 -> 16777215 : 3204 |******** | 8388608 -> 16777215 : 7020 |************************* |
16777216 -> 33554431 : 15013 |****************************************| 16777216 -> 33554431 : 10952 |****************************************|
Exiting trace of synchronize_rcu_expedited Exiting trace of synchronize_rcu_expedited


Tracing synchronize_rcu. Hit Ctrl-C to exit Tracing synchronize_rcu. Hit Ctrl-C to exit

nsec : count distribution nsec : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 0 | | 4 -> 7 : 0 | |
8 -> 15 : 0 | | 8 -> 15 : 0 | |
16 -> 31 : 0 | | 16 -> 31 : 0 | |
32 -> 63 : 0 | | 32 -> 63 : 0 | |
64 -> 127 : 0 | | 64 -> 127 : 0 | |
128 -> 255 : 0 | | 128 -> 255 : 0 | |
256 -> 511 : 0 | | 256 -> 511 : 0 | |
512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
4194304 -> 8388607 : 861 |** | 4194304 -> 8388607 : 1136 |**** |
8388608 -> 16777215 : 3111 |******** | 8388608 -> 16777215 : 6320 |************************ |
16777216 -> 33554431 : 13901 |****************************************| 16777216 -> 33554431 : 10484 |****************************************|
Exiting trace of synchronize_rcu Exiting trace of synchronize_rcu


Interestingly I tried the same experiment without rcu_normal=1 (leaving rcu_expedited=1):

adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
rcupdate.rcu_expedited=1
rcu_nocbs=0-7

In this case I also saw the -6 to -7% decrease in the app launch times
but I don't have a good explanation why that would be? (The fucntion
latency histograms in this case didn't show any significant difference).
Do you have any insight why this may happen?

Thanks,
Kalesh




> ---
> .../admin-guide/kernel-parameters.txt | 14 ++
> kernel/rcu/Kconfig.debug | 12 ++
> kernel/rcu/tree.c | 138 +++++++++++++++++-
> kernel/rcu/tree_exp.h | 2 +-
> 4 files changed, 164 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 17a454909ab4..2cca75e4f0c6 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5047,6 +5047,20 @@
> delay, memory pressure or callback list growing too
> big.
>
> + rcutree.rcu_normal_wake_from_gp= [KNL]
> + Reduces a latency of synchronize_rcu() call. This approach
> + maintains its own track of synchronize_rcu() callers, so it
> + does not interact with regular callbacks because it does not
> + use a call_rcu[_hurry]() path. Please note, this is for a
> + normal grace period.
> +
> + How to enable it:
> +
> + echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> + or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> +
> + Default is 0.
> +
> rcuscale.gp_async= [KNL]
> Measure performance of asynchronous
> grace-period primitives such as call_rcu().
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 9b0b52e1836f..4812c6249185 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -168,4 +168,16 @@ config RCU_STRICT_GRACE_PERIOD
> when looking for certain types of RCU usage bugs, for example,
> too-short RCU read-side critical sections.
>
> +config RCU_SR_NORMAL_DEBUG_GP
> + bool "Debug synchronize_rcu() callers for a grace period completion"
> + depends on DEBUG_KERNEL && RCU_EXPERT
> + default n
> + help
> + This option enables additional debugging for detecting a grace
> + period incompletion for synchronize_rcu() users. If a GP is not
> + fully passed for any user, the warning message is emitted.
> +
> + Say Y here if you want to enable such debugging
> + Say N if you are unsure.
> +
> endmenu # "RCU Debugging"
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 499803234176..b756c40e4960 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1422,6 +1422,106 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
>
> +/*
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> + struct llist_head srs_next; /* request a GP users. */
> + struct llist_head srs_wait; /* wait for GP users. */
> + struct llist_head srs_done; /* ready for GP users. */
> +
> + /*
> + * In order to add a batch of nodes to already
> + * existing srs-done-list, a tail of srs-wait-list
> + * is maintained.
> + */
> + struct llist_node *srs_wait_tail;
> +} sr;
> +
> +/* Disabled by default. */
> +static int rcu_normal_wake_from_gp;
> +module_param(rcu_normal_wake_from_gp, int, 0644);
> +
> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> + struct rcu_synchronize *rs = container_of(
> + (struct rcu_head *) node, struct rcu_synchronize, head);
> + unsigned long oldstate = (unsigned long) rs->head.func;
> +
> + WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
> + !poll_state_synchronize_rcu(oldstate),
> + "A full grace period is not passed yet: %lu",
> + rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> +
> + /* Finally. */
> + complete(&rs->completion);
> +}
> +
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> + struct llist_node *done, *rcu, *next;
> +
> + done = llist_del_all(&sr.srs_done);
> + if (!done)
> + return;
> +
> + llist_for_each_safe(rcu, next, done)
> + rcu_sr_normal_complete(rcu);
> +}
> +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> +
> +/*
> + * Helper function for rcu_gp_cleanup().
> + */
> +static void rcu_sr_normal_gp_cleanup(void)
> +{
> + struct llist_node *head, *tail;
> +
> + if (llist_empty(&sr.srs_wait))
> + return;
> +
> + tail = READ_ONCE(sr.srs_wait_tail);
> + head = __llist_del_all(&sr.srs_wait);
> +
> + if (head) {
> + /* Can be not empty. */
> + llist_add_batch(head, tail, &sr.srs_done);
> + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> + }
> +}
> +
> +/*
> + * Helper function for rcu_gp_init().
> + */
> +static void rcu_sr_normal_gp_init(void)
> +{
> + struct llist_node *head, *tail;
> +
> + if (llist_empty(&sr.srs_next))
> + return;
> +
> + tail = llist_del_all(&sr.srs_next);
> + head = llist_reverse_order(tail);
> +
> + /*
> + * A waiting list of GP should be empty on this step,
> + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> + * rolls it over. If not, it is a BUG, warn a user.
> + */
> + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> +
> + WRITE_ONCE(sr.srs_wait_tail, tail);
> + __llist_add_batch(head, tail, &sr.srs_wait);
> +}
> +
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> +}
> +
> /*
> * Initialize a new grace period. Return false if no grace period required.
> */
> @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> /* Record GP times before starting GP, hence rcu_seq_start(). */
> rcu_seq_start(&rcu_state.gp_seq);
> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> + rcu_sr_normal_gp_init();
> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> raw_spin_unlock_irq_rcu_node(rnp);
> @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> }
> raw_spin_unlock_irq_rcu_node(rnp);
>
> + // Make synchronize_rcu() users aware of the end of old grace period.
> + rcu_sr_normal_gp_cleanup();
> +
> // If strict, make all CPUs aware of the end of the old grace period.
> if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> return true;
> }
>
> +/*
> + * Helper function for the synchronize_rcu() API.
> + */
> +static void synchronize_rcu_normal(void)
> +{
> + struct rcu_synchronize rs;
> +
> + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> + wait_rcu_gp(call_rcu_hurry);
> + return;
> + }
> +
> + init_rcu_head_on_stack(&rs.head);
> + init_completion(&rs.completion);
> +
> + /*
> + * This code might be preempted, therefore take a GP
> + * snapshot before adding a request.
> + */
> + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> + rs.head.func = (void *) get_state_synchronize_rcu();
> +
> + rcu_sr_normal_add_req(&rs);
> +
> + /* Kick a GP and start waiting. */
> + (void) start_poll_synchronize_rcu();
> +
> + /* Now we can wait. */
> + wait_for_completion(&rs.completion);
> + destroy_rcu_head_on_stack(&rs.head);
> +}
> +
> /**
> * synchronize_rcu - wait until a grace period has elapsed.
> *
> @@ -3612,7 +3748,7 @@ void synchronize_rcu(void)
> if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 014ddf672165..bdc30d972d32 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -985,7 +985,7 @@ void synchronize_rcu_expedited(void)
>
> /* If expedited grace periods are prohibited, fall back to normal. */
> if (rcu_gp_is_normal()) {
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> --
> 2.39.2
>

2024-01-10 09:22:10

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

Hello, Kalesh!

>
> Hi Uladzislau,
>
> I've tried your patches (v3) on Android with 6.1.43 kernel.
>
> The test cycles 10 apps (including camera) sequentially for 100
> iterations.
>
> I've set rcu_normal to override the rcu_expedited in the boot
> parameters:
>
> adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
>
> rcupdate.rcu_normal=1
> rcupdate.rcu_expedited=1
> rcu_nocbs=0-7
>
>
> The configurations are:
>
> A - echo 0 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> B - echo 1 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
>
> Results:
>
> = APP LAUNCH TIME =
> delta (B-A) ratio(%)
> overall_app_launch_time(ms) -11399.00 -6.65
>
>
> == camera_launch_time
> type delta(B-A %) A_count B_count
> HOT -7.05 99 99
> COLD -6.33 1 1
>
>
If i interpret it correctly you also see that this series reduces
a launch time by 6/7% on your app set. Is that correct?

> === Function Latencies ===
>
> Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit
>
> nsec : count distribution nsec : count distribution
> 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> 4194304 -> 8388607 : 871 |** | 4194304 -> 8388607 : 1180 |**** |
> 8388608 -> 16777215 : 3204 |******** | 8388608 -> 16777215 : 7020 |************************* |
> 16777216 -> 33554431 : 15013 |****************************************| 16777216 -> 33554431 : 10952 |****************************************|
> Exiting trace of synchronize_rcu_expedited Exiting trace of synchronize_rcu_expedited
>
>
> Tracing synchronize_rcu. Hit Ctrl-C to exit Tracing synchronize_rcu. Hit Ctrl-C to exit
>
> nsec : count distribution nsec : count distribution
> 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> 4194304 -> 8388607 : 861 |** | 4194304 -> 8388607 : 1136 |**** |
> 8388608 -> 16777215 : 3111 |******** | 8388608 -> 16777215 : 6320 |************************ |
> 16777216 -> 33554431 : 13901 |****************************************| 16777216 -> 33554431 : 10484 |****************************************|
> Exiting trace of synchronize_rcu Exiting trace of synchronize_rcu
>
Who is B and who is A?

>
> Interestingly I tried the same experiment without rcu_normal=1 (leaving rcu_expedited=1):
>
> adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
> rcupdate.rcu_expedited=1
> rcu_nocbs=0-7
>
> In this case I also saw the -6 to -7% decrease in the app launch times
> but I don't have a good explanation why that would be? (The fucntion
> latency histograms in this case didn't show any significant difference).
> Do you have any insight why this may happen?
>
When rcu_expedited=1 is set and rcu_normal=0 is disabled. The
synchronize_rcu() call is converted into synchronize_rcu_expidited():

<snip>
void synchronize_rcu(void)
{
unsigned long flags;
struct rcu_node *rnp;

RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
lock_is_held(&rcu_lock_map) ||
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_rcu() in RCU read-side critical section");
if (!rcu_blocking_is_gp()) {
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
synchronize_rcu_normal();
return;
}
..
<snip>

rcu_gp_is_expidited() is true, so invoke "expedited" version.

I see some concerns in preferring an expedited version as a global
replacement. First of all it is related to latency sensitive workloads
because in order to expedite a grace period it sends out IPIs on all
online CPUs to force them to report a quiescent-state asap. I have not
investigated yet how it affects such workloads.

Therefore, in your case, you also see a performance boost of your app sets.

Thank you for looking at it!

--
Uladzislau Rezki

2024-01-11 16:37:59

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Wed, Jan 10, 2024 at 1:22 AM Uladzislau Rezki <[email protected]> wrote:
>
> Hello, Kalesh!
>
> >
> > Hi Uladzislau,
> >
> > I've tried your patches (v3) on Android with 6.1.43 kernel.
> >
> > The test cycles 10 apps (including camera) sequentially for 100
> > iterations.
> >
> > I've set rcu_normal to override the rcu_expedited in the boot
> > parameters:
> >
> > adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
> >
> > rcupdate.rcu_normal=1
> > rcupdate.rcu_expedited=1
> > rcu_nocbs=0-7
> >
> >
> > The configurations are:
> >
> > A - echo 0 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > B - echo 1 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> >
> > Results:
> >
> > = APP LAUNCH TIME =
> > delta (B-A) ratio(%)
> > overall_app_launch_time(ms) -11399.00 -6.65
> >
> >
> > == camera_launch_time
> > type delta(B-A %) A_count B_count
> > HOT -7.05 99 99
> > COLD -6.33 1 1
> >
> >

Hi Uladzislau,

> If i interpret it correctly you also see that this series reduces
> a launch time by 6/7% on your app set. Is that correct?

Yes your understanding is correct.

>
> > === Function Latencies ===
> >
> > Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit
> >
> > nsec : count distribution nsec : count distribution
> > 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> > 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> > 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> > 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> > 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> > 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> > 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> > 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> > 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> > 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> > 4194304 -> 8388607 : 871 |** | 4194304 -> 8388607 : 1180 |**** |
> > 8388608 -> 16777215 : 3204 |******** | 8388608 -> 16777215 : 7020 |************************* |
> > 16777216 -> 33554431 : 15013 |****************************************| 16777216 -> 33554431 : 10952 |****************************************|
> > Exiting trace of synchronize_rcu_expedited Exiting trace of synchronize_rcu_expedited
> >
> >
> > Tracing synchronize_rcu. Hit Ctrl-C to exit Tracing synchronize_rcu. Hit Ctrl-C to exit
> >
> > nsec : count distribution nsec : count distribution
> > 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> > 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> > 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> > 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> > 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> > 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> > 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> > 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> > 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> > 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> > 4194304 -> 8388607 : 861 |** | 4194304 -> 8388607 : 1136 |**** |
> > 8388608 -> 16777215 : 3111 |******** | 8388608 -> 16777215 : 6320 |************************ |
> > 16777216 -> 33554431 : 13901 |****************************************| 16777216 -> 33554431 : 10484 |****************************************|
> > Exiting trace of synchronize_rcu Exiting trace of synchronize_rcu
> >
> Who is B and who is A?

Left is A (rcu_normal_wake_from_gp=0) and right is B
(rcu_normal_wake_from_gp=1)
>
> >
> > Interestingly I tried the same experiment without rcu_normal=1 (leaving rcu_expedited=1):
> >
> > adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
> > rcupdate.rcu_expedited=1
> > rcu_nocbs=0-7
> >
> > In this case I also saw the -6 to -7% decrease in the app launch times
> > but I don't have a good explanation why that would be? (The fucntion
> > latency histograms in this case didn't show any significant difference).
> > Do you have any insight why this may happen?
> >
> When rcu_expedited=1 is set and rcu_normal=0 is disabled. The
> synchronize_rcu() call is converted into synchronize_rcu_expidited():
>
> <snip>
> void synchronize_rcu(void)
> {
> unsigned long flags;
> struct rcu_node *rnp;
>
> RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> lock_is_held(&rcu_lock_map) ||
> lock_is_held(&rcu_sched_lock_map),
> "Illegal synchronize_rcu() in RCU read-side critical section");
> if (!rcu_blocking_is_gp()) {
> if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> synchronize_rcu_normal();
> return;
> }
> ...
> <snip>
>
> rcu_gp_is_expidited() is true, so invoke "expedited" version.
>
> I see some concerns in preferring an expedited version as a global
> replacement. First of all it is related to latency sensitive workloads
> because in order to expedite a grace period it sends out IPIs on all
> online CPUs to force them to report a quiescent-state asap. I have not
> investigated yet how it affects such workloads.
>
> Therefore, in your case, you also see a performance boost of your app sets.

IIUC the patch shouldn't affect the case? The only difference in A vs
B is rcu_normal_wake_from_gp (both have rcu_expedited=1).

Thanks,
Kalesh
>
> Thank you for looking at it!
>
> --
> Uladzislau Rezki

2024-01-11 17:35:48

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

Hello, Kalesh!
> >
> > >
> > > Hi Uladzislau,
> > >
> > > I've tried your patches (v3) on Android with 6.1.43 kernel.
> > >
> > > The test cycles 10 apps (including camera) sequentially for 100
> > > iterations.
> > >
> > > I've set rcu_normal to override the rcu_expedited in the boot
> > > parameters:
> > >
> > > adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
> > >
> > > rcupdate.rcu_normal=1
> > > rcupdate.rcu_expedited=1
> > > rcu_nocbs=0-7
> > >
> > >
> > > The configurations are:
> > >
> > > A - echo 0 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > > B - echo 1 >/sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> > >
> > > Results:
> > >
> > > = APP LAUNCH TIME =
> > > delta (B-A) ratio(%)
> > > overall_app_launch_time(ms) -11399.00 -6.65
> > >
> > >
> > > == camera_launch_time
> > > type delta(B-A %) A_count B_count
> > > HOT -7.05 99 99
> > > COLD -6.33 1 1
> > >
> > >
>
> Hi Uladzislau,
>
> > If i interpret it correctly you also see that this series reduces
> > a launch time by 6/7% on your app set. Is that correct?
>
> Yes your understanding is correct.
>
> >
> > > === Function Latencies ===
> > >
> > > Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit Tracing synchronize_rcu_expedited. Hit Ctrl-C to exit
> > >
> > > nsec : count distribution nsec : count distribution
> > > 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> > > 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> > > 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> > > 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> > > 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> > > 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> > > 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> > > 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> > > 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> > > 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> > > 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> > > 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> > > 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> > > 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> > > 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> > > 4194304 -> 8388607 : 871 |** | 4194304 -> 8388607 : 1180 |**** |
> > > 8388608 -> 16777215 : 3204 |******** | 8388608 -> 16777215 : 7020 |************************* |
> > > 16777216 -> 33554431 : 15013 |****************************************| 16777216 -> 33554431 : 10952 |****************************************|
> > > Exiting trace of synchronize_rcu_expedited Exiting trace of synchronize_rcu_expedited
> > >
> > >
> > > Tracing synchronize_rcu. Hit Ctrl-C to exit Tracing synchronize_rcu. Hit Ctrl-C to exit
> > >
> > > nsec : count distribution nsec : count distribution
> > > 0 -> 1 : 0 | | 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | | 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | | 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | | 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | | 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | | 32 -> 63 : 0 | |
> > > 64 -> 127 : 0 | | 64 -> 127 : 0 | |
> > > 128 -> 255 : 0 | | 128 -> 255 : 0 | |
> > > 256 -> 511 : 0 | | 256 -> 511 : 0 | |
> > > 512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
> > > 1024 -> 2047 : 0 | | 1024 -> 2047 : 0 | |
> > > 2048 -> 4095 : 0 | | 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | | 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 0 | | 8192 -> 16383 : 0 | |
> > > 16384 -> 32767 : 0 | | 16384 -> 32767 : 0 | |
> > > 32768 -> 65535 : 0 | | 32768 -> 65535 : 0 | |
> > > 65536 -> 131071 : 0 | | 65536 -> 131071 : 0 | |
> > > 131072 -> 262143 : 0 | | 131072 -> 262143 : 0 | |
> > > 262144 -> 524287 : 0 | | 262144 -> 524287 : 0 | |
> > > 524288 -> 1048575 : 0 | | 524288 -> 1048575 : 0 | |
> > > 1048576 -> 2097151 : 0 | | 1048576 -> 2097151 : 0 | |
> > > 2097152 -> 4194303 : 0 | | 2097152 -> 4194303 : 0 | |
> > > 4194304 -> 8388607 : 861 |** | 4194304 -> 8388607 : 1136 |**** |
> > > 8388608 -> 16777215 : 3111 |******** | 8388608 -> 16777215 : 6320 |************************ |
> > > 16777216 -> 33554431 : 13901 |****************************************| 16777216 -> 33554431 : 10484 |****************************************|
> > > Exiting trace of synchronize_rcu Exiting trace of synchronize_rcu
> > >
> > Who is B and who is A?
>
> Left is A (rcu_normal_wake_from_gp=0) and right is B
> (rcu_normal_wake_from_gp=1)
> >
> > >
> > > Interestingly I tried the same experiment without rcu_normal=1 (leaving rcu_expedited=1):
> > >
> > > adb shell cat /proc/cmdline | tr ' ' '\n' | grep rcu
> > > rcupdate.rcu_expedited=1
> > > rcu_nocbs=0-7
> > >
> > > In this case I also saw the -6 to -7% decrease in the app launch times
> > > but I don't have a good explanation why that would be? (The fucntion
> > > latency histograms in this case didn't show any significant difference).
> > > Do you have any insight why this may happen?
> > >
> > When rcu_expedited=1 is set and rcu_normal=0 is disabled. The
> > synchronize_rcu() call is converted into synchronize_rcu_expidited():
> >
> > <snip>
> > void synchronize_rcu(void)
> > {
> > unsigned long flags;
> > struct rcu_node *rnp;
> >
> > RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> > lock_is_held(&rcu_lock_map) ||
> > lock_is_held(&rcu_sched_lock_map),
> > "Illegal synchronize_rcu() in RCU read-side critical section");
> > if (!rcu_blocking_is_gp()) {
> > if (rcu_gp_is_expedited())
> > synchronize_rcu_expedited();
> > else
> > synchronize_rcu_normal();
> > return;
> > }
> > ...
> > <snip>
> >
> > rcu_gp_is_expidited() is true, so invoke "expedited" version.
> >
> > I see some concerns in preferring an expedited version as a global
> > replacement. First of all it is related to latency sensitive workloads
> > because in order to expedite a grace period it sends out IPIs on all
> > online CPUs to force them to report a quiescent-state asap. I have not
> > investigated yet how it affects such workloads.
> >
> > Therefore, in your case, you also see a performance boost of your app sets.
>
> IIUC the patch shouldn't affect the case? The only difference in A vs
> B is rcu_normal_wake_from_gp (both have rcu_expedited=1).
>
Right. This patch does not touch "expedited" version at all.

Appreciate for test results and looking at!

--
Uladzislau Rezki

2024-01-12 23:10:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

Le Thu, Jan 04, 2024 at 05:25:07PM +0100, Uladzislau Rezki (Sony) a ?crit :
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 9b0b52e1836f..4812c6249185 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -168,4 +168,16 @@ config RCU_STRICT_GRACE_PERIOD
> when looking for certain types of RCU usage bugs, for example,
> too-short RCU read-side critical sections.
>
> +config RCU_SR_NORMAL_DEBUG_GP
> + bool "Debug synchronize_rcu() callers for a grace period completion"
> + depends on DEBUG_KERNEL && RCU_EXPERT
> + default n
> + help
> + This option enables additional debugging for detecting a grace
> + period incompletion for synchronize_rcu() users. If a GP is not
> + fully passed for any user, the warning message is emitted.
> +
> + Say Y here if you want to enable such debugging
> + Say N if you are unsure.

How about just reuse CONFIG_PROVE_RCU instead?

> +
> endmenu # "RCU Debugging"
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 499803234176..b756c40e4960 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1422,6 +1422,106 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
>
> +/*
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> + struct llist_head srs_next; /* request a GP users. */
> + struct llist_head srs_wait; /* wait for GP users. */
> + struct llist_head srs_done; /* ready for GP users. */
> +
> + /*
> + * In order to add a batch of nodes to already
> + * existing srs-done-list, a tail of srs-wait-list
> + * is maintained.
> + */
> + struct llist_node *srs_wait_tail;
> +} sr;

"sr" is good enough for a function scope variable but not for a file scope one.

At least "sr_state" would be better. Or maybe you don't even need to name that
struct and make instead:

struct {
...
...
} sr_normal_state;


> +
> +/* Disabled by default. */
> +static int rcu_normal_wake_from_gp;
> +module_param(rcu_normal_wake_from_gp, int, 0644);
> +
> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> + struct rcu_synchronize *rs = container_of(
> + (struct rcu_head *) node, struct rcu_synchronize, head);

Should there be some union in struct rcu_synchronize between struct rcu_head
and struct llist_node?

Anyway it's stack allocated, they could even be separate fields.

> + unsigned long oldstate = (unsigned long) rs->head.func;

Luckily struct callback_head layout allows such magic but if rcu_head
and llist_node were separate, reviewers would be less hurt.

If stack space really matters, something like the below?

struct rcu_synchronize {
union {
struct rcu_head head;
struct {
struct llist_node node;
unsigned long seq;
}
}
struct completion completion;
};


> +
> + WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
> + !poll_state_synchronize_rcu(oldstate),
> + "A full grace period is not passed yet: %lu",
> + rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> +
> + /* Finally. */
> + complete(&rs->completion);
> +}
> +
[...]
> +
> +/*
> + * Helper function for rcu_gp_cleanup().
> + */
> +static void rcu_sr_normal_gp_cleanup(void)
> +{
> + struct llist_node *head, *tail;
> +
> + if (llist_empty(&sr.srs_wait))
> + return;
> +
> + tail = READ_ONCE(sr.srs_wait_tail);

Is the READ_ONCE() needed?

A part from those boring details:

Reviewed-by: Frederic Weisbecker <[email protected]>

2024-01-12 23:20:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] rcu: Add a trace event for synchronize_rcu_normal()

Le Thu, Jan 04, 2024 at 05:25:08PM +0100, Uladzislau Rezki (Sony) a ?crit :
> Add an rcu_sr_normal() trace event. It takes three arguments
> first one is the name of RCU flavour, second one is a user id
> which triggeres synchronize_rcu_normal() and last one is an
> event.
>
> There are two traces in the synchronize_rcu_normal(). On entry,
> when a new request is registered and on exit point when request
> is completed.
>
> Please note, CONFIG_RCU_TRACE=y is required to activate traces.
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> include/trace/events/rcu.h | 27 +++++++++++++++++++++++++++
> kernel/rcu/tree.c | 7 ++++++-
> 2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> index 2ef9c719772a..31b3e0d3e65f 100644
> --- a/include/trace/events/rcu.h
> +++ b/include/trace/events/rcu.h
> @@ -707,6 +707,33 @@ TRACE_EVENT_RCU(rcu_invoke_kfree_bulk_callback,
> __entry->rcuname, __entry->p, __entry->nr_records)
> );
>
> +/*
> + * Tracepoint for a normal synchronize_rcu() states. The first argument
> + * is the RCU flavor, the second argument is a pointer to rcu_head the
> + * last one is an event.
> + */
> +TRACE_EVENT_RCU(rcu_sr_normal,

Can we call this "synchronize_rcu" instead? So people really know what it's
about.

Then should the need arise, we can still add "synchronize_rcu_expedited" trace
events later.

Thanks.

> +
> + TP_PROTO(const char *rcuname, struct rcu_head *rhp, const char *srevent),
> +
> + TP_ARGS(rcuname, rhp, srevent),
> +
> + TP_STRUCT__entry(
> + __field(const char *, rcuname)
> + __field(void *, rhp)
> + __field(const char *, srevent)
> + ),
> +
> + TP_fast_assign(
> + __entry->rcuname = rcuname;
> + __entry->rhp = rhp;
> + __entry->srevent = srevent;
> + ),
> +
> + TP_printk("%s rhp=0x%p event=%s",
> + __entry->rcuname, __entry->rhp, __entry->srevent)
> +);
> +
> /*
> * Tracepoint for exiting rcu_do_batch after RCU callbacks have been
> * invoked. The first argument is the name of the RCU flavor,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b756c40e4960..7d2ed89efcb3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3672,9 +3672,11 @@ static void synchronize_rcu_normal(void)
> {
> struct rcu_synchronize rs;
>
> + trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("request"));
> +
> if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> wait_rcu_gp(call_rcu_hurry);
> - return;
> + goto trace_complete_out;
> }
>
> init_rcu_head_on_stack(&rs.head);
> @@ -3695,6 +3697,9 @@ static void synchronize_rcu_normal(void)
> /* Now we can wait. */
> wait_for_completion(&rs.completion);
> destroy_rcu_head_on_stack(&rs.head);
> +
> +trace_complete_out:
> + trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("complete"));
> }
>
> /**
> --
> 2.39.2
>

2024-01-13 09:20:04

by Zqiang

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] rcu: Support direct wake-up of synchronize_rcu() users

>
> This patch introduces a small enhancement which allows to do a
> direct wake-up of synchronize_rcu() callers. It occurs after a
> completion of grace period, thus by the gp-kthread.
>
> Number of clients is limited by the hard-coded maximum allowed
> threshold. The remaining part, if still exists is deferred to
> a main worker.
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> kernel/rcu/tree.c | 31 +++++++++++++++++++++++++++++--
> kernel/rcu/tree.h | 6 ++++++
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 88a47a6a658e..96fe9cc122ca 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1642,7 +1642,8 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> */
> static void rcu_sr_normal_gp_cleanup(void)
> {
> - struct llist_node *wait_tail;
> + struct llist_node *wait_tail, *next, *rcu;
> + int done = 0;
>
> wait_tail = rcu_state.srs_wait_tail;
> if (wait_tail == NULL)
> @@ -1650,12 +1651,38 @@ static void rcu_sr_normal_gp_cleanup(void)
>
> rcu_state.srs_wait_tail = NULL;
> ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
> + WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> +
> + /*
> + * Process (a) and (d) cases. See an illustration. Apart of
> + * that it handles the scenario when all clients are done,
> + * wait-head is released if last. The worker is not kicked.
> + */
> + llist_for_each_safe(rcu, next, wait_tail->next) {
> + if (rcu_sr_is_wait_head(rcu)) {
> + if (!rcu->next) {
> + rcu_sr_put_wait_head(rcu);
> + wait_tail->next = NULL;
> + } else {
> + wait_tail->next = rcu;
> + }
> +
> + break;
> + }
> +
> + rcu_sr_normal_complete(rcu);
> + // It can be last, update a next on this step.
> + wait_tail->next = next;
> +
> + if (++done == SR_MAX_USERS_WAKE_FROM_GP)
> + break;
> + }
>
> // concurrent sr_normal_gp_cleanup work might observe this update.
> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>
> - if (wait_tail)
> + if (wait_tail->next)
> queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
>

I'm testing these patches :) , one question is as follows:
Can we use (WQ_MEM_RECLAIM | WQ_HIGHPR)type of workqueue to perform
wake-up actions? avoid kworker creation failure under memory pressure, causing
the wake-up action to be delayed.

Thanks
Zqiang



> }
>
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4c35d7d37647..5d8b71a1caec 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -316,6 +316,12 @@ do { \
> __set_current_state(TASK_RUNNING); \
> } while (0)
>
> +/*
> + * A max threshold for synchronize_rcu() users which are
> + * awaken directly by the rcu_gp_kthread(). Left part is
> + * deferred to the main worker.
> + */
> +#define SR_MAX_USERS_WAKE_FROM_GP 5
> #define SR_NORMAL_GP_WAIT_HEAD_MAX 5
>
> struct sr_wait_node {
> --
> 2.39.2
>
>

2024-01-15 10:46:47

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] rcu: Support direct wake-up of synchronize_rcu() users

Hello, Zqiang.

> >
> > // concurrent sr_normal_gp_cleanup work might observe this update.
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >
> > - if (wait_tail)
> > + if (wait_tail->next)
> > queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> >
>
> I'm testing these patches :) , one question is as follows:
> Can we use (WQ_MEM_RECLAIM | WQ_HIGHPR)type of workqueue to perform
> wake-up actions? avoid kworker creation failure under memory pressure, causing
> the wake-up action to be delayed.
>
I do not have any objections in not doing that, so we can add.

Thank for testing this!

--
Uladzislau Rezki

2024-01-15 10:57:53

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] rcu: Support direct wake-up of synchronize_rcu() users

> Hello, Zqiang.
>
> > >
> > > // concurrent sr_normal_gp_cleanup work might observe this update.
> > > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > > ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> > >
> > > - if (wait_tail)
> > > + if (wait_tail->next)
> > > queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > >
> >
> > I'm testing these patches :) , one question is as follows:
> > Can we use (WQ_MEM_RECLAIM | WQ_HIGHPR)type of workqueue to perform
> > wake-up actions? avoid kworker creation failure under memory pressure, causing
> > the wake-up action to be delayed.
> >
> I do not have any objections in not doing that, so we can add.
>
> Thank for testing this!
>
I forgot to ask, is your testing simulates a low memory condition so
you see the failure you refer to? Or it is just a possible scenario?

Thanks!

--
Uladzislau Rezki


2024-01-15 12:14:57

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] rcu: Add a trace event for synchronize_rcu_normal()

On Sat, Jan 13, 2024 at 12:20:14AM +0100, Frederic Weisbecker wrote:
> Le Thu, Jan 04, 2024 at 05:25:08PM +0100, Uladzislau Rezki (Sony) a écrit :
> > Add an rcu_sr_normal() trace event. It takes three arguments
> > first one is the name of RCU flavour, second one is a user id
> > which triggeres synchronize_rcu_normal() and last one is an
> > event.
> >
> > There are two traces in the synchronize_rcu_normal(). On entry,
> > when a new request is registered and on exit point when request
> > is completed.
> >
> > Please note, CONFIG_RCU_TRACE=y is required to activate traces.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > ---
> > include/trace/events/rcu.h | 27 +++++++++++++++++++++++++++
> > kernel/rcu/tree.c | 7 ++++++-
> > 2 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> > index 2ef9c719772a..31b3e0d3e65f 100644
> > --- a/include/trace/events/rcu.h
> > +++ b/include/trace/events/rcu.h
> > @@ -707,6 +707,33 @@ TRACE_EVENT_RCU(rcu_invoke_kfree_bulk_callback,
> > __entry->rcuname, __entry->p, __entry->nr_records)
> > );
> >
> > +/*
> > + * Tracepoint for a normal synchronize_rcu() states. The first argument
> > + * is the RCU flavor, the second argument is a pointer to rcu_head the
> > + * last one is an event.
> > + */
> > +TRACE_EVENT_RCU(rcu_sr_normal,
>
> Can we call this "synchronize_rcu" instead? So people really know what it's
> about.
>
I will update it accordingly!

--
Uladzislau Rezki

2024-01-16 06:19:54

by Zqiang

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] rcu: Support direct wake-up of synchronize_rcu() users

>
> > Hello, Zqiang.
> >
> > > >
> > > > // concurrent sr_normal_gp_cleanup work might observe this update.
> > > > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > > > ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> > > >
> > > > - if (wait_tail)
> > > > + if (wait_tail->next)
> > > > queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > > >
> > >
> > > I'm testing these patches :) , one question is as follows:
> > > Can we use (WQ_MEM_RECLAIM | WQ_HIGHPR)type of workqueue to perform
> > > wake-up actions? avoid kworker creation failure under memory pressure, causing
> > > the wake-up action to be delayed.
> > >
> > I do not have any objections in not doing that, so we can add.
> >
> > Thank for testing this!
> >
> I forgot to ask, is your testing simulates a low memory condition so
> you see the failure you refer to? Or it is just a possible scenario?
>

I'm not currently testing this feature in low memory scenarios, I thought
of this possible scenario. I will test it in a low memory scenario later and
let you know if it happens :) .

Thanks
Zqiang

>
> Thanks!

>
> --
> Uladzislau Rezki
>

2024-01-16 16:18:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Thu, Jan 04, 2024 at 05:25:07PM +0100, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
>
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
>
> - how fast a process of offloading is started. Combination of:
> - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> - other.
> - when started, invoking path is interrupted due to:
> - time limit;
> - need_resched();
> - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
>
> Example:
>
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
> <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
> cgroup_attach_task():
> -> cgroup_migrate_execute()
> -> cpuset_can_attach()
> -> percpu_down_write()
> -> rcu_sync_enter()
> -> synchronize_rcu()
> -> now move tasks to the new cgroup.
> -> cgroup_migrate_finish()
>
> <snip>
> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
>
> "Boosting a task" depends on synchronize_rcu() latency:
>
> - first trace shows a completion of synchronize_rcu();
> - second shows attaching a task to a new group;
> - last shows a final step when migration occurs.
>
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. After completion of a grace period
> users are deferred to a dedicated worker to process requests.
>
> 4. This patch reduces the latency of synchronize_rcu() approximately
> by ~30-40% on synthetic tests. The real test case, camera launch time,
> shows(time is in milliseconds):
>
> 1-run 542 vs 489 improvement 9%
> 2-run 540 vs 466 improvement 13%
> 3-run 518 vs 468 improvement 9%
> 4-run 531 vs 457 improvement 13%
> 5-run 548 vs 475 improvement 13%
> 6-run 509 vs 484 improvement 4%
>
> Synthetic test(no "noise" from other callbacks):
> Hardware: x86_64 64 CPUs, 64GB of memory
> Linux-6.6
>
> - 10K tasks(simultaneous);
> - each task does(1000 loops)
> synchronize_rcu();
> kfree(p);
>
> default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
> patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.
>
> Running 60K gives approximately same results on my setup. Please note
> it is without any interaction with another type of callbacks, otherwise
> it will impact a lot a default case.
>
> 5. An extra CONFIG_RCU_SR_NORMAL_DEBUG_GP kernel option is added
> which enables additional debugging for detecting a grace period
> incompletion for synchronize_rcu() users. If a GP is not fully
> passed for any user, the warning message is emitted.
>
> 6. By default it is disabled. To enable this perform one of the
> below sequence:
>
> echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>

Again, nice latency reductions!

A few comments and questions below.

Thanx, Paul

> ---
> .../admin-guide/kernel-parameters.txt | 14 ++
> kernel/rcu/Kconfig.debug | 12 ++
> kernel/rcu/tree.c | 138 +++++++++++++++++-
> kernel/rcu/tree_exp.h | 2 +-
> 4 files changed, 164 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 17a454909ab4..2cca75e4f0c6 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5047,6 +5047,20 @@
> delay, memory pressure or callback list growing too
> big.
>
> + rcutree.rcu_normal_wake_from_gp= [KNL]
> + Reduces a latency of synchronize_rcu() call. This approach
> + maintains its own track of synchronize_rcu() callers, so it
> + does not interact with regular callbacks because it does not
> + use a call_rcu[_hurry]() path. Please note, this is for a
> + normal grace period.
> +
> + How to enable it:
> +
> + echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> + or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> +
> + Default is 0.
> +
> rcuscale.gp_async= [KNL]
> Measure performance of asynchronous
> grace-period primitives such as call_rcu().
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 9b0b52e1836f..4812c6249185 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -168,4 +168,16 @@ config RCU_STRICT_GRACE_PERIOD
> when looking for certain types of RCU usage bugs, for example,
> too-short RCU read-side critical sections.
>
> +config RCU_SR_NORMAL_DEBUG_GP
> + bool "Debug synchronize_rcu() callers for a grace period completion"
> + depends on DEBUG_KERNEL && RCU_EXPERT
> + default n
> + help
> + This option enables additional debugging for detecting a grace
> + period incompletion for synchronize_rcu() users. If a GP is not
> + fully passed for any user, the warning message is emitted.
> +
> + Say Y here if you want to enable such debugging
> + Say N if you are unsure.
> +
> endmenu # "RCU Debugging"
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 499803234176..b756c40e4960 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1422,6 +1422,106 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
>
> +/*
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> + struct llist_head srs_next; /* request a GP users. */
> + struct llist_head srs_wait; /* wait for GP users. */
> + struct llist_head srs_done; /* ready for GP users. */
> +
> + /*
> + * In order to add a batch of nodes to already
> + * existing srs-done-list, a tail of srs-wait-list
> + * is maintained.
> + */
> + struct llist_node *srs_wait_tail;
> +} sr;

Please put this in the rcu_state structure. Having the separate structure
is fine (it does group the fields nicely, plus you can take a pointer
to it in the functions using this state), but it is good to have the
state in one place.

Also, please add the data structures in a separate patch. This might
save someone a lot of time and effort should someone breaks the kernel
in a way that depends on data-structure size. It would be much easier
for us if their bisection converged on the commit that adds the data
structures instead of the commit that also adds a lot of code.

> +
> +/* Disabled by default. */
> +static int rcu_normal_wake_from_gp;
> +module_param(rcu_normal_wake_from_gp, int, 0644);
> +
> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> + struct rcu_synchronize *rs = container_of(
> + (struct rcu_head *) node, struct rcu_synchronize, head);
> + unsigned long oldstate = (unsigned long) rs->head.func;
> +
> + WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
> + !poll_state_synchronize_rcu(oldstate),
> + "A full grace period is not passed yet: %lu",
> + rcu_seq_diff(get_state_synchronize_rcu(), oldstate));

Good, the false-positive-prone check is now under debug. Or at
least possible, even if not prone.

> + /* Finally. */
> + complete(&rs->completion);
> +}
> +
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> + struct llist_node *done, *rcu, *next;
> +
> + done = llist_del_all(&sr.srs_done);
> + if (!done)
> + return;
> +
> + llist_for_each_safe(rcu, next, done)
> + rcu_sr_normal_complete(rcu);
> +}
> +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);

Why not put this into the sr_normal_state structure? You can use
__WORK_INITIALIZER() to initialize it, as is done in a number of other
places in the kernel.

> +/*
> + * Helper function for rcu_gp_cleanup().
> + */
> +static void rcu_sr_normal_gp_cleanup(void)
> +{
> + struct llist_node *head, *tail;
> +
> + if (llist_empty(&sr.srs_wait))
> + return;
> +
> + tail = READ_ONCE(sr.srs_wait_tail);
> + head = __llist_del_all(&sr.srs_wait);
> +
> + if (head) {
> + /* Can be not empty. */
> + llist_add_batch(head, tail, &sr.srs_done);
> + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> + }
> +}
> +
> +/*
> + * Helper function for rcu_gp_init().
> + */
> +static void rcu_sr_normal_gp_init(void)
> +{
> + struct llist_node *head, *tail;
> +
> + if (llist_empty(&sr.srs_next))
> + return;
> +
> + tail = llist_del_all(&sr.srs_next);
> + head = llist_reverse_order(tail);

Again, reversing the order is going to cause trouble on large systems.
Let's please not do that. (I could have sworn that this was not present
in the last series...)

> + /*
> + * A waiting list of GP should be empty on this step,
> + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> + * rolls it over. If not, it is a BUG, warn a user.
> + */
> + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> +
> + WRITE_ONCE(sr.srs_wait_tail, tail);
> + __llist_add_batch(head, tail, &sr.srs_wait);
> +}
> +
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> +}
> +
> /*
> * Initialize a new grace period. Return false if no grace period required.
> */
> @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> /* Record GP times before starting GP, hence rcu_seq_start(). */
> rcu_seq_start(&rcu_state.gp_seq);
> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> + rcu_sr_normal_gp_init();
> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> raw_spin_unlock_irq_rcu_node(rnp);
> @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> }
> raw_spin_unlock_irq_rcu_node(rnp);
>
> + // Make synchronize_rcu() users aware of the end of old grace period.
> + rcu_sr_normal_gp_cleanup();
> +
> // If strict, make all CPUs aware of the end of the old grace period.
> if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> return true;
> }
>
> +/*
> + * Helper function for the synchronize_rcu() API.
> + */
> +static void synchronize_rcu_normal(void)
> +{
> + struct rcu_synchronize rs;
> +
> + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> + wait_rcu_gp(call_rcu_hurry);
> + return;
> + }
> +
> + init_rcu_head_on_stack(&rs.head);
> + init_completion(&rs.completion);
> +
> + /*
> + * This code might be preempted, therefore take a GP
> + * snapshot before adding a request.
> + */
> + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> + rs.head.func = (void *) get_state_synchronize_rcu();
> +
> + rcu_sr_normal_add_req(&rs);
> +
> + /* Kick a GP and start waiting. */
> + (void) start_poll_synchronize_rcu();

It is unfortunate that the debugging requires an extra timestamp.
The ways I can think of to avoid this have problems, though. If this
thing was replicated per leaf rcu_node structure, the usual approach
would be to protect it with that structure's ->lock.

Thoughts?

> + /* Now we can wait. */
> + wait_for_completion(&rs.completion);
> + destroy_rcu_head_on_stack(&rs.head);
> +}
> +
> /**
> * synchronize_rcu - wait until a grace period has elapsed.
> *
> @@ -3612,7 +3748,7 @@ void synchronize_rcu(void)
> if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 014ddf672165..bdc30d972d32 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -985,7 +985,7 @@ void synchronize_rcu_expedited(void)
>
> /* If expedited grace periods are prohibited, fall back to normal. */
> if (rcu_gp_is_normal()) {
> - wait_rcu_gp(call_rcu_hurry);
> + synchronize_rcu_normal();
> return;
> }
>
> --
> 2.39.2
>

2024-01-16 16:32:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] rcu: Improve handling of synchronize_rcu() users

On Thu, Jan 04, 2024 at 05:25:09PM +0100, Uladzislau Rezki (Sony) wrote:
> From: Neeraj Upadhyay <[email protected]>
>
> Currently, processing of the next batch of rcu_synchronize nodes
> for the new grace period, requires doing a llist reversal operation
> to find the tail element of the list. This can be a very costly
> operation (high number of cache misses) for a long list.
>
> To address this, this patch introduces a "dummy-wait-node" entity.
> At every grace period init, a new wait node is added to the llist.
> This wait node is used as wait tail for this new grace period.
>
> This allows lockless additions of new rcu_synchronize nodes in the
> rcu_sr_normal_add_req(), while the cleanup work executes and does
> the progress. The dummy nodes are removed on next round of cleanup
> work execution.

OK, now I am reminded that the list-reversal step was removed later.

So never mind that piece of feedback on the first patch!

Thanx, Paul

> Co-developed-by: Uladzislau Rezki (Sony) <[email protected]>
> Signed-off-by: Neeraj Upadhyay <[email protected]>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> ---
> kernel/rcu/tree.c | 271 +++++++++++++++++++++++++++++++++++++++-------
> kernel/rcu/tree.h | 13 +++
> 2 files changed, 244 insertions(+), 40 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7d2ed89efcb3..88a47a6a658e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1423,23 +1423,157 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> }
>
> /*
> - * There are three lists for handling synchronize_rcu() users.
> - * A first list corresponds to new coming users, second for users
> - * which wait for a grace period and third is for which a grace
> - * period is passed.
> + * There is a single llist, which is used for handling
> + * synchronize_rcu() users' enqueued rcu_synchronize nodes.
> + * Within this llist, there are two tail pointers:
> + *
> + * wait tail: Tracks the set of nodes, which need to
> + * wait for the current GP to complete.
> + * done tail: Tracks the set of nodes, for which grace
> + * period has elapsed. These nodes processing
> + * will be done as part of the cleanup work
> + * execution by a kworker.
> + *
> + * At every grace period init, a new wait node is added
> + * to the llist. This wait node is used as wait tail
> + * for this new grace period. Given that there are a fixed
> + * number of wait nodes, if all wait nodes are in use
> + * (which can happen when kworker callback processing
> + * is delayed) and additional grace period is requested.
> + * This means, a system is slow in processing callbacks.
> + *
> + * TODO: If a slow processing is detected, a first node
> + * in the llist should be used as a wait-tail for this
> + * grace period, therefore users which should wait due
> + * to a slow process are handled by _this_ grace period
> + * and not next.
> + *
> + * Below is an illustration of how the done and wait
> + * tail pointers move from one set of rcu_synchronize nodes
> + * to the other, as grace periods start and finish and
> + * nodes are processed by kworker.
> + *
> + *
> + * a. Initial llist callbacks list:
> + *
> + * +----------+ +--------+ +-------+
> + * | | | | | |
> + * | head |---------> | cb2 |--------->| cb1 |
> + * | | | | | |
> + * +----------+ +--------+ +-------+
> + *
> + *
> + *
> + * b. New GP1 Start:
> + *
> + * WAIT TAIL
> + * |
> + * |
> + * v
> + * +----------+ +--------+ +--------+ +-------+
> + * | | | | | | | |
> + * | head ------> wait |------> cb2 |------> | cb1 |
> + * | | | head1 | | | | |
> + * +----------+ +--------+ +--------+ +-------+
> + *
> + *
> + *
> + * c. GP completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + *
> + * DONE TAIL
> + * |
> + * |
> + * v
> + * +----------+ +--------+ +--------+ +-------+
> + * | | | | | | | |
> + * | head ------> wait |------> cb2 |------> | cb1 |
> + * | | | head1 | | | | |
> + * +----------+ +--------+ +--------+ +-------+
> + *
> + *
> + *
> + * d. New callbacks and GP2 start:
> + *
> + * WAIT TAIL DONE TAIL
> + * | |
> + * | |
> + * v v
> + * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
> + * | | | | | | | | | | | | | |
> + * | head ------> wait |--->| cb4 |--->| cb3 |--->|wait |--->| cb2 |--->| cb1 |
> + * | | | head2| | | | | |head1| | | | |
> + * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
> + *
> + *
> + *
> + * e. GP2 completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + * DONE TAIL
> + * |
> + * |
> + * v
> + * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
> + * | | | | | | | | | | | | | |
> + * | head ------> wait |--->| cb4 |--->| cb3 |--->|wait |--->| cb2 |--->| cb1 |
> + * | | | head2| | | | | |head1| | | | |
> + * +----------+ +------+ +------+ +------+ +-----+ +-----+ +-----+
> + *
> + *
> + * While the llist state transitions from d to e, a kworker
> + * can start executing rcu_sr_normal_gp_cleanup_work() and
> + * can observe either the old done tail (@c) or the new
> + * done tail (@e). So, done tail updates and reads need
> + * to use the rel-acq semantics. If the concurrent kworker
> + * observes the old done tail, the newly queued work
> + * execution will process the updated done tail. If the
> + * concurrent kworker observes the new done tail, then
> + * the newly queued work will skip processing the done
> + * tail, as workqueue semantics guarantees that the new
> + * work is executed only after the previous one completes.
> + *
> + * f. kworker callbacks processing complete:
> + *
> + *
> + * DONE TAIL
> + * |
> + * |
> + * v
> + * +----------+ +--------+
> + * | | | |
> + * | head ------> wait |
> + * | | | head2 |
> + * +----------+ +--------+
> + *
> */
> -static struct sr_normal_state {
> - struct llist_head srs_next; /* request a GP users. */
> - struct llist_head srs_wait; /* wait for GP users. */
> - struct llist_head srs_done; /* ready for GP users. */
> +static bool rcu_sr_is_wait_head(struct llist_node *node)
> +{
> + return &(rcu_state.srs_wait_nodes)[0].node <= node &&
> + node <= &(rcu_state.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
> +}
>
> - /*
> - * In order to add a batch of nodes to already
> - * existing srs-done-list, a tail of srs-wait-list
> - * is maintained.
> - */
> - struct llist_node *srs_wait_tail;
> -} sr;
> +static struct llist_node *rcu_sr_get_wait_head(void)
> +{
> + struct sr_wait_node *sr_wn;
> + int i;
> +
> + for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
> + sr_wn = &(rcu_state.srs_wait_nodes)[i];
> +
> + if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
> + return &sr_wn->node;
> + }
> +
> + return NULL;
> +}
> +
> +static void rcu_sr_put_wait_head(struct llist_node *node)
> +{
> + struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
> + atomic_set_release(&sr_wn->inuse, 0);
> +}
>
> /* Disabled by default. */
> static int rcu_normal_wake_from_gp;
> @@ -1462,14 +1596,44 @@ static void rcu_sr_normal_complete(struct llist_node *node)
>
> static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> {
> - struct llist_node *done, *rcu, *next;
> + struct llist_node *done, *rcu, *next, *head;
>
> - done = llist_del_all(&sr.srs_done);
> + /*
> + * This work execution can potentially execute
> + * while a new done tail is being updated by
> + * grace period kthread in rcu_sr_normal_gp_cleanup().
> + * So, read and updates of done tail need to
> + * follow acq-rel semantics.
> + *
> + * Given that wq semantics guarantees that a single work
> + * cannot execute concurrently by multiple kworkers,
> + * the done tail list manipulations are protected here.
> + */
> + done = smp_load_acquire(&rcu_state.srs_done_tail);
> if (!done)
> return;
>
> - llist_for_each_safe(rcu, next, done)
> - rcu_sr_normal_complete(rcu);
> + WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> + head = done->next;
> + done->next = NULL;
> +
> + /*
> + * The dummy node, which is pointed to by the
> + * done tail which is acq-read above is not removed
> + * here. This allows lockless additions of new
> + * rcu_synchronize nodes in rcu_sr_normal_add_req(),
> + * while the cleanup work executes. The dummy
> + * nodes is removed, in next round of cleanup
> + * work execution.
> + */
> + llist_for_each_safe(rcu, next, head) {
> + if (!rcu_sr_is_wait_head(rcu)) {
> + rcu_sr_normal_complete(rcu);
> + continue;
> + }
> +
> + rcu_sr_put_wait_head(rcu);
> + }
> }
> static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>
> @@ -1478,48 +1642,61 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> */
> static void rcu_sr_normal_gp_cleanup(void)
> {
> - struct llist_node *head, *tail;
> + struct llist_node *wait_tail;
>
> - if (llist_empty(&sr.srs_wait))
> + wait_tail = rcu_state.srs_wait_tail;
> + if (wait_tail == NULL)
> return;
>
> - tail = READ_ONCE(sr.srs_wait_tail);
> - head = __llist_del_all(&sr.srs_wait);
> + rcu_state.srs_wait_tail = NULL;
> + ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
>
> - if (head) {
> - /* Can be not empty. */
> - llist_add_batch(head, tail, &sr.srs_done);
> + // concurrent sr_normal_gp_cleanup work might observe this update.
> + smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> + ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> +
> + if (wait_tail)
> queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> - }
> }
>
> /*
> * Helper function for rcu_gp_init().
> */
> -static void rcu_sr_normal_gp_init(void)
> +static bool rcu_sr_normal_gp_init(void)
> {
> - struct llist_node *head, *tail;
> + struct llist_node *first;
> + struct llist_node *wait_head;
> + bool start_new_poll = false;
>
> - if (llist_empty(&sr.srs_next))
> - return;
> + first = READ_ONCE(rcu_state.srs_next.first);
> + if (!first || rcu_sr_is_wait_head(first))
> + return start_new_poll;
> +
> + wait_head = rcu_sr_get_wait_head();
> + if (!wait_head) {
> + // Kick another GP to retry.
> + start_new_poll = true;
> + return start_new_poll;
> + }
>
> - tail = llist_del_all(&sr.srs_next);
> - head = llist_reverse_order(tail);
> + /* Inject a wait-dummy-node. */
> + llist_add(wait_head, &rcu_state.srs_next);
>
> /*
> - * A waiting list of GP should be empty on this step,
> - * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> + * A waiting list of rcu_synchronize nodes should be empty on
> + * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> * rolls it over. If not, it is a BUG, warn a user.
> */
> - WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> + WARN_ON_ONCE(rcu_state.srs_wait_tail != NULL);
> + rcu_state.srs_wait_tail = wait_head;
> + ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
>
> - WRITE_ONCE(sr.srs_wait_tail, tail);
> - __llist_add_batch(head, tail, &sr.srs_wait);
> + return start_new_poll;
> }
>
> static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> {
> - llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> + llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
> }
>
> /*
> @@ -1532,6 +1709,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> unsigned long mask;
> struct rcu_data *rdp;
> struct rcu_node *rnp = rcu_get_root();
> + bool start_new_poll;
>
> WRITE_ONCE(rcu_state.gp_activity, jiffies);
> raw_spin_lock_irq_rcu_node(rnp);
> @@ -1556,11 +1734,24 @@ static noinline_for_stack bool rcu_gp_init(void)
> /* Record GP times before starting GP, hence rcu_seq_start(). */
> rcu_seq_start(&rcu_state.gp_seq);
> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> - rcu_sr_normal_gp_init();
> + start_new_poll = rcu_sr_normal_gp_init();
> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> raw_spin_unlock_irq_rcu_node(rnp);
>
> + /*
> + * The "start_new_poll" is set to true, only when this GP is not able
> + * to handle anything and there are outstanding users. It happens when
> + * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> + * separator to the llist, because there were no left any dummy-nodes.
> + *
> + * Number of dummy-nodes is fixed, it could be that we are run out of
> + * them, if so we start a new pool request to repeat a try. It is rare
> + * and it means that a system is doing a slow processing of callbacks.
> + */
> + if (start_new_poll)
> + (void) start_poll_synchronize_rcu();
> +
> /*
> * Apply per-leaf buffered online and offline operations to
> * the rcu_node tree. Note that this new grace period need not
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index e9821a8422db..4c35d7d37647 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -316,6 +316,13 @@ do { \
> __set_current_state(TASK_RUNNING); \
> } while (0)
>
> +#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> +
> +struct sr_wait_node {
> + atomic_t inuse;
> + struct llist_node node;
> +};
> +
> /*
> * RCU global state, including node hierarchy. This hierarchy is
> * represented in "heap" form in a dense array. The root (first level)
> @@ -401,6 +408,12 @@ struct rcu_state {
> /* Synchronize offline with */
> /* GP pre-initialization. */
> int nocb_is_setup; /* nocb is setup from boot */
> +
> + /* synchronize_rcu() part. */
> + struct llist_head srs_next; /* request a GP users. */
> + struct llist_node *srs_wait_tail; /* wait for GP users. */
> + struct llist_node *srs_done_tail; /* ready for GP users. */
> + struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
> };
>
> /* Values for rcu_state structure's gp_flags field. */
> --
> 2.39.2
>

2024-01-17 12:26:43

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

> > +/*
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > + struct llist_head srs_next; /* request a GP users. */
> > + struct llist_head srs_wait; /* wait for GP users. */
> > + struct llist_head srs_done; /* ready for GP users. */
> > +
> > + /*
> > + * In order to add a batch of nodes to already
> > + * existing srs-done-list, a tail of srs-wait-list
> > + * is maintained.
> > + */
> > + struct llist_node *srs_wait_tail;
> > +} sr;
>
> Please put this in the rcu_state structure. Having the separate structure
> is fine (it does group the fields nicely, plus you can take a pointer
> to it in the functions using this state), but it is good to have the
> state in one place.
>
> Also, please add the data structures in a separate patch. This might
> save someone a lot of time and effort should someone breaks the kernel
> in a way that depends on data-structure size. It would be much easier
> for us if their bisection converged on the commit that adds the data
> structures instead of the commit that also adds a lot of code.
>
I put the data under rcu_state in the patch-3 in this series. But i can
create a separate patch for this purpose. Should i split it or not?

> > + /* Finally. */
> > + complete(&rs->completion);
> > +}
> > +
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > + struct llist_node *done, *rcu, *next;
> > +
> > + done = llist_del_all(&sr.srs_done);
> > + if (!done)
> > + return;
> > +
> > + llist_for_each_safe(rcu, next, done)
> > + rcu_sr_normal_complete(rcu);
> > +}
> > +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>
> Why not put this into the sr_normal_state structure? You can use
> __WORK_INITIALIZER() to initialize it, as is done in a number of other
> places in the kernel.
>
It is not a big problem. I can move it under "rcu_state" also!

> > +/*
> > + * Helper function for rcu_gp_cleanup().
> > + */
> > +static void rcu_sr_normal_gp_cleanup(void)
> > +{
> > + struct llist_node *head, *tail;
> > +
> > + if (llist_empty(&sr.srs_wait))
> > + return;
> > +
> > + tail = READ_ONCE(sr.srs_wait_tail);
> > + head = __llist_del_all(&sr.srs_wait);
> > +
> > + if (head) {
> > + /* Can be not empty. */
> > + llist_add_batch(head, tail, &sr.srs_done);
> > + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > + }
> > +}
> > +
> > +/*
> > + * Helper function for rcu_gp_init().
> > + */
> > +static void rcu_sr_normal_gp_init(void)
> > +{
> > + struct llist_node *head, *tail;
> > +
> > + if (llist_empty(&sr.srs_next))
> > + return;
> > +
> > + tail = llist_del_all(&sr.srs_next);
> > + head = llist_reverse_order(tail);
>
> Again, reversing the order is going to cause trouble on large systems.
> Let's please not do that. (I could have sworn that this was not present
> in the last series...)
>
> > + /*
> > + * A waiting list of GP should be empty on this step,
> > + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > + * rolls it over. If not, it is a BUG, warn a user.
> > + */
> > + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > +
> > + WRITE_ONCE(sr.srs_wait_tail, tail);
> > + __llist_add_batch(head, tail, &sr.srs_wait);
> > +}
> > +
> > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > +{
> > + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> > +}
> > +
> > /*
> > * Initialize a new grace period. Return false if no grace period required.
> > */
> > @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > /* Record GP times before starting GP, hence rcu_seq_start(). */
> > rcu_seq_start(&rcu_state.gp_seq);
> > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > + rcu_sr_normal_gp_init();
> > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > raw_spin_unlock_irq_rcu_node(rnp);
> > @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> > }
> > raw_spin_unlock_irq_rcu_node(rnp);
> >
> > + // Make synchronize_rcu() users aware of the end of old grace period.
> > + rcu_sr_normal_gp_cleanup();
> > +
> > // If strict, make all CPUs aware of the end of the old grace period.
> > if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> > on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> > return true;
> > }
> >
> > +/*
> > + * Helper function for the synchronize_rcu() API.
> > + */
> > +static void synchronize_rcu_normal(void)
> > +{
> > + struct rcu_synchronize rs;
> > +
> > + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> > + wait_rcu_gp(call_rcu_hurry);
> > + return;
> > + }
> > +
> > + init_rcu_head_on_stack(&rs.head);
> > + init_completion(&rs.completion);
> > +
> > + /*
> > + * This code might be preempted, therefore take a GP
> > + * snapshot before adding a request.
> > + */
> > + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> > + rs.head.func = (void *) get_state_synchronize_rcu();
> > +
> > + rcu_sr_normal_add_req(&rs);
> > +
> > + /* Kick a GP and start waiting. */
> > + (void) start_poll_synchronize_rcu();
>
> It is unfortunate that the debugging requires an extra timestamp.
> The ways I can think of to avoid this have problems, though. If this
> thing was replicated per leaf rcu_node structure, the usual approach
> would be to protect it with that structure's ->lock.
>
Hmm.. a per-node approach can be deployed later. As discussed earlier :)

Debugging part i do not follow, could you please elaborate a bit?

Thanks!

--
Uladzislau Rezki

2024-01-18 10:37:26

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Sat, Jan 13, 2024 at 12:09:51AM +0100, Frederic Weisbecker wrote:
> Le Thu, Jan 04, 2024 at 05:25:07PM +0100, Uladzislau Rezki (Sony) a écrit :
> > diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> > index 9b0b52e1836f..4812c6249185 100644
> > --- a/kernel/rcu/Kconfig.debug
> > +++ b/kernel/rcu/Kconfig.debug
> > @@ -168,4 +168,16 @@ config RCU_STRICT_GRACE_PERIOD
> > when looking for certain types of RCU usage bugs, for example,
> > too-short RCU read-side critical sections.
> >
> > +config RCU_SR_NORMAL_DEBUG_GP
> > + bool "Debug synchronize_rcu() callers for a grace period completion"
> > + depends on DEBUG_KERNEL && RCU_EXPERT
> > + default n
> > + help
> > + This option enables additional debugging for detecting a grace
> > + period incompletion for synchronize_rcu() users. If a GP is not
> > + fully passed for any user, the warning message is emitted.
> > +
> > + Say Y here if you want to enable such debugging
> > + Say N if you are unsure.
>
> How about just reuse CONFIG_PROVE_RCU instead?
>
Less extra CONFIG_* configuration we have the better approach is. I do
not mind, so we can reuse it. Thanks for this point :)

I see in some places indeed it is used as a debugging peace.

> > +
> > endmenu # "RCU Debugging"
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 499803234176..b756c40e4960 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1422,6 +1422,106 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > }
> >
> > +/*
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > + struct llist_head srs_next; /* request a GP users. */
> > + struct llist_head srs_wait; /* wait for GP users. */
> > + struct llist_head srs_done; /* ready for GP users. */
> > +
> > + /*
> > + * In order to add a batch of nodes to already
> > + * existing srs-done-list, a tail of srs-wait-list
> > + * is maintained.
> > + */
> > + struct llist_node *srs_wait_tail;
> > +} sr;
>
> "sr" is good enough for a function scope variable but not for a file scope one.
>
> At least "sr_state" would be better. Or maybe you don't even need to name that
> struct and make instead:
>
> struct {
> ...
> ...
> } sr_normal_state;
>
It is moved by the following patch in the series under the "rcu_state" struct variable.

>
> > +
> > +/* Disabled by default. */
> > +static int rcu_normal_wake_from_gp;
> > +module_param(rcu_normal_wake_from_gp, int, 0644);
> > +
> > +static void rcu_sr_normal_complete(struct llist_node *node)
> > +{
> > + struct rcu_synchronize *rs = container_of(
> > + (struct rcu_head *) node, struct rcu_synchronize, head);
>
> Should there be some union in struct rcu_synchronize between struct rcu_head
> and struct llist_node?
>
> Anyway it's stack allocated, they could even be separate fields.
>
> > + unsigned long oldstate = (unsigned long) rs->head.func;
>
> Luckily struct callback_head layout allows such magic but if rcu_head
> and llist_node were separate, reviewers would be less hurt.
>
> If stack space really matters, something like the below?
>
> struct rcu_synchronize {
> union {
> struct rcu_head head;
> struct {
> struct llist_node node;
> unsigned long seq;
> }
> }
> struct completion completion;
> };
>
>
We can do that. I am not sure if should be a separate patch or as a big
change. I tend to separate it.

> > +
> > + WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
> > + !poll_state_synchronize_rcu(oldstate),
> > + "A full grace period is not passed yet: %lu",
> > + rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> > +
> > + /* Finally. */
> > + complete(&rs->completion);
> > +}
> > +
> [...]
> > +
> > +/*
> > + * Helper function for rcu_gp_cleanup().
> > + */
> > +static void rcu_sr_normal_gp_cleanup(void)
> > +{
> > + struct llist_node *head, *tail;
> > +
> > + if (llist_empty(&sr.srs_wait))
> > + return;
> > +
> > + tail = READ_ONCE(sr.srs_wait_tail);
>
> Is the READ_ONCE() needed?
>
> A part from those boring details:
>
> Reviewed-by: Frederic Weisbecker <[email protected]>
>
Appreciate for the review. I will fix all the comments.

Thanks!

--
Uladzislau Rezki

2024-01-19 15:24:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Wed, Jan 17, 2024 at 01:26:19PM +0100, Uladzislau Rezki wrote:
> > > +/*
> > > + * There are three lists for handling synchronize_rcu() users.
> > > + * A first list corresponds to new coming users, second for users
> > > + * which wait for a grace period and third is for which a grace
> > > + * period is passed.
> > > + */
> > > +static struct sr_normal_state {
> > > + struct llist_head srs_next; /* request a GP users. */
> > > + struct llist_head srs_wait; /* wait for GP users. */
> > > + struct llist_head srs_done; /* ready for GP users. */
> > > +
> > > + /*
> > > + * In order to add a batch of nodes to already
> > > + * existing srs-done-list, a tail of srs-wait-list
> > > + * is maintained.
> > > + */
> > > + struct llist_node *srs_wait_tail;
> > > +} sr;
> >
> > Please put this in the rcu_state structure. Having the separate structure
> > is fine (it does group the fields nicely, plus you can take a pointer
> > to it in the functions using this state), but it is good to have the
> > state in one place.
> >
> > Also, please add the data structures in a separate patch. This might
> > save someone a lot of time and effort should someone breaks the kernel
> > in a way that depends on data-structure size. It would be much easier
> > for us if their bisection converged on the commit that adds the data
> > structures instead of the commit that also adds a lot of code.
> >
> I put the data under rcu_state in the patch-3 in this series. But i can
> create a separate patch for this purpose. Should i split it or not?

Bisection is best if the data-structure changes come first, keeping in
mind the example where the change in data size triggers some unrelated
bug. Better to have that bisection converge on a data-structure only
commit than on a more complex commit.

So it would be much better if the data started out in rcu_state.

> > > + /* Finally. */
> > > + complete(&rs->completion);
> > > +}
> > > +
> > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > +{
> > > + struct llist_node *done, *rcu, *next;
> > > +
> > > + done = llist_del_all(&sr.srs_done);
> > > + if (!done)
> > > + return;
> > > +
> > > + llist_for_each_safe(rcu, next, done)
> > > + rcu_sr_normal_complete(rcu);
> > > +}
> > > +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> >
> > Why not put this into the sr_normal_state structure? You can use
> > __WORK_INITIALIZER() to initialize it, as is done in a number of other
> > places in the kernel.
> >
> It is not a big problem. I can move it under "rcu_state" also!

Very good, thank you!

> > > +/*
> > > + * Helper function for rcu_gp_cleanup().
> > > + */
> > > +static void rcu_sr_normal_gp_cleanup(void)
> > > +{
> > > + struct llist_node *head, *tail;
> > > +
> > > + if (llist_empty(&sr.srs_wait))
> > > + return;
> > > +
> > > + tail = READ_ONCE(sr.srs_wait_tail);
> > > + head = __llist_del_all(&sr.srs_wait);
> > > +
> > > + if (head) {
> > > + /* Can be not empty. */
> > > + llist_add_batch(head, tail, &sr.srs_done);
> > > + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Helper function for rcu_gp_init().
> > > + */
> > > +static void rcu_sr_normal_gp_init(void)
> > > +{
> > > + struct llist_node *head, *tail;
> > > +
> > > + if (llist_empty(&sr.srs_next))
> > > + return;
> > > +
> > > + tail = llist_del_all(&sr.srs_next);
> > > + head = llist_reverse_order(tail);
> >
> > Again, reversing the order is going to cause trouble on large systems.
> > Let's please not do that. (I could have sworn that this was not present
> > in the last series...)
> >
> > > + /*
> > > + * A waiting list of GP should be empty on this step,
> > > + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > + * rolls it over. If not, it is a BUG, warn a user.
> > > + */
> > > + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > +
> > > + WRITE_ONCE(sr.srs_wait_tail, tail);
> > > + __llist_add_batch(head, tail, &sr.srs_wait);
> > > +}
> > > +
> > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > +{
> > > + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> > > +}
> > > +
> > > /*
> > > * Initialize a new grace period. Return false if no grace period required.
> > > */
> > > @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > /* Record GP times before starting GP, hence rcu_seq_start(). */
> > > rcu_seq_start(&rcu_state.gp_seq);
> > > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > + rcu_sr_normal_gp_init();
> > > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > raw_spin_unlock_irq_rcu_node(rnp);
> > > @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> > > }
> > > raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > > + // Make synchronize_rcu() users aware of the end of old grace period.
> > > + rcu_sr_normal_gp_cleanup();
> > > +
> > > // If strict, make all CPUs aware of the end of the old grace period.
> > > if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> > > return true;
> > > }
> > >
> > > +/*
> > > + * Helper function for the synchronize_rcu() API.
> > > + */
> > > +static void synchronize_rcu_normal(void)
> > > +{
> > > + struct rcu_synchronize rs;
> > > +
> > > + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> > > + wait_rcu_gp(call_rcu_hurry);
> > > + return;
> > > + }
> > > +
> > > + init_rcu_head_on_stack(&rs.head);
> > > + init_completion(&rs.completion);
> > > +
> > > + /*
> > > + * This code might be preempted, therefore take a GP
> > > + * snapshot before adding a request.
> > > + */
> > > + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> > > + rs.head.func = (void *) get_state_synchronize_rcu();
> > > +
> > > + rcu_sr_normal_add_req(&rs);
> > > +
> > > + /* Kick a GP and start waiting. */
> > > + (void) start_poll_synchronize_rcu();
> >
> > It is unfortunate that the debugging requires an extra timestamp.
> > The ways I can think of to avoid this have problems, though. If this
> > thing was replicated per leaf rcu_node structure, the usual approach
> > would be to protect it with that structure's ->lock.
> >
> Hmm.. a per-node approach can be deployed later. As discussed earlier :)

Agreed!

> Debugging part i do not follow, could you please elaborate a bit?

Let's not worry about this unless and until we need per-rcu_node lists
of tasks waiting on grace periods. At that point, we will know more
and things will be more clear.

Thanx, Paul

2024-01-22 18:10:56

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Fri, Jan 19, 2024 at 07:24:28AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 17, 2024 at 01:26:19PM +0100, Uladzislau Rezki wrote:
> > > > +/*
> > > > + * There are three lists for handling synchronize_rcu() users.
> > > > + * A first list corresponds to new coming users, second for users
> > > > + * which wait for a grace period and third is for which a grace
> > > > + * period is passed.
> > > > + */
> > > > +static struct sr_normal_state {
> > > > + struct llist_head srs_next; /* request a GP users. */
> > > > + struct llist_head srs_wait; /* wait for GP users. */
> > > > + struct llist_head srs_done; /* ready for GP users. */
> > > > +
> > > > + /*
> > > > + * In order to add a batch of nodes to already
> > > > + * existing srs-done-list, a tail of srs-wait-list
> > > > + * is maintained.
> > > > + */
> > > > + struct llist_node *srs_wait_tail;
> > > > +} sr;
> > >
> > > Please put this in the rcu_state structure. Having the separate structure
> > > is fine (it does group the fields nicely, plus you can take a pointer
> > > to it in the functions using this state), but it is good to have the
> > > state in one place.
> > >
> > > Also, please add the data structures in a separate patch. This might
> > > save someone a lot of time and effort should someone breaks the kernel
> > > in a way that depends on data-structure size. It would be much easier
> > > for us if their bisection converged on the commit that adds the data
> > > structures instead of the commit that also adds a lot of code.
> > >
> > I put the data under rcu_state in the patch-3 in this series. But i can
> > create a separate patch for this purpose. Should i split it or not?
>
> Bisection is best if the data-structure changes come first, keeping in
> mind the example where the change in data size triggers some unrelated
> bug. Better to have that bisection converge on a data-structure only
> commit than on a more complex commit.
>
> So it would be much better if the data started out in rcu_state.
>
OK. Then i will also combine two patches into one:

rcu: Improve handling of synchronize_rcu() users
rcu: Reduce synchronize_rcu() latency

to reduce the mess.

> > > > + /* Finally. */
> > > > + complete(&rs->completion);
> > > > +}
> > > > +
> > > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > +{
> > > > + struct llist_node *done, *rcu, *next;
> > > > +
> > > > + done = llist_del_all(&sr.srs_done);
> > > > + if (!done)
> > > > + return;
> > > > +
> > > > + llist_for_each_safe(rcu, next, done)
> > > > + rcu_sr_normal_complete(rcu);
> > > > +}
> > > > +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> > >
> > > Why not put this into the sr_normal_state structure? You can use
> > > __WORK_INITIALIZER() to initialize it, as is done in a number of other
> > > places in the kernel.
> > >
> > It is not a big problem. I can move it under "rcu_state" also!
>
> Very good, thank you!
>
> > > > +/*
> > > > + * Helper function for rcu_gp_cleanup().
> > > > + */
> > > > +static void rcu_sr_normal_gp_cleanup(void)
> > > > +{
> > > > + struct llist_node *head, *tail;
> > > > +
> > > > + if (llist_empty(&sr.srs_wait))
> > > > + return;
> > > > +
> > > > + tail = READ_ONCE(sr.srs_wait_tail);
> > > > + head = __llist_del_all(&sr.srs_wait);
> > > > +
> > > > + if (head) {
> > > > + /* Can be not empty. */
> > > > + llist_add_batch(head, tail, &sr.srs_done);
> > > > + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > > > + }
> > > > +}
> > > > +
> > > > +/*
> > > > + * Helper function for rcu_gp_init().
> > > > + */
> > > > +static void rcu_sr_normal_gp_init(void)
> > > > +{
> > > > + struct llist_node *head, *tail;
> > > > +
> > > > + if (llist_empty(&sr.srs_next))
> > > > + return;
> > > > +
> > > > + tail = llist_del_all(&sr.srs_next);
> > > > + head = llist_reverse_order(tail);
> > >
> > > Again, reversing the order is going to cause trouble on large systems.
> > > Let's please not do that. (I could have sworn that this was not present
> > > in the last series...)
> > >
> > > > + /*
> > > > + * A waiting list of GP should be empty on this step,
> > > > + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > + * rolls it over. If not, it is a BUG, warn a user.
> > > > + */
> > > > + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > +
> > > > + WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > + __llist_add_batch(head, tail, &sr.srs_wait);
> > > > +}
> > > > +
> > > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > +{
> > > > + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> > > > +}
> > > > +
> > > > /*
> > > > * Initialize a new grace period. Return false if no grace period required.
> > > > */
> > > > @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > /* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > rcu_seq_start(&rcu_state.gp_seq);
> > > > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > + rcu_sr_normal_gp_init();
> > > > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > > raw_spin_unlock_irq_rcu_node(rnp);
> > > > @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> > > > }
> > > > raw_spin_unlock_irq_rcu_node(rnp);
> > > >
> > > > + // Make synchronize_rcu() users aware of the end of old grace period.
> > > > + rcu_sr_normal_gp_cleanup();
> > > > +
> > > > // If strict, make all CPUs aware of the end of the old grace period.
> > > > if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> > > > return true;
> > > > }
> > > >
> > > > +/*
> > > > + * Helper function for the synchronize_rcu() API.
> > > > + */
> > > > +static void synchronize_rcu_normal(void)
> > > > +{
> > > > + struct rcu_synchronize rs;
> > > > +
> > > > + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> > > > + wait_rcu_gp(call_rcu_hurry);
> > > > + return;
> > > > + }
> > > > +
> > > > + init_rcu_head_on_stack(&rs.head);
> > > > + init_completion(&rs.completion);
> > > > +
> > > > + /*
> > > > + * This code might be preempted, therefore take a GP
> > > > + * snapshot before adding a request.
> > > > + */
> > > > + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> > > > + rs.head.func = (void *) get_state_synchronize_rcu();
> > > > +
> > > > + rcu_sr_normal_add_req(&rs);
> > > > +
> > > > + /* Kick a GP and start waiting. */
> > > > + (void) start_poll_synchronize_rcu();
> > >
> > > It is unfortunate that the debugging requires an extra timestamp.
> > > The ways I can think of to avoid this have problems, though. If this
> > > thing was replicated per leaf rcu_node structure, the usual approach
> > > would be to protect it with that structure's ->lock.
> > >
> > Hmm.. a per-node approach can be deployed later. As discussed earlier :)
>
> Agreed!
>
> > Debugging part i do not follow, could you please elaborate a bit?
>
> Let's not worry about this unless and until we need per-rcu_node lists
> of tasks waiting on grace periods. At that point, we will know more
> and things will be more clear.
>
Good and thank you :)

--
Uladzislau Rezki

2024-01-23 11:21:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency

On Mon, Jan 22, 2024 at 06:35:44PM +0100, Uladzislau Rezki wrote:
> On Fri, Jan 19, 2024 at 07:24:28AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 17, 2024 at 01:26:19PM +0100, Uladzislau Rezki wrote:
> > > > > +/*
> > > > > + * There are three lists for handling synchronize_rcu() users.
> > > > > + * A first list corresponds to new coming users, second for users
> > > > > + * which wait for a grace period and third is for which a grace
> > > > > + * period is passed.
> > > > > + */
> > > > > +static struct sr_normal_state {
> > > > > + struct llist_head srs_next; /* request a GP users. */
> > > > > + struct llist_head srs_wait; /* wait for GP users. */
> > > > > + struct llist_head srs_done; /* ready for GP users. */
> > > > > +
> > > > > + /*
> > > > > + * In order to add a batch of nodes to already
> > > > > + * existing srs-done-list, a tail of srs-wait-list
> > > > > + * is maintained.
> > > > > + */
> > > > > + struct llist_node *srs_wait_tail;
> > > > > +} sr;
> > > >
> > > > Please put this in the rcu_state structure. Having the separate structure
> > > > is fine (it does group the fields nicely, plus you can take a pointer
> > > > to it in the functions using this state), but it is good to have the
> > > > state in one place.
> > > >
> > > > Also, please add the data structures in a separate patch. This might
> > > > save someone a lot of time and effort should someone breaks the kernel
> > > > in a way that depends on data-structure size. It would be much easier
> > > > for us if their bisection converged on the commit that adds the data
> > > > structures instead of the commit that also adds a lot of code.
> > > >
> > > I put the data under rcu_state in the patch-3 in this series. But i can
> > > create a separate patch for this purpose. Should i split it or not?
> >
> > Bisection is best if the data-structure changes come first, keeping in
> > mind the example where the change in data size triggers some unrelated
> > bug. Better to have that bisection converge on a data-structure only
> > commit than on a more complex commit.
> >
> > So it would be much better if the data started out in rcu_state.
> >
> OK. Then i will also combine two patches into one:
>
> rcu: Improve handling of synchronize_rcu() users
> rcu: Reduce synchronize_rcu() latency
>
> to reduce the mess.

That sounds like an excellent next step, thank you!

Thanx, Paul

> > > > > + /* Finally. */
> > > > > + complete(&rs->completion);
> > > > > +}
> > > > > +
> > > > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > +{
> > > > > + struct llist_node *done, *rcu, *next;
> > > > > +
> > > > > + done = llist_del_all(&sr.srs_done);
> > > > > + if (!done)
> > > > > + return;
> > > > > +
> > > > > + llist_for_each_safe(rcu, next, done)
> > > > > + rcu_sr_normal_complete(rcu);
> > > > > +}
> > > > > +static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> > > >
> > > > Why not put this into the sr_normal_state structure? You can use
> > > > __WORK_INITIALIZER() to initialize it, as is done in a number of other
> > > > places in the kernel.
> > > >
> > > It is not a big problem. I can move it under "rcu_state" also!
> >
> > Very good, thank you!
> >
> > > > > +/*
> > > > > + * Helper function for rcu_gp_cleanup().
> > > > > + */
> > > > > +static void rcu_sr_normal_gp_cleanup(void)
> > > > > +{
> > > > > + struct llist_node *head, *tail;
> > > > > +
> > > > > + if (llist_empty(&sr.srs_wait))
> > > > > + return;
> > > > > +
> > > > > + tail = READ_ONCE(sr.srs_wait_tail);
> > > > > + head = __llist_del_all(&sr.srs_wait);
> > > > > +
> > > > > + if (head) {
> > > > > + /* Can be not empty. */
> > > > > + llist_add_batch(head, tail, &sr.srs_done);
> > > > > + queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Helper function for rcu_gp_init().
> > > > > + */
> > > > > +static void rcu_sr_normal_gp_init(void)
> > > > > +{
> > > > > + struct llist_node *head, *tail;
> > > > > +
> > > > > + if (llist_empty(&sr.srs_next))
> > > > > + return;
> > > > > +
> > > > > + tail = llist_del_all(&sr.srs_next);
> > > > > + head = llist_reverse_order(tail);
> > > >
> > > > Again, reversing the order is going to cause trouble on large systems.
> > > > Let's please not do that. (I could have sworn that this was not present
> > > > in the last series...)
> > > >
> > > > > + /*
> > > > > + * A waiting list of GP should be empty on this step,
> > > > > + * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > > + * rolls it over. If not, it is a BUG, warn a user.
> > > > > + */
> > > > > + WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > > +
> > > > > + WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > > + __llist_add_batch(head, tail, &sr.srs_wait);
> > > > > +}
> > > > > +
> > > > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > > +{
> > > > > + llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Initialize a new grace period. Return false if no grace period required.
> > > > > */
> > > > > @@ -1456,6 +1556,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > > /* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > rcu_seq_start(&rcu_state.gp_seq);
> > > > > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > + rcu_sr_normal_gp_init();
> > > > > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > > > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > > > raw_spin_unlock_irq_rcu_node(rnp);
> > > > > @@ -1825,6 +1926,9 @@ static noinline void rcu_gp_cleanup(void)
> > > > > }
> > > > > raw_spin_unlock_irq_rcu_node(rnp);
> > > > >
> > > > > + // Make synchronize_rcu() users aware of the end of old grace period.
> > > > > + rcu_sr_normal_gp_cleanup();
> > > > > +
> > > > > // If strict, make all CPUs aware of the end of the old grace period.
> > > > > if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
> > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > > @@ -3561,6 +3665,38 @@ static int rcu_blocking_is_gp(void)
> > > > > return true;
> > > > > }
> > > > >
> > > > > +/*
> > > > > + * Helper function for the synchronize_rcu() API.
> > > > > + */
> > > > > +static void synchronize_rcu_normal(void)
> > > > > +{
> > > > > + struct rcu_synchronize rs;
> > > > > +
> > > > > + if (!READ_ONCE(rcu_normal_wake_from_gp)) {
> > > > > + wait_rcu_gp(call_rcu_hurry);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + init_rcu_head_on_stack(&rs.head);
> > > > > + init_completion(&rs.completion);
> > > > > +
> > > > > + /*
> > > > > + * This code might be preempted, therefore take a GP
> > > > > + * snapshot before adding a request.
> > > > > + */
> > > > > + if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> > > > > + rs.head.func = (void *) get_state_synchronize_rcu();
> > > > > +
> > > > > + rcu_sr_normal_add_req(&rs);
> > > > > +
> > > > > + /* Kick a GP and start waiting. */
> > > > > + (void) start_poll_synchronize_rcu();
> > > >
> > > > It is unfortunate that the debugging requires an extra timestamp.
> > > > The ways I can think of to avoid this have problems, though. If this
> > > > thing was replicated per leaf rcu_node structure, the usual approach
> > > > would be to protect it with that structure's ->lock.
> > > >
> > > Hmm.. a per-node approach can be deployed later. As discussed earlier :)
> >
> > Agreed!
> >
> > > Debugging part i do not follow, could you please elaborate a bit?
> >
> > Let's not worry about this unless and until we need per-rcu_node lists
> > of tasks waiting on grace periods. At that point, we will know more
> > and things will be more clear.
> >
> Good and thank you :)
>
> --
> Uladzislau Rezki

2024-01-27 07:07:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] Reduce synchronize_rcu() latency(v4)

On Thu, Jan 04, 2024 at 05:25:06PM +0100, Uladzislau Rezki (Sony) wrote:
> This is a v4 that tends to improve synchronize_rcu() call. To be more
> specific it is about reducing a waiting time(especially worst cases)
> of caller that blocks until a grace period is elapsed.
>
> In general, this series separates synchronize_rcu() callers from other
> callbacks. We keep a dedicated an independent queue, thus the processing
> of it starts as soon as grace period is over, so there is no need to wait
> until other callbacks are processed one by one. Please note, a number of
> callbacks can be 10K, 20K, 60K and so on. That is why this series maintain
> a separate track for this call that blocks a context.

And before I forget (again), a possible follow-on to this work is to
reduce cond_synchronize_rcu() and cond_synchronize_rcu_full() latency.
Right now, these wait for a full additional grace period (and maybe
more) when the required grace period has not elapsed. In contrast,
this work might enable waiting only for the needed portion of a grace
period to elapse.

Thanx, Paul

> v3 -> v4:
> - Squash patches;
> - Add more description;
> - Fix comments based on v3 feedback.
>
> v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
> v2: https://lore.kernel.org/all/[email protected]/T/
> v1: https://lore.kernel.org/lkml/[email protected]/T/
>
> Neeraj Upadhyay (1):
> rcu: Improve handling of synchronize_rcu() users
>
> Uladzislau Rezki (Sony) (3):
> rcu: Reduce synchronize_rcu() latency
> rcu: Add a trace event for synchronize_rcu_normal()
> rcu: Support direct wake-up of synchronize_rcu() users
>
> .../admin-guide/kernel-parameters.txt | 14 +
> include/trace/events/rcu.h | 27 ++
> kernel/rcu/Kconfig.debug | 12 +
> kernel/rcu/tree.c | 361 +++++++++++++++++-
> kernel/rcu/tree.h | 19 +
> kernel/rcu/tree_exp.h | 2 +-
> 6 files changed, 433 insertions(+), 2 deletions(-)
>
> --
> 2.39.2
>

2024-01-29 16:37:51

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] Reduce synchronize_rcu() latency(v4)

On Fri, Jan 26, 2024 at 11:07:18PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 04, 2024 at 05:25:06PM +0100, Uladzislau Rezki (Sony) wrote:
> > This is a v4 that tends to improve synchronize_rcu() call. To be more
> > specific it is about reducing a waiting time(especially worst cases)
> > of caller that blocks until a grace period is elapsed.
> >
> > In general, this series separates synchronize_rcu() callers from other
> > callbacks. We keep a dedicated an independent queue, thus the processing
> > of it starts as soon as grace period is over, so there is no need to wait
> > until other callbacks are processed one by one. Please note, a number of
> > callbacks can be 10K, 20K, 60K and so on. That is why this series maintain
> > a separate track for this call that blocks a context.
>
> And before I forget (again), a possible follow-on to this work is to
> reduce cond_synchronize_rcu() and cond_synchronize_rcu_full() latency.
> Right now, these wait for a full additional grace period (and maybe
> more) when the required grace period has not elapsed. In contrast,
> this work might enable waiting only for the needed portion of a grace
> period to elapse.
>
Thanks. I see it. Probably we also need to move "sync" related
functionality out of tree.c file to the sync.c or something similar
to that name. IMO.

Thanks!

--
Uladzislau Rezki

2024-01-29 19:44:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] Reduce synchronize_rcu() latency(v4)

On Mon, Jan 29, 2024 at 05:23:01PM +0100, Uladzislau Rezki wrote:
> On Fri, Jan 26, 2024 at 11:07:18PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 04, 2024 at 05:25:06PM +0100, Uladzislau Rezki (Sony) wrote:
> > > This is a v4 that tends to improve synchronize_rcu() call. To be more
> > > specific it is about reducing a waiting time(especially worst cases)
> > > of caller that blocks until a grace period is elapsed.
> > >
> > > In general, this series separates synchronize_rcu() callers from other
> > > callbacks. We keep a dedicated an independent queue, thus the processing
> > > of it starts as soon as grace period is over, so there is no need to wait
> > > until other callbacks are processed one by one. Please note, a number of
> > > callbacks can be 10K, 20K, 60K and so on. That is why this series maintain
> > > a separate track for this call that blocks a context.
> >
> > And before I forget (again), a possible follow-on to this work is to
> > reduce cond_synchronize_rcu() and cond_synchronize_rcu_full() latency.
> > Right now, these wait for a full additional grace period (and maybe
> > more) when the required grace period has not elapsed. In contrast,
> > this work might enable waiting only for the needed portion of a grace
> > period to elapse.
> >
> Thanks. I see it. Probably we also need to move "sync" related
> functionality out of tree.c file to the sync.c or something similar
> to that name. IMO.

I would prioritize moving the kfree_rcu() code out of tree.c quite
a ways over moving out the synchronous-wait code. ;-)

Thanx, Paul

2024-01-29 20:36:55

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] Reduce synchronize_rcu() latency(v4)

On Mon, Jan 29, 2024 at 11:43:43AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 29, 2024 at 05:23:01PM +0100, Uladzislau Rezki wrote:
> > On Fri, Jan 26, 2024 at 11:07:18PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 04, 2024 at 05:25:06PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > This is a v4 that tends to improve synchronize_rcu() call. To be more
> > > > specific it is about reducing a waiting time(especially worst cases)
> > > > of caller that blocks until a grace period is elapsed.
> > > >
> > > > In general, this series separates synchronize_rcu() callers from other
> > > > callbacks. We keep a dedicated an independent queue, thus the processing
> > > > of it starts as soon as grace period is over, so there is no need to wait
> > > > until other callbacks are processed one by one. Please note, a number of
> > > > callbacks can be 10K, 20K, 60K and so on. That is why this series maintain
> > > > a separate track for this call that blocks a context.
> > >
> > > And before I forget (again), a possible follow-on to this work is to
> > > reduce cond_synchronize_rcu() and cond_synchronize_rcu_full() latency.
> > > Right now, these wait for a full additional grace period (and maybe
> > > more) when the required grace period has not elapsed. In contrast,
> > > this work might enable waiting only for the needed portion of a grace
> > > period to elapse.
> > >
> > Thanks. I see it. Probably we also need to move "sync" related
> > functionality out of tree.c file to the sync.c or something similar
> > to that name. IMO.
>
> I would prioritize moving the kfree_rcu() code out of tree.c quite
> a ways over moving out the synchronous-wait code. ;-)
>
Indeed. But i am not about priority :)

--
Uladzislau Rezki