2024-06-04 22:23:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 0/9] Miscellaneous fixes for v6.11

Hello!

This series provides miscellaneous fixes:

1. Add lockdep_assert_in_rcu_read_lock() and friends.

2. Reduce synchronize_rcu() delays when all wait heads are in use,
courtesy of Neeraj Upadhyay.

3. rcu/tree: Reduce wake up for synchronize_rcu() common case,
courtesy of "Joel Fernandes (Google)".

4. Disable interrupts directly in rcu_gp_init().

5. Disable interrupts directly in srcu_gp_end().

6. Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter.

7. MAINTAINERS: Add Uladzislau Rezki as RCU maintainer.

8. Eliminate lockless accesses to rcu_sync->gp_count, courtesy of
Oleg Nesterov.

9. Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation, courtesy
of Frederic Weisbecker.

Thanx, Paul

------------------------------------------------------------------------

b/Documentation/admin-guide/kernel-parameters.txt | 8 ++
b/MAINTAINERS | 1
b/include/linux/rcupdate.h | 60 +++++++++++++++++++++
b/kernel/rcu/srcutree.c | 5 -
b/kernel/rcu/sync.c | 12 +---
b/kernel/rcu/tree.c | 40 ++++++++------
b/kernel/rcu/tree.h | 1
b/kernel/rcu/tree_plugin.h | 10 +++
kernel/rcu/tree.c | 61 +++++++++++++++++-----
9 files changed, 157 insertions(+), 41 deletions(-)


2024-06-04 22:24:19

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

From: Neeraj Upadhyay <[email protected]>

When all wait heads are in use, which can happen when
rcu_sr_normal_gp_cleanup_work()'s callback processing
is slow, any new synchronize_rcu() user's rcu_synchronize
node's processing is deferred to future GP periods. This
can result in long list of synchronize_rcu() invocations
waiting for full grace period processing, which can delay
freeing of memory. Mitigate this problem by using first
node in the list as wait tail when all wait heads are in use.
While methods to speed up callback processing would be needed
to recover from this situation, allowing new nodes to complete
their grace period can help prevent delays due to a fixed
number of wait head nodes.

Signed-off-by: Neeraj Upadhyay <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 40 +++++++++++++++++++++++-----------------
1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 28c7031711a3f..6ba36d9c09bde 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1463,14 +1463,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
* for this new grace period. Given that there are a fixed
* number of wait nodes, if all wait nodes are in use
* (which can happen when kworker callback processing
- * is delayed) and additional grace period is requested.
- * This means, a system is slow in processing callbacks.
- *
- * TODO: If a slow processing is detected, a first node
- * in the llist should be used as a wait-tail for this
- * grace period, therefore users which should wait due
- * to a slow process are handled by _this_ grace period
- * and not next.
+ * is delayed), first node in the llist is used as wait
+ * tail for this grace period. This means, the first node
+ * has to go through additional grace periods before it is
+ * part of the wait callbacks. This should be ok, as
+ * the system is slow in processing callbacks anyway.
*
* Below is an illustration of how the done and wait
* tail pointers move from one set of rcu_synchronize nodes
@@ -1639,7 +1636,6 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
if (!done)
return;

- WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
head = done->next;
done->next = NULL;

@@ -1676,13 +1672,21 @@ static void rcu_sr_normal_gp_cleanup(void)

rcu_state.srs_wait_tail = NULL;
ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
- WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));

/*
* Process (a) and (d) cases. See an illustration.
*/
llist_for_each_safe(rcu, next, wait_tail->next) {
- if (rcu_sr_is_wait_head(rcu))
+ /*
+ * The done tail may reference a rcu_synchronize node.
+ * Stop at done tail, as using rcu_sr_normal_complete()
+ * from this path can result in use-after-free. This
+ * may occur if, following the wake-up of the synchronize_rcu()
+ * wait contexts and freeing up of node memory,
+ * rcu_sr_normal_gp_cleanup_work() accesses the done tail and
+ * its subsequent nodes.
+ */
+ if (wait_tail->next == rcu_state.srs_done_tail)
break;

rcu_sr_normal_complete(rcu);
@@ -1719,15 +1723,17 @@ static bool rcu_sr_normal_gp_init(void)
return start_new_poll;

wait_head = rcu_sr_get_wait_head();
- if (!wait_head) {
- // Kick another GP to retry.
+ if (wait_head) {
+ /* Inject a wait-dummy-node. */
+ llist_add(wait_head, &rcu_state.srs_next);
+ } else {
+ // Kick another GP for first node.
start_new_poll = true;
- return start_new_poll;
+ if (first == rcu_state.srs_done_tail)
+ return start_new_poll;
+ wait_head = first;
}

- /* Inject a wait-dummy-node. */
- llist_add(wait_head, &rcu_state.srs_next);
-
/*
* A waiting list of rcu_synchronize nodes should be empty on
* this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
--
2.40.1


2024-06-04 22:24:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 5/9] srcu: Disable interrupts directly in srcu_gp_end()

Interrupts are enabled in srcu_gp_end(), so this commit switches from
spin_lock_irqsave_rcu_node() and spin_unlock_irqrestore_rcu_node()
to spin_lock_irq_rcu_node() and spin_unlock_irq_rcu_node().

Link: https://lore.kernel.org/all/[email protected]/

Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/srcutree.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index bc4b58b0204e9..d14d350f505f4 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -845,7 +845,6 @@ static void srcu_gp_end(struct srcu_struct *ssp)
bool cbs;
bool last_lvl;
int cpu;
- unsigned long flags;
unsigned long gpseq;
int idx;
unsigned long mask;
@@ -907,12 +906,12 @@ static void srcu_gp_end(struct srcu_struct *ssp)
if (!(gpseq & counter_wrap_check))
for_each_possible_cpu(cpu) {
sdp = per_cpu_ptr(ssp->sda, cpu);
- spin_lock_irqsave_rcu_node(sdp, flags);
+ spin_lock_irq_rcu_node(sdp);
if (ULONG_CMP_GE(gpseq, sdp->srcu_gp_seq_needed + 100))
sdp->srcu_gp_seq_needed = gpseq;
if (ULONG_CMP_GE(gpseq, sdp->srcu_gp_seq_needed_exp + 100))
sdp->srcu_gp_seq_needed_exp = gpseq;
- spin_unlock_irqrestore_rcu_node(sdp, flags);
+ spin_unlock_irq_rcu_node(sdp);
}

/* Callback initiation done, allow grace periods after next. */
--
2.40.1


2024-06-04 22:24:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 4/9] rcu: Disable interrupts directly in rcu_gp_init()

Interrupts are enabled in rcu_gp_init(), so this commit switches from
local_irq_save() and local_irq_restore() to local_irq_disable() and
local_irq_enable().

Link: https://lore.kernel.org/all/[email protected]/

Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2fe08e6186b4d..35bf4a3736765 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1841,7 +1841,7 @@ static noinline_for_stack bool rcu_gp_init(void)
WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
/* Exclude CPU hotplug operations. */
rcu_for_each_leaf_node(rnp) {
- local_irq_save(flags);
+ local_irq_disable();
arch_spin_lock(&rcu_state.ofl_lock);
raw_spin_lock_rcu_node(rnp);
if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
@@ -1849,7 +1849,7 @@ static noinline_for_stack bool rcu_gp_init(void)
/* Nothing to do on this leaf rcu_node structure. */
raw_spin_unlock_rcu_node(rnp);
arch_spin_unlock(&rcu_state.ofl_lock);
- local_irq_restore(flags);
+ local_irq_enable();
continue;
}

@@ -1886,7 +1886,7 @@ static noinline_for_stack bool rcu_gp_init(void)

raw_spin_unlock_rcu_node(rnp);
arch_spin_unlock(&rcu_state.ofl_lock);
- local_irq_restore(flags);
+ local_irq_enable();
}
rcu_gp_slow(gp_preinit_delay); /* Races with CPU hotplug. */

--
2.40.1


2024-06-04 22:24:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 1/9] rcu: Add lockdep_assert_in_rcu_read_lock() and friends

There is no direct RCU counterpart to lockdep_assert_irqs_disabled()
and friends. Although it is possible to construct them, it would
be more convenient to have the following lockdep assertions:

lockdep_assert_in_rcu_read_lock()
lockdep_assert_in_rcu_read_lock_bh()
lockdep_assert_in_rcu_read_lock_sched()
lockdep_assert_in_rcu_reader()

This commit therefore creates them.

Reported-by: Jens Axboe <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 60 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index dfd2399f2cde0..8470a85f65634 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -421,11 +421,71 @@ static inline void rcu_preempt_sleep_check(void) { }
"Illegal context switch in RCU-sched read-side critical section"); \
} while (0)

+// See RCU_LOCKDEP_WARN() for an explanation of the double call to
+// debug_lockdep_rcu_enabled().
+static inline bool lockdep_assert_rcu_helper(bool c)
+{
+ return debug_lockdep_rcu_enabled() &&
+ (c || !rcu_is_watching() || !rcu_lockdep_current_cpu_online()) &&
+ debug_lockdep_rcu_enabled();
+}
+
+/**
+ * lockdep_assert_in_rcu_read_lock - WARN if not protected by rcu_read_lock()
+ *
+ * Splats if lockdep is enabled and there is no rcu_read_lock() in effect.
+ */
+#define lockdep_assert_in_rcu_read_lock() \
+ WARN_ON_ONCE(lockdep_assert_rcu_helper(!lock_is_held(&rcu_lock_map)))
+
+/**
+ * lockdep_assert_in_rcu_read_lock_bh - WARN if not protected by rcu_read_lock_bh()
+ *
+ * Splats if lockdep is enabled and there is no rcu_read_lock_bh() in effect.
+ * Note that local_bh_disable() and friends do not suffice here, instead an
+ * actual rcu_read_lock_bh() is required.
+ */
+#define lockdep_assert_in_rcu_read_lock_bh() \
+ WARN_ON_ONCE(lockdep_assert_rcu_helper(!lock_is_held(&rcu_bh_lock_map)))
+
+/**
+ * lockdep_assert_in_rcu_read_lock_sched - WARN if not protected by rcu_read_lock_sched()
+ *
+ * Splats if lockdep is enabled and there is no rcu_read_lock_sched()
+ * in effect. Note that preempt_disable() and friends do not suffice here,
+ * instead an actual rcu_read_lock_sched() is required.
+ */
+#define lockdep_assert_in_rcu_read_lock_sched() \
+ WARN_ON_ONCE(lockdep_assert_rcu_helper(!lock_is_held(&rcu_sched_lock_map)))
+
+/**
+ * lockdep_assert_in_rcu_reader - WARN if not within some type of RCU reader
+ *
+ * Splats if lockdep is enabled and there is no RCU reader of any
+ * type in effect. Note that regions of code protected by things like
+ * preempt_disable, local_bh_disable(), and local_irq_disable() all qualify
+ * as RCU readers.
+ *
+ * Note that this will never trigger in PREEMPT_NONE or PREEMPT_VOLUNTARY
+ * kernels that are not also built with PREEMPT_COUNT. But if you have
+ * lockdep enabled, you might as well also enable PREEMPT_COUNT.
+ */
+#define lockdep_assert_in_rcu_reader() \
+ WARN_ON_ONCE(lockdep_assert_rcu_helper(!lock_is_held(&rcu_lock_map) && \
+ !lock_is_held(&rcu_bh_lock_map) && \
+ !lock_is_held(&rcu_sched_lock_map) && \
+ preemptible()))
+
#else /* #ifdef CONFIG_PROVE_RCU */

#define RCU_LOCKDEP_WARN(c, s) do { } while (0 && (c))
#define rcu_sleep_check() do { } while (0)

+#define lockdep_assert_in_rcu_read_lock() do { } while (0)
+#define lockdep_assert_in_rcu_read_lock_bh() do { } while (0)
+#define lockdep_assert_in_rcu_read_lock_sched() do { } while (0)
+#define lockdep_assert_in_rcu_reader() do { } while (0)
+
#endif /* #else #ifdef CONFIG_PROVE_RCU */

/*
--
2.40.1


2024-06-04 22:24:26

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

From: "Joel Fernandes (Google)" <[email protected]>

In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.

Introduce a new counter to track this and prevent the wakeup in the
common case.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
kernel/rcu/tree.h | 1 +
2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6ba36d9c09bde..2fe08e6186b4d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
rcu_sr_normal_gp_cleanup_work),
+ .srs_cleanups_pending = ATOMIC_INIT(0),
};

/* Dump rcu_node combining tree at boot to verify correct setup. */
@@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
* the done tail list manipulations are protected here.
*/
done = smp_load_acquire(&rcu_state.srs_done_tail);
- if (!done)
+ if (!done) {
+ /* See comments below. */
+ atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
return;
+ }

head = done->next;
done->next = NULL;
@@ -1656,6 +1660,9 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)

rcu_sr_put_wait_head(rcu);
}
+
+ /* Order list manipulations with atomic access. */
+ atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
}

/*
@@ -1663,7 +1670,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
*/
static void rcu_sr_normal_gp_cleanup(void)
{
- struct llist_node *wait_tail, *next, *rcu;
+ struct llist_node *wait_tail, *next = NULL, *rcu = NULL;
int done = 0;

wait_tail = rcu_state.srs_wait_tail;
@@ -1697,16 +1704,34 @@ static void rcu_sr_normal_gp_cleanup(void)
break;
}

- // concurrent sr_normal_gp_cleanup work might observe this update.
- smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+ /*
+ * Fast path, no more users to process except putting the second last
+ * wait head if no inflight-workers. If there are in-flight workers,
+ * they will remove the last wait head.
+ *
+ * Note that the ACQUIRE orders atomic access with list manipulation.
+ */
+ if (wait_tail->next && wait_tail->next->next == NULL &&
+ rcu_sr_is_wait_head(wait_tail->next) &&
+ !atomic_read_acquire(&rcu_state.srs_cleanups_pending)) {
+ rcu_sr_put_wait_head(wait_tail->next);
+ wait_tail->next = NULL;
+ }
+
+ /* Concurrent sr_normal_gp_cleanup work might observe this update. */
ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
+ smp_store_release(&rcu_state.srs_done_tail, wait_tail);

/*
* We schedule a work in order to perform a final processing
* of outstanding users(if still left) and releasing wait-heads
* added by rcu_sr_normal_gp_init() call.
*/
- queue_work(sync_wq, &rcu_state.srs_cleanup_work);
+ if (wait_tail->next) {
+ atomic_inc(&rcu_state.srs_cleanups_pending);
+ if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
+ atomic_dec(&rcu_state.srs_cleanups_pending);
+ }
}

/*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bae7925c497fe..affcb92a358c3 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -420,6 +420,7 @@ struct rcu_state {
struct llist_node *srs_done_tail; /* ready for GP users. */
struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
struct work_struct srs_cleanup_work;
+ atomic_t srs_cleanups_pending; /* srs inflight worker cleanups. */
};

/* Values for rcu_state structure's gp_flags field. */
--
2.40.1


2024-06-04 22:25:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 7/9] MAINTAINERS: Add Uladzislau Rezki as RCU maintainer

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d6c90161c7bfe..3f2047082073f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18868,6 +18868,7 @@ M: Neeraj Upadhyay <[email protected]> (kernel/rcu/tasks.h)
M: Joel Fernandes <[email protected]>
M: Josh Triplett <[email protected]>
M: Boqun Feng <[email protected]>
+M: Uladzislau Rezki <[email protected]>
R: Steven Rostedt <[email protected]>
R: Mathieu Desnoyers <[email protected]>
R: Lai Jiangshan <[email protected]>
--
2.40.1


2024-06-04 22:25:26

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 8/9] rcu: Eliminate lockless accesses to rcu_sync->gp_count

From: Oleg Nesterov <[email protected]>

The rcu_sync structure's ->gp_count field is always accessed under the
protection of that same structure's ->rss_lock field, with the exception
of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in
functions rcu_sync_exit() and rcu_sync_dtor(). These lockless accesses
are unnecessary and impair KCSAN's ability to catch bugs that might be
inserted via other lockless accesses.

This commit therefore moves those WARN_ON_ONCE() calls under the lock.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/sync.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 6c2bd9001adcd..da60a9947c005 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -122,7 +122,7 @@ void rcu_sync_enter(struct rcu_sync *rsp)
* we are called at early boot time but this shouldn't happen.
*/
}
- WRITE_ONCE(rsp->gp_count, rsp->gp_count + 1);
+ rsp->gp_count++;
spin_unlock_irq(&rsp->rss_lock);

if (gp_state == GP_IDLE) {
@@ -151,15 +151,11 @@ void rcu_sync_enter(struct rcu_sync *rsp)
*/
void rcu_sync_exit(struct rcu_sync *rsp)
{
- int gpc;
-
WARN_ON_ONCE(READ_ONCE(rsp->gp_state) == GP_IDLE);
- WARN_ON_ONCE(READ_ONCE(rsp->gp_count) == 0);

spin_lock_irq(&rsp->rss_lock);
- gpc = rsp->gp_count - 1;
- WRITE_ONCE(rsp->gp_count, gpc);
- if (!gpc) {
+ WARN_ON_ONCE(rsp->gp_count == 0);
+ if (!--rsp->gp_count) {
if (rsp->gp_state == GP_PASSED) {
WRITE_ONCE(rsp->gp_state, GP_EXIT);
rcu_sync_call(rsp);
@@ -178,10 +174,10 @@ void rcu_sync_dtor(struct rcu_sync *rsp)
{
int gp_state;

- WARN_ON_ONCE(READ_ONCE(rsp->gp_count));
WARN_ON_ONCE(READ_ONCE(rsp->gp_state) == GP_PASSED);

spin_lock_irq(&rsp->rss_lock);
+ WARN_ON_ONCE(rsp->gp_count);
if (rsp->gp_state == GP_REPLAY)
WRITE_ONCE(rsp->gp_state, GP_EXIT);
gp_state = rsp->gp_state;
--
2.40.1


2024-06-04 22:25:27

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 6/9] rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter

If a CPU is running either a userspace application or a guest OS in
nohz_full mode, it is possible for a system call to occur just as an
RCU grace period is starting. If that CPU also has the scheduling-clock
tick enabled for any reason (such as a second runnable task), and if the
system was booted with rcutree.use_softirq=0, then RCU can add insult to
injury by awakening that CPU's rcuc kthread, resulting in yet another
task and yet more OS jitter due to switching to that task, running it,
and switching back.

In addition, in the common case where that system call is not of
excessively long duration, awakening the rcuc task is pointless.
This pointlessness is due to the fact that the CPU will enter an extended
quiescent state upon returning to the userspace application or guest OS.
In this case, the rcuc kthread cannot do anything that the main RCU
grace-period kthread cannot do on its behalf, at least if it is given
a few additional milliseconds (for example, given the time duration
specified by rcutree.jiffies_till_first_fqs, give or take scheduling
delays).

This commit therefore adds a rcutree.nocb_patience_delay kernel boot
parameter that specifies the grace period age (in milliseconds)
before which RCU will refrain from awakening the rcuc kthread.
Preliminary experiementation suggests a value of 1000, that is,
one second. Increasing rcutree.nocb_patience_delay will increase
grace-period latency and in turn increase memory footprint, so systems
with constrained memory might choose a smaller value. Systems with
less-aggressive OS-jitter requirements might choose the default value
of zero, which keeps the traditional immediate-wakeup behavior, thus
avoiding increases in grace-period latency.

[ paulmck: Apply Leonardo Bras feedback. ]

Link: https://lore.kernel.org/all/[email protected]/

Reported-by: Leonardo Bras <[email protected]>
Suggested-by: Leonardo Bras <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Leonardo Bras <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
kernel/rcu/tree.c | 10 ++++++++--
kernel/rcu/tree_plugin.h | 10 ++++++++++
3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 500cfa7762257..2d4a512cf1fc6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5018,6 +5018,14 @@
the ->nocb_bypass queue. The definition of "too
many" is supplied by this kernel boot parameter.

+ rcutree.nocb_patience_delay= [KNL]
+ On callback-offloaded (rcu_nocbs) CPUs, avoid
+ disturbing RCU unless the grace period has
+ reached the specified age in milliseconds.
+ Defaults to zero. Large values will be capped
+ at five seconds. All values will be rounded down
+ to the nearest value representable by jiffies.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 35bf4a3736765..408b020c9501f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -176,6 +176,9 @@ static int gp_init_delay;
module_param(gp_init_delay, int, 0444);
static int gp_cleanup_delay;
module_param(gp_cleanup_delay, int, 0444);
+static int nocb_patience_delay;
+module_param(nocb_patience_delay, int, 0444);
+static int nocb_patience_delay_jiffies;

// Add delay to rcu_read_unlock() for strict grace periods.
static int rcu_unlock_delay;
@@ -4344,11 +4347,14 @@ static int rcu_pending(int user)
return 1;

/* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
- if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+ gp_in_progress = rcu_gp_in_progress();
+ if ((user || rcu_is_cpu_rrupt_from_idle() ||
+ (gp_in_progress &&
+ time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay_jiffies))) &&
+ rcu_nohz_full_cpu())
return 0;

/* Is the RCU core waiting for a quiescent state from this CPU? */
- gp_in_progress = rcu_gp_in_progress();
if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
return 1;

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 340bbefe5f652..31c539f09c150 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -93,6 +93,16 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
if (gp_cleanup_delay)
pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
+ if (nocb_patience_delay < 0) {
+ pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
+ nocb_patience_delay = 0;
+ } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
+ pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
+ nocb_patience_delay = 5 * MSEC_PER_SEC;
+ } else if (nocb_patience_delay) {
+ pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
+ }
+ nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
if (!use_softirq)
pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
--
2.40.1


2024-06-04 22:25:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 9/9] rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation

From: Frederic Weisbecker <[email protected]>

When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.

However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.

The following theoretical race example can make rcu_barrier() hang:

CPU 0 CPU 1
----- -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
invoke_rcu_core()
do_softirq()
rcu_core()
rcu_do_batch()
// callback storm
// rcu_do_batch() returns
// before completing all
// of them
// do_softirq also returns early because of
// timeout. It defers to ksoftirqd but
// it's parked
</IRQ>
stop_machine()
take_cpu_down()
rcu_barrier()
spin_lock(barrier_lock)
// observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
rcu_core()
rcu_do_batch()
//completes all pending callbacks
//smp_mb() implied _after_ callback number dec
</IRQ>

rcutree_report_cpu_dead()
rnp->qsmaskinitnext &= ~rdp->grpmask;

rcutree_migrate_callback()
// no callback, early return without locking
// barrier_lock
//observes !rcu_rdp_cpu_online(rdp)
rcu_barrier_entrain()
rcu_segcblist_entrain()
// Observe rcu_segcblist_n_cbs(rsclp) == 0
// because no barrier between reading
// rnp->qsmaskinitnext and rsclp->len
rcu_segcblist_add_len()
smp_mb__before_atomic()
// will now observe the 0 count and empty
// list, but too late, we enqueue regardless
WRITE_ONCE(rsclp->len, rsclp->len + v);
// ignored barrier callback
// rcu barrier stall...

This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().

However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.

Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 408b020c9501f..c58fc10fb5969 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -5147,11 +5147,15 @@ void rcutree_migrate_callbacks(int cpu)
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
bool needwake;

- if (rcu_rdp_is_offloaded(rdp) ||
- rcu_segcblist_empty(&rdp->cblist))
- return; /* No callbacks to migrate. */
+ if (rcu_rdp_is_offloaded(rdp))
+ return;

raw_spin_lock_irqsave(&rcu_state.barrier_lock, flags);
+ if (rcu_segcblist_empty(&rdp->cblist)) {
+ raw_spin_unlock_irqrestore(&rcu_state.barrier_lock, flags);
+ return; /* No callbacks to migrate. */
+ }
+
WARN_ON_ONCE(rcu_rdp_cpu_online(rdp));
rcu_barrier_entrain(rdp);
my_rdp = this_cpu_ptr(&rcu_data);
--
2.40.1


2024-06-05 13:09:55

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a ?crit :
> From: Neeraj Upadhyay <[email protected]>
>
> When all wait heads are in use, which can happen when
> rcu_sr_normal_gp_cleanup_work()'s callback processing
> is slow, any new synchronize_rcu() user's rcu_synchronize
> node's processing is deferred to future GP periods. This
> can result in long list of synchronize_rcu() invocations
> waiting for full grace period processing, which can delay
> freeing of memory. Mitigate this problem by using first
> node in the list as wait tail when all wait heads are in use.
> While methods to speed up callback processing would be needed
> to recover from this situation, allowing new nodes to complete
> their grace period can help prevent delays due to a fixed
> number of wait head nodes.
>
> Signed-off-by: Neeraj Upadhyay <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

IIRC we agreed that this patch could be a step too far that
made an already not so simple state machine even less simple,
breaking the wait_head based flow.

Should we postpone this change until it is observed that a workqueue
not being scheduled for 5 grace periods is a real issue?

Thanks.

2024-06-05 16:47:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a ?crit :
> From: "Joel Fernandes (Google)" <[email protected]>
>
> In the synchronize_rcu() common case, we will have less than
> SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> is pointless just to free the last injected wait head since at that point,
> all the users have already been awakened.
>
> Introduce a new counter to track this and prevent the wakeup in the
> common case.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> kernel/rcu/tree.h | 1 +
> 2 files changed, 31 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6ba36d9c09bde..2fe08e6186b4d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> rcu_sr_normal_gp_cleanup_work),
> + .srs_cleanups_pending = ATOMIC_INIT(0),
> };
>
> /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> * the done tail list manipulations are protected here.
> */
> done = smp_load_acquire(&rcu_state.srs_done_tail);
> - if (!done)
> + if (!done) {
> + /* See comments below. */
> + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);

This condition is not supposed to happen. If the work is scheduled,
there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
may make things worse.

So this should be:

if (WARN_ON_ONCE(!done))
return;

Thanks.

2024-06-05 18:42:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Wed, Jun 05, 2024 at 06:35:08PM +0200, Frederic Weisbecker wrote:
> Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a ?crit :
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > In the synchronize_rcu() common case, we will have less than
> > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > is pointless just to free the last injected wait head since at that point,
> > all the users have already been awakened.
> >
> > Introduce a new counter to track this and prevent the wakeup in the
> > common case.
> >
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > kernel/rcu/tree.h | 1 +
> > 2 files changed, 31 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > rcu_sr_normal_gp_cleanup_work),
> > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > };
> >
> > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > * the done tail list manipulations are protected here.
> > */
> > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > - if (!done)
> > + if (!done) {
> > + /* See comments below. */
> > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
>
> This condition is not supposed to happen. If the work is scheduled,
> there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> may make things worse.
>
> So this should be:
>
> if (WARN_ON_ONCE(!done))
> return;

Or just this:

WARN_ON_ONCE(!smp_load_acquire(&rcu_state.srs_done_tail));

Uladzislau, thoughts? Is there some corner case where we really can
see that smp_load_acquire() hand back a NULL pointer, perhaps during boot?

Thanx, Paul

2024-06-05 18:53:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a ?crit :
> > From: Neeraj Upadhyay <[email protected]>
> >
> > When all wait heads are in use, which can happen when
> > rcu_sr_normal_gp_cleanup_work()'s callback processing
> > is slow, any new synchronize_rcu() user's rcu_synchronize
> > node's processing is deferred to future GP periods. This
> > can result in long list of synchronize_rcu() invocations
> > waiting for full grace period processing, which can delay
> > freeing of memory. Mitigate this problem by using first
> > node in the list as wait tail when all wait heads are in use.
> > While methods to speed up callback processing would be needed
> > to recover from this situation, allowing new nodes to complete
> > their grace period can help prevent delays due to a fixed
> > number of wait head nodes.
> >
> > Signed-off-by: Neeraj Upadhyay <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> IIRC we agreed that this patch could be a step too far that
> made an already not so simple state machine even less simple,
> breaking the wait_head based flow.

True, which is why we agreed not to submit it into the v6.10 merge window.

And I don't recall us saying what merge window to send it to.

> Should we postpone this change until it is observed that a workqueue
> not being scheduled for 5 grace periods is a real issue?

Neeraj, thoughts? Or, better yet, test results? ;-)

Thanx, Paul

2024-06-06 03:46:44

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use



On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
>> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
>>> From: Neeraj Upadhyay <[email protected]>
>>>
>>> When all wait heads are in use, which can happen when
>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>> node's processing is deferred to future GP periods. This
>>> can result in long list of synchronize_rcu() invocations
>>> waiting for full grace period processing, which can delay
>>> freeing of memory. Mitigate this problem by using first
>>> node in the list as wait tail when all wait heads are in use.
>>> While methods to speed up callback processing would be needed
>>> to recover from this situation, allowing new nodes to complete
>>> their grace period can help prevent delays due to a fixed
>>> number of wait head nodes.
>>>
>>> Signed-off-by: Neeraj Upadhyay <[email protected]>
>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>
>> IIRC we agreed that this patch could be a step too far that
>> made an already not so simple state machine even less simple,
>> breaking the wait_head based flow.
>
> True, which is why we agreed not to submit it into the v6.10 merge window.
>
> And I don't recall us saying what merge window to send it to.
>
>> Should we postpone this change until it is observed that a workqueue
>> not being scheduled for 5 grace periods is a real issue?
>
> Neeraj, thoughts? Or, better yet, test results? ;-)

Yes I agree that we postpone this change until we see it as a real
problem. I had run a test to invoke synchronize_rcu() from all CPUs
on a 96 core system in parallel. I didn't specifically check if this
scenario was hit. Will run RCU torture test with this change.


Thanks
Neeraj


>
> Thanx, Paul

2024-06-06 05:58:31

by Neeraj upadhyay

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
>
> Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > In the synchronize_rcu() common case, we will have less than
> > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > is pointless just to free the last injected wait head since at that point,
> > all the users have already been awakened.
> >
> > Introduce a new counter to track this and prevent the wakeup in the
> > common case.
> >
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > kernel/rcu/tree.h | 1 +
> > 2 files changed, 31 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > rcu_sr_normal_gp_cleanup_work),
> > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > };
> >
> > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > * the done tail list manipulations are protected here.
> > */
> > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > - if (!done)
> > + if (!done) {
> > + /* See comments below. */
> > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
>
> This condition is not supposed to happen. If the work is scheduled,
> there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> may make things worse.
>

I also don't see a scenario where this can happen. However, if we are
returning from here, given that for every queued work we do an
increment of rcu_state.srs_cleanups_pending, I think it's safer to
decrement in this
case, as that counter tracks only the work queuing and execution counts.

atomic_inc(&rcu_state.srs_cleanups_pending);
if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
atomic_dec(&rcu_state.srs_cleanups_pending);




Thanks
Neeraj

> So this should be:
>
> if (WARN_ON_ONCE(!done))
> return;
>
> Thanks.
>

2024-06-06 17:08:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote:
>
>
> On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a ?crit :
> >>> From: Neeraj Upadhyay <[email protected]>
> >>>
> >>> When all wait heads are in use, which can happen when
> >>> rcu_sr_normal_gp_cleanup_work()'s callback processing
> >>> is slow, any new synchronize_rcu() user's rcu_synchronize
> >>> node's processing is deferred to future GP periods. This
> >>> can result in long list of synchronize_rcu() invocations
> >>> waiting for full grace period processing, which can delay
> >>> freeing of memory. Mitigate this problem by using first
> >>> node in the list as wait tail when all wait heads are in use.
> >>> While methods to speed up callback processing would be needed
> >>> to recover from this situation, allowing new nodes to complete
> >>> their grace period can help prevent delays due to a fixed
> >>> number of wait head nodes.
> >>>
> >>> Signed-off-by: Neeraj Upadhyay <[email protected]>
> >>> Signed-off-by: Paul E. McKenney <[email protected]>
> >>
> >> IIRC we agreed that this patch could be a step too far that
> >> made an already not so simple state machine even less simple,
> >> breaking the wait_head based flow.
> >
> > True, which is why we agreed not to submit it into the v6.10 merge window.
> >
> > And I don't recall us saying what merge window to send it to.
> >
> >> Should we postpone this change until it is observed that a workqueue
> >> not being scheduled for 5 grace periods is a real issue?
> >
> > Neeraj, thoughts? Or, better yet, test results? ;-)
>
> Yes I agree that we postpone this change until we see it as a real
> problem. I had run a test to invoke synchronize_rcu() from all CPUs
> on a 96 core system in parallel. I didn't specifically check if this
> scenario was hit. Will run RCU torture test with this change.

Very well, I will drop this patch with the expectation that you will
re-post it if a problem does arise.

Thanx, Paul

2024-06-06 18:43:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
> >
> > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > From: "Joel Fernandes (Google)" <[email protected]>
> > >
> > > In the synchronize_rcu() common case, we will have less than
> > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > is pointless just to free the last injected wait head since at that point,
> > > all the users have already been awakened.
> > >
> > > Introduce a new counter to track this and prevent the wakeup in the
> > > common case.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > kernel/rcu/tree.h | 1 +
> > > 2 files changed, 31 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > rcu_sr_normal_gp_cleanup_work),
> > > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > > };
> > >
> > > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > * the done tail list manipulations are protected here.
> > > */
> > > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > - if (!done)
> > > + if (!done) {
> > > + /* See comments below. */
> > > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> >
> > This condition is not supposed to happen. If the work is scheduled,
> > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > may make things worse.
> >
>
> I also don't see a scenario where this can happen. However, if we are
> returning from here, given that for every queued work we do an
> increment of rcu_state.srs_cleanups_pending, I think it's safer to
> decrement in this
> case, as that counter tracks only the work queuing and execution counts.
>
> atomic_inc(&rcu_state.srs_cleanups_pending);
> if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> atomic_dec(&rcu_state.srs_cleanups_pending);

Linus Torvald's general rule is that if you cannot imagine how a bug
can happen, don't attempt to clean up after it. His rationale (which
is *almost* always a good one) is that not knowing how the bug happens
means that attempts to clean up will usually just make matters worse.
And all too often, the clean-up code makes debugging more difficult.

One example exception to this rule is when debug-objects detects a
duplicate call_rcu(). In that case, we ignore that second call_rcu().
But the reason is that experience has shown that the usual cause really
is someone doing a duplicate call_rcu(), and also that ignoring the
second call_rcu() makes debugging easier.

So what is it that Frederic and I are missing here?

Thanx, Paul

> Thanks
> Neeraj
>
> > So this should be:
> >
> > if (WARN_ON_ONCE(!done))
> > return;
> >
> > Thanks.
> >
>

2024-06-07 01:52:19

by Neeraj upadhyay

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
> > >
> > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > From: "Joel Fernandes (Google)" <[email protected]>
> > > >
> > > > In the synchronize_rcu() common case, we will have less than
> > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > is pointless just to free the last injected wait head since at that point,
> > > > all the users have already been awakened.
> > > >
> > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > common case.
> > > >
> > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > ---
> > > > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > kernel/rcu/tree.h | 1 +
> > > > 2 files changed, 31 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > rcu_sr_normal_gp_cleanup_work),
> > > > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > };
> > > >
> > > > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > * the done tail list manipulations are protected here.
> > > > */
> > > > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > - if (!done)
> > > > + if (!done) {
> > > > + /* See comments below. */
> > > > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > >
> > > This condition is not supposed to happen. If the work is scheduled,
> > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > may make things worse.
> > >
> >
> > I also don't see a scenario where this can happen. However, if we are
> > returning from here, given that for every queued work we do an
> > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > decrement in this
> > case, as that counter tracks only the work queuing and execution counts.
> >
> > atomic_inc(&rcu_state.srs_cleanups_pending);
> > if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > atomic_dec(&rcu_state.srs_cleanups_pending);
>
> Linus Torvald's general rule is that if you cannot imagine how a bug
> can happen, don't attempt to clean up after it. His rationale (which
> is *almost* always a good one) is that not knowing how the bug happens
> means that attempts to clean up will usually just make matters worse.
> And all too often, the clean-up code makes debugging more difficult.
>

Ok. Thanks for sharing this info!

> One example exception to this rule is when debug-objects detects a
> duplicate call_rcu(). In that case, we ignore that second call_rcu().
> But the reason is that experience has shown that the usual cause really
> is someone doing a duplicate call_rcu(), and also that ignoring the
> second call_rcu() makes debugging easier.
>
> So what is it that Frederic and I are missing here?
>

Maybe nothing. As kworker context does not modify srs_done_tail and
invalid values
of srs_done_tail can only be caused by the GP kthread manipulations
of srs_done_tail , my thought here was, we can keep the pending
rcu_sr_normal_gp_cleanup_work count consistent with the number of
queue_work() and kworker executions, even when we see unexpected
srs_done_tail values like these. However, as you described the general rule
is to not attempt any clean up for such scenarios.

Thanks
Neeraj



> Thanx, Paul
>
> > Thanks
> > Neeraj
> >
> > > So this should be:
> > >
> > > if (WARN_ON_ONCE(!done))
> > > return;
> > >
> > > Thanks.
> > >
> >

2024-06-10 05:05:51

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH rcu 6/9] rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter

On Tue, Jun 04, 2024 at 03:23:52PM -0700, Paul E. McKenney wrote:
> If a CPU is running either a userspace application or a guest OS in
> nohz_full mode, it is possible for a system call to occur just as an
> RCU grace period is starting. If that CPU also has the scheduling-clock
> tick enabled for any reason (such as a second runnable task), and if the
> system was booted with rcutree.use_softirq=0, then RCU can add insult to
> injury by awakening that CPU's rcuc kthread, resulting in yet another
> task and yet more OS jitter due to switching to that task, running it,
> and switching back.
>
> In addition, in the common case where that system call is not of
> excessively long duration, awakening the rcuc task is pointless.
> This pointlessness is due to the fact that the CPU will enter an extended
> quiescent state upon returning to the userspace application or guest OS.
> In this case, the rcuc kthread cannot do anything that the main RCU
> grace-period kthread cannot do on its behalf, at least if it is given
> a few additional milliseconds (for example, given the time duration
> specified by rcutexperiementationree.jiffies_till_first_fqs, give or take
> scheduling
> delays).
>
> This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> parameter that specifies the grace period age (in milliseconds)
> before which RCU will refrain from awakening the rcuc kthread.
> Preliminary experiementation suggests a value of 1000, that is,

Just a nit I found when cherry-picking here
s/experiementation/experimentation/

Thanks!
Leo

> one second. Increasing rcutree.nocb_patience_delay will increase
> grace-period latency and in turn increase memory footprint, so systems
> with constrained memory might choose a smaller value. Systems with
> less-aggressive OS-jitter requirements might choose the default value
> of zero, which keeps the traditional immediate-wakeup behavior, thus
> avoiding increases in grace-period latency.
>
> [ paulmck: Apply Leonardo Bras feedback. ]
>
> Link: https://lore.kernel.org/all/[email protected]/
>
> Reported-by: Leonardo Bras <[email protected]>
> Suggested-by: Leonardo Bras <[email protected]>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Reviewed-by: Leonardo Bras <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
> kernel/rcu/tree.c | 10 ++++++++--
> kernel/rcu/tree_plugin.h | 10 ++++++++++
> 3 files changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 500cfa7762257..2d4a512cf1fc6 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5018,6 +5018,14 @@
> the ->nocb_bypass queue. The definition of "too
> many" is supplied by this kernel boot parameter.
>
> + rcutree.nocb_patience_delay= [KNL]
> + On callback-offloaded (rcu_nocbs) CPUs, avoid
> + disturbing RCU unless the grace period has
> + reached the specified age in milliseconds.
> + Defaults to zero. Large values will be capped
> + at five seconds. All values will be rounded down
> + to the nearest value representable by jiffies.
> +
> rcutree.qhimark= [KNL]
> Set threshold of queued RCU callbacks beyond which
> batch limiting is disabled.
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 35bf4a3736765..408b020c9501f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -176,6 +176,9 @@ static int gp_init_delay;
> module_param(gp_init_delay, int, 0444);
> static int gp_cleanup_delay;
> module_param(gp_cleanup_delay, int, 0444);
> +static int nocb_patience_delay;
> +module_param(nocb_patience_delay, int, 0444);
> +static int nocb_patience_delay_jiffies;
>
> // Add delay to rcu_read_unlock() for strict grace periods.
> static int rcu_unlock_delay;
> @@ -4344,11 +4347,14 @@ static int rcu_pending(int user)
> return 1;
>
> /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> + gp_in_progress = rcu_gp_in_progress();
> + if ((user || rcu_is_cpu_rrupt_from_idle() ||
> + (gp_in_progress &&
> + time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay_jiffies))) &&
> + rcu_nohz_full_cpu())
> return 0;
>
> /* Is the RCU core waiting for a quiescent state from this CPU? */
> - gp_in_progress = rcu_gp_in_progress();
> if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> return 1;
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 340bbefe5f652..31c539f09c150 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -93,6 +93,16 @@ static void __init rcu_bootup_announce_oddness(void)
> pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> if (gp_cleanup_delay)
> pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> + if (nocb_patience_delay < 0) {
> + pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> + nocb_patience_delay = 0;
> + } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> + pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> + nocb_patience_delay = 5 * MSEC_PER_SEC;
> + } else if (nocb_patience_delay) {
> + pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> + }
> + nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> if (!use_softirq)
> pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> --
> 2.40.1
>


2024-06-10 15:11:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 6/9] rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter

On Mon, Jun 10, 2024 at 02:05:30AM -0300, Leonardo Bras wrote:
> On Tue, Jun 04, 2024 at 03:23:52PM -0700, Paul E. McKenney wrote:
> > If a CPU is running either a userspace application or a guest OS in
> > nohz_full mode, it is possible for a system call to occur just as an
> > RCU grace period is starting. If that CPU also has the scheduling-clock
> > tick enabled for any reason (such as a second runnable task), and if the
> > system was booted with rcutree.use_softirq=0, then RCU can add insult to
> > injury by awakening that CPU's rcuc kthread, resulting in yet another
> > task and yet more OS jitter due to switching to that task, running it,
> > and switching back.
> >
> > In addition, in the common case where that system call is not of
> > excessively long duration, awakening the rcuc task is pointless.
> > This pointlessness is due to the fact that the CPU will enter an extended
> > quiescent state upon returning to the userspace application or guest OS.
> > In this case, the rcuc kthread cannot do anything that the main RCU
> > grace-period kthread cannot do on its behalf, at least if it is given
> > a few additional milliseconds (for example, given the time duration
> > specified by rcutexperiementationree.jiffies_till_first_fqs, give or take
> > scheduling
> > delays).
> >
> > This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> > parameter that specifies the grace period age (in milliseconds)
> > before which RCU will refrain from awakening the rcuc kthread.
> > Preliminary experiementation suggests a value of 1000, that is,
>
> Just a nit I found when cherry-picking here
> s/experiementation/experimentation/

Good eyes! I will fix this on my next rebase, thank you!

Thanx, Paul

> Thanks!
> Leo
>
> > one second. Increasing rcutree.nocb_patience_delay will increase
> > grace-period latency and in turn increase memory footprint, so systems
> > with constrained memory might choose a smaller value. Systems with
> > less-aggressive OS-jitter requirements might choose the default value
> > of zero, which keeps the traditional immediate-wakeup behavior, thus
> > avoiding increases in grace-period latency.
> >
> > [ paulmck: Apply Leonardo Bras feedback. ]
> >
> > Link: https://lore.kernel.org/all/[email protected]/
> >
> > Reported-by: Leonardo Bras <[email protected]>
> > Suggested-by: Leonardo Bras <[email protected]>
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Reviewed-by: Leonardo Bras <[email protected]>
> > ---
> > Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
> > kernel/rcu/tree.c | 10 ++++++++--
> > kernel/rcu/tree_plugin.h | 10 ++++++++++
> > 3 files changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 500cfa7762257..2d4a512cf1fc6 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5018,6 +5018,14 @@
> > the ->nocb_bypass queue. The definition of "too
> > many" is supplied by this kernel boot parameter.
> >
> > + rcutree.nocb_patience_delay= [KNL]
> > + On callback-offloaded (rcu_nocbs) CPUs, avoid
> > + disturbing RCU unless the grace period has
> > + reached the specified age in milliseconds.
> > + Defaults to zero. Large values will be capped
> > + at five seconds. All values will be rounded down
> > + to the nearest value representable by jiffies.
> > +
> > rcutree.qhimark= [KNL]
> > Set threshold of queued RCU callbacks beyond which
> > batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 35bf4a3736765..408b020c9501f 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -176,6 +176,9 @@ static int gp_init_delay;
> > module_param(gp_init_delay, int, 0444);
> > static int gp_cleanup_delay;
> > module_param(gp_cleanup_delay, int, 0444);
> > +static int nocb_patience_delay;
> > +module_param(nocb_patience_delay, int, 0444);
> > +static int nocb_patience_delay_jiffies;
> >
> > // Add delay to rcu_read_unlock() for strict grace periods.
> > static int rcu_unlock_delay;
> > @@ -4344,11 +4347,14 @@ static int rcu_pending(int user)
> > return 1;
> >
> > /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > + gp_in_progress = rcu_gp_in_progress();
> > + if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > + (gp_in_progress &&
> > + time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay_jiffies))) &&
> > + rcu_nohz_full_cpu())
> > return 0;
> >
> > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > - gp_in_progress = rcu_gp_in_progress();
> > if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> > return 1;
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 340bbefe5f652..31c539f09c150 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -93,6 +93,16 @@ static void __init rcu_bootup_announce_oddness(void)
> > pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > if (gp_cleanup_delay)
> > pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > + if (nocb_patience_delay < 0) {
> > + pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> > + nocb_patience_delay = 0;
> > + } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > + pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> > + nocb_patience_delay = 5 * MSEC_PER_SEC;
> > + } else if (nocb_patience_delay) {
> > + pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> > + }
> > + nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > if (!use_softirq)
> > pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > --
> > 2.40.1
> >
>

2024-06-10 15:14:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
> > > >
> > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > From: "Joel Fernandes (Google)" <[email protected]>
> > > > >
> > > > > In the synchronize_rcu() common case, we will have less than
> > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > is pointless just to free the last injected wait head since at that point,
> > > > > all the users have already been awakened.
> > > > >
> > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > common case.
> > > > >
> > > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > ---
> > > > > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > > kernel/rcu/tree.h | 1 +
> > > > > 2 files changed, 31 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > > rcu_sr_normal_gp_cleanup_work),
> > > > > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > > };
> > > > >
> > > > > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > * the done tail list manipulations are protected here.
> > > > > */
> > > > > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > - if (!done)
> > > > > + if (!done) {
> > > > > + /* See comments below. */
> > > > > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > >
> > > > This condition is not supposed to happen. If the work is scheduled,
> > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > may make things worse.
> > > >
> > >
> > > I also don't see a scenario where this can happen. However, if we are
> > > returning from here, given that for every queued work we do an
> > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > decrement in this
> > > case, as that counter tracks only the work queuing and execution counts.
> > >
> > > atomic_inc(&rcu_state.srs_cleanups_pending);
> > > if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > > atomic_dec(&rcu_state.srs_cleanups_pending);
> >
> > Linus Torvald's general rule is that if you cannot imagine how a bug
> > can happen, don't attempt to clean up after it. His rationale (which
> > is *almost* always a good one) is that not knowing how the bug happens
> > means that attempts to clean up will usually just make matters worse.
> > And all too often, the clean-up code makes debugging more difficult.
> >
>
> Ok. Thanks for sharing this info!
>
> > One example exception to this rule is when debug-objects detects a
> > duplicate call_rcu(). In that case, we ignore that second call_rcu().
> > But the reason is that experience has shown that the usual cause really
> > is someone doing a duplicate call_rcu(), and also that ignoring the
> > second call_rcu() makes debugging easier.
> >
> > So what is it that Frederic and I are missing here?
>
> Maybe nothing. As kworker context does not modify srs_done_tail and
> invalid values
> of srs_done_tail can only be caused by the GP kthread manipulations
> of srs_done_tail , my thought here was, we can keep the pending
> rcu_sr_normal_gp_cleanup_work count consistent with the number of
> queue_work() and kworker executions, even when we see unexpected
> srs_done_tail values like these. However, as you described the general rule
> is to not attempt any clean up for such scenarios.

So "if (WARN_ON_ONCE(!done) return;"?

Or is there a better way?

Thanx, Paul

2024-06-11 10:12:53

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH rcu 2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

On Thu, Jun 06, 2024 at 09:49:59AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote:
> >
> >
> > On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> > > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> > >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
> > >>> From: Neeraj Upadhyay <[email protected]>
> > >>>
> > >>> When all wait heads are in use, which can happen when
> > >>> rcu_sr_normal_gp_cleanup_work()'s callback processing
> > >>> is slow, any new synchronize_rcu() user's rcu_synchronize
> > >>> node's processing is deferred to future GP periods. This
> > >>> can result in long list of synchronize_rcu() invocations
> > >>> waiting for full grace period processing, which can delay
> > >>> freeing of memory. Mitigate this problem by using first
> > >>> node in the list as wait tail when all wait heads are in use.
> > >>> While methods to speed up callback processing would be needed
> > >>> to recover from this situation, allowing new nodes to complete
> > >>> their grace period can help prevent delays due to a fixed
> > >>> number of wait head nodes.
> > >>>
> > >>> Signed-off-by: Neeraj Upadhyay <[email protected]>
> > >>> Signed-off-by: Paul E. McKenney <[email protected]>
> > >>
> > >> IIRC we agreed that this patch could be a step too far that
> > >> made an already not so simple state machine even less simple,
> > >> breaking the wait_head based flow.
> > >
> > > True, which is why we agreed not to submit it into the v6.10 merge window.
> > >
> > > And I don't recall us saying what merge window to send it to.
> > >
> > >> Should we postpone this change until it is observed that a workqueue
> > >> not being scheduled for 5 grace periods is a real issue?
> > >
> > > Neeraj, thoughts? Or, better yet, test results? ;-)
> >
> > Yes I agree that we postpone this change until we see it as a real
> > problem. I had run a test to invoke synchronize_rcu() from all CPUs
> > on a 96 core system in parallel. I didn't specifically check if this
> > scenario was hit. Will run RCU torture test with this change.
>
> Very well, I will drop this patch with the expectation that you will
> re-post it if a problem does arise.
>
Thank you! We discussed it before and came to conclusion that it adds an
extra complexity. Once we hit an issue with delays, we can introduce it
and explain a workload which triggers it.

--
Uladzislau Rezki

2024-06-11 13:48:03

by Neeraj upadhyay

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Mon, Jun 10, 2024 at 8:42 PM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> > On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
> > > > >
> > > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > > From: "Joel Fernandes (Google)" <[email protected]>
> > > > > >
> > > > > > In the synchronize_rcu() common case, we will have less than
> > > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > > is pointless just to free the last injected wait head since at that point,
> > > > > > all the users have already been awakened.
> > > > > >
> > > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > > common case.
> > > > > >
> > > > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > > > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > > ---
> > > > > > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > > > kernel/rcu/tree.h | 1 +
> > > > > > 2 files changed, 31 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > > > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > > > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > > > rcu_sr_normal_gp_cleanup_work),
> > > > > > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > > > };
> > > > > >
> > > > > > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > > * the done tail list manipulations are protected here.
> > > > > > */
> > > > > > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > > - if (!done)
> > > > > > + if (!done) {
> > > > > > + /* See comments below. */
> > > > > > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > > >
> > > > > This condition is not supposed to happen. If the work is scheduled,
> > > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > > may make things worse.
> > > > >
> > > >
> > > > I also don't see a scenario where this can happen. However, if we are
> > > > returning from here, given that for every queued work we do an
> > > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > > decrement in this
> > > > case, as that counter tracks only the work queuing and execution counts.
> > > >
> > > > atomic_inc(&rcu_state.srs_cleanups_pending);
> > > > if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > > > atomic_dec(&rcu_state.srs_cleanups_pending);
> > >
> > > Linus Torvald's general rule is that if you cannot imagine how a bug
> > > can happen, don't attempt to clean up after it. His rationale (which
> > > is *almost* always a good one) is that not knowing how the bug happens
> > > means that attempts to clean up will usually just make matters worse.
> > > And all too often, the clean-up code makes debugging more difficult.
> > >
> >
> > Ok. Thanks for sharing this info!
> >
> > > One example exception to this rule is when debug-objects detects a
> > > duplicate call_rcu(). In that case, we ignore that second call_rcu().
> > > But the reason is that experience has shown that the usual cause really
> > > is someone doing a duplicate call_rcu(), and also that ignoring the
> > > second call_rcu() makes debugging easier.
> > >
> > > So what is it that Frederic and I are missing here?
> >
> > Maybe nothing. As kworker context does not modify srs_done_tail and
> > invalid values
> > of srs_done_tail can only be caused by the GP kthread manipulations
> > of srs_done_tail , my thought here was, we can keep the pending
> > rcu_sr_normal_gp_cleanup_work count consistent with the number of
> > queue_work() and kworker executions, even when we see unexpected
> > srs_done_tail values like these. However, as you described the general rule
> > is to not attempt any clean up for such scenarios.
>
> So "if (WARN_ON_ONCE(!done) return;"?
>

Looks good, nit: one more closing parenthesis "if (WARN_ON_ONCE(!done)) return;"

> Or is there a better way?
>

I think, above check suffices .


Thanks
Neeraj

> Thanx, Paul

2024-06-11 16:18:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 3/9] rcu/tree: Reduce wake up for synchronize_rcu() common case

On Tue, Jun 11, 2024 at 07:16:37PM +0530, Neeraj upadhyay wrote:
> On Mon, Jun 10, 2024 at 8:42 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Fri, Jun 07, 2024 at 07:21:55AM +0530, Neeraj upadhyay wrote:
> > > On Thu, Jun 6, 2024 at 11:42 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 06, 2024 at 11:28:07AM +0530, Neeraj upadhyay wrote:
> > > > > On Wed, Jun 5, 2024 at 10:05 PM Frederic Weisbecker <[email protected]> wrote:
> > > > > >
> > > > > > Le Tue, Jun 04, 2024 at 03:23:49PM -0700, Paul E. McKenney a écrit :
> > > > > > > From: "Joel Fernandes (Google)" <[email protected]>
> > > > > > >
> > > > > > > In the synchronize_rcu() common case, we will have less than
> > > > > > > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > > > > > > is pointless just to free the last injected wait head since at that point,
> > > > > > > all the users have already been awakened.
> > > > > > >
> > > > > > > Introduce a new counter to track this and prevent the wakeup in the
> > > > > > > common case.
> > > > > > >
> > > > > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > > > > Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
> > > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > > > ---
> > > > > > > kernel/rcu/tree.c | 35 ++++++++++++++++++++++++++++++-----
> > > > > > > kernel/rcu/tree.h | 1 +
> > > > > > > 2 files changed, 31 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 6ba36d9c09bde..2fe08e6186b4d 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
> > > > > > > .ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> > > > > > > .srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> > > > > > > rcu_sr_normal_gp_cleanup_work),
> > > > > > > + .srs_cleanups_pending = ATOMIC_INIT(0),
> > > > > > > };
> > > > > > >
> > > > > > > /* Dump rcu_node combining tree at boot to verify correct setup. */
> > > > > > > @@ -1633,8 +1634,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > > > > * the done tail list manipulations are protected here.
> > > > > > > */
> > > > > > > done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > > > > - if (!done)
> > > > > > > + if (!done) {
> > > > > > > + /* See comments below. */
> > > > > > > + atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
> > > > > >
> > > > > > This condition is not supposed to happen. If the work is scheduled,
> > > > > > there has to be a wait_queue in rcu_state.srs_done_tail. And decrementing
> > > > > > may make things worse.
> > > > > >
> > > > >
> > > > > I also don't see a scenario where this can happen. However, if we are
> > > > > returning from here, given that for every queued work we do an
> > > > > increment of rcu_state.srs_cleanups_pending, I think it's safer to
> > > > > decrement in this
> > > > > case, as that counter tracks only the work queuing and execution counts.
> > > > >
> > > > > atomic_inc(&rcu_state.srs_cleanups_pending);
> > > > > if (!queue_work(sync_wq, &rcu_state.srs_cleanup_work))
> > > > > atomic_dec(&rcu_state.srs_cleanups_pending);
> > > >
> > > > Linus Torvald's general rule is that if you cannot imagine how a bug
> > > > can happen, don't attempt to clean up after it. His rationale (which
> > > > is *almost* always a good one) is that not knowing how the bug happens
> > > > means that attempts to clean up will usually just make matters worse.
> > > > And all too often, the clean-up code makes debugging more difficult.
> > > >
> > >
> > > Ok. Thanks for sharing this info!
> > >
> > > > One example exception to this rule is when debug-objects detects a
> > > > duplicate call_rcu(). In that case, we ignore that second call_rcu().
> > > > But the reason is that experience has shown that the usual cause really
> > > > is someone doing a duplicate call_rcu(), and also that ignoring the
> > > > second call_rcu() makes debugging easier.
> > > >
> > > > So what is it that Frederic and I are missing here?
> > >
> > > Maybe nothing. As kworker context does not modify srs_done_tail and
> > > invalid values
> > > of srs_done_tail can only be caused by the GP kthread manipulations
> > > of srs_done_tail , my thought here was, we can keep the pending
> > > rcu_sr_normal_gp_cleanup_work count consistent with the number of
> > > queue_work() and kworker executions, even when we see unexpected
> > > srs_done_tail values like these. However, as you described the general rule
> > > is to not attempt any clean up for such scenarios.
> >
> > So "if (WARN_ON_ONCE(!done) return;"?
> >
>
> Looks good, nit: one more closing parenthesis "if (WARN_ON_ONCE(!done)) return;"
>
> > Or is there a better way?
> >
>
> I think, above check suffices .

Very well, I will make this change on the next rebase.

Thanx, Paul