Hello!
This series provides miscellaneous fixes:
1. Fix to include first blocked task in stall warning, courtesy of
Yanfei Xu.
2. Fix stall-warning deadlock due to non-release of rcu_node ->lock,
courtesy of Yanfei Xu.
3. Remove special bit at the bottom of the ->dynticks counter,
courtesy of "Joel Fernandes (Google)".
4. Weaken ->dynticks accesses and updates.
5. Mark accesses to ->rcu_read_lock_nesting.
6. Unify documentation about missing list_empty_rcu(), courtesy of
Julian Wiedmann.
7. Handle VM stoppage in stall detection, courtesy of Sergey
Senozhatsky.
8. Do not disable GP stall detection in rcu_cpu_stall_reset(),
courtesy of Sergey Senozhatsky.
9. Start timing stall repetitions after warning complete.
10. Mark read-side data races.
11. Mark lockless ->qsmask read in rcu_check_boost_fail().
12. Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack.
13. Remove trailing spaces and tabs, courtesy of Zhen Lei.
14. Mark accesses in tree_stall.h.
15. Remove useless "ret" update in rcu_gp_fqs_loop(), courtesy of
Liu Song.
16. Use per_cpu_ptr to get the pointer of per_cpu variable, courtesy
of Liu Song.
17. Explain why rcu_all_qs() is a stub in preemptible TREE RCU,
courtesy of Frederic Weisbecker.
18. Print human-readable message for schedule() in RCU reader.
Thanx, Paul
------------------------------------------------------------------------
b/include/linux/rculist.h | 35 +++++++-------
b/include/linux/rcupdate.h | 2
b/include/linux/rcutiny.h | 3 -
b/include/linux/srcutiny.h | 8 +--
b/kernel/rcu/srcutiny.c | 2
b/kernel/rcu/tasks.h | 2
b/kernel/rcu/tree.c | 77 +++++---------------------------
b/kernel/rcu/tree.h | 2
b/kernel/rcu/tree_plugin.h | 9 ++-
b/kernel/rcu/tree_stall.h | 4 -
b/kernel/sched/core.c | 11 ++++
kernel/rcu/tree.c | 62 ++++++++++++++++----------
kernel/rcu/tree_plugin.h | 2
kernel/rcu/tree_stall.h | 107 ++++++++++++++++++++++++++++-----------------
14 files changed, 165 insertions(+), 161 deletions(-)
From: Yanfei Xu <[email protected]>
The for loop in rcu_print_task_stall() always omits ts[0], which points
to the first task blocking the stalled grace period. This in turn fails
to count this first task, which means that ndetected will be equal to
zero when all CPUs have passed through their quiescent states and only
one task is blocking the stalled grace period. This zero value for
ndetected will in turn result in an incorrect "All QSes seen" message:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23):
(detected by 15, t=6504 jiffies, g=164777, q=9011209)
rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
INFO: lockdep is turned off.
Preemption disabled at:
[<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x2cc
show_stack+0x24/0x30
dump_stack+0x110/0x188
___might_sleep+0x214/0x2d0
__might_sleep+0x7c/0xe0
This commit therefore fixes the loop to include ts[0].
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Signed-off-by: Yanfei Xu <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 6c76988cc019f..2e96f9741666d 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -280,8 +280,8 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
break;
}
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
- for (i--; i; i--) {
- t = ts[i];
+ while (i) {
+ t = ts[--i];
if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
pr_cont(" P%d", t->pid);
else
--
2.31.1.189.g2e36527f23
Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe. However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations. The remaining pair must remain full
memory barriers. This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.
The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:
C dynticks-from-idle
{
DYNTICKS=0; (* Initially idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
int x;
// Idle.
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now non-idle
x = READ_ONCE(*X);
}
P1(int *X, int *DYNTICKS)
{
int dynticks;
WRITE_ONCE(*X, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
}
exists (1:dynticks=0 /\ 0:x=1)
Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.
The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:
C dynticks-into-idle
{
DYNTICKS=1; (* Initially non-idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
// Non-idle.
WRITE_ONCE(*X, 1);
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now idle.
}
P1(int *X, int *DYNTICKS)
{
int x;
int dynticks;
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}
exists (1:dynticks=2 /\ 1:x=0)
Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X). This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.
Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case. However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective. That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 50 +++++++++++++++++++++++++++++++----------------
kernel/rcu/tree.h | 2 +-
2 files changed, 34 insertions(+), 18 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 42a0032dd99f7..bc6ccf0ba3b24 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -77,7 +77,7 @@
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
.dynticks_nesting = 1,
.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
- .dynticks = ATOMIC_INIT(1),
+ .dynticks = 1UL,
#ifdef CONFIG_RCU_NOCB_CPU
.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
#endif
@@ -251,6 +251,21 @@ void rcu_softirq_qs(void)
rcu_tasks_qs(current, false);
}
+/*
+ * Increment the current CPU's rcu_data structure's ->dynticks field
+ * with ordering. Return the new value.
+ */
+static noinstr unsigned long rcu_dynticks_inc(int incby)
+{
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+ int seq;
+
+ seq = READ_ONCE(rdp->dynticks) + incby;
+ smp_store_release(&rdp->dynticks, seq);
+ smp_mb(); // Fundamental RCU ordering guarantee.
+ return seq;
+}
+
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state, that is,
@@ -267,7 +282,7 @@ static noinstr void rcu_dynticks_eqs_enter(void)
* next idle sojourn.
*/
rcu_dynticks_task_trace_enter(); // Before ->dynticks update!
- seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(1);
// RCU is no longer watching. Better be in extended quiescent state!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
}
@@ -286,7 +301,7 @@ static noinstr void rcu_dynticks_eqs_exit(void)
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(1);
// RCU is now watching. Better not be in an extended quiescent state!
rcu_dynticks_task_trace_exit(); // After ->dynticks update!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
@@ -306,9 +321,9 @@ static void rcu_dynticks_eqs_online(void)
{
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
- if (atomic_read(&rdp->dynticks) & 0x1)
+ if (READ_ONCE(rdp->dynticks) & 0x1)
return;
- atomic_inc(&rdp->dynticks);
+ rcu_dynticks_inc(1);
}
/*
@@ -318,7 +333,7 @@ static void rcu_dynticks_eqs_online(void)
*/
static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
{
- return !(arch_atomic_read(&this_cpu_ptr(&rcu_data)->dynticks) & 0x1);
+ return !(READ_ONCE(this_cpu_ptr(&rcu_data)->dynticks) & 0x1);
}
/*
@@ -327,7 +342,8 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
*/
static int rcu_dynticks_snap(struct rcu_data *rdp)
{
- return atomic_add_return(0, &rdp->dynticks);
+ smp_mb(); // Fundamental RCU ordering guarantee.
+ return smp_load_acquire(&rdp->dynticks);
}
/*
@@ -367,7 +383,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
int snap;
// If not quiescent, force back to earlier extended quiescent state.
- snap = atomic_read(&rdp->dynticks) & ~0x1;
+ snap = READ_ONCE(rdp->dynticks) & ~0x1;
smp_rmb(); // Order ->dynticks and *vp reads.
if (READ_ONCE(*vp))
@@ -375,7 +391,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
smp_rmb(); // Order *vp read and ->dynticks re-read.
// If still in the same extended quiescent state, we are good!
- return snap == atomic_read(&rdp->dynticks);
+ return snap == READ_ONCE(rdp->dynticks);
}
/*
@@ -391,12 +407,12 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
*/
notrace void rcu_momentary_dyntick_idle(void)
{
- int special;
+ int seq;
raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
- special = atomic_add_return(2, &this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(2);
/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(seq & 0x1));
rcu_preempt_deferred_qs(current);
}
EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
@@ -612,7 +628,7 @@ static noinstr void rcu_eqs_enter(bool user)
lockdep_assert_irqs_disabled();
instrumentation_begin();
- trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
+ trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, READ_ONCE(rdp->dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
@@ -747,7 +763,7 @@ noinstr void rcu_nmi_exit(void)
*/
if (rdp->dynticks_nmi_nesting != 1) {
trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
- atomic_read(&rdp->dynticks));
+ READ_ONCE(rdp->dynticks));
WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
rdp->dynticks_nmi_nesting - 2);
instrumentation_end();
@@ -755,7 +771,7 @@ noinstr void rcu_nmi_exit(void)
}
/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
- trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
+ trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, READ_ONCE(rdp->dynticks));
WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
if (!in_nmi())
@@ -863,7 +879,7 @@ static void noinstr rcu_eqs_exit(bool user)
instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
rcu_cleanup_after_idle();
- trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
+ trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, READ_ONCE(rdp->dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
WRITE_ONCE(rdp->dynticks_nesting, 1);
WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
@@ -1026,7 +1042,7 @@ noinstr void rcu_nmi_enter(void)
trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
rdp->dynticks_nmi_nesting,
- rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
+ rdp->dynticks_nmi_nesting + incby, READ_ONCE(rdp->dynticks));
instrumentation_end();
WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
rdp->dynticks_nmi_nesting + incby);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 305cf6aeb4086..ce611da2ff6b3 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -184,7 +184,7 @@ struct rcu_data {
int dynticks_snap; /* Per-GP tracking for dynticks. */
long dynticks_nesting; /* Track process nesting level. */
long dynticks_nmi_nesting; /* Track irq/NMI nesting level. */
- atomic_t dynticks; /* Even value for idle, else odd. */
+ unsigned long dynticks; /* Even value for idle, else odd. */
bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */
bool rcu_urgent_qs; /* GP old need light quiescent state. */
bool rcu_forced_tick; /* Forced tick to provide QS. */
--
2.31.1.189.g2e36527f23
From: Liu Song <[email protected]>
There are a few remaining locations in kernel/rcu that still use
"&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Neeraj Upadhyay <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Liu Song <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tasks.h | 2 +-
kernel/rcu/tree.c | 2 +-
kernel/rcu/tree_stall.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 8536c55df5142..21f00194e69d7 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -920,7 +920,7 @@ static void trc_read_check_handler(void *t_in)
// Allow future IPIs to be sent on CPU and for task.
// Also order this IPI handler against any later manipulations of
// the intended task.
- smp_store_release(&per_cpu(trc_ipi_to_cpu, smp_processor_id()), false); // ^^^
+ smp_store_release(per_cpu_ptr(&trc_ipi_to_cpu, smp_processor_id()), false); // ^^^
smp_store_release(&texp->trc_ipi_to_cpu, -1); // ^^^
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7b65ac3f49e5a..888efad0361d3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1292,7 +1292,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
*/
jtsq = READ_ONCE(jiffies_to_sched_qs);
ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
- rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
+ rnhqp = per_cpu_ptr(&rcu_data.rcu_need_heavy_qs, rdp->cpu);
if (!READ_ONCE(*rnhqp) &&
(time_after(jiffies, rcu_state.gp_start + jtsq * 2) ||
time_after(jiffies, rcu_state.jiffies_resched) ||
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index a8d0fcf0826f4..677ee3d8671bf 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -351,7 +351,7 @@ static void rcu_dump_cpu_stacks(void)
static void print_cpu_stall_fast_no_hz(char *cp, int cpu)
{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
sprintf(cp, "last_accelerate: %04lx/%04lx dyntick_enabled: %d",
rdp->last_accelerate & 0xffff, jiffies & 0xffff,
--
2.31.1.189.g2e36527f23
The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations. This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently. This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.
Reported-by: kernel test robot <[email protected]>
Tested-by: Rong Chen <[email protected]>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bc6ccf0ba3b24..e472c78036011 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1739,7 +1739,7 @@ static void rcu_strict_gp_boundary(void *unused)
/*
* Initialize a new grace period. Return false if no grace period required.
*/
-static bool rcu_gp_init(void)
+static noinline_for_stack bool rcu_gp_init(void)
{
unsigned long firstseq;
unsigned long flags;
@@ -1933,7 +1933,7 @@ static void rcu_gp_fqs(bool first_time)
/*
* Loop doing repeated quiescent-state forcing until the grace period ends.
*/
-static void rcu_gp_fqs_loop(void)
+static noinline_for_stack void rcu_gp_fqs_loop(void)
{
bool first_gp_fqs;
int gf = 0;
--
2.31.1.189.g2e36527f23
Systems with low-bandwidth consoles can have very large printk()
latencies, and on such systems it makes no sense to have the next RCU CPU
stall warning message start output before the prior message completed.
This commit therefore sets the time of the next stall only after the
prints have completed. While printing, the time of the next stall
message is set to ULONG_MAX/2 jiffies into the future.
Reviewed-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index e199022dce9dc..42847caa3909b 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -647,6 +647,7 @@ static void print_cpu_stall(unsigned long gps)
static void check_cpu_stall(struct rcu_data *rdp)
{
+ bool didstall = false;
unsigned long gs1;
unsigned long gs2;
unsigned long gps;
@@ -692,7 +693,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
ULONG_CMP_GE(gps, js))
return; /* No stall or GP completed since entering function. */
rnp = rdp->mynode;
- jn = jiffies + 3 * rcu_jiffies_till_stall_check() + 3;
+ jn = jiffies + ULONG_MAX / 2;
if (rcu_gp_in_progress() &&
(READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
@@ -709,6 +710,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
print_cpu_stall(gps);
if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
rcu_ftrace_dump(DUMP_ALL);
+ didstall = true;
} else if (rcu_gp_in_progress() &&
ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
@@ -726,6 +728,11 @@ static void check_cpu_stall(struct rcu_data *rdp)
print_other_cpu_stall(gs2, gps);
if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
rcu_ftrace_dump(DUMP_ALL);
+ didstall = true;
+ }
+ if (didstall && READ_ONCE(rcu_state.jiffies_stall) == jn) {
+ jn = jiffies + 3 * rcu_jiffies_till_stall_check() + 3;
+ WRITE_ONCE(rcu_state.jiffies_stall, jn);
}
}
--
2.31.1.189.g2e36527f23
From: Frederic Weisbecker <[email protected]>
The cond_resched() function reports an RCU quiescent state only in
non-preemptible TREE RCU implementation. This commit therefore adds a
comment explaining why cond_resched() does nothing in preemptible kernels.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/sched/core.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d9ff40f46619..6a03c3fac55cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7781,6 +7781,17 @@ int __sched __cond_resched(void)
preempt_schedule_common();
return 1;
}
+ /*
+ * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
+ * whether the current CPU is in an RCU read-side critical section,
+ * so the tick can report quiescent states even for CPUs looping
+ * in kernel context. In contrast, in non-preemptible kernels,
+ * RCU readers leave no in-memory hints, which means that CPU-bound
+ * processes executing in kernel context might never report an
+ * RCU quiescent state. Therefore, the following code causes
+ * cond_resched() to report a quiescent state, but only when RCU
+ * is in urgent need of one.
+ */
#ifndef CONFIG_PREEMPT_RCU
rcu_all_qs();
#endif
--
2.31.1.189.g2e36527f23
From: Yanfei Xu <[email protected]>
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
To reproduce this bug:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Signed-off-by: Yanfei Xu <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 2e96f9741666d..bd4de5bc5807e 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
struct task_struct *ts[8];
lockdep_assert_irqs_disabled();
- if (!rcu_preempt_blocked_readers_cgp(rnp))
+ if (!rcu_preempt_blocked_readers_cgp(rnp)) {
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return 0;
+ }
pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
rnp->level, rnp->grplo, rnp->grphi);
t = list_entry(rnp->gp_tasks->prev,
--
2.31.1.189.g2e36527f23
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but
in the past, the overhead of marked accesses was excessive. However,
that was long ago, and much has changed since then, both in terms of
hardware and of compilers. Here is data taken on an eight-core laptop
using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built
using gcc version 9.3.0, with all data in nanoseconds.
Unmarked accesses (status quo), measured by three refscale runs:
Minimum reader duration: 3.286 2.851 3.395
Median reader duration: 3.698 3.531 3.4695
Maximum reader duration: 4.481 5.215 5.157
Marked accesses, also measured by three refscale runs:
Minimum reader duration: 3.501 3.677 3.580
Median reader duration: 4.053 3.723 3.895
Maximum reader duration: 7.307 4.999 5.511
This focused microbenhmark shows only sub-nanosecond differences which
are unlikely to be visible at the system level. This commit therefore
marks data-racing accesses to ->rcu_read_lock_nesting.
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 2 +-
kernel/rcu/tree_plugin.h | 9 ++++++---
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index d9680b798b211..cfeb43bfc719e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -53,7 +53,7 @@ void __rcu_read_unlock(void);
* nesting depth, but makes sense only if CONFIG_PREEMPT_RCU -- in other
* types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
*/
-#define rcu_preempt_depth() (current->rcu_read_lock_nesting)
+#define rcu_preempt_depth() READ_ONCE(current->rcu_read_lock_nesting)
#else /* #ifdef CONFIG_PREEMPT_RCU */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index de1dc3bb7f701..83a702a4e2963 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -405,17 +405,20 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
static void rcu_preempt_read_enter(void)
{
- current->rcu_read_lock_nesting++;
+ WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
}
static int rcu_preempt_read_exit(void)
{
- return --current->rcu_read_lock_nesting;
+ int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
+
+ WRITE_ONCE(current->rcu_read_lock_nesting, ret);
+ return ret;
}
static void rcu_preempt_depth_set(int val)
{
- current->rcu_read_lock_nesting = val;
+ WRITE_ONCE(current->rcu_read_lock_nesting, val);
}
/*
--
2.31.1.189.g2e36527f23
From: Zhen Lei <[email protected]>
Run the following command to find and remove the trailing spaces and tabs:
find kernel/rcu/ -type f | xargs sed -r -i 's/[ \t]+$//'
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e472c78036011..37bc3a702b6ea 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -259,7 +259,7 @@ static noinstr unsigned long rcu_dynticks_inc(int incby)
{
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
int seq;
-
+
seq = READ_ONCE(rdp->dynticks) + incby;
smp_store_release(&rdp->dynticks, seq);
smp_mb(); // Fundamental RCU ordering guarantee.
--
2.31.1.189.g2e36527f23
From: Liu Song <[email protected]>
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code. This commit therefore
removes this useless assignment.
Signed-off-by: Liu Song <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 37bc3a702b6ea..7b65ac3f49e5a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1960,8 +1960,8 @@ static noinline_for_stack void rcu_gp_fqs_loop(void)
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
TPS("fqswait"));
WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS);
- ret = swait_event_idle_timeout_exclusive(
- rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
+ (void)swait_event_idle_timeout_exclusive(rcu_state.gp_wq,
+ rcu_gp_fqs_check_wake(&gf), j);
rcu_gp_torture_wait();
WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS);
/* Locking provides needed memory barriers. */
--
2.31.1.189.g2e36527f23
From: Sergey Senozhatsky <[email protected]>
rcu_cpu_stall_reset() is one of the functions virtual CPUs
execute during VM resume in order to handle jiffies skew
that can trigger false positive stall warnings. Paul has
pointed out that this approach is problematic because
rcu_cpu_stall_reset() disables RCU grace period stall-detection
virtually forever, while in fact it can just restart the
stall-detection timeout.
Suggested-by: "Paul E. McKenney" <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 0e7a60706d1c0..e199022dce9dc 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -119,17 +119,14 @@ static void panic_on_rcu_stall(void)
}
/**
- * rcu_cpu_stall_reset - prevent further stall warnings in current grace period
- *
- * Set the stall-warning timeout way off into the future, thus preventing
- * any RCU CPU stall-warning messages from appearing in the current set of
- * RCU grace periods.
+ * rcu_cpu_stall_reset - restart stall-warning timeout for current grace period
*
* The caller must disable hard irqs.
*/
void rcu_cpu_stall_reset(void)
{
- WRITE_ONCE(rcu_state.jiffies_stall, jiffies + ULONG_MAX / 2);
+ WRITE_ONCE(rcu_state.jiffies_stall,
+ jiffies + rcu_jiffies_till_stall_check());
}
//////////////////////////////////////////////////////////////////////////////
--
2.31.1.189.g2e36527f23
This commit marks some interrupt-induced read-side data races in
__srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/srcutiny.h | 8 ++++----
kernel/rcu/srcutiny.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
index 0e0cf4d6a72a0..6cfaa0a9a9b96 100644
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -61,7 +61,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp)
int idx;
idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
- WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1);
+ WRITE_ONCE(ssp->srcu_lock_nesting[idx], READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1);
return idx;
}
@@ -81,11 +81,11 @@ static inline void srcu_torture_stats_print(struct srcu_struct *ssp,
{
int idx;
- idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
+ idx = ((data_race(READ_ONCE(ssp->srcu_idx)) + 1) & 0x2) >> 1;
pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%hd,%hd)\n",
tt, tf, idx,
- READ_ONCE(ssp->srcu_lock_nesting[!idx]),
- READ_ONCE(ssp->srcu_lock_nesting[idx]));
+ data_race(READ_ONCE(ssp->srcu_lock_nesting[!idx])),
+ data_race(READ_ONCE(ssp->srcu_lock_nesting[idx])));
}
#endif
diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
index 26344dc6483b0..a0ba2ed49bc61 100644
--- a/kernel/rcu/srcutiny.c
+++ b/kernel/rcu/srcutiny.c
@@ -96,7 +96,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
*/
void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
{
- int newval = ssp->srcu_lock_nesting[idx] - 1;
+ int newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) - 1;
WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval);
if (!newval && READ_ONCE(ssp->srcu_gp_waiting))
--
2.31.1.189.g2e36527f23
Accesses to ->qsmask are normally protected by ->lock, but there is an
exception in the diagnostic code in rcu_check_boost_fail(). This commit
therefore applies data_race() to this access to avoid KCSAN complaining
about the C-language writes protected by ->lock.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 42847caa3909b..6dd6c9aa3f757 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -766,7 +766,7 @@ bool rcu_check_boost_fail(unsigned long gp_state, int *cpup)
rcu_for_each_leaf_node(rnp) {
if (!cpup) {
- if (READ_ONCE(rnp->qsmask)) {
+ if (data_race(READ_ONCE(rnp->qsmask))) {
return false;
} else {
if (READ_ONCE(rnp->gp_tasks))
--
2.31.1.189.g2e36527f23
From: Sergey Senozhatsky <[email protected]>
The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.
This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.
Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.
There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.
apic_timer_interrupt()
smp_apic_timer_interrupt()
hrtimer_interrupt()
__hrtimer_run_queues()
tick_sched_timer()
tick_sched_handle()
update_process_times()
rcu_sched_clock_irq()
This triggers RCU stalls on our devices during VM resume.
If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.
Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.
Signed-off-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index bd4de5bc5807e..0e7a60706d1c0 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -7,6 +7,8 @@
* Author: Paul E. McKenney <[email protected]>
*/
+#include <linux/kvm_para.h>
+
//////////////////////////////////////////////////////////////////////////////
//
// Controlling CPU stall warnings, including delay calculation.
@@ -698,6 +700,14 @@ static void check_cpu_stall(struct rcu_data *rdp)
(READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+ /*
+ * If a virtual machine is stopped by the host it can look to
+ * the watchdog like an RCU stall. Check to see if the host
+ * stopped the vm.
+ */
+ if (kvm_check_and_clear_guest_paused())
+ return;
+
/* We haven't checked in, so go dump stack. */
print_cpu_stall(gps);
if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
@@ -707,6 +717,14 @@ static void check_cpu_stall(struct rcu_data *rdp)
ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+ /*
+ * If a virtual machine is stopped by the host it can look to
+ * the watchdog like an RCU stall. Check to see if the host
+ * stopped the vm.
+ */
+ if (kvm_check_and_clear_guest_paused())
+ return;
+
/* They had a few time units to dump stack, so complain. */
print_other_cpu_stall(gs2, gps);
if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
--
2.31.1.189.g2e36527f23
Hmm.
This actually seems to make some of the ordering worse.
I'm not seeing a lot of weakening or optimization, but it depends a
bit on what is common and what is not.
On Wed, Jul 21, 2021 at 1:21 PM Paul E. McKenney <[email protected]> wrote:
>
> +/*
> + * Increment the current CPU's rcu_data structure's ->dynticks field
> + * with ordering. Return the new value.
> + */
> +static noinstr unsigned long rcu_dynticks_inc(int incby)
> +{
> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> + int seq;
> +
> + seq = READ_ONCE(rdp->dynticks) + incby;
> + smp_store_release(&rdp->dynticks, seq);
> + smp_mb(); // Fundamental RCU ordering guarantee.
> + return seq;
> +}
So this is actually likely *more* expensive than the old code was, at
least on x86.
The READ_ONCE/smp_store_release are cheap, but then the smp_mb() is expensive.
The old code did just arch_atomic_inc_return(), which included the
memory barrier.
There *might* be some cache ordering advantage to letting the
READ_ONCE() float upwards, but from a pure barrier standpoint this is
more expensive than what we used to have.
> - if (atomic_read(&rdp->dynticks) & 0x1)
> + if (READ_ONCE(rdp->dynticks) & 0x1)
> return;
> - atomic_inc(&rdp->dynticks);
> + rcu_dynticks_inc(1);
And this one seems to not take advantage of the new rule, so we end up
having two reads, and then that potentially more expensive sequence.
> static int rcu_dynticks_snap(struct rcu_data *rdp)
> {
> - return atomic_add_return(0, &rdp->dynticks);
> + smp_mb(); // Fundamental RCU ordering guarantee.
> + return smp_load_acquire(&rdp->dynticks);
> }
This is likely cheaper - not because of barriers, but simply because
it avoids dirtying the cacheline.
So which operation do we _care_ about, and do we have numbers for why
this improves anything? Because looking at the patch, it's not obvious
that this is an improvement.
Linus
This commit marks the accesses in tree_stall.h so as to both avoid
undesirable compiler optimizations and to keep KCSAN focused on the
accesses of the core algorithm.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 63 +++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 30 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 6dd6c9aa3f757..a8d0fcf0826f4 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -465,9 +465,10 @@ static void rcu_check_gp_kthread_starvation(void)
pr_err("%s kthread starved for %ld jiffies! g%ld f%#x %s(%d) ->state=%#x ->cpu=%d\n",
rcu_state.name, j,
(long)rcu_seq_current(&rcu_state.gp_seq),
- data_race(rcu_state.gp_flags),
- gp_state_getname(rcu_state.gp_state), rcu_state.gp_state,
- gpk ? gpk->__state : ~0, cpu);
+ data_race(READ_ONCE(rcu_state.gp_flags)),
+ gp_state_getname(rcu_state.gp_state),
+ data_race(READ_ONCE(rcu_state.gp_state)),
+ gpk ? data_race(READ_ONCE(gpk->__state)) : ~0, cpu);
if (gpk) {
pr_err("\tUnless %s kthread gets sufficient CPU time, OOM is now expected behavior.\n", rcu_state.name);
pr_err("RCU grace-period kthread stack dump:\n");
@@ -510,7 +511,7 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void)
(long)rcu_seq_current(&rcu_state.gp_seq),
data_race(rcu_state.gp_flags),
gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS,
- gpk->__state);
+ data_race(READ_ONCE(gpk->__state)));
pr_err("\tPossible timer handling issue on cpu=%d timer-softirq=%u\n",
cpu, kstat_softirqs_cpu(TIMER_SOFTIRQ, cpu));
}
@@ -569,11 +570,11 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
pr_err("INFO: Stall ended before state dump start\n");
} else {
j = jiffies;
- gpa = data_race(rcu_state.gp_activity);
+ gpa = data_race(READ_ONCE(rcu_state.gp_activity));
pr_err("All QSes seen, last %s kthread activity %ld (%ld-%ld), jiffies_till_next_fqs=%ld, root ->qsmask %#lx\n",
rcu_state.name, j - gpa, j, gpa,
- data_race(jiffies_till_next_fqs),
- rcu_get_root()->qsmask);
+ data_race(READ_ONCE(jiffies_till_next_fqs)),
+ data_race(READ_ONCE(rcu_get_root()->qsmask)));
}
}
/* Rewrite if needed in case of slow consoles. */
@@ -815,32 +816,34 @@ void show_rcu_gp_kthreads(void)
struct task_struct *t = READ_ONCE(rcu_state.gp_kthread);
j = jiffies;
- ja = j - data_race(rcu_state.gp_activity);
- jr = j - data_race(rcu_state.gp_req_activity);
- js = j - data_race(rcu_state.gp_start);
- jw = j - data_race(rcu_state.gp_wake_time);
+ ja = j - data_race(READ_ONCE(rcu_state.gp_activity));
+ jr = j - data_race(READ_ONCE(rcu_state.gp_req_activity));
+ js = j - data_race(READ_ONCE(rcu_state.gp_start));
+ jw = j - data_race(READ_ONCE(rcu_state.gp_wake_time));
pr_info("%s: wait state: %s(%d) ->state: %#x ->rt_priority %u delta ->gp_start %lu ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_max %lu ->gp_flags %#x\n",
rcu_state.name, gp_state_getname(rcu_state.gp_state),
- rcu_state.gp_state, t ? t->__state : 0x1ffff, t ? t->rt_priority : 0xffU,
- js, ja, jr, jw, (long)data_race(rcu_state.gp_wake_seq),
- (long)data_race(rcu_state.gp_seq),
- (long)data_race(rcu_get_root()->gp_seq_needed),
- data_race(rcu_state.gp_max),
- data_race(rcu_state.gp_flags));
+ data_race(READ_ONCE(rcu_state.gp_state)),
+ t ? data_race(READ_ONCE(t->__state)) : 0x1ffff, t ? t->rt_priority : 0xffU,
+ js, ja, jr, jw, (long)data_race(READ_ONCE(rcu_state.gp_wake_seq)),
+ (long)data_race(READ_ONCE(rcu_state.gp_seq)),
+ (long)data_race(READ_ONCE(rcu_get_root()->gp_seq_needed)),
+ data_race(READ_ONCE(rcu_state.gp_max)),
+ data_race(READ_ONCE(rcu_state.gp_flags)));
rcu_for_each_node_breadth_first(rnp) {
if (ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), READ_ONCE(rnp->gp_seq_needed)) &&
- !data_race(rnp->qsmask) && !data_race(rnp->boost_tasks) &&
- !data_race(rnp->exp_tasks) && !data_race(rnp->gp_tasks))
+ !data_race(READ_ONCE(rnp->qsmask)) && !data_race(READ_ONCE(rnp->boost_tasks)) &&
+ !data_race(READ_ONCE(rnp->exp_tasks)) && !data_race(READ_ONCE(rnp->gp_tasks)))
continue;
pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld ->qsmask %#lx %c%c%c%c ->n_boosts %ld\n",
rnp->grplo, rnp->grphi,
- (long)data_race(rnp->gp_seq), (long)data_race(rnp->gp_seq_needed),
- data_race(rnp->qsmask),
- ".b"[!!data_race(rnp->boost_kthread_task)],
- ".B"[!!data_race(rnp->boost_tasks)],
- ".E"[!!data_race(rnp->exp_tasks)],
- ".G"[!!data_race(rnp->gp_tasks)],
- data_race(rnp->n_boosts));
+ (long)data_race(READ_ONCE(rnp->gp_seq)),
+ (long)data_race(READ_ONCE(rnp->gp_seq_needed)),
+ data_race(READ_ONCE(rnp->qsmask)),
+ ".b"[!!data_race(READ_ONCE(rnp->boost_kthread_task))],
+ ".B"[!!data_race(READ_ONCE(rnp->boost_tasks))],
+ ".E"[!!data_race(READ_ONCE(rnp->exp_tasks))],
+ ".G"[!!data_race(READ_ONCE(rnp->gp_tasks))],
+ data_race(READ_ONCE(rnp->n_boosts)));
if (!rcu_is_leaf_node(rnp))
continue;
for_each_leaf_node_possible_cpu(rnp, cpu) {
@@ -850,12 +853,12 @@ void show_rcu_gp_kthreads(void)
READ_ONCE(rdp->gp_seq_needed)))
continue;
pr_info("\tcpu %d ->gp_seq_needed %ld\n",
- cpu, (long)data_race(rdp->gp_seq_needed));
+ cpu, (long)data_race(READ_ONCE(rdp->gp_seq_needed)));
}
}
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(&rcu_data, cpu);
- cbs += data_race(rdp->n_cbs_invoked);
+ cbs += data_race(READ_ONCE(rdp->n_cbs_invoked));
if (rcu_segcblist_is_offloaded(&rdp->cblist))
show_rcu_nocb_state(rdp);
}
@@ -937,11 +940,11 @@ void rcu_fwd_progress_check(unsigned long j)
if (rcu_gp_in_progress()) {
pr_info("%s: GP age %lu jiffies\n",
- __func__, jiffies - rcu_state.gp_start);
+ __func__, jiffies - data_race(READ_ONCE(rcu_state.gp_start)));
show_rcu_gp_kthreads();
} else {
pr_info("%s: Last GP end %lu jiffies ago\n",
- __func__, jiffies - rcu_state.gp_end);
+ __func__, jiffies - data_race(READ_ONCE(rcu_state.gp_end)));
preempt_disable();
rdp = this_cpu_ptr(&rcu_data);
rcu_check_gp_start_stall(rdp->mynode, rdp, j);
--
2.31.1.189.g2e36527f23
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of
rcu_note_context_switch() triggers when there is a voluntary context
switch in an RCU read-side critical section, but there is quite a gap
between the output of that WARN_ON_ONCE() and this RCU-usage error.
This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE()
that explicitly describes the problem in its message.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 83a702a4e2963..e8b45ab72a799 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -346,7 +346,7 @@ void rcu_note_context_switch(bool preempt)
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
- WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
+ WARN_ONCE(!preempt && rcu_preempt_depth() > 0, "Voluntary context switch within RCU read-side critical section!");
if (rcu_preempt_depth() > 0 &&
!t->rcu_read_unlock_special.b.blocked) {
--
2.31.1.189.g2e36527f23
From: Julian Wiedmann <[email protected]>
We have two separate sections that talk about why list_empty_rcu()
is not needed, so this commit consolidates them.
Signed-off-by: Julian Wiedmann <[email protected]>
[ paulmck: The usual wordsmithing. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rculist.h | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index f8633d37e3581..d29740be4833e 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -10,15 +10,6 @@
#include <linux/list.h>
#include <linux/rcupdate.h>
-/*
- * Why is there no list_empty_rcu()? Because list_empty() serves this
- * purpose. The list_empty() function fetches the RCU-protected pointer
- * and compares it to the address of the list head, but neither dereferences
- * this pointer itself nor provides this pointer to the caller. Therefore,
- * it is not necessary to use rcu_dereference(), so that list_empty() can
- * be used anywhere you would want to use a list_empty_rcu().
- */
-
/*
* INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers
* @list: list to be initialized
@@ -318,21 +309,29 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
/*
* Where are list_empty_rcu() and list_first_entry_rcu()?
*
- * Implementing those functions following their counterparts list_empty() and
- * list_first_entry() is not advisable because they lead to subtle race
- * conditions as the following snippet shows:
+ * They do not exist because they would lead to subtle race conditions:
*
* if (!list_empty_rcu(mylist)) {
* struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
* do_something(bar);
* }
*
- * The list may not be empty when list_empty_rcu checks it, but it may be when
- * list_first_entry_rcu rereads the ->next pointer.
- *
- * Rereading the ->next pointer is not a problem for list_empty() and
- * list_first_entry() because they would be protected by a lock that blocks
- * writers.
+ * The list might be non-empty when list_empty_rcu() checks it, but it
+ * might have become empty by the time that list_first_entry_rcu() rereads
+ * the ->next pointer, which would result in a SEGV.
+ *
+ * When not using RCU, it is OK for list_first_entry() to re-read that
+ * pointer because both functions should be protected by some lock that
+ * blocks writers.
+ *
+ * When using RCU, list_empty() uses READ_ONCE() to fetch the
+ * RCU-protected ->next pointer and then compares it to the address of the
+ * list head. However, it neither dereferences this pointer nor provides
+ * this pointer to its caller. Thus, READ_ONCE() suffices (that is,
+ * rcu_dereference() is not needed), which means that list_empty() can be
+ * used anywhere you would want to use list_empty_rcu(). Just don't
+ * expect anything useful to happen if you do a subsequent lockless
+ * call to list_first_entry_rcu()!!!
*
* See list_first_or_null_rcu for an alternative.
*/
--
2.31.1.189.g2e36527f23
On Wed, Jul 21, 2021 at 01:41:46PM -0700, Linus Torvalds wrote:
> Hmm.
>
> This actually seems to make some of the ordering worse.
>
> I'm not seeing a lot of weakening or optimization, but it depends a
> bit on what is common and what is not.
Agreed, and I expect that I will be reworking this patch rather
thoroughly.
Something about smp_mb() often being a locked atomic operation on a
stack location. :-/
But you did ask for this to be sped up some years back (before the
memory model was formalized), so I figured I should at least show what
can be done. Plus I expect that you know much more about what Intel is
planning than I do.
> On Wed, Jul 21, 2021 at 1:21 PM Paul E. McKenney <[email protected]> wrote:
> >
> > +/*
> > + * Increment the current CPU's rcu_data structure's ->dynticks field
> > + * with ordering. Return the new value.
> > + */
> > +static noinstr unsigned long rcu_dynticks_inc(int incby)
> > +{
> > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > + int seq;
> > +
> > + seq = READ_ONCE(rdp->dynticks) + incby;
> > + smp_store_release(&rdp->dynticks, seq);
> > + smp_mb(); // Fundamental RCU ordering guarantee.
> > + return seq;
> > +}
>
> So this is actually likely *more* expensive than the old code was, at
> least on x86.
>
> The READ_ONCE/smp_store_release are cheap, but then the smp_mb() is expensive.
>
> The old code did just arch_atomic_inc_return(), which included the
> memory barrier.
>
> There *might* be some cache ordering advantage to letting the
> READ_ONCE() float upwards, but from a pure barrier standpoint this is
> more expensive than what we used to have.
No argument here.
> > - if (atomic_read(&rdp->dynticks) & 0x1)
> > + if (READ_ONCE(rdp->dynticks) & 0x1)
> > return;
> > - atomic_inc(&rdp->dynticks);
> > + rcu_dynticks_inc(1);
>
> And this one seems to not take advantage of the new rule, so we end up
> having two reads, and then that potentially more expensive sequence.
This one only executes when a CPU comes online, so I am not worried
about its overhead.
> > static int rcu_dynticks_snap(struct rcu_data *rdp)
> > {
> > - return atomic_add_return(0, &rdp->dynticks);
> > + smp_mb(); // Fundamental RCU ordering guarantee.
> > + return smp_load_acquire(&rdp->dynticks);
> > }
>
> This is likely cheaper - not because of barriers, but simply because
> it avoids dirtying the cacheline.
>
> So which operation do we _care_ about, and do we have numbers for why
> this improves anything? Because looking at the patch, it's not obvious
> that this is an improvement.
It sounds like I should keep this hunk and revert the rest back to
atomic operations, but still in the new rcu_dynticks_inc() function.
Either way, thank you for looking this over!
Thanx, Paul
Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe. However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations. The remaining pair must remain full
memory barriers. This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.
The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:
C dynticks-from-idle
{
DYNTICKS=0; (* Initially idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
int x;
// Idle.
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now non-idle
x = READ_ONCE(*X);
}
P1(int *X, int *DYNTICKS)
{
int dynticks;
WRITE_ONCE(*X, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
}
exists (1:dynticks=0 /\ 0:x=1)
Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.
The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:
C dynticks-into-idle
{
DYNTICKS=1; (* Initially non-idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
// Non-idle.
WRITE_ONCE(*X, 1);
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now idle.
}
P1(int *X, int *DYNTICKS)
{
int x;
int dynticks;
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}
exists (1:dynticks=2 /\ 1:x=0)
Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X). This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.
Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case. However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective. That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.
Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic
increment. This commit therefore weakens only the read from ->dynticks.
However, the updates are abstracted into a rcu_dynticks_inc() function
to ease any future changes that might be needed.
[ paulmck: Apply Linus Torvalds feedback. ]
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 42a0032dd99f7..c87b3a271d65b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -251,6 +251,15 @@ void rcu_softirq_qs(void)
rcu_tasks_qs(current, false);
}
+/*
+ * Increment the current CPU's rcu_data structure's ->dynticks field
+ * with ordering. Return the new value.
+ */
+static noinstr unsigned long rcu_dynticks_inc(int incby)
+{
+ return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
+}
+
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state, that is,
@@ -267,7 +276,7 @@ static noinstr void rcu_dynticks_eqs_enter(void)
* next idle sojourn.
*/
rcu_dynticks_task_trace_enter(); // Before ->dynticks update!
- seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(1);
// RCU is no longer watching. Better be in extended quiescent state!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
}
@@ -286,7 +295,7 @@ static noinstr void rcu_dynticks_eqs_exit(void)
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(1);
// RCU is now watching. Better not be in an extended quiescent state!
rcu_dynticks_task_trace_exit(); // After ->dynticks update!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
@@ -308,7 +317,7 @@ static void rcu_dynticks_eqs_online(void)
if (atomic_read(&rdp->dynticks) & 0x1)
return;
- atomic_inc(&rdp->dynticks);
+ rcu_dynticks_inc(1);
}
/*
@@ -318,7 +327,7 @@ static void rcu_dynticks_eqs_online(void)
*/
static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
{
- return !(arch_atomic_read(&this_cpu_ptr(&rcu_data)->dynticks) & 0x1);
+ return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1);
}
/*
@@ -327,7 +336,8 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
*/
static int rcu_dynticks_snap(struct rcu_data *rdp)
{
- return atomic_add_return(0, &rdp->dynticks);
+ smp_mb(); // Fundamental RCU ordering guarantee.
+ return atomic_read_acquire(&rdp->dynticks);
}
/*
@@ -391,12 +401,12 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
*/
notrace void rcu_momentary_dyntick_idle(void)
{
- int special;
+ int seq;
raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
- special = atomic_add_return(2, &this_cpu_ptr(&rcu_data)->dynticks);
+ seq = rcu_dynticks_inc(2);
/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(seq & 0x1));
rcu_preempt_deferred_qs(current);
}
EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
On Wed, Jul 28, 2021 at 10:37 AM Paul E. McKenney <[email protected]> wrote:
>
> +/*
> + * Increment the current CPU's rcu_data structure's ->dynticks field
> + * with ordering. Return the new value.
> + */
> +static noinstr unsigned long rcu_dynticks_inc(int incby)
> +{
> + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> +}
Maybe inline?
But more I reacted to how we sadly don't have percpu atomics. They'd
be fairly easy to add on x86, but I guess it's not a huge deal.
And hey, if this is pretty much the only place that would use them, I
guess that's not much of an issue.
Linus
----- On Jul 28, 2021, at 1:58 PM, Linus Torvalds [email protected] wrote:
> On Wed, Jul 28, 2021 at 10:37 AM Paul E. McKenney <[email protected]> wrote:
>>
>> +/*
>> + * Increment the current CPU's rcu_data structure's ->dynticks field
>> + * with ordering. Return the new value.
>> + */
>> +static noinstr unsigned long rcu_dynticks_inc(int incby)
>> +{
>> + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
>> +}
>
> Maybe inline?
>
> But more I reacted to how we sadly don't have percpu atomics. They'd
> be fairly easy to add on x86, but I guess it's not a huge deal.
Are the percpu atomics you have in mind different from what is found in
Documentation/core-api/this_cpu_ops.rst ?
Namely this_cpu_add_return(pcp, val) in this case.
I must be missing something subtle because AFAIU those are already
available. Those per-cpu atomics don't provide any memory ordering
though, which may be why those are not used here.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Jul 28, 2021, at 1:37 PM, paulmck [email protected] wrote:
[...]
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 42a0032dd99f7..c87b3a271d65b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -251,6 +251,15 @@ void rcu_softirq_qs(void)
> rcu_tasks_qs(current, false);
> }
>
> +/*
> + * Increment the current CPU's rcu_data structure's ->dynticks field
> + * with ordering. Return the new value.
> + */
> +static noinstr unsigned long rcu_dynticks_inc(int incby)
> +{
> + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> +}
> +
[...]
> @@ -308,7 +317,7 @@ static void rcu_dynticks_eqs_online(void)
>
> if (atomic_read(&rdp->dynticks) & 0x1)
> return;
Can the thread be migrated at this point ? If yes, then
the check and the increment may happen on different cpu's rdps. Is
that OK ?
> - atomic_inc(&rdp->dynticks);
> + rcu_dynticks_inc(1);
> }
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Jul 28, 2021, at 2:32 PM, Linus Torvalds [email protected] wrote:
> On Wed, Jul 28, 2021 at 11:12 AM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> Are the percpu atomics you have in mind different from what is found in
>> Documentation/core-api/this_cpu_ops.rst ?
>>
>> Namely this_cpu_add_return(pcp, val) in this case.
>
> Nope.
>
> Those are only "CPU-atomic", ie atomic wrt interrupts etc.
>
> The RCU code wants SMP-atomic, and it's mainly that we *could* do the
> addressing more efficiently.
OK, so combining the addressing tricks of this_cpu operations with
smp-atomic operations (e.g. LOCK prefix on x86). It may indeed become
worthwhile given enough users, and fast enough atomic operations.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Jul 28, 2021 at 11:12 AM Mathieu Desnoyers
<[email protected]> wrote:
>
> Are the percpu atomics you have in mind different from what is found in
> Documentation/core-api/this_cpu_ops.rst ?
>
> Namely this_cpu_add_return(pcp, val) in this case.
Nope.
Those are only "CPU-atomic", ie atomic wrt interrupts etc.
The RCU code wants SMP-atomic, and it's mainly that we *could* do the
addressing more efficiently.
Linus
On Wed, Jul 28, 2021 at 02:12:05PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 28, 2021, at 1:58 PM, Linus Torvalds [email protected] wrote:
>
> > On Wed, Jul 28, 2021 at 10:37 AM Paul E. McKenney <[email protected]> wrote:
> >>
> >> +/*
> >> + * Increment the current CPU's rcu_data structure's ->dynticks field
> >> + * with ordering. Return the new value.
> >> + */
> >> +static noinstr unsigned long rcu_dynticks_inc(int incby)
> >> +{
> >> + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> >> +}
> >
> > Maybe inline?
> >
> > But more I reacted to how we sadly don't have percpu atomics. They'd
> > be fairly easy to add on x86, but I guess it's not a huge deal.
>
> Are the percpu atomics you have in mind different from what is found in
> Documentation/core-api/this_cpu_ops.rst ?
>
> Namely this_cpu_add_return(pcp, val) in this case.
>
> I must be missing something subtle because AFAIU those are already
> available. Those per-cpu atomics don't provide any memory ordering
> though, which may be why those are not used here.
Good point, but this code does indeed need the ordering. It also must
support the occasional off-CPU access.
Thanx, Paul
On Wed, Jul 28, 2021 at 10:58:08AM -0700, Linus Torvalds wrote:
> On Wed, Jul 28, 2021 at 10:37 AM Paul E. McKenney <[email protected]> wrote:
> >
> > +/*
> > + * Increment the current CPU's rcu_data structure's ->dynticks field
> > + * with ordering. Return the new value.
> > + */
> > +static noinstr unsigned long rcu_dynticks_inc(int incby)
> > +{
> > + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> > +}
>
> Maybe inline?
I bet the compiler does this already, but why give it the choice?
I will add inline.
> But more I reacted to how we sadly don't have percpu atomics. They'd
> be fairly easy to add on x86, but I guess it's not a huge deal.
>
> And hey, if this is pretty much the only place that would use them, I
> guess that's not much of an issue.
I did a git grep for "atomic_.*this_cpu_ptr" in -rcu, and came up with
this:
arch/s390/kernel/time.c: atomic_t *sw_ptr = this_cpu_ptr(&clock_sync_word);
arch/s390/kernel/time.c: atomic_t *sw_ptr = this_cpu_ptr(&clock_sync_word);
arch/x86/xen/spinlock.c: atomic_t *nest_cnt = this_cpu_ptr(&xen_qlock_wait_nest);
drivers/irqchip/irq-apple-aic.c: atomic_andnot(irq_bit, this_cpu_ptr(&aic_vipi_enable));
drivers/irqchip/irq-apple-aic.c: atomic_or(irq_bit, this_cpu_ptr(&aic_vipi_enable));
drivers/irqchip/irq-apple-aic.c: if (atomic_read(this_cpu_ptr(&aic_vipi_flag)) & irq_bit)
drivers/irqchip/irq-apple-aic.c: enabled = atomic_read(this_cpu_ptr(&aic_vipi_enable));
drivers/irqchip/irq-apple-aic.c: firing = atomic_fetch_andnot(enabled, this_cpu_ptr(&aic_vipi_flag)) & enabled;
kernel/events/core.c: if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
kernel/events/core.c: if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
kernel/rcu/rcuscale.c: atomic_dec(this_cpu_ptr(&n_async_inflight));
kernel/rcu/rcuscale.c: if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
kernel/rcu/rcuscale.c: atomic_inc(this_cpu_ptr(&n_async_inflight));
kernel/rcu/tree.c: return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
kernel/rcu/tree.c: return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1);
kernel/trace/ring_buffer.c: if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
kernel/trace/ring_buffer.c: atomic_dec(this_cpu_ptr(&checking));
Given that this mostly moves text from the first argument to the function
name, I suspect that we would need quite a few instances of any given
atomic operation to make the added lines for all the definitions and
the documentation worth it. Maybe for atomic_read()?
/**
* atomic_read_this_cpu - atomically read from this CPU's variable
* @p: Pointer to per-CPU variable.
*
* Do an atomic_read() from the current CPU's instance of the
* specified per-CPU variable.
*/
#define atomic_read_this_cpu(p) atomic_read(this_cpu_ptr(p))
But atomic_read_this_cpu(&rcu_data.dynticks) isn't all that much shorter
than atomic_read(this_cpu_ptr(&rcu_data.dynticks)).
Thanx, Paul
On Wed, Jul 28, 2021 at 11:46 AM Paul E. McKenney <[email protected]> wrote:
>
> But atomic_read_this_cpu(&rcu_data.dynticks) isn't all that much shorter
> than atomic_read(this_cpu_ptr(&rcu_data.dynticks)).
It's not so much that it's shorter to write for a human, it's that we
could generate better code for it.
That atomic_read(this_cpu_ptr()) pattern generates code like
movq $rcu_data+288, %rax
add %gs:this_cpu_off(%rip), %rax
movl (%rax), %eax
but it *could* just generate
movl %gs:rcu_data+288, %rax
instead.
Similar patterns for the other per-cpu atomics, ie it would be
possible to just generate
lock ; xaddl %gs:..., %rax
instead of generating the address by doing that "add %gs:this_cpu_off" thing..
But no, it doesn't look like there are enough users of this to matter.
We're just talking a few extra bytes, and a couple of extra
instructions (and possibly slightly higher register pressure, which
then generates more instructions).
The *expensive* part remains the SMP serialization of the "lock".
Linus
On Wed, Jul 28, 2021 at 02:23:05PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 28, 2021, at 1:37 PM, paulmck [email protected] wrote:
> [...]
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 42a0032dd99f7..c87b3a271d65b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -251,6 +251,15 @@ void rcu_softirq_qs(void)
> > rcu_tasks_qs(current, false);
> > }
> >
> > +/*
> > + * Increment the current CPU's rcu_data structure's ->dynticks field
> > + * with ordering. Return the new value.
> > + */
> > +static noinstr unsigned long rcu_dynticks_inc(int incby)
> > +{
> > + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> > +}
> > +
>
> [...]
>
> > @@ -308,7 +317,7 @@ static void rcu_dynticks_eqs_online(void)
> >
> > if (atomic_read(&rdp->dynticks) & 0x1)
> > return;
>
> Can the thread be migrated at this point ? If yes, then
> the check and the increment may happen on different cpu's rdps. Is
> that OK ?
Good point! Actually, it can be migrated, but it does not matter.
In fact, it so completely fails to matter that is is totally useless. :-/
The incoming CPU is still offline, so this is run from some other
completely-online CPU. Because this CPU is executing in non-idle
kernel context, that "if" condition must evaluate to true, so that the
rcu_dynticks_inc() below is dead code.
Maybe I should move the call to rcu_dynticks_eqs_online() to
rcu_cpu_starting(), which is pinned to the incoming CPU. Yes, I
could remove it completely, but then small changes in the offline
process could cause great mischief.
Good catch, thank you!
Thanx, Paul
> > - atomic_inc(&rdp->dynticks);
> > + rcu_dynticks_inc(1);
> > }
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
On Wed, Jul 28, 2021 at 11:58:54AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 28, 2021 at 02:23:05PM -0400, Mathieu Desnoyers wrote:
> > ----- On Jul 28, 2021, at 1:37 PM, paulmck [email protected] wrote:
> > [...]
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 42a0032dd99f7..c87b3a271d65b 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -251,6 +251,15 @@ void rcu_softirq_qs(void)
> > > rcu_tasks_qs(current, false);
> > > }
> > >
> > > +/*
> > > + * Increment the current CPU's rcu_data structure's ->dynticks field
> > > + * with ordering. Return the new value.
> > > + */
> > > +static noinstr unsigned long rcu_dynticks_inc(int incby)
> > > +{
> > > + return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> > > +}
> > > +
> >
> > [...]
> >
> > > @@ -308,7 +317,7 @@ static void rcu_dynticks_eqs_online(void)
> > >
> > > if (atomic_read(&rdp->dynticks) & 0x1)
> > > return;
> >
> > Can the thread be migrated at this point ? If yes, then
> > the check and the increment may happen on different cpu's rdps. Is
> > that OK ?
>
> Good point! Actually, it can be migrated, but it does not matter.
> In fact, it so completely fails to matter that is is totally useless. :-/
>
> The incoming CPU is still offline, so this is run from some other
> completely-online CPU. Because this CPU is executing in non-idle
> kernel context, that "if" condition must evaluate to true, so that the
> rcu_dynticks_inc() below is dead code.
>
> Maybe I should move the call to rcu_dynticks_eqs_online() to
> rcu_cpu_starting(), which is pinned to the incoming CPU. Yes, I
> could remove it completely, but then small changes in the offline
> process could cause great mischief.
>
> Good catch, thank you!
And how about like this?
Thanx, Paul
------------------------------------------------------------------------
commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
Author: Paul E. McKenney <[email protected]>
Date: Wed Jul 28 12:38:42 2021 -0700
rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU if required. It is currently is invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU. This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect. One could argue
that this means that rcu_dynticks_eqs_online() is unnecessary, however,
removing it makes the CPU-online process vulnerable to slight changes
in the CPU-offline process.
This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU. The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.
Reported-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0172a5fd6d8de..aa00babdaf544 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
rdp->blimit = blimit;
rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
- rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
/*
@@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
mask = rdp->grpmask;
WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
+ rcu_dynticks_eqs_online();
smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
raw_spin_lock_irqsave_rcu_node(rnp, flags);
WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
[...]
>
> And how about like this?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Jul 28 12:38:42 2021 -0700
>
> rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>
> The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> counter of an incoming CPU if required. It is currently is invoked
"is currently is" -> "is currently"
> from rcutree_prepare_cpu(), which runs before the incoming CPU is
> running, and thus on some other CPU. This makes the per-CPU accesses in
> rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> the running CPU cannot possibly be in dyntick-idle mode, which means
> that rcu_dynticks_eqs_online() never has any effect. One could argue
> that this means that rcu_dynticks_eqs_online() is unnecessary, however,
> removing it makes the CPU-online process vulnerable to slight changes
> in the CPU-offline process.
Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
was a good reason for having this very early in the prepare_cpu step ?
Also, the commit message refers to this bug as having no effect because the
running CPU cannot possibly be in dyntick-idle mode. I understand that calling
this function was indeed effect-less, but then why is it OK for the CPU coming
online to skip this call in the first place ? This commit message hints at
"slight changes in the CPU-offline process" which could break it, but therer is
no explanation of what makes this not an actual bug fix.
Thanks,
Mathieu
>
> This commit therefore moves the call to rcu_dynticks_eqs_online() from
> rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
> to be running on the incoming CPU. The call to this function must of
> course be placed before this rcu_cpu_starting() announces this CPU's
> presence to RCU.
>
> Reported-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0172a5fd6d8de..aa00babdaf544 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
> rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
> rdp->blimit = blimit;
> rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
> - rcu_dynticks_eqs_online();
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
>
> /*
> @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
> mask = rdp->grpmask;
> WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> + rcu_dynticks_eqs_online();
> smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
> [...]
> >
> > And how about like this?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
> > Author: Paul E. McKenney <[email protected]>
> > Date: Wed Jul 28 12:38:42 2021 -0700
> >
> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
> >
> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> > counter of an incoming CPU if required. It is currently is invoked
>
> "is currently is" -> "is currently"
Good catch, fixed!
> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
> > running, and thus on some other CPU. This makes the per-CPU accesses in
> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> > the running CPU cannot possibly be in dyntick-idle mode, which means
> > that rcu_dynticks_eqs_online() never has any effect. One could argue
> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
> > removing it makes the CPU-online process vulnerable to slight changes
> > in the CPU-offline process.
>
> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
> was a good reason for having this very early in the prepare_cpu step ?
Some years back, there was a good reason. This reason was that
rcutree_prepare_cpu() marked the CPU as being online from an RCU
viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
being online, so the ->dynticks check can be deferred to this function.
> Also, the commit message refers to this bug as having no effect because the
> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
> this function was indeed effect-less, but then why is it OK for the CPU coming
> online to skip this call in the first place ? This commit message hints at
> "slight changes in the CPU-offline process" which could break it, but therer is
> no explanation of what makes this not an actual bug fix.
Because rcutorture would not have suffered in silence had this
situation ever arisen.
I have updated the commit log to answer these questions as shown
below. Thoughts?
Thanx, Paul
------------------------------------------------------------------------
commit 516c8c4cc6fce62539f7e0182739812db4591c1d
Author: Paul E. McKenney <[email protected]>
Date: Wed Jul 28 12:38:42 2021 -0700
rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU when required. It is currently invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU. This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect.
It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
only because the CPU-offline process just happens to leave ->dynticks in
the correct state. After all, if ->dynticks were in the wrong state on a
just-onlined CPU, rcutorture would complain bitterly the next time that
CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
for example, those built by rcutorture scenario TREE04. One could
argue that this means that rcu_dynticks_eqs_online() is unnecessary,
however, removing it would make the CPU-online process vulnerable to
slight changes in the CPU-offline process.
One could also ask why it is safe to move the rcu_dynticks_eqs_online()
call so late in the CPU-online process. Indeed, there was a time when it
would not have been safe, which does much to explain its current location.
However, the marking of a CPU as online from an RCU perspective has long
since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
that is required is that ->dynticks be set correctly by the time that
the CPU is marked as online from an RCU perspective. After all, the RCU
grace-period kthread does not check to see if offline CPUs are also idle.
(In case you were curious, this is one reason why there is quiescent-state
reporting as part of the offlining process.)
This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU. The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.
Reported-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0172a5fd6d8de..aa00babdaf544 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
rdp->blimit = blimit;
rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
- rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
/*
@@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
mask = rdp->grpmask;
WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
+ rcu_dynticks_eqs_online();
smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
raw_spin_lock_irqsave_rcu_node(rnp, flags);
WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
On Wed, Jul 28, 2021 at 10:37:15AM -0700, Paul E. McKenney wrote:
> This change makes the memory ordering requirements
> more evident, and it might well also speed up the to-idle and from-idle
> fastpaths on some architectures.
Cleaning up the memory ordering requirements certainly seems worthwhile.
But is there any straightforward benchmark that might quantify the
"might well also speed up" here? How much does weakening the memory
ordering buy us, in practice?
On Wed, Jul 28, 2021 at 01:37:19PM -0700, Josh Triplett wrote:
> On Wed, Jul 28, 2021 at 10:37:15AM -0700, Paul E. McKenney wrote:
> > This change makes the memory ordering requirements
> > more evident, and it might well also speed up the to-idle and from-idle
> > fastpaths on some architectures.
>
> Cleaning up the memory ordering requirements certainly seems worthwhile.
> But is there any straightforward benchmark that might quantify the
> "might well also speed up" here? How much does weakening the memory
> ordering buy us, in practice?
None that I know of!
Thanx, Paul
On Wed, Jul 28, 2021 at 01:47:20PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 28, 2021 at 01:37:19PM -0700, Josh Triplett wrote:
> > On Wed, Jul 28, 2021 at 10:37:15AM -0700, Paul E. McKenney wrote:
> > > This change makes the memory ordering requirements
> > > more evident, and it might well also speed up the to-idle and from-idle
> > > fastpaths on some architectures.
> >
> > Cleaning up the memory ordering requirements certainly seems worthwhile.
> > But is there any straightforward benchmark that might quantify the
> > "might well also speed up" here? How much does weakening the memory
> > ordering buy us, in practice?
>
> None that I know of!
I know two:
1) The whole debate makes us review again (and again) the memory ordering
requirements in RCU VS dynticks-idle, which can only be good to enforce
correctness.
2) The more we weaken the ordering, the better we grasp and understand the
underlying ordering requirements. Unnecessary full memory barriers tend to
obfuscate our ordering expectations, making the code less self-explanatory.
3) I have terrible ideas to remove a full barrier in the dynticks idle path
that should work in practice but not in theory and therefore I'm never going
to talk about it unless everyone in the room is drunk.
On Thu, Jul 29, 2021 at 12:23:33AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 28, 2021 at 01:47:20PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 28, 2021 at 01:37:19PM -0700, Josh Triplett wrote:
> > > On Wed, Jul 28, 2021 at 10:37:15AM -0700, Paul E. McKenney wrote:
> > > > This change makes the memory ordering requirements
> > > > more evident, and it might well also speed up the to-idle and from-idle
> > > > fastpaths on some architectures.
> > >
> > > Cleaning up the memory ordering requirements certainly seems worthwhile.
> > > But is there any straightforward benchmark that might quantify the
> > > "might well also speed up" here? How much does weakening the memory
> > > ordering buy us, in practice?
> >
> > None that I know of!
>
> I know two:
>
> 1) The whole debate makes us review again (and again) the memory ordering
> requirements in RCU VS dynticks-idle, which can only be good to enforce
> correctness.
>
> 2) The more we weaken the ordering, the better we grasp and understand the
> underlying ordering requirements. Unnecessary full memory barriers tend to
> obfuscate our ordering expectations, making the code less self-explanatory.
>
> 3) I have terrible ideas to remove a full barrier in the dynticks idle path
> that should work in practice but not in theory and therefore I'm never going
> to talk about it unless everyone in the room is drunk.
Cute!
On #3/2, I don't drink, so I guess you have to leave me out. ;-)
The other side of this coin is that weakening ordering often decreases
robustness and increases complexity. In unquantifiable ways, of course,
which can make discussion of the tradeoffs problematic.
Thanx, Paul
Hi Paul,
On Wed, Jul 21, 2021 at 01:21:12PM -0700, Paul E. McKenney wrote:
> Accesses to the rcu_data structure's ->dynticks field have always been
> fully ordered because it was not possible to prove that weaker ordering
> was safe. However, with the removal of the rcu_eqs_special_set() function
> and the advent of the Linux-kernel memory model, it is now easy to show
> that two of the four original full memory barriers can be weakened to
> acquire and release operations. The remaining pair must remain full
> memory barriers. This change makes the memory ordering requirements
> more evident, and it might well also speed up the to-idle and from-idle
> fastpaths on some architectures.
>
> The following litmus test, adapted from one supplied off-list by Frederic
> Weisbecker, models the RCU grace-period kthread detecting an idle CPU
> that is concurrently transitioning to non-idle:
>
> C dynticks-from-idle
>
> {
> DYNTICKS=0; (* Initially idle. *)
> }
>
> P0(int *X, int *DYNTICKS)
> {
> int dynticks;
> int x;
>
> // Idle.
> dynticks = READ_ONCE(*DYNTICKS);
> smp_store_release(DYNTICKS, dynticks + 1);
> smp_mb();
This smp_mb() is needed, because there is an ->fr relation in the cycle
we want to avoid, however...
> // Now non-idle
> x = READ_ONCE(*X);
> }
>
> P1(int *X, int *DYNTICKS)
> {
> int dynticks;
>
> WRITE_ONCE(*X, 1);
> smp_mb();
> dynticks = smp_load_acquire(DYNTICKS);
> }
>
> exists (1:dynticks=0 /\ 0:x=1)
>
> Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
> this transition, namely, showing that if the RCU grace-period kthread (P1)
> sees another CPU as idle (P0), then any memory access prior to the start
> of the grace period (P1's write to X) will be seen by any RCU read-side
> critical section following the to-non-idle transition (P0's read from X).
> This is a straightforward use of full memory barriers to force ordering
> in a store-buffering (SB) litmus test.
>
> The following litmus test, also adapted from the one supplied off-list
> by Frederic Weisbecker, models the RCU grace-period kthread detecting
> a non-idle CPU that is concurrently transitioning to idle:
>
> C dynticks-into-idle
>
> {
> DYNTICKS=1; (* Initially non-idle. *)
> }
>
> P0(int *X, int *DYNTICKS)
> {
> int dynticks;
>
> // Non-idle.
> WRITE_ONCE(*X, 1);
> dynticks = READ_ONCE(*DYNTICKS);
> smp_store_release(DYNTICKS, dynticks + 1);
> smp_mb();
this smp_mb() is not needed, as we rely on the release-acquire pair to
provide the ordering.
This means that if we use different implementations (one w/ smp_mb(),
another w/o) rcu_dynticks_inc() for idle-to-nonidle and nonidle-to-idle,
we could save a smp_mb(). Thoughts?
> // Now idle.
> }
>
> P1(int *X, int *DYNTICKS)
> {
> int x;
> int dynticks;
>
> smp_mb();
> dynticks = smp_load_acquire(DYNTICKS);
> x = READ_ONCE(*X);
> }
>
> exists (1:dynticks=2 /\ 1:x=0)
>
> Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
> this transition, namely, showing that if the RCU grace-period kthread
> (P1) sees another CPU as newly idle (P0), then any pre-idle memory access
> (P0's write to X) will be seen by any code following the grace period
> (P1's read from X). This is a simple release-acquire pair forcing
> ordering in a message-passing (MP) litmus test.
>
> Of course, if the grace-period kthread detects the CPU as non-idle,
> it will refrain from reporting a quiescent state on behalf of that CPU,
> so there are no ordering requirements from the grace-period kthread in
> that case. However, other subsystems call rcu_is_idle_cpu() to check
> for CPUs being non-idle from an RCU perspective. That case is also
> verified by the above litmus tests with the proviso that the sense of
> the low-order bit of the DYNTICKS counter be inverted.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 50 +++++++++++++++++++++++++++++++----------------
> kernel/rcu/tree.h | 2 +-
> 2 files changed, 34 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 42a0032dd99f7..bc6ccf0ba3b24 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -77,7 +77,7 @@
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
> .dynticks_nesting = 1,
> .dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
> - .dynticks = ATOMIC_INIT(1),
> + .dynticks = 1UL,
> #ifdef CONFIG_RCU_NOCB_CPU
> .cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
> #endif
> @@ -251,6 +251,21 @@ void rcu_softirq_qs(void)
> rcu_tasks_qs(current, false);
> }
>
> +/*
> + * Increment the current CPU's rcu_data structure's ->dynticks field
> + * with ordering. Return the new value.
> + */
> +static noinstr unsigned long rcu_dynticks_inc(int incby)
> +{
> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> + int seq;
> +
> + seq = READ_ONCE(rdp->dynticks) + incby;
> + smp_store_release(&rdp->dynticks, seq);
> + smp_mb(); // Fundamental RCU ordering guarantee.
> + return seq;
> +}
So say we have a rcu_dynticks_inc_no_mb() which has the same body except
the smp_mb(), and ...
> +
> /*
> * Record entry into an extended quiescent state. This is only to be
> * called when not already in an extended quiescent state, that is,
> @@ -267,7 +282,7 @@ static noinstr void rcu_dynticks_eqs_enter(void)
> * next idle sojourn.
> */
> rcu_dynticks_task_trace_enter(); // Before ->dynticks update!
> - seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
> + seq = rcu_dynticks_inc(1);
...we can actually change this to the no smp_mb() version because we are
exiting RCU here:
seq = rcu_dynticks_inc_no_mb(1);
> // RCU is no longer watching. Better be in extended quiescent state!
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
> }
> @@ -286,7 +301,7 @@ static noinstr void rcu_dynticks_eqs_exit(void)
> * and we also must force ordering with the next RCU read-side
> * critical section.
> */
> - seq = arch_atomic_inc_return(&this_cpu_ptr(&rcu_data)->dynticks);
> + seq = rcu_dynticks_inc(1);
> // RCU is now watching. Better not be in an extended quiescent state!
> rcu_dynticks_task_trace_exit(); // After ->dynticks update!
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
> @@ -306,9 +321,9 @@ static void rcu_dynticks_eqs_online(void)
> {
> struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>
> - if (atomic_read(&rdp->dynticks) & 0x1)
> + if (READ_ONCE(rdp->dynticks) & 0x1)
> return;
> - atomic_inc(&rdp->dynticks);
> + rcu_dynticks_inc(1);
> }
>
> /*
> @@ -318,7 +333,7 @@ static void rcu_dynticks_eqs_online(void)
> */
> static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
> {
> - return !(arch_atomic_read(&this_cpu_ptr(&rcu_data)->dynticks) & 0x1);
> + return !(READ_ONCE(this_cpu_ptr(&rcu_data)->dynticks) & 0x1);
> }
>
> /*
> @@ -327,7 +342,8 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
> */
> static int rcu_dynticks_snap(struct rcu_data *rdp)
> {
> - return atomic_add_return(0, &rdp->dynticks);
> + smp_mb(); // Fundamental RCU ordering guarantee.
> + return smp_load_acquire(&rdp->dynticks);
> }
>
> /*
> @@ -367,7 +383,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
> int snap;
>
> // If not quiescent, force back to earlier extended quiescent state.
> - snap = atomic_read(&rdp->dynticks) & ~0x1;
> + snap = READ_ONCE(rdp->dynticks) & ~0x1;
>
> smp_rmb(); // Order ->dynticks and *vp reads.
> if (READ_ONCE(*vp))
> @@ -375,7 +391,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
> smp_rmb(); // Order *vp read and ->dynticks re-read.
>
> // If still in the same extended quiescent state, we are good!
> - return snap == atomic_read(&rdp->dynticks);
> + return snap == READ_ONCE(rdp->dynticks);
> }
>
> /*
> @@ -391,12 +407,12 @@ bool rcu_dynticks_zero_in_eqs(int cpu, int *vp)
> */
> notrace void rcu_momentary_dyntick_idle(void)
> {
> - int special;
> + int seq;
>
> raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
> - special = atomic_add_return(2, &this_cpu_ptr(&rcu_data)->dynticks);
> + seq = rcu_dynticks_inc(2);
> /* It is illegal to call this from idle state. */
> - WARN_ON_ONCE(!(special & 0x1));
> + WARN_ON_ONCE(!(seq & 0x1));
> rcu_preempt_deferred_qs(current);
> }
> EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> @@ -612,7 +628,7 @@ static noinstr void rcu_eqs_enter(bool user)
>
> lockdep_assert_irqs_disabled();
> instrumentation_begin();
> - trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
> + trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, READ_ONCE(rdp->dynticks));
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
> rcu_prepare_for_idle();
> rcu_preempt_deferred_qs(current);
> @@ -747,7 +763,7 @@ noinstr void rcu_nmi_exit(void)
> */
> if (rdp->dynticks_nmi_nesting != 1) {
> trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
> - atomic_read(&rdp->dynticks));
> + READ_ONCE(rdp->dynticks));
> WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
> rdp->dynticks_nmi_nesting - 2);
> instrumentation_end();
> @@ -755,7 +771,7 @@ noinstr void rcu_nmi_exit(void)
> }
>
> /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> - trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
> + trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, READ_ONCE(rdp->dynticks));
> WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
>
> if (!in_nmi())
> @@ -863,7 +879,7 @@ static void noinstr rcu_eqs_exit(bool user)
> instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
>
> rcu_cleanup_after_idle();
> - trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
> + trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, READ_ONCE(rdp->dynticks));
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
> WRITE_ONCE(rdp->dynticks_nesting, 1);
> WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
> @@ -1026,7 +1042,7 @@ noinstr void rcu_nmi_enter(void)
>
> trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
> rdp->dynticks_nmi_nesting,
> - rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
> + rdp->dynticks_nmi_nesting + incby, READ_ONCE(rdp->dynticks));
> instrumentation_end();
> WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
> rdp->dynticks_nmi_nesting + incby);
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 305cf6aeb4086..ce611da2ff6b3 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -184,7 +184,7 @@ struct rcu_data {
> int dynticks_snap; /* Per-GP tracking for dynticks. */
> long dynticks_nesting; /* Track process nesting level. */
> long dynticks_nmi_nesting; /* Track irq/NMI nesting level. */
> - atomic_t dynticks; /* Even value for idle, else odd. */
> + unsigned long dynticks; /* Even value for idle, else odd. */
Here the type of dynticks is changed, so we should change the type of
trace point as well. Or use "unsigned int" as the type.
Regards,
Boqun
> bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */
> bool rcu_urgent_qs; /* GP old need light quiescent state. */
> bool rcu_forced_tick; /* Forced tick to provide QS. */
> --
> 2.31.1.189.g2e36527f23
>
On Wed, Jul 21, 2021 at 01:21:18PM -0700, Paul E. McKenney wrote:
> This commit marks some interrupt-induced read-side data races in
> __srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/srcutiny.h | 8 ++++----
> kernel/rcu/srcutiny.c | 2 +-
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
> index 0e0cf4d6a72a0..6cfaa0a9a9b96 100644
> --- a/include/linux/srcutiny.h
> +++ b/include/linux/srcutiny.h
> @@ -61,7 +61,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp)
> int idx;
>
> idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
> - WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1);
> + WRITE_ONCE(ssp->srcu_lock_nesting[idx], READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1);
> return idx;
> }
>
> @@ -81,11 +81,11 @@ static inline void srcu_torture_stats_print(struct srcu_struct *ssp,
> {
> int idx;
>
> - idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
> + idx = ((data_race(READ_ONCE(ssp->srcu_idx)) + 1) & 0x2) >> 1;
This looks very weird, any explanation why we want to put data_race() on
a READ_ONCE()?
Regards,
Boqun
> pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%hd,%hd)\n",
> tt, tf, idx,
> - READ_ONCE(ssp->srcu_lock_nesting[!idx]),
> - READ_ONCE(ssp->srcu_lock_nesting[idx]));
> + data_race(READ_ONCE(ssp->srcu_lock_nesting[!idx])),
> + data_race(READ_ONCE(ssp->srcu_lock_nesting[idx])));
> }
>
> #endif
> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
> index 26344dc6483b0..a0ba2ed49bc61 100644
> --- a/kernel/rcu/srcutiny.c
> +++ b/kernel/rcu/srcutiny.c
> @@ -96,7 +96,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> */
> void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> {
> - int newval = ssp->srcu_lock_nesting[idx] - 1;
> + int newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) - 1;
>
> WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval);
> if (!newval && READ_ONCE(ssp->srcu_gp_waiting))
> --
> 2.31.1.189.g2e36527f23
>
On Wed, Jul 21, 2021 at 01:21:19PM -0700, Paul E. McKenney wrote:
> Accesses to ->qsmask are normally protected by ->lock, but there is an
> exception in the diagnostic code in rcu_check_boost_fail(). This commit
> therefore applies data_race() to this access to avoid KCSAN complaining
> about the C-language writes protected by ->lock.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree_stall.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 42847caa3909b..6dd6c9aa3f757 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -766,7 +766,7 @@ bool rcu_check_boost_fail(unsigned long gp_state, int *cpup)
>
> rcu_for_each_leaf_node(rnp) {
> if (!cpup) {
> - if (READ_ONCE(rnp->qsmask)) {
> + if (data_race(READ_ONCE(rnp->qsmask))) {
If the write sides allow normal writes, i.e. allowing store tearing, the
READ_ONCE() here could read incomplete writes, which could be anything
basically? And we get the same result if we remove the READ_ONCE(),
don't we? Yes, I know, without the READ_ONCE(), compilers can do
anything to optimize on rnp->qsmask, but the end result is same or
similar to reading incomplete writes (which is also a result by compiler
optimization). So if we mark something as data_race(), **in theory**, it
makes no difference with or without the READ_ONCE(), so I think maybe we
can remove the READ_ONCE() here?
Regards,
Boqun
> return false;
> } else {
> if (READ_ONCE(rnp->gp_tasks))
> --
> 2.31.1.189.g2e36527f23
>
On Thu, Jul 29, 2021 at 03:58:04PM +0800, Boqun Feng wrote:
> > The following litmus test, also adapted from the one supplied off-list
> > by Frederic Weisbecker, models the RCU grace-period kthread detecting
> > a non-idle CPU that is concurrently transitioning to idle:
> >
> > C dynticks-into-idle
> >
> > {
> > DYNTICKS=1; (* Initially non-idle. *)
> > }
> >
> > P0(int *X, int *DYNTICKS)
> > {
> > int dynticks;
> >
> > // Non-idle.
> > WRITE_ONCE(*X, 1);
> > dynticks = READ_ONCE(*DYNTICKS);
> > smp_store_release(DYNTICKS, dynticks + 1);
> > smp_mb();
>
> this smp_mb() is not needed, as we rely on the release-acquire pair to
> provide the ordering.
>
> This means that if we use different implementations (one w/ smp_mb(),
> another w/o) rcu_dynticks_inc() for idle-to-nonidle and nonidle-to-idle,
> we could save a smp_mb(). Thoughts?
That's exactly what I wanted to propose but everybody was sober. Namely order
only the RCU read side critical sections before/after idle together:
READ side critical section
//enter idle
//exit idle
smp_mb()
READ side critical section
instead of ordering the RCU read side critical section before idle - with the RCU
idle extended quiescent state - with the RCU read side critical section after idle:
READ side critical section
//enter idle
smp_mb();
//exit idle
smp_mb()
READ side critical section
So the side effect now is that if the write side waits for the reader to
report a quiescent state and scans its dynticks state and see it's not yet in
RCU idle mode, then later on when the read side enters in RCU idle mode we
expect it to see the write side updates.
But after the barrier removal the reader will only see the write side update
once we exit RCU idle mode.
So the following may happen:
P0(int *X, int *Y, int *DYNTICKS)
{
int y;
WRITE_ONCE(*X, 1);
smp_store_release(DYNTICKS, 1); // rcu_eqs_enter
//smp_mb() not there anymore
y = READ_ONCE(*Y);
smp_store_release(DYNTICKS, 2); // rcu_eqs_exit()
smp_mb();
}
P1(int *X, int *Y, int *DYNTICKS)
{
int x;
int dynticks;
WRITE_ONCE(*Y, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}
exists (1:x=0 /\ 0:y=0)
Theoretically it shouldn't matter because the RCU idle mode isn't
supposed to perform RCU reads. But theoretically again once a CPU
has reported a quiescent state, any further read is expected to see
the latest updates from the write side.
So I don't know what to think. In practice I believe it's not a big deal
because RCU idle mode code is usually a fragile path that just handles
cpuidle code to put the CPU in/out low power mode. But what about dragons...
On Thu, Jul 29, 2021 at 04:23:08PM +0800, Boqun Feng wrote:
> On Wed, Jul 21, 2021 at 01:21:18PM -0700, Paul E. McKenney wrote:
> > This commit marks some interrupt-induced read-side data races in
> > __srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > include/linux/srcutiny.h | 8 ++++----
> > kernel/rcu/srcutiny.c | 2 +-
> > 2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
> > index 0e0cf4d6a72a0..6cfaa0a9a9b96 100644
> > --- a/include/linux/srcutiny.h
> > +++ b/include/linux/srcutiny.h
> > @@ -61,7 +61,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp)
> > int idx;
> >
> > idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
> > - WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1);
> > + WRITE_ONCE(ssp->srcu_lock_nesting[idx], READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1);
> > return idx;
> > }
> >
> > @@ -81,11 +81,11 @@ static inline void srcu_torture_stats_print(struct srcu_struct *ssp,
> > {
> > int idx;
> >
> > - idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
> > + idx = ((data_race(READ_ONCE(ssp->srcu_idx)) + 1) & 0x2) >> 1;
>
> This looks very weird, any explanation why we want to put data_race() on
> a READ_ONCE()?
We don't want KCSAN to check this read, but we also don't want the
compiler to mess it up.
Thanx, Paul
> Regards,
> Boqun
>
> > pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%hd,%hd)\n",
> > tt, tf, idx,
> > - READ_ONCE(ssp->srcu_lock_nesting[!idx]),
> > - READ_ONCE(ssp->srcu_lock_nesting[idx]));
> > + data_race(READ_ONCE(ssp->srcu_lock_nesting[!idx])),
> > + data_race(READ_ONCE(ssp->srcu_lock_nesting[idx])));
> > }
> >
> > #endif
> > diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
> > index 26344dc6483b0..a0ba2ed49bc61 100644
> > --- a/kernel/rcu/srcutiny.c
> > +++ b/kernel/rcu/srcutiny.c
> > @@ -96,7 +96,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> > */
> > void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> > {
> > - int newval = ssp->srcu_lock_nesting[idx] - 1;
> > + int newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) - 1;
> >
> > WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval);
> > if (!newval && READ_ONCE(ssp->srcu_gp_waiting))
> > --
> > 2.31.1.189.g2e36527f23
> >
On Thu, Jul 29, 2021 at 04:54:30PM +0800, Boqun Feng wrote:
> On Wed, Jul 21, 2021 at 01:21:19PM -0700, Paul E. McKenney wrote:
> > Accesses to ->qsmask are normally protected by ->lock, but there is an
> > exception in the diagnostic code in rcu_check_boost_fail(). This commit
> > therefore applies data_race() to this access to avoid KCSAN complaining
> > about the C-language writes protected by ->lock.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree_stall.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > index 42847caa3909b..6dd6c9aa3f757 100644
> > --- a/kernel/rcu/tree_stall.h
> > +++ b/kernel/rcu/tree_stall.h
> > @@ -766,7 +766,7 @@ bool rcu_check_boost_fail(unsigned long gp_state, int *cpup)
> >
> > rcu_for_each_leaf_node(rnp) {
> > if (!cpup) {
> > - if (READ_ONCE(rnp->qsmask)) {
> > + if (data_race(READ_ONCE(rnp->qsmask))) {
>
> If the write sides allow normal writes, i.e. allowing store tearing, the
> READ_ONCE() here could read incomplete writes, which could be anything
> basically? And we get the same result if we remove the READ_ONCE(),
> don't we? Yes, I know, without the READ_ONCE(), compilers can do
> anything to optimize on rnp->qsmask, but the end result is same or
> similar to reading incomplete writes (which is also a result by compiler
> optimization). So if we mark something as data_race(), **in theory**, it
> makes no difference with or without the READ_ONCE(), so I think maybe we
> can remove the READ_ONCE() here?
In this particular case, perhaps. But there is also the possibility
of ASSERT_EXCLUSIVE_WRITER() in conjunction with WRITE_ONCE(), and
data_race(READ_ONCE(()) handles all such cases properly.
Again, the point here is to prevent the compiler from messing with
the load in strange and unpredictable ways while also telling KCSAN
that this particular read should not be considered to be part of the
concurrent algorithm.
Thanx, Paul
> Regards,
> Boqun
>
> > return false;
> > } else {
> > if (READ_ONCE(rnp->gp_tasks))
> > --
> > 2.31.1.189.g2e36527f23
> >
----- On Jul 28, 2021, at 4:28 PM, paulmck [email protected] wrote:
> On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
>> [...]
>> >
>> > And how about like this?
>> >
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
>> > Author: Paul E. McKenney <[email protected]>
>> > Date: Wed Jul 28 12:38:42 2021 -0700
>> >
>> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>> >
>> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
>> > counter of an incoming CPU if required. It is currently is invoked
>>
>> "is currently is" -> "is currently"
>
> Good catch, fixed!
>
>> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
>> > running, and thus on some other CPU. This makes the per-CPU accesses in
>> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
>> > the running CPU cannot possibly be in dyntick-idle mode, which means
>> > that rcu_dynticks_eqs_online() never has any effect. One could argue
>> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
>> > removing it makes the CPU-online process vulnerable to slight changes
>> > in the CPU-offline process.
>>
>> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
>> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
>> was a good reason for having this very early in the prepare_cpu step ?
>
> Some years back, there was a good reason. This reason was that
> rcutree_prepare_cpu() marked the CPU as being online from an RCU
> viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
> being online, so the ->dynticks check can be deferred to this function.
>
>> Also, the commit message refers to this bug as having no effect because the
>> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
>> this function was indeed effect-less, but then why is it OK for the CPU coming
>> online to skip this call in the first place ? This commit message hints at
>> "slight changes in the CPU-offline process" which could break it, but therer is
>> no explanation of what makes this not an actual bug fix.
>
> Because rcutorture would not have suffered in silence had this
> situation ever arisen.
Testing can usually prove the presence of a bug, but it's rather tricky to prove
the absence of bug.
>
> I have updated the commit log to answer these questions as shown
> below. Thoughts?
I'm still concerned about one scenario wrt moving rcu_dynticks_eqs_online()
from rcutree_prepare_cpu to rcu_cpu_starting. What happens if an interrupt
handler, or a NMI handler, nests early over the CPU-online startup code ?
AFAIU, this interrupt handler could contain RCU read-side critical sections,
but if the eqs state does not show the CPU as "online", I wonder whether it
will work as expected.
Thanks,
Mathieu
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 516c8c4cc6fce62539f7e0182739812db4591c1d
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Jul 28 12:38:42 2021 -0700
>
> rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>
> The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> counter of an incoming CPU when required. It is currently invoked
> from rcutree_prepare_cpu(), which runs before the incoming CPU is
> running, and thus on some other CPU. This makes the per-CPU accesses in
> rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> the running CPU cannot possibly be in dyntick-idle mode, which means
> that rcu_dynticks_eqs_online() never has any effect.
>
> It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
> only because the CPU-offline process just happens to leave ->dynticks in
> the correct state. After all, if ->dynticks were in the wrong state on a
> just-onlined CPU, rcutorture would complain bitterly the next time that
> CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
> for example, those built by rcutorture scenario TREE04. One could
> argue that this means that rcu_dynticks_eqs_online() is unnecessary,
> however, removing it would make the CPU-online process vulnerable to
> slight changes in the CPU-offline process.
>
> One could also ask why it is safe to move the rcu_dynticks_eqs_online()
> call so late in the CPU-online process. Indeed, there was a time when it
> would not have been safe, which does much to explain its current location.
> However, the marking of a CPU as online from an RCU perspective has long
> since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
> that is required is that ->dynticks be set correctly by the time that
> the CPU is marked as online from an RCU perspective. After all, the RCU
> grace-period kthread does not check to see if offline CPUs are also idle.
> (In case you were curious, this is one reason why there is quiescent-state
> reporting as part of the offlining process.)
>
> This commit therefore moves the call to rcu_dynticks_eqs_online() from
> rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
> to be running on the incoming CPU. The call to this function must of
> course be placed before this rcu_cpu_starting() announces this CPU's
> presence to RCU.
>
> Reported-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0172a5fd6d8de..aa00babdaf544 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
> rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
> rdp->blimit = blimit;
> rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
> - rcu_dynticks_eqs_online();
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
>
> /*
> @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
> mask = rdp->grpmask;
> WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> + rcu_dynticks_eqs_online();
> smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jul 29, 2021 at 10:41:18AM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 28, 2021, at 4:28 PM, paulmck [email protected] wrote:
>
> > On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
> >> [...]
> >> >
> >> > And how about like this?
> >> >
> >> > Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
> >> > Author: Paul E. McKenney <[email protected]>
> >> > Date: Wed Jul 28 12:38:42 2021 -0700
> >> >
> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
> >> >
> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> >> > counter of an incoming CPU if required. It is currently is invoked
> >>
> >> "is currently is" -> "is currently"
> >
> > Good catch, fixed!
> >
> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
> >> > that rcu_dynticks_eqs_online() never has any effect. One could argue
> >> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
> >> > removing it makes the CPU-online process vulnerable to slight changes
> >> > in the CPU-offline process.
> >>
> >> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
> >> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
> >> was a good reason for having this very early in the prepare_cpu step ?
> >
> > Some years back, there was a good reason. This reason was that
> > rcutree_prepare_cpu() marked the CPU as being online from an RCU
> > viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
> > being online, so the ->dynticks check can be deferred to this function.
> >
> >> Also, the commit message refers to this bug as having no effect because the
> >> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
> >> this function was indeed effect-less, but then why is it OK for the CPU coming
> >> online to skip this call in the first place ? This commit message hints at
> >> "slight changes in the CPU-offline process" which could break it, but therer is
> >> no explanation of what makes this not an actual bug fix.
> >
> > Because rcutorture would not have suffered in silence had this
> > situation ever arisen.
>
> Testing can usually prove the presence of a bug, but it's rather tricky to prove
> the absence of bug.
In general, true enough.
But in this particular case, a WARN would have deterministically triggered
the very next time that this CPU found its way either to the idle loop
or to nohz_full usermode execution.
> > I have updated the commit log to answer these questions as shown
> > below. Thoughts?
>
> I'm still concerned about one scenario wrt moving rcu_dynticks_eqs_online()
> from rcutree_prepare_cpu to rcu_cpu_starting. What happens if an interrupt
> handler, or a NMI handler, nests early over the CPU-online startup code ?
> AFAIU, this interrupt handler could contain RCU read-side critical sections,
> but if the eqs state does not show the CPU as "online", I wonder whether it
> will work as expected.
Interrupts are still disabled at this point in the onlining process,
my _irqsave() locks notwithstanding.
You are right about NMI handlers, but there would be much more damage
from an early NMI handler than mere RCU issues. And this can be handled
as described in the next paragraph.
Now, there are architectures (including x86) that need RCU readers
before notify_cpu_starting() time (which is where rcu_cpu_starting()
is invoked by default, and those architectures can (and do) simply
place a call to rcu_cpu_starting() at an earlier appropriate point in
the architecture-specific CPU-bringup code. And this is in fact the
reason for the ->cpu_started check at the beginning of rcu_cpu_starting().
So an architecture using NMIs early in the CPU-bringup code should
invoke rcu_cpu_starting() before enabling NMIs.
So why not just be safe and mark the CPU online early in the process?
Because that could result in unbounded grace periods and strange
deadlocks. These deadlocks were broken earlier by code that assumed that
a CPU could not possibly take more than one jiffy to come online, but that
assumption is clearly broken on todays systems, even the bare-metal ones.
In theory, I could change the raw_spin_lock_irqsave_rcu_node() to
raw_spin_lock_rcu_node(), rely on the lockdep_assert_irqs_disabled()
in the matching raw_spin_unlock_rcu_node(), and ditch the "flags"
local variable, but rcu_report_qs_rnp() needs that "flags" argument.
OK, I guess one approach is to get the "flags" value from local_save_flags()
and get rid of the _irqsave and _irq restore. Assuming lockdep is
functional that early in CPU bringup.
But would that really be better than status quo?
Thanx, Paul
> Thanks,
>
> Mathieu
>
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 516c8c4cc6fce62539f7e0182739812db4591c1d
> > Author: Paul E. McKenney <[email protected]>
> > Date: Wed Jul 28 12:38:42 2021 -0700
> >
> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
> >
> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> > counter of an incoming CPU when required. It is currently invoked
> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
> > running, and thus on some other CPU. This makes the per-CPU accesses in
> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> > the running CPU cannot possibly be in dyntick-idle mode, which means
> > that rcu_dynticks_eqs_online() never has any effect.
> >
> > It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
> > only because the CPU-offline process just happens to leave ->dynticks in
> > the correct state. After all, if ->dynticks were in the wrong state on a
> > just-onlined CPU, rcutorture would complain bitterly the next time that
> > CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
> > for example, those built by rcutorture scenario TREE04. One could
> > argue that this means that rcu_dynticks_eqs_online() is unnecessary,
> > however, removing it would make the CPU-online process vulnerable to
> > slight changes in the CPU-offline process.
> >
> > One could also ask why it is safe to move the rcu_dynticks_eqs_online()
> > call so late in the CPU-online process. Indeed, there was a time when it
> > would not have been safe, which does much to explain its current location.
> > However, the marking of a CPU as online from an RCU perspective has long
> > since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
> > that is required is that ->dynticks be set correctly by the time that
> > the CPU is marked as online from an RCU perspective. After all, the RCU
> > grace-period kthread does not check to see if offline CPUs are also idle.
> > (In case you were curious, this is one reason why there is quiescent-state
> > reporting as part of the offlining process.)
> >
> > This commit therefore moves the call to rcu_dynticks_eqs_online() from
> > rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
> > to be running on the incoming CPU. The call to this function must of
> > course be placed before this rcu_cpu_starting() announces this CPU's
> > presence to RCU.
> >
> > Reported-by: Mathieu Desnoyers <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 0172a5fd6d8de..aa00babdaf544 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
> > rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
> > rdp->blimit = blimit;
> > rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
> > - rcu_dynticks_eqs_online();
> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> >
> > /*
> > @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
> > mask = rdp->grpmask;
> > WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> > WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> > + rcu_dynticks_eqs_online();
> > smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
----- On Jul 29, 2021, at 11:57 AM, paulmck [email protected] wrote:
> On Thu, Jul 29, 2021 at 10:41:18AM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 28, 2021, at 4:28 PM, paulmck [email protected] wrote:
>>
>> > On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
>> >> [...]
>> >> >
>> >> > And how about like this?
>> >> >
>> >> > Thanx, Paul
>> >> >
>> >> > ------------------------------------------------------------------------
>> >> >
>> >> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
>> >> > Author: Paul E. McKenney <[email protected]>
>> >> > Date: Wed Jul 28 12:38:42 2021 -0700
>> >> >
>> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>> >> >
>> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
>> >> > counter of an incoming CPU if required. It is currently is invoked
>> >>
>> >> "is currently is" -> "is currently"
>> >
>> > Good catch, fixed!
>> >
>> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
>> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
>> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
>> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
>> >> > that rcu_dynticks_eqs_online() never has any effect. One could argue
>> >> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
>> >> > removing it makes the CPU-online process vulnerable to slight changes
>> >> > in the CPU-offline process.
>> >>
>> >> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
>> >> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
>> >> was a good reason for having this very early in the prepare_cpu step ?
>> >
>> > Some years back, there was a good reason. This reason was that
>> > rcutree_prepare_cpu() marked the CPU as being online from an RCU
>> > viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
>> > being online, so the ->dynticks check can be deferred to this function.
>> >
>> >> Also, the commit message refers to this bug as having no effect because the
>> >> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
>> >> this function was indeed effect-less, but then why is it OK for the CPU coming
>> >> online to skip this call in the first place ? This commit message hints at
>> >> "slight changes in the CPU-offline process" which could break it, but therer is
>> >> no explanation of what makes this not an actual bug fix.
>> >
>> > Because rcutorture would not have suffered in silence had this
>> > situation ever arisen.
>>
>> Testing can usually prove the presence of a bug, but it's rather tricky to prove
>> the absence of bug.
>
> In general, true enough.
>
> But in this particular case, a WARN would have deterministically triggered
> the very next time that this CPU found its way either to the idle loop
> or to nohz_full usermode execution.
>
>> > I have updated the commit log to answer these questions as shown
>> > below. Thoughts?
>>
>> I'm still concerned about one scenario wrt moving rcu_dynticks_eqs_online()
>> from rcutree_prepare_cpu to rcu_cpu_starting. What happens if an interrupt
>> handler, or a NMI handler, nests early over the CPU-online startup code ?
>> AFAIU, this interrupt handler could contain RCU read-side critical sections,
>> but if the eqs state does not show the CPU as "online", I wonder whether it
>> will work as expected.
>
> Interrupts are still disabled at this point in the onlining process,
> my _irqsave() locks notwithstanding.
>
> You are right about NMI handlers, but there would be much more damage
> from an early NMI handler than mere RCU issues. And this can be handled
> as described in the next paragraph.
>
> Now, there are architectures (including x86) that need RCU readers
> before notify_cpu_starting() time (which is where rcu_cpu_starting()
> is invoked by default, and those architectures can (and do) simply
> place a call to rcu_cpu_starting() at an earlier appropriate point in
> the architecture-specific CPU-bringup code. And this is in fact the
> reason for the ->cpu_started check at the beginning of rcu_cpu_starting().
> So an architecture using NMIs early in the CPU-bringup code should
> invoke rcu_cpu_starting() before enabling NMIs.
>
> So why not just be safe and mark the CPU online early in the process?
>
> Because that could result in unbounded grace periods and strange
> deadlocks. These deadlocks were broken earlier by code that assumed that
> a CPU could not possibly take more than one jiffy to come online, but that
> assumption is clearly broken on todays systems, even the bare-metal ones.
>
> In theory, I could change the raw_spin_lock_irqsave_rcu_node() to
> raw_spin_lock_rcu_node(), rely on the lockdep_assert_irqs_disabled()
> in the matching raw_spin_unlock_rcu_node(), and ditch the "flags"
> local variable, but rcu_report_qs_rnp() needs that "flags" argument.
>
> OK, I guess one approach is to get the "flags" value from local_save_flags()
> and get rid of the _irqsave and _irq restore. Assuming lockdep is
> functional that early in CPU bringup.
>
> But would that really be better than status quo?
I'm OK with your explanation about the fact that interrupts and NMIs scenarios
are correctly handled, so moving this call from prepare_cpu to cpu_starting
is fine with me.
Thanks,
Mathieu
>
> Thanx, Paul
>
>> Thanks,
>>
>> Mathieu
>>
>> >
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > commit 516c8c4cc6fce62539f7e0182739812db4591c1d
>> > Author: Paul E. McKenney <[email protected]>
>> > Date: Wed Jul 28 12:38:42 2021 -0700
>> >
>> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>> >
>> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
>> > counter of an incoming CPU when required. It is currently invoked
>> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
>> > running, and thus on some other CPU. This makes the per-CPU accesses in
>> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
>> > the running CPU cannot possibly be in dyntick-idle mode, which means
>> > that rcu_dynticks_eqs_online() never has any effect.
>> >
>> > It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
>> > only because the CPU-offline process just happens to leave ->dynticks in
>> > the correct state. After all, if ->dynticks were in the wrong state on a
>> > just-onlined CPU, rcutorture would complain bitterly the next time that
>> > CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
>> > for example, those built by rcutorture scenario TREE04. One could
>> > argue that this means that rcu_dynticks_eqs_online() is unnecessary,
>> > however, removing it would make the CPU-online process vulnerable to
>> > slight changes in the CPU-offline process.
>> >
>> > One could also ask why it is safe to move the rcu_dynticks_eqs_online()
>> > call so late in the CPU-online process. Indeed, there was a time when it
>> > would not have been safe, which does much to explain its current location.
>> > However, the marking of a CPU as online from an RCU perspective has long
>> > since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
>> > that is required is that ->dynticks be set correctly by the time that
>> > the CPU is marked as online from an RCU perspective. After all, the RCU
>> > grace-period kthread does not check to see if offline CPUs are also idle.
>> > (In case you were curious, this is one reason why there is quiescent-state
>> > reporting as part of the offlining process.)
>> >
>> > This commit therefore moves the call to rcu_dynticks_eqs_online() from
>> > rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
>> > to be running on the incoming CPU. The call to this function must of
>> > course be placed before this rcu_cpu_starting() announces this CPU's
>> > presence to RCU.
>> >
>> > Reported-by: Mathieu Desnoyers <[email protected]>
>> > Signed-off-by: Paul E. McKenney <[email protected]>
>> >
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index 0172a5fd6d8de..aa00babdaf544 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>> > @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
>> > rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
>> > rdp->blimit = blimit;
>> > rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
>> > - rcu_dynticks_eqs_online();
>> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
>> >
>> > /*
>> > @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
>> > mask = rdp->grpmask;
>> > WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
>> > WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
>> > + rcu_dynticks_eqs_online();
>> > smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
>> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jul 29, 2021 at 01:41:41PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 29, 2021, at 11:57 AM, paulmck [email protected] wrote:
>
> > On Thu, Jul 29, 2021 at 10:41:18AM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 28, 2021, at 4:28 PM, paulmck [email protected] wrote:
> >>
> >> > On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
> >> >> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
> >> >> [...]
> >> >> >
> >> >> > And how about like this?
> >> >> >
> >> >> > Thanx, Paul
> >> >> >
> >> >> > ------------------------------------------------------------------------
> >> >> >
> >> >> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
> >> >> > Author: Paul E. McKenney <[email protected]>
> >> >> > Date: Wed Jul 28 12:38:42 2021 -0700
> >> >> >
> >> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
> >> >> >
> >> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> >> >> > counter of an incoming CPU if required. It is currently is invoked
> >> >>
> >> >> "is currently is" -> "is currently"
> >> >
> >> > Good catch, fixed!
> >> >
> >> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
> >> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
> >> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> >> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
> >> >> > that rcu_dynticks_eqs_online() never has any effect. One could argue
> >> >> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
> >> >> > removing it makes the CPU-online process vulnerable to slight changes
> >> >> > in the CPU-offline process.
> >> >>
> >> >> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
> >> >> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
> >> >> was a good reason for having this very early in the prepare_cpu step ?
> >> >
> >> > Some years back, there was a good reason. This reason was that
> >> > rcutree_prepare_cpu() marked the CPU as being online from an RCU
> >> > viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
> >> > being online, so the ->dynticks check can be deferred to this function.
> >> >
> >> >> Also, the commit message refers to this bug as having no effect because the
> >> >> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
> >> >> this function was indeed effect-less, but then why is it OK for the CPU coming
> >> >> online to skip this call in the first place ? This commit message hints at
> >> >> "slight changes in the CPU-offline process" which could break it, but therer is
> >> >> no explanation of what makes this not an actual bug fix.
> >> >
> >> > Because rcutorture would not have suffered in silence had this
> >> > situation ever arisen.
> >>
> >> Testing can usually prove the presence of a bug, but it's rather tricky to prove
> >> the absence of bug.
> >
> > In general, true enough.
> >
> > But in this particular case, a WARN would have deterministically triggered
> > the very next time that this CPU found its way either to the idle loop
> > or to nohz_full usermode execution.
> >
> >> > I have updated the commit log to answer these questions as shown
> >> > below. Thoughts?
> >>
> >> I'm still concerned about one scenario wrt moving rcu_dynticks_eqs_online()
> >> from rcutree_prepare_cpu to rcu_cpu_starting. What happens if an interrupt
> >> handler, or a NMI handler, nests early over the CPU-online startup code ?
> >> AFAIU, this interrupt handler could contain RCU read-side critical sections,
> >> but if the eqs state does not show the CPU as "online", I wonder whether it
> >> will work as expected.
> >
> > Interrupts are still disabled at this point in the onlining process,
> > my _irqsave() locks notwithstanding.
> >
> > You are right about NMI handlers, but there would be much more damage
> > from an early NMI handler than mere RCU issues. And this can be handled
> > as described in the next paragraph.
> >
> > Now, there are architectures (including x86) that need RCU readers
> > before notify_cpu_starting() time (which is where rcu_cpu_starting()
> > is invoked by default, and those architectures can (and do) simply
> > place a call to rcu_cpu_starting() at an earlier appropriate point in
> > the architecture-specific CPU-bringup code. And this is in fact the
> > reason for the ->cpu_started check at the beginning of rcu_cpu_starting().
> > So an architecture using NMIs early in the CPU-bringup code should
> > invoke rcu_cpu_starting() before enabling NMIs.
> >
> > So why not just be safe and mark the CPU online early in the process?
> >
> > Because that could result in unbounded grace periods and strange
> > deadlocks. These deadlocks were broken earlier by code that assumed that
> > a CPU could not possibly take more than one jiffy to come online, but that
> > assumption is clearly broken on todays systems, even the bare-metal ones.
> >
> > In theory, I could change the raw_spin_lock_irqsave_rcu_node() to
> > raw_spin_lock_rcu_node(), rely on the lockdep_assert_irqs_disabled()
> > in the matching raw_spin_unlock_rcu_node(), and ditch the "flags"
> > local variable, but rcu_report_qs_rnp() needs that "flags" argument.
> >
> > OK, I guess one approach is to get the "flags" value from local_save_flags()
> > and get rid of the _irqsave and _irq restore. Assuming lockdep is
> > functional that early in CPU bringup.
> >
> > But would that really be better than status quo?
>
> I'm OK with your explanation about the fact that interrupts and NMIs scenarios
> are correctly handled, so moving this call from prepare_cpu to cpu_starting
> is fine with me.
I will add a "Link:" to this conversation.
May I also add your "Acked-by" or similar?
Thanx, Paul
> Thanks,
>
> Mathieu
>
> >
> > Thanx, Paul
> >
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> >
> >> > Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > commit 516c8c4cc6fce62539f7e0182739812db4591c1d
> >> > Author: Paul E. McKenney <[email protected]>
> >> > Date: Wed Jul 28 12:38:42 2021 -0700
> >> >
> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
> >> >
> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
> >> > counter of an incoming CPU when required. It is currently invoked
> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
> >> > that rcu_dynticks_eqs_online() never has any effect.
> >> >
> >> > It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
> >> > only because the CPU-offline process just happens to leave ->dynticks in
> >> > the correct state. After all, if ->dynticks were in the wrong state on a
> >> > just-onlined CPU, rcutorture would complain bitterly the next time that
> >> > CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
> >> > for example, those built by rcutorture scenario TREE04. One could
> >> > argue that this means that rcu_dynticks_eqs_online() is unnecessary,
> >> > however, removing it would make the CPU-online process vulnerable to
> >> > slight changes in the CPU-offline process.
> >> >
> >> > One could also ask why it is safe to move the rcu_dynticks_eqs_online()
> >> > call so late in the CPU-online process. Indeed, there was a time when it
> >> > would not have been safe, which does much to explain its current location.
> >> > However, the marking of a CPU as online from an RCU perspective has long
> >> > since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
> >> > that is required is that ->dynticks be set correctly by the time that
> >> > the CPU is marked as online from an RCU perspective. After all, the RCU
> >> > grace-period kthread does not check to see if offline CPUs are also idle.
> >> > (In case you were curious, this is one reason why there is quiescent-state
> >> > reporting as part of the offlining process.)
> >> >
> >> > This commit therefore moves the call to rcu_dynticks_eqs_online() from
> >> > rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
> >> > to be running on the incoming CPU. The call to this function must of
> >> > course be placed before this rcu_cpu_starting() announces this CPU's
> >> > presence to RCU.
> >> >
> >> > Reported-by: Mathieu Desnoyers <[email protected]>
> >> > Signed-off-by: Paul E. McKenney <[email protected]>
> >> >
> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> > index 0172a5fd6d8de..aa00babdaf544 100644
> >> > --- a/kernel/rcu/tree.c
> >> > +++ b/kernel/rcu/tree.c
> >> > @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
> >> > rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
> >> > rdp->blimit = blimit;
> >> > rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
> >> > - rcu_dynticks_eqs_online();
> >> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> >> >
> >> > /*
> >> > @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
> >> > mask = rdp->grpmask;
> >> > WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> >> > WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> >> > + rcu_dynticks_eqs_online();
> >> > smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> >> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> >> > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> > > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
----- On Jul 29, 2021, at 2:05 PM, paulmck [email protected] wrote:
> On Thu, Jul 29, 2021 at 01:41:41PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 29, 2021, at 11:57 AM, paulmck [email protected] wrote:
>>
>> > On Thu, Jul 29, 2021 at 10:41:18AM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 28, 2021, at 4:28 PM, paulmck [email protected] wrote:
>> >>
>> >> > On Wed, Jul 28, 2021 at 04:03:02PM -0400, Mathieu Desnoyers wrote:
>> >> >> ----- On Jul 28, 2021, at 3:45 PM, paulmck [email protected] wrote:
>> >> >> [...]
>> >> >> >
>> >> >> > And how about like this?
>> >> >> >
>> >> >> > Thanx, Paul
>> >> >> >
>> >> >> > ------------------------------------------------------------------------
>> >> >> >
>> >> >> > commit cb8914dcc6443cca15ce48d937a93c0dfdb114d3
>> >> >> > Author: Paul E. McKenney <[email protected]>
>> >> >> > Date: Wed Jul 28 12:38:42 2021 -0700
>> >> >> >
>> >> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>> >> >> >
>> >> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
>> >> >> > counter of an incoming CPU if required. It is currently is invoked
>> >> >>
>> >> >> "is currently is" -> "is currently"
>> >> >
>> >> > Good catch, fixed!
>> >> >
>> >> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
>> >> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
>> >> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
>> >> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
>> >> >> > that rcu_dynticks_eqs_online() never has any effect. One could argue
>> >> >> > that this means that rcu_dynticks_eqs_online() is unnecessary, however,
>> >> >> > removing it makes the CPU-online process vulnerable to slight changes
>> >> >> > in the CPU-offline process.
>> >> >>
>> >> >> Why favor moving this from the prepare_cpu to the cpu_starting hotplug step,
>> >> >> rather than using the target cpu's rdp from rcutree_prepare_cpu ? Maybe there
>> >> >> was a good reason for having this very early in the prepare_cpu step ?
>> >> >
>> >> > Some years back, there was a good reason. This reason was that
>> >> > rcutree_prepare_cpu() marked the CPU as being online from an RCU
>> >> > viewpoint. But now rcu_cpu_starting() is the one that marks the CPU as
>> >> > being online, so the ->dynticks check can be deferred to this function.
>> >> >
>> >> >> Also, the commit message refers to this bug as having no effect because the
>> >> >> running CPU cannot possibly be in dyntick-idle mode. I understand that calling
>> >> >> this function was indeed effect-less, but then why is it OK for the CPU coming
>> >> >> online to skip this call in the first place ? This commit message hints at
>> >> >> "slight changes in the CPU-offline process" which could break it, but therer is
>> >> >> no explanation of what makes this not an actual bug fix.
>> >> >
>> >> > Because rcutorture would not have suffered in silence had this
>> >> > situation ever arisen.
>> >>
>> >> Testing can usually prove the presence of a bug, but it's rather tricky to prove
>> >> the absence of bug.
>> >
>> > In general, true enough.
>> >
>> > But in this particular case, a WARN would have deterministically triggered
>> > the very next time that this CPU found its way either to the idle loop
>> > or to nohz_full usermode execution.
>> >
>> >> > I have updated the commit log to answer these questions as shown
>> >> > below. Thoughts?
>> >>
>> >> I'm still concerned about one scenario wrt moving rcu_dynticks_eqs_online()
>> >> from rcutree_prepare_cpu to rcu_cpu_starting. What happens if an interrupt
>> >> handler, or a NMI handler, nests early over the CPU-online startup code ?
>> >> AFAIU, this interrupt handler could contain RCU read-side critical sections,
>> >> but if the eqs state does not show the CPU as "online", I wonder whether it
>> >> will work as expected.
>> >
>> > Interrupts are still disabled at this point in the onlining process,
>> > my _irqsave() locks notwithstanding.
>> >
>> > You are right about NMI handlers, but there would be much more damage
>> > from an early NMI handler than mere RCU issues. And this can be handled
>> > as described in the next paragraph.
>> >
>> > Now, there are architectures (including x86) that need RCU readers
>> > before notify_cpu_starting() time (which is where rcu_cpu_starting()
>> > is invoked by default, and those architectures can (and do) simply
>> > place a call to rcu_cpu_starting() at an earlier appropriate point in
>> > the architecture-specific CPU-bringup code. And this is in fact the
>> > reason for the ->cpu_started check at the beginning of rcu_cpu_starting().
>> > So an architecture using NMIs early in the CPU-bringup code should
>> > invoke rcu_cpu_starting() before enabling NMIs.
>> >
>> > So why not just be safe and mark the CPU online early in the process?
>> >
>> > Because that could result in unbounded grace periods and strange
>> > deadlocks. These deadlocks were broken earlier by code that assumed that
>> > a CPU could not possibly take more than one jiffy to come online, but that
>> > assumption is clearly broken on todays systems, even the bare-metal ones.
>> >
>> > In theory, I could change the raw_spin_lock_irqsave_rcu_node() to
>> > raw_spin_lock_rcu_node(), rely on the lockdep_assert_irqs_disabled()
>> > in the matching raw_spin_unlock_rcu_node(), and ditch the "flags"
>> > local variable, but rcu_report_qs_rnp() needs that "flags" argument.
>> >
>> > OK, I guess one approach is to get the "flags" value from local_save_flags()
>> > and get rid of the _irqsave and _irq restore. Assuming lockdep is
>> > functional that early in CPU bringup.
>> >
>> > But would that really be better than status quo?
>>
>> I'm OK with your explanation about the fact that interrupts and NMIs scenarios
>> are correctly handled, so moving this call from prepare_cpu to cpu_starting
>> is fine with me.
>
> I will add a "Link:" to this conversation.
>
> May I also add your "Acked-by" or similar?
Of course, feel free to add my Acked-by.
Thanks,
Mathieu
>
> Thanx, Paul
>
>> Thanks,
>>
>> Mathieu
>>
>> >
>> > Thanx, Paul
>> >
>> >> Thanks,
>> >>
>> >> Mathieu
>> >>
>> >> >
>> >> > Thanx, Paul
>> >> >
>> >> > ------------------------------------------------------------------------
>> >> >
>> >> > commit 516c8c4cc6fce62539f7e0182739812db4591c1d
>> >> > Author: Paul E. McKenney <[email protected]>
>> >> > Date: Wed Jul 28 12:38:42 2021 -0700
>> >> >
>> >> > rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
>> >> >
>> >> > The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
>> >> > counter of an incoming CPU when required. It is currently invoked
>> >> > from rcutree_prepare_cpu(), which runs before the incoming CPU is
>> >> > running, and thus on some other CPU. This makes the per-CPU accesses in
>> >> > rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
>> >> > the running CPU cannot possibly be in dyntick-idle mode, which means
>> >> > that rcu_dynticks_eqs_online() never has any effect.
>> >> >
>> >> > It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
>> >> > only because the CPU-offline process just happens to leave ->dynticks in
>> >> > the correct state. After all, if ->dynticks were in the wrong state on a
>> >> > just-onlined CPU, rcutorture would complain bitterly the next time that
>> >> > CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
>> >> > for example, those built by rcutorture scenario TREE04. One could
>> >> > argue that this means that rcu_dynticks_eqs_online() is unnecessary,
>> >> > however, removing it would make the CPU-online process vulnerable to
>> >> > slight changes in the CPU-offline process.
>> >> >
>> >> > One could also ask why it is safe to move the rcu_dynticks_eqs_online()
>> >> > call so late in the CPU-online process. Indeed, there was a time when it
>> >> > would not have been safe, which does much to explain its current location.
>> >> > However, the marking of a CPU as online from an RCU perspective has long
>> >> > since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
>> >> > that is required is that ->dynticks be set correctly by the time that
>> >> > the CPU is marked as online from an RCU perspective. After all, the RCU
>> >> > grace-period kthread does not check to see if offline CPUs are also idle.
>> >> > (In case you were curious, this is one reason why there is quiescent-state
>> >> > reporting as part of the offlining process.)
>> >> >
>> >> > This commit therefore moves the call to rcu_dynticks_eqs_online() from
>> >> > rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
>> >> > to be running on the incoming CPU. The call to this function must of
>> >> > course be placed before this rcu_cpu_starting() announces this CPU's
>> >> > presence to RCU.
>> >> >
>> >> > Reported-by: Mathieu Desnoyers <[email protected]>
>> >> > Signed-off-by: Paul E. McKenney <[email protected]>
>> >> >
>> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> >> > index 0172a5fd6d8de..aa00babdaf544 100644
>> >> > --- a/kernel/rcu/tree.c
>> >> > +++ b/kernel/rcu/tree.c
>> >> > @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
>> >> > rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
>> >> > rdp->blimit = blimit;
>> >> > rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */
>> >> > - rcu_dynticks_eqs_online();
>> >> > raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
>> >> >
>> >> > /*
>> >> > @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu)
>> >> > mask = rdp->grpmask;
>> >> > WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
>> >> > WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
>> >> > + rcu_dynticks_eqs_online();
>> >> > smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
>> >> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> >> > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>> >>
>> >> --
>> >> Mathieu Desnoyers
>> >> EfficiOS Inc.
>> > > http://www.efficios.com
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jul 29, 2021 at 07:03:17AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 29, 2021 at 04:54:30PM +0800, Boqun Feng wrote:
> > On Wed, Jul 21, 2021 at 01:21:19PM -0700, Paul E. McKenney wrote:
> > > Accesses to ->qsmask are normally protected by ->lock, but there is an
> > > exception in the diagnostic code in rcu_check_boost_fail(). This commit
> > > therefore applies data_race() to this access to avoid KCSAN complaining
> > > about the C-language writes protected by ->lock.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/rcu/tree_stall.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > > index 42847caa3909b..6dd6c9aa3f757 100644
> > > --- a/kernel/rcu/tree_stall.h
> > > +++ b/kernel/rcu/tree_stall.h
> > > @@ -766,7 +766,7 @@ bool rcu_check_boost_fail(unsigned long gp_state, int *cpup)
> > >
> > > rcu_for_each_leaf_node(rnp) {
> > > if (!cpup) {
> > > - if (READ_ONCE(rnp->qsmask)) {
> > > + if (data_race(READ_ONCE(rnp->qsmask))) {
> >
> > If the write sides allow normal writes, i.e. allowing store tearing, the
> > READ_ONCE() here could read incomplete writes, which could be anything
> > basically? And we get the same result if we remove the READ_ONCE(),
> > don't we? Yes, I know, without the READ_ONCE(), compilers can do
> > anything to optimize on rnp->qsmask, but the end result is same or
> > similar to reading incomplete writes (which is also a result by compiler
> > optimization). So if we mark something as data_race(), **in theory**, it
> > makes no difference with or without the READ_ONCE(), so I think maybe we
> > can remove the READ_ONCE() here?
>
> In this particular case, perhaps. But there is also the possibility
> of ASSERT_EXCLUSIVE_WRITER() in conjunction with WRITE_ONCE(), and
> data_race(READ_ONCE(()) handles all such cases properly.
>
> Again, the point here is to prevent the compiler from messing with
> the load in strange and unpredictable ways while also telling KCSAN
> that this particular read should not be considered to be part of the
> concurrent algorithm.
>
Thanks for explaining this ;-)
I guess I'm just a little concerned that things may end up with using
data_race(READ_ONCE()) everywhere instead of data_race(), because who
doesn't want his/her racy code to be 1) not reported by KCSan (using
data_race()), and 2) less racy (using READ_ONCE())? ;-)
Regards,
Boqun
> Thanx, Paul
>
> > Regards,
> > Boqun
> >
> > > return false;
> > > } else {
> > > if (READ_ONCE(rnp->gp_tasks))
> > > --
> > > 2.31.1.189.g2e36527f23
> > >
On Fri, Jul 30, 2021 at 10:28:58AM +0800, Boqun Feng wrote:
> On Thu, Jul 29, 2021 at 07:03:17AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 29, 2021 at 04:54:30PM +0800, Boqun Feng wrote:
> > > On Wed, Jul 21, 2021 at 01:21:19PM -0700, Paul E. McKenney wrote:
> > > > Accesses to ->qsmask are normally protected by ->lock, but there is an
> > > > exception in the diagnostic code in rcu_check_boost_fail(). This commit
> > > > therefore applies data_race() to this access to avoid KCSAN complaining
> > > > about the C-language writes protected by ->lock.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > ---
> > > > kernel/rcu/tree_stall.h | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > > > index 42847caa3909b..6dd6c9aa3f757 100644
> > > > --- a/kernel/rcu/tree_stall.h
> > > > +++ b/kernel/rcu/tree_stall.h
> > > > @@ -766,7 +766,7 @@ bool rcu_check_boost_fail(unsigned long gp_state, int *cpup)
> > > >
> > > > rcu_for_each_leaf_node(rnp) {
> > > > if (!cpup) {
> > > > - if (READ_ONCE(rnp->qsmask)) {
> > > > + if (data_race(READ_ONCE(rnp->qsmask))) {
> > >
> > > If the write sides allow normal writes, i.e. allowing store tearing, the
> > > READ_ONCE() here could read incomplete writes, which could be anything
> > > basically? And we get the same result if we remove the READ_ONCE(),
> > > don't we? Yes, I know, without the READ_ONCE(), compilers can do
> > > anything to optimize on rnp->qsmask, but the end result is same or
> > > similar to reading incomplete writes (which is also a result by compiler
> > > optimization). So if we mark something as data_race(), **in theory**, it
> > > makes no difference with or without the READ_ONCE(), so I think maybe we
> > > can remove the READ_ONCE() here?
> >
> > In this particular case, perhaps. But there is also the possibility
> > of ASSERT_EXCLUSIVE_WRITER() in conjunction with WRITE_ONCE(), and
> > data_race(READ_ONCE(()) handles all such cases properly.
> >
> > Again, the point here is to prevent the compiler from messing with
> > the load in strange and unpredictable ways while also telling KCSAN
> > that this particular read should not be considered to be part of the
> > concurrent algorithm.
>
> Thanks for explaining this ;-)
>
> I guess I'm just a little concerned that things may end up with using
> data_race(READ_ONCE()) everywhere instead of data_race(), because who
> doesn't want his/her racy code to be 1) not reported by KCSan (using
> data_race()), and 2) less racy (using READ_ONCE())? ;-)
There always is the risk of a too-attractive "easy way out",
isn't there? Just make sure to understand exactly where
the "out" leads to longer term. ;-)
Thanx, Paul
> Regards,
> Boqun
>
> > Thanx, Paul
> >
> > > Regards,
> > > Boqun
> > >
> > > > return false;
> > > > } else {
> > > > if (READ_ONCE(rnp->gp_tasks))
> > > > --
> > > > 2.31.1.189.g2e36527f23
> > > >
On Thu, Jul 29, 2021 at 12:53:31PM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 29, 2021 at 03:58:04PM +0800, Boqun Feng wrote:
> > > The following litmus test, also adapted from the one supplied off-list
> > > by Frederic Weisbecker, models the RCU grace-period kthread detecting
> > > a non-idle CPU that is concurrently transitioning to idle:
> > >
> > > C dynticks-into-idle
> > >
> > > {
> > > DYNTICKS=1; (* Initially non-idle. *)
> > > }
> > >
> > > P0(int *X, int *DYNTICKS)
> > > {
> > > int dynticks;
> > >
> > > // Non-idle.
> > > WRITE_ONCE(*X, 1);
> > > dynticks = READ_ONCE(*DYNTICKS);
> > > smp_store_release(DYNTICKS, dynticks + 1);
> > > smp_mb();
> >
> > this smp_mb() is not needed, as we rely on the release-acquire pair to
> > provide the ordering.
> >
> > This means that if we use different implementations (one w/ smp_mb(),
> > another w/o) rcu_dynticks_inc() for idle-to-nonidle and nonidle-to-idle,
> > we could save a smp_mb(). Thoughts?
>
> That's exactly what I wanted to propose but everybody was sober. Namely order
> only the RCU read side critical sections before/after idle together:
>
> READ side critical section
> //enter idle
> //exit idle
> smp_mb()
> READ side critical section
>
> instead of ordering the RCU read side critical section before idle - with the RCU
> idle extended quiescent state - with the RCU read side critical section after idle:
>
> READ side critical section
> //enter idle
> smp_mb();
> //exit idle
> smp_mb()
> READ side critical section
>
> So the side effect now is that if the write side waits for the reader to
> report a quiescent state and scans its dynticks state and see it's not yet in
> RCU idle mode, then later on when the read side enters in RCU idle mode we
> expect it to see the write side updates.
> But after the barrier removal the reader will only see the write side update
> once we exit RCU idle mode.
>
> So the following may happen:
>
> P0(int *X, int *Y, int *DYNTICKS)
> {
> int y;
>
> WRITE_ONCE(*X, 1);
> smp_store_release(DYNTICKS, 1); // rcu_eqs_enter
> //smp_mb() not there anymore
> y = READ_ONCE(*Y);
> smp_store_release(DYNTICKS, 2); // rcu_eqs_exit()
> smp_mb();
> }
>
> P1(int *X, int *Y, int *DYNTICKS)
> {
> int x;
> int dynticks;
>
> WRITE_ONCE(*Y, 1);
> smp_mb();
> dynticks = smp_load_acquire(DYNTICKS);
> x = READ_ONCE(*X);
> }
>
> exists (1:x=0 /\ 0:y=0)
>
Thanks for the detailed explanation ;-)
> Theoretically it shouldn't matter because the RCU idle mode isn't
> supposed to perform RCU reads. But theoretically again once a CPU
Right, in LOCKDEP=y kernel, rcu_read_lock_held() requires
rcu_is_watching(), so rcu_dereference() is not allowed in idle mode,
unless using RCU_NONIDLE() or rcu_irq_enter_irqson() to temporarily exit
the idle mode.
> has reported a quiescent state, any further read is expected to see
> the latest updates from the write side.
Yes, but in your above case, doesn't P0 already reach to a quiescent
state even before WRITE_ONCE()? IOW, that case is similar to the
following:
P0(int *X, int *Y)
{
// in QS
WRITE_ONCE(*X, 1);
y = READ_ONCE(*Y);
}
P1(int *X, int *Y)
{
WRITE_ONCE(*Y, 1);
synchronize_rcu();
x = READ_ONCE(*X);
}
exists (1:x=0 /\ 0:y=0)
And RCU doesn't guarantee the READ_ONCE() on P0 sees the WRITE_ONCE() on
P1.
>
> So I don't know what to think. In practice I believe it's not a big deal
> because RCU idle mode code is usually a fragile path that just handles
> cpuidle code to put the CPU in/out low power mode. But what about dragons...
My current thought is that if the cpuidle code requires some ordering
with synchronize_rcu(), RCU_NONIDLE() should be used, and ordering can
be guaranteed in this case (RCU_NONIDLE() has a rcu_eqs_exit() in it).
Otherwise, it's a bug.
So looks like we can drop that smp_mb() in rcu_eqs_enter()? At least, we
can say something in the doc to prevent people from relying on the
ordering between normal reads in RCU idle mode and synchronize_rcu().
Thoughts?
Regards,
Boqun
On Fri, Jul 30, 2021 at 01:56:41PM +0800, Boqun Feng wrote:
> On Thu, Jul 29, 2021 at 12:53:31PM +0200, Frederic Weisbecker wrote:
> > On Thu, Jul 29, 2021 at 03:58:04PM +0800, Boqun Feng wrote:
> > > > The following litmus test, also adapted from the one supplied off-list
> > > > by Frederic Weisbecker, models the RCU grace-period kthread detecting
> > > > a non-idle CPU that is concurrently transitioning to idle:
> > > >
> > > > C dynticks-into-idle
> > > >
> > > > {
> > > > DYNTICKS=1; (* Initially non-idle. *)
> > > > }
> > > >
> > > > P0(int *X, int *DYNTICKS)
> > > > {
> > > > int dynticks;
> > > >
> > > > // Non-idle.
> > > > WRITE_ONCE(*X, 1);
> > > > dynticks = READ_ONCE(*DYNTICKS);
> > > > smp_store_release(DYNTICKS, dynticks + 1);
> > > > smp_mb();
> > >
> > > this smp_mb() is not needed, as we rely on the release-acquire pair to
> > > provide the ordering.
> > >
> > > This means that if we use different implementations (one w/ smp_mb(),
> > > another w/o) rcu_dynticks_inc() for idle-to-nonidle and nonidle-to-idle,
> > > we could save a smp_mb(). Thoughts?
> >
> > That's exactly what I wanted to propose but everybody was sober. Namely order
> > only the RCU read side critical sections before/after idle together:
> >
> > READ side critical section
> > //enter idle
> > //exit idle
> > smp_mb()
> > READ side critical section
> >
> > instead of ordering the RCU read side critical section before idle - with the RCU
> > idle extended quiescent state - with the RCU read side critical section after idle:
> >
> > READ side critical section
> > //enter idle
> > smp_mb();
> > //exit idle
> > smp_mb()
> > READ side critical section
> >
> > So the side effect now is that if the write side waits for the reader to
> > report a quiescent state and scans its dynticks state and see it's not yet in
> > RCU idle mode, then later on when the read side enters in RCU idle mode we
> > expect it to see the write side updates.
> > But after the barrier removal the reader will only see the write side update
> > once we exit RCU idle mode.
> >
> > So the following may happen:
> >
> > P0(int *X, int *Y, int *DYNTICKS)
> > {
> > int y;
> >
> > WRITE_ONCE(*X, 1);
> > smp_store_release(DYNTICKS, 1); // rcu_eqs_enter
> > //smp_mb() not there anymore
> > y = READ_ONCE(*Y);
> > smp_store_release(DYNTICKS, 2); // rcu_eqs_exit()
> > smp_mb();
> > }
> >
> > P1(int *X, int *Y, int *DYNTICKS)
> > {
> > int x;
> > int dynticks;
> >
> > WRITE_ONCE(*Y, 1);
> > smp_mb();
> > dynticks = smp_load_acquire(DYNTICKS);
> > x = READ_ONCE(*X);
> > }
> >
> > exists (1:x=0 /\ 0:y=0)
> >
>
> Thanks for the detailed explanation ;-)
>
> > Theoretically it shouldn't matter because the RCU idle mode isn't
> > supposed to perform RCU reads. But theoretically again once a CPU
>
> Right, in LOCKDEP=y kernel, rcu_read_lock_held() requires
> rcu_is_watching(), so rcu_dereference() is not allowed in idle mode,
> unless using RCU_NONIDLE() or rcu_irq_enter_irqson() to temporarily exit
> the idle mode.
>
> > has reported a quiescent state, any further read is expected to see
> > the latest updates from the write side.
>
> Yes, but in your above case, doesn't P0 already reach to a quiescent
> state even before WRITE_ONCE()? IOW, that case is similar to the
> following:
>
> P0(int *X, int *Y)
> {
> // in QS
>
> WRITE_ONCE(*X, 1);
> y = READ_ONCE(*Y);
> }
>
> P1(int *X, int *Y)
> {
> WRITE_ONCE(*Y, 1);
> synchronize_rcu();
> x = READ_ONCE(*X);
> }
>
> exists (1:x=0 /\ 0:y=0)
>
> And RCU doesn't guarantee the READ_ONCE() on P0 sees the WRITE_ONCE() on
> P1.
>
> >
> > So I don't know what to think. In practice I believe it's not a big deal
> > because RCU idle mode code is usually a fragile path that just handles
> > cpuidle code to put the CPU in/out low power mode. But what about dragons...
>
> My current thought is that if the cpuidle code requires some ordering
> with synchronize_rcu(), RCU_NONIDLE() should be used, and ordering can
> be guaranteed in this case (RCU_NONIDLE() has a rcu_eqs_exit() in it).
> Otherwise, it's a bug.
>
> So looks like we can drop that smp_mb() in rcu_eqs_enter()? At least, we
> can say something in the doc to prevent people from relying on the
> ordering between normal reads in RCU idle mode and synchronize_rcu().
>
> Thoughts?
Is there a benchmark that can show a system-level difference? My
guess is that the realtime interrupt-latency and scheduler-latency
benchmarks would have the best chance of seeing this.
Thanx, Paul
Hi
On 07/21/21 13:21, Paul E. McKenney wrote:
> From: Yanfei Xu <[email protected]>
>
> If rcu_print_task_stall() is invoked on an rcu_node structure that does
> not contain any tasks blocking the current grace period, it takes an
> early exit that fails to release that rcu_node structure's lock. This
> results in a self-deadlock, which is detected by lockdep.
>
> To reproduce this bug:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
>
> This will also result in other complaints, including RCU's scheduler
> hook complaining about blocking rather than preemption and an rcutorture
> writer stall.
>
> Only a partial RCU CPU stall warning message will be printed because of
> the self-deadlock.
>
> This commit therefore releases the lock on the rcu_print_task_stall()
> function's early exit path.
>
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
relevant here? Trying to apply the patches and test, but the problem is tricky
to reproduce so thought worth asking first.
[ 1010.244334] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:"}
[ 1010.250538] rcu: \t2-...!: (42 GPs behind) idle=884/0/0x0 softirq=1646/1647 fqs=0 (false positive?)"}
[ 1010.259746] rcu: \t3-...!: (1 ticks this GP) idle=134/0/0x0 softirq=3205/3205 fqs=0 (false positive?)"}
[ 1010.269013] \t(detected by 4, t=6502 jiffies, g=9121, q=110)"}
[ 1010.274621] "}
[ 1010.276115] ============================================"}
[ 1010.281438] WARNING: possible recursive locking detected"}
[ 1010.286763] 5.10.43-g92fdbb553f50-ab979 #1 Not tainted"}
[ 1010.291912] --------------------------------------------"}
[ 1010.297235] swapper/4/0 is trying to acquire lock:"}
[ 1010.302037] ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_dump_cpu_stacks+0x60/0xf8"}
[ 1010.310183] "}
[ 1010.310183] but task is already holding lock:"}
[ 1010.316028] ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_sched_clock_irq+0x658/0xca0"}
[ 1010.324341] "}
[ 1010.324341] other info that might help us debug this:"}
[ 1010.330882] Possible unsafe locking scenario:"}
[ 1010.330882] "}
[ 1010.336813] CPU0"}
[ 1010.339263] ----"}
[ 1010.341713] lock(rcu_node_0);"}
[ 1010.344872] lock(rcu_node_0);"}
[ 1010.348029] "}
[ 1010.348029] *** DEADLOCK ***"}
[ 1010.348029] "}
[ 1010.353961] May be due to missing lock nesting notation"}
[ 1010.353961] "}
[ 1010.360764] 1 lock held by swapper/4/0:"}
[ 1010.364607] #0: ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_sched_clock_irq+0x658/0xca0"}
[ 1010.373359] "}
[ 1010.373359] stack backtrace:"}
[ 1010.377729] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.10.43-g92fdbb553f50-ab979 #1"}
[ 1010.385489] Hardware name: ARM Juno development board (r0) (DT)"}
[ 1010.391421] Call trace:"}
[ 1010.393877] dump_backtrace+0x0/0x1c0"}
[ 1010.397551] show_stack+0x24/0x30"}
[ 1010.400877] dump_stack_lvl+0xf4/0x130"}
[ 1010.404636] dump_stack+0x18/0x58"}
[ 1010.407962] __lock_acquire+0xa18/0x1ffc"}
[ 1010.411895] lock_acquire.part.0+0xc8/0x30c"}
[ 1010.416089] lock_acquire+0x68/0x84"}
[ 1010.419589] _raw_spin_lock_irqsave+0x84/0x158"}
[ 1010.424045] rcu_dump_cpu_stacks+0x60/0xf8"}
[ 1010.428154] rcu_sched_clock_irq+0x8d8/0xca0"}
[ 1010.432435] update_process_times+0x6c/0xb0"}
[ 1010.436632] tick_sched_handle+0x3c/0x60"}
[ 1010.440566] tick_sched_timer+0x58/0xb0"}
[ 1010.444412] __hrtimer_run_queues+0x1a4/0x5b0"}
[ 1010.448781] hrtimer_interrupt+0xf4/0x2cc"}
[ 1010.452803] arch_timer_handler_phys+0x40/0x50"}
[ 1010.457260] handle_percpu_devid_irq+0x98/0x180"}
[ 1010.461803] __handle_domain_irq+0x80/0xec"}
[ 1010.465911] gic_handle_irq+0x5c/0xf0"}
[ 1010.469583] el1_irq+0xcc/0x180"}
[ 1010.472735] cpuidle_enter_state+0xe8/0x350"}
[ 1010.476929] cpuidle_enter+0x44/0x60"}
[ 1010.480515] do_idle+0x25c/0x2f0"}
[ 1010.483751] cpu_startup_entry+0x34/0x8c"}
[ 1010.487687] secondary_start_kernel+0x154/0x190"}
Thanks
--
Qais Yousef
On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> Hi
>
> On 07/21/21 13:21, Paul E. McKenney wrote:
> > From: Yanfei Xu <[email protected]>
> >
> > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > not contain any tasks blocking the current grace period, it takes an
> > early exit that fails to release that rcu_node structure's lock. This
> > results in a self-deadlock, which is detected by lockdep.
> >
> > To reproduce this bug:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> >
> > This will also result in other complaints, including RCU's scheduler
> > hook complaining about blocking rather than preemption and an rcutorture
> > writer stall.
> >
> > Only a partial RCU CPU stall warning message will be printed because of
> > the self-deadlock.
> >
> > This commit therefore releases the lock on the rcu_print_task_stall()
> > function's early exit path.
> >
> > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > Signed-off-by: Yanfei Xu <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
>
> We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> relevant here? Trying to apply the patches and test, but the problem is tricky
> to reproduce so thought worth asking first.
Looks like the relevant symptoms to me, so I suggest trying this series
from -rcu:
8baded711edc ("rcu: Fix to include first blocked task in stall warning")
f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
Thanx, Paul
> [ 1010.244334] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:"}
> [ 1010.250538] rcu: \t2-...!: (42 GPs behind) idle=884/0/0x0 softirq=1646/1647 fqs=0 (false positive?)"}
> [ 1010.259746] rcu: \t3-...!: (1 ticks this GP) idle=134/0/0x0 softirq=3205/3205 fqs=0 (false positive?)"}
> [ 1010.269013] \t(detected by 4, t=6502 jiffies, g=9121, q=110)"}
> [ 1010.274621] "}
> [ 1010.276115] ============================================"}
> [ 1010.281438] WARNING: possible recursive locking detected"}
> [ 1010.286763] 5.10.43-g92fdbb553f50-ab979 #1 Not tainted"}
> [ 1010.291912] --------------------------------------------"}
> [ 1010.297235] swapper/4/0 is trying to acquire lock:"}
> [ 1010.302037] ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_dump_cpu_stacks+0x60/0xf8"}
> [ 1010.310183] "}
> [ 1010.310183] but task is already holding lock:"}
> [ 1010.316028] ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_sched_clock_irq+0x658/0xca0"}
> [ 1010.324341] "}
> [ 1010.324341] other info that might help us debug this:"}
> [ 1010.330882] Possible unsafe locking scenario:"}
> [ 1010.330882] "}
> [ 1010.336813] CPU0"}
> [ 1010.339263] ----"}
> [ 1010.341713] lock(rcu_node_0);"}
> [ 1010.344872] lock(rcu_node_0);"}
> [ 1010.348029] "}
> [ 1010.348029] *** DEADLOCK ***"}
> [ 1010.348029] "}
> [ 1010.353961] May be due to missing lock nesting notation"}
> [ 1010.353961] "}
> [ 1010.360764] 1 lock held by swapper/4/0:"}
> [ 1010.364607] #0: ffff8000121fe618 (rcu_node_0){-.-.}-{2:2}, at: rcu_sched_clock_irq+0x658/0xca0"}
> [ 1010.373359] "}
> [ 1010.373359] stack backtrace:"}
> [ 1010.377729] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.10.43-g92fdbb553f50-ab979 #1"}
> [ 1010.385489] Hardware name: ARM Juno development board (r0) (DT)"}
> [ 1010.391421] Call trace:"}
> [ 1010.393877] dump_backtrace+0x0/0x1c0"}
> [ 1010.397551] show_stack+0x24/0x30"}
> [ 1010.400877] dump_stack_lvl+0xf4/0x130"}
> [ 1010.404636] dump_stack+0x18/0x58"}
> [ 1010.407962] __lock_acquire+0xa18/0x1ffc"}
> [ 1010.411895] lock_acquire.part.0+0xc8/0x30c"}
> [ 1010.416089] lock_acquire+0x68/0x84"}
> [ 1010.419589] _raw_spin_lock_irqsave+0x84/0x158"}
> [ 1010.424045] rcu_dump_cpu_stacks+0x60/0xf8"}
> [ 1010.428154] rcu_sched_clock_irq+0x8d8/0xca0"}
> [ 1010.432435] update_process_times+0x6c/0xb0"}
> [ 1010.436632] tick_sched_handle+0x3c/0x60"}
> [ 1010.440566] tick_sched_timer+0x58/0xb0"}
> [ 1010.444412] __hrtimer_run_queues+0x1a4/0x5b0"}
> [ 1010.448781] hrtimer_interrupt+0xf4/0x2cc"}
> [ 1010.452803] arch_timer_handler_phys+0x40/0x50"}
> [ 1010.457260] handle_percpu_devid_irq+0x98/0x180"}
> [ 1010.461803] __handle_domain_irq+0x80/0xec"}
> [ 1010.465911] gic_handle_irq+0x5c/0xf0"}
> [ 1010.469583] el1_irq+0xcc/0x180"}
> [ 1010.472735] cpuidle_enter_state+0xe8/0x350"}
> [ 1010.476929] cpuidle_enter+0x44/0x60"}
> [ 1010.480515] do_idle+0x25c/0x2f0"}
> [ 1010.483751] cpu_startup_entry+0x34/0x8c"}
> [ 1010.487687] secondary_start_kernel+0x154/0x190"}
>
>
> Thanks
>
> --
> Qais Yousef
On 08/03/21 08:52, Paul E. McKenney wrote:
> On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > Hi
> >
> > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > From: Yanfei Xu <[email protected]>
> > >
> > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > not contain any tasks blocking the current grace period, it takes an
> > > early exit that fails to release that rcu_node structure's lock. This
> > > results in a self-deadlock, which is detected by lockdep.
> > >
> > > To reproduce this bug:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > >
> > > This will also result in other complaints, including RCU's scheduler
> > > hook complaining about blocking rather than preemption and an rcutorture
> > > writer stall.
> > >
> > > Only a partial RCU CPU stall warning message will be printed because of
> > > the self-deadlock.
> > >
> > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > function's early exit path.
> > >
> > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > Signed-off-by: Yanfei Xu <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> >
> > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > relevant here? Trying to apply the patches and test, but the problem is tricky
> > to reproduce so thought worth asking first.
>
> Looks like the relevant symptoms to me, so I suggest trying this series
> from -rcu:
>
> 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
Great thanks. These are the ones we picked as the rest was a bit tricky to
apply on 5.10.
While at it, we see these errors too though they look harmless. They happen
all the time
[ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
[ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
[ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
[ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
[ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
[ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
[ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
I used to see them on mainline a while back but seem to have been fixed.
Something didn't get backported to 5.10 perhaps?
It might be a question to Frederic actually..
Thanks!
--
Qais Yousef
On 08/03/21 09:28, Paul E. McKenney wrote:
> On Tue, Aug 03, 2021 at 05:12:21PM +0100, Qais Yousef wrote:
> > On 08/03/21 08:52, Paul E. McKenney wrote:
> > > On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > > > Hi
> > > >
> > > > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > > > From: Yanfei Xu <[email protected]>
> > > > >
> > > > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > > > not contain any tasks blocking the current grace period, it takes an
> > > > > early exit that fails to release that rcu_node structure's lock. This
> > > > > results in a self-deadlock, which is detected by lockdep.
> > > > >
> > > > > To reproduce this bug:
> > > > >
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > > > >
> > > > > This will also result in other complaints, including RCU's scheduler
> > > > > hook complaining about blocking rather than preemption and an rcutorture
> > > > > writer stall.
> > > > >
> > > > > Only a partial RCU CPU stall warning message will be printed because of
> > > > > the self-deadlock.
> > > > >
> > > > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > > > function's early exit path.
> > > > >
> > > > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > > > Signed-off-by: Yanfei Xu <[email protected]>
> > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > ---
> > > >
> > > > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > > > relevant here? Trying to apply the patches and test, but the problem is tricky
> > > > to reproduce so thought worth asking first.
> > >
> > > Looks like the relevant symptoms to me, so I suggest trying this series
> > > from -rcu:
> > >
> > > 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> > > f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
> >
> > Great thanks. These are the ones we picked as the rest was a bit tricky to
> > apply on 5.10.
> >
> > While at it, we see these errors too though they look harmless. They happen
> > all the time
> >
> > [ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
> > [ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> >
> > I used to see them on mainline a while back but seem to have been fixed.
> > Something didn't get backported to 5.10 perhaps?
>
> I believe that you need at least this one:
>
> 47c218dcae65 ("tick/sched: Prevent false positive softirq pending warnings on RT")
Thanks! Shall try that.
Cheers
--
Qais Yousef
On Wed, 2021-07-21 at 13:21 -0700, Paul E. McKenney wrote:
> From: Liu Song <[email protected]>
>
> Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
> return value from swait_event_idle_timeout_exclusive(), but "ret" is
> unconditionally overwritten later in the code. This commit therefore
> removes this useless assignment.
[]
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
[]
> @@ -1960,8 +1960,8 @@ static noinline_for_stack void rcu_gp_fqs_loop(void)
> ? trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
> ? TPS("fqswait"));
> ? WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS);
> - ret = swait_event_idle_timeout_exclusive(
> - rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
> + (void)swait_event_idle_timeout_exclusive(rcu_state.gp_wq,
> + rcu_gp_fqs_check_wake(&gf), j);
It doesn't seem this is a __must_check routine so why
bother to cast to void ?
> ? rcu_gp_torture_wait();
> ? WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS);
> ? /* Locking provides needed memory barriers. */
On Tue, Aug 03, 2021 at 09:48:23AM -0700, Joe Perches wrote:
> On Wed, 2021-07-21 at 13:21 -0700, Paul E. McKenney wrote:
> > From: Liu Song <[email protected]>
> >
> > Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
> > return value from swait_event_idle_timeout_exclusive(), but "ret" is
> > unconditionally overwritten later in the code. This commit therefore
> > removes this useless assignment.
> []
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> []
> > @@ -1960,8 +1960,8 @@ static noinline_for_stack void rcu_gp_fqs_loop(void)
> > ? trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
> > ? TPS("fqswait"));
> > ? WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS);
> > - ret = swait_event_idle_timeout_exclusive(
> > - rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
> > + (void)swait_event_idle_timeout_exclusive(rcu_state.gp_wq,
> > + rcu_gp_fqs_check_wake(&gf), j);
>
> It doesn't seem this is a __must_check routine so why
> bother to cast to void ?
As a hint to the reader that discarding the return value is intentional
rather than an oversight.
Thanx, Paul
> > ? rcu_gp_torture_wait();
> > ? WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS);
> > ? /* Locking provides needed memory barriers. */
>
>
On Tue, Aug 03, 2021 at 05:12:21PM +0100, Qais Yousef wrote:
> On 08/03/21 08:52, Paul E. McKenney wrote:
> > On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > > Hi
> > >
> > > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > > From: Yanfei Xu <[email protected]>
> > > >
> > > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > > not contain any tasks blocking the current grace period, it takes an
> > > > early exit that fails to release that rcu_node structure's lock. This
> > > > results in a self-deadlock, which is detected by lockdep.
> > > >
> > > > To reproduce this bug:
> > > >
> > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > > >
> > > > This will also result in other complaints, including RCU's scheduler
> > > > hook complaining about blocking rather than preemption and an rcutorture
> > > > writer stall.
> > > >
> > > > Only a partial RCU CPU stall warning message will be printed because of
> > > > the self-deadlock.
> > > >
> > > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > > function's early exit path.
> > > >
> > > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > > Signed-off-by: Yanfei Xu <[email protected]>
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > ---
> > >
> > > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > > relevant here? Trying to apply the patches and test, but the problem is tricky
> > > to reproduce so thought worth asking first.
> >
> > Looks like the relevant symptoms to me, so I suggest trying this series
> > from -rcu:
> >
> > 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> > f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
>
> Great thanks. These are the ones we picked as the rest was a bit tricky to
> apply on 5.10.
>
> While at it, we see these errors too though they look harmless. They happen
> all the time
>
> [ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
> [ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> [ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> [ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> [ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> [ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> [ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
>
> I used to see them on mainline a while back but seem to have been fixed.
> Something didn't get backported to 5.10 perhaps?
I believe that you need at least this one:
47c218dcae65 ("tick/sched: Prevent false positive softirq pending warnings on RT")
Thanx, Paul
> It might be a question to Frederic actually..
>
> Thanks!
>
> --
> Qais Yousef
On 08/03/21 09:28, Paul E. McKenney wrote:
> On Tue, Aug 03, 2021 at 05:12:21PM +0100, Qais Yousef wrote:
> > On 08/03/21 08:52, Paul E. McKenney wrote:
> > > On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > > > Hi
> > > >
> > > > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > > > From: Yanfei Xu <[email protected]>
> > > > >
> > > > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > > > not contain any tasks blocking the current grace period, it takes an
> > > > > early exit that fails to release that rcu_node structure's lock. This
> > > > > results in a self-deadlock, which is detected by lockdep.
> > > > >
> > > > > To reproduce this bug:
> > > > >
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > > > >
> > > > > This will also result in other complaints, including RCU's scheduler
> > > > > hook complaining about blocking rather than preemption and an rcutorture
> > > > > writer stall.
> > > > >
> > > > > Only a partial RCU CPU stall warning message will be printed because of
> > > > > the self-deadlock.
> > > > >
> > > > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > > > function's early exit path.
> > > > >
> > > > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > > > Signed-off-by: Yanfei Xu <[email protected]>
> > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > ---
> > > >
> > > > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > > > relevant here? Trying to apply the patches and test, but the problem is tricky
> > > > to reproduce so thought worth asking first.
> > >
> > > Looks like the relevant symptoms to me, so I suggest trying this series
> > > from -rcu:
> > >
> > > 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> > > f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
> >
> > Great thanks. These are the ones we picked as the rest was a bit tricky to
> > apply on 5.10.
> >
> > While at it, we see these errors too though they look harmless. They happen
> > all the time
> >
> > [ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
> > [ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > [ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> >
> > I used to see them on mainline a while back but seem to have been fixed.
> > Something didn't get backported to 5.10 perhaps?
>
> I believe that you need at least this one:
>
> 47c218dcae65 ("tick/sched: Prevent false positive softirq pending warnings on RT")
After looking at the content of the patch, it's not related. We don't run with
PREEMPT_RT.
I think we're hitting a genuine issue, most likely due to out-of-tree changes
done by Android to fix RT latency problems against softirq (surprise surprise).
Thanks for your help and sorry for the noise.
Cheers
--
Qais Yousef
On Wed, Aug 04, 2021 at 02:50:17PM +0100, Qais Yousef wrote:
> On 08/03/21 09:28, Paul E. McKenney wrote:
> > On Tue, Aug 03, 2021 at 05:12:21PM +0100, Qais Yousef wrote:
> > > On 08/03/21 08:52, Paul E. McKenney wrote:
> > > > On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > > > > Hi
> > > > >
> > > > > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > > > > From: Yanfei Xu <[email protected]>
> > > > > >
> > > > > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > > > > not contain any tasks blocking the current grace period, it takes an
> > > > > > early exit that fails to release that rcu_node structure's lock. This
> > > > > > results in a self-deadlock, which is detected by lockdep.
> > > > > >
> > > > > > To reproduce this bug:
> > > > > >
> > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > > > > >
> > > > > > This will also result in other complaints, including RCU's scheduler
> > > > > > hook complaining about blocking rather than preemption and an rcutorture
> > > > > > writer stall.
> > > > > >
> > > > > > Only a partial RCU CPU stall warning message will be printed because of
> > > > > > the self-deadlock.
> > > > > >
> > > > > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > > > > function's early exit path.
> > > > > >
> > > > > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > > > > Signed-off-by: Yanfei Xu <[email protected]>
> > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > > ---
> > > > >
> > > > > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > > > > relevant here? Trying to apply the patches and test, but the problem is tricky
> > > > > to reproduce so thought worth asking first.
> > > >
> > > > Looks like the relevant symptoms to me, so I suggest trying this series
> > > > from -rcu:
> > > >
> > > > 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> > > > f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
> > >
> > > Great thanks. These are the ones we picked as the rest was a bit tricky to
> > > apply on 5.10.
> > >
> > > While at it, we see these errors too though they look harmless. They happen
> > > all the time
> > >
> > > [ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
> > > [ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > [ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > [ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > [ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > [ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > [ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > >
> > > I used to see them on mainline a while back but seem to have been fixed.
> > > Something didn't get backported to 5.10 perhaps?
> >
> > I believe that you need at least this one:
> >
> > 47c218dcae65 ("tick/sched: Prevent false positive softirq pending warnings on RT")
>
> After looking at the content of the patch, it's not related. We don't run with
> PREEMPT_RT.
>
> I think we're hitting a genuine issue, most likely due to out-of-tree changes
> done by Android to fix RT latency problems against softirq (surprise surprise).
>
> Thanks for your help and sorry for the noise.
No problem!
But I used to see this very frequently in non-PREEMPT_RT rcutorture runs,
and there was a patch from Thomas that made them go away. So it might
be worth looking at has patches in this area since 5.10. Maybe I just
got confused and picked the wrong one.
Thanx, Paul
On 08/04/21 15:33, Paul E. McKenney wrote:
> On Wed, Aug 04, 2021 at 02:50:17PM +0100, Qais Yousef wrote:
> > On 08/03/21 09:28, Paul E. McKenney wrote:
> > > On Tue, Aug 03, 2021 at 05:12:21PM +0100, Qais Yousef wrote:
> > > > On 08/03/21 08:52, Paul E. McKenney wrote:
> > > > > On Tue, Aug 03, 2021 at 03:24:58PM +0100, Qais Yousef wrote:
> > > > > > Hi
> > > > > >
> > > > > > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > > > > > From: Yanfei Xu <[email protected]>
> > > > > > >
> > > > > > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > > > > > not contain any tasks blocking the current grace period, it takes an
> > > > > > > early exit that fails to release that rcu_node structure's lock. This
> > > > > > > results in a self-deadlock, which is detected by lockdep.
> > > > > > >
> > > > > > > To reproduce this bug:
> > > > > > >
> > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > > > > > >
> > > > > > > This will also result in other complaints, including RCU's scheduler
> > > > > > > hook complaining about blocking rather than preemption and an rcutorture
> > > > > > > writer stall.
> > > > > > >
> > > > > > > Only a partial RCU CPU stall warning message will be printed because of
> > > > > > > the self-deadlock.
> > > > > > >
> > > > > > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > > > > > function's early exit path.
> > > > > > >
> > > > > > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > > > > > Signed-off-by: Yanfei Xu <[email protected]>
> > > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > > > ---
> > > > > >
> > > > > > We are seeing similar stall/deadlock issue on android 5.10 kernel, is the fix
> > > > > > relevant here? Trying to apply the patches and test, but the problem is tricky
> > > > > > to reproduce so thought worth asking first.
> > > > >
> > > > > Looks like the relevant symptoms to me, so I suggest trying this series
> > > > > from -rcu:
> > > > >
> > > > > 8baded711edc ("rcu: Fix to include first blocked task in stall warning")
> > > > > f6b3995a8b56 ("rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock")
> > > >
> > > > Great thanks. These are the ones we picked as the rest was a bit tricky to
> > > > apply on 5.10.
> > > >
> > > > While at it, we see these errors too though they look harmless. They happen
> > > > all the time
> > > >
> > > > [ 595.292685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #02!!!"}
> > > > [ 595.301467] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > > [ 595.389353] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > > [ 595.397454] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > > [ 595.417112] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > > [ 595.425215] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > > [ 595.438807] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!"}
> > > >
> > > > I used to see them on mainline a while back but seem to have been fixed.
> > > > Something didn't get backported to 5.10 perhaps?
> > >
> > > I believe that you need at least this one:
> > >
> > > 47c218dcae65 ("tick/sched: Prevent false positive softirq pending warnings on RT")
> >
> > After looking at the content of the patch, it's not related. We don't run with
> > PREEMPT_RT.
> >
> > I think we're hitting a genuine issue, most likely due to out-of-tree changes
> > done by Android to fix RT latency problems against softirq (surprise surprise).
> >
> > Thanks for your help and sorry for the noise.
>
> No problem!
>
> But I used to see this very frequently in non-PREEMPT_RT rcutorture runs,
> and there was a patch from Thomas that made them go away. So it might
> be worth looking at has patches in this area since 5.10. Maybe I just
> got confused and picked the wrong one.
My suspicion turned out to be correct at the end.. So no issue on vanilla 5.10.
These warnings were hard to miss at some point (for me at least) in mainline
for me too, that's why I suspected something was not backported. All good now.
Cheers
--
Qais Yousef
On 07/21/21 13:21, Paul E. McKenney wrote:
> From: Yanfei Xu <[email protected]>
>
> If rcu_print_task_stall() is invoked on an rcu_node structure that does
> not contain any tasks blocking the current grace period, it takes an
> early exit that fails to release that rcu_node structure's lock. This
> results in a self-deadlock, which is detected by lockdep.
>
> To reproduce this bug:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
>
> This will also result in other complaints, including RCU's scheduler
> hook complaining about blocking rather than preemption and an rcutorture
> writer stall.
>
> Only a partial RCU CPU stall warning message will be printed because of
> the self-deadlock.
>
> This commit therefore releases the lock on the rcu_print_task_stall()
> function's early exit path.
>
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
We were seeing similar issue on Android 5.10. Applying patches 1 and 2 did fix
the deadlock problem and we get proper RCU stall splat now.
For patches 1 and 2:
Tested-by: Qais Yousef <[email protected]>
They have Fixes tags, so should end up in 5.10 stable I presume.
Thanks!
--
Qais Yousef
On Fri, Aug 06, 2021 at 10:57:30AM +0100, Qais Yousef wrote:
> On 07/21/21 13:21, Paul E. McKenney wrote:
> > From: Yanfei Xu <[email protected]>
> >
> > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > not contain any tasks blocking the current grace period, it takes an
> > early exit that fails to release that rcu_node structure's lock. This
> > results in a self-deadlock, which is detected by lockdep.
> >
> > To reproduce this bug:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> >
> > This will also result in other complaints, including RCU's scheduler
> > hook complaining about blocking rather than preemption and an rcutorture
> > writer stall.
> >
> > Only a partial RCU CPU stall warning message will be printed because of
> > the self-deadlock.
> >
> > This commit therefore releases the lock on the rcu_print_task_stall()
> > function's early exit path.
> >
> > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > Signed-off-by: Yanfei Xu <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
>
> We were seeing similar issue on Android 5.10. Applying patches 1 and 2 did fix
> the deadlock problem and we get proper RCU stall splat now.
>
> For patches 1 and 2:
>
> Tested-by: Qais Yousef <[email protected]>
Thank you, and I will apply this on the next rebase.
> They have Fixes tags, so should end up in 5.10 stable I presume.
Here is hoping! ;-)
Thanx, Paul
On 08/06/21 04:43, Paul E. McKenney wrote:
> On Fri, Aug 06, 2021 at 10:57:30AM +0100, Qais Yousef wrote:
> > On 07/21/21 13:21, Paul E. McKenney wrote:
> > > From: Yanfei Xu <[email protected]>
> > >
> > > If rcu_print_task_stall() is invoked on an rcu_node structure that does
> > > not contain any tasks blocking the current grace period, it takes an
> > > early exit that fails to release that rcu_node structure's lock. This
> > > results in a self-deadlock, which is detected by lockdep.
> > >
> > > To reproduce this bug:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
> > >
> > > This will also result in other complaints, including RCU's scheduler
> > > hook complaining about blocking rather than preemption and an rcutorture
> > > writer stall.
> > >
> > > Only a partial RCU CPU stall warning message will be printed because of
> > > the self-deadlock.
> > >
> > > This commit therefore releases the lock on the rcu_print_task_stall()
> > > function's early exit path.
> > >
> > > Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> > > Signed-off-by: Yanfei Xu <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> >
> > We were seeing similar issue on Android 5.10. Applying patches 1 and 2 did fix
> > the deadlock problem and we get proper RCU stall splat now.
> >
> > For patches 1 and 2:
> >
> > Tested-by: Qais Yousef <[email protected]>
>
> Thank you, and I will apply this on the next rebase.
>
> > They have Fixes tags, so should end up in 5.10 stable I presume.
>
> Here is hoping! ;-)
Leave it with me, I've got a todo to make sure this happens ;-)
Thanks!
--
Qais Yousef