2019-08-03 14:25:40

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/14] No-CBs bypass addition for v5.4

Hello!

This series is a sneak preview of additional work for the move of no-CBs
CPUs to the ->cblist segmented RCU callback list. This work adds
a ->nocb_bypass list with its own lock to further reduce contention.
This series also includes some nascent work to turn the scheduling-clock
interrupt back on for nohz_full CPUs doing heavy rcutorture work or RCU
callback invocation, both of which can remain in the kernel for long
time periods, which in turn can impede CPU hotplug removals. (On some
systems "impede" means up to seven minutes for stop-machine to actually
get things to stop, a problem that has not yet been observed on no-CBs
CPUs that are not also nohz_full CPUs.)

1. Atomic ->len field in rcu_segcblist structure.

2. Add bypass callback queueing in ->nocb_bypass with its own
->nocb_bypass_lock.

3. (Experimental) Check use and usefulness of ->nocb_lock_contended.

4. Print no-CBs diagnostics when rcutorture writer unduly delayed.

5. Avoid synchronous wakeup in __call_rcu_nocb_wake().

6. Advance CBs after merge in rcutree_migrate_callbacks() to
avoid unnecessary invocation delays.

7. Reduce nocb_cb_wait() leaf rcu_node ->lock contention.

8. Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention.

9. Don't wake no-CBs GP kthread if timer posted under overload,
thus reducing overhead in the overload case.

10. Allow rcu_do_batch() to dynamically adjust batch sizes, courtesy
of Eric Dumazet.

11. (Experimental) Add TICK_DEP_BIT_RCU, courtesy of Frederic Weisbecker.

12. Force on tick when invoking lots of callbacks to reduce the
probability of long stop-machine delays.

13. Force on tick for readers and callback flooders, again to reduce
the probability of long stop-machine delays.

14. (Experimental and likely quite imperfect) Make multi_cpu_stop()
enable tick on all online CPUs, yet again to reduce the
probability of long stop-machine delays.

Thanx, Paul

------------------------------------------------------------------------

include/linux/rcu_segcblist.h | 4
include/linux/tick.h | 7
kernel/rcu/rcu_segcblist.c | 116 +++++++++-
kernel/rcu/rcu_segcblist.h | 17 +
kernel/rcu/rcutorture.c | 25 +-
kernel/rcu/tree.c | 41 +++
kernel/rcu/tree.h | 35 ++-
kernel/rcu/tree_plugin.h | 486 +++++++++++++++++++++++++++++++++++++-----
kernel/rcu/tree_stall.h | 5
kernel/stop_machine.c | 9
kernel/time/tick-sched.c | 2
11 files changed, 667 insertions(+), 80 deletions(-)


2019-08-03 14:25:54

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure

Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel. This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcu_segcblist.h | 4 ++
kernel/rcu/rcu_segcblist.c | 86 ++++++++++++++++++++++++++++++++---
kernel/rcu/rcu_segcblist.h | 12 ++++-
3 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 82977726da29..38e4b811af9b 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -65,7 +65,11 @@ struct rcu_segcblist {
struct rcu_head *head;
struct rcu_head **tails[RCU_CBLIST_NSEGS];
unsigned long gp_seq[RCU_CBLIST_NSEGS];
+#ifdef CONFIG_RCU_NOCB_CPU
+ atomic_long_t len;
+#else
long len;
+#endif
long len_lazy;
u8 enabled;
u8 offloaded;
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 92968b856593..ff431cc83037 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -23,6 +23,19 @@ void rcu_cblist_init(struct rcu_cblist *rclp)
rclp->len_lazy = 0;
}

+/*
+ * Enqueue an rcu_head structure onto the specified callback list.
+ * This function assumes that the callback is non-lazy because it
+ * is intended for use by no-CBs CPUs, which do not distinguish
+ * between lazy and non-lazy RCU callbacks.
+ */
+void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
+{
+ *rclp->tail = rhp;
+ rclp->tail = &rhp->next;
+ WRITE_ONCE(rclp->len, rclp->len + 1);
+}
+
/*
* Dequeue the oldest rcu_head structure from the specified callback
* list. This function assumes that the callback is non-lazy, but
@@ -44,6 +57,67 @@ struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp)
return rhp;
}

+/* Set the length of an rcu_segcblist structure. */
+void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+ atomic_long_set(&rsclp->len, v);
+#else
+ WRITE_ONCE(rsclp->len, v);
+#endif
+}
+
+/*
+ * Increase the numeric length of an rcu_segcblist structure by the
+ * specified amount, which can be negative. This can cause the ->len
+ * field to disagree with the actual number of callbacks on the structure.
+ * This increase is fully ordered with respect to the callers accesses
+ * both before and after.
+ */
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+ smp_mb__before_atomic(); /* Up to the caller! */
+ atomic_long_add(v, &rsclp->len);
+ smp_mb__after_atomic(); /* Up to the caller! */
+#else
+ smp_mb(); /* Up to the caller! */
+ WRITE_ONCE(rsclp->len, rsclp->len + v);
+ smp_mb(); /* Up to the caller! */
+#endif
+}
+
+/*
+ * Increase the numeric length of an rcu_segcblist structure by one.
+ * This can cause the ->len field to disagree with the actual number of
+ * callbacks on the structure. This increase is fully ordered with respect
+ * to the callers accesses both before and after.
+ */
+void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
+{
+ rcu_segcblist_add_len(rsclp, 1);
+}
+
+/*
+ * Exchange the numeric length of the specified rcu_segcblist structure
+ * with the specified value. This can cause the ->len field to disagree
+ * with the actual number of callbacks on the structure. This exchange is
+ * fully ordered with respect to the callers accesses both before and after.
+ */
+long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+ return atomic_long_xchg(&rsclp->len, v);
+#else
+ long ret = rsclp->len;
+
+ smp_mb(); /* Up to the caller! */
+ WRITE_ONCE(rsclp->len, v);
+ smp_mb(); /* Up to the caller! */
+ return ret;
+#endif
+}
+
/*
* Initialize an rcu_segcblist structure.
*/
@@ -56,7 +130,7 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
rsclp->head = NULL;
for (i = 0; i < RCU_CBLIST_NSEGS; i++)
rsclp->tails[i] = &rsclp->head;
- rsclp->len = 0;
+ rcu_segcblist_set_len(rsclp, 0);
rsclp->len_lazy = 0;
rsclp->enabled = 1;
}
@@ -151,7 +225,7 @@ bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp)
void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
struct rcu_head *rhp, bool lazy)
{
- WRITE_ONCE(rsclp->len, rsclp->len + 1); /* ->len sampled locklessly. */
+ rcu_segcblist_inc_len(rsclp);
if (lazy)
rsclp->len_lazy++;
smp_mb(); /* Ensure counts are updated before callback is enqueued. */
@@ -177,7 +251,7 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,

if (rcu_segcblist_n_cbs(rsclp) == 0)
return false;
- WRITE_ONCE(rsclp->len, rsclp->len + 1);
+ rcu_segcblist_inc_len(rsclp);
if (lazy)
rsclp->len_lazy++;
smp_mb(); /* Ensure counts are updated before callback is entrained. */
@@ -204,9 +278,8 @@ void rcu_segcblist_extract_count(struct rcu_segcblist *rsclp,
struct rcu_cblist *rclp)
{
rclp->len_lazy += rsclp->len_lazy;
- rclp->len += rsclp->len;
rsclp->len_lazy = 0;
- WRITE_ONCE(rsclp->len, 0); /* ->len sampled locklessly. */
+ rclp->len = rcu_segcblist_xchg_len(rsclp, 0);
}

/*
@@ -259,8 +332,7 @@ void rcu_segcblist_insert_count(struct rcu_segcblist *rsclp,
struct rcu_cblist *rclp)
{
rsclp->len_lazy += rclp->len_lazy;
- /* ->len sampled locklessly. */
- WRITE_ONCE(rsclp->len, rsclp->len + rclp->len);
+ rcu_segcblist_add_len(rsclp, rclp->len);
rclp->len_lazy = 0;
rclp->len = 0;
}
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index db38f0a512c4..1ff996647d3c 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -9,6 +9,12 @@

#include <linux/rcu_segcblist.h>

+/* Return number of callbacks in the specified callback list. */
+static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
+{
+ return READ_ONCE(rclp->len);
+}
+
/*
* Account for the fact that a previously dequeued callback turned out
* to be marked as lazy.
@@ -42,7 +48,11 @@ static inline bool rcu_segcblist_empty(struct rcu_segcblist *rsclp)
/* Return number of callbacks in segmented callback list. */
static inline long rcu_segcblist_n_cbs(struct rcu_segcblist *rsclp)
{
+#ifdef CONFIG_RCU_NOCB_CPU
+ return atomic_long_read(&rsclp->len);
+#else
return READ_ONCE(rsclp->len);
+#endif
}

/* Return number of lazy callbacks in segmented callback list. */
@@ -54,7 +64,7 @@ static inline long rcu_segcblist_n_lazy_cbs(struct rcu_segcblist *rsclp)
/* Return number of lazy callbacks in segmented callback list. */
static inline long rcu_segcblist_n_nonlazy_cbs(struct rcu_segcblist *rsclp)
{
- return rsclp->len - rsclp->len_lazy;
+ return rcu_segcblist_n_cbs(rsclp) - rsclp->len_lazy;
}

/*
--
2.17.1

2019-08-03 14:26:02

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

The multi_cpu_stop() function relies on the scheduler to gain control from
whatever is running on the various online CPUs, including any nohz_full
CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
interrupt on such CPUs can delay multi_cpu_stop() for several minutes
and can also result in RCU CPU stall warnings. This commit therefore
causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
online CPUs.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/stop_machine.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b4f83f7bdf86..a2659f61ed92 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -20,6 +20,7 @@
#include <linux/smpboot.h>
#include <linux/atomic.h>
#include <linux/nmi.h>
+#include <linux/tick.h>
#include <linux/sched/wake_q.h>

/*
@@ -187,15 +188,19 @@ static int multi_cpu_stop(void *data)
{
struct multi_stop_data *msdata = data;
enum multi_stop_state curstate = MULTI_STOP_NONE;
- int cpu = smp_processor_id(), err = 0;
+ int cpu, err = 0;
const struct cpumask *cpumask;
unsigned long flags;
bool is_active;

+ for_each_online_cpu(cpu)
+ tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU);
+
/*
* When called from stop_machine_from_inactive_cpu(), irq might
* already be disabled. Save the state and restore it on exit.
*/
+ cpu = smp_processor_id();
local_save_flags(flags);

if (!msdata->active_cpus) {
@@ -236,6 +241,8 @@ static int multi_cpu_stop(void *data)
} while (curstate != MULTI_STOP_EXIT);

local_irq_restore(flags);
+ for_each_online_cpu(cpu)
+ tick_nohz_dep_clear_cpu(cpu, TICK_DEP_MASK_RCU);
return err;
}

--
2.17.1

2019-08-03 14:26:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 12/14] rcu/nohz: Force on tick when invoking lots of callbacks

Callback invocation can run for a significant time period, and within
CONFIG_NO_HZ_FULL=y kernels, this period will be devoid of scheduler-clock
interrupts. In-kernel execution without such interrupts can cause all
manner of malfunction, with RCU CPU stall warnings being but one result.

This commit therefore forces scheduling-clock interrupts on whenever more
than a few RCU callbacks are invoked. Because offloaded callback invocation
can be preempted, this forcing is withdrawn on each context switch. This
in turn requires that the loop invoking RCU callbacks reiterate the forcing
periodically.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 71395e91b876..b80eb11c19e1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2151,6 +2151,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
rcu_nocb_unlock_irqrestore(rdp, flags);

/* Invoke callbacks. */
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_set_task(current, TICK_DEP_MASK_RCU);
rhp = rcu_cblist_dequeue(&rcl);
for (; rhp; rhp = rcu_cblist_dequeue(&rcl)) {
debug_rcu_head_unqueue(rhp);
@@ -2217,6 +2219,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
/* Re-invoke RCU core processing if there are callbacks remaining. */
if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist))
invoke_rcu_core();
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
}

/*
--
2.17.1

2019-08-03 14:26:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 07/14] rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention

Currently, nocb_cb_wait() advances callbacks on each pass through its
loop, though only if it succeeds in conditionally acquiring its leaf
rcu_node structure's ->lock. Despite the conditional acquisition of
->lock, this does increase contention. This commit therefore avoids
advancing callbacks unless there are callbacks in ->cblist whose grace
period has completed.

Note that nocb_cb_wait() doesn't worry about callbacks that have not
yet been assigned a grace period. The idea is that the only reason for
nocb_cb_wait() to advance callbacks is to allow it to continue invoking
callbacks. Time will tell whether this is the correct choice.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 5db39ce1cae1..02739366ef5d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2076,6 +2076,7 @@ static int rcu_nocb_gp_kthread(void *arg)
*/
static void nocb_cb_wait(struct rcu_data *rdp)
{
+ unsigned long cur_gp_seq;
unsigned long flags;
bool needwake_gp = false;
struct rcu_node *rnp = rdp->mynode;
@@ -2088,7 +2089,9 @@ static void nocb_cb_wait(struct rcu_data *rdp)
local_bh_enable();
lockdep_assert_irqs_enabled();
rcu_nocb_lock_irqsave(rdp, flags);
- if (raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
+ if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
+ rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
+ raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
}
--
2.17.1

2019-08-03 14:26:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 13/14] rcutorture: Force on tick for readers and callback flooders

Readers and callback flooders in the rcutorture stress-test suite run for
extended time periods by design. They do take pains to relinquish the
CPU from time to time, but in some cases this relies on the scheduler
being active, which in turn relies on the scheduler-clock interrupt
firing from time to time.

This commit therefore forces scheduling-clock interrupts within
these loops. While in the area, this commit also prevents
rcu_torture_reader()'s occasional timed sleeps from delaying shutdown.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 3c9feca1eab1..bf08aa783ecc 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -44,6 +44,7 @@
#include <linux/sched/debug.h>
#include <linux/sched/sysctl.h>
#include <linux/oom.h>
+#include <linux/tick.h>

#include "rcu.h"

@@ -1363,15 +1364,16 @@ rcu_torture_reader(void *arg)
set_user_nice(current, MAX_NICE);
if (irqreader && cur_ops->irq_capable)
timer_setup_on_stack(&t, rcu_torture_timer, 0);
-
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_set_task(current, TICK_DEP_MASK_RCU);
do {
if (irqreader && cur_ops->irq_capable) {
if (!timer_pending(&t))
mod_timer(&t, jiffies + 1);
}
- if (!rcu_torture_one_read(&rand))
+ if (!rcu_torture_one_read(&rand) && !torture_must_stop())
schedule_timeout_interruptible(HZ);
- if (time_after(jiffies, lastsleep)) {
+ if (time_after(jiffies, lastsleep) && !torture_must_stop()) {
schedule_timeout_interruptible(1);
lastsleep = jiffies + 10;
}
@@ -1383,6 +1385,8 @@ rcu_torture_reader(void *arg)
del_timer_sync(&t);
destroy_timer_on_stack(&t);
}
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
torture_kthread_stopping("rcu_torture_reader");
return 0;
}
@@ -1729,10 +1733,10 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
// Real call_rcu() floods hit userspace, so emulate that.
if (need_resched() || (iter & 0xfff))
schedule();
- } else {
- // No userspace emulation: CB invocation throttles call_rcu()
- cond_resched();
+ return;
}
+ // No userspace emulation: CB invocation throttles call_rcu()
+ cond_resched();
}

/*
@@ -1781,6 +1785,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries)
init_rcu_head_on_stack(&fcs.rh);
selfpropcb = true;
}
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_set_task(current, TICK_DEP_MASK_RCU);

/* Tight loop containing cond_resched(). */
WRITE_ONCE(rcu_fwd_cb_nodelay, true);
@@ -1826,6 +1832,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries)
destroy_rcu_head_on_stack(&fcs.rh);
}
schedule_timeout_uninterruptible(HZ / 10); /* Let kthreads recover. */
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
WRITE_ONCE(rcu_fwd_cb_nodelay, false);
}

@@ -1865,6 +1873,8 @@ static void rcu_torture_fwd_prog_cr(void)
cver = READ_ONCE(rcu_torture_current_version);
gps = cur_ops->get_gp_seq();
rcu_launder_gp_seq_start = gps;
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_set_task(current, TICK_DEP_MASK_RCU);
while (time_before(jiffies, stopat) &&
!shutdown_time_arrived() &&
!READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) {
@@ -1911,6 +1921,8 @@ static void rcu_torture_fwd_prog_cr(void)
rcu_torture_fwd_cb_hist();
}
schedule_timeout_uninterruptible(HZ); /* Let CBs drain. */
+ if (IS_ENABLED(CONFIG_NO_HZ_FULL))
+ tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
WRITE_ONCE(rcu_fwd_cb_nodelay, false);
}

--
2.17.1

2019-08-03 14:27:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 05/14] rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()

When callbacks are in full flow, the common case is waiting for a
grace period, and this grace period will normally take a few jiffies to
complete. It therefore isn't all that helpful for __call_rcu_nocb_wake()
to do a synchronous wakeup in this case. This commit therefore turns this
into a timer-based deferred wakeup of the no-CBs grace-period kthread.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index aaeb8a658f4b..5db39ce1cae1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1897,22 +1897,13 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
} else if (len > rdp->qlen_last_fqs_check + qhimark) {
/* ... or if many callbacks queued. */
rdp->qlen_last_fqs_check = len;
- if (!rdp->nocb_cb_sleep &&
- rcu_segcblist_ready_cbs(&rdp->cblist)) {
- // Already going full tilt, so don't try to rewake.
- rcu_nocb_unlock_irqrestore(rdp, flags);
- } else {
+ if (rdp->nocb_cb_sleep ||
+ !rcu_segcblist_ready_cbs(&rdp->cblist)) {
rcu_advance_cbs_nowake(rdp->mynode, rdp);
- if (!irqs_disabled_flags(flags)) {
- wake_nocb_gp(rdp, false, flags);
- trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
- TPS("WakeOvf"));
- } else {
- wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
- TPS("WakeOvfIsDeferred"));
- rcu_nocb_unlock_irqrestore(rdp, flags);
- }
+ wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
+ TPS("WakeOvfIsDeferred"));
}
+ rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
rcu_nocb_unlock_irqrestore(rdp, flags);
--
2.17.1

2019-08-03 14:28:06

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 11/14] EXP nohz: Add TICK_DEP_BIT_RCU

From: Frederic Weisbecker <[email protected]>

Indeed, kernel code is assumed to be quick enough (between two extended grace
periods) to avoid running the tick for RCU. But some long lasting kernel code
may require to tick temporarily.

You can use tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU) with the
following: @@@

[ paulmck: Enable use within rcutorture and from !NO_HZ_FULL kernels. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/tick.h | 7 ++++++-
kernel/time/tick-sched.c | 2 ++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 196a0a7bfc4f..0dea6fb33a11 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -108,7 +108,8 @@ enum tick_dep_bits {
TICK_DEP_BIT_POSIX_TIMER = 0,
TICK_DEP_BIT_PERF_EVENTS = 1,
TICK_DEP_BIT_SCHED = 2,
- TICK_DEP_BIT_CLOCK_UNSTABLE = 3
+ TICK_DEP_BIT_CLOCK_UNSTABLE = 3,
+ TICK_DEP_BIT_RCU = 4
};

#define TICK_DEP_MASK_NONE 0
@@ -116,6 +117,7 @@ enum tick_dep_bits {
#define TICK_DEP_MASK_PERF_EVENTS (1 << TICK_DEP_BIT_PERF_EVENTS)
#define TICK_DEP_MASK_SCHED (1 << TICK_DEP_BIT_SCHED)
#define TICK_DEP_MASK_CLOCK_UNSTABLE (1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
+#define TICK_DEP_MASK_RCU (1 << TICK_DEP_BIT_RCU)

#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
@@ -258,6 +260,9 @@ static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

+static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
+static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
+
static inline void tick_dep_set(enum tick_dep_bits bit) { }
static inline void tick_dep_clear(enum tick_dep_bits bit) { }
static inline void tick_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index be9707f68024..5d4844d543cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -324,6 +324,7 @@ void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit)
preempt_enable();
}
}
+EXPORT_SYMBOL_GPL(tick_nohz_dep_set_cpu);

void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit)
{
@@ -331,6 +332,7 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit)

atomic_andnot(BIT(bit), &ts->tick_dep_mask);
}
+EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);

/*
* Set a per-task tick dependency. Posix CPU timers need this in order to elapse
--
2.17.1

2019-08-03 14:28:07

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 04/14] rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed

This commit causes locking, sleeping, and callback state to be printed
for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
rcutorture to complain.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 1 +
kernel/rcu/tree.h | 7 +++-
kernel/rcu/tree_plugin.h | 82 ++++++++++++++++++++++++++++++++++++++++
kernel/rcu/tree_stall.h | 5 +++
4 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index b22947324423..3c9feca1eab1 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -2176,6 +2176,7 @@ rcu_torture_cleanup(void)
return;
}

+ show_rcu_gp_kthreads();
rcu_torture_barrier_cleanup();
torture_stop_kthread(rcu_torture_fwd_prog, fwd_prog_task);
torture_stop_kthread(rcu_torture_stall, stall_task);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e4df86db8137..c612f306fe89 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -212,7 +212,11 @@ struct rcu_data {
/* The following fields are used by GP kthread, hence own cacheline. */
raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
- bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */
+ u8 nocb_gp_sleep; /* Is the nocb GP thread asleep? */
+ u8 nocb_gp_bypass; /* Found a bypass on last scan? */
+ u8 nocb_gp_gp; /* GP to wait for on last scan? */
+ unsigned long nocb_gp_seq; /* If so, ->gp_seq to wait for. */
+ unsigned long nocb_gp_loops; /* # passes through wait code. */
struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */
struct task_struct *nocb_cb_kthread;
@@ -438,6 +442,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_cpu_nocb_kthread(int cpu);
static void __init rcu_spawn_nocb_kthreads(void);
+static void show_rcu_nocb_state(struct rcu_data *rdp);
static void rcu_nocb_lock(struct rcu_data *rdp);
static void rcu_nocb_unlock(struct rcu_data *rdp);
static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fcbd61738dff..aaeb8a658f4b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2018,6 +2018,9 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
rcu_gp_kthread_wake();
}

+ my_rdp->nocb_gp_bypass = bypass;
+ my_rdp->nocb_gp_gp = needwait_gp;
+ my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
if (bypass && !rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
@@ -2052,6 +2055,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
WRITE_ONCE(my_rdp->nocb_gp_sleep, true);
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
}
+ my_rdp->nocb_gp_seq = -1;
WARN_ON(signal_pending(current));
}

@@ -2068,6 +2072,7 @@ static int rcu_nocb_gp_kthread(void *arg)
struct rcu_data *rdp = arg;

for (;;) {
+ WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1);
nocb_gp_wait(rdp);
cond_resched_tasks_rcu_qs();
}
@@ -2359,6 +2364,79 @@ void rcu_bind_current_to_nocb(void)
}
EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb);

+/*
+ * Dump out nocb grace-period kthread state for the specified rcu_data
+ * structure.
+ */
+static void show_rcu_nocb_gp_state(struct rcu_data *rdp)
+{
+ struct rcu_node *rnp = rdp->mynode;
+
+ pr_info("nocb GP %d %c%c%c%c%c%c %c[%c%c] %c%c:%ld rnp %d:%d %lu\n",
+ rdp->cpu,
+ "kK"[!!rdp->nocb_gp_kthread],
+ "lL"[raw_spin_is_locked(&rdp->nocb_gp_lock)],
+ "dD"[!!rdp->nocb_defer_wakeup],
+ "tT"[timer_pending(&rdp->nocb_timer)],
+ "bB"[timer_pending(&rdp->nocb_bypass_timer)],
+ "sS"[!!rdp->nocb_gp_sleep],
+ ".W"[swait_active(&rdp->nocb_gp_wq)],
+ ".W"[swait_active(&rnp->nocb_gp_wq[0])],
+ ".W"[swait_active(&rnp->nocb_gp_wq[1])],
+ ".B"[!!rdp->nocb_gp_bypass],
+ ".G"[!!rdp->nocb_gp_gp],
+ (long)rdp->nocb_gp_seq,
+ rnp->grplo, rnp->grphi, READ_ONCE(rdp->nocb_gp_loops));
+}
+
+/* Dump out nocb kthread state for the specified rcu_data structure. */
+static void show_rcu_nocb_state(struct rcu_data *rdp)
+{
+ struct rcu_segcblist *rsclp = &rdp->cblist;
+ bool waslocked;
+ bool wastimer;
+ bool wassleep;
+
+ if (rdp->nocb_gp_rdp == rdp)
+ show_rcu_nocb_gp_state(rdp);
+
+ pr_info(" CB %d->%d %c%c%c%c%c%c F%ld L%ld C%d %c%c%c%c%c q%ld\n",
+ rdp->cpu, rdp->nocb_gp_rdp->cpu,
+ "kK"[!!rdp->nocb_cb_kthread],
+ "bB"[raw_spin_is_locked(&rdp->nocb_bypass_lock)],
+ "cC"[!!atomic_read(&rdp->nocb_lock_contended)],
+ "lL"[raw_spin_is_locked(&rdp->nocb_lock)],
+ "sS"[!!rdp->nocb_cb_sleep],
+ ".W"[swait_active(&rdp->nocb_cb_wq)],
+ jiffies - rdp->nocb_bypass_first,
+ jiffies - rdp->nocb_nobypass_last,
+ rdp->nocb_nobypass_count,
+ ".D"[rcu_segcblist_ready_cbs(rsclp)],
+ ".W"[!rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL)],
+ ".R"[!rcu_segcblist_restempty(rsclp, RCU_WAIT_TAIL)],
+ ".N"[!rcu_segcblist_restempty(rsclp, RCU_NEXT_READY_TAIL)],
+ ".B"[!!rcu_cblist_n_cbs(&rdp->nocb_bypass)],
+ rcu_segcblist_n_cbs(&rdp->cblist));
+
+ /* It is OK for GP kthreads to have GP state. */
+ if (rdp->nocb_gp_rdp == rdp)
+ return;
+
+ waslocked = raw_spin_is_locked(&rdp->nocb_gp_lock);
+ wastimer = timer_pending(&rdp->nocb_timer);
+ wassleep = swait_active(&rdp->nocb_gp_wq);
+ if (!rdp->nocb_defer_wakeup && !rdp->nocb_gp_sleep &&
+ !waslocked && !wastimer && !wassleep)
+ return; /* Nothing untowards. */
+
+ pr_info(" !!! %c%c%c%c %c\n",
+ "lL"[waslocked],
+ "dD"[!!rdp->nocb_defer_wakeup],
+ "tT"[wastimer],
+ "sS"[!!rdp->nocb_gp_sleep],
+ ".W"[wassleep]);
+}
+
#else /* #ifdef CONFIG_RCU_NOCB_CPU */

/* No ->nocb_lock to acquire. */
@@ -2436,6 +2514,10 @@ static void __init rcu_spawn_nocb_kthreads(void)
{
}

+static void show_rcu_nocb_state(struct rcu_data *rdp)
+{
+}
+
#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */

/*
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 0627a66699a6..841ab43f3e60 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -589,6 +589,11 @@ void show_rcu_gp_kthreads(void)
cpu, (long)rdp->gp_seq_needed);
}
}
+ for_each_possible_cpu(cpu) {
+ rdp = per_cpu_ptr(&rcu_data, cpu);
+ if (rcu_segcblist_is_offloaded(&rdp->cblist))
+ show_rcu_nocb_state(rdp);
+ }
/* sched_show_task(rcu_state.gp_kthread); */
}
EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
--
2.17.1

2019-08-03 14:28:16

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 09/14] rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload

When under overload conditions, __call_rcu_nocb_wake() will wake the
no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
are no ready-to-invoke callbacks, but only after a timer delay. If the
no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
from __call_rcu_nocb_wake() is redundant. This commit therefore makes
__call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
->nocb_bypass_timer is pending. This requires adding a bit of ordering
of timer actions.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index efd7f6fa2790..379cb7e50a62 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1906,8 +1906,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
rcu_advance_cbs_nowake(rdp->mynode, rdp);
rdp->nocb_gp_adv_time = j;
}
- if (rdp->nocb_cb_sleep ||
- !rcu_segcblist_ready_cbs(&rdp->cblist))
+ smp_mb(); /* Enqueue before timer_pending(). */
+ if ((rdp->nocb_cb_sleep ||
+ !rcu_segcblist_ready_cbs(&rdp->cblist)) &&
+ !timer_pending(&rdp->nocb_bypass_timer))
wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
@@ -1926,6 +1928,7 @@ static void do_nocb_bypass_wakeup_timer(struct timer_list *t)

trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
rcu_nocb_lock_irqsave(rdp, flags);
+ smp_mb__after_spinlock(); /* Timer expire before wakeup. */
__call_rcu_nocb_wake(rdp, true, flags);
}

--
2.17.1

2019-08-03 14:28:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 03/14] rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index bb906295538d..fcbd61738dff 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1513,6 +1513,7 @@ static void rcu_nocb_bypass_lock(struct rcu_data *rdp)
if (raw_spin_trylock(&rdp->nocb_bypass_lock))
return;
atomic_inc(&rdp->nocb_lock_contended);
+ WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
smp_mb__after_atomic(); /* atomic_inc() before lock. */
raw_spin_lock(&rdp->nocb_bypass_lock);
smp_mb__before_atomic(); /* atomic_dec() after lock. */
@@ -1531,7 +1532,8 @@ static void rcu_nocb_bypass_lock(struct rcu_data *rdp)
*/
static void rcu_nocb_wait_contended(struct rcu_data *rdp)
{
- while (atomic_read(&rdp->nocb_lock_contended))
+ WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
+ while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)))
cpu_relax();
}

--
2.17.1

2019-08-03 14:28:32

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.

This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcu_segcblist.c | 30 ++++
kernel/rcu/rcu_segcblist.h | 5 +
kernel/rcu/tree.c | 16 +-
kernel/rcu/tree.h | 28 +--
kernel/rcu/tree_plugin.h | 356 ++++++++++++++++++++++++++++++++++---
5 files changed, 394 insertions(+), 41 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index ff431cc83037..495c58ce1640 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -36,6 +36,36 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
WRITE_ONCE(rclp->len, rclp->len + 1);
}

+/*
+ * Flush the second rcu_cblist structure onto the first one, obliterating
+ * any contents of the first. If rhp is non-NULL, enqueue it as the sole
+ * element of the second rcu_cblist structure, but ensuring that the second
+ * rcu_cblist structure, if initially non-empty, always appears non-empty
+ * throughout the process. If rdp is NULL, the second rcu_cblist structure
+ * is instead initialized to empty.
+ */
+void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
+ struct rcu_cblist *srclp,
+ struct rcu_head *rhp)
+{
+ drclp->head = srclp->head;
+ if (drclp->head)
+ drclp->tail = srclp->tail;
+ else
+ drclp->tail = &drclp->head;
+ drclp->len = srclp->len;
+ drclp->len_lazy = srclp->len_lazy;
+ if (!rhp) {
+ rcu_cblist_init(srclp);
+ } else {
+ rhp->next = NULL;
+ srclp->head = rhp;
+ srclp->tail = &rhp->next;
+ WRITE_ONCE(srclp->len, 1);
+ srclp->len_lazy = 0;
+ }
+}
+
/*
* Dequeue the oldest rcu_head structure from the specified callback
* list. This function assumes that the callback is non-lazy, but
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 1ff996647d3c..815c2fdd3fcc 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -25,6 +25,10 @@ static inline void rcu_cblist_dequeued_lazy(struct rcu_cblist *rclp)
}

void rcu_cblist_init(struct rcu_cblist *rclp);
+void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
+void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
+ struct rcu_cblist *srclp,
+ struct rcu_head *rhp);
struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp);

/*
@@ -92,6 +96,7 @@ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg)
return !READ_ONCE(*READ_ONCE(rsclp->tails[seg]));
}

+void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
void rcu_segcblist_init(struct rcu_segcblist *rsclp);
void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ec320658aeef..457623100d12 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1251,6 +1251,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
unsigned long gp_seq_req;
bool ret = false;

+ rcu_lockdep_assert_cblist_protected(rdp);
raw_lockdep_assert_held_rcu_node(rnp);

/* If no pending (not yet ready to invoke) callbacks, nothing to do. */
@@ -1292,7 +1293,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
unsigned long c;
bool needwake;

- lockdep_assert_irqs_disabled();
+ rcu_lockdep_assert_cblist_protected(rdp);
c = rcu_seq_snap(&rcu_state.gp_seq);
if (!rdp->gpwrap && ULONG_CMP_GE(rdp->gp_seq_needed, c)) {
/* Old request still live, so mark recent callbacks. */
@@ -1318,6 +1319,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp,
*/
static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
{
+ rcu_lockdep_assert_cblist_protected(rdp);
raw_lockdep_assert_held_rcu_node(rnp);

/* If no pending (not yet ready to invoke) callbacks, nothing to do. */
@@ -1341,6 +1343,7 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp,
struct rcu_data *rdp)
{
+ rcu_lockdep_assert_cblist_protected(rdp);
if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) ||
!raw_spin_trylock_rcu_node(rnp))
return;
@@ -2187,7 +2190,9 @@ static void rcu_do_batch(struct rcu_data *rdp)
* The following usually indicates a double call_rcu(). To track
* this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.
*/
- WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) != (count == 0));
+ WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(&rdp->cblist));
+ WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
+ count != 0 && rcu_segcblist_empty(&rdp->cblist));

rcu_nocb_unlock_irqrestore(rdp, flags);

@@ -2564,8 +2569,9 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, bool lazy)
if (rcu_segcblist_empty(&rdp->cblist))
rcu_segcblist_init(&rdp->cblist);
}
- rcu_nocb_lock(rdp);
- was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+ if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+ return; // Enqueued onto ->nocb_bypass, so just leave.
+ /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */
rcu_segcblist_enqueue(&rdp->cblist, head, lazy);
if (__is_kfree_rcu_offset((unsigned long)func))
trace_rcu_kfree_callback(rcu_state.name, head,
@@ -2839,6 +2845,7 @@ static void rcu_barrier_func(void *unused)
rdp->barrier_head.func = rcu_barrier_callback;
debug_rcu_head_queue(&rdp->barrier_head);
rcu_nocb_lock(rdp);
+ WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head, 0)) {
atomic_inc(&rcu_state.barrier_cpu_count);
} else {
@@ -3192,6 +3199,7 @@ void rcutree_migrate_callbacks(int cpu)
my_rdp = this_cpu_ptr(&rcu_data);
my_rnp = my_rdp->mynode;
rcu_nocb_lock(my_rdp); /* irqs already disabled. */
+ WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
/* Leverage recent GPs and set GP for new callbacks. */
needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2c3e9068671c..e4df86db8137 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -200,18 +200,26 @@ struct rcu_data {
atomic_t nocb_lock_contended; /* Contention experienced. */
int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
struct timer_list nocb_timer; /* Enforce finite deferral. */
+ unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */
+
+ /* The following fields are used by call_rcu, hence own cacheline. */
+ raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
+ struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */
+ unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
+ unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
+ int nocb_nobypass_count; /* # ->cblist enqueues at ^^^ time. */

/* The following fields are used by GP kthread, hence own cacheline. */
raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
- bool nocb_gp_sleep;
- /* Is the nocb GP thread asleep? */
+ struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
+ bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */
struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */
struct task_struct *nocb_cb_kthread;
struct rcu_data *nocb_next_cb_rdp;
/* Next rcu_data in wakeup chain. */

- /* The following fields are used by CB kthread, hence new cachline. */
+ /* The following fields are used by CB kthread, hence new cacheline. */
struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp;
/* GP rdp takes GP-end wakeups. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
@@ -419,6 +427,10 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
static void rcu_init_one_nocb(struct rcu_node *rnp);
+static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ unsigned long j);
+static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ bool *was_alldone, unsigned long flags);
static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
unsigned long flags);
static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
@@ -430,19 +442,15 @@ static void rcu_nocb_lock(struct rcu_data *rdp);
static void rcu_nocb_unlock(struct rcu_data *rdp);
static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
unsigned long flags);
+static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
#ifdef CONFIG_RCU_NOCB_CPU
static void __init rcu_organize_nocb_kthreads(void);
#define rcu_nocb_lock_irqsave(rdp, flags) \
do { \
- if (!rcu_segcblist_is_offloaded(&(rdp)->cblist)) { \
+ if (!rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
local_irq_save(flags); \
- } else if (!raw_spin_trylock_irqsave(&(rdp)->nocb_lock, (flags))) {\
- atomic_inc(&(rdp)->nocb_lock_contended); \
- smp_mb__after_atomic(); /* atomic_inc() before lock. */ \
+ else \
raw_spin_lock_irqsave(&(rdp)->nocb_lock, (flags)); \
- smp_mb__before_atomic(); /* atomic_dec() after lock. */ \
- atomic_dec(&(rdp)->nocb_lock_contended); \
- } \
} while (0)
#else /* #ifdef CONFIG_RCU_NOCB_CPU */
#define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e164d2c5fa93..bb906295538d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1495,19 +1495,26 @@ static int __init parse_rcu_nocb_poll(char *arg)
early_param("rcu_nocb_poll", parse_rcu_nocb_poll);

/*
- * Acquire the specified rcu_data structure's ->nocb_lock, but only
- * if it corresponds to a no-CBs CPU. If the lock isn't immediately
- * available, increment ->nocb_lock_contended to flag the contention.
+ * Don't bother bypassing ->cblist if the call_rcu() rate is low.
+ * After all, the main point of bypassing is to avoid lock contention
+ * on ->nocb_lock, which only can happen at high call_rcu() rates.
*/
-static void rcu_nocb_lock(struct rcu_data *rdp)
+int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ;
+module_param(nocb_nobypass_lim_per_jiffy, int, 0);
+
+/*
+ * Acquire the specified rcu_data structure's ->nocb_bypass_lock. If the
+ * lock isn't immediately available, increment ->nocb_lock_contended to
+ * flag the contention.
+ */
+static void rcu_nocb_bypass_lock(struct rcu_data *rdp)
{
lockdep_assert_irqs_disabled();
- if (!rcu_segcblist_is_offloaded(&rdp->cblist) ||
- raw_spin_trylock(&rdp->nocb_lock))
+ if (raw_spin_trylock(&rdp->nocb_bypass_lock))
return;
atomic_inc(&rdp->nocb_lock_contended);
smp_mb__after_atomic(); /* atomic_inc() before lock. */
- raw_spin_lock(&rdp->nocb_lock);
+ raw_spin_lock(&rdp->nocb_bypass_lock);
smp_mb__before_atomic(); /* atomic_dec() after lock. */
atomic_dec(&rdp->nocb_lock_contended);
}
@@ -1528,6 +1535,37 @@ static void rcu_nocb_wait_contended(struct rcu_data *rdp)
cpu_relax();
}

+/*
+ * Conditionally acquire the specified rcu_data structure's
+ * ->nocb_bypass_lock.
+ */
+static bool rcu_nocb_bypass_trylock(struct rcu_data *rdp)
+{
+ lockdep_assert_irqs_disabled();
+ return raw_spin_trylock(&rdp->nocb_bypass_lock);
+}
+
+/*
+ * Release the specified rcu_data structure's ->nocb_bypass_lock.
+ */
+static void rcu_nocb_bypass_unlock(struct rcu_data *rdp)
+{
+ lockdep_assert_irqs_disabled();
+ raw_spin_unlock(&rdp->nocb_bypass_lock);
+}
+
+/*
+ * Acquire the specified rcu_data structure's ->nocb_lock, but only
+ * if it corresponds to a no-CBs CPU.
+ */
+static void rcu_nocb_lock(struct rcu_data *rdp)
+{
+ lockdep_assert_irqs_disabled();
+ if (!rcu_segcblist_is_offloaded(&rdp->cblist))
+ return;
+ raw_spin_lock(&rdp->nocb_lock);
+}
+
/*
* Release the specified rcu_data structure's ->nocb_lock, but only
* if it corresponds to a no-CBs CPU.
@@ -1555,6 +1593,15 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
}
}

+/* Lockdep check that ->cblist may be safely accessed. */
+static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
+{
+ lockdep_assert_irqs_disabled();
+ if (rcu_segcblist_is_offloaded(&rdp->cblist) &&
+ cpu_online(rdp->cpu))
+ lockdep_assert_held(&rdp->nocb_lock);
+}
+
/*
* Wake up any no-CBs CPUs' kthreads that were waiting on the just-ended
* grace period.
@@ -1591,24 +1638,27 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
unsigned long flags)
__releases(rdp->nocb_lock)
{
+ bool needwake = false;
struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;

lockdep_assert_held(&rdp->nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("AlreadyAwake"));
rcu_nocb_unlock_irqrestore(rdp, flags);
return;
}
- if (READ_ONCE(rdp_gp->nocb_gp_sleep) || force) {
- del_timer(&rdp->nocb_timer);
- rcu_nocb_unlock_irqrestore(rdp, flags);
- smp_mb(); /* enqueue before ->nocb_gp_sleep. */
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
+ del_timer(&rdp->nocb_timer);
+ rcu_nocb_unlock_irqrestore(rdp, flags);
+ raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
+ if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
- raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- wake_up_process(rdp_gp->nocb_gp_kthread);
- } else {
- rcu_nocb_unlock_irqrestore(rdp, flags);
+ needwake = true;
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake"));
}
+ raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
+ if (needwake)
+ wake_up_process(rdp_gp->nocb_gp_kthread);
}

/*
@@ -1625,6 +1675,188 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
}

+/*
+ * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
+ * However, if there is a callback to be enqueued and if ->nocb_bypass
+ * proves to be initially empty, just return false because the no-CB GP
+ * kthread may need to be awakened in this case.
+ *
+ * Note that this function always returns true if rhp is NULL.
+ */
+static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ unsigned long j)
+{
+ struct rcu_cblist rcl;
+
+ WARN_ON_ONCE(!rcu_segcblist_is_offloaded(&rdp->cblist));
+ rcu_lockdep_assert_cblist_protected(rdp);
+ lockdep_assert_held(&rdp->nocb_bypass_lock);
+ if (rhp && !rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
+ raw_spin_unlock(&rdp->nocb_bypass_lock);
+ return false;
+ }
+ /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
+ if (rhp)
+ rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
+ rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+ rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
+ WRITE_ONCE(rdp->nocb_bypass_first, j);
+ rcu_nocb_bypass_unlock(rdp);
+ return true;
+}
+
+/*
+ * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
+ * However, if there is a callback to be enqueued and if ->nocb_bypass
+ * proves to be initially empty, just return false because the no-CB GP
+ * kthread may need to be awakened in this case.
+ *
+ * Note that this function always returns true if rhp is NULL.
+ */
+static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ unsigned long j)
+{
+ if (!rcu_segcblist_is_offloaded(&rdp->cblist))
+ return true;
+ rcu_lockdep_assert_cblist_protected(rdp);
+ rcu_nocb_bypass_lock(rdp);
+ return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+}
+
+/*
+ * If the ->nocb_bypass_lock is immediately available, flush the
+ * ->nocb_bypass queue into ->cblist.
+ */
+static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
+{
+ rcu_lockdep_assert_cblist_protected(rdp);
+ if (!rcu_segcblist_is_offloaded(&rdp->cblist) ||
+ !rcu_nocb_bypass_trylock(rdp))
+ return;
+ WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+}
+
+/*
+ * See whether it is appropriate to use the ->nocb_bypass list in order
+ * to control contention on ->nocb_lock. A limited number of direct
+ * enqueues are permitted into ->cblist per jiffy. If ->nocb_bypass
+ * is non-empty, further callbacks must be placed into ->nocb_bypass,
+ * otherwise rcu_barrier() breaks. Use rcu_nocb_flush_bypass() to switch
+ * back to direct use of ->cblist. However, ->nocb_bypass should not be
+ * used if ->cblist is empty, because otherwise callbacks can be stranded
+ * on ->nocb_bypass because we cannot count on the current CPU ever again
+ * invoking call_rcu(). The general rule is that if ->nocb_bypass is
+ * non-empty, the corresponding no-CBs grace-period kthread must not be
+ * in an indefinite sleep state.
+ *
+ * Finally, it is not permitted to use the bypass during early boot,
+ * as doing so would confuse the auto-initialization code. Besides
+ * which, there is no point in worrying about lock contention while
+ * there is only one CPU in operation.
+ */
+static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ bool *was_alldone, unsigned long flags)
+{
+ unsigned long c;
+ unsigned long cur_gp_seq;
+ unsigned long j = jiffies;
+ long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+
+ if (!rcu_segcblist_is_offloaded(&rdp->cblist)) {
+ *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+ return false; /* Not offloaded, no bypassing. */
+ }
+ lockdep_assert_irqs_disabled();
+
+ // Don't use ->nocb_bypass during early boot.
+ if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
+ rcu_nocb_lock(rdp);
+ WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
+ *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+ return false;
+ }
+
+ // If we have advanced to a new jiffy, reset counts to allow
+ // moving back from ->nocb_bypass to ->cblist.
+ if (j == rdp->nocb_nobypass_last) {
+ c = rdp->nocb_nobypass_count + 1;
+ } else {
+ WRITE_ONCE(rdp->nocb_nobypass_last, j);
+ c = rdp->nocb_nobypass_count - nocb_nobypass_lim_per_jiffy;
+ if (c > nocb_nobypass_lim_per_jiffy)
+ c = nocb_nobypass_lim_per_jiffy;
+ else if (c < 0)
+ c = 0;
+ }
+ WRITE_ONCE(rdp->nocb_nobypass_count, c);
+
+ // If there hasn't yet been all that many ->cblist enqueues
+ // this jiffy, tell the caller to enqueue onto ->cblist. But flush
+ // ->nocb_bypass first.
+ if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+ rcu_nocb_lock(rdp);
+ *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+ if (*was_alldone)
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("FirstQ"));
+ WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+ WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
+ return false; // Caller must enqueue the callback.
+ }
+
+ // If ->nocb_bypass has been used too long or is too full,
+ // flush ->nocb_bypass to ->cblist.
+ if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+ ncbs >= qhimark) {
+ rcu_nocb_lock(rdp);
+ if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+ *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+ if (*was_alldone)
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("FirstQ"));
+ WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
+ return false; // Caller must enqueue the callback.
+ }
+ if (j != rdp->nocb_gp_adv_time &&
+ rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
+ rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
+ rcu_advance_cbs_nowake(rdp->mynode, rdp);
+ rdp->nocb_gp_adv_time = j;
+ }
+ rcu_nocb_unlock_irqrestore(rdp, flags);
+ return true; // Callback already enqueued.
+ }
+
+ // We need to use the bypass.
+ rcu_nocb_wait_contended(rdp);
+ rcu_nocb_bypass_lock(rdp);
+ ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+ rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
+ rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+ if (!ncbs) {
+ WRITE_ONCE(rdp->nocb_bypass_first, j);
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
+ }
+ rcu_nocb_bypass_unlock(rdp);
+ smp_mb(); /* Order enqueue before wake. */
+ if (ncbs) {
+ local_irq_restore(flags);
+ } else {
+ // No-CBs GP kthread might be indefinitely asleep, if so, wake.
+ rcu_nocb_lock(rdp); // Rare during call_rcu() flood.
+ if (!rcu_segcblist_pend_cbs(&rdp->cblist)) {
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("FirstBQwake"));
+ __call_rcu_nocb_wake(rdp, true, flags);
+ } else {
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("FirstBQnoWake"));
+ rcu_nocb_unlock_irqrestore(rdp, flags);
+ }
+ }
+ return true; // Callback already enqueued.
+}
+
/*
* Awaken the no-CBs grace-period kthead if needed, either due to it
* legitimately being asleep or due to overload conditions.
@@ -1683,23 +1915,33 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
rcu_nocb_unlock_irqrestore(rdp, flags);
}
- if (!irqs_disabled_flags(flags)) {
- lockdep_assert_irqs_enabled();
- rcu_nocb_wait_contended(rdp);
- }
return;
}

+/* Wake up the no-CBs GP kthread to flush ->nocb_bypass. */
+static void do_nocb_bypass_wakeup_timer(struct timer_list *t)
+{
+ unsigned long flags;
+ struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);
+
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
+ rcu_nocb_lock_irqsave(rdp, flags);
+ __call_rcu_nocb_wake(rdp, true, flags);
+}
+
/*
* No-CBs GP kthreads come here to wait for additional callbacks to show up
* or for grace periods to end.
*/
static void nocb_gp_wait(struct rcu_data *my_rdp)
{
+ bool bypass = false;
+ long bypass_ncbs;
int __maybe_unused cpu = my_rdp->cpu;
unsigned long cur_gp_seq;
unsigned long flags;
bool gotcbs;
+ unsigned long j = jiffies;
bool needwait_gp = false; // This prevents actual uninitialized use.
bool needwake;
bool needwake_gp;
@@ -1713,21 +1955,50 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
* and the global grace-period kthread are awakened if needed.
*/
for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) {
- if (rcu_segcblist_empty(&rdp->cblist))
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
+ rcu_nocb_lock_irqsave(rdp, flags);
+ bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+ if (bypass_ncbs &&
+ (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
+ bypass_ncbs > 2 * qhimark)) {
+ // Bypass full or old, so flush it.
+ (void)rcu_nocb_try_flush_bypass(rdp, j);
+ bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+ } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
+ rcu_nocb_unlock_irqrestore(rdp, flags);
continue; /* No callbacks here, try next. */
+ }
+ if (bypass_ncbs) {
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("Bypass"));
+ bypass = true;
+ }
rnp = rdp->mynode;
- rcu_nocb_lock_irqsave(rdp, flags);
- WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
- del_timer(&my_rdp->nocb_timer);
- raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
- needwake_gp = rcu_advance_cbs(rnp, rdp);
- raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
+ if (bypass) { // Avoid race with first bypass CB.
+ WRITE_ONCE(my_rdp->nocb_defer_wakeup,
+ RCU_NOCB_WAKE_NOT);
+ del_timer(&my_rdp->nocb_timer);
+ }
+ // Advance callbacks if helpful and low contention.
+ needwake_gp = false;
+ if (!rcu_segcblist_restempty(&rdp->cblist,
+ RCU_NEXT_READY_TAIL) ||
+ (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
+ rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) {
+ raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
+ needwake_gp = rcu_advance_cbs(rnp, rdp);
+ raw_spin_unlock_rcu_node(rnp); /* irqs disabled. */
+ }
// Need to wait on some grace period?
+ WARN_ON_ONCE(!rcu_segcblist_restempty(&rdp->cblist,
+ RCU_NEXT_READY_TAIL));
if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) {
if (!needwait_gp ||
ULONG_CMP_LT(cur_gp_seq, wait_gp_seq))
wait_gp_seq = cur_gp_seq;
needwait_gp = true;
+ trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+ TPS("NeedWaitGP"));
}
if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
needwake = rdp->nocb_cb_sleep;
@@ -1745,6 +2016,13 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
rcu_gp_kthread_wake();
}

+ if (bypass && !rcu_nocb_poll) {
+ // At least one child with non-empty ->nocb_bypass, so set
+ // timer in order to avoid stranding its callbacks.
+ raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
+ mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
+ raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
+ }
if (rcu_nocb_poll) {
/* Polling, so trace if first poll in the series. */
if (gotcbs)
@@ -1755,6 +2033,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep"));
swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq,
!READ_ONCE(my_rdp->nocb_gp_sleep));
+ trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("EndSleep"));
} else {
rnp = my_rdp->mynode;
trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("StartWait"));
@@ -1766,6 +2045,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
}
if (!rcu_nocb_poll) {
raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
+ if (bypass)
+ del_timer(&my_rdp->nocb_bypass_timer);
WRITE_ONCE(my_rdp->nocb_gp_sleep, true);
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
}
@@ -1947,8 +2228,11 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
init_swait_queue_head(&rdp->nocb_cb_wq);
init_swait_queue_head(&rdp->nocb_gp_wq);
raw_spin_lock_init(&rdp->nocb_lock);
+ raw_spin_lock_init(&rdp->nocb_bypass_lock);
raw_spin_lock_init(&rdp->nocb_gp_lock);
timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
+ timer_setup(&rdp->nocb_bypass_timer, do_nocb_bypass_wakeup_timer, 0);
+ rcu_cblist_init(&rdp->nocb_bypass);
}

/*
@@ -2092,6 +2376,12 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
local_irq_restore(flags);
}

+/* Lockdep check that ->cblist may be safely accessed. */
+static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
+{
+ lockdep_assert_irqs_disabled();
+}
+
static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
{
}
@@ -2105,6 +2395,18 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
{
}

+static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ unsigned long j)
+{
+ return true;
+}
+
+static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+ bool *was_alldone, unsigned long flags)
+{
+ return false;
+}
+
static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
unsigned long flags)
{
--
2.17.1

2019-08-03 14:28:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 08/14] rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention

Currently, __call_rcu_nocb_wake() advances callbacks each time that it
detects excessive numbers of callbacks, though only if it succeeds in
conditionally acquiring its leaf rcu_node structure's ->lock. Despite
the conditional acquisition of ->lock, this does increase contention.
This commit therefore avoids advancing callbacks unless there are
callbacks in ->cblist whose grace period has completed and advancing
has not yet been done during this jiffy.

Note that this decision does not take the presence of new callbacks
into account. That is because on this code path, there will always be
at least one new callback, namely the one we just enqueued.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 02739366ef5d..efd7f6fa2790 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1869,6 +1869,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
unsigned long flags)
__releases(rdp->nocb_lock)
{
+ unsigned long cur_gp_seq;
+ unsigned long j;
long len;
struct task_struct *t;

@@ -1897,12 +1899,17 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
} else if (len > rdp->qlen_last_fqs_check + qhimark) {
/* ... or if many callbacks queued. */
rdp->qlen_last_fqs_check = len;
- if (rdp->nocb_cb_sleep ||
- !rcu_segcblist_ready_cbs(&rdp->cblist)) {
+ j = jiffies;
+ if (j != rdp->nocb_gp_adv_time &&
+ rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
+ rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
rcu_advance_cbs_nowake(rdp->mynode, rdp);
+ rdp->nocb_gp_adv_time = j;
+ }
+ if (rdp->nocb_cb_sleep ||
+ !rcu_segcblist_ready_cbs(&rdp->cblist))
wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
TPS("WakeOvfIsDeferred"));
- }
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
--
2.17.1

2019-08-03 14:29:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 10/14] rcu: Allow rcu_do_batch() to dynamically adjust batch sizes

From: Eric Dumazet <[email protected]>

Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.

When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.

This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.

Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.

Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"

We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.

If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns

Signed-off-by: Eric Dumazet <[email protected]>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3e89b5b83ea0..71395e91b876 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -56,6 +56,7 @@
#include <linux/smpboot.h>
#include <linux/jiffies.h>
#include <linux/sched/isolation.h>
+#include <linux/sched/clock.h>
#include "../time/tick-internal.h"

#include "tree.h"
@@ -416,6 +417,12 @@ module_param(qlowmark, long, 0444);
static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;
static bool rcu_kick_kthreads;
+static int rcu_divisor = 7;
+module_param(rcu_divisor, int, 0644);
+
+/* Force an exit from rcu_do_batch() after 3 milliseconds. */
+static long rcu_resched_ns = 3 * NSEC_PER_MSEC;
+module_param(rcu_resched_ns, long, 0644);

/*
* How long the grace period must be before we start recruiting
@@ -2109,6 +2116,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
+ long pending, tlimit = 0;

/* If no callbacks are ready, just return. */
if (!rcu_segcblist_ready_cbs(&rdp->cblist)) {
@@ -2130,7 +2138,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
local_irq_save(flags);
rcu_nocb_lock(rdp);
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
- bl = rdp->blimit;
+ pending = rcu_segcblist_n_cbs(&rdp->cblist);
+ bl = max(rdp->blimit, pending >> rcu_divisor);
+ if (unlikely(bl > 100))
+ tlimit = local_clock() + rcu_resched_ns;
trace_rcu_batch_start(rcu_state.name,
rcu_segcblist_n_lazy_cbs(&rdp->cblist),
rcu_segcblist_n_cbs(&rdp->cblist), bl);
@@ -2153,6 +2164,13 @@ static void rcu_do_batch(struct rcu_data *rdp)
(need_resched() ||
(!is_idle_task(current) && !rcu_is_callbacks_kthread())))
break;
+ if (unlikely(tlimit)) {
+ /* only call local_clock() every 32 callbacks */
+ if (likely((-rcl.len & 31) || local_clock() < tlimit))
+ continue;
+ /* Exceeded the time limit, so leave. */
+ break;
+ }
if (offloaded) {
WARN_ON_ONCE(in_serving_softirq());
local_bh_enable();
--
2.17.1

2019-08-03 20:01:07

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu 06/14] rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()

The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them. However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number. This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 457623100d12..3e89b5b83ea0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3205,6 +3205,7 @@ void rcutree_migrate_callbacks(int cpu)
needwake = rcu_advance_cbs(my_rnp, rdp) ||
rcu_advance_cbs(my_rnp, my_rdp);
rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
+ needwake = needwake || rcu_advance_cbs(my_rnp, my_rdp);
rcu_segcblist_disable(&rdp->cblist);
WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
!rcu_segcblist_n_cbs(&my_rdp->cblist));
--
2.17.1

2019-08-04 14:45:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> The multi_cpu_stop() function relies on the scheduler to gain control from
> whatever is running on the various online CPUs, including any nohz_full
> CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> and can also result in RCU CPU stall warnings. This commit therefore
> causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> online CPUs.

This sounds wrong; should we be fixing sched_can_stop_tick() instead to
return false when the stop task is runnable?

2019-08-04 14:50:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > The multi_cpu_stop() function relies on the scheduler to gain control from
> > whatever is running on the various online CPUs, including any nohz_full
> > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > and can also result in RCU CPU stall warnings. This commit therefore
> > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > online CPUs.
>
> This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> return false when the stop task is runnable?

And even without that; I don't understand how we're not instantly
preempted the moment we enqueue the stop task.

Any enqueue, should go through check_preempt_curr() which will be an
instant resched_curr() when we just woke the stop class.

2019-08-04 14:52:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure

On Fri, Aug 02, 2019 at 08:14:48AM -0700, Paul E. McKenney wrote:
> +/*
> + * Exchange the numeric length of the specified rcu_segcblist structure
> + * with the specified value. This can cause the ->len field to disagree
> + * with the actual number of callbacks on the structure. This exchange is
> + * fully ordered with respect to the callers accesses both before and after.
> + */
> +long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
> +{
> +#ifdef CONFIG_RCU_NOCB_CPU
> + return atomic_long_xchg(&rsclp->len, v);
> +#else
> + long ret = rsclp->len;
> +
> + smp_mb(); /* Up to the caller! */
> + WRITE_ONCE(rsclp->len, v);
> + smp_mb(); /* Up to the caller! */
> + return ret;
> +#endif
> +}

That one's weird; for matching semantics the load needs to be between
the memory barriers.

2019-08-04 14:55:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure

On Sun, Aug 04, 2019 at 04:50:51PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 02, 2019 at 08:14:48AM -0700, Paul E. McKenney wrote:
> > +/*
> > + * Exchange the numeric length of the specified rcu_segcblist structure
> > + * with the specified value. This can cause the ->len field to disagree
> > + * with the actual number of callbacks on the structure. This exchange is
> > + * fully ordered with respect to the callers accesses both before and after.
> > + */
> > +long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
> > +{
> > +#ifdef CONFIG_RCU_NOCB_CPU
> > + return atomic_long_xchg(&rsclp->len, v);
> > +#else
> > + long ret = rsclp->len;
> > +
> > + smp_mb(); /* Up to the caller! */
> > + WRITE_ONCE(rsclp->len, v);
> > + smp_mb(); /* Up to the caller! */
> > + return ret;
> > +#endif
> > +}
>
> That one's weird; for matching semantics the load needs to be between
> the memory barriers.

Also, since you WRITE_ONCE() the thing, the load needs to be a
READ_ONCE().

2019-08-04 18:42:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 04:48:35PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > whatever is running on the various online CPUs, including any nohz_full
> > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > and can also result in RCU CPU stall warnings. This commit therefore
> > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > online CPUs.
> >
> > This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> > return false when the stop task is runnable?

Agreed. However, it is proving surprisingly hard to come up with a
code sequence that has the effect of rcu_nocb without nohz_full.
And rcu_nocb works just fine. With nohz_full also in place, I am
decreasing the failure rate, but it still fails, perhaps a few times
per hour of TREE04 rcutorture on an eight-CPU system. (My 12-CPU
system stubbornly refuses to fail. Good thing I kept the eight-CPU
system around, I guess.)

When I arrive at some sequence of actions that actually work reliably,
then by all means let's put it somewhere in the NO_HZ_FULL machinery!

> And even without that; I don't understand how we're not instantly
> preempted the moment we enqueue the stop task.

There is no preemption because CONFIG_PREEMPT=n for the scenarios still
having trouble. Yes, there are cond_resched() calls, but they don't do
anything unless the appropriate flags are set, which won't always happen
without the tick, apparently. Or without -something- that isn't always
happening as it should.

> Any enqueue, should go through check_preempt_curr() which will be an
> instant resched_curr() when we just woke the stop class.

I did try hitting all of the CPUs with resched_cpu(). Ten times on each
CPU with a ten-jiffy wait between each. This might have decreased the
probability of excessively long CPU-stopper waits by a factor of two or
three, but it did not eliminate the excessively long waits.

What else should I try?

For example, are there any diagnostics I could collect, say from within
the CPU stopper when things are taking too long? I see CPU-stopper
delays in excess of five -minutes-, so this is anything but subtle.

Thanx, Paul

2019-08-04 18:44:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure

On Sun, Aug 04, 2019 at 04:50:51PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 02, 2019 at 08:14:48AM -0700, Paul E. McKenney wrote:
> > +/*
> > + * Exchange the numeric length of the specified rcu_segcblist structure
> > + * with the specified value. This can cause the ->len field to disagree
> > + * with the actual number of callbacks on the structure. This exchange is
> > + * fully ordered with respect to the callers accesses both before and after.
> > + */
> > +long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
> > +{
> > +#ifdef CONFIG_RCU_NOCB_CPU
> > + return atomic_long_xchg(&rsclp->len, v);
> > +#else
> > + long ret = rsclp->len;
> > +
> > + smp_mb(); /* Up to the caller! */
> > + WRITE_ONCE(rsclp->len, v);
> > + smp_mb(); /* Up to the caller! */
> > + return ret;
> > +#endif
> > +}
>
> That one's weird; for matching semantics the load needs to be between
> the memory barriers.

This is to match the concurrent version, which uses atomic_xchg().

Thanx, Paul

2019-08-04 18:47:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure

On Sun, Aug 04, 2019 at 04:52:46PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 04, 2019 at 04:50:51PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 02, 2019 at 08:14:48AM -0700, Paul E. McKenney wrote:
> > > +/*
> > > + * Exchange the numeric length of the specified rcu_segcblist structure
> > > + * with the specified value. This can cause the ->len field to disagree
> > > + * with the actual number of callbacks on the structure. This exchange is
> > > + * fully ordered with respect to the callers accesses both before and after.
> > > + */
> > > +long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
> > > +{
> > > +#ifdef CONFIG_RCU_NOCB_CPU
> > > + return atomic_long_xchg(&rsclp->len, v);
> > > +#else
> > > + long ret = rsclp->len;
> > > +
> > > + smp_mb(); /* Up to the caller! */
> > > + WRITE_ONCE(rsclp->len, v);
> > > + smp_mb(); /* Up to the caller! */
> > > + return ret;
> > > +#endif
> > > +}
> >
> > That one's weird; for matching semantics the load needs to be between
> > the memory barriers.
>
> Also, since you WRITE_ONCE() the thing, the load needs to be a
> READ_ONCE().

Not in this case, because ->len is written only by the CPU in question
in the !RCU_NOCB_CPU case.

It would not be hard to convince me that adding READ_ONCE() would be
cheap and easy future-proofing, but Linus has objected to that sort of
thing in the past.

Thanx, Paul

2019-08-04 20:26:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 11:41:59AM -0700, Paul E. McKenney wrote:
> On Sun, Aug 04, 2019 at 04:48:35PM +0200, Peter Zijlstra wrote:
> > On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > > whatever is running on the various online CPUs, including any nohz_full
> > > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > > and can also result in RCU CPU stall warnings. This commit therefore
> > > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > > online CPUs.
> > >
> > > This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> > > return false when the stop task is runnable?
>
> Agreed. However, it is proving surprisingly hard to come up with a
> code sequence that has the effect of rcu_nocb without nohz_full.
> And rcu_nocb works just fine. With nohz_full also in place, I am
> decreasing the failure rate, but it still fails, perhaps a few times
> per hour of TREE04 rcutorture on an eight-CPU system. (My 12-CPU
> system stubbornly refuses to fail. Good thing I kept the eight-CPU
> system around, I guess.)
>
> When I arrive at some sequence of actions that actually work reliably,
> then by all means let's put it somewhere in the NO_HZ_FULL machinery!
>
> > And even without that; I don't understand how we're not instantly
> > preempted the moment we enqueue the stop task.
>
> There is no preemption because CONFIG_PREEMPT=n for the scenarios still
> having trouble. Yes, there are cond_resched() calls, but they don't do
> anything unless the appropriate flags are set, which won't always happen
> without the tick, apparently. Or without -something- that isn't always
> happening as it should.
>
> > Any enqueue, should go through check_preempt_curr() which will be an
> > instant resched_curr() when we just woke the stop class.
>
> I did try hitting all of the CPUs with resched_cpu(). Ten times on each
> CPU with a ten-jiffy wait between each. This might have decreased the
> probability of excessively long CPU-stopper waits by a factor of two or
> three, but it did not eliminate the excessively long waits.
>
> What else should I try?
>
> For example, are there any diagnostics I could collect, say from within
> the CPU stopper when things are taking too long? I see CPU-stopper
> delays in excess of five -minutes-, so this is anything but subtle.

For whatever it is worth, the things on my list include using 25 rounds
of resched_cpu() on each CPU with ten-jiffy wait between each (instead of
merely 10 rounds), using waitqueues or some such to actually force a
meaningful context switch on the other CPUs, etc.

Thanx, Paul

2019-08-05 04:20:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 01:24:46PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 04, 2019 at 11:41:59AM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 04, 2019 at 04:48:35PM +0200, Peter Zijlstra wrote:
> > > On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > > > whatever is running on the various online CPUs, including any nohz_full
> > > > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > > > and can also result in RCU CPU stall warnings. This commit therefore
> > > > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > > > online CPUs.
> > > >
> > > > This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> > > > return false when the stop task is runnable?
> >
> > Agreed. However, it is proving surprisingly hard to come up with a
> > code sequence that has the effect of rcu_nocb without nohz_full.
> > And rcu_nocb works just fine. With nohz_full also in place, I am
> > decreasing the failure rate, but it still fails, perhaps a few times
> > per hour of TREE04 rcutorture on an eight-CPU system. (My 12-CPU
> > system stubbornly refuses to fail. Good thing I kept the eight-CPU
> > system around, I guess.)
> >
> > When I arrive at some sequence of actions that actually work reliably,
> > then by all means let's put it somewhere in the NO_HZ_FULL machinery!
> >
> > > And even without that; I don't understand how we're not instantly
> > > preempted the moment we enqueue the stop task.
> >
> > There is no preemption because CONFIG_PREEMPT=n for the scenarios still
> > having trouble. Yes, there are cond_resched() calls, but they don't do
> > anything unless the appropriate flags are set, which won't always happen
> > without the tick, apparently. Or without -something- that isn't always
> > happening as it should.
> >
> > > Any enqueue, should go through check_preempt_curr() which will be an
> > > instant resched_curr() when we just woke the stop class.
> >
> > I did try hitting all of the CPUs with resched_cpu(). Ten times on each
> > CPU with a ten-jiffy wait between each. This might have decreased the
> > probability of excessively long CPU-stopper waits by a factor of two or
> > three, but it did not eliminate the excessively long waits.
> >
> > What else should I try?
> >
> > For example, are there any diagnostics I could collect, say from within
> > the CPU stopper when things are taking too long? I see CPU-stopper
> > delays in excess of five -minutes-, so this is anything but subtle.
>
> For whatever it is worth, the things on my list include using 25 rounds
> of resched_cpu() on each CPU with ten-jiffy wait between each (instead of
> merely 10 rounds), using waitqueues or some such to actually force a
> meaningful context switch on the other CPUs, etc.

Which appears to have reduced the bug rate by about a factor of two.
(But statistics and all that.)

I am now trying the same test, but with CONFIG_PREEMPT=y and without
quite so much hammering on the scheduler. This is keying off Peter's
earlier mention of preemption. If this turns out to be solid, perhaps
we outlaw CONFIG_PREEMPT=n && CONFIG_NO_HZ_FULL=y?

Thanx, Paul

2019-08-05 08:07:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 11:41:59AM -0700, Paul E. McKenney wrote:
> On Sun, Aug 04, 2019 at 04:48:35PM +0200, Peter Zijlstra wrote:
> > On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > > whatever is running on the various online CPUs, including any nohz_full
> > > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > > and can also result in RCU CPU stall warnings. This commit therefore
> > > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > > online CPUs.
> > >
> > > This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> > > return false when the stop task is runnable?
>
> Agreed. However, it is proving surprisingly hard to come up with a
> code sequence that has the effect of rcu_nocb without nohz_full.
> And rcu_nocb works just fine. With nohz_full also in place, I am
> decreasing the failure rate, but it still fails, perhaps a few times
> per hour of TREE04 rcutorture on an eight-CPU system. (My 12-CPU
> system stubbornly refuses to fail. Good thing I kept the eight-CPU
> system around, I guess.)
>
> When I arrive at some sequence of actions that actually work reliably,
> then by all means let's put it somewhere in the NO_HZ_FULL machinery!

I'm confused; what are you arguing? The patch as proposed is just wrong,
it needs to go.

> > And even without that; I don't understand how we're not instantly
> > preempted the moment we enqueue the stop task.
>
> There is no preemption because CONFIG_PREEMPT=n for the scenarios still

That doesn't make sense; even with CONFIG_PREEMPT=n we set
TIF_NEED_RESCHED. We'll just not react to it as promptly (only explicit
rescheduling points and return to userspace). Enabling the tick will not
make any difference what so ever.

Tick based preemption will not 'fix' the lack of wakeup preemption. If
the stop task wakeup didn't set TIF_NEED_RESCHED, the OTHER/CFS tick
will not either.

> having trouble. Yes, there are cond_resched() calls, but they don't do
> anything unless the appropriate flags are set, which won't always happen
> without the tick, apparently. Or without -something- that isn't always
> happening as it should.

Right; so clearly we're not understanding what's happening. That seems
like a requirement for actually doing a patch.

> > Any enqueue, should go through check_preempt_curr() which will be an
> > instant resched_curr() when we just woke the stop class.
>
> I did try hitting all of the CPUs with resched_cpu(). Ten times on each
> CPU with a ten-jiffy wait between each. This might have decreased the
> probability of excessively long CPU-stopper waits by a factor of two or
> three, but it did not eliminate the excessively long waits.
>
> What else should I try?
>
> For example, are there any diagnostics I could collect, say from within
> the CPU stopper when things are taking too long? I see CPU-stopper
> delays in excess of five -minutes-, so this is anything but subtle.

Catch the whole thing in a function trace?

The chain that should instantly set TIF_NEED_RESCHED:

stop_machine()
stop_machine_cpuslocked()
stop_cpus()
__stop_cpus()
queue_stop_cpus_work()
cpu_stop_queue_work()
wake_up_q()
wake_up_process()


wake_up_process()
try_to_wake_up()
ttwu_queue()
ttwu_queue_remote()
<- scheduler_ipi()
sched_ttwu_pending()
ttwu_do_activate()

ttwu_do_activate()
activate_task()
ttwu_do_wakeup()
check_preempt_curr()
resched_curr()

You could frob some tracing into __stop_cpus(), before
wait_for_completion(), at that point all the CPUs in @cpumask should
either be running the stop task or have TIF_NEED_RESCHED set.


2019-08-05 08:08:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Sun, Aug 04, 2019 at 09:19:01PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 04, 2019 at 01:24:46PM -0700, Paul E. McKenney wrote:

> > For whatever it is worth, the things on my list include using 25 rounds
> > of resched_cpu() on each CPU with ten-jiffy wait between each (instead of
> > merely 10 rounds), using waitqueues or some such to actually force a
> > meaningful context switch on the other CPUs, etc.

That really should not be needed. What are those other CPUs doing?

> Which appears to have reduced the bug rate by about a factor of two.
> (But statistics and all that.)

Which is just weird..

> I am now trying the same test, but with CONFIG_PREEMPT=y and without
> quite so much hammering on the scheduler. This is keying off Peter's
> earlier mention of preemption. If this turns out to be solid, perhaps
> we outlaw CONFIG_PREEMPT=n && CONFIG_NO_HZ_FULL=y?

CONFIG_PREEMPT=n should work just fine, _something_ is off.

2019-08-05 14:50:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 05, 2019 at 10:07:36AM +0200, Peter Zijlstra wrote:
> On Sun, Aug 04, 2019 at 09:19:01PM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 04, 2019 at 01:24:46PM -0700, Paul E. McKenney wrote:
>
> > > For whatever it is worth, the things on my list include using 25 rounds
> > > of resched_cpu() on each CPU with ten-jiffy wait between each (instead of
> > > merely 10 rounds), using waitqueues or some such to actually force a
> > > meaningful context switch on the other CPUs, etc.
>
> That really should not be needed. What are those other CPUs doing?

Excellent question. It would be really nice to have a CPU-stopper stall
warning, wouldn't it? But who knows? Maybe I am the only one to have
run into this. However, the comment in multi_cpu_stop() just before
the call to touch_nmi_watchdog() leads me to believe otherwise. ;-)

> > Which appears to have reduced the bug rate by about a factor of two.
> > (But statistics and all that.)
>
> Which is just weird..

Indeed. Your point being?

> > I am now trying the same test, but with CONFIG_PREEMPT=y and without
> > quite so much hammering on the scheduler. This is keying off Peter's
> > earlier mention of preemption. If this turns out to be solid, perhaps
> > we outlaw CONFIG_PREEMPT=n && CONFIG_NO_HZ_FULL=y?
>
> CONFIG_PREEMPT=n should work just fine, _something_ is off.

Thank you, that is what I needed to know.

Thanx, Paul

2019-08-05 14:56:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 05, 2019 at 10:05:31AM +0200, Peter Zijlstra wrote:
> On Sun, Aug 04, 2019 at 11:41:59AM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 04, 2019 at 04:48:35PM +0200, Peter Zijlstra wrote:
> > > On Sun, Aug 04, 2019 at 04:43:17PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > > > whatever is running on the various online CPUs, including any nohz_full
> > > > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > > > and can also result in RCU CPU stall warnings. This commit therefore
> > > > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > > > online CPUs.
> > > >
> > > > This sounds wrong; should we be fixing sched_can_stop_tick() instead to
> > > > return false when the stop task is runnable?
> >
> > Agreed. However, it is proving surprisingly hard to come up with a
> > code sequence that has the effect of rcu_nocb without nohz_full.
> > And rcu_nocb works just fine. With nohz_full also in place, I am
> > decreasing the failure rate, but it still fails, perhaps a few times
> > per hour of TREE04 rcutorture on an eight-CPU system. (My 12-CPU
> > system stubbornly refuses to fail. Good thing I kept the eight-CPU
> > system around, I guess.)
> >
> > When I arrive at some sequence of actions that actually work reliably,
> > then by all means let's put it somewhere in the NO_HZ_FULL machinery!
>
> I'm confused; what are you arguing? The patch as proposed is just wrong,
> it needs to go.

Eventually, sure. But one dragon at a time. Right now that dragon is
"what is required to get multi_cpu_stop() to work in a timely fashioon".
The "where does that code really go" dragon comes later.

> > > And even without that; I don't understand how we're not instantly
> > > preempted the moment we enqueue the stop task.
> >
> > There is no preemption because CONFIG_PREEMPT=n for the scenarios still
>
> That doesn't make sense; even with CONFIG_PREEMPT=n we set
> TIF_NEED_RESCHED. We'll just not react to it as promptly (only explicit
> rescheduling points and return to userspace). Enabling the tick will not
> make any difference what so ever.
>
> Tick based preemption will not 'fix' the lack of wakeup preemption. If
> the stop task wakeup didn't set TIF_NEED_RESCHED, the OTHER/CFS tick
> will not either.

Seems logical except for the fact that multi_cpu_stop() really is taking
in excess of five minutes on a regular basis.

> > having trouble. Yes, there are cond_resched() calls, but they don't do
> > anything unless the appropriate flags are set, which won't always happen
> > without the tick, apparently. Or without -something- that isn't always
> > happening as it should.
>
> Right; so clearly we're not understanding what's happening. That seems
> like a requirement for actually doing a patch.

Almost but not quite. It is a requirement for a patch *that* *is*
*supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
writing experimental patches, please feel free to take a long walk on
a short pier.

Understood???

> > > Any enqueue, should go through check_preempt_curr() which will be an
> > > instant resched_curr() when we just woke the stop class.
> >
> > I did try hitting all of the CPUs with resched_cpu(). Ten times on each
> > CPU with a ten-jiffy wait between each. This might have decreased the
> > probability of excessively long CPU-stopper waits by a factor of two or
> > three, but it did not eliminate the excessively long waits.
> >
> > What else should I try?
> >
> > For example, are there any diagnostics I could collect, say from within
> > the CPU stopper when things are taking too long? I see CPU-stopper
> > delays in excess of five -minutes-, so this is anything but subtle.
>
> Catch the whole thing in a function trace?
>
> The chain that should instantly set TIF_NEED_RESCHED:
>
> stop_machine()
> stop_machine_cpuslocked()
> stop_cpus()
> __stop_cpus()
> queue_stop_cpus_work()
> cpu_stop_queue_work()
> wake_up_q()
> wake_up_process()
>
>
> wake_up_process()
> try_to_wake_up()
> ttwu_queue()
> ttwu_queue_remote()
> <- scheduler_ipi()
> sched_ttwu_pending()
> ttwu_do_activate()
>
> ttwu_do_activate()
> activate_task()
> ttwu_do_wakeup()
> check_preempt_curr()
> resched_curr()
>
> You could frob some tracing into __stop_cpus(), before
> wait_for_completion(), at that point all the CPUs in @cpumask should
> either be running the stop task or have TIF_NEED_RESCHED set.

Thank you, this should be quite helpful.

Thanx, Paul

2019-08-05 15:52:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:

> > Right; so clearly we're not understanding what's happening. That seems
> > like a requirement for actually doing a patch.
>
> Almost but not quite. It is a requirement for a patch *that* *is*
> *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> writing experimental patches, please feel free to take a long walk on
> a short pier.
>
> Understood???

Ah, my bad, I thought you were actually proposing this as an actual
patch. I now see that is my bad, I'd overlooked the RFC part.

2019-08-05 17:49:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
>
> > > Right; so clearly we're not understanding what's happening. That seems
> > > like a requirement for actually doing a patch.
> >
> > Almost but not quite. It is a requirement for a patch *that* *is*
> > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > writing experimental patches, please feel free to take a long walk on
> > a short pier.
> >
> > Understood???
>
> Ah, my bad, I thought you were actually proposing this as an actual
> patch. I now see that is my bad, I'd overlooked the RFC part.

No problem!

And of course adding tracing decreases the frequency and duration of
the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)

Thanx, Paul

2019-08-06 18:11:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> >
> > > > Right; so clearly we're not understanding what's happening. That seems
> > > > like a requirement for actually doing a patch.
> > >
> > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > writing experimental patches, please feel free to take a long walk on
> > > a short pier.
> > >
> > > Understood???
> >
> > Ah, my bad, I thought you were actually proposing this as an actual
> > patch. I now see that is my bad, I'd overlooked the RFC part.
>
> No problem!
>
> And of course adding tracing decreases the frequency and duration of
> the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)

And I did eventually get a good trace. If I am interpreting this trace
correctly, the torture_-135 task didn't get around to attempting to wake
up all of the CPUs. I will try again, but this time with the sched_switch
trace event enabled.

As a side note, enabling ftrace from the command line seems to interact
badly with turning tracing off and on in the kernel, so I eventually
resorted to trace_printk() in the functions of interest. The trace
output is below, followed by the current diagnostic patch. Please note
that I am -not- using the desperation hammer-the-scheduler patches.

More as I learn more!

Thanx, Paul

------------------------------------------------------------------------

[ 280.918596] torture_-135 0.... 163481679us : __stop_cpus: __stop_cpus entered
[ 280.918596] torture_-135 0.... 163481680us : queue_stop_cpus_work: queue_stop_cpus_work entered
[ 280.918596] torture_-135 0.... 163481681us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.... 163481681us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.... 163481682us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.... 163481682us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0d... 163481682us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0d... 163481682us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] torture_-135 0d... 163481682us : activate_task: activate_task entered
[ 280.918596] torture_-135 0d... 163481683us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] torture_-135 0d... 163481683us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] torture_-135 0d... 163481683us : resched_curr: resched_curr entered
[ 280.918596] torture_-135 0.N.. 163481684us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.N.. 163481684us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481684us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481685us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481685us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0dN.. 163481685us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] torture_-135 0.N.. 163481695us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] <idle>-0 1d... 163481700us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] <idle>-0 1d.h. 163481702us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] <idle>-0 1d.h. 163481702us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] <idle>-0 1d.h. 163481702us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] <idle>-0 1d.h. 163481703us : activate_task: activate_task entered
[ 280.918596] <idle>-0 1d.h. 163481703us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] <idle>-0 1d.h. 163481703us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] <idle>-0 1d.h. 163481707us : resched_curr: resched_curr entered
[ 280.918596] torture_-135 0.N.. 163481713us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481713us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481713us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481713us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0dN.. 163481714us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] torture_-135 0.N.. 163481722us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.N.. 163481723us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481723us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481723us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481723us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0dN.. 163481724us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] <idle>-0 1.N.. 163481730us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] torture_-135 0.N.. 163481732us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.N.. 163481732us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481732us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481733us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481733us : try_to_wake_up: ttwu_queue entered
[ 280.918596] rcu_tort-130 3d... 163481733us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] torture_-135 0dN.. 163481733us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] rcu_tort-130 3d.h. 163481734us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] rcu_tort-130 3d.h. 163481734us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] rcu_tort-130 3d.h. 163481735us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] rcu_tort-130 3d.h. 163481735us : activate_task: activate_task entered
[ 280.918596] rcu_tort-130 3d.h. 163481735us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] rcu_tort-130 3d.h. 163481735us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-130 3d.h. 163481735us : resched_curr: resched_curr entered
[ 280.918596] torture_-135 0.N.. 163481742us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.N.. 163481742us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481742us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481742us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481743us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0dN.. 163481743us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] rcu_tort-128 4d... 163481743us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] rcu_tort-124 2d... 163481746us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] rcu_tort-128 4d.h. 163481746us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] rcu_tort-128 4d.h. 163481746us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] rcu_tort-124 2d.h. 163481747us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] rcu_tort-128 4d.h. 163481747us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] rcu_tort-124 2d.h. 163481747us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] rcu_tort-124 2d.h. 163481747us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] rcu_tort-128 4d.h. 163481747us : activate_task: activate_task entered
[ 280.918596] rcu_tort-124 2d.h. 163481748us : activate_task: activate_task entered
[ 280.918596] rcu_tort-128 4d.h. 163481748us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] rcu_tort-124 2d.h. 163481748us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] rcu_tort-128 4d.h. 163481748us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-124 2d.h. 163481748us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-128 4d.h. 163481748us : resched_curr: resched_curr entered
[ 280.918596] rcu_tort-124 2d.h. 163481749us : resched_curr: resched_curr entered
[ 280.918596] torture_-135 0.N.. 163481752us : cpu_stop_queue_work: cpu_stop_queue_work entered
[ 280.918596] torture_-135 0.N.. 163481753us : wake_up_q: wake_up_q entered
[ 280.918596] torture_-135 0.N.. 163481753us : wake_up_process: wake_up_process entered
[ 280.918596] torture_-135 0.N.. 163481753us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] torture_-135 0dN.. 163481753us : try_to_wake_up: ttwu_queue entered
[ 280.918596] torture_-135 0dN.. 163481753us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] <idle>-0 5d... 163481772us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] <idle>-0 5d.h. 163481773us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] <idle>-0 5d.h. 163481774us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] <idle>-0 5d.h. 163481774us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] <idle>-0 5d.h. 163481775us : activate_task: activate_task entered
[ 280.918596] <idle>-0 5d.h. 163481776us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] <idle>-0 5d.h. 163481776us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] <idle>-0 5d.h. 163481776us : resched_curr: resched_curr entered
[ 280.918596] <idle>-0 5.N.. 163481808us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-11 0..s1 163481828us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 163481829us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 163481829us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 163481829us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] migratio-14 1d..1 163481869us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] migratio-14 1d.h1 163481870us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-14 1d.h1 163481870us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] migratio-14 1d.h1 163481871us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-14 1d.h1 163481890us : activate_task: activate_task entered
[ 280.918596] migratio-14 1d.h1 163481891us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-14 1d.h1 163481892us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0..s1 163483828us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 163483828us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 163483828us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 163483829us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 163483829us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 163483829us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 163483829us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] <idle>-0 7d... 163483829us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] <idle>-0 7d.h. 163486880us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] <idle>-0 7d.h. 163486881us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] <idle>-0 7d.h. 163486885us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] <idle>-0 7d.h. 163486885us : activate_task: activate_task entered
[ 280.918596] <idle>-0 7d.h. 163486885us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] <idle>-0 7d.h. 163486886us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] <idle>-0 7d.h. 163486889us : resched_curr: resched_curr entered
[ 280.918596] <idle>-0 7.N.. 163486937us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-40 5d..1 278818221us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-11 0d.h1 278818380us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.h1 278818382us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.h1 278818384us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.h1 278818384us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] rcu_tort-123 3d... 278818385us : wake_up_process: wake_up_process entered
[ 280.918596] rcu_tort-123 3d... 278818386us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] rcu_tort-123 3d... 278818387us : try_to_wake_up: ttwu_queue entered
[ 280.918596] rcu_tort-123 3d... 278818387us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] migratio-33 4d.s1 278818403us : activate_task: activate_task entered
[ 280.918596] migratio-33 4d.s1 278818405us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818416us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] migratio-11 0d.H1 278818417us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-11 0d.H1 278818417us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] migratio-11 0d.H1 278818418us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.H1 278818421us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.H1 278818423us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.H1 278818423us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0..s1 278818425us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 278818426us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818427us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818427us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818427us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818428us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818428us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-124 2d... 278818429us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818430us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818431us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] rcu_tort-124 2d... 278818431us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818431us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818432us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818432us : activate_task: activate_task entered
[ 280.918596] rcu_tort-124 2d... 278818432us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818432us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818433us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-124 2d... 278818434us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818434us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818434us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818435us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818435us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818436us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818436us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818438us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818438us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818439us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818440us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0..s1 278818441us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 278818441us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818441us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818442us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818442us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818443us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818443us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0..s1 278818444us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 278818444us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818445us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818445us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818446us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818446us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818447us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0..s1 278818449us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 278818449us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] rcu_tort-126 2d... 278818451us : activate_task: activate_task entered
[ 280.918596] rcu_tort-126 2d... 278818452us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-126 2d... 278818452us : resched_curr: resched_curr entered
[ 280.918596] rcu_tort-126 2dN.. 278818453us : activate_task: activate_task entered
[ 280.918596] rcu_tort-126 2dN.. 278818454us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-126 2dN.. 278818454us : activate_task: activate_task entered
[ 280.918596] rcu_tort-126 2dN.. 278818455us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818466us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818466us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818467us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818467us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818468us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818469us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818469us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0..s1 278818481us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0..s1 278818481us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818482us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-11 0d.s1 278818483us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-11 0d.s1 278818483us : activate_task: activate_task entered
[ 280.918596] migratio-11 0d.s1 278818484us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-11 0d.s1 278818484us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-11 0d.s1 278818485us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818485us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818487us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818487us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818488us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818488us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818489us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818490us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818491us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818491us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818492us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818492us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818493us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818493us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-11 0d.s1 278818494us : wake_up_process: wake_up_process entered
[ 280.918596] migratio-11 0d.s1 278818495us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-62 0d... 278818512us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-62 0d... 278818512us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-62 0d... 278818513us : try_to_wake_up: ttwu_queue entered
[ 280.918596] kworker/-62 0d... 278818514us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] kworker/-62 0d... 278818514us : activate_task: activate_task entered
[ 280.918596] kworker/-62 0d... 278818515us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] kworker/-62 0d... 278818516us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] kworker/-161 0d... 278818518us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-161 0d... 278818519us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-161 0d... 278818519us : try_to_wake_up: ttwu_queue entered
[ 280.918596] kworker/-161 0d... 278818519us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] kworker/-161 0d... 278818520us : activate_task: activate_task entered
[ 280.918596] kworker/-161 0d... 278818520us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] kworker/-161 0d... 278818521us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] kworker/-161 0d... 278818522us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-161 0d... 278818522us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-161 0d... 278818523us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-161 0d... 278818523us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-161 0d... 278818524us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-161 0d... 278818524us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-161 0d... 278818526us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-161 0d... 278818526us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-177 0.... 278818628us : wake_up_q: wake_up_q entered
[ 280.918596] kworker/-177 0d... 278818629us : wake_up_process: wake_up_process entered
[ 280.918596] kworker/-177 0d... 278818629us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-177 0d... 278818630us : try_to_wake_up: ttwu_queue entered
[ 280.918596] kworker/-177 0d... 278818630us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] kworker/-177 0d... 278818631us : activate_task: activate_task entered
[ 280.918596] kworker/-177 0d... 278818632us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] kworker/-177 0d... 278818632us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] rcu_tort-131 2d.s. 278821050us : wake_up_process: wake_up_process entered
[ 280.918596] rcu_tort-131 2d.s. 278821051us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] rcu_tort-131 2d.s. 278821052us : try_to_wake_up: ttwu_queue entered
[ 280.918596] rcu_tort-131 2d.s. 278821052us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] rcu_tort-131 2d.s. 278821052us : activate_task: activate_task entered
[ 280.918596] rcu_tort-131 2d.s. 278821053us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] rcu_tort-131 2d.s. 278821054us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-14 1dNs1 278821054us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] migratio-14 1dNH1 278821055us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] migratio-14 1dNH1 278821055us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] migratio-14 1dNH1 278821056us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] migratio-14 1dNH1 278821056us : activate_task: activate_task entered
[ 280.918596] migratio-14 1dNH1 278821058us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] migratio-14 1dNH1 278821058us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-14 1dNs1 278821070us : activate_task: activate_task entered
[ 280.918596] migratio-14 1dNs1 278821070us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-14 1dNs1 278821071us : activate_task: activate_task entered
[ 280.918596] migratio-14 1dNs1 278821071us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] migratio-14 1dN.1 278821073us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] migratio-14 1dN.1 278821074us : try_to_wake_up: ttwu_queue entered
[ 280.918596] migratio-14 1dN.1 278821075us : try_to_wake_up: ttwu_queue_remote entered
[ 280.918596] kworker/-178 0d... 278821153us : scheduler_ipi: scheduler_ipi entered
[ 280.918596] kworker/-178 0d.h. 278821154us : sched_ttwu_pending: sched_ttwu_pending entered
[ 280.918596] kworker/-178 0d.h. 278821154us : sched_ttwu_pending: sched_ttwu_pending non-NULL llist
[ 280.918596] kworker/-178 0d.h. 278821155us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] kworker/-178 0d.h. 278821155us : activate_task: activate_task entered
[ 280.918596] kworker/-178 0d.h. 278821157us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] kworker/-178 0d.h. 278821157us : check_preempt_curr: check_preempt_curr entered
[ 280.918596] kworker/-178 0d... 278821174us : try_to_wake_up: try_to_wake_up entered
[ 280.918596] kworker/-178 0d... 278821175us : try_to_wake_up: ttwu_queue entered
[ 280.918596] kworker/-178 0d... 278821175us : ttwu_do_activate.isra.108: ttwu_do_activate entered
[ 280.918596] kworker/-178 0d... 278821176us : activate_task: activate_task entered
[ 280.918596] kworker/-178 0d... 278821176us : ttwu_do_wakeup.isra.107: ttwu_do_wakeup entered
[ 280.918596] kworker/-178 0d... 278821177us : check_preempt_curr: check_preempt_curr entered

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ce00b442ced0..1a50ed258ef0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3569,6 +3569,7 @@ void __init rcu_init(void)
rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_par_gp_wq);
srcu_init();
+ tracing_off();
}

#include "tree_stall.h"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..6949ae27fae5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -479,6 +479,7 @@ void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;

+ trace_printk("%s entered\n", __func__);
while (node != WAKE_Q_TAIL) {
struct task_struct *task;

@@ -509,6 +510,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

+ trace_printk("%s entered\n", __func__);
lockdep_assert_held(&rq->lock);

if (test_tsk_need_resched(curr))
@@ -1197,6 +1199,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ trace_printk("%s entered\n", __func__);
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;

@@ -1298,6 +1301,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;

+ trace_printk("%s entered\n", __func__);
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
@@ -2097,6 +2101,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
{
+ trace_printk("%s entered\n", __func__);
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2132,6 +2137,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

+ trace_printk("%s entered\n", __func__);
lockdep_assert_held(&rq->lock);

#ifdef CONFIG_SMP
@@ -2178,9 +2184,11 @@ void sched_ttwu_pending(void)
struct task_struct *p, *t;
struct rq_flags rf;

+ trace_printk("%s entered\n", __func__);
if (!llist)
return;

+ trace_printk("%s non-NULL llist\n", __func__);
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

@@ -2192,6 +2200,7 @@ void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ trace_printk("%s entered\n", __func__);
/*
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
* TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -2232,6 +2241,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);

+ trace_printk("%s entered\n", __func__);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
@@ -2277,6 +2287,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;

+ trace_printk("%s entered\n", __func__);
#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
@@ -2399,6 +2410,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;

+ trace_printk("%s entered\n", __func__);
preempt_disable();
if (p == current) {
/*
@@ -2545,6 +2557,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
int wake_up_process(struct task_struct *p)
{
+ trace_printk("%s entered\n", __func__);
return try_to_wake_up(p, TASK_NORMAL, 0);
}
EXPORT_SYMBOL(wake_up_process);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5c2b2f90fae1..d3441acabc80 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -80,6 +80,7 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
unsigned long flags;
bool enabled;

+ trace_printk("%s entered\n", __func__);
preempt_disable();
raw_spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
@@ -382,6 +383,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
* preempted by a stopper which might wait for other stoppers
* to enter @fn which can lead to deadlock.
*/
+ trace_printk("%s entered\n", __func__);
preempt_disable();
stop_cpus_in_progress = true;
for_each_cpu(cpu, cpumask) {
@@ -402,11 +404,17 @@ static int __stop_cpus(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg)
{
struct cpu_stop_done done;
+ unsigned long j = jiffies;

+ tracing_on();
+ trace_printk("%s entered\n", __func__);
cpu_stop_init_done(&done, cpumask_weight(cpumask));
if (!queue_stop_cpus_work(cpumask, fn, arg, &done))
return -ENOENT;
wait_for_completion(&done.completion);
+ tracing_off();
+ if (time_after(jiffies, j + HZ * 20))
+ ftrace_dump(DUMP_ALL);
return done.ret;
}

@@ -442,6 +450,7 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
{
int ret;

+ trace_printk("%s entered\n", __func__);
/* static works are used, process one request at a time */
mutex_lock(&stop_cpus_mutex);
ret = __stop_cpus(cpumask, fn, arg);
@@ -599,6 +608,7 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
.active_cpus = cpus,
};

+ trace_printk("%s: entered\n", __func__);
lockdep_assert_cpus_held();

if (!stop_machine_initialized) {

2019-08-07 00:04:19

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Fri, Aug 02, 2019 at 08:14:49AM -0700, Paul E. McKenney wrote:
> Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
> takes advantage of unrelated grace periods, thus reducing the memory
> footprint in the face of floods of call_rcu() invocations. However,
> the ->cblist field is a more-complex rcu_segcblist structure which must
> be protected via locking. Even though there are only three entities
> which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
> grace-period kthread, and the no-CBs callbacks kthread), the contention
> on this lock is excessive under heavy stress.
>
> This commit therefore greatly reduces contention by provisioning
> an rcu_cblist structure field named ->nocb_bypass within the
> rcu_data structure. Each no-CBs CPU is permitted only a limited
> number of enqueues onto the ->cblist per jiffy, controlled by a new
> nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
> about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
> exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

Looks quite interesting. I am guessing the not-no-CB (regular) enqueues don't
need to use the same technique because both enqueues / callback execution are
happening on same CPU..

Still looking through patch but I understood the basic idea. Some nits below:

[snip]
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 2c3e9068671c..e4df86db8137 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -200,18 +200,26 @@ struct rcu_data {
> atomic_t nocb_lock_contended; /* Contention experienced. */
> int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
> struct timer_list nocb_timer; /* Enforce finite deferral. */
> + unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */
> +
> + /* The following fields are used by call_rcu, hence own cacheline. */
> + raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
> + struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */
> + unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
> + unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
> + int nocb_nobypass_count; /* # ->cblist enqueues at ^^^ time. */

Can these and below fields be ifdef'd out if !CONFIG_RCU_NOCB_CPU so as to
keep the size of struct smaller for benefit of systems that don't use NOCB?


>
> /* The following fields are used by GP kthread, hence own cacheline. */
> raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
> - bool nocb_gp_sleep;
> - /* Is the nocb GP thread asleep? */
> + struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
> + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */

And these too, I think.


> struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
> bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */
> struct task_struct *nocb_cb_kthread;
> struct rcu_data *nocb_next_cb_rdp;
> /* Next rcu_data in wakeup chain. */
>
> - /* The following fields are used by CB kthread, hence new cachline. */
> + /* The following fields are used by CB kthread, hence new cacheline. */
> struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp;
> /* GP rdp takes GP-end wakeups. */
> #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
[snip]
> +static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
> +{
> + rcu_lockdep_assert_cblist_protected(rdp);
> + if (!rcu_segcblist_is_offloaded(&rdp->cblist) ||
> + !rcu_nocb_bypass_trylock(rdp))
> + return;
> + WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> +}
> +
> +/*
> + * See whether it is appropriate to use the ->nocb_bypass list in order
> + * to control contention on ->nocb_lock. A limited number of direct
> + * enqueues are permitted into ->cblist per jiffy. If ->nocb_bypass
> + * is non-empty, further callbacks must be placed into ->nocb_bypass,
> + * otherwise rcu_barrier() breaks. Use rcu_nocb_flush_bypass() to switch
> + * back to direct use of ->cblist. However, ->nocb_bypass should not be
> + * used if ->cblist is empty, because otherwise callbacks can be stranded
> + * on ->nocb_bypass because we cannot count on the current CPU ever again
> + * invoking call_rcu(). The general rule is that if ->nocb_bypass is
> + * non-empty, the corresponding no-CBs grace-period kthread must not be
> + * in an indefinite sleep state.
> + *
> + * Finally, it is not permitted to use the bypass during early boot,
> + * as doing so would confuse the auto-initialization code. Besides
> + * which, there is no point in worrying about lock contention while
> + * there is only one CPU in operation.
> + */
> +static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> + bool *was_alldone, unsigned long flags)
> +{
> + unsigned long c;
> + unsigned long cur_gp_seq;
> + unsigned long j = jiffies;
> + long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +
> + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) {
> + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> + return false; /* Not offloaded, no bypassing. */
> + }
> + lockdep_assert_irqs_disabled();
> +
> + // Don't use ->nocb_bypass during early boot.

Very minor nit: comment style should be /* */

thanks,

- Joel

[snip]

2019-08-07 00:18:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, Aug 06, 2019 at 08:03:13PM -0400, Joel Fernandes wrote:
> On Fri, Aug 02, 2019 at 08:14:49AM -0700, Paul E. McKenney wrote:
> > Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
> > takes advantage of unrelated grace periods, thus reducing the memory
> > footprint in the face of floods of call_rcu() invocations. However,
> > the ->cblist field is a more-complex rcu_segcblist structure which must
> > be protected via locking. Even though there are only three entities
> > which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
> > grace-period kthread, and the no-CBs callbacks kthread), the contention
> > on this lock is excessive under heavy stress.
> >
> > This commit therefore greatly reduces contention by provisioning
> > an rcu_cblist structure field named ->nocb_bypass within the
> > rcu_data structure. Each no-CBs CPU is permitted only a limited
> > number of enqueues onto the ->cblist per jiffy, controlled by a new
> > nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
> > about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
> > exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
>
> Looks quite interesting. I am guessing the not-no-CB (regular) enqueues don't
> need to use the same technique because both enqueues / callback execution are
> happening on same CPU..
>
> Still looking through patch but I understood the basic idea. Some nits below:
>
> [snip]
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 2c3e9068671c..e4df86db8137 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -200,18 +200,26 @@ struct rcu_data {
> > atomic_t nocb_lock_contended; /* Contention experienced. */
> > int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
> > struct timer_list nocb_timer; /* Enforce finite deferral. */
> > + unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */
> > +
> > + /* The following fields are used by call_rcu, hence own cacheline. */
> > + raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
> > + struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */
> > + unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
> > + unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
> > + int nocb_nobypass_count; /* # ->cblist enqueues at ^^^ time. */
>
> Can these and below fields be ifdef'd out if !CONFIG_RCU_NOCB_CPU so as to
> keep the size of struct smaller for benefit of systems that don't use NOCB?
>
>
> >
> > /* The following fields are used by GP kthread, hence own cacheline. */
> > raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
> > - bool nocb_gp_sleep;
> > - /* Is the nocb GP thread asleep? */
> > + struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
> > + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */
>
> And these too, I think.

Please ignore this comment, I missed that these were already ifdef'd out
since it did not appear in the diff.

thanks!

2019-08-07 00:37:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, Aug 06, 2019 at 08:03:13PM -0400, Joel Fernandes wrote:
> On Fri, Aug 02, 2019 at 08:14:49AM -0700, Paul E. McKenney wrote:
> > Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
> > takes advantage of unrelated grace periods, thus reducing the memory
> > footprint in the face of floods of call_rcu() invocations. However,
> > the ->cblist field is a more-complex rcu_segcblist structure which must
> > be protected via locking. Even though there are only three entities
> > which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
> > grace-period kthread, and the no-CBs callbacks kthread), the contention
> > on this lock is excessive under heavy stress.
> >
> > This commit therefore greatly reduces contention by provisioning
> > an rcu_cblist structure field named ->nocb_bypass within the
> > rcu_data structure. Each no-CBs CPU is permitted only a limited
> > number of enqueues onto the ->cblist per jiffy, controlled by a new
> > nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
> > about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
> > exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
>
> Looks quite interesting. I am guessing the not-no-CB (regular) enqueues don't
> need to use the same technique because both enqueues / callback execution are
> happening on same CPU..

That is the theory! ;-)

> Still looking through patch but I understood the basic idea. Some nits below:
>
> [snip]
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 2c3e9068671c..e4df86db8137 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -200,18 +200,26 @@ struct rcu_data {
> > atomic_t nocb_lock_contended; /* Contention experienced. */
> > int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
> > struct timer_list nocb_timer; /* Enforce finite deferral. */
> > + unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */
> > +
> > + /* The following fields are used by call_rcu, hence own cacheline. */
> > + raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
> > + struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */
> > + unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */
> > + unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */
> > + int nocb_nobypass_count; /* # ->cblist enqueues at ^^^ time. */
>
> Can these and below fields be ifdef'd out if !CONFIG_RCU_NOCB_CPU so as to
> keep the size of struct smaller for benefit of systems that don't use NOCB?

Please see below...

> > /* The following fields are used by GP kthread, hence own cacheline. */
> > raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp;
> > - bool nocb_gp_sleep;
> > - /* Is the nocb GP thread asleep? */
> > + struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
> > + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */
>
> And these too, I think.
>
>
> > struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
> > bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */
> > struct task_struct *nocb_cb_kthread;
> > struct rcu_data *nocb_next_cb_rdp;
> > /* Next rcu_data in wakeup chain. */
> >
> > - /* The following fields are used by CB kthread, hence new cachline. */
> > + /* The following fields are used by CB kthread, hence new cacheline. */
> > struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp;
> > /* GP rdp takes GP-end wakeups. */
> > #endif /* #ifdef CONFIG_RCU_NOCB_CPU */

I believe that they in fact are all under CONFIG_RCU_NOCB_CPU.

> [snip]
> > +static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
> > +{
> > + rcu_lockdep_assert_cblist_protected(rdp);
> > + if (!rcu_segcblist_is_offloaded(&rdp->cblist) ||
> > + !rcu_nocb_bypass_trylock(rdp))
> > + return;
> > + WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> > +}
> > +
> > +/*
> > + * See whether it is appropriate to use the ->nocb_bypass list in order
> > + * to control contention on ->nocb_lock. A limited number of direct
> > + * enqueues are permitted into ->cblist per jiffy. If ->nocb_bypass
> > + * is non-empty, further callbacks must be placed into ->nocb_bypass,
> > + * otherwise rcu_barrier() breaks. Use rcu_nocb_flush_bypass() to switch
> > + * back to direct use of ->cblist. However, ->nocb_bypass should not be
> > + * used if ->cblist is empty, because otherwise callbacks can be stranded
> > + * on ->nocb_bypass because we cannot count on the current CPU ever again
> > + * invoking call_rcu(). The general rule is that if ->nocb_bypass is
> > + * non-empty, the corresponding no-CBs grace-period kthread must not be
> > + * in an indefinite sleep state.
> > + *
> > + * Finally, it is not permitted to use the bypass during early boot,
> > + * as doing so would confuse the auto-initialization code. Besides
> > + * which, there is no point in worrying about lock contention while
> > + * there is only one CPU in operation.
> > + */
> > +static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > + bool *was_alldone, unsigned long flags)
> > +{
> > + unsigned long c;
> > + unsigned long cur_gp_seq;
> > + unsigned long j = jiffies;
> > + long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > +
> > + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) {
> > + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> > + return false; /* Not offloaded, no bypassing. */
> > + }
> > + lockdep_assert_irqs_disabled();
> > +
> > + // Don't use ->nocb_bypass during early boot.
>
> Very minor nit: comment style should be /* */

I thought that Linus said that "//" was now OK. Am I confused?

Thanx, Paul

2019-08-07 00:41:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, 6 Aug 2019 17:35:01 -0700
"Paul E. McKenney" <[email protected]> wrote:

> > > + // Don't use ->nocb_bypass during early boot.
> >
> > Very minor nit: comment style should be /* */
>
> I thought that Linus said that "//" was now OK. Am I confused?

Have a link?

-- Steve

2019-08-07 01:18:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, Aug 06, 2019 at 08:40:55PM -0400, Steven Rostedt wrote:
> On Tue, 6 Aug 2019 17:35:01 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > > + // Don't use ->nocb_bypass during early boot.
> > >
> > > Very minor nit: comment style should be /* */
> >
> > I thought that Linus said that "//" was now OK. Am I confused?
>
> Have a link?

https://lkml.org/lkml/2016/7/8/625

Thanx, Paul

2019-08-07 01:25:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, 6 Aug 2019 18:17:07 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Tue, Aug 06, 2019 at 08:40:55PM -0400, Steven Rostedt wrote:
> > On Tue, 6 Aug 2019 17:35:01 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > > > + // Don't use ->nocb_bypass during early boot.
> > > >
> > > > Very minor nit: comment style should be /* */
> > >
> > > I thought that Linus said that "//" was now OK. Am I confused?
> >
> > Have a link?
>
> https://lkml.org/lkml/2016/7/8/625
>


The (c) form is particularly good for things like enum or structure
member comments at the end of code, where you might want to align
things up, but the ending comment marker ends up being visually pretty
distracting (and lining _that_ up is too much make-believe work).


I think it's still for special occasions, and the above example doesn't
look like one of them ;-)

I basically avoid the '//' comment, as it just adds inconstancy.

-- Steve

2019-08-07 03:50:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing

On Tue, Aug 06, 2019 at 09:24:15PM -0400, Steven Rostedt wrote:
> On Tue, 6 Aug 2019 18:17:07 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Tue, Aug 06, 2019 at 08:40:55PM -0400, Steven Rostedt wrote:
> > > On Tue, 6 Aug 2019 17:35:01 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > > > + // Don't use ->nocb_bypass during early boot.
> > > > >
> > > > > Very minor nit: comment style should be /* */
> > > >
> > > > I thought that Linus said that "//" was now OK. Am I confused?
> > >
> > > Have a link?
> >
> > https://lkml.org/lkml/2016/7/8/625
>
> The (c) form is particularly good for things like enum or structure
> member comments at the end of code, where you might want to align
> things up, but the ending comment marker ends up being visually pretty
> distracting (and lining _that_ up is too much make-believe work).
>
> I think it's still for special occasions, and the above example doesn't
> look like one of them ;-)

It does say "particularly good for", not "only good for. ;-)

> I basically avoid the '//' comment, as it just adds inconstancy.

It saves me two whacks on the shift key and three whacks on other
keys. ;-)

Thanx, Paul

2019-08-07 21:44:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > >
> > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > like a requirement for actually doing a patch.
> > > >
> > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > writing experimental patches, please feel free to take a long walk on
> > > > a short pier.
> > > >
> > > > Understood???
> > >
> > > Ah, my bad, I thought you were actually proposing this as an actual
> > > patch. I now see that is my bad, I'd overlooked the RFC part.
> >
> > No problem!
> >
> > And of course adding tracing decreases the frequency and duration of
> > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
>
> And I did eventually get a good trace. If I am interpreting this trace
> correctly, the torture_-135 task didn't get around to attempting to wake
> up all of the CPUs. I will try again, but this time with the sched_switch
> trace event enabled.
>
> As a side note, enabling ftrace from the command line seems to interact
> badly with turning tracing off and on in the kernel, so I eventually
> resorted to trace_printk() in the functions of interest. The trace
> output is below, followed by the current diagnostic patch. Please note
> that I am -not- using the desperation hammer-the-scheduler patches.
>
> More as I learn more!

And of course I forgot to dump out the online CPUs, so I really had no
idea whether or not all the CPUs were accounted for. I added tracing
to dump out the online CPUs at the beginning of __stop_cpus() and then
reworked it a few times to get the problem to happen in reasonable time.
Please see below for the resulting annotated trace.

I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
but all the migration threads are running within about 2 milliseconds.
It is then almost two minutes(!) until the next trace message.

Looks like time to (very carefully!) instrument multi_cpu_stop().

Of course, if you have any ideas, please do not keep them a secret!

Thanx, Paul

------------------------------------------------------------------------

This trace is taken after an RCU CPU stall warning following the start
of CPU 4 going offline:

[ 2579.977765] Unregister pv shared memory for cpu 4

Here is the trace, trimming out the part from earlier CPU-hotplug operations:

[ 2813.040289] torture_-135 0.... 2578022804us : __stop_cpus: CPUs 0-7 online
All eight CPUs are online.
[ 2813.040289] torture_-135 0.... 2578022805us : queue_stop_cpus_work: entered
[ 2813.040289] torture_-135 0.... 2578022805us : cpu_stop_queue_work: entered for CPU 0
[ 2813.040289] torture_-135 0.... 2578022806us : wake_up_q: entered
[ 2813.040289] torture_-135 0.... 2578022806us : wake_up_process: entered
[ 2813.040289] torture_-135 0.... 2578022806us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0d... 2578022806us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0d... 2578022807us : ttwu_do_activate.isra.108: entered
[ 2813.040289] torture_-135 0d... 2578022807us : activate_task: entered
[ 2813.040289] torture_-135 0d... 2578022807us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] torture_-135 0d... 2578022807us : check_preempt_curr: entered
[ 2813.040289] torture_-135 0d... 2578022807us : resched_curr: entered
[ 2813.040289] torture_-135 0.N.. 2578022808us : cpu_stop_queue_work: entered for CPU 1
[ 2813.040289] torture_-135 0.N.. 2578022809us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022809us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022809us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022809us : try_to_wake_up: entered
Note: trace_printk() is confused by inlining.
[ 2813.040289] torture_-135 0dN.. 2578022810us : try_to_wake_up: entered, CPU 1
So the above is really ttwu_queue_remote().
We are running on CPU 0, so presumably don't need to IPI it.
We have IPIed CPU 1.
[ 2813.040289] torture_-135 0.N.. 2578022819us : cpu_stop_queue_work: entered for CPU 2
[ 2813.040289] torture_-135 0.N.. 2578022820us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022820us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022820us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022820us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022821us : try_to_wake_up: entered, CPU 2
We have IPIed CPUs 1-2.
[ 2813.040289] rcu_tort-129 1d... 2578022821us : scheduler_ipi: entered
CPU 1 got its IPI, so CPUs 1-2 IPIed and CPU 1 received IPI.
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : sched_ttwu_pending: entered
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : ttwu_do_activate.isra.108: entered
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : activate_task: entered
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-129 1d.h. 2578022821us : resched_curr: entered
[ 2813.040289] rcu_tort-129 1d... 2578022821us : sched_switch: prev_comm=rcu_torture_rea prev_pid=129 prev_prio=139 prev_state=R+ ==> next_comm=migration/1 next_pid=14 next_prio=0
CPU 1 has switched to its migration kthread.
[ 2813.040289] torture_-135 0.N.. 2578022830us : cpu_stop_queue_work: entered for CPU 3
[ 2813.040289] torture_-135 0.N.. 2578022831us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022831us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022831us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022831us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022832us : try_to_wake_up: entered, CPU 3
We have IPIed CPUs 1-3.
[ 2813.040289] rcu_tort-126 2d... 2578022832us : scheduler_ipi: entered
CPU 2 got its IPI, so CPUs 1-3 IPIed and CPUs 1-2 received IPI.
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : sched_ttwu_pending: entered
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : ttwu_do_activate.isra.108: entered
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : activate_task: entered
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-126 2d.h. 2578022832us : resched_curr: entered
[ 2813.040289] torture_-135 0.N.. 2578022841us : cpu_stop_queue_work: entered for CPU 4
[ 2813.040289] torture_-135 0.N.. 2578022841us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022841us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022841us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022842us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022842us : try_to_wake_up: entered, CPU 4
We have IPIed CPUs 1-4.
[ 2813.040289] rcu_tort-128 3d... 2578022842us : scheduler_ipi: entered
CPU 3 got its IPI, so CPUs 1-4 IPIed and CPUs 1-3 received IPI.
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : sched_ttwu_pending: entered
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : ttwu_do_activate.isra.108: entered
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : activate_task: entered
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-128 3d.h. 2578022842us : resched_curr: entered
[ 2813.040289] torture_-135 0.N.. 2578022853us : cpu_stop_queue_work: entered for CPU 5
[ 2813.040289] torture_-135 0.N.. 2578022853us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022853us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022854us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022854us : try_to_wake_up: entered
[ 2813.040289] torture_-135 0dN.. 2578022856us : try_to_wake_up: entered, CPU 5
We have IPIed CPUs 1-5.
[ 2813.040289] <idle>-0 4d... 2578022863us : scheduler_ipi: entered
CPU 4 got its IPI, so CPUs 1-5 IPIed and CPUs 1-4 received IPI.
[ 2813.040289] <idle>-0 4d.h. 2578022865us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 4d.h. 2578022865us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] <idle>-0 4d.h. 2578022866us : ttwu_do_activate.isra.108: entered
[ 2813.040289] <idle>-0 4d.h. 2578022866us : activate_task: entered
[ 2813.040289] <idle>-0 4d.h. 2578022867us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-126 2d... 2578022867us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=migration/2 next_pid=21 next_prio=0
CPUs 1-2 have switched to their migration kthreads.
[ 2813.040289] <idle>-0 4d.h. 2578022868us : check_preempt_curr: entered
[ 2813.040289] <idle>-0 4d.h. 2578022868us : resched_curr: entered
[ 2813.040289] torture_-135 0.N.. 2578022872us : cpu_stop_queue_work: entered for CPU 6
[ 2813.040289] torture_-135 0.N.. 2578022873us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022873us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022874us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-125 5d... 2578022874us : scheduler_ipi: entered
CPU 5 got its IPI, so CPUs 1-5 IPIed and CPUs 1-5 received IPI.
[ 2813.040289] rcu_tort-125 5d.h. 2578022874us : sched_ttwu_pending: entered
[ 2813.040289] rcu_tort-125 5d.h. 2578022875us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcu_tort-125 5d.h. 2578022875us : ttwu_do_activate.isra.108: entered
[ 2813.040289] rcu_tort-125 5d.h. 2578022875us : activate_task: entered
[ 2813.040289] rcu_tort-125 5d.h. 2578022877us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-125 5d.h. 2578022877us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-125 5d.h. 2578022877us : resched_curr: entered
[ 2813.040289] rcu_tort-128 3d... 2578022877us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=R+ ==> next_comm=migration/3 next_pid=27 next_prio=0
CPUs 1-3 have switched to their migration kthreads.
[ 2813.040289] <idle>-0 4.N.. 2578022912us : sched_ttwu_pending: entered
[ 2813.040289] torture_-135 0dN.. 2578022914us : try_to_wake_up: entered
[ 2813.040289] <idle>-0 4d... 2578022914us : sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=migration/4 next_pid=33 next_prio=0
CPUs 1-4 have switched to their migration kthreads.
[ 2813.040289] torture_-135 0dN.. 2578022915us : try_to_wake_up: entered, CPU 6
We have IPIed CPUs 1-6.
[ 2813.040289] rcu_tort-125 5d... 2578022926us : sched_switch: prev_comm=rcu_torture_rea prev_pid=125 prev_prio=139 prev_state=R+ ==> next_comm=migration/5 next_pid=40 next_prio=0
CPUs 1-5 have switched to their migration kthreads.
[ 2813.040289] rcu_tort-136 6d... 2578022926us : scheduler_ipi: entered
CPU 6 got its IPI, so CPUs 1-6 IPIed and CPUs 1-6 received IPI.
[ 2813.040289] torture_-135 0.N.. 2578022929us : cpu_stop_queue_work: entered for CPU 7
[ 2813.040289] torture_-135 0.N.. 2578022930us : wake_up_q: entered
[ 2813.040289] torture_-135 0.N.. 2578022930us : wake_up_process: entered
[ 2813.040289] torture_-135 0.N.. 2578022930us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-136 6d.h. 2578022930us : sched_ttwu_pending: entered
[ 2813.040289] torture_-135 0dN.. 2578022930us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-136 6d.h. 2578022930us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcu_tort-136 6d.h. 2578022930us : ttwu_do_activate.isra.108: entered
[ 2813.040289] torture_-135 0dN.. 2578022931us : try_to_wake_up: entered, CPU 7
We have IPIed CPUs 1-7.
[ 2813.040289] rcu_tort-136 6d.h. 2578022931us : activate_task: entered
[ 2813.040289] rcu_tort-136 6d.h. 2578022931us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-136 6d.h. 2578022931us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-136 6d.h. 2578022931us : resched_curr: entered
[ 2813.040289] torture_-135 0d... 2578022951us : sched_switch: prev_comm=torture_onoff prev_pid=135 prev_prio=120 prev_state=D ==> next_comm=migration/0 next_pid=11 next_prio=0
CPUs 0-5 have switched to their migration kthreads.
[ 2813.040289] rcu_tort-136 6d... 2578022951us : sched_switch: prev_comm=rcu_torture_fwd prev_pid=136 prev_prio=139 prev_state=R+ ==> next_comm=migration/6 next_pid=46 next_prio=0
CPUs 0-6 have switched to their migration kthreads.
[ 2813.040289] migratio-46 6d.s1 2578023060us : wake_up_process: entered
[ 2813.040289] migratio-46 6d.s1 2578023060us : try_to_wake_up: entered
[ 2813.040289] migratio-46 6d.s1 2578023060us : try_to_wake_up: entered
[ 2813.040289] migratio-46 6d.s1 2578023061us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-46 6d.s1 2578023061us : activate_task: entered
[ 2813.040289] migratio-46 6d.s1 2578023062us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-46 6d.s1 2578023062us : check_preempt_curr: entered
[ 2813.040289] <idle>-0 7d... 2578023758us : scheduler_ipi: entered
CPU 7 got its IPI, so CPUs 1-7 IPIed and CPUs 1-7 received IPI.
[ 2813.040289] <idle>-0 7d.h. 2578025037us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 7d.h. 2578025038us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] <idle>-0 7d.h. 2578025038us : ttwu_do_activate.isra.108: entered
[ 2813.040289] <idle>-0 7d.h. 2578025038us : activate_task: entered
[ 2813.040289] <idle>-0 7d.h. 2578025043us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] <idle>-0 7d.h. 2578025043us : check_preempt_curr: entered
[ 2813.040289] <idle>-0 7d.h. 2578025043us : resched_curr: entered
[ 2813.040289] <idle>-0 7.N.. 2578025107us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 7d... 2578025132us : sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=migration/7 next_pid=52 next_prio=0
CPUs 0-7 have all switched to their migration kthreads. So we
should be done, but we clearly are not. Almost two minutes later:
[ 2813.040289] migratio-33 4d..1 2689243033us : sched_ttwu_pending: entered
[ 2813.040289] migratio-11 0d.h1 2689243195us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.h1 2689243196us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.h1 2689243198us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.h1 2689243201us : try_to_wake_up: entered, CPU 5
[ 2813.040289] migratio-40 5d... 2689243204us : sched_switch: prev_comm=migration/5 prev_pid=40 prev_prio=0 prev_state=S ==> next_comm=rcu_torture_rea next_pid=125 next_prio=139
CPU 5 switches from its migration kthread.
[ 2813.040289] migratio-46 6d... 2689243222us : sched_switch: prev_comm=migration/6 prev_pid=46 prev_prio=0 prev_state=S ==> next_comm=kworker/6:1 next_pid=175 next_prio=120
CPU 6 switches from its migration kthread.
[ 2813.040289] migratio-21 2d.s1 2689243222us : activate_task: entered
[ 2813.040289] migratio-21 2d.s1 2689243225us : check_preempt_curr: entered
[ 2813.040289] migratio-27 3d.s1 2689243226us : activate_task: entered
[ 2813.040289] rcu_tort-125 5d... 2689243227us : scheduler_ipi: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243228us : sched_ttwu_pending: entered
[ 2813.040289] migratio-27 3d.s1 2689243228us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243228us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] migratio-27 3d.s1 2689243229us : activate_task: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243229us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0..s1 2689243229us : wake_up_process: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243229us : activate_task: entered
[ 2813.040289] migratio-27 3d.s1 2689243229us : check_preempt_curr: entered
[ 2813.040289] migratio-11 0..s1 2689243229us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243230us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243231us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243231us : activate_task: entered
[ 2813.040289] migratio-21 2d... 2689243232us : sched_switch: prev_comm=migration/2 prev_pid=21 prev_prio=0 prev_state=S ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
CPU 2 switches from its migration kthread.
[ 2813.040289] migratio-11 0d.s1 2689243232us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243232us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243232us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243232us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-125 5d.h. 2689243233us : resched_curr: entered
[ 2813.040289] migratio-11 0..s1 2689243234us : wake_up_process: entered
[ 2813.040289] migratio-11 0..s1 2689243234us : try_to_wake_up: entered
[ 2813.040289] migratio-27 3d... 2689243234us : sched_switch: prev_comm=migration/3 prev_pid=27 prev_prio=0 prev_state=S ==> next_comm=rcu_torture_fak next_pid=122 next_prio=139
CPU 3 switches from its migration kthread.
[ 2813.040289] migratio-11 0d.s1 2689243235us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243235us : ttwu_do_activate.isra.108: entered
[ 2813.040289] kworker/-175 6d... 2689243235us : sched_switch: prev_comm=kworker/6:1 prev_pid=175 prev_prio=120 prev_state=I ==> next_comm=rcu_torture_fwd next_pid=136 next_prio=139
CPU 6 switches yet again.
[ 2813.040289] migratio-11 0d.s1 2689243235us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243236us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243237us : check_preempt_curr: entered
[ 2813.040289] migratio-11 0d.s1 2689243239us : wake_up_process: entered
[ 2813.040289] rcu_tort-126 2d... 2689243240us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
CPU 2 switches yet again.
[ 2813.040289] migratio-11 0d.s1 2689243240us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243240us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243240us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243240us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243241us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243241us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-125 5d... 2689243242us : sched_switch: prev_comm=rcu_torture_rea prev_pid=125 prev_prio=139 prev_state=S ==> next_comm=init next_pid=1 next_prio=120
CPU 5 switches yet again.
[ 2813.040289] migratio-11 0d.s1 2689243242us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243243us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243243us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243243us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243244us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243244us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-124 2d... 2689243244us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
CPU 2 switches yet again.
[ 2813.040289] migratio-11 0d.s1 2689243245us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243245us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243245us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243246us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243246us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243246us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0..s1 2689243247us : wake_up_process: entered
[ 2813.040289] migratio-11 0..s1 2689243247us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243248us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243248us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243249us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243249us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243249us : check_preempt_curr: entered
[ 2813.040289] migratio-11 0d.s1 2689243251us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243251us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243251us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243252us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243252us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243252us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243253us : check_preempt_curr: entered
[ 2813.040289] migratio-11 0d.s1 2689243263us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243263us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243264us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243264us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0..s1 2689243265us : wake_up_process: entered
[ 2813.040289] migratio-11 0..s1 2689243265us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243266us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243266us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243266us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243268us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243268us : check_preempt_curr: entered
[ 2813.040289] migratio-11 0..s1 2689243269us : wake_up_process: entered
[ 2813.040289] migratio-11 0..s1 2689243269us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243270us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243270us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-11 0d.s1 2689243270us : activate_task: entered
[ 2813.040289] migratio-11 0d.s1 2689243271us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-11 0d.s1 2689243272us : check_preempt_curr: entered
[ 2813.040289] migratio-52 7d... 2689243291us : sched_switch: prev_comm=migration/7 prev_pid=52 prev_prio=0 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
CPU 7 switches from its migration kthread.
[ 2813.040289] migratio-11 0d.s1 2689243315us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243315us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243316us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243316us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243316us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243317us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d.s1 2689243317us : wake_up_process: entered
[ 2813.040289] migratio-11 0d.s1 2689243318us : try_to_wake_up: entered
[ 2813.040289] migratio-11 0d... 2689243323us : sched_switch: prev_comm=migration/0 prev_pid=11 prev_prio=0 prev_state=S ==> next_comm=kworker/0:3 next_pid=171 next_prio=120
CPU 0 switches from its migration kthread.
[ 2813.040289] kworker/-171 0d... 2689243327us : try_to_wake_up: entered
[ 2813.040289] kworker/-171 0d... 2689243327us : try_to_wake_up: entered
[ 2813.040289] kworker/-171 0d... 2689243328us : try_to_wake_up: entered, CPU 5
[ 2813.040289] init-1 5d... 2689243343us : scheduler_ipi: entered
[ 2813.040289] init-1 5d.h. 2689243344us : sched_ttwu_pending: entered
[ 2813.040289] init-1 5d.h. 2689243344us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] init-1 5d.h. 2689243351us : ttwu_do_activate.isra.108: entered
[ 2813.040289] init-1 5d.h. 2689243351us : activate_task: entered
[ 2813.040289] init-1 5d.h. 2689243353us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] init-1 5d.h. 2689243353us : check_preempt_curr: entered
[ 2813.040289] kworker/-171 0d... 2689243362us : sched_switch: prev_comm=kworker/0:3 prev_pid=171 prev_prio=120 prev_state=I ==> next_comm=rcu_sched next_pid=10 next_prio=120
CPU 0 switches again.
[ 2813.040289] rcu_sche-10 0d... 2689243365us : try_to_wake_up: entered
[ 2813.040289] rcu_sche-10 0d... 2689243366us : try_to_wake_up: entered
[ 2813.040289] rcu_sche-10 0d... 2689243366us : try_to_wake_up: entered, CPU 5
[ 2813.040289] rcu_sche-10 0d... 2689243379us : sched_switch: prev_comm=rcu_sched prev_pid=10 prev_prio=120 prev_state=I ==> next_comm=rcu_torture_fak next_pid=123 next_prio=139
CPU 0 switches again.
[ 2813.040289] init-1 5d... 2689243380us : scheduler_ipi: entered
[ 2813.040289] init-1 5d.h. 2689243380us : sched_ttwu_pending: entered
[ 2813.040289] init-1 5d.h. 2689243380us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] init-1 5d.h. 2689243380us : ttwu_do_activate.isra.108: entered
[ 2813.040289] init-1 5d.h. 2689243381us : activate_task: entered
[ 2813.040289] init-1 5d.h. 2689243382us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] init-1 5d.h. 2689243382us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-123 0d... 2689243509us : sched_switch: prev_comm=rcu_torture_fak prev_pid=123 prev_prio=139 prev_state=D ==> next_comm=kworker/u16:0 next_pid=181 next_prio=120
CPU 0 switches again.
[ 2813.040289] kworker/-181 0d... 2689243510us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243510us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243510us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243511us : ttwu_do_activate.isra.108: entered
[ 2813.040289] kworker/-181 0d... 2689243511us : activate_task: entered
[ 2813.040289] kworker/-181 0d... 2689243512us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] kworker/-181 0d... 2689243512us : check_preempt_curr: entered
[ 2813.040289] kworker/-181 0d... 2689243513us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243513us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243513us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243513us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243514us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243514us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243514us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243514us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243515us : wake_up_process: entered
[ 2813.040289] kworker/-181 0d... 2689243515us : try_to_wake_up: entered
[ 2813.040289] kworker/-181 0d... 2689243516us : sched_switch: prev_comm=kworker/u16:0 prev_pid=181 prev_prio=120 prev_state=I ==> next_comm=torture_shuffle next_pid=132 next_prio=120
CPU 0 switches again.
[ 2813.040289] torture_-132 0.... 2689243518us : wake_up_q: entered
[ 2813.040289] torture_-132 0d... 2689243519us : sched_switch: prev_comm=torture_shuffle prev_pid=132 prev_prio=120 prev_state=D ==> next_comm=kworker/0:1 next_pid=183 next_prio=120
CPU 0 switches again.
[ 2813.040289] kworker/-183 0d... 2689243521us : sched_switch: prev_comm=kworker/0:1 prev_pid=183 prev_prio=120 prev_state=I ==> next_comm=rcu_torture_sta next_pid=131 next_prio=120
CPU 0 switches again.
[ 2813.040289] rcu_tort-131 0d... 2689243559us : sched_switch: prev_comm=rcu_torture_sta prev_pid=131 prev_prio=120 prev_state=S ==> next_comm=torture_stutter next_pid=133 next_prio=120
CPU 0 switches again.
[ 2813.040289] torture_-133 0d... 2689243561us : sched_switch: prev_comm=torture_stutter prev_pid=133 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:4 next_pid=161 next_prio=120
[ 2813.040289] kworker/-161 0d... 2689243566us : activate_task: entered
[ 2813.040289] kworker/-161 0d... 2689243567us : check_preempt_curr: entered
[ 2813.040289] kworker/-161 0d... 2689243568us : sched_switch: prev_comm=kworker/u16:4 prev_pid=161 prev_prio=120 prev_state=I ==> next_comm=rcu_torture_rea next_pid=130 next_prio=139
CPU 0 switches again.
[ 2813.040289] rcu_tort-130 0d... 2689243571us : activate_task: entered
[ 2813.040289] rcu_tort-130 0d... 2689243572us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-130 0d... 2689243572us : activate_task: entered
[ 2813.040289] rcu_tort-130 0d... 2689243572us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-130 0d... 2689243573us : sched_switch: prev_comm=rcu_torture_rea prev_pid=130 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=127 next_prio=139
CPU 0 switches again.
[ 2813.040289] rcu_tort-127 0d... 2689243585us : sched_switch: prev_comm=rcu_torture_rea prev_pid=127 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=128 next_prio=139
CPU 0 switches again.
[ 2813.040289] rcu_tort-128 0d... 2689243588us : activate_task: entered
[ 2813.040289] rcu_tort-128 0d... 2689243588us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-128 0d... 2689243588us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=129 next_prio=139
CPU 0 switches again.
[ 2813.040289] rcu_tort-129 0d... 2689243591us : sched_switch: prev_comm=rcu_torture_rea prev_pid=129 prev_prio=139 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
CPU 0 switches again.
[ 2813.040289] rcu_tort-122 3.... 2689243707us : wake_up_process: entered
[ 2813.040289] rcu_tort-122 3.... 2689243707us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-122 3d... 2689243711us : sched_switch: prev_comm=rcu_torture_fak prev_pid=122 prev_prio=139 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120
CPU 3 switches again.
[ 2813.040289] <idle>-0 0d... 2689244073us : wake_up_process: entered
[ 2813.040289] <idle>-0 0d... 2689244074us : try_to_wake_up: entered
[ 2813.040289] <idle>-0 0d... 2689244074us : try_to_wake_up: entered
[ 2813.040289] <idle>-0 0d... 2689244074us : ttwu_do_activate.isra.108: entered
[ 2813.040289] <idle>-0 0d... 2689244075us : activate_task: entered
[ 2813.040289] <idle>-0 0d... 2689244076us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] <idle>-0 0d... 2689244076us : check_preempt_curr: entered
[ 2813.040289] <idle>-0 0d... 2689244076us : resched_curr: entered
[ 2813.040289] <idle>-0 0.N.. 2689244087us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 0d... 2689244088us : sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=9 next_prio=120
CPU 0 switches again.
[ 2813.040289] init-1 5d... 2689244336us : sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=S ==> next_comm=rcu_torture_fak next_pid=120 next_prio=139
CPU 5 switches again.
[ 2813.040289] rcu_tort-120 5.... 2689244344us : wake_up_q: entered
[ 2813.040289] rcu_tort-120 5.... 2689244344us : wake_up_process: entered
[ 2813.040289] rcu_tort-120 5.... 2689244344us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-120 5d... 2689244345us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-120 5d... 2689244346us : try_to_wake_up: entered, CPU 0
[ 2813.040289] rcu_tort-120 5d... 2689244359us : sched_switch: prev_comm=rcu_torture_fak prev_pid=120 prev_prio=139 prev_state=S ==> next_comm=rcuog/1 next_pid=18 next_prio=120
CPU 5 switches again.
[ 2813.040289] rcuog/1-18 5d... 2689244361us : wake_up_process: entered
[ 2813.040289] rcuog/1-18 5d... 2689244362us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689244362us : scheduler_ipi: entered
[ 2813.040289] rcuog/1-18 5d... 2689244362us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244362us : sched_ttwu_pending: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244363us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] rcuog/1-18 5d... 2689244363us : try_to_wake_up: entered, CPU 0
[ 2813.040289] ksoftirq-9 0d.H. 2689244363us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244363us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244365us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244365us : check_preempt_curr: entered
[ 2813.040289] rcuog/1-18 5d... 2689244375us : sched_switch: prev_comm=rcuog/1 prev_pid=18 prev_prio=120 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120
CPU 5 switches again.
[ 2813.040289] ksoftirq-9 0d.s. 2689244386us : scheduler_ipi: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244386us : sched_ttwu_pending: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244387us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] ksoftirq-9 0d.H. 2689244387us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244387us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244388us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689244388us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245078us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245078us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245079us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245079us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245080us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245080us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245080us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245081us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245081us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245081us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245082us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245082us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245082us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245083us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245083us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245083us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245084us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245084us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245084us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245084us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245085us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245085us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689245085us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245086us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245086us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245086us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245087us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689245087us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-136 6.... 2689245087us : wake_up_process: entered
[ 2813.040289] rcu_tort-136 6.... 2689245087us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-136 6d... 2689245087us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-136 6d... 2689245087us : try_to_wake_up: entered, CPU 0
[ 2813.040289] ksoftirq-9 0d.s. 2689245639us : scheduler_ipi: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245639us : sched_ttwu_pending: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245639us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] ksoftirq-9 0d.H. 2689245639us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245640us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245640us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245640us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245640us : resched_curr: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : activate_task: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : activate_task: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : resched_curr: entered
[ 2813.040289] rcu_tort-136 6dN.. 2689245640us : activate_task: entered
[ 2813.040289] rcu_tort-136 6dN.. 2689245640us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-136 6d... 2689245640us : sched_switch: prev_comm=rcu_torture_fwd prev_pid=136 prev_prio=139 prev_state=D ==> next_comm=rcuos/2 next_pid=25 next_prio=120
CPU 6 switches again.
[ 2813.040289] ksoftirq-9 0d... 2689245654us : sched_switch: prev_comm=ksoftirqd/0 prev_pid=9 prev_prio=120 prev_state=R+ ==> next_comm=rcuog/4 next_pid=37 next_prio=120
CPU 0 switches again.
[ 2813.040289] rcuog/4-37 0d... 2689245658us : sched_switch: prev_comm=rcuog/4 prev_pid=37 prev_prio=120 prev_state=S ==> next_comm=ksoftirqd/0 next_pid=9 next_prio=120
CPU 0 switches again.
[ 2813.040289] rcuos/2-25 6d... 2689245658us : try_to_wake_up: entered
[ 2813.040289] rcuos/2-25 6d... 2689245658us : try_to_wake_up: entered
[ 2813.040289] rcuos/2-25 6d... 2689245658us : try_to_wake_up: entered, CPU 2
[ 2813.040289] rcuos/2-25 6d... 2689245658us : sched_switch: prev_comm=rcuos/2 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=rcu_torture_fak next_pid=123 next_prio=139
CPU 6 switches again.
[ 2813.040289] rcu_tort-123 6d... 2689245658us : wake_up_process: entered
[ 2813.040289] rcu_tort-123 6d... 2689245658us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-123 6d... 2689245658us : try_to_wake_up: entered
[ 2813.040289] rcu_tort-123 6d... 2689245658us : try_to_wake_up: entered, CPU 0
[ 2813.040289] rcu_tort-123 6d... 2689245658us : sched_switch: prev_comm=rcu_torture_fak prev_pid=123 prev_prio=139 prev_state=D ==> next_comm=rcu_torture_rea next_pid=129 next_prio=139
CPU 6 switches again.
[ 2813.040289] rcu_tort-129 6d... 2689245658us : activate_task: entered
[ 2813.040289] rcu_tort-129 6d... 2689245658us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-129 6d... 2689245658us : sched_switch: prev_comm=rcu_torture_rea prev_pid=129 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=128 next_prio=139
CPU 6 switches again.
[ 2813.040289] ksoftirq-9 0d.s. 2689245781us : scheduler_ipi: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245781us : sched_ttwu_pending: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245781us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] ksoftirq-9 0d.H. 2689245781us : ttwu_do_activate.isra.108: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245782us : activate_task: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245782us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245782us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0d.H. 2689245783us : resched_curr: entered
[ 2813.040289] rcu_tort-128 6d... 2689245783us : activate_task: entered
[ 2813.040289] rcu_tort-128 6d... 2689245783us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-128 6d... 2689245783us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=127 next_prio=139
CPU 6 switches again.
[ 2813.040289] rcu_tort-127 6d... 2689245783us : activate_task: entered
[ 2813.040289] rcu_tort-127 6d... 2689245783us : check_preempt_curr: entered
[ 2813.040289] rcu_tort-127 6d... 2689245783us : sched_switch: prev_comm=rcu_torture_rea prev_pid=127 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=130 next_prio=139
CPU 6 switches again.
[ 2813.040289] rcu_tort-130 6d... 2689245783us : sched_switch: prev_comm=rcu_torture_rea prev_pid=130 prev_prio=139 prev_state=S ==> next_comm=swapper/6 next_pid=0 next_prio=120
CPU 6 switches again.
[ 2813.040289] ksoftirq-9 0d... 2689245805us : sched_switch: prev_comm=ksoftirqd/0 prev_pid=9 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=183 next_prio=120
CPU 0 switches again.
[ 2813.040289] <idle>-0 2d... 2689245805us : scheduler_ipi: entered
[ 2813.040289] <idle>-0 2d.h. 2689246047us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 2d.h. 2689246047us : sched_ttwu_pending: non-NULL llist
[ 2813.040289] <idle>-0 2d.h. 2689246048us : ttwu_do_activate.isra.108: entered
[ 2813.040289] <idle>-0 2d.h. 2689246048us : activate_task: entered
[ 2813.040289] <idle>-0 2d.h. 2689246050us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] <idle>-0 2d.h. 2689246050us : check_preempt_curr: entered
[ 2813.040289] <idle>-0 2d.h. 2689246050us : resched_curr: entered
[ 2813.040289] <idle>-0 2.N.. 2689246090us : sched_ttwu_pending: entered
[ 2813.040289] <idle>-0 2d... 2689246091us : sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_torture_wri next_pid=119 next_prio=120
CPU 2 switches again.
[ 2813.040289] rcu_tort-119 2d... 2689246096us : sched_switch: prev_comm=rcu_torture_wri prev_pid=119 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
CPU 2 switches again.
[ 2813.040289] kworker/-183 0d.h. 2689280510us : resched_curr: entered
[ 2813.040289] kworker/-183 0dNh. 2689285096us : resched_curr: entered
[ 2813.040289] kworker/-183 0dN.. 2689285103us : wake_up_process: entered
[ 2813.040289] kworker/-183 0dN.. 2689285103us : try_to_wake_up: entered
[ 2813.040289] kworker/-183 0dN.. 2689285104us : try_to_wake_up: entered
[ 2813.040289] kworker/-183 0dN.. 2689285105us : ttwu_do_activate.isra.108: entered
[ 2813.040289] kworker/-183 0dN.. 2689285105us : activate_task: entered
[ 2813.040289] kworker/-183 0dN.. 2689285106us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] kworker/-183 0dN.. 2689285106us : check_preempt_curr: entered
[ 2813.040289] kworker/-183 0d... 2689285109us : sched_switch: prev_comm=kworker/0:1 prev_pid=183 prev_prio=120 prev_state=D ==> next_comm=kworker/0:3 next_pid=171 next_prio=120
CPU 0 switches again.
[ 2813.040289] kworker/-171 0d... 2689285121us : try_to_wake_up: entered
[ 2813.040289] kworker/-171 0d... 2689285121us : try_to_wake_up: entered
[ 2813.040289] kworker/-171 0d... 2689285122us : ttwu_do_activate.isra.108: entered
[ 2813.040289] kworker/-171 0d... 2689285122us : activate_task: entered
[ 2813.040289] kworker/-171 0d... 2689285122us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] kworker/-171 0d... 2689285122us : check_preempt_curr: entered
[ 2813.040289] kworker/-171 0d... 2689285123us : sched_switch: prev_comm=kworker/0:3 prev_pid=171 prev_prio=120 prev_state=I ==> next_comm=ksoftirqd/0 next_pid=9 next_prio=120
CPU 0 switches again.
[ 2813.040289] ksoftirq-9 0..s. 2689285125us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285126us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285126us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285127us : try_to_wake_up: entered, CPU 5
[ 2813.040289] ksoftirq-9 0..s. 2689285138us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285138us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285138us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285138us : try_to_wake_up: entered, CPU 6
[ 2813.040289] migratio-14 1dN.1 2689285146us : try_to_wake_up: entered
[ 2813.040289] migratio-14 1dN.1 2689285148us : try_to_wake_up: entered
[ 2813.040289] migratio-14 1dN.1 2689285148us : ttwu_do_activate.isra.108: entered
[ 2813.040289] migratio-14 1dN.1 2689285148us : activate_task: entered
[ 2813.040289] migratio-14 1dN.1 2689285149us : ttwu_do_wakeup.isra.107: entered
[ 2813.040289] migratio-14 1dN.1 2689285149us : check_preempt_curr: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285150us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285150us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285150us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285151us : try_to_wake_up: entered, CPU 6
[ 2813.040289] ksoftirq-9 0..s. 2689285151us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285151us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285152us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285152us : try_to_wake_up: entered, CPU 6
[ 2813.040289] ksoftirq-9 0..s. 2689285152us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285152us : try_to_wake_up: entered
[ 2813.040289] migratio-14 1d... 2689285153us : sched_switch: prev_comm=migration/1 prev_pid=14 prev_prio=0 prev_state=S ==> next_comm=torture_onoff next_pid=135 next_prio=120
CPU 1 switches from its migration kthread.
[ 2813.040289] ksoftirq-9 0d.s. 2689285153us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285153us : try_to_wake_up: entered, CPU 6
[ 2813.040289] ksoftirq-9 0..s. 2689285153us : wake_up_process: entered
[ 2813.040289] ksoftirq-9 0..s. 2689285153us : try_to_wake_up: entered
[ 2813.040289] ksoftirq-9 0d.s. 2689285154us : try_to_wake_up: entered
[ 2813.040289] ---------------------------------

And finally, the trace ends and CPU 4 is announced as being offline:

[ 2813.040289] smpboot: CPU 4 is now offline

2019-08-08 20:38:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > >
> > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > like a requirement for actually doing a patch.
> > > > >
> > > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > > writing experimental patches, please feel free to take a long walk on
> > > > > a short pier.
> > > > >
> > > > > Understood???
> > > >
> > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > >
> > > No problem!
> > >
> > > And of course adding tracing decreases the frequency and duration of
> > > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
> >
> > And I did eventually get a good trace. If I am interpreting this trace
> > correctly, the torture_-135 task didn't get around to attempting to wake
> > up all of the CPUs. I will try again, but this time with the sched_switch
> > trace event enabled.
> >
> > As a side note, enabling ftrace from the command line seems to interact
> > badly with turning tracing off and on in the kernel, so I eventually
> > resorted to trace_printk() in the functions of interest. The trace
> > output is below, followed by the current diagnostic patch. Please note
> > that I am -not- using the desperation hammer-the-scheduler patches.
> >
> > More as I learn more!
>
> And of course I forgot to dump out the online CPUs, so I really had no
> idea whether or not all the CPUs were accounted for. I added tracing
> to dump out the online CPUs at the beginning of __stop_cpus() and then
> reworked it a few times to get the problem to happen in reasonable time.
> Please see below for the resulting annotated trace.
>
> I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> but all the migration threads are running within about 2 milliseconds.
> It is then almost two minutes(!) until the next trace message.
>
> Looks like time to (very carefully!) instrument multi_cpu_stop().
>
> Of course, if you have any ideas, please do not keep them a secret!

Functionally, multi_cpu_stop() is working fine, according to the trace
below (search for a line beginning with TAB). But somehow CPU 2 took
almost three -minutes- to do one iteration of the loop. The prime suspect
in that loop is cpu_relax() due to the hypervisor having an opportunity
to do something at that point. The commentary below (again, search for
a line beginning with TAB) gives my analysis.

Of course, if I am correct, it should be possible to catch cpu_relax()
in the act. That is the next step, give or take the Heisenbuggy nature
of this beast.

Another thing for me to try is to run longer with !NO_HZ_FULL, just in
case the earlier runs just got lucky.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

[ 1564.195213] Unregister pv shared memory for cpu 6
[ 1731.578001] rcu: INFO: rcu_sched self-detected stall on CPU
...
[ 1731.632619] torture_-135 0.... 1562233456us : __stop_cpus: CPUs 0-2,6-7 online
[ 1731.632619] torture_-135 0.... 1562233457us : queue_stop_cpus_work: entered
[ 1731.632619] torture_-135 0.... 1562233457us : cpu_stop_queue_work: entered for CPU 0
[ 1731.632619] torture_-135 0.... 1562233458us : wake_up_q: entered
[ 1731.632619] torture_-135 0.... 1562233458us : wake_up_process: entered
[ 1731.632619] torture_-135 0.... 1562233458us : try_to_wake_up: entered
[ 1731.632619] torture_-135 0d... 1562233458us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] torture_-135 0d... 1562233459us : ttwu_do_activate.isra.108: entered
[ 1731.632619] torture_-135 0d... 1562233459us : activate_task: entered
[ 1731.632619] torture_-135 0d... 1562233459us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] torture_-135 0d... 1562233459us : check_preempt_curr: entered
[ 1731.632619] torture_-135 0d... 1562233459us : resched_curr: entered
[ 1731.632619] torture_-135 0.N.. 1562233460us : cpu_stop_queue_work: entered for CPU 1
[ 1731.632619] torture_-135 0.N.. 1562233460us : wake_up_q: entered
[ 1731.632619] torture_-135 0.N.. 1562233460us : wake_up_process: entered
[ 1731.632619] torture_-135 0.N.. 1562233460us : try_to_wake_up: entered
[ 1731.632619] torture_-135 0dN.. 1562233461us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] torture_-135 0dN.. 1562233478us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] torture_-135 0.N.. 1562233488us : cpu_stop_queue_work: entered for CPU 2
[ 1731.632619] torture_-135 0.N.. 1562233488us : wake_up_q: entered
[ 1731.632619] torture_-135 0.N.. 1562233488us : wake_up_process: entered
[ 1731.632619] torture_-135 0.N.. 1562233488us : try_to_wake_up: entered
[ 1731.632619] torture_-135 0dN.. 1562233489us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] torture_-135 0dN.. 1562233489us : try_to_wake_up: ttwu_queue_remote entered, CPU 2
[ 1731.632619] <idle>-0 1d... 1562233493us : scheduler_ipi: entered
[ 1731.632619] <idle>-0 1d.h. 1562233495us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 1d.h. 1562233495us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] <idle>-0 1d.h. 1562233495us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 1d.h. 1562233495us : activate_task: entered
[ 1731.632619] <idle>-0 1d.h. 1562233496us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 1d.h. 1562233496us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 1d.h. 1562233496us : resched_curr: entered
[ 1731.632619] torture_-135 0.N.. 1562233498us : cpu_stop_queue_work: entered for CPU 6
[ 1731.632619] torture_-135 0.N.. 1562233498us : wake_up_q: entered
[ 1731.632619] torture_-135 0.N.. 1562233499us : wake_up_process: entered
[ 1731.632619] torture_-135 0.N.. 1562233499us : try_to_wake_up: entered
[ 1731.632619] torture_-135 0dN.. 1562233499us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] torture_-135 0dN.. 1562233499us : try_to_wake_up: ttwu_queue_remote entered, CPU 6
[ 1731.632619] torture_-135 0.N.. 1562233509us : cpu_stop_queue_work: entered for CPU 7
[ 1731.632619] torture_-135 0.N.. 1562233509us : wake_up_q: entered
[ 1731.632619] torture_-135 0.N.. 1562233509us : wake_up_process: entered
[ 1731.632619] torture_-135 0.N.. 1562233509us : try_to_wake_up: entered
[ 1731.632619] torture_-135 0dN.. 1562233510us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] torture_-135 0dN.. 1562233510us : try_to_wake_up: ttwu_queue_remote entered, CPU 7
[ 1731.632619] <idle>-0 2d... 1562233515us : scheduler_ipi: entered
[ 1731.632619] <idle>-0 2d.h. 1562233517us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 2d.h. 1562233517us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] <idle>-0 2d.h. 1562233518us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 2d.h. 1562233519us : activate_task: entered
[ 1731.632619] <idle>-0 2d.h. 1562233519us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 2d.h. 1562233519us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 2d.h. 1562233519us : resched_curr: entered
[ 1731.632619] torture_-135 0d... 1562233520us : sched_switch: prev_comm=torture_onoff prev_pid=135 prev_prio=120 prev_state=D ==> next_comm=migration/0 next_pid=11 next_prio=0
[ 1731.632619] rcu_tort-128 7d... 1562233520us : scheduler_ipi: entered
[ 1731.632619] rcu_tort-128 7d.h. 1562233521us : sched_ttwu_pending: entered
[ 1731.632619] rcu_tort-128 7d.h. 1562233521us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] rcu_tort-128 7d.h. 1562233521us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-128 7d.h. 1562233521us : activate_task: entered
[ 1731.632619] migratio-11 0...1 1562233521us : multi_cpu_stop: curstate = 1, ack = 5
[ 1731.632619] rcu_tort-128 7d.h. 1562233521us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-128 7d.h. 1562233523us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-128 7d.h. 1562233524us : resched_curr: entered
[ 1731.632619] rcu_tort-128 7d... 1562233532us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=R+ ==> next_comm=migration/7 next_pid=52 next_prio=0
[ 1731.632619] migratio-52 7...1 1562233535us : multi_cpu_stop: curstate = 1, ack = 4
[ 1731.632619] <idle>-0 1dNs. 1562233552us : activate_task: entered
[ 1731.632619] <idle>-0 2.N.. 1562233553us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 1dNs. 1562233553us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 2d... 1562233554us : sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=migration/2 next_pid=21 next_prio=0
[ 1731.632619] <idle>-0 1dNs. 1562233554us : resched_curr: entered
[ 1731.632619] <idle>-0 1.N.. 1562233554us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 1d... 1562233555us : sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=migration/1 next_pid=14 next_prio=0
[ 1731.632619] migratio-21 2...1 1562233556us : multi_cpu_stop: curstate = 1, ack = 3
[ 1731.632619] migratio-14 1...1 1562233556us : multi_cpu_stop: curstate = 1, ack = 2
[ 1731.632619] migratio-11 0..s1 1562234528us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1562234528us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1562234529us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1562234529us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1562234529us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1562234529us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1562234530us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0..s1 1562235527us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1562235527us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1562235528us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1562235528us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1562235528us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1562235528us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1562235529us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 6d... 1562235529us : scheduler_ipi: entered
[ 1731.632619] <idle>-0 6d.h. 1562249376us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 6d.h. 1562249377us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] <idle>-0 6d.h. 1562249377us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 6d.h. 1562249377us : activate_task: entered
[ 1731.632619] <idle>-0 6d.h. 1562249378us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 6d.h. 1562249378us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 6d.h. 1562249378us : resched_curr: entered
[ 1731.632619] <idle>-0 6.N.. 1562249472us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 6d... 1562249480us : sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=migration/6 next_pid=46 next_prio=0
[ 1731.632619] migratio-46 6...1 1562249502us : multi_cpu_stop: curstate = 1, ack = 1
[ 1731.632619] migratio-14 1...1 1562249503us : multi_cpu_stop: curstate = 2, ack = 5
[ 1731.632619] migratio-52 7...1 1562249503us : multi_cpu_stop: curstate = 2, ack = 5
[ 1731.632619] migratio-11 0...1 1562249504us : multi_cpu_stop: curstate = 2, ack = 3
[ 1731.632619] migratio-46 6...1 1562249505us : multi_cpu_stop: curstate = 2, ack = 2
Almost three -minutes- delay. CPU 2 was just fine 16ms earlier:
2...1 1562233556us : multi_cpu_stop: curstate = 1, ack = 3
"curstate = 1" above corresponds to MULTI_STOP_PREPARE.
"curstate = 2" below corresponds to MULTI_STOP_DISABLE_IRQ
So what did that CPU need to execute in the meantime?
o ack_state(), which does an atomic_dec_and_test(), an
atomic_set(), an smp_wmb(), and a WRITE_ONCE().
o rcu_momentary_dyntick_idle(), which does a
raw_cpu_write(), an atomic_add_return(),
integer comparison for a WARN_ON_ONCE() and call to
rcu_preempt_deferred_qs(), which in this configuration
is an empty function.
o stop_machine_yield(), which does a cpu_relax(), which might
pass control to the hupervisor.
o Multiple passes through a loop doing READ_ONCE(msdata->state)
(paranoia on my part), a couple of comparisons, an
rcu_momentary_dyntick_idle(), and a stop_machine_yield().
o We don't call the function msdata->fn(msdata->data)
because that doesn't happen until after we get to
MULTI_STOP_RUN, which is after MULTI_STOP_DISABLE_IRQ.
o This would be easy to blame on the hypervisor, but then
why is this behavior restricted to CONFIG_NO_HZ_FULL
guest-OS kernels with lots of nohz_full CPUs? I suppose
maybe the guest's scheduling-clock interrupt attempts
might be having an effect, but...
[ 1731.632619] migratio-21 2...1 1729613677us : multi_cpu_stop: curstate = 2, ack = 1
[ 1731.632619] migratio-21 2d..1 1729613680us : multi_cpu_stop: curstate = 3, ack = 5
[ 1731.632619] migratio-46 6d..1 1729613680us : multi_cpu_stop: curstate = 3, ack = 5
[ 1731.632619] migratio-14 1d..1 1729613680us : multi_cpu_stop: curstate = 3, ack = 5
[ 1731.632619] migratio-52 7d..1 1729613680us : multi_cpu_stop: curstate = 3, ack = 4
[ 1731.632619] migratio-11 0d..1 1729613680us : multi_cpu_stop: curstate = 3, ack = 4
[ 1731.632619] migratio-46 6d..1 1729613680us : sched_ttwu_pending: entered
[ 1731.632619] migratio-14 1d..1 1729613680us : multi_cpu_stop: curstate = 4, ack = 5
[ 1731.632619] migratio-46 6d..1 1729613680us : multi_cpu_stop: curstate = 4, ack = 5
[ 1731.632619] migratio-52 7d..1 1729613680us : multi_cpu_stop: curstate = 4, ack = 5
[ 1731.632619] migratio-21 2d..1 1729614637us : multi_cpu_stop: curstate = 4, ack = 5
[ 1731.632619] migratio-14 1d... 1729614817us : sched_switch: prev_comm=migration/1 prev_pid=14 prev_prio=0 prev_state=S ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] migratio-52 7d... 1729614820us : sched_switch: prev_comm=migration/7 prev_pid=52 prev_prio=0 prev_state=S ==> next_comm=rcu_torture_rea next_pid=129 next_prio=139
[ 1731.632619] rcu_tort-129 7d... 1729614828us : sched_switch: prev_comm=rcu_torture_rea prev_pid=129 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=128 next_prio=139
[ 1731.632619] rcu_tort-128 7d... 1729614833us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
[ 1731.632619] rcu_tort-124 1d... 1729614834us : activate_task: entered
[ 1731.632619] rcu_tort-124 1d... 1729614836us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729614836us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614836us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614837us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614837us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614837us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614837us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729614838us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729614839us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_wri next_pid=119 next_prio=120
[ 1731.632619] migratio-11 0d..1 1729614839us : multi_cpu_stop: curstate = 4, ack = 1
[ 1731.632619] migratio-11 0d.h1 1729614974us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.h1 1729614974us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.h1 1729614976us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.h1 1729614977us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.h1 1729614977us : activate_task: entered
[ 1731.632619] migratio-11 0d.h1 1729614978us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.h1 1729614979us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0..s1 1729614988us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729614989us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729614990us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729614991us : try_to_wake_up: ttwu_queue_remote entered, CPU 7
[ 1731.632619] migratio-11 0..s1 1729615008us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615008us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615009us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615010us : try_to_wake_up: ttwu_queue_remote entered, CPU 7
[ 1731.632619] migratio-11 0..s1 1729615010us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615010us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615011us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615011us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] rcu_tort-119 1d... 1729615025us : scheduler_ipi: entered
[ 1731.632619] rcu_tort-119 1d.h. 1729615026us : sched_ttwu_pending: entered
[ 1731.632619] rcu_tort-119 1d.h. 1729615026us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] rcu_tort-119 1d.h. 1729615026us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-119 1d.h. 1729615027us : activate_task: entered
[ 1731.632619] rcu_tort-119 1d.h. 1729615028us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-119 1d.h. 1729615028us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0d.s1 1729615029us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615029us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615030us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615030us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615030us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615031us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615031us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0d.s1 1729615032us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615032us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0..s1 1729615033us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615033us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615034us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615034us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615034us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615035us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615035us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0..s1 1729615036us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615036us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615038us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615038us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615038us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615039us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615039us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0d.s1 1729615040us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615040us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615042us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615042us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0..s1 1729615043us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615043us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615044us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615044us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615044us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615045us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615045us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0..s1 1729615046us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615046us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615046us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615047us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615047us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615047us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615047us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0..s1 1729615048us : wake_up_process: entered
[ 1731.632619] migratio-11 0..s1 1729615049us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615050us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615050us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615051us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615051us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615051us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0d.s1 1729615057us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615057us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615057us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-11 0d.s1 1729615058us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-11 0d.s1 1729615058us : activate_task: entered
[ 1731.632619] migratio-11 0d.s1 1729615058us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-11 0d.s1 1729615059us : check_preempt_curr: entered
[ 1731.632619] migratio-11 0d.s1 1729615060us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615060us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615061us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615061us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615094us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615094us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615095us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615095us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615096us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615096us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d.s1 1729615097us : wake_up_process: entered
[ 1731.632619] migratio-11 0d.s1 1729615097us : try_to_wake_up: entered
[ 1731.632619] migratio-11 0d... 1729615103us : sched_switch: prev_comm=migration/0 prev_pid=11 prev_prio=0 prev_state=S ==> next_comm=rcu_sched next_pid=10 next_prio=120
[ 1731.632619] rcu_sche-10 0d... 1729615108us : sched_switch: prev_comm=rcu_sched prev_pid=10 prev_prio=120 prev_state=I ==> next_comm=init next_pid=1 next_prio=120
[ 1731.632619] rcu_tort-119 1d... 1729615678us : sched_switch: prev_comm=rcu_torture_wri prev_pid=119 prev_prio=120 prev_state=S ==> next_comm=rcu_torture_rea next_pid=125 next_prio=139
[ 1731.632619] rcu_tort-125 1d... 1729615683us : sched_switch: prev_comm=rcu_torture_rea prev_pid=125 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] <idle>-0 7d... 1729615886us : scheduler_ipi: entered
[ 1731.632619] <idle>-0 7d.h. 1729616090us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 7d.h. 1729616090us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] <idle>-0 7d.h. 1729616091us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 7d.h. 1729616091us : activate_task: entered
[ 1731.632619] <idle>-0 7d.h. 1729616092us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 7d.h. 1729616093us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 7d.h. 1729616093us : resched_curr: entered
[ 1731.632619] <idle>-0 7dNh. 1729616094us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 7dNh. 1729616094us : activate_task: entered
[ 1731.632619] <idle>-0 7dNh. 1729616095us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 7dNh. 1729616095us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 7dNh. 1729616117us : resched_curr: entered
[ 1731.632619] <idle>-0 7.N.. 1729616153us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 7d... 1729616155us : sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_torture_rea next_pid=129 next_prio=139
[ 1731.632619] rcu_tort-129 7d... 1729616159us : sched_switch: prev_comm=rcu_torture_rea prev_pid=129 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=128 next_prio=139
[ 1731.632619] init-1 0d... 1729616201us : sched_switch: prev_comm=init prev_pid=1 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:5 next_pid=162 next_prio=120
[ 1731.632619] kworker/-162 0d... 1729616207us : wake_up_process: entered
[ 1731.632619] kworker/-162 0d... 1729616207us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0d... 1729616208us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] kworker/-162 0d... 1729616209us : ttwu_do_activate.isra.108: entered
[ 1731.632619] kworker/-162 0d... 1729616209us : activate_task: entered
[ 1731.632619] kworker/-162 0d... 1729616211us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] kworker/-162 0d... 1729616211us : check_preempt_curr: entered
[ 1731.632619] kworker/-162 0d... 1729616213us : wake_up_process: entered
[ 1731.632619] kworker/-162 0d... 1729616214us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0d... 1729616215us : wake_up_process: entered
[ 1731.632619] kworker/-162 0d... 1729616215us : try_to_wake_up: entered
[ 1731.632619] rcu_tort-124 1d.H. 1729618001us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729618015us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=127 next_prio=139
[ 1731.632619] rcu_tort-127 1d... 1729618022us : sched_switch: prev_comm=rcu_torture_rea prev_pid=127 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d.h. 1729620527us : resched_curr: entered
[ 1731.632619] rcu_tort-126 1d... 1729620533us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] rcu_tort-124 1d.h. 1729623526us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729623532us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d.h. 1729626526us : resched_curr: entered
[ 1731.632619] rcu_tort-126 1d... 1729626534us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] rcu_tort-124 1d.h. 1729629526us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729629719us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d.h. 1729632524us : resched_curr: entered
[ 1731.632619] rcu_tort-126 1d... 1729632677us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] rcu_tort-124 1d.h. 1729635526us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729635533us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d.h. 1729638531us : resched_curr: entered
[ 1731.632619] rcu_tort-126 1d... 1729638541us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] rcu_tort-124 1d.h. 1729641526us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729641534us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d.h. 1729644526us : resched_curr: entered
[ 1731.632619] rcu_tort-126 1d... 1729644622us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] rcu_tort-124 1d.h. 1729647528us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729647534us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=R+ ==> next_comm=rcu_torture_rea next_pid=126 next_prio=139
[ 1731.632619] rcu_tort-126 1d... 1729647662us : sched_switch: prev_comm=rcu_torture_rea prev_pid=126 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=124 next_prio=139
[ 1731.632619] kworker/-162 0d.h. 1729647665us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729647674us : activate_task: entered
[ 1731.632619] rcu_tort-124 1d... 1729647675us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729647675us : activate_task: entered
[ 1731.632619] rcu_tort-124 1d... 1729647675us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729647676us : resched_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647676us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647676us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647676us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647677us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647677us : activate_task: entered
[ 1731.632619] rcu_tort-124 1dN.. 1729647677us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-124 1d... 1729647679us : sched_switch: prev_comm=rcu_torture_rea prev_pid=124 prev_prio=139 prev_state=S ==> next_comm=torture_stutter next_pid=133 next_prio=120
[ 1731.632619] torture_-133 1d... 1729647682us : sched_switch: prev_comm=torture_stutter prev_pid=133 prev_prio=120 prev_state=S ==> next_comm=rcu_torture_fwd next_pid=136 next_prio=139
[ 1731.632619] kworker/-162 0.Ns. 1729647684us : wake_up_process: entered
[ 1731.632619] kworker/-162 0.Ns. 1729647685us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0dNs. 1729647688us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] kworker/-162 0dNs. 1729647690us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] rcu_tort-128 7d... 1729647690us : sched_switch: prev_comm=rcu_torture_rea prev_pid=128 prev_prio=139 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120
[ 1731.632619] kworker/-162 0.Ns. 1729647707us : wake_up_process: entered
[ 1731.632619] kworker/-162 0.Ns. 1729647707us : try_to_wake_up: entered
[ 1731.632619] rcu_tort-136 1d... 1729647709us : scheduler_ipi: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647709us : sched_ttwu_pending: entered
[ 1731.632619] kworker/-162 0dNs. 1729647709us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647709us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] rcu_tort-136 1d.h. 1729647710us : ttwu_do_activate.isra.108: entered
[ 1731.632619] kworker/-162 0dNs. 1729647710us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] rcu_tort-136 1d.h. 1729647710us : activate_task: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647711us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647711us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-136 1.... 1729647714us : wake_up_process: entered
[ 1731.632619] rcu_tort-136 1.... 1729647714us : try_to_wake_up: entered
[ 1731.632619] rcu_tort-136 1d... 1729647716us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] rcu_tort-136 1d... 1729647716us : try_to_wake_up: ttwu_queue_remote entered, CPU 2
[ 1731.632619] kworker/-162 0.Ns. 1729647723us : wake_up_process: entered
[ 1731.632619] kworker/-162 0.Ns. 1729647723us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0dNs. 1729647724us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] kworker/-162 0dNs. 1729647724us : ttwu_do_activate.isra.108: entered
[ 1731.632619] kworker/-162 0dNs. 1729647725us : activate_task: entered
[ 1731.632619] kworker/-162 0dNs. 1729647726us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] kworker/-162 0dNs. 1729647726us : check_preempt_curr: entered
[ 1731.632619] kworker/-162 0.Ns. 1729647727us : wake_up_process: entered
[ 1731.632619] kworker/-162 0.Ns. 1729647727us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0dNs. 1729647728us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] kworker/-162 0dNs. 1729647729us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] rcu_tort-136 1d... 1729647730us : scheduler_ipi: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647730us : sched_ttwu_pending: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647730us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] rcu_tort-136 1d.h. 1729647731us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647731us : activate_task: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647731us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647732us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-136 1d.h. 1729647732us : resched_curr: entered
[ 1731.632619] rcu_tort-136 1dNh. 1729647732us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-136 1dNh. 1729647732us : activate_task: entered
[ 1731.632619] rcu_tort-136 1dNh. 1729647733us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] kworker/-162 0d... 1729647733us : sched_switch: prev_comm=kworker/u16:5 prev_pid=162 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:3 next_pid=171 next_prio=120
[ 1731.632619] rcu_tort-136 1dNh. 1729647733us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-136 1d... 1729647734us : sched_switch: prev_comm=rcu_torture_fwd prev_pid=136 prev_prio=139 prev_state=D ==> next_comm=rcu_torture_wri next_pid=119 next_prio=120
[ 1731.632619] kworker/-171 0d... 1729647752us : sched_switch: prev_comm=kworker/0:3 prev_pid=171 prev_prio=120 prev_state=I ==> next_comm=kworker/u16:4 next_pid=161 next_prio=120
[ 1731.632619] kworker/-161 0d... 1729647754us : sched_switch: prev_comm=kworker/u16:4 prev_pid=161 prev_prio=120 prev_state=I ==> next_comm=rcu_sched next_pid=10 next_prio=120
[ 1731.632619] rcu_tort-119 1d... 1729648207us : sched_switch: prev_comm=rcu_torture_wri prev_pid=119 prev_prio=120 prev_state=D ==> next_comm=rcu_torture_sta next_pid=131 next_prio=120
[ 1731.632619] rcu_sche-10 0d... 1729648662us : resched_curr: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650719us : wake_up_process: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650719us : try_to_wake_up: entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650720us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-21 2dNs1 1729650720us : scheduler_ipi: entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650721us : try_to_wake_up: ttwu_queue_remote entered, CPU 7
[ 1731.632619] migratio-21 2dNH1 1729650721us : sched_ttwu_pending: entered
[ 1731.632619] migratio-21 2dNH1 1729650721us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] migratio-21 2dNH1 1729650721us : ttwu_do_activate.isra.108: entered
[ 1731.632619] migratio-21 2dNH1 1729650722us : activate_task: entered
[ 1731.632619] migratio-21 2dNH1 1729650723us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] migratio-21 2dNH1 1729650723us : check_preempt_curr: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650734us : wake_up_process: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650735us : try_to_wake_up: entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650735us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650736us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] <idle>-0 7d... 1729650736us : scheduler_ipi: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650747us : wake_up_process: entered
[ 1731.632619] rcu_sche-10 0..s. 1729650747us : try_to_wake_up: entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650748us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] rcu_sche-10 0d.s. 1729650748us : try_to_wake_up: ttwu_queue_remote entered, CPU 1
[ 1731.632619] <idle>-0 7d.h. 1729650749us : sched_ttwu_pending: entered
[ 1731.632619] <idle>-0 7d.h. 1729650749us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] <idle>-0 7d.h. 1729650750us : ttwu_do_activate.isra.108: entered
[ 1731.632619] <idle>-0 7d.h. 1729650750us : activate_task: entered
[ 1731.632619] rcu_sche-10 0d... 1729650754us : sched_switch: prev_comm=rcu_sched prev_pid=10 prev_prio=120 prev_state=I ==> next_comm=kworker/u16:5 next_pid=162 next_prio=120
[ 1731.632619] <idle>-0 7d.h. 1729650755us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] <idle>-0 7d.h. 1729650755us : check_preempt_curr: entered
[ 1731.632619] <idle>-0 7d.h. 1729650755us : resched_curr: entered
[ 1731.632619] migratio-21 2dNs1 1729650758us : wake_up_process: entered
[ 1731.632619] migratio-21 2dNs1 1729650758us : try_to_wake_up: entered
[ 1731.632619] migratio-21 2dNs1 1729650759us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-21 2dNs1 1729650759us : try_to_wake_up: ttwu_queue_remote entered, CPU 0
[ 1731.632619] kworker/-162 0d... 1729650763us : activate_task: entered
[ 1731.632619] kworker/-162 0d... 1729650764us : check_preempt_curr: entered
[ 1731.632619] kworker/-162 0d... 1729650765us : activate_task: entered
[ 1731.632619] kworker/-162 0d... 1729650765us : check_preempt_curr: entered
[ 1731.632619] kworker/-162 0d... 1729650765us : activate_task: entered
[ 1731.632619] kworker/-162 0d... 1729650766us : check_preempt_curr: entered
[ 1731.632619] migratio-21 2dN.1 1729650770us : try_to_wake_up: entered
[ 1731.632619] kworker/-162 0d... 1729650770us : sched_switch: prev_comm=kworker/u16:5 prev_pid=162 prev_prio=120 prev_state=I ==> next_comm=rcu_torture_rea next_pid=125 next_prio=139
[ 1731.632619] migratio-21 2dN.1 1729650771us : try_to_wake_up: ttwu_queue entered
[ 1731.632619] migratio-21 2dN.1 1729650771us : try_to_wake_up: ttwu_queue_remote entered, CPU 0
[ 1731.632619] migratio-21 2d... 1729650774us : sched_switch: prev_comm=migration/2 prev_pid=21 prev_prio=0 prev_state=S ==> next_comm=rcuog/1 next_pid=18 next_prio=120
[ 1731.632619] rcu_tort-125 0d... 1729650775us : scheduler_ipi: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650775us : sched_ttwu_pending: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650775us : sched_ttwu_pending: non-NULL llist
[ 1731.632619] rcu_tort-125 0d.h. 1729650776us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650777us : activate_task: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650778us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650778us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650779us : ttwu_do_activate.isra.108: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650779us : activate_task: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650779us : ttwu_do_wakeup.isra.107: entered
[ 1731.632619] rcu_tort-125 0d.h. 1729650779us : check_preempt_curr: entered
[ 1731.632619] rcu_tort-125 0d... 1729650783us : sched_switch: prev_comm=rcu_torture_rea prev_pid=125 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=127 next_prio=139
[ 1731.632619] rcuog/1-18 2d... 1729650783us : sched_switch: prev_comm=rcuog/1 prev_pid=18 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
[ 1731.632619] rcu_tort-127 0d... 1729650786us : sched_switch: prev_comm=rcu_torture_rea prev_pid=127 prev_prio=139 prev_state=S ==> next_comm=rcu_torture_rea next_pid=130 next_prio=139
[ 1731.632619] rcu_tort-130 0d... 1729650789us : sched_switch: prev_comm=rcu_torture_rea prev_pid=130 prev_prio=139 prev_state=S ==> next_comm=torture_onoff next_pid=135 next_prio=120
[ 1731.632619] ---------------------------------
[ 1795.802039] smpboot: CPU 6 is now offline

2019-08-08 21:32:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > >
> > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > like a requirement for actually doing a patch.
> > > > > >
> > > > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > a short pier.
> > > > > >
> > > > > > Understood???
> > > > >
> > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > >
> > > > No problem!
> > > >
> > > > And of course adding tracing decreases the frequency and duration of
> > > > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
> > >
> > > And I did eventually get a good trace. If I am interpreting this trace
> > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > up all of the CPUs. I will try again, but this time with the sched_switch
> > > trace event enabled.
> > >
> > > As a side note, enabling ftrace from the command line seems to interact
> > > badly with turning tracing off and on in the kernel, so I eventually
> > > resorted to trace_printk() in the functions of interest. The trace
> > > output is below, followed by the current diagnostic patch. Please note
> > > that I am -not- using the desperation hammer-the-scheduler patches.
> > >
> > > More as I learn more!
> >
> > And of course I forgot to dump out the online CPUs, so I really had no
> > idea whether or not all the CPUs were accounted for. I added tracing
> > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > reworked it a few times to get the problem to happen in reasonable time.
> > Please see below for the resulting annotated trace.
> >
> > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > but all the migration threads are running within about 2 milliseconds.
> > It is then almost two minutes(!) until the next trace message.
> >
> > Looks like time to (very carefully!) instrument multi_cpu_stop().
> >
> > Of course, if you have any ideas, please do not keep them a secret!
>
> Functionally, multi_cpu_stop() is working fine, according to the trace
> below (search for a line beginning with TAB). But somehow CPU 2 took
> almost three -minutes- to do one iteration of the loop. The prime suspect
> in that loop is cpu_relax() due to the hypervisor having an opportunity
> to do something at that point. The commentary below (again, search for
> a line beginning with TAB) gives my analysis.
>
> Of course, if I am correct, it should be possible to catch cpu_relax()
> in the act. That is the next step, give or take the Heisenbuggy nature
> of this beast.
>
> Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> case the earlier runs just got lucky.
>
> Thoughts?

And it really can happen:

[ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms

The previous timestamp was 1123391100us, so the cpu_relax() is almost
exactly the full delay.

But another instance stalled for many minutes without a ten-second
cpu_relax(). So it is not just cpu_relax() causing trouble. I could
rationalize that vCPU preemption being at fault...

And my diagnostic patch is below, just in case I am doing something
stupid with that.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ce00b442ced0..1a50ed258ef0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3569,6 +3569,7 @@ void __init rcu_init(void)
rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_par_gp_wq);
srcu_init();
+ tracing_off();
}

#include "tree_stall.h"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..a5a879a49051 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -479,6 +479,7 @@ void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;

+ trace_printk("entered\n");
while (node != WAKE_Q_TAIL) {
struct task_struct *task;

@@ -509,6 +510,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);

if (test_tsk_need_resched(curr))
@@ -1197,6 +1199,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ trace_printk("entered\n");
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;

@@ -1298,6 +1301,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;

+ trace_printk("entered\n");
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
@@ -2097,6 +2101,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
{
+ trace_printk("entered\n");
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2132,6 +2137,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);

#ifdef CONFIG_SMP
@@ -2178,9 +2184,11 @@ void sched_ttwu_pending(void)
struct task_struct *p, *t;
struct rq_flags rf;

+ trace_printk("entered\n");
if (!llist)
return;

+ trace_printk("non-NULL llist\n");
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

@@ -2192,6 +2200,7 @@ void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ trace_printk("entered\n");
/*
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
* TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -2232,6 +2241,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);

+ trace_printk("%s entered, CPU %d\n", __func__, cpu);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
@@ -2277,6 +2287,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;

+ trace_printk("%s entered\n", __func__);
#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
@@ -2399,6 +2410,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;

+ trace_printk("entered\n");
preempt_disable();
if (p == current) {
/*
@@ -2545,6 +2557,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
int wake_up_process(struct task_struct *p)
{
+ trace_printk("entered\n");
return try_to_wake_up(p, TASK_NORMAL, 0);
}
EXPORT_SYMBOL(wake_up_process);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5c2b2f90fae1..a07f77b9c1f2 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/atomic.h>
#include <linux/nmi.h>
#include <linux/sched/wake_q.h>
+#include <linux/sched/clock.h>

/*
* Structure to determine completion condition and record errors. May
@@ -80,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
unsigned long flags;
bool enabled;

+ trace_printk("entered for CPU %u\n", cpu);
preempt_disable();
raw_spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
@@ -167,7 +169,7 @@ static void set_state(struct multi_stop_data *msdata,
/* Reset ack counter. */
atomic_set(&msdata->thread_ack, msdata->num_threads);
smp_wmb();
- msdata->state = newstate;
+ WRITE_ONCE(msdata->state, newstate);
}

/* Last one to ack a state moves to the next state. */
@@ -179,7 +181,15 @@ static void ack_state(struct multi_stop_data *msdata)

void __weak stop_machine_yield(const struct cpumask *cpumask)
{
+ u64 starttime = local_clock();
+ u64 endtime;
+ const u64 delta = 100ULL * 1000ULL * 1000ULL * 1000ULL;
+
cpu_relax();
+ endtime = local_clock();
+ if (time_after64(endtime, starttime + delta))
+ trace_printk("cpu_relax() took %llu ms\n",
+ (endtime - starttime) / (1000ULL * 1000ULL));
}

/* This is the cpu_stop function which stops the CPU. */
@@ -210,8 +220,9 @@ static int multi_cpu_stop(void *data)
do {
/* Chill out and ensure we re-read multi_stop_state. */
stop_machine_yield(cpumask);
- if (msdata->state != curstate) {
- curstate = msdata->state;
+ if (READ_ONCE(msdata->state) != curstate) {
+ curstate = READ_ONCE(msdata->state);
+ trace_printk("curstate = %d, ack = %d\n", curstate, atomic_read(&msdata->thread_ack));
switch (curstate) {
case MULTI_STOP_DISABLE_IRQ:
local_irq_disable();
@@ -382,6 +393,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
* preempted by a stopper which might wait for other stoppers
* to enter @fn which can lead to deadlock.
*/
+ trace_printk("entered\n");
preempt_disable();
stop_cpus_in_progress = true;
for_each_cpu(cpu, cpumask) {
@@ -402,11 +414,18 @@ static int __stop_cpus(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg)
{
struct cpu_stop_done done;
+ unsigned long j = jiffies;

+ tracing_on();
+ trace_printk("entered\n");
+ trace_printk("CPUs %*pbl online\n", cpumask_pr_args(cpu_online_mask));
cpu_stop_init_done(&done, cpumask_weight(cpumask));
if (!queue_stop_cpus_work(cpumask, fn, arg, &done))
return -ENOENT;
wait_for_completion(&done.completion);
+ tracing_off();
+ if (time_after(jiffies, j + HZ * 20))
+ ftrace_dump(DUMP_ALL);
return done.ret;
}

@@ -442,6 +461,7 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
{
int ret;

+ trace_printk("entered\n");
/* static works are used, process one request at a time */
mutex_lock(&stop_cpus_mutex);
ret = __stop_cpus(cpumask, fn, arg);
@@ -599,6 +619,7 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
.active_cpus = cpus,
};

+ trace_printk("entered\n");
lockdep_assert_cpus_held();

if (!stop_machine_initialized) {

2019-08-09 17:32:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 08, 2019 at 02:30:12PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > > >
> > > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > > like a requirement for actually doing a patch.
> > > > > > >
> > > > > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > > a short pier.
> > > > > > >
> > > > > > > Understood???
> > > > > >
> > > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > > >
> > > > > No problem!
> > > > >
> > > > > And of course adding tracing decreases the frequency and duration of
> > > > > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
> > > >
> > > > And I did eventually get a good trace. If I am interpreting this trace
> > > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > > up all of the CPUs. I will try again, but this time with the sched_switch
> > > > trace event enabled.
> > > >
> > > > As a side note, enabling ftrace from the command line seems to interact
> > > > badly with turning tracing off and on in the kernel, so I eventually
> > > > resorted to trace_printk() in the functions of interest. The trace
> > > > output is below, followed by the current diagnostic patch. Please note
> > > > that I am -not- using the desperation hammer-the-scheduler patches.
> > > >
> > > > More as I learn more!
> > >
> > > And of course I forgot to dump out the online CPUs, so I really had no
> > > idea whether or not all the CPUs were accounted for. I added tracing
> > > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > > reworked it a few times to get the problem to happen in reasonable time.
> > > Please see below for the resulting annotated trace.
> > >
> > > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > > but all the migration threads are running within about 2 milliseconds.
> > > It is then almost two minutes(!) until the next trace message.
> > >
> > > Looks like time to (very carefully!) instrument multi_cpu_stop().
> > >
> > > Of course, if you have any ideas, please do not keep them a secret!
> >
> > Functionally, multi_cpu_stop() is working fine, according to the trace
> > below (search for a line beginning with TAB). But somehow CPU 2 took
> > almost three -minutes- to do one iteration of the loop. The prime suspect
> > in that loop is cpu_relax() due to the hypervisor having an opportunity
> > to do something at that point. The commentary below (again, search for
> > a line beginning with TAB) gives my analysis.
> >
> > Of course, if I am correct, it should be possible to catch cpu_relax()
> > in the act. That is the next step, give or take the Heisenbuggy nature
> > of this beast.
> >
> > Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> > case the earlier runs just got lucky.
> >
> > Thoughts?
>
> And it really can happen:
>
> [ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms
>
> The previous timestamp was 1123391100us, so the cpu_relax() is almost
> exactly the full delay.
>
> But another instance stalled for many minutes without a ten-second
> cpu_relax(). So it is not just cpu_relax() causing trouble. I could
> rationalize that vCPU preemption being at fault...
>
> And my diagnostic patch is below, just in case I am doing something
> stupid with that.

I did a 12-hour run with the same configuration except for leaving out the
"nohz_full=1-7" kernel parameter without problems (aside from the RCU CPU
stall warnings due to the ftrace_dump() at the end of the run -- isn't
there some way to clear the ftrace buffer without actually printing it?).

My next step is to do an over-the-weekend run with the same configuration,
then a similar run with more recent kernel and qemu but with the
"nohz_full=1-7". If both of those pass, I will consider this to be a
KVM/qemu bug that has since been fixed.

Thanx, Paul

2019-08-09 18:10:11

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Fri, Aug 09, 2019 at 09:51:20AM -0700, Paul E. McKenney wrote:
> > > > And of course I forgot to dump out the online CPUs, so I really had no
> > > > idea whether or not all the CPUs were accounted for. I added tracing
> > > > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > > > reworked it a few times to get the problem to happen in reasonable time.
> > > > Please see below for the resulting annotated trace.
> > > >
> > > > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > > > but all the migration threads are running within about 2 milliseconds.
> > > > It is then almost two minutes(!) until the next trace message.
> > > >
> > > > Looks like time to (very carefully!) instrument multi_cpu_stop().
> > > >
> > > > Of course, if you have any ideas, please do not keep them a secret!
> > >
> > > Functionally, multi_cpu_stop() is working fine, according to the trace
> > > below (search for a line beginning with TAB). But somehow CPU 2 took
> > > almost three -minutes- to do one iteration of the loop. The prime suspect
> > > in that loop is cpu_relax() due to the hypervisor having an opportunity
> > > to do something at that point. The commentary below (again, search for
> > > a line beginning with TAB) gives my analysis.
> > >
> > > Of course, if I am correct, it should be possible to catch cpu_relax()
> > > in the act. That is the next step, give or take the Heisenbuggy nature
> > > of this beast.
> > >
> > > Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> > > case the earlier runs just got lucky.
> > >
> > > Thoughts?
> >
> > And it really can happen:
> >
> > [ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms
> >
> > The previous timestamp was 1123391100us, so the cpu_relax() is almost
> > exactly the full delay.
> >
> > But another instance stalled for many minutes without a ten-second
> > cpu_relax(). So it is not just cpu_relax() causing trouble. I could
> > rationalize that vCPU preemption being at fault...
> >
> > And my diagnostic patch is below, just in case I am doing something
> > stupid with that.
>
> I did a 12-hour run with the same configuration except for leaving out the
> "nohz_full=1-7" kernel parameter without problems (aside from the RCU CPU
> stall warnings due to the ftrace_dump() at the end of the run -- isn't
> there some way to clear the ftrace buffer without actually printing it?).

I think if tracing_reset_all_online_cpus() is exposed, then calling that with
the appropriate locks held can reset the ftrace ring buffer and clear the
trace. Steve, is there a better way?

thanks,

- Joel

2019-08-09 18:41:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Fri, Aug 09, 2019 at 02:07:21PM -0400, Joel Fernandes wrote:
> On Fri, Aug 09, 2019 at 09:51:20AM -0700, Paul E. McKenney wrote:
> > > > > And of course I forgot to dump out the online CPUs, so I really had no
> > > > > idea whether or not all the CPUs were accounted for. I added tracing
> > > > > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > > > > reworked it a few times to get the problem to happen in reasonable time.
> > > > > Please see below for the resulting annotated trace.
> > > > >
> > > > > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > > > > but all the migration threads are running within about 2 milliseconds.
> > > > > It is then almost two minutes(!) until the next trace message.
> > > > >
> > > > > Looks like time to (very carefully!) instrument multi_cpu_stop().
> > > > >
> > > > > Of course, if you have any ideas, please do not keep them a secret!
> > > >
> > > > Functionally, multi_cpu_stop() is working fine, according to the trace
> > > > below (search for a line beginning with TAB). But somehow CPU 2 took
> > > > almost three -minutes- to do one iteration of the loop. The prime suspect
> > > > in that loop is cpu_relax() due to the hypervisor having an opportunity
> > > > to do something at that point. The commentary below (again, search for
> > > > a line beginning with TAB) gives my analysis.
> > > >
> > > > Of course, if I am correct, it should be possible to catch cpu_relax()
> > > > in the act. That is the next step, give or take the Heisenbuggy nature
> > > > of this beast.
> > > >
> > > > Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> > > > case the earlier runs just got lucky.
> > > >
> > > > Thoughts?
> > >
> > > And it really can happen:
> > >
> > > [ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms
> > >
> > > The previous timestamp was 1123391100us, so the cpu_relax() is almost
> > > exactly the full delay.
> > >
> > > But another instance stalled for many minutes without a ten-second
> > > cpu_relax(). So it is not just cpu_relax() causing trouble. I could
> > > rationalize that vCPU preemption being at fault...
> > >
> > > And my diagnostic patch is below, just in case I am doing something
> > > stupid with that.
> >
> > I did a 12-hour run with the same configuration except for leaving out the
> > "nohz_full=1-7" kernel parameter without problems (aside from the RCU CPU
> > stall warnings due to the ftrace_dump() at the end of the run -- isn't
> > there some way to clear the ftrace buffer without actually printing it?).
>
> I think if tracing_reset_all_online_cpus() is exposed, then calling that with
> the appropriate locks held can reset the ftrace ring buffer and clear the
> trace. Steve, is there a better way?

On the off-chance that it helps, here is my use case:

o I have a race condition that becomes extremely unlikely if
I leave tracing enabled at all times.

o I therefore enable tracing at the beginning of a CPU-hotplug
removal.

o At the end of that CPU-hotplug removal, I disable tracing.

o I can test whether the problem occurred without affecting its
probability. If it occurred, I want to dump only that portion
of the trace buffer since enabling it above. If the problem
did not occur, I want to flush (not dump) the trace buffer.

o I cannot reasonably just make the trace buffer small because
the number of events in a given occurrence of the problem
can vary widely.

Thanx, Paul

2019-08-12 21:37:38

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> The multi_cpu_stop() function relies on the scheduler to gain control from
> whatever is running on the various online CPUs, including any nohz_full
> CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> and can also result in RCU CPU stall warnings. This commit therefore
> causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> online CPUs.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/stop_machine.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index b4f83f7bdf86..a2659f61ed92 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -20,6 +20,7 @@
> #include <linux/smpboot.h>
> #include <linux/atomic.h>
> #include <linux/nmi.h>
> +#include <linux/tick.h>
> #include <linux/sched/wake_q.h>
>
> /*
> @@ -187,15 +188,19 @@ static int multi_cpu_stop(void *data)
> {
> struct multi_stop_data *msdata = data;
> enum multi_stop_state curstate = MULTI_STOP_NONE;
> - int cpu = smp_processor_id(), err = 0;
> + int cpu, err = 0;
> const struct cpumask *cpumask;
> unsigned long flags;
> bool is_active;
>
> + for_each_online_cpu(cpu)
> + tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU);

Looks like it's not the right fix but, should you ever need to set an
all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().

Thanks.

2019-08-12 23:27:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > The multi_cpu_stop() function relies on the scheduler to gain control from
> > whatever is running on the various online CPUs, including any nohz_full
> > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > and can also result in RCU CPU stall warnings. This commit therefore
> > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > online CPUs.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/stop_machine.c | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index b4f83f7bdf86..a2659f61ed92 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -20,6 +20,7 @@
> > #include <linux/smpboot.h>
> > #include <linux/atomic.h>
> > #include <linux/nmi.h>
> > +#include <linux/tick.h>
> > #include <linux/sched/wake_q.h>
> >
> > /*
> > @@ -187,15 +188,19 @@ static int multi_cpu_stop(void *data)
> > {
> > struct multi_stop_data *msdata = data;
> > enum multi_stop_state curstate = MULTI_STOP_NONE;
> > - int cpu = smp_processor_id(), err = 0;
> > + int cpu, err = 0;
> > const struct cpumask *cpumask;
> > unsigned long flags;
> > bool is_active;
> >
> > + for_each_online_cpu(cpu)
> > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU);
>
> Looks like it's not the right fix but, should you ever need to set an
> all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().

Indeed, I have dropped this patch, but I now do something similar in
RCU's CPU-hotplug notifiers. Which does have an effect, especially on
the system that isn't subject to the insane-latency cpu_relax().

Plus I am having to put a similar workaround into RCU's quiescent-state
forcing logic.

But how should this really be done?

Isn't there some sort of monitoring of nohz_full CPUs for accounting
purposes? If so, would it make sense for this monitoring to check for
long-duration kernel execution and enable the tick in this case? The
RCU dyntick machinery can be used to remotely detect the long-duration
kernel execution using something like the following:

int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);

...

if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
else
/* Turn on the tick! */

I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(),
which would be simple wrappers around RCU's private rcu_dynticks_snap()
and rcu_dynticks_in_eqs() functions.

Would this make sense as a general solution, or am I missing a corner
case or three?

Thanx, Paul

2019-08-13 01:36:51

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 12, 2019 at 7:23 PM Paul E. McKenney <[email protected]> wrote:
>
> On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > whatever is running on the various online CPUs, including any nohz_full
> > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > and can also result in RCU CPU stall warnings. This commit therefore
> > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > online CPUs.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/stop_machine.c | 9 ++++++++-
> > > 1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > > index b4f83f7bdf86..a2659f61ed92 100644
> > > --- a/kernel/stop_machine.c
> > > +++ b/kernel/stop_machine.c
> > > @@ -20,6 +20,7 @@
> > > #include <linux/smpboot.h>
> > > #include <linux/atomic.h>
> > > #include <linux/nmi.h>
> > > +#include <linux/tick.h>
> > > #include <linux/sched/wake_q.h>
> > >
> > > /*
> > > @@ -187,15 +188,19 @@ static int multi_cpu_stop(void *data)
> > > {
> > > struct multi_stop_data *msdata = data;
> > > enum multi_stop_state curstate = MULTI_STOP_NONE;
> > > - int cpu = smp_processor_id(), err = 0;
> > > + int cpu, err = 0;
> > > const struct cpumask *cpumask;
> > > unsigned long flags;
> > > bool is_active;
> > >
> > > + for_each_online_cpu(cpu)
> > > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU);
> >
> > Looks like it's not the right fix but, should you ever need to set an
> > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().
>
> Indeed, I have dropped this patch, but I now do something similar in
> RCU's CPU-hotplug notifiers. Which does have an effect, especially on
> the system that isn't subject to the insane-latency cpu_relax().
>
> Plus I am having to put a similar workaround into RCU's quiescent-state
> forcing logic.
>
> But how should this really be done?
>
> Isn't there some sort of monitoring of nohz_full CPUs for accounting
> purposes? If so, would it make sense for this monitoring to check for
> long-duration kernel execution and enable the tick in this case? The
> RCU dyntick machinery can be used to remotely detect the long-duration
> kernel execution using something like the following:
>
> int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
>
> ...
>
> if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
> nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> else
> /* Turn on the tick! */

This solution does make sense to me and is simpler than the
rcu_nmi_exit_common() solution you proposed in the other thread.

Though I am not sure about the intricacies of remotely enabling the
timer tick for a CPU. But overall, the above solution does seem
simpler.

thanks,

- Joel

2019-08-13 12:31:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > Looks like it's not the right fix but, should you ever need to set an
> > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().
>
> Indeed, I have dropped this patch, but I now do something similar in
> RCU's CPU-hotplug notifiers. Which does have an effect, especially on
> the system that isn't subject to the insane-latency cpu_relax().
>
> Plus I am having to put a similar workaround into RCU's quiescent-state
> forcing logic.
>
> But how should this really be done?
>
> Isn't there some sort of monitoring of nohz_full CPUs for accounting
> purposes? If so, would it make sense for this monitoring to check for
> long-duration kernel execution and enable the tick in this case? The
> RCU dyntick machinery can be used to remotely detect the long-duration
> kernel execution using something like the following:
>
> int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
>
> ...
>
> if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
> nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> else
> /* Turn on the tick! */
>
> I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(),
> which would be simple wrappers around RCU's private rcu_dynticks_snap()
> and rcu_dynticks_in_eqs() functions.
>
> Would this make sense as a general solution, or am I missing a corner
> case or three?

Oh I see. Until now we considered than running into the kernel (between user/guest/idle)
is supposed to be short but there can be specific places where it doesn't apply.

I'm wondering if, more than just providing wrappers, this shouldn't be entirely
driven by RCU using the tick_set_dep_cpu()/tick_clear_dep_cpu() at appropriate timings.

I don't want to sound like I'm trying to put all the work on you :p It's just that
the tick shouldn't know much about RCU, it's rather RCU that is a client for the tick and
is probably better suited to determine when a CPU becomes annoying with its extended grace
period.

Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.

What do you think?

2019-08-13 16:07:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Tue, Aug 13, 2019 at 02:30:19PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > Looks like it's not the right fix but, should you ever need to set an
> > > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().
> >
> > Indeed, I have dropped this patch, but I now do something similar in
> > RCU's CPU-hotplug notifiers. Which does have an effect, especially on
> > the system that isn't subject to the insane-latency cpu_relax().
> >
> > Plus I am having to put a similar workaround into RCU's quiescent-state
> > forcing logic.
> >
> > But how should this really be done?
> >
> > Isn't there some sort of monitoring of nohz_full CPUs for accounting
> > purposes? If so, would it make sense for this monitoring to check for
> > long-duration kernel execution and enable the tick in this case? The
> > RCU dyntick machinery can be used to remotely detect the long-duration
> > kernel execution using something like the following:
> >
> > int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> >
> > ...
> >
> > if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
> > nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> > else
> > /* Turn on the tick! */
> >
> > I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(),
> > which would be simple wrappers around RCU's private rcu_dynticks_snap()
> > and rcu_dynticks_in_eqs() functions.
> >
> > Would this make sense as a general solution, or am I missing a corner
> > case or three?
>
> Oh I see. Until now we considered than running into the kernel (between user/guest/idle)
> is supposed to be short but there can be specific places where it doesn't apply.
>
> I'm wondering if, more than just providing wrappers, this shouldn't be entirely
> driven by RCU using the tick_set_dep_cpu()/tick_clear_dep_cpu() at appropriate timings.
>
> I don't want to sound like I'm trying to put all the work on you :p It's just that
> the tick shouldn't know much about RCU, it's rather RCU that is a client for the tick and
> is probably better suited to determine when a CPU becomes annoying with its extended grace
> period.
>
> Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.
>
> What do you think?

Left to itself, RCU would take action only when a given nohz_full
in-kernel CPU was delaying a grace period, which is what the (lightly
tested) patch below is supposed to help with. If that is all that is
needed, well and good!

But should we need long-running in-kernel nohz_full CPUs to turn on
their ticks when they are not blocking an RCU grace period, for example,
when RCU is idle, more will be needed. To that point, isn't there some
sort of monitoring that checks up on nohz_full CPUs ever second or so?
If so, perhaps that monitoring could periodically invoke an RCU function
that I provide for deciding when to turn the tick on. We would also need
to work out how to turn the tick off in a timely fashion once the CPU got
out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().

If this would be called only every second or so, the separate grace-period
checking is still needed for its shorter timespan, though.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 1cb89508804f6f2fdb79a1be032b1932d52318c4
Author: Paul E. McKenney <[email protected]>
Date: Mon Aug 12 16:14:00 2019 -0700

rcu: Force tick on for nohz_full CPUs not reaching quiescent states

CPUs running for long time periods in the kernel in nohz_full mode
might leave the scheduling-clock interrupt disabled for then full
duration of their in-kernel execution. This can (among other things)
delay grace periods. This commit therefore forces the tick back on
for any nohz_full CPU that is failing to pass through a quiescent state
upon return from interrupt, which the resched_cpu() will induce.

Reported-by: Joel Fernandes <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8c494a692728..8b8f5bffdc5a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
*/
if (rdp->dynticks_nmi_nesting != 1) {
trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
+ if (tick_nohz_full_cpu(rdp->cpu) &&
+ rdp->dynticks_nmi_nesting == 2 &&
+ rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+ rdp->rcu_forced_tick = true;
+ tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+ }
WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
rdp->dynticks_nmi_nesting - 2);
return;
@@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
local_irq_restore(flags);
}

+/*
+ * If the scheduler-clock interrupt was enabled on a nohz_full CPU
+ * in order to get to a quiescent state, disable it.
+ */
+void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
+{
+ if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
+ tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+}
+
/**
* rcu_is_watching - see if RCU thinks that the current CPU is not idle
*
@@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
if (!offloaded)
needwake = rcu_accelerate_cbs(rnp, rdp);

+ rcu_disable_tick_upon_qs(rdp);
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
/* ^^^ Released rnp->lock */
if (needwake)
@@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
int cpu;
unsigned long flags;
unsigned long mask;
+ struct rcu_data *rdp;
struct rcu_node *rnp;

rcu_for_each_leaf_node(rnp) {
@@ -2293,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
for_each_leaf_node_possible_cpu(rnp, cpu) {
unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
if ((rnp->qsmask & bit) != 0) {
- if (f(per_cpu_ptr(&rcu_data, cpu)))
+ rdp = per_cpu_ptr(&rcu_data, cpu);
+ if (f(rdp)) {
mask |= bit;
+ rcu_disable_tick_upon_qs(rdp);
+ }
}
}
if (mask != 0) {
@@ -2322,7 +2343,7 @@ void rcu_force_quiescent_state(void)
rnp = __this_cpu_read(rcu_data.mynode);
for (; rnp != NULL; rnp = rnp->parent) {
ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
- !raw_spin_trylock(&rnp->fqslock);
+ !raw_spin_trylock(&rnp->fqslock);
if (rnp_old != NULL)
raw_spin_unlock(&rnp_old->fqslock);
if (ret)
@@ -2855,7 +2876,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
{
if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
rcu_barrier_trace(TPS("LastCB"), -1,
- rcu_state.barrier_sequence);
+ rcu_state.barrier_sequence);
complete(&rcu_state.barrier_completion);
} else {
rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
@@ -2879,7 +2900,7 @@ static void rcu_barrier_func(void *unused)
} else {
debug_rcu_head_unqueue(&rdp->barrier_head);
rcu_barrier_trace(TPS("IRQNQ"), -1,
- rcu_state.barrier_sequence);
+ rcu_state.barrier_sequence);
}
rcu_nocb_unlock(rdp);
}
@@ -2906,7 +2927,7 @@ void rcu_barrier(void)
/* Did someone else do our work for us? */
if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
rcu_barrier_trace(TPS("EarlyExit"), -1,
- rcu_state.barrier_sequence);
+ rcu_state.barrier_sequence);
smp_mb(); /* caller's subsequent code after above check. */
mutex_unlock(&rcu_state.barrier_mutex);
return;
@@ -2938,11 +2959,11 @@ void rcu_barrier(void)
continue;
if (rcu_segcblist_n_cbs(&rdp->cblist)) {
rcu_barrier_trace(TPS("OnlineQ"), cpu,
- rcu_state.barrier_sequence);
+ rcu_state.barrier_sequence);
smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
} else {
rcu_barrier_trace(TPS("OnlineNQ"), cpu,
- rcu_state.barrier_sequence);
+ rcu_state.barrier_sequence);
}
}
put_online_cpus();
@@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu)
rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+ rcu_disable_tick_upon_qs(rdp);
/* Report QS -after- changing ->qsmaskinitnext! */
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
} else {
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index c612f306fe89..055c31781d3a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -181,6 +181,7 @@ struct rcu_data {
atomic_t dynticks; /* Even value for idle, else odd. */
bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */
bool rcu_urgent_qs; /* GP old need light quiescent state. */
+ bool rcu_forced_tick; /* Forced tick to provide QS. */
#ifdef CONFIG_RCU_FAST_NO_HZ
bool all_lazy; /* All CPU's CBs lazy at idle start? */
unsigned long last_accelerate; /* Last jiffy CBs were accelerated. */

2019-08-13 21:09:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > The multi_cpu_stop() function relies on the scheduler to gain control from
> > > whatever is running on the various online CPUs, including any nohz_full
> > > CPUs running long loops in kernel-mode code. Lack of the scheduler-clock
> > > interrupt on such CPUs can delay multi_cpu_stop() for several minutes
> > > and can also result in RCU CPU stall warnings. This commit therefore
> > > causes multi_cpu_stop() to enable the scheduler-clock interrupt on all
> > > online CPUs.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/stop_machine.c | 9 ++++++++-
> > > 1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > > index b4f83f7bdf86..a2659f61ed92 100644
> > > --- a/kernel/stop_machine.c
> > > +++ b/kernel/stop_machine.c
> > > @@ -20,6 +20,7 @@
> > > #include <linux/smpboot.h>
> > > #include <linux/atomic.h>
> > > #include <linux/nmi.h>
> > > +#include <linux/tick.h>
> > > #include <linux/sched/wake_q.h>
> > >
> > > /*
> > > @@ -187,15 +188,19 @@ static int multi_cpu_stop(void *data)
> > > {
> > > struct multi_stop_data *msdata = data;
> > > enum multi_stop_state curstate = MULTI_STOP_NONE;
> > > - int cpu = smp_processor_id(), err = 0;
> > > + int cpu, err = 0;
> > > const struct cpumask *cpumask;
> > > unsigned long flags;
> > > bool is_active;
> > >
> > > + for_each_online_cpu(cpu)
> > > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_MASK_RCU);
> >
> > Looks like it's not the right fix but, should you ever need to set an
> > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().

Except that I am not finding anything resembling tick_set_dep() in current
mainline. What should I be looking for instead?

Thanx, Paul

2019-08-14 17:58:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Tue, Aug 13, 2019 at 07:48:09AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 13, 2019 at 02:30:19PM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > Looks like it's not the right fix but, should you ever need to set an
> > > > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().
> > >
> > > Indeed, I have dropped this patch, but I now do something similar in
> > > RCU's CPU-hotplug notifiers. Which does have an effect, especially on
> > > the system that isn't subject to the insane-latency cpu_relax().
> > >
> > > Plus I am having to put a similar workaround into RCU's quiescent-state
> > > forcing logic.
> > >
> > > But how should this really be done?
> > >
> > > Isn't there some sort of monitoring of nohz_full CPUs for accounting
> > > purposes? If so, would it make sense for this monitoring to check for
> > > long-duration kernel execution and enable the tick in this case? The
> > > RCU dyntick machinery can be used to remotely detect the long-duration
> > > kernel execution using something like the following:
> > >
> > > int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> > >
> > > ...
> > >
> > > if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
> > > nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> > > else
> > > /* Turn on the tick! */
> > >
> > > I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(),
> > > which would be simple wrappers around RCU's private rcu_dynticks_snap()
> > > and rcu_dynticks_in_eqs() functions.
> > >
> > > Would this make sense as a general solution, or am I missing a corner
> > > case or three?
> >
> > Oh I see. Until now we considered than running into the kernel (between user/guest/idle)
> > is supposed to be short but there can be specific places where it doesn't apply.
> >
> > I'm wondering if, more than just providing wrappers, this shouldn't be entirely
> > driven by RCU using the tick_set_dep_cpu()/tick_clear_dep_cpu() at appropriate timings.
> >
> > I don't want to sound like I'm trying to put all the work on you :p It's just that
> > the tick shouldn't know much about RCU, it's rather RCU that is a client for the tick and
> > is probably better suited to determine when a CPU becomes annoying with its extended grace
> > period.
> >
> > Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.
> >
> > What do you think?
>
> Left to itself, RCU would take action only when a given nohz_full
> in-kernel CPU was delaying a grace period, which is what the (lightly
> tested) patch below is supposed to help with. If that is all that is
> needed, well and good!
>
> But should we need long-running in-kernel nohz_full CPUs to turn on
> their ticks when they are not blocking an RCU grace period, for example,
> when RCU is idle, more will be needed. To that point, isn't there some
> sort of monitoring that checks up on nohz_full CPUs ever second or so?

Wouldn't such monitoring need to be more often than a second, given that
rcu_urgent_qs and rcu_need_heavy_qs are configured typically to be sooner
(200-300 jiffies on my system).

> If so, perhaps that monitoring could periodically invoke an RCU function
> that I provide for deciding when to turn the tick on. We would also need
> to work out how to turn the tick off in a timely fashion once the CPU got
> out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
>
> If this would be called only every second or so, the separate grace-period
> checking is still needed for its shorter timespan, though.
>
> Thoughts?

Do you want me to test the below patch to see if it fixes the issue with my
other test case (where I had a nohz full CPU holding up a grace period).

2 comments below:

> ------------------------------------------------------------------------
>
> commit 1cb89508804f6f2fdb79a1be032b1932d52318c4
> Author: Paul E. McKenney <[email protected]>
> Date: Mon Aug 12 16:14:00 2019 -0700
>
> rcu: Force tick on for nohz_full CPUs not reaching quiescent states
>
> CPUs running for long time periods in the kernel in nohz_full mode
> might leave the scheduling-clock interrupt disabled for then full
> duration of their in-kernel execution. This can (among other things)
> delay grace periods. This commit therefore forces the tick back on
> for any nohz_full CPU that is failing to pass through a quiescent state
> upon return from interrupt, which the resched_cpu() will induce.
>
> Reported-by: Joel Fernandes <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8c494a692728..8b8f5bffdc5a 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> */
> if (rdp->dynticks_nmi_nesting != 1) {
> trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> + if (tick_nohz_full_cpu(rdp->cpu) &&
> + rdp->dynticks_nmi_nesting == 2 &&
> + rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> + rdp->rcu_forced_tick = true;
> + tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> + }
> WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
> rdp->dynticks_nmi_nesting - 2);
> return;
> @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
> local_irq_restore(flags);
> }
>
> +/*
> + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> + * in order to get to a quiescent state, disable it.
> + */
> +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> +{
> + if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> + tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> +}
> +
> /**
> * rcu_is_watching - see if RCU thinks that the current CPU is not idle
> *
> @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
> if (!offloaded)
> needwake = rcu_accelerate_cbs(rnp, rdp);
>
> + rcu_disable_tick_upon_qs(rdp);
> rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> /* ^^^ Released rnp->lock */
> if (needwake)
> @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> int cpu;
> unsigned long flags;
> unsigned long mask;
> + struct rcu_data *rdp;
> struct rcu_node *rnp;
>
> rcu_for_each_leaf_node(rnp) {
> @@ -2293,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> for_each_leaf_node_possible_cpu(rnp, cpu) {
> unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
> if ((rnp->qsmask & bit) != 0) {
> - if (f(per_cpu_ptr(&rcu_data, cpu)))
> + rdp = per_cpu_ptr(&rcu_data, cpu);
> + if (f(rdp)) {
> mask |= bit;
> + rcu_disable_tick_upon_qs(rdp);
> + }

I am guessing this was the earlier thing you corrected, cool!!

> }
> }
> if (mask != 0) {
> @@ -2322,7 +2343,7 @@ void rcu_force_quiescent_state(void)
> rnp = __this_cpu_read(rcu_data.mynode);
> for (; rnp != NULL; rnp = rnp->parent) {
> ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
> - !raw_spin_trylock(&rnp->fqslock);
> + !raw_spin_trylock(&rnp->fqslock);
> if (rnp_old != NULL)
> raw_spin_unlock(&rnp_old->fqslock);
> if (ret)
> @@ -2855,7 +2876,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
> {
> if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
> rcu_barrier_trace(TPS("LastCB"), -1,
> - rcu_state.barrier_sequence);
> + rcu_state.barrier_sequence);
> complete(&rcu_state.barrier_completion);
> } else {
> rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
> @@ -2879,7 +2900,7 @@ static void rcu_barrier_func(void *unused)
> } else {
> debug_rcu_head_unqueue(&rdp->barrier_head);
> rcu_barrier_trace(TPS("IRQNQ"), -1,
> - rcu_state.barrier_sequence);
> + rcu_state.barrier_sequence);
> }
> rcu_nocb_unlock(rdp);
> }
> @@ -2906,7 +2927,7 @@ void rcu_barrier(void)
> /* Did someone else do our work for us? */
> if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> rcu_barrier_trace(TPS("EarlyExit"), -1,
> - rcu_state.barrier_sequence);
> + rcu_state.barrier_sequence);
> smp_mb(); /* caller's subsequent code after above check. */
> mutex_unlock(&rcu_state.barrier_mutex);
> return;
> @@ -2938,11 +2959,11 @@ void rcu_barrier(void)
> continue;
> if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> rcu_barrier_trace(TPS("OnlineQ"), cpu,
> - rcu_state.barrier_sequence);
> + rcu_state.barrier_sequence);
> smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> } else {
> rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> - rcu_state.barrier_sequence);
> + rcu_state.barrier_sequence);
> }
> }
> put_online_cpus();
> @@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu)
> rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> + rcu_disable_tick_upon_qs(rdp);
> /* Report QS -after- changing ->qsmaskinitnext! */
> rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);

Just curious about the existing code. If a CPU is just starting up (after
bringing it online), how can RCU be waiting on it? I thought RCU would not be
watching offline CPUs.

thanks,

- Joel
[snip]

2019-08-14 22:27:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Wed, Aug 14, 2019 at 01:55:46PM -0400, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 07:48:09AM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 13, 2019 at 02:30:19PM +0200, Frederic Weisbecker wrote:
> > > On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote:
> > > > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote:
> > > > > Looks like it's not the right fix but, should you ever need to set an
> > > > > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep().
> > > >
> > > > Indeed, I have dropped this patch, but I now do something similar in
> > > > RCU's CPU-hotplug notifiers. Which does have an effect, especially on
> > > > the system that isn't subject to the insane-latency cpu_relax().
> > > >
> > > > Plus I am having to put a similar workaround into RCU's quiescent-state
> > > > forcing logic.
> > > >
> > > > But how should this really be done?
> > > >
> > > > Isn't there some sort of monitoring of nohz_full CPUs for accounting
> > > > purposes? If so, would it make sense for this monitoring to check for
> > > > long-duration kernel execution and enable the tick in this case? The
> > > > RCU dyntick machinery can be used to remotely detect the long-duration
> > > > kernel execution using something like the following:
> > > >
> > > > int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> > > >
> > > > ...
> > > >
> > > > if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap)
> > > > nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu);
> > > > else
> > > > /* Turn on the tick! */
> > > >
> > > > I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(),
> > > > which would be simple wrappers around RCU's private rcu_dynticks_snap()
> > > > and rcu_dynticks_in_eqs() functions.
> > > >
> > > > Would this make sense as a general solution, or am I missing a corner
> > > > case or three?
> > >
> > > Oh I see. Until now we considered than running into the kernel (between user/guest/idle)
> > > is supposed to be short but there can be specific places where it doesn't apply.
> > >
> > > I'm wondering if, more than just providing wrappers, this shouldn't be entirely
> > > driven by RCU using the tick_set_dep_cpu()/tick_clear_dep_cpu() at appropriate timings.
> > >
> > > I don't want to sound like I'm trying to put all the work on you :p It's just that
> > > the tick shouldn't know much about RCU, it's rather RCU that is a client for the tick and
> > > is probably better suited to determine when a CPU becomes annoying with its extended grace
> > > period.
> > >
> > > Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.
> > >
> > > What do you think?
> >
> > Left to itself, RCU would take action only when a given nohz_full
> > in-kernel CPU was delaying a grace period, which is what the (lightly
> > tested) patch below is supposed to help with. If that is all that is
> > needed, well and good!
> >
> > But should we need long-running in-kernel nohz_full CPUs to turn on
> > their ticks when they are not blocking an RCU grace period, for example,
> > when RCU is idle, more will be needed. To that point, isn't there some
> > sort of monitoring that checks up on nohz_full CPUs ever second or so?
>
> Wouldn't such monitoring need to be more often than a second, given that
> rcu_urgent_qs and rcu_need_heavy_qs are configured typically to be sooner
> (200-300 jiffies on my system).

Either it would have to be more often than once per second, or RCU would
need to retain its more frequent checks. But note that RCU isn't going
to check unless there is a grace period in progress.

> > If so, perhaps that monitoring could periodically invoke an RCU function
> > that I provide for deciding when to turn the tick on. We would also need
> > to work out how to turn the tick off in a timely fashion once the CPU got
> > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> >
> > If this would be called only every second or so, the separate grace-period
> > checking is still needed for its shorter timespan, though.
> >
> > Thoughts?
>
> Do you want me to test the below patch to see if it fixes the issue with my
> other test case (where I had a nohz full CPU holding up a grace period).

Please!

> 2 comments below:
>
> > ------------------------------------------------------------------------
> >
> > commit 1cb89508804f6f2fdb79a1be032b1932d52318c4
> > Author: Paul E. McKenney <[email protected]>
> > Date: Mon Aug 12 16:14:00 2019 -0700
> >
> > rcu: Force tick on for nohz_full CPUs not reaching quiescent states
> >
> > CPUs running for long time periods in the kernel in nohz_full mode
> > might leave the scheduling-clock interrupt disabled for then full
> > duration of their in-kernel execution. This can (among other things)
> > delay grace periods. This commit therefore forces the tick back on
> > for any nohz_full CPU that is failing to pass through a quiescent state
> > upon return from interrupt, which the resched_cpu() will induce.
> >
> > Reported-by: Joel Fernandes <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8c494a692728..8b8f5bffdc5a 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > */
> > if (rdp->dynticks_nmi_nesting != 1) {
> > trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > + if (tick_nohz_full_cpu(rdp->cpu) &&
> > + rdp->dynticks_nmi_nesting == 2 &&
> > + rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > + rdp->rcu_forced_tick = true;
> > + tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > + }
> > WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
> > rdp->dynticks_nmi_nesting - 2);
> > return;
> > @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
> > local_irq_restore(flags);
> > }
> >
> > +/*
> > + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> > + * in order to get to a quiescent state, disable it.
> > + */
> > +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> > +{
> > + if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> > + tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +}
> > +
> > /**
> > * rcu_is_watching - see if RCU thinks that the current CPU is not idle
> > *
> > @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
> > if (!offloaded)
> > needwake = rcu_accelerate_cbs(rnp, rdp);
> >
> > + rcu_disable_tick_upon_qs(rdp);
> > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> > /* ^^^ Released rnp->lock */
> > if (needwake)
> > @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> > int cpu;
> > unsigned long flags;
> > unsigned long mask;
> > + struct rcu_data *rdp;
> > struct rcu_node *rnp;
> >
> > rcu_for_each_leaf_node(rnp) {
> > @@ -2293,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> > for_each_leaf_node_possible_cpu(rnp, cpu) {
> > unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
> > if ((rnp->qsmask & bit) != 0) {
> > - if (f(per_cpu_ptr(&rcu_data, cpu)))
> > + rdp = per_cpu_ptr(&rcu_data, cpu);
> > + if (f(rdp)) {
> > mask |= bit;
> > + rcu_disable_tick_upon_qs(rdp);
> > + }
>
> I am guessing this was the earlier thing you corrected, cool!!

Like I said, stupid error on my part. The usual kind. ;-)

> > }
> > }
> > if (mask != 0) {
> > @@ -2322,7 +2343,7 @@ void rcu_force_quiescent_state(void)
> > rnp = __this_cpu_read(rcu_data.mynode);
> > for (; rnp != NULL; rnp = rnp->parent) {
> > ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
> > - !raw_spin_trylock(&rnp->fqslock);
> > + !raw_spin_trylock(&rnp->fqslock);
> > if (rnp_old != NULL)
> > raw_spin_unlock(&rnp_old->fqslock);
> > if (ret)
> > @@ -2855,7 +2876,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
> > {
> > if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
> > rcu_barrier_trace(TPS("LastCB"), -1,
> > - rcu_state.barrier_sequence);
> > + rcu_state.barrier_sequence);
> > complete(&rcu_state.barrier_completion);
> > } else {
> > rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
> > @@ -2879,7 +2900,7 @@ static void rcu_barrier_func(void *unused)
> > } else {
> > debug_rcu_head_unqueue(&rdp->barrier_head);
> > rcu_barrier_trace(TPS("IRQNQ"), -1,
> > - rcu_state.barrier_sequence);
> > + rcu_state.barrier_sequence);
> > }
> > rcu_nocb_unlock(rdp);
> > }
> > @@ -2906,7 +2927,7 @@ void rcu_barrier(void)
> > /* Did someone else do our work for us? */
> > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> > rcu_barrier_trace(TPS("EarlyExit"), -1,
> > - rcu_state.barrier_sequence);
> > + rcu_state.barrier_sequence);
> > smp_mb(); /* caller's subsequent code after above check. */
> > mutex_unlock(&rcu_state.barrier_mutex);
> > return;
> > @@ -2938,11 +2959,11 @@ void rcu_barrier(void)
> > continue;
> > if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> > rcu_barrier_trace(TPS("OnlineQ"), cpu,
> > - rcu_state.barrier_sequence);
> > + rcu_state.barrier_sequence);
> > smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> > } else {
> > rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> > - rcu_state.barrier_sequence);
> > + rcu_state.barrier_sequence);
> > }
> > }
> > put_online_cpus();
> > @@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu)
> > rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> > rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > + rcu_disable_tick_upon_qs(rdp);
> > /* Report QS -after- changing ->qsmaskinitnext! */
> > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>
> Just curious about the existing code. If a CPU is just starting up (after
> bringing it online), how can RCU be waiting on it? I thought RCU would not be
> watching offline CPUs.

Well, neither grace periods nor CPU-hotplug operations are atomic,
and each can take significant time to complete.

So suppose we have a large system with multiple leaf rcu_node structures
(not that 17 CPUs is all that many these days, but please bear with me).
Suppose just after a new grace period initializes a given leaf rcu_node
structure, one of its CPUs goes offline (yes, that CPU would have to
have waited on a grace period, but that might have been the previous
grace period). But before the FQS scan notices that RCU is waiting on
an offline CPU, the CPU comes back online.

That situation is exactly what the above code is intended to handle.

Without that code, RCU can give false-positive splats at various points
in its processing. ("Wait! How can a task be blocked waiting on a
grace period that hasn't even started yet???")

Thanx, Paul

2019-08-15 17:15:26

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Wed, Aug 14, 2019 at 03:05:16PM -0700, Paul E. McKenney wrote:
[snip]
> > > > Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.
> > > >
> > > > What do you think?
> > >
> > > Left to itself, RCU would take action only when a given nohz_full
> > > in-kernel CPU was delaying a grace period, which is what the (lightly
> > > tested) patch below is supposed to help with. If that is all that is
> > > needed, well and good!
> > >
> > > But should we need long-running in-kernel nohz_full CPUs to turn on
> > > their ticks when they are not blocking an RCU grace period, for example,
> > > when RCU is idle, more will be needed. To that point, isn't there some
> > > sort of monitoring that checks up on nohz_full CPUs ever second or so?
> >
> > Wouldn't such monitoring need to be more often than a second, given that
> > rcu_urgent_qs and rcu_need_heavy_qs are configured typically to be sooner
> > (200-300 jiffies on my system).
>
> Either it would have to be more often than once per second, or RCU would
> need to retain its more frequent checks. But note that RCU isn't going
> to check unless there is a grace period in progress.

Sure.

> > > If so, perhaps that monitoring could periodically invoke an RCU function
> > > that I provide for deciding when to turn the tick on. We would also need
> > > to work out how to turn the tick off in a timely fashion once the CPU got
> > > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> > >
> > > If this would be called only every second or so, the separate grace-period
> > > checking is still needed for its shorter timespan, though.
> > >
> > > Thoughts?
> >
> > Do you want me to test the below patch to see if it fixes the issue with my
> > other test case (where I had a nohz full CPU holding up a grace period).
>
> Please!

I tried the patch below, but it did not seem to make a difference to the
issue I was seeing. My test tree is here in case you can spot anything I did
not do right: https://github.com/joelagnel/linux-kernel/commits/rcu/nohz-test
The main patch is here:
https://github.com/joelagnel/linux-kernel/commit/4dc282b559d918a0be826936f997db0bdad7abb3

On the trace output, I grep something like: egrep "(rcu_perf|cpu 3|3d)". I
see a few ticks after 300ms, but then there are no more ticks and just a
periodic resched_cpu() from rcu_implicit_dynticks_qs():

[ 19.534107] rcu_perf-165 12.... 2276436us : rcu_perf_writer: Start of rcuperf test
[ 19.557968] rcu_pree-10 0d..1 2287973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 20.136222] rcu_perf-165 3d.h. 2591894us : rcu_sched_clock_irq: sched-tick
[ 20.137185] rcu_perf-165 3d.h2 2591906us : rcu_sched_clock_irq: sched-tick
[ 20.138149] rcu_perf-165 3d.h. 2591911us : rcu_sched_clock_irq: sched-tick
[ 20.139106] rcu_perf-165 3d.h. 2591915us : rcu_sched_clock_irq: sched-tick
[ 20.140077] rcu_perf-165 3d.h. 2591919us : rcu_sched_clock_irq: sched-tick
[ 20.141041] rcu_perf-165 3d.h. 2591924us : rcu_sched_clock_irq: sched-tick
[ 20.142001] rcu_perf-165 3d.h. 2591928us : rcu_sched_clock_irq: sched-tick
[ 20.142961] rcu_perf-165 3d.h. 2591932us : rcu_sched_clock_irq: sched-tick
[ 20.143925] rcu_perf-165 3d.h. 2591936us : rcu_sched_clock_irq: sched-tick
[ 20.144885] rcu_perf-165 3d.h. 2591940us : rcu_sched_clock_irq: sched-tick
[ 20.145876] rcu_perf-165 3d.h. 2591945us : rcu_sched_clock_irq: sched-tick
[ 20.146835] rcu_perf-165 3d.h. 2591949us : rcu_sched_clock_irq: sched-tick
[ 20.147797] rcu_perf-165 3d.h. 2591953us : rcu_sched_clock_irq: sched-tick
[ 20.148759] rcu_perf-165 3d.h. 2591957us : rcu_sched_clock_irq: sched-tick
[ 20.151655] rcu_pree-10 0d..1 2591979us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 20.732938] rcu_pree-10 0d..1 2895960us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 21.318104] rcu_pree-10 0d..1 3199975us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 21.899908] rcu_pree-10 0d..1 3503964us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 22.481316] rcu_pree-10 0d..1 3807990us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 23.065623] rcu_pree-10 0d..1 4111990us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 23.650875] rcu_pree-10 0d..1 4415989us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 24.233999] rcu_pree-10 0d..1 4719978us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 24.818397] rcu_pree-10 0d..1 5023982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 25.402633] rcu_pree-10 0d..1 5327981us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 25.984104] rcu_pree-10 0d..1 5631976us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 26.566100] rcu_pree-10 0d..1 5935982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 27.144497] rcu_pree-10 0d..1 6239973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 27.192661] rcu_perf-165 3d.h. 6276923us : rcu_sched_clock_irq: sched-tick
[ 27.705789] rcu_pree-10 0d..1 6541901us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 28.292155] rcu_pree-10 0d..1 6845974us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 28.874049] rcu_pree-10 0d..1 7149972us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[ 29.112646] rcu_perf-165 3.... 7275951us : rcu_perf_writer: End of rcuperf test

[snip]
> > > @@ -2906,7 +2927,7 @@ void rcu_barrier(void)
> > > /* Did someone else do our work for us? */
> > > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> > > rcu_barrier_trace(TPS("EarlyExit"), -1,
> > > - rcu_state.barrier_sequence);
> > > + rcu_state.barrier_sequence);
> > > smp_mb(); /* caller's subsequent code after above check. */
> > > mutex_unlock(&rcu_state.barrier_mutex);
> > > return;
> > > @@ -2938,11 +2959,11 @@ void rcu_barrier(void)
> > > continue;
> > > if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> > > rcu_barrier_trace(TPS("OnlineQ"), cpu,
> > > - rcu_state.barrier_sequence);
> > > + rcu_state.barrier_sequence);
> > > smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> > > } else {
> > > rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> > > - rcu_state.barrier_sequence);
> > > + rcu_state.barrier_sequence);
> > > }
> > > }
> > > put_online_cpus();
> > > @@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu)
> > > rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> > > rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> > > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > > + rcu_disable_tick_upon_qs(rdp);
> > > /* Report QS -after- changing ->qsmaskinitnext! */
> > > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> >
> > Just curious about the existing code. If a CPU is just starting up (after
> > bringing it online), how can RCU be waiting on it? I thought RCU would not be
> > watching offline CPUs.
>
> Well, neither grace periods nor CPU-hotplug operations are atomic,
> and each can take significant time to complete.
>
> So suppose we have a large system with multiple leaf rcu_node structures
> (not that 17 CPUs is all that many these days, but please bear with me).
> Suppose just after a new grace period initializes a given leaf rcu_node
> structure, one of its CPUs goes offline (yes, that CPU would have to
> have waited on a grace period, but that might have been the previous
> grace period). But before the FQS scan notices that RCU is waiting on
> an offline CPU, the CPU comes back online.
>
> That situation is exactly what the above code is intended to handle.

That makes sense!

> Without that code, RCU can give false-positive splats at various points
> in its processing. ("Wait! How can a task be blocked waiting on a
> grace period that hasn't even started yet???")

I did not fully understand the question in brackets though, a task can be on
a different CPU though which has nothing to do with the CPU that's going
offline/online so it could totally be waiting on a grace period right?

Also waiting on a grace period that hasn't even started is totally possible:

GP1 GP2
|<--------->|<-------->|
^ ^
| |____ task gets unblocked
task blocks
on synchronize_rcu
but is waiting on
GP2 which hasn't started

Or did I misunderstand the question?

thanks!

- Joel

2019-08-15 18:31:22

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 15, 2019 at 10:23:51AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 11:07:35AM -0400, Joel Fernandes wrote:
> > On Wed, Aug 14, 2019 at 03:05:16PM -0700, Paul E. McKenney wrote:
> > [snip]
> > > > > If so, perhaps that monitoring could periodically invoke an RCU function
> > > > > that I provide for deciding when to turn the tick on. We would also need
> > > > > to work out how to turn the tick off in a timely fashion once the CPU got
> > > > > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> > > > >
> > > > > If this would be called only every second or so, the separate grace-period
> > > > > checking is still needed for its shorter timespan, though.
> > > > >
> > > > > Thoughts?
> > > >
> > > > Do you want me to test the below patch to see if it fixes the issue with my
> > > > other test case (where I had a nohz full CPU holding up a grace period).
> > >
> > > Please!
> >
> > I tried the patch below, but it did not seem to make a difference to the
> > issue I was seeing. My test tree is here in case you can spot anything I did
> > not do right: https://github.com/joelagnel/linux-kernel/commits/rcu/nohz-test
> > The main patch is here:
> > https://github.com/joelagnel/linux-kernel/commit/4dc282b559d918a0be826936f997db0bdad7abb3
>
> That is more aggressive that rcutorture's rcu_torture_fwd_prog_nr(), so
> I am guessing that I need to up rcu_torture_fwd_prog_nr()'s game. I am
> currently testing that.
>
> > On the trace output, I grep something like: egrep "(rcu_perf|cpu 3|3d)". I
> > see a few ticks after 300ms, but then there are no more ticks and just a
> > periodic resched_cpu() from rcu_implicit_dynticks_qs():
> >
> > [ 19.534107] rcu_perf-165 12.... 2276436us : rcu_perf_writer: Start of rcuperf test
> > [ 19.557968] rcu_pree-10 0d..1 2287973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 20.136222] rcu_perf-165 3d.h. 2591894us : rcu_sched_clock_irq: sched-tick
> > [ 20.137185] rcu_perf-165 3d.h2 2591906us : rcu_sched_clock_irq: sched-tick
> > [ 20.138149] rcu_perf-165 3d.h. 2591911us : rcu_sched_clock_irq: sched-tick
> > [ 20.139106] rcu_perf-165 3d.h. 2591915us : rcu_sched_clock_irq: sched-tick
[snip]
> > [ 20.147797] rcu_perf-165 3d.h. 2591953us : rcu_sched_clock_irq: sched-tick
> > [ 20.148759] rcu_perf-165 3d.h. 2591957us : rcu_sched_clock_irq: sched-tick
> > [ 20.151655] rcu_pree-10 0d..1 2591979us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 20.732938] rcu_pree-10 0d..1 2895960us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[snip]
> > [ 26.566100] rcu_pree-10 0d..1 5935982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 27.144497] rcu_pree-10 0d..1 6239973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 27.192661] rcu_perf-165 3d.h. 6276923us : rcu_sched_clock_irq: sched-tick
> > [ 27.705789] rcu_pree-10 0d..1 6541901us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 28.292155] rcu_pree-10 0d..1 6845974us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 28.874049] rcu_pree-10 0d..1 7149972us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [ 29.112646] rcu_perf-165 3.... 7275951us : rcu_perf_writer: End of rcuperf test
>
> That would be due to my own stupidity. I forgot to clear ->rcu_forced_tick
> in rcu_disable_tick_upon_qs() inside the "if" statement. This of course
> prevents rcu_nmi_exit_common() from ever re-enabling it.
>
> Excellent catch! Thank you for testing this!!!

Ah I missed it too. Happy to help! I tried setting it as below but getting
same results:

+/*
+ * If the scheduler-clock interrupt was enabled on a nohz_full CPU
+ * in order to get to a quiescent state, disable it.
+ */
+void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
+{
+ if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
+ tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+ rdp->rcu_forced_tick = false;
+}
+

> > [snip]
> > > > > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > > > > + rcu_disable_tick_upon_qs(rdp);
> > > > > /* Report QS -after- changing ->qsmaskinitnext! */
> > > > > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> > > >
> > > > Just curious about the existing code. If a CPU is just starting up (after
> > > > bringing it online), how can RCU be waiting on it? I thought RCU would not be
> > > > watching offline CPUs.
> > >
> > > Well, neither grace periods nor CPU-hotplug operations are atomic,
> > > and each can take significant time to complete.
> > >
> > > So suppose we have a large system with multiple leaf rcu_node structures
> > > (not that 17 CPUs is all that many these days, but please bear with me).
> > > Suppose just after a new grace period initializes a given leaf rcu_node
> > > structure, one of its CPUs goes offline (yes, that CPU would have to
> > > have waited on a grace period, but that might have been the previous
> > > grace period). But before the FQS scan notices that RCU is waiting on
> > > an offline CPU, the CPU comes back online.
> > >
> > > That situation is exactly what the above code is intended to handle.
> >
> > That makes sense!
> >
> > > Without that code, RCU can give false-positive splats at various points
> > > in its processing. ("Wait! How can a task be blocked waiting on a
> > > grace period that hasn't even started yet???")
> >
> > I did not fully understand the question in brackets though, a task can be on
> > a different CPU though which has nothing to do with the CPU that's going
> > offline/online so it could totally be waiting on a grace period right?
> >
> > Also waiting on a grace period that hasn't even started is totally possible:
> >
> > GP1 GP2
> > |<--------->|<-------->|
> > ^ ^
> > | |____ task gets unblocked
> > task blocks
> > on synchronize_rcu
> > but is waiting on
> > GP2 which hasn't started
> >
> > Or did I misunderstand the question?
>
> There is a ->gp_tasks field in the leaf rcu_node structures that
> references a list of tasks blocking the current grace period. When there
> is no grace period in progress (as is the case from the end of GP1 to
> the beginning of GP2, the RCU code expects ->gp_tasks to be NULL.
> Without the curiosity code you pointed out above, ->gp_tasks could
> in fact end up being non-NULL when no grace period was in progress.
>
> And did end up being non-NULL from time to time, initially every few
> hundred hours of a particular rcutorture scenario.

Oh ok! I will think more about it. I am not yet able to connect the gp_tasks
being non-NULL to the CPU going offline/online scenario though. Maybe I
should delete this code, run an experiment and trace for this condition
(gp_tasks != NULL)?

I love it how you found these issues by heavy testing and fixed them.

thanks,

- Joel

2019-08-15 18:44:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 15, 2019 at 02:15:00PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 10:23:51AM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 15, 2019 at 11:07:35AM -0400, Joel Fernandes wrote:
> > > On Wed, Aug 14, 2019 at 03:05:16PM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > > > If so, perhaps that monitoring could periodically invoke an RCU function
> > > > > > that I provide for deciding when to turn the tick on. We would also need
> > > > > > to work out how to turn the tick off in a timely fashion once the CPU got
> > > > > > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> > > > > >
> > > > > > If this would be called only every second or so, the separate grace-period
> > > > > > checking is still needed for its shorter timespan, though.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > Do you want me to test the below patch to see if it fixes the issue with my
> > > > > other test case (where I had a nohz full CPU holding up a grace period).
> > > >
> > > > Please!
> > >
> > > I tried the patch below, but it did not seem to make a difference to the
> > > issue I was seeing. My test tree is here in case you can spot anything I did
> > > not do right: https://github.com/joelagnel/linux-kernel/commits/rcu/nohz-test
> > > The main patch is here:
> > > https://github.com/joelagnel/linux-kernel/commit/4dc282b559d918a0be826936f997db0bdad7abb3
> >
> > That is more aggressive that rcutorture's rcu_torture_fwd_prog_nr(), so
> > I am guessing that I need to up rcu_torture_fwd_prog_nr()'s game. I am
> > currently testing that.
> >
> > > On the trace output, I grep something like: egrep "(rcu_perf|cpu 3|3d)". I
> > > see a few ticks after 300ms, but then there are no more ticks and just a
> > > periodic resched_cpu() from rcu_implicit_dynticks_qs():
> > >
> > > [ 19.534107] rcu_perf-165 12.... 2276436us : rcu_perf_writer: Start of rcuperf test
> > > [ 19.557968] rcu_pree-10 0d..1 2287973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 20.136222] rcu_perf-165 3d.h. 2591894us : rcu_sched_clock_irq: sched-tick
> > > [ 20.137185] rcu_perf-165 3d.h2 2591906us : rcu_sched_clock_irq: sched-tick
> > > [ 20.138149] rcu_perf-165 3d.h. 2591911us : rcu_sched_clock_irq: sched-tick
> > > [ 20.139106] rcu_perf-165 3d.h. 2591915us : rcu_sched_clock_irq: sched-tick
> [snip]
> > > [ 20.147797] rcu_perf-165 3d.h. 2591953us : rcu_sched_clock_irq: sched-tick
> > > [ 20.148759] rcu_perf-165 3d.h. 2591957us : rcu_sched_clock_irq: sched-tick
> > > [ 20.151655] rcu_pree-10 0d..1 2591979us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 20.732938] rcu_pree-10 0d..1 2895960us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [snip]
> > > [ 26.566100] rcu_pree-10 0d..1 5935982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 27.144497] rcu_pree-10 0d..1 6239973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 27.192661] rcu_perf-165 3d.h. 6276923us : rcu_sched_clock_irq: sched-tick
> > > [ 27.705789] rcu_pree-10 0d..1 6541901us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 28.292155] rcu_pree-10 0d..1 6845974us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 28.874049] rcu_pree-10 0d..1 7149972us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > [ 29.112646] rcu_perf-165 3.... 7275951us : rcu_perf_writer: End of rcuperf test
> >
> > That would be due to my own stupidity. I forgot to clear ->rcu_forced_tick
> > in rcu_disable_tick_upon_qs() inside the "if" statement. This of course
> > prevents rcu_nmi_exit_common() from ever re-enabling it.
> >
> > Excellent catch! Thank you for testing this!!!
>
> Ah I missed it too. Happy to help! I tried setting it as below but getting
> same results:
>
> +/*
> + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> + * in order to get to a quiescent state, disable it.
> + */
> +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> +{
> + if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> + tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> + rdp->rcu_forced_tick = false;

I put this inside the "if" statement, though I would not expect that to
change behavior in this case.

Does your test case still avoid turning on the tick more than once? Or
is it turning on the tick each time the grace period gets too long, but
without the tick managing to end the grace periods?

> +}
> +
>
> > > [snip]
> > > > > > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > > > > > + rcu_disable_tick_upon_qs(rdp);
> > > > > > /* Report QS -after- changing ->qsmaskinitnext! */
> > > > > > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> > > > >
> > > > > Just curious about the existing code. If a CPU is just starting up (after
> > > > > bringing it online), how can RCU be waiting on it? I thought RCU would not be
> > > > > watching offline CPUs.
> > > >
> > > > Well, neither grace periods nor CPU-hotplug operations are atomic,
> > > > and each can take significant time to complete.
> > > >
> > > > So suppose we have a large system with multiple leaf rcu_node structures
> > > > (not that 17 CPUs is all that many these days, but please bear with me).
> > > > Suppose just after a new grace period initializes a given leaf rcu_node
> > > > structure, one of its CPUs goes offline (yes, that CPU would have to
> > > > have waited on a grace period, but that might have been the previous
> > > > grace period). But before the FQS scan notices that RCU is waiting on
> > > > an offline CPU, the CPU comes back online.
> > > >
> > > > That situation is exactly what the above code is intended to handle.
> > >
> > > That makes sense!
> > >
> > > > Without that code, RCU can give false-positive splats at various points
> > > > in its processing. ("Wait! How can a task be blocked waiting on a
> > > > grace period that hasn't even started yet???")
> > >
> > > I did not fully understand the question in brackets though, a task can be on
> > > a different CPU though which has nothing to do with the CPU that's going
> > > offline/online so it could totally be waiting on a grace period right?
> > >
> > > Also waiting on a grace period that hasn't even started is totally possible:
> > >
> > > GP1 GP2
> > > |<--------->|<-------->|
> > > ^ ^
> > > | |____ task gets unblocked
> > > task blocks
> > > on synchronize_rcu
> > > but is waiting on
> > > GP2 which hasn't started
> > >
> > > Or did I misunderstand the question?
> >
> > There is a ->gp_tasks field in the leaf rcu_node structures that
> > references a list of tasks blocking the current grace period. When there
> > is no grace period in progress (as is the case from the end of GP1 to
> > the beginning of GP2, the RCU code expects ->gp_tasks to be NULL.
> > Without the curiosity code you pointed out above, ->gp_tasks could
> > in fact end up being non-NULL when no grace period was in progress.
> >
> > And did end up being non-NULL from time to time, initially every few
> > hundred hours of a particular rcutorture scenario.
>
> Oh ok! I will think more about it. I am not yet able to connect the gp_tasks
> being non-NULL to the CPU going offline/online scenario though. Maybe I
> should delete this code, run an experiment and trace for this condition
> (gp_tasks != NULL)?

Or you could dig through the git logs for this code change.

> I love it how you found these issues by heavy testing and fixed them.

Me, I would have rather foreseen them and avoided them in the first place,
but I agree that it is better for rcutorture to find them than for some
hapless user somewhere to be inconvenienced by them. ;-)

Thanx, Paul

2019-08-15 19:44:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 15, 2019 at 11:07:35AM -0400, Joel Fernandes wrote:
> On Wed, Aug 14, 2019 at 03:05:16PM -0700, Paul E. McKenney wrote:
> [snip]
> > > > > Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that.
> > > > >
> > > > > What do you think?
> > > >
> > > > Left to itself, RCU would take action only when a given nohz_full
> > > > in-kernel CPU was delaying a grace period, which is what the (lightly
> > > > tested) patch below is supposed to help with. If that is all that is
> > > > needed, well and good!
> > > >
> > > > But should we need long-running in-kernel nohz_full CPUs to turn on
> > > > their ticks when they are not blocking an RCU grace period, for example,
> > > > when RCU is idle, more will be needed. To that point, isn't there some
> > > > sort of monitoring that checks up on nohz_full CPUs ever second or so?
> > >
> > > Wouldn't such monitoring need to be more often than a second, given that
> > > rcu_urgent_qs and rcu_need_heavy_qs are configured typically to be sooner
> > > (200-300 jiffies on my system).
> >
> > Either it would have to be more often than once per second, or RCU would
> > need to retain its more frequent checks. But note that RCU isn't going
> > to check unless there is a grace period in progress.
>
> Sure.
>
> > > > If so, perhaps that monitoring could periodically invoke an RCU function
> > > > that I provide for deciding when to turn the tick on. We would also need
> > > > to work out how to turn the tick off in a timely fashion once the CPU got
> > > > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> > > >
> > > > If this would be called only every second or so, the separate grace-period
> > > > checking is still needed for its shorter timespan, though.
> > > >
> > > > Thoughts?
> > >
> > > Do you want me to test the below patch to see if it fixes the issue with my
> > > other test case (where I had a nohz full CPU holding up a grace period).
> >
> > Please!
>
> I tried the patch below, but it did not seem to make a difference to the
> issue I was seeing. My test tree is here in case you can spot anything I did
> not do right: https://github.com/joelagnel/linux-kernel/commits/rcu/nohz-test
> The main patch is here:
> https://github.com/joelagnel/linux-kernel/commit/4dc282b559d918a0be826936f997db0bdad7abb3

That is more aggressive that rcutorture's rcu_torture_fwd_prog_nr(), so
I am guessing that I need to up rcu_torture_fwd_prog_nr()'s game. I am
currently testing that.

> On the trace output, I grep something like: egrep "(rcu_perf|cpu 3|3d)". I
> see a few ticks after 300ms, but then there are no more ticks and just a
> periodic resched_cpu() from rcu_implicit_dynticks_qs():
>
> [ 19.534107] rcu_perf-165 12.... 2276436us : rcu_perf_writer: Start of rcuperf test
> [ 19.557968] rcu_pree-10 0d..1 2287973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 20.136222] rcu_perf-165 3d.h. 2591894us : rcu_sched_clock_irq: sched-tick
> [ 20.137185] rcu_perf-165 3d.h2 2591906us : rcu_sched_clock_irq: sched-tick
> [ 20.138149] rcu_perf-165 3d.h. 2591911us : rcu_sched_clock_irq: sched-tick
> [ 20.139106] rcu_perf-165 3d.h. 2591915us : rcu_sched_clock_irq: sched-tick
> [ 20.140077] rcu_perf-165 3d.h. 2591919us : rcu_sched_clock_irq: sched-tick
> [ 20.141041] rcu_perf-165 3d.h. 2591924us : rcu_sched_clock_irq: sched-tick
> [ 20.142001] rcu_perf-165 3d.h. 2591928us : rcu_sched_clock_irq: sched-tick
> [ 20.142961] rcu_perf-165 3d.h. 2591932us : rcu_sched_clock_irq: sched-tick
> [ 20.143925] rcu_perf-165 3d.h. 2591936us : rcu_sched_clock_irq: sched-tick
> [ 20.144885] rcu_perf-165 3d.h. 2591940us : rcu_sched_clock_irq: sched-tick
> [ 20.145876] rcu_perf-165 3d.h. 2591945us : rcu_sched_clock_irq: sched-tick
> [ 20.146835] rcu_perf-165 3d.h. 2591949us : rcu_sched_clock_irq: sched-tick
> [ 20.147797] rcu_perf-165 3d.h. 2591953us : rcu_sched_clock_irq: sched-tick
> [ 20.148759] rcu_perf-165 3d.h. 2591957us : rcu_sched_clock_irq: sched-tick
> [ 20.151655] rcu_pree-10 0d..1 2591979us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 20.732938] rcu_pree-10 0d..1 2895960us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 21.318104] rcu_pree-10 0d..1 3199975us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 21.899908] rcu_pree-10 0d..1 3503964us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 22.481316] rcu_pree-10 0d..1 3807990us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 23.065623] rcu_pree-10 0d..1 4111990us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 23.650875] rcu_pree-10 0d..1 4415989us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 24.233999] rcu_pree-10 0d..1 4719978us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 24.818397] rcu_pree-10 0d..1 5023982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 25.402633] rcu_pree-10 0d..1 5327981us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 25.984104] rcu_pree-10 0d..1 5631976us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 26.566100] rcu_pree-10 0d..1 5935982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 27.144497] rcu_pree-10 0d..1 6239973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 27.192661] rcu_perf-165 3d.h. 6276923us : rcu_sched_clock_irq: sched-tick
> [ 27.705789] rcu_pree-10 0d..1 6541901us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 28.292155] rcu_pree-10 0d..1 6845974us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 28.874049] rcu_pree-10 0d..1 7149972us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [ 29.112646] rcu_perf-165 3.... 7275951us : rcu_perf_writer: End of rcuperf test

That would be due to my own stupidity. I forgot to clear ->rcu_forced_tick
in rcu_disable_tick_upon_qs() inside the "if" statement. This of course
prevents rcu_nmi_exit_common() from ever re-enabling it.

Excellent catch! Thank you for testing this!!!

> [snip]
> > > > @@ -2906,7 +2927,7 @@ void rcu_barrier(void)
> > > > /* Did someone else do our work for us? */
> > > > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> > > > rcu_barrier_trace(TPS("EarlyExit"), -1,
> > > > - rcu_state.barrier_sequence);
> > > > + rcu_state.barrier_sequence);
> > > > smp_mb(); /* caller's subsequent code after above check. */
> > > > mutex_unlock(&rcu_state.barrier_mutex);
> > > > return;
> > > > @@ -2938,11 +2959,11 @@ void rcu_barrier(void)
> > > > continue;
> > > > if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> > > > rcu_barrier_trace(TPS("OnlineQ"), cpu,
> > > > - rcu_state.barrier_sequence);
> > > > + rcu_state.barrier_sequence);
> > > > smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> > > > } else {
> > > > rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> > > > - rcu_state.barrier_sequence);
> > > > + rcu_state.barrier_sequence);
> > > > }
> > > > }
> > > > put_online_cpus();
> > > > @@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu)
> > > > rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> > > > rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> > > > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > > > + rcu_disable_tick_upon_qs(rdp);
> > > > /* Report QS -after- changing ->qsmaskinitnext! */
> > > > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> > >
> > > Just curious about the existing code. If a CPU is just starting up (after
> > > bringing it online), how can RCU be waiting on it? I thought RCU would not be
> > > watching offline CPUs.
> >
> > Well, neither grace periods nor CPU-hotplug operations are atomic,
> > and each can take significant time to complete.
> >
> > So suppose we have a large system with multiple leaf rcu_node structures
> > (not that 17 CPUs is all that many these days, but please bear with me).
> > Suppose just after a new grace period initializes a given leaf rcu_node
> > structure, one of its CPUs goes offline (yes, that CPU would have to
> > have waited on a grace period, but that might have been the previous
> > grace period). But before the FQS scan notices that RCU is waiting on
> > an offline CPU, the CPU comes back online.
> >
> > That situation is exactly what the above code is intended to handle.
>
> That makes sense!
>
> > Without that code, RCU can give false-positive splats at various points
> > in its processing. ("Wait! How can a task be blocked waiting on a
> > grace period that hasn't even started yet???")
>
> I did not fully understand the question in brackets though, a task can be on
> a different CPU though which has nothing to do with the CPU that's going
> offline/online so it could totally be waiting on a grace period right?
>
> Also waiting on a grace period that hasn't even started is totally possible:
>
> GP1 GP2
> |<--------->|<-------->|
> ^ ^
> | |____ task gets unblocked
> task blocks
> on synchronize_rcu
> but is waiting on
> GP2 which hasn't started
>
> Or did I misunderstand the question?

There is a ->gp_tasks field in the leaf rcu_node structures that
references a list of tasks blocking the current grace period. When there
is no grace period in progress (as is the case from the end of GP1 to
the beginning of GP2, the RCU code expects ->gp_tasks to be NULL.
Without the curiosity code you pointed out above, ->gp_tasks could
in fact end up being non-NULL when no grace period was in progress.

And did end up being non-NULL from time to time, initially every few
hundred hours of a particular rcutorture scenario.

Thanx, Paul

2019-08-15 20:24:05

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

On Thu, Aug 15, 2019 at 11:39:37AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 02:15:00PM -0400, Joel Fernandes wrote:
> > On Thu, Aug 15, 2019 at 10:23:51AM -0700, Paul E. McKenney wrote:
> > > > [snip]
> > > > > > > If so, perhaps that monitoring could periodically invoke an RCU function
> > > > > > > that I provide for deciding when to turn the tick on. We would also need
> > > > > > > to work out how to turn the tick off in a timely fashion once the CPU got
> > > > > > > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common().
> > > > > > >
> > > > > > > If this would be called only every second or so, the separate grace-period
> > > > > > > checking is still needed for its shorter timespan, though.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > Do you want me to test the below patch to see if it fixes the issue with my
> > > > > > other test case (where I had a nohz full CPU holding up a grace period).
> > > > >
> > > > > Please!
> > > >
> > > > I tried the patch below, but it did not seem to make a difference to the
> > > > issue I was seeing. My test tree is here in case you can spot anything I did
> > > > not do right: https://github.com/joelagnel/linux-kernel/commits/rcu/nohz-test
> > > > The main patch is here:
> > > > https://github.com/joelagnel/linux-kernel/commit/4dc282b559d918a0be826936f997db0bdad7abb3
> > >
> > > That is more aggressive that rcutorture's rcu_torture_fwd_prog_nr(), so
> > > I am guessing that I need to up rcu_torture_fwd_prog_nr()'s game. I am
> > > currently testing that.
> > >
> > > > On the trace output, I grep something like: egrep "(rcu_perf|cpu 3|3d)". I
> > > > see a few ticks after 300ms, but then there are no more ticks and just a
> > > > periodic resched_cpu() from rcu_implicit_dynticks_qs():
> > > >
> > > > [ 19.534107] rcu_perf-165 12.... 2276436us : rcu_perf_writer: Start of rcuperf test
> > > > [ 19.557968] rcu_pree-10 0d..1 2287973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 20.136222] rcu_perf-165 3d.h. 2591894us : rcu_sched_clock_irq: sched-tick
> > > > [ 20.137185] rcu_perf-165 3d.h2 2591906us : rcu_sched_clock_irq: sched-tick
> > > > [ 20.138149] rcu_perf-165 3d.h. 2591911us : rcu_sched_clock_irq: sched-tick
> > > > [ 20.139106] rcu_perf-165 3d.h. 2591915us : rcu_sched_clock_irq: sched-tick
> > [snip]
> > > > [ 20.147797] rcu_perf-165 3d.h. 2591953us : rcu_sched_clock_irq: sched-tick
> > > > [ 20.148759] rcu_perf-165 3d.h. 2591957us : rcu_sched_clock_irq: sched-tick
> > > > [ 20.151655] rcu_pree-10 0d..1 2591979us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 20.732938] rcu_pree-10 0d..1 2895960us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > [snip]
> > > > [ 26.566100] rcu_pree-10 0d..1 5935982us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 27.144497] rcu_pree-10 0d..1 6239973us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 27.192661] rcu_perf-165 3d.h. 6276923us : rcu_sched_clock_irq: sched-tick
> > > > [ 27.705789] rcu_pree-10 0d..1 6541901us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 28.292155] rcu_pree-10 0d..1 6845974us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 28.874049] rcu_pree-10 0d..1 7149972us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> > > > [ 29.112646] rcu_perf-165 3.... 7275951us : rcu_perf_writer: End of rcuperf test
> > >
> > > That would be due to my own stupidity. I forgot to clear ->rcu_forced_tick
> > > in rcu_disable_tick_upon_qs() inside the "if" statement. This of course
> > > prevents rcu_nmi_exit_common() from ever re-enabling it.
> > >
> > > Excellent catch! Thank you for testing this!!!
> >
> > Ah I missed it too. Happy to help! I tried setting it as below but getting
> > same results:
> >
> > +/*
> > + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> > + * in order to get to a quiescent state, disable it.
> > + */
> > +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> > +{
> > + if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> > + tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > + rdp->rcu_forced_tick = false;
>
> I put this inside the "if" statement, though I would not expect that to
> change behavior in this case.
>
> Does your test case still avoid turning on the tick more than once? Or
> is it turning on the tick each time the grace period gets too long, but
> without the tick managing to end the grace periods?

I will put some more prints and let you know. But it looks like I see a print
from rcu_sched_clock_irq() only once at around 700ms from the start of the
test loop. After that I don't see prints at all for the rest of the 7 seconds
of the test.

Before the test starts, I see several rcu_sched_clock_irq() at the regular
tick interval of 1 ms (HZ=1000).

> > > > [snip]
> > > > > Without that code, RCU can give false-positive splats at various points
> > > > > in its processing. ("Wait! How can a task be blocked waiting on a
> > > > > grace period that hasn't even started yet???")
> > > >
> > > > I did not fully understand the question in brackets though, a task can be on
> > > > a different CPU though which has nothing to do with the CPU that's going
> > > > offline/online so it could totally be waiting on a grace period right?
> > > >
> > > > Also waiting on a grace period that hasn't even started is totally possible:
> > > >
> > > > GP1 GP2
> > > > |<--------->|<-------->|
> > > > ^ ^
> > > > | |____ task gets unblocked
> > > > task blocks
> > > > on synchronize_rcu
> > > > but is waiting on
> > > > GP2 which hasn't started
> > > >
> > > > Or did I misunderstand the question?
> > >
> > > There is a ->gp_tasks field in the leaf rcu_node structures that
> > > references a list of tasks blocking the current grace period. When there
> > > is no grace period in progress (as is the case from the end of GP1 to
> > > the beginning of GP2, the RCU code expects ->gp_tasks to be NULL.
> > > Without the curiosity code you pointed out above, ->gp_tasks could
> > > in fact end up being non-NULL when no grace period was in progress.
> > >
> > > And did end up being non-NULL from time to time, initially every few
> > > hundred hours of a particular rcutorture scenario.
> >
> > Oh ok! I will think more about it. I am not yet able to connect the gp_tasks
> > being non-NULL to the CPU going offline/online scenario though. Maybe I
> > should delete this code, run an experiment and trace for this condition
> > (gp_tasks != NULL)?
>
> Or you could dig through the git logs for this code change.

Ok will do.

> > I love it how you found these issues by heavy testing and fixed them.
>
> Me, I would have rather foreseen them and avoided them in the first place,
> but I agree that it is better for rcutorture to find them than for some
> hapless user somewhere to be inconvenienced by them. ;-)

True, forseeing is always better ;)

thanks,

- Joel