2015-07-17 23:29:12

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/19] Expedited grace period changes for 4.3

Hello!

This series contains a series of optimizations for RCU's expedited
grace periods that reduce the jitter they induce in other CPUs. There
is more work to be done here, but this should introduce a good and
sufficient level of instability for 4.3. :-/

1. Stop disabling CPU hotplug in synchronize_rcu_expedited().
Longer term, synchronize_sched_expedited() will get the
same treatment.

2. Remove CONFIG_RCU_CPU_STALL_INFO, so that kernels are always
built as if CONFIG_RCU_CPU_STALL_INFO=y. A bit off-topic,
but remains in this series due to conflicts.

3. Switch synchronize_sched_expedited() to stop_one_cpu(), courtesy
of Peter Zijlstra.

4. Simplify synchronize_sched_expedited() counter handling.

5. Get rid of synchronize_sched_expedited()'s polling loop in
favor of tree-based funnel locking.

6. Make expedited GP CPU stoppage asynchronous, courtesy of Peter
Zijlstra. This repairs a performance regression in #3.

7. Abstract sequence counting from synchronize_sched_expedited().

8. Make synchronize_rcu_expedited() use sequence-counter scheme.

9. Abstract funnel locking from synchronize_sched_expedited().

10. Fix synchronize_sched_expedited() type error for variable "s".

11. Use funnel locking for synchronize_rcu_expedited()'s polling loop.

12. Apply rcu_seq operations to _rcu_barrier().

13. Consolidate last open-coded expedited memory barrier.

14. Extend expedited funnel locking to rcu_data structure, allowing
full parallel handling of expedited grace periods.

15. Add stall warnings to synchronize_sched_expedited().

16. Document new expedited stall warnings.

17. Pull out wait_event*() condition into helper function.

18. Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS.

19. Add fastpath bypassing funnel locking to speed up the common
case where only one expedited grace period happens at a time
across the full system.

Thanx, Paul

------------------------------------------------------------------------

b/Documentation/RCU/stallwarn.txt | 29
b/Documentation/RCU/trace.txt | 36
b/include/trace/events/rcu.h | 1
b/kernel/rcu/tree.c | 625 +++++-----
b/kernel/rcu/tree.h | 39
b/kernel/rcu/tree_plugin.h | 123 -
b/kernel/rcu/tree_trace.c | 29
b/lib/Kconfig.debug | 14
b/tools/testing/selftests/rcutorture/configs/rcu/TREE01 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE02 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE02-T | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE03 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE05 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE06 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE07 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE08 | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE08-T | 1
b/tools/testing/selftests/rcutorture/configs/rcu/TREE09 | 1
b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1
20 files changed, 425 insertions(+), 483 deletions(-)


2015-07-17 23:29:41

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 01/19] rcu: Stop disabling CPU hotplug in synchronize_rcu_expedited()

From: "Paul E. McKenney" <[email protected]>

The fact that tasks could be migrated from leaf to root rcu_node
structures meant that synchronize_rcu_expedited() had to disable
CPU hotplug. However, tasks now stay put, so this commit removes the
CPU-hotplug disabling from synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 25 ++-----------------------
1 file changed, 2 insertions(+), 23 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 5dac0a10a985..7234f03e0aa2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -728,42 +728,23 @@ void synchronize_rcu_expedited(void)
smp_mb(); /* Above access cannot bleed into critical section. */

/*
- * Block CPU-hotplug operations. This means that any CPU-hotplug
- * operation that finds an rcu_node structure with tasks in the
- * process of being boosted will know that all tasks blocking
- * this expedited grace period will already be in the process of
- * being boosted. This simplifies the process of moving tasks
- * from leaf to root rcu_node structures.
- */
- if (!try_get_online_cpus()) {
- /* CPU-hotplug operation in flight, fall back to normal GP. */
- wait_rcu_gp(call_rcu);
- return;
- }
-
- /*
* Acquire lock, falling back to synchronize_rcu() if too many
* lock-acquisition failures. Of course, if someone does the
* expedited grace period for us, just leave.
*/
while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
if (ULONG_CMP_LT(snap,
- READ_ONCE(sync_rcu_preempt_exp_count))) {
- put_online_cpus();
+ READ_ONCE(sync_rcu_preempt_exp_count)))
goto mb_ret; /* Others did our work for us. */
- }
if (trycount++ < 10) {
udelay(trycount * num_online_cpus());
} else {
- put_online_cpus();
wait_rcu_gp(call_rcu);
return;
}
}
- if (ULONG_CMP_LT(snap, READ_ONCE(sync_rcu_preempt_exp_count))) {
- put_online_cpus();
+ if (ULONG_CMP_LT(snap, READ_ONCE(sync_rcu_preempt_exp_count)))
goto unlock_mb_ret; /* Others did our work for us. */
- }

/* force all RCU readers onto ->blkd_tasks lists. */
synchronize_sched_expedited();
@@ -779,8 +760,6 @@ void synchronize_rcu_expedited(void)
rcu_for_each_leaf_node(rsp, rnp)
sync_rcu_preempt_exp_init2(rsp, rnp);

- put_online_cpus();
-
/* Wait for snapshotted ->blkd_tasks lists to drain. */
rnp = rcu_get_root(rsp);
wait_event(sync_rcu_preempt_exp_wq,
--
1.8.1.5

2015-07-17 23:35:38

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 02/19] rcu: Remove CONFIG_RCU_CPU_STALL_INFO

From: "Paul E. McKenney" <[email protected]>

The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
releases with no complaints, so it is time to eliminate this Kconfig
option entirely, so that the long-form RCU CPU stall warnings cannot
be disabled. This commit does just that.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/stallwarn.txt | 12 +-----
kernel/rcu/tree.h | 4 --
kernel/rcu/tree_plugin.h | 45 ----------------------
lib/Kconfig.debug | 14 -------
.../selftests/rcutorture/configs/rcu/TREE01 | 1 -
.../selftests/rcutorture/configs/rcu/TREE02 | 1 -
.../selftests/rcutorture/configs/rcu/TREE02-T | 1 -
.../selftests/rcutorture/configs/rcu/TREE03 | 1 -
.../selftests/rcutorture/configs/rcu/TREE04 | 1 -
.../selftests/rcutorture/configs/rcu/TREE05 | 1 -
.../selftests/rcutorture/configs/rcu/TREE06 | 1 -
.../selftests/rcutorture/configs/rcu/TREE07 | 1 -
.../selftests/rcutorture/configs/rcu/TREE08 | 1 -
.../selftests/rcutorture/configs/rcu/TREE08-T | 1 -
.../selftests/rcutorture/configs/rcu/TREE09 | 1 -
.../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 -
16 files changed, 2 insertions(+), 85 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index b57c0c1cdac6..046f32637b95 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -26,12 +26,6 @@ CONFIG_RCU_CPU_STALL_TIMEOUT
Stall-warning messages may be enabled and disabled completely via
/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress.

-CONFIG_RCU_CPU_STALL_INFO
-
- This kernel configuration parameter causes the stall warning to
- print out additional per-CPU diagnostic information, including
- information on scheduling-clock ticks and RCU's idle-CPU tracking.
-
RCU_STALL_DELAY_DELTA

Although the lockdep facility is extremely useful, it does add
@@ -101,15 +95,13 @@ interact. Please note that it is not possible to entirely eliminate this
sort of false positive without resorting to things like stop_machine(),
which is overkill for this sort of problem.

-If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set,
-more information is printed with the stall-warning message, for example:
+Recent kernels will print a long form of the stall-warning message:

INFO: rcu_preempt detected stall on CPU
0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543
(t=65000 jiffies)

-In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is
-printed:
+In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed:

INFO: rcu_preempt detected stall on CPU
0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index faee5242d6ff..7c0b09d754a1 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -288,12 +288,10 @@ struct rcu_data {
bool gpwrap; /* Possible gpnum/completed wrap. */
struct rcu_node *mynode; /* This CPU's leaf of hierarchy */
unsigned long grpmask; /* Mask to apply to leaf qsmask. */
-#ifdef CONFIG_RCU_CPU_STALL_INFO
unsigned long ticks_this_gp; /* The number of scheduling-clock */
/* ticks this CPU has handled */
/* during and after the last grace */
/* period it is aware of. */
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_INFO */

/* 2) batch handling */
/*
@@ -388,9 +386,7 @@ struct rcu_data {
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */

/* 8) RCU CPU stall data. */
-#ifdef CONFIG_RCU_CPU_STALL_INFO
unsigned int softirq_snap; /* Snapshot of softirq activity. */
-#endif /* #ifdef CONFIG_RCU_CPU_STALL_INFO */

int cpu;
struct rcu_state *rsp;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7234f03e0aa2..ef41c1b04ba6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -82,8 +82,6 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tRCU lockdep checking is enabled.\n");
if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_RUNNABLE))
pr_info("\tRCU torture testing starts during boot.\n");
- if (IS_ENABLED(CONFIG_RCU_CPU_STALL_INFO))
- pr_info("\tAdditional per-CPU info printed with stalls.\n");
if (RCU_NUM_LVLS >= 4)
pr_info("\tFour(or more)-level hierarchy is enabled.\n");
if (RCU_FANOUT_LEAF != 16)
@@ -418,8 +416,6 @@ static void rcu_print_detail_task_stall(struct rcu_state *rsp)
rcu_print_detail_task_stall_rnp(rnp);
}

-#ifdef CONFIG_RCU_CPU_STALL_INFO
-
static void rcu_print_task_stall_begin(struct rcu_node *rnp)
{
pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
@@ -431,18 +427,6 @@ static void rcu_print_task_stall_end(void)
pr_cont("\n");
}

-#else /* #ifdef CONFIG_RCU_CPU_STALL_INFO */
-
-static void rcu_print_task_stall_begin(struct rcu_node *rnp)
-{
-}
-
-static void rcu_print_task_stall_end(void)
-{
-}
-
-#endif /* #else #ifdef CONFIG_RCU_CPU_STALL_INFO */
-
/*
* Scan the current list of tasks blocked within RCU read-side critical
* sections, printing out the tid of each.
@@ -1685,8 +1669,6 @@ early_initcall(rcu_register_oom_notifier);

#endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */

-#ifdef CONFIG_RCU_CPU_STALL_INFO
-
#ifdef CONFIG_RCU_FAST_NO_HZ

static void print_cpu_stall_fast_no_hz(char *cp, int cpu)
@@ -1775,33 +1757,6 @@ static void increment_cpu_stall_ticks(void)
raw_cpu_inc(rsp->rda->ticks_this_gp);
}

-#else /* #ifdef CONFIG_RCU_CPU_STALL_INFO */
-
-static void print_cpu_stall_info_begin(void)
-{
- pr_cont(" {");
-}
-
-static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
-{
- pr_cont(" %d", cpu);
-}
-
-static void print_cpu_stall_info_end(void)
-{
- pr_cont("} ");
-}
-
-static void zero_cpu_stall_ticks(struct rcu_data *rdp)
-{
-}
-
-static void increment_cpu_stall_ticks(void)
-{
-}
-
-#endif /* #else #ifdef CONFIG_RCU_CPU_STALL_INFO */
-
#ifdef CONFIG_RCU_NOCB_CPU

/*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e2894b23efb6..8a34205d6922 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1353,20 +1353,6 @@ config RCU_CPU_STALL_TIMEOUT
RCU grace period persists, additional CPU stall warnings are
printed at more widely spaced intervals.

-config RCU_CPU_STALL_INFO
- bool "Print additional diagnostics on RCU CPU stall"
- depends on (TREE_RCU || PREEMPT_RCU) && DEBUG_KERNEL
- default y
- help
- For each stalled CPU that is aware of the current RCU grace
- period, print out additional per-CPU diagnostic information
- regarding scheduling-clock ticks, idle state, and,
- for RCU_FAST_NO_HZ kernels, idle-entry state.
-
- Say N if you are unsure.
-
- Say Y if you want to enable such diagnostics.
-
config RCU_TRACE
bool "Enable tracing for RCU"
depends on DEBUG_KERNEL
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE01 b/tools/testing/selftests/rcutorture/configs/rcu/TREE01
index 8e9137f66831..f572b873c620 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE01
@@ -13,7 +13,6 @@ CONFIG_MAXSMP=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ZERO=y
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE02 b/tools/testing/selftests/rcutorture/configs/rcu/TREE02
index aeea6a204d14..ef6a22c44dea 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE02
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE02
@@ -17,7 +17,6 @@ CONFIG_RCU_FANOUT_LEAF=3
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE02-T b/tools/testing/selftests/rcutorture/configs/rcu/TREE02-T
index 2ac9e68ea3d1..917d2517b5b5 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE02-T
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE02-T
@@ -17,6 +17,5 @@ CONFIG_RCU_FANOUT_LEAF=3
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03 b/tools/testing/selftests/rcutorture/configs/rcu/TREE03
index 72aa7d87ea99..7a17c503b382 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03
@@ -13,7 +13,6 @@ CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=2
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index 3f5112751cda..39a2c6d7d7ec 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -17,6 +17,5 @@ CONFIG_RCU_FANOUT=4
CONFIG_RCU_FANOUT_LEAF=4
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE05 b/tools/testing/selftests/rcutorture/configs/rcu/TREE05
index c04dfea6fd21..1257d3227b1e 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE05
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE05
@@ -17,6 +17,5 @@ CONFIG_RCU_NOCB_CPU_NONE=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
#CHECK#CONFIG_PROVE_RCU=y
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE06 b/tools/testing/selftests/rcutorture/configs/rcu/TREE06
index f51d2c73a68e..d3e456b74cbe 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE06
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE06
@@ -18,6 +18,5 @@ CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
#CHECK#CONFIG_PROVE_RCU=y
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE07 b/tools/testing/selftests/rcutorture/configs/rcu/TREE07
index f422af4ff5a3..3956b4131f72 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE07
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE07
@@ -17,6 +17,5 @@ CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE08 b/tools/testing/selftests/rcutorture/configs/rcu/TREE08
index a24d2ca30646..bb9b0c1a23c2 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE08
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE08
@@ -19,7 +19,6 @@ CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_PROVE_LOCKING=y
#CHECK#CONFIG_PROVE_RCU=y
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE08-T b/tools/testing/selftests/rcutorture/configs/rcu/TREE08-T
index b2b8cea69dc9..2ad13f0d29cc 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE08-T
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE08-T
@@ -17,6 +17,5 @@ CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE09 b/tools/testing/selftests/rcutorture/configs/rcu/TREE09
index aa4ed08d999d..6710e749d9de 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE09
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE09
@@ -13,7 +13,6 @@ CONFIG_SUSPEND=n
CONFIG_HIBERNATION=n
CONFIG_RCU_NOCB_CPU=n
CONFIG_DEBUG_LOCK_ALLOC=n
-CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
#CHECK#CONFIG_RCU_EXPERT=n
diff --git a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
index b24c0004fc49..657f3a035488 100644
--- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
+++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
@@ -16,7 +16,6 @@ CONFIG_PROVE_LOCKING -- Do several, covering CONFIG_DEBUG_LOCK_ALLOC=y and not.
CONFIG_PROVE_RCU -- Hardwired to CONFIG_PROVE_LOCKING.
CONFIG_RCU_BOOST -- one of PREEMPT_RCU.
CONFIG_RCU_KTHREAD_PRIO -- set to 2 for _BOOST testing.
-CONFIG_RCU_CPU_STALL_INFO -- Now default, avoid at least twice.
CONFIG_RCU_FANOUT -- Cover hierarchy, but overlap with others.
CONFIG_RCU_FANOUT_LEAF -- Do one non-default.
CONFIG_RCU_FAST_NO_HZ -- Do one, but not with CONFIG_RCU_NOCB_CPU_ALL.
--
1.8.1.5

2015-07-17 23:35:36

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 03/19] rcu: Switch synchronize_sched_expedited() to stop_one_cpu()

From: Peter Zijlstra <[email protected]>

The synchronize_sched_expedited() currently invokes try_stop_cpus(),
which schedules the stopper kthreads on each online non-idle CPU,
and waits until all those kthreads are running before letting any
of them stop. This is disastrous for real-time workloads, which
get hit with a preemption that is as long as the longest scheduling
latency on any CPU, including any non-realtime housekeeping CPUs.
This commit therefore switches to using stop_one_cpu() on each CPU
in turn. This avoids inflicting the worst-case scheduling latency
on the worst-case CPU onto all other CPUs, and also simplifies the
code a little bit.

Follow-up commits will simplify the counter-snapshotting algorithm
and convert a number of the counters that are now protected by the
new ->expedited_mutex to non-atomic.

Signed-off-by: Peter Zijlstra <[email protected]>
[ paulmck: Kept stop_one_cpu(), dropped disabling of "guardrails". ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 41 ++++++++++++++---------------------------
kernel/rcu/tree.h | 1 +
2 files changed, 15 insertions(+), 27 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a2147d7b51c0..ae39a49daa58 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -103,6 +103,7 @@ struct rcu_state sname##_state = { \
.orphan_nxttail = &sname##_state.orphan_nxtlist, \
.orphan_donetail = &sname##_state.orphan_donelist, \
.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
+ .expedited_mutex = __MUTEX_INITIALIZER(sname##_state.expedited_mutex), \
.name = RCU_STATE_NAME(sname), \
.abbr = sabbr, \
}
@@ -3305,8 +3306,6 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
*/
void synchronize_sched_expedited(void)
{
- cpumask_var_t cm;
- bool cma = false;
int cpu;
long firstsnap, s, snap;
int trycount = 0;
@@ -3342,28 +3341,11 @@ void synchronize_sched_expedited(void)
}
WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));

- /* Offline CPUs, idle CPUs, and any CPU we run on are quiescent. */
- cma = zalloc_cpumask_var(&cm, GFP_KERNEL);
- if (cma) {
- cpumask_copy(cm, cpu_online_mask);
- cpumask_clear_cpu(raw_smp_processor_id(), cm);
- for_each_cpu(cpu, cm) {
- struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
-
- if (!(atomic_add_return(0, &rdtp->dynticks) & 0x1))
- cpumask_clear_cpu(cpu, cm);
- }
- if (cpumask_weight(cm) == 0)
- goto all_cpus_idle;
- }
-
/*
* Each pass through the following loop attempts to force a
* context switch on each CPU.
*/
- while (try_stop_cpus(cma ? cm : cpu_online_mask,
- synchronize_sched_expedited_cpu_stop,
- NULL) == -EAGAIN) {
+ while (!mutex_trylock(&rsp->expedited_mutex)) {
put_online_cpus();
atomic_long_inc(&rsp->expedited_tryfail);

@@ -3373,7 +3355,6 @@ void synchronize_sched_expedited(void)
/* ensure test happens before caller kfree */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(&rsp->expedited_workdone1);
- free_cpumask_var(cm);
return;
}

@@ -3383,7 +3364,6 @@ void synchronize_sched_expedited(void)
} else {
wait_rcu_gp(call_rcu_sched);
atomic_long_inc(&rsp->expedited_normal);
- free_cpumask_var(cm);
return;
}

@@ -3393,7 +3373,6 @@ void synchronize_sched_expedited(void)
/* ensure test happens before caller kfree */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(&rsp->expedited_workdone2);
- free_cpumask_var(cm);
return;
}

@@ -3408,16 +3387,23 @@ void synchronize_sched_expedited(void)
/* CPU hotplug operation in flight, use normal GP. */
wait_rcu_gp(call_rcu_sched);
atomic_long_inc(&rsp->expedited_normal);
- free_cpumask_var(cm);
return;
}
snap = atomic_long_read(&rsp->expedited_start);
smp_mb(); /* ensure read is before try_stop_cpus(). */
}
- atomic_long_inc(&rsp->expedited_stoppedcpus);

-all_cpus_idle:
- free_cpumask_var(cm);
+ /* Stop each CPU that is online, non-idle, and not us. */
+ for_each_online_cpu(cpu) {
+ struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+ /* Skip our CPU and any idle CPUs. */
+ if (raw_smp_processor_id() == cpu ||
+ !(atomic_add_return(0, &rdtp->dynticks) & 0x1))
+ continue;
+ stop_one_cpu(cpu, synchronize_sched_expedited_cpu_stop, NULL);
+ }
+ atomic_long_inc(&rsp->expedited_stoppedcpus);

/*
* Everyone up to our most recent fetch is covered by our grace
@@ -3436,6 +3422,7 @@ all_cpus_idle:
}
} while (atomic_long_cmpxchg(&rsp->expedited_done, s, snap) != s);
atomic_long_inc(&rsp->expedited_done_exit);
+ mutex_unlock(&rsp->expedited_mutex);

put_online_cpus();
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7c0b09d754a1..7c25fe473ad9 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -480,6 +480,7 @@ struct rcu_state {
/* _rcu_barrier(). */
/* End of fields guarded by barrier_mutex. */

+ struct mutex expedited_mutex; /* Serializes expediting. */
atomic_long_t expedited_start; /* Starting ticket. */
atomic_long_t expedited_done; /* Done ticket. */
atomic_long_t expedited_wrap; /* # near-wrap incidents. */
--
1.8.1.5

2015-07-17 23:31:58

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 04/19] rcu: Rework synchronize_sched_expedited() counter handling

From: "Paul E. McKenney" <[email protected]>

Now that synchronize_sched_expedited() have a mutex, it can use simpler
work-already-done detection scheme. This commit simplifies this scheme
by using something similar to the sequence-locking counter scheme.
A counter is incremented before and after each grace period, so that
the counter is odd in the midst of the grace period and even otherwise.
So if the counter has advanced to the second even number that is
greater than or equal to the snapshot, the required grace period has
already happened.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 98 +++++++++++++++----------------------------------
kernel/rcu/tree.h | 9 +----
kernel/rcu/tree_trace.c | 12 ++----
3 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ae39a49daa58..3c182fdec805 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3283,56 +3283,24 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
* restructure your code to batch your updates, and then use a single
* synchronize_sched() instead.
*
- * This implementation can be thought of as an application of ticket
- * locking to RCU, with sync_sched_expedited_started and
- * sync_sched_expedited_done taking on the roles of the halves
- * of the ticket-lock word. Each task atomically increments
- * sync_sched_expedited_started upon entry, snapshotting the old value,
- * then attempts to stop all the CPUs. If this succeeds, then each
- * CPU will have executed a context switch, resulting in an RCU-sched
- * grace period. We are then done, so we use atomic_cmpxchg() to
- * update sync_sched_expedited_done to match our snapshot -- but
- * only if someone else has not already advanced past our snapshot.
- *
- * On the other hand, if try_stop_cpus() fails, we check the value
- * of sync_sched_expedited_done. If it has advanced past our
- * initial snapshot, then someone else must have forced a grace period
- * some time after we took our snapshot. In this case, our work is
- * done for us, and we can simply return. Otherwise, we try again,
- * but keep our initial snapshot for purposes of checking for someone
- * doing our work for us.
- *
- * If we fail too many times in a row, we fall back to synchronize_sched().
+ * This implementation can be thought of as an application of sequence
+ * locking to expedited grace periods, but using the sequence counter to
+ * determine when someone else has already done the work instead of for
+ * retrying readers. We do a mutex_trylock() polling loop, but if we fail
+ * too many times in a row, we fall back to synchronize_sched().
*/
void synchronize_sched_expedited(void)
{
int cpu;
- long firstsnap, s, snap;
+ long s;
int trycount = 0;
struct rcu_state *rsp = &rcu_sched_state;

- /*
- * If we are in danger of counter wrap, just do synchronize_sched().
- * By allowing sync_sched_expedited_started to advance no more than
- * ULONG_MAX/8 ahead of sync_sched_expedited_done, we are ensuring
- * that more than 3.5 billion CPUs would be required to force a
- * counter wrap on a 32-bit system. Quite a few more CPUs would of
- * course be required on a 64-bit system.
- */
- if (ULONG_CMP_GE((ulong)atomic_long_read(&rsp->expedited_start),
- (ulong)atomic_long_read(&rsp->expedited_done) +
- ULONG_MAX / 8)) {
- wait_rcu_gp(call_rcu_sched);
- atomic_long_inc(&rsp->expedited_wrap);
- return;
- }
+ /* Take a snapshot of the sequence number. */
+ smp_mb(); /* Caller's modifications seen first by other CPUs. */
+ s = (READ_ONCE(rsp->expedited_sequence) + 3) & ~0x1;
+ smp_mb(); /* Above access must not bleed into critical section. */

- /*
- * Take a ticket. Note that atomic_inc_return() implies a
- * full memory barrier.
- */
- snap = atomic_long_inc_return(&rsp->expedited_start);
- firstsnap = snap;
if (!try_get_online_cpus()) {
/* CPU hotplug operation in flight, fall back to normal GP. */
wait_rcu_gp(call_rcu_sched);
@@ -3342,16 +3310,15 @@ void synchronize_sched_expedited(void)
WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));

/*
- * Each pass through the following loop attempts to force a
- * context switch on each CPU.
+ * Each pass through the following loop attempts to acquire
+ * ->expedited_mutex, checking for others doing our work each time.
*/
while (!mutex_trylock(&rsp->expedited_mutex)) {
put_online_cpus();
atomic_long_inc(&rsp->expedited_tryfail);

/* Check to see if someone else did our work for us. */
- s = atomic_long_read(&rsp->expedited_done);
- if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
+ if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
/* ensure test happens before caller kfree */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(&rsp->expedited_workdone1);
@@ -3368,8 +3335,7 @@ void synchronize_sched_expedited(void)
}

/* Recheck to see if someone else did our work for us. */
- s = atomic_long_read(&rsp->expedited_done);
- if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
+ if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
/* ensure test happens before caller kfree */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(&rsp->expedited_workdone2);
@@ -3389,10 +3355,20 @@ void synchronize_sched_expedited(void)
atomic_long_inc(&rsp->expedited_normal);
return;
}
- snap = atomic_long_read(&rsp->expedited_start);
- smp_mb(); /* ensure read is before try_stop_cpus(). */
}

+ /* Recheck yet again to see if someone else did our work for us. */
+ if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
+ rsp->expedited_workdone3++;
+ mutex_unlock(&rsp->expedited_mutex);
+ smp_mb(); /* ensure test happens before caller kfree */
+ return;
+ }
+
+ WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
+ smp_mb(); /* Ensure expedited GP seen after counter increment. */
+ WARN_ON_ONCE(!(rsp->expedited_sequence & 0x1));
+
/* Stop each CPU that is online, non-idle, and not us. */
for_each_online_cpu(cpu) {
struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
@@ -3403,26 +3379,12 @@ void synchronize_sched_expedited(void)
continue;
stop_one_cpu(cpu, synchronize_sched_expedited_cpu_stop, NULL);
}
- atomic_long_inc(&rsp->expedited_stoppedcpus);

- /*
- * Everyone up to our most recent fetch is covered by our grace
- * period. Update the counter, but only if our work is still
- * relevant -- which it won't be if someone who started later
- * than we did already did their update.
- */
- do {
- atomic_long_inc(&rsp->expedited_done_tries);
- s = atomic_long_read(&rsp->expedited_done);
- if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
- /* ensure test happens before caller kfree */
- smp_mb__before_atomic(); /* ^^^ */
- atomic_long_inc(&rsp->expedited_done_lost);
- break;
- }
- } while (atomic_long_cmpxchg(&rsp->expedited_done, s, snap) != s);
- atomic_long_inc(&rsp->expedited_done_exit);
+ smp_mb(); /* Ensure expedited GP seen before counter increment. */
+ WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
+ WARN_ON_ONCE(rsp->expedited_sequence & 0x1);
mutex_unlock(&rsp->expedited_mutex);
+ smp_mb(); /* ensure subsequent action seen after grace period. */

put_online_cpus();
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7c25fe473ad9..6a2b741436de 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -481,17 +481,12 @@ struct rcu_state {
/* End of fields guarded by barrier_mutex. */

struct mutex expedited_mutex; /* Serializes expediting. */
- atomic_long_t expedited_start; /* Starting ticket. */
- atomic_long_t expedited_done; /* Done ticket. */
- atomic_long_t expedited_wrap; /* # near-wrap incidents. */
+ unsigned long expedited_sequence; /* Take a ticket. */
atomic_long_t expedited_tryfail; /* # acquisition failures. */
atomic_long_t expedited_workdone1; /* # done by others #1. */
atomic_long_t expedited_workdone2; /* # done by others #2. */
+ unsigned long expedited_workdone3; /* # done by others #3. */
atomic_long_t expedited_normal; /* # fallbacks to normal. */
- atomic_long_t expedited_stoppedcpus; /* # successful stop_cpus. */
- atomic_long_t expedited_done_tries; /* # tries to update _done. */
- atomic_long_t expedited_done_lost; /* # times beaten to _done. */
- atomic_long_t expedited_done_exit; /* # times exited _done loop. */

unsigned long jiffies_force_qs; /* Time at which to invoke */
/* force_quiescent_state(). */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index 3ea7ffc7d5c4..a1ab3a5f6290 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -185,18 +185,14 @@ static int show_rcuexp(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;

- seq_printf(m, "s=%lu d=%lu w=%lu tf=%lu wd1=%lu wd2=%lu n=%lu sc=%lu dt=%lu dl=%lu dx=%lu\n",
- atomic_long_read(&rsp->expedited_start),
- atomic_long_read(&rsp->expedited_done),
- atomic_long_read(&rsp->expedited_wrap),
+ seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu sc=%lu\n",
+ rsp->expedited_sequence,
atomic_long_read(&rsp->expedited_tryfail),
atomic_long_read(&rsp->expedited_workdone1),
atomic_long_read(&rsp->expedited_workdone2),
+ rsp->expedited_workdone3,
atomic_long_read(&rsp->expedited_normal),
- atomic_long_read(&rsp->expedited_stoppedcpus),
- atomic_long_read(&rsp->expedited_done_tries),
- atomic_long_read(&rsp->expedited_done_lost),
- atomic_long_read(&rsp->expedited_done_exit));
+ rsp->expedited_sequence / 2);
return 0;
}

--
1.8.1.5

2015-07-17 23:29:37

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 05/19] rcu: Get rid of synchronize_sched_expedited()'s polling loop

From: "Paul E. McKenney" <[email protected]>

This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
polling loop in favor of a funnel-locking scheme based on the rcu_node
tree. The work-done check is done at each level of the tree, allowing
high-contention situations to be resolved quickly with reasonable levels
of mutex contention.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 95 +++++++++++++++++++++----------------------------
kernel/rcu/tree.h | 8 +++--
kernel/rcu/tree_trace.c | 3 +-
3 files changed, 47 insertions(+), 59 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3c182fdec805..b310b40a49a2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -70,6 +70,7 @@ MODULE_ALIAS("rcutree");

static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
+static struct lock_class_key rcu_exp_class[RCU_NUM_LVLS];

/*
* In order to export the rcu_state name to the tracing tools, it
@@ -103,7 +104,6 @@ struct rcu_state sname##_state = { \
.orphan_nxttail = &sname##_state.orphan_nxtlist, \
.orphan_donetail = &sname##_state.orphan_donelist, \
.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
- .expedited_mutex = __MUTEX_INITIALIZER(sname##_state.expedited_mutex), \
.name = RCU_STATE_NAME(sname), \
.abbr = sabbr, \
}
@@ -3272,6 +3272,22 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
return 0;
}

+/* Common code for synchronize_sched_expedited() work-done checking. */
+static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
+ atomic_long_t *stat, unsigned long s)
+{
+ if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
+ if (rnp)
+ mutex_unlock(&rnp->exp_funnel_mutex);
+ /* Ensure test happens before caller kfree(). */
+ smp_mb__before_atomic(); /* ^^^ */
+ atomic_long_inc(stat);
+ put_online_cpus();
+ return true;
+ }
+ return false;
+}
+
/**
* synchronize_sched_expedited - Brute-force RCU-sched grace period
*
@@ -3286,15 +3302,15 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
* This implementation can be thought of as an application of sequence
* locking to expedited grace periods, but using the sequence counter to
* determine when someone else has already done the work instead of for
- * retrying readers. We do a mutex_trylock() polling loop, but if we fail
- * too many times in a row, we fall back to synchronize_sched().
+ * retrying readers.
*/
void synchronize_sched_expedited(void)
{
int cpu;
long s;
- int trycount = 0;
struct rcu_state *rsp = &rcu_sched_state;
+ struct rcu_node *rnp0;
+ struct rcu_node *rnp1 = NULL;

/* Take a snapshot of the sequence number. */
smp_mb(); /* Caller's modifications seen first by other CPUs. */
@@ -3310,60 +3326,25 @@ void synchronize_sched_expedited(void)
WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));

/*
- * Each pass through the following loop attempts to acquire
- * ->expedited_mutex, checking for others doing our work each time.
+ * Each pass through the following loop works its way
+ * up the rcu_node tree, returning if others have done the
+ * work or otherwise falls through holding the root rnp's
+ * ->exp_funnel_mutex. The mapping from CPU to rcu_node structure
+ * can be inexact, as it is just promoting locality and is not
+ * strictly needed for correctness.
*/
- while (!mutex_trylock(&rsp->expedited_mutex)) {
- put_online_cpus();
- atomic_long_inc(&rsp->expedited_tryfail);
-
- /* Check to see if someone else did our work for us. */
- if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
- /* ensure test happens before caller kfree */
- smp_mb__before_atomic(); /* ^^^ */
- atomic_long_inc(&rsp->expedited_workdone1);
- return;
- }
-
- /* No joy, try again later. Or just synchronize_sched(). */
- if (trycount++ < 10) {
- udelay(trycount * num_online_cpus());
- } else {
- wait_rcu_gp(call_rcu_sched);
- atomic_long_inc(&rsp->expedited_normal);
+ rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
+ for (; rnp0 != NULL; rnp0 = rnp0->parent) {
+ if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
return;
- }
-
- /* Recheck to see if someone else did our work for us. */
- if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
- /* ensure test happens before caller kfree */
- smp_mb__before_atomic(); /* ^^^ */
- atomic_long_inc(&rsp->expedited_workdone2);
- return;
- }
-
- /*
- * Refetching sync_sched_expedited_started allows later
- * callers to piggyback on our grace period. We retry
- * after they started, so our grace period works for them,
- * and they started after our first try, so their grace
- * period works for us.
- */
- if (!try_get_online_cpus()) {
- /* CPU hotplug operation in flight, use normal GP. */
- wait_rcu_gp(call_rcu_sched);
- atomic_long_inc(&rsp->expedited_normal);
- return;
- }
+ mutex_lock(&rnp0->exp_funnel_mutex);
+ if (rnp1)
+ mutex_unlock(&rnp1->exp_funnel_mutex);
+ rnp1 = rnp0;
}
-
- /* Recheck yet again to see if someone else did our work for us. */
- if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
- rsp->expedited_workdone3++;
- mutex_unlock(&rsp->expedited_mutex);
- smp_mb(); /* ensure test happens before caller kfree */
+ rnp0 = rnp1; /* rcu_get_root(rsp), AKA root rcu_node structure. */
+ if (sync_sched_exp_wd(rsp, rnp0, &rsp->expedited_workdone2, s))
return;
- }

WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
smp_mb(); /* Ensure expedited GP seen after counter increment. */
@@ -3383,7 +3364,7 @@ void synchronize_sched_expedited(void)
smp_mb(); /* Ensure expedited GP seen before counter increment. */
WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
WARN_ON_ONCE(rsp->expedited_sequence & 0x1);
- mutex_unlock(&rsp->expedited_mutex);
+ mutex_unlock(&rnp0->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */

put_online_cpus();
@@ -3940,6 +3921,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
{
static const char * const buf[] = RCU_NODE_NAME_INIT;
static const char * const fqs[] = RCU_FQS_NAME_INIT;
+ static const char * const exp[] = RCU_EXP_NAME_INIT;
static u8 fl_mask = 0x1;

int levelcnt[RCU_NUM_LVLS]; /* # nodes in each level. */
@@ -3998,6 +3980,9 @@ static void __init rcu_init_one(struct rcu_state *rsp,
rnp->level = i;
INIT_LIST_HEAD(&rnp->blkd_tasks);
rcu_init_one_nocb(rnp);
+ mutex_init(&rnp->exp_funnel_mutex);
+ lockdep_set_class_and_name(&rnp->exp_funnel_mutex,
+ &rcu_exp_class[i], exp[i]);
}
}

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 6a2b741436de..2ef036b356f7 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -68,6 +68,7 @@
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
# define RCU_NODE_NAME_INIT { "rcu_node_0" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
+# define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
#elif NR_CPUS <= RCU_FANOUT_2
# define RCU_NUM_LVLS 2
# define NUM_RCU_LVL_0 1
@@ -76,6 +77,7 @@
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+# define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
#elif NR_CPUS <= RCU_FANOUT_3
# define RCU_NUM_LVLS 3
# define NUM_RCU_LVL_0 1
@@ -85,6 +87,7 @@
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
+# define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
#elif NR_CPUS <= RCU_FANOUT_4
# define RCU_NUM_LVLS 4
# define NUM_RCU_LVL_0 1
@@ -95,6 +98,7 @@
# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
+# define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT_1 */
@@ -237,6 +241,8 @@ struct rcu_node {
int need_future_gp[2];
/* Counts of upcoming no-CB GP requests. */
raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;
+
+ struct mutex exp_funnel_mutex ____cacheline_internodealigned_in_smp;
} ____cacheline_internodealigned_in_smp;

/*
@@ -480,12 +486,10 @@ struct rcu_state {
/* _rcu_barrier(). */
/* End of fields guarded by barrier_mutex. */

- struct mutex expedited_mutex; /* Serializes expediting. */
unsigned long expedited_sequence; /* Take a ticket. */
atomic_long_t expedited_tryfail; /* # acquisition failures. */
atomic_long_t expedited_workdone1; /* # done by others #1. */
atomic_long_t expedited_workdone2; /* # done by others #2. */
- unsigned long expedited_workdone3; /* # done by others #3. */
atomic_long_t expedited_normal; /* # fallbacks to normal. */

unsigned long jiffies_force_qs; /* Time at which to invoke */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index a1ab3a5f6290..d2aab8dcd58e 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -185,12 +185,11 @@ static int show_rcuexp(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;

- seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu sc=%lu\n",
+ seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu n=%lu sc=%lu\n",
rsp->expedited_sequence,
atomic_long_read(&rsp->expedited_tryfail),
atomic_long_read(&rsp->expedited_workdone1),
atomic_long_read(&rsp->expedited_workdone2),
- rsp->expedited_workdone3,
atomic_long_read(&rsp->expedited_normal),
rsp->expedited_sequence / 2);
return 0;
--
1.8.1.5

2015-07-17 23:29:35

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 06/19] rcu: Make expedited GP CPU stoppage asynchronous

From: Peter Zijlstra <[email protected]>

Sequentially stopping the CPUs slows down expedited grace periods by
at least a factor of two, based on rcutorture's grace-period-per-second
rate. This is a conservative measure because rcutorture uses unusually
long RCU read-side critical sections and because rcutorture periodically
quiesces the system in order to test RCU's ability to ramp down to and
up from the idle state. This commit therefore replaces the stop_one_cpu()
with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
when all CPUs have passed through the stopped state.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 31 +++++++++++++++++--------------
kernel/rcu/tree.h | 6 ++++++
kernel/rcu/tree_trace.c | 3 ++-
3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b310b40a49a2..c5c8509054ef 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3257,18 +3257,11 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu);

static int synchronize_sched_expedited_cpu_stop(void *data)
{
- /*
- * There must be a full memory barrier on each affected CPU
- * between the time that try_stop_cpus() is called and the
- * time that it returns.
- *
- * In the current initial implementation of cpu_stop, the
- * above condition is already met when the control reaches
- * this point and the following smp_mb() is not strictly
- * necessary. Do smp_mb() anyway for documentation and
- * robustness against future implementation changes.
- */
- smp_mb(); /* See above comment block. */
+ struct rcu_state *rsp = data;
+
+ /* We are here: If we are last, do the wakeup. */
+ if (atomic_dec_and_test(&rsp->expedited_need_qs))
+ wake_up(&rsp->expedited_wq);
return 0;
}

@@ -3308,9 +3301,9 @@ void synchronize_sched_expedited(void)
{
int cpu;
long s;
- struct rcu_state *rsp = &rcu_sched_state;
struct rcu_node *rnp0;
struct rcu_node *rnp1 = NULL;
+ struct rcu_state *rsp = &rcu_sched_state;

/* Take a snapshot of the sequence number. */
smp_mb(); /* Caller's modifications seen first by other CPUs. */
@@ -3351,16 +3344,26 @@ void synchronize_sched_expedited(void)
WARN_ON_ONCE(!(rsp->expedited_sequence & 0x1));

/* Stop each CPU that is online, non-idle, and not us. */
+ init_waitqueue_head(&rsp->expedited_wq);
+ atomic_set(&rsp->expedited_need_qs, 1); /* Extra count avoids race. */
for_each_online_cpu(cpu) {
+ struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);

/* Skip our CPU and any idle CPUs. */
if (raw_smp_processor_id() == cpu ||
!(atomic_add_return(0, &rdtp->dynticks) & 0x1))
continue;
- stop_one_cpu(cpu, synchronize_sched_expedited_cpu_stop, NULL);
+ atomic_inc(&rsp->expedited_need_qs);
+ stop_one_cpu_nowait(cpu, synchronize_sched_expedited_cpu_stop,
+ rsp, &rdp->exp_stop_work);
}

+ /* Remove extra count and, if necessary, wait for CPUs to stop. */
+ if (!atomic_dec_and_test(&rsp->expedited_need_qs))
+ wait_event(rsp->expedited_wq,
+ !atomic_read(&rsp->expedited_need_qs));
+
smp_mb(); /* Ensure expedited GP seen before counter increment. */
WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
WARN_ON_ONCE(rsp->expedited_sequence & 0x1);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2ef036b356f7..4edc277d08eb 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -27,6 +27,7 @@
#include <linux/threads.h>
#include <linux/cpumask.h>
#include <linux/seqlock.h>
+#include <linux/stop_machine.h>

/*
* Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
@@ -298,6 +299,9 @@ struct rcu_data {
/* ticks this CPU has handled */
/* during and after the last grace */
/* period it is aware of. */
+ struct cpu_stop_work exp_stop_work;
+ /* Expedited grace-period control */
+ /* for CPU stopping. */

/* 2) batch handling */
/*
@@ -491,6 +495,8 @@ struct rcu_state {
atomic_long_t expedited_workdone1; /* # done by others #1. */
atomic_long_t expedited_workdone2; /* # done by others #2. */
atomic_long_t expedited_normal; /* # fallbacks to normal. */
+ atomic_t expedited_need_qs; /* # CPUs left to check in. */
+ wait_queue_head_t expedited_wq; /* Wait for check-ins. */

unsigned long jiffies_force_qs; /* Time at which to invoke */
/* force_quiescent_state(). */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index d2aab8dcd58e..36c04b46d3b8 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -185,12 +185,13 @@ static int show_rcuexp(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;

- seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu n=%lu sc=%lu\n",
+ seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu n=%lu enq=%d sc=%lu\n",
rsp->expedited_sequence,
atomic_long_read(&rsp->expedited_tryfail),
atomic_long_read(&rsp->expedited_workdone1),
atomic_long_read(&rsp->expedited_workdone2),
atomic_long_read(&rsp->expedited_normal),
+ atomic_read(&rsp->expedited_need_qs),
rsp->expedited_sequence / 2);
return 0;
}
--
1.8.1.5

2015-07-17 23:29:39

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 07/19] rcu: Abstract sequence counting from synchronize_sched_expedited()

From: "Paul E. McKenney" <[email protected]>

This commit creates rcu_exp_gp_seq_start() and rcu_exp_gp_seq_end() to
bracket an expedited grace period, rcu_exp_gp_seq_snap() to snapshot the
sequence counter, and rcu_exp_gp_seq_done() to check to see if a full
expedited grace period has elapsed since the snapshot. These will be
applied to synchronize_rcu_expedited(). These are defined in terms of
underlying rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), rcu_seq_done(),
which will be applied to _rcu_barrier().

One reason that this commit doesn't use the seqcount primitives themselves
is that the smp_wmb() in those primitive is insufficient due to the fact
that expedited grace periods do reads as well as writes. In addition,
the read-side seqcount primitives detect a potentially partial change,
where the expedited primitives instead need a guaranteed full change.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 58 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c5c8509054ef..67fe75725486 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3255,6 +3255,60 @@ void cond_synchronize_rcu(unsigned long oldstate)
}
EXPORT_SYMBOL_GPL(cond_synchronize_rcu);

+/* Adjust sequence number for start of update-side operation. */
+static void rcu_seq_start(unsigned long *sp)
+{
+ WRITE_ONCE(*sp, *sp + 1);
+ smp_mb(); /* Ensure update-side operation after counter increment. */
+ WARN_ON_ONCE(!(*sp & 0x1));
+}
+
+/* Adjust sequence number for end of update-side operation. */
+static void rcu_seq_end(unsigned long *sp)
+{
+ smp_mb(); /* Ensure update-side operation before counter increment. */
+ WRITE_ONCE(*sp, *sp + 1);
+ WARN_ON_ONCE(*sp & 0x1);
+}
+
+/* Take a snapshot of the update side's sequence number. */
+static unsigned long rcu_seq_snap(unsigned long *sp)
+{
+ unsigned long s;
+
+ smp_mb(); /* Caller's modifications seen first by other CPUs. */
+ s = (READ_ONCE(*sp) + 3) & ~0x1;
+ smp_mb(); /* Above access must not bleed into critical section. */
+ return s;
+}
+
+/*
+ * Given a snapshot from rcu_seq_snap(), determine whether or not a
+ * full update-side operation has occurred.
+ */
+static bool rcu_seq_done(unsigned long *sp, unsigned long s)
+{
+ return ULONG_CMP_GE(READ_ONCE(*sp), s);
+}
+
+/* Wrapper functions for expedited grace periods. */
+static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
+{
+ rcu_seq_start(&rsp->expedited_sequence);
+}
+static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
+{
+ rcu_seq_end(&rsp->expedited_sequence);
+}
+static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
+{
+ return rcu_seq_snap(&rsp->expedited_sequence);
+}
+static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
+{
+ return rcu_seq_done(&rsp->expedited_sequence, s);
+}
+
static int synchronize_sched_expedited_cpu_stop(void *data)
{
struct rcu_state *rsp = data;
@@ -3269,7 +3323,7 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
atomic_long_t *stat, unsigned long s)
{
- if (ULONG_CMP_GE(READ_ONCE(rsp->expedited_sequence), s)) {
+ if (rcu_exp_gp_seq_done(rsp, s)) {
if (rnp)
mutex_unlock(&rnp->exp_funnel_mutex);
/* Ensure test happens before caller kfree(). */
@@ -3306,9 +3360,7 @@ void synchronize_sched_expedited(void)
struct rcu_state *rsp = &rcu_sched_state;

/* Take a snapshot of the sequence number. */
- smp_mb(); /* Caller's modifications seen first by other CPUs. */
- s = (READ_ONCE(rsp->expedited_sequence) + 3) & ~0x1;
- smp_mb(); /* Above access must not bleed into critical section. */
+ s = rcu_exp_gp_seq_snap(rsp);

if (!try_get_online_cpus()) {
/* CPU hotplug operation in flight, fall back to normal GP. */
@@ -3339,9 +3391,7 @@ void synchronize_sched_expedited(void)
if (sync_sched_exp_wd(rsp, rnp0, &rsp->expedited_workdone2, s))
return;

- WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
- smp_mb(); /* Ensure expedited GP seen after counter increment. */
- WARN_ON_ONCE(!(rsp->expedited_sequence & 0x1));
+ rcu_exp_gp_seq_start(rsp);

/* Stop each CPU that is online, non-idle, and not us. */
init_waitqueue_head(&rsp->expedited_wq);
@@ -3364,9 +3414,7 @@ void synchronize_sched_expedited(void)
wait_event(rsp->expedited_wq,
!atomic_read(&rsp->expedited_need_qs));

- smp_mb(); /* Ensure expedited GP seen before counter increment. */
- WRITE_ONCE(rsp->expedited_sequence, rsp->expedited_sequence + 1);
- WARN_ON_ONCE(rsp->expedited_sequence & 0x1);
+ rcu_exp_gp_seq_end(rsp);
mutex_unlock(&rnp0->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */

--
1.8.1.5

2015-07-17 23:31:56

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 08/19] rcu: Make synchronize_rcu_expedited() use sequence-counter scheme

From: "Paul E. McKenney" <[email protected]>

Although synchronize_rcu_expedited() uses a sequence-counter scheme, it
is based on a single increment per grace period, which means that tasks
piggybacking off of concurrent grace periods may be forced to wait longer
than necessary. This commit therefore applies the new sequence-count
functions developed for synchronize_sched_expedited() to speed things
up a bit and to consolidate the sequence-counter implementation.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index ef41c1b04ba6..759883f51de7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -536,7 +536,6 @@ void synchronize_rcu(void)
EXPORT_SYMBOL_GPL(synchronize_rcu);

static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
-static unsigned long sync_rcu_preempt_exp_count;
static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);

/*
@@ -704,12 +703,10 @@ void synchronize_rcu_expedited(void)
{
struct rcu_node *rnp;
struct rcu_state *rsp = rcu_state_p;
- unsigned long snap;
+ unsigned long s;
int trycount = 0;

- smp_mb(); /* Caller's modifications seen first by other CPUs. */
- snap = READ_ONCE(sync_rcu_preempt_exp_count) + 1;
- smp_mb(); /* Above access cannot bleed into critical section. */
+ s = rcu_exp_gp_seq_snap(rsp);

/*
* Acquire lock, falling back to synchronize_rcu() if too many
@@ -717,8 +714,7 @@ void synchronize_rcu_expedited(void)
* expedited grace period for us, just leave.
*/
while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
- if (ULONG_CMP_LT(snap,
- READ_ONCE(sync_rcu_preempt_exp_count)))
+ if (rcu_exp_gp_seq_done(rsp, s))
goto mb_ret; /* Others did our work for us. */
if (trycount++ < 10) {
udelay(trycount * num_online_cpus());
@@ -727,8 +723,9 @@ void synchronize_rcu_expedited(void)
return;
}
}
- if (ULONG_CMP_LT(snap, READ_ONCE(sync_rcu_preempt_exp_count)))
+ if (rcu_exp_gp_seq_done(rsp, s))
goto unlock_mb_ret; /* Others did our work for us. */
+ rcu_exp_gp_seq_start(rsp);

/* force all RCU readers onto ->blkd_tasks lists. */
synchronize_sched_expedited();
@@ -750,8 +747,7 @@ void synchronize_rcu_expedited(void)
sync_rcu_preempt_exp_done(rnp));

/* Clean up and exit. */
- smp_mb(); /* ensure expedited GP seen before counter increment. */
- WRITE_ONCE(sync_rcu_preempt_exp_count, sync_rcu_preempt_exp_count + 1);
+ rcu_exp_gp_seq_end(rsp);
unlock_mb_ret:
mutex_unlock(&sync_rcu_preempt_exp_mutex);
mb_ret:
--
1.8.1.5

2015-07-17 23:31:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 09/19] rcu: Abstract funnel locking from synchronize_sched_expedited()

From: "Paul E. McKenney" <[email protected]>

This commit abstracts funnel locking from synchronize_sched_expedited()
so that it may be used by synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 80 ++++++++++++++++++++++++++++++++-----------------------
1 file changed, 47 insertions(+), 33 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 67fe75725486..f79a1c646846 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3309,16 +3309,6 @@ static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
return rcu_seq_done(&rsp->expedited_sequence, s);
}

-static int synchronize_sched_expedited_cpu_stop(void *data)
-{
- struct rcu_state *rsp = data;
-
- /* We are here: If we are last, do the wakeup. */
- if (atomic_dec_and_test(&rsp->expedited_need_qs))
- wake_up(&rsp->expedited_wq);
- return 0;
-}
-
/* Common code for synchronize_sched_expedited() work-done checking. */
static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
atomic_long_t *stat, unsigned long s)
@@ -3335,6 +3325,48 @@ static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
return false;
}

+/*
+ * Funnel-lock acquisition for expedited grace periods. Returns a
+ * pointer to the root rcu_node structure, or NULL if some other
+ * task did the expedited grace period for us.
+ */
+static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
+{
+ struct rcu_node *rnp0;
+ struct rcu_node *rnp1 = NULL;
+
+ /*
+ * Each pass through the following loop works its way
+ * up the rcu_node tree, returning if others have done the
+ * work or otherwise falls through holding the root rnp's
+ * ->exp_funnel_mutex. The mapping from CPU to rcu_node structure
+ * can be inexact, as it is just promoting locality and is not
+ * strictly needed for correctness.
+ */
+ rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
+ for (; rnp0 != NULL; rnp0 = rnp0->parent) {
+ if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
+ return NULL;
+ mutex_lock(&rnp0->exp_funnel_mutex);
+ if (rnp1)
+ mutex_unlock(&rnp1->exp_funnel_mutex);
+ rnp1 = rnp0;
+ }
+ if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone2, s))
+ return NULL;
+ return rnp1;
+}
+
+static int synchronize_sched_expedited_cpu_stop(void *data)
+{
+ struct rcu_state *rsp = data;
+
+ /* We are here: If we are last, do the wakeup. */
+ if (atomic_dec_and_test(&rsp->expedited_need_qs))
+ wake_up(&rsp->expedited_wq);
+ return 0;
+}
+
/**
* synchronize_sched_expedited - Brute-force RCU-sched grace period
*
@@ -3355,8 +3387,7 @@ void synchronize_sched_expedited(void)
{
int cpu;
long s;
- struct rcu_node *rnp0;
- struct rcu_node *rnp1 = NULL;
+ struct rcu_node *rnp;
struct rcu_state *rsp = &rcu_sched_state;

/* Take a snapshot of the sequence number. */
@@ -3370,26 +3401,9 @@ void synchronize_sched_expedited(void)
}
WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));

- /*
- * Each pass through the following loop works its way
- * up the rcu_node tree, returning if others have done the
- * work or otherwise falls through holding the root rnp's
- * ->exp_funnel_mutex. The mapping from CPU to rcu_node structure
- * can be inexact, as it is just promoting locality and is not
- * strictly needed for correctness.
- */
- rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
- for (; rnp0 != NULL; rnp0 = rnp0->parent) {
- if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
- return;
- mutex_lock(&rnp0->exp_funnel_mutex);
- if (rnp1)
- mutex_unlock(&rnp1->exp_funnel_mutex);
- rnp1 = rnp0;
- }
- rnp0 = rnp1; /* rcu_get_root(rsp), AKA root rcu_node structure. */
- if (sync_sched_exp_wd(rsp, rnp0, &rsp->expedited_workdone2, s))
- return;
+ rnp = exp_funnel_lock(rsp, s);
+ if (rnp == NULL)
+ return; /* Someone else did our work for us. */

rcu_exp_gp_seq_start(rsp);

@@ -3415,7 +3429,7 @@ void synchronize_sched_expedited(void)
!atomic_read(&rsp->expedited_need_qs));

rcu_exp_gp_seq_end(rsp);
- mutex_unlock(&rnp0->exp_funnel_mutex);
+ mutex_unlock(&rnp->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */

put_online_cpus();
--
1.8.1.5

2015-07-17 23:29:36

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 10/19] rcu: Fix synchronize_sched_expedited() type error for "s"

From: "Paul E. McKenney" <[email protected]>

The type of "s" has been "long" rather than the correct "unsigned long"
for quite some time. This commit fixes this type error.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f79a1c646846..094ed8ff82b4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3386,7 +3386,7 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
void synchronize_sched_expedited(void)
{
int cpu;
- long s;
+ unsigned long s;
struct rcu_node *rnp;
struct rcu_state *rsp = &rcu_sched_state;

--
1.8.1.5

2015-07-17 23:35:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 11/19] rcu: Use funnel locking for synchronize_rcu_expedited()'s polling loop

From: "Paul E. McKenney" <[email protected]>

This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock()
polling loop in favor of the funnel-locking scheme that was abstracted
from synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 15 ++++++++-------
kernel/rcu/tree_plugin.h | 36 ++++++++++--------------------------
2 files changed, 18 insertions(+), 33 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 094ed8ff82b4..338ea61929bd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3309,9 +3309,9 @@ static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
return rcu_seq_done(&rsp->expedited_sequence, s);
}

-/* Common code for synchronize_sched_expedited() work-done checking. */
-static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
- atomic_long_t *stat, unsigned long s)
+/* Common code for synchronize_{rcu,sched}_expedited() work-done checking. */
+static bool sync_exp_work_done(struct rcu_state *rsp, struct rcu_node *rnp,
+ atomic_long_t *stat, unsigned long s)
{
if (rcu_exp_gp_seq_done(rsp, s)) {
if (rnp)
@@ -3319,7 +3319,6 @@ static bool sync_sched_exp_wd(struct rcu_state *rsp, struct rcu_node *rnp,
/* Ensure test happens before caller kfree(). */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(stat);
- put_online_cpus();
return true;
}
return false;
@@ -3345,14 +3344,14 @@ static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
*/
rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
for (; rnp0 != NULL; rnp0 = rnp0->parent) {
- if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
+ if (sync_exp_work_done(rsp, rnp1, &rsp->expedited_workdone1, s))
return NULL;
mutex_lock(&rnp0->exp_funnel_mutex);
if (rnp1)
mutex_unlock(&rnp1->exp_funnel_mutex);
rnp1 = rnp0;
}
- if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone2, s))
+ if (sync_exp_work_done(rsp, rnp1, &rsp->expedited_workdone2, s))
return NULL;
return rnp1;
}
@@ -3402,8 +3401,10 @@ void synchronize_sched_expedited(void)
WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));

rnp = exp_funnel_lock(rsp, s);
- if (rnp == NULL)
+ if (rnp == NULL) {
+ put_online_cpus();
return; /* Someone else did our work for us. */
+ }

rcu_exp_gp_seq_start(rsp);

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 759883f51de7..f0d71449ec0c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -536,7 +536,6 @@ void synchronize_rcu(void)
EXPORT_SYMBOL_GPL(synchronize_rcu);

static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
-static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);

/*
* Return non-zero if there are any tasks in RCU read-side critical
@@ -556,7 +555,7 @@ static int rcu_preempted_readers_exp(struct rcu_node *rnp)
* for the current expedited grace period. Works only for preemptible
* RCU -- other RCU implementation use other means.
*
- * Caller must hold sync_rcu_preempt_exp_mutex.
+ * Caller must hold the root rcu_node's exp_funnel_mutex.
*/
static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
{
@@ -572,7 +571,7 @@ static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
* recursively up the tree. (Calm down, calm down, we do the recursion
* iteratively!)
*
- * Caller must hold sync_rcu_preempt_exp_mutex.
+ * Caller must hold the root rcu_node's exp_funnel_mutex.
*/
static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
bool wake)
@@ -611,7 +610,7 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
* set the ->expmask bits on the leaf rcu_node structures to tell phase 2
* that work is needed here.
*
- * Caller must hold sync_rcu_preempt_exp_mutex.
+ * Caller must hold the root rcu_node's exp_funnel_mutex.
*/
static void
sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
@@ -654,7 +653,7 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
* invoke rcu_report_exp_rnp() to clear out the upper-level ->expmask bits,
* enabling rcu_read_unlock_special() to do the bit-clearing.
*
- * Caller must hold sync_rcu_preempt_exp_mutex.
+ * Caller must hold the root rcu_node's exp_funnel_mutex.
*/
static void
sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
@@ -702,29 +701,16 @@ sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
void synchronize_rcu_expedited(void)
{
struct rcu_node *rnp;
+ struct rcu_node *rnp_unlock;
struct rcu_state *rsp = rcu_state_p;
unsigned long s;
- int trycount = 0;

s = rcu_exp_gp_seq_snap(rsp);

- /*
- * Acquire lock, falling back to synchronize_rcu() if too many
- * lock-acquisition failures. Of course, if someone does the
- * expedited grace period for us, just leave.
- */
- while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
- if (rcu_exp_gp_seq_done(rsp, s))
- goto mb_ret; /* Others did our work for us. */
- if (trycount++ < 10) {
- udelay(trycount * num_online_cpus());
- } else {
- wait_rcu_gp(call_rcu);
- return;
- }
- }
- if (rcu_exp_gp_seq_done(rsp, s))
- goto unlock_mb_ret; /* Others did our work for us. */
+ rnp_unlock = exp_funnel_lock(rsp, s);
+ if (rnp_unlock == NULL)
+ return; /* Someone else did our work for us. */
+
rcu_exp_gp_seq_start(rsp);

/* force all RCU readers onto ->blkd_tasks lists. */
@@ -748,9 +734,7 @@ void synchronize_rcu_expedited(void)

/* Clean up and exit. */
rcu_exp_gp_seq_end(rsp);
-unlock_mb_ret:
- mutex_unlock(&sync_rcu_preempt_exp_mutex);
-mb_ret:
+ mutex_unlock(&rnp_unlock->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
--
1.8.1.5

2015-07-17 23:31:15

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 12/19] rcu: Apply rcu_seq operations to _rcu_barrier()

From: "Paul E. McKenney" <[email protected]>

The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
replaces the open-coding with the shiny new rcu_seq operations.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/trace/events/rcu.h | 1 -
kernel/rcu/tree.c | 72 ++++++++++++----------------------------------
kernel/rcu/tree.h | 2 +-
kernel/rcu/tree_trace.c | 4 +--
4 files changed, 22 insertions(+), 57 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index c78e88ce5ea3..ef72c4aada56 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -661,7 +661,6 @@ TRACE_EVENT(rcu_torture_read,
* Tracepoint for _rcu_barrier() execution. The string "s" describes
* the _rcu_barrier phase:
* "Begin": _rcu_barrier() started.
- * "Check": _rcu_barrier() checking for piggybacking.
* "EarlyExit": _rcu_barrier() piggybacked, thus early exit.
* "Inc1": _rcu_barrier() piggyback check counter incremented.
* "OfflineNoCB": _rcu_barrier() found callback on never-online CPU
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 338ea61929bd..44245ae4c1c2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3568,10 +3568,10 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
struct rcu_state *rsp = rdp->rsp;

if (atomic_dec_and_test(&rsp->barrier_cpu_count)) {
- _rcu_barrier_trace(rsp, "LastCB", -1, rsp->n_barrier_done);
+ _rcu_barrier_trace(rsp, "LastCB", -1, rsp->barrier_sequence);
complete(&rsp->barrier_completion);
} else {
- _rcu_barrier_trace(rsp, "CB", -1, rsp->n_barrier_done);
+ _rcu_barrier_trace(rsp, "CB", -1, rsp->barrier_sequence);
}
}

@@ -3583,7 +3583,7 @@ static void rcu_barrier_func(void *type)
struct rcu_state *rsp = type;
struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);

- _rcu_barrier_trace(rsp, "IRQ", -1, rsp->n_barrier_done);
+ _rcu_barrier_trace(rsp, "IRQ", -1, rsp->barrier_sequence);
atomic_inc(&rsp->barrier_cpu_count);
rsp->call(&rdp->barrier_head, rcu_barrier_callback);
}
@@ -3596,55 +3596,24 @@ static void _rcu_barrier(struct rcu_state *rsp)
{
int cpu;
struct rcu_data *rdp;
- unsigned long snap = READ_ONCE(rsp->n_barrier_done);
- unsigned long snap_done;
+ unsigned long s = rcu_seq_snap(&rsp->barrier_sequence);

- _rcu_barrier_trace(rsp, "Begin", -1, snap);
+ _rcu_barrier_trace(rsp, "Begin", -1, s);

/* Take mutex to serialize concurrent rcu_barrier() requests. */
mutex_lock(&rsp->barrier_mutex);

- /*
- * Ensure that all prior references, including to ->n_barrier_done,
- * are ordered before the _rcu_barrier() machinery.
- */
- smp_mb(); /* See above block comment. */
-
- /*
- * Recheck ->n_barrier_done to see if others did our work for us.
- * This means checking ->n_barrier_done for an even-to-odd-to-even
- * transition. The "if" expression below therefore rounds the old
- * value up to the next even number and adds two before comparing.
- */
- snap_done = rsp->n_barrier_done;
- _rcu_barrier_trace(rsp, "Check", -1, snap_done);
-
- /*
- * If the value in snap is odd, we needed to wait for the current
- * rcu_barrier() to complete, then wait for the next one, in other
- * words, we need the value of snap_done to be three larger than
- * the value of snap. On the other hand, if the value in snap is
- * even, we only had to wait for the next rcu_barrier() to complete,
- * in other words, we need the value of snap_done to be only two
- * greater than the value of snap. The "(snap + 3) & ~0x1" computes
- * this for us (thank you, Linus!).
- */
- if (ULONG_CMP_GE(snap_done, (snap + 3) & ~0x1)) {
- _rcu_barrier_trace(rsp, "EarlyExit", -1, snap_done);
+ /* Did someone else do our work for us? */
+ if (rcu_seq_done(&rsp->barrier_sequence, s)) {
+ _rcu_barrier_trace(rsp, "EarlyExit", -1, rsp->barrier_sequence);
smp_mb(); /* caller's subsequent code after above check. */
mutex_unlock(&rsp->barrier_mutex);
return;
}

- /*
- * Increment ->n_barrier_done to avoid duplicate work. Use
- * WRITE_ONCE() to prevent the compiler from speculating
- * the increment to precede the early-exit check.
- */
- WRITE_ONCE(rsp->n_barrier_done, rsp->n_barrier_done + 1);
- WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 1);
- _rcu_barrier_trace(rsp, "Inc1", -1, rsp->n_barrier_done);
- smp_mb(); /* Order ->n_barrier_done increment with below mechanism. */
+ /* Mark the start of the barrier operation. */
+ rcu_seq_start(&rsp->barrier_sequence);
+ _rcu_barrier_trace(rsp, "Inc1", -1, rsp->barrier_sequence);

/*
* Initialize the count to one rather than to zero in order to
@@ -3668,10 +3637,10 @@ static void _rcu_barrier(struct rcu_state *rsp)
if (rcu_is_nocb_cpu(cpu)) {
if (!rcu_nocb_cpu_needs_barrier(rsp, cpu)) {
_rcu_barrier_trace(rsp, "OfflineNoCB", cpu,
- rsp->n_barrier_done);
+ rsp->barrier_sequence);
} else {
_rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
- rsp->n_barrier_done);
+ rsp->barrier_sequence);
smp_mb__before_atomic();
atomic_inc(&rsp->barrier_cpu_count);
__call_rcu(&rdp->barrier_head,
@@ -3679,11 +3648,11 @@ static void _rcu_barrier(struct rcu_state *rsp)
}
} else if (READ_ONCE(rdp->qlen)) {
_rcu_barrier_trace(rsp, "OnlineQ", cpu,
- rsp->n_barrier_done);
+ rsp->barrier_sequence);
smp_call_function_single(cpu, rcu_barrier_func, rsp, 1);
} else {
_rcu_barrier_trace(rsp, "OnlineNQ", cpu,
- rsp->n_barrier_done);
+ rsp->barrier_sequence);
}
}
put_online_cpus();
@@ -3695,16 +3664,13 @@ static void _rcu_barrier(struct rcu_state *rsp)
if (atomic_dec_and_test(&rsp->barrier_cpu_count))
complete(&rsp->barrier_completion);

- /* Increment ->n_barrier_done to prevent duplicate work. */
- smp_mb(); /* Keep increment after above mechanism. */
- WRITE_ONCE(rsp->n_barrier_done, rsp->n_barrier_done + 1);
- WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 0);
- _rcu_barrier_trace(rsp, "Inc2", -1, rsp->n_barrier_done);
- smp_mb(); /* Keep increment before caller's subsequent code. */
-
/* Wait for all rcu_barrier_callback() callbacks to be invoked. */
wait_for_completion(&rsp->barrier_completion);

+ /* Mark the end of the barrier operation. */
+ _rcu_barrier_trace(rsp, "Inc2", -1, rsp->barrier_sequence);
+ rcu_seq_end(&rsp->barrier_sequence);
+
/* Other rcu_barrier() invocations can now safely proceed. */
mutex_unlock(&rsp->barrier_mutex);
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4edc277d08eb..5c1042d9c310 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -486,7 +486,7 @@ struct rcu_state {
struct mutex barrier_mutex; /* Guards barrier fields. */
atomic_t barrier_cpu_count; /* # CPUs waiting on. */
struct completion barrier_completion; /* Wake at barrier end. */
- unsigned long n_barrier_done; /* ++ at start and end of */
+ unsigned long barrier_sequence; /* ++ at start and end of */
/* _rcu_barrier(). */
/* End of fields guarded by barrier_mutex. */

diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index 36c04b46d3b8..d9982a2ce305 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -81,9 +81,9 @@ static void r_stop(struct seq_file *m, void *v)
static int show_rcubarrier(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;
- seq_printf(m, "bcc: %d nbd: %lu\n",
+ seq_printf(m, "bcc: %d bseq: %lu\n",
atomic_read(&rsp->barrier_cpu_count),
- rsp->n_barrier_done);
+ rsp->barrier_sequence);
return 0;
}

--
1.8.1.5

2015-07-17 23:35:31

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 13/19] rcu: Consolidate last open-coded expedited memory barrier

From: "Paul E. McKenney" <[email protected]>

One of the requirements on RCU grace periods is that if there is a
causal chain of operations that starts after one grace period and
ends before another grace period, then the two grace periods must
be serialized. There has been (and might still be) code that relies
on this, for example, certain types of reference-counting code that
does a call_rcu() within an RCU callback function.

This requirement is why there is an smp_mb() at the end of both
synchronize_sched_expedited() and synchronize_rcu_expedited().
However, this is the only smp_mb() in these functions, so it would
be nicer to consolidate it into rcu_exp_gp_seq_end(). This commit
does just that.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
kernel/rcu/tree_plugin.h | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 44245ae4c1c2..a905d3ba8673 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3299,6 +3299,7 @@ static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
{
rcu_seq_end(&rsp->expedited_sequence);
+ smp_mb(); /* Ensure that consecutive grace periods serialize. */
}
static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
{
@@ -3431,7 +3432,6 @@ void synchronize_sched_expedited(void)

rcu_exp_gp_seq_end(rsp);
mutex_unlock(&rnp->exp_funnel_mutex);
- smp_mb(); /* ensure subsequent action seen after grace period. */

put_online_cpus();
}
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f0d71449ec0c..27b714601c6e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -735,7 +735,6 @@ void synchronize_rcu_expedited(void)
/* Clean up and exit. */
rcu_exp_gp_seq_end(rsp);
mutex_unlock(&rnp_unlock->exp_funnel_mutex);
- smp_mb(); /* ensure subsequent action seen after grace period. */
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);

--
1.8.1.5

2015-07-17 23:31:49

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 14/19] rcu: Extend expedited funnel locking to rcu_data structure

From: "Paul E. McKenney" <[email protected]>

The strictly rcu_node based funnel-locking scheme works well in many
cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
all that much concurrency. This commit therefore extends the funnel
locking into the per-CPU rcu_data structure, providing concurrency equal
to the number of CPUs.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 19 ++++++++++++++++---
kernel/rcu/tree.h | 4 +++-
kernel/rcu/tree_trace.c | 3 ++-
3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a905d3ba8673..e45097fc39fa 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3312,11 +3312,14 @@ static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)

/* Common code for synchronize_{rcu,sched}_expedited() work-done checking. */
static bool sync_exp_work_done(struct rcu_state *rsp, struct rcu_node *rnp,
+ struct rcu_data *rdp,
atomic_long_t *stat, unsigned long s)
{
if (rcu_exp_gp_seq_done(rsp, s)) {
if (rnp)
mutex_unlock(&rnp->exp_funnel_mutex);
+ else if (rdp)
+ mutex_unlock(&rdp->exp_funnel_mutex);
/* Ensure test happens before caller kfree(). */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(stat);
@@ -3332,6 +3335,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, struct rcu_node *rnp,
*/
static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
{
+ struct rcu_data *rdp;
struct rcu_node *rnp0;
struct rcu_node *rnp1 = NULL;

@@ -3343,16 +3347,24 @@ static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
* can be inexact, as it is just promoting locality and is not
* strictly needed for correctness.
*/
- rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
+ rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());
+ if (sync_exp_work_done(rsp, NULL, NULL, &rsp->expedited_workdone1, s))
+ return NULL;
+ mutex_lock(&rdp->exp_funnel_mutex);
+ rnp0 = rdp->mynode;
for (; rnp0 != NULL; rnp0 = rnp0->parent) {
- if (sync_exp_work_done(rsp, rnp1, &rsp->expedited_workdone1, s))
+ if (sync_exp_work_done(rsp, rnp1, rdp,
+ &rsp->expedited_workdone2, s))
return NULL;
mutex_lock(&rnp0->exp_funnel_mutex);
if (rnp1)
mutex_unlock(&rnp1->exp_funnel_mutex);
+ else
+ mutex_unlock(&rdp->exp_funnel_mutex);
rnp1 = rnp0;
}
- if (sync_exp_work_done(rsp, rnp1, &rsp->expedited_workdone2, s))
+ if (sync_exp_work_done(rsp, rnp1, rdp,
+ &rsp->expedited_workdone3, s))
return NULL;
return rnp1;
}
@@ -3733,6 +3745,7 @@ rcu_boot_init_percpu_data(int cpu, struct rcu_state *rsp)
WARN_ON_ONCE(atomic_read(&rdp->dynticks->dynticks) != 1);
rdp->cpu = cpu;
rdp->rsp = rsp;
+ mutex_init(&rdp->exp_funnel_mutex);
rcu_boot_init_nocb_percpu_data(rdp);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 5c1042d9c310..efee84ce1e08 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -364,11 +364,12 @@ struct rcu_data {
unsigned long n_rp_nocb_defer_wakeup;
unsigned long n_rp_need_nothing;

- /* 6) _rcu_barrier() and OOM callbacks. */
+ /* 6) _rcu_barrier(), OOM callbacks, and expediting. */
struct rcu_head barrier_head;
#ifdef CONFIG_RCU_FAST_NO_HZ
struct rcu_head oom_head;
#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
+ struct mutex exp_funnel_mutex;

/* 7) Callback offloading. */
#ifdef CONFIG_RCU_NOCB_CPU
@@ -494,6 +495,7 @@ struct rcu_state {
atomic_long_t expedited_tryfail; /* # acquisition failures. */
atomic_long_t expedited_workdone1; /* # done by others #1. */
atomic_long_t expedited_workdone2; /* # done by others #2. */
+ atomic_long_t expedited_workdone3; /* # done by others #3. */
atomic_long_t expedited_normal; /* # fallbacks to normal. */
atomic_t expedited_need_qs; /* # CPUs left to check in. */
wait_queue_head_t expedited_wq; /* Wait for check-ins. */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index d9982a2ce305..ec62369f1b02 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -185,11 +185,12 @@ static int show_rcuexp(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;

- seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu n=%lu enq=%d sc=%lu\n",
+ seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu enq=%d sc=%lu\n",
rsp->expedited_sequence,
atomic_long_read(&rsp->expedited_tryfail),
atomic_long_read(&rsp->expedited_workdone1),
atomic_long_read(&rsp->expedited_workdone2),
+ atomic_long_read(&rsp->expedited_workdone3),
atomic_long_read(&rsp->expedited_normal),
atomic_read(&rsp->expedited_need_qs),
rsp->expedited_sequence / 2);
--
1.8.1.5

2015-07-17 23:31:55

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 15/19] rcu: Add stall warnings to synchronize_sched_expedited()

From: "Paul E. McKenney" <[email protected]>

Although synchronize_sched_expedited() historically has no RCU CPU stall
warnings, the availability of the rcupdate.rcu_expedited boot parameter
invalidates the old assumption that synchronize_sched()'s stall warnings
would suffice. This commit therefore adds RCU CPU stall warnings to
synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++----
kernel/rcu/tree.h | 1 +
2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e45097fc39fa..4b6594c7db58 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3369,16 +3369,65 @@ static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
return rnp1;
}

+/* Invoked on each online non-idle CPU for expedited quiescent state. */
static int synchronize_sched_expedited_cpu_stop(void *data)
{
- struct rcu_state *rsp = data;
+ struct rcu_data *rdp = data;
+ struct rcu_state *rsp = rdp->rsp;

/* We are here: If we are last, do the wakeup. */
+ rdp->exp_done = true;
if (atomic_dec_and_test(&rsp->expedited_need_qs))
wake_up(&rsp->expedited_wq);
return 0;
}

+static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
+{
+ int cpu;
+ unsigned long jiffies_stall;
+ unsigned long jiffies_start;
+ struct rcu_data *rdp;
+ int ret;
+
+ jiffies_stall = rcu_jiffies_till_stall_check();
+ jiffies_start = jiffies;
+
+ for (;;) {
+ ret = wait_event_interruptible_timeout(
+ rsp->expedited_wq,
+ !atomic_read(&rsp->expedited_need_qs),
+ jiffies_stall);
+ if (ret > 0)
+ return;
+ if (ret < 0) {
+ /* Hit a signal, disable CPU stall warnings. */
+ wait_event(rsp->expedited_wq,
+ !atomic_read(&rsp->expedited_need_qs));
+ return;
+ }
+ pr_err("INFO: %s detected expedited stalls on CPUs: {",
+ rsp->name);
+ for_each_online_cpu(cpu) {
+ rdp = per_cpu_ptr(rsp->rda, cpu);
+
+ if (rdp->exp_done)
+ continue;
+ pr_cont(" %d", cpu);
+ }
+ pr_cont(" } %lu jiffies s: %lu\n",
+ jiffies - jiffies_start, rsp->expedited_sequence);
+ for_each_online_cpu(cpu) {
+ rdp = per_cpu_ptr(rsp->rda, cpu);
+
+ if (rdp->exp_done)
+ continue;
+ dump_cpu_task(cpu);
+ }
+ jiffies_stall = 3 * rcu_jiffies_till_stall_check() + 3;
+ }
+}
+
/**
* synchronize_sched_expedited - Brute-force RCU-sched grace period
*
@@ -3428,19 +3477,20 @@ void synchronize_sched_expedited(void)
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);

+ rdp->exp_done = false;
+
/* Skip our CPU and any idle CPUs. */
if (raw_smp_processor_id() == cpu ||
!(atomic_add_return(0, &rdtp->dynticks) & 0x1))
continue;
atomic_inc(&rsp->expedited_need_qs);
stop_one_cpu_nowait(cpu, synchronize_sched_expedited_cpu_stop,
- rsp, &rdp->exp_stop_work);
+ rdp, &rdp->exp_stop_work);
}

/* Remove extra count and, if necessary, wait for CPUs to stop. */
if (!atomic_dec_and_test(&rsp->expedited_need_qs))
- wait_event(rsp->expedited_wq,
- !atomic_read(&rsp->expedited_need_qs));
+ synchronize_sched_expedited_wait(rsp);

rcu_exp_gp_seq_end(rsp);
mutex_unlock(&rnp->exp_funnel_mutex);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index efee84ce1e08..b3ae8d3cffbc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -370,6 +370,7 @@ struct rcu_data {
struct rcu_head oom_head;
#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
struct mutex exp_funnel_mutex;
+ bool exp_done; /* Expedited QS for this CPU? */

/* 7) Callback offloading. */
#ifdef CONFIG_RCU_NOCB_CPU
--
1.8.1.5

2015-07-17 23:31:50

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 16/19] documentation: Describe new expedited stall warnings

From: "Paul E. McKenney" <[email protected]>

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/stallwarn.txt | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 046f32637b95..efb9454875ab 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -163,6 +163,23 @@ message will be about three times the interval between the beginning
of the stall and the first message.


+Stall Warnings for Expedited Grace Periods
+
+If an expedited grace period detects a stall, it will place a message
+like the following in dmesg:
+
+ INFO: rcu_sched detected expedited stalls on CPUs: { 1 2 6 } 26009 jiffies s: 1043
+
+This indicates that CPUs 1, 2, and 6 have failed to respond to a
+reschedule IPI, that the expedited grace period has been going on for
+26,009 jiffies, and that the expedited grace-period sequence counter is
+1043. The fact that this last value is odd indicates that an expedited
+grace period is in flight.
+
+It is entirely possible to see stall warnings from normal and from
+expedited grace periods at about the same time from the same run.
+
+
What Causes RCU CPU Stall Warnings?

So your kernel printed an RCU CPU stall warning. The next question is
--
1.8.1.5

2015-07-17 23:30:45

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 17/19] rcu: Pull out wait_event*() condition into helper function

From: "Paul E. McKenney" <[email protected]>

The condition for the wait_event_interruptible_timeout() that waits
to do the next force-quiescent-state scan is a bit ornate:

((gf = READ_ONCE(rsp->gp_flags)) &
RCU_GP_FLAG_FQS) ||
(!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))

This commit therefore pulls this condition out into a helper function
and comments its component conditions.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4b6594c7db58..b2803730ac13 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1904,6 +1904,26 @@ static int rcu_gp_init(struct rcu_state *rsp)
}

/*
+ * Helper function for wait_event_interruptible_timeout() wakeup
+ * at force-quiescent-state time.
+ */
+static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
+{
+ struct rcu_node *rnp = rcu_get_root(rsp);
+
+ /* Someone like call_rcu() requested a force-quiescent-state scan. */
+ *gfp = READ_ONCE(rsp->gp_flags);
+ if (*gfp & RCU_GP_FLAG_FQS)
+ return true;
+
+ /* The current grace period has completed. */
+ if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
+ return true;
+
+ return false;
+}
+
+/*
* Do one round of quiescent-state forcing.
*/
static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
@@ -2067,11 +2087,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;
ret = wait_event_interruptible_timeout(rsp->gp_wq,
- ((gf = READ_ONCE(rsp->gp_flags)) &
- RCU_GP_FLAG_FQS) ||
- (!READ_ONCE(rnp->qsmask) &&
- !rcu_preempt_blocked_readers_cgp(rnp)),
- j);
+ rcu_gp_fqs_check_wake(rsp, &gf), j);
rsp->gp_state = RCU_GP_DONE_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
--
1.8.1.5

2015-07-17 23:35:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 18/19] rcu: Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS

From: "Paul E. McKenney" <[email protected]>

The grace-period kthread sleeps waiting to do a force-quiescent-state
scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
However, this is confusing because the kthread has not done the
force-quiescent-state, but is instead just starting to do it. This commit
therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
things a bit easier on reviewers.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
kernel/rcu/tree.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b2803730ac13..f66f6e7730bc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2088,7 +2088,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
rsp->gp_state = RCU_GP_WAIT_FQS;
ret = wait_event_interruptible_timeout(rsp->gp_wq,
rcu_gp_fqs_check_wake(rsp, &gf), j);
- rsp->gp_state = RCU_GP_DONE_FQS;
+ rsp->gp_state = RCU_GP_DOING_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b3ae8d3cffbc..543ba726396c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -535,7 +535,7 @@ struct rcu_state {
#define RCU_GP_WAIT_GPS 1 /* Wait for grace-period start. */
#define RCU_GP_DONE_GPS 2 /* Wait done for grace-period start. */
#define RCU_GP_WAIT_FQS 3 /* Wait for force-quiescent-state time. */
-#define RCU_GP_DONE_FQS 4 /* Wait done for force-quiescent-state time. */
+#define RCU_GP_DOING_FQS 4 /* Wait done for force-quiescent-state time. */
#define RCU_GP_CLEANUP 5 /* Grace-period cleanup started. */
#define RCU_GP_CLEANED 6 /* Grace-period cleanup complete. */

--
1.8.1.5

2015-07-17 23:31:52

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

From: "Paul E. McKenney" <[email protected]>

In the common case, there will be only one expedited grace period in
the system at a given time, in which case it is not helpful to use
funnel locking. This commit therefore adds a fastpath that bypasses
funnel locking when the root ->exp_funnel_mutex is not held.

Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/trace.txt | 36 ++++++++++--------------------------
kernel/rcu/tree.c | 16 ++++++++++++++++
kernel/rcu/tree.h | 2 +-
kernel/rcu/tree_trace.c | 4 ++--
4 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index 08651da15448..97f17e9decda 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -237,42 +237,26 @@ o "ktl" is the low-order 16 bits (in hexadecimal) of the count of

The output of "cat rcu/rcu_preempt/rcuexp" looks as follows:

-s=21872 d=21872 w=0 tf=0 wd1=0 wd2=0 n=0 sc=21872 dt=21872 dl=0 dx=21872
+s=21872 wd0=0 wd1=0 wd2=0 wd3=5 n=0 enq=0 sc=21872

These fields are as follows:

-o "s" is the starting sequence number.
+o "s" is the sequence number, with an odd number indicating that
+ an expedited grace period is in progress.

-o "d" is the ending sequence number. When the starting and ending
- numbers differ, there is an expedited grace period in progress.
-
-o "w" is the number of times that the sequence numbers have been
- in danger of wrapping.
-
-o "tf" is the number of times that contention has resulted in a
- failure to begin an expedited grace period.
-
-o "wd1" and "wd2" are the number of times that an attempt to
- start an expedited grace period found that someone else had
- completed an expedited grace period that satisfies the
+o "wd0", "wd1", "wd2", and "wd3" are the number of times that an
+ attempt to start an expedited grace period found that someone
+ else had completed an expedited grace period that satisfies the
attempted request. "Our work is done."

-o "n" is number of times that contention was so great that
- the request was demoted from an expedited grace period to
- a normal grace period.
+o "n" is number of times that a concurrent CPU-hotplug operation
+ forced a fallback to a normal grace period.
+
+o "enq" is the number of quiescent states still outstanding.

o "sc" is the number of times that the attempt to start a
new expedited grace period succeeded.

-o "dt" is the number of times that we attempted to update
- the "d" counter.
-
-o "dl" is the number of times that we failed to update the "d"
- counter.
-
-o "dx" is the number of times that we succeeded in updating
- the "d" counter.
-

The output of "cat rcu/rcu_preempt/rcugp" looks as follows:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f66f6e7730bc..3af0dee2d045 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3356,6 +3356,22 @@ static struct rcu_node *exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
struct rcu_node *rnp1 = NULL;

/*
+ * First try directly acquiring the root lock in order to reduce
+ * latency in the common case where expedited grace periods are
+ * rare. We check mutex_is_locked() to avoid pathological levels of
+ * memory contention on ->exp_funnel_mutex in the heavy-load case.
+ */
+ rnp0 = rcu_get_root(rsp);
+ if (!mutex_is_locked(&rnp0->exp_funnel_mutex)) {
+ if (mutex_trylock(&rnp0->exp_funnel_mutex)) {
+ if (sync_exp_work_done(rsp, rnp0, NULL,
+ &rsp->expedited_workdone0, s))
+ return NULL;
+ return rnp0;
+ }
+ }
+
+ /*
* Each pass through the following loop works its way
* up the rcu_node tree, returning if others have done the
* work or otherwise falls through holding the root rnp's
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 543ba726396c..80d974df0ea0 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -493,7 +493,7 @@ struct rcu_state {
/* End of fields guarded by barrier_mutex. */

unsigned long expedited_sequence; /* Take a ticket. */
- atomic_long_t expedited_tryfail; /* # acquisition failures. */
+ atomic_long_t expedited_workdone0; /* # done by others #0. */
atomic_long_t expedited_workdone1; /* # done by others #1. */
atomic_long_t expedited_workdone2; /* # done by others #2. */
atomic_long_t expedited_workdone3; /* # done by others #3. */
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index ec62369f1b02..6fc4c5ff3bb5 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -185,9 +185,9 @@ static int show_rcuexp(struct seq_file *m, void *v)
{
struct rcu_state *rsp = (struct rcu_state *)m->private;

- seq_printf(m, "t=%lu tf=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu enq=%d sc=%lu\n",
+ seq_printf(m, "s=%lu wd0=%lu wd1=%lu wd2=%lu wd3=%lu n=%lu enq=%d sc=%lu\n",
rsp->expedited_sequence,
- atomic_long_read(&rsp->expedited_tryfail),
+ atomic_long_read(&rsp->expedited_workdone0),
atomic_long_read(&rsp->expedited_workdone1),
atomic_long_read(&rsp->expedited_workdone2),
atomic_long_read(&rsp->expedited_workdone3),
--
1.8.1.5

2015-07-30 12:50:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/19] rcu: Remove CONFIG_RCU_CPU_STALL_INFO

On Fri, Jul 17, 2015 at 04:29:07PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
> releases with no complaints, so it is time to eliminate this Kconfig
> option entirely, so that the long-form RCU CPU stall warnings cannot
> be disabled. This commit does just that.

I would think the tiny people (/me looks @ Josh) would complain about
this..

2015-07-30 14:45:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Fri, Jul 17, 2015 at 04:29:24PM -0700, Paul E. McKenney wrote:

> /*
> + * First try directly acquiring the root lock in order to reduce
> + * latency in the common case where expedited grace periods are
> + * rare. We check mutex_is_locked() to avoid pathological levels of
> + * memory contention on ->exp_funnel_mutex in the heavy-load case.
> + */
> + rnp0 = rcu_get_root(rsp);
> + if (!mutex_is_locked(&rnp0->exp_funnel_mutex)) {
> + if (mutex_trylock(&rnp0->exp_funnel_mutex)) {
> + if (sync_exp_work_done(rsp, rnp0, NULL,
> + &rsp->expedited_workdone0, s))
> + return NULL;
> + return rnp0;
> + }
> + }

So our 'new' locking primitives do things like:

static __always_inline int queued_spin_trylock(struct qspinlock *lock)
{
if (!atomic_read(&lock->val) &&
(atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
return 1;
return 0;
}

mutexes do not do this.

Now I suppose the question is, does that extra read slow down the
(common) uncontended case? (remember, we should optimize locks for the
uncontended case, heavy lock contention should be fixed with better
locking schemes, not lock implementations).

Davidlohr, Waiman, do we have data on this?

If the extra read before the cmpxchg() does not hurt, we should do the
same for mutex and make the above redundant.

2015-07-30 15:13:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/19] rcu: Remove CONFIG_RCU_CPU_STALL_INFO

On Thu, Jul 30, 2015 at 02:49:55PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2015 at 04:29:07PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
> > releases with no complaints, so it is time to eliminate this Kconfig
> > option entirely, so that the long-form RCU CPU stall warnings cannot
> > be disabled. This commit does just that.
>
> I would think the tiny people (/me looks @ Josh) would complain about
> this..

They might, but if they do, I will point out that CONFIG_RCU_CPU_STALL_INFO
only has effect for Tree RCU. And also that RCU CPU stall warnings are
compiled out unless CONFIG_RCU_TRACE=y, which I suspect they are disabling.

All that aside, I do appreciate the attention to tininess.

Thanx, Paul

2015-07-30 15:31:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/19] rcu: Remove CONFIG_RCU_CPU_STALL_INFO

On Thu, Jul 30, 2015 at 08:13:41AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 30, 2015 at 02:49:55PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2015 at 04:29:07PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <[email protected]>
> > >
> > > The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
> > > releases with no complaints, so it is time to eliminate this Kconfig
> > > option entirely, so that the long-form RCU CPU stall warnings cannot
> > > be disabled. This commit does just that.
> >
> > I would think the tiny people (/me looks @ Josh) would complain about
> > this..
>
> They might, but if they do, I will point out that CONFIG_RCU_CPU_STALL_INFO
> only has effect for Tree RCU. And also that RCU CPU stall warnings are
> compiled out unless CONFIG_RCU_TRACE=y, which I suspect they are disabling.

Ah, fair enough.

2015-07-30 15:35:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Thu, Jul 30, 2015 at 04:44:55PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2015 at 04:29:24PM -0700, Paul E. McKenney wrote:
>
> > /*
> > + * First try directly acquiring the root lock in order to reduce
> > + * latency in the common case where expedited grace periods are
> > + * rare. We check mutex_is_locked() to avoid pathological levels of
> > + * memory contention on ->exp_funnel_mutex in the heavy-load case.
> > + */
> > + rnp0 = rcu_get_root(rsp);
> > + if (!mutex_is_locked(&rnp0->exp_funnel_mutex)) {
> > + if (mutex_trylock(&rnp0->exp_funnel_mutex)) {
> > + if (sync_exp_work_done(rsp, rnp0, NULL,
> > + &rsp->expedited_workdone0, s))
> > + return NULL;
> > + return rnp0;
> > + }
> > + }
>
> So our 'new' locking primitives do things like:
>
> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> {
> if (!atomic_read(&lock->val) &&
> (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> return 1;
> return 0;
> }
>
> mutexes do not do this.
>
> Now I suppose the question is, does that extra read slow down the
> (common) uncontended case? (remember, we should optimize locks for the
> uncontended case, heavy lock contention should be fixed with better
> locking schemes, not lock implementations).
>
> Davidlohr, Waiman, do we have data on this?
>
> If the extra read before the cmpxchg() does not hurt, we should do the
> same for mutex and make the above redundant.

I am pretty sure that different hardware wants it done differently. :-/
So I agree that hard data would be good.

I could probably further optimize the RCU code by checking for a
single-node tree, but I am not convinced that this is worthwhile.
However, skipping three cache misses in the uncontended case is
definitely worthwhile, hence this patch. ;-)

Thanx, Paul

2015-07-30 15:40:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Thu, Jul 30, 2015 at 08:34:52AM -0700, Paul E. McKenney wrote:
> > If the extra read before the cmpxchg() does not hurt, we should do the
> > same for mutex and make the above redundant.
>
> I am pretty sure that different hardware wants it done differently. :-/
> So I agree that hard data would be good.
>
> I could probably further optimize the RCU code by checking for a
> single-node tree, but I am not convinced that this is worthwhile.
> However, skipping three cache misses in the uncontended case is
> definitely worthwhile, hence this patch. ;-)

I was mostly talking about the !mutex_is_locked() && mutex_try_lock()
thing. The fast path thing makes sense.

2015-07-30 15:45:13

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/19] rcu: Remove CONFIG_RCU_CPU_STALL_INFO

On Thu, Jul 30, 2015 at 08:13:41AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 30, 2015 at 02:49:55PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2015 at 04:29:07PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <[email protected]>
> > >
> > > The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
> > > releases with no complaints, so it is time to eliminate this Kconfig
> > > option entirely, so that the long-form RCU CPU stall warnings cannot
> > > be disabled. This commit does just that.
> >
> > I would think the tiny people (/me looks @ Josh) would complain about
> > this..
>
> They might, but if they do, I will point out that CONFIG_RCU_CPU_STALL_INFO
> only has effect for Tree RCU. And also that RCU CPU stall warnings are
> compiled out unless CONFIG_RCU_TRACE=y, which I suspect they are disabling.

You took the words right out of my mouth. :)

> All that aside, I do appreciate the attention to tininess.

Likewise!

2015-07-30 16:34:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Thu, Jul 30, 2015 at 08:34:52AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 30, 2015 at 04:44:55PM +0200, Peter Zijlstra wrote:

> > If the extra read before the cmpxchg() does not hurt, we should do the
> > same for mutex and make the above redundant.
>
> I am pretty sure that different hardware wants it done differently. :-/

I think that most archs won't notice since any RmW includes a load of
that variable anyhow. The only case where it can matter is if the RmW is
done outside of the normal cache hierarchy -- like on Power, where the
ll/sc bypasses the L1.

2015-07-31 02:04:01

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On 07/30/2015 10:44 AM, Peter Zijlstra wrote:
> On Fri, Jul 17, 2015 at 04:29:24PM -0700, Paul E. McKenney wrote:
>
>> /*
>> + * First try directly acquiring the root lock in order to reduce
>> + * latency in the common case where expedited grace periods are
>> + * rare. We check mutex_is_locked() to avoid pathological levels of
>> + * memory contention on ->exp_funnel_mutex in the heavy-load case.
>> + */
>> + rnp0 = rcu_get_root(rsp);
>> + if (!mutex_is_locked(&rnp0->exp_funnel_mutex)) {
>> + if (mutex_trylock(&rnp0->exp_funnel_mutex)) {
>> + if (sync_exp_work_done(rsp, rnp0, NULL,
>> + &rsp->expedited_workdone0, s))
>> + return NULL;
>> + return rnp0;
>> + }
>> + }
> So our 'new' locking primitives do things like:
>
> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> {
> if (!atomic_read(&lock->val)&&
> (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> return 1;
> return 0;
> }
>
> mutexes do not do this.
>
> Now I suppose the question is, does that extra read slow down the
> (common) uncontended case? (remember, we should optimize locks for the
> uncontended case, heavy lock contention should be fixed with better
> locking schemes, not lock implementations).

I suppose the extra read may slow down the uncontended case, but I am
not sure by how much as I haven't run any test to quantify this.
However, there are use cases where it is advantageous to do a read
first, like when the lock cacheline is likely to be hot (in the
slowpath, for example). So it depends on how the trylock is being used.

Cheers,
Longman

2015-07-31 15:57:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Thu, Jul 30, 2015 at 06:34:27PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 30, 2015 at 08:34:52AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 30, 2015 at 04:44:55PM +0200, Peter Zijlstra wrote:
>
> > > If the extra read before the cmpxchg() does not hurt, we should do the
> > > same for mutex and make the above redundant.
> >
> > I am pretty sure that different hardware wants it done differently. :-/
>
> I think that most archs won't notice since any RmW includes a load of
> that variable anyhow. The only case where it can matter is if the RmW is
> done outside of the normal cache hierarchy -- like on Power, where the
> ll/sc bypasses the L1.

Some years back, AMD and Intel variants of x86 had different preferences
on this matter. Timings indicated that one or the other of them (I
cannot recall which) would get the cacheline shared, then have to get
it exclusive, while the other would get it exclusive to begin with.

I honestly do not know what the preferences of current Power hardware
might be.

Thanx, Paul

2015-08-03 20:05:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Thu, 30 Jul 2015 17:40:01 +0200
Peter Zijlstra <[email protected]> wrote:

> On Thu, Jul 30, 2015 at 08:34:52AM -0700, Paul E. McKenney wrote:
> > > If the extra read before the cmpxchg() does not hurt, we should do the
> > > same for mutex and make the above redundant.
> >
> > I am pretty sure that different hardware wants it done differently. :-/
> > So I agree that hard data would be good.
> >
> > I could probably further optimize the RCU code by checking for a
> > single-node tree, but I am not convinced that this is worthwhile.
> > However, skipping three cache misses in the uncontended case is
> > definitely worthwhile, hence this patch. ;-)
>
> I was mostly talking about the !mutex_is_locked() && mutex_try_lock()
> thing. The fast path thing makes sense.

Note, mutex does do this for the optimistic spin. See
mutex_try_to_aquire().

-- Steve

2015-08-03 20:06:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 19/19] rcu: Add fastpath bypassing funnel locking

On Mon, Aug 03, 2015 at 04:05:33PM -0400, Steven Rostedt wrote:
> On Thu, 30 Jul 2015 17:40:01 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, Jul 30, 2015 at 08:34:52AM -0700, Paul E. McKenney wrote:
> > > > If the extra read before the cmpxchg() does not hurt, we should do the
> > > > same for mutex and make the above redundant.
> > >
> > > I am pretty sure that different hardware wants it done differently. :-/
> > > So I agree that hard data would be good.
> > >
> > > I could probably further optimize the RCU code by checking for a
> > > single-node tree, but I am not convinced that this is worthwhile.
> > > However, skipping three cache misses in the uncontended case is
> > > definitely worthwhile, hence this patch. ;-)
> >
> > I was mostly talking about the !mutex_is_locked() && mutex_try_lock()
> > thing. The fast path thing makes sense.
>
> Note, mutex does do this for the optimistic spin. See
> mutex_try_to_aquire().

Right but that's mutex_lock(). mutex_trylock() does not.