LinuxLists.cc - [PATCH tip/core/rcu 0/7] RCU fixes for 3.11

2013-04-12 23:18:57

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 0/7] RCU fixes for 3.11

Hello!

This series contains the following fixes for RCU:

1-2. Convert remaining printk() calls to pr_*().

3. Kick adaptive-ticks CPUs that are holding up RCU grace periods.

4. Don't allocate bootmem from rcu_init(), courtesy of Sasha Levin.

5. Remove "Experimental" flags from old RCU Kconfig options.

6. Automatically tune defaults for delays between attempts to
force quiescent states.

7. Merge adjacent identical #ifdefs.

Thanx, Paul

b/init/Kconfig | 2 -
b/kernel/rcupdate.c | 3 -
b/kernel/rcutree.c | 34 ++++++++++++++++++---
b/kernel/rcutree.h | 13 +++++++-
b/kernel/rcutree_plugin.h | 74 +++++++++++++++++++++++++++-------------------
5 files changed, 87 insertions(+), 39 deletions(-)

2013-04-12 23:19:27

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 1/7] rcu: Convert rcutree.c printk calls

From: "Paul E. McKenney" <[email protected]>

This commit converts printk() calls to the corresponding pr_*() calls.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 2d5f94c..bc3eac5 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -856,7 +856,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
* See Documentation/RCU/stallwarn.txt for info on how to debug
* RCU CPU stall warnings.
*/
- printk(KERN_ERR "INFO: %s detected stalls on CPUs/tasks:",
+ pr_err("INFO: %s detected stalls on CPUs/tasks:",
rsp->name);
print_cpu_stall_info_begin();
rcu_for_each_leaf_node(rsp, rnp) {
@@ -889,7 +889,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
smp_processor_id(), (long)(jiffies - rsp->gp_start),
rsp->gpnum, rsp->completed, totqlen);
if (ndetected == 0)
- printk(KERN_ERR "INFO: Stall ended before state dump start\n");
+ pr_err("INFO: Stall ended before state dump start\n");
else if (!trigger_all_cpu_backtrace())
rcu_dump_cpu_stacks(rsp);

@@ -912,7 +912,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
* See Documentation/RCU/stallwarn.txt for info on how to debug
* RCU CPU stall warnings.
*/
- printk(KERN_ERR "INFO: %s self-detected stall on CPU", rsp->name);
+ pr_err("INFO: %s self-detected stall on CPU", rsp->name);
print_cpu_stall_info_begin();
print_cpu_stall_info(rsp, smp_processor_id());
print_cpu_stall_info_end();
--
1.8.1.5

2013-04-12 23:19:41

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 7/7] rcu: Merge adjacent identical ifdefs

From: "Paul E. McKenney" <[email protected]>

Two ifdefs in kernel/rcupdate.c now have identical conditions with
nothing between them, so the commit merges them into a single ifdef.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcupdate.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index 48ab703..faeea98 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -145,9 +145,6 @@ static struct lock_class_key rcu_sched_lock_key;
struct lockdep_map rcu_sched_lock_map =
STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_sched", &rcu_sched_lock_key);
EXPORT_SYMBOL_GPL(rcu_sched_lock_map);
-#endif
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC

int debug_lockdep_rcu_enabled(void)
{
--
1.8.1.5

2013-04-12 23:19:40

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 2/7] rcu: Convert rcutree_plugin.h printk calls

From: "Paul E. McKenney" <[email protected]>

This commit converts printk() calls to the corresponding pr_*() calls.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree_plugin.h | 45 ++++++++++++++++++++++-----------------------
1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index d084ae3..e6cf7e5 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -52,38 +52,37 @@ static char __initdata nocb_buf[NR_CPUS * 5];
static void __init rcu_bootup_announce_oddness(void)
{
#ifdef CONFIG_RCU_TRACE
- printk(KERN_INFO "\tRCU debugfs-based tracing is enabled.\n");
+ pr_info("\tRCU debugfs-based tracing is enabled.\n");
#endif
#if (defined(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 64) || (!defined(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 32)
- printk(KERN_INFO "\tCONFIG_RCU_FANOUT set to non-default value of %d\n",
+ pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n",
CONFIG_RCU_FANOUT);
#endif
#ifdef CONFIG_RCU_FANOUT_EXACT
- printk(KERN_INFO "\tHierarchical RCU autobalancing is disabled.\n");
+ pr_info("\tHierarchical RCU autobalancing is disabled.\n");
#endif
#ifdef CONFIG_RCU_FAST_NO_HZ
- printk(KERN_INFO
- "\tRCU dyntick-idle grace-period acceleration is enabled.\n");
+ pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
#endif
#ifdef CONFIG_PROVE_RCU
- printk(KERN_INFO "\tRCU lockdep checking is enabled.\n");
+ pr_info("\tRCU lockdep checking is enabled.\n");
#endif
#ifdef CONFIG_RCU_TORTURE_TEST_RUNNABLE
- printk(KERN_INFO "\tRCU torture testing starts during boot.\n");
+ pr_info("\tRCU torture testing starts during boot.\n");
#endif
#if defined(CONFIG_TREE_PREEMPT_RCU) && !defined(CONFIG_RCU_CPU_STALL_VERBOSE)
- printk(KERN_INFO "\tDump stacks of tasks blocking RCU-preempt GP.\n");
+ pr_info("\tDump stacks of tasks blocking RCU-preempt GP.\n");
#endif
#if defined(CONFIG_RCU_CPU_STALL_INFO)
- printk(KERN_INFO "\tAdditional per-CPU info printed with stalls.\n");
+ pr_info("\tAdditional per-CPU info printed with stalls.\n");
#endif
#if NUM_RCU_LVL_4 != 0
- printk(KERN_INFO "\tFour-level hierarchy is enabled.\n");
+ pr_info("\tFour-level hierarchy is enabled.\n");
#endif
if (rcu_fanout_leaf != CONFIG_RCU_FANOUT_LEAF)
- printk(KERN_INFO "\tExperimental boot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
+ pr_info("\tExperimental boot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
if (nr_cpu_ids != NR_CPUS)
- printk(KERN_INFO "\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids);
+ pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids);
#ifdef CONFIG_RCU_NOCB_CPU
#ifndef CONFIG_RCU_NOCB_CPU_NONE
if (!have_rcu_nocb_mask) {
@@ -122,7 +121,7 @@ static int rcu_preempted_readers_exp(struct rcu_node *rnp);
*/
static void __init rcu_bootup_announce(void)
{
- printk(KERN_INFO "Preemptible hierarchical RCU implementation.\n");
+ pr_info("Preemptible hierarchical RCU implementation.\n");
rcu_bootup_announce_oddness();
}

@@ -489,13 +488,13 @@ static void rcu_print_detail_task_stall(struct rcu_state *rsp)

static void rcu_print_task_stall_begin(struct rcu_node *rnp)
{
- printk(KERN_ERR "\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
+ pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
rnp->level, rnp->grplo, rnp->grphi);
}

static void rcu_print_task_stall_end(void)
{
- printk(KERN_CONT "\n");
+ pr_cont("\n");
}

#else /* #ifdef CONFIG_RCU_CPU_STALL_INFO */
@@ -525,7 +524,7 @@ static int rcu_print_task_stall(struct rcu_node *rnp)
t = list_entry(rnp->gp_tasks,
struct task_struct, rcu_node_entry);
list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
- printk(KERN_CONT " P%d", t->pid);
+ pr_cont(" P%d", t->pid);
ndetected++;
}
rcu_print_task_stall_end();
@@ -941,7 +940,7 @@ static struct rcu_state *rcu_state = &rcu_sched_state;
*/
static void __init rcu_bootup_announce(void)
{
- printk(KERN_INFO "Hierarchical RCU implementation.\n");
+ pr_info("Hierarchical RCU implementation.\n");
rcu_bootup_announce_oddness();
}

@@ -1882,7 +1881,7 @@ static void print_cpu_stall_fast_no_hz(char *cp, int cpu)
/* Initiate the stall-info list. */
static void print_cpu_stall_info_begin(void)
{
- printk(KERN_CONT "\n");
+ pr_cont("\n");
}

/*
@@ -1913,7 +1912,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
ticks_value = rsp->gpnum - rdp->gpnum;
}
print_cpu_stall_fast_no_hz(fast_no_hz, cpu);
- printk(KERN_ERR "\t%d: (%lu %s) idle=%03x/%llx/%d softirq=%u/%u %s\n",
+ pr_err("\t%d: (%lu %s) idle=%03x/%llx/%d softirq=%u/%u %s\n",
cpu, ticks_value, ticks_title,
atomic_read(&rdtp->dynticks) & 0xfff,
rdtp->dynticks_nesting, rdtp->dynticks_nmi_nesting,
@@ -1924,7 +1923,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
/* Terminate the stall-info list. */
static void print_cpu_stall_info_end(void)
{
- printk(KERN_ERR "\t");
+ pr_err("\t");
}

/* Zero ->ticks_this_gp for all flavors of RCU. */
@@ -1947,17 +1946,17 @@ static void increment_cpu_stall_ticks(void)

static void print_cpu_stall_info_begin(void)
{
- printk(KERN_CONT " {");
+ pr_cont(" {");
}

static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
{
- printk(KERN_CONT " %d", cpu);
+ pr_cont(" %d", cpu);
}

static void print_cpu_stall_info_end(void)
{
- printk(KERN_CONT "} ");
+ pr_cont("} ");
}

static void zero_cpu_stall_ticks(struct rcu_data *rdp)
--
1.8.1.5

2013-04-12 23:19:26

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 5/7] rcu: Remove "Experimental" flags

From: "Paul E. McKenney" <[email protected]>

After a release or two, features are no longer experimental. Therefore,
this commit removes the "Experimental" tag from them.

Reported-by: Paul Gortmaker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
init/Kconfig | 2 +-
kernel/rcutree_plugin.h | 10 +++++-----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a3a2304..bac0483 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -658,7 +658,7 @@ config RCU_BOOST_DELAY
Accept the default if unsure.

config RCU_NOCB_CPU
- bool "Offload RCU callback processing from boot-selected CPUs (EXPERIMENTAL"
+ bool "Offload RCU callback processing from boot-selected CPUs"
depends on TREE_RCU || TREE_PREEMPT_RCU
default n
help
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 44b0998..dcab269 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -80,7 +80,7 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tFour-level hierarchy is enabled.\n");
#endif
if (rcu_fanout_leaf != CONFIG_RCU_FANOUT_LEAF)
- pr_info("\tExperimental boot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
+ pr_info("\tBoot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
if (nr_cpu_ids != NR_CPUS)
pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids);
#ifdef CONFIG_RCU_NOCB_CPU
@@ -90,19 +90,19 @@ static void __init rcu_bootup_announce_oddness(void)
have_rcu_nocb_mask = true;
}
#ifdef CONFIG_RCU_NOCB_CPU_ZERO
- pr_info("\tExperimental no-CBs CPU 0\n");
+ pr_info("\tOffload RCU callbacks from CPU 0\n");
cpumask_set_cpu(0, rcu_nocb_mask);
#endif /* #ifdef CONFIG_RCU_NOCB_CPU_ZERO */
#ifdef CONFIG_RCU_NOCB_CPU_ALL
- pr_info("\tExperimental no-CBs for all CPUs\n");
+ pr_info("\tOffload RCU callbacks from all CPUs\n");
cpumask_setall(rcu_nocb_mask);
#endif /* #ifdef CONFIG_RCU_NOCB_CPU_ALL */
#endif /* #ifndef CONFIG_RCU_NOCB_CPU_NONE */
if (have_rcu_nocb_mask) {
cpulist_scnprintf(nocb_buf, sizeof(nocb_buf), rcu_nocb_mask);
- pr_info("\tExperimental no-CBs CPUs: %s.\n", nocb_buf);
+ pr_info("\tOffload RCU callbacks from CPUs: %s.\n", nocb_buf);
if (rcu_nocb_poll)
- pr_info("\tExperimental polled no-CBs CPUs.\n");
+ pr_info("\tPoll for callbacks from no-CBs CPUs.\n");
}
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
}
--
1.8.1.5

2013-04-12 23:20:29

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 3/7] rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods

From: "Paul E. McKenney" <[email protected]>

Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
not necessarily turn the scheduler-clock tick back on. This state of
affairs could result in RCU waiting on an adaptive-ticks CPU running
for an extended period in kernel mode. Such a CPU will never run the
RCU state machine, and could therefore indefinitely extend the RCU state
machine, sooner or later resulting in an OOM condition.

This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
causes RCU's force-quiescent-state processing to check for this condition
and to send an IPI to CPUs that remain in that state for too long.
"Too long" currently means about three jiffies by default, which is
quite some time for a CPU to remain in the kernel without blocking.
The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
sysfs variables may be used to tune "too long" if needed.

Reported-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree.c | 10 ++++++++++
kernel/rcutree.h | 1 +
kernel/rcutree_plugin.h | 17 +++++++++++++++++
3 files changed, 28 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index bc3eac5..3710d74 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -799,6 +799,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
rdp->offline_fqs++;
return 1;
}
+
+ /*
+ * There is a possibility that a CPU in adaptive-ticks state
+ * might run in the kernel with the scheduling-clock tick disabled
+ * for an extended time period. Invoke rcu_kick_nohz_cpu() to
+ * force the CPU to restart the scheduling-clock tick in this
+ * CPU is in this state.
+ */
+ rcu_kick_nohz_cpu(rdp->cpu);
+
return 0;
}

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 14ee407..08972c9 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -538,6 +538,7 @@ static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_kick_nohz_cpu(int cpu);

#endif /* #ifndef RCU_TREE_NONCORE */

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index e6cf7e5..ca6e39c 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2336,3 +2336,20 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
}

#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
+
+/*
+ * An adaptive-ticks CPU can potentially execute in kernel mode for an
+ * arbitrarily long period of time with the scheduling-clock tick turned
+ * off. RCU will be paying attention to this CPU because it is in the
+ * kernel, but the CPU cannot be guaranteed to be executing the RCU state
+ * machine because the scheduling-clock tick has been disabled. Therefore,
+ * if an adaptive-ticks CPU is failing to respond to the current grace
+ * period and has not be idle from an RCU perspective, kick it.
+ */
+static void rcu_kick_nohz_cpu(int cpu)
+{
+#ifdef CONFIG_NO_HZ_EXTENDED
+ if (tick_nohz_full_cpu(cpu))
+ smp_send_reschedule(cpu);
+#endif /* #ifdef CONFIG_NO_HZ_EXTENDED */
+}
--
1.8.1.5

2013-04-12 23:20:26

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 4/7] rcu: Don't allocate bootmem from rcu_init()

From: Sasha Levin <[email protected]>

When rcu_init() is called we already have slab working, allocating
bootmem at that point results in warnings and an allocation from
slab. This commit therefore changes alloc_bootmem_cpumask_var() to
alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
from rcu_init().

Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index ca6e39c..44b0998 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -86,7 +86,7 @@ static void __init rcu_bootup_announce_oddness(void)
#ifdef CONFIG_RCU_NOCB_CPU
#ifndef CONFIG_RCU_NOCB_CPU_NONE
if (!have_rcu_nocb_mask) {
- alloc_bootmem_cpumask_var(&rcu_nocb_mask);
+ alloc_cpumask_var(&rcu_nocb_mask, GFP_KERNEL);
have_rcu_nocb_mask = true;
}
#ifdef CONFIG_RCU_NOCB_CPU_ZERO
--
1.8.1.5

2013-04-12 23:19:25

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

From: "Paul E. McKenney" <[email protected]>

Systems with HZ=100 can have slow bootup times due to the default
three-jiffy delays between quiescent-state forcing attempts. This
commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
on the value of HZ. However, this would break very large systems that
require more time between quiescent-state forcing attempts. This
commit therefore also ups the default delay by one jiffy for each
256 CPUs that might be on the system (based off of nr_cpu_ids at
runtime, -not- NR_CPUS at build time).

Reported-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcutree.c | 18 ++++++++++++++++--
kernel/rcutree.h | 12 +++++++++++-
2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3710d74..cbfb4ee 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -218,8 +218,8 @@ module_param(blimit, long, 0444);
module_param(qhimark, long, 0444);
module_param(qlowmark, long, 0444);

-static ulong jiffies_till_first_fqs = RCU_JIFFIES_TILL_FORCE_QS;
-static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS;
+static ulong jiffies_till_first_fqs = ULONG_MAX;
+static ulong jiffies_till_next_fqs = ULONG_MAX;

module_param(jiffies_till_first_fqs, ulong, 0644);
module_param(jiffies_till_next_fqs, ulong, 0644);
@@ -3252,11 +3252,25 @@ static void __init rcu_init_one(struct rcu_state *rsp,
*/
static void __init rcu_init_geometry(void)
{
+ ulong d;
int i;
int j;
int n = nr_cpu_ids;
int rcu_capacity[MAX_RCU_LVLS + 1];

+ /*
+ * Initialize any unspecified boot parameters.
+ * The default values of jiffies_till_first_fqs and
+ * jiffies_till_next_fqs are set to the RCU_JIFFIES_TILL_FORCE_QS
+ * value, which is a function of HZ, then adding one for each
+ * RCU_JIFFIES_FQS_DIV CPUs that might be on the system.
+ */
+ d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
+ if (jiffies_till_first_fqs == ULONG_MAX)
+ jiffies_till_first_fqs = d;
+ if (jiffies_till_next_fqs == ULONG_MAX)
+ jiffies_till_next_fqs = d;
+
/* If the compile-time values are accurate, just leave. */
if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
nr_cpu_ids == NR_CPUS)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 08972c9..7d5f876 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -342,7 +342,17 @@ struct rcu_data {
#define RCU_FORCE_QS 3 /* Need to force quiescent state. */
#define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK

-#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
+#if HZ > 500
+#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for jiffies_till_first_fqs */
+#elif HZ > 250
+#define RCU_JIFFIES_TILL_FORCE_QS 2
+#else
+#define RCU_JIFFIES_TILL_FORCE_QS 1
+#endif
+#define RCU_JIFFIES_FQS_DIV 256 /* Very large systems need */
+ /* more delay between bouts */
+ /* of quiescent-state */
+ /* forcing. */

#define RCU_STALL_RAT_DELAY 2 /* Allow other CPUs time */
/* to take at least one */
--
1.8.1.5

2013-04-12 23:54:19

by Josh Triplett

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> Systems with HZ=100 can have slow bootup times due to the default
> three-jiffy delays between quiescent-state forcing attempts. This
> commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> on the value of HZ. However, this would break very large systems that
> require more time between quiescent-state forcing attempts. This
> commit therefore also ups the default delay by one jiffy for each
> 256 CPUs that might be on the system (based off of nr_cpu_ids at
> runtime, -not- NR_CPUS at build time).
>
> Reported-by: Paul Mackerras <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

Something seems very wrong if RCU regularly hits the fqs code during
boot; feels like there's some more straightforward solution we're
missing. What causes these CPUs to fall under RCU's scrutiny during
boot yet not actually hit the RCU codepaths naturally?

Also, a comment below.

> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -342,7 +342,17 @@ struct rcu_data {
> #define RCU_FORCE_QS 3 /* Need to force quiescent state. */
> #define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK
>
> -#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
> +#if HZ > 500
> +#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for jiffies_till_first_fqs */
> +#elif HZ > 250
> +#define RCU_JIFFIES_TILL_FORCE_QS 2
> +#else
> +#define RCU_JIFFIES_TILL_FORCE_QS 1
> +#endif

This seems like it really wants to use a duration calculated directly
from HZ; perhaps (HZ/100)?

- Josh Triplett

2013-04-13 00:02:08

by Josh Triplett

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 0/7] RCU fixes for 3.11

On Fri, Apr 12, 2013 at 04:18:47PM -0700, Paul E. McKenney wrote:
> Hello!
>
> This series contains the following fixes for RCU:
>
> 1-2. Convert remaining printk() calls to pr_*().
>
> 3. Kick adaptive-ticks CPUs that are holding up RCU grace periods.
>
> 4. Don't allocate bootmem from rcu_init(), courtesy of Sasha Levin.
>
> 5. Remove "Experimental" flags from old RCU Kconfig options.
>
> 6. Automatically tune defaults for delays between attempts to
> force quiescent states.
>
> 7. Merge adjacent identical #ifdefs.

For 1-5 and 7:
Reviewed-by: Josh Triplett <[email protected]>

Responded to 6 separately.

- Josh Triplett

2013-04-13 06:38:13

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > Systems with HZ=100 can have slow bootup times due to the default
> > three-jiffy delays between quiescent-state forcing attempts. This
> > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > on the value of HZ. However, this would break very large systems that
> > require more time between quiescent-state forcing attempts. This
> > commit therefore also ups the default delay by one jiffy for each
> > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > runtime, -not- NR_CPUS at build time).
> >
> > Reported-by: Paul Mackerras <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> Something seems very wrong if RCU regularly hits the fqs code during
> boot; feels like there's some more straightforward solution we're
> missing. What causes these CPUs to fall under RCU's scrutiny during
> boot yet not actually hit the RCU codepaths naturally?

The problem is that they are running HZ=100, so that RCU will often
take 30-60 milliseconds per grace period. At that point, you only
need 16-30 grace periods to chew up a full second, so it is not all
that hard to eat up the additional 8-12 seconds of boot time that
they were seeing. IIRC, UP boot was costing them 4 seconds.

For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
near as annoying.

> Also, a comment below.
>
> > --- a/kernel/rcutree.h
> > +++ b/kernel/rcutree.h
> > @@ -342,7 +342,17 @@ struct rcu_data {
> > #define RCU_FORCE_QS 3 /* Need to force quiescent state. */
> > #define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK
> >
> > -#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
> > +#if HZ > 500
> > +#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for jiffies_till_first_fqs */
> > +#elif HZ > 250
> > +#define RCU_JIFFIES_TILL_FORCE_QS 2
> > +#else
> > +#define RCU_JIFFIES_TILL_FORCE_QS 1
> > +#endif
>
> This seems like it really wants to use a duration calculated directly
> from HZ; perhaps (HZ/100)?

Very possibly to the direct calculation, but HZ/100 would get 10 ticks
delay at HZ=1000, which is too high -- the value of 3 ticks for HZ=1000
works well. But I could do something like this:

#define RCU_JIFFIES_TILL_FORCE_QS (((HZ + 199) / 300) + ((HZ + 199) / 300 ? 0 : 1))

Or maybe a bit better:

#define RCU_JTFQS_SE ((HZ + 199) / 300)
#define RCU_JIFFIES_TILL_FORCE_QS (RCU_JTFQS_SE + (RCU_JTFQS_SE ? 0 : 1))

This would come reasonably close to the values shown above. Would
this work for you?

Thanx, Paul

2013-04-13 14:07:01

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 3/7] rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods

2013/4/13 Paul E. McKenney <[email protected]>:
> From: "Paul E. McKenney" <[email protected]>
>
> Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
> not necessarily turn the scheduler-clock tick back on. This state of
> affairs could result in RCU waiting on an adaptive-ticks CPU running
> for an extended period in kernel mode. Such a CPU will never run the
> RCU state machine, and could therefore indefinitely extend the RCU state
> machine, sooner or later resulting in an OOM condition.
>
> This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
> causes RCU's force-quiescent-state processing to check for this condition
> and to send an IPI to CPUs that remain in that state for too long.
> "Too long" currently means about three jiffies by default, which is
> quite some time for a CPU to remain in the kernel without blocking.
> The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
> sysfs variables may be used to tune "too long" if needed.
>
> Reported-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

It might be better if I take this patch to get it through
tip:timers/nohz so that I can keep it in sync with the rest. What do
you think?

Thanks.

2013-04-13 15:20:04

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 3/7] rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods

On Sat, Apr 13, 2013 at 04:06:58PM +0200, Frederic Weisbecker wrote:
> 2013/4/13 Paul E. McKenney <[email protected]>:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
> > not necessarily turn the scheduler-clock tick back on. This state of
> > affairs could result in RCU waiting on an adaptive-ticks CPU running
> > for an extended period in kernel mode. Such a CPU will never run the
> > RCU state machine, and could therefore indefinitely extend the RCU state
> > machine, sooner or later resulting in an OOM condition.
> >
> > This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
> > causes RCU's force-quiescent-state processing to check for this condition
> > and to send an IPI to CPUs that remain in that state for too long.
> > "Too long" currently means about three jiffies by default, which is
> > quite some time for a CPU to remain in the kernel without blocking.
> > The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
> > sysfs variables may be used to tune "too long" if needed.
> >
> > Reported-by: Frederic Weisbecker <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> It might be better if I take this patch to get it through
> tip:timers/nohz so that I can keep it in sync with the rest. What do
> you think?

Works for me! Let me know when you have picked it up and I will drop
it from my tree.

Thanx, Paul

2013-04-13 18:18:20

by Josh Triplett

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <[email protected]>
> > >
> > > Systems with HZ=100 can have slow bootup times due to the default
> > > three-jiffy delays between quiescent-state forcing attempts. This
> > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > on the value of HZ. However, this would break very large systems that
> > > require more time between quiescent-state forcing attempts. This
> > > commit therefore also ups the default delay by one jiffy for each
> > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > runtime, -not- NR_CPUS at build time).
> > >
> > > Reported-by: Paul Mackerras <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > Something seems very wrong if RCU regularly hits the fqs code during
> > boot; feels like there's some more straightforward solution we're
> > missing. What causes these CPUs to fall under RCU's scrutiny during
> > boot yet not actually hit the RCU codepaths naturally?
>
> The problem is that they are running HZ=100, so that RCU will often
> take 30-60 milliseconds per grace period. At that point, you only
> need 16-30 grace periods to chew up a full second, so it is not all
> that hard to eat up the additional 8-12 seconds of boot time that
> they were seeing. IIRC, UP boot was costing them 4 seconds.
>
> For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> near as annoying.

That raises two questions, though. First, who calls synchronize_rcu()
repeatedly during boot, and could they call call_rcu() instead to avoid
blocking for an RCU grace period? Second, why does RCU need 3-6 jiffies
to resolve a grace period during boot? That suggests that RCU doesn't
actually resolve a grace period until the force-quiescent-state
machinery kicks in, meaning that the normal quiescent-state mechanism
didn't work.

> > Also, a comment below.
> >
> > > --- a/kernel/rcutree.h
> > > +++ b/kernel/rcutree.h
> > > @@ -342,7 +342,17 @@ struct rcu_data {
> > > #define RCU_FORCE_QS 3 /* Need to force quiescent state. */
> > > #define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK
> > >
> > > -#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
> > > +#if HZ > 500
> > > +#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for jiffies_till_first_fqs */
> > > +#elif HZ > 250
> > > +#define RCU_JIFFIES_TILL_FORCE_QS 2
> > > +#else
> > > +#define RCU_JIFFIES_TILL_FORCE_QS 1
> > > +#endif
> >
> > This seems like it really wants to use a duration calculated directly
> > from HZ; perhaps (HZ/100)?
>
> Very possibly to the direct calculation, but HZ/100 would get 10 ticks
> delay at HZ=1000, which is too high -- the value of 3 ticks for HZ=1000
> works well. But I could do something like this:
>
> #define RCU_JIFFIES_TILL_FORCE_QS (((HZ + 199) / 300) + ((HZ + 199) / 300 ? 0 : 1))
>
> Or maybe a bit better:
>
> #define RCU_JTFQS_SE ((HZ + 199) / 300)
> #define RCU_JIFFIES_TILL_FORCE_QS (RCU_JTFQS_SE + (RCU_JTFQS_SE ? 0 : 1))
>
> This would come reasonably close to the values shown above. Would
> this work for you?

I'd argue that if you need something that complex, you should just
explicitly write it as a step function:

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))

- Josh Triplett

2013-04-13 19:34:43

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <[email protected]>
> > > >
> > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > three-jiffy delays between quiescent-state forcing attempts. This
> > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > on the value of HZ. However, this would break very large systems that
> > > > require more time between quiescent-state forcing attempts. This
> > > > commit therefore also ups the default delay by one jiffy for each
> > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > runtime, -not- NR_CPUS at build time).
> > > >
> > > > Reported-by: Paul Mackerras <[email protected]>
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > Something seems very wrong if RCU regularly hits the fqs code during
> > > boot; feels like there's some more straightforward solution we're
> > > missing. What causes these CPUs to fall under RCU's scrutiny during
> > > boot yet not actually hit the RCU codepaths naturally?
> >
> > The problem is that they are running HZ=100, so that RCU will often
> > take 30-60 milliseconds per grace period. At that point, you only
> > need 16-30 grace periods to chew up a full second, so it is not all
> > that hard to eat up the additional 8-12 seconds of boot time that
> > they were seeing. IIRC, UP boot was costing them 4 seconds.
> >
> > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > near as annoying.
>
> That raises two questions, though. First, who calls synchronize_rcu()
> repeatedly during boot, and could they call call_rcu() instead to avoid
> blocking for an RCU grace period? Second, why does RCU need 3-6 jiffies
> to resolve a grace period during boot? That suggests that RCU doesn't
> actually resolve a grace period until the force-quiescent-state
> machinery kicks in, meaning that the normal quiescent-state mechanism
> didn't work.

Indeed, converting synchronize_rcu() to call_rcu() might also be
helpful. The reason that RCU often does not resolve grace periods until
force_quiescent_state() is that it is often the case during boot that
all but one CPU is idle. RCU tries hard to avoid waking up idle CPUs,
so it must scan them. Scanning is relatively expensive, so there is
reason to wait.

One thing that could be done would be to scan immediately during boot,
and then back off once boot has completed. Of course, RCU has no idea
when boot has completed, but one way to get this effect is to boot
with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
to 3 once boot has completed.

> > > Also, a comment below.
> > >
> > > > --- a/kernel/rcutree.h
> > > > +++ b/kernel/rcutree.h
> > > > @@ -342,7 +342,17 @@ struct rcu_data {
> > > > #define RCU_FORCE_QS 3 /* Need to force quiescent state. */
> > > > #define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK
> > > >
> > > > -#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */
> > > > +#if HZ > 500
> > > > +#define RCU_JIFFIES_TILL_FORCE_QS 3 /* for jiffies_till_first_fqs */
> > > > +#elif HZ > 250
> > > > +#define RCU_JIFFIES_TILL_FORCE_QS 2
> > > > +#else
> > > > +#define RCU_JIFFIES_TILL_FORCE_QS 1
> > > > +#endif
> > >
> > > This seems like it really wants to use a duration calculated directly
> > > from HZ; perhaps (HZ/100)?
> >
> > Very possibly to the direct calculation, but HZ/100 would get 10 ticks
> > delay at HZ=1000, which is too high -- the value of 3 ticks for HZ=1000
> > works well. But I could do something like this:
> >
> > #define RCU_JIFFIES_TILL_FORCE_QS (((HZ + 199) / 300) + ((HZ + 199) / 300 ? 0 : 1))
> >
> > Or maybe a bit better:
> >
> > #define RCU_JTFQS_SE ((HZ + 199) / 300)
> > #define RCU_JIFFIES_TILL_FORCE_QS (RCU_JTFQS_SE + (RCU_JTFQS_SE ? 0 : 1))
> >
> > This would come reasonably close to the values shown above. Would
> > this work for you?
>
> I'd argue that if you need something that complex, you should just
> explicitly write it as a step function:
>
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))

Yeah, I couldn't resist handling HZ>1000, but that doesn't sound all
that likely. I will use your suggested approach.

Thanx, Paul

2013-04-13 19:53:48

by Josh Triplett

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Sat, Apr 13, 2013 at 12:34:25PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> > On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > > From: "Paul E. McKenney" <[email protected]>
> > > > >
> > > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > > three-jiffy delays between quiescent-state forcing attempts. This
> > > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > > on the value of HZ. However, this would break very large systems that
> > > > > require more time between quiescent-state forcing attempts. This
> > > > > commit therefore also ups the default delay by one jiffy for each
> > > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > > runtime, -not- NR_CPUS at build time).
> > > > >
> > > > > Reported-by: Paul Mackerras <[email protected]>
> > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > >
> > > > Something seems very wrong if RCU regularly hits the fqs code during
> > > > boot; feels like there's some more straightforward solution we're
> > > > missing. What causes these CPUs to fall under RCU's scrutiny during
> > > > boot yet not actually hit the RCU codepaths naturally?
> > >
> > > The problem is that they are running HZ=100, so that RCU will often
> > > take 30-60 milliseconds per grace period. At that point, you only
> > > need 16-30 grace periods to chew up a full second, so it is not all
> > > that hard to eat up the additional 8-12 seconds of boot time that
> > > they were seeing. IIRC, UP boot was costing them 4 seconds.
> > >
> > > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > > near as annoying.
> >
> > That raises two questions, though. First, who calls synchronize_rcu()
> > repeatedly during boot, and could they call call_rcu() instead to avoid
> > blocking for an RCU grace period? Second, why does RCU need 3-6 jiffies
> > to resolve a grace period during boot? That suggests that RCU doesn't
> > actually resolve a grace period until the force-quiescent-state
> > machinery kicks in, meaning that the normal quiescent-state mechanism
> > didn't work.
>
> Indeed, converting synchronize_rcu() to call_rcu() might also be
> helpful. The reason that RCU often does not resolve grace periods until
> force_quiescent_state() is that it is often the case during boot that
> all but one CPU is idle. RCU tries hard to avoid waking up idle CPUs,
> so it must scan them. Scanning is relatively expensive, so there is
> reason to wait.

How are those CPUs going idle without first telling RCU that they're
quiesced? Seems like, during boot at least, you want RCU to use its
idle==quiesced logic to proactively note continuously-quiescent states.
Ideally, you should not hit the FQS code at all during boot.

> One thing that could be done would be to scan immediately during boot,
> and then back off once boot has completed. Of course, RCU has no idea
> when boot has completed, but one way to get this effect is to boot
> with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
> to 3 once boot has completed.

What do you mean by "boot has completed" here? The kernel's early
initialization, the kernel's initialization up to running /sbin/init, or
userspace initialization up through supporting user login?

In any case, I don't think it makes sense to do this with FQS.

- Josh Triplett

2013-04-13 22:09:52

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Sat, Apr 13, 2013 at 12:53:36PM -0700, Josh Triplett wrote:
> On Sat, Apr 13, 2013 at 12:34:25PM -0700, Paul E. McKenney wrote:
> > On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> > > On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > > > From: "Paul E. McKenney" <[email protected]>
> > > > > >
> > > > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > > > three-jiffy delays between quiescent-state forcing attempts. This
> > > > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > > > on the value of HZ. However, this would break very large systems that
> > > > > > require more time between quiescent-state forcing attempts. This
> > > > > > commit therefore also ups the default delay by one jiffy for each
> > > > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > > > runtime, -not- NR_CPUS at build time).
> > > > > >
> > > > > > Reported-by: Paul Mackerras <[email protected]>
> > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > >
> > > > > Something seems very wrong if RCU regularly hits the fqs code during
> > > > > boot; feels like there's some more straightforward solution we're
> > > > > missing. What causes these CPUs to fall under RCU's scrutiny during
> > > > > boot yet not actually hit the RCU codepaths naturally?
> > > >
> > > > The problem is that they are running HZ=100, so that RCU will often
> > > > take 30-60 milliseconds per grace period. At that point, you only
> > > > need 16-30 grace periods to chew up a full second, so it is not all
> > > > that hard to eat up the additional 8-12 seconds of boot time that
> > > > they were seeing. IIRC, UP boot was costing them 4 seconds.
> > > >
> > > > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > > > near as annoying.
> > >
> > > That raises two questions, though. First, who calls synchronize_rcu()
> > > repeatedly during boot, and could they call call_rcu() instead to avoid
> > > blocking for an RCU grace period? Second, why does RCU need 3-6 jiffies
> > > to resolve a grace period during boot? That suggests that RCU doesn't
> > > actually resolve a grace period until the force-quiescent-state
> > > machinery kicks in, meaning that the normal quiescent-state mechanism
> > > didn't work.
> >
> > Indeed, converting synchronize_rcu() to call_rcu() might also be
> > helpful. The reason that RCU often does not resolve grace periods until
> > force_quiescent_state() is that it is often the case during boot that
> > all but one CPU is idle. RCU tries hard to avoid waking up idle CPUs,
> > so it must scan them. Scanning is relatively expensive, so there is
> > reason to wait.
>
> How are those CPUs going idle without first telling RCU that they're
> quiesced? Seems like, during boot at least, you want RCU to use its
> idle==quiesced logic to proactively note continuously-quiescent states.
> Ideally, you should not hit the FQS code at all during boot.

FQS is RCU's idle==quiesced logic. ;-)

In theory, RCU could add logic at idle entry to report a quiescent state,
in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that. In practice,
this is not good for energy efficiency at runtime for a goodly number
of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
numbering and FQS.

I understand that at boot time, energy efficiency is best served by
making boot go faster, but that means that something has to tell RCU
when boot is complete.

> > One thing that could be done would be to scan immediately during boot,
> > and then back off once boot has completed. Of course, RCU has no idea
> > when boot has completed, but one way to get this effect is to boot
> > with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
> > to 3 once boot has completed.
>
> What do you mean by "boot has completed" here? The kernel's early
> initialization, the kernel's initialization up to running /sbin/init, or
> userspace initialization up through supporting user login?

That is exactly the question. After all, if RCU is going to do something
special during boot, it needs to know when boot ends. People normally
count boot as up to user login, but RCU currently has no way to know
when this is, at least as far as I know. Which is why I suggested that
something tell RCU via sysfs.

Regardless, for the usual definition of "boot is complete", user space has
to decide when boot is complete. The kernel is out of the loop early on.

> In any case, I don't think it makes sense to do this with FQS.

OK, let's go through the possibilities I can imagine at the moment:

1. Force the scheduling-clock interrupt to remain on during
boot. This way, each CPU could tell RCU of its idle/non-idle
state. Of course, something then needs to tell the kernel
when boot is over so that it can go back to energy-efficient
mode.

2. Set rcutree.jiffies_till_first_fqs=0 at boot time, then when
boot is complete, set it to 3 via sysfs, or to some magic number
telling RCU to recompute the default. This has the virtue of
allowing different userspaces to handle this differently.

3. Take a half-step by having RCU register a callback during the
latest phase of kernel-visible boot. I am under the impression
that this is a relatively small fraction of boot, so it would
be sub-optimal.

4. Make CPUs announce quiescence on each entry to idle. This
covers the transition to idle, but when a given CPU stays idle
for more than one grace period, RCU has to do something to verify
that the CPU remains idle. Right now, that is FQS's job --
it cycles through the dyntick-idle structures of all CPUs that
have not already announced quiescence.

5. Make CPUs IPI RCU's grace-period kthread on each transition
to and from idle. I might be missing something, but given the
cost and disuptiveness of IPIs, this does not seem to me to be
a strategy to win.

6. IPI the CPUs to see if they are still idle. This would defeat
energy efficiency. Of course, RCU could take this approach
only during boot, but it is cheaper and faster to just check
each CPU's rcu_dynticks structure -- which is what FQS does.

7. Treat all normal grace periods as expedited grace periods, but
only during boot. It is fairly easy for RCU to do this, but
again, something has to tell RCU when boot is complete.

8. Your idea here. Plus more of mine as I remember them. ;-)

So, what am I missing?

Thanx, Paul

2013-04-14 06:10:15

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Sat, Apr 13, 2013 at 03:09:43PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 13, 2013 at 12:53:36PM -0700, Josh Triplett wrote:
> > On Sat, Apr 13, 2013 at 12:34:25PM -0700, Paul E. McKenney wrote:
> > > On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> > > > On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > > > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > > > > From: "Paul E. McKenney" <[email protected]>
> > > > > > >
> > > > > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > > > > three-jiffy delays between quiescent-state forcing attempts. This
> > > > > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > > > > on the value of HZ. However, this would break very large systems that
> > > > > > > require more time between quiescent-state forcing attempts. This
> > > > > > > commit therefore also ups the default delay by one jiffy for each
> > > > > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > > > > runtime, -not- NR_CPUS at build time).
> > > > > > >
> > > > > > > Reported-by: Paul Mackerras <[email protected]>
> > > > > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > > >
> > > > > > Something seems very wrong if RCU regularly hits the fqs code during
> > > > > > boot; feels like there's some more straightforward solution we're
> > > > > > missing. What causes these CPUs to fall under RCU's scrutiny during
> > > > > > boot yet not actually hit the RCU codepaths naturally?
> > > > >
> > > > > The problem is that they are running HZ=100, so that RCU will often
> > > > > take 30-60 milliseconds per grace period. At that point, you only
> > > > > need 16-30 grace periods to chew up a full second, so it is not all
> > > > > that hard to eat up the additional 8-12 seconds of boot time that
> > > > > they were seeing. IIRC, UP boot was costing them 4 seconds.
> > > > >
> > > > > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > > > > near as annoying.
> > > >
> > > > That raises two questions, though. First, who calls synchronize_rcu()
> > > > repeatedly during boot, and could they call call_rcu() instead to avoid
> > > > blocking for an RCU grace period? Second, why does RCU need 3-6 jiffies
> > > > to resolve a grace period during boot? That suggests that RCU doesn't
> > > > actually resolve a grace period until the force-quiescent-state
> > > > machinery kicks in, meaning that the normal quiescent-state mechanism
> > > > didn't work.
> > >
> > > Indeed, converting synchronize_rcu() to call_rcu() might also be
> > > helpful. The reason that RCU often does not resolve grace periods until
> > > force_quiescent_state() is that it is often the case during boot that
> > > all but one CPU is idle. RCU tries hard to avoid waking up idle CPUs,
> > > so it must scan them. Scanning is relatively expensive, so there is
> > > reason to wait.
> >
> > How are those CPUs going idle without first telling RCU that they're
> > quiesced? Seems like, during boot at least, you want RCU to use its
> > idle==quiesced logic to proactively note continuously-quiescent states.
> > Ideally, you should not hit the FQS code at all during boot.
>
> FQS is RCU's idle==quiesced logic. ;-)
>
> In theory, RCU could add logic at idle entry to report a quiescent state,
> in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that. In practice,
> this is not good for energy efficiency at runtime for a goodly number
> of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
> numbering and FQS.
>
> I understand that at boot time, energy efficiency is best served by
> making boot go faster, but that means that something has to tell RCU
> when boot is complete.
>
> > > One thing that could be done would be to scan immediately during boot,
> > > and then back off once boot has completed. Of course, RCU has no idea
> > > when boot has completed, but one way to get this effect is to boot
> > > with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
> > > to 3 once boot has completed.
> >
> > What do you mean by "boot has completed" here? The kernel's early
> > initialization, the kernel's initialization up to running /sbin/init, or
> > userspace initialization up through supporting user login?
>
> That is exactly the question. After all, if RCU is going to do something
> special during boot, it needs to know when boot ends. People normally
> count boot as up to user login, but RCU currently has no way to know
> when this is, at least as far as I know. Which is why I suggested that
> something tell RCU via sysfs.
>
> Regardless, for the usual definition of "boot is complete", user space has
> to decide when boot is complete. The kernel is out of the loop early on.
>
> > In any case, I don't think it makes sense to do this with FQS.
>
> OK, let's go through the possibilities I can imagine at the moment:
>
> 1. Force the scheduling-clock interrupt to remain on during
> boot. This way, each CPU could tell RCU of its idle/non-idle
> state. Of course, something then needs to tell the kernel
> when boot is over so that it can go back to energy-efficient
> mode.
>
> 2. Set rcutree.jiffies_till_first_fqs=0 at boot time, then when
> boot is complete, set it to 3 via sysfs, or to some magic number
> telling RCU to recompute the default. This has the virtue of
> allowing different userspaces to handle this differently.
>
> 3. Take a half-step by having RCU register a callback during the
> latest phase of kernel-visible boot. I am under the impression
> that this is a relatively small fraction of boot, so it would
> be sub-optimal.
>
> 4. Make CPUs announce quiescence on each entry to idle. This
> covers the transition to idle, but when a given CPU stays idle
> for more than one grace period, RCU has to do something to verify
> that the CPU remains idle. Right now, that is FQS's job --
> it cycles through the dyntick-idle structures of all CPUs that
> have not already announced quiescence.
>
> 5. Make CPUs IPI RCU's grace-period kthread on each transition
> to and from idle. I might be missing something, but given the
> cost and disuptiveness of IPIs, this does not seem to me to be
> a strategy to win.
>
> 6. IPI the CPUs to see if they are still idle. This would defeat
> energy efficiency. Of course, RCU could take this approach
> only during boot, but it is cheaper and faster to just check
> each CPU's rcu_dynticks structure -- which is what FQS does.
>
> 7. Treat all normal grace periods as expedited grace periods, but
> only during boot. It is fairly easy for RCU to do this, but
> again, something has to tell RCU when boot is complete.
>
> 8. Your idea here. Plus more of mine as I remember them. ;-)
>
> So, what am I missing?

Hmmm... I suppose I could have RCU define boot as being (say) the ten
seconds following the early_inits. That is crude enough that it might
actually work reasonably well.

Thanx, Paul

2013-04-15 02:04:09

by Paul Mackerras

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <[email protected]>
> > >
> > > Systems with HZ=100 can have slow bootup times due to the default
> > > three-jiffy delays between quiescent-state forcing attempts. This
> > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > on the value of HZ. However, this would break very large systems that
> > > require more time between quiescent-state forcing attempts. This
> > > commit therefore also ups the default delay by one jiffy for each
> > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > runtime, -not- NR_CPUS at build time).
> > >
> > > Reported-by: Paul Mackerras <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > Something seems very wrong if RCU regularly hits the fqs code during
> > boot; feels like there's some more straightforward solution we're
> > missing. What causes these CPUs to fall under RCU's scrutiny during
> > boot yet not actually hit the RCU codepaths naturally?
>
> The problem is that they are running HZ=100, so that RCU will often
> take 30-60 milliseconds per grace period. At that point, you only
> need 16-30 grace periods to chew up a full second, so it is not all
> that hard to eat up the additional 8-12 seconds of boot time that
> they were seeing. IIRC, UP boot was costing them 4 seconds.

I added some instrumentation, which counted 202 calls to
synchronize_sched() during boot (Fedora 17 minimal install +
development tools) with a 3.8.0 kernel on a 4-cpu KVM virtual machine
on a POWER7. Without this patch, those 202 calls take up a total of
4.32 seconds; with it, they take up 3.6 seconds. The kernel is
compiled with HZ=100 and NR_CPUS=1024, like the standard Fedora
kernel.

I suspect a lot of the calls are in udevd and related processes.
Interestingly there were no calls to synchronize_rcu_bh or
synchronize_sched_expedited.

Paul.

2013-04-15 17:27:45

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Mon, Apr 15, 2013 at 12:03:54PM +1000, Paul Mackerras wrote:
> On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <[email protected]>
> > > >
> > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > three-jiffy delays between quiescent-state forcing attempts. This
> > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > on the value of HZ. However, this would break very large systems that
> > > > require more time between quiescent-state forcing attempts. This
> > > > commit therefore also ups the default delay by one jiffy for each
> > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > runtime, -not- NR_CPUS at build time).
> > > >
> > > > Reported-by: Paul Mackerras <[email protected]>
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > Something seems very wrong if RCU regularly hits the fqs code during
> > > boot; feels like there's some more straightforward solution we're
> > > missing. What causes these CPUs to fall under RCU's scrutiny during
> > > boot yet not actually hit the RCU codepaths naturally?
> >
> > The problem is that they are running HZ=100, so that RCU will often
> > take 30-60 milliseconds per grace period. At that point, you only
> > need 16-30 grace periods to chew up a full second, so it is not all
> > that hard to eat up the additional 8-12 seconds of boot time that
> > they were seeing. IIRC, UP boot was costing them 4 seconds.
>
> I added some instrumentation, which counted 202 calls to
> synchronize_sched() during boot (Fedora 17 minimal install +
> development tools) with a 3.8.0 kernel on a 4-cpu KVM virtual machine
> on a POWER7. Without this patch, those 202 calls take up a total of
> 4.32 seconds; with it, they take up 3.6 seconds. The kernel is
> compiled with HZ=100 and NR_CPUS=1024, like the standard Fedora
> kernel.

Going from 4.32 seconds down to 3.6 seconds is an improvement, but there
is clearly room for more. The following experimental not-for-inclusion
patch might help get most of the remaining 3.6 seconds. Could you please
try it out?

> I suspect a lot of the calls are in udevd and related processes.
> Interestingly there were no calls to synchronize_rcu_bh or
> synchronize_sched_expedited.

The lack of synchronize_rcu_bh() suggests that networking is not
involved in the slowdown. The lack of synchronize_sched_expedited()
is not surprising, unless you booted with rcupdate.rcu_expedited=1,
but in that case I would expect a much greater reduction in boot time.

Thanx, Paul

------------------------------------------------------------------------

rcu: Not for inclusion: Force expedited grace periods

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a9610d1..55c5ef6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2420,7 +2420,7 @@ void synchronize_sched(void)
"Illegal synchronize_sched() in RCU-sched read-side critical section");
if (rcu_blocking_is_gp())
return;
- if (rcu_expedited)
+ if (1)
synchronize_sched_expedited();
else
wait_rcu_gp(call_rcu_sched);
@@ -2447,7 +2447,7 @@ void synchronize_rcu_bh(void)
"Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
if (rcu_blocking_is_gp())
return;
- if (rcu_expedited)
+ if (1)
synchronize_rcu_bh_expedited();
else
wait_rcu_gp(call_rcu_bh);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 46b93b0..190a199 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -711,7 +711,7 @@ void synchronize_rcu(void)
"Illegal synchronize_rcu() in RCU read-side critical section");
if (!rcu_scheduler_active)
return;
- if (rcu_expedited)
+ if (1)
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);

2013-05-14 12:23:02

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Sat, Apr 13, 2013 at 03:09:43PM -0700, Paul E. McKenney wrote:
> > How are those CPUs going idle without first telling RCU that they're
> > quiesced? Seems like, during boot at least, you want RCU to use its
> > idle==quiesced logic to proactively note continuously-quiescent states.
> > Ideally, you should not hit the FQS code at all during boot.
>
> FQS is RCU's idle==quiesced logic. ;-)
>
> In theory, RCU could add logic at idle entry to report a quiescent state,
> in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that. In practice,
> this is not good for energy efficiency at runtime for a goodly number
> of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
> numbering and FQS.

OK, so bear with me.. I've missed a few months of RCU so I might not be as
up-to-date as I'd like to be.

So going by the above; FAST_NO_HZ used to kick RCU into quiescence on entering
NO_HZ. This made some ARM people happy but made the rest of the world sad
because of immense idle-entry times.

The above implies you've changed this about to allow CPUs to go idle without
reporting home but instead rely on Forced Quiescent States to push the remote
idle cpus into quiescence.

Now I understand that advancing the RCU state machine and processing callbacks
takes time; however at boot (and possibly thereafter) we have the special case
where we have no pending RCU state.

Could we not, under those circumstances, quickly remove the CPU from the RCU
state machine so that FQS aren't required to prod quite as much remote state?

2013-05-14 14:18:52

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Tue, May 14, 2013 at 02:20:49PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 13, 2013 at 03:09:43PM -0700, Paul E. McKenney wrote:
> > > How are those CPUs going idle without first telling RCU that they're
> > > quiesced? Seems like, during boot at least, you want RCU to use its
> > > idle==quiesced logic to proactively note continuously-quiescent states.
> > > Ideally, you should not hit the FQS code at all during boot.
> >
> > FQS is RCU's idle==quiesced logic. ;-)
> >
> > In theory, RCU could add logic at idle entry to report a quiescent state,
> > in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that. In practice,
> > this is not good for energy efficiency at runtime for a goodly number
> > of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
> > numbering and FQS.
>
> OK, so bear with me.. I've missed a few months of RCU so I might not be as
> up-to-date as I'd like to be.
>
> So going by the above; FAST_NO_HZ used to kick RCU into quiescence on entering
> NO_HZ. This made some ARM people happy but made the rest of the world sad
> because of immense idle-entry times.

The old RCU_FAST_NO_HZ was too heavy-weight. The effect was that
it achieved the stated goal of producing long idle periods without
scheduling-clock interrupts, but incurred to-idle and from-idle overhead
that rivaled the savings. :-(

> The above implies you've changed this about to allow CPUs to go idle without
> reporting home but instead rely on Forced Quiescent States to push the remote
> idle cpus into quiescence.

This is one prong of the mechanism, which is the same prong used for
normal dyntick-idle CPUs. The other prong is a slowed-down timer tick,
4 jiffies if there is at least one non-lazy callback on a given CPU, 6
seconds if all of that CPU's callbacks are lazy ("lazy" as in kfree_rcu()
as opposed to synchronize_rcu() or call_rcu()).

Unfortunately, idiot here got the lazy/non-lazy comparison backwards,
which is what I believe to be responsible for the excessive boot times.
(Also for the excessive suspend and hibernation times, but it appears
that using expedited grace periods works really well for that.) The
patch to fix my mistake is attached below.

> Now I understand that advancing the RCU state machine and processing callbacks
> takes time; however at boot (and possibly thereafter) we have the special case
> where we have no pending RCU state.
>
> Could we not, under those circumstances, quickly remove the CPU from the RCU
> state machine so that FQS aren't required to prod quite as much remote state?

In theory, yes. In practice, this requires lots of lock acquisitions
and releases on large systems, including some global locks. The weight
could be reduced, but...

What I would like to do instead would be to specify expedited grace
periods during boot. The challenge here appears to be somehow telling
RCU when boot is done. The APIs are there from an RCU viewpoint: boot
with rcupdate.rcu_expedited=1, then, once boot is complete (whatever
that means on your platform) "echo 0 > /sys/kernel/rcu_expedited".

Thanx, Paul

------------------------------------------------------------------------

rcu: Fix comparison sense in rcu_needs_cpu()

Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
callbacks) introduced a bug that can result in excessively long grace
periods. This bug reverse the senes of the "if" statement checking
for lazy callbacks, so that RCU takes a lazy approach when there are
in fact non-lazy callbacks. This can result in excessive boot, suspend,
and resume times.

This commit therefore fixes the sense of this "if" statement.

Reported-by: Borislav Petkov <[email protected]>
Reported-by: Bj?rn Mork <[email protected]>
Reported-by: Joerg Roedel <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 170814d..6d939a6 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1667,7 +1667,7 @@ int rcu_needs_cpu(int cpu, unsigned long *dj)
rdtp->last_accelerate = jiffies;

/* Request timer delay depending on laziness, and round. */
- if (rdtp->all_lazy) {
+ if (!rdtp->all_lazy) {
*dj = round_up(rcu_idle_gp_delay + jiffies,
rcu_idle_gp_delay) - jiffies;
} else {

2013-05-14 14:53:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

> In theory, yes. In practice, this requires lots of lock acquisitions
> and releases on large systems, including some global locks. The weight
> could be reduced, but...
>
> What I would like to do instead would be to specify expedited grace
> periods during boot.

But why, surely going idle without any RCU callbacks isn't completely unheard
of, even outside of the boot process?

Being able to quickly drop out of the RCU state machinery would be a good thing IMO.

> The challenge here appears to be somehow telling
> RCU when boot is done. The APIs are there from an RCU viewpoint: boot
> with rcupdate.rcu_expedited=1, then, once boot is complete (whatever
> that means on your platform) "echo 0 > /sys/kernel/rcu_expedited".

Ha, and here you assume userspace is sane and co-operative. Fail in my book ;-)

2013-05-14 15:47:49

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Tue, May 14, 2013 at 04:51:20PM +0200, Peter Zijlstra wrote:
> > In theory, yes. In practice, this requires lots of lock acquisitions
> > and releases on large systems, including some global locks. The weight
> > could be reduced, but...
> >
> > What I would like to do instead would be to specify expedited grace
> > periods during boot.
>
> But why, surely going idle without any RCU callbacks isn't completely unheard
> of, even outside of the boot process?

Yep, and RCU has special-cased that for quite some time.

> Being able to quickly drop out of the RCU state machinery would be a good thing IMO.

And this is currently possible -- this is the job of rcu_idle_enter()
and friends. And it works well, at least when I get my "if" statements
set up correctly (hence the earlier patch).

Or are you seeing a slowdown even with that earlier patch applied? If so,
please let me know what you are seeing.

> > The challenge here appears to be somehow telling
> > RCU when boot is done. The APIs are there from an RCU viewpoint: boot
> > with rcupdate.rcu_expedited=1, then, once boot is complete (whatever
> > that means on your platform) "echo 0 > /sys/kernel/rcu_expedited".
>
> Ha, and here you assume userspace is sane and co-operative. Fail in my book ;-)

If they are insane or uncooperative, they pay the price. ;-)

Thanx, Paul

2013-05-15 09:01:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Tue, May 14, 2013 at 08:47:28AM -0700, Paul E. McKenney wrote:
> On Tue, May 14, 2013 at 04:51:20PM +0200, Peter Zijlstra wrote:
> > > In theory, yes. In practice, this requires lots of lock acquisitions
> > > and releases on large systems, including some global locks. The weight
> > > could be reduced, but...
> > >
> > > What I would like to do instead would be to specify expedited grace
> > > periods during boot.
> >
> > But why, surely going idle without any RCU callbacks isn't completely unheard
> > of, even outside of the boot process?
>
> Yep, and RCU has special-cased that for quite some time.
>
> > Being able to quickly drop out of the RCU state machinery would be a good thing IMO.
>
> And this is currently possible -- this is the job of rcu_idle_enter()
> and friends. And it works well, at least when I get my "if" statements
> set up correctly (hence the earlier patch).
>
> Or are you seeing a slowdown even with that earlier patch applied? If so,
> please let me know what you are seeing.

I'm not running anything in particular, except maybe a broken mental
model of RCU ;-)

So what I'm talking about is the !rcu_cpu_has_callbacks() case, where
there's absolutely nothing for RCU to do except tell the state machine
its no longer participating.

Your patch to rcu_needs_cpu() frobbing the lazy condition is after that
and thus irrelevant for this AFAICT.

Now as far as I can see, rcu_needs_cpu() will return false in this case;
allowing the cpu to enter NO_HZ state. We then call rcu_idle_enter()
which would call rcu_eqs_enter(). Which should put the CPU in extended
quiescent state.

However, you're still running into these FQSs delaying boot. Why is
that? Is that because rcu_eqs_enter() doesn't really do enough?

The thing is, if all other CPUs are idle, detecting the end of a grace
period should be rather trivial and not involve FQSs and thus be tons
faster.

Clearly I'm missing something obvious and not communicating right or so.

2013-05-15 09:03:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 10:56:39AM +0200, Peter Zijlstra wrote:
> On Tue, May 14, 2013 at 08:47:28AM -0700, Paul E. McKenney wrote:
> > On Tue, May 14, 2013 at 04:51:20PM +0200, Peter Zijlstra wrote:
> > > > In theory, yes. In practice, this requires lots of lock acquisitions
> > > > and releases on large systems, including some global locks. The weight
> > > > could be reduced, but...
> > > >
> > > > What I would like to do instead would be to specify expedited grace
> > > > periods during boot.
> > >
> > > But why, surely going idle without any RCU callbacks isn't completely unheard
> > > of, even outside of the boot process?
> >
> > Yep, and RCU has special-cased that for quite some time.
> >
> > > Being able to quickly drop out of the RCU state machinery would be a good thing IMO.
> >
> > And this is currently possible -- this is the job of rcu_idle_enter()
> > and friends. And it works well, at least when I get my "if" statements
> > set up correctly (hence the earlier patch).
> >
> > Or are you seeing a slowdown even with that earlier patch applied? If so,
> > please let me know what you are seeing.
>
> I'm not running anything in particular, except maybe a broken mental
> model of RCU ;-)
>
> So what I'm talking about is the !rcu_cpu_has_callbacks() case, where
> there's absolutely nothing for RCU to do except tell the state machine
> its no longer participating.
>
> Your patch to rcu_needs_cpu() frobbing the lazy condition is after that
> and thus irrelevant for this AFAICT.
>
> Now as far as I can see, rcu_needs_cpu() will return false in this case;
> allowing the cpu to enter NO_HZ state. We then call rcu_idle_enter()
> which would call rcu_eqs_enter(). Which should put the CPU in extended
> quiescent state.
>
> However, you're still running into these FQSs delaying boot. Why is
> that? Is that because rcu_eqs_enter() doesn't really do enough?
>
> The thing is, if all other CPUs are idle, detecting the end of a grace
> period should be rather trivial and not involve FQSs and thus be tons
> faster.
>
> Clearly I'm missing something obvious and not communicating right or so.

Earlier you said that improving EQS behaviour was expensive in that it
would require taking (global) locks or somesuch.

Would it not be possible to have the cpu performing a FQS finish this
work; that way the first FQS would be a little slow, but after that no
FQS would be needed anymore, right? Since we'd no longer require the
other CPUs to end a grace period.

2013-05-15 09:21:04

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

* Paul E. McKenney <[email protected]> wrote:

> rcu: Fix comparison sense in rcu_needs_cpu()
>
> Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
> callbacks) introduced a bug that can result in excessively long grace
> periods. This bug reverse the senes of the "if" statement checking
> for lazy callbacks, so that RCU takes a lazy approach when there are
> in fact non-lazy callbacks. This can result in excessive boot, suspend,
> and resume times.
>
> This commit therefore fixes the sense of this "if" statement.
>
> Reported-by: Borislav Petkov <[email protected]>
> Reported-by: Bj?rn Mork <[email protected]>
> Reported-by: Joerg Roedel <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 170814d..6d939a6 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -1667,7 +1667,7 @@ int rcu_needs_cpu(int cpu, unsigned long *dj)
> rdtp->last_accelerate = jiffies;
>
> /* Request timer delay depending on laziness, and round. */
> - if (rdtp->all_lazy) {
> + if (!rdtp->all_lazy) {
> *dj = round_up(rcu_idle_gp_delay + jiffies,
> rcu_idle_gp_delay) - jiffies;

Neat - could this explain sporadic long (but not infinite) boot times with
NOHZ_FULL?

We changed HZ to be at least 1 Hz pretty recently, which might have worked
around this bug.

Thanks,

Ingo

2013-05-15 15:45:39

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 11:20:55AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > rcu: Fix comparison sense in rcu_needs_cpu()
> >
> > Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
> > callbacks) introduced a bug that can result in excessively long grace
> > periods. This bug reverse the senes of the "if" statement checking
> > for lazy callbacks, so that RCU takes a lazy approach when there are
> > in fact non-lazy callbacks. This can result in excessive boot, suspend,
> > and resume times.
> >
> > This commit therefore fixes the sense of this "if" statement.
> >
> > Reported-by: Borislav Petkov <[email protected]>
> > Reported-by: Bj?rn Mork <[email protected]>
> > Reported-by: Joerg Roedel <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index 170814d..6d939a6 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -1667,7 +1667,7 @@ int rcu_needs_cpu(int cpu, unsigned long *dj)
> > rdtp->last_accelerate = jiffies;
> >
> > /* Request timer delay depending on laziness, and round. */
> > - if (rdtp->all_lazy) {
> > + if (!rdtp->all_lazy) {
> > *dj = round_up(rcu_idle_gp_delay + jiffies,
> > rcu_idle_gp_delay) - jiffies;
>
> Neat - could this explain sporadic long (but not infinite) boot times with
> NOHZ_FULL?
>
> We changed HZ to be at least 1 Hz pretty recently, which might have worked
> around this bug.

Quite possibly...

Of course, I don't see the boot slowdowns in my testing. :-/

Thanx, Paul

2013-05-15 21:10:41

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 10:56:39AM +0200, Peter Zijlstra wrote:
> On Tue, May 14, 2013 at 08:47:28AM -0700, Paul E. McKenney wrote:
> > On Tue, May 14, 2013 at 04:51:20PM +0200, Peter Zijlstra wrote:
> > > > In theory, yes. In practice, this requires lots of lock acquisitions
> > > > and releases on large systems, including some global locks. The weight
> > > > could be reduced, but...
> > > >
> > > > What I would like to do instead would be to specify expedited grace
> > > > periods during boot.
> > >
> > > But why, surely going idle without any RCU callbacks isn't completely unheard
> > > of, even outside of the boot process?
> >
> > Yep, and RCU has special-cased that for quite some time.
> >
> > > Being able to quickly drop out of the RCU state machinery would be a good thing IMO.
> >
> > And this is currently possible -- this is the job of rcu_idle_enter()
> > and friends. And it works well, at least when I get my "if" statements
> > set up correctly (hence the earlier patch).
> >
> > Or are you seeing a slowdown even with that earlier patch applied? If so,
> > please let me know what you are seeing.
>
> I'm not running anything in particular, except maybe a broken mental
> model of RCU ;-)
>
> So what I'm talking about is the !rcu_cpu_has_callbacks() case, where
> there's absolutely nothing for RCU to do except tell the state machine
> its no longer participating.
>
> Your patch to rcu_needs_cpu() frobbing the lazy condition is after that
> and thus irrelevant for this AFAICT.
>
> Now as far as I can see, rcu_needs_cpu() will return false in this case;
> allowing the cpu to enter NO_HZ state. We then call rcu_idle_enter()
> which would call rcu_eqs_enter(). Which should put the CPU in extended
> quiescent state.

Yep, that is exactly what happens in that case.

But it really was the wrongly frobbed lazy check that was causing the
regression in boot times and in suspend/hibernate times.

> However, you're still running into these FQSs delaying boot. Why is
> that? Is that because rcu_eqs_enter() doesn't really do enough?

You are assuming that they are delaying boot. Maybe they are and maybe
they are not. One way to find out would be to boot both with and without
rcupdate.rcu_expedited=1 and compare the boot times. I don't see a
statistically significant difference when I try it, but other hardware
and software configurations might see other results.

For the sake of argument, let's assume that thye are.

> The thing is, if all other CPUs are idle, detecting the end of a grace
> period should be rather trivial and not involve FQSs and thus be tons
> faster.
>
> Clearly I'm missing something obvious and not communicating right or so.

Or maybe it is me missing the obvious -- wouldn't be the first time! ;-)

The need is to detect that an idle CPU is idle without making it do
anything. To do otherwise would kill battery lifetime and introduce
OS jitter.

This other CPU must be able to correctly detect idle CPUs regardless of
how long they have been idle. In particular, it is necessary to detect
CPUs that were idle at the start of the current grace period and have
remained idle throughout the entirety of the current grace period.

A CPU might transition between idle and non-idle states at any time.
Therefore, if RCU collects a given CPU's idleness state during a given
grace period, it must be very careful to avoid relying on that state
during some other grace period.

Therefore, from what I can see, unless all CPUs explicitly report a
quiescent state in a timely fashion during a given grace period (in
which case each CPU was non-idle at some point during that grace period),
there is no alternative to polling RCU's per-CPU rcu_dynticks structures
during that grace period. In particular, if at least one CPU remained
idle throughout that grace period, it will be necessary to poll.

Of course, during boot time, there are often long time periods during
which at least one CPU remains idle. Therefore, we can expect many
boot-time grace periods to delay for at least one FQS time period.

OK, so how much delay does this cause? The delay from the start
of the grace period until the first FQS scan is controlled by
jiffies_till_first_fqs, which defaults to 3 jiffies. One question
might be "Why delay at all?" The reason for delaying is efficiency at
run time -- the longer a given grace period delays, the more updates
will be handled by a given grace period, and the lower the per-update
grace-period overhead.

This still leaves the question of whether it would be better to
do the first scan immediately after initializing the grace period.
It turns out that you can make the current code do this by booting
with rcutree.jiffies_till_first_fqs=0. You can also adjust the value
after boot via sysfs, though it will camp values to one second's worth
of jiffies.

So, if you are seeing RCU slowing down boot, there are two thing to try:

1. Boot with rcupdate.rcu_expedited=1.

2. Boot with rcutree.jiffies_till_first_fqs=0.

I cannot imagine changing the default for rcupdate.rcu_expedited
unless userspace set it back after boot completes, but if
rcutree.jiffies_till_first_fqs=0 helps, it might be worth changing
the default.

Thanx, Paul

2013-05-15 21:10:50

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 11:02:34AM +0200, Peter Zijlstra wrote:
> On Wed, May 15, 2013 at 10:56:39AM +0200, Peter Zijlstra wrote:
> > On Tue, May 14, 2013 at 08:47:28AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 14, 2013 at 04:51:20PM +0200, Peter Zijlstra wrote:
> > > > > In theory, yes. In practice, this requires lots of lock acquisitions
> > > > > and releases on large systems, including some global locks. The weight
> > > > > could be reduced, but...
> > > > >
> > > > > What I would like to do instead would be to specify expedited grace
> > > > > periods during boot.
> > > >
> > > > But why, surely going idle without any RCU callbacks isn't completely unheard
> > > > of, even outside of the boot process?
> > >
> > > Yep, and RCU has special-cased that for quite some time.
> > >
> > > > Being able to quickly drop out of the RCU state machinery would be a good thing IMO.
> > >
> > > And this is currently possible -- this is the job of rcu_idle_enter()
> > > and friends. And it works well, at least when I get my "if" statements
> > > set up correctly (hence the earlier patch).
> > >
> > > Or are you seeing a slowdown even with that earlier patch applied? If so,
> > > please let me know what you are seeing.
> >
> > I'm not running anything in particular, except maybe a broken mental
> > model of RCU ;-)
> >
> > So what I'm talking about is the !rcu_cpu_has_callbacks() case, where
> > there's absolutely nothing for RCU to do except tell the state machine
> > its no longer participating.
> >
> > Your patch to rcu_needs_cpu() frobbing the lazy condition is after that
> > and thus irrelevant for this AFAICT.
> >
> > Now as far as I can see, rcu_needs_cpu() will return false in this case;
> > allowing the cpu to enter NO_HZ state. We then call rcu_idle_enter()
> > which would call rcu_eqs_enter(). Which should put the CPU in extended
> > quiescent state.
> >
> > However, you're still running into these FQSs delaying boot. Why is
> > that? Is that because rcu_eqs_enter() doesn't really do enough?
> >
> > The thing is, if all other CPUs are idle, detecting the end of a grace
> > period should be rather trivial and not involve FQSs and thus be tons
> > faster.
> >
> > Clearly I'm missing something obvious and not communicating right or so.
>
> Earlier you said that improving EQS behaviour was expensive in that it
> would require taking (global) locks or somesuch.
>
> Would it not be possible to have the cpu performing a FQS finish this
> work; that way the first FQS would be a little slow, but after that no
> FQS would be needed anymore, right? Since we'd no longer require the
> other CPUs to end a grace period.

It is not just the first FQS that would be slow, it would also be slow
the next time that this CPU transitioned from idle to non-idle, which
is when this work would need to be undone.

Furthermore, in this approach, RCU would still need to scan all the CPUs
to see if any did the first part of the transition to idle. And if we
have to scan either way, why not keep the idle-nonidle transitions cheap
and continue to rely on the scan? Here are the rationales I can think
of and what I am thinking in terms of doing instead:

1. The scan could become a scalability bottleneck. There is one
way to handle this today, and one possible future change. The way
to handle this today is to increas rcutree.jiffies_till_first_fqs,
for example, the SGI guys set it to 20 or thereabouts. If this
becomes problematic, I could easily create multiple kthreads to
carry out the FQS scan in parallel for large systems.

2. Someone could demonstrate that RCU's grace periods were significantly
delaying boot. There are several ways of dealing with this:

a. Set rcupdate.rcu_expedited=1 at boot, and set it back
after boot completes.

b. Set rcutree.jiffies_till_first_fqs=0.

c. As (b) above, but modifying RCU to use additional
kthreads for the per-CPU grace-period operations.

Make sense?

Thanx, Paul

2013-05-16 09:39:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 09:37:00AM -0700, Paul E. McKenney wrote:
> The need is to detect that an idle CPU is idle without making it do
> anything. To do otherwise would kill battery lifetime and introduce
> OS jitter.

Not anything isn't leaving us much room to wriggle, we could maybe try and do a
wee bit without people shooting us :-) In fact, looking at rcu_idle_enter()
its very much not an empty function.

> This other CPU must be able to correctly detect idle CPUs regardless of
> how long they have been idle. In particular, it is necessary to detect
> CPUs that were idle at the start of the current grace period and have
> remained idle throughout the entirety of the current grace period.

OK, so continuing this hypothetical merry go round :-)

Since RCU is a global endeavour, I'm assuming there is a global GP sequence
number. Could we not stamp the CPU with the current GP# in rcu_idle_enter().

> A CPU might transition between idle and non-idle states at any time.
> Therefore, if RCU collects a given CPU's idleness state during a given
> grace period, it must be very careful to avoid relying on that state
> during some other grace period.

However, if we know during which GP it became idle, we know we can ignore it
for all GPs thereafter, right?

> Therefore, from what I can see, unless all CPUs explicitly report a
> quiescent state in a timely fashion during a given grace period (in
> which case each CPU was non-idle at some point during that grace period),
> there is no alternative to polling RCU's per-CPU rcu_dynticks structures
> during that grace period. In particular, if at least one CPU remained
> idle throughout that grace period, it will be necessary to poll.

Agreed..

> Of course, during boot time, there are often long time periods during
> which at least one CPU remains idle. Therefore, we can expect many
> boot-time grace periods to delay for at least one FQS time period.
>
> OK, so how much delay does this cause?

Oh, I'm so way past that, it is a neat puzzle by now ;-)

2013-05-16 09:49:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Wed, May 15, 2013 at 10:31:42AM -0700, Paul E. McKenney wrote:
> On Wed, May 15, 2013 at 11:02:34AM +0200, Peter Zijlstra wrote:

> > Earlier you said that improving EQS behaviour was expensive in that it
> > would require taking (global) locks or somesuch.
> >
> > Would it not be possible to have the cpu performing a FQS finish this
> > work; that way the first FQS would be a little slow, but after that no
> > FQS would be needed anymore, right? Since we'd no longer require the
> > other CPUs to end a grace period.
>
> It is not just the first FQS that would be slow, it would also be slow
> the next time that this CPU transitioned from idle to non-idle, which
> is when this work would need to be undone.

Hurm, yes I suppose that is true. If you've saved more on FQS cost it might be
worth it for the throughput people though.

But somehow I imagined making a CPU part of the GP would be easier than taking
it out. After all, taking it out is dangerous and careful work, one is not to
accidentally execute a callback or otherwise end a GP before time.

When entering the GP cycle there is no such concern, the CPU state is clean
after all.

> Furthermore, in this approach, RCU would still need to scan all the CPUs
> to see if any did the first part of the transition to idle. And if we
> have to scan either way, why not keep the idle-nonidle transitions cheap
> and continue to rely on the scan? Here are the rationales I can think
> of and what I am thinking in terms of doing instead:
>
> 1. The scan could become a scalability bottleneck. There is one
> way to handle this today, and one possible future change. The way
> to handle this today is to increas rcutree.jiffies_till_first_fqs,
> for example, the SGI guys set it to 20 or thereabouts. If this
> becomes problematic, I could easily create multiple kthreads to
> carry out the FQS scan in parallel for large systems.

*groan* whoever thought all this SMP nonsense was worth it again? :-)

> 2. Someone could demonstrate that RCU's grace periods were significantly
> delaying boot. There are several ways of dealing with this:

Surely there's also non-boot cases where most of the machine is 'idle' and
we're running into FQS? Esp. now with that userspace NO_HZ stuff from Frederic.

2013-05-16 13:14:51

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Thu, May 16, 2013 at 11:37:40AM +0200, Peter Zijlstra wrote:
> On Wed, May 15, 2013 at 09:37:00AM -0700, Paul E. McKenney wrote:
> > The need is to detect that an idle CPU is idle without making it do
> > anything. To do otherwise would kill battery lifetime and introduce
> > OS jitter.
>
> Not anything isn't leaving us much room to wriggle, we could maybe try and do a
> wee bit without people shooting us :-) In fact, looking at rcu_idle_enter()
> its very much not an empty function.

That said, it operates on CPU-local variables, so the first idle/nonidle
transition after an FQS scan is expensive, but subsequent transitions
will not incur communications cache misses.

> > This other CPU must be able to correctly detect idle CPUs regardless of
> > how long they have been idle. In particular, it is necessary to detect
> > CPUs that were idle at the start of the current grace period and have
> > remained idle throughout the entirety of the current grace period.
>
> OK, so continuing this hypothetical merry go round :-)
>
> Since RCU is a global endeavour, I'm assuming there is a global GP sequence
> number. Could we not stamp the CPU with the current GP# in rcu_idle_enter().

We could do that. But we would still need to store it surrounded by
memory barriers, and we would still need to scan it every grace period
to which the CPU did not otherwise repond.

This would get rid of atomic-instruction overhead, but the atomic part
of the increment is on my list to eliminate in any case.

Furthermore, if we stamp the CPU with the last grace period during which
it was non-idle (which needs to be the case for the non-idle-to-idle
transition), we cannot tell whether or not that CPU went idle during
the current grace period if it was non-idle during that grace period.
In contrast, the current scheme can detect arbitrarily short idle
sojourns, regardless of the current state of the CPU.

> > A CPU might transition between idle and non-idle states at any time.
> > Therefore, if RCU collects a given CPU's idleness state during a given
> > grace period, it must be very careful to avoid relying on that state
> > during some other grace period.
>
> However, if we know during which GP it became idle, we know we can ignore it
> for all GPs thereafter, right?

Yes, but as noted above, we wouldn't know to ignore it during the GP during
which it became idle, which is quite important -- many workloads have
short idle sojourns, e.g., due to interrupts arriving at an otherwise
idle CPU.

> > Therefore, from what I can see, unless all CPUs explicitly report a
> > quiescent state in a timely fashion during a given grace period (in
> > which case each CPU was non-idle at some point during that grace period),
> > there is no alternative to polling RCU's per-CPU rcu_dynticks structures
> > during that grace period. In particular, if at least one CPU remained
> > idle throughout that grace period, it will be necessary to poll.
>
> Agreed..
>
> > Of course, during boot time, there are often long time periods during
> > which at least one CPU remains idle. Therefore, we can expect many
> > boot-time grace periods to delay for at least one FQS time period.
> >
> > OK, so how much delay does this cause?
>
> Oh, I'm so way past that, it is a neat puzzle by now ;-)

I can identify with that feeling! ;-)

Thanx, Paul

2013-05-16 13:22:39

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Thu, May 16, 2013 at 11:45:19AM +0200, Peter Zijlstra wrote:
> On Wed, May 15, 2013 at 10:31:42AM -0700, Paul E. McKenney wrote:
> > On Wed, May 15, 2013 at 11:02:34AM +0200, Peter Zijlstra wrote:
>
> > > Earlier you said that improving EQS behaviour was expensive in that it
> > > would require taking (global) locks or somesuch.
> > >
> > > Would it not be possible to have the cpu performing a FQS finish this
> > > work; that way the first FQS would be a little slow, but after that no
> > > FQS would be needed anymore, right? Since we'd no longer require the
> > > other CPUs to end a grace period.
> >
> > It is not just the first FQS that would be slow, it would also be slow
> > the next time that this CPU transitioned from idle to non-idle, which
> > is when this work would need to be undone.
>
> Hurm, yes I suppose that is true. If you've saved more on FQS cost it might be
> worth it for the throughput people though.

But the NO_HZ_PERIODIC and NO_HZ_IDLE throughput people will have their
CPUs non-idle, which means that they are reporting their quiescent states
and the FQS scan just isn't happening. The NO_HZ_FULL throughput people
will have their RCU GP kthreads pinned to the timekeeping CPU, and therefore
won't care much about the overhead of the FQS scan.

> But somehow I imagined making a CPU part of the GP would be easier than taking
> it out. After all, taking it out is dangerous and careful work, one is not to
> accidentally execute a callback or otherwise end a GP before time.
>
> When entering the GP cycle there is no such concern, the CPU state is clean
> after all.

But that would increase the overhead of GP initialization. Right now,
GP initialization touches only the leaf rcu_node structures, of which
there are by default one per 16 CPUs (and can be configured up to one per
64 CPUs, which it is on really big systems). So on busy mixed-workload
systems, this approach increases GP initialization overhead for no
good reason -- and on systems running these sorts of workloads, there
usually aren't "sacrificial lamb" timekeeping CPUs whose utilization
doesn't matter.

> > Furthermore, in this approach, RCU would still need to scan all the CPUs
> > to see if any did the first part of the transition to idle. And if we
> > have to scan either way, why not keep the idle-nonidle transitions cheap
> > and continue to rely on the scan? Here are the rationales I can think
> > of and what I am thinking in terms of doing instead:
> >
> > 1. The scan could become a scalability bottleneck. There is one
> > way to handle this today, and one possible future change. The way
> > to handle this today is to increas rcutree.jiffies_till_first_fqs,
> > for example, the SGI guys set it to 20 or thereabouts. If this
> > becomes problematic, I could easily create multiple kthreads to
> > carry out the FQS scan in parallel for large systems.
>
> *groan* whoever thought all this SMP nonsense was worth it again? :-)

NR_CPUS=0!!! It is the only way! ;-)

> > 2. Someone could demonstrate that RCU's grace periods were significantly
> > delaying boot. There are several ways of dealing with this:
>
> Surely there's also non-boot cases where most of the machine is 'idle' and
> we're running into FQS? Esp. now with that userspace NO_HZ stuff from Frederic.

Yep, but as noted above, the NO_HZ_FULL case will be running the RCU
GP kthreads on the timekeeping CPUs, where they aren't running worker
threads. In the general-purpose workload case, the CPUs are busy and
doing a wide variety of things, so that with high probability each
CPU checks in before the three-jiffies FQS scan has a chance to get
kicked off.

Thanx, Paul

2013-05-21 13:25:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Thu, May 16, 2013 at 06:22:10AM -0700, Paul E. McKenney wrote:
> > But somehow I imagined making a CPU part of the GP would be easier than taking
> > it out. After all, taking it out is dangerous and careful work, one is not to
> > accidentally execute a callback or otherwise end a GP before time.
> >
> > When entering the GP cycle there is no such concern, the CPU state is clean
> > after all.
>
> But that would increase the overhead of GP initialization. Right now,
> GP initialization touches only the leaf rcu_node structures, of which
> there are by default one per 16 CPUs (and can be configured up to one per
> 64 CPUs, which it is on really big systems). So on busy mixed-workload
> systems, this approach increases GP initialization overhead for no
> good reason -- and on systems running these sorts of workloads, there
> usually aren't "sacrificial lamb" timekeeping CPUs whose utilization
> doesn't matter.

Right, so I read through some of the fqs code to get a better feel for
things and I suppose I see what you're talking about :-)

The only thing I could come up with is making fqslock a global/local
style lock, so that individual CPUs can adjust their own state without
bouncing the lock around.

It would make the fqs itself a 'bit' more expensive but ideally those
don't happen that often, ha!.

But yeah, every time you let the fqs propagate 'idle' state up the tree
your join becomes more expensive too.

2013-05-21 17:05:05

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Tue, May 21, 2013 at 11:45:31AM +0200, Peter Zijlstra wrote:
> On Thu, May 16, 2013 at 06:22:10AM -0700, Paul E. McKenney wrote:
> > > But somehow I imagined making a CPU part of the GP would be easier than taking
> > > it out. After all, taking it out is dangerous and careful work, one is not to
> > > accidentally execute a callback or otherwise end a GP before time.
> > >
> > > When entering the GP cycle there is no such concern, the CPU state is clean
> > > after all.
> >
> > But that would increase the overhead of GP initialization. Right now,
> > GP initialization touches only the leaf rcu_node structures, of which
> > there are by default one per 16 CPUs (and can be configured up to one per
> > 64 CPUs, which it is on really big systems). So on busy mixed-workload
> > systems, this approach increases GP initialization overhead for no
> > good reason -- and on systems running these sorts of workloads, there
> > usually aren't "sacrificial lamb" timekeeping CPUs whose utilization
> > doesn't matter.
>
> Right, so I read through some of the fqs code to get a better feel for
> things and I suppose I see what you're talking about :-)
>
> The only thing I could come up with is making fqslock a global/local
> style lock, so that individual CPUs can adjust their own state without
> bouncing the lock around.

Maybe... The current design uses bitmasks at each level, and avoiding the
upper-level locks would mean making RCU work with out-of-date bitmasks
at the upper levels. Might be possible, but it is not clear to me that
this would be a win.

I could also maintain yet another bitmask at the bottom level to record
the idle CPUs, but it is not clear that this is a win, either, especially
on systems with frequent idle/busy transitions.

> It would make the fqs itself a 'bit' more expensive but ideally those
> don't happen that often, ha!.
>
> But yeah, every time you let the fqs propagate 'idle' state up the tree
> your join becomes more expensive too.

Yep! :-/

Thanx, Paul

2013-05-28 10:07:49

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

* Paul E. McKenney <[email protected]> wrote:

> On Wed, May 15, 2013 at 11:20:55AM +0200, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > rcu: Fix comparison sense in rcu_needs_cpu()
> > >
> > > Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
> > > callbacks) introduced a bug that can result in excessively long grace
> > > periods. This bug reverse the senes of the "if" statement checking
> > > for lazy callbacks, so that RCU takes a lazy approach when there are
> > > in fact non-lazy callbacks. This can result in excessive boot, suspend,
> > > and resume times.
> > >
> > > This commit therefore fixes the sense of this "if" statement.
> > >
> > > Reported-by: Borislav Petkov <[email protected]>
> > > Reported-by: Bj?rn Mork <[email protected]>
> > > Reported-by: Joerg Roedel <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > >
> > > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > > index 170814d..6d939a6 100644
> > > --- a/kernel/rcutree_plugin.h
> > > +++ b/kernel/rcutree_plugin.h
> > > @@ -1667,7 +1667,7 @@ int rcu_needs_cpu(int cpu, unsigned long *dj)
> > > rdtp->last_accelerate = jiffies;
> > >
> > > /* Request timer delay depending on laziness, and round. */
> > > - if (rdtp->all_lazy) {
> > > + if (!rdtp->all_lazy) {
> > > *dj = round_up(rcu_idle_gp_delay + jiffies,
> > > rcu_idle_gp_delay) - jiffies;
> >
> > Neat - could this explain sporadic long (but not infinite) boot times with
> > NOHZ_FULL?
> >
> > We changed HZ to be at least 1 Hz pretty recently, which might have worked
> > around this bug.
>
> Quite possibly...
>
> Of course, I don't see the boot slowdowns in my testing. :-/

They were pretty sporadic and only popped up (and down) during randconfig
testing. Simple unrelated changes to the .config made them go away -
heisenbugs.

Thanks,

Ingo

2013-05-29 01:29:37

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ

On Tue, May 28, 2013 at 12:07:42PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > On Wed, May 15, 2013 at 11:20:55AM +0200, Ingo Molnar wrote:
> > >
> > > * Paul E. McKenney <[email protected]> wrote:
> > >
> > > > rcu: Fix comparison sense in rcu_needs_cpu()
> > > >
> > > > Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
> > > > callbacks) introduced a bug that can result in excessively long grace
> > > > periods. This bug reverse the senes of the "if" statement checking
> > > > for lazy callbacks, so that RCU takes a lazy approach when there are
> > > > in fact non-lazy callbacks. This can result in excessive boot, suspend,
> > > > and resume times.
> > > >
> > > > This commit therefore fixes the sense of this "if" statement.
> > > >
> > > > Reported-by: Borislav Petkov <[email protected]>
> > > > Reported-by: Bj?rn Mork <[email protected]>
> > > > Reported-by: Joerg Roedel <[email protected]>
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > >
> > > > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > > > index 170814d..6d939a6 100644
> > > > --- a/kernel/rcutree_plugin.h
> > > > +++ b/kernel/rcutree_plugin.h
> > > > @@ -1667,7 +1667,7 @@ int rcu_needs_cpu(int cpu, unsigned long *dj)
> > > > rdtp->last_accelerate = jiffies;
> > > >
> > > > /* Request timer delay depending on laziness, and round. */
> > > > - if (rdtp->all_lazy) {
> > > > + if (!rdtp->all_lazy) {
> > > > *dj = round_up(rcu_idle_gp_delay + jiffies,
> > > > rcu_idle_gp_delay) - jiffies;
> > >
> > > Neat - could this explain sporadic long (but not infinite) boot times with
> > > NOHZ_FULL?
> > >
> > > We changed HZ to be at least 1 Hz pretty recently, which might have worked
> > > around this bug.
> >
> > Quite possibly...
> >
> > Of course, I don't see the boot slowdowns in my testing. :-/
>
> They were pretty sporadic and only popped up (and down) during randconfig
> testing. Simple unrelated changes to the .config made them go away -
> heisenbugs.

I can believe that... The system has to be very quiet for this bug to
significantly slow down boot. Interrupts scattered across CPUs (for
example) would tend to force RCU's state machine forward.

Thanx, Paul

> Thanks,
>
> Ingo
>