2014-06-21 03:00:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls. Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.

This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable. However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs. These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.

Reported-by: Dave Hansen <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Eric Dumazet <[email protected]>

---

b/Documentation/kernel-parameters.txt | 6 +
b/include/linux/rcupdate.h | 36 --------
b/kernel/rcu/tree.c | 140 +++++++++++++++++++++++++++-------
b/kernel/rcu/tree.h | 6 +
b/kernel/rcu/tree_plugin.h | 2
b/kernel/rcu/update.c | 18 ----
b/kernel/sched/core.c | 7 -
7 files changed, 125 insertions(+), 90 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..910c3829f81d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
leaf rcu_node structure. Useful for very large
systems.

+ rcutree.jiffies_till_sched_qs= [KNL]
+ Set required age in jiffies for a
+ given grace period before RCU starts
+ soliciting quiescent-state help from
+ rcu_note_context_switch().
+
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5a75d19aa661..243aa4656cb7 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -44,7 +44,6 @@
#include <linux/debugobjects.h>
#include <linux/bug.h>
#include <linux/compiler.h>
-#include <linux/percpu.h>
#include <asm/barrier.h>

extern int rcu_expedited; /* for sysctl */
@@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
#endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */

/*
- * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
- */
-
-#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
-DECLARE_PER_CPU(int, rcu_cond_resched_count);
-void rcu_resched(void);
-
-/*
- * Is it time to report RCU quiescent states?
- *
- * Note unsynchronized access to rcu_cond_resched_count. Yes, we might
- * increment some random CPU's count, and possibly also load the result from
- * yet another CPU's count. We might even clobber some other CPU's attempt
- * to zero its counter. This is all OK because the goal is not precision,
- * but rather reasonable amortization of rcu_note_context_switch() overhead
- * and extremely high probability of avoiding RCU CPU stall warnings.
- * Note that this function has to be preempted in just the wrong place,
- * many thousands of times in a row, for anything bad to happen.
- */
-static inline bool rcu_should_resched(void)
-{
- return raw_cpu_inc_return(rcu_cond_resched_count) >=
- RCU_COND_RESCHED_LIM;
-}
-
-/*
- * Report quiscent states to RCU if it is time to do so.
- */
-static inline void rcu_cond_resched(void)
-{
- if (unlikely(rcu_should_resched()))
- rcu_resched();
-}
-
-/*
* Infrastructure to implement the synchronize_() primitives in
* TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
*/
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f1ba77363fbb..7d711f9a2e86 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -206,6 +206,70 @@ void rcu_bh_qs(int cpu)
rdp->passed_quiesce = 1;
}

+static DEFINE_PER_CPU(int, rcu_sched_qs_mask);
+
+static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
+ .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
+ .dynticks = ATOMIC_INIT(1),
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+ .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
+ .dynticks_idle = ATOMIC_INIT(1),
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+};
+
+/*
+ * Let the RCU core know that this CPU has gone through the scheduler,
+ * which is a quiescent state. This is called when the need for a
+ * quiescent state is urgent, so we burn an atomic operation and full
+ * memory barriers to let the RCU core know about it, regardless of what
+ * this CPU might (or might not) do in the near future.
+ *
+ * We inform the RCU core by emulating a zero-duration dyntick-idle
+ * period, which we in turn do by incrementing the ->dynticks counter
+ * by two.
+ */
+void rcu_momentary_dyntick_idle(void)
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+ struct rcu_dynticks *rdtp;
+ int resched_mask;
+ struct rcu_state *rsp;
+
+ local_irq_save(flags);
+
+ /*
+ * Yes, we can lose flag-setting operations. This is OK, because
+ * the flag will be set again after some delay.
+ */
+ resched_mask = raw_cpu_read(rcu_sched_qs_mask);
+ raw_cpu_write(rcu_sched_qs_mask, 0);
+
+ /* Find the flavor that needs a quiescent state. */
+ for_each_rcu_flavor(rsp) {
+ rdp = raw_cpu_ptr(rsp->rda);
+ if (!(resched_mask & rsp->flavor_mask))
+ continue;
+ smp_mb(); /* ->flavor_mask before ->cond_resched_completed. */
+ if (ACCESS_ONCE(rdp->mynode->completed) !=
+ ACCESS_ONCE(rdp->cond_resched_completed))
+ continue;
+
+ /*
+ * Pretend to be momentarily idle for the quiescent state.
+ * This allows the grace-period kthread to record the
+ * quiescent state, with no need for this CPU to do anything
+ * further.
+ */
+ rdtp = this_cpu_ptr(&rcu_dynticks);
+ smp_mb__before_atomic(); /* Earlier stuff before QS. */
+ atomic_add(2, &rdtp->dynticks); /* QS. */
+ smp_mb__after_atomic(); /* Later stuff after QS. */
+ break;
+ }
+ local_irq_restore(flags);
+}
+
/*
* Note a context switch. This is a quiescent state for RCU-sched,
* and requires special handling for preemptible RCU.
@@ -216,19 +280,12 @@ void rcu_note_context_switch(int cpu)
trace_rcu_utilization(TPS("Start context switch"));
rcu_sched_qs(cpu);
rcu_preempt_note_context_switch(cpu);
+ if (unlikely(raw_cpu_read(rcu_sched_qs_mask)))
+ rcu_momentary_dyntick_idle();
trace_rcu_utilization(TPS("End context switch"));
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);

-static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
- .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
- .dynticks = ATOMIC_INIT(1),
-#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
- .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
- .dynticks_idle = ATOMIC_INIT(1),
-#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
-};
-
static long blimit = 10; /* Maximum callbacks per rcu_do_batch. */
static long qhimark = 10000; /* If this many pending, ignore blimit. */
static long qlowmark = 100; /* Once only this many pending, use blimit. */
@@ -243,6 +300,13 @@ static ulong jiffies_till_next_fqs = ULONG_MAX;
module_param(jiffies_till_first_fqs, ulong, 0644);
module_param(jiffies_till_next_fqs, ulong, 0644);

+/*
+ * How long the grace period must be before we start recruiting
+ * quiescent-state help from rcu_note_context_switch().
+ */
+static ulong jiffies_till_sched_qs = HZ / 20;
+module_param(jiffies_till_sched_qs, ulong, 0644);
+
static bool rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
struct rcu_data *rdp);
static void force_qs_rnp(struct rcu_state *rsp,
@@ -853,6 +917,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
bool *isidle, unsigned long *maxj)
{
unsigned int curr;
+ int *rcrmp;
unsigned int snap;

curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
@@ -893,27 +958,43 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
}

/*
- * There is a possibility that a CPU in adaptive-ticks state
- * might run in the kernel with the scheduling-clock tick disabled
- * for an extended time period. Invoke rcu_kick_nohz_cpu() to
- * force the CPU to restart the scheduling-clock tick in this
- * CPU is in this state.
- */
- rcu_kick_nohz_cpu(rdp->cpu);
-
- /*
- * Alternatively, the CPU might be running in the kernel
- * for an extended period of time without a quiescent state.
- * Attempt to force the CPU through the scheduler to gain the
- * needed quiescent state, but only if the grace period has gone
- * on for an uncommonly long time. If there are many stuck CPUs,
- * we will beat on the first one until it gets unstuck, then move
- * to the next. Only do this for the primary flavor of RCU.
+ * A CPU running for an extended time within the kernel can
+ * delay RCU grace periods. When the CPU is in NO_HZ_FULL mode,
+ * even context-switching back and forth between a pair of
+ * in-kernel CPU-bound tasks cannot advance grace periods.
+ * So if the grace period is old enough, make the CPU pay attention.
+ * Note that the unsynchronized assignments to the per-CPU
+ * rcu_sched_qs_mask variable are safe. Yes, setting of
+ * bits can be lost, but they will be set again on the next
+ * force-quiescent-state pass. So lost bit sets do not result
+ * in incorrect behavior, merely in a grace period lasting
+ * a few jiffies longer than it might otherwise. Because
+ * there are at most four threads involved, and because the
+ * updates are only once every few jiffies, the probability of
+ * lossage (and thus of slight grace-period extension) is
+ * quite low.
+ *
+ * Note that if the jiffies_till_sched_qs boot/sysfs parameter
+ * is set too high, we override with half of the RCU CPU stall
+ * warning delay.
*/
- if (rdp->rsp == rcu_state_p &&
+ rcrmp = &per_cpu(rcu_sched_qs_mask, rdp->cpu);
+ if (ULONG_CMP_GE(jiffies,
+ rdp->rsp->gp_start + jiffies_till_sched_qs) ||
ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
- rdp->rsp->jiffies_resched += 5;
- resched_cpu(rdp->cpu);
+ if (!(ACCESS_ONCE(*rcrmp) & rdp->rsp->flavor_mask)) {
+ ACCESS_ONCE(rdp->cond_resched_completed) =
+ ACCESS_ONCE(rdp->mynode->completed);
+ smp_mb(); /* ->cond_resched_completed before *rcrmp. */
+ ACCESS_ONCE(*rcrmp) =
+ ACCESS_ONCE(*rcrmp) + rdp->rsp->flavor_mask;
+ resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
+ rdp->rsp->jiffies_resched += 5; /* Enable beating. */
+ } else if (ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
+ /* Time to beat on that CPU again! */
+ resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
+ rdp->rsp->jiffies_resched += 5; /* Re-enable beating. */
+ }
}

return 0;
@@ -3491,6 +3572,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
"rcu_node_fqs_1",
"rcu_node_fqs_2",
"rcu_node_fqs_3" }; /* Match MAX_RCU_LVLS */
+ static u8 fl_mask = 0x1;
int cpustride = 1;
int i;
int j;
@@ -3509,6 +3591,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
for (i = 1; i < rcu_num_lvls; i++)
rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1];
rcu_init_levelspread(rsp);
+ rsp->flavor_mask = fl_mask;
+ fl_mask <<= 1;

/* Initialize the elements themselves, starting from the leaves. */

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..0f69a79c5b7d 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -307,6 +307,9 @@ struct rcu_data {
/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
unsigned long dynticks_fqs; /* Kicked due to dynticks idle. */
unsigned long offline_fqs; /* Kicked due to being offline. */
+ unsigned long cond_resched_completed;
+ /* Grace period that needs help */
+ /* from cond_resched(). */

/* 5) __rcu_pending() statistics. */
unsigned long n_rcu_pending; /* rcu_pending() calls since boot. */
@@ -392,6 +395,7 @@ struct rcu_state {
struct rcu_node *level[RCU_NUM_LVLS]; /* Hierarchy levels. */
u32 levelcnt[MAX_RCU_LVLS + 1]; /* # nodes in each level. */
u8 levelspread[RCU_NUM_LVLS]; /* kids/node in each level. */
+ u8 flavor_mask; /* bit in flavor mask. */
struct rcu_data __percpu *rda; /* pointer of percu rcu_data. */
void (*call)(struct rcu_head *head, /* call_rcu() flavor. */
void (*func)(struct rcu_head *head));
@@ -563,7 +567,7 @@ static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
-static void rcu_kick_nohz_cpu(int cpu);
+static void __maybe_unused rcu_kick_nohz_cpu(int cpu);
static bool init_nocb_callback_list(struct rcu_data *rdp);
static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cbc2c45265e2..02ac0fb186b8 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2404,7 +2404,7 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
* if an adaptive-ticks CPU is failing to respond to the current grace
* period and has not be idle from an RCU perspective, kick it.
*/
-static void rcu_kick_nohz_cpu(int cpu)
+static void __maybe_unused rcu_kick_nohz_cpu(int cpu)
{
#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(cpu))
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a2aeb4df0f60..d22309cae9f5 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -350,21 +350,3 @@ static int __init check_cpu_stall_init(void)
early_initcall(check_cpu_stall_init);

#endif /* #ifdef CONFIG_RCU_STALL_COMMON */
-
-/*
- * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
- */
-
-DEFINE_PER_CPU(int, rcu_cond_resched_count);
-
-/*
- * Report a set of RCU quiescent states, for use by cond_resched()
- * and friends. Out of line due to being called infrequently.
- */
-void rcu_resched(void)
-{
- preempt_disable();
- __this_cpu_write(rcu_cond_resched_count, 0);
- rcu_note_context_switch(smp_processor_id());
- preempt_enable();
-}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b494fe..bc1638b33449 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4147,7 +4147,6 @@ static void __cond_resched(void)

int __sched _cond_resched(void)
{
- rcu_cond_resched();
if (should_resched()) {
__cond_resched();
return 1;
@@ -4166,18 +4165,15 @@ EXPORT_SYMBOL(_cond_resched);
*/
int __cond_resched_lock(spinlock_t *lock)
{
- bool need_rcu_resched = rcu_should_resched();
int resched = should_resched();
int ret = 0;

lockdep_assert_held(lock);

- if (spin_needbreak(lock) || resched || need_rcu_resched) {
+ if (spin_needbreak(lock) || resched) {
spin_unlock(lock);
if (resched)
__cond_resched();
- else if (unlikely(need_rcu_resched))
- rcu_resched();
else
cpu_relax();
ret = 1;
@@ -4191,7 +4187,6 @@ int __sched __cond_resched_softirq(void)
{
BUG_ON(!in_softirq());

- rcu_cond_resched(); /* BH disabled OK, just recording QSes. */
if (should_resched()) {
local_bh_enable();
__cond_resched();


2014-06-21 04:30:17

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> fixed a problem where a CPU looping in the kernel with but one runnable
> task would give RCU CPU stall warnings, even if the in-kernel loop
> contained cond_resched() calls. Unfortunately, in so doing, it introduced
> performance regressions in Anton Blanchard's will-it-scale "open1" test.
> The problem appears to be not so much the increased cond_resched() path
> length as an increase in the rate at which grace periods complete, which
> increased per-update grace-period overhead.
>
> This commit takes a different approach to fixing this bug, mainly by
> moving the RCU-visible quiescent state from cond_resched() to
> rcu_note_context_switch(), and by further reducing the check to a
> simple non-zero test of a single per-CPU variable. However, this
> approach requires that the force-quiescent-state processing send
> resched IPIs to the offending CPUs. These will be sent only once
> the grace period has reached an age specified by the boot/sysfs
> parameter rcutree.jiffies_till_sched_qs, or once the grace period
> reaches an age halfway to the point at which RCU CPU stall warnings
> will be emitted, whichever comes first.
>
> Reported-by: Dave Hansen <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Eric Dumazet <[email protected]>

I like this approach *far* better. This is the kind of thing I had in
mind when I suggested using the fqs machinery: remove the poll entirely
and just thwack a CPU if it takes too long without a quiescent state.
Reviewed-by: Josh Triplett <[email protected]>

> ---
>
> b/Documentation/kernel-parameters.txt | 6 +
> b/include/linux/rcupdate.h | 36 --------
> b/kernel/rcu/tree.c | 140 +++++++++++++++++++++++++++-------
> b/kernel/rcu/tree.h | 6 +
> b/kernel/rcu/tree_plugin.h | 2
> b/kernel/rcu/update.c | 18 ----
> b/kernel/sched/core.c | 7 -
> 7 files changed, 125 insertions(+), 90 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 6eaa9cdb7094..910c3829f81d 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> leaf rcu_node structure. Useful for very large
> systems.
>
> + rcutree.jiffies_till_sched_qs= [KNL]
> + Set required age in jiffies for a
> + given grace period before RCU starts
> + soliciting quiescent-state help from
> + rcu_note_context_switch().
> +
> rcutree.jiffies_till_first_fqs= [KNL]
> Set delay from grace-period initialization to
> first attempt to force quiescent states.
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5a75d19aa661..243aa4656cb7 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -44,7 +44,6 @@
> #include <linux/debugobjects.h>
> #include <linux/bug.h>
> #include <linux/compiler.h>
> -#include <linux/percpu.h>
> #include <asm/barrier.h>
>
> extern int rcu_expedited; /* for sysctl */
> @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
> #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
>
> /*
> - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> - */
> -
> -#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
> -DECLARE_PER_CPU(int, rcu_cond_resched_count);
> -void rcu_resched(void);
> -
> -/*
> - * Is it time to report RCU quiescent states?
> - *
> - * Note unsynchronized access to rcu_cond_resched_count. Yes, we might
> - * increment some random CPU's count, and possibly also load the result from
> - * yet another CPU's count. We might even clobber some other CPU's attempt
> - * to zero its counter. This is all OK because the goal is not precision,
> - * but rather reasonable amortization of rcu_note_context_switch() overhead
> - * and extremely high probability of avoiding RCU CPU stall warnings.
> - * Note that this function has to be preempted in just the wrong place,
> - * many thousands of times in a row, for anything bad to happen.
> - */
> -static inline bool rcu_should_resched(void)
> -{
> - return raw_cpu_inc_return(rcu_cond_resched_count) >=
> - RCU_COND_RESCHED_LIM;
> -}
> -
> -/*
> - * Report quiscent states to RCU if it is time to do so.
> - */
> -static inline void rcu_cond_resched(void)
> -{
> - if (unlikely(rcu_should_resched()))
> - rcu_resched();
> -}
> -
> -/*
> * Infrastructure to implement the synchronize_() primitives in
> * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f1ba77363fbb..7d711f9a2e86 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -206,6 +206,70 @@ void rcu_bh_qs(int cpu)
> rdp->passed_quiesce = 1;
> }
>
> +static DEFINE_PER_CPU(int, rcu_sched_qs_mask);
> +
> +static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> + .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
> + .dynticks = ATOMIC_INIT(1),
> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> + .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> + .dynticks_idle = ATOMIC_INIT(1),
> +#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +};
> +
> +/*
> + * Let the RCU core know that this CPU has gone through the scheduler,
> + * which is a quiescent state. This is called when the need for a
> + * quiescent state is urgent, so we burn an atomic operation and full
> + * memory barriers to let the RCU core know about it, regardless of what
> + * this CPU might (or might not) do in the near future.
> + *
> + * We inform the RCU core by emulating a zero-duration dyntick-idle
> + * period, which we in turn do by incrementing the ->dynticks counter
> + * by two.
> + */
> +void rcu_momentary_dyntick_idle(void)
> +{
> + unsigned long flags;
> + struct rcu_data *rdp;
> + struct rcu_dynticks *rdtp;
> + int resched_mask;
> + struct rcu_state *rsp;
> +
> + local_irq_save(flags);
> +
> + /*
> + * Yes, we can lose flag-setting operations. This is OK, because
> + * the flag will be set again after some delay.
> + */
> + resched_mask = raw_cpu_read(rcu_sched_qs_mask);
> + raw_cpu_write(rcu_sched_qs_mask, 0);
> +
> + /* Find the flavor that needs a quiescent state. */
> + for_each_rcu_flavor(rsp) {
> + rdp = raw_cpu_ptr(rsp->rda);
> + if (!(resched_mask & rsp->flavor_mask))
> + continue;
> + smp_mb(); /* ->flavor_mask before ->cond_resched_completed. */
> + if (ACCESS_ONCE(rdp->mynode->completed) !=
> + ACCESS_ONCE(rdp->cond_resched_completed))
> + continue;
> +
> + /*
> + * Pretend to be momentarily idle for the quiescent state.
> + * This allows the grace-period kthread to record the
> + * quiescent state, with no need for this CPU to do anything
> + * further.
> + */
> + rdtp = this_cpu_ptr(&rcu_dynticks);
> + smp_mb__before_atomic(); /* Earlier stuff before QS. */
> + atomic_add(2, &rdtp->dynticks); /* QS. */
> + smp_mb__after_atomic(); /* Later stuff after QS. */
> + break;
> + }
> + local_irq_restore(flags);
> +}
> +
> /*
> * Note a context switch. This is a quiescent state for RCU-sched,
> * and requires special handling for preemptible RCU.
> @@ -216,19 +280,12 @@ void rcu_note_context_switch(int cpu)
> trace_rcu_utilization(TPS("Start context switch"));
> rcu_sched_qs(cpu);
> rcu_preempt_note_context_switch(cpu);
> + if (unlikely(raw_cpu_read(rcu_sched_qs_mask)))
> + rcu_momentary_dyntick_idle();
> trace_rcu_utilization(TPS("End context switch"));
> }
> EXPORT_SYMBOL_GPL(rcu_note_context_switch);
>
> -static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> - .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
> - .dynticks = ATOMIC_INIT(1),
> -#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> - .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> - .dynticks_idle = ATOMIC_INIT(1),
> -#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> -};
> -
> static long blimit = 10; /* Maximum callbacks per rcu_do_batch. */
> static long qhimark = 10000; /* If this many pending, ignore blimit. */
> static long qlowmark = 100; /* Once only this many pending, use blimit. */
> @@ -243,6 +300,13 @@ static ulong jiffies_till_next_fqs = ULONG_MAX;
> module_param(jiffies_till_first_fqs, ulong, 0644);
> module_param(jiffies_till_next_fqs, ulong, 0644);
>
> +/*
> + * How long the grace period must be before we start recruiting
> + * quiescent-state help from rcu_note_context_switch().
> + */
> +static ulong jiffies_till_sched_qs = HZ / 20;
> +module_param(jiffies_till_sched_qs, ulong, 0644);
> +
> static bool rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
> struct rcu_data *rdp);
> static void force_qs_rnp(struct rcu_state *rsp,
> @@ -853,6 +917,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
> bool *isidle, unsigned long *maxj)
> {
> unsigned int curr;
> + int *rcrmp;
> unsigned int snap;
>
> curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
> @@ -893,27 +958,43 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
> }
>
> /*
> - * There is a possibility that a CPU in adaptive-ticks state
> - * might run in the kernel with the scheduling-clock tick disabled
> - * for an extended time period. Invoke rcu_kick_nohz_cpu() to
> - * force the CPU to restart the scheduling-clock tick in this
> - * CPU is in this state.
> - */
> - rcu_kick_nohz_cpu(rdp->cpu);
> -
> - /*
> - * Alternatively, the CPU might be running in the kernel
> - * for an extended period of time without a quiescent state.
> - * Attempt to force the CPU through the scheduler to gain the
> - * needed quiescent state, but only if the grace period has gone
> - * on for an uncommonly long time. If there are many stuck CPUs,
> - * we will beat on the first one until it gets unstuck, then move
> - * to the next. Only do this for the primary flavor of RCU.
> + * A CPU running for an extended time within the kernel can
> + * delay RCU grace periods. When the CPU is in NO_HZ_FULL mode,
> + * even context-switching back and forth between a pair of
> + * in-kernel CPU-bound tasks cannot advance grace periods.
> + * So if the grace period is old enough, make the CPU pay attention.
> + * Note that the unsynchronized assignments to the per-CPU
> + * rcu_sched_qs_mask variable are safe. Yes, setting of
> + * bits can be lost, but they will be set again on the next
> + * force-quiescent-state pass. So lost bit sets do not result
> + * in incorrect behavior, merely in a grace period lasting
> + * a few jiffies longer than it might otherwise. Because
> + * there are at most four threads involved, and because the
> + * updates are only once every few jiffies, the probability of
> + * lossage (and thus of slight grace-period extension) is
> + * quite low.
> + *
> + * Note that if the jiffies_till_sched_qs boot/sysfs parameter
> + * is set too high, we override with half of the RCU CPU stall
> + * warning delay.
> */
> - if (rdp->rsp == rcu_state_p &&
> + rcrmp = &per_cpu(rcu_sched_qs_mask, rdp->cpu);
> + if (ULONG_CMP_GE(jiffies,
> + rdp->rsp->gp_start + jiffies_till_sched_qs) ||
> ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
> - rdp->rsp->jiffies_resched += 5;
> - resched_cpu(rdp->cpu);
> + if (!(ACCESS_ONCE(*rcrmp) & rdp->rsp->flavor_mask)) {
> + ACCESS_ONCE(rdp->cond_resched_completed) =
> + ACCESS_ONCE(rdp->mynode->completed);
> + smp_mb(); /* ->cond_resched_completed before *rcrmp. */
> + ACCESS_ONCE(*rcrmp) =
> + ACCESS_ONCE(*rcrmp) + rdp->rsp->flavor_mask;
> + resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
> + rdp->rsp->jiffies_resched += 5; /* Enable beating. */
> + } else if (ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
> + /* Time to beat on that CPU again! */
> + resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
> + rdp->rsp->jiffies_resched += 5; /* Re-enable beating. */
> + }
> }
>
> return 0;
> @@ -3491,6 +3572,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
> "rcu_node_fqs_1",
> "rcu_node_fqs_2",
> "rcu_node_fqs_3" }; /* Match MAX_RCU_LVLS */
> + static u8 fl_mask = 0x1;
> int cpustride = 1;
> int i;
> int j;
> @@ -3509,6 +3591,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
> for (i = 1; i < rcu_num_lvls; i++)
> rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1];
> rcu_init_levelspread(rsp);
> + rsp->flavor_mask = fl_mask;
> + fl_mask <<= 1;
>
> /* Initialize the elements themselves, starting from the leaves. */
>
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index bf2c1e669691..0f69a79c5b7d 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -307,6 +307,9 @@ struct rcu_data {
> /* 4) reasons this CPU needed to be kicked by force_quiescent_state */
> unsigned long dynticks_fqs; /* Kicked due to dynticks idle. */
> unsigned long offline_fqs; /* Kicked due to being offline. */
> + unsigned long cond_resched_completed;
> + /* Grace period that needs help */
> + /* from cond_resched(). */
>
> /* 5) __rcu_pending() statistics. */
> unsigned long n_rcu_pending; /* rcu_pending() calls since boot. */
> @@ -392,6 +395,7 @@ struct rcu_state {
> struct rcu_node *level[RCU_NUM_LVLS]; /* Hierarchy levels. */
> u32 levelcnt[MAX_RCU_LVLS + 1]; /* # nodes in each level. */
> u8 levelspread[RCU_NUM_LVLS]; /* kids/node in each level. */
> + u8 flavor_mask; /* bit in flavor mask. */
> struct rcu_data __percpu *rda; /* pointer of percu rcu_data. */
> void (*call)(struct rcu_head *head, /* call_rcu() flavor. */
> void (*func)(struct rcu_head *head));
> @@ -563,7 +567,7 @@ static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
> static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
> static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
> static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
> -static void rcu_kick_nohz_cpu(int cpu);
> +static void __maybe_unused rcu_kick_nohz_cpu(int cpu);
> static bool init_nocb_callback_list(struct rcu_data *rdp);
> static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
> static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index cbc2c45265e2..02ac0fb186b8 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2404,7 +2404,7 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
> * if an adaptive-ticks CPU is failing to respond to the current grace
> * period and has not be idle from an RCU perspective, kick it.
> */
> -static void rcu_kick_nohz_cpu(int cpu)
> +static void __maybe_unused rcu_kick_nohz_cpu(int cpu)
> {
> #ifdef CONFIG_NO_HZ_FULL
> if (tick_nohz_full_cpu(cpu))
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index a2aeb4df0f60..d22309cae9f5 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -350,21 +350,3 @@ static int __init check_cpu_stall_init(void)
> early_initcall(check_cpu_stall_init);
>
> #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> -
> -/*
> - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> - */
> -
> -DEFINE_PER_CPU(int, rcu_cond_resched_count);
> -
> -/*
> - * Report a set of RCU quiescent states, for use by cond_resched()
> - * and friends. Out of line due to being called infrequently.
> - */
> -void rcu_resched(void)
> -{
> - preempt_disable();
> - __this_cpu_write(rcu_cond_resched_count, 0);
> - rcu_note_context_switch(smp_processor_id());
> - preempt_enable();
> -}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3bdf01b494fe..bc1638b33449 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4147,7 +4147,6 @@ static void __cond_resched(void)
>
> int __sched _cond_resched(void)
> {
> - rcu_cond_resched();
> if (should_resched()) {
> __cond_resched();
> return 1;
> @@ -4166,18 +4165,15 @@ EXPORT_SYMBOL(_cond_resched);
> */
> int __cond_resched_lock(spinlock_t *lock)
> {
> - bool need_rcu_resched = rcu_should_resched();
> int resched = should_resched();
> int ret = 0;
>
> lockdep_assert_held(lock);
>
> - if (spin_needbreak(lock) || resched || need_rcu_resched) {
> + if (spin_needbreak(lock) || resched) {
> spin_unlock(lock);
> if (resched)
> __cond_resched();
> - else if (unlikely(need_rcu_resched))
> - rcu_resched();
> else
> cpu_relax();
> ret = 1;
> @@ -4191,7 +4187,6 @@ int __sched __cond_resched_softirq(void)
> {
> BUG_ON(!in_softirq());
>
> - rcu_cond_resched(); /* BH disabled OK, just recording QSes. */
> if (should_resched()) {
> local_bh_enable();
> __cond_resched();
>

2014-06-21 06:06:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Fri, Jun 20, 2014 at 09:29:58PM -0700, Josh Triplett wrote:
> On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> > Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> > fixed a problem where a CPU looping in the kernel with but one runnable
> > task would give RCU CPU stall warnings, even if the in-kernel loop
> > contained cond_resched() calls. Unfortunately, in so doing, it introduced
> > performance regressions in Anton Blanchard's will-it-scale "open1" test.
> > The problem appears to be not so much the increased cond_resched() path
> > length as an increase in the rate at which grace periods complete, which
> > increased per-update grace-period overhead.
> >
> > This commit takes a different approach to fixing this bug, mainly by
> > moving the RCU-visible quiescent state from cond_resched() to
> > rcu_note_context_switch(), and by further reducing the check to a
> > simple non-zero test of a single per-CPU variable. However, this
> > approach requires that the force-quiescent-state processing send
> > resched IPIs to the offending CPUs. These will be sent only once
> > the grace period has reached an age specified by the boot/sysfs
> > parameter rcutree.jiffies_till_sched_qs, or once the grace period
> > reaches an age halfway to the point at which RCU CPU stall warnings
> > will be emitted, whichever comes first.
> >
> > Reported-by: Dave Hansen <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Cc: Josh Triplett <[email protected]>
> > Cc: Andi Kleen <[email protected]>
> > Cc: Christoph Lameter <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Eric Dumazet <[email protected]>
>
> I like this approach *far* better. This is the kind of thing I had in
> mind when I suggested using the fqs machinery: remove the poll entirely
> and just thwack a CPU if it takes too long without a quiescent state.
> Reviewed-by: Josh Triplett <[email protected]>

Glad you like it. Not a fan of the IPI myself, but then again if you
are spending that must time looping in the kernel, an extra IPI is the
least of your problems.

I will be testing this more thoroughly, and if nothing bad happens will
send it on up within a few days.

Thanx, Paul

> > ---
> >
> > b/Documentation/kernel-parameters.txt | 6 +
> > b/include/linux/rcupdate.h | 36 --------
> > b/kernel/rcu/tree.c | 140 +++++++++++++++++++++++++++-------
> > b/kernel/rcu/tree.h | 6 +
> > b/kernel/rcu/tree_plugin.h | 2
> > b/kernel/rcu/update.c | 18 ----
> > b/kernel/sched/core.c | 7 -
> > 7 files changed, 125 insertions(+), 90 deletions(-)
> >
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..910c3829f81d 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2785,6 +2785,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > leaf rcu_node structure. Useful for very large
> > systems.
> >
> > + rcutree.jiffies_till_sched_qs= [KNL]
> > + Set required age in jiffies for a
> > + given grace period before RCU starts
> > + soliciting quiescent-state help from
> > + rcu_note_context_switch().
> > +
> > rcutree.jiffies_till_first_fqs= [KNL]
> > Set delay from grace-period initialization to
> > first attempt to force quiescent states.
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 5a75d19aa661..243aa4656cb7 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -44,7 +44,6 @@
> > #include <linux/debugobjects.h>
> > #include <linux/bug.h>
> > #include <linux/compiler.h>
> > -#include <linux/percpu.h>
> > #include <asm/barrier.h>
> >
> > extern int rcu_expedited; /* for sysctl */
> > @@ -300,41 +299,6 @@ bool __rcu_is_watching(void);
> > #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> >
> > /*
> > - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> > - */
> > -
> > -#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
> > -DECLARE_PER_CPU(int, rcu_cond_resched_count);
> > -void rcu_resched(void);
> > -
> > -/*
> > - * Is it time to report RCU quiescent states?
> > - *
> > - * Note unsynchronized access to rcu_cond_resched_count. Yes, we might
> > - * increment some random CPU's count, and possibly also load the result from
> > - * yet another CPU's count. We might even clobber some other CPU's attempt
> > - * to zero its counter. This is all OK because the goal is not precision,
> > - * but rather reasonable amortization of rcu_note_context_switch() overhead
> > - * and extremely high probability of avoiding RCU CPU stall warnings.
> > - * Note that this function has to be preempted in just the wrong place,
> > - * many thousands of times in a row, for anything bad to happen.
> > - */
> > -static inline bool rcu_should_resched(void)
> > -{
> > - return raw_cpu_inc_return(rcu_cond_resched_count) >=
> > - RCU_COND_RESCHED_LIM;
> > -}
> > -
> > -/*
> > - * Report quiscent states to RCU if it is time to do so.
> > - */
> > -static inline void rcu_cond_resched(void)
> > -{
> > - if (unlikely(rcu_should_resched()))
> > - rcu_resched();
> > -}
> > -
> > -/*
> > * Infrastructure to implement the synchronize_() primitives in
> > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f1ba77363fbb..7d711f9a2e86 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -206,6 +206,70 @@ void rcu_bh_qs(int cpu)
> > rdp->passed_quiesce = 1;
> > }
> >
> > +static DEFINE_PER_CPU(int, rcu_sched_qs_mask);
> > +
> > +static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> > + .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
> > + .dynticks = ATOMIC_INIT(1),
> > +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> > + .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> > + .dynticks_idle = ATOMIC_INIT(1),
> > +#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> > +};
> > +
> > +/*
> > + * Let the RCU core know that this CPU has gone through the scheduler,
> > + * which is a quiescent state. This is called when the need for a
> > + * quiescent state is urgent, so we burn an atomic operation and full
> > + * memory barriers to let the RCU core know about it, regardless of what
> > + * this CPU might (or might not) do in the near future.
> > + *
> > + * We inform the RCU core by emulating a zero-duration dyntick-idle
> > + * period, which we in turn do by incrementing the ->dynticks counter
> > + * by two.
> > + */
> > +void rcu_momentary_dyntick_idle(void)
> > +{
> > + unsigned long flags;
> > + struct rcu_data *rdp;
> > + struct rcu_dynticks *rdtp;
> > + int resched_mask;
> > + struct rcu_state *rsp;
> > +
> > + local_irq_save(flags);
> > +
> > + /*
> > + * Yes, we can lose flag-setting operations. This is OK, because
> > + * the flag will be set again after some delay.
> > + */
> > + resched_mask = raw_cpu_read(rcu_sched_qs_mask);
> > + raw_cpu_write(rcu_sched_qs_mask, 0);
> > +
> > + /* Find the flavor that needs a quiescent state. */
> > + for_each_rcu_flavor(rsp) {
> > + rdp = raw_cpu_ptr(rsp->rda);
> > + if (!(resched_mask & rsp->flavor_mask))
> > + continue;
> > + smp_mb(); /* ->flavor_mask before ->cond_resched_completed. */
> > + if (ACCESS_ONCE(rdp->mynode->completed) !=
> > + ACCESS_ONCE(rdp->cond_resched_completed))
> > + continue;
> > +
> > + /*
> > + * Pretend to be momentarily idle for the quiescent state.
> > + * This allows the grace-period kthread to record the
> > + * quiescent state, with no need for this CPU to do anything
> > + * further.
> > + */
> > + rdtp = this_cpu_ptr(&rcu_dynticks);
> > + smp_mb__before_atomic(); /* Earlier stuff before QS. */
> > + atomic_add(2, &rdtp->dynticks); /* QS. */
> > + smp_mb__after_atomic(); /* Later stuff after QS. */
> > + break;
> > + }
> > + local_irq_restore(flags);
> > +}
> > +
> > /*
> > * Note a context switch. This is a quiescent state for RCU-sched,
> > * and requires special handling for preemptible RCU.
> > @@ -216,19 +280,12 @@ void rcu_note_context_switch(int cpu)
> > trace_rcu_utilization(TPS("Start context switch"));
> > rcu_sched_qs(cpu);
> > rcu_preempt_note_context_switch(cpu);
> > + if (unlikely(raw_cpu_read(rcu_sched_qs_mask)))
> > + rcu_momentary_dyntick_idle();
> > trace_rcu_utilization(TPS("End context switch"));
> > }
> > EXPORT_SYMBOL_GPL(rcu_note_context_switch);
> >
> > -static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> > - .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
> > - .dynticks = ATOMIC_INIT(1),
> > -#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> > - .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> > - .dynticks_idle = ATOMIC_INIT(1),
> > -#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> > -};
> > -
> > static long blimit = 10; /* Maximum callbacks per rcu_do_batch. */
> > static long qhimark = 10000; /* If this many pending, ignore blimit. */
> > static long qlowmark = 100; /* Once only this many pending, use blimit. */
> > @@ -243,6 +300,13 @@ static ulong jiffies_till_next_fqs = ULONG_MAX;
> > module_param(jiffies_till_first_fqs, ulong, 0644);
> > module_param(jiffies_till_next_fqs, ulong, 0644);
> >
> > +/*
> > + * How long the grace period must be before we start recruiting
> > + * quiescent-state help from rcu_note_context_switch().
> > + */
> > +static ulong jiffies_till_sched_qs = HZ / 20;
> > +module_param(jiffies_till_sched_qs, ulong, 0644);
> > +
> > static bool rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
> > struct rcu_data *rdp);
> > static void force_qs_rnp(struct rcu_state *rsp,
> > @@ -853,6 +917,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
> > bool *isidle, unsigned long *maxj)
> > {
> > unsigned int curr;
> > + int *rcrmp;
> > unsigned int snap;
> >
> > curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
> > @@ -893,27 +958,43 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
> > }
> >
> > /*
> > - * There is a possibility that a CPU in adaptive-ticks state
> > - * might run in the kernel with the scheduling-clock tick disabled
> > - * for an extended time period. Invoke rcu_kick_nohz_cpu() to
> > - * force the CPU to restart the scheduling-clock tick in this
> > - * CPU is in this state.
> > - */
> > - rcu_kick_nohz_cpu(rdp->cpu);
> > -
> > - /*
> > - * Alternatively, the CPU might be running in the kernel
> > - * for an extended period of time without a quiescent state.
> > - * Attempt to force the CPU through the scheduler to gain the
> > - * needed quiescent state, but only if the grace period has gone
> > - * on for an uncommonly long time. If there are many stuck CPUs,
> > - * we will beat on the first one until it gets unstuck, then move
> > - * to the next. Only do this for the primary flavor of RCU.
> > + * A CPU running for an extended time within the kernel can
> > + * delay RCU grace periods. When the CPU is in NO_HZ_FULL mode,
> > + * even context-switching back and forth between a pair of
> > + * in-kernel CPU-bound tasks cannot advance grace periods.
> > + * So if the grace period is old enough, make the CPU pay attention.
> > + * Note that the unsynchronized assignments to the per-CPU
> > + * rcu_sched_qs_mask variable are safe. Yes, setting of
> > + * bits can be lost, but they will be set again on the next
> > + * force-quiescent-state pass. So lost bit sets do not result
> > + * in incorrect behavior, merely in a grace period lasting
> > + * a few jiffies longer than it might otherwise. Because
> > + * there are at most four threads involved, and because the
> > + * updates are only once every few jiffies, the probability of
> > + * lossage (and thus of slight grace-period extension) is
> > + * quite low.
> > + *
> > + * Note that if the jiffies_till_sched_qs boot/sysfs parameter
> > + * is set too high, we override with half of the RCU CPU stall
> > + * warning delay.
> > */
> > - if (rdp->rsp == rcu_state_p &&
> > + rcrmp = &per_cpu(rcu_sched_qs_mask, rdp->cpu);
> > + if (ULONG_CMP_GE(jiffies,
> > + rdp->rsp->gp_start + jiffies_till_sched_qs) ||
> > ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
> > - rdp->rsp->jiffies_resched += 5;
> > - resched_cpu(rdp->cpu);
> > + if (!(ACCESS_ONCE(*rcrmp) & rdp->rsp->flavor_mask)) {
> > + ACCESS_ONCE(rdp->cond_resched_completed) =
> > + ACCESS_ONCE(rdp->mynode->completed);
> > + smp_mb(); /* ->cond_resched_completed before *rcrmp. */
> > + ACCESS_ONCE(*rcrmp) =
> > + ACCESS_ONCE(*rcrmp) + rdp->rsp->flavor_mask;
> > + resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
> > + rdp->rsp->jiffies_resched += 5; /* Enable beating. */
> > + } else if (ULONG_CMP_GE(jiffies, rdp->rsp->jiffies_resched)) {
> > + /* Time to beat on that CPU again! */
> > + resched_cpu(rdp->cpu); /* Force CPU into scheduler. */
> > + rdp->rsp->jiffies_resched += 5; /* Re-enable beating. */
> > + }
> > }
> >
> > return 0;
> > @@ -3491,6 +3572,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
> > "rcu_node_fqs_1",
> > "rcu_node_fqs_2",
> > "rcu_node_fqs_3" }; /* Match MAX_RCU_LVLS */
> > + static u8 fl_mask = 0x1;
> > int cpustride = 1;
> > int i;
> > int j;
> > @@ -3509,6 +3591,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
> > for (i = 1; i < rcu_num_lvls; i++)
> > rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1];
> > rcu_init_levelspread(rsp);
> > + rsp->flavor_mask = fl_mask;
> > + fl_mask <<= 1;
> >
> > /* Initialize the elements themselves, starting from the leaves. */
> >
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index bf2c1e669691..0f69a79c5b7d 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -307,6 +307,9 @@ struct rcu_data {
> > /* 4) reasons this CPU needed to be kicked by force_quiescent_state */
> > unsigned long dynticks_fqs; /* Kicked due to dynticks idle. */
> > unsigned long offline_fqs; /* Kicked due to being offline. */
> > + unsigned long cond_resched_completed;
> > + /* Grace period that needs help */
> > + /* from cond_resched(). */
> >
> > /* 5) __rcu_pending() statistics. */
> > unsigned long n_rcu_pending; /* rcu_pending() calls since boot. */
> > @@ -392,6 +395,7 @@ struct rcu_state {
> > struct rcu_node *level[RCU_NUM_LVLS]; /* Hierarchy levels. */
> > u32 levelcnt[MAX_RCU_LVLS + 1]; /* # nodes in each level. */
> > u8 levelspread[RCU_NUM_LVLS]; /* kids/node in each level. */
> > + u8 flavor_mask; /* bit in flavor mask. */
> > struct rcu_data __percpu *rda; /* pointer of percu rcu_data. */
> > void (*call)(struct rcu_head *head, /* call_rcu() flavor. */
> > void (*func)(struct rcu_head *head));
> > @@ -563,7 +567,7 @@ static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
> > static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
> > static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
> > static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
> > -static void rcu_kick_nohz_cpu(int cpu);
> > +static void __maybe_unused rcu_kick_nohz_cpu(int cpu);
> > static bool init_nocb_callback_list(struct rcu_data *rdp);
> > static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
> > static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index cbc2c45265e2..02ac0fb186b8 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2404,7 +2404,7 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
> > * if an adaptive-ticks CPU is failing to respond to the current grace
> > * period and has not be idle from an RCU perspective, kick it.
> > */
> > -static void rcu_kick_nohz_cpu(int cpu)
> > +static void __maybe_unused rcu_kick_nohz_cpu(int cpu)
> > {
> > #ifdef CONFIG_NO_HZ_FULL
> > if (tick_nohz_full_cpu(cpu))
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index a2aeb4df0f60..d22309cae9f5 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -350,21 +350,3 @@ static int __init check_cpu_stall_init(void)
> > early_initcall(check_cpu_stall_init);
> >
> > #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > -
> > -/*
> > - * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
> > - */
> > -
> > -DEFINE_PER_CPU(int, rcu_cond_resched_count);
> > -
> > -/*
> > - * Report a set of RCU quiescent states, for use by cond_resched()
> > - * and friends. Out of line due to being called infrequently.
> > - */
> > -void rcu_resched(void)
> > -{
> > - preempt_disable();
> > - __this_cpu_write(rcu_cond_resched_count, 0);
> > - rcu_note_context_switch(smp_processor_id());
> > - preempt_enable();
> > -}
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3bdf01b494fe..bc1638b33449 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4147,7 +4147,6 @@ static void __cond_resched(void)
> >
> > int __sched _cond_resched(void)
> > {
> > - rcu_cond_resched();
> > if (should_resched()) {
> > __cond_resched();
> > return 1;
> > @@ -4166,18 +4165,15 @@ EXPORT_SYMBOL(_cond_resched);
> > */
> > int __cond_resched_lock(spinlock_t *lock)
> > {
> > - bool need_rcu_resched = rcu_should_resched();
> > int resched = should_resched();
> > int ret = 0;
> >
> > lockdep_assert_held(lock);
> >
> > - if (spin_needbreak(lock) || resched || need_rcu_resched) {
> > + if (spin_needbreak(lock) || resched) {
> > spin_unlock(lock);
> > if (resched)
> > __cond_resched();
> > - else if (unlikely(need_rcu_resched))
> > - rcu_resched();
> > else
> > cpu_relax();
> > ret = 1;
> > @@ -4191,7 +4187,6 @@ int __sched __cond_resched_softirq(void)
> > {
> > BUG_ON(!in_softirq());
> >
> > - rcu_cond_resched(); /* BH disabled OK, just recording QSes. */
> > if (should_resched()) {
> > local_bh_enable();
> > __cond_resched();
> >
>

2014-06-23 06:26:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> fixed a problem where a CPU looping in the kernel with but one runnable
> task would give RCU CPU stall warnings, even if the in-kernel loop
> contained cond_resched() calls. Unfortunately, in so doing, it introduced
> performance regressions in Anton Blanchard's will-it-scale "open1" test.
> The problem appears to be not so much the increased cond_resched() path
> length as an increase in the rate at which grace periods complete, which
> increased per-update grace-period overhead.
>
> This commit takes a different approach to fixing this bug, mainly by
> moving the RCU-visible quiescent state from cond_resched() to
> rcu_note_context_switch(), and by further reducing the check to a
> simple non-zero test of a single per-CPU variable. However, this
> approach requires that the force-quiescent-state processing send
> resched IPIs to the offending CPUs. These will be sent only once
> the grace period has reached an age specified by the boot/sysfs
> parameter rcutree.jiffies_till_sched_qs, or once the grace period
> reaches an age halfway to the point at which RCU CPU stall warnings
> will be emitted, whichever comes first.

Right, and I suppose the force quiescent stuff is triggered from the
tick, which in turn wakes some of these rcu kthreads, which on UP would
cause scheduling themselves.

On the topic of these threads; I recently noticed RCU grew a metric ton
of them, I found some 75 rcu kthreads on my box, wth up with that?

Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, 23 Jun 2014, Peter Zijlstra wrote:

> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

Would kworker threads work for rcu? That would also avoid the shifting
around of RCU threads for NOHZ configurations (which seems to have to be
done manually right now). The kworker subsystem work that allows
restriction to non NOHZ hardware threads would then also allow the
shifting of the rcu threads which would simplify the whole endeavor.


Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Fri, 20 Jun 2014, Paul E. McKenney wrote:

> > I like this approach *far* better. This is the kind of thing I had in
> > mind when I suggested using the fqs machinery: remove the poll entirely
> > and just thwack a CPU if it takes too long without a quiescent state.
> > Reviewed-by: Josh Triplett <[email protected]>
>
> Glad you like it. Not a fan of the IPI myself, but then again if you
> are spending that must time looping in the kernel, an extra IPI is the
> least of your problems.

Good. The IPI is only used when actually necessary. The code inserted
was always there and always executed although rarely needed.

2014-06-23 14:54:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 08:26:15AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 20, 2014 at 07:59:58PM -0700, Paul E. McKenney wrote:
> > Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
> > fixed a problem where a CPU looping in the kernel with but one runnable
> > task would give RCU CPU stall warnings, even if the in-kernel loop
> > contained cond_resched() calls. Unfortunately, in so doing, it introduced
> > performance regressions in Anton Blanchard's will-it-scale "open1" test.
> > The problem appears to be not so much the increased cond_resched() path
> > length as an increase in the rate at which grace periods complete, which
> > increased per-update grace-period overhead.
> >
> > This commit takes a different approach to fixing this bug, mainly by
> > moving the RCU-visible quiescent state from cond_resched() to
> > rcu_note_context_switch(), and by further reducing the check to a
> > simple non-zero test of a single per-CPU variable. However, this
> > approach requires that the force-quiescent-state processing send
> > resched IPIs to the offending CPUs. These will be sent only once
> > the grace period has reached an age specified by the boot/sysfs
> > parameter rcutree.jiffies_till_sched_qs, or once the grace period
> > reaches an age halfway to the point at which RCU CPU stall warnings
> > will be emitted, whichever comes first.
>
> Right, and I suppose the force quiescent stuff is triggered from the
> tick, which in turn wakes some of these rcu kthreads, which on UP would
> cause scheduling themselves.

Yep, which is another reason why this commit only affects TREE_RCU and
TREE_PREEMPT_RCU, not TINY_RCU.

> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

The most likely cause of a recent increase would be if you now have
CONFIG_RCU_NOCB_CPU_ALL=y, which would give you a pair of kthreads per
CPU for callback offloading. Plus an additional kthread per CPU (for
a total of three new kthreads per CPU) for CONFIG_PREEMPT=y. These would
be the rcuo kthreads.

Are they causing you trouble?

Thanx, Paul

2014-06-23 15:27:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 08:53:12AM -0500, Christoph Lameter wrote:
> On Fri, 20 Jun 2014, Paul E. McKenney wrote:
>
> > > I like this approach *far* better. This is the kind of thing I had in
> > > mind when I suggested using the fqs machinery: remove the poll entirely
> > > and just thwack a CPU if it takes too long without a quiescent state.
> > > Reviewed-by: Josh Triplett <[email protected]>
> >
> > Glad you like it. Not a fan of the IPI myself, but then again if you
> > are spending that must time looping in the kernel, an extra IPI is the
> > least of your problems.
>
> Good. The IPI is only used when actually necessary. The code inserted
> was always there and always executed although rarely needed.

Interesting. I actually proposed this approach several times in the
earlier thread, but to deafing silence: https://lkml.org/lkml/2014/6/18/836,
https://lkml.org/lkml/2014/6/17/793, and https://lkml.org/lkml/2014/6/20/479.

I guess this further validates interpreting silence as assent.

Thanx, Paul

2014-06-23 15:49:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

> On the topic of these threads; I recently noticed RCU grew a metric ton
> of them, I found some 75 rcu kthreads on my box, wth up with that?

It seems like RCU is growing in complexity all the time.

Can it be put on a diet in general?

No more new CONFIGs please either.

-Andi
--
[email protected] -- Speaking for myself only

2014-06-23 16:43:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 08:49:31AM -0700, Andi Kleen wrote:
> > On the topic of these threads; I recently noticed RCU grew a metric ton
> > of them, I found some 75 rcu kthreads on my box, wth up with that?
>
> It seems like RCU is growing in complexity all the time.
>
> Can it be put on a diet in general?

In 3.10, RCU had 14,046 lines of code, not counting documentation and
test scripting. In 3.15, RCU had 13,208 lines of code, again not counting
documentation and test scripting. That is a decrease of almost 1KLoC,
so your wish is granted.

In the future, I hope to be able to make NOCB the default and remove the
softirq-based callback handling, which should shrink things a bit further.
Of course, continued work to make NOCB handle various corner cases will
offset that expected shrinkage, though hopefully not be too much.

Of course, I cannot resist taking your call for RCU simplicity as a vote
against Peter's proposal for aligning the rcu_node tree to the hardware's
electrical structure. ;-)

> No more new CONFIGs please either.

Since 3.10, I have gotten rid of CONFIG_RCU_CPU_STALL_TIMEOUT.

Over time, it might be possible to make CONFIG_RCU_FAST_NO_HZ the default,
and thus eliminate that Kconfig parameter. As noted about, ditto for
CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
and CONFIG_RCU_NOCB_CPU_ALL. It also might be reasonable to replace
uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
CONFIG_PROVE_RCU to be eliminated. CONFIG_PROVE_RCU_DELAY hasn't proven
very good at finding bugs, so I am considering eliminating it as well.
Given recent and planned changes related to RCU's stall-warning stack
dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
(And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
for some time beforehand.) I have also been considering getting rid of
CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

That should make room for additional RCU Kconfig parameters as needed
for specialized or high-risk new functionality, when and if required.

Thanx, Paul

> -Andi
> --
> [email protected] -- Speaking for myself only
>

2014-06-23 16:45:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 08:51:08AM -0500, Christoph Lameter wrote:
> On Mon, 23 Jun 2014, Peter Zijlstra wrote:
>
> > On the topic of these threads; I recently noticed RCU grew a metric ton
> > of them, I found some 75 rcu kthreads on my box, wth up with that?
>
> Would kworker threads work for rcu? That would also avoid the shifting
> around of RCU threads for NOHZ configurations (which seems to have to be
> done manually right now). The kworker subsystem work that allows
> restriction to non NOHZ hardware threads would then also allow the
> shifting of the rcu threads which would simplify the whole endeavor.

Short term, I am planning to use a different method to automate the
binding of the rcuo kthreads to housekeeping CPUs, but longer term,
it might well make a lot of sense to move to workqueues and the kworker
threads.

Thanx, Paul

2014-06-23 16:55:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

This still has a regression. Commit 1ed70de (from Paul's git tree),
gets a result of 52231880. If I back up two commits to v3.16-rc1 and
revert ac1bea85 (the original culprit) the result goes back up to 57308512.

So something is still going on here.

I'll go back and compare the grace period ages to see if I can tell what
is going on.

--

root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
testcase:Separate file open/close
warmup
min:319787 max:386071 total:57464989
min:307235 max:351905 total:53289241
min:291765 max:342439 total:51364514
min:297948 max:349214 total:52552745
min:294950 max:340132 total:51586179
min:290791 max:339958 total:50793238
measurement
min:298851 max:346868 total:51951469
min:292879 max:340704 total:50817269
min:305768 max:347381 total:52655149
min:301046 max:345616 total:52449584
min:300428 max:345293 total:52021166
min:293404 max:337973 total:51012206
min:303569 max:348191 total:52713179
min:305523 max:357448 total:53707053
min:307040 max:356937 total:53271883
min:302134 max:347923 total:52477496
min:297823 max:340488 total:51884417
min:286981 max:338246 total:50496850
min:295920 max:349405 total:51792563
min:302749 max:343780 total:52305074
min:298497 max:345208 total:52035318
min:291393 max:332195 total:50163093
min:303561 max:353396 total:52983515
min:301613 max:352988 total:53029200
min:300693 max:343726 total:52057334
min:296801 max:352408 total:52028824
min:304834 max:358236 total:53526191
min:297933 max:338351 total:51578481
min:299571 max:341679 total:51817941
min:308225 max:354075 total:53760098
min:296262 max:346965 total:51856596
min:309196 max:356432 total:53455141
min:295604 max:341814 total:51449366
min:296931 max:345961 total:52051944
min:300533 max:350304 total:52652951
min:299887 max:350764 total:52955064
average:52231880
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-00002-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
2014 x86_64 x86_64 x86_64 GNU/Linux


root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
testcase:Separate file open/close
warmup
min:346853 max:416035 total:62412724
min:281766 max:344178 total:52207349
min:311187 max:374918 total:57149451
min:326309 max:391061 total:60200366
min:310327 max:375619 total:56744357
min:323336 max:393415 total:59619164
measurement
min:323934 max:393718 total:59665843
min:307247 max:368313 total:55681436
min:318210 max:378048 total:57849321
min:314494 max:383884 total:57741073
min:316497 max:385223 total:58565045
min:320490 max:397636 total:59003133
min:318695 max:391712 total:57789360
min:304368 max:378540 total:56412216
min:314609 max:384462 total:58298008
min:317235 max:384205 total:58812490
min:323556 max:388014 total:59468492
min:301011 max:362664 total:55381779
min:301113 max:364712 total:55375445
min:311730 max:369336 total:56640530
min:316951 max:381341 total:58649244
min:317077 max:383943 total:58132878
min:316970 max:390127 total:59039489
min:315895 max:375937 total:57404755
min:295500 max:346523 total:53086962
min:310882 max:371923 total:56612144
min:321837 max:390544 total:59651640
min:303481 max:368716 total:56135908
min:306437 max:367658 total:56388659
min:307343 max:373645 total:56893136
min:298703 max:358090 total:54152268
min:319162 max:386583 total:58999429
min:304881 max:361968 total:55286607
min:311034 max:381100 total:57846182
min:312786 max:378270 total:57964383
min:311740 max:367481 total:56327526
average:57308512
root@bigbox:~/will-it-scale# uname -a
Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
x86_64 x86_64 x86_64 GNU/Linux

2014-06-23 17:16:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
> This still has a regression. Commit 1ed70de (from Paul's git tree),
> gets a result of 52231880. If I back up two commits to v3.16-rc1 and
> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
>
> So something is still going on here.

And commit 1ed70de is in fact the right one, so...

The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
long RCU waits before asking for quiescent states. The default is
currently HZ/20. Does increasing this parameter help? Easy for me to
increase the default if it does.

> I'll go back and compare the grace period ages to see if I can tell what
> is going on.

That would be very helpful, thank you!

Thanx, Paul

> --
>
> root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s30
> testcase:Separate file open/close
> warmup
> min:319787 max:386071 total:57464989
> min:307235 max:351905 total:53289241
> min:291765 max:342439 total:51364514
> min:297948 max:349214 total:52552745
> min:294950 max:340132 total:51586179
> min:290791 max:339958 total:50793238
> measurement
> min:298851 max:346868 total:51951469
> min:292879 max:340704 total:50817269
> min:305768 max:347381 total:52655149
> min:301046 max:345616 total:52449584
> min:300428 max:345293 total:52021166
> min:293404 max:337973 total:51012206
> min:303569 max:348191 total:52713179
> min:305523 max:357448 total:53707053
> min:307040 max:356937 total:53271883
> min:302134 max:347923 total:52477496
> min:297823 max:340488 total:51884417
> min:286981 max:338246 total:50496850
> min:295920 max:349405 total:51792563
> min:302749 max:343780 total:52305074
> min:298497 max:345208 total:52035318
> min:291393 max:332195 total:50163093
> min:303561 max:353396 total:52983515
> min:301613 max:352988 total:53029200
> min:300693 max:343726 total:52057334
> min:296801 max:352408 total:52028824
> min:304834 max:358236 total:53526191
> min:297933 max:338351 total:51578481
> min:299571 max:341679 total:51817941
> min:308225 max:354075 total:53760098
> min:296262 max:346965 total:51856596
> min:309196 max:356432 total:53455141
> min:295604 max:341814 total:51449366
> min:296931 max:345961 total:52051944
> min:300533 max:350304 total:52652951
> min:299887 max:350764 total:52955064
> average:52231880
> root@bigbox:~/will-it-scale# uname -a
> Linux bigbox 3.16.0-rc1-00002-g1ed70de #176 SMP Mon Jun 23 09:04:02 PDT
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@bigbox:~/will-it-scale# ./open1_processes -t 160 -s 30
> testcase:Separate file open/close
> warmup
> min:346853 max:416035 total:62412724
> min:281766 max:344178 total:52207349
> min:311187 max:374918 total:57149451
> min:326309 max:391061 total:60200366
> min:310327 max:375619 total:56744357
> min:323336 max:393415 total:59619164
> measurement
> min:323934 max:393718 total:59665843
> min:307247 max:368313 total:55681436
> min:318210 max:378048 total:57849321
> min:314494 max:383884 total:57741073
> min:316497 max:385223 total:58565045
> min:320490 max:397636 total:59003133
> min:318695 max:391712 total:57789360
> min:304368 max:378540 total:56412216
> min:314609 max:384462 total:58298008
> min:317235 max:384205 total:58812490
> min:323556 max:388014 total:59468492
> min:301011 max:362664 total:55381779
> min:301113 max:364712 total:55375445
> min:311730 max:369336 total:56640530
> min:316951 max:381341 total:58649244
> min:317077 max:383943 total:58132878
> min:316970 max:390127 total:59039489
> min:315895 max:375937 total:57404755
> min:295500 max:346523 total:53086962
> min:310882 max:371923 total:56612144
> min:321837 max:390544 total:59651640
> min:303481 max:368716 total:56135908
> min:306437 max:367658 total:56388659
> min:307343 max:373645 total:56893136
> min:298703 max:358090 total:54152268
> min:319162 max:386583 total:58999429
> min:304881 max:361968 total:55286607
> min:311034 max:381100 total:57846182
> min:312786 max:378270 total:57964383
> min:311740 max:367481 total:56327526
> average:57308512
> root@bigbox:~/will-it-scale# uname -a
> Linux bigbox 3.16.0-rc1-dirty #177 SMP Mon Jun 23 09:13:59 PDT 2014
> x86_64 x86_64 x86_64 GNU/Linux
>

2014-06-23 17:17:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 09:55 AM, Dave Hansen wrote:
> This still has a regression. Commit 1ed70de (from Paul's git tree),
> gets a result of 52231880. If I back up two commits to v3.16-rc1 and
> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
>
> So something is still going on here.
>
> I'll go back and compare the grace period ages to see if I can tell what
> is going on.

RCU_TRACE interferes with the benchmark a little bit, and it lowers the
delta that the regression causes. So, evaluate this cautiously.

According to rcu_sched/rcugp, the average "age" is:

v3.16-rc1, with ac1bea85 reverted: 10.7
v3.16-rc1, plus e552592e: 6.1

Paul, have you been keeping an eye on rcugp? Even if I run my system
with only 10 threads, I still see this basic pattern where the average
"age" is lower when I see lower performance. It seems to be a
reasonable proxy that could be used instead of waiting on me to re-run
tests.

2014-06-23 17:19:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

> In 3.10, RCU had 14,046 lines of code, not counting documentation and
> test scripting. In 3.15, RCU had 13,208 lines of code, again not counting
> documentation and test scripting. That is a decrease of almost 1KLoC,
> so your wish is granted.

Ok that's good progress.

> CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
> and CONFIG_RCU_NOCB_CPU_ALL. It also might be reasonable to replace
> uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
> CONFIG_PROVE_RCU to be eliminated. CONFIG_PROVE_RCU_DELAY hasn't proven
> very good at finding bugs, so I am considering eliminating it as well.
> Given recent and planned changes related to RCU's stall-warning stack
> dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
> CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
> (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
> for some time beforehand.) I have also been considering getting rid of
> CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.

Yes please to all.

Sounds good thanks.

-Andi

2014-06-23 17:34:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 10:16 AM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 09:55:21AM -0700, Dave Hansen wrote:
>> This still has a regression. Commit 1ed70de (from Paul's git tree),
>> gets a result of 52231880. If I back up two commits to v3.16-rc1 and
>> revert ac1bea85 (the original culprit) the result goes back up to 57308512.
>>
>> So something is still going on here.
>
> And commit 1ed70de is in fact the right one, so...
>
> The rcutree.jiffies_till_sched_qs boot/sysfs parameter controls how
> long RCU waits before asking for quiescent states. The default is
> currently HZ/20. Does increasing this parameter help? Easy for me to
> increase the default if it does.

Making it an insane value:

echo 12 > /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:52248706
echo 9999999 > /sys/module/rcutree/parameters/jiffies_till_sched_qs
average:55712533

gets us back up _closer_ to our original 57M number, but it's still not
quite there.

2014-06-23 17:42:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 10:19:05AM -0700, Andi Kleen wrote:
> > In 3.10, RCU had 14,046 lines of code, not counting documentation and
> > test scripting. In 3.15, RCU had 13,208 lines of code, again not counting
> > documentation and test scripting. That is a decrease of almost 1KLoC,
> > so your wish is granted.
>
> Ok that's good progress.

Glad you like it!

> > CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO,
> > and CONFIG_RCU_NOCB_CPU_ALL. It also might be reasonable to replace
> > uses of CONFIG_PROVE_RCU with CONFIG_PROVE_LOCKING, thus allowing
> > CONFIG_PROVE_RCU to be eliminated. CONFIG_PROVE_RCU_DELAY hasn't proven
> > very good at finding bugs, so I am considering eliminating it as well.
> > Given recent and planned changes related to RCU's stall-warning stack
> > dumping, I hope to eliminate both CONFIG_RCU_CPU_STALL_VERBOSE and
> > CONFIG_RCU_CPU_STALL_INFO, making them both happen unconditionally.
> > (And yes, I should probably make CONFIG_RCU_CPU_STALL_INFO be the default
> > for some time beforehand.) I have also been considering getting rid of
> > CONFIG_RCU_FANOUT_EXACT, given that it appears that no one uses it.
>
> Yes please to all.
>
> Sounds good thanks.

Very good!

Please note that this will take some time. For example, getting rid
of CONFIG_RCU_CPU_STALL_TIMEOUT resulted in a series of bugs over a
period of a well over a year. It turned out that very few people were
exercising it while it was non-default. Hopefully, Fengguang Wu's
RANDCONFIG testing is testing these things more these days.

Also, some of them might have non-obvious effects on performance,
witness the cond_resched() fun and excitement.

Thanx, Paul

2014-06-23 18:09:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 10:17:19AM -0700, Dave Hansen wrote:
> On 06/23/2014 09:55 AM, Dave Hansen wrote:
> > This still has a regression. Commit 1ed70de (from Paul's git tree),
> > gets a result of 52231880. If I back up two commits to v3.16-rc1 and
> > revert ac1bea85 (the original culprit) the result goes back up to 57308512.
> >
> > So something is still going on here.
> >
> > I'll go back and compare the grace period ages to see if I can tell what
> > is going on.
>
> RCU_TRACE interferes with the benchmark a little bit, and it lowers the
> delta that the regression causes. So, evaluate this cautiously.

RCU_TRACE does increase overhead somewhat, so I would expect somewhat
less difference with it enabled. Though I am a bit surprised that the
overhead of its counters is measurable. Or is something going on?

> According to rcu_sched/rcugp, the average "age" is:
>
> v3.16-rc1, with ac1bea85 reverted: 10.7
> v3.16-rc1, plus e552592e: 6.1
>
> Paul, have you been keeping an eye on rcugp? Even if I run my system
> with only 10 threads, I still see this basic pattern where the average
> "age" is lower when I see lower performance. It seems to be a
> reasonable proxy that could be used instead of waiting on me to re-run
> tests.

I do print out GPs/sec when running rcutorture, and they do vary somewhat,
but mostly with different Kconfig parameter settings. Plus rcutorture
ramps up and down, so the GPs/sec is less than what you might see in a
system running an unvarying workload. That said, increasing grace-period
latency is not always good for performance, in fact, I usually get beaten
up for grace periods completing too quickly rather than too slowly.
This current issue is one of the rare exceptions, perhaps even the
only exception.

So let's see... The open1 benchmark sits in a loop doing open()
and close(), and probably spends most of its time in the kernel.
It doesn't do much context switching. I am guessing that you don't
have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
much effect because then the first quiescent-state-forcing attempt would
likely finish the grace period.

So, given that short grace periods help other workloads (I have the
scars to prove it), and given that the patch fixes some real problems,
and given that the large number for rcutree.jiffies_till_sched_qs got
us within 3%, shouldn't we consider this issue closed?

Thanx, Paul

2014-06-23 23:30:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
> So let's see... The open1 benchmark sits in a loop doing open()
> and close(), and probably spends most of its time in the kernel.
> It doesn't do much context switching. I am guessing that you don't
> have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
> much effect because then the first quiescent-state-forcing attempt would
> likely finish the grace period.
>
> So, given that short grace periods help other workloads (I have the
> scars to prove it), and given that the patch fixes some real problems,

I'm not arguing that short grace periods _can_ help some workloads, or
that one is better than the other. The patch in question changes
existing behavior by shortening grace periods. This change of existing
behavior removes some of the benefits that my system gets out of RCU. I
suspect this affects a lot more systems, but my core cout makes it
easier to see.

Perhaps I'm misunderstanding the original patch's intent, but it seemed
to me to be working around an overactive debug message. While often a
_useful_ debug message, it was firing falsely in the case being
addressed in the patch.

> and given that the large number for rcutree.jiffies_till_sched_qs got
> us within 3%, shouldn't we consider this issue closed?

With the default value for the tunable, the regression is still solidly
over 10%. I think we can have a reasonable argument about it once the
default delta is down to the small single digits.

One more thing I just realized: this isn't a scalability problem, at
least with rcutree.jiffies_till_sched_qs=12. There's a pretty
consistent delta in throughput throughout the entire range of threads
from 1->160. See the "processes" column in the data files:

plain 3.15:
> https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
e552592e0383bc:
> https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv

or visually:

> https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.15&2=willitscale/systems/bigbox/3.16.0-rc1-pf2&hide=linear,threads_idle,processes_idle

2014-06-24 00:15:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 04:30:12PM -0700, Dave Hansen wrote:
> On 06/23/2014 11:09 AM, Paul E. McKenney wrote:
> > So let's see... The open1 benchmark sits in a loop doing open()
> > and close(), and probably spends most of its time in the kernel.
> > It doesn't do much context switching. I am guessing that you don't
> > have CONFIG_NO_HZ_FULL=y, or the boot/sysfs parameter would not have
> > much effect because then the first quiescent-state-forcing attempt would
> > likely finish the grace period.
> >
> > So, given that short grace periods help other workloads (I have the
> > scars to prove it), and given that the patch fixes some real problems,
>
> I'm not arguing that short grace periods _can_ help some workloads, or
> that one is better than the other. The patch in question changes
> existing behavior by shortening grace periods. This change of existing
> behavior removes some of the benefits that my system gets out of RCU. I
> suspect this affects a lot more systems, but my core cout makes it
> easier to see.

And adds some benefits for other systems. Your tight loop on open() and
close() will be sensitive to some things, and tight loops on other syscalls
will be sensitive to others.

> Perhaps I'm misunderstanding the original patch's intent, but it seemed
> to me to be working around an overactive debug message. While often a
> _useful_ debug message, it was firing falsely in the case being
> addressed in the patch.

You are indeed misunderstanding the original patch's intent. It was
preventing OOMs. The "overactive debug message" is just a warning that
OOMs are possible.

> > and given that the large number for rcutree.jiffies_till_sched_qs got
> > us within 3%, shouldn't we consider this issue closed?
>
> With the default value for the tunable, the regression is still solidly
> over 10%. I think we can have a reasonable argument about it once the
> default delta is down to the small single digits.

Look, you are to be congratulated for identifying a micro-benchmark that
exposes such small changes in timing, but I am not at all interested
in that micro-benchmark becoming the kernel's straightjacket. If you
have real workloads for which this micro-benchmark is a good predictor
of performance, we can talk about quite a few additional steps to take
to tune for those workloads.

> One more thing I just realized: this isn't a scalability problem, at
> least with rcutree.jiffies_till_sched_qs=12. There's a pretty
> consistent delta in throughput throughout the entire range of threads
> from 1->160. See the "processes" column in the data files:
>
> plain 3.15:
> > https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.15/open1.csv
> e552592e0383bc:
> > https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.16.0-rc1-pf2/open1.csv
>
> or visually:
>
> > https://www.sr71.net/~dave/intel/array-join.html?1=willitscale/systems/bigbox/3.15&2=willitscale/systems/bigbox/3.16.0-rc1-pf2&hide=linear,threads_idle,processes_idle

Just out of curiosity, how many CPUs does your system have? 80?
If 160, looks like something bad is happening at 80.

Thanx, Paul

2014-06-24 00:20:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> Just out of curiosity, how many CPUs does your system have? 80?
> If 160, looks like something bad is happening at 80.

80 cores, 160 threads. >80 processes/threads is where we start using
the second thread on the cores. The tasks are also pinned to
hyperthread pairs, so they disturb each other, and the scheduler moves
them between threads on occasion which causes extra noise.

2014-06-24 00:39:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> > Just out of curiosity, how many CPUs does your system have? 80?
> > If 160, looks like something bad is happening at 80.
>
> 80 cores, 160 threads. >80 processes/threads is where we start using
> the second thread on the cores. The tasks are also pinned to
> hyperthread pairs, so they disturb each other, and the scheduler moves
> them between threads on occasion which causes extra noise.

OK, that could explain the near flattening of throughput near 80
processes. Is 3.16.0-rc1-pf2 with the two RCU patches? If so, is the
new sysfs parameter at its default value?

Thanx, Paul

2014-06-24 16:19:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
>> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
>>> Just out of curiosity, how many CPUs does your system have? 80?
>>> If 160, looks like something bad is happening at 80.
>>
>> 80 cores, 160 threads. >80 processes/threads is where we start using
>> the second thread on the cores. The tasks are also pinned to
>> hyperthread pairs, so they disturb each other, and the scheduler moves
>> them between threads on occasion which causes extra noise.
>
> OK, that could explain the near flattening of throughput near 80
> processes. Is 3.16.0-rc1-pf2 with the two RCU patches?

It's actually with _just_ e552592e03 applied on top of 3.16-rc1.

> If so, is the new sysfs parameter at its default value?

I didn't record that, and I've forgotten. I'll re-run it to verify what
it was.

2014-06-24 20:43:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
>> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
>>> Just out of curiosity, how many CPUs does your system have? 80?
>>> If 160, looks like something bad is happening at 80.
>>
>> 80 cores, 160 threads. >80 processes/threads is where we start using
>> the second thread on the cores. The tasks are also pinned to
>> hyperthread pairs, so they disturb each other, and the scheduler moves
>> them between threads on occasion which causes extra noise.
>
> OK, that could explain the near flattening of throughput near 80
> processes. Is 3.16.0-rc1-pf2 with the two RCU patches? If so, is the
> new sysfs parameter at its default value?

Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:

> https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.15

3.16-rc1 is actually in the lead up until the end when we're filling up
the hyperthreads. The same pattern holds when comparing
3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:

> https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.16.0-rc1-wrevert

So, the current situation is generally _better_ than 3.15, except during
the noisy ranges of the test where hyperthreading and the scheduler are
coming in to play. I made the mistake of doing all my spot-checks at
the 160-thread number, which honestly wasn't the best point to be
looking at.

At this point, I'm satisfied with how e552592e is dealing with the
original regression. Thanks for all the prompt attention on this one, Paul.

2014-06-24 21:15:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu] Reduce overhead of cond_resched() checks for RCU

On Tue, Jun 24, 2014 at 01:43:16PM -0700, Dave Hansen wrote:
> On 06/23/2014 05:39 PM, Paul E. McKenney wrote:
> > On Mon, Jun 23, 2014 at 05:20:30PM -0700, Dave Hansen wrote:
> >> On 06/23/2014 05:15 PM, Paul E. McKenney wrote:
> >>> Just out of curiosity, how many CPUs does your system have? 80?
> >>> If 160, looks like something bad is happening at 80.
> >>
> >> 80 cores, 160 threads. >80 processes/threads is where we start using
> >> the second thread on the cores. The tasks are also pinned to
> >> hyperthread pairs, so they disturb each other, and the scheduler moves
> >> them between threads on occasion which causes extra noise.
> >
> > OK, that could explain the near flattening of throughput near 80
> > processes. Is 3.16.0-rc1-pf2 with the two RCU patches? If so, is the
> > new sysfs parameter at its default value?
>
> Here's 3.16-rc1 with e552592e applied and jiffies_till_sched_qs=12 vs. 3.15:
>
> > https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.15
>
> 3.16-rc1 is actually in the lead up until the end when we're filling up
> the hyperthreads. The same pattern holds when comparing
> 3.16-rc1+e552592e to 3.16-rc1 with ac1bea8 reverted:
>
> > https://www.sr71.net/~dave/intel/bb.html?2=3.16.0-rc1-paultry2-jtsq12&1=3.16.0-rc1-wrevert
>
> So, the current situation is generally _better_ than 3.15, except during
> the noisy ranges of the test where hyperthreading and the scheduler are
> coming in to play.

Good to know that my intuition is not yet completely broken. ;-)

> I made the mistake of doing all my spot-checks at
> the 160-thread number, which honestly wasn't the best point to be
> looking at.

That would do it! ;-)

> At this point, I'm satisfied with how e552592e is dealing with the
> original regression. Thanks for all the prompt attention on this one, Paul.

Glad it worked out, I have sent a pull request to Ingo to hopefully
get this into 3.16.

Thanx, Paul