Hello!
This series contains some highly experimental patches that allow normal
grace periods to take advantage of the work done by concurrent expedited
grace periods. This can reduce the overhead incurred by normal grace
periods by eliminating the need for force-quiescent-state scans that
would otherwise have happened after the expedited grace period completed.
It is not clear whether this is a useful tradeoff. Nevertheless, this
series contains the following patches:
1. Pure code-movement commit that allows the RCU grace-period kthread
to access the expedited grace period's state.
2. Use snapshot technique to allow a normal grace period to complete
after an expedited grace period has completed.
3. Changes to rcutorture to fully test normal grace periods. Current
tests would fail to notice some types of forward-progress bugs due
to the presence of expedited grace periods in the common case.
4. Make expedited grace periods awaken any concurrent normal grace
periods.
5. Throttle the awakening to avoid excessive CPU consumption by the
normal grace-period kthread.
A separate series will cover changes to the expedited grace-period
machinery itself.
Thanx, Paul
------------------------------------------------------------------------
b/kernel/rcu/rcutorture.c | 22 +++++-
b/kernel/rcu/tree.c | 165 +++++++++++++++++++++++++++-------------------
b/kernel/rcu/tree.h | 6 +
3 files changed, 127 insertions(+), 66 deletions(-)
From: "Paul E. McKenney" <[email protected]>
This is a pure code-movement commit in preparation for changes that
allow short-circuiting of normal grace-period processing due to
concurrent execution of an expedited grace period.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 108 +++++++++++++++++++++++++++---------------------------
1 file changed, 54 insertions(+), 54 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 041968e3b7ce..bea9b9c80d91 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1393,6 +1393,60 @@ void rcu_cpu_stall_reset(void)
WRITE_ONCE(rsp->jiffies_stall, jiffies + ULONG_MAX / 2);
}
+/* Adjust sequence number for start of update-side operation. */
+static void rcu_seq_start(unsigned long *sp)
+{
+ WRITE_ONCE(*sp, *sp + 1);
+ smp_mb(); /* Ensure update-side operation after counter increment. */
+ WARN_ON_ONCE(!(*sp & 0x1));
+}
+
+/* Adjust sequence number for end of update-side operation. */
+static void rcu_seq_end(unsigned long *sp)
+{
+ smp_mb(); /* Ensure update-side operation before counter increment. */
+ WRITE_ONCE(*sp, *sp + 1);
+ WARN_ON_ONCE(*sp & 0x1);
+}
+
+/* Take a snapshot of the update side's sequence number. */
+static unsigned long rcu_seq_snap(unsigned long *sp)
+{
+ unsigned long s;
+
+ smp_mb(); /* Caller's modifications seen first by other CPUs. */
+ s = (READ_ONCE(*sp) + 3) & ~0x1;
+ smp_mb(); /* Above access must not bleed into critical section. */
+ return s;
+}
+
+/*
+ * Given a snapshot from rcu_seq_snap(), determine whether or not a
+ * full update-side operation has occurred.
+ */
+static bool rcu_seq_done(unsigned long *sp, unsigned long s)
+{
+ return ULONG_CMP_GE(READ_ONCE(*sp), s);
+}
+
+/* Wrapper functions for expedited grace periods. */
+static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
+{
+ rcu_seq_start(&rsp->expedited_sequence);
+}
+static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
+{
+ rcu_seq_end(&rsp->expedited_sequence);
+}
+static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
+{
+ return rcu_seq_snap(&rsp->expedited_sequence);
+}
+static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
+{
+ return rcu_seq_done(&rsp->expedited_sequence, s);
+}
+
/*
* Initialize the specified rcu_data structure's default callback list
* to empty. The default callback list is the one that is not used by
@@ -3307,60 +3361,6 @@ void cond_synchronize_sched(unsigned long oldstate)
}
EXPORT_SYMBOL_GPL(cond_synchronize_sched);
-/* Adjust sequence number for start of update-side operation. */
-static void rcu_seq_start(unsigned long *sp)
-{
- WRITE_ONCE(*sp, *sp + 1);
- smp_mb(); /* Ensure update-side operation after counter increment. */
- WARN_ON_ONCE(!(*sp & 0x1));
-}
-
-/* Adjust sequence number for end of update-side operation. */
-static void rcu_seq_end(unsigned long *sp)
-{
- smp_mb(); /* Ensure update-side operation before counter increment. */
- WRITE_ONCE(*sp, *sp + 1);
- WARN_ON_ONCE(*sp & 0x1);
-}
-
-/* Take a snapshot of the update side's sequence number. */
-static unsigned long rcu_seq_snap(unsigned long *sp)
-{
- unsigned long s;
-
- smp_mb(); /* Caller's modifications seen first by other CPUs. */
- s = (READ_ONCE(*sp) + 3) & ~0x1;
- smp_mb(); /* Above access must not bleed into critical section. */
- return s;
-}
-
-/*
- * Given a snapshot from rcu_seq_snap(), determine whether or not a
- * full update-side operation has occurred.
- */
-static bool rcu_seq_done(unsigned long *sp, unsigned long s)
-{
- return ULONG_CMP_GE(READ_ONCE(*sp), s);
-}
-
-/* Wrapper functions for expedited grace periods. */
-static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
-{
- rcu_seq_start(&rsp->expedited_sequence);
-}
-static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
-{
- rcu_seq_end(&rsp->expedited_sequence);
-}
-static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
-{
- return rcu_seq_snap(&rsp->expedited_sequence);
-}
-static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
-{
- return rcu_seq_done(&rsp->expedited_sequence, s);
-}
-
/* Common code for synchronize_{rcu,sched}_expedited() work-done checking. */
static bool sync_exp_work_done(struct rcu_state *rsp, struct rcu_node *rnp,
atomic_long_t *stat, unsigned long s)
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
If an expedited grace period executes after the start of a normal
grace period, there is no point in the normal grace period doing any
more work, as the expedited grace period has supplied all the needed
quiescent states. This commit therefore makes the normal grace-period
initialization process take a snapshot of the expedited grace-period
state, and then recheck the state just before each force-quiescent-state
scan. If the recheck determines that a full expedited grace period
executed since the beginning of the normal grace period, the grace-period
kthread proceeds immediately to grace-period cleanup.
Because the expedited grace period does not awaken the grace-period
kthread, this change should provide only a minimal reduction in
grace-period latency, however, it does reduce the overhead of detecting
the end of the grace period.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 40 ++++++++++++++++++++++++++++++++--------
kernel/rcu/tree.h | 6 ++++++
2 files changed, 38 insertions(+), 8 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bea9b9c80d91..5365f6332a60 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -104,6 +104,7 @@ struct rcu_state sname##_state = { \
.orphan_nxttail = &sname##_state.orphan_nxtlist, \
.orphan_donetail = &sname##_state.orphan_donelist, \
.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
+ .exp_rsp = &sname##_state, \
.name = RCU_STATE_NAME(sname), \
.abbr = sabbr, \
}
@@ -1954,6 +1955,15 @@ static int rcu_gp_init(struct rcu_state *rsp)
WRITE_ONCE(rsp->gp_activity, jiffies);
}
+ /*
+ * Record associated expedited-grace-period snapshot. This cannot
+ * be done before the root rcu_node structure has been initialized
+ * due to the fact that callbacks can still be registered for the
+ * current grace period until that initialization is complete.
+ */
+ rsp->gp_exp_snap = rcu_exp_gp_seq_snap(rsp->exp_rsp);
+ rsp->gp_exp_help = false;
+
return 1;
}
@@ -2035,8 +2045,14 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
rcu_for_each_node_breadth_first(rsp, rnp) {
raw_spin_lock_irq(&rnp->lock);
smp_mb__after_unlock_lock();
- WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
- WARN_ON_ONCE(rnp->qsmask);
+ if (!rsp->gp_exp_help) {
+ WARN_ON_ONCE(rnp->qsmask);
+ WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
+ } else {
+ /* Clear out state from truncated grace period. */
+ rnp->qsmask = 0;
+ rnp->gp_tasks = NULL;
+ }
WRITE_ONCE(rnp->completed, rsp->gpnum);
rdp = this_cpu_ptr(rsp->rda);
if (rnp == rdp->mynode)
@@ -2121,17 +2137,24 @@ static int __noreturn rcu_gp_kthread(void *arg)
TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;
ret = wait_event_interruptible_timeout(rsp->gp_wq,
- ((gf = READ_ONCE(rsp->gp_flags)) &
- RCU_GP_FLAG_FQS) ||
- (!READ_ONCE(rnp->qsmask) &&
- !rcu_preempt_blocked_readers_cgp(rnp)),
- j);
+ ((gf = READ_ONCE(rsp->gp_flags)) &
+ RCU_GP_FLAG_FQS) ||
+ (!READ_ONCE(rnp->qsmask) &&
+ !rcu_preempt_blocked_readers_cgp(rnp)) ||
+ rcu_exp_gp_seq_done(rsp->exp_rsp,
+ rsp->gp_exp_snap),
+ j);
rsp->gp_state = RCU_GP_DONE_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
- !rcu_preempt_blocked_readers_cgp(rnp))
+ !rcu_preempt_blocked_readers_cgp(rnp)) {
break;
+ } else if (rcu_exp_gp_seq_done(rsp->exp_rsp,
+ rsp->gp_exp_snap)) {
+ rsp->gp_exp_help = true;
+ break;
+ }
/* If time for quiescent-state forcing, do it. */
if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
(gf & RCU_GP_FLAG_FQS)) {
@@ -4190,6 +4213,7 @@ void __init rcu_init(void)
rcu_init_geometry();
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
+ rcu_bh_state.exp_rsp = &rcu_sched_state;
if (dump_tree)
rcu_dump_rcu_node_tree(&rcu_sched_state);
__rcu_init_preempt();
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e40b65d45495..eed6e84cb182 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -466,6 +466,10 @@ struct rcu_state {
wait_queue_head_t gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
short gp_state; /* GP kthread sleep state. */
+ bool gp_exp_help; /* Expedited GP helped the */
+ /* just-completed normal GP. */
+ unsigned long gp_exp_snap; /* Expedited snapshot for */
+ /* FQS short-circuiting. */
/* End of fields guarded by root rcu_node's lock. */
@@ -495,6 +499,8 @@ struct rcu_state {
atomic_long_t expedited_normal; /* # fallbacks to normal. */
atomic_t expedited_need_qs; /* # CPUs left to check in. */
wait_queue_head_t expedited_wq; /* Wait for check-ins. */
+ struct rcu_state *exp_rsp; /* RCU flavor that expedites */
+ /* for this flavor. */
unsigned long jiffies_force_qs; /* Time at which to invoke */
/* force_quiescent_state(). */
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
Currently, rcutorture runs between 25% and 50% of grace periods as
expedited grace periods. This means that on large configurations,
there are expected to be a number of expedited grace periods executing
concurrently with each normal grace period. Now that normal grace periods
can be assisted by expedited grace periods, rcutorture would be unable to
detect a bug that prevented normal grace periods from completing if there
was no expedited grace period helping it.
This commit therefore alternates 30-second phases that run expedited
grace periods and 30-second phases that do not, so that this sort of
bug will provoke an RCU CPU stall warning.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 59aa76b4460e..c62c1c34d73c 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -54,6 +54,11 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Paul E. McKenney <[email protected]> and Josh Triplett <[email protected]>");
+#ifdef CONFIG_RCU_CPU_STALL_TIMEOUT
+#define NOEXP_SECS_DEFAULT (CONFIG_RCU_CPU_STALL_TIMEOUT + 10)
+#else /* #ifdef CONFIG_RCU_CPU_STALL_TIMEOUT */
+#define NOEXP_SECS_DEFAULT 30
+#endif /* #else #ifdef CONFIG_RCU_CPU_STALL_TIMEOUT */
torture_param(int, cbflood_inter_holdoff, HZ,
"Holdoff between floods (jiffies)");
@@ -75,6 +80,8 @@ torture_param(int, irqreader, 1, "Allow RCU readers from irq handlers");
torture_param(int, n_barrier_cbs, 0,
"# of callbacks/kthreads for barrier testing");
torture_param(int, nfakewriters, 4, "Number of RCU fake writer threads");
+torture_param(int, noexp_secs, NOEXP_SECS_DEFAULT,
+ "Time to suppress/enable expedited GPs (s), 0 to disable");
torture_param(int, nreaders, -1, "Number of RCU reader threads");
torture_param(int, object_debug, 0,
"Enable debug-object double call_rcu() testing");
@@ -192,6 +199,8 @@ static u64 notrace rcu_trace_clock_local(void)
}
#endif /* #else #ifdef CONFIG_RCU_TRACE */
+static unsigned long rcutorture_noexp;
+
static unsigned long boost_starttime; /* jiffies of next boost test start. */
static DEFINE_MUTEX(boost_mutex); /* protect setting boost_starttime */
/* and boost task create/destroy. */
@@ -894,6 +903,8 @@ rcu_torture_writer(void *arg)
bool gp_cond1 = gp_cond, gp_exp1 = gp_exp, gp_normal1 = gp_normal;
bool gp_sync1 = gp_sync;
int i;
+ int noexp_jiffies = noexp_secs * HZ;
+ unsigned long noexp_jiffies_next = jiffies - 1;
struct rcu_torture *rp;
struct rcu_torture *old_rp;
static DEFINE_TORTURE_RANDOM(rand);
@@ -942,6 +953,10 @@ rcu_torture_writer(void *arg)
do {
rcu_torture_writer_state = RTWS_FIXED_DELAY;
schedule_timeout_uninterruptible(1);
+ if (time_after(jiffies, noexp_jiffies_next)) {
+ rcutorture_noexp = jiffies + noexp_jiffies;
+ noexp_jiffies_next = rcutorture_noexp + noexp_jiffies;
+ }
rp = rcu_torture_alloc();
if (rp == NULL)
continue;
@@ -966,6 +981,9 @@ rcu_torture_writer(void *arg)
cur_ops->deferred_free(old_rp);
break;
case RTWS_EXP_SYNC:
+ if (time_before(jiffies, rcutorture_noexp) &&
+ cur_ops->sync && gp_sync1)
+ goto noexp;
rcu_torture_writer_state = RTWS_EXP_SYNC;
cur_ops->exp_sync();
rcu_torture_pipe_update(old_rp);
@@ -982,6 +1000,7 @@ rcu_torture_writer(void *arg)
rcu_torture_pipe_update(old_rp);
break;
case RTWS_SYNC:
+noexp:
rcu_torture_writer_state = RTWS_SYNC;
cur_ops->sync();
rcu_torture_pipe_update(old_rp);
@@ -1036,7 +1055,8 @@ rcu_torture_fakewriter(void *arg)
torture_random(&rand) % (nfakewriters * 8) == 0) {
cur_ops->cb_barrier();
} else if (gp_normal == gp_exp) {
- if (torture_random(&rand) & 0x80)
+ if (time_before(jiffies, rcutorture_noexp) ||
+ torture_random(&rand) & 0x80)
cur_ops->sync();
else
cur_ops->exp_sync();
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
Let the expedited grace period confer its lower latency onto any
normal grace period that is in progress.
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5365f6332a60..308b6acb4260 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3507,6 +3507,8 @@ void synchronize_sched_expedited(void)
rcu_exp_gp_seq_end(rsp);
mutex_unlock(&rnp->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */
+ if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
+ wake_up(&rsp->gp_wq);
put_online_cpus();
}
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 308b6acb4260..247aa1120c4c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
!atomic_read(&rsp->expedited_need_qs));
rcu_exp_gp_seq_end(rsp);
- mutex_unlock(&rnp->exp_funnel_mutex);
smp_mb(); /* ensure subsequent action seen after grace period. */
- if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
- wake_up(&rsp->gp_wq);
+ if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
+ static unsigned long nextgp;
+ static unsigned long nextjiffy;
+
+ if (time_after_eq(jiffies, nextgp) ||
+ ULONG_CMP_GE(rsp->gpnum, nextgp)) {
+ nextgp = rsp->gpnum + 4;
+ nextjiffy = jiffies + 10;
+ wake_up(&rsp->gp_wq);
+ }
+ }
+ mutex_unlock(&rnp->exp_funnel_mutex);
put_online_cpus();
}
--
1.8.1.5
On Tue, Jun 30, 2015 at 11:48 PM, Paul E. McKenney
<[email protected]> wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 308b6acb4260..247aa1120c4c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
> !atomic_read(&rsp->expedited_need_qs));
>
> rcu_exp_gp_seq_end(rsp);
> - mutex_unlock(&rnp->exp_funnel_mutex);
> smp_mb(); /* ensure subsequent action seen after grace period. */
> - if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
> - wake_up(&rsp->gp_wq);
> + if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
> + static unsigned long nextgp;
> + static unsigned long nextjiffy;
> +
> + if (time_after_eq(jiffies, nextgp) ||
> + ULONG_CMP_GE(rsp->gpnum, nextgp)) {
> + nextgp = rsp->gpnum + 4;
> + nextjiffy = jiffies + 10;
Do you want 10 ticks or 10 ms (as stated in title) ?
> + wake_up(&rsp->gp_wq);
> + }
> + }
> + mutex_unlock(&rnp->exp_funnel_mutex);
>
> put_online_cpus();
> }
> --
> 1.8.1.5
>
On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> Hello!
>
> This series contains some highly experimental patches that allow normal
> grace periods to take advantage of the work done by concurrent expedited
> grace periods. This can reduce the overhead incurred by normal grace
> periods by eliminating the need for force-quiescent-state scans that
> would otherwise have happened after the expedited grace period completed.
> It is not clear whether this is a useful tradeoff. Nevertheless, this
> series contains the following patches:
While it makes sense to avoid unnecessarily delaying a normal grace
period if the expedited machinery has provided the necessary delay, I'm
also *deeply* concerned that this will create a new class of
nondeterministic performance issues. Something that uses RCU may
perform badly due to grace period latency, but then suddenly start
performing well because an unrelated task starts hammering expedited
grace periods. This seems particularly likely during boot, for
instance, where RCU grace periods can be a significant component of boot
time (when you're trying to boot to userspace in small fractions of a
second).
- Josh Triplett
On Tue, Jun 30, 2015 at 11:56:37PM +0200, Eric Dumazet wrote:
> On Tue, Jun 30, 2015 at 11:48 PM, Paul E. McKenney
> <[email protected]> wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 15 ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 308b6acb4260..247aa1120c4c 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
> > !atomic_read(&rsp->expedited_need_qs));
> >
> > rcu_exp_gp_seq_end(rsp);
> > - mutex_unlock(&rnp->exp_funnel_mutex);
> > smp_mb(); /* ensure subsequent action seen after grace period. */
> > - if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
> > - wake_up(&rsp->gp_wq);
> > + if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
> > + static unsigned long nextgp;
> > + static unsigned long nextjiffy;
> > +
> > + if (time_after_eq(jiffies, nextgp) ||
> > + ULONG_CMP_GE(rsp->gpnum, nextgp)) {
> > + nextgp = rsp->gpnum + 4;
> > + nextjiffy = jiffies + 10;
>
> Do you want 10 ticks or 10 ms (as stated in title) ?
Ten ticks, good catch!
Thanx, Paul
> > + wake_up(&rsp->gp_wq);
> > + }
> > + }
> > + mutex_unlock(&rnp->exp_funnel_mutex);
> >
> > put_online_cpus();
> > }
> > --
> > 1.8.1.5
> >
>
On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > This series contains some highly experimental patches that allow normal
> > grace periods to take advantage of the work done by concurrent expedited
> > grace periods. This can reduce the overhead incurred by normal grace
> > periods by eliminating the need for force-quiescent-state scans that
> > would otherwise have happened after the expedited grace period completed.
> > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > series contains the following patches:
>
> While it makes sense to avoid unnecessarily delaying a normal grace
> period if the expedited machinery has provided the necessary delay, I'm
> also *deeply* concerned that this will create a new class of
> nondeterministic performance issues. Something that uses RCU may
> perform badly due to grace period latency, but then suddenly start
> performing well because an unrelated task starts hammering expedited
> grace periods. This seems particularly likely during boot, for
> instance, where RCU grace periods can be a significant component of boot
> time (when you're trying to boot to userspace in small fractions of a
> second).
I will take that as another vote against. And for a reason that I had
not yet come up with, so good show! ;-)
Thanx, Paul
On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > This series contains some highly experimental patches that allow normal
> > > grace periods to take advantage of the work done by concurrent expedited
> > > grace periods. This can reduce the overhead incurred by normal grace
> > > periods by eliminating the need for force-quiescent-state scans that
> > > would otherwise have happened after the expedited grace period completed.
> > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > series contains the following patches:
> >
> > While it makes sense to avoid unnecessarily delaying a normal grace
> > period if the expedited machinery has provided the necessary delay, I'm
> > also *deeply* concerned that this will create a new class of
> > nondeterministic performance issues. Something that uses RCU may
> > perform badly due to grace period latency, but then suddenly start
> > performing well because an unrelated task starts hammering expedited
> > grace periods. This seems particularly likely during boot, for
> > instance, where RCU grace periods can be a significant component of boot
> > time (when you're trying to boot to userspace in small fractions of a
> > second).
>
> I will take that as another vote against. And for a reason that I had
> not yet come up with, so good show! ;-)
Consider it a fairly weak concern against. Increasing performance seems
like a good thing in general; I just don't relish the future "feels less
responsive" bug reports that take a long time to track down and turn out
to be "this completely unrelated driver was loaded and started using
expedited grace periods".
Then again, perhaps the more relevant concern would be why drivers use
expedited grace periods in the first place.
- Josh Triplett
On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > This series contains some highly experimental patches that allow normal
> > > > grace periods to take advantage of the work done by concurrent expedited
> > > > grace periods. This can reduce the overhead incurred by normal grace
> > > > periods by eliminating the need for force-quiescent-state scans that
> > > > would otherwise have happened after the expedited grace period completed.
> > > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > > series contains the following patches:
> > >
> > > While it makes sense to avoid unnecessarily delaying a normal grace
> > > period if the expedited machinery has provided the necessary delay, I'm
> > > also *deeply* concerned that this will create a new class of
> > > nondeterministic performance issues. Something that uses RCU may
> > > perform badly due to grace period latency, but then suddenly start
> > > performing well because an unrelated task starts hammering expedited
> > > grace periods. This seems particularly likely during boot, for
> > > instance, where RCU grace periods can be a significant component of boot
> > > time (when you're trying to boot to userspace in small fractions of a
> > > second).
> >
> > I will take that as another vote against. And for a reason that I had
> > not yet come up with, so good show! ;-)
>
> Consider it a fairly weak concern against. Increasing performance seems
> like a good thing in general; I just don't relish the future "feels less
> responsive" bug reports that take a long time to track down and turn out
> to be "this completely unrelated driver was loaded and started using
> expedited grace periods".
>From what I can see, this one needs a good reason to go in, as opposed
to a good reason to stay out.
> Then again, perhaps the more relevant concern would be why drivers use
> expedited grace periods in the first place.
Networking uses expedited grace periods when RTNL is held to reduce
contention on that lock. Several other places have used it to minimize
user-visible grace-period slowdown. But there are probably places that
would be better served doing something different. That is after all
the common case for most synchronization primitives. ;-)
Thanx, Paul
On Tue, Jun 30, 2015 at 05:15:58PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > > > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > > > Hello!
> > > > >
> > > > > This series contains some highly experimental patches that allow normal
> > > > > grace periods to take advantage of the work done by concurrent expedited
> > > > > grace periods. This can reduce the overhead incurred by normal grace
> > > > > periods by eliminating the need for force-quiescent-state scans that
> > > > > would otherwise have happened after the expedited grace period completed.
> > > > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > > > series contains the following patches:
> > > >
> > > > While it makes sense to avoid unnecessarily delaying a normal grace
> > > > period if the expedited machinery has provided the necessary delay, I'm
> > > > also *deeply* concerned that this will create a new class of
> > > > nondeterministic performance issues. Something that uses RCU may
> > > > perform badly due to grace period latency, but then suddenly start
> > > > performing well because an unrelated task starts hammering expedited
> > > > grace periods. This seems particularly likely during boot, for
> > > > instance, where RCU grace periods can be a significant component of boot
> > > > time (when you're trying to boot to userspace in small fractions of a
> > > > second).
> > >
> > > I will take that as another vote against. And for a reason that I had
> > > not yet come up with, so good show! ;-)
> >
> > Consider it a fairly weak concern against. Increasing performance seems
> > like a good thing in general; I just don't relish the future "feels less
> > responsive" bug reports that take a long time to track down and turn out
> > to be "this completely unrelated driver was loaded and started using
> > expedited grace periods".
>
> From what I can see, this one needs a good reason to go in, as opposed
> to a good reason to stay out.
>
> > Then again, perhaps the more relevant concern would be why drivers use
> > expedited grace periods in the first place.
>
> Networking uses expedited grace periods when RTNL is held to reduce
> contention on that lock.
Wait, what? Why is anything using traditional (non-S) RCU while *any*
lock is held?
> Several other places have used it to minimize
> user-visible grace-period slowdown. But there are probably places that
> would be better served doing something different. That is after all
> the common case for most synchronization primitives. ;-)
Sounds likely. :)
- Josh Triplett
On Tue, Jun 30, 2015 at 05:42:14PM -0700, Josh Triplett wrote:
> On Tue, Jun 30, 2015 at 05:15:58PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > > On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > > > > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > > > > Hello!
> > > > > >
> > > > > > This series contains some highly experimental patches that allow normal
> > > > > > grace periods to take advantage of the work done by concurrent expedited
> > > > > > grace periods. This can reduce the overhead incurred by normal grace
> > > > > > periods by eliminating the need for force-quiescent-state scans that
> > > > > > would otherwise have happened after the expedited grace period completed.
> > > > > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > > > > series contains the following patches:
> > > > >
> > > > > While it makes sense to avoid unnecessarily delaying a normal grace
> > > > > period if the expedited machinery has provided the necessary delay, I'm
> > > > > also *deeply* concerned that this will create a new class of
> > > > > nondeterministic performance issues. Something that uses RCU may
> > > > > perform badly due to grace period latency, but then suddenly start
> > > > > performing well because an unrelated task starts hammering expedited
> > > > > grace periods. This seems particularly likely during boot, for
> > > > > instance, where RCU grace periods can be a significant component of boot
> > > > > time (when you're trying to boot to userspace in small fractions of a
> > > > > second).
> > > >
> > > > I will take that as another vote against. And for a reason that I had
> > > > not yet come up with, so good show! ;-)
> > >
> > > Consider it a fairly weak concern against. Increasing performance seems
> > > like a good thing in general; I just don't relish the future "feels less
> > > responsive" bug reports that take a long time to track down and turn out
> > > to be "this completely unrelated driver was loaded and started using
> > > expedited grace periods".
> >
> > From what I can see, this one needs a good reason to go in, as opposed
> > to a good reason to stay out.
> >
> > > Then again, perhaps the more relevant concern would be why drivers use
> > > expedited grace periods in the first place.
> >
> > Networking uses expedited grace periods when RTNL is held to reduce
> > contention on that lock.
>
> Wait, what? Why is anything using traditional (non-S) RCU while *any*
> lock is held?
In their defense, it is a sleeplock that is never taken except when
rearranging networking configuration. Sometimes they need a grace period
under the lock. So synchronize_net() checks to see if RTNL is held, and
does a synchronize_rcu_expedited() if so and a synchronize_rcu() if not.
But maybe I am misunderstanding your question?
> > Several other places have used it to minimize
> > user-visible grace-period slowdown. But there are probably places that
> > would be better served doing something different. That is after all
> > the common case for most synchronization primitives. ;-)
>
> Sounds likely. :)
;-)
Thanx, Paul
On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> @@ -2121,17 +2137,24 @@ static int __noreturn rcu_gp_kthread(void *arg)
> TPS("fqswait"));
> rsp->gp_state = RCU_GP_WAIT_FQS;
> ret = wait_event_interruptible_timeout(rsp->gp_wq,
> - ((gf = READ_ONCE(rsp->gp_flags)) &
> - RCU_GP_FLAG_FQS) ||
> - (!READ_ONCE(rnp->qsmask) &&
> - !rcu_preempt_blocked_readers_cgp(rnp)),
> - j);
> + ((gf = READ_ONCE(rsp->gp_flags)) &
> + RCU_GP_FLAG_FQS) ||
> + (!READ_ONCE(rnp->qsmask) &&
> + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> + rcu_exp_gp_seq_done(rsp->exp_rsp,
> + rsp->gp_exp_snap),
> + j);
How about using a helper function there?
static inline bool rcu_gp_done(rsp, rnp)
{
/* Forced Quiescent State complete */
if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
return true;
/* QS not masked and not blocked by preempted readers */
if (!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
return true;
/* Expedited Grace Period completed */
if (rcu_exp_gp_seq_done(rsp))
return true;
return false;
}
ret = wait_event_interruptible_timeout(rsp->gp_wq,
rcu_gp_done(rsp, rnp), j);
On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> rsp->gp_state = RCU_GP_WAIT_FQS;
> ret = wait_event_interruptible_timeout(rsp->gp_wq,
> + ((gf = READ_ONCE(rsp->gp_flags)) &
> + RCU_GP_FLAG_FQS) ||
> + (!READ_ONCE(rnp->qsmask) &&
> + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> + rcu_exp_gp_seq_done(rsp->exp_rsp,
> + rsp->gp_exp_snap),
> + j);
> rsp->gp_state = RCU_GP_DONE_FQS;
How can the GP be done if we timed out or got interrupted?
On Tue, Jun 30, 2015 at 02:48:30PM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
This seems like a good place to explain why this is a desirable thing,
no?
Why would you want to limit this?
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 308b6acb4260..247aa1120c4c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
> !atomic_read(&rsp->expedited_need_qs));
>
> rcu_exp_gp_seq_end(rsp);
> - mutex_unlock(&rnp->exp_funnel_mutex);
> smp_mb(); /* ensure subsequent action seen after grace period. */
> - if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
> - wake_up(&rsp->gp_wq);
> + if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
> + static unsigned long nextgp;
> + static unsigned long nextjiffy;
> +
> + if (time_after_eq(jiffies, nextgp) ||
> + ULONG_CMP_GE(rsp->gpnum, nextgp)) {
> + nextgp = rsp->gpnum + 4;
> + nextjiffy = jiffies + 10;
> + wake_up(&rsp->gp_wq);
> + }
> + }
> + mutex_unlock(&rnp->exp_funnel_mutex);
>
> put_online_cpus();
> }
> --
> 1.8.1.5
>
On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> Consider it a fairly weak concern against. Increasing performance seems
> like a good thing in general; I just don't relish the future "feels less
> responsive" bug reports that take a long time to track down and turn out
> to be "this completely unrelated driver was loaded and started using
> expedited grace periods".
random drivers, or for that matter, new-code of any sort. Should _NOT_
be using expedited grace periods.
They're a horrid hack only suitable for unfixable ABI.
> Then again, perhaps the more relevant concern would be why drivers use
> expedited grace periods in the first place.
Right.
On Tue, Jun 30, 2015 at 08:37:01PM -0700, Paul E. McKenney wrote:
> > Wait, what? Why is anything using traditional (non-S) RCU while *any*
> > lock is held?
>
> In their defense, it is a sleeplock that is never taken except when
> rearranging networking configuration. Sometimes they need a grace period
> under the lock. So synchronize_net() checks to see if RTNL is held, and
> does a synchronize_rcu_expedited() if so and a synchronize_rcu() if not.
Sounds vile.
On Wed, Jul 01, 2015 at 12:09:39PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > Consider it a fairly weak concern against. Increasing performance seems
> > like a good thing in general; I just don't relish the future "feels less
> > responsive" bug reports that take a long time to track down and turn out
> > to be "this completely unrelated driver was loaded and started using
> > expedited grace periods".
>
> random drivers, or for that matter, new-code of any sort. Should _NOT_
> be using expedited grace periods.
>
> They're a horrid hack only suitable for unfixable ABI.
Let me repeat, just in case I've not been clear. Expedited grace periods
are _BAD_ and should be avoided at all costs.
They perturb the _entire_ machine.
Yes we can polish the turd, but in the end its still a turd.
Sadly people seem to have taken a liking to them, ooh a make RCU go
faster button. And there's not been much if any pushback on people using
it.
On Wed, Jul 01, 2015 at 12:05:21PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > rsp->gp_state = RCU_GP_WAIT_FQS;
> > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > + RCU_GP_FLAG_FQS) ||
> > + (!READ_ONCE(rnp->qsmask) &&
> > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > + rsp->gp_exp_snap),
> > + j);
> > rsp->gp_state = RCU_GP_DONE_FQS;
>
> How can the GP be done if we timed out or got interrupted?
If all the CPUs still blocking the grace period went idle, or in a
NO_HZ_FULL kernel, entered nohz_full userspace execution. Or, if
certain low-probability races happen, went offline.
Thanx, Paul
On Wed, Jul 01, 2015 at 12:03:52PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > @@ -2121,17 +2137,24 @@ static int __noreturn rcu_gp_kthread(void *arg)
> > TPS("fqswait"));
> > rsp->gp_state = RCU_GP_WAIT_FQS;
> > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > - ((gf = READ_ONCE(rsp->gp_flags)) &
> > - RCU_GP_FLAG_FQS) ||
> > - (!READ_ONCE(rnp->qsmask) &&
> > - !rcu_preempt_blocked_readers_cgp(rnp)),
> > - j);
> > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > + RCU_GP_FLAG_FQS) ||
> > + (!READ_ONCE(rnp->qsmask) &&
> > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > + rsp->gp_exp_snap),
> > + j);
>
> How about using a helper function there?
Good point, will do.
Thanx, Paul
> static inline bool rcu_gp_done(rsp, rnp)
> {
> /* Forced Quiescent State complete */
> if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
> return true;
>
> /* QS not masked and not blocked by preempted readers */
> if (!READ_ONCE(rnp->qsmask) &&
> !rcu_preempt_blocked_readers_cgp(rnp))
> return true;
>
> /* Expedited Grace Period completed */
> if (rcu_exp_gp_seq_done(rsp))
> return true;
>
> return false;
> }
>
> ret = wait_event_interruptible_timeout(rsp->gp_wq,
> rcu_gp_done(rsp, rnp), j);
>
>
On Wed, Jul 01, 2015 at 12:07:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:48:30PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
>
> This seems like a good place to explain why this is a desirable thing,
> no?
Good point.
> Why would you want to limit this?
Because the unconditional wakeup is a two-edges sword. It reduces
the latency of normal RCU grace periods on the one hand, but it makes
rcu_sched consume even more CPU on the other.
Thanx, Paul
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 15 ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 308b6acb4260..247aa1120c4c 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
> > !atomic_read(&rsp->expedited_need_qs));
> >
> > rcu_exp_gp_seq_end(rsp);
> > - mutex_unlock(&rnp->exp_funnel_mutex);
> > smp_mb(); /* ensure subsequent action seen after grace period. */
> > - if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
> > - wake_up(&rsp->gp_wq);
> > + if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
> > + static unsigned long nextgp;
> > + static unsigned long nextjiffy;
> > +
> > + if (time_after_eq(jiffies, nextgp) ||
> > + ULONG_CMP_GE(rsp->gpnum, nextgp)) {
> > + nextgp = rsp->gpnum + 4;
> > + nextjiffy = jiffies + 10;
> > + wake_up(&rsp->gp_wq);
> > + }
> > + }
> > + mutex_unlock(&rnp->exp_funnel_mutex);
> >
> > put_online_cpus();
> > }
> > --
> > 1.8.1.5
> >
>
On Wed, Jul 01, 2015 at 06:41:44AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 01, 2015 at 12:05:21PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > > rsp->gp_state = RCU_GP_WAIT_FQS;
> > > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > > + RCU_GP_FLAG_FQS) ||
> > > + (!READ_ONCE(rnp->qsmask) &&
> > > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > > + rsp->gp_exp_snap),
> > > + j);
> > > rsp->gp_state = RCU_GP_DONE_FQS;
> >
> > How can the GP be done if we timed out or got interrupted?
>
> If all the CPUs still blocking the grace period went idle, or in a
> NO_HZ_FULL kernel, entered nohz_full userspace execution. Or, if
> certain low-probability races happen, went offline.
But what if none of those are true and we still timed out? You
unconditionally grant the GP.
On Wed, Jul 01, 2015 at 12:55:11PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 12:09:39PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > > Consider it a fairly weak concern against. Increasing performance seems
> > > like a good thing in general; I just don't relish the future "feels less
> > > responsive" bug reports that take a long time to track down and turn out
> > > to be "this completely unrelated driver was loaded and started using
> > > expedited grace periods".
> >
> > random drivers, or for that matter, new-code of any sort. Should _NOT_
> > be using expedited grace periods.
> >
> > They're a horrid hack only suitable for unfixable ABI.
>
> Let me repeat, just in case I've not been clear. Expedited grace periods
> are _BAD_ and should be avoided at all costs.
That is a bit extreme, Peter.
There should not be a problem using them for operations that are not done
while the system is running the main workload. Which is why I am OK with
synchronize_net() using synchronize_rcu_expedited() when RTNL is held.
The operations that do that are setup/configuration operations that you
won't normally be doing while your HPC or realtime workload is running.
I believe that many of the other uses are similar.
> They perturb the _entire_ machine.
The portion of the machine that is non-idle and not executing in nohz_full
userspace, that is. So nohz_full usermode execution is unperturbed by
expedited grace periods.
In addition, synchronize_srcu_expedited() does -not- perturb anything
other than the task actually executing synchronize_srcu_expedited().
So those read-side memory barriers are gaining you something, and there
should not be much need to push back on synchronize_srcu_expedited().
> Yes we can polish the turd, but in the end its still a turd.
And I intend to polish it even more. ;-)
> Sadly people seem to have taken a liking to them, ooh a make RCU go
> faster button. And there's not been much if any pushback on people using
> it.
There aren't all that many uses, so I don't believe that
people are abusing it that much. There are only four non-RCU
uses of synchronize_rcu_expedited() and only two non-RCU uses of
synchronize_sched_expedited(). In contrast, there are a couple hundred
uses of synchronize_rcu() and about 40 uses of synchronize_sched().
So I am not seeing much evidence of wanton use of either
synchronize_srcu() or synchronize_sched().
Are a huge pile of them coming in this merge window or something?
What raised your concerns on this issue?
Thanx, Paul
On Wed, Jul 01, 2015 at 12:12:21PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 08:37:01PM -0700, Paul E. McKenney wrote:
> > > Wait, what? Why is anything using traditional (non-S) RCU while *any*
> > > lock is held?
> >
> > In their defense, it is a sleeplock that is never taken except when
> > rearranging networking configuration. Sometimes they need a grace period
> > under the lock. So synchronize_net() checks to see if RTNL is held, and
> > does a synchronize_rcu_expedited() if so and a synchronize_rcu() if not.
>
> Sounds vile.
OK, I'll bite. Exactly what seems especially vile about it?
Thanx, Paul
On Wed, Jul 01, 2015 at 03:48:06PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 06:41:44AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 01, 2015 at 12:05:21PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > > > rsp->gp_state = RCU_GP_WAIT_FQS;
> > > > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > > > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > > > + RCU_GP_FLAG_FQS) ||
> > > > + (!READ_ONCE(rnp->qsmask) &&
> > > > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > > > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > > > + rsp->gp_exp_snap),
> > > > + j);
> > > > rsp->gp_state = RCU_GP_DONE_FQS;
> > >
> > > How can the GP be done if we timed out or got interrupted?
> >
> > If all the CPUs still blocking the grace period went idle, or in a
> > NO_HZ_FULL kernel, entered nohz_full userspace execution. Or, if
> > certain low-probability races happen, went offline.
>
> But what if none of those are true and we still timed out? You
> unconditionally grant the GP.
Say what???
I recheck the conditions and break out of the loop only if one or more
of the grace-period-end conditions is satisfied. If not, I invoke
rcu_gp_fqs() to do the scan to see if all remaining CPUs are idle,
in nohz_full userspace execution, or offline.
What am I mising here?
Thanx, Paul
> OK, I'll bite. Exactly what seems especially vile about it?
Presumably we would need to convert everything to async model
(call_rcu() and friends),
but that is a long way to go...
For example, synchronize_srcu_expedited() in kvm_set_irq_routing()
really looks overkill.
On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
> That is a bit extreme, Peter.
Of course; but I'm really not seeing people taking due care with them
> Are a huge pile of them coming in this merge window or something?
> What raised your concerns on this issue?
This is complete horse manure (breaking the nvidiot binary blob is a
good thing):
74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
Also, I'm not entirely convinced things like:
fd2ed4d25270 ("dm: add statistics support")
83d5e5b0af90 ("dm: optimize use SRCU and RCU")
ef3230880abd ("backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu")
Are in the 'never' happens category. Esp. the backing-dev one, it
triggers every time you unplug a USB stick or similar.
Rejigging a DM might indeed be rare enough; but then again, people use
DM explicitly so they can rejig while in operation.
Also, they really do not explain how expedited really is the only option
available. Why things can't be batched etc..
On Tue, Jun 30, 2015 at 08:37:01PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 30, 2015 at 05:42:14PM -0700, Josh Triplett wrote:
> > On Tue, Jun 30, 2015 at 05:15:58PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > > > On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > > > > > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > > > > > Hello!
> > > > > > >
> > > > > > > This series contains some highly experimental patches that allow normal
> > > > > > > grace periods to take advantage of the work done by concurrent expedited
> > > > > > > grace periods. This can reduce the overhead incurred by normal grace
> > > > > > > periods by eliminating the need for force-quiescent-state scans that
> > > > > > > would otherwise have happened after the expedited grace period completed.
> > > > > > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > > > > > series contains the following patches:
> > > > > >
> > > > > > While it makes sense to avoid unnecessarily delaying a normal grace
> > > > > > period if the expedited machinery has provided the necessary delay, I'm
> > > > > > also *deeply* concerned that this will create a new class of
> > > > > > nondeterministic performance issues. Something that uses RCU may
> > > > > > perform badly due to grace period latency, but then suddenly start
> > > > > > performing well because an unrelated task starts hammering expedited
> > > > > > grace periods. This seems particularly likely during boot, for
> > > > > > instance, where RCU grace periods can be a significant component of boot
> > > > > > time (when you're trying to boot to userspace in small fractions of a
> > > > > > second).
> > > > >
> > > > > I will take that as another vote against. And for a reason that I had
> > > > > not yet come up with, so good show! ;-)
> > > >
> > > > Consider it a fairly weak concern against. Increasing performance seems
> > > > like a good thing in general; I just don't relish the future "feels less
> > > > responsive" bug reports that take a long time to track down and turn out
> > > > to be "this completely unrelated driver was loaded and started using
> > > > expedited grace periods".
> > >
> > > From what I can see, this one needs a good reason to go in, as opposed
> > > to a good reason to stay out.
> > >
> > > > Then again, perhaps the more relevant concern would be why drivers use
> > > > expedited grace periods in the first place.
> > >
> > > Networking uses expedited grace periods when RTNL is held to reduce
> > > contention on that lock.
> >
> > Wait, what? Why is anything using traditional (non-S) RCU while *any*
> > lock is held?
>
> In their defense, it is a sleeplock that is never taken except when
> rearranging networking configuration. Sometimes they need a grace period
> under the lock. So synchronize_net() checks to see if RTNL is held, and
> does a synchronize_rcu_expedited() if so and a synchronize_rcu() if not.
>
> But maybe I am misunderstanding your question?
No, you understood my question. It seems wrong that they would need a
grace period *under* the lock, rather than the usual case of making a
change under the lock, dropping the lock, and *then* synchronizing.
- Josh Triplett
On Wed, Jul 01, 2015 at 04:08:27PM +0200, Eric Dumazet wrote:
> > OK, I'll bite. Exactly what seems especially vile about it?
>
> Presumably we would need to convert everything to async model
> (call_rcu() and friends),
> but that is a long way to go...
>
> For example, synchronize_srcu_expedited() in kvm_set_irq_routing()
> really looks overkill.
But synchronize_srcu_expedited() is SRCU, and does not disturb the
rest of the system.
Don't get me wrong, if converting it to call_srcu() helps, why not?
But synchronize_srcu_expedited() isn't messing up HPC or real-time
workloads.
Thanx, Paul
On Wed, Jul 01, 2015 at 08:43:54AM -0700, Josh Triplett wrote:
> On Tue, Jun 30, 2015 at 08:37:01PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 30, 2015 at 05:42:14PM -0700, Josh Triplett wrote:
> > > On Tue, Jun 30, 2015 at 05:15:58PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Jun 30, 2015 at 04:46:33PM -0700, [email protected] wrote:
> > > > > On Tue, Jun 30, 2015 at 03:12:24PM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Jun 30, 2015 at 03:00:15PM -0700, [email protected] wrote:
> > > > > > > On Tue, Jun 30, 2015 at 02:48:05PM -0700, Paul E. McKenney wrote:
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > This series contains some highly experimental patches that allow normal
> > > > > > > > grace periods to take advantage of the work done by concurrent expedited
> > > > > > > > grace periods. This can reduce the overhead incurred by normal grace
> > > > > > > > periods by eliminating the need for force-quiescent-state scans that
> > > > > > > > would otherwise have happened after the expedited grace period completed.
> > > > > > > > It is not clear whether this is a useful tradeoff. Nevertheless, this
> > > > > > > > series contains the following patches:
> > > > > > >
> > > > > > > While it makes sense to avoid unnecessarily delaying a normal grace
> > > > > > > period if the expedited machinery has provided the necessary delay, I'm
> > > > > > > also *deeply* concerned that this will create a new class of
> > > > > > > nondeterministic performance issues. Something that uses RCU may
> > > > > > > perform badly due to grace period latency, but then suddenly start
> > > > > > > performing well because an unrelated task starts hammering expedited
> > > > > > > grace periods. This seems particularly likely during boot, for
> > > > > > > instance, where RCU grace periods can be a significant component of boot
> > > > > > > time (when you're trying to boot to userspace in small fractions of a
> > > > > > > second).
> > > > > >
> > > > > > I will take that as another vote against. And for a reason that I had
> > > > > > not yet come up with, so good show! ;-)
> > > > >
> > > > > Consider it a fairly weak concern against. Increasing performance seems
> > > > > like a good thing in general; I just don't relish the future "feels less
> > > > > responsive" bug reports that take a long time to track down and turn out
> > > > > to be "this completely unrelated driver was loaded and started using
> > > > > expedited grace periods".
> > > >
> > > > From what I can see, this one needs a good reason to go in, as opposed
> > > > to a good reason to stay out.
> > > >
> > > > > Then again, perhaps the more relevant concern would be why drivers use
> > > > > expedited grace periods in the first place.
> > > >
> > > > Networking uses expedited grace periods when RTNL is held to reduce
> > > > contention on that lock.
> > >
> > > Wait, what? Why is anything using traditional (non-S) RCU while *any*
> > > lock is held?
> >
> > In their defense, it is a sleeplock that is never taken except when
> > rearranging networking configuration. Sometimes they need a grace period
> > under the lock. So synchronize_net() checks to see if RTNL is held, and
> > does a synchronize_rcu_expedited() if so and a synchronize_rcu() if not.
> >
> > But maybe I am misunderstanding your question?
>
> No, you understood my question. It seems wrong that they would need a
> grace period *under* the lock, rather than the usual case of making a
> change under the lock, dropping the lock, and *then* synchronizing.
On that I must defer to the networking folks.
Thanx, Paul
On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
>
> > That is a bit extreme, Peter.
>
> Of course; but I'm really not seeing people taking due care with them
;-)
> > Are a huge pile of them coming in this merge window or something?
> > What raised your concerns on this issue?
>
> This is complete horse manure (breaking the nvidiot binary blob is a
> good thing):
>
> 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
Really???
I am not concerned about this one. After all, one of the first things
that people do for OS-jitter-sensitive workloads is to get rid of
binary blobs. And any runtime use of ACPI as well. And let's face it,
if your latency-sensitive workload is using either binary blobs or ACPI,
you have already completely lost. Therefore, an additional expedited
grace period cannot possibly cause you to lose any more.
> Also, I'm not entirely convinced things like:
>
> fd2ed4d25270 ("dm: add statistics support")
> 83d5e5b0af90 ("dm: optimize use SRCU and RCU")
> ef3230880abd ("backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu")
>
> Are in the 'never' happens category. Esp. the backing-dev one, it
> triggers every time you unplug a USB stick or similar.
Which people should be assiduously avoiding for any sort of
industrial-control system, especially given things like STUXNET.
> Rejigging a DM might indeed be rare enough; but then again, people use
> DM explicitly so they can rejig while in operation.
They rejig DM when running OS-jitter-sensitive workloads?
> Also, they really do not explain how expedited really is the only option
> available. Why things can't be batched etc..
Fair question!
Thanx, Paul
On Wed, Jul 01, 2015 at 09:17:05AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
>
> Really???
>
> I am not concerned about this one. After all, one of the first things
> that people do for OS-jitter-sensitive workloads is to get rid of
> binary blobs. And any runtime use of ACPI as well. And let's face it,
> if your latency-sensitive workload is using either binary blobs or ACPI,
> you have already completely lost. Therefore, an additional expedited
> grace period cannot possibly cause you to lose any more.
This isn't solely about rt etc.. this call is a generic facility used by
however many consumers. A normal workstation/server could run into it at
relatively high frequency depending on its workload.
Even on not latency sensitive workloads I think hammering all active
CPUs is bad behaviour. Remember that a typical server class machine
easily has more than 32 CPUs these days.
> > Also, I'm not entirely convinced things like:
> >
> > fd2ed4d25270 ("dm: add statistics support")
> > 83d5e5b0af90 ("dm: optimize use SRCU and RCU")
> > ef3230880abd ("backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu")
> >
> > Are in the 'never' happens category. Esp. the backing-dev one, it
> > triggers every time you unplug a USB stick or similar.
>
> Which people should be assiduously avoiding for any sort of
> industrial-control system, especially given things like STUXNET.
USB sure, but a backing dev is involved in nfs clients, loopback and all
sorts of block/filesystem like setups.
unmount an NFS mount and voila expedited rcu, unmount a loopback, tada.
All you need is a regular server workload triggering any of that on a
semi regular basis and even !rt people might start to notice something
is up.
> > Rejigging a DM might indeed be rare enough; but then again, people use
> > DM explicitly so they can rejig while in operation.
>
> They rejig DM when running OS-jitter-sensitive workloads?
Unlikely but who knows, I don't really know DM, so I can't even tell
what would trigger these.
On Wed, Jul 01, 2015 at 12:07:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:48:30PM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
>
> This seems like a good place to explain why this is a desirable thing,
> no?
>
> Why would you want to limit this?
Good point! I added the following to the commit log:
Currently, the expedited grace period unconditionally awakens the
normal grace period anytime there is a grace period in progress.
However, this can be too much of a good thing for the following
reasons: (1) It can result in excessive CPU consumption on
the part of the RCU grace-period kthread, (2) The resulting
variations in normal grace-period latency may cause confusion
(as Josh Triplett points out), and (3) In many cases, reducing
normal grace-period latency will be of little or no value.
This commit therefore does the wakeups only once per ten jiffies
or every fourth grace period, whichever is more frequent.
Does that help?
Thanx, Paul
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 15 ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 308b6acb4260..247aa1120c4c 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3505,10 +3505,19 @@ void synchronize_sched_expedited(void)
> > !atomic_read(&rsp->expedited_need_qs));
> >
> > rcu_exp_gp_seq_end(rsp);
> > - mutex_unlock(&rnp->exp_funnel_mutex);
> > smp_mb(); /* ensure subsequent action seen after grace period. */
> > - if (rsp->gp_kthread && rcu_gp_in_progress(rsp))
> > - wake_up(&rsp->gp_wq);
> > + if (rsp->gp_kthread && rcu_gp_in_progress(rsp)) {
> > + static unsigned long nextgp;
> > + static unsigned long nextjiffy;
> > +
> > + if (time_after_eq(jiffies, nextgp) ||
> > + ULONG_CMP_GE(rsp->gpnum, nextgp)) {
> > + nextgp = rsp->gpnum + 4;
> > + nextjiffy = jiffies + 10;
> > + wake_up(&rsp->gp_wq);
> > + }
> > + }
> > + mutex_unlock(&rnp->exp_funnel_mutex);
> >
> > put_online_cpus();
> > }
> > --
> > 1.8.1.5
> >
>
On Wed, Jul 01, 2015 at 07:02:42PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 09:17:05AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
>
> > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> >
> > Really???
> >
> > I am not concerned about this one. After all, one of the first things
> > that people do for OS-jitter-sensitive workloads is to get rid of
> > binary blobs. And any runtime use of ACPI as well. And let's face it,
> > if your latency-sensitive workload is using either binary blobs or ACPI,
> > you have already completely lost. Therefore, an additional expedited
> > grace period cannot possibly cause you to lose any more.
>
> This isn't solely about rt etc.. this call is a generic facility used by
> however many consumers. A normal workstation/server could run into it at
> relatively high frequency depending on its workload.
>
> Even on not latency sensitive workloads I think hammering all active
> CPUs is bad behaviour. Remember that a typical server class machine
> easily has more than 32 CPUs these days.
Well, that certainly is one reason for the funnel locking, sequence
counters, etc., keeping the overhead bounded despite large numbers
of CPUs. So I don't believe that a non-RT/non-HPC workload is going
to notice.
> > > Also, I'm not entirely convinced things like:
> > >
> > > fd2ed4d25270 ("dm: add statistics support")
> > > 83d5e5b0af90 ("dm: optimize use SRCU and RCU")
> > > ef3230880abd ("backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu")
> > >
> > > Are in the 'never' happens category. Esp. the backing-dev one, it
> > > triggers every time you unplug a USB stick or similar.
> >
> > Which people should be assiduously avoiding for any sort of
> > industrial-control system, especially given things like STUXNET.
>
> USB sure, but a backing dev is involved in nfs clients, loopback and all
> sorts of block/filesystem like setups.
>
> unmount an NFS mount and voila expedited rcu, unmount a loopback, tada.
>
> All you need is a regular server workload triggering any of that on a
> semi regular basis and even !rt people might start to notice something
> is up.
I don't believe that latency-sensitive systems are going to be messing
with remapping their storage at runtime, let alone on a regular basis.
If they are not latency sensitive, and assuming that the rate of
storage remapping is at all sane, I bet that they won't notice the
synchronize_rcu_expedited() overhead. The overhead of the actual
remapping will very likely leave the synchronize_rcu_expedited() overhead
way down in the noise.
And if they are doing completely insane rates of storage remapping,
I suspect that the batching in the synchronize_rcu_expedited()
implementation will reduce the expedited-grace-period overhead still
further as a fraction of the total.
> > > Rejigging a DM might indeed be rare enough; but then again, people use
> > > DM explicitly so they can rejig while in operation.
> >
> > They rejig DM when running OS-jitter-sensitive workloads?
>
> Unlikely but who knows, I don't really know DM, so I can't even tell
> what would trigger these.
In my experience, the hard-core low-latency guys avoid doing pretty
much anything that isn't completely essential to the job at hand.
Thanx, Paul
On Wed, Jul 01, 2015 at 12:03:52PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > @@ -2121,17 +2137,24 @@ static int __noreturn rcu_gp_kthread(void *arg)
> > TPS("fqswait"));
> > rsp->gp_state = RCU_GP_WAIT_FQS;
> > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > - ((gf = READ_ONCE(rsp->gp_flags)) &
> > - RCU_GP_FLAG_FQS) ||
> > - (!READ_ONCE(rnp->qsmask) &&
> > - !rcu_preempt_blocked_readers_cgp(rnp)),
> > - j);
> > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > + RCU_GP_FLAG_FQS) ||
> > + (!READ_ONCE(rnp->qsmask) &&
> > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > + rsp->gp_exp_snap),
> > + j);
>
> How about using a helper function there?
>
> static inline bool rcu_gp_done(rsp, rnp)
> {
> /* Forced Quiescent State complete */
> if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
> return true;
>
> /* QS not masked and not blocked by preempted readers */
> if (!READ_ONCE(rnp->qsmask) &&
> !rcu_preempt_blocked_readers_cgp(rnp))
> return true;
>
> /* Expedited Grace Period completed */
> if (rcu_exp_gp_seq_done(rsp))
> return true;
>
> return false;
> }
>
> ret = wait_event_interruptible_timeout(rsp->gp_wq,
> rcu_gp_done(rsp, rnp), j);
Fair point, and applies to the original as well, give or take comments
and naming. Please see below for the corresponding patch. Thoughts?
Thanx, Paul
------------------------------------------------------------------------
commit a8bb9abbf3690e507d61efe8144cec091ce5fb02
Author: Paul E. McKenney <[email protected]>
Date: Wed Jul 1 13:50:28 2015 -0700
rcu: Pull out wait_event*() condition into helper function
The condition for the wait_event_interruptible_timeout() that waits
to do the next force-quiescent-state scan is a bit ornate:
((gf = READ_ONCE(rsp->gp_flags)) &
RCU_GP_FLAG_FQS) ||
(!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
This commit therefore pulls this condition out into a helper function
and comments its component conditions.
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4992bfd360b6..2afb8e8c5134 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1904,6 +1904,26 @@ static int rcu_gp_init(struct rcu_state *rsp)
}
/*
+ * Helper function for wait_event_interruptible_timeout() wakeup
+ * at force-quiescent-state time.
+ */
+static bool rcu_gp_fqs_check_wake(struct rcu_state *rsp, int *gfp)
+{
+ struct rcu_node *rnp = rcu_get_root(rsp);
+
+ /* Someone like call_rcu() requested a force-quiescent-state scan. */
+ *gfp = READ_ONCE(rsp->gp_flags);
+ if (*gfp & RCU_GP_FLAG_FQS)
+ return true;
+
+ /* The current grace period has completed. */
+ if (!READ_ONCE(rnp->qsmask) && !rcu_preempt_blocked_readers_cgp(rnp))
+ return true;
+
+ return false;
+}
+
+/*
* Do one round of quiescent-state forcing.
*/
static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
@@ -2067,11 +2087,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;
ret = wait_event_interruptible_timeout(rsp->gp_wq,
- ((gf = READ_ONCE(rsp->gp_flags)) &
- RCU_GP_FLAG_FQS) ||
- (!READ_ONCE(rnp->qsmask) &&
- !rcu_preempt_blocked_readers_cgp(rnp)),
- j);
+ rcu_gp_fqs_check_wake(rsp, &gf), j);
rsp->gp_state = RCU_GP_DONE_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
On Wed, Jul 01, 2015 at 01:09:36PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 01, 2015 at 07:02:42PM +0200, Peter Zijlstra wrote:
> > USB sure, but a backing dev is involved in nfs clients, loopback and all
> > sorts of block/filesystem like setups.
> >
> > unmount an NFS mount and voila expedited rcu, unmount a loopback, tada.
> >
> > All you need is a regular server workload triggering any of that on a
> > semi regular basis and even !rt people might start to notice something
> > is up.
>
> I don't believe that latency-sensitive systems are going to be messing
> with remapping their storage at runtime, let alone on a regular basis.
> If they are not latency sensitive, and assuming that the rate of
> storage remapping is at all sane, I bet that they won't notice the
> synchronize_rcu_expedited() overhead. The overhead of the actual
> remapping will very likely leave the synchronize_rcu_expedited() overhead
> way down in the noise.
>
> And if they are doing completely insane rates of storage remapping,
> I suspect that the batching in the synchronize_rcu_expedited()
> implementation will reduce the expedited-grace-period overhead still
> further as a fraction of the total.
Consider the case of container-based systems, calling mount as part of
container setup and umount as part of container teardown.
And those workloads are often sensitive to latency, not throughput.
- Josh Triplett
On Wed, Jul 01, 2015 at 02:20:01PM -0700, [email protected] wrote:
> On Wed, Jul 01, 2015 at 01:09:36PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 01, 2015 at 07:02:42PM +0200, Peter Zijlstra wrote:
> > > USB sure, but a backing dev is involved in nfs clients, loopback and all
> > > sorts of block/filesystem like setups.
> > >
> > > unmount an NFS mount and voila expedited rcu, unmount a loopback, tada.
> > >
> > > All you need is a regular server workload triggering any of that on a
> > > semi regular basis and even !rt people might start to notice something
> > > is up.
> >
> > I don't believe that latency-sensitive systems are going to be messing
> > with remapping their storage at runtime, let alone on a regular basis.
> > If they are not latency sensitive, and assuming that the rate of
> > storage remapping is at all sane, I bet that they won't notice the
> > synchronize_rcu_expedited() overhead. The overhead of the actual
> > remapping will very likely leave the synchronize_rcu_expedited() overhead
> > way down in the noise.
> >
> > And if they are doing completely insane rates of storage remapping,
> > I suspect that the batching in the synchronize_rcu_expedited()
> > implementation will reduce the expedited-grace-period overhead still
> > further as a fraction of the total.
>
> Consider the case of container-based systems, calling mount as part of
> container setup and umount as part of container teardown.
>
> And those workloads are often sensitive to latency, not throughput.
So people are really seeing a synchronize_rcu_expedited() on each
container setup/teardown right now? Or is this something that could
happen if they were mounting block devices rather than rebind mounts?
And when you say that these workloads are sensitive to latency, I am
guessing that you mean to the millisecond-level latencies seen from
synchronize_rcu() as opposed to the microsecond-level OS jitter from
synchronize_rcu_expedited(). Or are there really containers workloads
that care about the few microseconds of OS jitter that would be incurred
due to expedited grace periods?
Thanx, Paul
On Wed, 2015-07-01 at 09:17 -0700, Paul E. McKenney wrote:
> On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
> >
> > > That is a bit extreme, Peter.
> >
> > Of course; but I'm really not seeing people taking due care with them
>
> ;-)
>
> > > Are a huge pile of them coming in this merge window or something?
> > > What raised your concerns on this issue?
> >
> > This is complete horse manure (breaking the nvidiot binary blob is a
> > good thing):
> >
> > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
>
> Really???
>
> I am not concerned about this one. After all, one of the first things
> that people do for OS-jitter-sensitive workloads is to get rid of
> binary blobs.
I know two users who have no choice but to use the nvidia driver with
their realtime applications, as nouveau is not up to the task.
-Mike
On Thu, Jul 02, 2015 at 03:11:24AM +0200, Mike Galbraith wrote:
> On Wed, 2015-07-01 at 09:17 -0700, Paul E. McKenney wrote:
> > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
> > >
> > > > That is a bit extreme, Peter.
> > >
> > > Of course; but I'm really not seeing people taking due care with them
> >
> > ;-)
> >
> > > > Are a huge pile of them coming in this merge window or something?
> > > > What raised your concerns on this issue?
> > >
> > > This is complete horse manure (breaking the nvidiot binary blob is a
> > > good thing):
> > >
> > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> >
> > Really???
> >
> > I am not concerned about this one. After all, one of the first things
> > that people do for OS-jitter-sensitive workloads is to get rid of
> > binary blobs.
>
> I know two users who have no choice but to use the nvidia driver with
> their realtime applications, as nouveau is not up to the task.
Sounds like they have a relatively loose definition of "realtime", then.
- Josh Triplett
On Wed, 2015-07-01 at 18:34 -0700, [email protected] wrote:
> On Thu, Jul 02, 2015 at 03:11:24AM +0200, Mike Galbraith wrote:
> > On Wed, 2015-07-01 at 09:17 -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
> > > >
> > > > > That is a bit extreme, Peter.
> > > >
> > > > Of course; but I'm really not seeing people taking due care with them
> > >
> > > ;-)
> > >
> > > > > Are a huge pile of them coming in this merge window or something?
> > > > > What raised your concerns on this issue?
> > > >
> > > > This is complete horse manure (breaking the nvidiot binary blob is a
> > > > good thing):
> > > >
> > > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> > >
> > > Really???
> > >
> > > I am not concerned about this one. After all, one of the first things
> > > that people do for OS-jitter-sensitive workloads is to get rid of
> > > binary blobs.
> >
> > I know two users who have no choice but to use the nvidia driver with
> > their realtime applications, as nouveau is not up to the task.
>
> Sounds like they have a relatively loose definition of "realtime", then.
It would be better it they broke their beasts up into a bunch of small
synchronized boxen, but they use big boxen here and now, with realtime
rendering being a non-disposable portion of the load.
-Mike
On Thu, Jul 02, 2015 at 03:59:55AM +0200, Mike Galbraith wrote:
> On Wed, 2015-07-01 at 18:34 -0700, [email protected] wrote:
> > On Thu, Jul 02, 2015 at 03:11:24AM +0200, Mike Galbraith wrote:
> > > On Wed, 2015-07-01 at 09:17 -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > > > > On Wed, Jul 01, 2015 at 07:00:31AM -0700, Paul E. McKenney wrote:
> > > > >
> > > > > > That is a bit extreme, Peter.
> > > > >
> > > > > Of course; but I'm really not seeing people taking due care with them
> > > >
> > > > ;-)
> > > >
> > > > > > Are a huge pile of them coming in this merge window or something?
> > > > > > What raised your concerns on this issue?
> > > > >
> > > > > This is complete horse manure (breaking the nvidiot binary blob is a
> > > > > good thing):
> > > > >
> > > > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> > > >
> > > > Really???
> > > >
> > > > I am not concerned about this one. After all, one of the first things
> > > > that people do for OS-jitter-sensitive workloads is to get rid of
> > > > binary blobs.
> > >
> > > I know two users who have no choice but to use the nvidia driver with
> > > their realtime applications, as nouveau is not up to the task.
> >
> > Sounds like they have a relatively loose definition of "realtime", then.
>
> It would be better it they broke their beasts up into a bunch of small
> synchronized boxen, but they use big boxen here and now, with realtime
> rendering being a non-disposable portion of the load.
Does their normal workload trigger the condition that results in the
expedited grace period? If so, do they use NO_HZ_FULL? (I am guessing
"no".) If not, is the resulting double-context-switch on their worker
CPUs a real problem for them? (I am guessing "no" given that they have
not complained, at least not that I know of.)
Thanx, Paul
On Wed, 2015-07-01 at 19:18 -0700, Paul E. McKenney wrote:
> Does their normal workload trigger the condition that results in the
> expedited grace period? If so, do they use NO_HZ_FULL?
They're having no trouble that I'm aware of, and I definitely would be
made aware. They're not currently using a kernel with NO_HZ_FULL even
built in, as that stuff is far far too raw in the current kernel.
They may meet a NO_HZ_FULL enabled kernel soonish, but will be told to
not turn it on for anything other than pure compute, as NO_HZ_FULL is
currently not usable for mixed load in partitioned box, much less rt.
-Mike
On Thu, Jul 02, 2015 at 04:50:45AM +0200, Mike Galbraith wrote:
> On Wed, 2015-07-01 at 19:18 -0700, Paul E. McKenney wrote:
>
> > Does their normal workload trigger the condition that results in the
> > expedited grace period? If so, do they use NO_HZ_FULL?
>
> They're having no trouble that I'm aware of, and I definitely would be
> made aware. They're not currently using a kernel with NO_HZ_FULL even
> built in, as that stuff is far far too raw in the current kernel.
>
> They may meet a NO_HZ_FULL enabled kernel soonish, but will be told to
> not turn it on for anything other than pure compute, as NO_HZ_FULL is
> currently not usable for mixed load in partitioned box, much less rt.
OK, sounds like synchronize_rcu_expedited() is a complete non-problem
for them, then.
Thanx, Paul
* Paul E. McKenney <[email protected]> wrote:
> On Wed, Jul 01, 2015 at 07:02:42PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2015 at 09:17:05AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> >
> > > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> > >
> > > Really???
> > >
> > > I am not concerned about this one. After all, one of the first things that
> > > people do for OS-jitter-sensitive workloads is to get rid of binary blobs.
> > > And any runtime use of ACPI as well. And let's face it, if your
> > > latency-sensitive workload is using either binary blobs or ACPI, you have
> > > already completely lost. Therefore, an additional expedited grace period
> > > cannot possibly cause you to lose any more.
> >
> > This isn't solely about rt etc.. this call is a generic facility used by
> > however many consumers. A normal workstation/server could run into it at
> > relatively high frequency depending on its workload.
> >
> > Even on not latency sensitive workloads I think hammering all active CPUs is
> > bad behaviour. Remember that a typical server class machine easily has more
> > than 32 CPUs these days.
>
> Well, that certainly is one reason for the funnel locking, sequence counters,
> etc., keeping the overhead bounded despite large numbers of CPUs. So I don't
> believe that a non-RT/non-HPC workload is going to notice.
So I think Peter's concern is that we should not be offering/promoting APIs that
are easy to add, hard to remove/convert - especially if we _know_ they eventually
have to be converted. That model does not scale, it piles up increasing amounts of
crud.
Also, there will be a threshold over which it will be increasingly harder to make
hard-rt promises, because so much seemingly mundane functionality will be using
these APIs. The big plus of -rt is that it's out of the box hard RT - if people
are able to control their environment carefully they can use RTAI or so. I.e. it
directly cuts into the usability of Linux in certain segments.
Death by a thousand cuts and such.
And it's not like it's that hard to stem the flow of algorithmic sloppiness at the
source, right?
Thanks,
Ingo
On Wed, Jul 01, 2015 at 07:03:13AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 01, 2015 at 03:48:06PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2015 at 06:41:44AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 01, 2015 at 12:05:21PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > > > > rsp->gp_state = RCU_GP_WAIT_FQS;
> > > > > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > > > > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > > > > + RCU_GP_FLAG_FQS) ||
> > > > > + (!READ_ONCE(rnp->qsmask) &&
> > > > > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > > > > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > > > > + rsp->gp_exp_snap),
> > > > > + j);
> > > > > rsp->gp_state = RCU_GP_DONE_FQS;
> > > >
> > > > How can the GP be done if we timed out or got interrupted?
> > >
> > > If all the CPUs still blocking the grace period went idle, or in a
> > > NO_HZ_FULL kernel, entered nohz_full userspace execution. Or, if
> > > certain low-probability races happen, went offline.
> >
> > But what if none of those are true and we still timed out? You
> > unconditionally grant the GP.
>
> Say what???
>
> I recheck the conditions and break out of the loop only if one or more
> of the grace-period-end conditions is satisfied. If not, I invoke
> rcu_gp_fqs() to do the scan to see if all remaining CPUs are idle,
> in nohz_full userspace execution, or offline.
>
> What am I mising here?
The whole wait_event_interruptible_timeout() thing can end without @cond
being true, after which you unconditionally set ->gp_state =
RCU_GP_DONE_FQS.
On Thu, Jul 02, 2015 at 09:47:19AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > On Wed, Jul 01, 2015 at 07:02:42PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 01, 2015 at 09:17:05AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 01, 2015 at 04:17:10PM +0200, Peter Zijlstra wrote:
> > >
> > > > > 74b51ee152b6 ("ACPI / osl: speedup grace period in acpi_os_map_cleanup")
> > > >
> > > > Really???
> > > >
> > > > I am not concerned about this one. After all, one of the first things that
> > > > people do for OS-jitter-sensitive workloads is to get rid of binary blobs.
> > > > And any runtime use of ACPI as well. And let's face it, if your
> > > > latency-sensitive workload is using either binary blobs or ACPI, you have
> > > > already completely lost. Therefore, an additional expedited grace period
> > > > cannot possibly cause you to lose any more.
> > >
> > > This isn't solely about rt etc.. this call is a generic facility used by
> > > however many consumers. A normal workstation/server could run into it at
> > > relatively high frequency depending on its workload.
> > >
> > > Even on not latency sensitive workloads I think hammering all active CPUs is
> > > bad behaviour. Remember that a typical server class machine easily has more
> > > than 32 CPUs these days.
> >
> > Well, that certainly is one reason for the funnel locking, sequence counters,
> > etc., keeping the overhead bounded despite large numbers of CPUs. So I don't
> > believe that a non-RT/non-HPC workload is going to notice.
>
> So I think Peter's concern is that we should not be offering/promoting APIs that
> are easy to add, hard to remove/convert - especially if we _know_ they eventually
> have to be converted. That model does not scale, it piles up increasing amounts of
> crud.
>
> Also, there will be a threshold over which it will be increasingly harder to make
> hard-rt promises, because so much seemingly mundane functionality will be using
> these APIs. The big plus of -rt is that it's out of the box hard RT - if people
> are able to control their environment carefully they can use RTAI or so. I.e. it
> directly cuts into the usability of Linux in certain segments.
>
> Death by a thousand cuts and such.
>
> And it's not like it's that hard to stem the flow of algorithmic sloppiness at the
> source, right?
OK, first let me make sure that I understand what you are asking for:
1. Completely eliminate synchronize_rcu_expedited() and
synchronize_sched_expedited(), replacing all uses with their
unexpedited counterparts. (Note that synchronize_srcu_expedited()
does not wake up CPUs, courtesy of its read-side memory barriers.)
The fast-boot guys are probably going to complain, along with
the networking guys.
2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
but push back hard on any new uses and question any existing uses.
3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
acpi_os_map_cleanup").
4. Something else?
Thanx, Paul
On Thu, Jul 02, 2015 at 02:03:29PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 07:03:13AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 01, 2015 at 03:48:06PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 01, 2015 at 06:41:44AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 01, 2015 at 12:05:21PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 30, 2015 at 02:48:27PM -0700, Paul E. McKenney wrote:
> > > > > > rsp->gp_state = RCU_GP_WAIT_FQS;
> > > > > > ret = wait_event_interruptible_timeout(rsp->gp_wq,
> > > > > > + ((gf = READ_ONCE(rsp->gp_flags)) &
> > > > > > + RCU_GP_FLAG_FQS) ||
> > > > > > + (!READ_ONCE(rnp->qsmask) &&
> > > > > > + !rcu_preempt_blocked_readers_cgp(rnp)) ||
> > > > > > + rcu_exp_gp_seq_done(rsp->exp_rsp,
> > > > > > + rsp->gp_exp_snap),
> > > > > > + j);
> > > > > > rsp->gp_state = RCU_GP_DONE_FQS;
> > > > >
> > > > > How can the GP be done if we timed out or got interrupted?
> > > >
> > > > If all the CPUs still blocking the grace period went idle, or in a
> > > > NO_HZ_FULL kernel, entered nohz_full userspace execution. Or, if
> > > > certain low-probability races happen, went offline.
> > >
> > > But what if none of those are true and we still timed out? You
> > > unconditionally grant the GP.
> >
> > Say what???
> >
> > I recheck the conditions and break out of the loop only if one or more
> > of the grace-period-end conditions is satisfied. If not, I invoke
> > rcu_gp_fqs() to do the scan to see if all remaining CPUs are idle,
> > in nohz_full userspace execution, or offline.
> >
> > What am I mising here?
>
> The whole wait_event_interruptible_timeout() thing can end without @cond
> being true, after which you unconditionally set ->gp_state =
> RCU_GP_DONE_FQS.
Ah, but then if the grace period is not done, even after a scan forcing
quiescent states, we go back through the loop and set it back to
RCU_GP_WAIT_FQS.
Or is your point that RCU_GP_DONE_FQS is a bad name? Perhaps I should
change it to something like RCU_GP_DOING_FQS. Or am I still missing
something here?
Thanx, Paul
On Thu, Jul 02, 2015 at 07:06:17AM -0700, Paul E. McKenney wrote:
> Or is your point that RCU_GP_DONE_FQS is a bad name? Perhaps I should
> change it to something like RCU_GP_DOING_FQS. Or am I still missing
> something here?
Yes, that might have been the root of my confusion.
* Paul E. McKenney <[email protected]> wrote:
> > And it's not like it's that hard to stem the flow of algorithmic sloppiness at
> > the source, right?
>
> OK, first let me make sure that I understand what you are asking for:
>
> 1. Completely eliminate synchronize_rcu_expedited() and
> synchronize_sched_expedited(), replacing all uses with their
> unexpedited counterparts. (Note that synchronize_srcu_expedited()
> does not wake up CPUs, courtesy of its read-side memory barriers.)
> The fast-boot guys are probably going to complain, along with
> the networking guys.
>
> 2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
> but push back hard on any new uses and question any existing uses.
>
> 3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
> acpi_os_map_cleanup").
>
> 4. Something else?
I'd love to have 1) but 2) would be a realistic second best option? ;-)
Thanks,
Ingo
----- On Jul 2, 2015, at 2:35 PM, Ingo Molnar [email protected] wrote:
> * Paul E. McKenney <[email protected]> wrote:
>
>> > And it's not like it's that hard to stem the flow of algorithmic sloppiness at
>> > the source, right?
>>
>> OK, first let me make sure that I understand what you are asking for:
>>
>> 1. Completely eliminate synchronize_rcu_expedited() and
>> synchronize_sched_expedited(), replacing all uses with their
>> unexpedited counterparts. (Note that synchronize_srcu_expedited()
>> does not wake up CPUs, courtesy of its read-side memory barriers.)
>> The fast-boot guys are probably going to complain, along with
>> the networking guys.
>>
>> 2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
>> but push back hard on any new uses and question any existing uses.
>>
>> 3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
>> acpi_os_map_cleanup").
>>
>> 4. Something else?
>
> I'd love to have 1) but 2) would be a realistic second best option? ;-)
Perhaps triggering a printk warning if use of
synchronize_{rcu,sched}_expedited() go beyond of certain rate might be
another option ? If we detect that a caller calls it too often, we could
emit a printk warning with a stack trace. This should ensure everyone
is very careful about where they use it.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jul 02, 2015 at 08:35:57PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > > And it's not like it's that hard to stem the flow of algorithmic sloppiness at
> > > the source, right?
> >
> > OK, first let me make sure that I understand what you are asking for:
> >
> > 1. Completely eliminate synchronize_rcu_expedited() and
> > synchronize_sched_expedited(), replacing all uses with their
> > unexpedited counterparts. (Note that synchronize_srcu_expedited()
> > does not wake up CPUs, courtesy of its read-side memory barriers.)
> > The fast-boot guys are probably going to complain, along with
> > the networking guys.
> >
> > 2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
> > but push back hard on any new uses and question any existing uses.
> >
> > 3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
> > acpi_os_map_cleanup").
> >
> > 4. Something else?
>
> I'd love to have 1) but 2) would be a realistic second best option? ;-)
OK, how about the following checkpatch.pl patch?
And here are some other actions I have taken and will take to improve
the situation, both for OS jitter and for scalability:
o Reduce OS jitter by switching from try_stop_cpus() to
stop_one_cpu_nowait(), courtesy of Peter Zijlstra.
I expect to send this in v4.3 or v4.4, depending on how
testing and review goes.
o Eliminate expedited-grace-period-induced OS jitter on idle CPUs.
This went into v3.19. Note that this also avoids IPIing
nohz_full CPUs.
o I believe that I can reduce OS jitter by a further factor of two
(worst case) or factor of five (average case), but I am still
thinking about exactly how to do this. (This also has the
benefit of shutting up a lockdep false positive.)
o There is a global counter that synchronize_sched_expedited()
uses to determine when all the CPUs have passed through a
quiescent state. This is a scalability bottleneck on modest
systems under heavy load, so I will be switching this to
instead use the combining tree.
o Because both synchronize_sched_expedited() and
synchronize_rcu_expedited() can potentially wake up each and
every CPU, on sufficiently large systems, they are quite slow.
If this scalability problem ever becomes real, I intend to use
multiple kthreads to do the wakeups on large systems.
Seem reasonable?
Thanx, Paul
------------------------------------------------------------------------
scripts: Make checkpatch.pl warn on expedited RCU grace periods
The synchronize_rcu_expedited() and synchronize_sched_expedited()
expedited-grace-period primitives induce OS jitter, which can degrade
real-time response. This commit therefore adds a checkpatch.pl warning
on any patch adding them.
Note that this patch does not warn on synchronize_srcu_expedited()
because it does not induce OS jitter, courtesy of its otherwise
much-maligned read-side memory barriers.
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Joe Perches <[email protected]>
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 89b1df4e72ab..ddd82d743bba 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4898,6 +4898,12 @@ sub process {
"memory barrier without comment\n" . $herecurr);
}
}
+# check for expedited grace periods that interrupt CPUs.
+# note that synchronize_srcu_expedited() does -not- do this, so no complaints.
+ if ($line =~ /\b(synchronize_rcu_expedited|synchronize_sched_expedited)\(/) {
+ WARN("EXPEDITED_RCU_GRACE_PERIOD",
+ "expedited RCU grace periods should be avoided\n" . $herecurr);
+ }
# check of hardware specific defines
if ($line =~ m@^.\s*\#\s*if.*\b(__i386__|__powerpc64__|__sun__|__s390x__)\b@ && $realfile !~ m@include/asm-@) {
CHK("ARCH_DEFINES",
On Thu, Jul 02, 2015 at 06:47:47PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 2, 2015, at 2:35 PM, Ingo Molnar [email protected] wrote:
>
> > * Paul E. McKenney <[email protected]> wrote:
> >
> >> > And it's not like it's that hard to stem the flow of algorithmic sloppiness at
> >> > the source, right?
> >>
> >> OK, first let me make sure that I understand what you are asking for:
> >>
> >> 1. Completely eliminate synchronize_rcu_expedited() and
> >> synchronize_sched_expedited(), replacing all uses with their
> >> unexpedited counterparts. (Note that synchronize_srcu_expedited()
> >> does not wake up CPUs, courtesy of its read-side memory barriers.)
> >> The fast-boot guys are probably going to complain, along with
> >> the networking guys.
> >>
> >> 2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
> >> but push back hard on any new uses and question any existing uses.
> >>
> >> 3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
> >> acpi_os_map_cleanup").
> >>
> >> 4. Something else?
> >
> > I'd love to have 1) but 2) would be a realistic second best option? ;-)
>
> Perhaps triggering a printk warning if use of
> synchronize_{rcu,sched}_expedited() go beyond of certain rate might be
> another option ? If we detect that a caller calls it too often, we could
> emit a printk warning with a stack trace. This should ensure everyone
> is very careful about where they use it.
My first thought is that a storm of expedited grace periods would be
most likely to show up in some error condition, and having them
splat might obscure the splats identifying the real problem. Or did
you have something else in mind here?
Thanx, Paul
On Thu, Jul 02, 2015 at 06:48:16PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 07:06:17AM -0700, Paul E. McKenney wrote:
> > Or is your point that RCU_GP_DONE_FQS is a bad name? Perhaps I should
> > change it to something like RCU_GP_DOING_FQS. Or am I still missing
> > something here?
>
> Yes, that might have been the root of my confusion.
OK, how about this, then?
Thanx, Paul
------------------------------------------------------------------------
rcu: Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS
The grace-period kthread sleeps waiting to do a force-quiescent-state
scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
However, this is confusing because the kthread has not done the
force-quiescent-state, but is instead just starting to do it. This commit
therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
things a bit easier on reviewers.
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2afb8e8c5134..e1d9909fd59d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2088,7 +2088,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
rsp->gp_state = RCU_GP_WAIT_FQS;
ret = wait_event_interruptible_timeout(rsp->gp_wq,
rcu_gp_fqs_check_wake(rsp, &gf), j);
- rsp->gp_state = RCU_GP_DONE_FQS;
+ rsp->gp_state = RCU_GP_DOING_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index cb402b5c2e71..852b810df38e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -533,7 +533,7 @@ struct rcu_state {
#define RCU_GP_WAIT_GPS 1 /* Wait for grace-period start. */
#define RCU_GP_DONE_GPS 2 /* Wait done for grace-period start. */
#define RCU_GP_WAIT_FQS 3 /* Wait for force-quiescent-state time. */
-#define RCU_GP_DONE_FQS 4 /* Wait done for force-quiescent-state time. */
+#define RCU_GP_DOING_FQS 4 /* Wait done for force-quiescent-state time. */
#define RCU_GP_CLEANUP 5 /* Grace-period cleanup started. */
#define RCU_GP_CLEANED 6 /* Grace-period cleanup complete. */
----- On Jul 2, 2015, at 3:23 PM, Paul E. McKenney [email protected] wrote:
> On Thu, Jul 02, 2015 at 06:47:47PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jul 2, 2015, at 2:35 PM, Ingo Molnar [email protected] wrote:
>>
>> > * Paul E. McKenney <[email protected]> wrote:
>> >
>> >> > And it's not like it's that hard to stem the flow of algorithmic sloppiness at
>> >> > the source, right?
>> >>
>> >> OK, first let me make sure that I understand what you are asking for:
>> >>
>> >> 1. Completely eliminate synchronize_rcu_expedited() and
>> >> synchronize_sched_expedited(), replacing all uses with their
>> >> unexpedited counterparts. (Note that synchronize_srcu_expedited()
>> >> does not wake up CPUs, courtesy of its read-side memory barriers.)
>> >> The fast-boot guys are probably going to complain, along with
>> >> the networking guys.
>> >>
>> >> 2. Keep synchronize_rcu_expedited() and synchronize_sched_expedited(),
>> >> but push back hard on any new uses and question any existing uses.
>> >>
>> >> 3. Revert 74b51ee152b6 ("ACPI / osl: speedup grace period in
>> >> acpi_os_map_cleanup").
>> >>
>> >> 4. Something else?
>> >
>> > I'd love to have 1) but 2) would be a realistic second best option? ;-)
>>
>> Perhaps triggering a printk warning if use of
>> synchronize_{rcu,sched}_expedited() go beyond of certain rate might be
>> another option ? If we detect that a caller calls it too often, we could
>> emit a printk warning with a stack trace. This should ensure everyone
>> is very careful about where they use it.
>
> My first thought is that a storm of expedited grace periods would be
> most likely to show up in some error condition, and having them
> splat might obscure the splats identifying the real problem. Or did
> you have something else in mind here?
Fair point! So I guess your checkpatch approach is more appropriate.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jul 02, 2015 at 12:35:45PM -0700, Paul E. McKenney wrote:
> OK, how about this, then?
> rcu: Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS
>
> The grace-period kthread sleeps waiting to do a force-quiescent-state
> scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
> However, this is confusing because the kthread has not done the
> force-quiescent-state, but is instead just starting to do it. This commit
> therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
> things a bit easier on reviewers.
>
> Reported-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
Much better, thanks! :-)