Hello!
This series, possibly for v3.21, contains changes that allow in-kernel
code to specify that all subsequent synchronous grace-period primitives
(synchronize_rcu() and friends) be expedited. New rcu_expedite_gp()
and rcu_unexpedite_gp() primitives enable and disable expediting,
and these may be nested. Note that the rcu_expedited boot/sysfs
variable, if non-zero, causes expediting to happen regardless of calls
to rcu_expedite_gp().
Because one of the use cases for these primitives is to expedite
grace periods during the in-kernel portion of boot, a new Kconfig
parameter named CONFIG_RCU_EXPEDITE_BOOT causes the kernel to act
as if rcu_expedite_gp() was called very early in boot. At the end
of boot (presumably just before init is spawned), a call to
rcu_end_inkernel_boot() will provide the matching rcu_unexpedite_gp()
if required.
The patches in this series are as follows:
1. Add rcu_expedite_gp() and rcu_unexpedite_gp() functions.
2. Add rcutorture tests for rcu_expedite_gp() and rcu_unexpedite_gp().
3. Change open-coded access to the rcu_expedited variable to
instead use a new rcu_gp_is_expedited() function.
4. Add the CONFIG_RCU_EXPEDITE_BOOT Kconfig parameter and
the rcu_end_inkernel_boot() function.
This passes light rcutorture testing.
Thanx, Paul
------------------------------------------------------------------------
b/include/linux/rcupdate.h | 21 ++++++++++++++++
b/init/Kconfig | 13 +++++++++
b/kernel/rcu/rcutorture.c | 24 ++++++++++++++++++
b/kernel/rcu/srcu.c | 2 -
b/kernel/rcu/tree.c | 9 +++---
b/kernel/rcu/tree_plugin.h | 2 -
b/kernel/rcu/update.c | 59 ++++++++++++++++++++++++++++++++++++++++++++-
7 files changed, 123 insertions(+), 7 deletions(-)
From: "Paul E. McKenney" <[email protected]>
Currently, expediting of normal synchronous grace-period primitives
(synchronize_rcu() and friends) is controlled by the rcu_expedited()
boot/sysfs parameter. This works well, but does not handle nesting.
This commit therefore provides rcu_expedite_gp() to enable expediting
and rcu_unexpedite_gp() to cancel a prior rcu_expedite_gp(), both of
which support nesting.
Reported-by: Arjan van de Ven <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 20 ++++++++++++++++++++
kernel/rcu/update.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 9d3d0e1d0766..0189ac088c67 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -48,6 +48,26 @@
extern int rcu_expedited; /* for sysctl */
+#ifdef CONFIG_TINY_RCU
+/* Tiny RCU doesn't expedite, as its purpose in life is instead to be tiny. */
+static inline bool rcu_gp_is_expedited(void) /* Internal RCU use. */
+{
+ return false;
+}
+
+static inline void rcu_expedite_gp(void)
+{
+}
+
+static inline void rcu_unexpedite_gp(void)
+{
+}
+#else /* #ifdef CONFIG_TINY_RCU */
+bool rcu_gp_is_expedited(void); /* Internal RCU use. */
+void rcu_expedite_gp(void);
+void rcu_unexpedite_gp(void);
+#endif /* #else #ifdef CONFIG_TINY_RCU */
+
enum rcutorture_type {
RCU_FLAVOR,
RCU_BH_FLAVOR,
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 8864ed90f0d7..44a74c24936a 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -62,6 +62,54 @@ MODULE_ALIAS("rcupdate");
module_param(rcu_expedited, int, 0);
+#ifndef CONFIG_TINY_RCU
+
+static atomic_t rcu_expedited_nesting;
+
+/*
+ * Should normal grace-period primitives be expedited? Intended for
+ * use within RCU. Note that this function takes the rcu_expedited
+ * sysfs/boot variable into account as well as the rcu_expedite_gp()
+ * nesting. So looping on rcu_unexpedite_gp() until rcu_gp_is_expedited()
+ * returns false is a -really- bad idea.
+ */
+bool rcu_gp_is_expedited(void)
+{
+ return rcu_expedited || atomic_read(&rcu_expedited_nesting);
+}
+EXPORT_SYMBOL_GPL(rcu_gp_is_expedited);
+
+/**
+ * rcu_expedite_gp - Expedite future RCU grace periods
+ *
+ * After a call to this function, future calls to synchronize_rcu() and
+ * friends act as the corresponding synchronize_rcu_expedited() function
+ * had instead been called.
+ */
+void rcu_expedite_gp(void)
+{
+ atomic_inc(&rcu_expedited_nesting);
+}
+EXPORT_SYMBOL_GPL(rcu_expedite_gp);
+
+/**
+ * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
+ *
+ * Undo a prior call to rcu_expedite_gp(). If all prior calls to
+ * rcu_expedite_gp() are undone by a subsequent call to rcu_unexpedite_gp(),
+ * and if the rcu_expedited sysfs/boot parameter is not set, then all
+ * subsequent calls to synchronize_rcu() and friends will return to
+ * their normal non-expedited behavior.
+ */
+void rcu_unexpedite_gp(void)
+{
+ atomic_dec(&rcu_expedited_nesting);
+}
+EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
+
+#endif /* #ifndef CONFIG_TINY_RCU */
+
+
#ifdef CONFIG_PREEMPT_RCU
/*
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 30d42aa55d83..3a60815ce97c 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -853,6 +853,8 @@ rcu_torture_fqs(void *arg)
static int
rcu_torture_writer(void *arg)
{
+ bool can_expedite = !rcu_gp_is_expedited();
+ int expediting = 0;
unsigned long gp_snap;
bool gp_cond1 = gp_cond, gp_exp1 = gp_exp, gp_normal1 = gp_normal;
bool gp_sync1 = gp_sync;
@@ -865,6 +867,12 @@ rcu_torture_writer(void *arg)
int nsynctypes = 0;
VERBOSE_TOROUT_STRING("rcu_torture_writer task started");
+ pr_alert("%s" TORTURE_FLAG
+ " Grace periods expedited from boot/sysfs for %s,\n",
+ torture_type, cur_ops->name);
+ pr_alert("%s" TORTURE_FLAG
+ " Testing of dynamic grace-period expediting diabled.\n",
+ torture_type);
/* Initialize synctype[] array. If none set, take default. */
if (!gp_cond1 && !gp_exp1 && !gp_normal1 && !gp_sync)
@@ -949,9 +957,25 @@ rcu_torture_writer(void *arg)
}
}
rcutorture_record_progress(++rcu_torture_current_version);
+ /* Cycle through nesting levels of rcu_expedite_gp() calls. */
+ if (can_expedite && torture_random(&rand)) {
+ WARN_ON_ONCE(expediting == 0 && rcu_gp_is_expedited());
+ if (expediting >= 0)
+ rcu_expedite_gp();
+ else
+ rcu_unexpedite_gp();
+ if (++expediting > 3)
+ expediting = -expediting;
+ }
rcu_torture_writer_state = RTWS_STUTTER;
stutter_wait("rcu_torture_writer");
} while (!torture_must_stop());
+ /* Reset expediting back to unexpedited. */
+ if (expediting > 0)
+ expediting = -expediting;
+ while (can_expedite && expediting++ < 0)
+ rcu_unexpedite_gp();
+ WARN_ON_ONCE(can_expedite && rcu_gp_is_expedited());
rcu_torture_writer_state = RTWS_STOPPING;
torture_kthread_stopping("rcu_torture_writer");
return 0;
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
This commit updates open-coded tests of the rcu_expedited variable
to instead use rcu_gp_is_expedited().
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/srcu.c | 2 +-
kernel/rcu/tree.c | 9 +++++----
kernel/rcu/tree_plugin.h | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 81f53b504c18..cad76e76b4e7 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -490,7 +490,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
*/
void synchronize_srcu(struct srcu_struct *sp)
{
- __synchronize_srcu(sp, rcu_expedited
+ __synchronize_srcu(sp, rcu_gp_is_expedited()
? SYNCHRONIZE_SRCU_EXP_TRYCOUNT
: SYNCHRONIZE_SRCU_TRYCOUNT);
}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 715e7bb35852..5fb9888af6d5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3064,7 +3064,7 @@ void synchronize_sched(void)
"Illegal synchronize_sched() in RCU-sched read-side critical section");
if (rcu_blocking_is_gp())
return;
- if (rcu_expedited)
+ if (rcu_gp_is_expedited())
synchronize_sched_expedited();
else
wait_rcu_gp(call_rcu_sched);
@@ -3091,7 +3091,7 @@ void synchronize_rcu_bh(void)
"Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
if (rcu_blocking_is_gp())
return;
- if (rcu_expedited)
+ if (rcu_gp_is_expedited())
synchronize_rcu_bh_expedited();
else
wait_rcu_gp(call_rcu_bh);
@@ -3783,11 +3783,12 @@ static int rcu_pm_notify(struct notifier_block *self,
case PM_HIBERNATION_PREPARE:
case PM_SUSPEND_PREPARE:
if (nr_cpu_ids <= 256) /* Expediting bad for large systems. */
- rcu_expedited = 1;
+ rcu_expedite_gp();
break;
case PM_POST_HIBERNATION:
case PM_POST_SUSPEND:
- rcu_expedited = 0;
+ if (nr_cpu_ids <= 256) /* Expediting bad for large systems. */
+ rcu_unexpedite_gp();
break;
default:
break;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 65629cc3c6cf..d11612ff3daa 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -572,7 +572,7 @@ void synchronize_rcu(void)
"Illegal synchronize_rcu() in RCU read-side critical section");
if (!rcu_scheduler_active)
return;
- if (rcu_expedited)
+ if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);
--
1.8.1.5
From: "Paul E. McKenney" <[email protected]>
This commit adds a CONFIG_RCU_EXPEDITE_BOOT Kconfig parameter
that emulates a very early boot rcu_expedite_gp(). A late-boot
call to rcu_end_inkernel_boot() will provide the corresponding
rcu_unexpedite_gp(). The late-boot call to rcu_end_inkernel_boot()
should be made just before init is spawned.
Reported-by: Arjan van de Ven <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 1 +
init/Kconfig | 13 +++++++++++++
kernel/rcu/update.c | 11 ++++++++++-
3 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0189ac088c67..573a5afd5ed8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -287,6 +287,7 @@ static inline int rcu_preempt_depth(void)
/* Internal to kernel */
void rcu_init(void);
+void rcu_end_inkernel_boot(void);
void rcu_sched_qs(void);
void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
diff --git a/init/Kconfig b/init/Kconfig
index 1354ac09b516..ece7e9380867 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -791,6 +791,19 @@ config RCU_NOCB_CPU_ALL
endchoice
+config RCU_EXPEDITE_BOOT
+ bool
+ default n
+ help
+ This option enables expedited grace periods at boot time,
+ as if rcu_expedite_gp() had been invoked early in boot.
+ The corresponding rcu_unexpedite_gp() is invoked from
+ rcu_end_inkernel_boot(), which is intended to be invoked
+ at the end of the kernel-only boot sequence, just before
+ init is exec'ed.
+
+ Accept the default if unsure.
+
endmenu # "RCU Subsystem"
config BUILD_BIN2C
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 44a74c24936a..1f133350da01 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -64,7 +64,8 @@ module_param(rcu_expedited, int, 0);
#ifndef CONFIG_TINY_RCU
-static atomic_t rcu_expedited_nesting;
+static atomic_t rcu_expedited_nesting =
+ ATOMIC_INIT(IS_ENABLED(CONFIG_RCU_EXPEDITE_BOOT) ? 1 : 0);
/*
* Should normal grace-period primitives be expedited? Intended for
@@ -109,6 +110,14 @@ EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
#endif /* #ifndef CONFIG_TINY_RCU */
+/*
+ * Inform RCU of the end of the in-kernel boot sequence.
+ */
+void rcu_end_inkernel_boot(void)
+{
+ if (IS_ENABLED(CONFIG_RCU_EXPEDITE_BOOT))
+ rcu_unexpedite_gp();
+}
#ifdef CONFIG_PREEMPT_RCU
--
1.8.1.5
On Thu, Feb 19, 2015 at 09:08:50PM -0800, Paul E. McKenney wrote:
> Hello!
>
> This series, possibly for v3.21, contains changes that allow in-kernel
> code to specify that all subsequent synchronous grace-period primitives
> (synchronize_rcu() and friends) be expedited. New rcu_expedite_gp()
> and rcu_unexpedite_gp() primitives enable and disable expediting,
> and these may be nested. Note that the rcu_expedited boot/sysfs
> variable, if non-zero, causes expediting to happen regardless of calls
> to rcu_expedite_gp().
>
> Because one of the use cases for these primitives is to expedite
> grace periods during the in-kernel portion of boot, a new Kconfig
> parameter named CONFIG_RCU_EXPEDITE_BOOT causes the kernel to act
> as if rcu_expedite_gp() was called very early in boot. At the end
> of boot (presumably just before init is spawned), a call to
> rcu_end_inkernel_boot() will provide the matching rcu_unexpedite_gp()
> if required.
So I though we wanted to get rid / limit the expedited stuff because its
IPI happy, and here its spreading.
Does it really make a machine boot much faster? Why are people using
synchronous gp primitives if they care about speed? Should we not fix
that instead?
On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 19, 2015 at 09:08:50PM -0800, Paul E. McKenney wrote:
> > Hello!
> >
> > This series, possibly for v3.21, contains changes that allow in-kernel
> > code to specify that all subsequent synchronous grace-period primitives
> > (synchronize_rcu() and friends) be expedited. New rcu_expedite_gp()
> > and rcu_unexpedite_gp() primitives enable and disable expediting,
> > and these may be nested. Note that the rcu_expedited boot/sysfs
> > variable, if non-zero, causes expediting to happen regardless of calls
> > to rcu_expedite_gp().
> >
> > Because one of the use cases for these primitives is to expedite
> > grace periods during the in-kernel portion of boot, a new Kconfig
> > parameter named CONFIG_RCU_EXPEDITE_BOOT causes the kernel to act
> > as if rcu_expedite_gp() was called very early in boot. At the end
> > of boot (presumably just before init is spawned), a call to
> > rcu_end_inkernel_boot() will provide the matching rcu_unexpedite_gp()
> > if required.
>
> So I though we wanted to get rid / limit the expedited stuff because its
> IPI happy, and here its spreading.
Well, at least it no longer IPIs idle CPUs. ;-)
And this is during boot, when a few extra IPIs should not be a big deal.
> Does it really make a machine boot much faster? Why are people using
> synchronous gp primitives if they care about speed? Should we not fix
> that instead?
The report I heard was that it provided 10-15% faster boot times.
Thanx, Paul
On Fri, Feb 20, 2015 at 08:37:37AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> > So I though we wanted to get rid / limit the expedited stuff because its
> > IPI happy, and here its spreading.
>
> Well, at least it no longer IPIs idle CPUs. ;-)
>
> And this is during boot, when a few extra IPIs should not be a big deal.
Well the one application now is during boot; but you expose the
interface for all to use, and therefore someone will.
> > Does it really make a machine boot much faster? Why are people using
> > synchronous gp primitives if they care about speed? Should we not fix
> > that instead?
>
> The report I heard was that it provided 10-15% faster boot times.
That's not insignificant; got more details? I think we should really
look at why people are using the sync primitives.
On Fri, Feb 20, 2015 at 05:54:09PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 08:37:37AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> > > So I though we wanted to get rid / limit the expedited stuff because its
> > > IPI happy, and here its spreading.
> >
> > Well, at least it no longer IPIs idle CPUs. ;-)
> >
> > And this is during boot, when a few extra IPIs should not be a big deal.
>
> Well the one application now is during boot; but you expose the
> interface for all to use, and therefore someone will.
I could make rcu_expedite_gp() and rcu_unexpedite_gp() be static,
I suppose. Except that I need to test them with rcutorture.
I suppose I could put the declaration in rcutorture.c, but then
0day will tell me to made them static. :-/
> > > Does it really make a machine boot much faster? Why are people using
> > > synchronous gp primitives if they care about speed? Should we not fix
> > > that instead?
> >
> > The report I heard was that it provided 10-15% faster boot times.
>
> That's not insignificant; got more details? I think we should really
> look at why people are using the sync primitives.
I must defer to the people who took the exact measurements.
But yes, once I have that info, I should add it to the commit log.
Thanx, Paul
>>>> Does it really make a machine boot much faster? Why are people using
>>>> synchronous gp primitives if they care about speed? Should we not fix
>>>> that instead?
>>>
>>> The report I heard was that it provided 10-15% faster boot times.
>>
>> That's not insignificant; got more details? I think we should really
>> look at why people are using the sync primitives.
>
> I must defer to the people who took the exact measurements.
>
> But yes, once I have that info, I should add it to the commit log.
so the two most obvious cases are
Registering sysrq keys ... even when the old key code had no handler
(have a patch pending for this)
registering idle handlers
(this is more tricky, it's very obvious abuse but the fix is less clear)
there's a few others as well that I'm chasing down...
.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> there's a few others as well that I'm chasing down...
> .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
So my objections are twofold:
- I object to fast expedites in principle; they spray IPIs across the
system, so ideally we'd not have them at all, therefore also not at
boot.
Because as soon as the option exists, people will use it for other
things too.
- The proposed interface is very much exposed to everybody who wants
it; this again is wide open to (ab)use.
Once it exists people will start to use, and before you know it we'll
always have that fast counter incremented and we're in IPI hell. Most
likely because someone was lazy and it seemed like a quick fix for
some stupid code.
And esp. in bootup code you can special case a lot of stuff; there's
limited concurrency esp. because userspace it not there yet. So we might
not actually need those sync calls.
On 2/20/2015 9:43 AM, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
>> there's a few others as well that I'm chasing down...
>> .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
>
> So my objections are twofold:
>
> - I object to fast expedites in principle; they spray IPIs across the
> system, so ideally we'd not have them at all, therefore also not at
> boot.
>
> Because as soon as the option exists, people will use it for other
> things too.
the option exists today in sysfs and kernel parameter...
>
> And esp. in bootup code you can special case a lot of stuff; there's
> limited concurrency esp. because userspace it not there yet. So we might
> not actually need those sync calls.
yeah I am going down that angle as well absolutely.
but there are cases that may well be legit (or are 5 function calls deep into common code)
On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> >>>>Does it really make a machine boot much faster? Why are people using
> >>>>synchronous gp primitives if they care about speed? Should we not fix
> >>>>that instead?
> >>>
> >>>The report I heard was that it provided 10-15% faster boot times.
> >>
> >>That's not insignificant; got more details? I think we should really
> >>look at why people are using the sync primitives.
> >
> >I must defer to the people who took the exact measurements.
> >
> >But yes, once I have that info, I should add it to the commit log.
>
> so the two most obvious cases are
>
> Registering sysrq keys ... even when the old key code had no handler
> (have a patch pending for this)
>
> registering idle handlers
> (this is more tricky, it's very obvious abuse but the fix is less clear)
>
> there's a few others as well that I'm chasing down...
> .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
It would be good to have before-and-after measurements of actual
boot time. Are these numbers available?
Thanx, Paul
On 2/20/2015 10:27 AM, Paul E. McKenney wrote:
> On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
>>>>>> Does it really make a machine boot much faster? Why are people using
>>>>>> synchronous gp primitives if they care about speed? Should we not fix
>>>>>> that instead?
>>>>>
>>>>> The report I heard was that it provided 10-15% faster boot times.
>>>>
>>>> That's not insignificant; got more details? I think we should really
>>>> look at why people are using the sync primitives.
>>>
>>> I must defer to the people who took the exact measurements.
>>>
>>> But yes, once I have that info, I should add it to the commit log.
>>
>> so the two most obvious cases are
>>
>> Registering sysrq keys ... even when the old key code had no handler
>> (have a patch pending for this)
>>
>> registering idle handlers
>> (this is more tricky, it's very obvious abuse but the fix is less clear)
>>
>> there's a few others as well that I'm chasing down...
>> .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
>
> It would be good to have before-and-after measurements of actual
> boot time. Are these numbers available?
I'll make you pretty graphs when I get home from collab summit, which
should be later today
On Fri, Feb 20, 2015 at 06:43:59PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> > there's a few others as well that I'm chasing down...
> > .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
>
> So my objections are twofold:
>
> - I object to fast expedites in principle; they spray IPIs across the
> system, so ideally we'd not have them at all, therefore also not at
> boot.
There are only a few uses of expedited grace periods, despite their
having been in the kernel for some years. So people do seem to be
exercising appropriate restraint here.
> Because as soon as the option exists, people will use it for other
> things too.
>
> - The proposed interface is very much exposed to everybody who wants
> it; this again is wide open to (ab)use.
>
> Once it exists people will start to use, and before you know it we'll
> always have that fast counter incremented and we're in IPI hell. Most
> likely because someone was lazy and it seemed like a quick fix for
> some stupid code.
I suppose that another way to keep it private would be to have the
declaration in both update.c and rcutorture.c. This would mean that no
other file could invoke it, and should keep 0day happy. It would mean
that the declarations would be duplicated, but worse things could happen.
> And esp. in bootup code you can special case a lot of stuff; there's
> limited concurrency esp. because userspace it not there yet. So we might
> not actually need those sync calls.
I expect that some could be rewritten, but it might not work well for
code common to boot and to runtime.
Thanx, Paul
On Fri, Feb 20, 2015 at 10:29:34AM -0800, Arjan van de Ven wrote:
> On 2/20/2015 10:27 AM, Paul E. McKenney wrote:
> >On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> >>>>>>Does it really make a machine boot much faster? Why are people using
> >>>>>>synchronous gp primitives if they care about speed? Should we not fix
> >>>>>>that instead?
> >>>>>
> >>>>>The report I heard was that it provided 10-15% faster boot times.
> >>>>
> >>>>That's not insignificant; got more details? I think we should really
> >>>>look at why people are using the sync primitives.
> >>>
> >>>I must defer to the people who took the exact measurements.
> >>>
> >>>But yes, once I have that info, I should add it to the commit log.
> >>
> >>so the two most obvious cases are
> >>
> >>Registering sysrq keys ... even when the old key code had no handler
> >>(have a patch pending for this)
> >>
> >>registering idle handlers
> >>(this is more tricky, it's very obvious abuse but the fix is less clear)
> >>
> >>there's a few others as well that I'm chasing down...
> >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> >
> >It would be good to have before-and-after measurements of actual
> >boot time. Are these numbers available?
>
> I'll make you pretty graphs when I get home from collab summit, which
> should be later today
Very good, looking forward to seeing them. ;-)
Thanx, Paul
On Fri, Feb 20, 2015 at 05:54:09PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 08:37:37AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> > > Does it really make a machine boot much faster? Why are people using
> > > synchronous gp primitives if they care about speed? Should we not fix
> > > that instead?
> >
> > The report I heard was that it provided 10-15% faster boot times.
>
> That's not insignificant; got more details? I think we should really
> look at why people are using the sync primitives.
Paul, what do you think about adding a compile-time debug option to
synchronize_rcu() that causes it to capture the time on entry and exit
and print the duration together with the file:line of the caller?
Similar to initcall_debug, but for blocking calls to synchronize_rcu().
Put that together with initcall_debug, and you'd have a pretty good idea
of where that holds up boot.
We do want early boot to run as asynchronously as possible, and to avoid
having later bits of boot waiting on a synchronize_rcu from earlier bits
of boot. Switching a caller over to call_rcu() doesn't actually help if
it still has to finish a grace period before it can allow later bits to
run. Ideally, we ought to be able to work out the "depth" of boot in
grace-periods.
Has anyone wired initcall_debug up to a bootchart-like graph?
- Josh Triplett
----- Original Message -----
> From: "Josh Triplett" <[email protected]>
> To: "Peter Zijlstra" <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>, [email protected], [email protected],
> [email protected], [email protected], [email protected], "mathieu desnoyers"
> <[email protected]>, [email protected], [email protected], [email protected], [email protected],
> [email protected], [email protected], [email protected], "bobby prani" <[email protected]>
> Sent: Saturday, February 21, 2015 1:04:28 AM
> Subject: Re: [PATCH tip/core/rcu 0/4] Programmatic nestable expedited grace periods
>
> On Fri, Feb 20, 2015 at 05:54:09PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 08:37:37AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> > > > Does it really make a machine boot much faster? Why are people using
> > > > synchronous gp primitives if they care about speed? Should we not fix
> > > > that instead?
> > >
> > > The report I heard was that it provided 10-15% faster boot times.
> >
> > That's not insignificant; got more details? I think we should really
> > look at why people are using the sync primitives.
>
> Paul, what do you think about adding a compile-time debug option to
> synchronize_rcu() that causes it to capture the time on entry and exit
> and print the duration together with the file:line of the caller?
> Similar to initcall_debug, but for blocking calls to synchronize_rcu().
> Put that together with initcall_debug, and you'd have a pretty good idea
> of where that holds up boot.
>
> We do want early boot to run as asynchronously as possible, and to avoid
> having later bits of boot waiting on a synchronize_rcu from earlier bits
> of boot. Switching a caller over to call_rcu() doesn't actually help if
> it still has to finish a grace period before it can allow later bits to
> run. Ideally, we ought to be able to work out the "depth" of boot in
> grace-periods.
>
> Has anyone wired initcall_debug up to a bootchart-like graph?
The information about begin/end of synchronize_rcu, as well as begin/end
of rcu_barrier() seems to be very relevant here. This should perhaps be
covered tracepoints ? Isn't it already ?
Thanks,
Mathieu
>
> - Josh Triplett
>
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
>>
>> there's a few others as well that I'm chasing down...
>> .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
>
> It would be good to have before-and-after measurements of actual
> boot time. Are these numbers available?
To show the boot time, I'm using the timestamp of the "Write protecting" line,
that's pretty much the last thing we print prior to ring 3 execution.
A kernel with default RCU behavior (inside KVM, only virtual devices) looks like this:
[ 0.038724] Write protecting the kernel read-only data: 10240k
a kernel with expedited RCU (using the command line option, so that I don't have
to recompile between measurements and thus am completely oranges-to-oranges)
[ 0.031768] Write protecting the kernel read-only data: 10240k
which, in percentage, is an 18% improvement.
On Fri, Feb 20, 2015 at 09:45:39AM -0800, Arjan van de Ven wrote:
> On 2/20/2015 9:43 AM, Peter Zijlstra wrote:
> >On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> >>there's a few others as well that I'm chasing down...
> >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> >
> >So my objections are twofold:
> >
> > - I object to fast expedites in principle; they spray IPIs across the
> > system, so ideally we'd not have them at all, therefore also not at
> > boot.
> >
> > Because as soon as the option exists, people will use it for other
> > things too.
>
> the option exists today in sysfs and kernel parameter...
Yeah, Paul and me have been having this argument for a while now ;-)
> >And esp. in bootup code you can special case a lot of stuff; there's
> >limited concurrency esp. because userspace it not there yet. So we might
> >not actually need those sync calls.
>
> yeah I am going down that angle as well absolutely.
> but there are cases that may well be legit (or are 5 function calls deep into common code)
Good ;-)
On Fri, Feb 20, 2015 at 10:38:49AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 20, 2015 at 06:43:59PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> > > there's a few others as well that I'm chasing down...
> > > .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> >
> > So my objections are twofold:
> >
> > - I object to fast expedites in principle; they spray IPIs across the
> > system, so ideally we'd not have them at all, therefore also not at
> > boot.
>
> There are only a few uses of expedited grace periods, despite their
> having been in the kernel for some years. So people do seem to be
> exercising appropriate restraint here.
Or people just don't know about it :-)
> > Because as soon as the option exists, people will use it for other
> > things too.
> >
> > - The proposed interface is very much exposed to everybody who wants
> > it; this again is wide open to (ab)use.
> >
> > Once it exists people will start to use, and before you know it we'll
> > always have that fast counter incremented and we're in IPI hell. Most
> > likely because someone was lazy and it seemed like a quick fix for
> > some stupid code.
>
> I suppose that another way to keep it private would be to have the
> declaration in both update.c and rcutorture.c. This would mean that no
> other file could invoke it, and should keep 0day happy. It would mean
> that the declarations would be duplicated, but worse things could happen.
Why do you need it for rcu torture? That can call the regular expedited
call to exercise those rcu paths, right?
That would allow you to use system_state < SYSTEM_RUNNING if you really
wanted to do this without exposing any interface for this.
On Sat, Feb 21, 2015 at 05:11:30PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 10:38:49AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 20, 2015 at 06:43:59PM +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> > > > there's a few others as well that I'm chasing down...
> > > > .. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> > >
> > > So my objections are twofold:
> > >
> > > - I object to fast expedites in principle; they spray IPIs across the
> > > system, so ideally we'd not have them at all, therefore also not at
> > > boot.
> >
> > There are only a few uses of expedited grace periods, despite their
> > having been in the kernel for some years. So people do seem to be
> > exercising appropriate restraint here.
>
> Or people just don't know about it :-)
"Ignorance: The #1 contributor to appropriate restraint!" ;-)
> > > Because as soon as the option exists, people will use it for other
> > > things too.
> > >
> > > - The proposed interface is very much exposed to everybody who wants
> > > it; this again is wide open to (ab)use.
> > >
> > > Once it exists people will start to use, and before you know it we'll
> > > always have that fast counter incremented and we're in IPI hell. Most
> > > likely because someone was lazy and it seemed like a quick fix for
> > > some stupid code.
> >
> > I suppose that another way to keep it private would be to have the
> > declaration in both update.c and rcutorture.c. This would mean that no
> > other file could invoke it, and should keep 0day happy. It would mean
> > that the declarations would be duplicated, but worse things could happen.
>
> Why do you need it for rcu torture? That can call the regular expedited
> call to exercise those rcu paths, right?
Yes, but not the ability to turn expediting on and off in the normal path.
> That would allow you to use system_state < SYSTEM_RUNNING if you really
> wanted to do this without exposing any interface for this.
This decision is not up to RCU. Something else must tell RCU whether or
not and when to treat normal grace periods as expedited.
Thanx, Paul
On Sat, Feb 21, 2015 at 07:51:34AM -0800, Arjan van de Ven wrote:
> >>
> >>there's a few others as well that I'm chasing down...
> >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> >
> >It would be good to have before-and-after measurements of actual
> >boot time. Are these numbers available?
>
>
> To show the boot time, I'm using the timestamp of the "Write protecting" line,
> that's pretty much the last thing we print prior to ring 3 execution.
>
> A kernel with default RCU behavior (inside KVM, only virtual devices) looks like this:
>
> [ 0.038724] Write protecting the kernel read-only data: 10240k
>
> a kernel with expedited RCU (using the command line option, so that I don't have
> to recompile between measurements and thus am completely oranges-to-oranges)
>
> [ 0.031768] Write protecting the kernel read-only data: 10240k
>
> which, in percentage, is an 18% improvement.
Thank you, will repost with this info.
Thanx, Paul
On Sat, Feb 21, 2015 at 03:12:01PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > From: "Josh Triplett" <[email protected]>
> > To: "Peter Zijlstra" <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>, [email protected], [email protected],
> > [email protected], [email protected], [email protected], "mathieu desnoyers"
> > <[email protected]>, [email protected], [email protected], [email protected], [email protected],
> > [email protected], [email protected], [email protected], "bobby prani" <[email protected]>
> > Sent: Saturday, February 21, 2015 1:04:28 AM
> > Subject: Re: [PATCH tip/core/rcu 0/4] Programmatic nestable expedited grace periods
> >
> > On Fri, Feb 20, 2015 at 05:54:09PM +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 20, 2015 at 08:37:37AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 20, 2015 at 10:11:07AM +0100, Peter Zijlstra wrote:
> > > > > Does it really make a machine boot much faster? Why are people using
> > > > > synchronous gp primitives if they care about speed? Should we not fix
> > > > > that instead?
> > > >
> > > > The report I heard was that it provided 10-15% faster boot times.
> > >
> > > That's not insignificant; got more details? I think we should really
> > > look at why people are using the sync primitives.
> >
> > Paul, what do you think about adding a compile-time debug option to
> > synchronize_rcu() that causes it to capture the time on entry and exit
> > and print the duration together with the file:line of the caller?
> > Similar to initcall_debug, but for blocking calls to synchronize_rcu().
> > Put that together with initcall_debug, and you'd have a pretty good idea
> > of where that holds up boot.
> >
> > We do want early boot to run as asynchronously as possible, and to avoid
> > having later bits of boot waiting on a synchronize_rcu from earlier bits
> > of boot. Switching a caller over to call_rcu() doesn't actually help if
> > it still has to finish a grace period before it can allow later bits to
> > run. Ideally, we ought to be able to work out the "depth" of boot in
> > grace-periods.
> >
> > Has anyone wired initcall_debug up to a bootchart-like graph?
>
> The information about begin/end of synchronize_rcu, as well as begin/end
> of rcu_barrier() seems to be very relevant here. This should perhaps be
> covered tracepoints ? Isn't it already ?
Good points, but they did measure this somehow. Wouldn't some ftrace
magic get this result?
Thanx, Paul
> Thanks,
>
> Mathieu
>
> >
> > - Josh Triplett
> >
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
On Sat, Feb 21, 2015 at 05:08:52PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 09:45:39AM -0800, Arjan van de Ven wrote:
> > On 2/20/2015 9:43 AM, Peter Zijlstra wrote:
> > >On Fri, Feb 20, 2015 at 09:32:39AM -0800, Arjan van de Ven wrote:
> > >>there's a few others as well that I'm chasing down...
> > >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> > >
> > >So my objections are twofold:
> > >
> > > - I object to fast expedites in principle; they spray IPIs across the
> > > system, so ideally we'd not have them at all, therefore also not at
> > > boot.
> > >
> > > Because as soon as the option exists, people will use it for other
> > > things too.
> >
> > the option exists today in sysfs and kernel parameter...
>
> Yeah, Paul and me have been having this argument for a while now ;-)
Indeed we have. ;-)
And if expedited grace periods start causing latency issues in real-world
workloads, I will address those issues.
In the meantime, one of the nice things about NO_HZ_FULL is that
synchronize_sched_expedited() avoids IPIing CPUs having a single runnable
task that is running in nohz_full mode. ;-)
Thanx, Paul
> > >And esp. in bootup code you can special case a lot of stuff; there's
> > >limited concurrency esp. because userspace it not there yet. So we might
> > >not actually need those sync calls.
> >
> > yeah I am going down that angle as well absolutely.
> > but there are cases that may well be legit (or are 5 function calls deep into common code)
>
> Good ;-)
>
On Sat, Feb 21, 2015 at 07:51:34AM -0800, Arjan van de Ven wrote:
> >>
> >>there's a few others as well that I'm chasing down...
> >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> >
> >It would be good to have before-and-after measurements of actual
> >boot time. Are these numbers available?
>
> To show the boot time, I'm using the timestamp of the "Write protecting" line,
> that's pretty much the last thing we print prior to ring 3 execution.
That's a little sad; we ought to be write-protecting kernel read-only
data as *early* as possible.
> A kernel with default RCU behavior (inside KVM, only virtual devices) looks like this:
>
> [ 0.038724] Write protecting the kernel read-only data: 10240k
>
> a kernel with expedited RCU (using the command line option, so that I don't have
> to recompile between measurements and thus am completely oranges-to-oranges)
>
> [ 0.031768] Write protecting the kernel read-only data: 10240k
>
> which, in percentage, is an 18% improvement.
Nice improvement, but that suggests that we're spending far too much
time waiting on RCU grace periods at boot time.
- Josh Triplett
On Sat, Feb 21, 2015 at 07:58:07PM -0800, Josh Triplett wrote:
> On Sat, Feb 21, 2015 at 07:51:34AM -0800, Arjan van de Ven wrote:
> > >>
> > >>there's a few others as well that I'm chasing down...
> > >>.. but the flip side, prior to running ring 3 code, why NOT do fast expedites?
> > >
> > >It would be good to have before-and-after measurements of actual
> > >boot time. Are these numbers available?
> >
> > To show the boot time, I'm using the timestamp of the "Write protecting" line,
> > that's pretty much the last thing we print prior to ring 3 execution.
>
> That's a little sad; we ought to be write-protecting kernel read-only
> data as *early* as possible.
>
> > A kernel with default RCU behavior (inside KVM, only virtual devices) looks like this:
> >
> > [ 0.038724] Write protecting the kernel read-only data: 10240k
> >
> > a kernel with expedited RCU (using the command line option, so that I don't have
> > to recompile between measurements and thus am completely oranges-to-oranges)
> >
> > [ 0.031768] Write protecting the kernel read-only data: 10240k
> >
> > which, in percentage, is an 18% improvement.
>
> Nice improvement, but that suggests that we're spending far too much
> time waiting on RCU grace periods at boot time.
Let's see... 0.038724-0.031768=0.006956, or about seven milliseconds.
This might be as many as ten grace periods, but is more likely to be
about two of them. Of course, this counts only the grace periods after
the scheduler starts, as those prior to scheduler start are no-ops,
courtesy of your single-CPU optimization.
So, how many grace periods between scheduler start and init spawning
do you feel would be appropriate?
Thanx, Paul
>>
>> To show the boot time, I'm using the timestamp of the "Write protecting" line,
>> that's pretty much the last thing we print prior to ring 3 execution.
>
> That's a little sad; we ought to be write-protecting kernel read-only
> data as *early* as possible.
well... if you are compromised before the first ring 3 instruction...
.... you have a slightly bigger problem than where in the kernel we write protect things.
On Sun, Feb 22, 2015 at 10:31:26AM -0800, Arjan van de Ven wrote:
> >>To show the boot time, I'm using the timestamp of the "Write protecting" line,
> >>that's pretty much the last thing we print prior to ring 3 execution.
> >
> >That's a little sad; we ought to be write-protecting kernel read-only
> >data as *early* as possible.
>
> well... if you are compromised before the first ring 3 instruction...
> .... you have a slightly bigger problem than where in the kernel we write protect things.
Definitely not talking about malicious compromise here; malicious code
could just remove the write protection. However, write-protecting
kernel read-only data also protects against a class of bugs.
- Josh Triplett