jiffies_till_sched_qs is useless if it's readonly as it is used to set
jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
And it should be applied immediately on change through sysfs.
The function for setting jiffies_to_sched_qs,
adjust_jiffies_till_sched_qs() will be called only if
the value from sysfs != ULONG_MAX. And the value won't be adjusted
unlike first/next fqs jiffies.
While at it, changed the positions of two module_param()s downward.
Signed-off-by: Byungchul Park <[email protected]>
---
kernel/rcu/tree.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a2f8ba2..a28e2fe 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
* quiescent-state help from rcu_note_context_switch().
*/
static ulong jiffies_till_sched_qs = ULONG_MAX;
-module_param(jiffies_till_sched_qs, ulong, 0444);
static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
-module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
/*
* Make sure that we give the grace-period kthread time to detect any
@@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
WRITE_ONCE(jiffies_to_sched_qs, j);
}
+static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
+{
+ ulong j;
+ int ret = kstrtoul(val, 0, &j);
+
+ if (!ret && j != ULONG_MAX) {
+ WRITE_ONCE(*(ulong *)kp->arg, j);
+ adjust_jiffies_till_sched_qs();
+ }
+ return ret;
+}
+
static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
{
ulong j;
@@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
return ret;
}
+static struct kernel_param_ops sched_qs_jiffies_ops = {
+ .set = param_set_sched_qs_jiffies,
+ .get = param_get_ulong,
+};
+
static struct kernel_param_ops first_fqs_jiffies_ops = {
.set = param_set_first_fqs_jiffies,
.get = param_get_ulong,
@@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
.get = param_get_ulong,
};
+module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
+
+module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
module_param(rcu_kick_kthreads, bool, 0644);
static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
--
1.9.1
On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> Good morning!
>
> On Mon, Jul 08, 2019 at 05:50:13AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 08, 2019 at 03:00:09PM +0900, Byungchul Park wrote:
> > > jiffies_till_sched_qs is useless if it's readonly as it is used to set
> > > jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
> > > And it should be applied immediately on change through sysfs.
>
> It is interesting it can be setup at boot time, but not at runtime. I think
> this can be mentioned in the change log that it is not really "read-only",
> because it is something that can be dynamically changed as a kernel boot
> parameter.
In Byungchul's defense, the current module_param() permissions are
0444, which really is read-only. Although I do agree that they can
be written at boot, one could use this same line of reasoning to argue
that const variables can be written at compile time (or, for on-stack
const variables, at function-invocation time). But we still call them
"const".
> > Actually, the intent was to only allow this to be changed at boot time.
> > Of course, if there is now a good reason to adjust it, it needs
> > to be adjustable. So what situation is making you want to change
> > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > to adjust it? What (if any) relationships between this timeout and the
> > various other RCU timeouts need to be maintained? What changes to
> > rcutorture should be applied in order to test the ability to change
> > this at runtime?
>
> I am also interested in the context, are you changing it at runtime for
> experimentation? I recently was doing some performance experiments and it is
> quite interesting how reducing this value can shorten grace period times :)
If you -really- want to reduce grace-period latencies, you can always
boot with rcupdate.rcu_expedited=1. ;-)
If you want to reduce grace-period latencies, but without all the IPIs
that expedited grace periods give you, the rcutree.jiffies_till_first_fqs
and rcutree.jiffies_till_next_fqs kernel boot parameters might be better
places to start than rcutree.jiffies_till_sched_qs. For one thing,
adjusting these two affects the value of jiffies_till_sched_qs.
Thanx, Paul
> Joel
>
>
> > Thanx, Paul
> >
> > > The function for setting jiffies_to_sched_qs,
> > > adjust_jiffies_till_sched_qs() will be called only if
> > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > unlike first/next fqs jiffies.
> > >
> > > While at it, changed the positions of two module_param()s downward.
> > >
> > > Signed-off-by: Byungchul Park <[email protected]>
> > > ---
> > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a2f8ba2..a28e2fe 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > * quiescent-state help from rcu_note_context_switch().
> > > */
> > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > >
> > > /*
> > > * Make sure that we give the grace-period kthread time to detect any
> > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > }
> > >
> > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > +{
> > > + ulong j;
> > > + int ret = kstrtoul(val, 0, &j);
> > > +
> > > + if (!ret && j != ULONG_MAX) {
> > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > + adjust_jiffies_till_sched_qs();
> > > + }
> > > + return ret;
> > > +}
> > > +
> > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > {
> > > ulong j;
> > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > return ret;
> > > }
> > >
> > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > + .set = param_set_sched_qs_jiffies,
> > > + .get = param_get_ulong,
> > > +};
> > > +
> > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > .set = param_set_first_fqs_jiffies,
> > > .get = param_get_ulong,
> > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > .get = param_get_ulong,
> > > };
> > >
> > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > +
> > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > module_param(rcu_kick_kthreads, bool, 0644);
> > >
> > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > --
> > > 1.9.1
> > >
> >
>
On Mon, Jul 08, 2019 at 03:00:09PM +0900, Byungchul Park wrote:
> jiffies_till_sched_qs is useless if it's readonly as it is used to set
> jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
> And it should be applied immediately on change through sysfs.
Actually, the intent was to only allow this to be changed at boot time.
Of course, if there is now a good reason to adjust it, it needs
to be adjustable. So what situation is making you want to change
jiffies_till_sched_qs at runtime? To what values is it proving useful
to adjust it? What (if any) relationships between this timeout and the
various other RCU timeouts need to be maintained? What changes to
rcutorture should be applied in order to test the ability to change
this at runtime?
Thanx, Paul
> The function for setting jiffies_to_sched_qs,
> adjust_jiffies_till_sched_qs() will be called only if
> the value from sysfs != ULONG_MAX. And the value won't be adjusted
> unlike first/next fqs jiffies.
>
> While at it, changed the positions of two module_param()s downward.
>
> Signed-off-by: Byungchul Park <[email protected]>
> ---
> kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a2f8ba2..a28e2fe 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> * quiescent-state help from rcu_note_context_switch().
> */
> static ulong jiffies_till_sched_qs = ULONG_MAX;
> -module_param(jiffies_till_sched_qs, ulong, 0444);
> static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
>
> /*
> * Make sure that we give the grace-period kthread time to detect any
> @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> WRITE_ONCE(jiffies_to_sched_qs, j);
> }
>
> +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> +{
> + ulong j;
> + int ret = kstrtoul(val, 0, &j);
> +
> + if (!ret && j != ULONG_MAX) {
> + WRITE_ONCE(*(ulong *)kp->arg, j);
> + adjust_jiffies_till_sched_qs();
> + }
> + return ret;
> +}
> +
> static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> {
> ulong j;
> @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> return ret;
> }
>
> +static struct kernel_param_ops sched_qs_jiffies_ops = {
> + .set = param_set_sched_qs_jiffies,
> + .get = param_get_ulong,
> +};
> +
> static struct kernel_param_ops first_fqs_jiffies_ops = {
> .set = param_set_first_fqs_jiffies,
> .get = param_get_ulong,
> @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> .get = param_get_ulong,
> };
>
> +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> +
> +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> module_param(rcu_kick_kthreads, bool, 0644);
>
> static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> --
> 1.9.1
>
Good morning!
On Mon, Jul 08, 2019 at 05:50:13AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 08, 2019 at 03:00:09PM +0900, Byungchul Park wrote:
> > jiffies_till_sched_qs is useless if it's readonly as it is used to set
> > jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
> > And it should be applied immediately on change through sysfs.
It is interesting it can be setup at boot time, but not at runtime. I think
this can be mentioned in the change log that it is not really "read-only",
because it is something that can be dynamically changed as a kernel boot
parameter.
> Actually, the intent was to only allow this to be changed at boot time.
> Of course, if there is now a good reason to adjust it, it needs
> to be adjustable. So what situation is making you want to change
> jiffies_till_sched_qs at runtime? To what values is it proving useful
> to adjust it? What (if any) relationships between this timeout and the
> various other RCU timeouts need to be maintained? What changes to
> rcutorture should be applied in order to test the ability to change
> this at runtime?
I am also interested in the context, are you changing it at runtime for
experimentation? I recently was doing some performance experiments and it is
quite interesting how reducing this value can shorten grace period times :)
Joel
> Thanx, Paul
>
> > The function for setting jiffies_to_sched_qs,
> > adjust_jiffies_till_sched_qs() will be called only if
> > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > unlike first/next fqs jiffies.
> >
> > While at it, changed the positions of two module_param()s downward.
> >
> > Signed-off-by: Byungchul Park <[email protected]>
> > ---
> > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > 1 file changed, 20 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a2f8ba2..a28e2fe 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > * quiescent-state help from rcu_note_context_switch().
> > */
> > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> >
> > /*
> > * Make sure that we give the grace-period kthread time to detect any
> > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > WRITE_ONCE(jiffies_to_sched_qs, j);
> > }
> >
> > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > +{
> > + ulong j;
> > + int ret = kstrtoul(val, 0, &j);
> > +
> > + if (!ret && j != ULONG_MAX) {
> > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > + adjust_jiffies_till_sched_qs();
> > + }
> > + return ret;
> > +}
> > +
> > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > {
> > ulong j;
> > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > return ret;
> > }
> >
> > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > + .set = param_set_sched_qs_jiffies,
> > + .get = param_get_ulong,
> > +};
> > +
> > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > .set = param_set_first_fqs_jiffies,
> > .get = param_get_ulong,
> > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > .get = param_get_ulong,
> > };
> >
> > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > +
> > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > module_param(rcu_kick_kthreads, bool, 0644);
> >
> > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > --
> > 1.9.1
> >
>
On Mon, Jul 08, 2019 at 06:19:42AM -0700, Paul E. McKenney wrote:
[snip]
> > > Actually, the intent was to only allow this to be changed at boot time.
> > > Of course, if there is now a good reason to adjust it, it needs
> > > to be adjustable. So what situation is making you want to change
> > > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > > to adjust it? What (if any) relationships between this timeout and the
> > > various other RCU timeouts need to be maintained? What changes to
> > > rcutorture should be applied in order to test the ability to change
> > > this at runtime?
> >
> > I am also interested in the context, are you changing it at runtime for
> > experimentation? I recently was doing some performance experiments and it is
> > quite interesting how reducing this value can shorten grace period times :)
>
> If you -really- want to reduce grace-period latencies, you can always
> boot with rcupdate.rcu_expedited=1. ;-)
>
> If you want to reduce grace-period latencies, but without all the IPIs
> that expedited grace periods give you, the rcutree.jiffies_till_first_fqs
> and rcutree.jiffies_till_next_fqs kernel boot parameters might be better
> places to start than rcutree.jiffies_till_sched_qs. For one thing,
> adjusting these two affects the value of jiffies_till_sched_qs.
Yes, agreed ;)
Joel
> >
> > > Thanx, Paul
> > >
> > > > The function for setting jiffies_to_sched_qs,
> > > > adjust_jiffies_till_sched_qs() will be called only if
> > > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > > unlike first/next fqs jiffies.
> > > >
> > > > While at it, changed the positions of two module_param()s downward.
> > > >
> > > > Signed-off-by: Byungchul Park <[email protected]>
> > > > ---
> > > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a2f8ba2..a28e2fe 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > > * quiescent-state help from rcu_note_context_switch().
> > > > */
> > > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > >
> > > > /*
> > > > * Make sure that we give the grace-period kthread time to detect any
> > > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > }
> > > >
> > > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > > +{
> > > > + ulong j;
> > > > + int ret = kstrtoul(val, 0, &j);
> > > > +
> > > > + if (!ret && j != ULONG_MAX) {
> > > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > > + adjust_jiffies_till_sched_qs();
> > > > + }
> > > > + return ret;
> > > > +}
> > > > +
> > > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > > {
> > > > ulong j;
> > > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > return ret;
> > > > }
> > > >
> > > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > > + .set = param_set_sched_qs_jiffies,
> > > > + .get = param_get_ulong,
> > > > +};
> > > > +
> > > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > > .set = param_set_first_fqs_jiffies,
> > > > .get = param_get_ulong,
> > > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > .get = param_get_ulong,
> > > > };
> > > >
> > > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > > +
> > > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > > module_param(rcu_kick_kthreads, bool, 0644);
> > > >
> > > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > > --
> > > > 1.9.1
> > > >
> > >
> >
>
On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> > Actually, the intent was to only allow this to be changed at boot time.
> > Of course, if there is now a good reason to adjust it, it needs
> > to be adjustable. So what situation is making you want to change
> > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > to adjust it? What (if any) relationships between this timeout and the
> > various other RCU timeouts need to be maintained? What changes to
> > rcutorture should be applied in order to test the ability to change
> > this at runtime?
>
> I am also interested in the context, are you changing it at runtime for
> experimentation? I recently was doing some performance experiments and it is
> quite interesting how reducing this value can shorten grace period times :)
Hi Joel,
I've read a thread talking about your experiment to see how the grace
periods change depending on the tunnable variables which was interesting
to me. While reading it, I found out jiffies_till_sched_qs is not
tunnable at runtime unlike jiffies_till_{first,next}_fqs which looks
like non-sense to me that's why I tried this patch. :)
Hi Paul,
IMHO, as much as we want to tune the time for fqs to be initiated, we
can also want to tune the time for the help from scheduler to start.
I thought only difference between them is a level of urgency. I might be
wrong. It would be appreciated if you let me know if I miss something.
And it's ok even if the patch is turned down based on your criteria. :)
Thanks,
Byungchul
> Joel
>
>
> > Thanx, Paul
> >
> > > The function for setting jiffies_to_sched_qs,
> > > adjust_jiffies_till_sched_qs() will be called only if
> > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > unlike first/next fqs jiffies.
> > >
> > > While at it, changed the positions of two module_param()s downward.
> > >
> > > Signed-off-by: Byungchul Park <[email protected]>
> > > ---
> > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a2f8ba2..a28e2fe 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > * quiescent-state help from rcu_note_context_switch().
> > > */
> > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > >
> > > /*
> > > * Make sure that we give the grace-period kthread time to detect any
> > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > }
> > >
> > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > +{
> > > + ulong j;
> > > + int ret = kstrtoul(val, 0, &j);
> > > +
> > > + if (!ret && j != ULONG_MAX) {
> > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > + adjust_jiffies_till_sched_qs();
> > > + }
> > > + return ret;
> > > +}
> > > +
> > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > {
> > > ulong j;
> > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > return ret;
> > > }
> > >
> > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > + .set = param_set_sched_qs_jiffies,
> > > + .get = param_get_ulong,
> > > +};
> > > +
> > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > .set = param_set_first_fqs_jiffies,
> > > .get = param_get_ulong,
> > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > .get = param_get_ulong,
> > > };
> > >
> > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > +
> > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > module_param(rcu_kick_kthreads, bool, 0644);
> > >
> > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > --
> > > 1.9.1
> > >
> >
On Mon, Jul 08, 2019 at 06:19:42AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> > Good morning!
> >
> > On Mon, Jul 08, 2019 at 05:50:13AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 08, 2019 at 03:00:09PM +0900, Byungchul Park wrote:
> > > > jiffies_till_sched_qs is useless if it's readonly as it is used to set
> > > > jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
> > > > And it should be applied immediately on change through sysfs.
> >
> > It is interesting it can be setup at boot time, but not at runtime. I think
> > this can be mentioned in the change log that it is not really "read-only",
> > because it is something that can be dynamically changed as a kernel boot
> > parameter.
>
> In Byungchul's defense, the current module_param() permissions are
> 0444, which really is read-only. Although I do agree that they can
> be written at boot, one could use this same line of reasoning to argue
> that const variables can be written at compile time (or, for on-stack
> const variables, at function-invocation time). But we still call them
> "const".
;-)
> > > Actually, the intent was to only allow this to be changed at boot time.
> > > Of course, if there is now a good reason to adjust it, it needs
> > > to be adjustable. So what situation is making you want to change
> > > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > > to adjust it? What (if any) relationships between this timeout and the
> > > various other RCU timeouts need to be maintained? What changes to
> > > rcutorture should be applied in order to test the ability to change
> > > this at runtime?
> >
> > I am also interested in the context, are you changing it at runtime for
> > experimentation? I recently was doing some performance experiments and it is
> > quite interesting how reducing this value can shorten grace period times :)
>
> If you -really- want to reduce grace-period latencies, you can always
> boot with rcupdate.rcu_expedited=1. ;-)
It's a quite different mechanism at the moment though... :(
> If you want to reduce grace-period latencies, but without all the IPIs
> that expedited grace periods give you, the rcutree.jiffies_till_first_fqs
> and rcutree.jiffies_till_next_fqs kernel boot parameters might be better
> places to start than rcutree.jiffies_till_sched_qs. For one thing,
> adjusting these two affects the value of jiffies_till_sched_qs.
Do you mean:
adjusting these two affects the value of *jiffies_to_sched_qs* (instead
of jiffies_till_sched_qs).
Right?
Thanks,
Byungchul
>
> Thanx, Paul
>
> > Joel
> >
> >
> > > Thanx, Paul
> > >
> > > > The function for setting jiffies_to_sched_qs,
> > > > adjust_jiffies_till_sched_qs() will be called only if
> > > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > > unlike first/next fqs jiffies.
> > > >
> > > > While at it, changed the positions of two module_param()s downward.
> > > >
> > > > Signed-off-by: Byungchul Park <[email protected]>
> > > > ---
> > > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a2f8ba2..a28e2fe 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > > * quiescent-state help from rcu_note_context_switch().
> > > > */
> > > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > >
> > > > /*
> > > > * Make sure that we give the grace-period kthread time to detect any
> > > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > }
> > > >
> > > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > > +{
> > > > + ulong j;
> > > > + int ret = kstrtoul(val, 0, &j);
> > > > +
> > > > + if (!ret && j != ULONG_MAX) {
> > > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > > + adjust_jiffies_till_sched_qs();
> > > > + }
> > > > + return ret;
> > > > +}
> > > > +
> > > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > > {
> > > > ulong j;
> > > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > return ret;
> > > > }
> > > >
> > > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > > + .set = param_set_sched_qs_jiffies,
> > > > + .get = param_get_ulong,
> > > > +};
> > > > +
> > > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > > .set = param_set_first_fqs_jiffies,
> > > > .get = param_get_ulong,
> > > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > .get = param_get_ulong,
> > > > };
> > > >
> > > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > > +
> > > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > > module_param(rcu_kick_kthreads, bool, 0644);
> > > >
> > > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > > --
> > > > 1.9.1
> > > >
> > >
> >
On Tue, Jul 09, 2019 at 02:58:16PM +0900, Byungchul Park wrote:
> On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> > > Actually, the intent was to only allow this to be changed at boot time.
> > > Of course, if there is now a good reason to adjust it, it needs
> > > to be adjustable. So what situation is making you want to change
> > > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > > to adjust it? What (if any) relationships between this timeout and the
> > > various other RCU timeouts need to be maintained? What changes to
> > > rcutorture should be applied in order to test the ability to change
> > > this at runtime?
> >
> > I am also interested in the context, are you changing it at runtime for
> > experimentation? I recently was doing some performance experiments and it is
> > quite interesting how reducing this value can shorten grace period times :)
>
> Hi Joel,
>
> I've read a thread talking about your experiment to see how the grace
> periods change depending on the tunnable variables which was interesting
> to me. While reading it, I found out jiffies_till_sched_qs is not
> tunnable at runtime unlike jiffies_till_{first,next}_fqs which looks
> like non-sense to me that's why I tried this patch. :)
>
> Hi Paul,
>
> IMHO, as much as we want to tune the time for fqs to be initiated, we
> can also want to tune the time for the help from scheduler to start.
Let me mention one more thing here...
This is about jiffies_till_sched_qs, I think, used to directly set
jiffies_to_sched_qs.
> I thought only difference between them is a level of urgency.
Of course, they are coupled in case jiffies_to_sched_qs is calculated
from jiffies_till_{first,next}_fqs though, I'm just talking about its
original motivation or concept.
Thanks,
Byungchul
On Tue, Jul 09, 2019 at 02:58:16PM +0900, Byungchul Park wrote:
> On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> > > Actually, the intent was to only allow this to be changed at boot time.
> > > Of course, if there is now a good reason to adjust it, it needs
> > > to be adjustable. So what situation is making you want to change
> > > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > > to adjust it? What (if any) relationships between this timeout and the
> > > various other RCU timeouts need to be maintained? What changes to
> > > rcutorture should be applied in order to test the ability to change
> > > this at runtime?
> >
> > I am also interested in the context, are you changing it at runtime for
> > experimentation? I recently was doing some performance experiments and it is
> > quite interesting how reducing this value can shorten grace period times :)
>
> Hi Joel,
>
> I've read a thread talking about your experiment to see how the grace
> periods change depending on the tunnable variables which was interesting
> to me. While reading it, I found out jiffies_till_sched_qs is not
> tunnable at runtime unlike jiffies_till_{first,next}_fqs which looks
> like non-sense to me that's why I tried this patch. :)
>
> Hi Paul,
>
> IMHO, as much as we want to tune the time for fqs to be initiated, we
> can also want to tune the time for the help from scheduler to start.
> I thought only difference between them is a level of urgency. I might be
> wrong. It would be appreciated if you let me know if I miss something.
Hello, Byungchul,
I understand that one hypothetically might want to tune this at runtime,
but have you had need to tune this at runtime on a real production
workload? If so, what problem was happening that caused you to want to
do this tuning?
> And it's ok even if the patch is turned down based on your criteria. :)
If there is a real need, something needs to be provided to meet that
need. But in the absence of a real need, past experience has shown
that speculative tuning knobs usually do more harm than good. ;-)
Hence my question to you about a real need.
Thanx, Paul
> Thanks,
> Byungchul
>
> > Joel
> >
> >
> > > Thanx, Paul
> > >
> > > > The function for setting jiffies_to_sched_qs,
> > > > adjust_jiffies_till_sched_qs() will be called only if
> > > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > > unlike first/next fqs jiffies.
> > > >
> > > > While at it, changed the positions of two module_param()s downward.
> > > >
> > > > Signed-off-by: Byungchul Park <[email protected]>
> > > > ---
> > > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a2f8ba2..a28e2fe 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > > * quiescent-state help from rcu_note_context_switch().
> > > > */
> > > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > >
> > > > /*
> > > > * Make sure that we give the grace-period kthread time to detect any
> > > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > }
> > > >
> > > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > > +{
> > > > + ulong j;
> > > > + int ret = kstrtoul(val, 0, &j);
> > > > +
> > > > + if (!ret && j != ULONG_MAX) {
> > > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > > + adjust_jiffies_till_sched_qs();
> > > > + }
> > > > + return ret;
> > > > +}
> > > > +
> > > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > > {
> > > > ulong j;
> > > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > return ret;
> > > > }
> > > >
> > > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > > + .set = param_set_sched_qs_jiffies,
> > > > + .get = param_get_ulong,
> > > > +};
> > > > +
> > > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > > .set = param_set_first_fqs_jiffies,
> > > > .get = param_get_ulong,
> > > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > .get = param_get_ulong,
> > > > };
> > > >
> > > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > > +
> > > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > > module_param(rcu_kick_kthreads, bool, 0644);
> > > >
> > > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > > --
> > > > 1.9.1
> > > >
> > >
>
On Tue, Jul 09, 2019 at 03:05:58PM +0900, Byungchul Park wrote:
> On Mon, Jul 08, 2019 at 06:19:42AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 08, 2019 at 09:03:59AM -0400, Joel Fernandes wrote:
> > > Good morning!
> > >
> > > On Mon, Jul 08, 2019 at 05:50:13AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Jul 08, 2019 at 03:00:09PM +0900, Byungchul Park wrote:
> > > > > jiffies_till_sched_qs is useless if it's readonly as it is used to set
> > > > > jiffies_to_sched_qs with its value regardless of first/next fqs jiffies.
> > > > > And it should be applied immediately on change through sysfs.
> > >
> > > It is interesting it can be setup at boot time, but not at runtime. I think
> > > this can be mentioned in the change log that it is not really "read-only",
> > > because it is something that can be dynamically changed as a kernel boot
> > > parameter.
> >
> > In Byungchul's defense, the current module_param() permissions are
> > 0444, which really is read-only. Although I do agree that they can
> > be written at boot, one could use this same line of reasoning to argue
> > that const variables can be written at compile time (or, for on-stack
> > const variables, at function-invocation time). But we still call them
> > "const".
>
> ;-)
>
> > > > Actually, the intent was to only allow this to be changed at boot time.
> > > > Of course, if there is now a good reason to adjust it, it needs
> > > > to be adjustable. So what situation is making you want to change
> > > > jiffies_till_sched_qs at runtime? To what values is it proving useful
> > > > to adjust it? What (if any) relationships between this timeout and the
> > > > various other RCU timeouts need to be maintained? What changes to
> > > > rcutorture should be applied in order to test the ability to change
> > > > this at runtime?
> > >
> > > I am also interested in the context, are you changing it at runtime for
> > > experimentation? I recently was doing some performance experiments and it is
> > > quite interesting how reducing this value can shorten grace period times :)
> >
> > If you -really- want to reduce grace-period latencies, you can always
> > boot with rcupdate.rcu_expedited=1. ;-)
>
> It's a quite different mechanism at the moment though... :(
Quite true, but a rather effective mechanism for reducing grace-period
latency, give or take the needs of aggressive real-time workloads.
> > If you want to reduce grace-period latencies, but without all the IPIs
> > that expedited grace periods give you, the rcutree.jiffies_till_first_fqs
> > and rcutree.jiffies_till_next_fqs kernel boot parameters might be better
> > places to start than rcutree.jiffies_till_sched_qs. For one thing,
> > adjusting these two affects the value of jiffies_till_sched_qs.
>
> Do you mean:
>
> adjusting these two affects the value of *jiffies_to_sched_qs* (instead
> of jiffies_till_sched_qs).
>
> Right?
Yes, and apologies for my confusion!
Thaxn, Paul
> Thanks,
> Byungchul
>
> >
> > Thanx, Paul
> >
> > > Joel
> > >
> > >
> > > > Thanx, Paul
> > > >
> > > > > The function for setting jiffies_to_sched_qs,
> > > > > adjust_jiffies_till_sched_qs() will be called only if
> > > > > the value from sysfs != ULONG_MAX. And the value won't be adjusted
> > > > > unlike first/next fqs jiffies.
> > > > >
> > > > > While at it, changed the positions of two module_param()s downward.
> > > > >
> > > > > Signed-off-by: Byungchul Park <[email protected]>
> > > > > ---
> > > > > kernel/rcu/tree.c | 22 ++++++++++++++++++++--
> > > > > 1 file changed, 20 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index a2f8ba2..a28e2fe 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -422,9 +422,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > > > * quiescent-state help from rcu_note_context_switch().
> > > > > */
> > > > > static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > > > -module_param(jiffies_till_sched_qs, ulong, 0444);
> > > > > static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > > >
> > > > > /*
> > > > > * Make sure that we give the grace-period kthread time to detect any
> > > > > @@ -450,6 +448,18 @@ static void adjust_jiffies_till_sched_qs(void)
> > > > > WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > }
> > > > >
> > > > > +static int param_set_sched_qs_jiffies(const char *val, const struct kernel_param *kp)
> > > > > +{
> > > > > + ulong j;
> > > > > + int ret = kstrtoul(val, 0, &j);
> > > > > +
> > > > > + if (!ret && j != ULONG_MAX) {
> > > > > + WRITE_ONCE(*(ulong *)kp->arg, j);
> > > > > + adjust_jiffies_till_sched_qs();
> > > > > + }
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > > > {
> > > > > ulong j;
> > > > > @@ -474,6 +484,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +static struct kernel_param_ops sched_qs_jiffies_ops = {
> > > > > + .set = param_set_sched_qs_jiffies,
> > > > > + .get = param_get_ulong,
> > > > > +};
> > > > > +
> > > > > static struct kernel_param_ops first_fqs_jiffies_ops = {
> > > > > .set = param_set_first_fqs_jiffies,
> > > > > .get = param_get_ulong,
> > > > > @@ -484,8 +499,11 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param
> > > > > .get = param_get_ulong,
> > > > > };
> > > > >
> > > > > +module_param_cb(jiffies_till_sched_qs, &sched_qs_jiffies_ops, &jiffies_till_sched_qs, 0644);
> > > > > module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
> > > > > module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
> > > > > +
> > > > > +module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > > > module_param(rcu_kick_kthreads, bool, 0644);
> > > > >
> > > > > static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
> > > > > --
> > > > > 1.9.1
> > > > >
> > > >
> > >
>
On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > Hi Paul,
> >
> > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > can also want to tune the time for the help from scheduler to start.
> > I thought only difference between them is a level of urgency. I might be
> > wrong. It would be appreciated if you let me know if I miss something.
>
> Hello, Byungchul,
>
> I understand that one hypothetically might want to tune this at runtime,
> but have you had need to tune this at runtime on a real production
> workload? If so, what problem was happening that caused you to want to
> do this tuning?
Not actually.
> > And it's ok even if the patch is turned down based on your criteria. :)
>
> If there is a real need, something needs to be provided to meet that
> need. But in the absence of a real need, past experience has shown
> that speculative tuning knobs usually do more harm than good. ;-)
It makes sense, "A speculative tuning knobs do more harm than good".
Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
but jiffies_till_sched_qs until we need it.
However,
(1) In case that jiffies_till_sched_qs is tunnable:
We might need all of jiffies_till_{first,next}_qs,
jiffies_till_sched_qs and jiffies_to_sched_qs because
jiffies_to_sched_qs can be affected by any of them. So we
should be able to read each value at any time.
(2) In case that jiffies_till_sched_qs is not tunnable:
I think we don't have to keep the jiffies_till_sched_qs any
longer since that's only for setting jiffies_to_sched_qs at
*booting time*, which can be done with jiffies_to_sched_qs too.
It's meaningless to keep all of tree variables.
The simpler and less knobs that we really need we have, the better.
what do you think about it?
In the following patch, I (1) removed jiffies_till_sched_qs and then
(2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
think jiffies_till_sched_qs is a much better name for the purpose. I
will resend it with a commit msg after knowing your opinion on it.
Thanks,
Byungchul
---8<---
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e72c184..94b58f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3792,10 +3792,6 @@
a value based on the most recent settings
of rcutree.jiffies_till_first_fqs
and rcutree.jiffies_till_next_fqs.
- This calculated value may be viewed in
- rcutree.jiffies_to_sched_qs. Any attempt to set
- rcutree.jiffies_to_sched_qs will be cheerfully
- overwritten.
rcutree.kthread_prio= [KNL,BOOT]
Set the SCHED_FIFO priority of the RCU per-CPU
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a2f8ba2..ad9dc86 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
* How long the grace period must be before we start recruiting
* quiescent-state help from rcu_note_context_switch().
*/
-static ulong jiffies_till_sched_qs = ULONG_MAX;
+static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
module_param(jiffies_till_sched_qs, ulong, 0444);
-static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
-module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
/*
* Make sure that we give the grace-period kthread time to detect any
@@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
{
unsigned long j;
- /* If jiffies_till_sched_qs was specified, respect the request. */
- if (jiffies_till_sched_qs != ULONG_MAX) {
- WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
- return;
- }
/* Otherwise, set to third fqs scan, but bound below on large system. */
j = READ_ONCE(jiffies_till_first_fqs) +
2 * READ_ONCE(jiffies_till_next_fqs);
if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
- WRITE_ONCE(jiffies_to_sched_qs, j);
+ WRITE_ONCE(jiffies_till_sched_qs, j);
}
static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
@@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
/*
* A CPU running for an extended time within the kernel can
- * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
- * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
+ * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
+ * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
* both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
* unsynchronized assignments to the per-CPU rcu_need_heavy_qs
* variable are safe because the assignments are repeated if this
* CPU failed to pass through a quiescent state. This code
- * also checks .jiffies_resched in case jiffies_to_sched_qs
+ * also checks .jiffies_resched in case jiffies_till_sched_qs
* is set way high.
*/
- jtsq = READ_ONCE(jiffies_to_sched_qs);
+ jtsq = READ_ONCE(jiffies_till_sched_qs);
ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
if (!READ_ONCE(*rnhqp) &&
@@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
jiffies_till_first_fqs = d;
if (jiffies_till_next_fqs == ULONG_MAX)
jiffies_till_next_fqs = d;
- adjust_jiffies_till_sched_qs();
+ if (jiffies_till_sched_qs == ULONG_MAX)
+ adjust_jiffies_till_sched_qs();
/* If the compile-time values are accurate, just leave. */
if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > Hi Paul,
> > >
> > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > can also want to tune the time for the help from scheduler to start.
> > > I thought only difference between them is a level of urgency. I might be
> > > wrong. It would be appreciated if you let me know if I miss something.
> >
> > Hello, Byungchul,
> >
> > I understand that one hypothetically might want to tune this at runtime,
> > but have you had need to tune this at runtime on a real production
> > workload? If so, what problem was happening that caused you to want to
> > do this tuning?
>
> Not actually.
>
> > > And it's ok even if the patch is turned down based on your criteria. :)
> >
> > If there is a real need, something needs to be provided to meet that
> > need. But in the absence of a real need, past experience has shown
> > that speculative tuning knobs usually do more harm than good. ;-)
>
> It makes sense, "A speculative tuning knobs do more harm than good".
>
> Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> but jiffies_till_sched_qs until we need it.
>
> However,
>
> (1) In case that jiffies_till_sched_qs is tunnable:
>
> We might need all of jiffies_till_{first,next}_qs,
> jiffies_till_sched_qs and jiffies_to_sched_qs because
> jiffies_to_sched_qs can be affected by any of them. So we
> should be able to read each value at any time.
>
> (2) In case that jiffies_till_sched_qs is not tunnable:
>
> I think we don't have to keep the jiffies_till_sched_qs any
> longer since that's only for setting jiffies_to_sched_qs at
> *booting time*, which can be done with jiffies_to_sched_qs too.
> It's meaningless to keep all of tree variables.
>
> The simpler and less knobs that we really need we have, the better.
>
> what do you think about it?
>
> In the following patch, I (1) removed jiffies_till_sched_qs and then
> (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> think jiffies_till_sched_qs is a much better name for the purpose. I
> will resend it with a commit msg after knowing your opinion on it.
I will give you a definite "maybe".
Here are the two reasons for changing RCU's embarrassingly large array
of tuning parameters:
1. They are causing a problem in production. This would represent a
bug that clearly must be fixed. As you say, this change is not
in this category.
2. The change simplifies either RCU's code or the process of tuning
RCU, but without degrading RCU's ability to run everywhere and
without removing debugging tools.
The change below clearly simplifies things by removing a few lines of
code, and it does not change RCU's default self-configuration. But are
we sure about the debugging aspect? (Please keep in mind that many more
sites are willing to change boot parameters than are willing to patch
their kernels.)
What do you think?
Finally, I urge you to join with Joel Fernandes and go through these
grace-period-duration tuning parameters. Once you guys get your heads
completely around all of them and how they interact across the different
possible RCU configurations, I bet that the two of you will have excellent
ideas for improvement.
Thanx, Paul
> Thanks,
> Byungchul
>
> ---8<---
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e72c184..94b58f5 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3792,10 +3792,6 @@
> a value based on the most recent settings
> of rcutree.jiffies_till_first_fqs
> and rcutree.jiffies_till_next_fqs.
> - This calculated value may be viewed in
> - rcutree.jiffies_to_sched_qs. Any attempt to set
> - rcutree.jiffies_to_sched_qs will be cheerfully
> - overwritten.
>
> rcutree.kthread_prio= [KNL,BOOT]
> Set the SCHED_FIFO priority of the RCU per-CPU
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a2f8ba2..ad9dc86 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> * How long the grace period must be before we start recruiting
> * quiescent-state help from rcu_note_context_switch().
> */
> -static ulong jiffies_till_sched_qs = ULONG_MAX;
> +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> module_param(jiffies_till_sched_qs, ulong, 0444);
> -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
>
> /*
> * Make sure that we give the grace-period kthread time to detect any
> @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> {
> unsigned long j;
>
> - /* If jiffies_till_sched_qs was specified, respect the request. */
> - if (jiffies_till_sched_qs != ULONG_MAX) {
> - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> - return;
> - }
> /* Otherwise, set to third fqs scan, but bound below on large system. */
> j = READ_ONCE(jiffies_till_first_fqs) +
> 2 * READ_ONCE(jiffies_till_next_fqs);
> if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> - WRITE_ONCE(jiffies_to_sched_qs, j);
> + WRITE_ONCE(jiffies_till_sched_qs, j);
> }
>
> static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
>
> /*
> * A CPU running for an extended time within the kernel can
> - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> * variable are safe because the assignments are repeated if this
> * CPU failed to pass through a quiescent state. This code
> - * also checks .jiffies_resched in case jiffies_to_sched_qs
> + * also checks .jiffies_resched in case jiffies_till_sched_qs
> * is set way high.
> */
> - jtsq = READ_ONCE(jiffies_to_sched_qs);
> + jtsq = READ_ONCE(jiffies_till_sched_qs);
> ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> if (!READ_ONCE(*rnhqp) &&
> @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> jiffies_till_first_fqs = d;
> if (jiffies_till_next_fqs == ULONG_MAX)
> jiffies_till_next_fqs = d;
> - adjust_jiffies_till_sched_qs();
> + if (jiffies_till_sched_qs == ULONG_MAX)
> + adjust_jiffies_till_sched_qs();
>
> /* If the compile-time values are accurate, just leave. */
> if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > Hi Paul,
> > > >
> > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > can also want to tune the time for the help from scheduler to start.
> > > > I thought only difference between them is a level of urgency. I might be
> > > > wrong. It would be appreciated if you let me know if I miss something.
> > >
> > > Hello, Byungchul,
> > >
> > > I understand that one hypothetically might want to tune this at runtime,
> > > but have you had need to tune this at runtime on a real production
> > > workload? If so, what problem was happening that caused you to want to
> > > do this tuning?
> >
> > Not actually.
> >
> > > > And it's ok even if the patch is turned down based on your criteria. :)
> > >
> > > If there is a real need, something needs to be provided to meet that
> > > need. But in the absence of a real need, past experience has shown
> > > that speculative tuning knobs usually do more harm than good. ;-)
> >
> > It makes sense, "A speculative tuning knobs do more harm than good".
> >
> > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > but jiffies_till_sched_qs until we need it.
> >
> > However,
> >
> > (1) In case that jiffies_till_sched_qs is tunnable:
> >
> > We might need all of jiffies_till_{first,next}_qs,
> > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > jiffies_to_sched_qs can be affected by any of them. So we
> > should be able to read each value at any time.
> >
> > (2) In case that jiffies_till_sched_qs is not tunnable:
> >
> > I think we don't have to keep the jiffies_till_sched_qs any
> > longer since that's only for setting jiffies_to_sched_qs at
> > *booting time*, which can be done with jiffies_to_sched_qs too.
> > It's meaningless to keep all of tree variables.
> >
> > The simpler and less knobs that we really need we have, the better.
> >
> > what do you think about it?
> >
> > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > think jiffies_till_sched_qs is a much better name for the purpose. I
> > will resend it with a commit msg after knowing your opinion on it.
>
> I will give you a definite "maybe".
>
> Here are the two reasons for changing RCU's embarrassingly large array
> of tuning parameters:
>
> 1. They are causing a problem in production. This would represent a
> bug that clearly must be fixed. As you say, this change is not
> in this category.
>
> 2. The change simplifies either RCU's code or the process of tuning
> RCU, but without degrading RCU's ability to run everywhere and
> without removing debugging tools.
>
> The change below clearly simplifies things by removing a few lines of
> code, and it does not change RCU's default self-configuration. But are
> we sure about the debugging aspect? (Please keep in mind that many more
> sites are willing to change boot parameters than are willing to patch
> their kernels.)
>
> What do you think?
Just to add that independent of whether the runtime tunable make sense or
not, may be it is still worth correcting the 0444 to be 0644 to be a separate
patch?
> Finally, I urge you to join with Joel Fernandes and go through these
> grace-period-duration tuning parameters. Once you guys get your heads
> completely around all of them and how they interact across the different
> possible RCU configurations, I bet that the two of you will have excellent
> ideas for improvement.
Yes, I am quite happy to join forces. Byungchul, let me know what about this
or other things you had in mind. I have some other RCU topics too I am trying
to get my head around and planning to work on more patches.
Paul, in case you had any other specific tunables or experiments in mind, let
me know. I am quite happy to try out new experiments and learn something
based on tuning something.
thanks,
- Joel
> > Thanks,
> > Byungchul
> >
> > ---8<---
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index e72c184..94b58f5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3792,10 +3792,6 @@
> > a value based on the most recent settings
> > of rcutree.jiffies_till_first_fqs
> > and rcutree.jiffies_till_next_fqs.
> > - This calculated value may be viewed in
> > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > - rcutree.jiffies_to_sched_qs will be cheerfully
> > - overwritten.
> >
> > rcutree.kthread_prio= [KNL,BOOT]
> > Set the SCHED_FIFO priority of the RCU per-CPU
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a2f8ba2..ad9dc86 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > * How long the grace period must be before we start recruiting
> > * quiescent-state help from rcu_note_context_switch().
> > */
> > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > module_param(jiffies_till_sched_qs, ulong, 0444);
> > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> >
> > /*
> > * Make sure that we give the grace-period kthread time to detect any
> > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > {
> > unsigned long j;
> >
> > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > - return;
> > - }
> > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > j = READ_ONCE(jiffies_till_first_fqs) +
> > 2 * READ_ONCE(jiffies_till_next_fqs);
> > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > }
> >
> > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> >
> > /*
> > * A CPU running for an extended time within the kernel can
> > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > * variable are safe because the assignments are repeated if this
> > * CPU failed to pass through a quiescent state. This code
> > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > * is set way high.
> > */
> > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > if (!READ_ONCE(*rnhqp) &&
> > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > jiffies_till_first_fqs = d;
> > if (jiffies_till_next_fqs == ULONG_MAX)
> > jiffies_till_next_fqs = d;
> > - adjust_jiffies_till_sched_qs();
> > + if (jiffies_till_sched_qs == ULONG_MAX)
> > + adjust_jiffies_till_sched_qs();
> >
> > /* If the compile-time values are accurate, just leave. */
> > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
>
On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > > Hi Paul,
> > > > >
> > > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > > can also want to tune the time for the help from scheduler to start.
> > > > > I thought only difference between them is a level of urgency. I might be
> > > > > wrong. It would be appreciated if you let me know if I miss something.
> > > >
> > > > Hello, Byungchul,
> > > >
> > > > I understand that one hypothetically might want to tune this at runtime,
> > > > but have you had need to tune this at runtime on a real production
> > > > workload? If so, what problem was happening that caused you to want to
> > > > do this tuning?
> > >
> > > Not actually.
> > >
> > > > > And it's ok even if the patch is turned down based on your criteria. :)
> > > >
> > > > If there is a real need, something needs to be provided to meet that
> > > > need. But in the absence of a real need, past experience has shown
> > > > that speculative tuning knobs usually do more harm than good. ;-)
> > >
> > > It makes sense, "A speculative tuning knobs do more harm than good".
> > >
> > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > but jiffies_till_sched_qs until we need it.
> > >
> > > However,
> > >
> > > (1) In case that jiffies_till_sched_qs is tunnable:
> > >
> > > We might need all of jiffies_till_{first,next}_qs,
> > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > jiffies_to_sched_qs can be affected by any of them. So we
> > > should be able to read each value at any time.
> > >
> > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > >
> > > I think we don't have to keep the jiffies_till_sched_qs any
> > > longer since that's only for setting jiffies_to_sched_qs at
> > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > It's meaningless to keep all of tree variables.
> > >
> > > The simpler and less knobs that we really need we have, the better.
> > >
> > > what do you think about it?
> > >
> > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > will resend it with a commit msg after knowing your opinion on it.
> >
> > I will give you a definite "maybe".
> >
> > Here are the two reasons for changing RCU's embarrassingly large array
> > of tuning parameters:
> >
> > 1. They are causing a problem in production. This would represent a
> > bug that clearly must be fixed. As you say, this change is not
> > in this category.
> >
> > 2. The change simplifies either RCU's code or the process of tuning
> > RCU, but without degrading RCU's ability to run everywhere and
> > without removing debugging tools.
> >
> > The change below clearly simplifies things by removing a few lines of
> > code, and it does not change RCU's default self-configuration. But are
> > we sure about the debugging aspect? (Please keep in mind that many more
> > sites are willing to change boot parameters than are willing to patch
> > their kernels.)
> >
> > What do you think?
>
> Just to add that independent of whether the runtime tunable make sense or
> not, may be it is still worth correcting the 0444 to be 0644 to be a separate
> patch?
You lost me on this one. Doesn't changing from 0444 to 0644 make it be
a runtime tunable?
> > Finally, I urge you to join with Joel Fernandes and go through these
> > grace-period-duration tuning parameters. Once you guys get your heads
> > completely around all of them and how they interact across the different
> > possible RCU configurations, I bet that the two of you will have excellent
> > ideas for improvement.
>
> Yes, I am quite happy to join forces. Byungchul, let me know what about this
> or other things you had in mind. I have some other RCU topics too I am trying
> to get my head around and planning to work on more patches.
>
> Paul, in case you had any other specific tunables or experiments in mind, let
> me know. I am quite happy to try out new experiments and learn something
> based on tuning something.
These would be the tunables controlling how quickly RCU takes its
various actions to encourage the current grace period to end quickly.
I would be happy to give you the exact list if you wish, but most of
them have appeared in this thread.
The experiments should be designed to work out whether the current
default settings have configurations where they act badly. This might
also come up with advice for people attempting hand-tuning, or proposed
parameter-checking code to avoid bad combinations.
For one example, setting the RCU CPU stall timeout too low will definitely
cause some unwanted splats. (Yes, one could argue that other things in
the kernel should change to allow this value to decrease, but things
like latency tracer and friends are probably more useful and important.)
Thanx, Paul
> thanks,
>
> - Joel
>
>
>
> > > Thanks,
> > > Byungchul
> > >
> > > ---8<---
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index e72c184..94b58f5 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -3792,10 +3792,6 @@
> > > a value based on the most recent settings
> > > of rcutree.jiffies_till_first_fqs
> > > and rcutree.jiffies_till_next_fqs.
> > > - This calculated value may be viewed in
> > > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > > - rcutree.jiffies_to_sched_qs will be cheerfully
> > > - overwritten.
> > >
> > > rcutree.kthread_prio= [KNL,BOOT]
> > > Set the SCHED_FIFO priority of the RCU per-CPU
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a2f8ba2..ad9dc86 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > * How long the grace period must be before we start recruiting
> > > * quiescent-state help from rcu_note_context_switch().
> > > */
> > > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > > module_param(jiffies_till_sched_qs, ulong, 0444);
> > > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > >
> > > /*
> > > * Make sure that we give the grace-period kthread time to detect any
> > > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > > {
> > > unsigned long j;
> > >
> > > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > - return;
> > > - }
> > > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > > j = READ_ONCE(jiffies_till_first_fqs) +
> > > 2 * READ_ONCE(jiffies_till_next_fqs);
> > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > > }
> > >
> > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > >
> > > /*
> > > * A CPU running for an extended time within the kernel can
> > > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > > * variable are safe because the assignments are repeated if this
> > > * CPU failed to pass through a quiescent state. This code
> > > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > > * is set way high.
> > > */
> > > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > > if (!READ_ONCE(*rnhqp) &&
> > > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > > jiffies_till_first_fqs = d;
> > > if (jiffies_till_next_fqs == ULONG_MAX)
> > > jiffies_till_next_fqs = d;
> > > - adjust_jiffies_till_sched_qs();
> > > + if (jiffies_till_sched_qs == ULONG_MAX)
> > > + adjust_jiffies_till_sched_qs();
> > >
> > > /* If the compile-time values are accurate, just leave. */
> > > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
> >
On Thu, Jul 11, 2019 at 08:02:15AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> > On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > > > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > > > Hi Paul,
> > > > > >
> > > > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > > > can also want to tune the time for the help from scheduler to start.
> > > > > > I thought only difference between them is a level of urgency. I might be
> > > > > > wrong. It would be appreciated if you let me know if I miss something.
> > > > >
> > > > > Hello, Byungchul,
> > > > >
> > > > > I understand that one hypothetically might want to tune this at runtime,
> > > > > but have you had need to tune this at runtime on a real production
> > > > > workload? If so, what problem was happening that caused you to want to
> > > > > do this tuning?
> > > >
> > > > Not actually.
> > > >
> > > > > > And it's ok even if the patch is turned down based on your criteria. :)
> > > > >
> > > > > If there is a real need, something needs to be provided to meet that
> > > > > need. But in the absence of a real need, past experience has shown
> > > > > that speculative tuning knobs usually do more harm than good. ;-)
> > > >
> > > > It makes sense, "A speculative tuning knobs do more harm than good".
> > > >
> > > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > > but jiffies_till_sched_qs until we need it.
> > > >
> > > > However,
> > > >
> > > > (1) In case that jiffies_till_sched_qs is tunnable:
> > > >
> > > > We might need all of jiffies_till_{first,next}_qs,
> > > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > > jiffies_to_sched_qs can be affected by any of them. So we
> > > > should be able to read each value at any time.
> > > >
> > > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > > >
> > > > I think we don't have to keep the jiffies_till_sched_qs any
> > > > longer since that's only for setting jiffies_to_sched_qs at
> > > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > > It's meaningless to keep all of tree variables.
> > > >
> > > > The simpler and less knobs that we really need we have, the better.
> > > >
> > > > what do you think about it?
> > > >
> > > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > > will resend it with a commit msg after knowing your opinion on it.
> > >
> > > I will give you a definite "maybe".
> > >
> > > Here are the two reasons for changing RCU's embarrassingly large array
> > > of tuning parameters:
> > >
> > > 1. They are causing a problem in production. This would represent a
> > > bug that clearly must be fixed. As you say, this change is not
> > > in this category.
> > >
> > > 2. The change simplifies either RCU's code or the process of tuning
> > > RCU, but without degrading RCU's ability to run everywhere and
> > > without removing debugging tools.
> > >
> > > The change below clearly simplifies things by removing a few lines of
> > > code, and it does not change RCU's default self-configuration. But are
> > > we sure about the debugging aspect? (Please keep in mind that many more
> > > sites are willing to change boot parameters than are willing to patch
> > > their kernels.)
> > >
> > > What do you think?
> >
> > Just to add that independent of whether the runtime tunable make sense or
> > not, may be it is still worth correcting the 0444 to be 0644 to be a separate
> > patch?
>
> You lost me on this one. Doesn't changing from 0444 to 0644 make it be
> a runtime tunable?
I was going by our earlier discussion that the parameter is still writable at
boot time. You mentioned something like the following:
---
In Byungchul's defense, the current module_param() permissions are
0444, which really is read-only. Although I do agree that they can
be written at boot, one could use this same line of reasoning to argue
that const variables can be written at compile time (or, for on-stack
const variables, at function-invocation time). But we still call them
"const".
---
Sorry if I got confused. You are right that we could leave it as read-only.
> > > Finally, I urge you to join with Joel Fernandes and go through these
> > > grace-period-duration tuning parameters. ?Once you guys get your heads
> > > completely around all of them and how they interact across the different
> > > possible RCU configurations, I bet that the two of you will have excellent
> > > ideas for improvement.
> >
> > Yes, I am quite happy to join forces. Byungchul, let me know what about this
> > or other things you had in mind. I have some other RCU topics too I am trying
> > to get my head around and planning to work on more patches.
> >
> > Paul, in case you had any other specific tunables or experiments in mind, let
> > me know. I am quite happy to try out new experiments and learn something
> > based on tuning something.
>
> These would be the tunables controlling how quickly RCU takes its
> various actions to encourage the current grace period to end quickly.
> I would be happy to give you the exact list if you wish, but most of
> them have appeared in this thread.
>
> The experiments should be designed to work out whether the current
> default settings have configurations where they act badly. ?This might
> also come up with advice for people attempting hand-tuning, or proposed
> parameter-checking code to avoid bad combinations.
>
> For one example, setting the RCU CPU stall timeout too low will definitely
> cause some unwanted splats. ?(Yes, one could argue that other things in
> the kernel should change to allow this value to decrease, but things
> like latency tracer and friends are probably more useful and important.)
Ok, thank you for the hints. I will try to poke around with these as well. I
am currently also thinking of spending my time on RCU with reviving the list
API patches:
https://lore.kernel.org/patchwork/project/lkml/list/?series=396464
I noticed the occasional splat even with my preempt-disable test. But that
was just when trace dumping was interfering which we discussed at length last
week.
- Joel
On Thu, Jul 11, 2019 at 12:48:18PM -0400, Joel Fernandes wrote:
> On Thu, Jul 11, 2019 at 08:02:15AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > > > > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > > > > Hi Paul,
> > > > > > >
> > > > > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > > > > can also want to tune the time for the help from scheduler to start.
> > > > > > > I thought only difference between them is a level of urgency. I might be
> > > > > > > wrong. It would be appreciated if you let me know if I miss something.
> > > > > >
> > > > > > Hello, Byungchul,
> > > > > >
> > > > > > I understand that one hypothetically might want to tune this at runtime,
> > > > > > but have you had need to tune this at runtime on a real production
> > > > > > workload? If so, what problem was happening that caused you to want to
> > > > > > do this tuning?
> > > > >
> > > > > Not actually.
> > > > >
> > > > > > > And it's ok even if the patch is turned down based on your criteria. :)
> > > > > >
> > > > > > If there is a real need, something needs to be provided to meet that
> > > > > > need. But in the absence of a real need, past experience has shown
> > > > > > that speculative tuning knobs usually do more harm than good. ;-)
> > > > >
> > > > > It makes sense, "A speculative tuning knobs do more harm than good".
> > > > >
> > > > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > > > but jiffies_till_sched_qs until we need it.
> > > > >
> > > > > However,
> > > > >
> > > > > (1) In case that jiffies_till_sched_qs is tunnable:
> > > > >
> > > > > We might need all of jiffies_till_{first,next}_qs,
> > > > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > > > jiffies_to_sched_qs can be affected by any of them. So we
> > > > > should be able to read each value at any time.
> > > > >
> > > > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > > > >
> > > > > I think we don't have to keep the jiffies_till_sched_qs any
> > > > > longer since that's only for setting jiffies_to_sched_qs at
> > > > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > > > It's meaningless to keep all of tree variables.
> > > > >
> > > > > The simpler and less knobs that we really need we have, the better.
> > > > >
> > > > > what do you think about it?
> > > > >
> > > > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > > > will resend it with a commit msg after knowing your opinion on it.
> > > >
> > > > I will give you a definite "maybe".
> > > >
> > > > Here are the two reasons for changing RCU's embarrassingly large array
> > > > of tuning parameters:
> > > >
> > > > 1. They are causing a problem in production. This would represent a
> > > > bug that clearly must be fixed. As you say, this change is not
> > > > in this category.
> > > >
> > > > 2. The change simplifies either RCU's code or the process of tuning
> > > > RCU, but without degrading RCU's ability to run everywhere and
> > > > without removing debugging tools.
> > > >
> > > > The change below clearly simplifies things by removing a few lines of
> > > > code, and it does not change RCU's default self-configuration. But are
> > > > we sure about the debugging aspect? (Please keep in mind that many more
> > > > sites are willing to change boot parameters than are willing to patch
> > > > their kernels.)
> > > >
> > > > What do you think?
> > >
> > > Just to add that independent of whether the runtime tunable make sense or
> > > not, may be it is still worth correcting the 0444 to be 0644 to be a separate
> > > patch?
> >
> > You lost me on this one. Doesn't changing from 0444 to 0644 make it be
> > a runtime tunable?
>
> I was going by our earlier discussion that the parameter is still writable at
> boot time. You mentioned something like the following:
> ---
> In Byungchul's defense, the current module_param() permissions are
> 0444, which really is read-only. Although I do agree that they can
> be written at boot, one could use this same line of reasoning to argue
> that const variables can be written at compile time (or, for on-stack
> const variables, at function-invocation time). But we still call them
> "const".
> ---
>
> Sorry if I got confused. You are right that we could leave it as read-only.
>
> > > > Finally, I urge you to join with Joel Fernandes and go through these
> > > > grace-period-duration tuning parameters. ?Once you guys get your heads
> > > > completely around all of them and how they interact across the different
> > > > possible RCU configurations, I bet that the two of you will have excellent
> > > > ideas for improvement.
> > >
> > > Yes, I am quite happy to join forces. Byungchul, let me know what about this
> > > or other things you had in mind. I have some other RCU topics too I am trying
> > > to get my head around and planning to work on more patches.
> > >
> > > Paul, in case you had any other specific tunables or experiments in mind, let
> > > me know. I am quite happy to try out new experiments and learn something
> > > based on tuning something.
> >
> > These would be the tunables controlling how quickly RCU takes its
> > various actions to encourage the current grace period to end quickly.
> > I would be happy to give you the exact list if you wish, but most of
> > them have appeared in this thread.
> >
> > The experiments should be designed to work out whether the current
> > default settings have configurations where they act badly. ?This might
> > also come up with advice for people attempting hand-tuning, or proposed
> > parameter-checking code to avoid bad combinations.
> >
> > For one example, setting the RCU CPU stall timeout too low will definitely
> > cause some unwanted splats. ?(Yes, one could argue that other things in
> > the kernel should change to allow this value to decrease, but things
> > like latency tracer and friends are probably more useful and important.)
>
> Ok, thank you for the hints.
Hmm, speaking of grace period durations, it seems to me the maximum grace
period ever is recorded in rcu_state.gp_max. However it is not read from
anywhere.
Any idea why it was added but not used?
I am interested in dumping this value just for fun, and seeing what I get.
I wonder also it is useful to dump it in rcutorture/rcuperf to find any
issues, or even expose it in sys/proc fs to see what worst case grace periods
look like.
- Joel
On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > If there is a real need, something needs to be provided to meet that
> > > need. But in the absence of a real need, past experience has shown
> > > that speculative tuning knobs usually do more harm than good. ;-)
> >
> > It makes sense, "A speculative tuning knobs do more harm than good".
> >
> > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > but jiffies_till_sched_qs until we need it.
> >
> > However,
> >
> > (1) In case that jiffies_till_sched_qs is tunnable:
> >
> > We might need all of jiffies_till_{first,next}_qs,
> > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > jiffies_to_sched_qs can be affected by any of them. So we
> > should be able to read each value at any time.
> >
> > (2) In case that jiffies_till_sched_qs is not tunnable:
> >
> > I think we don't have to keep the jiffies_till_sched_qs any
> > longer since that's only for setting jiffies_to_sched_qs at
> > *booting time*, which can be done with jiffies_to_sched_qs too.
> > It's meaningless to keep all of tree variables.
> >
> > The simpler and less knobs that we really need we have, the better.
> >
> > what do you think about it?
> >
> > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > think jiffies_till_sched_qs is a much better name for the purpose. I
> > will resend it with a commit msg after knowing your opinion on it.
Hi Paul,
> I will give you a definite "maybe".
>
> Here are the two reasons for changing RCU's embarrassingly large array
> of tuning parameters:
>
> 1. They are causing a problem in production. This would represent a
> bug that clearly must be fixed. As you say, this change is not
> in this category.
>
> 2. The change simplifies either RCU's code or the process of tuning
> RCU, but without degrading RCU's ability to run everywhere and
> without removing debugging tools.
Agree.
> The change below clearly simplifies things by removing a few lines of
> code, and it does not change RCU's default self-configuration. But are
> we sure about the debugging aspect? (Please keep in mind that many more
I'm sorry I don't get it. I don't think this patch affect any debugging
ability. What do you think it hurts? Could you explain it more?
> sites are willing to change boot parameters than are willing to patch
> their kernels.)
Right.
> What do you think?
>
> Finally, I urge you to join with Joel Fernandes and go through these
> grace-period-duration tuning parameters. Once you guys get your heads
> completely around all of them and how they interact across the different
> possible RCU configurations, I bet that the two of you will have excellent
> ideas for improvement.
Great. I'd be happy if I join the improvement and with Joel. I might
need to ask you exactly what you expect in detail maybe. Anyway I will
willingly go with it. :)
Thanks,
Byungchul
> Thanx, Paul
>
> > Thanks,
> > Byungchul
> >
> > ---8<---
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index e72c184..94b58f5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3792,10 +3792,6 @@
> > a value based on the most recent settings
> > of rcutree.jiffies_till_first_fqs
> > and rcutree.jiffies_till_next_fqs.
> > - This calculated value may be viewed in
> > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > - rcutree.jiffies_to_sched_qs will be cheerfully
> > - overwritten.
> >
> > rcutree.kthread_prio= [KNL,BOOT]
> > Set the SCHED_FIFO priority of the RCU per-CPU
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a2f8ba2..ad9dc86 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > * How long the grace period must be before we start recruiting
> > * quiescent-state help from rcu_note_context_switch().
> > */
> > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > module_param(jiffies_till_sched_qs, ulong, 0444);
> > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> >
> > /*
> > * Make sure that we give the grace-period kthread time to detect any
> > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > {
> > unsigned long j;
> >
> > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > - return;
> > - }
> > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > j = READ_ONCE(jiffies_till_first_fqs) +
> > 2 * READ_ONCE(jiffies_till_next_fqs);
> > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > }
> >
> > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> >
> > /*
> > * A CPU running for an extended time within the kernel can
> > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > * variable are safe because the assignments are repeated if this
> > * CPU failed to pass through a quiescent state. This code
> > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > * is set way high.
> > */
> > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > if (!READ_ONCE(*rnhqp) &&
> > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > jiffies_till_first_fqs = d;
> > if (jiffies_till_next_fqs == ULONG_MAX)
> > jiffies_till_next_fqs = d;
> > - adjust_jiffies_till_sched_qs();
> > + if (jiffies_till_sched_qs == ULONG_MAX)
> > + adjust_jiffies_till_sched_qs();
> >
> > /* If the compile-time values are accurate, just leave. */
> > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> > Finally, I urge you to join with Joel Fernandes and go through these
> > grace-period-duration tuning parameters. Once you guys get your heads
> > completely around all of them and how they interact across the different
> > possible RCU configurations, I bet that the two of you will have excellent
> > ideas for improvement.
>
> Yes, I am quite happy to join forces. Byungchul, let me know what about this
I love it. Let's talk later :)
> or other things you had in mind. I have some other RCU topics too I am trying
> to get my head around and planning to work on more patches.
Great.
Thanks,
Byungchul
> Paul, in case you had any other specific tunables or experiments in mind, let
> me know. I am quite happy to try out new experiments and learn something
> based on tuning something.
>
> thanks,
>
> - Joel
>
>
>
> > > Thanks,
> > > Byungchul
> > >
> > > ---8<---
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index e72c184..94b58f5 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -3792,10 +3792,6 @@
> > > a value based on the most recent settings
> > > of rcutree.jiffies_till_first_fqs
> > > and rcutree.jiffies_till_next_fqs.
> > > - This calculated value may be viewed in
> > > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > > - rcutree.jiffies_to_sched_qs will be cheerfully
> > > - overwritten.
> > >
> > > rcutree.kthread_prio= [KNL,BOOT]
> > > Set the SCHED_FIFO priority of the RCU per-CPU
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a2f8ba2..ad9dc86 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > * How long the grace period must be before we start recruiting
> > > * quiescent-state help from rcu_note_context_switch().
> > > */
> > > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > > module_param(jiffies_till_sched_qs, ulong, 0444);
> > > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > >
> > > /*
> > > * Make sure that we give the grace-period kthread time to detect any
> > > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > > {
> > > unsigned long j;
> > >
> > > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > - return;
> > > - }
> > > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > > j = READ_ONCE(jiffies_till_first_fqs) +
> > > 2 * READ_ONCE(jiffies_till_next_fqs);
> > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > > }
> > >
> > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > >
> > > /*
> > > * A CPU running for an extended time within the kernel can
> > > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > > * variable are safe because the assignments are repeated if this
> > > * CPU failed to pass through a quiescent state. This code
> > > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > > * is set way high.
> > > */
> > > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > > if (!READ_ONCE(*rnhqp) &&
> > > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > > jiffies_till_first_fqs = d;
> > > if (jiffies_till_next_fqs == ULONG_MAX)
> > > jiffies_till_next_fqs = d;
> > > - adjust_jiffies_till_sched_qs();
> > > + if (jiffies_till_sched_qs == ULONG_MAX)
> > > + adjust_jiffies_till_sched_qs();
> > >
> > > /* If the compile-time values are accurate, just leave. */
> > > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
> >
On Thu, Jul 11, 2019 at 08:02:15AM -0700, Paul E. McKenney wrote:
> These would be the tunables controlling how quickly RCU takes its
> various actions to encourage the current grace period to end quickly.
Seriously one of the most interesting thing over all kernel works.
> I would be happy to give you the exact list if you wish, but most of
> them have appeared in this thread.
Thank you. :)
> The experiments should be designed to work out whether the current
> default settings have configurations where they act badly. This might
> also come up with advice for people attempting hand-tuning, or proposed
> parameter-checking code to avoid bad combinations.
Great.
> For one example, setting the RCU CPU stall timeout too low will definitely
> cause some unwanted splats. (Yes, one could argue that other things in
> the kernel should change to allow this value to decrease, but things
> like latency tracer and friends are probably more useful and important.)
Agree.
Thanks,
Byungchul
>
> Thanx, Paul
>
> > thanks,
> >
> > - Joel
> >
> >
> >
> > > > Thanks,
> > > > Byungchul
> > > >
> > > > ---8<---
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index e72c184..94b58f5 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -3792,10 +3792,6 @@
> > > > a value based on the most recent settings
> > > > of rcutree.jiffies_till_first_fqs
> > > > and rcutree.jiffies_till_next_fqs.
> > > > - This calculated value may be viewed in
> > > > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > > > - rcutree.jiffies_to_sched_qs will be cheerfully
> > > > - overwritten.
> > > >
> > > > rcutree.kthread_prio= [KNL,BOOT]
> > > > Set the SCHED_FIFO priority of the RCU per-CPU
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a2f8ba2..ad9dc86 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > > * How long the grace period must be before we start recruiting
> > > > * quiescent-state help from rcu_note_context_switch().
> > > > */
> > > > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > > > module_param(jiffies_till_sched_qs, ulong, 0444);
> > > > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > > >
> > > > /*
> > > > * Make sure that we give the grace-period kthread time to detect any
> > > > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > > > {
> > > > unsigned long j;
> > > >
> > > > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > > > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > > > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > > - return;
> > > > - }
> > > > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > > > j = READ_ONCE(jiffies_till_first_fqs) +
> > > > 2 * READ_ONCE(jiffies_till_next_fqs);
> > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > > > }
> > > >
> > > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > > >
> > > > /*
> > > > * A CPU running for an extended time within the kernel can
> > > > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > > > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > > > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > > > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > > > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > > > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > > > * variable are safe because the assignments are repeated if this
> > > > * CPU failed to pass through a quiescent state. This code
> > > > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > > > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > > > * is set way high.
> > > > */
> > > > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > > > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > > > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > > > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > > > if (!READ_ONCE(*rnhqp) &&
> > > > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > > > jiffies_till_first_fqs = d;
> > > > if (jiffies_till_next_fqs == ULONG_MAX)
> > > > jiffies_till_next_fqs = d;
> > > > - adjust_jiffies_till_sched_qs();
> > > > + if (jiffies_till_sched_qs == ULONG_MAX)
> > > > + adjust_jiffies_till_sched_qs();
> > > >
> > > > /* If the compile-time values are accurate, just leave. */
> > > > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
> > >
On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> Hmm, speaking of grace period durations, it seems to me the maximum grace
> period ever is recorded in rcu_state.gp_max. However it is not read from
> anywhere.
>
> Any idea why it was added but not used?
>
> I am interested in dumping this value just for fun, and seeing what I get.
>
> I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> issues, or even expose it in sys/proc fs to see what worst case grace periods
> look like.
Hi,
commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
rcu: Remove debugfs tracing
removed all debugfs tracing, gp_max also included.
And you sounds great. And even looks not that hard to add it like,
:)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ad9dc86..86095ff 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
raw_spin_lock_irq_rcu_node(rnp);
rcu_state.gp_end = jiffies;
gp_duration = rcu_state.gp_end - rcu_state.gp_start;
- if (gp_duration > rcu_state.gp_max)
+ if (gp_duration > rcu_state.gp_max) {
rcu_state.gp_max = gp_duration;
+ trace_rcu_grace_period(something something);
+ }
/*
* We know the grace period is complete, but to everyone else
Thanks,
Byungchul
On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > period ever is recorded in rcu_state.gp_max. However it is not read from
> > anywhere.
> >
> > Any idea why it was added but not used?
> >
> > I am interested in dumping this value just for fun, and seeing what I get.
> >
> > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > look like.
>
> Hi,
>
> commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> rcu: Remove debugfs tracing
>
> removed all debugfs tracing, gp_max also included.
>
> And you sounds great. And even looks not that hard to add it like,
>
> :)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ad9dc86..86095ff 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> raw_spin_lock_irq_rcu_node(rnp);
> rcu_state.gp_end = jiffies;
> gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> - if (gp_duration > rcu_state.gp_max)
> + if (gp_duration > rcu_state.gp_max) {
> rcu_state.gp_max = gp_duration;
> + trace_rcu_grace_period(something something);
> + }
Yes, that makes sense. But I think it is much better off as a readable value
from a virtual fs. The drawback of tracing for this sort of thing are:
- Tracing will only catch it if tracing is on
- Tracing data can be lost if too many events, then no one has a clue what
the max gp time is.
- The data is already available in rcu_state::gp_max so copying it into the
trace buffer seems a bit pointless IMHO
- It is a lot easier on ones eyes to process a single counter than process
heaps of traces.
I think a minimal set of RCU counters exposed to /proc or /sys should not
hurt and could do more good than not. The scheduler already does this for
scheduler statistics. I have seen Peter complain a lot about new tracepoints
but not much (or never) about new statistics.
Tracing has its strengths but may not apply well here IMO. I think a counter
like this could be useful for tuning of things like the jiffies_*_sched_qs,
the stall timeouts and also any other RCU knobs. What do you think?
- Joel
On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> On Thu, Jul 11, 2019 at 12:48:18PM -0400, Joel Fernandes wrote:
> > On Thu, Jul 11, 2019 at 08:02:15AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> > > > On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > > > > > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > > > > > Hi Paul,
> > > > > > > >
> > > > > > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > > > > > can also want to tune the time for the help from scheduler to start.
> > > > > > > > I thought only difference between them is a level of urgency. I might be
> > > > > > > > wrong. It would be appreciated if you let me know if I miss something.
> > > > > > >
> > > > > > > Hello, Byungchul,
> > > > > > >
> > > > > > > I understand that one hypothetically might want to tune this at runtime,
> > > > > > > but have you had need to tune this at runtime on a real production
> > > > > > > workload? If so, what problem was happening that caused you to want to
> > > > > > > do this tuning?
> > > > > >
> > > > > > Not actually.
> > > > > >
> > > > > > > > And it's ok even if the patch is turned down based on your criteria. :)
> > > > > > >
> > > > > > > If there is a real need, something needs to be provided to meet that
> > > > > > > need. But in the absence of a real need, past experience has shown
> > > > > > > that speculative tuning knobs usually do more harm than good. ;-)
> > > > > >
> > > > > > It makes sense, "A speculative tuning knobs do more harm than good".
> > > > > >
> > > > > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > > > > but jiffies_till_sched_qs until we need it.
> > > > > >
> > > > > > However,
> > > > > >
> > > > > > (1) In case that jiffies_till_sched_qs is tunnable:
> > > > > >
> > > > > > We might need all of jiffies_till_{first,next}_qs,
> > > > > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > > > > jiffies_to_sched_qs can be affected by any of them. So we
> > > > > > should be able to read each value at any time.
> > > > > >
> > > > > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > > > > >
> > > > > > I think we don't have to keep the jiffies_till_sched_qs any
> > > > > > longer since that's only for setting jiffies_to_sched_qs at
> > > > > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > > > > It's meaningless to keep all of tree variables.
> > > > > >
> > > > > > The simpler and less knobs that we really need we have, the better.
> > > > > >
> > > > > > what do you think about it?
> > > > > >
> > > > > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > > > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > > > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > > > > will resend it with a commit msg after knowing your opinion on it.
> > > > >
> > > > > I will give you a definite "maybe".
> > > > >
> > > > > Here are the two reasons for changing RCU's embarrassingly large array
> > > > > of tuning parameters:
> > > > >
> > > > > 1. They are causing a problem in production. This would represent a
> > > > > bug that clearly must be fixed. As you say, this change is not
> > > > > in this category.
> > > > >
> > > > > 2. The change simplifies either RCU's code or the process of tuning
> > > > > RCU, but without degrading RCU's ability to run everywhere and
> > > > > without removing debugging tools.
> > > > >
> > > > > The change below clearly simplifies things by removing a few lines of
> > > > > code, and it does not change RCU's default self-configuration. But are
> > > > > we sure about the debugging aspect? (Please keep in mind that many more
> > > > > sites are willing to change boot parameters than are willing to patch
> > > > > their kernels.)
> > > > >
> > > > > What do you think?
> > > >
> > > > Just to add that independent of whether the runtime tunable make sense or
> > > > not, may be it is still worth correcting the 0444 to be 0644 to be a separate
> > > > patch?
> > >
> > > You lost me on this one. Doesn't changing from 0444 to 0644 make it be
> > > a runtime tunable?
> >
> > I was going by our earlier discussion that the parameter is still writable at
> > boot time. You mentioned something like the following:
> > ---
> > In Byungchul's defense, the current module_param() permissions are
> > 0444, which really is read-only. Although I do agree that they can
> > be written at boot, one could use this same line of reasoning to argue
> > that const variables can be written at compile time (or, for on-stack
> > const variables, at function-invocation time). But we still call them
> > "const".
> > ---
> >
> > Sorry if I got confused. You are right that we could leave it as read-only.
> >
> > > > > Finally, I urge you to join with Joel Fernandes and go through these
> > > > > grace-period-duration tuning parameters. ?Once you guys get your heads
> > > > > completely around all of them and how they interact across the different
> > > > > possible RCU configurations, I bet that the two of you will have excellent
> > > > > ideas for improvement.
> > > >
> > > > Yes, I am quite happy to join forces. Byungchul, let me know what about this
> > > > or other things you had in mind. I have some other RCU topics too I am trying
> > > > to get my head around and planning to work on more patches.
> > > >
> > > > Paul, in case you had any other specific tunables or experiments in mind, let
> > > > me know. I am quite happy to try out new experiments and learn something
> > > > based on tuning something.
> > >
> > > These would be the tunables controlling how quickly RCU takes its
> > > various actions to encourage the current grace period to end quickly.
> > > I would be happy to give you the exact list if you wish, but most of
> > > them have appeared in this thread.
> > >
> > > The experiments should be designed to work out whether the current
> > > default settings have configurations where they act badly. ?This might
> > > also come up with advice for people attempting hand-tuning, or proposed
> > > parameter-checking code to avoid bad combinations.
> > >
> > > For one example, setting the RCU CPU stall timeout too low will definitely
> > > cause some unwanted splats. ?(Yes, one could argue that other things in
> > > the kernel should change to allow this value to decrease, but things
> > > like latency tracer and friends are probably more useful and important.)
> >
> > Ok, thank you for the hints.
>
> Hmm, speaking of grace period durations, it seems to me the maximum grace
> period ever is recorded in rcu_state.gp_max. However it is not read from
> anywhere.
>
> Any idea why it was added but not used?
If I remember correclty, it used to be used in debugfs prints. It is
useful for working out how low you can decrease rcutorture.stall_cpu to
without getting RCU CPU stall warnings. A rather infrequent need,
given that the mainline default has been adjusted only once.
> I am interested in dumping this value just for fun, and seeing what I get.
>
> I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> issues, or even expose it in sys/proc fs to see what worst case grace periods
> look like.
That might be worthwhile.
Thanx, Paul
On Fri, Jul 12, 2019 at 08:51:16AM -0400, Joel Fernandes wrote:
> On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > anywhere.
> > >
> > > Any idea why it was added but not used?
> > >
> > > I am interested in dumping this value just for fun, and seeing what I get.
> > >
> > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > look like.
> >
> > Hi,
> >
> > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > rcu: Remove debugfs tracing
> >
> > removed all debugfs tracing, gp_max also included.
> >
> > And you sounds great. And even looks not that hard to add it like,
> >
> > :)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index ad9dc86..86095ff 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > raw_spin_lock_irq_rcu_node(rnp);
> > rcu_state.gp_end = jiffies;
> > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > - if (gp_duration > rcu_state.gp_max)
> > + if (gp_duration > rcu_state.gp_max) {
> > rcu_state.gp_max = gp_duration;
> > + trace_rcu_grace_period(something something);
> > + }
>
> Yes, that makes sense. But I think it is much better off as a readable value
> from a virtual fs. The drawback of tracing for this sort of thing are:
> - Tracing will only catch it if tracing is on
> - Tracing data can be lost if too many events, then no one has a clue what
> the max gp time is.
> - The data is already available in rcu_state::gp_max so copying it into the
> trace buffer seems a bit pointless IMHO
> - It is a lot easier on ones eyes to process a single counter than process
> heaps of traces.
>
> I think a minimal set of RCU counters exposed to /proc or /sys should not
> hurt and could do more good than not. The scheduler already does this for
> scheduler statistics. I have seen Peter complain a lot about new tracepoints
> but not much (or never) about new statistics.
>
> Tracing has its strengths but may not apply well here IMO. I think a counter
> like this could be useful for tuning of things like the jiffies_*_sched_qs,
> the stall timeouts and also any other RCU knobs. What do you think?
Is this one of those cases where eBPF is the answer, regardless of
the question? ;-)
Thanx, Paul
On Fri, Jul 12, 2019 at 06:01:05AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > On Thu, Jul 11, 2019 at 12:48:18PM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 11, 2019 at 08:02:15AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jul 11, 2019 at 09:08:49AM -0400, Joel Fernandes wrote:
> > > > > On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 10, 2019 at 10:20:25AM +0900, Byungchul Park wrote:
> > > > > > > On Tue, Jul 09, 2019 at 05:41:02AM -0700, Paul E. McKenney wrote:
> > > > > > > > > Hi Paul,
> > > > > > > > >
> > > > > > > > > IMHO, as much as we want to tune the time for fqs to be initiated, we
> > > > > > > > > can also want to tune the time for the help from scheduler to start.
> > > > > > > > > I thought only difference between them is a level of urgency. I might be
> > > > > > > > > wrong. It would be appreciated if you let me know if I miss something.
> > > > > > > >
> > > > > > > > Hello, Byungchul,
> > > > > > > >
> > > > > > > > I understand that one hypothetically might want to tune this at runtime,
> > > > > > > > but have you had need to tune this at runtime on a real production
> > > > > > > > workload? If so, what problem was happening that caused you to want to
> > > > > > > > do this tuning?
> > > > > > >
> > > > > > > Not actually.
> > > > > > >
> > > > > > > > > And it's ok even if the patch is turned down based on your criteria. :)
> > > > > > > >
> > > > > > > > If there is a real need, something needs to be provided to meet that
> > > > > > > > need. But in the absence of a real need, past experience has shown
> > > > > > > > that speculative tuning knobs usually do more harm than good. ;-)
> > > > > > >
> > > > > > > It makes sense, "A speculative tuning knobs do more harm than good".
> > > > > > >
> > > > > > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > > > > > but jiffies_till_sched_qs until we need it.
> > > > > > >
> > > > > > > However,
> > > > > > >
> > > > > > > (1) In case that jiffies_till_sched_qs is tunnable:
> > > > > > >
> > > > > > > We might need all of jiffies_till_{first,next}_qs,
> > > > > > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > > > > > jiffies_to_sched_qs can be affected by any of them. So we
> > > > > > > should be able to read each value at any time.
> > > > > > >
> > > > > > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > > > > > >
> > > > > > > I think we don't have to keep the jiffies_till_sched_qs any
> > > > > > > longer since that's only for setting jiffies_to_sched_qs at
> > > > > > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > > > > > It's meaningless to keep all of tree variables.
> > > > > > >
> > > > > > > The simpler and less knobs that we really need we have, the better.
> > > > > > >
> > > > > > > what do you think about it?
> > > > > > >
> > > > > > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > > > > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > > > > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > > > > > will resend it with a commit msg after knowing your opinion on it.
> > > > > >
> > > > > > I will give you a definite "maybe".
> > > > > >
> > > > > > Here are the two reasons for changing RCU's embarrassingly large array
> > > > > > of tuning parameters:
> > > > > >
> > > > > > 1. They are causing a problem in production. This would represent a
> > > > > > bug that clearly must be fixed. As you say, this change is not
> > > > > > in this category.
> > > > > >
> > > > > > 2. The change simplifies either RCU's code or the process of tuning
> > > > > > RCU, but without degrading RCU's ability to run everywhere and
> > > > > > without removing debugging tools.
> > > > > >
> > > > > > The change below clearly simplifies things by removing a few lines of
> > > > > > code, and it does not change RCU's default self-configuration. But are
> > > > > > we sure about the debugging aspect? (Please keep in mind that many more
> > > > > > sites are willing to change boot parameters than are willing to patch
> > > > > > their kernels.)
> > > > > >
> > > > > > What do you think?
> > > > >
> > > > > Just to add that independent of whether the runtime tunable make sense or
> > > > > not, may be it is still worth correcting the 0444 to be 0644 to be a separate
> > > > > patch?
> > > >
> > > > You lost me on this one. Doesn't changing from 0444 to 0644 make it be
> > > > a runtime tunable?
> > >
> > > I was going by our earlier discussion that the parameter is still writable at
> > > boot time. You mentioned something like the following:
> > > ---
> > > In Byungchul's defense, the current module_param() permissions are
> > > 0444, which really is read-only. Although I do agree that they can
> > > be written at boot, one could use this same line of reasoning to argue
> > > that const variables can be written at compile time (or, for on-stack
> > > const variables, at function-invocation time). But we still call them
> > > "const".
> > > ---
> > >
> > > Sorry if I got confused. You are right that we could leave it as read-only.
> > >
> > > > > > Finally, I urge you to join with Joel Fernandes and go through these
> > > > > > grace-period-duration tuning parameters. ?Once you guys get your heads
> > > > > > completely around all of them and how they interact across the different
> > > > > > possible RCU configurations, I bet that the two of you will have excellent
> > > > > > ideas for improvement.
> > > > >
> > > > > Yes, I am quite happy to join forces. Byungchul, let me know what about this
> > > > > or other things you had in mind. I have some other RCU topics too I am trying
> > > > > to get my head around and planning to work on more patches.
> > > > >
> > > > > Paul, in case you had any other specific tunables or experiments in mind, let
> > > > > me know. I am quite happy to try out new experiments and learn something
> > > > > based on tuning something.
> > > >
> > > > These would be the tunables controlling how quickly RCU takes its
> > > > various actions to encourage the current grace period to end quickly.
> > > > I would be happy to give you the exact list if you wish, but most of
> > > > them have appeared in this thread.
> > > >
> > > > The experiments should be designed to work out whether the current
> > > > default settings have configurations where they act badly. ?This might
> > > > also come up with advice for people attempting hand-tuning, or proposed
> > > > parameter-checking code to avoid bad combinations.
> > > >
> > > > For one example, setting the RCU CPU stall timeout too low will definitely
> > > > cause some unwanted splats. ?(Yes, one could argue that other things in
> > > > the kernel should change to allow this value to decrease, but things
> > > > like latency tracer and friends are probably more useful and important.)
> > >
> > > Ok, thank you for the hints.
> >
> > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > period ever is recorded in rcu_state.gp_max. However it is not read from
> > anywhere.
> >
> > Any idea why it was added but not used?
>
> If I remember correclty, it used to be used in debugfs prints. It is
> useful for working out how low you can decrease rcutorture.stall_cpu to
> without getting RCU CPU stall warnings. A rather infrequent need,
> given that the mainline default has been adjusted only once.
Got it.
> > I am interested in dumping this value just for fun, and seeing what I get.
> >
> > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > look like.
>
> That might be worthwhile.
Thanks. I am thinking also it will help see whether something else like GP
thread priority or other configuration could be causing long GP times that is
otherwise not possible to see without a counter.
I will work out with Byungchul and come up with something.
Cheers,
- Joel
On Fri, Jul 12, 2019 at 06:02:42AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 12, 2019 at 08:51:16AM -0400, Joel Fernandes wrote:
> > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > anywhere.
> > > >
> > > > Any idea why it was added but not used?
> > > >
> > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > >
> > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > look like.
> > >
> > > Hi,
> > >
> > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > rcu: Remove debugfs tracing
> > >
> > > removed all debugfs tracing, gp_max also included.
> > >
> > > And you sounds great. And even looks not that hard to add it like,
> > >
> > > :)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index ad9dc86..86095ff 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > raw_spin_lock_irq_rcu_node(rnp);
> > > rcu_state.gp_end = jiffies;
> > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > - if (gp_duration > rcu_state.gp_max)
> > > + if (gp_duration > rcu_state.gp_max) {
> > > rcu_state.gp_max = gp_duration;
> > > + trace_rcu_grace_period(something something);
> > > + }
> >
> > Yes, that makes sense. But I think it is much better off as a readable value
> > from a virtual fs. The drawback of tracing for this sort of thing are:
> > - Tracing will only catch it if tracing is on
> > - Tracing data can be lost if too many events, then no one has a clue what
> > the max gp time is.
> > - The data is already available in rcu_state::gp_max so copying it into the
> > trace buffer seems a bit pointless IMHO
> > - It is a lot easier on ones eyes to process a single counter than process
> > heaps of traces.
> >
> > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > hurt and could do more good than not. The scheduler already does this for
> > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > but not much (or never) about new statistics.
> >
> > Tracing has its strengths but may not apply well here IMO. I think a counter
> > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > the stall timeouts and also any other RCU knobs. What do you think?
>
> Is this one of those cases where eBPF is the answer, regardless of
> the question? ;-)
It could be. Except that people who fancy busybox still could benefit from
the counter ;-)
And also, eBPF uses RCU itself heavily, so it would help if the debug related
counter itself didn't depend on it. ;-)
thanks,
- Joel
On Fri, Jul 12, 2019 at 09:43:11AM -0400, Joel Fernandes wrote:
> On Fri, Jul 12, 2019 at 06:02:42AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 12, 2019 at 08:51:16AM -0400, Joel Fernandes wrote:
> > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > anywhere.
> > > > >
> > > > > Any idea why it was added but not used?
> > > > >
> > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > >
> > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > look like.
> > > >
> > > > Hi,
> > > >
> > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > rcu: Remove debugfs tracing
> > > >
> > > > removed all debugfs tracing, gp_max also included.
> > > >
> > > > And you sounds great. And even looks not that hard to add it like,
> > > >
> > > > :)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index ad9dc86..86095ff 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > rcu_state.gp_end = jiffies;
> > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > - if (gp_duration > rcu_state.gp_max)
> > > > + if (gp_duration > rcu_state.gp_max) {
> > > > rcu_state.gp_max = gp_duration;
> > > > + trace_rcu_grace_period(something something);
> > > > + }
> > >
> > > Yes, that makes sense. But I think it is much better off as a readable value
> > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > - Tracing will only catch it if tracing is on
> > > - Tracing data can be lost if too many events, then no one has a clue what
> > > the max gp time is.
> > > - The data is already available in rcu_state::gp_max so copying it into the
> > > trace buffer seems a bit pointless IMHO
> > > - It is a lot easier on ones eyes to process a single counter than process
> > > heaps of traces.
> > >
> > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > hurt and could do more good than not. The scheduler already does this for
> > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > but not much (or never) about new statistics.
> > >
> > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > the stall timeouts and also any other RCU knobs. What do you think?
> >
> > Is this one of those cases where eBPF is the answer, regardless of
> > the question? ;-)
>
> It could be. Except that people who fancy busybox still could benefit from
> the counter ;-)
;-)
Another approach might be for RCU to dump statistics, including this
one, to the console every so often, for example, maybe every minute for
the first hour, every hour for the first day, and every day after that.
What else might work?
(I am not a huge fan of bringing back RCU's debugfs due to testability
concerns and due to the fact that very few people ever used it.)
Plus the busybox people could always add a trace_printk() or similar.
> And also, eBPF uses RCU itself heavily, so it would help if the debug related
> counter itself didn't depend on it. ;-)
I would think that eBPF printing a statistical counter from RCU wouldn't
pose danger even given that eBPF uses RCU, so I suspect that your point
about busybox carries more force in this argument. ;-)
Thanx, Paul
On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
>
> On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > anywhere.
> > >
> > > Any idea why it was added but not used?
> > >
> > > I am interested in dumping this value just for fun, and seeing what I get.
> > >
> > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > look like.
> >
> > Hi,
> >
> > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > rcu: Remove debugfs tracing
> >
> > removed all debugfs tracing, gp_max also included.
> >
> > And you sounds great. And even looks not that hard to add it like,
> >
> > :)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index ad9dc86..86095ff 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > raw_spin_lock_irq_rcu_node(rnp);
> > rcu_state.gp_end = jiffies;
> > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > - if (gp_duration > rcu_state.gp_max)
> > + if (gp_duration > rcu_state.gp_max) {
> > rcu_state.gp_max = gp_duration;
> > + trace_rcu_grace_period(something something);
> > + }
>
> Yes, that makes sense. But I think it is much better off as a readable value
> from a virtual fs. The drawback of tracing for this sort of thing are:
> - Tracing will only catch it if tracing is on
> - Tracing data can be lost if too many events, then no one has a clue what
> the max gp time is.
> - The data is already available in rcu_state::gp_max so copying it into the
> trace buffer seems a bit pointless IMHO
> - It is a lot easier on ones eyes to process a single counter than process
> heaps of traces.
>
> I think a minimal set of RCU counters exposed to /proc or /sys should not
> hurt and could do more good than not. The scheduler already does this for
> scheduler statistics. I have seen Peter complain a lot about new tracepoints
> but not much (or never) about new statistics.
>
> Tracing has its strengths but may not apply well here IMO. I think a counter
> like this could be useful for tuning of things like the jiffies_*_sched_qs,
> the stall timeouts and also any other RCU knobs. What do you think?
I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
looks like undoing what Paul did at ae91aa0ad.
I think you're rational enough. I just wondered how Paul think of it.
--
Thanks,
Byungchul
On Fri, Jul 12, 2019 at 2:50 PM Byungchul Park <[email protected]> wrote:
>
> On Thu, Jul 11, 2019 at 05:30:52AM -0700, Paul E. McKenney wrote:
> > > > If there is a real need, something needs to be provided to meet that
> > > > need. But in the absence of a real need, past experience has shown
> > > > that speculative tuning knobs usually do more harm than good. ;-)
> > >
> > > It makes sense, "A speculative tuning knobs do more harm than good".
> > >
> > > Then, it would be better to leave jiffies_till_{first,next}_fqs tunnable
> > > but jiffies_till_sched_qs until we need it.
> > >
> > > However,
> > >
> > > (1) In case that jiffies_till_sched_qs is tunnable:
> > >
> > > We might need all of jiffies_till_{first,next}_qs,
> > > jiffies_till_sched_qs and jiffies_to_sched_qs because
> > > jiffies_to_sched_qs can be affected by any of them. So we
> > > should be able to read each value at any time.
> > >
> > > (2) In case that jiffies_till_sched_qs is not tunnable:
> > >
> > > I think we don't have to keep the jiffies_till_sched_qs any
> > > longer since that's only for setting jiffies_to_sched_qs at
> > > *booting time*, which can be done with jiffies_to_sched_qs too.
> > > It's meaningless to keep all of tree variables.
> > >
> > > The simpler and less knobs that we really need we have, the better.
> > >
> > > what do you think about it?
> > >
> > > In the following patch, I (1) removed jiffies_till_sched_qs and then
> > > (2) renamed jiffies_*to*_sched_qs to jiffies_*till*_sched_qs because I
> > > think jiffies_till_sched_qs is a much better name for the purpose. I
> > > will resend it with a commit msg after knowing your opinion on it.
>
> Hi Paul,
>
> > I will give you a definite "maybe".
> >
> > Here are the two reasons for changing RCU's embarrassingly large array
> > of tuning parameters:
> >
> > 1. They are causing a problem in production. This would represent a
> > bug that clearly must be fixed. As you say, this change is not
> > in this category.
> >
> > 2. The change simplifies either RCU's code or the process of tuning
> > RCU, but without degrading RCU's ability to run everywhere and
> > without removing debugging tools.
>
> Agree.
>
> > The change below clearly simplifies things by removing a few lines of
> > code, and it does not change RCU's default self-configuration. But are
> > we sure about the debugging aspect? (Please keep in mind that many more
>
> I'm sorry I don't get it. I don't think this patch affect any debugging
> ability. What do you think it hurts? Could you explain it more?
I meant we are able to achieve it by directly changing
rcutree.jiffies_to_sched_qs even for the debugging purpose
w/o changing code itself as well. Thoughts?
--
Thanks,
Byungchul
> > sites are willing to change boot parameters than are willing to patch
> > their kernels.)
>
> Right.
>
> > What do you think?
> >
> > Finally, I urge you to join with Joel Fernandes and go through these
> > grace-period-duration tuning parameters. Once you guys get your heads
> > completely around all of them and how they interact across the different
> > possible RCU configurations, I bet that the two of you will have excellent
> > ideas for improvement.
>
> Great. I'd be happy if I join the improvement and with Joel. I might
> need to ask you exactly what you expect in detail maybe. Anyway I will
> willingly go with it. :)
>
> Thanks,
> Byungchul
>
> > Thanx, Paul
> >
> > > Thanks,
> > > Byungchul
> > >
> > > ---8<---
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index e72c184..94b58f5 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -3792,10 +3792,6 @@
> > > a value based on the most recent settings
> > > of rcutree.jiffies_till_first_fqs
> > > and rcutree.jiffies_till_next_fqs.
> > > - This calculated value may be viewed in
> > > - rcutree.jiffies_to_sched_qs. Any attempt to set
> > > - rcutree.jiffies_to_sched_qs will be cheerfully
> > > - overwritten.
> > >
> > > rcutree.kthread_prio= [KNL,BOOT]
> > > Set the SCHED_FIFO priority of the RCU per-CPU
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a2f8ba2..ad9dc86 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -421,10 +421,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > > * How long the grace period must be before we start recruiting
> > > * quiescent-state help from rcu_note_context_switch().
> > > */
> > > -static ulong jiffies_till_sched_qs = ULONG_MAX;
> > > +static ulong jiffies_till_sched_qs = ULONG_MAX; /* See adjust_jiffies_till_sched_qs(). */
> > > module_param(jiffies_till_sched_qs, ulong, 0444);
> > > -static ulong jiffies_to_sched_qs; /* See adjust_jiffies_till_sched_qs(). */
> > > -module_param(jiffies_to_sched_qs, ulong, 0444); /* Display only! */
> > >
> > > /*
> > > * Make sure that we give the grace-period kthread time to detect any
> > > @@ -436,18 +434,13 @@ static void adjust_jiffies_till_sched_qs(void)
> > > {
> > > unsigned long j;
> > >
> > > - /* If jiffies_till_sched_qs was specified, respect the request. */
> > > - if (jiffies_till_sched_qs != ULONG_MAX) {
> > > - WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > - return;
> > > - }
> > > /* Otherwise, set to third fqs scan, but bound below on large system. */
> > > j = READ_ONCE(jiffies_till_first_fqs) +
> > > 2 * READ_ONCE(jiffies_till_next_fqs);
> > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > - WRITE_ONCE(jiffies_to_sched_qs, j);
> > > + WRITE_ONCE(jiffies_till_sched_qs, j);
> > > }
> > >
> > > static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
> > > @@ -1033,16 +1026,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > >
> > > /*
> > > * A CPU running for an extended time within the kernel can
> > > - * delay RCU grace periods: (1) At age jiffies_to_sched_qs,
> > > - * set .rcu_urgent_qs, (2) At age 2*jiffies_to_sched_qs, set
> > > + * delay RCU grace periods: (1) At age jiffies_till_sched_qs,
> > > + * set .rcu_urgent_qs, (2) At age 2*jiffies_till_sched_qs, set
> > > * both .rcu_need_heavy_qs and .rcu_urgent_qs. Note that the
> > > * unsynchronized assignments to the per-CPU rcu_need_heavy_qs
> > > * variable are safe because the assignments are repeated if this
> > > * CPU failed to pass through a quiescent state. This code
> > > - * also checks .jiffies_resched in case jiffies_to_sched_qs
> > > + * also checks .jiffies_resched in case jiffies_till_sched_qs
> > > * is set way high.
> > > */
> > > - jtsq = READ_ONCE(jiffies_to_sched_qs);
> > > + jtsq = READ_ONCE(jiffies_till_sched_qs);
> > > ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu);
> > > rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu);
> > > if (!READ_ONCE(*rnhqp) &&
> > > @@ -3383,7 +3376,8 @@ static void __init rcu_init_geometry(void)
> > > jiffies_till_first_fqs = d;
> > > if (jiffies_till_next_fqs == ULONG_MAX)
> > > jiffies_till_next_fqs = d;
> > > - adjust_jiffies_till_sched_qs();
> > > + if (jiffies_till_sched_qs == ULONG_MAX)
> > > + adjust_jiffies_till_sched_qs();
> > >
> > > /* If the compile-time values are accurate, just leave. */
> > > if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
<[email protected]> wrote:
>
> On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> >
> > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > anywhere.
> > > >
> > > > Any idea why it was added but not used?
> > > >
> > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > >
> > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > look like.
> > >
> > > Hi,
> > >
> > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > rcu: Remove debugfs tracing
> > >
> > > removed all debugfs tracing, gp_max also included.
> > >
> > > And you sounds great. And even looks not that hard to add it like,
> > >
> > > :)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index ad9dc86..86095ff 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > raw_spin_lock_irq_rcu_node(rnp);
> > > rcu_state.gp_end = jiffies;
> > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > - if (gp_duration > rcu_state.gp_max)
> > > + if (gp_duration > rcu_state.gp_max) {
> > > rcu_state.gp_max = gp_duration;
> > > + trace_rcu_grace_period(something something);
> > > + }
> >
> > Yes, that makes sense. But I think it is much better off as a readable value
> > from a virtual fs. The drawback of tracing for this sort of thing are:
> > - Tracing will only catch it if tracing is on
> > - Tracing data can be lost if too many events, then no one has a clue what
> > the max gp time is.
> > - The data is already available in rcu_state::gp_max so copying it into the
> > trace buffer seems a bit pointless IMHO
> > - It is a lot easier on ones eyes to process a single counter than process
> > heaps of traces.
> >
> > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > hurt and could do more good than not. The scheduler already does this for
> > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > but not much (or never) about new statistics.
> >
> > Tracing has its strengths but may not apply well here IMO. I think a counter
> > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > the stall timeouts and also any other RCU knobs. What do you think?
>
> I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> looks like undoing what Paul did at ae91aa0ad.
>
> I think you're rational enough. I just wondered how Paul think of it.
I believe at least initially, a set of statistics can be made
available only when rcutorture or rcuperf module is loaded. That way
they are purely only for debugging and nothing needs to be exposed to
normal kernels distributed thus reducing testability concerns.
rcu_state::gp_max would be trivial to expose through this, but for
other statistics that are more complicated - perhaps
tracepoint_probe_register can be used to add hooks on to the
tracepoints and generate statistics from them. Again the registration
of the probe and the probe handler itself would all be in
rcutorture/rcuperf test code and not a part of the kernel proper.
Thoughts?
- Joel
On Sat, Jul 13, 2019 at 10:20:02AM -0400, Joel Fernandes wrote:
> On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
> <[email protected]> wrote:
> >
> > On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > anywhere.
> > > > >
> > > > > Any idea why it was added but not used?
> > > > >
> > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > >
> > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > look like.
> > > >
> > > > Hi,
> > > >
> > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > rcu: Remove debugfs tracing
> > > >
> > > > removed all debugfs tracing, gp_max also included.
> > > >
> > > > And you sounds great. And even looks not that hard to add it like,
> > > >
> > > > :)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index ad9dc86..86095ff 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > rcu_state.gp_end = jiffies;
> > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > - if (gp_duration > rcu_state.gp_max)
> > > > + if (gp_duration > rcu_state.gp_max) {
> > > > rcu_state.gp_max = gp_duration;
> > > > + trace_rcu_grace_period(something something);
> > > > + }
> > >
> > > Yes, that makes sense. But I think it is much better off as a readable value
> > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > - Tracing will only catch it if tracing is on
> > > - Tracing data can be lost if too many events, then no one has a clue what
> > > the max gp time is.
> > > - The data is already available in rcu_state::gp_max so copying it into the
> > > trace buffer seems a bit pointless IMHO
> > > - It is a lot easier on ones eyes to process a single counter than process
> > > heaps of traces.
> > >
> > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > hurt and could do more good than not. The scheduler already does this for
> > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > but not much (or never) about new statistics.
> > >
> > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > the stall timeouts and also any other RCU knobs. What do you think?
> >
> > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > looks like undoing what Paul did at ae91aa0ad.
> >
> > I think you're rational enough. I just wondered how Paul think of it.
>
> I believe at least initially, a set of statistics can be made
> available only when rcutorture or rcuperf module is loaded. That way
> they are purely only for debugging and nothing needs to be exposed to
> normal kernels distributed thus reducing testability concerns.
>
> rcu_state::gp_max would be trivial to expose through this, but for
> other statistics that are more complicated - perhaps
> tracepoint_probe_register can be used to add hooks on to the
> tracepoints and generate statistics from them. Again the registration
> of the probe and the probe handler itself would all be in
> rcutorture/rcuperf test code and not a part of the kernel proper.
> Thoughts?
It still feels like you guys are hyperfocusing on this one particular
knob. I instead need you to look at the interrelating knobs as a group.
On the debugging side, suppose someone gives you an RCU bug report.
What information will you need? How can you best get that information
without excessive numbers of over-and-back interactions with the guy
reporting the bug? As part of this last question, what information is
normally supplied with the bug? Alternatively, what information are
bug reporters normally expected to provide when asked?
Thanx, Paul
On Sat, Jul 13, 2019 at 08:13:30AM -0700, Paul E. McKenney wrote:
> On Sat, Jul 13, 2019 at 10:20:02AM -0400, Joel Fernandes wrote:
> > On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
> > <[email protected]> wrote:
> > >
> > > On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > > anywhere.
> > > > > >
> > > > > > Any idea why it was added but not used?
> > > > > >
> > > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > > >
> > > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > > look like.
> > > > >
> > > > > Hi,
> > > > >
> > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > rcu: Remove debugfs tracing
> > > > >
> > > > > removed all debugfs tracing, gp_max also included.
> > > > >
> > > > > And you sounds great. And even looks not that hard to add it like,
> > > > >
> > > > > :)
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index ad9dc86..86095ff 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > rcu_state.gp_end = jiffies;
> > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > rcu_state.gp_max = gp_duration;
> > > > > + trace_rcu_grace_period(something something);
> > > > > + }
> > > >
> > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > - Tracing will only catch it if tracing is on
> > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > the max gp time is.
> > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > trace buffer seems a bit pointless IMHO
> > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > heaps of traces.
> > > >
> > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > hurt and could do more good than not. The scheduler already does this for
> > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > but not much (or never) about new statistics.
> > > >
> > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > the stall timeouts and also any other RCU knobs. What do you think?
> > >
> > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > looks like undoing what Paul did at ae91aa0ad.
> > >
> > > I think you're rational enough. I just wondered how Paul think of it.
> >
> > I believe at least initially, a set of statistics can be made
> > available only when rcutorture or rcuperf module is loaded. That way
> > they are purely only for debugging and nothing needs to be exposed to
> > normal kernels distributed thus reducing testability concerns.
> >
> > rcu_state::gp_max would be trivial to expose through this, but for
> > other statistics that are more complicated - perhaps
> > tracepoint_probe_register can be used to add hooks on to the
> > tracepoints and generate statistics from them. Again the registration
> > of the probe and the probe handler itself would all be in
> > rcutorture/rcuperf test code and not a part of the kernel proper.
> > Thoughts?
>
> It still feels like you guys are hyperfocusing on this one particular
> knob. I instead need you to look at the interrelating knobs as a group.
Thanks for the hints, we'll do that.
> On the debugging side, suppose someone gives you an RCU bug report.
> What information will you need? How can you best get that information
> without excessive numbers of over-and-back interactions with the guy
> reporting the bug? As part of this last question, what information is
> normally supplied with the bug? Alternatively, what information are
> bug reporters normally expected to provide when asked?
I suppose I could dig out some of our Android bug reports of the past where
there were RCU issues but if there's any fires you are currently fighting do
send it our way as debugging homework ;-)
thanks,
- Joel
On Sat, Jul 13, 2019 at 11:42:57AM -0400, Joel Fernandes wrote:
> On Sat, Jul 13, 2019 at 08:13:30AM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 13, 2019 at 10:20:02AM -0400, Joel Fernandes wrote:
> > > On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
> > > <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > > > anywhere.
> > > > > > >
> > > > > > > Any idea why it was added but not used?
> > > > > > >
> > > > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > > > >
> > > > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > > > look like.
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > > rcu: Remove debugfs tracing
> > > > > >
> > > > > > removed all debugfs tracing, gp_max also included.
> > > > > >
> > > > > > And you sounds great. And even looks not that hard to add it like,
> > > > > >
> > > > > > :)
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index ad9dc86..86095ff 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > > rcu_state.gp_end = jiffies;
> > > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > > rcu_state.gp_max = gp_duration;
> > > > > > + trace_rcu_grace_period(something something);
> > > > > > + }
> > > > >
> > > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > > - Tracing will only catch it if tracing is on
> > > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > > the max gp time is.
> > > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > > trace buffer seems a bit pointless IMHO
> > > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > > heaps of traces.
> > > > >
> > > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > > hurt and could do more good than not. The scheduler already does this for
> > > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > > but not much (or never) about new statistics.
> > > > >
> > > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > > the stall timeouts and also any other RCU knobs. What do you think?
> > > >
> > > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > > looks like undoing what Paul did at ae91aa0ad.
> > > >
> > > > I think you're rational enough. I just wondered how Paul think of it.
> > >
> > > I believe at least initially, a set of statistics can be made
> > > available only when rcutorture or rcuperf module is loaded. That way
> > > they are purely only for debugging and nothing needs to be exposed to
> > > normal kernels distributed thus reducing testability concerns.
> > >
> > > rcu_state::gp_max would be trivial to expose through this, but for
> > > other statistics that are more complicated - perhaps
> > > tracepoint_probe_register can be used to add hooks on to the
> > > tracepoints and generate statistics from them. Again the registration
> > > of the probe and the probe handler itself would all be in
> > > rcutorture/rcuperf test code and not a part of the kernel proper.
> > > Thoughts?
> >
> > It still feels like you guys are hyperfocusing on this one particular
> > knob. I instead need you to look at the interrelating knobs as a group.
>
> Thanks for the hints, we'll do that.
>
> > On the debugging side, suppose someone gives you an RCU bug report.
> > What information will you need? How can you best get that information
> > without excessive numbers of over-and-back interactions with the guy
> > reporting the bug? As part of this last question, what information is
> > normally supplied with the bug? Alternatively, what information are
> > bug reporters normally expected to provide when asked?
>
> I suppose I could dig out some of our Android bug reports of the past where
> there were RCU issues but if there's any fires you are currently fighting do
> send it our way as debugging homework ;-)
Evading the questions, are we?
OK, I can be flexible. Suppose that you were getting RCU CPU stall
warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
Of course, this really means that some other CPU/task is holding up
multi_cpu_stop() without also blocking the current grace period.
What is the best way to work out what is really holding things up?
Thanx, Paul
On Sun, Jul 14, 2019 at 2:41 AM Paul E. McKenney <[email protected]> wrote:
>
> On Sat, Jul 13, 2019 at 11:42:57AM -0400, Joel Fernandes wrote:
> > On Sat, Jul 13, 2019 at 08:13:30AM -0700, Paul E. McKenney wrote:
> > > On Sat, Jul 13, 2019 at 10:20:02AM -0400, Joel Fernandes wrote:
> > > > On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > > > > anywhere.
> > > > > > > >
> > > > > > > > Any idea why it was added but not used?
> > > > > > > >
> > > > > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > > > > >
> > > > > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > > > > look like.
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > > > rcu: Remove debugfs tracing
> > > > > > >
> > > > > > > removed all debugfs tracing, gp_max also included.
> > > > > > >
> > > > > > > And you sounds great. And even looks not that hard to add it like,
> > > > > > >
> > > > > > > :)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index ad9dc86..86095ff 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > > > rcu_state.gp_end = jiffies;
> > > > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > > > rcu_state.gp_max = gp_duration;
> > > > > > > + trace_rcu_grace_period(something something);
> > > > > > > + }
> > > > > >
> > > > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > > > - Tracing will only catch it if tracing is on
> > > > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > > > the max gp time is.
> > > > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > > > trace buffer seems a bit pointless IMHO
> > > > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > > > heaps of traces.
> > > > > >
> > > > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > > > hurt and could do more good than not. The scheduler already does this for
> > > > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > > > but not much (or never) about new statistics.
> > > > > >
> > > > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > > > the stall timeouts and also any other RCU knobs. What do you think?
> > > > >
> > > > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > > > looks like undoing what Paul did at ae91aa0ad.
> > > > >
> > > > > I think you're rational enough. I just wondered how Paul think of it.
> > > >
> > > > I believe at least initially, a set of statistics can be made
> > > > available only when rcutorture or rcuperf module is loaded. That way
> > > > they are purely only for debugging and nothing needs to be exposed to
> > > > normal kernels distributed thus reducing testability concerns.
> > > >
> > > > rcu_state::gp_max would be trivial to expose through this, but for
> > > > other statistics that are more complicated - perhaps
> > > > tracepoint_probe_register can be used to add hooks on to the
> > > > tracepoints and generate statistics from them. Again the registration
> > > > of the probe and the probe handler itself would all be in
> > > > rcutorture/rcuperf test code and not a part of the kernel proper.
> > > > Thoughts?
> > >
> > > It still feels like you guys are hyperfocusing on this one particular
> > > knob. I instead need you to look at the interrelating knobs as a group.
> >
> > Thanks for the hints, we'll do that.
> >
> > > On the debugging side, suppose someone gives you an RCU bug report.
> > > What information will you need? How can you best get that information
> > > without excessive numbers of over-and-back interactions with the guy
> > > reporting the bug? As part of this last question, what information is
> > > normally supplied with the bug? Alternatively, what information are
> > > bug reporters normally expected to provide when asked?
> >
> > I suppose I could dig out some of our Android bug reports of the past where
> > there were RCU issues but if there's any fires you are currently fighting do
> > send it our way as debugging homework ;-)
>
> Evading the questions, are we?
>
> OK, I can be flexible. Suppose that you were getting RCU CPU stall
> warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> Of course, this really means that some other CPU/task is holding up
> multi_cpu_stop() without also blocking the current grace period.
>
> What is the best way to work out what is really holding things up?
Either the stopper preempted another being in a critical section and
has something wrong itself in case of PREEMPT or mechanisms for
urgent control doesn't work correctly.
I don't know what exactly you intended but I would check things like
(1) irq disable / eqs / tick / scheduler events and (2) whether special
handling for each level of qs urgency has started correctly. For that
purpose all the history of those events would be more useful.
And with thinking it more, we could come up with a good way to
make use of those data to identify what the problem is. Do I catch
the point correctly? If so, me and Joel can start to work on it.
Otherwise, please correct me.
--
Thanks,
Byungchul
On Sun, Jul 14, 2019 at 10:39:58PM +0900, Byungchul Park wrote:
> On Sun, Jul 14, 2019 at 2:41 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Sat, Jul 13, 2019 at 11:42:57AM -0400, Joel Fernandes wrote:
> > > On Sat, Jul 13, 2019 at 08:13:30AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Jul 13, 2019 at 10:20:02AM -0400, Joel Fernandes wrote:
> > > > > On Sat, Jul 13, 2019 at 4:47 AM Byungchul Park
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jul 12, 2019 at 9:51 PM Joel Fernandes <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 12, 2019 at 03:32:40PM +0900, Byungchul Park wrote:
> > > > > > > > On Thu, Jul 11, 2019 at 03:58:39PM -0400, Joel Fernandes wrote:
> > > > > > > > > Hmm, speaking of grace period durations, it seems to me the maximum grace
> > > > > > > > > period ever is recorded in rcu_state.gp_max. However it is not read from
> > > > > > > > > anywhere.
> > > > > > > > >
> > > > > > > > > Any idea why it was added but not used?
> > > > > > > > >
> > > > > > > > > I am interested in dumping this value just for fun, and seeing what I get.
> > > > > > > > >
> > > > > > > > > I wonder also it is useful to dump it in rcutorture/rcuperf to find any
> > > > > > > > > issues, or even expose it in sys/proc fs to see what worst case grace periods
> > > > > > > > > look like.
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > > > > rcu: Remove debugfs tracing
> > > > > > > >
> > > > > > > > removed all debugfs tracing, gp_max also included.
> > > > > > > >
> > > > > > > > And you sounds great. And even looks not that hard to add it like,
> > > > > > > >
> > > > > > > > :)
> > > > > > > >
> > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > index ad9dc86..86095ff 100644
> > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > > > > rcu_state.gp_end = jiffies;
> > > > > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > > > > rcu_state.gp_max = gp_duration;
> > > > > > > > + trace_rcu_grace_period(something something);
> > > > > > > > + }
> > > > > > >
> > > > > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > > > > - Tracing will only catch it if tracing is on
> > > > > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > > > > the max gp time is.
> > > > > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > > > > trace buffer seems a bit pointless IMHO
> > > > > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > > > > heaps of traces.
> > > > > > >
> > > > > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > > > > hurt and could do more good than not. The scheduler already does this for
> > > > > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > > > > but not much (or never) about new statistics.
> > > > > > >
> > > > > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > > > > the stall timeouts and also any other RCU knobs. What do you think?
> > > > > >
> > > > > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > > > > looks like undoing what Paul did at ae91aa0ad.
> > > > > >
> > > > > > I think you're rational enough. I just wondered how Paul think of it.
> > > > >
> > > > > I believe at least initially, a set of statistics can be made
> > > > > available only when rcutorture or rcuperf module is loaded. That way
> > > > > they are purely only for debugging and nothing needs to be exposed to
> > > > > normal kernels distributed thus reducing testability concerns.
> > > > >
> > > > > rcu_state::gp_max would be trivial to expose through this, but for
> > > > > other statistics that are more complicated - perhaps
> > > > > tracepoint_probe_register can be used to add hooks on to the
> > > > > tracepoints and generate statistics from them. Again the registration
> > > > > of the probe and the probe handler itself would all be in
> > > > > rcutorture/rcuperf test code and not a part of the kernel proper.
> > > > > Thoughts?
> > > >
> > > > It still feels like you guys are hyperfocusing on this one particular
> > > > knob. I instead need you to look at the interrelating knobs as a group.
> > >
> > > Thanks for the hints, we'll do that.
> > >
> > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > What information will you need? How can you best get that information
> > > > without excessive numbers of over-and-back interactions with the guy
> > > > reporting the bug? As part of this last question, what information is
> > > > normally supplied with the bug? Alternatively, what information are
> > > > bug reporters normally expected to provide when asked?
> > >
> > > I suppose I could dig out some of our Android bug reports of the past where
> > > there were RCU issues but if there's any fires you are currently fighting do
> > > send it our way as debugging homework ;-)
> >
> > Evading the questions, are we?
> >
> > OK, I can be flexible. Suppose that you were getting RCU CPU stall
> > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > Of course, this really means that some other CPU/task is holding up
> > multi_cpu_stop() without also blocking the current grace period.
> >
> > What is the best way to work out what is really holding things up?
>
> Either the stopper preempted another being in a critical section and
> has something wrong itself in case of PREEMPT or mechanisms for
> urgent control doesn't work correctly.
>
> I don't know what exactly you intended but I would check things like
> (1) irq disable / eqs / tick / scheduler events and (2) whether special
> handling for each level of qs urgency has started correctly. For that
> purpose all the history of those events would be more useful.
>
> And with thinking it more, we could come up with a good way to
> make use of those data to identify what the problem is. Do I catch
> the point correctly? If so, me and Joel can start to work on it.
> Otherwise, please correct me.
I believe you are on the right track. In short, it would be great if
the kernel would automatically dump out the needed information when
cpu_stopper gets stalled, sort of like RCU does (much of the time,
anyway) in its CPU stall warnings. Given a patch that did this, I would
be quite happy to help advocate for it!
Thanx, Paul
On Sun, Jul 14, 2019 at 06:56:10AM -0700, Paul E. McKenney wrote:
[snip]
> > > > > > > > >
> > > > > > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > > > > > rcu: Remove debugfs tracing
> > > > > > > > >
> > > > > > > > > removed all debugfs tracing, gp_max also included.
> > > > > > > > >
> > > > > > > > > And you sounds great. And even looks not that hard to add it like,
> > > > > > > > >
> > > > > > > > > :)
> > > > > > > > >
> > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > > index ad9dc86..86095ff 100644
> > > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > > > > > rcu_state.gp_end = jiffies;
> > > > > > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > > > > > rcu_state.gp_max = gp_duration;
> > > > > > > > > + trace_rcu_grace_period(something something);
> > > > > > > > > + }
> > > > > > > >
> > > > > > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > > > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > > > > > - Tracing will only catch it if tracing is on
> > > > > > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > > > > > the max gp time is.
> > > > > > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > > > > > trace buffer seems a bit pointless IMHO
> > > > > > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > > > > > heaps of traces.
> > > > > > > >
> > > > > > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > > > > > hurt and could do more good than not. The scheduler already does this for
> > > > > > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > > > > > but not much (or never) about new statistics.
> > > > > > > >
> > > > > > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > > > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > > > > > the stall timeouts and also any other RCU knobs. What do you think?
> > > > > > >
> > > > > > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > > > > > looks like undoing what Paul did at ae91aa0ad.
> > > > > > >
> > > > > > > I think you're rational enough. I just wondered how Paul think of it.
> > > > > >
> > > > > > I believe at least initially, a set of statistics can be made
> > > > > > available only when rcutorture or rcuperf module is loaded. That way
> > > > > > they are purely only for debugging and nothing needs to be exposed to
> > > > > > normal kernels distributed thus reducing testability concerns.
> > > > > >
> > > > > > rcu_state::gp_max would be trivial to expose through this, but for
> > > > > > other statistics that are more complicated - perhaps
> > > > > > tracepoint_probe_register can be used to add hooks on to the
> > > > > > tracepoints and generate statistics from them. Again the registration
> > > > > > of the probe and the probe handler itself would all be in
> > > > > > rcutorture/rcuperf test code and not a part of the kernel proper.
> > > > > > Thoughts?
> > > > >
> > > > > It still feels like you guys are hyperfocusing on this one particular
> > > > > knob. I instead need you to look at the interrelating knobs as a group.
> > > >
> > > > Thanks for the hints, we'll do that.
> > > >
> > > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > > What information will you need? How can you best get that information
> > > > > without excessive numbers of over-and-back interactions with the guy
> > > > > reporting the bug? As part of this last question, what information is
> > > > > normally supplied with the bug? Alternatively, what information are
> > > > > bug reporters normally expected to provide when asked?
> > > >
> > > > I suppose I could dig out some of our Android bug reports of the past where
> > > > there were RCU issues but if there's any fires you are currently fighting do
> > > > send it our way as debugging homework ;-)
> > >
> > > Evading the questions, are we?
Sorry if I sounded like evading, I was just saying that it would be nice to
work on some specific issue or bug report and then figure out what is needed
from there, instead of trying to predict what data is useful. Of course, I
think predicting and dumping data is also useful, but I was just wondering if
you had some specific issue in mind that we could work off of (you did
mention the CPU stopper below so thank you!)
> > > OK, I can be flexible. Suppose that you were getting RCU CPU stall
> > > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > > Of course, this really means that some other CPU/task is holding up
> > > multi_cpu_stop() without also blocking the current grace period.
> > >
> > > What is the best way to work out what is really holding things up?
> >
> > Either the stopper preempted another being in a critical section and
> > has something wrong itself in case of PREEMPT or mechanisms for
> > urgent control doesn't work correctly.
> >
> > I don't know what exactly you intended but I would check things like
> > (1) irq disable / eqs / tick / scheduler events and (2) whether special
> > handling for each level of qs urgency has started correctly. For that
> > purpose all the history of those events would be more useful.
Agreed, these are all good.
> > And with thinking it more, we could come up with a good way to
> > make use of those data to identify what the problem is. Do I catch
> > the point correctly? If so, me and Joel can start to work on it.
> > Otherwise, please correct me.
>
> I believe you are on the right track. In short, it would be great if
> the kernel would automatically dump out the needed information when
> cpu_stopper gets stalled, sort of like RCU does (much of the time,
> anyway) in its CPU stall warnings. Given a patch that did this, I would
> be quite happy to help advocate for it!
In case you have a LKML link to a thread or bug report to this specific
cpu_stopper issue, please do pass it along.
Happy to work on this with Byungchul, thanks,
- Joel
On Mon, Jul 15, 2019 at 01:39:24PM -0400, Joel Fernandes wrote:
> On Sun, Jul 14, 2019 at 06:56:10AM -0700, Paul E. McKenney wrote:
> [snip]
> > > > > > > > > >
> > > > > > > > > > commit ae91aa0adb14dc33114d566feca2f7cb7a96b8b7
> > > > > > > > > > rcu: Remove debugfs tracing
> > > > > > > > > >
> > > > > > > > > > removed all debugfs tracing, gp_max also included.
> > > > > > > > > >
> > > > > > > > > > And you sounds great. And even looks not that hard to add it like,
> > > > > > > > > >
> > > > > > > > > > :)
> > > > > > > > > >
> > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > > > index ad9dc86..86095ff 100644
> > > > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > > > @@ -1658,8 +1658,10 @@ static void rcu_gp_cleanup(void)
> > > > > > > > > > raw_spin_lock_irq_rcu_node(rnp);
> > > > > > > > > > rcu_state.gp_end = jiffies;
> > > > > > > > > > gp_duration = rcu_state.gp_end - rcu_state.gp_start;
> > > > > > > > > > - if (gp_duration > rcu_state.gp_max)
> > > > > > > > > > + if (gp_duration > rcu_state.gp_max) {
> > > > > > > > > > rcu_state.gp_max = gp_duration;
> > > > > > > > > > + trace_rcu_grace_period(something something);
> > > > > > > > > > + }
> > > > > > > > >
> > > > > > > > > Yes, that makes sense. But I think it is much better off as a readable value
> > > > > > > > > from a virtual fs. The drawback of tracing for this sort of thing are:
> > > > > > > > > - Tracing will only catch it if tracing is on
> > > > > > > > > - Tracing data can be lost if too many events, then no one has a clue what
> > > > > > > > > the max gp time is.
> > > > > > > > > - The data is already available in rcu_state::gp_max so copying it into the
> > > > > > > > > trace buffer seems a bit pointless IMHO
> > > > > > > > > - It is a lot easier on ones eyes to process a single counter than process
> > > > > > > > > heaps of traces.
> > > > > > > > >
> > > > > > > > > I think a minimal set of RCU counters exposed to /proc or /sys should not
> > > > > > > > > hurt and could do more good than not. The scheduler already does this for
> > > > > > > > > scheduler statistics. I have seen Peter complain a lot about new tracepoints
> > > > > > > > > but not much (or never) about new statistics.
> > > > > > > > >
> > > > > > > > > Tracing has its strengths but may not apply well here IMO. I think a counter
> > > > > > > > > like this could be useful for tuning of things like the jiffies_*_sched_qs,
> > > > > > > > > the stall timeouts and also any other RCU knobs. What do you think?
> > > > > > > >
> > > > > > > > I prefer proc/sys knob for it to tracepoint. Why I've considered it is just it
> > > > > > > > looks like undoing what Paul did at ae91aa0ad.
> > > > > > > >
> > > > > > > > I think you're rational enough. I just wondered how Paul think of it.
> > > > > > >
> > > > > > > I believe at least initially, a set of statistics can be made
> > > > > > > available only when rcutorture or rcuperf module is loaded. That way
> > > > > > > they are purely only for debugging and nothing needs to be exposed to
> > > > > > > normal kernels distributed thus reducing testability concerns.
> > > > > > >
> > > > > > > rcu_state::gp_max would be trivial to expose through this, but for
> > > > > > > other statistics that are more complicated - perhaps
> > > > > > > tracepoint_probe_register can be used to add hooks on to the
> > > > > > > tracepoints and generate statistics from them. Again the registration
> > > > > > > of the probe and the probe handler itself would all be in
> > > > > > > rcutorture/rcuperf test code and not a part of the kernel proper.
> > > > > > > Thoughts?
> > > > > >
> > > > > > It still feels like you guys are hyperfocusing on this one particular
> > > > > > knob. I instead need you to look at the interrelating knobs as a group.
> > > > >
> > > > > Thanks for the hints, we'll do that.
> > > > >
> > > > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > > > What information will you need? How can you best get that information
> > > > > > without excessive numbers of over-and-back interactions with the guy
> > > > > > reporting the bug? As part of this last question, what information is
> > > > > > normally supplied with the bug? Alternatively, what information are
> > > > > > bug reporters normally expected to provide when asked?
> > > > >
> > > > > I suppose I could dig out some of our Android bug reports of the past where
> > > > > there were RCU issues but if there's any fires you are currently fighting do
> > > > > send it our way as debugging homework ;-)
> > > >
> > > > Evading the questions, are we?
>
> Sorry if I sounded like evading, I was just saying that it would be nice to
> work on some specific issue or bug report and then figure out what is needed
> from there, instead of trying to predict what data is useful. Of course, I
> think predicting and dumping data is also useful, but I was just wondering if
> you had some specific issue in mind that we could work off of (you did
> mention the CPU stopper below so thank you!)
No worries, and fair enough!
> > > > OK, I can be flexible. Suppose that you were getting RCU CPU stall
> > > > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > > > Of course, this really means that some other CPU/task is holding up
> > > > multi_cpu_stop() without also blocking the current grace period.
> > > >
> > > > What is the best way to work out what is really holding things up?
> > >
> > > Either the stopper preempted another being in a critical section and
> > > has something wrong itself in case of PREEMPT or mechanisms for
> > > urgent control doesn't work correctly.
> > >
> > > I don't know what exactly you intended but I would check things like
> > > (1) irq disable / eqs / tick / scheduler events and (2) whether special
> > > handling for each level of qs urgency has started correctly. For that
> > > purpose all the history of those events would be more useful.
>
> Agreed, these are all good.
Just to reiterate -- it would be good if multi_cpu_stop() could
automatically dump out useful information when it has been delayed too
long, where "too long" is probably a bit shorter than the RCU CPU stall
warning timeout. Useful information would include the stack of the task
impeding multi_cpu_stop().
> > > And with thinking it more, we could come up with a good way to
> > > make use of those data to identify what the problem is. Do I catch
> > > the point correctly? If so, me and Joel can start to work on it.
> > > Otherwise, please correct me.
> >
> > I believe you are on the right track. In short, it would be great if
> > the kernel would automatically dump out the needed information when
> > cpu_stopper gets stalled, sort of like RCU does (much of the time,
> > anyway) in its CPU stall warnings. Given a patch that did this, I would
> > be quite happy to help advocate for it!
>
> In case you have a LKML link to a thread or bug report to this specific
> cpu_stopper issue, please do pass it along.
I hope to have an rcutorture-based repeat-by in a few days, give or take.
(Famous last words!)
> Happy to work on this with Byungchul, thanks,
Thank you both!
Thanx, Paul
Trimming the list a bit to keep my noise level low,
On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
[snip]
> > It still feels like you guys are hyperfocusing on this one particular
> > > knob. I instead need you to look at the interrelating knobs as a group.
> >
> > Thanks for the hints, we'll do that.
> >
> > > On the debugging side, suppose someone gives you an RCU bug report.
> > > What information will you need? How can you best get that information
> > > without excessive numbers of over-and-back interactions with the guy
> > > reporting the bug? As part of this last question, what information is
> > > normally supplied with the bug? Alternatively, what information are
> > > bug reporters normally expected to provide when asked?
> >
> > I suppose I could dig out some of our Android bug reports of the past where
> > there were RCU issues but if there's any fires you are currently fighting do
> > send it our way as debugging homework ;-)
>
> Suppose that you were getting RCU CPU stall
> warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> Of course, this really means that some other CPU/task is holding up
> multi_cpu_stop() without also blocking the current grace period.
>
So I took a shot at this trying to learn how CPU stoppers work in
relation to this problem.
I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
in multi_cpu_stop() but another CPU Y has not yet entered this state.
So CPU X is stalling RCU but it is really because of CPU Y. Now in the
problem statement, you mentioned CPU Y is not holding up the grace
period, which means Y doesn't have any of IRQ, BH or preemption
disabled ; but is still somehow stalling RCU indirectly by troubling
X.
This can only happen if :
- CPU Y has a thread executing on it that is higher priority than CPU
X's stopper thread which prevents it from getting scheduled. - but the
CPU stopper thread (migration/..) is highest priority RT so this would
be some kind of an odd scheduler bug.
- There is a bug in the CPU stopper machinery itself preventing it
from scheduling the stopper on Y. Even though Y is not holding up the
grace period.
Did I get that right? Would be exciting to run the rcutorture test
once Paul has it available to reproduce this problem.
thanks,
- Joel
On Thu, Jul 18, 2019 at 12:14 PM Joel Fernandes <[email protected]> wrote:
>
> Trimming the list a bit to keep my noise level low,
>
> On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
> [snip]
> > > It still feels like you guys are hyperfocusing on this one particular
> > > > knob. I instead need you to look at the interrelating knobs as a group.
> > >
> > > Thanks for the hints, we'll do that.
> > >
> > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > What information will you need? How can you best get that information
> > > > without excessive numbers of over-and-back interactions with the guy
> > > > reporting the bug? As part of this last question, what information is
> > > > normally supplied with the bug? Alternatively, what information are
> > > > bug reporters normally expected to provide when asked?
> > >
> > > I suppose I could dig out some of our Android bug reports of the past where
> > > there were RCU issues but if there's any fires you are currently fighting do
> > > send it our way as debugging homework ;-)
> >
> > Suppose that you were getting RCU CPU stall
> > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > Of course, this really means that some other CPU/task is holding up
> > multi_cpu_stop() without also blocking the current grace period.
> >
>
> So I took a shot at this trying to learn how CPU stoppers work in
> relation to this problem.
>
> I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
> in multi_cpu_stop() but another CPU Y has not yet entered this state.
> So CPU X is stalling RCU but it is really because of CPU Y. Now in the
> problem statement, you mentioned CPU Y is not holding up the grace
> period, which means Y doesn't have any of IRQ, BH or preemption
> disabled ; but is still somehow stalling RCU indirectly by troubling
> X.
>
> This can only happen if :
> - CPU Y has a thread executing on it that is higher priority than CPU
> X's stopper thread which prevents it from getting scheduled. - but the
Sorry, here I meant "higher priority than CPU Y's stopper thread".
On Thu, Jul 18, 2019 at 12:14:22PM -0400, Joel Fernandes wrote:
> Trimming the list a bit to keep my noise level low,
>
> On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
> [snip]
> > > It still feels like you guys are hyperfocusing on this one particular
> > > > knob. I instead need you to look at the interrelating knobs as a group.
> > >
> > > Thanks for the hints, we'll do that.
> > >
> > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > What information will you need? How can you best get that information
> > > > without excessive numbers of over-and-back interactions with the guy
> > > > reporting the bug? As part of this last question, what information is
> > > > normally supplied with the bug? Alternatively, what information are
> > > > bug reporters normally expected to provide when asked?
> > >
> > > I suppose I could dig out some of our Android bug reports of the past where
> > > there were RCU issues but if there's any fires you are currently fighting do
> > > send it our way as debugging homework ;-)
> >
> > Suppose that you were getting RCU CPU stall
> > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > Of course, this really means that some other CPU/task is holding up
> > multi_cpu_stop() without also blocking the current grace period.
> >
>
> So I took a shot at this trying to learn how CPU stoppers work in
> relation to this problem.
>
> I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
> in multi_cpu_stop() but another CPU Y has not yet entered this state.
> So CPU X is stalling RCU but it is really because of CPU Y. Now in the
> problem statement, you mentioned CPU Y is not holding up the grace
> period, which means Y doesn't have any of IRQ, BH or preemption
> disabled ; but is still somehow stalling RCU indirectly by troubling
> X.
>
> This can only happen if :
> - CPU Y has a thread executing on it that is higher priority than CPU
> X's stopper thread which prevents it from getting scheduled. - but the
> CPU stopper thread (migration/..) is highest priority RT so this would
> be some kind of an odd scheduler bug.
> - There is a bug in the CPU stopper machinery itself preventing it
> from scheduling the stopper on Y. Even though Y is not holding up the
> grace period.
- CPU Y might have already passed through its quiescent state for
the current grace period, then disabled IRQs indefinitely.
Now, CPU Y would block a later grace period, but CPU X is
preventing the current grace period from ending, so no such
later grace period can start.
> Did I get that right? Would be exciting to run the rcutorture test
> once Paul has it available to reproduce this problem.
Working on it! Slow, I know!
Thanx, Paul
On Thu, Jul 18, 2019 at 12:14:22PM -0400, Joel Fernandes wrote:
> Trimming the list a bit to keep my noise level low,
>
> On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
> [snip]
> > > It still feels like you guys are hyperfocusing on this one particular
> > > > knob. I instead need you to look at the interrelating knobs as a group.
> > >
> > > Thanks for the hints, we'll do that.
> > >
> > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > What information will you need? How can you best get that information
> > > > without excessive numbers of over-and-back interactions with the guy
> > > > reporting the bug? As part of this last question, what information is
> > > > normally supplied with the bug? Alternatively, what information are
> > > > bug reporters normally expected to provide when asked?
> > >
> > > I suppose I could dig out some of our Android bug reports of the past where
> > > there were RCU issues but if there's any fires you are currently fighting do
> > > send it our way as debugging homework ;-)
> >
> > Suppose that you were getting RCU CPU stall
> > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > Of course, this really means that some other CPU/task is holding up
> > multi_cpu_stop() without also blocking the current grace period.
> >
>
> So I took a shot at this trying to learn how CPU stoppers work in
> relation to this problem.
>
> I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
> in multi_cpu_stop() but another CPU Y has not yet entered this state.
> So CPU X is stalling RCU but it is really because of CPU Y. Now in the
> problem statement, you mentioned CPU Y is not holding up the grace
> period, which means Y doesn't have any of IRQ, BH or preemption
> disabled ; but is still somehow stalling RCU indirectly by troubling
> X.
>
> This can only happen if :
> - CPU Y has a thread executing on it that is higher priority than CPU
> X's stopper thread which prevents it from getting scheduled. - but the
> CPU stopper thread (migration/..) is highest priority RT so this would
> be some kind of an odd scheduler bug.
I think this bug hardly can happen.
> - There is a bug in the CPU stopper machinery itself preventing it
> from scheduling the stopper on Y. Even though Y is not holding up the
> grace period.
Or any thread on Y is busy with preemption/irq disabled preventing the
stopper from being scheduled on Y.
Or something is stuck in ttwu() to wake up the stopper on Y due to any
scheduler locks such as pi_lock or rq->lock or something.
I think what you mentioned can happen easily.
Basically we would need information about preemption/irq disabled
sections on Y and scheduler's current activity on every cpu at that time.
Am I missing something?
> Did I get that right? Would be exciting to run the rcutorture test
> once Paul has it available to reproduce this problem.
Thanks,
Byungchul
On Thu, Jul 18, 2019 at 02:34:19PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2019 at 12:14:22PM -0400, Joel Fernandes wrote:
> > Trimming the list a bit to keep my noise level low,
> >
> > On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
> > [snip]
> > > > It still feels like you guys are hyperfocusing on this one particular
> > > > > knob. I instead need you to look at the interrelating knobs as a group.
> > > >
> > > > Thanks for the hints, we'll do that.
> > > >
> > > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > > What information will you need? How can you best get that information
> > > > > without excessive numbers of over-and-back interactions with the guy
> > > > > reporting the bug? As part of this last question, what information is
> > > > > normally supplied with the bug? Alternatively, what information are
> > > > > bug reporters normally expected to provide when asked?
> > > >
> > > > I suppose I could dig out some of our Android bug reports of the past where
> > > > there were RCU issues but if there's any fires you are currently fighting do
> > > > send it our way as debugging homework ;-)
> > >
> > > Suppose that you were getting RCU CPU stall
> > > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > > Of course, this really means that some other CPU/task is holding up
> > > multi_cpu_stop() without also blocking the current grace period.
> > >
> >
> > So I took a shot at this trying to learn how CPU stoppers work in
> > relation to this problem.
> >
> > I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
> > in multi_cpu_stop() but another CPU Y has not yet entered this state.
> > So CPU X is stalling RCU but it is really because of CPU Y. Now in the
> > problem statement, you mentioned CPU Y is not holding up the grace
> > period, which means Y doesn't have any of IRQ, BH or preemption
> > disabled ; but is still somehow stalling RCU indirectly by troubling
> > X.
> >
> > This can only happen if :
> > - CPU Y has a thread executing on it that is higher priority than CPU
> > X's stopper thread which prevents it from getting scheduled. - but the
> > CPU stopper thread (migration/..) is highest priority RT so this would
> > be some kind of an odd scheduler bug.
> > - There is a bug in the CPU stopper machinery itself preventing it
> > from scheduling the stopper on Y. Even though Y is not holding up the
> > grace period.
>
> - CPU Y might have already passed through its quiescent state for
> the current grace period, then disabled IRQs indefinitely.
> Now, CPU Y would block a later grace period, but CPU X is
> preventing the current grace period from ending, so no such
> later grace period can start.
Ah totally possible, yes!
thanks,
- Joel
On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
[snip]
> > - There is a bug in the CPU stopper machinery itself preventing it
> > from scheduling the stopper on Y. Even though Y is not holding up the
> > grace period.
>
> Or any thread on Y is busy with preemption/irq disabled preventing the
> stopper from being scheduled on Y.
>
> Or something is stuck in ttwu() to wake up the stopper on Y due to any
> scheduler locks such as pi_lock or rq->lock or something.
>
> I think what you mentioned can happen easily.
>
> Basically we would need information about preemption/irq disabled
> sections on Y and scheduler's current activity on every cpu at that time.
I think all that's needed is an NMI backtrace on all CPUs. An ARM we
don't have NMI solutions and only IPI or interrupt based backtrace
works which should at least catch and the preempt disable and softirq
disable cases.
But yeah I don't see why just the stacks of those CPUs that are
blocking the CPU X would not suffice for the trivial cases where a
piece of misbehaving code disable interrupts / preemption and
prevented the stopper thread from executing.
May be once the test case is ready (no rush!) , then it will be more
clear what can help.
J.
On Thu, Jul 18, 2019 at 02:34:19PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2019 at 12:14:22PM -0400, Joel Fernandes wrote:
> > Trimming the list a bit to keep my noise level low,
> >
> > On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <[email protected]> wrote:
> > [snip]
> > > > It still feels like you guys are hyperfocusing on this one particular
> > > > > knob. I instead need you to look at the interrelating knobs as a group.
> > > >
> > > > Thanks for the hints, we'll do that.
> > > >
> > > > > On the debugging side, suppose someone gives you an RCU bug report.
> > > > > What information will you need? How can you best get that information
> > > > > without excessive numbers of over-and-back interactions with the guy
> > > > > reporting the bug? As part of this last question, what information is
> > > > > normally supplied with the bug? Alternatively, what information are
> > > > > bug reporters normally expected to provide when asked?
> > > >
> > > > I suppose I could dig out some of our Android bug reports of the past where
> > > > there were RCU issues but if there's any fires you are currently fighting do
> > > > send it our way as debugging homework ;-)
> > >
> > > Suppose that you were getting RCU CPU stall
> > > warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> > > Of course, this really means that some other CPU/task is holding up
> > > multi_cpu_stop() without also blocking the current grace period.
> > >
> >
> > So I took a shot at this trying to learn how CPU stoppers work in
> > relation to this problem.
> >
> > I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
> > in multi_cpu_stop() but another CPU Y has not yet entered this state.
> > So CPU X is stalling RCU but it is really because of CPU Y. Now in the
> > problem statement, you mentioned CPU Y is not holding up the grace
> > period, which means Y doesn't have any of IRQ, BH or preemption
> > disabled ; but is still somehow stalling RCU indirectly by troubling
> > X.
> >
> > This can only happen if :
> > - CPU Y has a thread executing on it that is higher priority than CPU
> > X's stopper thread which prevents it from getting scheduled. - but the
> > CPU stopper thread (migration/..) is highest priority RT so this would
> > be some kind of an odd scheduler bug.
> > - There is a bug in the CPU stopper machinery itself preventing it
> > from scheduling the stopper on Y. Even though Y is not holding up the
> > grace period.
>
> - CPU Y might have already passed through its quiescent state for
> the current grace period, then disabled IRQs indefinitely.
Or for a longer time than the period that rcu considers as a stall. Or
preemption disabled for that long time. Or the stopper on Y even has yet
to be woken up inside scheduler because of any reasons but maybe locks.
> Now, CPU Y would block a later grace period, but CPU X is
> preventing the current grace period from ending, so no such
> later grace period can start.
>
> > Did I get that right? Would be exciting to run the rcutorture test
> > once Paul has it available to reproduce this problem.
>
> Working on it! Slow, I know!
>
> Thanx, Paul
On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> [snip]
> > > - There is a bug in the CPU stopper machinery itself preventing it
> > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > grace period.
> >
> > Or any thread on Y is busy with preemption/irq disabled preventing the
> > stopper from being scheduled on Y.
> >
> > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > scheduler locks such as pi_lock or rq->lock or something.
> >
> > I think what you mentioned can happen easily.
> >
> > Basically we would need information about preemption/irq disabled
> > sections on Y and scheduler's current activity on every cpu at that time.
>
> I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> don't have NMI solutions and only IPI or interrupt based backtrace
> works which should at least catch and the preempt disable and softirq
> disable cases.
>
> But yeah I don't see why just the stacks of those CPUs that are
> blocking the CPU X would not suffice for the trivial cases where a
> piece of misbehaving code disable interrupts / preemption and
> prevented the stopper thread from executing.
Right. So it makes more interesting tho! :-)
> May be once the test case is ready (no rush!) , then it will be more
> clear what can help.
Yes. I'm really happy to help things about RCU that I love, fixed or
improved. And with the the test case or a real issue, I believe I can do
more helpful work. Looking forward to it, too (no rush!).
Thanks,
Byungchul
On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> [snip]
> > > - There is a bug in the CPU stopper machinery itself preventing it
> > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > grace period.
> >
> > Or any thread on Y is busy with preemption/irq disabled preventing the
> > stopper from being scheduled on Y.
> >
> > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > scheduler locks such as pi_lock or rq->lock or something.
> >
> > I think what you mentioned can happen easily.
> >
> > Basically we would need information about preemption/irq disabled
> > sections on Y and scheduler's current activity on every cpu at that time.
>
> I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> don't have NMI solutions and only IPI or interrupt based backtrace
> works which should at least catch and the preempt disable and softirq
> disable cases.
True, though people with systems having hundreds of CPUs might not
thank you for forcing an NMI backtrace on each of them. Is it possible
to NMI only the ones that are holding up the CPU stopper?
Thanx, Paul
> But yeah I don't see why just the stacks of those CPUs that are
> blocking the CPU X would not suffice for the trivial cases where a
> piece of misbehaving code disable interrupts / preemption and
> prevented the stopper thread from executing.
>
> May be once the test case is ready (no rush!) , then it will be more
> clear what can help.
>
> J.
>
On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > [snip]
> > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > grace period.
> > >
> > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > stopper from being scheduled on Y.
> > >
> > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > scheduler locks such as pi_lock or rq->lock or something.
> > >
> > > I think what you mentioned can happen easily.
> > >
> > > Basically we would need information about preemption/irq disabled
> > > sections on Y and scheduler's current activity on every cpu at that time.
> >
> > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > don't have NMI solutions and only IPI or interrupt based backtrace
> > works which should at least catch and the preempt disable and softirq
> > disable cases.
>
> True, though people with systems having hundreds of CPUs might not
> thank you for forcing an NMI backtrace on each of them. Is it possible
> to NMI only the ones that are holding up the CPU stopper?
What a good idea! I think it's possible!
But we need to think about the case NMI doesn't work when the
holding-up was caused by IRQ disabled.
Though it's just around the corner of weekend, I will keep thinking
on it during weekend!
Thanks,
Byungchul
> Thanx, Paul
>
> > But yeah I don't see why just the stacks of those CPUs that are
> > blocking the CPU X would not suffice for the trivial cases where a
> > piece of misbehaving code disable interrupts / preemption and
> > prevented the stopper thread from executing.
> >
> > May be once the test case is ready (no rush!) , then it will be more
> > clear what can help.
> >
> > J.
> >
--
Thanks,
Byungchul
On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > [snip]
> > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > grace period.
> > > >
> > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > stopper from being scheduled on Y.
> > > >
> > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > scheduler locks such as pi_lock or rq->lock or something.
> > > >
> > > > I think what you mentioned can happen easily.
> > > >
> > > > Basically we would need information about preemption/irq disabled
> > > > sections on Y and scheduler's current activity on every cpu at that time.
> > >
> > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > works which should at least catch and the preempt disable and softirq
> > > disable cases.
> >
> > True, though people with systems having hundreds of CPUs might not
> > thank you for forcing an NMI backtrace on each of them. Is it possible
> > to NMI only the ones that are holding up the CPU stopper?
>
> What a good idea! I think it's possible!
>
> But we need to think about the case NMI doesn't work when the
> holding-up was caused by IRQ disabled.
>
> Though it's just around the corner of weekend, I will keep thinking
> on it during weekend!
Very good!
Thanx, Paul
> Thanks,
> Byungchul
>
> > Thanx, Paul
> >
> > > But yeah I don't see why just the stacks of those CPUs that are
> > > blocking the CPU X would not suffice for the trivial cases where a
> > > piece of misbehaving code disable interrupts / preemption and
> > > prevented the stopper thread from executing.
> > >
> > > May be once the test case is ready (no rush!) , then it will be more
> > > clear what can help.
> > >
> > > J.
> > >
>
>
>
> --
> Thanks,
> Byungchul
>
On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > [snip]
> > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > grace period.
> > > > >
> > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > stopper from being scheduled on Y.
> > > > >
> > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > >
> > > > > I think what you mentioned can happen easily.
> > > > >
> > > > > Basically we would need information about preemption/irq disabled
> > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > >
> > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > works which should at least catch and the preempt disable and softirq
> > > > disable cases.
> > >
> > > True, though people with systems having hundreds of CPUs might not
> > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > to NMI only the ones that are holding up the CPU stopper?
> >
> > What a good idea! I think it's possible!
> >
> > But we need to think about the case NMI doesn't work when the
> > holding-up was caused by IRQ disabled.
> >
> > Though it's just around the corner of weekend, I will keep thinking
> > on it during weekend!
>
> Very good!
Me too will think more about it ;-) Agreed with point about 100s of
CPUs usecase,
Thanks, have a great weekend,
- Joel
On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > > [snip]
> > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > grace period.
> > > > > >
> > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > stopper from being scheduled on Y.
> > > > > >
> > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > >
> > > > > > I think what you mentioned can happen easily.
> > > > > >
> > > > > > Basically we would need information about preemption/irq disabled
> > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > >
> > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > works which should at least catch and the preempt disable and softirq
> > > > > disable cases.
> > > >
> > > > True, though people with systems having hundreds of CPUs might not
> > > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > > to NMI only the ones that are holding up the CPU stopper?
> > >
> > > What a good idea! I think it's possible!
> > >
> > > But we need to think about the case NMI doesn't work when the
> > > holding-up was caused by IRQ disabled.
> > >
> > > Though it's just around the corner of weekend, I will keep thinking
> > > on it during weekend!
> >
> > Very good!
>
> Me too will think more about it ;-) Agreed with point about 100s of
> CPUs usecase,
>
> Thanks, have a great weekend,
BTW, if there's any long code section with irq/preemption disabled, then
the problem would be not only about RCU stall. And we can also use
latency tracer or something to detect the bad situation.
So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
to be scheduled does not give us additional meaningful information.
I think Paul started to think about this to solve some real problem. I
seriously love to help RCU and it's my pleasure to dig deep into kind of
RCU stuff, but I've yet to define exactly what problem is. Sorry.
Could you share the real issue? I think you don't have to reproduce it.
Just sharing the issue that you got inspired from is enough. Then I
might be able to develop 'how' with Joel! :-) It's our pleasure!
Thanks,
Byungchul
On Tue, Jul 23, 2019 at 08:05:21PM +0900, Byungchul Park wrote:
> On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> > On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > > > [snip]
> > > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > > grace period.
> > > > > > >
> > > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > > stopper from being scheduled on Y.
> > > > > > >
> > > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > > >
> > > > > > > I think what you mentioned can happen easily.
> > > > > > >
> > > > > > > Basically we would need information about preemption/irq disabled
> > > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > > >
> > > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > > works which should at least catch and the preempt disable and softirq
> > > > > > disable cases.
> > > > >
> > > > > True, though people with systems having hundreds of CPUs might not
> > > > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > > > to NMI only the ones that are holding up the CPU stopper?
> > > >
> > > > What a good idea! I think it's possible!
> > > >
> > > > But we need to think about the case NMI doesn't work when the
> > > > holding-up was caused by IRQ disabled.
> > > >
> > > > Though it's just around the corner of weekend, I will keep thinking
> > > > on it during weekend!
> > >
> > > Very good!
> >
> > Me too will think more about it ;-) Agreed with point about 100s of
> > CPUs usecase,
> >
> > Thanks, have a great weekend,
>
> BTW, if there's any long code section with irq/preemption disabled, then
> the problem would be not only about RCU stall. And we can also use
> latency tracer or something to detect the bad situation.
>
> So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
> to be scheduled does not give us additional meaningful information.
>
> I think Paul started to think about this to solve some real problem. I
> seriously love to help RCU and it's my pleasure to dig deep into kind of
> RCU stuff, but I've yet to define exactly what problem is. Sorry.
>
> Could you share the real issue? I think you don't have to reproduce it.
> Just sharing the issue that you got inspired from is enough. Then I
> might be able to develop 'how' with Joel! :-) It's our pleasure!
It is unfortunately quite intermittent. I was hoping to find a way
to make it happen more often. Part of the underlying problem appears
to be lock contention, in that reducing contention made it even more
intermittent. Which is good in general, but not for exercising the
CPU-stopper issue.
But perhaps your hardware will make this happen more readily than does
mine. The repeat-by is simple, namely run TREE04 on branch "dev" on an
eight-CPU system. It appear that the number of CPUs used by the test
should match the number available on the system that you are running on,
though perhaps affinity could allow mismatches.
So why not try it and see what happens?
Thanx, Paul
On Tue, Jul 23, 2019 at 06:47:17AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 23, 2019 at 08:05:21PM +0900, Byungchul Park wrote:
> > On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> > > On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > > > > [snip]
> > > > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > > > grace period.
> > > > > > > >
> > > > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > > > stopper from being scheduled on Y.
> > > > > > > >
> > > > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > > > >
> > > > > > > > I think what you mentioned can happen easily.
> > > > > > > >
> > > > > > > > Basically we would need information about preemption/irq disabled
> > > > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > > > >
> > > > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > > > works which should at least catch and the preempt disable and softirq
> > > > > > > disable cases.
> > > > > >
> > > > > > True, though people with systems having hundreds of CPUs might not
> > > > > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > > > > to NMI only the ones that are holding up the CPU stopper?
> > > > >
> > > > > What a good idea! I think it's possible!
> > > > >
> > > > > But we need to think about the case NMI doesn't work when the
> > > > > holding-up was caused by IRQ disabled.
> > > > >
> > > > > Though it's just around the corner of weekend, I will keep thinking
> > > > > on it during weekend!
> > > >
> > > > Very good!
> > >
> > > Me too will think more about it ;-) Agreed with point about 100s of
> > > CPUs usecase,
> > >
> > > Thanks, have a great weekend,
> >
> > BTW, if there's any long code section with irq/preemption disabled, then
> > the problem would be not only about RCU stall. And we can also use
> > latency tracer or something to detect the bad situation.
> >
> > So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
> > to be scheduled does not give us additional meaningful information.
> >
> > I think Paul started to think about this to solve some real problem. I
> > seriously love to help RCU and it's my pleasure to dig deep into kind of
> > RCU stuff, but I've yet to define exactly what problem is. Sorry.
> >
> > Could you share the real issue? I think you don't have to reproduce it.
> > Just sharing the issue that you got inspired from is enough. Then I
> > might be able to develop 'how' with Joel! :-) It's our pleasure!
>
> It is unfortunately quite intermittent. I was hoping to find a way
> to make it happen more often. Part of the underlying problem appears
> to be lock contention, in that reducing contention made it even more
> intermittent. Which is good in general, but not for exercising the
> CPU-stopper issue.
>
> But perhaps your hardware will make this happen more readily than does
> mine. The repeat-by is simple, namely run TREE04 on branch "dev" on an
> eight-CPU system. It appear that the number of CPUs used by the test
> should match the number available on the system that you are running on,
> though perhaps affinity could allow mismatches.
>
> So why not try it and see what happens?
And another potential issue causing this is a CONFIG_NO_HZ_FULL=y
kernel running in kernel mode (rcutorture on the one hand and callback
invocation on the other) for extended periods of time with the scheduling
clock disabled. Just started the tests for this. They will be running
for quite some time, which this week is a good thing. ;-)
Thanx, Paul
On Tue, Jul 23, 2019 at 09:54:03AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 23, 2019 at 06:47:17AM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 23, 2019 at 08:05:21PM +0900, Byungchul Park wrote:
> > > On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> > > > On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > > > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > > > > > [snip]
> > > > > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > > > > grace period.
> > > > > > > > >
> > > > > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > > > > stopper from being scheduled on Y.
> > > > > > > > >
> > > > > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > > > > >
> > > > > > > > > I think what you mentioned can happen easily.
> > > > > > > > >
> > > > > > > > > Basically we would need information about preemption/irq disabled
> > > > > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > > > > >
> > > > > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > > > > works which should at least catch and the preempt disable and softirq
> > > > > > > > disable cases.
> > > > > > >
> > > > > > > True, though people with systems having hundreds of CPUs might not
> > > > > > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > > > > > to NMI only the ones that are holding up the CPU stopper?
> > > > > >
> > > > > > What a good idea! I think it's possible!
> > > > > >
> > > > > > But we need to think about the case NMI doesn't work when the
> > > > > > holding-up was caused by IRQ disabled.
> > > > > >
> > > > > > Though it's just around the corner of weekend, I will keep thinking
> > > > > > on it during weekend!
> > > > >
> > > > > Very good!
> > > >
> > > > Me too will think more about it ;-) Agreed with point about 100s of
> > > > CPUs usecase,
> > > >
> > > > Thanks, have a great weekend,
> > >
> > > BTW, if there's any long code section with irq/preemption disabled, then
> > > the problem would be not only about RCU stall. And we can also use
> > > latency tracer or something to detect the bad situation.
> > >
> > > So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
> > > to be scheduled does not give us additional meaningful information.
> > >
> > > I think Paul started to think about this to solve some real problem. I
> > > seriously love to help RCU and it's my pleasure to dig deep into kind of
> > > RCU stuff, but I've yet to define exactly what problem is. Sorry.
> > >
> > > Could you share the real issue? I think you don't have to reproduce it.
> > > Just sharing the issue that you got inspired from is enough. Then I
> > > might be able to develop 'how' with Joel! :-) It's our pleasure!
> >
> > It is unfortunately quite intermittent. I was hoping to find a way
> > to make it happen more often. Part of the underlying problem appears
> > to be lock contention, in that reducing contention made it even more
> > intermittent. Which is good in general, but not for exercising the
> > CPU-stopper issue.
> >
> > But perhaps your hardware will make this happen more readily than does
> > mine. The repeat-by is simple, namely run TREE04 on branch "dev" on an
> > eight-CPU system. It appear that the number of CPUs used by the test
> > should match the number available on the system that you are running on,
> > though perhaps affinity could allow mismatches.
> >
> > So why not try it and see what happens?
>
> And another potential issue causing this is a CONFIG_NO_HZ_FULL=y
I see. This provides more insight into the problem.
Thanks,
Byungchul
> kernel running in kernel mode (rcutorture on the one hand and callback
> invocation on the other) for extended periods of time with the scheduling
> clock disabled. Just started the tests for this. They will be running
> for quite some time, which this week is a good thing. ;-)
>
> Thanx, Paul
On Tue, Jul 23, 2019 at 06:47:17AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 23, 2019 at 08:05:21PM +0900, Byungchul Park wrote:
> > On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> > > On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <[email protected]> wrote:
> > > > > > > [snip]
> > > > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > > > grace period.
> > > > > > > >
> > > > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > > > stopper from being scheduled on Y.
> > > > > > > >
> > > > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > > > >
> > > > > > > > I think what you mentioned can happen easily.
> > > > > > > >
> > > > > > > > Basically we would need information about preemption/irq disabled
> > > > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > > > >
> > > > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > > > works which should at least catch and the preempt disable and softirq
> > > > > > > disable cases.
> > > > > >
> > > > > > True, though people with systems having hundreds of CPUs might not
> > > > > > thank you for forcing an NMI backtrace on each of them. Is it possible
> > > > > > to NMI only the ones that are holding up the CPU stopper?
> > > > >
> > > > > What a good idea! I think it's possible!
> > > > >
> > > > > But we need to think about the case NMI doesn't work when the
> > > > > holding-up was caused by IRQ disabled.
> > > > >
> > > > > Though it's just around the corner of weekend, I will keep thinking
> > > > > on it during weekend!
> > > >
> > > > Very good!
> > >
> > > Me too will think more about it ;-) Agreed with point about 100s of
> > > CPUs usecase,
> > >
> > > Thanks, have a great weekend,
> >
> > BTW, if there's any long code section with irq/preemption disabled, then
> > the problem would be not only about RCU stall. And we can also use
> > latency tracer or something to detect the bad situation.
> >
> > So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
> > to be scheduled does not give us additional meaningful information.
> >
> > I think Paul started to think about this to solve some real problem. I
> > seriously love to help RCU and it's my pleasure to dig deep into kind of
> > RCU stuff, but I've yet to define exactly what problem is. Sorry.
> >
> > Could you share the real issue? I think you don't have to reproduce it.
> > Just sharing the issue that you got inspired from is enough. Then I
> > might be able to develop 'how' with Joel! :-) It's our pleasure!
>
> It is unfortunately quite intermittent. I was hoping to find a way
> to make it happen more often. Part of the underlying problem appears
> to be lock contention, in that reducing contention made it even more
> intermittent. Which is good in general, but not for exercising the
> CPU-stopper issue.
>
> But perhaps your hardware will make this happen more readily than does
> mine. The repeat-by is simple, namely run TREE04 on branch "dev" on an
> eight-CPU system. It appear that the number of CPUs used by the test
> should match the number available on the system that you are running on,
> though perhaps affinity could allow mismatches.
>
> So why not try it and see what happens?
Thank you. I'll try it too.