2013-04-22 15:38:51

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7] sched: fix init NOHZ_IDLE flag

On my smp platform which is made of 5 cores in 2 clusters, I have the
nr_busy_cpu field of sched_group_power struct that is not null when the
platform is fully idle. The root cause is:
During the boot sequence, some CPUs reach the idle loop and set their
NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus
field is initialized later with the assumption that all CPUs are in the busy
state whereas some CPUs have already set their NOHZ_IDLE flag.

More generally, the NOHZ_IDLE flag must be initialized when new sched_domains
are created in order to ensure that NOHZ_IDLE and nr_busy_cpus are aligned.

This condition can be ensured by adding a synchronize_rcu between the
destruction of old sched_domains and the creation of new ones so the NOHZ_IDLE
flag will not be updated with old sched_domain once it has been initialized.
But this solution introduces a additionnal latency in the rebuild sequence
that is called during cpu hotplug.

As suggested by Frederic Weisbecker, another solution is to have the same
rcu lifecycle for both NOHZ_IDLE and sched_domain struct.
A new nohz_flags has been added to sched_domain so both flags and sched_domain
will share the same RCU lifecycle and will be always synchronized. This
solution is prefered to the creation of a new struct with an extra pointer
indirection.

The synchronization is done at the cost of :
- An additional indirection and a rcu_dereference for accessing the NOHZ_IDLE
flag.
- We use only the nohz_flags field of the top sched_domain.

Change since v6:
- Add the flags in struct sched_domain instead of creating a sched_domain_rq.

Change since v5:
- minor variable and function name change.
- remove a useless null check before kfree
- fix a compilation error when NO_HZ is not set.

Change since v4:
- link both sched_domain and NOHZ_IDLE flag in one RCU object so
their states are always synchronized.

Change since V3;
- NOHZ flag is not cleared if a NULL domain is attached to the CPU
- Remove patch 2/2 which becomes useless with latest modifications

Change since V2:
- change the initialization to idle state instead of busy state so a CPU that
enters idle during the build of the sched_domain will not corrupt the
initialization state

Change since V1:
- remove the patch for SCHED softirq on an idle core use case as it was
a side effect of the other use cases.

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 34 ++++++++++++++++++++++++----------
2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..cde4f7f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -899,6 +899,7 @@ struct sched_domain {
unsigned int wake_idx;
unsigned int forkexec_idx;
unsigned int smt_gain;
+ unsigned long nohz_flags; /* NOHZ_IDLE flag status */
int flags; /* See SD_* */
int level;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..09e440f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5394,14 +5394,21 @@ static inline void set_cpu_sd_state_busy(void)
{
struct sched_domain *sd;
int cpu = smp_processor_id();
-
- if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
- return;
- clear_bit(NOHZ_IDLE, nohz_flags(cpu));
+ int first_nohz_idle = 1;

rcu_read_lock();
- for_each_domain(cpu, sd)
+ for_each_domain(cpu, sd) {
+ if (first_nohz_idle) {
+ if (!test_bit(NOHZ_IDLE, &sd->nohz_flags))
+ goto unlock;
+
+ clear_bit(NOHZ_IDLE, &sd->nohz_flags);
+ first_nohz_idle = 0;
+ }
+
atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+ }
+unlock:
rcu_read_unlock();
}

@@ -5409,14 +5416,21 @@ void set_cpu_sd_state_idle(void)
{
struct sched_domain *sd;
int cpu = smp_processor_id();
-
- if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
- return;
- set_bit(NOHZ_IDLE, nohz_flags(cpu));
+ int first_nohz_idle = 1;

rcu_read_lock();
- for_each_domain(cpu, sd)
+ for_each_domain(cpu, sd) {
+ if (first_nohz_idle) {
+ if (test_bit(NOHZ_IDLE, &sd->nohz_flags))
+ goto unlock;
+
+ set_bit(NOHZ_IDLE, &sd->nohz_flags);
+ first_nohz_idle = 0;
+ }
+
atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+ }
+unlock:
rcu_read_unlock();
}

--
1.7.9.5


2013-04-22 20:10:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7] sched: fix init NOHZ_IDLE flag

On Mon, 2013-04-22 at 17:38 +0200, Vincent Guittot wrote:

> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 34 ++++++++++++++++++++++++----------
> 2 files changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d35d2b6..cde4f7f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -899,6 +899,7 @@ struct sched_domain {
> unsigned int wake_idx;
> unsigned int forkexec_idx;
> unsigned int smt_gain;
> + unsigned long nohz_flags; /* NOHZ_IDLE flag status */
> int flags; /* See SD_* */
> int level;

There was a 4 byte hole here, I'm assuming you used unsigned long and
not utilized the hole because of the whole atomic bitop interface
taking unsigned long?

Bit wasteful but ok..

So we're only pulling NOHZ_IDLE out of nohz_flags()? Should we then not
also amend the rq_nohz_flag_bits enum? And it seems pointless to me to
set bit 2 our nohz_flags word if all other bits are unused.

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..09e440f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5394,14 +5394,21 @@ static inline void set_cpu_sd_state_busy(void)
> {
> struct sched_domain *sd;
> int cpu = smp_processor_id();
> -
> - if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
> - return;
> - clear_bit(NOHZ_IDLE, nohz_flags(cpu));
> + int first_nohz_idle = 1;
>
> rcu_read_lock();
> - for_each_domain(cpu, sd)
> + for_each_domain(cpu, sd) {
> + if (first_nohz_idle) {
> + if (!test_bit(NOHZ_IDLE, &sd->nohz_flags))
> + goto unlock;
> +
> + clear_bit(NOHZ_IDLE, &sd->nohz_flags);
> + first_nohz_idle = 0;
> + }
> +
> atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> + }

the mind boggles... what's wrong with something like:

static inline unsigned long *rq_nohz_flags(int cpu)
{
return rcu_dereference(cpu_rq(cpu)->sd)->nohz_flags;
}

if (!test_bit(0, rq_nohz_flags(cpu)))
return;
clear_bit(0, rq_nohz_flags(cpu));

> +unlock:
> rcu_read_unlock();
> }
>
> @@ -5409,14 +5416,21 @@ void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> int cpu = smp_processor_id();
> -
> - if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
> - return;
> - set_bit(NOHZ_IDLE, nohz_flags(cpu));
> + int first_nohz_idle = 1;
>
> rcu_read_lock();
> - for_each_domain(cpu, sd)
> + for_each_domain(cpu, sd) {
> + if (first_nohz_idle) {
> + if (test_bit(NOHZ_IDLE, &sd->nohz_flags))
> + goto unlock;
> +
> + set_bit(NOHZ_IDLE, &sd->nohz_flags);
> + first_nohz_idle = 0;
> + }
> +
> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> + }
> +unlock:

Same here, .. why on earth do it for every sched_domain for that cpu?

> rcu_read_unlock();
> }
>

And since its now only a single bit in a single word, we can easily
change it to something like:

if (*rq_nohz_idle(cpu))
return;
xchg(rq_nohz_idle(cpu), 1); /* set; use 0 for clear */

which drops the unsigned long nonsense..

Or am I completely missing something obvious here?

2013-04-23 07:53:02

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7] sched: fix init NOHZ_IDLE flag

On 22 April 2013 22:10, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2013-04-22 at 17:38 +0200, Vincent Guittot wrote:
>
>> ---
>> include/linux/sched.h | 1 +
>> kernel/sched/fair.c | 34 ++++++++++++++++++++++++----------
>> 2 files changed, 25 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d35d2b6..cde4f7f 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -899,6 +899,7 @@ struct sched_domain {
>> unsigned int wake_idx;
>> unsigned int forkexec_idx;
>> unsigned int smt_gain;
>> + unsigned long nohz_flags; /* NOHZ_IDLE flag status */
>> int flags; /* See SD_* */
>> int level;
>
> There was a 4 byte hole here, I'm assuming you used unsigned long and
> not utilized the hole because of the whole atomic bitop interface
> taking unsigned long?
>
> Bit wasteful but ok..
>
> So we're only pulling NOHZ_IDLE out of nohz_flags()? Should we then not
> also amend the rq_nohz_flag_bits enum? And it seems pointless to me to
> set bit 2 our nohz_flags word if all other bits are unused.

Yes, i have been too quick to update the patchset. i'm going to remove
NOHZ_IDLE from the rq_nohz_flag_bits

>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a33e59..09e440f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5394,14 +5394,21 @@ static inline void set_cpu_sd_state_busy(void)
>> {
>> struct sched_domain *sd;
>> int cpu = smp_processor_id();
>> -
>> - if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
>> - return;
>> - clear_bit(NOHZ_IDLE, nohz_flags(cpu));
>> + int first_nohz_idle = 1;
>>
>> rcu_read_lock();
>> - for_each_domain(cpu, sd)
>> + for_each_domain(cpu, sd) {
>> + if (first_nohz_idle) {
>> + if (!test_bit(NOHZ_IDLE, &sd->nohz_flags))
>> + goto unlock;
>> +
>> + clear_bit(NOHZ_IDLE, &sd->nohz_flags);
>> + first_nohz_idle = 0;
>> + }
>> +
>> atomic_inc(&sd->groups->sgp->nr_busy_cpus);
>> + }
>
> the mind boggles... what's wrong with something like:
>
> static inline unsigned long *rq_nohz_flags(int cpu)
> {
> return rcu_dereference(cpu_rq(cpu)->sd)->nohz_flags;
> }
>
> if (!test_bit(0, rq_nohz_flags(cpu)))
> return;
> clear_bit(0, rq_nohz_flags(cpu));
>

AFAICT, if we use different rcu_dereferences for modifying nohz_flags
and for updating the nr_busy_cpus, we open a time window for
desynchronization, isn't it ?
That why i'm doing the update of rq_nohz_flags with the same
rcu_dereference than nr_busy_cpus

>> +unlock:
>> rcu_read_unlock();
>> }
>>
>> @@ -5409,14 +5416,21 @@ void set_cpu_sd_state_idle(void)
>> {
>> struct sched_domain *sd;
>> int cpu = smp_processor_id();
>> -
>> - if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
>> - return;
>> - set_bit(NOHZ_IDLE, nohz_flags(cpu));
>> + int first_nohz_idle = 1;
>>
>> rcu_read_lock();
>> - for_each_domain(cpu, sd)
>> + for_each_domain(cpu, sd) {
>> + if (first_nohz_idle) {
>> + if (test_bit(NOHZ_IDLE, &sd->nohz_flags))
>> + goto unlock;
>> +
>> + set_bit(NOHZ_IDLE, &sd->nohz_flags);
>> + first_nohz_idle = 0;
>> + }
>> +
>> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
>> + }
>> +unlock:
>
> Same here, .. why on earth do it for every sched_domain for that cpu?

The update of rq_nohz_flags is done only once on the top level
sched_domain and the nr_busy_cpus is updated on each sched_domain
level in order to have a quick access to the number of busy cpu when
we check for the need to kick an idle load balance

>
>> rcu_read_unlock();
>> }
>>
>
> And since its now only a single bit in a single word, we can easily
> change it to something like:
>
> if (*rq_nohz_idle(cpu))
> return;
> xchg(rq_nohz_idle(cpu), 1); /* set; use 0 for clear */
>
> which drops the unsigned long nonsense..

that's fair. i'm going to use xchg instead of atomic

>
> Or am I completely missing something obvious here?
>

2013-04-23 08:50:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7] sched: fix init NOHZ_IDLE flag

On Tue, 2013-04-23 at 09:52 +0200, Vincent Guittot wrote:

> > static inline unsigned long *rq_nohz_flags(int cpu)
> > {
> > return rcu_dereference(cpu_rq(cpu)->sd)->nohz_flags;
> > }
> >
> > if (!test_bit(0, rq_nohz_flags(cpu)))
> > return;
> > clear_bit(0, rq_nohz_flags(cpu));
> >
>
> AFAICT, if we use different rcu_dereferences for modifying nohz_flags
> and for updating the nr_busy_cpus, we open a time window for
> desynchronization, isn't it ?

Oh right, we need to call rq_nohz_flags() once and use the pointer
twice.

> >> +unlock:
> >> rcu_read_unlock();
> >> }
> >>
> >> @@ -5409,14 +5416,21 @@ void set_cpu_sd_state_idle(void)
> >> {
> >> struct sched_domain *sd;
> >> int cpu = smp_processor_id();
> >> -
> >> - if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
> >> - return;
> >> - set_bit(NOHZ_IDLE, nohz_flags(cpu));
> >> + int first_nohz_idle = 1;
> >>
> >> rcu_read_lock();
> >> - for_each_domain(cpu, sd)
> >> + for_each_domain(cpu, sd) {
> >> + if (first_nohz_idle) {
> >> + if (test_bit(NOHZ_IDLE, &sd->nohz_flags))
> >> + goto unlock;
> >> +
> >> + set_bit(NOHZ_IDLE, &sd->nohz_flags);
> >> + first_nohz_idle = 0;
> >> + }
> >> +
> >> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> >> + }
> >> +unlock:
> >
> > Same here, .. why on earth do it for every sched_domain for that
> cpu?
>
> The update of rq_nohz_flags is done only once on the top level
> sched_domain and the nr_busy_cpus is updated on each sched_domain
> level in order to have a quick access to the number of busy cpu when
> we check for the need to kick an idle load balance

Ah, I didn't see the whole first_nohz_idle thing.. but why did you
place it inside the loop in the first place? Wouldn't GCC be able to
avoid the double cpu_rq(cpu)->sd dereference using CSE? Argh no,.. its
got an ACCESS_ONCE in it that defeats GCC.

Bothersome..

I'd almost write it like:

struct sched_domain *sd;

rcu_read_lock();
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);

if (sd->nohz_idle)
goto unlock;
xchg(&sd->nohz_idle, 1); /* do we actually need atomics here? */

for (; sd; sd = sd->parent)
atomic_dec(&sd->groups->sgp->nr_busy_cpus);
unlock:
rcu_read_unlock();



2013-04-23 09:01:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7] sched: fix init NOHZ_IDLE flag

On 23 April 2013 10:50, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2013-04-23 at 09:52 +0200, Vincent Guittot wrote:
>
>> > static inline unsigned long *rq_nohz_flags(int cpu)
>> > {
>> > return rcu_dereference(cpu_rq(cpu)->sd)->nohz_flags;
>> > }
>> >
>> > if (!test_bit(0, rq_nohz_flags(cpu)))
>> > return;
>> > clear_bit(0, rq_nohz_flags(cpu));
>> >
>>
>> AFAICT, if we use different rcu_dereferences for modifying nohz_flags
>> and for updating the nr_busy_cpus, we open a time window for
>> desynchronization, isn't it ?
>
> Oh right, we need to call rq_nohz_flags() once and use the pointer
> twice.
>
>> >> +unlock:
>> >> rcu_read_unlock();
>> >> }
>> >>
>> >> @@ -5409,14 +5416,21 @@ void set_cpu_sd_state_idle(void)
>> >> {
>> >> struct sched_domain *sd;
>> >> int cpu = smp_processor_id();
>> >> -
>> >> - if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
>> >> - return;
>> >> - set_bit(NOHZ_IDLE, nohz_flags(cpu));
>> >> + int first_nohz_idle = 1;
>> >>
>> >> rcu_read_lock();
>> >> - for_each_domain(cpu, sd)
>> >> + for_each_domain(cpu, sd) {
>> >> + if (first_nohz_idle) {
>> >> + if (test_bit(NOHZ_IDLE, &sd->nohz_flags))
>> >> + goto unlock;
>> >> +
>> >> + set_bit(NOHZ_IDLE, &sd->nohz_flags);
>> >> + first_nohz_idle = 0;
>> >> + }
>> >> +
>> >> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
>> >> + }
>> >> +unlock:
>> >
>> > Same here, .. why on earth do it for every sched_domain for that
>> cpu?
>>
>> The update of rq_nohz_flags is done only once on the top level
>> sched_domain and the nr_busy_cpus is updated on each sched_domain
>> level in order to have a quick access to the number of busy cpu when
>> we check for the need to kick an idle load balance
>
> Ah, I didn't see the whole first_nohz_idle thing.. but why did you
> place it inside the loop in the first place? Wouldn't GCC be able to

My goal was to keep for_each_domain that is a normally used for
accessing sched_domain tree

> avoid the double cpu_rq(cpu)->sd dereference using CSE? Argh no,.. its
> got an ACCESS_ONCE in it that defeats GCC.
>
> Bothersome..
>
> I'd almost write it like:
>
> struct sched_domain *sd;
>
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
>
> if (sd->nohz_idle)
> goto unlock;
> xchg(&sd->nohz_idle, 1); /* do we actually need atomics here? */
>
> for (; sd; sd = sd->parent)
> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
>

ok, i'm going to modify the patch and follow the sequence above

Vincent

>
>
>