2013-05-30 15:23:31

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform.

CPU 0 and CPU 1 are running tasks and CPU 2 is idle

CPU 1 kicks the Idle Load Balance
CPU 1 selects CPU 2 as the new Idle Load Balancer
CPU 1 sets NOHZ_BALANCE_KICK for CPU 2
CPU 1 sends a reschedule IPI to CPU 2
While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2
CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
Task A quickly goes back to sleep (before a tick occurs on CPU 2)
CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not
sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set
and no Idle Load Balance will be performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to
another part of the kernel to clear NOHZ_BALANCE_KICKand come back to
a normal situation.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..51fc715 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1420,7 +1420,8 @@ void scheduler_ipi(void)
if (unlikely(got_nohz_idle_kick() && !need_resched())) {
this_rq()->idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
- }
+ } else
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
irq_exit();
}

--
1.7.9.5


2013-06-03 22:48:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
> I have faced a sequence where the Idle Load Balance was sometime not
> triggered for a while on my platform.
>
> CPU 0 and CPU 1 are running tasks and CPU 2 is idle
>
> CPU 1 kicks the Idle Load Balance
> CPU 1 selects CPU 2 as the new Idle Load Balancer
> CPU 1 sets NOHZ_BALANCE_KICK for CPU 2
> CPU 1 sends a reschedule IPI to CPU 2
> While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2
> CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
> Task A quickly goes back to sleep (before a tick occurs on CPU 2)
> CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
>
> Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not
> sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set
> and no Idle Load Balance will be performed.
>
> We must wait for the sched softirq to be raised on CPU 2 thanks to
> another part of the kernel to clear NOHZ_BALANCE_KICKand come back to
> a normal situation.
>
> The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
> we can't raise the sched_softirq for the Idle Load Balance.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58453b8..51fc715 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1420,7 +1420,8 @@ void scheduler_ipi(void)
> if (unlikely(got_nohz_idle_kick() && !need_resched())) {
> this_rq()->idle_balance = 1;
> raise_softirq_irqoff(SCHED_SOFTIRQ);
> - }
> + } else
> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));

But then do we reach this if the IPI happens while running the non-idle task in
CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi()
due to the idle_cpu() test. So the flag doesn't get cleared in this case.

2013-06-04 08:21:11

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On 4 June 2013 00:48, Frederic Weisbecker <[email protected]> wrote:
> On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
>> I have faced a sequence where the Idle Load Balance was sometime not
>> triggered for a while on my platform.
>>
>> CPU 0 and CPU 1 are running tasks and CPU 2 is idle
>>
>> CPU 1 kicks the Idle Load Balance
>> CPU 1 selects CPU 2 as the new Idle Load Balancer
>> CPU 1 sets NOHZ_BALANCE_KICK for CPU 2
>> CPU 1 sends a reschedule IPI to CPU 2
>> While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2
>> CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
>> Task A quickly goes back to sleep (before a tick occurs on CPU 2)
>> CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
>>
>> Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not
>> sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set
>> and no Idle Load Balance will be performed.
>>
>> We must wait for the sched softirq to be raised on CPU 2 thanks to
>> another part of the kernel to clear NOHZ_BALANCE_KICKand come back to
>> a normal situation.
>>
>> The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
>> we can't raise the sched_softirq for the Idle Load Balance.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/core.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 58453b8..51fc715 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1420,7 +1420,8 @@ void scheduler_ipi(void)
>> if (unlikely(got_nohz_idle_kick() && !need_resched())) {
>> this_rq()->idle_balance = 1;
>> raise_softirq_irqoff(SCHED_SOFTIRQ);
>> - }
>> + } else
>> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
>
> But then do we reach this if the IPI happens while running the non-idle task in
> CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi()
> due to the idle_cpu() test. So the flag doesn't get cleared in this case.

The 1st point is that only idle cpu can be selected for idle load
balance. But this doesn't prevent the cpu to wake up while it is
kicked for idle load balance.
I had added the clear_bit for the 1st got_nohz_idle_kick in the draft
version of this patch but the test of the emptiness of the wake_list,
the call to smp_send_reschedule in the various way to wake up the idle
cpu and the results of the tests have convinced me (may be wrongly)
that it was not necessary.

Vincent

2013-06-04 09:36:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK


The best I can seem to come up with is something like the below; but I think
its ghastly. Surely we can do something saner with that bit.

Having to clear it at 3 different places is just wrong.

---
kernel/sched/core.c | 35 +++++++++++++++++++++++++----------
kernel/sched/fair.c | 6 ++++--
2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 142c682..4846223 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -630,15 +630,23 @@ void wake_up_nohz_cpu(int cpu)
wake_up_idle_cpu(cpu);
}

-static inline bool got_nohz_idle_kick(void)
+static inline bool got_nohz_idle_kick(int cpu)
{
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+ bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+
+ if (!nohz_kick)
+ return false;
+
+ if (idle_cpu(cpu))
+ return true;
+
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+ return false;
}

#else /* CONFIG_NO_HZ_COMMON */

-static inline bool got_nohz_idle_kick(void)
+static inline bool got_nohz_idle_kick(int cpu)
{
return false;
}
@@ -1395,8 +1403,11 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
+ int this_cpu = smp_processor_id();
+ bool nohz_kick = got_nohz_idle_kick(this_cpu);
+
+ if (llist_empty(&this_rq()->wake_list) && !nohz_kick &&
+ !tick_nohz_full_cpu(this_cpu))
return;

/*
@@ -1413,15 +1424,19 @@ void scheduler_ipi(void)
* somewhat pessimize the simple resched case.
*/
irq_enter();
- tick_nohz_full_check();
+ tick_nohz_full_check(); /* restart tick if required */
sched_ttwu_pending();

/*
* Check if someone kicked us for doing the nohz idle load balance.
*/
- if (unlikely(got_nohz_idle_kick() && !need_resched())) {
- this_rq()->idle_balance = 1;
- raise_softirq_irqoff(SCHED_SOFTIRQ);
+ if (unlikely(nohz_kick)) {
+ if (!need_resched()) {
+ this_rq()->idle_balance = 1;
+ raise_softirq_irqoff(SCHED_SOFTIRQ);
+ } else {
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+ }
}
irq_exit();
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 143dcdb..4e6cff4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5600,8 +5600,10 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
struct rq *rq;
int balance_cpu;

- if (idle != CPU_IDLE ||
- !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
+ if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
+ return;
+
+ if (idle != CPU_IDLE)
goto end;

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {

2013-06-04 09:56:42

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 10:21:06AM +0200, Vincent Guittot wrote:
> On 4 June 2013 00:48, Frederic Weisbecker <[email protected]> wrote:
> > On Thu, May 30, 2013 at 05:23:05PM +0200, Vincent Guittot wrote:
> >> I have faced a sequence where the Idle Load Balance was sometime not
> >> triggered for a while on my platform.
> >>
> >> CPU 0 and CPU 1 are running tasks and CPU 2 is idle
> >>
> >> CPU 1 kicks the Idle Load Balance
> >> CPU 1 selects CPU 2 as the new Idle Load Balancer
> >> CPU 1 sets NOHZ_BALANCE_KICK for CPU 2
> >> CPU 1 sends a reschedule IPI to CPU 2
> >> While CPU 2 wakes up, CPU 0 or CPU 1 migrates a waking task A on CPU 2
> >> CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
> >> Task A quickly goes back to sleep (before a tick occurs on CPU 2)
> >> CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
> >>
> >> Whenever CPU 2 will be selected for the ILB, reschedule IPI will be not
> >> sent to CPU2, which is idle, because NOHZ_BALANCE_KICK is already set
> >> and no Idle Load Balance will be performed.
> >>
> >> We must wait for the sched softirq to be raised on CPU 2 thanks to
> >> another part of the kernel to clear NOHZ_BALANCE_KICKand come back to
> >> a normal situation.
> >>
> >> The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
> >> we can't raise the sched_softirq for the Idle Load Balance.
> >>
> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> ---
> >> kernel/sched/core.c | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 58453b8..51fc715 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1420,7 +1420,8 @@ void scheduler_ipi(void)
> >> if (unlikely(got_nohz_idle_kick() && !need_resched())) {
> >> this_rq()->idle_balance = 1;
> >> raise_softirq_irqoff(SCHED_SOFTIRQ);
> >> - }
> >> + } else
> >> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(smp_processor_id()));
> >
> > But then do we reach this if the IPI happens while running the non-idle task in
> > CPU 2? The first got_nohz_idle_kick() test would drop us out early from scheduler_ipi()
> > due to the idle_cpu() test. So the flag doesn't get cleared in this case.
>
> The 1st point is that only idle cpu can be selected for idle load
> balance. But this doesn't prevent the cpu to wake up while it is
> kicked for idle load balance.

Yep.

> I had added the clear_bit for the 1st got_nohz_idle_kick in the draft
> version of this patch but the test of the emptiness of the wake_list,
> the call to smp_send_reschedule in the various way to wake up the idle
> cpu and the results of the tests have convinced me (may be wrongly)
> that it was not necessary.

Hmm, if the CPU is idle, get selected as an ilb, but then the CPU schedules
a non-idle task and receive the IPI in this non-idle context then finally
it goes back to idle for a long time. It can stay idle without ever been
notified with this NOHZ_BALANCE_KICK flag set.

But I can be missing something that clears the flag somewhere in that scenario.
In any case it's not obvious.

2013-06-04 10:26:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
>
> The best I can seem to come up with is something like the below; but I think
> its ghastly. Surely we can do something saner with that bit.
>
> Having to clear it at 3 different places is just wrong.

We could clear the flag early in scheduler_ipi() and set some
specific value in rq->idle_balance that tells we want nohz idle
balancing from the softirq, something like this untested:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..330136b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu)
wake_up_idle_cpu(cpu);
}

-static inline bool got_nohz_idle_kick(void)
+static inline bool got_nohz_idle_kick(int cpu)
{
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+ return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
}

#else /* CONFIG_NO_HZ_COMMON */

-static inline bool got_nohz_idle_kick(void)
+static inline bool got_nohz_idle_kick(int cpu)
{
return false;
}
@@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
+ int cpu = smp_processor_id();
+ bool idle_kick = got_nohz_idle_kick(cpu);
+
+ if (!(idle_kick && idle_cpu(cpu))
+ && llist_empty(&this_rq()->wake_list)
+ && !tick_nohz_full_cpu(cpu)
return;

/*
@@ -1417,8 +1420,8 @@ void scheduler_ipi(void)
/*
* Check if someone kicked us for doing the nohz idle load balance.
*/
- if (unlikely(got_nohz_idle_kick() && !need_resched())) {
- this_rq()->idle_balance = 1;
+ if (unlikely(idle_kick && idle_cpu(cpu) && !need_resched())) {
+ this_rq()->idle_balance = IDLE_NOHZ_BALANCE;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
irq_exit();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..816e7b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5577,15 +5577,14 @@ out:
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.
*/
-static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
+static void nohz_idle_balance(int this_cpu)
{
struct rq *this_rq = cpu_rq(this_cpu);
struct rq *rq;
int balance_cpu;

- if (idle != CPU_IDLE ||
- !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
- goto end;
+ if (this_rq->idle_balance != IDLE_NOHZ_BALANCE)
+ return;

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
@@ -5612,8 +5611,12 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
this_rq->next_balance = rq->next_balance;
}
nohz.next_balance = this_rq->next_balance;
-end:
- clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+ /* There could be concurrent updates from irqs but we don't care */
+ if (idle_cpu(this_cpu))
+ this_rq->idle_balance = IDLE_BALANCE;
+ else
+ this_rq->idle_balance = 0;
}

/*
@@ -5679,7 +5682,7 @@ need_kick:
return 1;
}
#else
-static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
+static void nohz_idle_balance(int this_cpu) { }
#endif

/*
@@ -5700,7 +5703,7 @@ static void run_rebalance_domains(struct softirq_action *h)
* balancing on behalf of the other idle cpus whose ticks are
* stopped.
*/
- nohz_idle_balance(this_cpu, idle);
+ nohz_idle_balance(this_cpu);
}

static inline int on_null_domain(int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..e9de976 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -387,6 +387,11 @@ extern struct root_domain def_root_domain;

#endif /* CONFIG_SMP */

+enum idle_balance_type {
+ IDLE_BALANCE = 1,
+ IDLE_NOHZ_BALANCE = 2,
+};
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -458,7 +463,7 @@ struct rq {

unsigned long cpu_power;

- unsigned char idle_balance;
+ enum idle_balance_type idle_balance;
/* For active balancing */
int post_schedule;
int active_balance;

2013-06-04 11:11:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
>>
>> The best I can seem to come up with is something like the below; but I think
>> its ghastly. Surely we can do something saner with that bit.
>>
>> Having to clear it at 3 different places is just wrong.
>
> We could clear the flag early in scheduler_ipi() and set some
> specific value in rq->idle_balance that tells we want nohz idle
> balancing from the softirq, something like this untested:

I'm not sure that we can have less than 2 places to clear it: cancel
place or acknowledge place otherwise we can face a situation where
idle load balance will be triggered 2 consecutive times because
NOHZ_BALANCE_KICK will be cleared before the idle load balance has
been done and had a chance to migrate tasks.

>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58453b8..330136b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu)
> wake_up_idle_cpu(cpu);
> }
>
> -static inline bool got_nohz_idle_kick(void)
> +static inline bool got_nohz_idle_kick(int cpu)
> {
> - int cpu = smp_processor_id();
> - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> + return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> }
>
> #else /* CONFIG_NO_HZ_COMMON */
>
> -static inline bool got_nohz_idle_kick(void)
> +static inline bool got_nohz_idle_kick(int cpu)
> {
> return false;
> }
> @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
> - && !tick_nohz_full_cpu(smp_processor_id()))
> + int cpu = smp_processor_id();
> + bool idle_kick = got_nohz_idle_kick(cpu);
> +
> + if (!(idle_kick && idle_cpu(cpu))
> + && llist_empty(&this_rq()->wake_list)
> + && !tick_nohz_full_cpu(cpu)
> return;
>
> /*
> @@ -1417,8 +1420,8 @@ void scheduler_ipi(void)
> /*
> * Check if someone kicked us for doing the nohz idle load balance.
> */
> - if (unlikely(got_nohz_idle_kick() && !need_resched())) {
> - this_rq()->idle_balance = 1;
> + if (unlikely(idle_kick && idle_cpu(cpu) && !need_resched())) {
> + this_rq()->idle_balance = IDLE_NOHZ_BALANCE;
> raise_softirq_irqoff(SCHED_SOFTIRQ);
> }
> irq_exit();
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c61a614..816e7b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5577,15 +5577,14 @@ out:
> * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
> * rebalancing for all the cpus for whom scheduler ticks are stopped.
> */
> -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
> +static void nohz_idle_balance(int this_cpu)
> {
> struct rq *this_rq = cpu_rq(this_cpu);
> struct rq *rq;
> int balance_cpu;
>
> - if (idle != CPU_IDLE ||
> - !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> - goto end;
> + if (this_rq->idle_balance != IDLE_NOHZ_BALANCE)
> + return;
>
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> @@ -5612,8 +5611,12 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
> this_rq->next_balance = rq->next_balance;
> }
> nohz.next_balance = this_rq->next_balance;
> -end:
> - clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> +
> + /* There could be concurrent updates from irqs but we don't care */
> + if (idle_cpu(this_cpu))
> + this_rq->idle_balance = IDLE_BALANCE;
> + else
> + this_rq->idle_balance = 0;
> }
>
> /*
> @@ -5679,7 +5682,7 @@ need_kick:
> return 1;
> }
> #else
> -static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
> +static void nohz_idle_balance(int this_cpu) { }
> #endif
>
> /*
> @@ -5700,7 +5703,7 @@ static void run_rebalance_domains(struct softirq_action *h)
> * balancing on behalf of the other idle cpus whose ticks are
> * stopped.
> */
> - nohz_idle_balance(this_cpu, idle);
> + nohz_idle_balance(this_cpu);
> }
>
> static inline int on_null_domain(int cpu)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ce39224..e9de976 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -387,6 +387,11 @@ extern struct root_domain def_root_domain;
>
> #endif /* CONFIG_SMP */
>
> +enum idle_balance_type {
> + IDLE_BALANCE = 1,
> + IDLE_NOHZ_BALANCE = 2,
> +};
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -458,7 +463,7 @@ struct rq {
>
> unsigned long cpu_power;
>
> - unsigned char idle_balance;
> + enum idle_balance_type idle_balance;
> /* For active balancing */
> int post_schedule;
> int active_balance;

2013-06-04 11:15:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 12:26:22PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
> >
> > The best I can seem to come up with is something like the below; but I think
> > its ghastly. Surely we can do something saner with that bit.
> >
> > Having to clear it at 3 different places is just wrong.
>
> We could clear the flag early in scheduler_ipi() and set some
> specific value in rq->idle_balance that tells we want nohz idle
> balancing from the softirq, something like this untested:

Yeah, I suppose something like that is a little better.. a few nits
though:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58453b8..330136b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -630,15 +630,14 @@ void wake_up_nohz_cpu(int cpu)
> wake_up_idle_cpu(cpu);
> }
>
> -static inline bool got_nohz_idle_kick(void)
> +static inline bool got_nohz_idle_kick(int cpu)
> {
> - int cpu = smp_processor_id();
> - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> + return test_and_clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> }
>
> #else /* CONFIG_NO_HZ_COMMON */
>
> -static inline bool got_nohz_idle_kick(void)
> +static inline bool got_nohz_idle_kick(int cpu)
> {
> return false;
> }
> @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
> - && !tick_nohz_full_cpu(smp_processor_id()))
> + int cpu = smp_processor_id();
> + bool idle_kick = got_nohz_idle_kick(cpu);

This puts an unconditional atomic instruction in the IPI path.
if (test) clear();
is lots cheaper, esp. since most IPIs won't have this flag set.

> +
> + if (!(idle_kick && idle_cpu(cpu))
> + && llist_empty(&this_rq()->wake_list)
> + && !tick_nohz_full_cpu(cpu)

What's with this weird operator first split style?

> return;
>
> /*

> +enum idle_balance_type {
> + IDLE_BALANCE = 1,
> + IDLE_NOHZ_BALANCE = 2,
> +};

You might want to update the rq->idle_balance assignment in
scheduler_tick() to make sure it uses the right value (it does now, but
there's nothing stopping people from changing the values).

2013-06-04 11:19:11

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
> On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
> >>
> >> The best I can seem to come up with is something like the below; but I think
> >> its ghastly. Surely we can do something saner with that bit.
> >>
> >> Having to clear it at 3 different places is just wrong.
> >
> > We could clear the flag early in scheduler_ipi() and set some
> > specific value in rq->idle_balance that tells we want nohz idle
> > balancing from the softirq, something like this untested:
>
> I'm not sure that we can have less than 2 places to clear it: cancel
> place or acknowledge place otherwise we can face a situation where
> idle load balance will be triggered 2 consecutive times because
> NOHZ_BALANCE_KICK will be cleared before the idle load balance has
> been done and had a chance to migrate tasks.

I guess it depends what is the minimum value of rq->next_balance, it seems
to be large enough to avoid this kind of incident. Although I don't
know well the whole logic with rq->next_balance and ilb trigger so I must
defer to you.

2013-06-04 11:48:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On 4 June 2013 13:19, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
>> On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
>> > On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
>> >>
>> >> The best I can seem to come up with is something like the below; but I think
>> >> its ghastly. Surely we can do something saner with that bit.
>> >>
>> >> Having to clear it at 3 different places is just wrong.
>> >
>> > We could clear the flag early in scheduler_ipi() and set some
>> > specific value in rq->idle_balance that tells we want nohz idle
>> > balancing from the softirq, something like this untested:
>>
>> I'm not sure that we can have less than 2 places to clear it: cancel
>> place or acknowledge place otherwise we can face a situation where
>> idle load balance will be triggered 2 consecutive times because
>> NOHZ_BALANCE_KICK will be cleared before the idle load balance has
>> been done and had a chance to migrate tasks.
>
> I guess it depends what is the minimum value of rq->next_balance, it seems
> to be large enough to avoid this kind of incident. Although I don't
> know well the whole logic with rq->next_balance and ilb trigger so I must
> defer to you.

In the trace that was showing the issue, i can see that both CPU0 and
CPU1 were trying to trig ILB almost simultaneously and the
test_and_set NOHZ_BALANCE_KICK filters one request so i would say that
clearing the bit before the end of the idle load balance sequence can
generate such sequence

In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK
in 2 places : acknowledge and cancel. I have reused part of the
proposal from peter which clears the bit if the condition doesn't
match but i have reordered the tests to done that only if all other
condition are matching

static inline bool got_nohz_idle_kick(void)
{
- int cpu = smp_processor_id();
- return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+ bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+
+ if (!nohz_kick)
+ return false;
+
+ if (idle_cpu(cpu) && !need_resched())
+ return true;
+
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+ return false;
}

#else /* CONFIG_NO_HZ_COMMON */
@@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
- && !tick_nohz_full_cpu(smp_processor_id()))
+ if (llist_empty(&this_rq()->wake_list)
+ && !tick_nohz_full_cpu(smp_processor_id())
+ && !got_nohz_idle_kick())
return;

/*
@@ -1417,7 +1426,7 @@ void scheduler_ipi(void)
/*
* Check if someone kicked us for doing the nohz idle load balance.
*/
- if (unlikely(got_nohz_idle_kick() && !need_resched())) {
+ if (unlikely(got_nohz_idle_kick())) {
this_rq()->idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}

2013-06-04 11:53:42

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 01:15:10PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 04, 2013 at 12:26:22PM +0200, Frederic Weisbecker wrote:
> > @@ -1393,8 +1392,12 @@ static void sched_ttwu_pending(void)
> >
> > void scheduler_ipi(void)
> > {
> > - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
> > - && !tick_nohz_full_cpu(smp_processor_id()))
> > + int cpu = smp_processor_id();
> > + bool idle_kick = got_nohz_idle_kick(cpu);
>
> This puts an unconditional atomic instruction in the IPI path.
> if (test) clear();
> is lots cheaper, esp. since most IPIs won't have this flag set.

Agreed but I'm a bit worried about ordering:

CPU 0 CPU 1

test_and_set_bit(nohz_kick, CPU 1) scheduler_ipi
smp_send_reschedule(CPU 1) if (test_and_clear_bit(nohz_kick))
do_something

I'm not sure what base guarantee we have with ordering against raw IPIs such as the
the scheduler ipi. But unless both IPI trigger and IPI receive imply a full barrier
(or just IPI receive implies read barrier, it seems that's all we need), we need
test_and_set_bit() or smp_rmb()/smp_mb__before_clear_bit() && smp_mb__after_clear_bit().

>
> > +
> > + if (!(idle_kick && idle_cpu(cpu))
> > + && llist_empty(&this_rq()->wake_list)
> > + && !tick_nohz_full_cpu(cpu)
>
> What's with this weird operator first split style?

Yeah ugly, I'll fix.

>
> > return;
> >
> > /*
>
> > +enum idle_balance_type {
> > + IDLE_BALANCE = 1,
> > + IDLE_NOHZ_BALANCE = 2,
> > +};
>
> You might want to update the rq->idle_balance assignment in
> scheduler_tick() to make sure it uses the right value (it does now, but
> there's nothing stopping people from changing the values).

Agreed!

Thanks.

2013-06-04 14:45:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
> On 4 June 2013 13:19, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
> >> On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
> >> > On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
> >> >>
> >> >> The best I can seem to come up with is something like the below; but I think
> >> >> its ghastly. Surely we can do something saner with that bit.
> >> >>
> >> >> Having to clear it at 3 different places is just wrong.
> >> >
> >> > We could clear the flag early in scheduler_ipi() and set some
> >> > specific value in rq->idle_balance that tells we want nohz idle
> >> > balancing from the softirq, something like this untested:
> >>
> >> I'm not sure that we can have less than 2 places to clear it: cancel
> >> place or acknowledge place otherwise we can face a situation where
> >> idle load balance will be triggered 2 consecutive times because
> >> NOHZ_BALANCE_KICK will be cleared before the idle load balance has
> >> been done and had a chance to migrate tasks.
> >
> > I guess it depends what is the minimum value of rq->next_balance, it seems
> > to be large enough to avoid this kind of incident. Although I don't
> > know well the whole logic with rq->next_balance and ilb trigger so I must
> > defer to you.
>
> In the trace that was showing the issue, i can see that both CPU0 and
> CPU1 were trying to trig ILB almost simultaneously and the
> test_and_set NOHZ_BALANCE_KICK filters one request so i would say that
> clearing the bit before the end of the idle load balance sequence can
> generate such sequence

I see.

>
> In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK
> in 2 places : acknowledge and cancel. I have reused part of the
> proposal from peter which clears the bit if the condition doesn't
> match but i have reordered the tests to done that only if all other
> condition are matching
>
> static inline bool got_nohz_idle_kick(void)
> {
> - int cpu = smp_processor_id();
> - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> + bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> +
> + if (!nohz_kick)
> + return false;
> +
> + if (idle_cpu(cpu) && !need_resched())
> + return true;
> +
> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> + return false;
> }
>
> #else /* CONFIG_NO_HZ_COMMON */
> @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
> - && !tick_nohz_full_cpu(smp_processor_id()))
> + if (llist_empty(&this_rq()->wake_list)
> + && !tick_nohz_full_cpu(smp_processor_id())
> + && !got_nohz_idle_kick())
> return;

But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise
if we run an "idle -> quick task slice -> idle" sequence we may keep the flag
but lose the notifying IPI in between.

2013-06-04 15:29:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On 4 June 2013 16:44, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
>> On 4 June 2013 13:19, Frederic Weisbecker <[email protected]> wrote:
>> > On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
>> >> On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
>> >> > On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
>> >> >>
>> >> >> The best I can seem to come up with is something like the below; but I think
>> >> >> its ghastly. Surely we can do something saner with that bit.
>> >> >>
>> >> >> Having to clear it at 3 different places is just wrong.
>> >> >
>> >> > We could clear the flag early in scheduler_ipi() and set some
>> >> > specific value in rq->idle_balance that tells we want nohz idle
>> >> > balancing from the softirq, something like this untested:
>> >>
>> >> I'm not sure that we can have less than 2 places to clear it: cancel
>> >> place or acknowledge place otherwise we can face a situation where
>> >> idle load balance will be triggered 2 consecutive times because
>> >> NOHZ_BALANCE_KICK will be cleared before the idle load balance has
>> >> been done and had a chance to migrate tasks.
>> >
>> > I guess it depends what is the minimum value of rq->next_balance, it seems
>> > to be large enough to avoid this kind of incident. Although I don't
>> > know well the whole logic with rq->next_balance and ilb trigger so I must
>> > defer to you.
>>
>> In the trace that was showing the issue, i can see that both CPU0 and
>> CPU1 were trying to trig ILB almost simultaneously and the
>> test_and_set NOHZ_BALANCE_KICK filters one request so i would say that
>> clearing the bit before the end of the idle load balance sequence can
>> generate such sequence
>
> I see.
>
>>
>> In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK
>> in 2 places : acknowledge and cancel. I have reused part of the
>> proposal from peter which clears the bit if the condition doesn't
>> match but i have reordered the tests to done that only if all other
>> condition are matching
>>
>> static inline bool got_nohz_idle_kick(void)
>> {
>> - int cpu = smp_processor_id();
>> - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
>> + bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
>> +
>> + if (!nohz_kick)
>> + return false;
>> +
>> + if (idle_cpu(cpu) && !need_resched())
>> + return true;
>> +
>> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
>> + return false;
>> }
>>
>> #else /* CONFIG_NO_HZ_COMMON */
>> @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
>>
>> void scheduler_ipi(void)
>> {
>> - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
>> - && !tick_nohz_full_cpu(smp_processor_id()))
>> + if (llist_empty(&this_rq()->wake_list)
>> + && !tick_nohz_full_cpu(smp_processor_id())
>> + && !got_nohz_idle_kick())
>> return;
>
> But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise
> if we run an "idle -> quick task slice -> idle" sequence we may keep the flag
> but lose the notifying IPI in between.

I'm not sure to catch the sequence you are describing above: "idle ->
quick task slice -> idle".
In addition, got_nohz_idle_kick must be the last tested condition (in
my proposal) in order to clear NOHZ_BALANCE_KICK only if we are sure
that we are going to return without possibility to trig the Idle load
balance

Vincent

2013-06-05 13:31:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] sched: fix clear NOHZ_BALANCE_KICK

On Tue, Jun 04, 2013 at 05:29:39PM +0200, Vincent Guittot wrote:
> On 4 June 2013 16:44, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jun 04, 2013 at 01:48:47PM +0200, Vincent Guittot wrote:
> >> On 4 June 2013 13:19, Frederic Weisbecker <[email protected]> wrote:
> >> > On Tue, Jun 04, 2013 at 01:11:47PM +0200, Vincent Guittot wrote:
> >> >> On 4 June 2013 12:26, Frederic Weisbecker <[email protected]> wrote:
> >> >> > On Tue, Jun 04, 2013 at 11:36:11AM +0200, Peter Zijlstra wrote:
> >> >> >>
> >> >> >> The best I can seem to come up with is something like the below; but I think
> >> >> >> its ghastly. Surely we can do something saner with that bit.
> >> >> >>
> >> >> >> Having to clear it at 3 different places is just wrong.
> >> >> >
> >> >> > We could clear the flag early in scheduler_ipi() and set some
> >> >> > specific value in rq->idle_balance that tells we want nohz idle
> >> >> > balancing from the softirq, something like this untested:
> >> >>
> >> >> I'm not sure that we can have less than 2 places to clear it: cancel
> >> >> place or acknowledge place otherwise we can face a situation where
> >> >> idle load balance will be triggered 2 consecutive times because
> >> >> NOHZ_BALANCE_KICK will be cleared before the idle load balance has
> >> >> been done and had a chance to migrate tasks.
> >> >
> >> > I guess it depends what is the minimum value of rq->next_balance, it seems
> >> > to be large enough to avoid this kind of incident. Although I don't
> >> > know well the whole logic with rq->next_balance and ilb trigger so I must
> >> > defer to you.
> >>
> >> In the trace that was showing the issue, i can see that both CPU0 and
> >> CPU1 were trying to trig ILB almost simultaneously and the
> >> test_and_set NOHZ_BALANCE_KICK filters one request so i would say that
> >> clearing the bit before the end of the idle load balance sequence can
> >> generate such sequence
> >
> > I see.
> >
> >>
> >> In the sequence below, i have minimized the clear of NOHZ_BALANCE_KICK
> >> in 2 places : acknowledge and cancel. I have reused part of the
> >> proposal from peter which clears the bit if the condition doesn't
> >> match but i have reordered the tests to done that only if all other
> >> condition are matching
> >>
> >> static inline bool got_nohz_idle_kick(void)
> >> {
> >> - int cpu = smp_processor_id();
> >> - return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> >> + bool nohz_kick = test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> >> +
> >> + if (!nohz_kick)
> >> + return false;
> >> +
> >> + if (idle_cpu(cpu) && !need_resched())
> >> + return true;
> >> +
> >> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> >> + return false;
> >> }
> >>
> >> #else /* CONFIG_NO_HZ_COMMON */
> >> @@ -1393,8 +1401,9 @@ static void sched_ttwu_pending(void)
> >>
> >> void scheduler_ipi(void)
> >> {
> >> - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
> >> - && !tick_nohz_full_cpu(smp_processor_id()))
> >> + if (llist_empty(&this_rq()->wake_list)
> >> + && !tick_nohz_full_cpu(smp_processor_id())
> >> + && !got_nohz_idle_kick())
> >> return;
> >
> > But we still need got_nohz_idle_kick() to be the first check, don't we? Otherwise
> > if we run an "idle -> quick task slice -> idle" sequence we may keep the flag
> > but lose the notifying IPI in between.
>
> I'm not sure to catch the sequence you are describing above: "idle ->
> quick task slice -> idle".
> In addition, got_nohz_idle_kick must be the last tested condition (in
> my proposal) in order to clear NOHZ_BALANCE_KICK only if we are sure
> that we are going to return without possibility to trig the Idle load
> balance

Right, sorry for the confusion.