LinuxLists.cc - [PATCH] sched/fair: Sync task util before slow-path wakeup

2017-08-02 13:10:16

Subject: [PATCH] sched/fair: Sync task util before slow-path wakeup

We use task_util in find_idlest_group via capacity_spare_wake. This
task_util is updated in wake_cap. However wake_cap is not the only
reason for ending up in find_idlest_group - we could have been sent
there by wake_wide. So explicitly sync the task util with prev_cpu
when we are about to head to find_idlest_group.

We could simply do this at the beginning of
select_task_rq_fair (i.e. irrespective of whether we're heading to
select_idle_sibling or find_idlest_group & co), but I didn't want to
slow down the select_idle_sibling path more than necessary.

Don't do this during fork balancing, we won't need the task_util and
we'd just clobber the last_update_time, which is supposed to be 0.

Signed-off-by: Brendan Jackman <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..62869ff252b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5913,6 +5913,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
new_cpu = cpu;
}

+ if (sd && !(sd_flag & SD_BALANCE_FORK))
+ /*
+ * We're going to need the task's util for capacity_spare_wake
+ * in select_idlest_group. Sync it up to prev_cpu's
+ * last_update_time.
+ */
+ sync_entity_load_avg(&p->se);
+
if (!sd) {
pick_cpu:
if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
--
2.13.0

2017-08-02 13:24:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Sync task util before slow-path wakeup

On Wed, Aug 02, 2017 at 02:10:02PM +0100, Brendan Jackman wrote:
> We use task_util in find_idlest_group via capacity_spare_wake. This
> task_util is updated in wake_cap. However wake_cap is not the only
> reason for ending up in find_idlest_group - we could have been sent
> there by wake_wide. So explicitly sync the task util with prev_cpu
> when we are about to head to find_idlest_group.
>
> We could simply do this at the beginning of
> select_task_rq_fair (i.e. irrespective of whether we're heading to
> select_idle_sibling or find_idlest_group & co), but I didn't want to
> slow down the select_idle_sibling path more than necessary.
>
> Don't do this during fork balancing, we won't need the task_util and
> we'd just clobber the last_update_time, which is supposed to be 0.

So I remember Morten explicitly not aging util of tasks on wakeup
because the old util was higher and better representative of what the
new util would be, or something along those lines.

Morten?

> Signed-off-by: Brendan Jackman <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/fair.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..62869ff252b4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5913,6 +5913,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> new_cpu = cpu;
> }
>
> + if (sd && !(sd_flag & SD_BALANCE_FORK))
> + /*
> + * We're going to need the task's util for capacity_spare_wake
> + * in select_idlest_group. Sync it up to prev_cpu's
> + * last_update_time.
> + */
> + sync_entity_load_avg(&p->se);
> +

That has missing {}

> if (!sd) {
> pick_cpu:

And if this patch lives, can you please fix up that broken label indent?

> if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */

2017-08-02 13:27:53

by Brendan Jackman

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Sync task util before slow-path wakeup

On Wed, Aug 02 2017 at 13:24, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 02:10:02PM +0100, Brendan Jackman wrote:
>> We use task_util in find_idlest_group via capacity_spare_wake. This
>> task_util is updated in wake_cap. However wake_cap is not the only
>> reason for ending up in find_idlest_group - we could have been sent
>> there by wake_wide. So explicitly sync the task util with prev_cpu
>> when we are about to head to find_idlest_group.
>>
>> We could simply do this at the beginning of
>> select_task_rq_fair (i.e. irrespective of whether we're heading to
>> select_idle_sibling or find_idlest_group & co), but I didn't want to
>> slow down the select_idle_sibling path more than necessary.
>>
>> Don't do this during fork balancing, we won't need the task_util and
>> we'd just clobber the last_update_time, which is supposed to be 0.
>
> So I remember Morten explicitly not aging util of tasks on wakeup
> because the old util was higher and better representative of what the
> new util would be, or something along those lines.
>
> Morten?
>
>> Signed-off-by: Brendan Jackman <[email protected]>
>> Cc: Dietmar Eggemann <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Josef Bacik <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Morten Rasmussen <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> ---
>> kernel/sched/fair.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c95880e216f6..62869ff252b4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5913,6 +5913,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> new_cpu = cpu;
>> }
>>
>> + if (sd && !(sd_flag & SD_BALANCE_FORK))
>> + /*
>> + * We're going to need the task's util for capacity_spare_wake
>> + * in select_idlest_group. Sync it up to prev_cpu's
>> + * last_update_time.
>> + */
>> + sync_entity_load_avg(&p->se);
>> +
>
> That has missing {}

OK. Also just noticed it refers to "select_idlest_group", will change to
"find_idlest_group".

>
>
>> if (!sd) {
>> pick_cpu:
>
> And if this patch lives, can you please fix up that broken label indent?

Sure.

>> if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */

Cheers,
Brendan

2017-08-07 12:52:11

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Sync task util before slow-path wakeup

On Wed, Aug 02, 2017 at 03:24:05PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 02:10:02PM +0100, Brendan Jackman wrote:
> > We use task_util in find_idlest_group via capacity_spare_wake. This
> > task_util is updated in wake_cap. However wake_cap is not the only
> > reason for ending up in find_idlest_group - we could have been sent
> > there by wake_wide. So explicitly sync the task util with prev_cpu
> > when we are about to head to find_idlest_group.
> >
> > We could simply do this at the beginning of
> > select_task_rq_fair (i.e. irrespective of whether we're heading to
> > select_idle_sibling or find_idlest_group & co), but I didn't want to
> > slow down the select_idle_sibling path more than necessary.
> >
> > Don't do this during fork balancing, we won't need the task_util and
> > we'd just clobber the last_update_time, which is supposed to be 0.
>
> So I remember Morten explicitly not aging util of tasks on wakeup
> because the old util was higher and better representative of what the
> new util would be, or something along those lines.
>
> Morten?

That was the intention, but when we discussed the wake_cap() stuff we
decided to drop that hoping that decay clamping or some other magic
would be added on top later. So this patch is in line with current
behaviour.

Using non-aged util is causing trouble when comparing prev_cpu to other
cpus. In cpu_util_wake() we compensate for the fact that the aged task
util is already included in the cpu util on the prev_cpu. For that to
work, we need to age the task util so we know how much is already
accounted for. In the original wake_cap() series I think I had a patch
that store the non-aged version so we could calculate the potential cpu
util as:

predicted_cpu_util(prev_cpu) =
cpu_util(prev_cpu) - task_util_aged(task) + task_util_nonaged(task)

predicted_cpu_util(other_cpu) =
cpu_util(other_cpu) + task_util_nonaged(task)

This would be better always under-estimating the task util by using the
aged util as we currently do:

predicted_cpu_util(prev_cpu) =
cpu_util(prev_cpu) - task_util_aged(task) + task_util_aged(task)

predicted_cpu_util(other_cpu) =
cpu_util(other_cpu) + task_util_aged(task)

but at least it gives us a fair comparison between prev_cpu and other
cpus.

The Android kernel carries additional patches that tracks the max (peak)
utilization and uses that as the non aged util for wake-up placement.
I'm hoping we can discuss this topic again at LPC, as last years idea of
clamping decay didn't work very well to solve this issue.