LinuxLists.cc - [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

2019-07-09 11:59:20

Subject: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

The ancient workaround to avoid the cost of updating rq clocks in the
middle of a migration causes some issues on asymmetric CPU capacity
systems where we use task utilization to determine which cpus fit a task.
On quiet systems we can inflate task util after a migration which
causes misfit to fire and force-migrate the task.

This occurs when:

(a) a task has util close to the non-overutilized capacity limit of a
particular cpu (cpu0 here); and
(b) the prev_cpu was quiet otherwise, such that rq clock is
sufficiently out of date (cpu1 here).

e.g.
_____
cpu0: ________________________| |______________

|<- misfit happens
______ ___ ___
cpu1: ____| |______________|___| |_________|

->| |<- wakeup migration time
last rq clock update

When the task util is in just the right range for the system, we can end
up migrating an unlucky task back and forth many times until we are lucky
and the source rq happens to be updated close to the migration time.

In order to address this, lets update both rq_clock and cfs_rq where
this could be an issue.

Signed-off-by: Chris Redpath <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b798fe7ff7cd..51791db26a2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6545,6 +6545,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* wakee task is less decayed, but giving the wakee more load
* sounds not bad.
*/
+ if (static_branch_unlikely(&sched_asym_cpucapacity) &&
+ p->state == TASK_WAKING) {
+ /*
+ * On asymmetric capacity systems task util guides
+ * wake placement so update rq_clock and cfs_rq
+ */
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rq *rq = task_rq(p);
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ update_rq_clock(rq);
+ update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+ rq_unlock_irqrestore(rq, &rf);
+ }
remove_entity_load_avg(&p->se);
}

--
2.17.1

2019-07-09 13:51:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

On Tue, Jul 09, 2019 at 12:57:59PM +0100, Chris Redpath wrote:
> The ancient workaround to avoid the cost of updating rq clocks in the
> middle of a migration causes some issues on asymmetric CPU capacity
> systems where we use task utilization to determine which cpus fit a task.
> On quiet systems we can inflate task util after a migration which
> causes misfit to fire and force-migrate the task.
>
> This occurs when:
>
> (a) a task has util close to the non-overutilized capacity limit of a
> particular cpu (cpu0 here); and
> (b) the prev_cpu was quiet otherwise, such that rq clock is
> sufficiently out of date (cpu1 here).
>
> e.g.
> _____
> cpu0: ________________________| |______________
>
> |<- misfit happens
> ______ ___ ___
> cpu1: ____| |______________|___| |_________|
>
> ->| |<- wakeup migration time
> last rq clock update
>
> When the task util is in just the right range for the system, we can end
> up migrating an unlucky task back and forth many times until we are lucky
> and the source rq happens to be updated close to the migration time.
>
> In order to address this, lets update both rq_clock and cfs_rq where
> this could be an issue.

Can you quantify how much of a problem this really is? It is really sad,
but this is already the second place where we take rq->lock on
migration. We worked so hard to avoid having to acquire it :/

> Signed-off-by: Chris Redpath <[email protected]>
> ---
> kernel/sched/fair.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b798fe7ff7cd..51791db26a2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6545,6 +6545,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * wakee task is less decayed, but giving the wakee more load
> * sounds not bad.
> */
> + if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> + p->state == TASK_WAKING) {

nit: indent fail.

> + /*
> + * On asymmetric capacity systems task util guides
> + * wake placement so update rq_clock and cfs_rq
> + */
> + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> + struct rq *rq = task_rq(p);
> + struct rq_flags rf;
> +
> + rq_lock_irqsave(rq, &rf);
> + update_rq_clock(rq);
> + update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> + rq_unlock_irqrestore(rq, &rf);
> + }
> remove_entity_load_avg(&p->se);
> }
>
> --
> 2.17.1
>

2019-07-09 15:24:33

by Chris Redpath

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

Hi Peter,

On 09/07/2019 14:50, Peter Zijlstra wrote:
> On Tue, Jul 09, 2019 at 12:57:59PM +0100, Chris Redpath wrote:
>> The ancient workaround to avoid the cost of updating rq clocks in the
>> middle of a migration causes some issues on asymmetric CPU capacity
>> systems where we use task utilization to determine which cpus fit a task.
>> On quiet systems we can inflate task util after a migration which
>> causes misfit to fire and force-migrate the task.
>>
>> This occurs when:
>>
>> (a) a task has util close to the non-overutilized capacity limit of a
>> particular cpu (cpu0 here); and
>> (b) the prev_cpu was quiet otherwise, such that rq clock is
>> sufficiently out of date (cpu1 here).
>>
>> e.g.
>> _____
>> cpu0: ________________________| |______________
>>
>> |<- misfit happens
>> ______ ___ ___
>> cpu1: ____| |______________|___| |_________|
>>
>> ->| |<- wakeup migration time
>> last rq clock update
>>
>> When the task util is in just the right range for the system, we can end
>> up migrating an unlucky task back and forth many times until we are lucky
>> and the source rq happens to be updated close to the migration time.
>>
>> In order to address this, lets update both rq_clock and cfs_rq where
>> this could be an issue.
>
> Can you quantify how much of a problem this really is? It is really sad,
> but this is already the second place where we take rq->lock on
> migration. We worked so hard to avoid having to acquire it :/
>

I think you're familiar with the way we test the EAS and misfit stuff,
but some might not be, so I'll just outline them.

We have performance and placement tests for a suite of simple synthetic
scenarios selected to trigger the EAS & misfit mechanisms. The
performance tests use rt-app's slack metric, and we try to minimise
negative slack (i.e. missed deadlines).

In the placement tests we estimate the minimum energy consumed to run a
particular synthetic test job and we calculate the energy consumed in
the actual execution according to a trace. We pass the test if our
estimate of actual is less than ideal+20%.

We enter this code quite often in our testing, most individual runs of a
test which has small tasks involved have at least one hit where we make
a change to the clock with this patch in.

That said - despite the relatively high number of hits only about 5% of
runs see enough additional energy consumed to trigger a test failure. We
do try to keep a quiet system as much as possible and only run for a few
seconds so the impact we see in testing is also probably higher than in
the real world.

I totally appreciate the reluctance to add this - I don't much like it
either, but I was hoping that sticking it behind the asym_cpucapacity
key might be a reasonable compromise.

At least only those people who select a CPU using task util and capacity
see this cost, and we have smaller systems so in theory the cost is smaller.

I'm very open to exploring alternatives :)

>> Signed-off-by: Chris Redpath <[email protected]>
>> ---
>> kernel/sched/fair.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b798fe7ff7cd..51791db26a2a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6545,6 +6545,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>> * wakee task is less decayed, but giving the wakee more load
>> * sounds not bad.
>> */
>> + if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>> + p->state == TASK_WAKING) {
>
> nit: indent fail.
>

oops, will tweak it

--Chris
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2019-07-09 15:37:38

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

Hi Chris,

On Tue, 9 Jul 2019 at 17:23, Chris Redpath <[email protected]> wrote:
>
> Hi Peter,
>
> On 09/07/2019 14:50, Peter Zijlstra wrote:
> > On Tue, Jul 09, 2019 at 12:57:59PM +0100, Chris Redpath wrote:
> >> The ancient workaround to avoid the cost of updating rq clocks in the
> >> middle of a migration causes some issues on asymmetric CPU capacity
> >> systems where we use task utilization to determine which cpus fit a task.
> >> On quiet systems we can inflate task util after a migration which
> >> causes misfit to fire and force-migrate the task.
> >>
> >> This occurs when:
> >>
> >> (a) a task has util close to the non-overutilized capacity limit of a
> >> particular cpu (cpu0 here); and
> >> (b) the prev_cpu was quiet otherwise, such that rq clock is
> >> sufficiently out of date (cpu1 here).
> >>
> >> e.g.
> >> _____
> >> cpu0: ________________________| |______________
> >>
> >> |<- misfit happens
> >> ______ ___ ___
> >> cpu1: ____| |______________|___| |_________|
> >>
> >> ->| |<- wakeup migration time
> >> last rq clock update
> >>
> >> When the task util is in just the right range for the system, we can end
> >> up migrating an unlucky task back and forth many times until we are lucky
> >> and the source rq happens to be updated close to the migration time.
> >>
> >> In order to address this, lets update both rq_clock and cfs_rq where
> >> this could be an issue.
> >
> > Can you quantify how much of a problem this really is? It is really sad,
> > but this is already the second place where we take rq->lock on
> > migration. We worked so hard to avoid having to acquire it :/
> >
>
> I think you're familiar with the way we test the EAS and misfit stuff,
> but some might not be, so I'll just outline them.
>
> We have performance and placement tests for a suite of simple synthetic
> scenarios selected to trigger the EAS & misfit mechanisms. The
> performance tests use rt-app's slack metric, and we try to minimise
> negative slack (i.e. missed deadlines).
>
> In the placement tests we estimate the minimum energy consumed to run a
> particular synthetic test job and we calculate the energy consumed in
> the actual execution according to a trace. We pass the test if our
> estimate of actual is less than ideal+20%.
>
> We enter this code quite often in our testing, most individual runs of a
> test which has small tasks involved have at least one hit where we make
> a change to the clock with this patch in.

Do you have a rt-app file that you can share ?

>
> That said - despite the relatively high number of hits only about 5% of
> runs see enough additional energy consumed to trigger a test failure. We
> do try to keep a quiet system as much as possible and only run for a few
> seconds so the impact we see in testing is also probably higher than in
> the real world.

Yeah, I'm curious to see the impact on a real system which have a
60fps screen update like an android phone

>
> I totally appreciate the reluctance to add this - I don't much like it
> either, but I was hoping that sticking it behind the asym_cpucapacity
> key might be a reasonable compromise.
>
> At least only those people who select a CPU using task util and capacity
> see this cost, and we have smaller systems so in theory the cost is smaller.
>
> I'm very open to exploring alternatives :)
>
> >> Signed-off-by: Chris Redpath <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 15 +++++++++++++++
> >> 1 file changed, 15 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b798fe7ff7cd..51791db26a2a 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6545,6 +6545,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >> * wakee task is less decayed, but giving the wakee more load
> >> * sounds not bad.
> >> */
> >> + if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> >> + p->state == TASK_WAKING) {
> >
> > nit: indent fail.
> >
>
> oops, will tweak it
>
> --Chris
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2019-07-09 15:44:36

by Chris Redpath

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

On 09/07/2019 16:36, Vincent Guittot wrote:
> Hi Chris,
>
>>
>> We enter this code quite often in our testing, most individual runs of a
>> test which has small tasks involved have at least one hit where we make
>> a change to the clock with this patch in.
>
> Do you have a rt-app file that you can share ?
>

The ThreeSmallTasks test which is the worst hit produces this:

{
"tasks": {
"small_0": {
"policy": "SCHED_OTHER",
"delay": 0,
"loop": 1,
"phases": {
"p000001": {
"loop": 62,
"run": 2880,
"timer": {
"ref": "small_0",
"period": 16000
}
}
}
},
"small_1": {
"policy": "SCHED_OTHER",
"delay": 0,
"loop": 1,
"phases": {
"p000001": {
"loop": 62,
"run": 2880,
"timer": {
"ref": "small_1",
"period": 16000
}
}
}
},
"small_2": {
"policy": "SCHED_OTHER",
"delay": 0,
"loop": 1,
"phases": {
"p000001": {
"loop": 62,
"run": 2880,
"timer": {
"ref": "small_2",
"period": 16000
}
}
}
}
},
"global": {
"default_policy": "SCHED_OTHER",
"duration": -1,
"calibration": 264,
"logdir": "/root/devlib-target"
}
}

when I run it

>>
>> That said - despite the relatively high number of hits only about 5% of
>> runs see enough additional energy consumed to trigger a test failure. We
>> do try to keep a quiet system as much as possible and only run for a few
>> seconds so the impact we see in testing is also probably higher than in
>> the real world.
>
> Yeah, I'm curious to see the impact on a real system which have a
> 60fps screen update like an android phone
>

I wouldn't expect much change there but I would on the idle-ish
homescreen/day-of-use type benchmarks.

If I had a platform with any kind of representative energy use, I'd test
it :)

2019-07-09 15:47:22

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

On Tue, 9 Jul 2019 at 17:42, Chris Redpath <[email protected]> wrote:
>
> On 09/07/2019 16:36, Vincent Guittot wrote:
> > Hi Chris,
> >
> >>
> >> We enter this code quite often in our testing, most individual runs of a
> >> test which has small tasks involved have at least one hit where we make
> >> a change to the clock with this patch in.
> >
> > Do you have a rt-app file that you can share ?
> >
>
> The ThreeSmallTasks test which is the worst hit produces this:
>
> {
> "tasks": {
> "small_0": {
> "policy": "SCHED_OTHER",
> "delay": 0,
> "loop": 1,
> "phases": {
> "p000001": {
> "loop": 62,
> "run": 2880,
> "timer": {
> "ref": "small_0",
> "period": 16000
> }
> }
> }
> },
> "small_1": {
> "policy": "SCHED_OTHER",
> "delay": 0,
> "loop": 1,
> "phases": {
> "p000001": {
> "loop": 62,
> "run": 2880,
> "timer": {
> "ref": "small_1",
> "period": 16000
> }
> }
> }
> },
> "small_2": {
> "policy": "SCHED_OTHER",
> "delay": 0,
> "loop": 1,
> "phases": {
> "p000001": {
> "loop": 62,
> "run": 2880,
> "timer": {
> "ref": "small_2",
> "period": 16000
> }
> }
> }
> }
> },
> "global": {
> "default_policy": "SCHED_OTHER",
> "duration": -1,
> "calibration": 264,
> "logdir": "/root/devlib-target"
> }
> }
>
> when I run it

Thanks I will make it a try on my b.L platform

>
> >>
> >> That said - despite the relatively high number of hits only about 5% of
> >> runs see enough additional energy consumed to trigger a test failure. We
> >> do try to keep a quiet system as much as possible and only run for a few
> >> seconds so the impact we see in testing is also probably higher than in
> >> the real world.
> >
> > Yeah, I'm curious to see the impact on a real system which have a
> > 60fps screen update like an android phone
> >
>
> I wouldn't expect much change there but I would on the idle-ish
> homescreen/day-of-use type benchmarks.
>
> If I had a platform with any kind of representative energy use, I'd test
> it :)