2022-07-13 04:07:05

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
use update_load_avg() to implement attach/detach entity load_avg.

Another advantage of using update_load_avg() is that it will check
last_update_time before attach or detach, instead of unconditional
attach/detach in the current code.

This way can avoid some corner problematic cases of load tracking,
like twice attach problem, detach unattached NEW task problem.

1. switch to fair class (twice attach problem)

p->sched_class = fair_class; --> p.se->avg.last_update_time = 0
if (queued)
enqueue_task(p);
...
enqueue_entity()
update_load_avg(UPDATE_TG | DO_ATTACH)
if (!se->avg.last_update_time && (flags & DO_ATTACH)) --> true
attach_entity_load_avg() --> attached, will set last_update_time
check_class_changed()
switched_from() (!fair)
switched_to() (fair)
switched_to_fair()
attach_entity_load_avg() --> unconditional attach again!

2. change cgroup of NEW task (detach unattached task problem)

sched_move_group(p)
if (queued)
dequeue_task()
task_move_group_fair()
detach_task_cfs_rq()
detach_entity_load_avg() --> detach unattached NEW task
set_task_rq()
attach_task_cfs_rq()
attach_entity_load_avg()
if (queued)
enqueue_task()

These problems have been fixed in commit 7dc603c9028e
("sched/fair: Fix PELT integrity for new tasks"), which also
bring its own problems.

First, it add a new task state TASK_NEW and an unnessary limitation
that we would fail when change the cgroup of TASK_NEW tasks.

Second, it attach entity load_avg in post_init_entity_util_avg(),
in which we only set sched_avg last_update_time for !fair tasks,
will cause PELT integrity problem when switched_to_fair().

This patch make update_load_avg() the only location of attach/detach,
and can handle these corner cases like change cgroup of NEW tasks,
by checking last_update_time before attach/detach.

Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/fair.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29811869c1fe..51fc20c161a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4307,11 +4307,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

static inline void remove_entity_load_avg(struct sched_entity *se) {}

-static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-
static inline int newidle_balance(struct rq *rq, struct rq_flags *rf)
{
return 0;
@@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se);

/* Catch up with the cfs_rq and remove our load when we leave */
- update_load_avg(cfs_rq, se, 0);
- detach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
+ update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
propagate_entity_cfs_rq(se);
}

@@ -11537,10 +11530,8 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);

- /* Synchronize entity with its cfs_rq */
- update_load_avg(cfs_rq, se, 0);
- attach_entity_load_avg(cfs_rq, se);
- update_tg_load_avg(cfs_rq);
+ /* Synchronize entity with its cfs_rq and attach our load */
+ update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
propagate_entity_cfs_rq(se);
}

--
2.36.1


2022-07-15 11:25:48

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 13/07/2022 06:04, Chengming Zhou wrote:
> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
> use update_load_avg() to implement attach/detach entity load_avg.
>
> Another advantage of using update_load_avg() is that it will check
> last_update_time before attach or detach, instead of unconditional
> attach/detach in the current code.
>
> This way can avoid some corner problematic cases of load tracking,
> like twice attach problem, detach unattached NEW task problem.

This explanation is somewhat hard to follow for me. Since both issues
have been fixed already (you mention this further below) you're saying
that with you change you don't reintroduce them?

> 1. switch to fair class (twice attach problem)
>
> p->sched_class = fair_class; --> p.se->avg.last_update_time = 0
> if (queued)
> enqueue_task(p);
> ...
> enqueue_entity()
> update_load_avg(UPDATE_TG | DO_ATTACH)
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) --> true
> attach_entity_load_avg() --> attached, will set last_update_time
> check_class_changed()
> switched_from() (!fair)
> switched_to() (fair)
> switched_to_fair()
> attach_entity_load_avg() --> unconditional attach again!
>
> 2. change cgroup of NEW task (detach unattached task problem)
>
> sched_move_group(p)
> if (queued)
> dequeue_task()
> task_move_group_fair()
> detach_task_cfs_rq()
> detach_entity_load_avg() --> detach unattached NEW task
> set_task_rq()
> attach_task_cfs_rq()
> attach_entity_load_avg()
> if (queued)
> enqueue_task()
>
> These problems have been fixed in commit 7dc603c9028e
> ("sched/fair: Fix PELT integrity for new tasks"), which also
> bring its own problems.
>
> First, it add a new task state TASK_NEW and an unnessary limitation
> that we would fail when change the cgroup of TASK_NEW tasks.
>
> Second, it attach entity load_avg in post_init_entity_util_avg(),
> in which we only set sched_avg last_update_time for !fair tasks,
> will cause PELT integrity problem when switched_to_fair().

I guess those PELT integrity problems are less severe since we have the
enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
we always decay the time the task spend outside fair.

Looks to me that you want to replace this by your `freeze PELT when
outside fair` model.

> This patch make update_load_avg() the only location of attach/detach,
> and can handle these corner cases like change cgroup of NEW tasks,
> by checking last_update_time before attach/detach.

[...]

> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> /* Catch up with the cfs_rq and remove our load when we leave */
> - update_load_avg(cfs_rq, se, 0);
> - detach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq);
> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);

IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
updated in this case.

[...]

2022-07-15 16:30:39

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 2022/7/15 19:18, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>> use update_load_avg() to implement attach/detach entity load_avg.
>>
>> Another advantage of using update_load_avg() is that it will check
>> last_update_time before attach or detach, instead of unconditional
>> attach/detach in the current code.
>>
>> This way can avoid some corner problematic cases of load tracking,
>> like twice attach problem, detach unattached NEW task problem.
>
> This explanation is somewhat hard to follow for me. Since both issues
> have been fixed already (you mention this further below) you're saying
> that with you change you don't reintroduce them?

Sorry for this not very clear explanation.

Yes, both issues have been fixed already, what I want to say is that bugfix
brings its own problem and limitation mentioned below.

So I want to use another way to solve these problems better.

>
>> 1. switch to fair class (twice attach problem)
>>
>> p->sched_class = fair_class; --> p.se->avg.last_update_time = 0
>> if (queued)
>> enqueue_task(p);
>> ...
>> enqueue_entity()
>> update_load_avg(UPDATE_TG | DO_ATTACH)
>> if (!se->avg.last_update_time && (flags & DO_ATTACH)) --> true
>> attach_entity_load_avg() --> attached, will set last_update_time
>> check_class_changed()
>> switched_from() (!fair)
>> switched_to() (fair)
>> switched_to_fair()
>> attach_entity_load_avg() --> unconditional attach again!
>>
>> 2. change cgroup of NEW task (detach unattached task problem)
>>
>> sched_move_group(p)
>> if (queued)
>> dequeue_task()
>> task_move_group_fair()
>> detach_task_cfs_rq()
>> detach_entity_load_avg() --> detach unattached NEW task
>> set_task_rq()
>> attach_task_cfs_rq()
>> attach_entity_load_avg()
>> if (queued)
>> enqueue_task()
>>
>> These problems have been fixed in commit 7dc603c9028e
>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>> bring its own problems.
>>
>> First, it add a new task state TASK_NEW and an unnessary limitation
>> that we would fail when change the cgroup of TASK_NEW tasks.

This is the limitation that bugfix has brought.

We can't change cgroup or switch to fair for task with last_update_time=0
if we don't have conditional detach/attach.

So we have to:

1. !fair task also need to set last_update_time.
2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.

>>
>> Second, it attach entity load_avg in post_init_entity_util_avg(),
>> in which we only set sched_avg last_update_time for !fair tasks,
>> will cause PELT integrity problem when switched_to_fair().
>
> I guess those PELT integrity problems are less severe since we have the
> enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
> we always decay the time the task spend outside fair.

enqueue_task_fair()
enqueue_entity()
update_load_avg()
if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) --> true
__update_load_avg_se(now, cfs_rq, se); --> (1)

We can see above, for queued !fair task, (1) will deay the delta time
(now - se.avg.last_update_time) even for a NEW !fair task.

>
> Looks to me that you want to replace this by your `freeze PELT when
> outside fair` model.

Yes, want to freeze PELT for two !fair cases:

1. !fair task hasn't been fair before: will still have its init load_avg
when switch to fair.

2. !fair task has been switched_from_fair(): will still keep its lastest
load_avg when switch to fair.

>
>> This patch make update_load_avg() the only location of attach/detach,
>> and can handle these corner cases like change cgroup of NEW tasks,
>> by checking last_update_time before attach/detach.
>
> [...]
>
>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>
>> /* Catch up with the cfs_rq and remove our load when we leave */
>> - update_load_avg(cfs_rq, se, 0);
>> - detach_entity_load_avg(cfs_rq, se);
>> - update_tg_load_avg(cfs_rq);
>> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>
> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> updated in this case.

Correct, will do.

Thanks.

>
> [...]

2022-07-19 10:42:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
<[email protected]> wrote:
>
> On 2022/7/15 19:18, Dietmar Eggemann wrote:
> > On 13/07/2022 06:04, Chengming Zhou wrote:
> >> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
> >> use update_load_avg() to implement attach/detach entity load_avg.
> >>
> >> Another advantage of using update_load_avg() is that it will check
> >> last_update_time before attach or detach, instead of unconditional
> >> attach/detach in the current code.
> >>
> >> This way can avoid some corner problematic cases of load tracking,
> >> like twice attach problem, detach unattached NEW task problem.
> >
> > This explanation is somewhat hard to follow for me. Since both issues
> > have been fixed already (you mention this further below) you're saying
> > that with you change you don't reintroduce them?
>
> Sorry for this not very clear explanation.
>
> Yes, both issues have been fixed already, what I want to say is that bugfix
> brings its own problem and limitation mentioned below.

As Dietmar said, the commit message is misleading because someone can
think you fix these bugs whereas it's not the case

>
> So I want to use another way to solve these problems better.
>
> >
> >> 1. switch to fair class (twice attach problem)
> >>
> >> p->sched_class = fair_class; --> p.se->avg.last_update_time = 0
> >> if (queued)
> >> enqueue_task(p);
> >> ...
> >> enqueue_entity()
> >> update_load_avg(UPDATE_TG | DO_ATTACH)
> >> if (!se->avg.last_update_time && (flags & DO_ATTACH)) --> true
> >> attach_entity_load_avg() --> attached, will set last_update_time
> >> check_class_changed()
> >> switched_from() (!fair)
> >> switched_to() (fair)
> >> switched_to_fair()
> >> attach_entity_load_avg() --> unconditional attach again!
> >>
> >> 2. change cgroup of NEW task (detach unattached task problem)
> >>
> >> sched_move_group(p)
> >> if (queued)
> >> dequeue_task()
> >> task_move_group_fair()
> >> detach_task_cfs_rq()
> >> detach_entity_load_avg() --> detach unattached NEW task
> >> set_task_rq()
> >> attach_task_cfs_rq()
> >> attach_entity_load_avg()
> >> if (queued)
> >> enqueue_task()
> >>
> >> These problems have been fixed in commit 7dc603c9028e
> >> ("sched/fair: Fix PELT integrity for new tasks"), which also
> >> bring its own problems.
> >>
> >> First, it add a new task state TASK_NEW and an unnessary limitation
> >> that we would fail when change the cgroup of TASK_NEW tasks.
>
> This is the limitation that bugfix has brought.
>
> We can't change cgroup or switch to fair for task with last_update_time=0
> if we don't have conditional detach/attach.
>
> So we have to:
>
> 1. !fair task also need to set last_update_time.
> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
>
> >>
> >> Second, it attach entity load_avg in post_init_entity_util_avg(),
> >> in which we only set sched_avg last_update_time for !fair tasks,
> >> will cause PELT integrity problem when switched_to_fair().
> >
> > I guess those PELT integrity problems are less severe since we have the
> > enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
> > we always decay the time the task spend outside fair.
>
> enqueue_task_fair()
> enqueue_entity()
> update_load_avg()
> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) --> true
> __update_load_avg_se(now, cfs_rq, se); --> (1)
>
> We can see above, for queued !fair task, (1) will deay the delta time
> (now - se.avg.last_update_time) even for a NEW !fair task.
>
> >
> > Looks to me that you want to replace this by your `freeze PELT when
> > outside fair` model.
>
> Yes, want to freeze PELT for two !fair cases:
>
> 1. !fair task hasn't been fair before: will still have its init load_avg
> when switch to fair.

But I'm not sure it makes sense to keep these init values. As an
example, the util_avg is set according to the cpu utilization at the
time of the task creation. I would tend to decay them as these init
values become less and less relevant.

so we should return early in post_init_entity_util_avg() and don't set
util_avg if sched class is not cfs

>
> 2. !fair task has been switched_from_fair(): will still keep its lastest
> load_avg when switch to fair.
>
> >
> >> This patch make update_load_avg() the only location of attach/detach,
> >> and can handle these corner cases like change cgroup of NEW tasks,
> >> by checking last_update_time before attach/detach.
> >
> > [...]
> >
> >> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> >> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>
> >> /* Catch up with the cfs_rq and remove our load when we leave */
> >> - update_load_avg(cfs_rq, se, 0);
> >> - detach_entity_load_avg(cfs_rq, se);
> >> - update_tg_load_avg(cfs_rq);
> >> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> >
> > IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> > updated in this case.
>
> Correct, will do.
>
> Thanks.
>
> >
> > [...]

2022-07-19 15:10:38

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 15/07/2022 18:21, Chengming Zhou wrote:
> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>
>>> Another advantage of using update_load_avg() is that it will check
>>> last_update_time before attach or detach, instead of unconditional
>>> attach/detach in the current code.
>>>
>>> This way can avoid some corner problematic cases of load tracking,
>>> like twice attach problem, detach unattached NEW task problem.
>>
>> This explanation is somewhat hard to follow for me. Since both issues
>> have been fixed already (you mention this further below) you're saying
>> that with you change you don't reintroduce them?
>
> Sorry for this not very clear explanation.
>
> Yes, both issues have been fixed already, what I want to say is that bugfix
> brings its own problem and limitation mentioned below.
>
> So I want to use another way to solve these problems better.

[...]

>>> These problems have been fixed in commit 7dc603c9028e
>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>> bring its own problems.
>>>
>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>> that we would fail when change the cgroup of TASK_NEW tasks.
>
> This is the limitation that bugfix has brought.
>
> We can't change cgroup or switch to fair for task with last_update_time=0
> if we don't have conditional detach/attach.
>
> So we have to:
>
> 1. !fair task also need to set last_update_time.
> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.

I see.

`cgroup_migrate_execute() -> cpu_cgroup_[can|]_attach()` has to wait for
`wake_up_new_task() -> WRITE_ONCE(p->__state, TASK_RUNNING)`.

Just to understand this change better: IMHO, this is still the case for
fair tasks, right?

`wake_up_new_task() -> post_init_entity_util_avg() ->
attach_entity_cfs_rq()` has to happen before the fair task can move
between taskgroups in `cgroup_migrate_execute() -> cpu_cgroup_attach()
-> sched_move_task() -> sched_change_group() -> task_change_group_fair()`.

[...]

2022-07-20 14:01:23

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 2022/7/19 23:02, Dietmar Eggemann wrote:
> On 15/07/2022 18:21, Chengming Zhou wrote:
>> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>>
>>>> Another advantage of using update_load_avg() is that it will check
>>>> last_update_time before attach or detach, instead of unconditional
>>>> attach/detach in the current code.
>>>>
>>>> This way can avoid some corner problematic cases of load tracking,
>>>> like twice attach problem, detach unattached NEW task problem.
>>>
>>> This explanation is somewhat hard to follow for me. Since both issues
>>> have been fixed already (you mention this further below) you're saying
>>> that with you change you don't reintroduce them?
>>
>> Sorry for this not very clear explanation.
>>
>> Yes, both issues have been fixed already, what I want to say is that bugfix
>> brings its own problem and limitation mentioned below.
>>
>> So I want to use another way to solve these problems better.
>
> [...]
>
>>>> These problems have been fixed in commit 7dc603c9028e
>>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>>> bring its own problems.
>>>>
>>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>>> that we would fail when change the cgroup of TASK_NEW tasks.
>>
>> This is the limitation that bugfix has brought.
>>
>> We can't change cgroup or switch to fair for task with last_update_time=0
>> if we don't have conditional detach/attach.
>>
>> So we have to:
>>
>> 1. !fair task also need to set last_update_time.
>> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
>
> I see.
>
> `cgroup_migrate_execute() -> cpu_cgroup_[can|]_attach()` has to wait for
> `wake_up_new_task() -> WRITE_ONCE(p->__state, TASK_RUNNING)`.
>
> Just to understand this change better: IMHO, this is still the case for
> fair tasks, right?

Yes, I think so. We could delete this limitation when we have conditional
detach/attach, since nothing will be detached when last_update_time==0.

Thanks!

>
> `wake_up_new_task() -> post_init_entity_util_avg() ->
> attach_entity_cfs_rq()` has to happen before the fair task can move
> between taskgroups in `cgroup_migrate_execute() -> cpu_cgroup_attach()
> -> sched_move_task() -> sched_change_group() -> task_change_group_fair()`.
>
> [...]

2022-07-20 14:06:24

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 2022/7/19 18:29, Vincent Guittot wrote:
> On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
> <[email protected]> wrote:
>>
>> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>>
>>>> Another advantage of using update_load_avg() is that it will check
>>>> last_update_time before attach or detach, instead of unconditional
>>>> attach/detach in the current code.
>>>>
>>>> This way can avoid some corner problematic cases of load tracking,
>>>> like twice attach problem, detach unattached NEW task problem.
>>>
>>> This explanation is somewhat hard to follow for me. Since both issues
>>> have been fixed already (you mention this further below) you're saying
>>> that with you change you don't reintroduce them?
>>
>> Sorry for this not very clear explanation.
>>
>> Yes, both issues have been fixed already, what I want to say is that bugfix
>> brings its own problem and limitation mentioned below.
>
> As Dietmar said, the commit message is misleading because someone can
> think you fix these bugs whereas it's not the case

Hi Vincent, thanks for your review! I will refactor the commit message to avoid
this misleading, sorry for my bad English expression.

>
>>
>> So I want to use another way to solve these problems better.
>>
>>>
>>>> 1. switch to fair class (twice attach problem)
>>>>
>>>> p->sched_class = fair_class; --> p.se->avg.last_update_time = 0
>>>> if (queued)
>>>> enqueue_task(p);
>>>> ...
>>>> enqueue_entity()
>>>> update_load_avg(UPDATE_TG | DO_ATTACH)
>>>> if (!se->avg.last_update_time && (flags & DO_ATTACH)) --> true
>>>> attach_entity_load_avg() --> attached, will set last_update_time
>>>> check_class_changed()
>>>> switched_from() (!fair)
>>>> switched_to() (fair)
>>>> switched_to_fair()
>>>> attach_entity_load_avg() --> unconditional attach again!
>>>>
>>>> 2. change cgroup of NEW task (detach unattached task problem)
>>>>
>>>> sched_move_group(p)
>>>> if (queued)
>>>> dequeue_task()
>>>> task_move_group_fair()
>>>> detach_task_cfs_rq()
>>>> detach_entity_load_avg() --> detach unattached NEW task
>>>> set_task_rq()
>>>> attach_task_cfs_rq()
>>>> attach_entity_load_avg()
>>>> if (queued)
>>>> enqueue_task()
>>>>
>>>> These problems have been fixed in commit 7dc603c9028e
>>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>>> bring its own problems.
>>>>
>>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>>> that we would fail when change the cgroup of TASK_NEW tasks.
>>
>> This is the limitation that bugfix has brought.
>>
>> We can't change cgroup or switch to fair for task with last_update_time=0
>> if we don't have conditional detach/attach.
>>
>> So we have to:
>>
>> 1. !fair task also need to set last_update_time.
>> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
>>
>>>>
>>>> Second, it attach entity load_avg in post_init_entity_util_avg(),
>>>> in which we only set sched_avg last_update_time for !fair tasks,
>>>> will cause PELT integrity problem when switched_to_fair().
>>>
>>> I guess those PELT integrity problems are less severe since we have the
>>> enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
>>> we always decay the time the task spend outside fair.
>>
>> enqueue_task_fair()
>> enqueue_entity()
>> update_load_avg()
>> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) --> true
>> __update_load_avg_se(now, cfs_rq, se); --> (1)
>>
>> We can see above, for queued !fair task, (1) will deay the delta time
>> (now - se.avg.last_update_time) even for a NEW !fair task.
>>
>>>
>>> Looks to me that you want to replace this by your `freeze PELT when
>>> outside fair` model.
>>
>> Yes, want to freeze PELT for two !fair cases:
>>
>> 1. !fair task hasn't been fair before: will still have its init load_avg
>> when switch to fair.
>
> But I'm not sure it makes sense to keep these init values. As an
> example, the util_avg is set according to the cpu utilization at the
> time of the task creation. I would tend to decay them as these init
> values become less and less relevant.
>
> so we should return early in post_init_entity_util_avg() and don't set
> util_avg if sched class is not cfs

Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
I'm also not sure whether it make sense to keep them to 0 ? Will it cause
unfairness problem between cfs_rqs?

>
>>
>> 2. !fair task has been switched_from_fair(): will still keep its lastest
>> load_avg when switch to fair.
>>
>>>
>>>> This patch make update_load_avg() the only location of attach/detach,
>>>> and can handle these corner cases like change cgroup of NEW tasks,
>>>> by checking last_update_time before attach/detach.
>>>
>>> [...]
>>>
>>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>>>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>>
>>>> /* Catch up with the cfs_rq and remove our load when we leave */
>>>> - update_load_avg(cfs_rq, se, 0);
>>>> - detach_entity_load_avg(cfs_rq, se);
>>>> - update_tg_load_avg(cfs_rq);
>>>> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>>>
>>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
>>> updated in this case.
>>
>> Correct, will do.
>>
>> Thanks.
>>
>>>
>>> [...]

2022-07-20 15:52:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On Wed, 20 Jul 2022 at 15:41, Chengming Zhou
<[email protected]> wrote:
>
> On 2022/7/19 18:29, Vincent Guittot wrote:
> > On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
> > <[email protected]> wrote:
> >>

...

> >>
> >>>
> >>> Looks to me that you want to replace this by your `freeze PELT when
> >>> outside fair` model.
> >>
> >> Yes, want to freeze PELT for two !fair cases:
> >>
> >> 1. !fair task hasn't been fair before: will still have its init load_avg
> >> when switch to fair.
> >
> > But I'm not sure it makes sense to keep these init values. As an
> > example, the util_avg is set according to the cpu utilization at the
> > time of the task creation. I would tend to decay them as these init
> > values become less and less relevant.
> >
> > so we should return early in post_init_entity_util_avg() and don't set
> > util_avg if sched class is not cfs
>
> Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
> I'm also not sure whether it make sense to keep them to 0 ? Will it cause
> unfairness problem between cfs_rqs?

Why should it cause an unfairness problem ? !fair tasks are not
accounted and their pelt values will be decayed down to 0 after 320ms
anyway (with the current implementation). So it's just like if you
started directly after those 320ms

>
> >
> >>
> >> 2. !fair task has been switched_from_fair(): will still keep its lastest
> >> load_avg when switch to fair.
> >>
> >>>
> >>>> This patch make update_load_avg() the only location of attach/detach,
> >>>> and can handle these corner cases like change cgroup of NEW tasks,
> >>>> by checking last_update_time before attach/detach.
> >>>
> >>> [...]
> >>>
> >>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> >>>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>>>
> >>>> /* Catch up with the cfs_rq and remove our load when we leave */
> >>>> - update_load_avg(cfs_rq, se, 0);
> >>>> - detach_entity_load_avg(cfs_rq, se);
> >>>> - update_tg_load_avg(cfs_rq);
> >>>> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> >>>
> >>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> >>> updated in this case.
> >>
> >> Correct, will do.
> >>
> >> Thanks.
> >>
> >>>
> >>> [...]

2022-07-21 15:05:52

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg

On 2022/7/20 23:34, Vincent Guittot wrote:
> On Wed, 20 Jul 2022 at 15:41, Chengming Zhou
> <[email protected]> wrote:
>>
>> On 2022/7/19 18:29, Vincent Guittot wrote:
>>> On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>
> ...
>
>>>>
>>>>>
>>>>> Looks to me that you want to replace this by your `freeze PELT when
>>>>> outside fair` model.
>>>>
>>>> Yes, want to freeze PELT for two !fair cases:
>>>>
>>>> 1. !fair task hasn't been fair before: will still have its init load_avg
>>>> when switch to fair.
>>>
>>> But I'm not sure it makes sense to keep these init values. As an
>>> example, the util_avg is set according to the cpu utilization at the
>>> time of the task creation. I would tend to decay them as these init
>>> values become less and less relevant.
>>>
>>> so we should return early in post_init_entity_util_avg() and don't set
>>> util_avg if sched class is not cfs
>>
>> Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
>> I'm also not sure whether it make sense to keep them to 0 ? Will it cause
>> unfairness problem between cfs_rqs?
>
> Why should it cause an unfairness problem ? !fair tasks are not
> accounted and their pelt values will be decayed down to 0 after 320ms
> anyway (with the current implementation). So it's just like if you
> started directly after those 320ms

Thanks for your patient explain. IMHO, I am thinking if we have init sched_avg
for new fair task (A), but have 0 for new task switched from !fair (B). Then
what's the point of init sched_avg for the fair task?

The B task will need some time to reach its stable load value, so in this process
its cfs_rq may can't get enough shares? Imaging below scenario, if we have fair
task A and switched from !fair task B at the same time, could cause unfairness
between cfs0 and cfs1 ?

CPU0 tg CPU1
| / \ |
| / \ |
cfs0 cfs1
(A) (B)

If runnable_avg and util_avg are 0 when switched from !fair, so we need more time
to do load balance or CPU frequency adjust? I think it's the reason why we have
init sched_avg for new fair task. Should we care about these, or it will be no problem?

I'm not sure, I must have missed something :-)

Thanks!

>
>>
>>>
>>>>
>>>> 2. !fair task has been switched_from_fair(): will still keep its lastest
>>>> load_avg when switch to fair.
>>>>
>>>>>
>>>>>> This patch make update_load_avg() the only location of attach/detach,
>>>>>> and can handle these corner cases like change cgroup of NEW tasks,
>>>>>> by checking last_update_time before attach/detach.
>>>>>
>>>>> [...]
>>>>>
>>>>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>>>>>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>>>>
>>>>>> /* Catch up with the cfs_rq and remove our load when we leave */
>>>>>> - update_load_avg(cfs_rq, se, 0);
>>>>>> - detach_entity_load_avg(cfs_rq, se);
>>>>>> - update_tg_load_avg(cfs_rq);
>>>>>> + update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>>>>>
>>>>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
>>>>> updated in this case.
>>>>
>>>> Correct, will do.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> [...]