by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 19/03/2024 12:50, Hongyan Xia wrote:
> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>
>> [...]
>>
>>>       /*
>>>        * The code below (indirectly) updates schedutil which looks at
>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>> task_struct *p, int flags)
>>> #ifdef CONFIG_UCLAMP_TASK
>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>> +    if (migrated)
>>
>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>
>>    if (flags & ENQUEUE_MIGRATED)
>
> I'm not sure if they are entirely equivalent. Both
> task_on_rq_migrating() and !task_on_rq_migrating() can have
> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
> Maybe I'm missing something?

That's true. There are 2:

(!task_on_rq_migrating() && !p->se.avg.last_update_time)

cases:

(1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()

(2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
not set.

I assume you want to add the task's util_avg_uclamp to
rq->root_cfs_util_uclamp in (2) too? So:

if (flags & ENQUEUE_MIGRATED || task_new)

[...]

>>> /* avg must belong to the queue this se is on. */
>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>> +void update_util_uclamp(u64 now,
>>> +            u64 last_update_time,
>>> +            u32 period_contrib,
>>> +            struct sched_avg *avg,
>>> +            struct task_struct *p)
>>> {
>>
>> I was wondering why you use such a long parameter list for this
>> function.
>>
>> IMHO
>>
>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>> sched_entity *se)
>>
>> would work as well. You could check whether se represents a task inside
>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>
>> The only reason I see is that you want to use this function for the RT
>> class as well later, where you have to deal with 'struct rt_rq' and
>> 'struct sched_rt_entity'.
>>
>> IMHO, it's always better to keep the implementation to the minimum and
>> only introduce changes which are related to the functionality you
>> present. This would make reviewing so much easier.
>
> Those parameters are necessary because of
>
> if (___update_load_sum()) {
>     ___update_load_avg();
>     update_util_uclamp();
> }

So you need ___update_load_avg() happening for the `normal uclamp path`
and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
update_util_uclamp()?

This is pretty hard to grasp. Isn't there a cleaner solution for this?

Why do you need the:

if (!avg)
return;

thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

> We have to cache last_update_time and period_contrib, because after
> ___update_load_sum() is done, both parameters in sched_avg have already
> been updated and overwritten and we lose the timestamp when the
> sched_avg was previously updated. update_util_uclamp() needs to know
> when sched_avg was previously updated.
>
>>
>>>       unsigned int util, uclamp_min, uclamp_max;
>>>       int delta;
>>> -    if (!p->se.on_rq)
>>> +    if (!p->se.on_rq) {
>>> +        ___update_util_uclamp_towards(now,
>>> +                          last_update_time,
>>> +                          period_contrib,
>>> +                          &p->se.avg.util_avg_uclamp,
>>> +                          0);
>>>           return;
>>> +    }
>>
>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>
> I would say this still belongs to 3/7, because 2/7 only implements
> utilization for on_rq tasks. This patch implements utilization for both
> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
runnable tasks) to be able to:

(a) set rq->root_cfs_util_uclamp (3) to max((3), (2))

(b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.

uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
around util_avg which has contributions from blocked tasks. And that's
why you have to maintain (3). And (3) only decays within
__update_load_avg_cfs_rq().
Can this be the reason why th implementation is so convoluted? It's
definitely more complicated than util_est with its easy layout:

enqueue_task_fair()
util_est_enqueue()

dequeue_task_fair()
util_est_dequeue()
util_est_update()

> For tasks, we also have two utilization numbers, one is on_rq and the
> other is on_rq plus !on_rq. However, we know they do not have to be
> stored in separate variables and util_avg_uclamp can capture both.

Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
'runnable' or 'runnable + blocking'? I would say it's the latter one but
like in PELT we don't update the task signal when it's sleeping.

> Moving this to 2/7 might be fine, although then this would be the only
> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>
>> This is the util_avg_uclamp handling for a se (task):
>>
>> enqueue_task_fair()
>>
>>    util_uclamp_enqueue()
>>
>>    update_util_uclamp()                 (1)
>>
>>      if (!p->se.on_rq)                  (x)
>>        ___update_util_uclamp_towards() (2)
>>
>> dequeue_task_fair()
>>
>>    util_uclamp_dequeue()
>>
>> __update_load_avg_blocked_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> __update_load_avg_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> Why is it so unbalanced? Why do you need (1) and (2)?
>>
>> Isn't this just an indication that the se util_avg_uclamp
>> is done at the wrong places?
>>
>> Is there an other way to provide a task/rq signal as the base
>> for uclamp sum aggregation?
>
> (2) won't happen, because at that point p->se.on_rq must be 1.

I see.

> The sequence during enqueue_task_fair() is:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)
>             update_util_uclamp(p) (decay path)
>         p->se.on_rq = 1;
>     util_uclamp_enqueue(p)
>     update_util_uclamp(p) (update path)
>
> The only reason why we want to issue update_util_uclamp() after seeing
> on_rq == 1 is that now it goes down the normal uclamp path and not the
> decay path. Otherwise, uclamp won't immediately engage during enqueue
> and has to wait a timer tick.

OK, I see, the task contribution should be visible immediately after the
enqueue.

> Ideally, we should:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)
>             util_uclamp_enqueue(p)
>             update_util_uclamp(p) (force update path)
>         p->se.on_rq = 1;
>
> This requires us to invent a new flag to update_util_uclamp() to force
> the update path even when p->se.on_rq is 0.

I guess you have to go this way to achieve a cleaner/easier integration
of 'util_avg_uclamp'.

> At the moment I'm treating util_avg_uclamp in the same way as util_est
> after the comments in v1, which is independent and has its own code
> path. We can go back to the old style, where util_avg_uclamp is closer
> to how we treat se rather than a separate thing like util_est.

Except that 'util_est' integration is much easier to understand. And
this is because of 'util_est' is clear runnable based only and is not
linked to any blocked part.

[...]

2024-03-20 17:23:23

by Hongyan Xia

[permalink] [raw]

Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 20/03/2024 15:27, Dietmar Eggemann wrote:
> On 19/03/2024 12:50, Hongyan Xia wrote:
>> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>>
>>> [...]
>>>
>>>>       /*
>>>>        * The code below (indirectly) updates schedutil which looks at
>>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>>> task_struct *p, int flags)
>>>> #ifdef CONFIG_UCLAMP_TASK
>>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>>> +    if (migrated)
>>>
>>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>>
>>>    if (flags & ENQUEUE_MIGRATED)
>>
>> I'm not sure if they are entirely equivalent. Both
>> task_on_rq_migrating() and !task_on_rq_migrating() can have
>> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
>> Maybe I'm missing something?
>
> That's true. There are 2:
>
> (!task_on_rq_migrating() && !p->se.avg.last_update_time)
>
> cases:
>
> (1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()
>
> (2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
> not set.
>
> I assume you want to add the task's util_avg_uclamp to
> rq->root_cfs_util_uclamp in (2) too? So:
>
> if (flags & ENQUEUE_MIGRATED || task_new)

I see. Maybe we don't need to check last_update_time. I'll take a look.

Although if we want to integrate sum aggregation with se rather than
making it fully independent (in your comments below) like util_est,
we'll probably just do that in

update_util_avg()
if (!se->avg.last_update_time && (flags & DO_ATTACH)) { ... }

and don't bother doing it in the outside loop of enqueue_task_fair() anyway.

> [...]
>
>>>> /* avg must belong to the queue this se is on. */
>>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>>> +void update_util_uclamp(u64 now,
>>>> +            u64 last_update_time,
>>>> +            u32 period_contrib,
>>>> +            struct sched_avg *avg,
>>>> +            struct task_struct *p)
>>>> {
>>>
>>> I was wondering why you use such a long parameter list for this
>>> function.
>>>
>>> IMHO
>>>
>>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>>> sched_entity *se)
>>>
>>> would work as well. You could check whether se represents a task inside
>>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>>
>>> The only reason I see is that you want to use this function for the RT
>>> class as well later, where you have to deal with 'struct rt_rq' and
>>> 'struct sched_rt_entity'.
>>>
>>> IMHO, it's always better to keep the implementation to the minimum and
>>> only introduce changes which are related to the functionality you
>>> present. This would make reviewing so much easier.
>>
>> Those parameters are necessary because of
>>
>> if (___update_load_sum()) {
>>     ___update_load_avg();
>>     update_util_uclamp();
>> }
>
> So you need ___update_load_avg() happening for the `normal uclamp path`
> and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
> update_util_uclamp()?
>
> This is pretty hard to grasp. Isn't there a cleaner solution for this?

You are correct. Not sure if there's a better way.

> Why do you need the:
>
> if (!avg)
> return;
>
> thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
> update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

I added it as a guard to rule out edge cases, but I think when you do
__update_load_avg_blocked_se(), it has to be !on_rq so it will never
enter this path. I think it doesn't hurt but I can remove it.

>> We have to cache last_update_time and period_contrib, because after
>> ___update_load_sum() is done, both parameters in sched_avg have already
>> been updated and overwritten and we lose the timestamp when the
>> sched_avg was previously updated. update_util_uclamp() needs to know
>> when sched_avg was previously updated.
>>
>>>
>>>>       unsigned int util, uclamp_min, uclamp_max;
>>>>       int delta;
>>>> -    if (!p->se.on_rq)
>>>> +    if (!p->se.on_rq) {
>>>> +        ___update_util_uclamp_towards(now,
>>>> +                          last_update_time,
>>>> +                          period_contrib,
>>>> +                          &p->se.avg.util_avg_uclamp,
>>>> +                          0);
>>>>           return;
>>>> +    }
>>>
>>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>>
>> I would say this still belongs to 3/7, because 2/7 only implements
>> utilization for on_rq tasks. This patch implements utilization for both
>> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
>> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).
>
> Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
> runnable tasks) to be able to:
>
> (a) set rq->root_cfs_util_uclamp (3) to max((3), (2))
>
> (b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.
>
> uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
> around util_avg which has contributions from blocked tasks. And that's
> why you have to maintain (3). And (3) only decays within
> __update_load_avg_cfs_rq().
> Can this be the reason why th implementation is so convoluted? It's
> definitely more complicated than util_est with its easy layout:
>
> enqueue_task_fair()
> util_est_enqueue()
>
> dequeue_task_fair()
> util_est_dequeue()
> util_est_update()
There's only one rule here, which is the root value must be the sum of
all task util_avg_uclamp values. If PELT slowly decays each util_avg,
then the sum will also decay in a similar fashion. The code here is just
doing that.

This 'convoluted' property is from PELT and not from sum aggregation.
Actually this is what PELT used to do a while back, when the root CFS
util was the sum of all tasks instead of being tracked separately, and
we do the same decay here as we did back then.

You are right that this feels convoluted, but sum aggregation doesn't
attempt to change how PELT works so the decaying property is there due
to PELT. I can keep exploring new ways to make the logic easier to follow.

>> For tasks, we also have two utilization numbers, one is on_rq and the
>> other is on_rq plus !on_rq. However, we know they do not have to be
>> stored in separate variables and util_avg_uclamp can capture both.
>
> Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
> 'runnable' or 'runnable + blocking'? I would say it's the latter one but
> like in PELT we don't update the task signal when it's sleeping.

The latter. We don't update the task signal when it's sleeping but we do
when we need to use it, and that's enough. This is also the case for all
blocked util_avg.

>> Moving this to 2/7 might be fine, although then this would be the only
>> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>>
>>> This is the util_avg_uclamp handling for a se (task):
>>>
>>> enqueue_task_fair()
>>>
>>>    util_uclamp_enqueue()
>>>
>>>    update_util_uclamp()                 (1)
>>>
>>>      if (!p->se.on_rq)                  (x)
>>>        ___update_util_uclamp_towards() (2)
>>>
>>> dequeue_task_fair()
>>>
>>>    util_uclamp_dequeue()
>>>
>>> __update_load_avg_blocked_se()
>>>
>>>    update_util_uclamp()
>>>
>>>      (x)
>>>
>>> __update_load_avg_se()
>>>
>>>    update_util_uclamp()
>>>
>>>      (x)
>>>
>>> Why is it so unbalanced? Why do you need (1) and (2)?
>>>
>>> Isn't this just an indication that the se util_avg_uclamp
>>> is done at the wrong places?
>>>
>>> Is there an other way to provide a task/rq signal as the base
>>> for uclamp sum aggregation?
>>
>> (2) won't happen, because at that point p->se.on_rq must be 1.
>
> I see.
>
>> The sequence during enqueue_task_fair() is:
>>
>> enqueue_task_fair(p)
>>     enqueue_entity(se)
>>         update_load_avg(se)
>>             update_util_uclamp(p) (decay path)
>>         p->se.on_rq = 1;
>>     util_uclamp_enqueue(p)
>>     update_util_uclamp(p) (update path)
>>
>> The only reason why we want to issue update_util_uclamp() after seeing
>> on_rq == 1 is that now it goes down the normal uclamp path and not the
>> decay path. Otherwise, uclamp won't immediately engage during enqueue
>> and has to wait a timer tick.
>
> OK, I see, the task contribution should be visible immediately after the
> enqueue.
>
>> Ideally, we should:
>>
>> enqueue_task_fair(p)
>>     enqueue_entity(se)
>>         update_load_avg(se)
>>             util_uclamp_enqueue(p)
>>             update_util_uclamp(p) (force update path)
>>         p->se.on_rq = 1;
>>
>> This requires us to invent a new flag to update_util_uclamp() to force
>> the update path even when p->se.on_rq is 0.
>
> I guess you have to go this way to achieve a cleaner/easier integration
> of 'util_avg_uclamp'.

If we don't want to keep util_avg_uclamp separately like util_est but
move it closer to se, then we can explore this option.

>> At the moment I'm treating util_avg_uclamp in the same way as util_est
>> after the comments in v1, which is independent and has its own code
>> path. We can go back to the old style, where util_avg_uclamp is closer
>> to how we treat se rather than a separate thing like util_est.
>
> Except that 'util_est' integration is much easier to understand. And
> this is because of 'util_est' is clear runnable based only and is not
> linked to any blocked part.

True.

Well, I can go back to the style in RFC v1. One big advantage of v2 is
that we can compile out the support of uclamp very easily because it's
treated more or less like an independent thing.

> [...]
>