LinuxLists.cc - [PATCH] sched/documentation: elaborate on uclamp limitations

2023-05-05 15:44:55

Subject: [PATCH] sched/documentation: elaborate on uclamp limitations

The story in 5.2 about util_avg abruptly jumping from 300 when
Fmax/Fmin == 3 to 1024 when Fmax/Fmin == 4 hides some details about how
clock_pelt works behind the scenes. Explicitly mention it to make it
easier for readers to follow.

Signed-off-by: Hongyan Xia <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
Documentation/scheduler/sched-util-clamp.rst | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
index 74d5b7c6431d..524df07bceba 100644
--- a/Documentation/scheduler/sched-util-clamp.rst
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.

p0->util_avg = 300 + small_error

+The reason why util_avg is around 300 even though it runs for 900 at Fmin is:
+Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
+accumulates util_sum at 1/3 of the rate at Fmax, the clock period
+(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
+
+::
+
+ util_sum / clock period = util_avg
+
+does not shrink to 1/3, since rq->clock_pelt is periodically synchronized with
+rq->clock_task as long as there's idle time. As a result, we get util_avg of
+about 300, not 900.
+
Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:

::
@@ -682,6 +695,10 @@ this happens, then the _actual_ util_avg will become:

p0->util_avg = 1024

+This is because rq->clock_pelt is no longer synchronized with the task clock.
+The clock period therefore is proportionally shrunk by the same ratio of
+(Fmax/Fmin), giving us a maximal util_avg of 1024.
+
If task p1 wakes up on this CPU, which have:

::
--
2.34.1

2023-05-05 15:46:22

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [PATCH] sched/documentation: elaborate on uclamp limitations

Hongyan Xia <[email protected]> writes:

> The story in 5.2 about util_avg abruptly jumping from 300 when
> Fmax/Fmin == 3 to 1024 when Fmax/Fmin == 4 hides some details about how
> clock_pelt works behind the scenes. Explicitly mention it to make it
> easier for readers to follow.

So this is a nit, but...

> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
> +
> +::
> +
> + util_sum / clock period = util_avg

This can be written as:

...previous rq_clock_pelt()) in::

util_sum / clock period = util_avg

reducing the markup clutter and making the document a little more
readable.

Thanks,

jon

2023-05-18 11:40:53

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH] sched/documentation: elaborate on uclamp limitations

Please CC sched maintainers (Ingo + Peter) next time as they should pick this
up ultimately and they won't see it from the list only.

On 05/05/23 16:24, Hongyan Xia wrote:
> The story in 5.2 about util_avg abruptly jumping from 300 when
> Fmax/Fmin == 3 to 1024 when Fmax/Fmin == 4 hides some details about how
> clock_pelt works behind the scenes. Explicitly mention it to make it
> easier for readers to follow.
>
> Signed-off-by: Hongyan Xia <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> ---
> Documentation/scheduler/sched-util-clamp.rst | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> index 74d5b7c6431d..524df07bceba 100644
> --- a/Documentation/scheduler/sched-util-clamp.rst
> +++ b/Documentation/scheduler/sched-util-clamp.rst
> @@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.
>
> p0->util_avg = 300 + small_error
>
> +The reason why util_avg is around 300 even though it runs for 900 at Fmin is:
> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
> +
> +::
> +
> + util_sum / clock period = util_avg
> +
> +does not shrink to 1/3, since rq->clock_pelt is periodically synchronized with
> +rq->clock_task as long as there's idle time. As a result, we get util_avg of
> +about 300, not 900.
> +

I feel neutral about these changes. It does answer some questions, but poses
more questions like what is clock_pelt. So we might end up in recursive
regression of explaining the explanation.

I don't think we have a doc about clock_pelt. Worth adding one and just add
a reference to it from here for those interested in understanding more details
on why we need to go to idle to correct util_avg? I think our code has
explanation, a reference to update_rq_clock_pelt() might suffice too.

Vincent, do you have an opinion here?

Thanks!

--
Qais Yousef

> Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:
>
> ::
> @@ -682,6 +695,10 @@ this happens, then the _actual_ util_avg will become:
>
> p0->util_avg = 1024
>
> +This is because rq->clock_pelt is no longer synchronized with the task clock.
> +The clock period therefore is proportionally shrunk by the same ratio of
> +(Fmax/Fmin), giving us a maximal util_avg of 1024.
> +
> If task p1 wakes up on this CPU, which have:
>
> ::
> --
> 2.34.1
>

2023-05-18 12:44:27

by Hongyan Xia

[permalink] [raw]

Subject: Re: [PATCH] sched/documentation: elaborate on uclamp limitations

Hi Qais,

On 2023-05-18 12:30, Qais Yousef wrote:
> Please CC sched maintainers (Ingo + Peter) next time as they should pick this
> up ultimately and they won't see it from the list only.

Will do. I was using the get_maintainers script and I thought that gave
me all the CCs.

> On 05/05/23 16:24, Hongyan Xia wrote:
>> The story in 5.2 about util_avg abruptly jumping from 300 when
>> Fmax/Fmin == 3 to 1024 when Fmax/Fmin == 4 hides some details about how
>> clock_pelt works behind the scenes. Explicitly mention it to make it
>> easier for readers to follow.
>>
>> Signed-off-by: Hongyan Xia <[email protected]>
>> Cc: Qais Yousef <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> ---
>> Documentation/scheduler/sched-util-clamp.rst | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
>> index 74d5b7c6431d..524df07bceba 100644
>> --- a/Documentation/scheduler/sched-util-clamp.rst
>> +++ b/Documentation/scheduler/sched-util-clamp.rst
>> @@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.
>>
>> p0->util_avg = 300 + small_error
>>
>> +The reason why util_avg is around 300 even though it runs for 900 at Fmin is:
>> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
>> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
>> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
>> +
>> +::
>> +
>> + util_sum / clock period = util_avg
>> +
>> +does not shrink to 1/3, since rq->clock_pelt is periodically synchronized with
>> +rq->clock_task as long as there's idle time. As a result, we get util_avg of
>> +about 300, not 900.
>> +
>
> I feel neutral about these changes. It does answer some questions, but poses
> more questions like what is clock_pelt. So we might end up in recursive
> regression of explaining the explanation.
>
> I don't think we have a doc about clock_pelt. Worth adding one and just add
> a reference to it from here for those interested in understanding more details
> on why we need to go to idle to correct util_avg? I think our code has
> explanation, a reference to update_rq_clock_pelt() might suffice too.
>
> Vincent, do you have an opinion here?

Sounds reasonable. I don't mind drafting a doc or just a couple of
paragraphs for clock_pelt (or all the different clocks like clock,
clock_task, clock_idle_*), if that's what we can agree on.

Hongyan

2023-05-23 09:28:59

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/documentation: elaborate on uclamp limitations

On Thu, 18 May 2023 at 14:42, Hongyan Xia <[email protected]> wrote:
>
> Hi Qais,
>
> On 2023-05-18 12:30, Qais Yousef wrote:
> > Please CC sched maintainers (Ingo + Peter) next time as they should pick this
> > up ultimately and they won't see it from the list only.
>
> Will do. I was using the get_maintainers script and I thought that gave
> me all the CCs.
>
> > On 05/05/23 16:24, Hongyan Xia wrote:
> >> The story in 5.2 about util_avg abruptly jumping from 300 when
> >> Fmax/Fmin == 3 to 1024 when Fmax/Fmin == 4 hides some details about how
> >> clock_pelt works behind the scenes. Explicitly mention it to make it
> >> easier for readers to follow.
> >>
> >> Signed-off-by: Hongyan Xia <[email protected]>
> >> Cc: Qais Yousef <[email protected]>
> >> Cc: Vincent Guittot <[email protected]>
> >> ---
> >> Documentation/scheduler/sched-util-clamp.rst | 17 +++++++++++++++++
> >> 1 file changed, 17 insertions(+)
> >>
> >> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> >> index 74d5b7c6431d..524df07bceba 100644
> >> --- a/Documentation/scheduler/sched-util-clamp.rst
> >> +++ b/Documentation/scheduler/sched-util-clamp.rst
> >> @@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.
> >>
> >> p0->util_avg = 300 + small_error
> >>
> >> +The reason why util_avg is around 300 even though it runs for 900 at Fmin is:

What does it mean running for 900 at Fmin ? util_avg is a ratio in the
range [0:1024] without time unit

> >> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
> >> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
> >> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
> >> +
> >> +::
> >> +
> >> + util_sum / clock period = util_avg

I don't get the meaning of the formula above ? There is no "clock
period" (although I'm not sure what it means here) involved when
computing util_avg

Also, there is no linear relation between util_avg and Fmin/Fmax
ratio. Fmin/Fmax ratio is meaningful in regards to the ratio between
running time and period time of a periodic task. I understand the
reference of pelt in this document as a quite simplified description
of PELT so I'm not sure that adding a partial explanation will help.
It will probably cause more confusion to people. The only thing that
is sure, is that PELT expects some idle time to stay fully invariant
for periodic task

> >> +
> >> +does not shrink to 1/3, since rq->clock_pelt is periodically synchronized with
> >> +rq->clock_task as long as there's idle time. As a result, we get util_avg of
> >> +about 300, not 900.
> >> +
> >
> > I feel neutral about these changes. It does answer some questions, but poses
> > more questions like what is clock_pelt. So we might end up in recursive
> > regression of explaining the explanation.
> >
> > I don't think we have a doc about clock_pelt. Worth adding one and just add
> > a reference to it from here for those interested in understanding more details
> > on why we need to go to idle to correct util_avg? I think our code has
> > explanation, a reference to update_rq_clock_pelt() might suffice too.
> >
> > Vincent, do you have an opinion here?
>
> Sounds reasonable. I don't mind drafting a doc or just a couple of
> paragraphs for clock_pelt (or all the different clocks like clock,
> clock_task, clock_idle_*), if that's what we can agree on.

I don't have a strong opinion on adding a doc on PELT.

>
> Hongyan

2023-05-23 15:00:26

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH] sched/documentation: elaborate on uclamp limitations

On 23/05/2023 11:23, Vincent Guittot wrote:
> On Thu, 18 May 2023 at 14:42, Hongyan Xia <[email protected]> wrote:
>>
>> Hi Qais,
>>
>> On 2023-05-18 12:30, Qais Yousef wrote:
>>> Please CC sched maintainers (Ingo + Peter) next time as they should pick this
>>> up ultimately and they won't see it from the list only.
>>
>> Will do. I was using the get_maintainers script and I thought that gave
>> me all the CCs.
>>
>>> On 05/05/23 16:24, Hongyan Xia wrote:

[...]

>>>> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
>>>> index 74d5b7c6431d..524df07bceba 100644
>>>> --- a/Documentation/scheduler/sched-util-clamp.rst
>>>> +++ b/Documentation/scheduler/sched-util-clamp.rst
>>>> @@ -669,6 +669,19 @@ but not proportional to Fmax/Fmin.
>>>>
>>>> p0->util_avg = 300 + small_error
>>>>
>>>> +The reason why util_avg is around 300 even though it runs for 900 at Fmin is:
>
> What does it mean running for 900 at Fmin ? util_avg is a ratio in the
> range [0:1024] without time unit
>
>>>> +Although running at Fmin reduces the rate of rq_clock_pelt() to 1/3 thus
>>>> +accumulates util_sum at 1/3 of the rate at Fmax, the clock period
>>>> +(rq_clock_pelt() now minus previous rq_clock_pelt()) in:
>>>> +
>>>> +::
>>>> +
>>>> + util_sum / clock period = util_avg
>
> I don't get the meaning of the formula above ? There is no "clock
> period" (although I'm not sure what it means here) involved when
> computing util_avg

I also didn't get this one. IMHO. the relation between util_avg and
util_sum is `divider = LOAD_AVG_MAX - 1024 + avg->period_contrib`. But
I can't see how this matters here.

The crucial point here is IMHO as long we have idle time (p->util_avg <
CPU (current) capacity) the util_avg will not raise to 1024 since at
wakeup util_avg will be only decayed (since the task was sleeping, i.e.
!!se->on_rq = 0). And we are scale invariant thanks to the functionality
in update_rq_clock_pelt() (which is executed when p is running).

The pelt clock update at this moment (wakeup) is setting clock_pelt to
clock_task since rq->curr is the idle task but IMHO that is not the
reason why p->util_avg behaves like this.

The moment `p->util_avg >= CPU (current) capacity` there is no idle time
left, i.e. no such `only decay` updates happens for p anymore (only
`accrue/decay` updates in tick) and the result is that p->util_avg goes
1024.

> Also, there is no linear relation between util_avg and Fmin/Fmax
> ratio. Fmin/Fmax ratio is meaningful in regards to the ratio between
> running time and period time of a periodic task. I understand the
> reference of pelt in this document as a quite simplified description
> of PELT so I'm not sure that adding a partial explanation will help.
> It will probably cause more confusion to people. The only thing that
> is sure, is that PELT expects some idle time to stay fully invariant
> for periodic task

+1 ... we have to be able to understand the code. BTW, schedutil.rst has
also paragraphs about PELT and `Frequency / CPU Invariance` and also
refers to kernel/sched/pelt.h:update_rq_clock_pelt() for details.

[...]