2023-09-12 12:04:52

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

Hi Peter,

On 9/7/23 21:16, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
>
>>> What task characteristic is tied to this? That is, this seems trivial to
>>> modify per-task.
>>
>> In particular Speedometer test and the main browser task, which reaches
>> ~900util, but sometimes vanish and waits for other background tasks
>> to do something. In the meantime it can decay and wake-up on
>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>> at least very bad down migration to Littles...
>
> Do they do a few short activations (wakeup/sleeps) while waiting? That
> would indeed completely ruin things since the EWMA thing is activation
> based.
>
> I wonder if there's anything sane we can do here...

My apologies for a delay, I have tried to push the graphs for you.

The experiment is on pixel7*. It's running the browser on the phone
with the test 'Speedometer 2.0'. It's a web test (you can also run on
your phone) available here, no need to install anything:
https://browserbench.org/Speedometer2.0/

Here is the Jupiter notebook [1], with plots of the signals:
- top 20 tasks' (based on runtime) utilization
- Util EST signals for the top 20 tasks, with the longer decaying ewma
filter (which is the 'red' plot called 'ewma')
- the main task (comm=CrRendererMain) Util, Util EST and task residency
(which tires to stick to CPUs 6,7* )
- the test score was 144.6 (while with fast decay ewma is ~134), so
staying at big cpus (helps the score in this case)

(the plots are interactive, you can zoom in with the icon 'Box Zoom')
(e.g. you can zoom in the task activation plot which is also linked
with the 'Util EST' on top, for this main task)

You can see the util signal of that 'CrRendererMain' task and those
utilization drops in time, which I was referring to. When the util
drops below some threshold, the task might 'fit' into smaller CPU,
which could be prevented automatically byt maintaining the util est
for longer (but not for all).

I do like your idea that Util EST might be per-task. I'm going to
check this, because that might help to get rid of the overutilized state
which is probably because small tasks are also 'bigger' for longer.

If this util est have chance to fly upstream, I could send an RFC if
you don't mind.

Regards,
Lukasz

*CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
Littles (~150 capacity)

[1]
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb


2023-09-12 15:07:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

Hi Lukasz,

On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <[email protected]> wrote:
>
> Hi Peter,
>
> On 9/7/23 21:16, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
> >
> >>> What task characteristic is tied to this? That is, this seems trivial to
> >>> modify per-task.
> >>
> >> In particular Speedometer test and the main browser task, which reaches
> >> ~900util, but sometimes vanish and waits for other background tasks
> >> to do something. In the meantime it can decay and wake-up on
> >> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> >> we pin the task to big CPUs). So, a longer util_est helps to avoid
> >> at least very bad down migration to Littles...
> >
> > Do they do a few short activations (wakeup/sleeps) while waiting? That
> > would indeed completely ruin things since the EWMA thing is activation
> > based.
> >
> > I wonder if there's anything sane we can do here...
>
> My apologies for a delay, I have tried to push the graphs for you.
>
> The experiment is on pixel7*. It's running the browser on the phone
> with the test 'Speedometer 2.0'. It's a web test (you can also run on
> your phone) available here, no need to install anything:
> https://browserbench.org/Speedometer2.0/
>
> Here is the Jupiter notebook [1], with plots of the signals:
> - top 20 tasks' (based on runtime) utilization
> - Util EST signals for the top 20 tasks, with the longer decaying ewma
> filter (which is the 'red' plot called 'ewma')
> - the main task (comm=CrRendererMain) Util, Util EST and task residency
> (which tires to stick to CPUs 6,7* )
> - the test score was 144.6 (while with fast decay ewma is ~134), so
> staying at big cpus (helps the score in this case)
>
> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
> (e.g. you can zoom in the task activation plot which is also linked
> with the 'Util EST' on top, for this main task)
>
> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

I was looking at your nice chart and I wonder if you could also add
the runnable _avg of the tasks ?

My 1st impression is that the decrease happens when your task starts
to share the CPU with some other tasks and this ends up with a
decrease of its utilization because util_avg doesn't take into account
the waiting time so typically task with an utilization of 1024, will
see its utilization decrease because of other tasks running on the
same cpu. This would explain the drop that you can see.

I wonder if we should not take into account the runnable_avg when
applying the ewm on util_est ? so the util_est will not decrease
because of time sharing with other

>
> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
>
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.
>
> Regards,
> Lukasz
>
> *CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
> Littles (~150 capacity)
>
> [1]
> https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb

2023-09-12 20:03:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

On Tue, Sep 12, 2023 at 12:51:52PM +0100, Lukasz Luba wrote:

> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

Right, so right at those util_est dips it has some small activations.
It's like a poll loop or something instead of a full block waiting for
things to happen.

And yeah, that'll destroy util_est in a hurry :/

> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
>
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.

The biggest stumbling block I see is the user interface; some generic
QoS hints based thing that allows us to do random things -- like tune
the above might do, dunno.

2023-09-13 09:59:46

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

Hi Vincent,

On 9/12/23 15:01, Vincent Guittot wrote:
> Hi Lukasz,
>
> On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On 9/7/23 21:16, Peter Zijlstra wrote:
>>> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
>>>
>>>>> What task characteristic is tied to this? That is, this seems trivial to
>>>>> modify per-task.
>>>>
>>>> In particular Speedometer test and the main browser task, which reaches
>>>> ~900util, but sometimes vanish and waits for other background tasks
>>>> to do something. In the meantime it can decay and wake-up on
>>>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>>>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>>>> at least very bad down migration to Littles...
>>>
>>> Do they do a few short activations (wakeup/sleeps) while waiting? That
>>> would indeed completely ruin things since the EWMA thing is activation
>>> based.
>>>
>>> I wonder if there's anything sane we can do here...
>>
>> My apologies for a delay, I have tried to push the graphs for you.
>>
>> The experiment is on pixel7*. It's running the browser on the phone
>> with the test 'Speedometer 2.0'. It's a web test (you can also run on
>> your phone) available here, no need to install anything:
>> https://browserbench.org/Speedometer2.0/
>>
>> Here is the Jupiter notebook [1], with plots of the signals:
>> - top 20 tasks' (based on runtime) utilization
>> - Util EST signals for the top 20 tasks, with the longer decaying ewma
>> filter (which is the 'red' plot called 'ewma')
>> - the main task (comm=CrRendererMain) Util, Util EST and task residency
>> (which tires to stick to CPUs 6,7* )
>> - the test score was 144.6 (while with fast decay ewma is ~134), so
>> staying at big cpus (helps the score in this case)
>>
>> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
>> (e.g. you can zoom in the task activation plot which is also linked
>> with the 'Util EST' on top, for this main task)
>>
>> You can see the util signal of that 'CrRendererMain' task and those
>> utilization drops in time, which I was referring to. When the util
>> drops below some threshold, the task might 'fit' into smaller CPU,
>> which could be prevented automatically byt maintaining the util est
>> for longer (but not for all).
>
> I was looking at your nice chart and I wonder if you could also add
> the runnable _avg of the tasks ?

Yes, I will try today or tomorrow to add such plots as well.

>
> My 1st impression is that the decrease happens when your task starts
> to share the CPU with some other tasks and this ends up with a
> decrease of its utilization because util_avg doesn't take into account
> the waiting time so typically task with an utilization of 1024, will
> see its utilization decrease because of other tasks running on the
> same cpu. This would explain the drop that you can see.
>
> I wonder if we should not take into account the runnable_avg when
> applying the ewm on util_est ? so the util_est will not decrease
> because of time sharing with other

Yes, that sounds a good idea. Let me provide those plots so we could
go further with the analysis. I will try to capture if that happens
to that particular task on CPU (if there are some others as well).


Thanks for jumping in to the discussion!

Lukasz