Many of the Android devices still prefer to run PELT with a shorter
halflife than the hardcoded value of 32ms in mainline.
The Android folks claim better response time of display pipeline tasks
(higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of the
benchmarks like PCmark web-browsing show higher scores when running
with 16ms or 8ms PELT halflife. The gain in response time and
performance is considered to outweigh the increase of energy
consumption in these cases.
The original idea of introducing a PELT halflife compile time option
for 32, 16, 8ms from Patrick Bellasi in 2018
https://lkml.kernel.org/r/[email protected]
wasn't integrated into mainline mainly because of breaking the PELT
stability requirement (see (1) below).
We have been experimenting with a new idea from Morten Rasmussen to
instead introduce an additional clock between task and pelt clock. This
way the effect of a shorter PELT halflife of 8ms or 16ms can be
achieved by left-shifting the elapsed time. This is similar to the use
of time shifting of the pelt clock to achieve scale invariance in PELT.
The implementation is from Vincent Donnefort with some minor
modifications to align with current tip sched/core.
---
Known potential issues:
(1) PELT stability requirement
PELT halflife has to be larger than or equal to the scheduling period.
The sched_period (sysctl_sched_latency) of a typical mobile device with
8 CPUs with the default logarithmical tuning is 24ms so only the 32 ms
PELT halflife met this. Shorter halflife of 16ms or even 8ms would break
this.
It looks like that this problem might not exist anymore because of the
PELT rewrite in 2015, i.e. with commit 9d89c257dfb9
("sched/fair: Rewrite runnable load and utilization average tracking").
Since then sched entities (task & task groups) and cfs_rq's are
independently maintained rather than each entity update maintains the
cfs_rq at the same time.
This seems to mitigate the issue that the cfs_rq signal is not correct
when there are not all runnable entities able ot do a self update during
a PELT halflife window.
That said, I'm not entirely sure whether the entity-cfs_rq
synchronization is the only issue behind this PELT stability requirement.
(2) PELT utilization versus util_est (estimated utilization)
The PELT signal of a periodic task oscillates with higher peak amplitude
when using smaller halflife. For a typical periodic task of the display
pipeline with a runtime/period of 8ms/16ms the peak amplitude is at ~40
for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the
util_avg peak as util_est.enqueued per task.
With an additional exponential weighted moving average (ewma) to smooth
task utilization decreases, util_est values of the runnable tasks are
aggregated on the root cfs_rq.
CPU and task utilization for CPU frequency selection and task placement
is the max value out of util_est and util_avg.
I.e. because of how util_est is implemented higher CPU Operating
Performance Points and more capable CPUs are already chosen when using
smaller PELT halflife.
(3) Wrong PELT history when switching PELT multiplier
The PELT history becomes stale the moment the PELT multiplier is changed
during runtime. So all decisions based on PELT are skewed for the time
interval to produce LOAD_MAX_AVG (the sum of the infinite geometric
series) which value is ~345ms for halflife=32ms (smaller for 8ms or
16ms).
Rate limiting the PELT multiplier change to this value is not solving
the issue here. So the user would have to live with possible incorrect
discussions during these PELT multiplier transition times.
---
It looks like that individual task boosting e.g. via uclamp_min,
possibly abstracted by middleware frameworks like Android Dynamic
Performance Framework (ADPF) would be the way to go here but until this
is fully available and adopted some Android folks will still prefer the
overall system boosting they achieve by running with a shorter PELT
halflife.
Vincent Donnefort (1):
sched/pelt: Introduce PELT multiplier
kernel/sched/core.c | 2 +-
kernel/sched/pelt.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++---
kernel/sched/sched.h | 1 +
4 files changed, 100 insertions(+), 5 deletions(-)
--
2.25.1
Update some test data in android phone to support switching PELT HL
is helpful functionality.
We switch runtime PELT HL during runtime by difference scenario e.g.
pelt8 in playing game, pelt32 in camera video. Support runntime
switching PELT HL is flexible for different workloads.
the below table show performance & power data points:
---------------------------------------------------------------------
--| | PELT
halflife |
| |----------------------------------------------|
| | 32 | 16 | 8 |
| |----------------------------------------------|
| | avg min avg | avg min avg | avg min avg|
| Scenarios | fps fps pwr | fps fps pwr | fps fps pwr|
|---------------------------------------------------------------------|
| HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106|
| HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105|
| HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105|
| FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103|
| Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102|
-----------------------------------------------------------------------
HOK ... Honour Of Kings, Video game
FHD ... Full High Definition
fps ... frame per second
pwr ... power consumption
table values are in %
On Mon, 2022-08-29 at 07:54 +0200, Dietmar Eggemann wrote:
> Many of the Android devices still prefer to run PELT with a shorter
> halflife than the hardcoded value of 32ms in mainline.
>
> The Android folks claim better response time of display pipeline
> tasks
> (higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of
> the
> benchmarks like PCmark web-browsing show higher scores when running
> with 16ms or 8ms PELT halflife. The gain in response time and
> performance is considered to outweigh the increase of energy
> consumption in these cases.
>
> The original idea of introducing a PELT halflife compile time option
> for 32, 16, 8ms from Patrick Bellasi in 2018
>
https://urldefense.com/v3/__https://lkml.kernel.org/r/[email protected]__;!!CTRNKA9wMg0ARbw!x-6IhaOmZWO5PJIWEfZLD-6grV2BwlOBpflNV57-oNZY8NfSocwlImAHM2TQFyo56_r-$
>
> wasn't integrated into mainline mainly because of breaking the PELT
> stability requirement (see (1) below).
>
> We have been experimenting with a new idea from Morten Rasmussen to
> instead introduce an additional clock between task and pelt clock.
> This
> way the effect of a shorter PELT halflife of 8ms or 16ms can be
> achieved by left-shifting the elapsed time. This is similar to the
> use
> of time shifting of the pelt clock to achieve scale invariance in
> PELT.
> The implementation is from Vincent Donnefort with some minor
> modifications to align with current tip sched/core.
>
> ---
>
> Known potential issues:
>
> (1) PELT stability requirement
>
> PELT halflife has to be larger than or equal to the scheduling
> period.
>
> The sched_period (sysctl_sched_latency) of a typical mobile device
> with
> 8 CPUs with the default logarithmical tuning is 24ms so only the 32
> ms
> PELT halflife met this. Shorter halflife of 16ms or even 8ms would
> break
> this.
>
> It looks like that this problem might not exist anymore because of
> the
> PELT rewrite in 2015, i.e. with commit 9d89c257dfb9
> ("sched/fair: Rewrite runnable load and utilization average
> tracking").
> Since then sched entities (task & task groups) and cfs_rq's are
> independently maintained rather than each entity update maintains the
> cfs_rq at the same time.
>
> This seems to mitigate the issue that the cfs_rq signal is not
> correct
> when there are not all runnable entities able ot do a self update
> during
> a PELT halflife window.
>
> That said, I'm not entirely sure whether the entity-cfs_rq
> synchronization is the only issue behind this PELT stability
> requirement.
>
>
> (2) PELT utilization versus util_est (estimated utilization)
>
> The PELT signal of a periodic task oscillates with higher peak
> amplitude
> when using smaller halflife. For a typical periodic task of the
> display
> pipeline with a runtime/period of 8ms/16ms the peak amplitude is at
> ~40
> for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the
> util_avg peak as util_est.enqueued per task.
>
> With an additional exponential weighted moving average (ewma) to
> smooth
> task utilization decreases, util_est values of the runnable tasks are
> aggregated on the root cfs_rq.
> CPU and task utilization for CPU frequency selection and task
> placement
> is the max value out of util_est and util_avg.
> I.e. because of how util_est is implemented higher CPU Operating
> Performance Points and more capable CPUs are already chosen when
> using
> smaller PELT halflife.
>
>
> (3) Wrong PELT history when switching PELT multiplier
>
> The PELT history becomes stale the moment the PELT multiplier is
> changed
> during runtime. So all decisions based on PELT are skewed for the
> time
> interval to produce LOAD_MAX_AVG (the sum of the infinite geometric
> series) which value is ~345ms for halflife=32ms (smaller for 8ms or
> 16ms).
>
> Rate limiting the PELT multiplier change to this value is not solving
> the issue here. So the user would have to live with possible
> incorrect
> discussions during these PELT multiplier transition times.
>
> ---
>
> It looks like that individual task boosting e.g. via uclamp_min,
> possibly abstracted by middleware frameworks like Android Dynamic
> Performance Framework (ADPF) would be the way to go here but until
> this
> is fully available and adopted some Android folks will still prefer
> the
> overall system boosting they achieve by running with a shorter PELT
> halflife.
>
> Vincent Donnefort (1):
> sched/pelt: Introduce PELT multiplier
>
> kernel/sched/core.c | 2 +-
> kernel/sched/pelt.c | 60
> ++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 1 +
> 4 files changed, 100 insertions(+), 5 deletions(-)
>
+ Kajetan Puchalski <[email protected]>
On 20/09/2022 16:07, Jian-Min Liu wrote:
>
> Update some test data in android phone to support switching PELT HL
> is helpful functionality.
>
> We switch runtime PELT HL during runtime by difference scenario e.g.
> pelt8 in playing game, pelt32 in camera video. Support runntime
> switching PELT HL is flexible for different workloads.
>
> the below table show performance & power data points:
>
> ---------------------------------------------------------------------
> --| | PELT
> halflife |
> | |----------------------------------------------|
> | | 32 | 16 | 8 |
> | |----------------------------------------------|
> | | avg min avg | avg min avg | avg min avg|
> | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr|
> |---------------------------------------------------------------------|
> | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106|
> | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105|
> | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105|
> | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103|
> | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102|
> -----------------------------------------------------------------------
>
> HOK ... Honour Of Kings, Video game
> FHD ... Full High Definition
> fps ... frame per second
> pwr ... power consumption
>
> table values are in %
I assume that you are specifically interested in those higher min fps
numbers which can be achieved with a tolerable energy consumption
increase when running the game with 16ms or even 8ms PELT halflife.
We see a similar effect when running the UI performance benchmark
Jankbench.
So you need this runtime-switchable PELT multiplier. Would this sched
feature interface:
https://lkml.kernel.org/r/[email protected]
be sufficient for you? People don't like to support `changing PELT
halflife` via an official sysctl.
[...]
Because it messes up the order in which people normally read text.
Why is top-posting such a bad thing?
Top-posting.
What is the most annoying thing in e-mail?
On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote:
>
> Update some test data in android phone to support switching PELT HL
> is helpful functionality.
>
> We switch runtime PELT HL during runtime by difference scenario e.g.
> pelt8 in playing game, pelt32 in camera video. Support runntime
> switching PELT HL is flexible for different workloads.
>
> the below table show performance & power data points:
>
> ---------------------------------------------------------------------
> --| | PELT
> halflife |
> | |----------------------------------------------|
> | | 32 | 16 | 8 |
> | |----------------------------------------------|
> | | avg min avg | avg min avg | avg min avg|
> | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr|
> |---------------------------------------------------------------------|
> | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106|
> | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105|
> | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105|
You have your min and avg fps columns mixed up, your min cannot be larger
than avg.
Also, with min fps mostly above the actual screen fps, who cares. And
seriously 120fps on a phone !?!? for worse power usage! you gotta be
kidding me.
And I googled this game; it is some top-down tactical thing with
real-time combat (as opposed to turn-based) (DOTA like I suppose),
60 fps locked should be plenty fine.
> | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103|
> | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102|
Mostly I think you've demonstrated that none of this is worth it.
> -----------------------------------------------------------------------
>
> HOK ... Honour Of Kings, Video game
> FHD ... Full High Definition
> fps ... frame per second
> pwr ... power consumption
>
> table values are in %
Oh... that's bloody insane; that's why none of it makes sense.
How is any of that an answer to:
"They want; I want an explanation of what exact problem is fixed how ;-)"
This is just random numbers showing poking the number has some effect;
it has zero explaination of why poking the number changes the workload
and if that is in fact the right way to go about solving that particular
issue.
On Thu, Sep 29, 2022 at 11:47:23AM +0200, Peter Zijlstra wrote:
[...]
> Mostly I think you've demonstrated that none of this is worth it.
>
> > -----------------------------------------------------------------------
> >
> > HOK ... Honour Of Kings, Video game
> > FHD ... Full High Definition
> > fps ... frame per second
> > pwr ... power consumption
> >
> > table values are in %
>
> Oh... that's bloody insane; that's why none of it makes sense.
Hi,
We have seen similar results to the ones provided by MTK while running
Jankbench, a UI performance benchmark.
For the following tables, the pelt numbers refer to multiplier values so
pelt_1 -> 32ms, pelt_2 -> 16ms, pelt_4 -> 8ms.
We can see the max frame durations decreasing significantly in line with
changing the pelt multiplier. Having a faster-responding pelt lets us
improve the worst-case scenario by a large margin which is why it can be
useful in some cases where that worst-case scenario is important.
Max frame duration (ms)
+------------------+----------+
| kernel | value |
|------------------+----------|
| pelt_1 | 157.426 |
| pelt_2 | 111.975 |
| pelt_4 | 85.2713 |
+------------------+----------+
However, it is accompanied by a very noticeable increase in power usage.
We have seen even bigger power usage increases for different workloads.
This is why we think it makes much more sense as something that can be
changed at runtime - if set at boot time the energy consumption increase
would nullify any of the potential benefits. For limited workloads or
scenarios, the tradeoff might be worth it.
Power usage [mW]
+------------------+---------+-------------+
| kernel | value | perc_diff |
|------------------+---------+-------------|
| pelt_1 | 139.9 | 0.0% |
| pelt_2 | 146.4 | 4.62% |
| pelt_4 | 158.5 | 13.25% |
+------------------+---------+-------------+
At the same time we see that the average-case can improve slightly as
well in the process and the consistency either doesn't get worse or
improves a bit too.
Mean frame duration (ms)
+---------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|---------------+------------------+---------+-------------|
| mean_duration | pelt_1 | 14.6 | 0.0% |
| mean_duration | pelt_2 | 13.8 | -5.43% |
| mean_duration | pelt_4 | 14.5 | -0.58% |
+---------------+------------------+---------+-------------+
Jank percentage
+------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|------------+------------------+---------+-------------|
| jank_perc | pelt_1 | 2.1 | 0.0% |
| jank_perc | pelt_2 | 2.1 | 0.11% |
| jank_perc | pelt_4 | 2 | -3.46% |
+------------+------------------+---------+-------------+
> How is any of that an answer to:
>
> "They want; I want an explanation of what exact problem is fixed how ;-)"
>
> This is just random numbers showing poking the number has some effect;
> it has zero explaination of why poking the number changes the workload
> and if that is in fact the right way to go about solving that particular
> issue.
Overall, the problem being solved here is that based on our testing the
PELT half life can occasionally be too slow to keep up in scenarios
where many frames need to be rendered quickly, especially on high-refresh
rate phones and similar devices. While it's not a problem most of the
time and so it doesn't warrant changing the default or having it set at
boot time, introducing this pelt multiplier would be very useful as a
tool to be able to avoid the worst-case in limited scenarios.
----
Kajetan
On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> Overall, the problem being solved here is that based on our testing the
> PELT half life can occasionally be too slow to keep up in scenarios
> where many frames need to be rendered quickly, especially on high-refresh
> rate phones and similar devices.
But it is a problem of DVFS not ramping up quick enough; or of the
load-balancer not reacting to the increase in load, or what aspect
controlled by PELT is responsible for the improvement seen?
On 29/09/2022 11:47, Peter Zijlstra wrote:
[...]
>> ---------------------------------------------------------------------
>> --| | PELT
>> halflife |
>> | |----------------------------------------------|
>> | | 32 | 16 | 8 |
>> | |----------------------------------------------|
>> | | avg min avg | avg min avg | avg min avg|
>> | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr|
>> |---------------------------------------------------------------------|
>> | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106|
>> | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105|
>> | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105|
>
> You have your min and avg fps columns mixed up, your min cannot be larger
> than avg.
>
> Also, with min fps mostly above the actual screen fps, who cares. And
> seriously 120fps on a phone !?!? for worse power usage! you gotta be
> kidding me.
I agree that since we don't know what 100% at 32 means its unclear what
problem gets actually solved here by running with 16 or 8.
> And I googled this game; it is some top-down tactical thing with
> real-time combat (as opposed to turn-based) (DOTA like I suppose),
> 60 fps locked should be plenty fine.
>
>> | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103|
>> | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102|
>
> Mostly I think you've demonstrated that none of this is worth it.
I assume Jian-Min added those two lines to demonstrate that they would
need the run-time switch.
>> -----------------------------------------------------------------------
>>
>> HOK ... Honour Of Kings, Video game
>> FHD ... Full High Definition
>> fps ... frame per second
>> pwr ... power consumption
>>
>> table values are in %
>
> Oh... that's bloody insane; that's why none of it makes sense.
>
>
> How is any of that an answer to:
>
> "They want; I want an explanation of what exact problem is fixed how ;-)"
>
> This is just random numbers showing poking the number has some effect;
> it has zero explaination of why poking the number changes the workload
> and if that is in fact the right way to go about solving that particular
> issue.
Jian-Min, would you be able to show real numbers in comparison to the
chosen fps here? And explain what the problem is which gets solved. What
is the effect of this higher min fps values when running 16 or 8? And
why is the default 32 not sufficient here?
On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
>
> > Overall, the problem being solved here is that based on our testing the
> > PELT half life can occasionally be too slow to keep up in scenarios
> > where many frames need to be rendered quickly, especially on high-refresh
> > rate phones and similar devices.
>
> But it is a problem of DVFS not ramping up quick enough; or of the
> load-balancer not reacting to the increase in load, or what aspect
> controlled by PELT is responsible for the improvement seen?
Based on all the tests we've seen, jankbench or otherwise, the
improvement can mainly be attributed to the faster ramp up of frequency
caused by the shorter PELT window while using schedutil. Alongside that
the signals rising faster also mean that the task would get migrated
faster to bigger CPUs on big.LITTLE systems which improves things too
but it's mostly the frequency aspect of it.
To establish that this benchmark is sensitive to frequency I ran some
tests using the 'performance' cpufreq governor.
Max frame duration (ms)
+------------------+-------------+----------+
| kernel | iteration | value |
|------------------+-------------+----------|
| pelt_1 | 10 | 157.426 |
| pelt_4 | 10 | 85.2713 |
| performance | 10 | 40.9308 |
+------------------+-------------+----------+
Mean frame duration (ms)
+---------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|---------------+------------------+---------+-------------|
| mean_duration | pelt_1 | 14.6 | 0.0% |
| mean_duration | pelt_4 | 14.5 | -0.58% |
| mean_duration | performance | 4.4 | -69.75% |
+---------------+------------------+---------+-------------+
Jank percentage
+------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|------------+------------------+---------+-------------|
| jank_perc | pelt_1 | 2.1 | 0.0% |
| jank_perc | pelt_4 | 2 | -3.46% |
| jank_perc | performance | 0.1 | -97.25% |
+------------+------------------+---------+-------------+
As you can see, bumping up frequency can hugely improve the results
here. This is what's happening when we decrease the PELT window, just on
a much smaller and not as drastic scale. It also explains specifically
where the increased power usage is coming from.
We have some data on an earlier build of Pixel 6a, which also runs a
slightly modified "sched" governor. The tuning definitely has both
performance and power impact on UX. With some additional user space
hints such as ADPF (Android Dynamic Performance Framework) and/or the
old-fashioned INTERACTION power hint, different trade-offs can be
archived with this sort of tuning.
+---------------------------------------------------------+----------+----------+
| Metrics | 32ms |
8ms |
+---------------------------------------------------------+----------+----------+
| Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 |
112.00 |
| Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 |
49.00 |
| CPU Power | 6,204.00 |
7,040.00 |
| Sum of Gfxinfo.frame.95th | 582.00 |
506.00 |
| Avg of Gfxinfo.frame.95th | 18.19 |
15.81 |
+---------------------------------------------------------+----------+----------+
On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
<[email protected]> wrote:
>
> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> >
> > > Overall, the problem being solved here is that based on our testing the
> > > PELT half life can occasionally be too slow to keep up in scenarios
> > > where many frames need to be rendered quickly, especially on high-refresh
> > > rate phones and similar devices.
> >
> > But it is a problem of DVFS not ramping up quick enough; or of the
> > load-balancer not reacting to the increase in load, or what aspect
> > controlled by PELT is responsible for the improvement seen?
>
> Based on all the tests we've seen, jankbench or otherwise, the
> improvement can mainly be attributed to the faster ramp up of frequency
> caused by the shorter PELT window while using schedutil. Alongside that
> the signals rising faster also mean that the task would get migrated
> faster to bigger CPUs on big.LITTLE systems which improves things too
> but it's mostly the frequency aspect of it.
>
> To establish that this benchmark is sensitive to frequency I ran some
> tests using the 'performance' cpufreq governor.
>
> Max frame duration (ms)
>
> +------------------+-------------+----------+
> | kernel | iteration | value |
> |------------------+-------------+----------|
> | pelt_1 | 10 | 157.426 |
> | pelt_4 | 10 | 85.2713 |
> | performance | 10 | 40.9308 |
> +------------------+-------------+----------+
>
> Mean frame duration (ms)
>
> +---------------+------------------+---------+-------------+
> | variable | kernel | value | perc_diff |
> |---------------+------------------+---------+-------------|
> | mean_duration | pelt_1 | 14.6 | 0.0% |
> | mean_duration | pelt_4 | 14.5 | -0.58% |
> | mean_duration | performance | 4.4 | -69.75% |
> +---------------+------------------+---------+-------------+
>
> Jank percentage
>
> +------------+------------------+---------+-------------+
> | variable | kernel | value | perc_diff |
> |------------+------------------+---------+-------------|
> | jank_perc | pelt_1 | 2.1 | 0.0% |
> | jank_perc | pelt_4 | 2 | -3.46% |
> | jank_perc | performance | 0.1 | -97.25% |
> +------------+------------------+---------+-------------+
>
> As you can see, bumping up frequency can hugely improve the results
> here. This is what's happening when we decrease the PELT window, just on
> a much smaller and not as drastic scale. It also explains specifically
> where the increased power usage is coming from.
Hi Wei,
On 04/10/2022 00:57, Wei Wang wrote:
Please don't do top-posting.
> We have some data on an earlier build of Pixel 6a, which also runs a
> slightly modified "sched" governor. The tuning definitely has both
> performance and power impact on UX. With some additional user space
> hints such as ADPF (Android Dynamic Performance Framework) and/or the
> old-fashioned INTERACTION power hint, different trade-offs can be
> archived with this sort of tuning.
>
>
> +---------------------------------------------------------+----------+----------+
> | Metrics | 32ms |
> 8ms |
> +---------------------------------------------------------+----------+----------+
> | Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 |
> 112.00 |
> | Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 |
> 49.00 |
> | CPU Power | 6,204.00 |
> 7,040.00 |
> | Sum of Gfxinfo.frame.95th | 582.00 |
> 506.00 |
> | Avg of Gfxinfo.frame.95th | 18.19 |
> 15.81 |
> +---------------------------------------------------------+----------+----------+
Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never
ran it.
I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`.
# adb shell dumpsys gfxinfo <PACKAGE_NAME>
...
** Graphics info for pid XXXX [<PACKAGE_NAME>] **
...
95th percentile: XXms <-- (a)
...
Number Frame deadline missed: XX <-- (b)
...
I assume that `Gfxinfo.frame.95th` is related to (a) and
`gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure
where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from?
What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)?
[...]
> On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
> <[email protected]> wrote:
>>
>> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
>>>
>>>> Overall, the problem being solved here is that based on our testing the
>>>> PELT half life can occasionally be too slow to keep up in scenarios
>>>> where many frames need to be rendered quickly, especially on high-refresh
>>>> rate phones and similar devices.
>>>
>>> But it is a problem of DVFS not ramping up quick enough; or of the
>>> load-balancer not reacting to the increase in load, or what aspect
>>> controlled by PELT is responsible for the improvement seen?
>>
>> Based on all the tests we've seen, jankbench or otherwise, the
>> improvement can mainly be attributed to the faster ramp up of frequency
>> caused by the shorter PELT window while using schedutil. Alongside that
>> the signals rising faster also mean that the task would get migrated
>> faster to bigger CPUs on big.LITTLE systems which improves things too
>> but it's mostly the frequency aspect of it.
>>
>> To establish that this benchmark is sensitive to frequency I ran some
>> tests using the 'performance' cpufreq governor.
>>
>> Max frame duration (ms)
>>
>> +------------------+-------------+----------+
>> | kernel | iteration | value |
>> |------------------+-------------+----------|
>> | pelt_1 | 10 | 157.426 |
>> | pelt_4 | 10 | 85.2713 |
>> | performance | 10 | 40.9308 |
>> +------------------+-------------+----------+
>>
>> Mean frame duration (ms)
>>
>> +---------------+------------------+---------+-------------+
>> | variable | kernel | value | perc_diff |
>> |---------------+------------------+---------+-------------|
>> | mean_duration | pelt_1 | 14.6 | 0.0% |
>> | mean_duration | pelt_4 | 14.5 | -0.58% |
>> | mean_duration | performance | 4.4 | -69.75% |
>> +---------------+------------------+---------+-------------+
>>
>> Jank percentage
>>
>> +------------+------------------+---------+-------------+
>> | variable | kernel | value | perc_diff |
>> |------------+------------------+---------+-------------|
>> | jank_perc | pelt_1 | 2.1 | 0.0% |
>> | jank_perc | pelt_4 | 2 | -3.46% |
>> | jank_perc | performance | 0.1 | -97.25% |
>> +------------+------------------+---------+-------------+
>>
>> As you can see, bumping up frequency can hugely improve the results
>> here. This is what's happening when we decrease the PELT window, just on
>> a much smaller and not as drastic scale. It also explains specifically
>> where the increased power usage is coming from.
On Tue, Oct 4, 2022 at 2:33 AM Dietmar Eggemann
<[email protected]> wrote:
>
> Hi Wei,
>
> On 04/10/2022 00:57, Wei Wang wrote:
>
> Please don't do top-posting.
>
Sorry, forgot this was posted to the list...
> > We have some data on an earlier build of Pixel 6a, which also runs a
> > slightly modified "sched" governor. The tuning definitely has both
> > performance and power impact on UX. With some additional user space
> > hints such as ADPF (Android Dynamic Performance Framework) and/or the
> > old-fashioned INTERACTION power hint, different trade-offs can be
> > archived with this sort of tuning.
> >
> >
> > +---------------------------------------------------------+----------+----------+
> > | Metrics | 32ms |
> > 8ms |
> > +---------------------------------------------------------+----------+----------+
> > | Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 |
> > 112.00 |
> > | Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 |
> > 49.00 |
> > | CPU Power | 6,204.00 |
> > 7,040.00 |
> > | Sum of Gfxinfo.frame.95th | 582.00 |
> > 506.00 |
> > | Avg of Gfxinfo.frame.95th | 18.19 |
> > 15.81 |
> > +---------------------------------------------------------+----------+----------+
>
> Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never
> ran it.
>
Yes.
> I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`.
>
> # adb shell dumpsys gfxinfo <PACKAGE_NAME>
>
> ...
> ** Graphics info for pid XXXX [<PACKAGE_NAME>] **
> ...
> 95th percentile: XXms <-- (a)
> ...
> Number Frame deadline missed: XX <-- (b)
> ...
>
>
> I assume that `Gfxinfo.frame.95th` is related to (a) and
> `gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure
> where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from?
>
a) is correct b) is from surfaceflinger. Android display pipeline
involves both a) app (generation) and b) surfaceflinger
(presentation).
> What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)?
>
Uibench[1] has several micro tests and it is the sum of those tests.
[1]: https://cs.android.com/android/platform/superproject/+/master:platform_testing/tests/microbenchmarks/uibench/src/com/android/uibench/microbenchmark/
> [...]
>
> > On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
> > <[email protected]> wrote:
> >>
> >> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> >>>
> >>>> Overall, the problem being solved here is that based on our testing the
> >>>> PELT half life can occasionally be too slow to keep up in scenarios
> >>>> where many frames need to be rendered quickly, especially on high-refresh
> >>>> rate phones and similar devices.
> >>>
> >>> But it is a problem of DVFS not ramping up quick enough; or of the
> >>> load-balancer not reacting to the increase in load, or what aspect
> >>> controlled by PELT is responsible for the improvement seen?
> >>
> >> Based on all the tests we've seen, jankbench or otherwise, the
> >> improvement can mainly be attributed to the faster ramp up of frequency
> >> caused by the shorter PELT window while using schedutil. Alongside that
> >> the signals rising faster also mean that the task would get migrated
> >> faster to bigger CPUs on big.LITTLE systems which improves things too
> >> but it's mostly the frequency aspect of it.
> >>
> >> To establish that this benchmark is sensitive to frequency I ran some
> >> tests using the 'performance' cpufreq governor.
> >>
> >> Max frame duration (ms)
> >>
> >> +------------------+-------------+----------+
> >> | kernel | iteration | value |
> >> |------------------+-------------+----------|
> >> | pelt_1 | 10 | 157.426 |
> >> | pelt_4 | 10 | 85.2713 |
> >> | performance | 10 | 40.9308 |
> >> +------------------+-------------+----------+
> >>
> >> Mean frame duration (ms)
> >>
> >> +---------------+------------------+---------+-------------+
> >> | variable | kernel | value | perc_diff |
> >> |---------------+------------------+---------+-------------|
> >> | mean_duration | pelt_1 | 14.6 | 0.0% |
> >> | mean_duration | pelt_4 | 14.5 | -0.58% |
> >> | mean_duration | performance | 4.4 | -69.75% |
> >> +---------------+------------------+---------+-------------+
> >>
> >> Jank percentage
> >>
> >> +------------+------------------+---------+-------------+
> >> | variable | kernel | value | perc_diff |
> >> |------------+------------------+---------+-------------|
> >> | jank_perc | pelt_1 | 2.1 | 0.0% |
> >> | jank_perc | pelt_4 | 2 | -3.46% |
> >> | jank_perc | performance | 0.1 | -97.25% |
> >> +------------+------------------+---------+-------------+
> >>
> >> As you can see, bumping up frequency can hugely improve the results
> >> here. This is what's happening when we decrease the PELT window, just on
> >> a much smaller and not as drastic scale. It also explains specifically
> >> where the increased power usage is coming from.
>
Hi,
On Thu, 2022-09-29 at 11:47 +0200, Peter Zijlstra wrote:
> Because it messes up the order in which people normally read text.
> Why is top-posting such a bad thing?
> Top-posting.
> What is the most annoying thing in e-mail?
>
Sorry for top-posting...
> On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote:
> >
> > Update some test data in android phone to support switching PELT
> > HL
> > is helpful functionality.
> >
> > We switch runtime PELT HL during runtime by difference scenario
> > e.g.
> > pelt8 in playing game, pelt32 in camera video. Support runntime
> > switching PELT HL is flexible for different workloads.
> >
> > the below table show performance & power data points:
> >
> > -----------------------------------------------------------------
> > ----
> > --| | PELT
> > halflife |
> > > |-------------------------------------------
> > > ---|
> > > | 32 | 16 | 8
> > > |
> > > |-------------------------------------------
> > > ---|
> > > | avg min avg | avg min avg |
> > > avg min avg|
> > > Scenarios | fps fps pwr | fps fps pwr |
> > > fps fps pwr|
> > > ---------------------------------------------------------------
> > > ------|
> > > HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152*
> > > 106|
> > > HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129*
> > > 105|
> > > HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134*
> > > 105|
>
> You have your min and avg fps columns mixed up, your min cannot be
> larger
> than avg.
>
> Also, with min fps mostly above the actual screen fps, who cares. And
> seriously 120fps on a phone !?!? for worse power usage! you gotta be
> kidding me.
>
> And I googled this game; it is some top-down tactical thing with
> real-time combat (as opposed to turn-based) (DOTA like I suppose),
> 60 fps locked should be plenty fine.
>
> > > FHD video rec. 60fps | 100 100 100 | n/a n/a n/a |
> > > 100 100 103|
> > > Camera snapshot | 100 100 100 | n/a n/a n/a |
> > > 100 100 102|
>
> Mostly I think you've demonstrated that none of this is worth it.
>
> > -----------------------------------------------------------------
> > ------
> >
> > HOK ... Honour Of Kings, Video game
> > FHD ... Full High Definition
> > fps ... frame per second
> > pwr ... power consumption
> >
> > table values are in %
>
> Oh... that's bloody insane; that's why none of it makes sense.
>
>
> How is any of that an answer to:
>
> "They want; I want an explanation of what exact problem is fixed
> how ;-)"
>
> This is just random numbers showing poking the number has some
> effect;
> it has zero explaination of why poking the number changes the
> workload
> and if that is in fact the right way to go about solving that
> particular
> issue.
Sorry that the data wasn't clear to understand. I try again with
absolute FPS numbers and some additional explanation as well as a
summary why we need to have the PELT halflife tunable a runtime.
HOK* 60FPS
+-------+-----------------------------------------+
| | avg. FPS | min. FPS | power |
+-------+--------+-------+-------+-------+--------+
|kernel | value |diff(%)| value |diff(%)| diff(%)|
+-------+--------+-------+-------+-------+--------+
|pelt_1 | 54.1 | 0.0% | 21.8 | 0.0% | 0.0% |
+-------+--------+-------+-------+-------+--------+
|pelt_2 | 56.9 | 5.2% | 29.2 | 34.0% | 2.2% |
+-------+--------+-------+-------+-------+--------+
|pelt_4 | 56.6 | 4.5% | 33.2 | 52.4% | 6.3% |
+-------+--------+-------+-------+-------+--------+
*Honour Of Kings, video game
Test methodology:
We choose 60FPS in the game setup. Android's systrace (similar to
ftrace) then provides the real FPS from which we take the average and
minimum value.
Sorry, but we can't share absolute numbers for power from our test
device since this is still considered sensitive information.
FHD 60fps video recording
+-------+-----------------------------------------+
| | avg. FPS | min. FPS | power |
+-------+--------+-------+-------+-------+--------+
|kernel | value |diff(%)| value |diff(%)| diff(%)|
+-------+--------+-------+-------+-------+--------+
|pelt_1 | 60.0 | 0.0% | 60.0 | 0.0% | 0.0% |
+-------+--------+-------+-------+-------+--------+
|pelt_4 | 60.0 | 0.0% | 60.0 | 0.0% | 2.1% |
+-------+--------+-------+-------+-------+--------+
To summarize, we need a smaller PELT halflife to reach higher avg. FPS
and min. FPS values for video gaming to achieve a smoother game-play
experience even when it comes with slightly higher power consumption.
Especially the improvement in min. FPS is important here to minimize
situations in which the game otherwise would stutter.
Since not all use cases profit from this behaviour (e.g. video
recording) the PELT halflife should be tunable at runtime.
On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> Based on all the tests we've seen, jankbench or otherwise, the
> improvement can mainly be attributed to the faster ramp up of frequency
> caused by the shorter PELT window while using schedutil.
Would something terrible like the below help some?
If not, I suppose it could be modified to take the current state as
history. But basically it runs a faster pelt sum along side the regular
signal just for ramping up the frequency.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9ba07a1d19f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,6 +96,7 @@ SCHED_FEAT(WA_BIAS, true)
*/
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_EST_FASTER, true)
SCHED_FEAT(LATENCY_WARN, false)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..13cd9e27ce3e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
return periods;
}
+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+ unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+ u64 periods = delta / 1024;
+
+ if (periods) {
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods, 1024, delta);
+ }
+
+ return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4a20046e586..99827d5dda27 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2922,6 +2922,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
return READ_ONCE(rq->avg_dl.util_avg);
}
+extern unsigned long faster_est_approx(u64 runtime);
+
/**
* cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
* @cpu: the CPU to get the utilization for.
@@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
*/
static inline unsigned long cpu_util_cfs(int cpu)
{
+ struct rq *rq = cpu_rq(cpu);
struct cfs_rq *cfs_rq;
unsigned long util;
- cfs_rq = &cpu_rq(cpu)->cfs;
+ cfs_rq = &rq->cfs;
util = READ_ONCE(cfs_rq->avg.util_avg);
if (sched_feat(UTIL_EST)) {
+ if (sched_feat(UTIL_EST_FASTER)) {
+ struct task_struct *curr;
+
+ rcu_read_lock();
+ curr = rcu_dereference(rq->curr);
+ if (likely(curr->sched_class == &fair_sched_class)) {
+ u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
+ util = max_t(unsigned long, util,
+ faster_est_approx(runtime * 2));
+ }
+ rcu_read_unlock();
+ }
util = max_t(unsigned long, util,
READ_ONCE(cfs_rq->avg.util_est.enqueued));
}
On 11/07/22 14:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> > Based on all the tests we've seen, jankbench or otherwise, the
> > improvement can mainly be attributed to the faster ramp up of frequency
> > caused by the shorter PELT window while using schedutil.
>
> Would something terrible like the below help some?
>
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.
A bit of a tangent, but this reminded me of this old patch:
https://lore.kernel.org/lkml/[email protected]/
I think we have a bit too many moving cogs that might be creating undesired
compound effect.
Should we consider removing margins in favour of improving util ramp up/down?
(whether via util_est or pelt hf).
Only worry is lower end devices; but it seems to me better improve util
response and get rid of these magic numbers - if we can. Having the ability to
adjust them at runtime will help defer the trade-offs to sys admins.
Thanks
--
Qais Yousef
Hi Peter,
On 11/7/22 13:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
>> Based on all the tests we've seen, jankbench or otherwise, the
>> improvement can mainly be attributed to the faster ramp up of frequency
>> caused by the shorter PELT window while using schedutil.
>
> Would something terrible like the below help some?
>
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.
[snip]
> +
> + rcu_read_lock();
> + curr = rcu_dereference(rq->curr);
> + if (likely(curr->sched_class == &fair_sched_class)) {
> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
> + util = max_t(unsigned long, util,
> + faster_est_approx(runtime * 2));
That's a really nice hack :)
I wonder why we end up in such situation on Android. Maybe there is
a different solution?
Maybe shorter tick (then also align PELT Half-Life)?
The problem is mostly in those high-FPS phones. You know, we now have
phones with 144Hz displays and even games > 100FPS (which wasn't the
case a few years ago when we invested a lot of effort into this
PELT+EAS). We also have a lot faster CPUs (~2x in 3-4 years).
IMO those games (and OS mechanisms assisting them) would have different
needs probably (if they do this 'soft-real-time simulations' with such
high granularity ~120/s -> every ~8ms).
IMO one old setting might not fit well into this: 4ms tick (which is the
Android use case), which then implies scheduler min granularity, which
we also align with the y^32 PELT.
Is this a correct chain of thinking?
Would it make sense to ask Android phone vendors to experiment with
1ms tick (w/ aligned PELT HL)? With that change, there might be some
power spikes issues, though.
Regards,
Lukasz
On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> On 11/07/22 14:41, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >
> > > Based on all the tests we've seen, jankbench or otherwise, the
> > > improvement can mainly be attributed to the faster ramp up of frequency
> > > caused by the shorter PELT window while using schedutil.
> >
> > Would something terrible like the below help some?
> >
> > If not, I suppose it could be modified to take the current state as
> > history. But basically it runs a faster pelt sum along side the regular
> > signal just for ramping up the frequency.
>
> A bit of a tangent, but this reminded me of this old patch:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> I think we have a bit too many moving cogs that might be creating undesired
> compound effect.
>
> Should we consider removing margins in favour of improving util ramp up/down?
> (whether via util_est or pelt hf).
Yeah, possibly.
So one thing that was key to that hack I proposed is that it is
per-task. This means we can either set or detect the task activation
period and use that to select an appropriate PELT multiplier.
But please explain; once tasks are in a steady state (60HZ, 90HZ or god
forbit higher), the utilization should be the same between the various
PELT window sizes, provided the activation period isn't *much* larger
than the window.
Are these things running a ton of single shot tasks or something daft
like that?
On 07/11/2022 14:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
[...]
> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> */
> static inline unsigned long cpu_util_cfs(int cpu)
> {
> + struct rq *rq = cpu_rq(cpu);
> struct cfs_rq *cfs_rq;
> unsigned long util;
>
> - cfs_rq = &cpu_rq(cpu)->cfs;
> + cfs_rq = &rq->cfs;
> util = READ_ONCE(cfs_rq->avg.util_avg);
>
> if (sched_feat(UTIL_EST)) {
> + if (sched_feat(UTIL_EST_FASTER)) {
> + struct task_struct *curr;
> +
> + rcu_read_lock();
> + curr = rcu_dereference(rq->curr);
> + if (likely(curr->sched_class == &fair_sched_class)) {
> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
Don't we and up with gigantic runtime numbers here?
oot@juno:~# cat /proc/1676/task/1676/schedstat
36946300 1150620 11
root@juno:~# cat /proc/1676/task/1676/sched
rt-app (1676, #threads: 2)
-------------------------------------------------------------------
se.exec_start : 77766.964240 <- !
se.vruntime : 563.587883
e.sum_exec_runtime : 36.946300 <- !
se.nr_migrations : 0
...
I expect cpu_util_cfs() to be ~1024 almost all the time now.
> + util = max_t(unsigned long, util,
> + faster_est_approx(runtime * 2));
> + }
> + rcu_read_unlock();
> + }
> util = max_t(unsigned long, util,
> READ_ONCE(cfs_rq->avg.util_est.enqueued));
> }
[...]
On 11/09/22 16:49, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> > On 11/07/22 14:41, Peter Zijlstra wrote:
> > > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> > >
> > > > Based on all the tests we've seen, jankbench or otherwise, the
> > > > improvement can mainly be attributed to the faster ramp up of frequency
> > > > caused by the shorter PELT window while using schedutil.
> > >
> > > Would something terrible like the below help some?
> > >
> > > If not, I suppose it could be modified to take the current state as
> > > history. But basically it runs a faster pelt sum along side the regular
> > > signal just for ramping up the frequency.
> >
> > A bit of a tangent, but this reminded me of this old patch:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > I think we have a bit too many moving cogs that might be creating undesired
> > compound effect.
> >
> > Should we consider removing margins in favour of improving util ramp up/down?
> > (whether via util_est or pelt hf).
>
> Yeah, possibly.
>
> So one thing that was key to that hack I proposed is that it is
> per-task. This means we can either set or detect the task activation
> period and use that to select an appropriate PELT multiplier.
Note that a big difference compared to PELT HF is that we bias towards going up
faster in util_est, not being able to go down as quickly could impact power as
our residency in higher frequencies will be higher. Testing only can show how
big of a problem this is in practice.
>
> But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> forbit higher), the utilization should be the same between the various
> PELT window sizes, provided the activation period isn't *much* larger
> than the window.
It is steady state for a short period of time, before something else happens
that change the nature of the workload.
For example, being standing still in an empty room then an explosion suddenly
happens causing lots of activity to appear on the screen.
We can have a steady state at 20%, but an action on the screen could suddenly
change the demand to 100%.
You can find a lot of videos on how to tweak cpu frequencies and governor to
improve gaming performances on youtube by the way:
https://www.youtube.com/results?search_query=android+gaming+cpu+boost
And this ancient video from google about impact of frequency scaling on games:
https://www.youtube.com/watch?v=AZ97b2nT-Vo
this is truly ancient and the advice given then (over 8 years ago) is not
a reflection on current state of affairs.
The problem is not new; and I guess expectations just keeps going higher on
what one can do on their phone in spite of all the past improvements :-)
>
> Are these things running a ton of single shot tasks or something daft
> like that?
I'm not sure how all game engines behave; but the few I've seen they don't tend
to do that.
I've seen apps like instagram using single shot tasks sometime in the (distant)
past to retrieve images. Generally I'm not sure how the Java based APIs behave.
There is an API for Job Scheduler that allows apps to schedule background and
foreground work; that could end up reusing a pool of tasks or creating new
ones. I'm not sure. Game engines tend to be written in NDKs; but simpler games
might not be.
Cheers
--
Qais Yousef
On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
> On 07/11/2022 14:41, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> > @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> > */
> > static inline unsigned long cpu_util_cfs(int cpu)
> > {
> > + struct rq *rq = cpu_rq(cpu);
> > struct cfs_rq *cfs_rq;
> > unsigned long util;
> >
> > - cfs_rq = &cpu_rq(cpu)->cfs;
> > + cfs_rq = &rq->cfs;
> > util = READ_ONCE(cfs_rq->avg.util_avg);
> >
> > if (sched_feat(UTIL_EST)) {
> > + if (sched_feat(UTIL_EST_FASTER)) {
> > + struct task_struct *curr;
> > +
> > + rcu_read_lock();
> > + curr = rcu_dereference(rq->curr);
> > + if (likely(curr->sched_class == &fair_sched_class)) {
> > + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
>
> Don't we and up with gigantic runtime numbers here?
>
> oot@juno:~# cat /proc/1676/task/1676/schedstat
> 36946300 1150620 11
> root@juno:~# cat /proc/1676/task/1676/sched
> rt-app (1676, #threads: 2)
> -------------------------------------------------------------------
> se.exec_start : 77766.964240 <- !
> se.vruntime : 563.587883
> e.sum_exec_runtime : 36.946300 <- !
> se.nr_migrations : 0
> ...
>
> I expect cpu_util_cfs() to be ~1024 almost all the time now.
Duh, obviously I meant to measure the runtime of the current activation
and messed up.
We don't appear to have the right information to compute this atm :/
Hi,
> Would something terrible like the below help some?
>
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.
As Dietmar mentioned in the other email, there seems to be an issue with
how the patch computes 'runtime'. Nevertheless I tested it just to see
what would happen so here are the results if you're interested.
Here's a comparison of Jankbench results on a normal system vs pelt_4 vs
performance cpufreq governor vs your pelt_rampup patch.
Max frame duration (ms)
+-----------------------+-----------+------------+
| kernel | iteration | value |
+-----------------------+-----------+------------+
| menu | 10 | 142.973401 |
| menu_pelt_4 | 10 | 85.271279 |
| menu_pelt_rampup | 10 | 61.494636 |
| menu_performance | 10 | 40.930829 |
+-----------------------+-----------+------------+
Power usage [mW]
+--------------+-----------------------+-------+-----------+
| chan_name | kernel | value | perc_diff |
+--------------+-----------------------+-------+-----------+
| total_power | menu | 144.6 | 0.0% |
| total_power | menu_pelt_4 | 158.5 | 9.63% |
| total_power | menu_pelt_rampup | 272.1 | 88.23% |
| total_power | menu_performance | 485.6 | 235.9% |
+--------------+-----------------------+-------+-----------+
Mean frame duration (ms)
+---------------+-----------------------+-------+-----------+
| variable | kernel | value | perc_diff |
+---------------+-----------------------+-------+-----------+
| mean_duration | menu | 13.9 | 0.0% |
| mean_duration | menu_pelt_4 | 14.5 | 4.74% |
| mean_duration | menu_pelt_rampup | 8.3 | -40.31% |
| mean_duration | menu_performance | 4.4 | -68.13% |
+---------------+-----------------------+-------+-----------+
Jank percentage
+-----------+-----------------------+-------+-----------+
| variable | kernel | value | perc_diff |
+-----------+-----------------------+-------+-----------+
| jank_perc | menu | 1.5 | 0.0% |
| jank_perc | menu_pelt_4 | 2.0 | 30.08% |
| jank_perc | menu_pelt_rampup | 0.1 | -93.09% |
| jank_perc | menu_performance | 0.1 | -96.29% |
+-----------+-----------------------+-------+-----------+
[...]
Some variant of this that's tunable at runtime could be workable for the
purposes described before. At least this further proves that it's manipulating
frequency that's responsible for the results here.
---
Kajetan
On 10/11/2022 14:05, Peter Zijlstra wrote:
> On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
>> On 07/11/2022 14:41, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>>
>> [...]
>>
>>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>>> */
>>> static inline unsigned long cpu_util_cfs(int cpu)
>>> {
>>> + struct rq *rq = cpu_rq(cpu);
>>> struct cfs_rq *cfs_rq;
>>> unsigned long util;
>>>
>>> - cfs_rq = &cpu_rq(cpu)->cfs;
>>> + cfs_rq = &rq->cfs;
>>> util = READ_ONCE(cfs_rq->avg.util_avg);
>>>
>>> if (sched_feat(UTIL_EST)) {
>>> + if (sched_feat(UTIL_EST_FASTER)) {
>>> + struct task_struct *curr;
>>> +
>>> + rcu_read_lock();
>>> + curr = rcu_dereference(rq->curr);
>>> + if (likely(curr->sched_class == &fair_sched_class)) {
>>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
>>
>> Don't we and up with gigantic runtime numbers here?
>>
>> oot@juno:~# cat /proc/1676/task/1676/schedstat
>> 36946300 1150620 11
>> root@juno:~# cat /proc/1676/task/1676/sched
>> rt-app (1676, #threads: 2)
>> -------------------------------------------------------------------
>> se.exec_start : 77766.964240 <- !
>> se.vruntime : 563.587883
>> e.sum_exec_runtime : 36.946300 <- !
>> se.nr_migrations : 0
>> ...
>>
>> I expect cpu_util_cfs() to be ~1024 almost all the time now.
>
> Duh, obviously I meant to measure the runtime of the current activation
> and messed up.
>
> We don't appear to have the right information to compute this atm :/
This would be:
u64 now = rq_clock_task(rq);
u64 runtime = now - curr->se.exec_start;
but we don't hold the rq lock so we can't get `now`?
On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote:
> On 10/11/2022 14:05, Peter Zijlstra wrote:
> > On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
> >> On 07/11/2022 14:41, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >>
> >> [...]
> >>
> >>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> >>> */
> >>> static inline unsigned long cpu_util_cfs(int cpu)
> >>> {
> >>> + struct rq *rq = cpu_rq(cpu);
> >>> struct cfs_rq *cfs_rq;
> >>> unsigned long util;
> >>>
> >>> - cfs_rq = &cpu_rq(cpu)->cfs;
> >>> + cfs_rq = &rq->cfs;
> >>> util = READ_ONCE(cfs_rq->avg.util_avg);
> >>>
> >>> if (sched_feat(UTIL_EST)) {
> >>> + if (sched_feat(UTIL_EST_FASTER)) {
> >>> + struct task_struct *curr;
> >>> +
> >>> + rcu_read_lock();
> >>> + curr = rcu_dereference(rq->curr);
> >>> + if (likely(curr->sched_class == &fair_sched_class)) {
> >>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
> >>
> >> Don't we and up with gigantic runtime numbers here?
> >>
> >> oot@juno:~# cat /proc/1676/task/1676/schedstat
> >> 36946300 1150620 11
> >> root@juno:~# cat /proc/1676/task/1676/sched
> >> rt-app (1676, #threads: 2)
> >> -------------------------------------------------------------------
> >> se.exec_start : 77766.964240 <- !
> >> se.vruntime : 563.587883
> >> e.sum_exec_runtime : 36.946300 <- !
> >> se.nr_migrations : 0
> >> ...
> >>
> >> I expect cpu_util_cfs() to be ~1024 almost all the time now.
> >
> > Duh, obviously I meant to measure the runtime of the current activation
> > and messed up.
> >
> > We don't appear to have the right information to compute this atm :/
>
> This would be:
>
> u64 now = rq_clock_task(rq);
> u64 runtime = now - curr->se.exec_start;
>
> but we don't hold the rq lock so we can't get `now`?
Not quite the same; that's the time since we got on-cpu last, but that's
not the same as the runtime of this activation (it is when you discount
preemption).
On 10/11/2022 18:51, Peter Zijlstra wrote:
> On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote:
>> On 10/11/2022 14:05, Peter Zijlstra wrote:
>>> On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
>>>> On 07/11/2022 14:41, Peter Zijlstra wrote:
>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>>>>> */
>>>>> static inline unsigned long cpu_util_cfs(int cpu)
>>>>> {
>>>>> + struct rq *rq = cpu_rq(cpu);
>>>>> struct cfs_rq *cfs_rq;
>>>>> unsigned long util;
>>>>>
>>>>> - cfs_rq = &cpu_rq(cpu)->cfs;
>>>>> + cfs_rq = &rq->cfs;
>>>>> util = READ_ONCE(cfs_rq->avg.util_avg);
>>>>>
>>>>> if (sched_feat(UTIL_EST)) {
>>>>> + if (sched_feat(UTIL_EST_FASTER)) {
>>>>> + struct task_struct *curr;
>>>>> +
>>>>> + rcu_read_lock();
>>>>> + curr = rcu_dereference(rq->curr);
>>>>> + if (likely(curr->sched_class == &fair_sched_class)) {
>>>>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
>>>>
>>>> Don't we and up with gigantic runtime numbers here?
>>>>
>>>> oot@juno:~# cat /proc/1676/task/1676/schedstat
>>>> 36946300 1150620 11
>>>> root@juno:~# cat /proc/1676/task/1676/sched
>>>> rt-app (1676, #threads: 2)
>>>> -------------------------------------------------------------------
>>>> se.exec_start : 77766.964240 <- !
>>>> se.vruntime : 563.587883
>>>> e.sum_exec_runtime : 36.946300 <- !
>>>> se.nr_migrations : 0
>>>> ...
>>>>
>>>> I expect cpu_util_cfs() to be ~1024 almost all the time now.
>>>
>>> Duh, obviously I meant to measure the runtime of the current activation
>>> and messed up.
>>>
>>> We don't appear to have the right information to compute this atm :/
>>
>> This would be:
>>
>> u64 now = rq_clock_task(rq);
>> u64 runtime = now - curr->se.exec_start;
>>
>> but we don't hold the rq lock so we can't get `now`?
>
> Not quite the same; that's the time since we got on-cpu last, but that's
> not the same as the runtime of this activation (it is when you discount
> preemption).
----|----|----|----|----|----|--->
a s1 p1 s2 p2 d
a ... activate_task() -> enqueue_task()
s ... set_next_entity()
p ... put_prev_entity()
d ... deactivate_task() -> dequeue_task()
By `runtime of the activation` you refer to `curr->sum_exec_runtime -
time(a)` ? And the latter we don't have?
And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run`
is only covering the time since we got onto the cpu, right?
With a missing `runtime >>= 10` (from __update_load_sum()) and using
`runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime`
for a 1 task-workload (so no preemption) with factor 2 or 4 I get at
least close to the original rq->cfs.avg.util_avg and util_est.enqueued
signals (cells (5)-(8) in the notebook below).
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true
----
set_next_entity()
update_stats_curr_start()
se->exec_start = rq_clock_task()
cfs_rq->curr = se (1)
se->prev_sum_exec_runtime = se->sum_exec_runtime (2)
update_curr()
now = rq_clock_task(rq_of(cfs_rq))
delta_exec = now - curr->exec_start (3)
curr->exec_start = now
curr->sum_exec_runtime += delta_exec; (4)
put_prev_entity()
cfs_rq->curr = NULL (5)
On Wed, Nov 30, 2022 at 07:14:51PM +0100, Dietmar Eggemann wrote:
> By `runtime of the activation` you refer to `curr->sum_exec_runtime -
> time(a)` ? And the latter we don't have?
>
> And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run`
> is only covering the time since we got onto the cpu, right?
>
> With a missing `runtime >>= 10` (from __update_load_sum()) and using
> `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime`
> for a 1 task-workload (so no preemption) with factor 2 or 4 I get at
> least close to the original rq->cfs.avg.util_avg and util_est.enqueued
> signals (cells (5)-(8) in the notebook below).
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true
>
With those two changes as described above the comparative results are as
follows:
Max frame durations (worst case scenario)
+--------------------------------+-----------+------------+
| kernel | iteration | value |
+--------------------------------+-----------+------------+
| baseline_60hz | 10 | 149.935514 |
| pelt_rampup_runtime_shift_60hz | 10 | 108.126862 |
+--------------------------------+-----------+------------+
Power usage [mW]
+--------------+--------------------------------+-------+-----------+
| chan_name | kernel | value | perc_diff |
+--------------+--------------------------------+-------+-----------+
| total_power | baseline_60hz | 141.6 | 0.0% |
| total_power | pelt_rampup_runtime_shift_60hz | 168.0 | 18.61% |
+--------------+--------------------------------+-------+-----------+
Mean frame duration (average case)
+---------------+--------------------------------+-------+-----------+
| variable | kernel | value | perc_diff |
+---------------+--------------------------------+-------+-----------+
| mean_duration | baseline_60hz | 16.7 | 0.0% |
| mean_duration | pelt_rampup_runtime_shift_60hz | 13.6 | -18.9% |
+---------------+--------------------------------+-------+-----------+
Jank percentage
+-----------+--------------------------------+-------+-----------+
| variable | kernel | value | perc_diff |
+-----------+--------------------------------+-------+-----------+
| jank_perc | baseline_60hz | 4.0 | 0.0% |
| jank_perc | pelt_rampup_runtime_shift_60hz | 1.5 | -64.04% |
+-----------+--------------------------------+-------+-----------+
Meaning it's a middle ground of sorts - instead of a 90% increase in
power usage it's 'just' 19%. At the same time though the fastest PELT
multiplier (pelt_4) was getting better max frame durations (85ms vs
108ms) for about half the power increase (9.6% vs 18.6%).
On 09/11/2022 16:49, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
[...]
> So one thing that was key to that hack I proposed is that it is
> per-task. This means we can either set or detect the task activation
> period and use that to select an appropriate PELT multiplier.
>
> But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> forbit higher), the utilization should be the same between the various
> PELT window sizes, provided the activation period isn't *much* larger
> than the window.
>
> Are these things running a ton of single shot tasks or something daft
> like that?
This investigation tries to answer these questions. The results can
be found in chapter (B) and (C).
I ran 'util_est_faster' with delta equal to 'duration of the current
activation'. I.e. the following patch is needed:
https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com
The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity
= [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and
forward ported task scheduler patches.
(A) *** 'util_est_faster' vs. 'scaled util_est_faster' ***
The initial approach didn't scale the runtime duration. It is based
on task clock and not PELT clock but it should be scaled by uArch
and frequency to align with the PELT time used for util tracking.
Although the original approach shows better results than the scaled
one. Even more aggressive boosting on non-big CPUs helps to raise the
frequency even quicker in the scenario described under (B).
All tests ran 10 iterations of all Jankbench sub-tests.
Max_frame_duration:
+------------------------+------------+
| kernel | value |
+------------------------+------------+
| base-a30b17f016b0 | 147.571352 |
| util_est_faster | 84.834999 |
| scaled_util_est_faster | 127.72855 |
+------------------------+------------+
Mean_frame_duration:
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 14.7 | 0.0% |
| util_est_faster | 12.6 | -14.01% |
| scaled_util_est_faster | 13.5 | -8.45% |
+------------------------+-------------------+
Jank percentage (Jank deadline 16ms):
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 1.8 | 0.0% |
| util_est_faster | 0.8 | -57.8% |
| scaled_util_est_faster | 1.4 | -25.89% |
+------------------------+-------+-----------+
Power usage [mW] (total - all CPUs):
+------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------+-------+-----------+
| base-a30b17f016b0 | 144.4 | 0.0% |
| util_est_faster | 150.9 | 4.45% |
| scaled_util_est_faster | 152.2 | 5.4% |
+------------------------+-------+-----------+
'scaled util_est_faster' is used as the base for all following tests.
(B) *** Where does util_est_faster help exactly? ***
It turns out that the score improvement comes from the more aggressive
DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
-> effective_cpu_util(..., cpu_util_cfs(), ...).
At the beginning of an episode (e.g. beginning of an image list view
fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
frequency') of the Android Graphics Pipeline (AGP) start to run, the
CPU Operating Performance Point (OPP) is often so low that those tasks
run more like 10/16ms which let the test application count a lot of
Jankframes at those moments.
And there is where this util_est_faster approach helps by boosting CPU
util according to the 'runtime of the current activation'.
Moreover it could also be that the tasks have simply more work to do in
these first activations at the beginning of an episode.
All the other places in which cpu_util_cfs() is used:
(2) CFS load balance ('_lb')
(3) CPU overutilization ('_ou')
(4) CFS fork/exec task placement ('_slowpath')
when tested individually don't show any improvement or even regression.
Max_frame_duration:
+---------------------------------+------------+
| kernel | value |
+---------------------------------+------------+
| scaled_util_est_faster | 127.72855 |
| scaled_util_est_faster_freq | 126.646506 |
| scaled_util_est_faster_lb | 162.596249 |
| scaled_util_est_faster_ou | 166.59519 |
| scaled_util_est_faster_slowpath | 153.966638 |
+---------------------------------+------------+
Mean_frame_duration:
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 13.5 | 0.0% |
| scaled_util_est_faster_freq | 13.7 | 1.79% |
| scaled_util_est_faster_lb | 14.8 | 9.87% |
| scaled_util_est_faster_ou | 14.5 | 7.46% |
| scaled_util_est_faster_slowpath | 16.2 | 20.45% |
+---------------------------------+-------+-----------+
Jank percentage (Jank deadline 16ms):
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 1.4 | 0.0% |
| scaled_util_est_faster_freq | 1.3 | -2.34% |
| scaled_util_est_faster_lb | 1.7 | 27.42% |
| scaled_util_est_faster_ou | 2.1 | 50.33% |
| scaled_util_est_faster_slowpath | 2.8 | 102.39% |
+---------------------------------+-------+-----------+
Power usage [mW] (total - all CPUs):
+---------------------------------+-------+-----------+
| kernel | value | perc_diff |
+---------------------------------+-------+-----------+
| scaled_util_est_faster | 152.2 | 0.0% |
| scaled_util_est_faster_freq | 132.3 | -13.1% |
| scaled_util_est_faster_lb | 137.1 | -9.96% |
| scaled_util_est_faster_ou | 132.4 | -13.04% |
| scaled_util_est_faster_slowpath | 141.3 | -7.18% |
+---------------------------------+-------+-----------+
(C) *** Which tasks contribute the most to the score improvement? ***
A trace_event capturing the cases in which task's util_est_fast trumps
CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench
and the base is (1) 'scaled_util_est_faster_freq':
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb
'Cell [6]' shows the tasks of the Jankbench process
'[com.an]droid.benchmark' which are boosting the CPU frequency request.
Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and
'RenderThread'.
The spikes in util_est_fast are congruent with the aforementioned
beginning of an episode in which these periodic tasks are running and
when their runtime/period is rather ~10/16ms and not ~1-2/16ms since
the CPU OPP is still low.
Very few other Jankbench tasks 'Cell [6] show the same behaviour. The
Surfaceflinger process 'Cell [8]' is not affected and from the kernel
tasks only kcompctd0 creates a mild boost 'Cell [9]'.
As expected, running a non-scaled version of (1) shows more aggressive
boosting on non-big CPUs:
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb
Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
util when periodic tasks have a longer runtime compared to when they reach
steady-sate.
The results is very similar to PELT halflife reduction. The advantage is
that 'util_est_faster' is only activated selectively when the runtime of
the current task in its current activation is long enough to create this
CPU util boost.
Original patch:
https://lkml.kernel.org/r/[email protected]
Changes applied:
- use 'duration of the current activation' as delta
- delta >>= 10
- uArch and frequency scaling of delta
-->%--
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index efdc29c42161..76d146d06bbe 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true)
*/
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_EST_FASTER, true)
SCHED_FEAT(LATENCY_WARN, false)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..13cd9e27ce3e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
return periods;
}
+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+ unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+ u64 periods = delta / 1024;
+
+ if (periods) {
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods, 1024, delta);
+ }
+
+ return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1072502976df..7cb45f1d8062 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
return READ_ONCE(rq->avg_dl.util_avg);
}
+extern unsigned long faster_est_approx(u64 runtime);
+
/**
* cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
* @cpu: the CPU to get the utilization for.
@@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
*/
static inline unsigned long cpu_util_cfs(int cpu)
{
+ struct rq *rq = cpu_rq(cpu);
struct cfs_rq *cfs_rq;
unsigned long util;
- cfs_rq = &cpu_rq(cpu)->cfs;
+ cfs_rq = &rq->cfs;
util = READ_ONCE(cfs_rq->avg.util_avg);
if (sched_feat(UTIL_EST)) {
+ if (sched_feat(UTIL_EST_FASTER)) {
+ struct task_struct *curr;
+
+ rcu_read_lock();
+ curr = rcu_dereference(rq->curr);
+ if (likely(curr->sched_class == &fair_sched_class)) {
+ unsigned long util_est_fast;
+ u64 delta;
+
+ delta = curr->se.sum_exec_runtime -
+ curr->se.prev_sum_exec_runtime_vol;
+
+ delta >>= 10;
+ if (!delta)
+ goto unlock;
+
+ delta = cap_scale(delta, arch_scale_cpu_capacity(cpu));
+ delta = cap_scale(delta, arch_scale_freq_capacity(cpu));
+
+ util_est_fast = faster_est_approx(delta * 2);
+ util = max(util, util_est_fast);
+ }
+unlock:
+ rcu_read_unlock();
+ }
util = max_t(unsigned long, util,
READ_ONCE(cfs_rq->avg.util_est.enqueued));
}
On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
>
> On 09/11/2022 16:49, Peter Zijlstra wrote:
> > On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> > So one thing that was key to that hack I proposed is that it is
> > per-task. This means we can either set or detect the task activation
> > period and use that to select an appropriate PELT multiplier.
> >
> > But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> > forbit higher), the utilization should be the same between the various
> > PELT window sizes, provided the activation period isn't *much* larger
> > than the window.
> >
> > Are these things running a ton of single shot tasks or something daft
> > like that?
>
> This investigation tries to answer these questions. The results can
> be found in chapter (B) and (C).
>
> I ran 'util_est_faster' with delta equal to 'duration of the current
> activation'. I.e. the following patch is needed:
>
> https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com
>
> The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity
> = [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and
> forward ported task scheduler patches.
>
> (A) *** 'util_est_faster' vs. 'scaled util_est_faster' ***
>
> The initial approach didn't scale the runtime duration. It is based
> on task clock and not PELT clock but it should be scaled by uArch
> and frequency to align with the PELT time used for util tracking.
>
> Although the original approach shows better results than the scaled
> one. Even more aggressive boosting on non-big CPUs helps to raise the
> frequency even quicker in the scenario described under (B).
>
> All tests ran 10 iterations of all Jankbench sub-tests.
>
> Max_frame_duration:
> +------------------------+------------+
> | kernel | value |
> +------------------------+------------+
> | base-a30b17f016b0 | 147.571352 |
> | util_est_faster | 84.834999 |
> | scaled_util_est_faster | 127.72855 |
> +------------------------+------------+
>
> Mean_frame_duration:
> +------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------+-------+-----------+
> | base-a30b17f016b0 | 14.7 | 0.0% |
> | util_est_faster | 12.6 | -14.01% |
> | scaled_util_est_faster | 13.5 | -8.45% |
> +------------------------+-------------------+
>
> Jank percentage (Jank deadline 16ms):
> +------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------+-------+-----------+
> | base-a30b17f016b0 | 1.8 | 0.0% |
> | util_est_faster | 0.8 | -57.8% |
> | scaled_util_est_faster | 1.4 | -25.89% |
> +------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------+-------+-----------+
> | base-a30b17f016b0 | 144.4 | 0.0% |
> | util_est_faster | 150.9 | 4.45% |
> | scaled_util_est_faster | 152.2 | 5.4% |
> +------------------------+-------+-----------+
>
> 'scaled util_est_faster' is used as the base for all following tests.
>
> (B) *** Where does util_est_faster help exactly? ***
>
> It turns out that the score improvement comes from the more aggressive
> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> -> effective_cpu_util(..., cpu_util_cfs(), ...).
>
> At the beginning of an episode (e.g. beginning of an image list view
> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> CPU Operating Performance Point (OPP) is often so low that those tasks
> run more like 10/16ms which let the test application count a lot of
> Jankframes at those moments.
I don't see how util_est_faster can help this 1ms task here ? It's
most probably never be preempted during this 1ms. For such an Android
Graphics Pipeline short task, hasn't uclamp_min been designed for and
a better solution ?
>
> And there is where this util_est_faster approach helps by boosting CPU
> util according to the 'runtime of the current activation'.
> Moreover it could also be that the tasks have simply more work to do in
> these first activations at the beginning of an episode.
>
> All the other places in which cpu_util_cfs() is used:
>
> (2) CFS load balance ('_lb')
> (3) CPU overutilization ('_ou')
> (4) CFS fork/exec task placement ('_slowpath')
>
> when tested individually don't show any improvement or even regression.
>
> Max_frame_duration:
> +---------------------------------+------------+
> | kernel | value |
> +---------------------------------+------------+
> | scaled_util_est_faster | 127.72855 |
> | scaled_util_est_faster_freq | 126.646506 |
> | scaled_util_est_faster_lb | 162.596249 |
> | scaled_util_est_faster_ou | 166.59519 |
> | scaled_util_est_faster_slowpath | 153.966638 |
> +---------------------------------+------------+
>
> Mean_frame_duration:
> +---------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +---------------------------------+-------+-----------+
> | scaled_util_est_faster | 13.5 | 0.0% |
> | scaled_util_est_faster_freq | 13.7 | 1.79% |
> | scaled_util_est_faster_lb | 14.8 | 9.87% |
> | scaled_util_est_faster_ou | 14.5 | 7.46% |
> | scaled_util_est_faster_slowpath | 16.2 | 20.45% |
> +---------------------------------+-------+-----------+
>
> Jank percentage (Jank deadline 16ms):
> +---------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +---------------------------------+-------+-----------+
> | scaled_util_est_faster | 1.4 | 0.0% |
> | scaled_util_est_faster_freq | 1.3 | -2.34% |
> | scaled_util_est_faster_lb | 1.7 | 27.42% |
> | scaled_util_est_faster_ou | 2.1 | 50.33% |
> | scaled_util_est_faster_slowpath | 2.8 | 102.39% |
> +---------------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +---------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +---------------------------------+-------+-----------+
> | scaled_util_est_faster | 152.2 | 0.0% |
> | scaled_util_est_faster_freq | 132.3 | -13.1% |
> | scaled_util_est_faster_lb | 137.1 | -9.96% |
> | scaled_util_est_faster_ou | 132.4 | -13.04% |
> | scaled_util_est_faster_slowpath | 141.3 | -7.18% |
> +---------------------------------+-------+-----------+
>
> (C) *** Which tasks contribute the most to the score improvement? ***
>
> A trace_event capturing the cases in which task's util_est_fast trumps
> CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench
> and the base is (1) 'scaled_util_est_faster_freq':
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb
>
> 'Cell [6]' shows the tasks of the Jankbench process
> '[com.an]droid.benchmark' which are boosting the CPU frequency request.
>
> Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and
> 'RenderThread'.
> The spikes in util_est_fast are congruent with the aforementioned
> beginning of an episode in which these periodic tasks are running and
> when their runtime/period is rather ~10/16ms and not ~1-2/16ms since
> the CPU OPP is still low.
>
> Very few other Jankbench tasks 'Cell [6] show the same behaviour. The
> Surfaceflinger process 'Cell [8]' is not affected and from the kernel
> tasks only kcompctd0 creates a mild boost 'Cell [9]'.
>
> As expected, running a non-scaled version of (1) shows more aggressive
> boosting on non-big CPUs:
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb
>
> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> util when periodic tasks have a longer runtime compared to when they reach
> steady-sate.
>
> The results is very similar to PELT halflife reduction. The advantage is
> that 'util_est_faster' is only activated selectively when the runtime of
> the current task in its current activation is long enough to create this
> CPU util boost.
IIUC how util_est_faster works, it removes the waiting time when
sharing cpu time with other tasks. So as long as there is no (runnable
but not running time), the result is the same as current util_est.
util_est_faster makes a difference only when the task alternates
between runnable and running slices.
Have you considered using runnable_avg metrics in the increase of cpu
freq ? This takes into the runnable slice and not only the running
time and increase faster than util_avg when tasks compete for the same
CPU
>
> Original patch:
> https://lkml.kernel.org/r/[email protected]
>
> Changes applied:
> - use 'duration of the current activation' as delta
> - delta >>= 10
> - uArch and frequency scaling of delta
>
> -->%--
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index efdc29c42161..76d146d06bbe 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true)
> */
> SCHED_FEAT(UTIL_EST, true)
> SCHED_FEAT(UTIL_EST_FASTUP, true)
> +SCHED_FEAT(UTIL_EST_FASTER, true)
>
> SCHED_FEAT(LATENCY_WARN, false)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 0f310768260c..13cd9e27ce3e 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
> return periods;
> }
>
> +/*
> + * Compute a pelt util_avg assuming no history and @delta runtime.
> + */
> +unsigned long faster_est_approx(u64 delta)
> +{
> + unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
> + u64 periods = delta / 1024;
> +
> + if (periods) {
> + delta %= 1024;
> + contrib = __accumulate_pelt_segments(periods, 1024, delta);
> + }
> +
> + return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
> +}
> +
> /*
> * We can represent the historical contribution to runnable average as the
> * coefficients of a geometric series. To do this we sub-divide our runnable
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1072502976df..7cb45f1d8062 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> return READ_ONCE(rq->avg_dl.util_avg);
> }
>
> +extern unsigned long faster_est_approx(u64 runtime);
> +
> /**
> * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
> * @cpu: the CPU to get the utilization for.
> @@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> */
> static inline unsigned long cpu_util_cfs(int cpu)
> {
> + struct rq *rq = cpu_rq(cpu);
> struct cfs_rq *cfs_rq;
> unsigned long util;
>
> - cfs_rq = &cpu_rq(cpu)->cfs;
> + cfs_rq = &rq->cfs;
> util = READ_ONCE(cfs_rq->avg.util_avg);
>
> if (sched_feat(UTIL_EST)) {
> + if (sched_feat(UTIL_EST_FASTER)) {
> + struct task_struct *curr;
> +
> + rcu_read_lock();
> + curr = rcu_dereference(rq->curr);
> + if (likely(curr->sched_class == &fair_sched_class)) {
> + unsigned long util_est_fast;
> + u64 delta;
> +
> + delta = curr->se.sum_exec_runtime -
> + curr->se.prev_sum_exec_runtime_vol;
> +
> + delta >>= 10;
> + if (!delta)
> + goto unlock;
> +
> + delta = cap_scale(delta, arch_scale_cpu_capacity(cpu));
> + delta = cap_scale(delta, arch_scale_freq_capacity(cpu));
> +
> + util_est_fast = faster_est_approx(delta * 2);
> + util = max(util, util_est_fast);
> + }
> +unlock:
> + rcu_read_unlock();
> + }
> util = max_t(unsigned long, util,
> READ_ONCE(cfs_rq->avg.util_est.enqueued));
> }
On 09/02/2023 17:16, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
[...]
>> (B) *** Where does util_est_faster help exactly? ***
>>
>> It turns out that the score improvement comes from the more aggressive
>> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
>> -> effective_cpu_util(..., cpu_util_cfs(), ...).
>>
>> At the beginning of an episode (e.g. beginning of an image list view
>> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
>> frequency') of the Android Graphics Pipeline (AGP) start to run, the
>> CPU Operating Performance Point (OPP) is often so low that those tasks
>> run more like 10/16ms which let the test application count a lot of
>> Jankframes at those moments.
>
> I don't see how util_est_faster can help this 1ms task here ? It's
> most probably never be preempted during this 1ms. For such an Android
It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
runtime with min CPU frequency at little CPU. I see runtime up to 10ms
at the beginning of a test episode.
Like I mentioned below, it could also be that the tasks have more work
to do at the beginning. It's easy to spot using Google's perfetto and
those moments also correlate with the occurrence of jankframes. I'm not
yet sure how much this has to do with the perfetto instrumentation though.
But you're right, on top of that, there is preemption (e.g. of the UI
thread) by other threads (render thread, involved binder threads,
surfaceflinger, etc.) going on. So the UI thread could be
running+runnable for > 20ms, again marked as a jankframe.
> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> a better solution ?
Yes, it has. I'm not sure how feasible this is to do for all tasks
involved. I'm thinking about the Binder threads here for instance.
[...]
>> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
>> util when periodic tasks have a longer runtime compared to when they reach
>> steady-sate.
>>
>> The results is very similar to PELT halflife reduction. The advantage is
>> that 'util_est_faster' is only activated selectively when the runtime of
>> the current task in its current activation is long enough to create this
>> CPU util boost.
>
> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.
> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.
> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU
Good idea! No, I haven't.
I just glanced over the code, there shouldn't be an advantage in terms
of more recent update between `curr->sum_exec_runtime` and
update_load_avg(cfs_rq) even in the taskgroup case.
Per-task view:
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
`max_util_scaled_util_est_faster_rbl_freq` once with very similar
results. Just to make sure the results are somehow correct).
Max_frame_duration:
+------------------------------------------+------------+
| kernel | value |
+------------------------------------------+------------+
| base-a30b17f016b0 | 147.571352 |
| pelt-hl-m2 | 119.416351 |
| pelt-hl-m4 | 96.473412 |
| scaled_util_est_faster_freq | 126.646506 |
| max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
+------------------------------------------+------------+
Mean_frame_duration:
+------------------------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------------------------+-------+-----------+
| base-a30b17f016b0 | 14.7 | 0.0% |
| pelt-hl-m2 | 13.6 | -7.5% |
| pelt-hl-m4 | 13.0 | -11.68% |
| scaled_util_est_faster_freq | 13.7 | -6.81% |
| max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% |
+------------------------------------------+-------+-----------+
Jank percentage (Jank deadline 16ms):
+------------------------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------------------------+-------+-----------+
| base-a30b17f016b0 | 1.8 | 0.0% |
| pelt-hl-m2 | 1.8 | -4.91% |
| pelt-hl-m4 | 1.2 | -36.61% |
| scaled_util_est_faster_freq | 1.3 | -27.63% |
| max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% |
+------------------------------------------+-------+-----------+
Power usage [mW] (total - all CPUs):
+------------------------------------------+-------+-----------+
| kernel | value | perc_diff |
+------------------------------------------+-------+-----------+
| base-a30b17f016b0 | 144.4 | 0.0% |
| pelt-hl-m2 | 141.6 | -1.97% |
| pelt-hl-m4 | 163.2 | 12.99% |
| scaled_util_est_faster_freq | 132.3 | -8.41% |
| max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% |
+------------------------------------------+-------+-----------+
There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
`Jank percentage` and `Power usage` are better.
So maybe DVFS boosting in preempt-scenarios is really the thing here to
further improve the Android Graphics Pipeline.
I ran the same test (boosting only for DVFS requests) with:
-->8--
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dbc56e8b85f9..7a4bf38f2920 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
READ_ONCE(cfs_rq->avg.util_est.enqueued));
}
+ util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
+
return min(util, capacity_orig_of(cpu));
Thanks!
-- Dietmar
On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote:
> > The results is very similar to PELT halflife reduction. The advantage is
> > that 'util_est_faster' is only activated selectively when the runtime of
> > the current task in its current activation is long enough to create this
> > CPU util boost.
>
> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.
Uh.. it's double the speed, no? Even if there is no contention, the
fake/in-situ pelt sum runs at double time and thus will ramp up faster
than normal.
> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.
UTIL_EST was supposed to help mitigate some of that, but yes. Also note
that _FASTER sorta sucks here because it starts from 0 every time, if it
were to start from the state saved by util_est_dequeue(), it would ramp
up faster still.
Patch has a comment along those lines I think.
> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU
Interesting! Indeed, that's boosting the DVFS for contention. And as
deggeman's reply shows, it seems to work well.
I wonder if that one place where it regresses is exactly the case
without contention.
On Mon, 20 Feb 2023 at 11:13, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote:
>
> > > The results is very similar to PELT halflife reduction. The advantage is
> > > that 'util_est_faster' is only activated selectively when the runtime of
> > > the current task in its current activation is long enough to create this
> > > CPU util boost.
> >
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
>
> Uh.. it's double the speed, no? Even if there is no contention, the
> fake/in-situ pelt sum runs at double time and thus will ramp up faster
> than normal.
Ah yes. I haven't noticed it was (delta * 2) and not delta
>
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
>
> UTIL_EST was supposed to help mitigate some of that, but yes. Also note
> that _FASTER sorta sucks here because it starts from 0 every time, if it
> were to start from the state saved by util_est_dequeue(), it would ramp
> up faster still.
Yes.
>
> Patch has a comment along those lines I think.
>
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Interesting! Indeed, that's boosting the DVFS for contention. And as
> deggeman's reply shows, it seems to work well.
>
> I wonder if that one place where it regresses is exactly the case
> without contention.
Yes that might be the case indeed. I would expect uclamp_min to help
for ensuring a min frequency such scenario
On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
>
> On 09/02/2023 17:16, Vincent Guittot wrote:
> > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 09/11/2022 16:49, Peter Zijlstra wrote:
> >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> >> (B) *** Where does util_est_faster help exactly? ***
> >>
> >> It turns out that the score improvement comes from the more aggressive
> >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> >> -> effective_cpu_util(..., cpu_util_cfs(), ...).
> >>
> >> At the beginning of an episode (e.g. beginning of an image list view
> >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> >> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> >> CPU Operating Performance Point (OPP) is often so low that those tasks
> >> run more like 10/16ms which let the test application count a lot of
> >> Jankframes at those moments.
> >
> > I don't see how util_est_faster can help this 1ms task here ? It's
> > most probably never be preempted during this 1ms. For such an Android
>
> It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
> runtime with min CPU frequency at little CPU. I see runtime up to 10ms
> at the beginning of a test episode.
>
> Like I mentioned below, it could also be that the tasks have more work
> to do at the beginning. It's easy to spot using Google's perfetto and
> those moments also correlate with the occurrence of jankframes. I'm not
> yet sure how much this has to do with the perfetto instrumentation though.
>
> But you're right, on top of that, there is preemption (e.g. of the UI
> thread) by other threads (render thread, involved binder threads,
> surfaceflinger, etc.) going on. So the UI thread could be
> running+runnable for > 20ms, again marked as a jankframe.
>
> > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > a better solution ?
>
> Yes, it has. I'm not sure how feasible this is to do for all tasks
> involved. I'm thinking about the Binder threads here for instance.
Yes, that can probably not help for all threads but some system
threads like surfaceflinger and graphic composer should probably
benefit from min uclamp
>
> [...]
>
> >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> >> util when periodic tasks have a longer runtime compared to when they reach
> >> steady-sate.
> >>
> >> The results is very similar to PELT halflife reduction. The advantage is
> >> that 'util_est_faster' is only activated selectively when the runtime of
> >> the current task in its current activation is long enough to create this
> >> CPU util boost.
> >
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Good idea! No, I haven't.
>
> I just glanced over the code, there shouldn't be an advantage in terms
> of more recent update between `curr->sum_exec_runtime` and
> update_load_avg(cfs_rq) even in the taskgroup case.
>
> Per-task view:
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
>
>
> All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
> `max_util_scaled_util_est_faster_rbl_freq` once with very similar
> results. Just to make sure the results are somehow correct).
>
> Max_frame_duration:
> +------------------------------------------+------------+
> | kernel | value |
> +------------------------------------------+------------+
> | base-a30b17f016b0 | 147.571352 |
> | pelt-hl-m2 | 119.416351 |
> | pelt-hl-m4 | 96.473412 |
> | scaled_util_est_faster_freq | 126.646506 |
> | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
> +------------------------------------------+------------+
>
> Mean_frame_duration:
> +------------------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------------------------+-------+-----------+
> | base-a30b17f016b0 | 14.7 | 0.0% |
> | pelt-hl-m2 | 13.6 | -7.5% |
> | pelt-hl-m4 | 13.0 | -11.68% |
> | scaled_util_est_faster_freq | 13.7 | -6.81% |
> | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% |
> +------------------------------------------+-------+-----------+
>
> Jank percentage (Jank deadline 16ms):
> +------------------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------------------------+-------+-----------+
> | base-a30b17f016b0 | 1.8 | 0.0% |
> | pelt-hl-m2 | 1.8 | -4.91% |
> | pelt-hl-m4 | 1.2 | -36.61% |
> | scaled_util_est_faster_freq | 1.3 | -27.63% |
> | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% |
> +------------------------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +------------------------------------------+-------+-----------+
> | kernel | value | perc_diff |
> +------------------------------------------+-------+-----------+
> | base-a30b17f016b0 | 144.4 | 0.0% |
> | pelt-hl-m2 | 141.6 | -1.97% |
> | pelt-hl-m4 | 163.2 | 12.99% |
> | scaled_util_est_faster_freq | 132.3 | -8.41% |
> | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% |
> +------------------------------------------+-------+-----------+
>
> There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
> `Jank percentage` and `Power usage` are better.
The max frame duration is interesting. Could it be the very 1st frame
of the test ?
It's interesting that it's even worse than baseline whereas it should
take the max of baseline and runnable_avg
>
> So maybe DVFS boosting in preempt-scenarios is really the thing here to
> further improve the Android Graphics Pipeline.
>
> I ran the same test (boosting only for DVFS requests) with:
>
> -->8--
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dbc56e8b85f9..7a4bf38f2920 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> READ_ONCE(cfs_rq->avg.util_est.enqueued));
> }
>
> + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> +
> return min(util, capacity_orig_of(cpu));
>
> Thanks!
>
> -- Dietmar
>
>
>
>
>
>
On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
<[email protected]> wrote:
>
> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
> >
> > On 09/02/2023 17:16, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
> > >>
> > >> On 09/11/2022 16:49, Peter Zijlstra wrote:
> > >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> > >>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> > >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > >> (B) *** Where does util_est_faster help exactly? ***
> > >>
> > >> It turns out that the score improvement comes from the more aggressive
> > >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> > >> -> effective_cpu_util(..., cpu_util_cfs(), ...).
> > >>
> > >> At the beginning of an episode (e.g. beginning of an image list view
> > >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> > >> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> > >> CPU Operating Performance Point (OPP) is often so low that those tasks
> > >> run more like 10/16ms which let the test application count a lot of
> > >> Jankframes at those moments.
> > >
> > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > most probably never be preempted during this 1ms. For such an Android
> >
> > It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
> > runtime with min CPU frequency at little CPU. I see runtime up to 10ms
> > at the beginning of a test episode.
> >
> > Like I mentioned below, it could also be that the tasks have more work
> > to do at the beginning. It's easy to spot using Google's perfetto and
> > those moments also correlate with the occurrence of jankframes. I'm not
> > yet sure how much this has to do with the perfetto instrumentation though.
> >
> > But you're right, on top of that, there is preemption (e.g. of the UI
> > thread) by other threads (render thread, involved binder threads,
> > surfaceflinger, etc.) going on. So the UI thread could be
> > running+runnable for > 20ms, again marked as a jankframe.
> >
> > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > a better solution ?
> >
> > Yes, it has. I'm not sure how feasible this is to do for all tasks
> > involved. I'm thinking about the Binder threads here for instance.
>
> Yes, that can probably not help for all threads but some system
> threads like surfaceflinger and graphic composer should probably
> benefit from min uclamp
>
> >
> > [...]
> >
> > >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> > >> util when periodic tasks have a longer runtime compared to when they reach
> > >> steady-sate.
> > >>
> > >> The results is very similar to PELT halflife reduction. The advantage is
> > >> that 'util_est_faster' is only activated selectively when the runtime of
> > >> the current task in its current activation is long enough to create this
> > >> CPU util boost.
> > >
> > > IIUC how util_est_faster works, it removes the waiting time when
> > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > but not running time), the result is the same as current util_est.
> > > util_est_faster makes a difference only when the task alternates
> > > between runnable and running slices.
> > > Have you considered using runnable_avg metrics in the increase of cpu
> > > freq ? This takes into the runnable slice and not only the running
> > > time and increase faster than util_avg when tasks compete for the same
> > > CPU
> >
> > Good idea! No, I haven't.
> >
> > I just glanced over the code, there shouldn't be an advantage in terms
> > of more recent update between `curr->sum_exec_runtime` and
> > update_load_avg(cfs_rq) even in the taskgroup case.
> >
> > Per-task view:
> >
> > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
> >
> >
> > All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
> > `max_util_scaled_util_est_faster_rbl_freq` once with very similar
> > results. Just to make sure the results are somehow correct).
> >
> > Max_frame_duration:
> > +------------------------------------------+------------+
> > | kernel | value |
> > +------------------------------------------+------------+
> > | base-a30b17f016b0 | 147.571352 |
> > | pelt-hl-m2 | 119.416351 |
> > | pelt-hl-m4 | 96.473412 |
> > | scaled_util_est_faster_freq | 126.646506 |
> > | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
> > +------------------------------------------+------------+
> >
> > Mean_frame_duration:
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 14.7 | 0.0% |
> > | pelt-hl-m2 | 13.6 | -7.5% |
> > | pelt-hl-m4 | 13.0 | -11.68% |
> > | scaled_util_est_faster_freq | 13.7 | -6.81% |
> > | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% |
> > +------------------------------------------+-------+-----------+
> >
> > Jank percentage (Jank deadline 16ms):
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 1.8 | 0.0% |
> > | pelt-hl-m2 | 1.8 | -4.91% |
> > | pelt-hl-m4 | 1.2 | -36.61% |
> > | scaled_util_est_faster_freq | 1.3 | -27.63% |
> > | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% |
> > +------------------------------------------+-------+-----------+
> >
> > Power usage [mW] (total - all CPUs):
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 144.4 | 0.0% |
> > | pelt-hl-m2 | 141.6 | -1.97% |
> > | pelt-hl-m4 | 163.2 | 12.99% |
> > | scaled_util_est_faster_freq | 132.3 | -8.41% |
> > | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% |
> > +------------------------------------------+-------+-----------+
> >
> > There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
> > `Jank percentage` and `Power usage` are better.
>
> The max frame duration is interesting. Could it be the very 1st frame
> of the test ?
> It's interesting that it's even worse than baseline whereas it should
> take the max of baseline and runnable_avg
>
> >
> > So maybe DVFS boosting in preempt-scenarios is really the thing here to
> > further improve the Android Graphics Pipeline.
> >
> > I ran the same test (boosting only for DVFS requests) with:
> >
> > -->8--
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index dbc56e8b85f9..7a4bf38f2920 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> > READ_ONCE(cfs_rq->avg.util_est.enqueued));
> > }
> >
> > + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> > +
Another reason why it gives better results could be that
cpu_util_cfs() is not only used for DVFS selection but also to track
the cpu utilization in load balance and EAS so the cpu will be faster
seen as overloaded and tasks will be spread around when there are
contentions.
Could you try to take cfs_rq->avg.runnable_avg into account only when
selecting frequency ?
That being said I can see some place in load balance where
cfs_rq->avg.runnable_avg could give some benefits like in
find_busiest_queue() where it could be better to take into account the
contention when selecting the busiest queue
> > return min(util, capacity_orig_of(cpu));
> >
> > Thanks!
> >
> > -- Dietmar
> >
> >
> >
> >
> >
> >
On 20/02/2023 14:54, Vincent Guittot wrote:
> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
[...]
>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>> a better solution ?
>>
>> Yes, it has. I'm not sure how feasible this is to do for all tasks
>> involved. I'm thinking about the Binder threads here for instance.
>
> Yes, that can probably not help for all threads but some system
> threads like surfaceflinger and graphic composer should probably
> benefit from min uclamp
Yes, and it looks like that the Android version I'm using
SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is
already using uclamp_min != 0 for tasks like UI thread. It's not one
particular value but different values from [0 .. 512] over the runtime
of a Jankbench iteration. I have to have a closer look.
[...]
>> Max_frame_duration:
>> +------------------------------------------+------------+
>> | kernel | value |
>> +------------------------------------------+------------+
>> | base-a30b17f016b0 | 147.571352 |
>> | pelt-hl-m2 | 119.416351 |
>> | pelt-hl-m4 | 96.473412 |
>> | scaled_util_est_faster_freq | 126.646506 |
>> | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
>> +------------------------------------------+------------+
>>
>> Mean_frame_duration:
>> +------------------------------------------+-------+-----------+
>> | kernel | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> | base-a30b17f016b0 | 14.7 | 0.0% |
>> | pelt-hl-m2 | 13.6 | -7.5% |
>> | pelt-hl-m4 | 13.0 | -11.68% |
>> | scaled_util_est_faster_freq | 13.7 | -6.81% |
>> | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% |
>> +------------------------------------------+-------+-----------+
>>
>> Jank percentage (Jank deadline 16ms):
>> +------------------------------------------+-------+-----------+
>> | kernel | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> | base-a30b17f016b0 | 1.8 | 0.0% |
>> | pelt-hl-m2 | 1.8 | -4.91% |
>> | pelt-hl-m4 | 1.2 | -36.61% |
>> | scaled_util_est_faster_freq | 1.3 | -27.63% |
>> | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% |
>> +------------------------------------------+-------+-----------+
>>
>> Power usage [mW] (total - all CPUs):
>> +------------------------------------------+-------+-----------+
>> | kernel | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> | base-a30b17f016b0 | 144.4 | 0.0% |
>> | pelt-hl-m2 | 141.6 | -1.97% |
>> | pelt-hl-m4 | 163.2 | 12.99% |
>> | scaled_util_est_faster_freq | 132.3 | -8.41% |
>> | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% |
>> +------------------------------------------+-------+-----------+
>>
>> There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
>> `Jank percentage` and `Power usage` are better.
>
> The max frame duration is interesting. Could it be the very 1st frame
> of the test ?
> It's interesting that it's even worse than baseline whereas it should
> take the max of baseline and runnable_avg
Since you asked in the following email: I just used the boosting for CPU
frequency selection (from sugov_get_util()). I added the the `_freq`
suffix in the kernel name to indicate this.
I don't have any helpful `ftrace` or `perfetto` data for these test runs
though.
That's why I ran another iteration with perfetto on
`max_util_scaled_util_est_faster_rbl_freq`.
`Max frame duration` = 121ms (< 158ms but this was over 10 iterations)
happened at the beginning of the 3/8 `List View Fling` episode.
The UI thread (com.android.benchmark) runs on CPU1. Just before the
start of this episode the CPU freq is 0.3Ghz. It takes 43ms for the CPU
freq to go up to 1.1Ghz.
oriole:/sys # cat devices/system/cpu/cpu1/cpu_capacity
124
oriole:/sys # cat devices/system/cpu/cpu1/cpufreq
/scaling_available_frequencies
300000 574000 738000 930000 1098000 1197000 1328000 1401000 1598000
1704000 1803000
So the combination of little CPU and low CPU frequency is the reason
why. But I can't see how using `max(max(util_avg, util_est.enq),
rbl_avg) can make `max frame duration` worse?
Don't understand how asking for higher CPU frequencies in contention
favors the UI thread being scheduled on little CPUs at the beginning of
an episode?
Also the particular uclamp_min settings of the runnable tasks at this
moment can have an influence on this `max frame duration` value.
[...]
On 21/02/2023 10:29, Vincent Guittot wrote:
> On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
> <[email protected]> wrote:
>>
>> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
[...]
>>> I ran the same test (boosting only for DVFS requests) with:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *
>>>
>>> -->8--
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index dbc56e8b85f9..7a4bf38f2920 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
>>> READ_ONCE(cfs_rq->avg.util_est.enqueued));
>>> }
>>>
>>> + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
>>> +
>
> Another reason why it gives better results could be that
> cpu_util_cfs() is not only used for DVFS selection but also to track
> the cpu utilization in load balance and EAS so the cpu will be faster
> seen as overloaded and tasks will be spread around when there are
> contentions.
>
> Could you try to take cfs_rq->avg.runnable_avg into account only when
> selecting frequency ?
I actually did exactly this. (* but not shown in the code snippet).
I just used the boosting for CPU frequency selection (from
sugov_get_util()). I added the the `_freq` suffix in the kernel name to
indicate this.
> That being said I can see some place in load balance where
> cfs_rq->avg.runnable_avg could give some benefits like in
> find_busiest_queue() where it could be better to take into account the
> contention when selecting the busiest queue
Could be. Looks like so far we only use it in group_has_capacity(),
group_is_overloaded() and for NUMA.
[...]
On 02/09/23 17:16, Vincent Guittot wrote:
> I don't see how util_est_faster can help this 1ms task here ? It's
> most probably never be preempted during this 1ms. For such an Android
> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> a better solution ?
uclamp_min is being used in UI and helping there. But your mileage might vary
with adoption still.
The major motivation behind this is to help things like gaming as the original
thread started. It can help UI and other use cases too. Android framework has
a lot of context on the type of workload that can help it make a decision when
this helps. And OEMs can have the chance to tune and apply based on the
characteristics of their device.
> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.
> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.
> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU
Just to understand why we're heading into this direction now.
AFAIU the desired outcome to have faster rampup time (and on HMP faster up
migration) which both are tied to utilization signal.
Wouldn't make the util response time faster help not just for rampup, but
rampdown too?
If we improve util response time, couldn't this mean we can remove util_est or
am I missing something?
Currently we have util response which is tweaked by util_est and then that is
tweaked further by schedutil with that 25% margin when maping util to
frequency.
I think if we can allow improving general util response time by tweaking PELT
HALFLIFE we can potentially remove util_est and potentially that magic 25%
margin too.
Why the approach of further tweaking util_est is better?
Recently phoronix reported that schedutil behavior is suboptimal and I wonder
if the response time is contributing to that
https://www.phoronix.com/review/schedutil-quirky-2023
Cheers
--
Qais Yousef
On Wed, 22 Feb 2023 at 21:29, Dietmar Eggemann <[email protected]> wrote:
>
> On 21/02/2023 10:29, Vincent Guittot wrote:
> > On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
> > <[email protected]> wrote:
> >>
> >> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
> >>>
> >>> On 09/02/2023 17:16, Vincent Guittot wrote:
> >>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
> >>>>>
> >>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
> >>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> >>> I ran the same test (boosting only for DVFS requests) with:
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *
> >>>
> >>> -->8--
> >>>
> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>> index dbc56e8b85f9..7a4bf38f2920 100644
> >>> --- a/kernel/sched/sched.h
> >>> +++ b/kernel/sched/sched.h
> >>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> >>> READ_ONCE(cfs_rq->avg.util_est.enqueued));
> >>> }
> >>>
> >>> + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> >>> +
> >
> > Another reason why it gives better results could be that
> > cpu_util_cfs() is not only used for DVFS selection but also to track
> > the cpu utilization in load balance and EAS so the cpu will be faster
> > seen as overloaded and tasks will be spread around when there are
> > contentions.
> >
> > Could you try to take cfs_rq->avg.runnable_avg into account only when
> > selecting frequency ?
>
> I actually did exactly this. (* but not shown in the code snippet).
> I just used the boosting for CPU frequency selection (from
> sugov_get_util()). I added the the `_freq` suffix in the kernel name to
> indicate this.
Ok. So the improvement that you are seeing, is really related to
better freq selection
>
> > That being said I can see some place in load balance where
> > cfs_rq->avg.runnable_avg could give some benefits like in
> > find_busiest_queue() where it could be better to take into account the
> > contention when selecting the busiest queue
>
> Could be. Looks like so far we only use it in group_has_capacity(),
> group_is_overloaded() and for NUMA.
I think it could be interesting to use runnable_avg in
find_busiest_queue() for migrate_util case to select the rq with
highest contention as an example
>
> [...]
On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
>
> On 02/09/23 17:16, Vincent Guittot wrote:
>
> > I don't see how util_est_faster can help this 1ms task here ? It's
> > most probably never be preempted during this 1ms. For such an Android
> > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > a better solution ?
>
> uclamp_min is being used in UI and helping there. But your mileage might vary
> with adoption still.
>
> The major motivation behind this is to help things like gaming as the original
> thread started. It can help UI and other use cases too. Android framework has
> a lot of context on the type of workload that can help it make a decision when
> this helps. And OEMs can have the chance to tune and apply based on the
> characteristics of their device.
>
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Just to understand why we're heading into this direction now.
>
> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> migration) which both are tied to utilization signal.
>
> Wouldn't make the util response time faster help not just for rampup, but
> rampdown too?
>
> If we improve util response time, couldn't this mean we can remove util_est or
> am I missing something?
not sure because you still have a ramping step whereas util_est
directly gives you the final tager
>
> Currently we have util response which is tweaked by util_est and then that is
> tweaked further by schedutil with that 25% margin when maping util to
> frequency.
the 25% is not related to the ramping time but to the fact that you
always need some margin to cover unexpected events and estimation
error
>
> I think if we can allow improving general util response time by tweaking PELT
> HALFLIFE we can potentially remove util_est and potentially that magic 25%
> margin too.
>
> Why the approach of further tweaking util_est is better?
note that in this case it doesn't really tweak util_est but Dietmar
has taken into account runnable_avg to increase the freq in case of
contention
Also IIUC Dietmar's results, the problem seems more linked to the
selection of a higher freq than increasing the utilization;
runnable_avg tests give similar perf results than shorter half life
and better power consumption.
>
> Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> if the response time is contributing to that
>
> https://www.phoronix.com/review/schedutil-quirky-2023
>
>
> Cheers
>
> --
> Qais Yousef
On 03/01/23 11:39, Vincent Guittot wrote:
> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> >
> > On 02/09/23 17:16, Vincent Guittot wrote:
> >
> > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > most probably never be preempted during this 1ms. For such an Android
> > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > a better solution ?
> >
> > uclamp_min is being used in UI and helping there. But your mileage might vary
> > with adoption still.
> >
> > The major motivation behind this is to help things like gaming as the original
> > thread started. It can help UI and other use cases too. Android framework has
> > a lot of context on the type of workload that can help it make a decision when
> > this helps. And OEMs can have the chance to tune and apply based on the
> > characteristics of their device.
> >
> > > IIUC how util_est_faster works, it removes the waiting time when
> > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > but not running time), the result is the same as current util_est.
> > > util_est_faster makes a difference only when the task alternates
> > > between runnable and running slices.
> > > Have you considered using runnable_avg metrics in the increase of cpu
> > > freq ? This takes into the runnable slice and not only the running
> > > time and increase faster than util_avg when tasks compete for the same
> > > CPU
> >
> > Just to understand why we're heading into this direction now.
> >
> > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > migration) which both are tied to utilization signal.
> >
> > Wouldn't make the util response time faster help not just for rampup, but
> > rampdown too?
> >
> > If we improve util response time, couldn't this mean we can remove util_est or
> > am I missing something?
>
> not sure because you still have a ramping step whereas util_est
> directly gives you the final tager
I didn't get you. tager?
>
> >
> > Currently we have util response which is tweaked by util_est and then that is
> > tweaked further by schedutil with that 25% margin when maping util to
> > frequency.
>
> the 25% is not related to the ramping time but to the fact that you
> always need some margin to cover unexpected events and estimation
> error
At the moment we have
util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
I think we have too many transformations before deciding the current
frequencies. Which makes it hard to tweak the system response.
>
> >
> > I think if we can allow improving general util response time by tweaking PELT
> > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > margin too.
> >
> > Why the approach of further tweaking util_est is better?
>
> note that in this case it doesn't really tweak util_est but Dietmar
> has taken into account runnable_avg to increase the freq in case of
> contention
>
> Also IIUC Dietmar's results, the problem seems more linked to the
> selection of a higher freq than increasing the utilization;
> runnable_avg tests give similar perf results than shorter half life
> and better power consumption.
Does it ramp down faster too?
Thanks
--
Qais Yousef
>
> >
> > Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> > if the response time is contributing to that
> >
> > https://www.phoronix.com/review/schedutil-quirky-2023
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
On Wed, 1 Mar 2023 at 18:25, Qais Yousef <[email protected]> wrote:
>
> On 03/01/23 11:39, Vincent Guittot wrote:
> > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> > >
> > > On 02/09/23 17:16, Vincent Guittot wrote:
> > >
> > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > most probably never be preempted during this 1ms. For such an Android
> > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > a better solution ?
> > >
> > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > with adoption still.
> > >
> > > The major motivation behind this is to help things like gaming as the original
> > > thread started. It can help UI and other use cases too. Android framework has
> > > a lot of context on the type of workload that can help it make a decision when
> > > this helps. And OEMs can have the chance to tune and apply based on the
> > > characteristics of their device.
> > >
> > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > but not running time), the result is the same as current util_est.
> > > > util_est_faster makes a difference only when the task alternates
> > > > between runnable and running slices.
> > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > freq ? This takes into the runnable slice and not only the running
> > > > time and increase faster than util_avg when tasks compete for the same
> > > > CPU
> > >
> > > Just to understand why we're heading into this direction now.
> > >
> > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > migration) which both are tied to utilization signal.
> > >
> > > Wouldn't make the util response time faster help not just for rampup, but
> > > rampdown too?
> > >
> > > If we improve util response time, couldn't this mean we can remove util_est or
> > > am I missing something?
> >
> > not sure because you still have a ramping step whereas util_est
> > directly gives you the final tager
>
> I didn't get you. tager?
target
>
> >
> > >
> > > Currently we have util response which is tweaked by util_est and then that is
> > > tweaked further by schedutil with that 25% margin when maping util to
> > > frequency.
> >
> > the 25% is not related to the ramping time but to the fact that you
> > always need some margin to cover unexpected events and estimation
> > error
>
> At the moment we have
>
> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
>
> I think we have too many transformations before deciding the current
> frequencies. Which makes it hard to tweak the system response.
What is proposed here with runnable_avg is more to take a new input
when selecting a frequency: the level of contention on the cpu. But
this is not used to modify the utilization seen by the scheduler
>
> >
> > >
> > > I think if we can allow improving general util response time by tweaking PELT
> > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > margin too.
> > >
> > > Why the approach of further tweaking util_est is better?
> >
> > note that in this case it doesn't really tweak util_est but Dietmar
> > has taken into account runnable_avg to increase the freq in case of
> > contention
> >
> > Also IIUC Dietmar's results, the problem seems more linked to the
> > selection of a higher freq than increasing the utilization;
> > runnable_avg tests give similar perf results than shorter half life
> > and better power consumption.
>
> Does it ramp down faster too?
I don't think so.
To be honest, I'm not convinced that modifying the half time is the
right way to solve this. If it was only a matter of half life not
being suitable for a system, the halk life would be set once at boot
and people would not ask to modify it at run time.
>
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > >
> > > Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> > > if the response time is contributing to that
> > >
> > > https://www.phoronix.com/review/schedutil-quirky-2023
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef
On 22/02/2023 21:13, Dietmar Eggemann wrote:
> On 20/02/2023 14:54, Vincent Guittot wrote:
>> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
>>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>>> a better solution ?
>>>
>>> Yes, it has. I'm not sure how feasible this is to do for all tasks
>>> involved. I'm thinking about the Binder threads here for instance.
>>
>> Yes, that can probably not help for all threads but some system
>> threads like surfaceflinger and graphic composer should probably
>> benefit from min uclamp
>
> Yes, and it looks like that the Android version I'm using
> SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is
> already using uclamp_min != 0 for tasks like UI thread. It's not one
> particular value but different values from [0 .. 512] over the runtime
> of a Jankbench iteration. I have to have a closer look.
I did more Jankbench and Speedometer testing especially to understand
the influence of the already used uclamp_min boosting (Android Dynamic
Performance Framework (ADPF) `CPU performance hints` feature:
https://developer.android.com/games/optimize/adpf#cpu-hints) for some
App tasks.
The following notebooks show which of the App tasks are uclamp_min
boosted (their diagram title carries an additional 'uclamp_min_boost'
tag and how uclamp_min boost relates to the other boost values:
This is probably not a fixed mapping and could change between test runs.
I assume that Android will issue performance hints in form of uclamp_min
boosting when it detects certain scenarios like a specific jankframe
threshold or something similar.
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/jankbench_uclamp_min_boost.ipynb
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/speedometer_uclamp_min_boost.ipynb
`base` has changed compared to `base-a30b17f016b0`. It now also
contains: e5ed0550c04c - sched/fair: unlink misfit task from cpu
overutilized (2023-02-11 Vincent Guittot)
Former `max_util_scaled_util_est_faster_rbl_freq` has been renamed to
`cpu_rbl_freq`.
Jankbench:
Max_frame_duration:
+-----------------------------+------------+
| kernel | value |
+-----------------------------+------------+
| base | 156.299159 |
| base_wo_uclamp | 171.063764 | uclamp disabled*
| pelt-hl-m2 | 126.190232 |
| pelt-hl-m4 | 100.865171 |
| scaled_util_est_faster_freq | 126.074194 |
| cpu_rbl_freq | 153.123089 |
+-----------------------------+------------+
* We still let Android set the uclamp_min values.
Just the uclamp setter are bypassed now.
Mean_frame_duration:
+-----------------------------+-------+-----------+
| kernel | value | perc_diff |
+-----------------------------+-------+-----------+
| base | 15.5 | 0.0% |
| base_wo_uclamp | 16.6 | 7.76% |
| pelt-hl-m2 | 14.9 | -3.27% |
| pelt-hl-m4 | 13.6 | -12.16% |
| scaled_util_est_faster_freq | 14.7 | -4.88% |
| cpu_rbl_freq | 12.2 | -20.84% |
+-----------------------------+-------+-----------+
Jank percentage (Jank deadline 16ms):
+-----------------------------+-------+-----------+
| kernel | value | perc_diff |
+-----------------------------+-------+-----------+
| base | 2.6 | 0.0% |
| base_wo_uclamp | 3.0 | 17.47% |
| pelt-hl-m2 | 2.0 | -23.33% |
| pelt-hl-m4 | 1.3 | -48.55% |
| scaled_util_est_faster_freq | 1.7 | -32.21% |
| cpu_rbl_freq | 0.7 | -71.36% |
+-----------------------------+-------+-----------+
Power usage [mW] (total - all CPUs):
+-----------------------------+-------+-----------+
| kernel | value | perc_diff |
+-----------------------------+-------+-----------+
| base | 141.1 | 0.0% |
| base_wo_uclamp | 116.6 | -17.4% |
| pelt-hl-m2 | 138.7 | -1.7% |
| pelt-hl-m4 | 156.5 | 10.87% |
| scaled_util_est_faster_freq | 147.6 | 4.57% |
| cpu_rbl_freq | 135.0 | -4.33% |
+-----------------------------+-------+-----------+
Speedometer:
Score:
+-----------------------------+-------+-----------+
| kernel | value | perc_diff |
+-----------------------------+-------+-----------+
| base | 108.4 | 0.0% |
| base_wo_uclamp | 95.2 | -12.17% |
| pelt-hl-m2 | 112.9 | 4.13% |
| scaled_util_est_faster_freq | 114.7 | 5.77% |
| cpu_rbl_freq | 127.7 | 17.75% |
+-----------------------------+-------+-----------+
Power usage [mW] (total - all CPUs):
+-----------------------------+--------+-----------+
| kernel | value | perc_diff |
+-----------------------------+--------+-----------+
| base | 2268.4 | 0.0% |
| base_wo_uclamp | 1789.5 | -21.11% |
| pelt-hl-m2 | 2386.5 | 5.21% |
| scaled_util_est_faster_freq | 2292.3 | 1.05% |
| cpu_rbl_freq | 2198.3 | -3.09% |
+-----------------------------+--------+-----------+
The explanation I have is that the `CPU performance hints` feature
tries to recreate the information about contention for a specific set of
tasks. Since there is also contention in which only non uclamp_min
boosted tasks are runnable, mechanisms like `util_est_faster` or
`cpu_runnable boosting` can help on top of what's already provided with
uclamp_min boosting from userspace.
[...]
On 02/03/2023 09:00, Vincent Guittot wrote:
> On Wed, 1 Mar 2023 at 18:25, Qais Yousef <[email protected]> wrote:
>>
>> On 03/01/23 11:39, Vincent Guittot wrote:
>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
>>>>
>>>> On 02/09/23 17:16, Vincent Guittot wrote:
[...]
>>>> Just to understand why we're heading into this direction now.
>>>>
>>>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
>>>> migration) which both are tied to utilization signal.
>>>>
>>>> Wouldn't make the util response time faster help not just for rampup, but
>>>> rampdown too?
>>>>
>>>> If we improve util response time, couldn't this mean we can remove util_est or
>>>> am I missing something?
>>>
>>> not sure because you still have a ramping step whereas util_est
>>> directly gives you the final tager
>>
>> I didn't get you. tager?
>
> target
uclamp_min boosting (ADPF's `CPU performance hints` feature) could
eclipse util_est but only if it's higher and only for those tasks
affected the feature,
[...]
>>>> I think if we can allow improving general util response time by tweaking PELT
>>>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
>>>> margin too.
>>>>
>>>> Why the approach of further tweaking util_est is better?
>>>
>>> note that in this case it doesn't really tweak util_est but Dietmar
>>> has taken into account runnable_avg to increase the freq in case of
>>> contention
>>>
>>> Also IIUC Dietmar's results, the problem seems more linked to the
>>> selection of a higher freq than increasing the utilization;
>>> runnable_avg tests give similar perf results than shorter half life
>>> and better power consumption.
>>
>> Does it ramp down faster too?
>
> I don't think so.
>
> To be honest, I'm not convinced that modifying the half time is the
> right way to solve this. If it was only a matter of half life not
> being suitable for a system, the halk life would be set once at boot
> and people would not ask to modify it at run time.
IMHO, what people don't like about PELT halflife mods is the fact that
all sched entities and every functionality based on PELT signals would
be affected even though it might not be beneficial or even harmful for
system behaviour not covered by the specific benchmark numbers shown.
That's why we wanted to figure out what is the actual reason which
improves those Jankbench (or Speedometer resp. game FPS numbers). In
this case we would be able to boost more selectively than PELT halflife
modding can do.
Util_est_faster (1) is an approach to only boost the CPU util signal
depending on the current task's activation duration (sum of task's
running time). This time is multiplied by 2 when calculating the fake
PELT signal which is then max-compared with the existing CPU util.
And the idea to max-compare CPU util and CPU runnable (2) is to help
tasks under contention. Android testing showed that contention very
often accompanies jankframe occurrences for example.
I only applied (1) and (2) to DVFS requests in my testing.
[...]
On 03/02/23 09:00, Vincent Guittot wrote:
> On Wed, 1 Mar 2023 at 18:25, Qais Yousef <[email protected]> wrote:
> >
> > On 03/01/23 11:39, Vincent Guittot wrote:
> > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > >
> > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > a better solution ?
> > > >
> > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > with adoption still.
> > > >
> > > > The major motivation behind this is to help things like gaming as the original
> > > > thread started. It can help UI and other use cases too. Android framework has
> > > > a lot of context on the type of workload that can help it make a decision when
> > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > characteristics of their device.
> > > >
> > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > but not running time), the result is the same as current util_est.
> > > > > util_est_faster makes a difference only when the task alternates
> > > > > between runnable and running slices.
> > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > freq ? This takes into the runnable slice and not only the running
> > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > CPU
> > > >
> > > > Just to understand why we're heading into this direction now.
> > > >
> > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > migration) which both are tied to utilization signal.
> > > >
> > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > rampdown too?
> > > >
> > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > am I missing something?
> > >
> > > not sure because you still have a ramping step whereas util_est
> > > directly gives you the final tager
> >
> > I didn't get you. tager?
>
> target
It seems you're referring to the holding function of util_est? ie: keep the
util high to avoid 'spurious' decays?
Isn't this a duplication of the schedutil's filter which is also a holding
function to prevent rapid frequency changes?
FWIW, that schedutil filter does get tweaked a lot in android world. Many add
an additional down_filter to prevent this premature drop in freq (AFAICT).
Which tells me util_est is not delivering completely on that front in practice.
>
> >
> > >
> > > >
> > > > Currently we have util response which is tweaked by util_est and then that is
> > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > frequency.
> > >
> > > the 25% is not related to the ramping time but to the fact that you
> > > always need some margin to cover unexpected events and estimation
> > > error
> >
> > At the moment we have
> >
> > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >
> > I think we have too many transformations before deciding the current
> > frequencies. Which makes it hard to tweak the system response.
>
> What is proposed here with runnable_avg is more to take a new input
> when selecting a frequency: the level of contention on the cpu. But
What if there's no contention on the CPU and it's just a single task running
there that suddenly becomes always running for a number of frames?
> this is not used to modify the utilization seen by the scheduler
>
> >
> > >
> > > >
> > > > I think if we can allow improving general util response time by tweaking PELT
> > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > margin too.
> > > >
> > > > Why the approach of further tweaking util_est is better?
> > >
> > > note that in this case it doesn't really tweak util_est but Dietmar
> > > has taken into account runnable_avg to increase the freq in case of
> > > contention
> > >
> > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > selection of a higher freq than increasing the utilization;
> > > runnable_avg tests give similar perf results than shorter half life
> > > and better power consumption.
> >
> > Does it ramp down faster too?
>
> I don't think so.
>
> To be honest, I'm not convinced that modifying the half time is the
> right way to solve this. If it was only a matter of half life not
> being suitable for a system, the halk life would be set once at boot
> and people would not ask to modify it at run time.
I'd like to understand more the reason behind these concerns. What is the
problem with modifying the halflife?
The way I see it it is an important metric of how responsive the system to how
loaded it is. Which drives a lot of important decisions.
32ms means the system needs approximately 200ms to detect an always running
task (from idle).
16ms halves it to 100ms. And 8ms halves it further to 50ms.
Or you can phrase it the opposite way, it takes 200ms to detect the system is
now idle from always busy state. etc.
Why is it bad for a sys admin to have the ability to adjust this response time
as they see fit?
What goes wrong?
AFAICS the two natural places to control the response time of the system is
pelt halflife for overall system responsiveness, and the mapping function in
schedutil for more fine grained frequency response.
There are issues with current filtering mechanism in schedutil too:
1. It drops any requests during the filtering window. At CFS enqueue we
could end up with multiple calls to cpufreq_update_util(); or if we
do multiple consecutive enqueues. In a shared domain, there's a race
which cpu issues the updated freq request first. Which might not be
the best for the domain during this window.
2. Maybe it needs asymmetric values for up and down.
I could be naive, but I see util_est as something we should strive to remove to
be honest. I think there are too many moving cogs.
Thanks!
--
Qais Yousef
On Mon, 6 Mar 2023 at 20:11, Qais Yousef <[email protected]> wrote:
>
> On 03/02/23 09:00, Vincent Guittot wrote:
> > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <[email protected]> wrote:
> > >
> > > On 03/01/23 11:39, Vincent Guittot wrote:
> > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> > > > >
> > > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > > >
> > > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > > a better solution ?
> > > > >
> > > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > > with adoption still.
> > > > >
> > > > > The major motivation behind this is to help things like gaming as the original
> > > > > thread started. It can help UI and other use cases too. Android framework has
> > > > > a lot of context on the type of workload that can help it make a decision when
> > > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > > characteristics of their device.
> > > > >
> > > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > > but not running time), the result is the same as current util_est.
> > > > > > util_est_faster makes a difference only when the task alternates
> > > > > > between runnable and running slices.
> > > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > > freq ? This takes into the runnable slice and not only the running
> > > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > > CPU
> > > > >
> > > > > Just to understand why we're heading into this direction now.
> > > > >
> > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > > migration) which both are tied to utilization signal.
> > > > >
> > > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > > rampdown too?
> > > > >
> > > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > > am I missing something?
> > > >
> > > > not sure because you still have a ramping step whereas util_est
> > > > directly gives you the final tager
> > >
> > > I didn't get you. tager?
> >
> > target
>
> It seems you're referring to the holding function of util_est? ie: keep the
> util high to avoid 'spurious' decays?
I mean whatever the half life, you will have to wait the utilization
to increase.
>
> Isn't this a duplication of the schedutil's filter which is also a holding
> function to prevent rapid frequency changes?
util_est is used by scheduler to estimate the final utilization of the cfs
>
> FWIW, that schedutil filter does get tweaked a lot in android world. Many add
> an additional down_filter to prevent this premature drop in freq (AFAICT).
> Which tells me util_est is not delivering completely on that front in practice.
>
> >
> > >
> > > >
> > > > >
> > > > > Currently we have util response which is tweaked by util_est and then that is
> > > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > > frequency.
> > > >
> > > > the 25% is not related to the ramping time but to the fact that you
> > > > always need some margin to cover unexpected events and estimation
> > > > error
> > >
> > > At the moment we have
> > >
> > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> > >
> > > I think we have too many transformations before deciding the current
> > > frequencies. Which makes it hard to tweak the system response.
> >
> > What is proposed here with runnable_avg is more to take a new input
> > when selecting a frequency: the level of contention on the cpu. But
>
> What if there's no contention on the CPU and it's just a single task running
> there that suddenly becomes always running for a number of frames?
>
> > this is not used to modify the utilization seen by the scheduler
> >
> > >
> > > >
> > > > >
> > > > > I think if we can allow improving general util response time by tweaking PELT
> > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > > margin too.
> > > > >
> > > > > Why the approach of further tweaking util_est is better?
> > > >
> > > > note that in this case it doesn't really tweak util_est but Dietmar
> > > > has taken into account runnable_avg to increase the freq in case of
> > > > contention
> > > >
> > > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > > selection of a higher freq than increasing the utilization;
> > > > runnable_avg tests give similar perf results than shorter half life
> > > > and better power consumption.
> > >
> > > Does it ramp down faster too?
> >
> > I don't think so.
> >
> > To be honest, I'm not convinced that modifying the half time is the
> > right way to solve this. If it was only a matter of half life not
> > being suitable for a system, the halk life would be set once at boot
> > and people would not ask to modify it at run time.
>
> I'd like to understand more the reason behind these concerns. What is the
> problem with modifying the halflife?
I can somehow understand that some systems would like a different half
life than the current one because of the number of cpus, the pace of
the system... But this should be fixed at boot. The fact that people
needs to dynamically change the half life means for me that even after
changing it then they still don't get the correct utilization. And I
think that the problem is not really related (or at least not only) to
the correctness of utilization tracking but a lack of taking into
account other input when selecting a frequency. And the contention
(runnable_avg) is a good input to take into account when selecting a
frequency because it reflects that some tasks are waiting to run on
the cpu
>
> The way I see it it is an important metric of how responsive the system to how
> loaded it is. Which drives a lot of important decisions.
>
> 32ms means the system needs approximately 200ms to detect an always running
> task (from idle).
>
> 16ms halves it to 100ms. And 8ms halves it further to 50ms.
>
> Or you can phrase it the opposite way, it takes 200ms to detect the system is
> now idle from always busy state. etc.
>
> Why is it bad for a sys admin to have the ability to adjust this response time
> as they see fit?
because it will use it to bias the response of the system and abuse it
at runtime instead of identifying the root cause.
>
> What goes wrong?
>
> AFAICS the two natural places to control the response time of the system is
> pelt halflife for overall system responsiveness, and the mapping function in
> schedutil for more fine grained frequency response.
>
> There are issues with current filtering mechanism in schedutil too:
>
> 1. It drops any requests during the filtering window. At CFS enqueue we
> could end up with multiple calls to cpufreq_update_util(); or if we
> do multiple consecutive enqueues. In a shared domain, there's a race
> which cpu issues the updated freq request first. Which might not be
> the best for the domain during this window.
> 2. Maybe it needs asymmetric values for up and down.
>
> I could be naive, but I see util_est as something we should strive to remove to
> be honest. I think there are too many moving cogs.
>
>
> Thanks!
>
> --
> Qais Yousef
On 03/07/23 14:22, Vincent Guittot wrote:
> On Mon, 6 Mar 2023 at 20:11, Qais Yousef <[email protected]> wrote:
> >
> > On 03/02/23 09:00, Vincent Guittot wrote:
> > > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 03/01/23 11:39, Vincent Guittot wrote:
> > > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> > > > > >
> > > > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > > > >
> > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > > > a better solution ?
> > > > > >
> > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > > > with adoption still.
> > > > > >
> > > > > > The major motivation behind this is to help things like gaming as the original
> > > > > > thread started. It can help UI and other use cases too. Android framework has
> > > > > > a lot of context on the type of workload that can help it make a decision when
> > > > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > > > characteristics of their device.
> > > > > >
> > > > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > > > but not running time), the result is the same as current util_est.
> > > > > > > util_est_faster makes a difference only when the task alternates
> > > > > > > between runnable and running slices.
> > > > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > > > freq ? This takes into the runnable slice and not only the running
> > > > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > > > CPU
> > > > > >
> > > > > > Just to understand why we're heading into this direction now.
> > > > > >
> > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > > > migration) which both are tied to utilization signal.
> > > > > >
> > > > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > > > rampdown too?
> > > > > >
> > > > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > > > am I missing something?
> > > > >
> > > > > not sure because you still have a ramping step whereas util_est
> > > > > directly gives you the final tager
> > > >
> > > > I didn't get you. tager?
> > >
> > > target
> >
> > It seems you're referring to the holding function of util_est? ie: keep the
> > util high to avoid 'spurious' decays?
>
> I mean whatever the half life, you will have to wait the utilization
> to increase.
Yes - which is what ramp up delay that is unacceptable in some cases and seem
to have been raised several times over the years
>
> >
> > Isn't this a duplication of the schedutil's filter which is also a holding
> > function to prevent rapid frequency changes?
>
> util_est is used by scheduler to estimate the final utilization of the cfs
IIR the commit message that introduced it correctly it is talking about ramp up
delays - and issues with premature decaying for periodic tasks.
So it is a mechanism to speed up util_avg response time. The same issue we're
trying to address again now.
>
> >
> > FWIW, that schedutil filter does get tweaked a lot in android world. Many add
> > an additional down_filter to prevent this premature drop in freq (AFAICT).
> > Which tells me util_est is not delivering completely on that front in practice.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Currently we have util response which is tweaked by util_est and then that is
> > > > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > > > frequency.
> > > > >
> > > > > the 25% is not related to the ramping time but to the fact that you
> > > > > always need some margin to cover unexpected events and estimation
> > > > > error
> > > >
> > > > At the moment we have
> > > >
> > > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> > > >
> > > > I think we have too many transformations before deciding the current
> > > > frequencies. Which makes it hard to tweak the system response.
> > >
> > > What is proposed here with runnable_avg is more to take a new input
> > > when selecting a frequency: the level of contention on the cpu. But
> >
> > What if there's no contention on the CPU and it's just a single task running
> > there that suddenly becomes always running for a number of frames?
> >
> > > this is not used to modify the utilization seen by the scheduler
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I think if we can allow improving general util response time by tweaking PELT
> > > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > > > margin too.
> > > > > >
> > > > > > Why the approach of further tweaking util_est is better?
> > > > >
> > > > > note that in this case it doesn't really tweak util_est but Dietmar
> > > > > has taken into account runnable_avg to increase the freq in case of
> > > > > contention
> > > > >
> > > > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > > > selection of a higher freq than increasing the utilization;
> > > > > runnable_avg tests give similar perf results than shorter half life
> > > > > and better power consumption.
> > > >
> > > > Does it ramp down faster too?
> > >
> > > I don't think so.
> > >
> > > To be honest, I'm not convinced that modifying the half time is the
> > > right way to solve this. If it was only a matter of half life not
> > > being suitable for a system, the halk life would be set once at boot
> > > and people would not ask to modify it at run time.
> >
> > I'd like to understand more the reason behind these concerns. What is the
> > problem with modifying the halflife?
>
> I can somehow understand that some systems would like a different half
> life than the current one because of the number of cpus, the pace of
> the system... But this should be fixed at boot. The fact that people
The boot time might be the only thing required. I think some systems only need
this already. The difficulty in practice is that on some systems this might
result in worse power over a day of use. So it'll all depend, hence the desire
to have it as a runtime. Why invent more crystal balls that might or not might
not be the best thing depends on who you ask?
> needs to dynamically change the half life means for me that even after
> changing it then they still don't get the correct utilization. And I
What is the correct utilization? It is just a signal in attempt to crystal ball
the future. It can't be correct in general IMHO. It's best effort that we know
fails occasionally already.
As I said above - there's a trade-off in perf/power and that will highly depend
on the system.
The proposed high contention detection doesn't address this trade-off; rather
biases the system further towards perf-first. Which is not always the right
trade-off. It could be a useful addition - but it needs to be a tunable too.
> think that the problem is not really related (or at least not only) to
> the correctness of utilization tracking but a lack of taking into
It's not correctness issue. It's response time issue. It's a simple
task of improving the reactiveness of the system. Which has a power cost that
some users don't want to incur when not necessary.
> account other input when selecting a frequency. And the contention
> (runnable_avg) is a good input to take into account when selecting a
> frequency because it reflects that some tasks are waiting to run on
> the cpu
You did not answer my question above. What if there's no contention and
a single task on a cpu suddenly moves from mostly idle to always running for
a number of frames? There's no contention in there. How will this be improved?
>
> >
> > The way I see it it is an important metric of how responsive the system to how
> > loaded it is. Which drives a lot of important decisions.
> >
> > 32ms means the system needs approximately 200ms to detect an always running
> > task (from idle).
> >
> > 16ms halves it to 100ms. And 8ms halves it further to 50ms.
> >
> > Or you can phrase it the opposite way, it takes 200ms to detect the system is
> > now idle from always busy state. etc.
> >
> > Why is it bad for a sys admin to have the ability to adjust this response time
> > as they see fit?
>
> because it will use it to bias the response of the system and abuse it
> at runtime instead of identifying the root cause.
No one wants to abuse anything. But the one size fits all approach is not
always right too. And sys admins and end users have the right to tune their
systems the way they see fit. There are too many variations out there to hard
code the system response. I view this like the right to repair - it's their
system, why do they have to hack the kernel to tune it?
The root cause is that the system reactiveness is controlled by this value.
And there's a trade-off between perf/power that is highly dependent on the
system characteristic. On some areas a boot time is all that one needs. In
others, it might be desired to improve specific use cases like gaming only as
the speed up at boot time only can hurt overall battery life in normal use
cases.
I think the story is simple :)
In my view util_est is borderline a hack. We just need to enable control pelt
ramp-up/down response times + improve schedutil. I highlight a few shortcomings
that are already known in the practice below. And that phoronix article about
schedutil not being better than ondemand demonstrates that this is an issue
outside of mobile too.
schedutil - as the name says it - depends on util signal. Which also depends on
pelt halflife. I really think this is the most natural and predictable way to
tune the system. I can't see the drawbacks.
I think we need to distinguish between picking sensible default behavior; and
enforcing policies or restricting user's choice. AFAICS the discussion is going
towards the latter.
On the topic of defaults - I do think 16ms is a more sensible default for
modern day hardware and use cases.
/me runs and hides :)
Cheers
--
Qais Yousef
>
> >
> > What goes wrong?
> >
> > AFAICS the two natural places to control the response time of the system is
> > pelt halflife for overall system responsiveness, and the mapping function in
> > schedutil for more fine grained frequency response.
> >
> > There are issues with current filtering mechanism in schedutil too:
> >
> > 1. It drops any requests during the filtering window. At CFS enqueue we
> > could end up with multiple calls to cpufreq_update_util(); or if we
> > do multiple consecutive enqueues. In a shared domain, there's a race
> > which cpu issues the updated freq request first. Which might not be
> > the best for the domain during this window.
> > 2. Maybe it needs asymmetric values for up and down.
> >
> > I could be naive, but I see util_est as something we should strive to remove to
> > be honest. I think there are too many moving cogs.
> >
> >
> > Thanks!
> >
> > --
> > Qais Yousef
On 01/03/2023 18:24, Qais Yousef wrote:
> On 03/01/23 11:39, Vincent Guittot wrote:
>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
>>>
>>> On 02/09/23 17:16, Vincent Guittot wrote:
>>>
>>>> I don't see how util_est_faster can help this 1ms task here ? It's
>>>> most probably never be preempted during this 1ms. For such an Android
>>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>>> a better solution ?
>>>
>>> uclamp_min is being used in UI and helping there. But your mileage might vary
>>> with adoption still.
>>>
>>> The major motivation behind this is to help things like gaming as the original
>>> thread started. It can help UI and other use cases too. Android framework has
>>> a lot of context on the type of workload that can help it make a decision when
>>> this helps. And OEMs can have the chance to tune and apply based on the
>>> characteristics of their device.
>>>
>>>> IIUC how util_est_faster works, it removes the waiting time when
>>>> sharing cpu time with other tasks. So as long as there is no (runnable
>>>> but not running time), the result is the same as current util_est.
>>>> util_est_faster makes a difference only when the task alternates
>>>> between runnable and running slices.
>>>> Have you considered using runnable_avg metrics in the increase of cpu
>>>> freq ? This takes into the runnable slice and not only the running
>>>> time and increase faster than util_avg when tasks compete for the same
>>>> CPU
>>>
>>> Just to understand why we're heading into this direction now.
>>>
>>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
>>> migration) which both are tied to utilization signal.
>>>
>>> Wouldn't make the util response time faster help not just for rampup, but
>>> rampdown too?
>>>
>>> If we improve util response time, couldn't this mean we can remove util_est or
>>> am I missing something?
>>
>> not sure because you still have a ramping step whereas util_est
>> directly gives you the final tager
util_est gives us instantaneous signal at enqueue for periodic tasks,
something PELT will never be able to do.
> I didn't get you. tager?
>
>>
>>>
>>> Currently we have util response which is tweaked by util_est and then that is
>>> tweaked further by schedutil with that 25% margin when maping util to
>>> frequency.
>>
>> the 25% is not related to the ramping time but to the fact that you
>> always need some margin to cover unexpected events and estimation
>> error
>
> At the moment we have
>
> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
>
> I think we have too many transformations before deciding the current
> frequencies. Which makes it hard to tweak the system response.
To me it looks more like this:
max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
^^^^^^^^^^^^
new proposal to factor in root cfs_rq contention
Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
there to create the safety margin used by schedutil & EAS.
* The schedutil up/down filter thing has been already naked in Nov 2016.
IMHO, this is where util_est was initially discussed as an alternative.
We have it in mainline as well, but one value (default 10ms) for both
directions. There was discussion to map it to the driver's
translation_latency instead.
In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
So on `up` your rate is as small as possible (only respecting the
driver's translation_latency) but on `down` you use much more than that.
Why exactly do you have this higher value on `down`? My hunch is
scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
so util_est is 0 and the blocked utilization is decaying (too fast,
4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
again when the CPU wakes up in those 20ms?
>>> I think if we can allow improving general util response time by tweaking PELT
>>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
>>> margin too.
>>>
>>> Why the approach of further tweaking util_est is better?
>>
>> note that in this case it doesn't really tweak util_est but Dietmar
>> has taken into account runnable_avg to increase the freq in case of
>> contention
>>
>> Also IIUC Dietmar's results, the problem seems more linked to the
>> selection of a higher freq than increasing the utilization;
>> runnable_avg tests give similar perf results than shorter half life
>> and better power consumption.
>
> Does it ramp down faster too?
Not sure why you are interested in this? Can't be related to the
`driving DVFS` functionality discussed above.
Hi Diemtar
On 03/23/23 17:29, Dietmar Eggemann wrote:
> On 01/03/2023 18:24, Qais Yousef wrote:
> > On 03/01/23 11:39, Vincent Guittot wrote:
> >> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> >>>
> >>> On 02/09/23 17:16, Vincent Guittot wrote:
> >>>
> >>>> I don't see how util_est_faster can help this 1ms task here ? It's
> >>>> most probably never be preempted during this 1ms. For such an Android
> >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> >>>> a better solution ?
> >>>
> >>> uclamp_min is being used in UI and helping there. But your mileage might vary
> >>> with adoption still.
> >>>
> >>> The major motivation behind this is to help things like gaming as the original
> >>> thread started. It can help UI and other use cases too. Android framework has
> >>> a lot of context on the type of workload that can help it make a decision when
> >>> this helps. And OEMs can have the chance to tune and apply based on the
> >>> characteristics of their device.
> >>>
> >>>> IIUC how util_est_faster works, it removes the waiting time when
> >>>> sharing cpu time with other tasks. So as long as there is no (runnable
> >>>> but not running time), the result is the same as current util_est.
> >>>> util_est_faster makes a difference only when the task alternates
> >>>> between runnable and running slices.
> >>>> Have you considered using runnable_avg metrics in the increase of cpu
> >>>> freq ? This takes into the runnable slice and not only the running
> >>>> time and increase faster than util_avg when tasks compete for the same
> >>>> CPU
> >>>
> >>> Just to understand why we're heading into this direction now.
> >>>
> >>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> >>> migration) which both are tied to utilization signal.
> >>>
> >>> Wouldn't make the util response time faster help not just for rampup, but
> >>> rampdown too?
> >>>
> >>> If we improve util response time, couldn't this mean we can remove util_est or
> >>> am I missing something?
> >>
> >> not sure because you still have a ramping step whereas util_est
> >> directly gives you the final tager
>
> util_est gives us instantaneous signal at enqueue for periodic tasks,
How do you define instantaneous and periodic here? How would you describe the
behavior for non periodic tasks?
> something PELT will never be able to do.
Why? Isn't by selecting a lower pelt halflife we achieve something similar?
>
> > I didn't get you. tager?
> >
> >>
> >>>
> >>> Currently we have util response which is tweaked by util_est and then that is
> >>> tweaked further by schedutil with that 25% margin when maping util to
> >>> frequency.
> >>
> >> the 25% is not related to the ramping time but to the fact that you
> >> always need some margin to cover unexpected events and estimation
> >> error
> >
> > At the moment we have
> >
> > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >
> > I think we have too many transformations before deciding the current
> > frequencies. Which makes it hard to tweak the system response.
>
> To me it looks more like this:
>
> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
> ^^^^^^^^^^^^
> new proposal to factor in root cfs_rq contention
These are still 5 stages even if written differently.
What if background tasks that are causing the contention? How can you tell it
to ignore that and NOT drive the frequency up unnecessary for those non
important ones? If userspace is fully aware of uclamp - this whole discussion
wouldn't be necessary. And I still have a bunch of fixes to push before
uclamp_max is actually usable in production.
> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
> there to create the safety margin used by schedutil & EAS.
Yes I know and that's not the point. The point is that it's a chain reaction.
25% percent headroom is already very aggressive and causes issues on the top
inefficient ends of the cores. And when util is high, you might end up in
a situation where you skip frequencies. Making everything go up faster without
balancing it with either enabling going down faster too or tune this value can
lead to power and thermal issues on powerful systems.
I think all we need is controlling pelt halflife and this one to tune the
system to the desired trade-off.
>
> * The schedutil up/down filter thing has been already naked in Nov 2016.
> IMHO, this is where util_est was initially discussed as an alternative.
Well, I don't see anyone not using a down filter. So I'm not sure util_est has
been a true alternative.
> We have it in mainline as well, but one value (default 10ms) for both
> directions. There was discussion to map it to the driver's
> translation_latency instead.
Which can be filled wrong sometimes :(
>
> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
>
> So on `up` your rate is as small as possible (only respecting the
> driver's translation_latency) but on `down` you use much more than that.
>
> Why exactly do you have this higher value on `down`? My hunch is
> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
> so util_est is 0 and the blocked utilization is decaying (too fast,
> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
> again when the CPU wakes up in those 20ms?
The down filter prevents changing the frequency to a lower value. So it's
a holding function to keep the residency at a higher frequency for at least
20ms. It is, sort of, similar to the max() functions you used above. The max
function will allow you to follow the fasting ramping up signal on the way up,
and the slowest ramping down one on the way down.
I think this is more deterministic way to do it.
>
> >>> I think if we can allow improving general util response time by tweaking PELT
> >>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
> >>> margin too.
> >>>
> >>> Why the approach of further tweaking util_est is better?
> >>
> >> note that in this case it doesn't really tweak util_est but Dietmar
> >> has taken into account runnable_avg to increase the freq in case of
> >> contention
> >>
> >> Also IIUC Dietmar's results, the problem seems more linked to the
> >> selection of a higher freq than increasing the utilization;
> >> runnable_avg tests give similar perf results than shorter half life
> >> and better power consumption.
> >
> > Does it ramp down faster too?
>
> Not sure why you are interested in this? Can't be related to the
> `driving DVFS` functionality discussed above.
If you change the reaction time to be more aggressive in going up, then it's
only natural to have it symmetrical so your residency on the power hungry OPPs
don't go over the roof and end up with thermal and power issues.
I am concerned about us biasing towrads perf first too much and not enabling
sys admins to select a proper trade off for their system and use case. Which
are not static. The workloads the system needs to accommodate to are abundant
and operating conditions could change. And the diversity of hardware available
out there is huge - I am not sure how can we expect we can have one response to
accommodate for all of them.
What I'm trying to push for here is that we should look at the chain as one
unit. And we should consider that there's important trade-off to be had here;
having a sensible default doesn't mean the user shouldn't be allowed to select
a different trade-off. I'm not sure the problem can be generalized and fixed
automatically. But happy to be proven wrong of course :-)
FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
pasting the patch for load_avg consideration so I can take it into account too
in my experiments?
Thanks!
--
Qais Yousef
Hi Qais,
On 03/04/2023 16:45, Qais Yousef wrote:
> Hi Diemtar
>
> On 03/23/23 17:29, Dietmar Eggemann wrote:
>> On 01/03/2023 18:24, Qais Yousef wrote:
>>> On 03/01/23 11:39, Vincent Guittot wrote:
>>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
>>>>>
>>>>> On 02/09/23 17:16, Vincent Guittot wrote:
[...]
>>>>> If we improve util response time, couldn't this mean we can remove util_est or
>>>>> am I missing something?
>>>>
>>>> not sure because you still have a ramping step whereas util_est
>>>> directly gives you the final tager
>>
>> util_est gives us instantaneous signal at enqueue for periodic tasks,
>
> How do you define instantaneous and periodic here? How would you describe the
> behavior for non periodic tasks?
Instantaneous is when the max value is available already @wakeup. That
is the main use case for util_est, provide this boost to periodic tasks.
A non-periodic task doesn't benefit from this. Work assumption back then
was that the important task involved here are the periodic (back then
60Hz, 16.67 ms period) tasks of the Android display pipeline.
>> something PELT will never be able to do.
>
> Why? Isn't by selecting a lower pelt halflife we achieve something similar?
You get closer but you still would need time to ramp-up. That's without
util_est.
[...]
>>>> the 25% is not related to the ramping time but to the fact that you
>>>> always need some margin to cover unexpected events and estimation
>>>> error
>>>
>>> At the moment we have
>>>
>>> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
>>>
>>> I think we have too many transformations before deciding the current
>>> frequencies. Which makes it hard to tweak the system response.
>>
>> To me it looks more like this:
>>
>> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
>> ^^^^^^^^^^^^
>> new proposal to factor in root cfs_rq contention
>
> These are still 5 stages even if written differently.
>
> What if background tasks that are causing the contention? How can you tell it
> to ignore that and NOT drive the frequency up unnecessary for those non
> important ones? If userspace is fully aware of uclamp - this whole discussion
> wouldn't be necessary. And I still have a bunch of fixes to push before
> uclamp_max is actually usable in production.
You're hinting to the other open discussion we have on uclamp in feec():
https://lkml.kernel.org/r/[email protected]
IMHO, this is a different discussion. No classification of tasks here.
>> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
>> there to create the safety margin used by schedutil & EAS.
>
> Yes I know and that's not the point. The point is that it's a chain reaction.
> 25% percent headroom is already very aggressive and causes issues on the top
> inefficient ends of the cores. And when util is high, you might end up in
> a situation where you skip frequencies. Making everything go up faster without
> balancing it with either enabling going down faster too or tune this value can
> lead to power and thermal issues on powerful systems.
I try to follow here but I fail. You're saying that the safety margin is
too wide and in case util is within the safety margin, the logic is
eclipsed by going max or choosing a CPU from a higher CPU capacity
Perf-domain?
Wouldn't `going down faster` contradict with schedutil's 20ms down rate
limit?
>
> I think all we need is controlling pelt halflife and this one to tune the
> system to the desired trade-off.
>
>>
>> * The schedutil up/down filter thing has been already naked in Nov 2016.
>> IMHO, this is where util_est was initially discussed as an alternative.
>
> Well, I don't see anyone not using a down filter. So I'm not sure util_est has
> been a true alternative.
Definitely not in down direction. util_est is 0 w/o any runnable tasks.
And blocked utilization is decaying much faster than your 20ms down rate
limit.
>> We have it in mainline as well, but one value (default 10ms) for both
>> directions. There was discussion to map it to the driver's
>> translation_latency instead.
>
> Which can be filled wrong sometimes :(
>
>>
>> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
>>
>> So on `up` your rate is as small as possible (only respecting the
>> driver's translation_latency) but on `down` you use much more than that.
>>
>> Why exactly do you have this higher value on `down`? My hunch is
>> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
>> so util_est is 0 and the blocked utilization is decaying (too fast,
>> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
>> again when the CPU wakes up in those 20ms?
>
> The down filter prevents changing the frequency to a lower value. So it's
> a holding function to keep the residency at a higher frequency for at least
> 20ms. It is, sort of, similar to the max() functions you used above. The max
> function will allow you to follow the fasting ramping up signal on the way up,
> and the slowest ramping down one on the way down.
>
> I think this is more deterministic way to do it.
But a faster PELT wouldn't help here, quite the opposite.
[...]
>>>> Also IIUC Dietmar's results, the problem seems more linked to the
>>>> selection of a higher freq than increasing the utilization;
>>>> runnable_avg tests give similar perf results than shorter half life
>>>> and better power consumption.
>>>
>>> Does it ramp down faster too?
>>
>> Not sure why you are interested in this? Can't be related to the
>> `driving DVFS` functionality discussed above.
>
> If you change the reaction time to be more aggressive in going up, then it's
> only natural to have it symmetrical so your residency on the power hungry OPPs
> don't go over the roof and end up with thermal and power issues.
But you apply this 20ms down rate limit on the big cores too?
> I am concerned about us biasing towrads perf first too much and not enabling
> sys admins to select a proper trade off for their system and use case. Which
> are not static. The workloads the system needs to accommodate to are abundant
> and operating conditions could change. And the diversity of hardware available
> out there is huge - I am not sure how can we expect we can have one response to
> accommodate for all of them.
>
> What I'm trying to push for here is that we should look at the chain as one
> unit. And we should consider that there's important trade-off to be had here;
> having a sensible default doesn't mean the user shouldn't be allowed to select
> a different trade-off. I'm not sure the problem can be generalized and fixed
> automatically. But happy to be proven wrong of course :-)
>
> FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
> pasting the patch for load_avg consideration so I can take it into account too
> in my experiments?
Just posted it:
https://lkml.kernel.org/r/[email protected]
Hi Dietmar!
On 04/06/23 17:58, Dietmar Eggemann wrote:
> Hi Qais,
>
> On 03/04/2023 16:45, Qais Yousef wrote:
> > Hi Diemtar
> >
> > On 03/23/23 17:29, Dietmar Eggemann wrote:
> >> On 01/03/2023 18:24, Qais Yousef wrote:
> >>> On 03/01/23 11:39, Vincent Guittot wrote:
> >>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <[email protected]> wrote:
> >>>>>
> >>>>> On 02/09/23 17:16, Vincent Guittot wrote:
>
> [...]
>
> >>>>> If we improve util response time, couldn't this mean we can remove util_est or
> >>>>> am I missing something?
> >>>>
> >>>> not sure because you still have a ramping step whereas util_est
> >>>> directly gives you the final tager
> >>
> >> util_est gives us instantaneous signal at enqueue for periodic tasks,
> >
> > How do you define instantaneous and periodic here? How would you describe the
> > behavior for non periodic tasks?
>
> Instantaneous is when the max value is available already @wakeup. That
> is the main use case for util_est, provide this boost to periodic tasks.
> A non-periodic task doesn't benefit from this. Work assumption back then
> was that the important task involved here are the periodic (back then
> 60Hz, 16.67 ms period) tasks of the Android display pipeline.
Not all tasks in the system are periodic..
Note that the main use case that was brought up here is gaming - which is not
the same as Android display pipeline.
>
> >> something PELT will never be able to do.
> >
> > Why? Isn't by selecting a lower pelt halflife we achieve something similar?
>
> You get closer but you still would need time to ramp-up. That's without
> util_est.
Yes we'll always need time to ramp up. Even for util_est, no?
>
> [...]
>
> >>>> the 25% is not related to the ramping time but to the fact that you
> >>>> always need some margin to cover unexpected events and estimation
> >>>> error
> >>>
> >>> At the moment we have
> >>>
> >>> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >>>
> >>> I think we have too many transformations before deciding the current
> >>> frequencies. Which makes it hard to tweak the system response.
> >>
> >> To me it looks more like this:
> >>
> >> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
> >> ^^^^^^^^^^^^
> >> new proposal to factor in root cfs_rq contention
> >
> > These are still 5 stages even if written differently.
> >
> > What if background tasks that are causing the contention? How can you tell it
> > to ignore that and NOT drive the frequency up unnecessary for those non
> > important ones? If userspace is fully aware of uclamp - this whole discussion
> > wouldn't be necessary. And I still have a bunch of fixes to push before
> > uclamp_max is actually usable in production.
>
> You're hinting to the other open discussion we have on uclamp in feec():
No, no I am not.
>
> https://lkml.kernel.org/r/[email protected]
>
> IMHO, this is a different discussion. No classification of tasks here.
That patch has nothing to do with what I'm trying to say here. You say looking
at load_avg helps with contention. My point was that what if the contention is
caused by background tasks? They'll cause a frequency to go up higher which is
not the desired effect.
So it'll not distinguish between cases that matters and cases that don't
matter; and with no ability to control this behavior.
As you know cpuset is used to keep background tasks on little cores; whose top
frequencies on latest ones are very expensive. This could lead to higher
residency on those expensive frequencies with your change.
We need to be selective - which is the whole point behind wanting a runtime
control. Not all workloads are equal. And not all systems handle the same
workload similarly. There are trade-offs.
>
> >> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
> >> there to create the safety margin used by schedutil & EAS.
> >
> > Yes I know and that's not the point. The point is that it's a chain reaction.
> > 25% percent headroom is already very aggressive and causes issues on the top
> > inefficient ends of the cores. And when util is high, you might end up in
> > a situation where you skip frequencies. Making everything go up faster without
> > balancing it with either enabling going down faster too or tune this value can
> > lead to power and thermal issues on powerful systems.
>
> I try to follow here but I fail. You're saying that the safety margin is
> too wide and in case util is within the safety margin, the logic is
> eclipsed by going max or choosing a CPU from a higher CPU capacity
> Perf-domain?
>
> Wouldn't `going down faster` contradict with schedutil's 20ms down rate
> limit?
No. 200ms is a far cry from 20ms.
>
> >
> > I think all we need is controlling pelt halflife and this one to tune the
> > system to the desired trade-off.
> >
> >>
> >> * The schedutil up/down filter thing has been already naked in Nov 2016.
> >> IMHO, this is where util_est was initially discussed as an alternative.
> >
> > Well, I don't see anyone not using a down filter. So I'm not sure util_est has
> > been a true alternative.
>
> Definitely not in down direction. util_est is 0 w/o any runnable tasks.
> And blocked utilization is decaying much faster than your 20ms down rate
> limit.
Okay I'll keep this in mind when looking at this in the future. Maybe there's
something fishy in there that we could improve.
>
> >> We have it in mainline as well, but one value (default 10ms) for both
> >> directions. There was discussion to map it to the driver's
> >> translation_latency instead.
> >
> > Which can be filled wrong sometimes :(
> >
> >>
> >> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
> >>
> >> So on `up` your rate is as small as possible (only respecting the
> >> driver's translation_latency) but on `down` you use much more than that.
> >>
> >> Why exactly do you have this higher value on `down`? My hunch is
> >> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
> >> so util_est is 0 and the blocked utilization is decaying (too fast,
> >> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
> >> again when the CPU wakes up in those 20ms?
> >
> > The down filter prevents changing the frequency to a lower value. So it's
> > a holding function to keep the residency at a higher frequency for at least
> > 20ms. It is, sort of, similar to the max() functions you used above. The max
> > function will allow you to follow the fasting ramping up signal on the way up,
> > and the slowest ramping down one on the way down.
> >
> > I think this is more deterministic way to do it.
>
> But a faster PELT wouldn't help here, quite the opposite.
I didn't mention PELT here. I was comparing util_est max() to the filter in
schedutil.
> [...]
>
> >>>> Also IIUC Dietmar's results, the problem seems more linked to the
> >>>> selection of a higher freq than increasing the utilization;
> >>>> runnable_avg tests give similar perf results than shorter half life
> >>>> and better power consumption.
> >>>
> >>> Does it ramp down faster too?
> >>
> >> Not sure why you are interested in this? Can't be related to the
> >> `driving DVFS` functionality discussed above.
> >
> > If you change the reaction time to be more aggressive in going up, then it's
> > only natural to have it symmetrical so your residency on the power hungry OPPs
> > don't go over the roof and end up with thermal and power issues.
>
> But you apply this 20ms down rate limit on the big cores too?
>
> > I am concerned about us biasing towrads perf first too much and not enabling
> > sys admins to select a proper trade off for their system and use case. Which
> > are not static. The workloads the system needs to accommodate to are abundant
> > and operating conditions could change. And the diversity of hardware available
> > out there is huge - I am not sure how can we expect we can have one response to
> > accommodate for all of them.
> >
> > What I'm trying to push for here is that we should look at the chain as one
> > unit. And we should consider that there's important trade-off to be had here;
> > having a sensible default doesn't mean the user shouldn't be allowed to select
> > a different trade-off. I'm not sure the problem can be generalized and fixed
> > automatically. But happy to be proven wrong of course :-)
> >
> > FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
> > pasting the patch for load_avg consideration so I can take it into account too
> > in my experiments?
>
> Just posted it:
>
> https://lkml.kernel.org/r/[email protected]
Thanks a lot! I'll revisit the whole story taking into account the relationship
with all these other controls. I will need sometime though. But I will get back
with some data hopefully to help us pave the right way. I think we shredded
this thread to pieces enough :)
Thanks!
--
Qais Yousef