Hi,
At the moment, all the available idle governors operate mainly based on their own past performance
without taking into account any scheduling information. Especially on interactive systems, this
results in them frequently selecting a deeper idle state and then waking up before its target
residency is hit, thus leading to increased wakeup latency and lower performance with no power
saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
account for over 24% of all wakeups.
At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
it over C1. Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
and nullify the potential power saving from using C1 in the first place. While taking this into
account, it is clear that on balance it is preferable for an idle governor to have more too shallow
sleeps instead of more too deep sleeps.
Currently the best available governor under this metric is TEO which on average results in less than
half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
increased performance in the process.
This proposed optional extension to TEO would specifically tune it for minimising too deep
sleeps and minimising latency to achieve better performance. To this end, before selecting the next
idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the
threshold, the governor directly selects the shallowest available idle state. If the util is below
the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
available idle state based on the closest timer event and its own past correctness.
Effectively this functions like a governor that on the fly disables deeper idle states when there
are things happening on the cpu and then immediately reenables them as soon as the cpu isn't
being utilized anymore.
Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
the review process. An alternative way of implementing it while not interfering
with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
governor (working name 'idleutil') and then implement util-awareness there, so that the two
approaches can coexist and both be available at runtime instead of relying on a compile-time option.
I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.
This approach can outperform all the other currently available governors, at least on mobile device
workloads, which is why I think it is worth keeping as an option.
Additionally, in my view, the reason why it makes more sense to implement this type of mechanism
inside a governor rather than outside using something like QoS or some other way to disable certain
idle states on the fly are the governor's metrics. If we were disabling idle states and reenabling
them without the governor 'knowing' about it, the governor's metrics would end up being updated
based on state selections not caused by the governor itself. This could interfere with the
correctness of said metrics as that's not what they were designed for as far as I understand.
This approach skips metrics updates whenever a state was selected based on the util and not based
on the metrics.
There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
it on TEO because it performs the best out of all the available options and I didn't think there was
any point in reinventing the wheel on the side of computing governor metrics. If a
better approach comes along at some point, there's no reason why the same idle aware mechanism
couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
a separate governor rather than a TEO add-on.
As for how the extension performs in practice, below I'll add some benchmark results I got while
testing this patchset.
Pixel 6 (Android 12, mainline kernel 5.18):
1. Geekbench 5 (latency-sensitive, heavy load test)
The values below are gmean values across 3 back to back iteration of Geekbench 5.
As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
values for all of the governors can change between runs as the benchmark might be affected by factors
other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
scores than all the other governors.
'shallow' is a trivial governor that only ever selects the shallowest available state, included here
for reference and to establish the lower bound of latency possible to achieve through cpuidle.
'gmean too deep %' and 'gmean too shallow %' are percentages of too deep and too shallow sleeps
computed using the new trace event - cpu_idle_miss. The percentage is obtained by counting the two
types of misses over the course of a run and then dividing them by the total number of wakeups.
| metric | menu | teo | shallow | teo + util-aware |
| ------------------------------------- | ------------- | --------------- | --------------- | --------------- |
| gmean score | 2716.4 (0.0%) | 2795 (+2.89%) | 2780.5 (+2.36%) | 2830.8 (+4.21%) |
| gmean too deep % | 16.64% | 9.61% | 0% | 4.19% |
| gmean too shallow % | 2.66% | 5.54% | 31.47% | 15.3% |
| gmean task wakeup latency (gb5) | 82.05μs (0.0%) | 73.97μs (-9.85%) | 42.05μs (-48.76%) | 66.91μs (-18.45%) |
| gmean task wakeup latency (asynctask) | 75.66μs (0.0%) | 56.58μs (-25.22%) | 65.78μs (-13.06%) | 55.35μs (-26.84%) |
In case of this benchmark, the difference in latency does seem to translate into better scores.
Additionally, here's a set of runs of Geekbench done after holding the phone in
the fridge for exactly an hour each time in order to minimise the impact of thermal issues.
| metric | menu | teo | teo + util-aware |
| ------------------------------------- | ------------- | --------------- | --------------- |
| gmean multicore score | 2792.1 (0.0%) | 2845.2 (+1.9%) | 2857.4 (+2.34%) |
| gmean single-core score | 1048.3 (0.0%) | 1052.6 (+0.41%) | 1055.3 (+0.67%) |
2. PCMark Web Browsing (non latency-sensitive, normal usage test)
The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
| metric | menu | teo | shallow | teo + util-aware |
| ------------------------- | ------------- | --------------- | --------------- | --------------- |
| gmean score | 6283.0 (0.0%) | 6262.9 (-0.32%) | 6258.4 (-0.39%) | 6323.7 (+0.65%) |
| gmean too deep % | 24.15% | 10.32% | 0% | 3.2% |
| gmean too shallow % | 2.81% | 7.68% | 27.69% | 17.189% |
| gmean power usage [mW] | 209.1 (0.0%) | 187.8 (-10.17%) | 205.5 (-1.71%) | 205 (-1.96%) |
| gmean task wakeup latency | 204.6μs (0.0%) | 184.39μs (-9.87%) | 95.55μs (-53.3%) | 95.98μs (-53.09%) |
As this is a web browsing benchmark, the task for which the wakeup latency was recorded was Chrome's
rendering task, ie CrRendererMain. The latency improvement for the actual benchmark task was very
similar.
In this case the large latency improvement does not translate into a notable increase in benchmark score as
this particular benchmark mainly responds to changes in operating frequency. Nevertheless, the small power
saving compared to menu with no decrease in benchmark score indicate that there are no regressions for this
type of workload while using this governor.
Note: The results above were as mentioned obtained on the 5.18 kernel. Results for Geekbench obtained after
backporting CFS patches from the most recent mainline can be found in the pdf linked below [1].
The results and improvements still hold up but the numbers change slightly. Additionally, the pdf contains
plots for all the relevant results obtained with this and other idle governors.
At the very least this approach seems promising so I wanted to discuss it in RFC form first.
Thank you for taking your time to read this!
--
Kajetan
[1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
Kajetan Puchalski (1):
cpuidle: teo: Introduce optional util-awareness
drivers/cpuidle/Kconfig | 12 +++++
drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
2 files changed, 98 insertions(+)
--
2.37.1
Modern interactive systems, such as recent Android phones, tend to have
power efficient shallow idle states. Selecting deeper idle states on a
device while a latency-sensitive workload is running can adversely impact
performance due to increased latency. Additionally, if the CPU wakes up
from a deeper sleep before its target residency as is often the case, it
results in a waste of energy on top of that.
This patch extends the TEO governor with an optional mechanism adding
util-awareness, effectively providing a way for the governor to switch
between only selecting the shallowest idle state when the cpu is being
utilized over a certain threshold and trying to select the deepest possible
state using TEO's metrics when the cpu is not being utilized. This is now
possible since the CPU utilization is exported from the scheduler with the
sched_cpu_util function and already used e.g. in the thermal governor IPA.
This can provide drastically decreased latency and performance benefits in
certain types of mobile workloads that are sensitive to latency,
such as Geekbench 5.
Signed-off-by: Kajetan Puchalski <[email protected]>
---
drivers/cpuidle/Kconfig | 12 +++++
drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
2 files changed, 98 insertions(+)
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index ff71dd662880..6b66ee88a2b2 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -33,6 +33,18 @@ config CPU_IDLE_GOV_TEO
Some workloads benefit from using it and it generally should be safe
to use. Say Y here if you are not happy with the alternatives.
+config CPU_IDLE_GOV_TEO_UTIL_AWARE
+ bool "Util-awareness mechanism for TEO"
+ depends on CPU_IDLE_GOV_TEO
+ help
+ Util-awareness mechanism for the TEO governor. With this enabled,
+ the governor will choose the shallowest available state when the
+ CPU's average util is above a certain threshold and default to
+ using the metrics-based approach when it's not.
+
+ Some latency-sensitive workloads on interactive devices can benefit
+ from using it.
+
config CPU_IDLE_GOV_HALTPOLL
bool "Haltpoll governor (for virtualized systems)"
depends on KVM_GUEST
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index d9262db79cae..fd5b2eb750be 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@
/*
* Timer events oriented CPU idle governor
*
+ * TEO governor:
* Copyright (C) 2018 - 2021 Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <[email protected]>
*/
/**
@@ -99,14 +104,48 @@
* select the given idle state instead of the candidate one.
*
* 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so the shallowest idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threhsold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above, it simply selects the shallowest
+ * enabled idle state.
*/
#include <linux/cpuidle.h>
#include <linux/jiffies.h>
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/tick.h>
+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
/*
* The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
* is used for decreasing metrics on a regular basis.
@@ -140,6 +179,8 @@ struct teo_bin {
* @total: Grand total of the "intercepts" and "hits" mertics for all bins.
* @next_recent_idx: Index of the next @recent_idx entry to update.
* @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
*/
struct teo_cpu {
s64 time_span_ns;
@@ -148,10 +189,28 @@ struct teo_cpu {
unsigned int total;
int next_recent_idx;
int recent_idx[NR_RECENT];
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+ unsigned long util_threshold;
+ bool utilized;
+#endif
};
static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+/**
+ * teo_get_util - Update the CPU utilized status
+ * @dev: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
+{
+ unsigned long util = sched_cpu_util(dev->cpu);
+
+ cpu_data->utilized = util > cpu_data->util_threshold;
+}
+#endif
+
/**
* teo_update - Update CPU metrics after wakeup.
* @drv: cpuidle driver containing state data.
@@ -301,7 +360,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
int i;
if (dev->last_state_idx >= 0) {
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+ /* don't update metrics if the cpu was utilized during the last sleep */
+ if (!cpu_data->utilized)
+ teo_update(drv, dev);
+#else
teo_update(drv, dev);
+#endif
dev->last_state_idx = -1;
}
@@ -321,6 +386,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
goto end;
}
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+ teo_get_util(dev, cpu_data);
+ /* if the cpu is being utilized, choose the shallowest state and exit */
+ if (cpu_data->utilized) {
+ for (i = 0; i < drv->state_count; ++i) {
+ if (dev->states_usage[i].disable)
+ continue;
+ break;
+ }
+
+ idx = i;
+ goto end;
+ }
+#endif
+
/*
* Find the deepest idle state whose target residency does not exceed
* the current sleep length and the deepest idle state not deeper than
@@ -508,9 +588,15 @@ static int teo_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+ unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
+#endif
int i;
memset(cpu_data, 0, sizeof(*cpu_data));
+#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
+ cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
+#endif
for (i = 0; i < NR_RECENT; i++)
cpu_data->recent_idx[i] = -1;
--
2.37.1
On Thu, Sep 15, 2022 at 9:45 AM Kajetan Puchalski
<[email protected]> wrote:
>
> Hi,
Hi,
I tried it.
>
> At the moment, all the available idle governors operate mainly based on their own past performance
> without taking into account any scheduling information. Especially on interactive systems, this
> results in them frequently selecting a deeper idle state and then waking up before its target
> residency is hit, thus leading to increased wakeup latency and lower performance with no power
> saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
> account for over 24% of all wakeups.
>
> At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
> it over C1. Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
> less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
> and nullify the potential power saving from using C1 in the first place. While taking this into
> account, it is clear that on balance it is preferable for an idle governor to have more too shallow
> sleeps instead of more too deep sleeps.
>
> Currently the best available governor under this metric is TEO which on average results in less than
> half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> increased performance in the process.
>
> This proposed optional extension to TEO would specifically tune it for minimising too deep
> sleeps and minimising latency to achieve better performance. To this end, before selecting the next
> idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
> CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
> the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation).
That seems quite a bit too low to me. However on my processor the
energy cost of using
idle state 0 verses anything deeper is very high, so I do not have a
good way to test.
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
On an idle system :
with only Idle state 0 enabled, processor package power is ~46 watts.
with only idle state 1 enabled, processor package power is ~2.6 watts
with all idle states enabled, processor package power is ~1.4 watts
> If the util is above the
> threshold, the governor directly selects the shallowest available idle state. If the util is below
> the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
> available idle state based on the closest timer event and its own past correctness.
>
> Effectively this functions like a governor that on the fly disables deeper idle states when there
> are things happening on the cpu and then immediately reenables them as soon as the cpu isn't
> being utilized anymore.
>
> Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
> the review process. An alternative way of implementing it while not interfering
> with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
> governor (working name 'idleutil') and then implement util-awareness there, so that the two
> approaches can coexist and both be available at runtime instead of relying on a compile-time option.
> I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.
I would prefer the two to coexist for testing, as it makes it easier
to manually compare some
areas of focus.
>
> This approach can outperform all the other currently available governors, at least on mobile device
> workloads, which is why I think it is worth keeping as an option.
>
> Additionally, in my view, the reason why it makes more sense to implement this type of mechanism
> inside a governor rather than outside using something like QoS or some other way to disable certain
> idle states on the fly are the governor's metrics. If we were disabling idle states and reenabling
> them without the governor 'knowing' about it, the governor's metrics would end up being updated
> based on state selections not caused by the governor itself. This could interfere with the
> correctness of said metrics as that's not what they were designed for as far as I understand.
> This approach skips metrics updates whenever a state was selected based on the util and not based
> on the metrics.
>
> There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
> it on TEO because it performs the best out of all the available options and I didn't think there was
> any point in reinventing the wheel on the side of computing governor metrics. If a
> better approach comes along at some point, there's no reason why the same idle aware mechanism
> couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
> a separate governor rather than a TEO add-on.
>
> As for how the extension performs in practice, below I'll add some benchmark results I got while
> testing this patchset.
>
> Pixel 6 (Android 12, mainline kernel 5.18):
>
> 1. Geekbench 5 (latency-sensitive, heavy load test)
>
> The values below are gmean values across 3 back to back iteration of Geekbench 5.
> As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
> resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
> values for all of the governors can change between runs as the benchmark might be affected by factors
> other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
> scores than all the other governors.
>
> 'shallow' is a trivial governor that only ever selects the shallowest available state, included here
> for reference and to establish the lower bound of latency possible to achieve through cpuidle.
>
> 'gmean too deep %' and 'gmean too shallow %' are percentages of too deep and too shallow sleeps
> computed using the new trace event - cpu_idle_miss. The percentage is obtained by counting the two
> types of misses over the course of a run and then dividing them by the total number of wakeups.
>
> | metric | menu | teo | shallow | teo + util-aware |
> | ------------------------------------- | ------------- | --------------- | --------------- | --------------- |
> | gmean score | 2716.4 (0.0%) | 2795 (+2.89%) | 2780.5 (+2.36%) | 2830.8 (+4.21%) |
> | gmean too deep % | 16.64% | 9.61% | 0% | 4.19% |
> | gmean too shallow % | 2.66% | 5.54% | 31.47% | 15.3% |
> | gmean task wakeup latency (gb5) | 82.05μs (0.0%) | 73.97μs (-9.85%) | 42.05μs (-48.76%) | 66.91μs (-18.45%) |
> | gmean task wakeup latency (asynctask) | 75.66μs (0.0%) | 56.58μs (-25.22%) | 65.78μs (-13.06%) | 55.35μs (-26.84%) |
>
> In case of this benchmark, the difference in latency does seem to translate into better scores.
>
> Additionally, here's a set of runs of Geekbench done after holding the phone in
> the fridge for exactly an hour each time in order to minimise the impact of thermal issues.
>
> | metric | menu | teo | teo + util-aware |
> | ------------------------------------- | ------------- | --------------- | --------------- |
> | gmean multicore score | 2792.1 (0.0%) | 2845.2 (+1.9%) | 2857.4 (+2.34%) |
> | gmean single-core score | 1048.3 (0.0%) | 1052.6 (+0.41%) | 1055.3 (+0.67%) |
>
> 2. PCMark Web Browsing (non latency-sensitive, normal usage test)
>
> The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
>
> | metric | menu | teo | shallow | teo + util-aware |
> | ------------------------- | ------------- | --------------- | --------------- | --------------- |
> | gmean score | 6283.0 (0.0%) | 6262.9 (-0.32%) | 6258.4 (-0.39%) | 6323.7 (+0.65%) |
> | gmean too deep % | 24.15% | 10.32% | 0% | 3.2% |
> | gmean too shallow % | 2.81% | 7.68% | 27.69% | 17.189% |
> | gmean power usage [mW] | 209.1 (0.0%) | 187.8 (-10.17%) | 205.5 (-1.71%) | 205 (-1.96%) |
> | gmean task wakeup latency | 204.6μs (0.0%) | 184.39μs (-9.87%) | 95.55μs (-53.3%) | 95.98μs (-53.09%) |
>
> As this is a web browsing benchmark, the task for which the wakeup latency was recorded was Chrome's
> rendering task, ie CrRendererMain. The latency improvement for the actual benchmark task was very
> similar.
>
> In this case the large latency improvement does not translate into a notable increase in benchmark score as
> this particular benchmark mainly responds to changes in operating frequency. Nevertheless, the small power
> saving compared to menu with no decrease in benchmark score indicate that there are no regressions for this
> type of workload while using this governor.
>
> Note: The results above were as mentioned obtained on the 5.18 kernel. Results for Geekbench obtained after
> backporting CFS patches from the most recent mainline can be found in the pdf linked below [1].
> The results and improvements still hold up but the numbers change slightly. Additionally, the pdf contains
> plots for all the relevant results obtained with this and other idle governors.
>
> At the very least this approach seems promising so I wanted to discuss it in RFC form first.
> Thank you for taking your time to read this!
There might be a way forward for my type of processor if the algorithm
were to just reduce the idle
depth by 1 instead of all the way to idle state 0. Not sure. It seems
to bypass all that the teo
governor is attempting to achieve.
For a single periodic workflow at any work sleep frequency (well, I
test 5 hertz to 411 hertz) and very
light workload: Processor package powers for 73 hertz work/sleep frequency:
teo: ~1.5 watts
menu: ~1.5 watts
util: ~19 watts
For 12 periodic workflow threads at 73 hertz work/sleep frequency
(well, I test 5 hertz to 411 hertz) and very
workload: Processor package powers:
teo: ~2.8watts
menu: ~2.8 watts
util: ~49 watts
My test computer is a server, with no gui. I started a desktop linux
VM guest that isn't doing much:
teo: ~1.8 watts
menu: ~1.8 watts
util: ~7.8 watts
>
> --
> Kajetan
>
> [1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
>
>
> Kajetan Puchalski (1):
> cpuidle: teo: Introduce optional util-awareness
>
> drivers/cpuidle/Kconfig | 12 +++++
> drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> 2 files changed, 98 insertions(+)
>
> --
> 2.37.1
>
On Thu, Sep 15, 2022 at 9:45 AM Kajetan Puchalski
<[email protected]> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have
> power efficient shallow idle states. Selecting deeper idle states on a
> device while a latency-sensitive workload is running can adversely impact
> performance due to increased latency. Additionally, if the CPU wakes up
> from a deeper sleep before its target residency as is often the case, it
> results in a waste of energy on top of that.
>
> This patch extends the TEO governor with an optional mechanism adding
> util-awareness, effectively providing a way for the governor to switch
> between only selecting the shallowest idle state when the cpu is being
> utilized over a certain threshold and trying to select the deepest possible
> state using TEO's metrics when the cpu is not being utilized. This is now
> possible since the CPU utilization is exported from the scheduler with the
> sched_cpu_util function and already used e.g. in the thermal governor IPA.
>
> This can provide drastically decreased latency and performance benefits in
> certain types of mobile workloads that are sensitive to latency,
> such as Geekbench 5.
>
> Signed-off-by: Kajetan Puchalski <[email protected]>
> ---
> drivers/cpuidle/Kconfig | 12 +++++
> drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> 2 files changed, 98 insertions(+)
>
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index ff71dd662880..6b66ee88a2b2 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -33,6 +33,18 @@ config CPU_IDLE_GOV_TEO
> Some workloads benefit from using it and it generally should be safe
> to use. Say Y here if you are not happy with the alternatives.
>
> +config CPU_IDLE_GOV_TEO_UTIL_AWARE
> + bool "Util-awareness mechanism for TEO"
> + depends on CPU_IDLE_GOV_TEO
> + help
> + Util-awareness mechanism for the TEO governor. With this enabled,
> + the governor will choose the shallowest available state when the
> + CPU's average util is above a certain threshold and default to
> + using the metrics-based approach when it's not.
> +
> + Some latency-sensitive workloads on interactive devices can benefit
> + from using it.
> +
> config CPU_IDLE_GOV_HALTPOLL
> bool "Haltpoll governor (for virtualized systems)"
> depends on KVM_GUEST
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index d9262db79cae..fd5b2eb750be 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -2,8 +2,13 @@
> /*
> * Timer events oriented CPU idle governor
> *
> + * TEO governor:
> * Copyright (C) 2018 - 2021 Intel Corporation
> * Author: Rafael J. Wysocki <[email protected]>
> + *
> + * Util-awareness mechanism:
> + * Copyright (C) 2022 Arm Ltd.
> + * Author: Kajetan Puchalski <[email protected]>
> */
>
> /**
> @@ -99,14 +104,48 @@
> * select the given idle state instead of the candidate one.
> *
> * 3. By default, select the candidate state.
> + *
> + * Util-awareness mechanism:
> + *
> + * The idea behind the util-awareness extension is that there are two distinct
> + * scenarios for the CPU which should result in two different approaches to idle
> + * state selection - utilized and not utilized.
> + *
> + * In this case, 'utilized' means that the average runqueue util of the CPU is
> + * above a certain threshold.
> + *
> + * When the CPU is utilized while going into idle, more likely than not it will
> + * be woken up to do more work soon and so the shallowest idle state should be
> + * selected to minimise latency and maximise performance. When the CPU is not
> + * being utilized, the usual metrics-based approach to selecting the deepest
> + * available idle state should be preferred to take advantage of the power
> + * saving.
> + *
> + * In order to achieve this, the governor uses a utilization threshold.
> + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> + * seems to be getting the best results.
> + *
> + * Before selecting the next idle state, the governor compares the current CPU
> + * util to the precomputed util threhsold. If it's below, it defaults to the
threshold
> + * TEO metrics mechanism. If it's above, it simply selects the shallowest
> + * enabled idle state.
> */
>
> #include <linux/cpuidle.h>
> #include <linux/jiffies.h>
> #include <linux/kernel.h>
> +#include <linux/sched.h>
I think it also needs this line:
+#include <linux/sched/topology.h>
At least for me, it didn't compile without it.
> #include <linux/sched/clock.h>
> #include <linux/tick.h>
>
> +/*
> + * The number of bits to shift the cpu's capacity by in order to determine
> + * the utilized threshold
> + */
> +#define UTIL_THRESHOLD_SHIFT 6
> +
> +
> /*
> * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> * is used for decreasing metrics on a regular basis.
> @@ -140,6 +179,8 @@ struct teo_bin {
> * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
metrics
> * @next_recent_idx: Index of the next @recent_idx entry to update.
> * @recent_idx: Indices of bins corresponding to recent "intercepts".
> + * @util_threshold: Threshold above which the CPU is considered utilized
> + * @utilized: Whether the last sleep on the CPU happened while utilized
> */
> struct teo_cpu {
> s64 time_span_ns;
> @@ -148,10 +189,28 @@ struct teo_cpu {
> unsigned int total;
> int next_recent_idx;
> int recent_idx[NR_RECENT];
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> + unsigned long util_threshold;
> + bool utilized;
> +#endif
> };
>
> static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> +/**
> + * teo_get_util - Update the CPU utilized status
> + * @dev: Target CPU
> + * @cpu_data: Governor CPU data for the target CPU
> + */
> +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> +{
> + unsigned long util = sched_cpu_util(dev->cpu);
> +
> + cpu_data->utilized = util > cpu_data->util_threshold;
> +}
> +#endif
> +
> /**
> * teo_update - Update CPU metrics after wakeup.
> * @drv: cpuidle driver containing state data.
> @@ -301,7 +360,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> int i;
>
> if (dev->last_state_idx >= 0) {
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> + /* don't update metrics if the cpu was utilized during the last sleep */
> + if (!cpu_data->utilized)
> + teo_update(drv, dev);
> +#else
> teo_update(drv, dev);
> +#endif
> dev->last_state_idx = -1;
> }
>
> @@ -321,6 +386,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> goto end;
> }
>
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> + teo_get_util(dev, cpu_data);
> + /* if the cpu is being utilized, choose the shallowest state and exit */
> + if (cpu_data->utilized) {
> + for (i = 0; i < drv->state_count; ++i) {
> + if (dev->states_usage[i].disable)
> + continue;
> + break;
> + }
> +
> + idx = i;
> + goto end;
> + }
> +#endif
> +
> /*
> * Find the deepest idle state whose target residency does not exceed
> * the current sleep length and the deepest idle state not deeper than
> @@ -508,9 +588,15 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> struct cpuidle_device *dev)
> {
> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> +#endif
> int i;
>
> memset(cpu_data, 0, sizeof(*cpu_data));
> +#ifdef CONFIG_CPU_IDLE_GOV_TEO_UTIL_AWARE
> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> +#endif
>
> for (i = 0; i < NR_RECENT; i++)
> cpu_data->recent_idx[i] = -1;
> --
> 2.37.1
>
On Fri, Sep 16, 2022 at 12:49 AM Kajetan Puchalski
<[email protected]> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have
> power efficient shallow idle states. Selecting deeper idle states on a
> device while a latency-sensitive workload is running can adversely impact
> performance due to increased latency. Additionally, if the CPU wakes up
> from a deeper sleep before its target residency as is often the case, it
> results in a waste of energy on top of that.
>
> This patch extends the TEO governor with an optional mechanism adding
> util-awareness, effectively providing a way for the governor to switch
> between only selecting the shallowest idle state when the cpu is being
> utilized over a certain threshold and trying to select the deepest possible
> state using TEO's metrics when the cpu is not being utilized.
Not sure if we can use util_avg as schedutil, but it looks interesting.
The last time I was trying to propose an idea to leverage util_avg to
optimize some
codes in the kernel, it was suggested that it would be better to make
the stategy
gradual rather than 0,1 state. So I was thinking if we could make it
something like:
next_idx = cpuidle_select();
next_idx = next_idx * (cpu_cap - util_avg) / cpu_cap;
The lower the util_avg is, the more we honor the choice of the governor,
vice versa.
> This is now possible since the CPU utilization is exported from the scheduler with the
> sched_cpu_util function and already used e.g. in the thermal governor IPA.
>
> This can provide drastically decreased latency and performance benefits in
> certain types of mobile workloads that are sensitive to latency,
> such as Geekbench 5.
As Doug mentioned in another thread, the impact data to energy consumption would
also be interesting.
thanks,
Chenyu
Hi, thanks for taking a look!
> > This proposed optional extension to TEO would specifically tune it for minimising too deep
> > sleeps and minimising latency to achieve better performance. To this end, before selecting the next
> > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
> > CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
> > the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation).
>
> That seems quite a bit too low to me. However on my processor the
> energy cost of using
> idle state 0 verses anything deeper is very high, so I do not have a
> good way to test.
I suppose it does look low but as I said, at least from my own testing
higher thresholds result in completely nullifying the potential benefits
from using this. It could be because with a low-enough threshold like
this we are able to catch the average util as it starts to rise and then
we're already in the 'low-latency mode' by the time it gets higer as
opposed to correcting after the fact. We could also always make it into
some kind of tunable if need be, I was testing it with a dedicated sysctl
and it worked all right.
>
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> On an idle system :
> with only Idle state 0 enabled, processor package power is ~46 watts.
> with only idle state 1 enabled, processor package power is ~2.6 watts
> with all idle states enabled, processor package power is ~1.4 watts
>
Ah I see, yeah this definitely won't work on systems with idle power
usage like above. It was designed for Arm devices like the Pixel 6 where
C0 is so power efficient that running with only C0 enabled can sometimes
actually use *less* power than running with all idle states enabled.
This was for non-intensive workloads like PCMark Web Browsing where
there were enough too deep sleeps in C1 to offset the entire power
saving. The entire idea we're relying upon here is C0 being very good to
begin with but wanting to still use *some* C1 in order to avoid bumping
into thermal issues.
> > If the util is above the
> > threshold, the governor directly selects the shallowest available idle state. If the util is below
> > the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
> > available idle state based on the closest timer event and its own past correctness.
> >
> > Effectively this functions like a governor that on the fly disables deeper idle states when there
> > are things happening on the cpu and then immediately reenables them as soon as the cpu isn't
> > being utilized anymore.
> >
> > Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
> > the review process. An alternative way of implementing it while not interfering
> > with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
> > governor (working name 'idleutil') and then implement util-awareness there, so that the two
> > approaches can coexist and both be available at runtime instead of relying on a compile-time option.
> > I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.
>
> I would prefer the two to coexist for testing, as it makes it easier
> to manually compare some
> areas of focus.
That would be my preference as well, it just seems like a cleaner
approach despite having to copy over some code to begin with. I'm just
waiting for Rafael to express a view one way or the other :)
> > At the very least this approach seems promising so I wanted to discuss it in RFC form first.
> > Thank you for taking your time to read this!
>
> There might be a way forward for my type of processor if the algorithm
> were to just reduce the idle
> depth by 1 instead of all the way to idle state 0. Not sure. It seems
> to bypass all that the teo
> governor is attempting to achieve.
Oh interesting, that could definitely be worth a try. As I said, this
was designed for Arm CPUs and all of the targeted ones only have 2 idle
states, C0 and C1. Thus reducing by 1 and going all the way to 0 are the
same thing for our use case. You're right that this is potentially
pretty excessive on Intel CPUs where you could be going from state 8/9 to
0. It would result in some wasted cycles on Arm but I imagine there should
be some way forward where we could accommodate the two.
> For a single periodic workflow at any work sleep frequency (well, I
> test 5 hertz to 411 hertz) and very
> light workload: Processor package powers for 73 hertz work/sleep frequency:
>
> teo: ~1.5 watts
> menu: ~1.5 watts
> util: ~19 watts
>
> For 12 periodic workflow threads at 73 hertz work/sleep frequency
> (well, I test 5 hertz to 411 hertz) and very
> workload: Processor package powers:
>
> teo: ~2.8watts
> menu: ~2.8 watts
> util: ~49 watts
>
> My test computer is a server, with no gui. I started a desktop linux
> VM guest that isn't doing much:
>
> teo: ~1.8 watts
> menu: ~1.8 watts
> util: ~7.8 watts
Ouch that's definitely not great, really good to know what this looks
like on Intel CPUs though. Thanks a lot for taking your time to test
this out!
> >
> > --
> > Kajetan
> >
> > [1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
> >
> >
> > Kajetan Puchalski (1):
> > cpuidle: teo: Introduce optional util-awareness
> >
> > drivers/cpuidle/Kconfig | 12 +++++
> > drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> > 2 files changed, 98 insertions(+)
> >
> > --
> > 2.37.1
> >
> Not sure if we can use util_avg as schedutil, but it looks interesting.
> The last time I was trying to propose an idea to leverage util_avg to
> optimize some
> codes in the kernel, it was suggested that it would be better to make
> the stategy
> gradual rather than 0,1 state. So I was thinking if we could make it
> something like:
>
> next_idx = cpuidle_select();
> next_idx = next_idx * (cpu_cap - util_avg) / cpu_cap;
>
> The lower the util_avg is, the more we honor the choice of the governor,
> vice versa.
Would that be in order to still make use of intermediate idle states (ie
the ones between first and last) or to change how the util threshold
works? It seems similar to the issue Doug pointed out.
I think there's two scenarios here, the idle landscape on Arm just looks
really different from the one on x86/Intel and we should probably
account for that. In our use case "gradual" and 0-1 are the same thing,
it's just all about how you set the threshold. On x86 on the other hand
you have the threshold and the approach to state selection to worry about.
This just further makes me think that separating this out into a
separate governor is preferable as this can work really nicely on
certain systems like ours and really badly on others like Doug's. We
probably shouldn't be bundling this with generic solutions like TEO that
work well across the board.
It might also make sense to have slightly different implementations for
x86 and arm to account for the hardware differences but that'd also be
up to Rafael to express a view on.
> > This is now possible since the CPU utilization is exported from the scheduler with the
> > sched_cpu_util function and already used e.g. in the thermal governor IPA.
> >
> > This can provide drastically decreased latency and performance benefits in
> > certain types of mobile workloads that are sensitive to latency,
> > such as Geekbench 5.
> As Doug mentioned in another thread, the impact data to energy consumption would
> also be interesting.
I included energy consumption plots in the pdf I linked in the cover
letter, here's the link:
https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
The unit on the plots is gmean mW measurement so they reflect average
power usage over the course of the benchmark. They also include a column
with 'shallow' which shows power consumption with only C0 and visualises
why this works on arm and how different this is compared to x86
behaviour described by Doug.
> thanks,
> Chenyu
Hi Rafael,
Just a gentle ping here. Could you please take a look at this
discussion?
I'd like to address some comments I received, especially on the subject
of making it reduce the state by one as opposed to going all the way to
0 to account for different hardware and how we can accomodate different
architectures in the implementation of that mechanism.
Before I send a v2 it'd be great to know your opinions on this and
whether I should still send it as a TEO patch or fork it into a separate
governor and make the changes there as both Doug and I seem to prefer.
Thank you in advance for you time,
Kajetan
On Thu, Sep 15, 2022 at 05:44:10PM +0100, Kajetan Puchalski wrote:
> At the very least this approach seems promising so I wanted to discuss it in RFC form first.
> Thank you for taking your time to read this!
>
> [1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
>
> Kajetan Puchalski (1):
> cpuidle: teo: Introduce optional util-awareness
>
> drivers/cpuidle/Kconfig | 12 +++++
> drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> 2 files changed, 98 insertions(+)
>
> --
> 2.37.1
>
Hi,
On Wed, Sep 28, 2022 at 2:42 PM Kajetan Puchalski
<[email protected]> wrote:
>
> Hi Rafael,
>
> Just a gentle ping here. Could you please take a look at this
> discussion?
I have seen it, but I haven't thought it through yet.
Overall, this is quite subtle and requires quite a bit of consideration IMV.
> I'd like to address some comments I received, especially on the subject
> of making it reduce the state by one as opposed to going all the way to
> 0 to account for different hardware and how we can accomodate different
> architectures in the implementation of that mechanism.
I need to think more about that.
> Before I send a v2 it'd be great to know your opinions on this and
> whether I should still send it as a TEO patch or fork it into a separate
> governor and make the changes there as both Doug and I seem to prefer.
Well, it is not a super-large patch against TEO, so I'm not sure if
adding a new governor just for this one bit is a good idea.
I surely don't like the #ifdeffery there, so if it can be made part of
the default TEO behavior, it will be much more appealing to me.
Thanks!
> On Thu, Sep 15, 2022 at 05:44:10PM +0100, Kajetan Puchalski wrote:
> > At the very least this approach seems promising so I wanted to discuss it in RFC form first.
> > Thank you for taking your time to read this!
> >
> > [1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
> >
> > Kajetan Puchalski (1):
> > cpuidle: teo: Introduce optional util-awareness
> >
> > drivers/cpuidle/Kconfig | 12 +++++
> > drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> > 2 files changed, 98 insertions(+)
> >
> > --
> > 2.37.1
> >