Hi,
At the moment, none of the available idle governors take any scheduling
information into account. They also tend to overestimate the idle
duration quite often, which causes them to select excessively deep idle
states, thus leading to increased wakeup latency and lower performance with no
power saving. For 'menu' while web browsing on Android for instance, those
types of wakeups ('too deep') account for over 24% of all wakeups.
At the same time, on some platforms idle state 0 can be power efficient
enough to warrant wanting to prefer it over idle state 1. This is because
the power usage of the two states can be so close that sufficient amounts
of too deep state 1 sleeps can completely offset the state 1 power saving to the
point where it would've been more power efficient to just use state 0 instead.
This is of course for systems where state 0 is not a polling state, such as
arm-based devices.
Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
save less power than they otherwise could have. Too deep sleeps, on the other
hand, harm performance and nullify the potential power saving from using state 1 in
the first place. While taking this into account, it is clear that on balance it
is preferable for an idle governor to have more too shallow sleeps instead of
more too deep sleeps on those kinds of platforms.
Currently the best available governor under this metric is TEO which on average results in less than
half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
increased performance in the process.
This patchset specifically tunes TEO to prefer shallower idle states in order to reduce wakeup latency
and achieve better performance. To this end, before selecting the next idle state it uses the avg_util
signal of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
This util value is then compared to a threshold defined as a percentage of the cpu's capacity
(capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the threshold, the idle
state selected by TEO metrics will be reduced by 1, thus selecting a shallower state. If the util is
below the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
available idle state based on the closest timer event and its own correctness.
The main goal of this is to reduce latency and increase performance for some workloads. Under some
workloads it will result in an increase in power usage (Geekbench 5) while for other workloads it
will also result in a decrease in power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
states are present. If there's just 2 idle states and the CPU is utilized, we can directly select
the shallowest state and save cycles by skipping the entire metrics mechanism.
Under the current implementation, the state will not be reduced by 1 if the change would lead to
selecting a polling state instead of a non-polling state.
This approach can outperform all the other currently available governors, at least on mobile device
workloads, which is why I think it is worth keeping as an option.
There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
it on TEO because it performs the best out of all the available options and I didn't think there was
any point in reinventing the wheel on the side of computing governor metrics. If a
better approach comes along at some point, there's no reason why the same idle aware mechanism
couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
a separate governor rather than a TEO add-on.
As for how the extension performs in practice, below I'll add some benchmark results I got while
testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly
an hour each time to minimise the impact of thermal issues.
Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches):
1. Geekbench 5 (latency-sensitive, heavy load test)
The values below are gmean values across 3 back to back iteration of Geekbench 5.
As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
values for all of the governors can change between runs as the benchmark might be affected by factors
other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
scores than all the other governors.
Benchmark scores
+-----------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|-----------------+-------------+---------+-------------|
| multicore_score | menu | 2826.5 | 0.0% |
| multicore_score | teo | 2764.8 | -2.18% |
| multicore_score | teo_util_v3 | 2849 | 0.8% |
| multicore_score | teo_util_v4 | 2865 | 1.36% |
| score | menu | 1053 | 0.0% |
| score | teo | 1050.7 | -0.22% |
| score | teo_util_v3 | 1059.6 | 0.63% |
| score | teo_util_v4 | 1057.6 | 0.44% |
+-----------------+-------------+---------+-------------+
Idle misses
The numbers are percentages of too deep and too shallow sleeps computed using the new trace
event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over
the course of a run and then dividing them by the total number of wakeups in that run.
+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 14.994% |
| teo | too deep | 9.649% |
| teo_util_v3 | too deep | 4.298% |
| teo_util_v4 | too deep | 4.02 % |
| menu | too shallow | 2.497% |
| teo | too shallow | 5.963% |
| teo_util_v3 | too shallow | 13.773% |
| teo_util_v4 | too shallow | 14.598% |
+-------------+-------------+--------------+
Power usage [mW]
+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 2551.4 | 0.0% |
| total_power | gmean | teo | 2606.8 | 2.17% |
| total_power | gmean | teo_util_v3 | 2670.1 | 4.65% |
| total_power | gmean | teo_util_v4 | 2722.3 | 6.7% |
+--------------+----------+-------------+---------+-------------+
Task wakeup latency
+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| AsyncTask #1 | gmean | menu | 78.16μs | 0.0% |
| AsyncTask #1 | gmean | teo | 61.60μs | -21.19% |
| AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% |
| AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% |
| labs.geekbench5 | gmean | menu | 88.55μs | 0.0% |
| labs.geekbench5 | gmean | teo | 100.97μs | 14.02% |
| labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% |
| labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% |
+-----------------+----------+-------------+-------------+-------------+
In case of this benchmark, the difference in latency does seem to translate into better scores.
2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test)
The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
Benchmark scores
+----------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|----------------+-------------+---------+-------------|
| PcmaWebV2Score | menu | 5232 | 0.0% |
| PcmaWebV2Score | teo | 5219.8 | -0.23% |
| PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% |
| PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% |
+----------------+-------------+---------+-------------+
Idle misses
+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 24.814% |
| teo | too deep | 11.65% |
| teo_util_v3 | too deep | 3.481% |
| teo_util_v4 | too deep | 3.662% |
| menu | too shallow | 3.101% |
| teo | too shallow | 8.578% |
| teo_util_v3 | too shallow | 18.326% |
| teo_util_v4 | too shallow | 18.692% |
+-------------+-------------+--------------+
Power usage [mW]
+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 179.2 | 0.0% |
| total_power | gmean | teo | 184.8 | 3.1% |
| total_power | gmean | teo_util_v3 | 177.4 | -1.02% |
| total_power | gmean | teo_util_v4 | 184.1 | 2.71% |
+--------------+----------+-------------+---------+-------------+
Task wakeup latency
+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| CrRendererMain | gmean | menu | 236.63μs | 0.0% |
| CrRendererMain | gmean | teo | 201.85μs | -14.7% |
| CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% |
| CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% |
| chmark:workload | gmean | menu | 100.30μs | 0.0% |
| chmark:workload | gmean | teo | 80.20μs | -20.04% |
| chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% |
| chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% |
| surfaceflinger | gmean | menu | 97.57μs | 0.0% |
| surfaceflinger | gmean | teo | 98.86μs | 1.31% |
| surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% |
| surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% |
+-----------------+----------+-------------+-------------+-------------+
In this case the large latency improvement does not translate into a notable increase in benchmark score as
this particular benchmark mainly responds to changes in operating frequency.
3. Jankbench (locked 60hz screen) (normal usage UI test)
Frame durations
+---------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|---------------+------------------+---------+-------------|
| mean_duration | menu_60hz | 13.9 | 0.0% |
| mean_duration | teo_60hz | 14.7 | 6.0% |
| mean_duration | teo_util_v3_60hz | 13.8 | -0.87% |
| mean_duration | teo_util_v4_60hz | 12.6 | -9.0% |
+---------------+------------------+---------+-------------+
Jank percentage
+------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|------------+------------------+---------+-------------|
| jank_perc | menu_60hz | 1.5 | 0.0% |
| jank_perc | teo_60hz | 2.1 | 36.99% |
| jank_perc | teo_util_v3_60hz | 1.3 | -13.95% |
| jank_perc | teo_util_v4_60hz | 1.3 | -17.37% |
+------------+------------------+---------+-------------+
Idle misses
+------------------+-------------+--------------+
| wa_path | type | count_perc |
|------------------+-------------+--------------|
| menu_60hz | too deep | 26.00% |
| teo_60hz | too deep | 11.00% |
| teo_util_v3_60hz | too deep | 2.33% |
| teo_util_v4_60hz | too deep | 2.54% |
| menu_60hz | too shallow | 4.74% |
| teo_60hz | too shallow | 11.89% |
| teo_util_v3_60hz | too shallow | 21.78% |
| teo_util_v4_60hz | too shallow | 21.93% |
+------------------+-------------+--------------+
Power usage [mW]
+--------------+------------------+---------+-------------+
| chan_name | kernel | value | perc_diff |
|--------------+------------------+---------+-------------|
| total_power | menu_60hz | 144.6 | 0.0% |
| total_power | teo_60hz | 136.9 | -5.27% |
| total_power | teo_util_v3_60hz | 134.2 | -7.19% |
| total_power | teo_util_v4_60hz | 121.3 | -16.08% |
+--------------+------------------+---------+-------------+
Task wakeup latency
+-----------------+------------------+-------------+-------------+
| comm | kernel | value | perc_diff |
|-----------------+------------------+-------------+-------------|
| RenderThread | menu_60hz | 139.52μs | 0.0% |
| RenderThread | teo_60hz | 116.51μs | -16.49% |
| RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% |
| RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% |
| droid.benchmark | menu_60hz | 135.88μs | 0.0% |
| droid.benchmark | teo_60hz | 105.21μs | -22.57% |
| droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% |
| droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% |
| surfaceflinger | menu_60hz | 124.03μs | 0.0% |
| surfaceflinger | teo_60hz | 151.90μs | 22.47% |
| surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% |
| surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% |
+-----------------+------------------+-------------+-------------+
4. Speedometer 2 (heavy load web browsing test)
Benchmark scores
+-------------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|-------------------+-------------+---------+-------------|
| Speedometer Score | menu | 102 | 0.0% |
| Speedometer Score | teo | 104.9 | 2.88% |
| Speedometer Score | teo_util_v3 | 102.1 | 0.16% |
| Speedometer Score | teo_util_v4 | 103.8 | 1.83% |
+-------------------+-------------+---------+-------------+
Idle misses
+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 17.95% |
| teo | too deep | 6.46% |
| teo_util_v3 | too deep | 0.63% |
| teo_util_v4 | too deep | 0.64% |
| menu | too shallow | 3.86% |
| teo | too shallow | 8.21% |
| teo_util_v3 | too shallow | 14.72% |
| teo_util_v4 | too shallow | 14.43% |
+-------------+-------------+--------------+
Power usage [mW]
+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 2059 | 0.0% |
| total_power | gmean | teo | 2187.8 | 6.26% |
| total_power | gmean | teo_util_v3 | 2212.9 | 7.47% |
| total_power | gmean | teo_util_v4 | 2121.8 | 3.05% |
+--------------+----------+-------------+---------+-------------+
Task wakeup latency
+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| CrRendererMain | gmean | menu | 17.18μs | 0.0% |
| CrRendererMain | gmean | teo | 16.18μs | -5.82% |
| CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% |
| CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% |
| RenderThread | gmean | menu | 68.60μs | 0.0% |
| RenderThread | gmean | teo | 48.44μs | -29.39% |
| RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% |
| RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% |
| surfaceflinger | gmean | menu | 42.23μs | 0.0% |
| surfaceflinger | gmean | teo | 29.84μs | -29.33% |
| surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% |
| surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% |
+-----------------+----------+-------------+-------------+-------------+
Thank you for taking your time to read this!
--
Kajetan
v5 -> v6:
- amended some wording in the commit description & cover letter
- included test results in the commit description
- refactored checking the CPU utilized status to account for !SMP systems
- dropped the RFC from the patchset header
v4 -> v5:
- remove the restriction to only apply the mechanism for C1 candidate state
- clarify some code comments, fix comment style
- refactor the fast-exit path loop implementation
- move some cover letter information into the commit description
v3 -> v4:
- remove the chunk of code skipping metrics updates when the CPU was utilized
- include new test results and more benchmarks in the cover letter
v2 -> v3:
- add a patch adding an option to skip polling states in teo_find_shallower_state()
- only reduce the state if the candidate state is C1 and C0 is not a polling state
- add a check for polling states in the 2-states fast-exit path
- remove the ifdefs and Kconfig option
v1 -> v2:
- rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
- add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
- fix typos in comments
- include a missing header
Kajetan Puchalski (2):
cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
cpuidle: teo: Introduce util-awareness
drivers/cpuidle/governors/teo.c | 100 ++++++++++++++++++++++++++++++--
1 file changed, 96 insertions(+), 4 deletions(-)
--
2.37.1
Modern interactive systems, such as recent Android phones, tend to have power
efficient shallow idle states. Selecting deeper idle states on a device while a
latency-sensitive workload is running can adversely impact performance due to
increased latency. Additionally, if the CPU wakes up from a deeper sleep before
its target residency as is often the case, it results in a waste of energy on
top of that.
At the moment, none of the available idle governors take any scheduling
information into account. They also tend to overestimate the idle
duration quite often, which causes them to select excessively deep idle
states, thus leading to increased wakeup latency and lower performance with no
power saving. For 'menu' while web browsing on Android for instance, those
types of wakeups ('too deep') account for over 24% of all wakeups.
At the same time, on some platforms idle state 0 can be power efficient
enough to warrant wanting to prefer it over idle state 1. This is because
the power usage of the two states can be so close that sufficient amounts
of too deep state 1 sleeps can completely offset the state 1 power saving to the
point where it would've been more power efficient to just use state 0 instead.
This is of course for systems where state 0 is not a polling state, such as
arm-based devices.
Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
save less power than they otherwise could have. Too deep sleeps, on the other
hand, harm performance and nullify the potential power saving from using state 1 in
the first place. While taking this into account, it is clear that on balance it
is preferable for an idle governor to have more too shallow sleeps instead of
more too deep sleeps on those kinds of platforms.
This patch specifically tunes TEO to prefer shallower idle states in
order to reduce wakeup latency and achieve better performance.
To this end, before selecting the next idle state it uses the avg_util signal
of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
This util value is then compared to a threshold defined as a percentage of the
cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the
util is above the threshold, the idle state selected by TEO metrics will be
reduced by 1, thus selecting a shallower state. If the util is below the threshold,
the governor defaults to the TEO metrics mechanism to try to select the deepest
available idle state based on the closest timer event and its own correctness.
The main goal of this is to reduce latency and increase performance for some
workloads. Under some workloads it will result in an increase in power usage
(Geekbench 5) while for other workloads it will also result in a decrease in
power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
It can provide drastically decreased latency and performance benefits in certain
types of workloads that are sensitive to latency.
Example test results:
1. GB5 (better score, latency & more power usage)
| metric | menu | teo | teo-util-aware |
| ------------------------------------- | -------------- | ----------------- | ----------------- |
| gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) |
| gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) |
| gmean too deep % | 14.99% | 9.65% | 4.02% |
| gmean too shallow % | 2.5% | 5.96% | 14.59% |
| gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |
2. Jankbench (better score, latency & less power usage)
| metric | menu | teo | teo-util-aware |
| ------------------------------------- | -------------- | ----------------- | ----------------- |
| gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) |
| gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) |
| gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) |
| gmean too deep % | 26.00% | 11.00% | 2.54% |
| gmean too shallow % | 4.74% | 11.89% | 21.93% |
| gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) |
| gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) |
Signed-off-by: Kajetan Puchalski <[email protected]>
---
drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++-
1 file changed, 91 insertions(+), 1 deletion(-)
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index e2864474a98d..2a2be4f45b70 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@
/*
* Timer events oriented CPU idle governor
*
+ * TEO governor:
* Copyright (C) 2018 - 2021 Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <[email protected]>
*/
/**
@@ -99,14 +104,55 @@
* select the given idle state instead of the candidate one.
*
* 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so a shallower idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threshold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
+ * as long as C0 is not a polling state.
*/
#include <linux/cpuidle.h>
#include <linux/jiffies.h>
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/sched/topology.h>
#include <linux/tick.h>
+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold.
+ *
+ * 6 was chosen based on testing as the number that achieved the best balance
+ * of power and performance on average.
+ *
+ * The resulting threshold is high enough to not be triggered by background
+ * noise and low enough to react quickly when activity starts to ramp up.
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
/*
* The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
* is used for decreasing metrics on a regular basis.
@@ -137,9 +183,11 @@ struct teo_bin {
* @time_span_ns: Time between idle state selection and post-wakeup update.
* @sleep_length_ns: Time till the closest timer event (at the selection time).
* @state_bins: Idle state data bins for this CPU.
- * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
+ * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
* @next_recent_idx: Index of the next @recent_idx entry to update.
* @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
*/
struct teo_cpu {
s64 time_span_ns;
@@ -148,10 +196,29 @@ struct teo_cpu {
unsigned int total;
int next_recent_idx;
int recent_idx[NR_RECENT];
+ unsigned long util_threshold;
+ bool utilized;
};
static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
+/**
+ * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
+ * @cpu: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+#ifdef CONFIG_SMP
+static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
+{
+ return sched_cpu_util(cpu) > cpu_data->util_threshold;
+}
+#else
+static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
+{
+ return false;
+}
+#endif
+
/**
* teo_update - Update CPU metrics after wakeup.
* @drv: cpuidle driver containing state data.
@@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
goto end;
}
+ cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
+ /*
+ * The cpu is being utilized over the threshold there are only 2 states to choose from.
+ * No need to consider metrics, choose the shallowest non-polling state and exit.
+ */
+ if (drv->state_count < 3 && cpu_data->utilized) {
+ for (i = 0; i < drv->state_count; ++i) {
+ if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
+ idx = i;
+ goto end;
+ }
+ }
+ }
+
/*
* Find the deepest idle state whose target residency does not exceed
* the current sleep length and the deepest idle state not deeper than
@@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (idx > constraint_idx)
idx = constraint_idx;
+ /*
+ * If the CPU is being utilized over the threshold,
+ * choose a shallower non-polling state to improve latency
+ */
+ if (cpu_data->utilized)
+ idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
+
end:
/*
* Don't stop the tick if the selected state is a polling one or if the
@@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
int i;
memset(cpu_data, 0, sizeof(*cpu_data));
+ cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
for (i = 0; i < NR_RECENT; i++)
cpu_data->recent_idx[i] = -1;
--
2.37.1
On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
<[email protected]> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have power
> efficient shallow idle states. Selecting deeper idle states on a device while a
> latency-sensitive workload is running can adversely impact performance due to
> increased latency. Additionally, if the CPU wakes up from a deeper sleep before
> its target residency as is often the case, it results in a waste of energy on
> top of that.
>
> At the moment, none of the available idle governors take any scheduling
> information into account. They also tend to overestimate the idle
> duration quite often, which causes them to select excessively deep idle
> states, thus leading to increased wakeup latency and lower performance with no
> power saving. For 'menu' while web browsing on Android for instance, those
> types of wakeups ('too deep') account for over 24% of all wakeups.
>
> At the same time, on some platforms idle state 0 can be power efficient
> enough to warrant wanting to prefer it over idle state 1. This is because
> the power usage of the two states can be so close that sufficient amounts
> of too deep state 1 sleeps can completely offset the state 1 power saving to the
> point where it would've been more power efficient to just use state 0 instead.
> This is of course for systems where state 0 is not a polling state, such as
> arm-based devices.
>
> Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
> save less power than they otherwise could have. Too deep sleeps, on the other
> hand, harm performance and nullify the potential power saving from using state 1 in
> the first place. While taking this into account, it is clear that on balance it
> is preferable for an idle governor to have more too shallow sleeps instead of
> more too deep sleeps on those kinds of platforms.
>
> This patch specifically tunes TEO to prefer shallower idle states in
> order to reduce wakeup latency and achieve better performance.
> To this end, before selecting the next idle state it uses the avg_util signal
> of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
> This util value is then compared to a threshold defined as a percentage of the
> cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the
> util is above the threshold, the idle state selected by TEO metrics will be
> reduced by 1, thus selecting a shallower state. If the util is below the threshold,
> the governor defaults to the TEO metrics mechanism to try to select the deepest
> available idle state based on the closest timer event and its own correctness.
>
> The main goal of this is to reduce latency and increase performance for some
> workloads. Under some workloads it will result in an increase in power usage
> (Geekbench 5) while for other workloads it will also result in a decrease in
> power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
>
> It can provide drastically decreased latency and performance benefits in certain
> types of workloads that are sensitive to latency.
>
> Example test results:
>
> 1. GB5 (better score, latency & more power usage)
>
> | metric | menu | teo | teo-util-aware |
> | ------------------------------------- | -------------- | ----------------- | ----------------- |
> | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) |
> | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) |
> | gmean too deep % | 14.99% | 9.65% | 4.02% |
> | gmean too shallow % | 2.5% | 5.96% | 14.59% |
> | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |
>
> 2. Jankbench (better score, latency & less power usage)
>
> | metric | menu | teo | teo-util-aware |
> | ------------------------------------- | -------------- | ----------------- | ----------------- |
> | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) |
> | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) |
> | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) |
> | gmean too deep % | 26.00% | 11.00% | 2.54% |
> | gmean too shallow % | 4.74% | 11.89% | 21.93% |
> | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) |
> | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) |
>
> Signed-off-by: Kajetan Puchalski <[email protected]>
This looks good enough for me.
There are still a couple of things I would change in it, but I may as
well do that when applying it, so never mind.
The most important question for now is what the scheduler people think
about calling sched_cpu_util() from a CPU idle governor. Peter,
Vincent?
> ---
> drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++-
> 1 file changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index e2864474a98d..2a2be4f45b70 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -2,8 +2,13 @@
> /*
> * Timer events oriented CPU idle governor
> *
> + * TEO governor:
> * Copyright (C) 2018 - 2021 Intel Corporation
> * Author: Rafael J. Wysocki <[email protected]>
> + *
> + * Util-awareness mechanism:
> + * Copyright (C) 2022 Arm Ltd.
> + * Author: Kajetan Puchalski <[email protected]>
> */
>
> /**
> @@ -99,14 +104,55 @@
> * select the given idle state instead of the candidate one.
> *
> * 3. By default, select the candidate state.
> + *
> + * Util-awareness mechanism:
> + *
> + * The idea behind the util-awareness extension is that there are two distinct
> + * scenarios for the CPU which should result in two different approaches to idle
> + * state selection - utilized and not utilized.
> + *
> + * In this case, 'utilized' means that the average runqueue util of the CPU is
> + * above a certain threshold.
> + *
> + * When the CPU is utilized while going into idle, more likely than not it will
> + * be woken up to do more work soon and so a shallower idle state should be
> + * selected to minimise latency and maximise performance. When the CPU is not
> + * being utilized, the usual metrics-based approach to selecting the deepest
> + * available idle state should be preferred to take advantage of the power
> + * saving.
> + *
> + * In order to achieve this, the governor uses a utilization threshold.
> + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> + * seems to be getting the best results.
> + *
> + * Before selecting the next idle state, the governor compares the current CPU
> + * util to the precomputed util threshold. If it's below, it defaults to the
> + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
> + * as long as C0 is not a polling state.
> */
>
> #include <linux/cpuidle.h>
> #include <linux/jiffies.h>
> #include <linux/kernel.h>
> +#include <linux/sched.h>
> #include <linux/sched/clock.h>
> +#include <linux/sched/topology.h>
> #include <linux/tick.h>
>
> +/*
> + * The number of bits to shift the cpu's capacity by in order to determine
> + * the utilized threshold.
> + *
> + * 6 was chosen based on testing as the number that achieved the best balance
> + * of power and performance on average.
> + *
> + * The resulting threshold is high enough to not be triggered by background
> + * noise and low enough to react quickly when activity starts to ramp up.
> + */
> +#define UTIL_THRESHOLD_SHIFT 6
> +
> +
> /*
> * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> * is used for decreasing metrics on a regular basis.
> @@ -137,9 +183,11 @@ struct teo_bin {
> * @time_span_ns: Time between idle state selection and post-wakeup update.
> * @sleep_length_ns: Time till the closest timer event (at the selection time).
> * @state_bins: Idle state data bins for this CPU.
> - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
> * @next_recent_idx: Index of the next @recent_idx entry to update.
> * @recent_idx: Indices of bins corresponding to recent "intercepts".
> + * @util_threshold: Threshold above which the CPU is considered utilized
> + * @utilized: Whether the last sleep on the CPU happened while utilized
> */
> struct teo_cpu {
> s64 time_span_ns;
> @@ -148,10 +196,29 @@ struct teo_cpu {
> unsigned int total;
> int next_recent_idx;
> int recent_idx[NR_RECENT];
> + unsigned long util_threshold;
> + bool utilized;
> };
>
> static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> +/**
> + * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
> + * @cpu: Target CPU
> + * @cpu_data: Governor CPU data for the target CPU
> + */
> +#ifdef CONFIG_SMP
> +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> +{
> + return sched_cpu_util(cpu) > cpu_data->util_threshold;
> +}
> +#else
> +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> +{
> + return false;
> +}
> +#endif
> +
> /**
> * teo_update - Update CPU metrics after wakeup.
> * @drv: cpuidle driver containing state data.
> @@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> goto end;
> }
>
> + cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> + /*
> + * The cpu is being utilized over the threshold there are only 2 states to choose from.
> + * No need to consider metrics, choose the shallowest non-polling state and exit.
> + */
> + if (drv->state_count < 3 && cpu_data->utilized) {
> + for (i = 0; i < drv->state_count; ++i) {
> + if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> + idx = i;
> + goto end;
> + }
> + }
> + }
> +
> /*
> * Find the deepest idle state whose target residency does not exceed
> * the current sleep length and the deepest idle state not deeper than
> @@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> if (idx > constraint_idx)
> idx = constraint_idx;
>
> + /*
> + * If the CPU is being utilized over the threshold,
> + * choose a shallower non-polling state to improve latency
> + */
> + if (cpu_data->utilized)
> + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> +
> end:
> /*
> * Don't stop the tick if the selected state is a polling one or if the
> @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> struct cpuidle_device *dev)
> {
> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> int i;
>
> memset(cpu_data, 0, sizeof(*cpu_data));
> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
> for (i = 0; i < NR_RECENT; i++)
> cpu_data->recent_idx[i] = -1;
> --
> 2.37.1
>
On 1/5/23 15:07, Rafael J. Wysocki wrote:
> On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
> <[email protected]> wrote:
>>
>> Modern interactive systems, such as recent Android phones, tend to have power
>> efficient shallow idle states. Selecting deeper idle states on a device while a
>> latency-sensitive workload is running can adversely impact performance due to
>> increased latency. Additionally, if the CPU wakes up from a deeper sleep before
>> its target residency as is often the case, it results in a waste of energy on
>> top of that.
>>
>> At the moment, none of the available idle governors take any scheduling
>> information into account. They also tend to overestimate the idle
>> duration quite often, which causes them to select excessively deep idle
>> states, thus leading to increased wakeup latency and lower performance with no
>> power saving. For 'menu' while web browsing on Android for instance, those
>> types of wakeups ('too deep') account for over 24% of all wakeups.
>>
>> At the same time, on some platforms idle state 0 can be power efficient
>> enough to warrant wanting to prefer it over idle state 1. This is because
>> the power usage of the two states can be so close that sufficient amounts
>> of too deep state 1 sleeps can completely offset the state 1 power saving to the
>> point where it would've been more power efficient to just use state 0 instead.
>> This is of course for systems where state 0 is not a polling state, such as
>> arm-based devices.
>>
>> Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
>> save less power than they otherwise could have. Too deep sleeps, on the other
>> hand, harm performance and nullify the potential power saving from using state 1 in
>> the first place. While taking this into account, it is clear that on balance it
>> is preferable for an idle governor to have more too shallow sleeps instead of
>> more too deep sleeps on those kinds of platforms.
>>
>> This patch specifically tunes TEO to prefer shallower idle states in
>> order to reduce wakeup latency and achieve better performance.
>> To this end, before selecting the next idle state it uses the avg_util signal
>> of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
>> This util value is then compared to a threshold defined as a percentage of the
>> cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the
>> util is above the threshold, the idle state selected by TEO metrics will be
>> reduced by 1, thus selecting a shallower state. If the util is below the threshold,
>> the governor defaults to the TEO metrics mechanism to try to select the deepest
>> available idle state based on the closest timer event and its own correctness.
>>
>> The main goal of this is to reduce latency and increase performance for some
>> workloads. Under some workloads it will result in an increase in power usage
>> (Geekbench 5) while for other workloads it will also result in a decrease in
>> power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
>>
>> It can provide drastically decreased latency and performance benefits in certain
>> types of workloads that are sensitive to latency.
>>
>> Example test results:
>>
>> 1. GB5 (better score, latency & more power usage)
>>
>> | metric | menu | teo | teo-util-aware |
>> | ------------------------------------- | -------------- | ----------------- | ----------------- |
>> | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) |
>> | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) |
>> | gmean too deep % | 14.99% | 9.65% | 4.02% |
>> | gmean too shallow % | 2.5% | 5.96% | 14.59% |
>> | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |
>>
>> 2. Jankbench (better score, latency & less power usage)
>>
>> | metric | menu | teo | teo-util-aware |
>> | ------------------------------------- | -------------- | ----------------- | ----------------- |
>> | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) |
>> | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) |
>> | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) |
>> | gmean too deep % | 26.00% | 11.00% | 2.54% |
>> | gmean too shallow % | 4.74% | 11.89% | 21.93% |
>> | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) |
>> | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) |
>>
>> Signed-off-by: Kajetan Puchalski <[email protected]>
>
> This looks good enough for me.
>
> There are still a couple of things I would change in it, but I may as
> well do that when applying it, so never mind.
>
> The most important question for now is what the scheduler people think
> about calling sched_cpu_util() from a CPU idle governor. Peter,
> Vincent?
>
We have a precedence in thermal framework for purpose of thermal
governor - IPA. It's been there for a while to estimate the power
of CPUs in the frequency domain for cpufreq_cooling device [1].
That's how this API sched_cpu_util() got created. Then it was also
adopted to PowerCap DTPM [2] (for the same power estimation purpose).
It's a function available with form include/linux/sched.h so I don't
see reasons why to not use it.
[1]
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpufreq_cooling.c#L151
[2]
https://elixir.bootlin.com/linux/latest/source/drivers/powercap/dtpm_cpu.c#L83
On Thu, 5 Jan 2023 at 16:07, Rafael J. Wysocki <[email protected]> wrote:
>
> On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
> <[email protected]> wrote:
> >
> > Modern interactive systems, such as recent Android phones, tend to have power
> > efficient shallow idle states. Selecting deeper idle states on a device while a
> > latency-sensitive workload is running can adversely impact performance due to
> > increased latency. Additionally, if the CPU wakes up from a deeper sleep before
> > its target residency as is often the case, it results in a waste of energy on
> > top of that.
> >
> > At the moment, none of the available idle governors take any scheduling
> > information into account. They also tend to overestimate the idle
> > duration quite often, which causes them to select excessively deep idle
> > states, thus leading to increased wakeup latency and lower performance with no
> > power saving. For 'menu' while web browsing on Android for instance, those
> > types of wakeups ('too deep') account for over 24% of all wakeups.
> >
> > At the same time, on some platforms idle state 0 can be power efficient
> > enough to warrant wanting to prefer it over idle state 1. This is because
> > the power usage of the two states can be so close that sufficient amounts
> > of too deep state 1 sleeps can completely offset the state 1 power saving to the
> > point where it would've been more power efficient to just use state 0 instead.
> > This is of course for systems where state 0 is not a polling state, such as
> > arm-based devices.
> >
> > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
> > save less power than they otherwise could have. Too deep sleeps, on the other
> > hand, harm performance and nullify the potential power saving from using state 1 in
> > the first place. While taking this into account, it is clear that on balance it
> > is preferable for an idle governor to have more too shallow sleeps instead of
> > more too deep sleeps on those kinds of platforms.
> >
> > This patch specifically tunes TEO to prefer shallower idle states in
> > order to reduce wakeup latency and achieve better performance.
> > To this end, before selecting the next idle state it uses the avg_util signal
> > of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
> > This util value is then compared to a threshold defined as a percentage of the
> > cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the
> > util is above the threshold, the idle state selected by TEO metrics will be
> > reduced by 1, thus selecting a shallower state. If the util is below the threshold,
> > the governor defaults to the TEO metrics mechanism to try to select the deepest
> > available idle state based on the closest timer event and its own correctness.
> >
> > The main goal of this is to reduce latency and increase performance for some
> > workloads. Under some workloads it will result in an increase in power usage
> > (Geekbench 5) while for other workloads it will also result in a decrease in
> > power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
> >
> > It can provide drastically decreased latency and performance benefits in certain
> > types of workloads that are sensitive to latency.
> >
> > Example test results:
> >
> > 1. GB5 (better score, latency & more power usage)
> >
> > | metric | menu | teo | teo-util-aware |
> > | ------------------------------------- | -------------- | ----------------- | ----------------- |
> > | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) |
> > | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) |
> > | gmean too deep % | 14.99% | 9.65% | 4.02% |
> > | gmean too shallow % | 2.5% | 5.96% | 14.59% |
> > | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |
> >
> > 2. Jankbench (better score, latency & less power usage)
> >
> > | metric | menu | teo | teo-util-aware |
> > | ------------------------------------- | -------------- | ----------------- | ----------------- |
> > | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) |
> > | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) |
> > | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) |
> > | gmean too deep % | 26.00% | 11.00% | 2.54% |
> > | gmean too shallow % | 4.74% | 11.89% | 21.93% |
> > | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) |
> > | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) |
> >
> > Signed-off-by: Kajetan Puchalski <[email protected]>
>
> This looks good enough for me.
>
> There are still a couple of things I would change in it, but I may as
> well do that when applying it, so never mind.
>
> The most important question for now is what the scheduler people think
> about calling sched_cpu_util() from a CPU idle governor. Peter,
> Vincent?
I don't see a problem with using sched_cpu_util() outside the
scheduler as it's already used in thermal and dtpm to get cpu
utilization.
>
> > ---
> > drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++-
> > 1 file changed, 91 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> > index e2864474a98d..2a2be4f45b70 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -2,8 +2,13 @@
> > /*
> > * Timer events oriented CPU idle governor
> > *
> > + * TEO governor:
> > * Copyright (C) 2018 - 2021 Intel Corporation
> > * Author: Rafael J. Wysocki <[email protected]>
> > + *
> > + * Util-awareness mechanism:
> > + * Copyright (C) 2022 Arm Ltd.
> > + * Author: Kajetan Puchalski <[email protected]>
> > */
> >
> > /**
> > @@ -99,14 +104,55 @@
> > * select the given idle state instead of the candidate one.
> > *
> > * 3. By default, select the candidate state.
> > + *
> > + * Util-awareness mechanism:
> > + *
> > + * The idea behind the util-awareness extension is that there are two distinct
> > + * scenarios for the CPU which should result in two different approaches to idle
> > + * state selection - utilized and not utilized.
> > + *
> > + * In this case, 'utilized' means that the average runqueue util of the CPU is
> > + * above a certain threshold.
> > + *
> > + * When the CPU is utilized while going into idle, more likely than not it will
> > + * be woken up to do more work soon and so a shallower idle state should be
> > + * selected to minimise latency and maximise performance. When the CPU is not
> > + * being utilized, the usual metrics-based approach to selecting the deepest
> > + * available idle state should be preferred to take advantage of the power
> > + * saving.
> > + *
> > + * In order to achieve this, the governor uses a utilization threshold.
> > + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> > + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> > + * seems to be getting the best results.
> > + *
> > + * Before selecting the next idle state, the governor compares the current CPU
> > + * util to the precomputed util threshold. If it's below, it defaults to the
> > + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
> > + * as long as C0 is not a polling state.
> > */
> >
> > #include <linux/cpuidle.h>
> > #include <linux/jiffies.h>
> > #include <linux/kernel.h>
> > +#include <linux/sched.h>
> > #include <linux/sched/clock.h>
> > +#include <linux/sched/topology.h>
> > #include <linux/tick.h>
> >
> > +/*
> > + * The number of bits to shift the cpu's capacity by in order to determine
> > + * the utilized threshold.
> > + *
> > + * 6 was chosen based on testing as the number that achieved the best balance
> > + * of power and performance on average.
> > + *
> > + * The resulting threshold is high enough to not be triggered by background
> > + * noise and low enough to react quickly when activity starts to ramp up.
> > + */
> > +#define UTIL_THRESHOLD_SHIFT 6
> > +
> > +
> > /*
> > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> > * is used for decreasing metrics on a regular basis.
> > @@ -137,9 +183,11 @@ struct teo_bin {
> > * @time_span_ns: Time between idle state selection and post-wakeup update.
> > * @sleep_length_ns: Time till the closest timer event (at the selection time).
> > * @state_bins: Idle state data bins for this CPU.
> > - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> > + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
> > * @next_recent_idx: Index of the next @recent_idx entry to update.
> > * @recent_idx: Indices of bins corresponding to recent "intercepts".
> > + * @util_threshold: Threshold above which the CPU is considered utilized
> > + * @utilized: Whether the last sleep on the CPU happened while utilized
> > */
> > struct teo_cpu {
> > s64 time_span_ns;
> > @@ -148,10 +196,29 @@ struct teo_cpu {
> > unsigned int total;
> > int next_recent_idx;
> > int recent_idx[NR_RECENT];
> > + unsigned long util_threshold;
> > + bool utilized;
> > };
> >
> > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> >
> > +/**
> > + * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
> > + * @cpu: Target CPU
> > + * @cpu_data: Governor CPU data for the target CPU
> > + */
> > +#ifdef CONFIG_SMP
> > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > +{
> > + return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > +}
> > +#else
> > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > /**
> > * teo_update - Update CPU metrics after wakeup.
> > * @drv: cpuidle driver containing state data.
> > @@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > goto end;
> > }
> >
> > + cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > + /*
> > + * The cpu is being utilized over the threshold there are only 2 states to choose from.
> > + * No need to consider metrics, choose the shallowest non-polling state and exit.
> > + */
> > + if (drv->state_count < 3 && cpu_data->utilized) {
> > + for (i = 0; i < drv->state_count; ++i) {
> > + if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > + idx = i;
> > + goto end;
> > + }
> > + }
> > + }
> > +
> > /*
> > * Find the deepest idle state whose target residency does not exceed
> > * the current sleep length and the deepest idle state not deeper than
> > @@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > if (idx > constraint_idx)
> > idx = constraint_idx;
> >
> > + /*
> > + * If the CPU is being utilized over the threshold,
> > + * choose a shallower non-polling state to improve latency
> > + */
> > + if (cpu_data->utilized)
> > + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> > +
> > end:
> > /*
> > * Don't stop the tick if the selected state is a polling one or if the
> > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > struct cpuidle_device *dev)
> > {
> > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > int i;
> >
> > memset(cpu_data, 0, sizeof(*cpu_data));
> > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> >
> > for (i = 0; i < NR_RECENT; i++)
> > cpu_data->recent_idx[i] = -1;
> > --
> > 2.37.1
> >
On Thu, Jan 5, 2023 at 4:34 PM Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 5 Jan 2023 at 16:07, Rafael J. Wysocki <[email protected]> wrote:
> >
> > On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
> > <[email protected]> wrote:
> > >
> > > Modern interactive systems, such as recent Android phones, tend to have power
> > > efficient shallow idle states. Selecting deeper idle states on a device while a
> > > latency-sensitive workload is running can adversely impact performance due to
> > > increased latency. Additionally, if the CPU wakes up from a deeper sleep before
> > > its target residency as is often the case, it results in a waste of energy on
> > > top of that.
> > >
> > > At the moment, none of the available idle governors take any scheduling
> > > information into account. They also tend to overestimate the idle
> > > duration quite often, which causes them to select excessively deep idle
> > > states, thus leading to increased wakeup latency and lower performance with no
> > > power saving. For 'menu' while web browsing on Android for instance, those
> > > types of wakeups ('too deep') account for over 24% of all wakeups.
> > >
> > > At the same time, on some platforms idle state 0 can be power efficient
> > > enough to warrant wanting to prefer it over idle state 1. This is because
> > > the power usage of the two states can be so close that sufficient amounts
> > > of too deep state 1 sleeps can completely offset the state 1 power saving to the
> > > point where it would've been more power efficient to just use state 0 instead.
> > > This is of course for systems where state 0 is not a polling state, such as
> > > arm-based devices.
> > >
> > > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
> > > save less power than they otherwise could have. Too deep sleeps, on the other
> > > hand, harm performance and nullify the potential power saving from using state 1 in
> > > the first place. While taking this into account, it is clear that on balance it
> > > is preferable for an idle governor to have more too shallow sleeps instead of
> > > more too deep sleeps on those kinds of platforms.
> > >
> > > This patch specifically tunes TEO to prefer shallower idle states in
> > > order to reduce wakeup latency and achieve better performance.
> > > To this end, before selecting the next idle state it uses the avg_util signal
> > > of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
> > > This util value is then compared to a threshold defined as a percentage of the
> > > cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the
> > > util is above the threshold, the idle state selected by TEO metrics will be
> > > reduced by 1, thus selecting a shallower state. If the util is below the threshold,
> > > the governor defaults to the TEO metrics mechanism to try to select the deepest
> > > available idle state based on the closest timer event and its own correctness.
> > >
> > > The main goal of this is to reduce latency and increase performance for some
> > > workloads. Under some workloads it will result in an increase in power usage
> > > (Geekbench 5) while for other workloads it will also result in a decrease in
> > > power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
> > >
> > > It can provide drastically decreased latency and performance benefits in certain
> > > types of workloads that are sensitive to latency.
> > >
> > > Example test results:
> > >
> > > 1. GB5 (better score, latency & more power usage)
> > >
> > > | metric | menu | teo | teo-util-aware |
> > > | ------------------------------------- | -------------- | ----------------- | ----------------- |
> > > | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) |
> > > | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) |
> > > | gmean too deep % | 14.99% | 9.65% | 4.02% |
> > > | gmean too shallow % | 2.5% | 5.96% | 14.59% |
> > > | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) |
> > >
> > > 2. Jankbench (better score, latency & less power usage)
> > >
> > > | metric | menu | teo | teo-util-aware |
> > > | ------------------------------------- | -------------- | ----------------- | ----------------- |
> > > | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) |
> > > | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) |
> > > | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) |
> > > | gmean too deep % | 26.00% | 11.00% | 2.54% |
> > > | gmean too shallow % | 4.74% | 11.89% | 21.93% |
> > > | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) |
> > > | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) |
> > >
> > > Signed-off-by: Kajetan Puchalski <[email protected]>
> >
> > This looks good enough for me.
> >
> > There are still a couple of things I would change in it, but I may as
> > well do that when applying it, so never mind.
> >
> > The most important question for now is what the scheduler people think
> > about calling sched_cpu_util() from a CPU idle governor. Peter,
> > Vincent?
>
> I don't see a problem with using sched_cpu_util() outside the
> scheduler as it's already used in thermal and dtpm to get cpu
> utilization.
OK, thanks!
> >
> > > ---
> > > drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++-
> > > 1 file changed, 91 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> > > index e2864474a98d..2a2be4f45b70 100644
> > > --- a/drivers/cpuidle/governors/teo.c
> > > +++ b/drivers/cpuidle/governors/teo.c
> > > @@ -2,8 +2,13 @@
> > > /*
> > > * Timer events oriented CPU idle governor
> > > *
> > > + * TEO governor:
> > > * Copyright (C) 2018 - 2021 Intel Corporation
> > > * Author: Rafael J. Wysocki <[email protected]>
> > > + *
> > > + * Util-awareness mechanism:
> > > + * Copyright (C) 2022 Arm Ltd.
> > > + * Author: Kajetan Puchalski <[email protected]>
> > > */
> > >
> > > /**
> > > @@ -99,14 +104,55 @@
> > > * select the given idle state instead of the candidate one.
> > > *
> > > * 3. By default, select the candidate state.
> > > + *
> > > + * Util-awareness mechanism:
> > > + *
> > > + * The idea behind the util-awareness extension is that there are two distinct
> > > + * scenarios for the CPU which should result in two different approaches to idle
> > > + * state selection - utilized and not utilized.
> > > + *
> > > + * In this case, 'utilized' means that the average runqueue util of the CPU is
> > > + * above a certain threshold.
> > > + *
> > > + * When the CPU is utilized while going into idle, more likely than not it will
> > > + * be woken up to do more work soon and so a shallower idle state should be
> > > + * selected to minimise latency and maximise performance. When the CPU is not
> > > + * being utilized, the usual metrics-based approach to selecting the deepest
> > > + * available idle state should be preferred to take advantage of the power
> > > + * saving.
> > > + *
> > > + * In order to achieve this, the governor uses a utilization threshold.
> > > + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> > > + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> > > + * seems to be getting the best results.
> > > + *
> > > + * Before selecting the next idle state, the governor compares the current CPU
> > > + * util to the precomputed util threshold. If it's below, it defaults to the
> > > + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
> > > + * as long as C0 is not a polling state.
> > > */
> > >
> > > #include <linux/cpuidle.h>
> > > #include <linux/jiffies.h>
> > > #include <linux/kernel.h>
> > > +#include <linux/sched.h>
> > > #include <linux/sched/clock.h>
> > > +#include <linux/sched/topology.h>
> > > #include <linux/tick.h>
> > >
> > > +/*
> > > + * The number of bits to shift the cpu's capacity by in order to determine
> > > + * the utilized threshold.
> > > + *
> > > + * 6 was chosen based on testing as the number that achieved the best balance
> > > + * of power and performance on average.
> > > + *
> > > + * The resulting threshold is high enough to not be triggered by background
> > > + * noise and low enough to react quickly when activity starts to ramp up.
> > > + */
> > > +#define UTIL_THRESHOLD_SHIFT 6
> > > +
> > > +
> > > /*
> > > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> > > * is used for decreasing metrics on a regular basis.
> > > @@ -137,9 +183,11 @@ struct teo_bin {
> > > * @time_span_ns: Time between idle state selection and post-wakeup update.
> > > * @sleep_length_ns: Time till the closest timer event (at the selection time).
> > > * @state_bins: Idle state data bins for this CPU.
> > > - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> > > + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
> > > * @next_recent_idx: Index of the next @recent_idx entry to update.
> > > * @recent_idx: Indices of bins corresponding to recent "intercepts".
> > > + * @util_threshold: Threshold above which the CPU is considered utilized
> > > + * @utilized: Whether the last sleep on the CPU happened while utilized
> > > */
> > > struct teo_cpu {
> > > s64 time_span_ns;
> > > @@ -148,10 +196,29 @@ struct teo_cpu {
> > > unsigned int total;
> > > int next_recent_idx;
> > > int recent_idx[NR_RECENT];
> > > + unsigned long util_threshold;
> > > + bool utilized;
> > > };
> > >
> > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> > >
> > > +/**
> > > + * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
> > > + * @cpu: Target CPU
> > > + * @cpu_data: Governor CPU data for the target CPU
> > > + */
> > > +#ifdef CONFIG_SMP
> > > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > > +{
> > > + return sched_cpu_util(cpu) > cpu_data->util_threshold;
> > > +}
> > > +#else
> > > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > /**
> > > * teo_update - Update CPU metrics after wakeup.
> > > * @drv: cpuidle driver containing state data.
> > > @@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > > goto end;
> > > }
> > >
> > > + cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > > + /*
> > > + * The cpu is being utilized over the threshold there are only 2 states to choose from.
> > > + * No need to consider metrics, choose the shallowest non-polling state and exit.
> > > + */
> > > + if (drv->state_count < 3 && cpu_data->utilized) {
> > > + for (i = 0; i < drv->state_count; ++i) {
> > > + if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > > + idx = i;
> > > + goto end;
> > > + }
> > > + }
> > > + }
> > > +
> > > /*
> > > * Find the deepest idle state whose target residency does not exceed
> > > * the current sleep length and the deepest idle state not deeper than
> > > @@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > > if (idx > constraint_idx)
> > > idx = constraint_idx;
> > >
> > > + /*
> > > + * If the CPU is being utilized over the threshold,
> > > + * choose a shallower non-polling state to improve latency
> > > + */
> > > + if (cpu_data->utilized)
> > > + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> > > +
> > > end:
> > > /*
> > > * Don't stop the tick if the selected state is a polling one or if the
> > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > struct cpuidle_device *dev)
> > > {
> > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > int i;
> > >
> > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> > >
> > > for (i = 0; i < NR_RECENT; i++)
> > > cpu_data->recent_idx[i] = -1;
> > > --
> > > 2.37.1
> > >
On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
<[email protected]> wrote:
>
> Hi,
>
> At the moment, none of the available idle governors take any scheduling
> information into account. They also tend to overestimate the idle
> duration quite often, which causes them to select excessively deep idle
> states, thus leading to increased wakeup latency and lower performance with no
> power saving. For 'menu' while web browsing on Android for instance, those
> types of wakeups ('too deep') account for over 24% of all wakeups.
>
> At the same time, on some platforms idle state 0 can be power efficient
> enough to warrant wanting to prefer it over idle state 1. This is because
> the power usage of the two states can be so close that sufficient amounts
> of too deep state 1 sleeps can completely offset the state 1 power saving to the
> point where it would've been more power efficient to just use state 0 instead.
> This is of course for systems where state 0 is not a polling state, such as
> arm-based devices.
>
> Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
> save less power than they otherwise could have. Too deep sleeps, on the other
> hand, harm performance and nullify the potential power saving from using state 1 in
> the first place. While taking this into account, it is clear that on balance it
> is preferable for an idle governor to have more too shallow sleeps instead of
> more too deep sleeps on those kinds of platforms.
>
> Currently the best available governor under this metric is TEO which on average results in less than
> half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> increased performance in the process.
>
> This patchset specifically tunes TEO to prefer shallower idle states in order to reduce wakeup latency
> and achieve better performance. To this end, before selecting the next idle state it uses the avg_util
> signal of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
> This util value is then compared to a threshold defined as a percentage of the cpu's capacity
> (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the threshold, the idle
> state selected by TEO metrics will be reduced by 1, thus selecting a shallower state. If the util is
> below the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
> available idle state based on the closest timer event and its own correctness.
>
> The main goal of this is to reduce latency and increase performance for some workloads. Under some
> workloads it will result in an increase in power usage (Geekbench 5) while for other workloads it
> will also result in a decrease in power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
>
> As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
> states are present. If there's just 2 idle states and the CPU is utilized, we can directly select
> the shallowest state and save cycles by skipping the entire metrics mechanism.
>
> Under the current implementation, the state will not be reduced by 1 if the change would lead to
> selecting a polling state instead of a non-polling state.
>
> This approach can outperform all the other currently available governors, at least on mobile device
> workloads, which is why I think it is worth keeping as an option.
>
> There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
> it on TEO because it performs the best out of all the available options and I didn't think there was
> any point in reinventing the wheel on the side of computing governor metrics. If a
> better approach comes along at some point, there's no reason why the same idle aware mechanism
> couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
> a separate governor rather than a TEO add-on.
>
> As for how the extension performs in practice, below I'll add some benchmark results I got while
> testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly
> an hour each time to minimise the impact of thermal issues.
>
> Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches):
>
> 1. Geekbench 5 (latency-sensitive, heavy load test)
>
> The values below are gmean values across 3 back to back iteration of Geekbench 5.
> As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
> resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
> values for all of the governors can change between runs as the benchmark might be affected by factors
> other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
> scores than all the other governors.
>
> Benchmark scores
>
> +-----------------+-------------+---------+-------------+
> | metric | kernel | value | perc_diff |
> |-----------------+-------------+---------+-------------|
> | multicore_score | menu | 2826.5 | 0.0% |
> | multicore_score | teo | 2764.8 | -2.18% |
> | multicore_score | teo_util_v3 | 2849 | 0.8% |
> | multicore_score | teo_util_v4 | 2865 | 1.36% |
> | score | menu | 1053 | 0.0% |
> | score | teo | 1050.7 | -0.22% |
> | score | teo_util_v3 | 1059.6 | 0.63% |
> | score | teo_util_v4 | 1057.6 | 0.44% |
> +-----------------+-------------+---------+-------------+
>
> Idle misses
>
> The numbers are percentages of too deep and too shallow sleeps computed using the new trace
> event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over
> the course of a run and then dividing them by the total number of wakeups in that run.
>
> +-------------+-------------+--------------+
> | wa_path | type | count_perc |
> |-------------+-------------+--------------|
> | menu | too deep | 14.994% |
> | teo | too deep | 9.649% |
> | teo_util_v3 | too deep | 4.298% |
> | teo_util_v4 | too deep | 4.02 % |
> | menu | too shallow | 2.497% |
> | teo | too shallow | 5.963% |
> | teo_util_v3 | too shallow | 13.773% |
> | teo_util_v4 | too shallow | 14.598% |
> +-------------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+----------+-------------+---------+-------------+
> | chan_name | metric | kernel | value | perc_diff |
> |--------------+----------+-------------+---------+-------------|
> | total_power | gmean | menu | 2551.4 | 0.0% |
> | total_power | gmean | teo | 2606.8 | 2.17% |
> | total_power | gmean | teo_util_v3 | 2670.1 | 4.65% |
> | total_power | gmean | teo_util_v4 | 2722.3 | 6.7% |
> +--------------+----------+-------------+---------+-------------+
>
> Task wakeup latency
>
> +-----------------+----------+-------------+-------------+-------------+
> | comm | metric | kernel | value | perc_diff |
> |-----------------+----------+-------------+-------------+-------------|
> | AsyncTask #1 | gmean | menu | 78.16μs | 0.0% |
> | AsyncTask #1 | gmean | teo | 61.60μs | -21.19% |
> | AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% |
> | AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% |
> | labs.geekbench5 | gmean | menu | 88.55μs | 0.0% |
> | labs.geekbench5 | gmean | teo | 100.97μs | 14.02% |
> | labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% |
> | labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% |
> +-----------------+----------+-------------+-------------+-------------+
>
> In case of this benchmark, the difference in latency does seem to translate into better scores.
>
> 2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test)
>
> The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
>
> Benchmark scores
>
> +----------------+-------------+---------+-------------+
> | metric | kernel | value | perc_diff |
> |----------------+-------------+---------+-------------|
> | PcmaWebV2Score | menu | 5232 | 0.0% |
> | PcmaWebV2Score | teo | 5219.8 | -0.23% |
> | PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% |
> | PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% |
> +----------------+-------------+---------+-------------+
>
> Idle misses
>
> +-------------+-------------+--------------+
> | wa_path | type | count_perc |
> |-------------+-------------+--------------|
> | menu | too deep | 24.814% |
> | teo | too deep | 11.65% |
> | teo_util_v3 | too deep | 3.481% |
> | teo_util_v4 | too deep | 3.662% |
> | menu | too shallow | 3.101% |
> | teo | too shallow | 8.578% |
> | teo_util_v3 | too shallow | 18.326% |
> | teo_util_v4 | too shallow | 18.692% |
> +-------------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+----------+-------------+---------+-------------+
> | chan_name | metric | kernel | value | perc_diff |
> |--------------+----------+-------------+---------+-------------|
> | total_power | gmean | menu | 179.2 | 0.0% |
> | total_power | gmean | teo | 184.8 | 3.1% |
> | total_power | gmean | teo_util_v3 | 177.4 | -1.02% |
> | total_power | gmean | teo_util_v4 | 184.1 | 2.71% |
> +--------------+----------+-------------+---------+-------------+
>
> Task wakeup latency
>
> +-----------------+----------+-------------+-------------+-------------+
> | comm | metric | kernel | value | perc_diff |
> |-----------------+----------+-------------+-------------+-------------|
> | CrRendererMain | gmean | menu | 236.63μs | 0.0% |
> | CrRendererMain | gmean | teo | 201.85μs | -14.7% |
> | CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% |
> | CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% |
> | chmark:workload | gmean | menu | 100.30μs | 0.0% |
> | chmark:workload | gmean | teo | 80.20μs | -20.04% |
> | chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% |
> | chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% |
> | surfaceflinger | gmean | menu | 97.57μs | 0.0% |
> | surfaceflinger | gmean | teo | 98.86μs | 1.31% |
> | surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% |
> | surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% |
> +-----------------+----------+-------------+-------------+-------------+
>
> In this case the large latency improvement does not translate into a notable increase in benchmark score as
> this particular benchmark mainly responds to changes in operating frequency.
>
> 3. Jankbench (locked 60hz screen) (normal usage UI test)
>
> Frame durations
>
> +---------------+------------------+---------+-------------+
> | variable | kernel | value | perc_diff |
> |---------------+------------------+---------+-------------|
> | mean_duration | menu_60hz | 13.9 | 0.0% |
> | mean_duration | teo_60hz | 14.7 | 6.0% |
> | mean_duration | teo_util_v3_60hz | 13.8 | -0.87% |
> | mean_duration | teo_util_v4_60hz | 12.6 | -9.0% |
> +---------------+------------------+---------+-------------+
>
> Jank percentage
>
> +------------+------------------+---------+-------------+
> | variable | kernel | value | perc_diff |
> |------------+------------------+---------+-------------|
> | jank_perc | menu_60hz | 1.5 | 0.0% |
> | jank_perc | teo_60hz | 2.1 | 36.99% |
> | jank_perc | teo_util_v3_60hz | 1.3 | -13.95% |
> | jank_perc | teo_util_v4_60hz | 1.3 | -17.37% |
> +------------+------------------+---------+-------------+
>
> Idle misses
>
> +------------------+-------------+--------------+
> | wa_path | type | count_perc |
> |------------------+-------------+--------------|
> | menu_60hz | too deep | 26.00% |
> | teo_60hz | too deep | 11.00% |
> | teo_util_v3_60hz | too deep | 2.33% |
> | teo_util_v4_60hz | too deep | 2.54% |
> | menu_60hz | too shallow | 4.74% |
> | teo_60hz | too shallow | 11.89% |
> | teo_util_v3_60hz | too shallow | 21.78% |
> | teo_util_v4_60hz | too shallow | 21.93% |
> +------------------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+------------------+---------+-------------+
> | chan_name | kernel | value | perc_diff |
> |--------------+------------------+---------+-------------|
> | total_power | menu_60hz | 144.6 | 0.0% |
> | total_power | teo_60hz | 136.9 | -5.27% |
> | total_power | teo_util_v3_60hz | 134.2 | -7.19% |
> | total_power | teo_util_v4_60hz | 121.3 | -16.08% |
> +--------------+------------------+---------+-------------+
>
> Task wakeup latency
>
> +-----------------+------------------+-------------+-------------+
> | comm | kernel | value | perc_diff |
> |-----------------+------------------+-------------+-------------|
> | RenderThread | menu_60hz | 139.52μs | 0.0% |
> | RenderThread | teo_60hz | 116.51μs | -16.49% |
> | RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% |
> | RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% |
> | droid.benchmark | menu_60hz | 135.88μs | 0.0% |
> | droid.benchmark | teo_60hz | 105.21μs | -22.57% |
> | droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% |
> | droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% |
> | surfaceflinger | menu_60hz | 124.03μs | 0.0% |
> | surfaceflinger | teo_60hz | 151.90μs | 22.47% |
> | surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% |
> | surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% |
> +-----------------+------------------+-------------+-------------+
>
> 4. Speedometer 2 (heavy load web browsing test)
>
> Benchmark scores
>
> +-------------------+-------------+---------+-------------+
> | metric | kernel | value | perc_diff |
> |-------------------+-------------+---------+-------------|
> | Speedometer Score | menu | 102 | 0.0% |
> | Speedometer Score | teo | 104.9 | 2.88% |
> | Speedometer Score | teo_util_v3 | 102.1 | 0.16% |
> | Speedometer Score | teo_util_v4 | 103.8 | 1.83% |
> +-------------------+-------------+---------+-------------+
>
> Idle misses
>
> +-------------+-------------+--------------+
> | wa_path | type | count_perc |
> |-------------+-------------+--------------|
> | menu | too deep | 17.95% |
> | teo | too deep | 6.46% |
> | teo_util_v3 | too deep | 0.63% |
> | teo_util_v4 | too deep | 0.64% |
> | menu | too shallow | 3.86% |
> | teo | too shallow | 8.21% |
> | teo_util_v3 | too shallow | 14.72% |
> | teo_util_v4 | too shallow | 14.43% |
> +-------------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+----------+-------------+---------+-------------+
> | chan_name | metric | kernel | value | perc_diff |
> |--------------+----------+-------------+---------+-------------|
> | total_power | gmean | menu | 2059 | 0.0% |
> | total_power | gmean | teo | 2187.8 | 6.26% |
> | total_power | gmean | teo_util_v3 | 2212.9 | 7.47% |
> | total_power | gmean | teo_util_v4 | 2121.8 | 3.05% |
> +--------------+----------+-------------+---------+-------------+
>
> Task wakeup latency
>
> +-----------------+----------+-------------+-------------+-------------+
> | comm | metric | kernel | value | perc_diff |
> |-----------------+----------+-------------+-------------+-------------|
> | CrRendererMain | gmean | menu | 17.18μs | 0.0% |
> | CrRendererMain | gmean | teo | 16.18μs | -5.82% |
> | CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% |
> | CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% |
> | RenderThread | gmean | menu | 68.60μs | 0.0% |
> | RenderThread | gmean | teo | 48.44μs | -29.39% |
> | RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% |
> | RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% |
> | surfaceflinger | gmean | menu | 42.23μs | 0.0% |
> | surfaceflinger | gmean | teo | 29.84μs | -29.33% |
> | surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% |
> | surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% |
> +-----------------+----------+-------------+-------------+-------------+
>
> Thank you for taking your time to read this!
>
> --
> Kajetan
>
> v5 -> v6:
> - amended some wording in the commit description & cover letter
> - included test results in the commit description
> - refactored checking the CPU utilized status to account for !SMP systems
> - dropped the RFC from the patchset header
>
> v4 -> v5:
> - remove the restriction to only apply the mechanism for C1 candidate state
> - clarify some code comments, fix comment style
> - refactor the fast-exit path loop implementation
> - move some cover letter information into the commit description
>
> v3 -> v4:
> - remove the chunk of code skipping metrics updates when the CPU was utilized
> - include new test results and more benchmarks in the cover letter
>
> v2 -> v3:
> - add a patch adding an option to skip polling states in teo_find_shallower_state()
> - only reduce the state if the candidate state is C1 and C0 is not a polling state
> - add a check for polling states in the 2-states fast-exit path
> - remove the ifdefs and Kconfig option
>
> v1 -> v2:
> - rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
> - add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
> - fix typos in comments
> - include a missing header
>
>
> Kajetan Puchalski (2):
> cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
> cpuidle: teo: Introduce util-awareness
>
> drivers/cpuidle/governors/teo.c | 100 ++++++++++++++++++++++++++++++--
> 1 file changed, 96 insertions(+), 4 deletions(-)
>
> --
Both patches in the series applied as 6.3 material, thanks!
On Thu, Jan 12, 2023 at 08:22:24PM +0100, Rafael J. Wysocki wrote:
> On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski
> <[email protected]> wrote:
> >
> > Hi,
> >
> > At the moment, none of the available idle governors take any scheduling
> > information into account. They also tend to overestimate the idle
> > duration quite often, which causes them to select excessively deep idle
> > states, thus leading to increased wakeup latency and lower performance with no
> > power saving. For 'menu' while web browsing on Android for instance, those
> > types of wakeups ('too deep') account for over 24% of all wakeups.
> >
> > At the same time, on some platforms idle state 0 can be power efficient
> > enough to warrant wanting to prefer it over idle state 1. This is because
> > the power usage of the two states can be so close that sufficient amounts
> > of too deep state 1 sleeps can completely offset the state 1 power saving to the
> > point where it would've been more power efficient to just use state 0 instead.
> > This is of course for systems where state 0 is not a polling state, such as
> > arm-based devices.
> >
> > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only
> > save less power than they otherwise could have. Too deep sleeps, on the other
> > hand, harm performance and nullify the potential power saving from using state 1 in
> > the first place. While taking this into account, it is clear that on balance it
> > is preferable for an idle governor to have more too shallow sleeps instead of
> > more too deep sleeps on those kinds of platforms.
> >
> > Currently the best available governor under this metric is TEO which on average results in less than
> > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> > increased performance in the process.
> >
> > This patchset specifically tunes TEO to prefer shallower idle states in order to reduce wakeup latency
> > and achieve better performance. To this end, before selecting the next idle state it uses the avg_util
> > signal of a CPU's runqueue in order to determine to what extent the CPU is being utilized.
> > This util value is then compared to a threshold defined as a percentage of the cpu's capacity
> > (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the threshold, the idle
> > state selected by TEO metrics will be reduced by 1, thus selecting a shallower state. If the util is
> > below the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
> > available idle state based on the closest timer event and its own correctness.
> >
> > The main goal of this is to reduce latency and increase performance for some workloads. Under some
> > workloads it will result in an increase in power usage (Geekbench 5) while for other workloads it
> > will also result in a decrease in power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
> >
> > As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
> > states are present. If there's just 2 idle states and the CPU is utilized, we can directly select
> > the shallowest state and save cycles by skipping the entire metrics mechanism.
> >
> > Under the current implementation, the state will not be reduced by 1 if the change would lead to
> > selecting a polling state instead of a non-polling state.
> >
> > This approach can outperform all the other currently available governors, at least on mobile device
> > workloads, which is why I think it is worth keeping as an option.
> >
> > There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
> > it on TEO because it performs the best out of all the available options and I didn't think there was
> > any point in reinventing the wheel on the side of computing governor metrics. If a
> > better approach comes along at some point, there's no reason why the same idle aware mechanism
> > couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
> > a separate governor rather than a TEO add-on.
> >
> > As for how the extension performs in practice, below I'll add some benchmark results I got while
> > testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly
> > an hour each time to minimise the impact of thermal issues.
> >
> > Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches):
> >
> > 1. Geekbench 5 (latency-sensitive, heavy load test)
> >
> > The values below are gmean values across 3 back to back iteration of Geekbench 5.
> > As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
> > resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
> > values for all of the governors can change between runs as the benchmark might be affected by factors
> > other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
> > scores than all the other governors.
> >
> > Benchmark scores
> >
> > +-----------------+-------------+---------+-------------+
> > | metric | kernel | value | perc_diff |
> > |-----------------+-------------+---------+-------------|
> > | multicore_score | menu | 2826.5 | 0.0% |
> > | multicore_score | teo | 2764.8 | -2.18% |
> > | multicore_score | teo_util_v3 | 2849 | 0.8% |
> > | multicore_score | teo_util_v4 | 2865 | 1.36% |
> > | score | menu | 1053 | 0.0% |
> > | score | teo | 1050.7 | -0.22% |
> > | score | teo_util_v3 | 1059.6 | 0.63% |
> > | score | teo_util_v4 | 1057.6 | 0.44% |
> > +-----------------+-------------+---------+-------------+
> >
> > Idle misses
> >
> > The numbers are percentages of too deep and too shallow sleeps computed using the new trace
> > event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over
> > the course of a run and then dividing them by the total number of wakeups in that run.
> >
> > +-------------+-------------+--------------+
> > | wa_path | type | count_perc |
> > |-------------+-------------+--------------|
> > | menu | too deep | 14.994% |
> > | teo | too deep | 9.649% |
> > | teo_util_v3 | too deep | 4.298% |
> > | teo_util_v4 | too deep | 4.02 % |
> > | menu | too shallow | 2.497% |
> > | teo | too shallow | 5.963% |
> > | teo_util_v3 | too shallow | 13.773% |
> > | teo_util_v4 | too shallow | 14.598% |
> > +-------------+-------------+--------------+
> >
> > Power usage [mW]
> >
> > +--------------+----------+-------------+---------+-------------+
> > | chan_name | metric | kernel | value | perc_diff |
> > |--------------+----------+-------------+---------+-------------|
> > | total_power | gmean | menu | 2551.4 | 0.0% |
> > | total_power | gmean | teo | 2606.8 | 2.17% |
> > | total_power | gmean | teo_util_v3 | 2670.1 | 4.65% |
> > | total_power | gmean | teo_util_v4 | 2722.3 | 6.7% |
> > +--------------+----------+-------------+---------+-------------+
> >
> > Task wakeup latency
> >
> > +-----------------+----------+-------------+-------------+-------------+
> > | comm | metric | kernel | value | perc_diff |
> > |-----------------+----------+-------------+-------------+-------------|
> > | AsyncTask #1 | gmean | menu | 78.16μs | 0.0% |
> > | AsyncTask #1 | gmean | teo | 61.60μs | -21.19% |
> > | AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% |
> > | AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% |
> > | labs.geekbench5 | gmean | menu | 88.55μs | 0.0% |
> > | labs.geekbench5 | gmean | teo | 100.97μs | 14.02% |
> > | labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% |
> > | labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% |
> > +-----------------+----------+-------------+-------------+-------------+
> >
> > In case of this benchmark, the difference in latency does seem to translate into better scores.
> >
> > 2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test)
> >
> > The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
> >
> > Benchmark scores
> >
> > +----------------+-------------+---------+-------------+
> > | metric | kernel | value | perc_diff |
> > |----------------+-------------+---------+-------------|
> > | PcmaWebV2Score | menu | 5232 | 0.0% |
> > | PcmaWebV2Score | teo | 5219.8 | -0.23% |
> > | PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% |
> > | PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% |
> > +----------------+-------------+---------+-------------+
> >
> > Idle misses
> >
> > +-------------+-------------+--------------+
> > | wa_path | type | count_perc |
> > |-------------+-------------+--------------|
> > | menu | too deep | 24.814% |
> > | teo | too deep | 11.65% |
> > | teo_util_v3 | too deep | 3.481% |
> > | teo_util_v4 | too deep | 3.662% |
> > | menu | too shallow | 3.101% |
> > | teo | too shallow | 8.578% |
> > | teo_util_v3 | too shallow | 18.326% |
> > | teo_util_v4 | too shallow | 18.692% |
> > +-------------+-------------+--------------+
> >
> > Power usage [mW]
> >
> > +--------------+----------+-------------+---------+-------------+
> > | chan_name | metric | kernel | value | perc_diff |
> > |--------------+----------+-------------+---------+-------------|
> > | total_power | gmean | menu | 179.2 | 0.0% |
> > | total_power | gmean | teo | 184.8 | 3.1% |
> > | total_power | gmean | teo_util_v3 | 177.4 | -1.02% |
> > | total_power | gmean | teo_util_v4 | 184.1 | 2.71% |
> > +--------------+----------+-------------+---------+-------------+
> >
> > Task wakeup latency
> >
> > +-----------------+----------+-------------+-------------+-------------+
> > | comm | metric | kernel | value | perc_diff |
> > |-----------------+----------+-------------+-------------+-------------|
> > | CrRendererMain | gmean | menu | 236.63μs | 0.0% |
> > | CrRendererMain | gmean | teo | 201.85μs | -14.7% |
> > | CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% |
> > | CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% |
> > | chmark:workload | gmean | menu | 100.30μs | 0.0% |
> > | chmark:workload | gmean | teo | 80.20μs | -20.04% |
> > | chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% |
> > | chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% |
> > | surfaceflinger | gmean | menu | 97.57μs | 0.0% |
> > | surfaceflinger | gmean | teo | 98.86μs | 1.31% |
> > | surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% |
> > | surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% |
> > +-----------------+----------+-------------+-------------+-------------+
> >
> > In this case the large latency improvement does not translate into a notable increase in benchmark score as
> > this particular benchmark mainly responds to changes in operating frequency.
> >
> > 3. Jankbench (locked 60hz screen) (normal usage UI test)
> >
> > Frame durations
> >
> > +---------------+------------------+---------+-------------+
> > | variable | kernel | value | perc_diff |
> > |---------------+------------------+---------+-------------|
> > | mean_duration | menu_60hz | 13.9 | 0.0% |
> > | mean_duration | teo_60hz | 14.7 | 6.0% |
> > | mean_duration | teo_util_v3_60hz | 13.8 | -0.87% |
> > | mean_duration | teo_util_v4_60hz | 12.6 | -9.0% |
> > +---------------+------------------+---------+-------------+
> >
> > Jank percentage
> >
> > +------------+------------------+---------+-------------+
> > | variable | kernel | value | perc_diff |
> > |------------+------------------+---------+-------------|
> > | jank_perc | menu_60hz | 1.5 | 0.0% |
> > | jank_perc | teo_60hz | 2.1 | 36.99% |
> > | jank_perc | teo_util_v3_60hz | 1.3 | -13.95% |
> > | jank_perc | teo_util_v4_60hz | 1.3 | -17.37% |
> > +------------+------------------+---------+-------------+
> >
> > Idle misses
> >
> > +------------------+-------------+--------------+
> > | wa_path | type | count_perc |
> > |------------------+-------------+--------------|
> > | menu_60hz | too deep | 26.00% |
> > | teo_60hz | too deep | 11.00% |
> > | teo_util_v3_60hz | too deep | 2.33% |
> > | teo_util_v4_60hz | too deep | 2.54% |
> > | menu_60hz | too shallow | 4.74% |
> > | teo_60hz | too shallow | 11.89% |
> > | teo_util_v3_60hz | too shallow | 21.78% |
> > | teo_util_v4_60hz | too shallow | 21.93% |
> > +------------------+-------------+--------------+
> >
> > Power usage [mW]
> >
> > +--------------+------------------+---------+-------------+
> > | chan_name | kernel | value | perc_diff |
> > |--------------+------------------+---------+-------------|
> > | total_power | menu_60hz | 144.6 | 0.0% |
> > | total_power | teo_60hz | 136.9 | -5.27% |
> > | total_power | teo_util_v3_60hz | 134.2 | -7.19% |
> > | total_power | teo_util_v4_60hz | 121.3 | -16.08% |
> > +--------------+------------------+---------+-------------+
> >
> > Task wakeup latency
> >
> > +-----------------+------------------+-------------+-------------+
> > | comm | kernel | value | perc_diff |
> > |-----------------+------------------+-------------+-------------|
> > | RenderThread | menu_60hz | 139.52μs | 0.0% |
> > | RenderThread | teo_60hz | 116.51μs | -16.49% |
> > | RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% |
> > | RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% |
> > | droid.benchmark | menu_60hz | 135.88μs | 0.0% |
> > | droid.benchmark | teo_60hz | 105.21μs | -22.57% |
> > | droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% |
> > | droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% |
> > | surfaceflinger | menu_60hz | 124.03μs | 0.0% |
> > | surfaceflinger | teo_60hz | 151.90μs | 22.47% |
> > | surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% |
> > | surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% |
> > +-----------------+------------------+-------------+-------------+
> >
> > 4. Speedometer 2 (heavy load web browsing test)
> >
> > Benchmark scores
> >
> > +-------------------+-------------+---------+-------------+
> > | metric | kernel | value | perc_diff |
> > |-------------------+-------------+---------+-------------|
> > | Speedometer Score | menu | 102 | 0.0% |
> > | Speedometer Score | teo | 104.9 | 2.88% |
> > | Speedometer Score | teo_util_v3 | 102.1 | 0.16% |
> > | Speedometer Score | teo_util_v4 | 103.8 | 1.83% |
> > +-------------------+-------------+---------+-------------+
> >
> > Idle misses
> >
> > +-------------+-------------+--------------+
> > | wa_path | type | count_perc |
> > |-------------+-------------+--------------|
> > | menu | too deep | 17.95% |
> > | teo | too deep | 6.46% |
> > | teo_util_v3 | too deep | 0.63% |
> > | teo_util_v4 | too deep | 0.64% |
> > | menu | too shallow | 3.86% |
> > | teo | too shallow | 8.21% |
> > | teo_util_v3 | too shallow | 14.72% |
> > | teo_util_v4 | too shallow | 14.43% |
> > +-------------+-------------+--------------+
> >
> > Power usage [mW]
> >
> > +--------------+----------+-------------+---------+-------------+
> > | chan_name | metric | kernel | value | perc_diff |
> > |--------------+----------+-------------+---------+-------------|
> > | total_power | gmean | menu | 2059 | 0.0% |
> > | total_power | gmean | teo | 2187.8 | 6.26% |
> > | total_power | gmean | teo_util_v3 | 2212.9 | 7.47% |
> > | total_power | gmean | teo_util_v4 | 2121.8 | 3.05% |
> > +--------------+----------+-------------+---------+-------------+
> >
> > Task wakeup latency
> >
> > +-----------------+----------+-------------+-------------+-------------+
> > | comm | metric | kernel | value | perc_diff |
> > |-----------------+----------+-------------+-------------+-------------|
> > | CrRendererMain | gmean | menu | 17.18μs | 0.0% |
> > | CrRendererMain | gmean | teo | 16.18μs | -5.82% |
> > | CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% |
> > | CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% |
> > | RenderThread | gmean | menu | 68.60μs | 0.0% |
> > | RenderThread | gmean | teo | 48.44μs | -29.39% |
> > | RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% |
> > | RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% |
> > | surfaceflinger | gmean | menu | 42.23μs | 0.0% |
> > | surfaceflinger | gmean | teo | 29.84μs | -29.33% |
> > | surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% |
> > | surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% |
> > +-----------------+----------+-------------+-------------+-------------+
> >
> > Thank you for taking your time to read this!
> >
> > --
> > Kajetan
> >
> > v5 -> v6:
> > - amended some wording in the commit description & cover letter
> > - included test results in the commit description
> > - refactored checking the CPU utilized status to account for !SMP systems
> > - dropped the RFC from the patchset header
> >
> > v4 -> v5:
> > - remove the restriction to only apply the mechanism for C1 candidate state
> > - clarify some code comments, fix comment style
> > - refactor the fast-exit path loop implementation
> > - move some cover letter information into the commit description
> >
> > v3 -> v4:
> > - remove the chunk of code skipping metrics updates when the CPU was utilized
> > - include new test results and more benchmarks in the cover letter
> >
> > v2 -> v3:
> > - add a patch adding an option to skip polling states in teo_find_shallower_state()
> > - only reduce the state if the candidate state is C1 and C0 is not a polling state
> > - add a check for polling states in the 2-states fast-exit path
> > - remove the ifdefs and Kconfig option
> >
> > v1 -> v2:
> > - rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
> > - add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
> > - fix typos in comments
> > - include a missing header
> >
> >
> > Kajetan Puchalski (2):
> > cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
> > cpuidle: teo: Introduce util-awareness
> >
> > drivers/cpuidle/governors/teo.c | 100 ++++++++++++++++++++++++++++++--
> > 1 file changed, 96 insertions(+), 4 deletions(-)
> >
> > --
>
> Both patches in the series applied as 6.3 material, thanks!
Thanks a lot, take care!
Hi Kajetan
On 01/05/23 14:51, Kajetan Puchalski wrote:
[...]
> @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> struct cpuidle_device *dev)
> {
> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> int i;
>
> memset(cpu_data, 0, sizeof(*cpu_data));
> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
Given that utilization is invariant, why do we set the threshold based on
cpu capacity?
I'm not sure if this is a problem, but on little cores this threshold would be
too low. Given that util is invariant - I wondered if we need to have a single
threshold for all type of CPUs instead. Have you tried something like that
while developing the patch?
Thanks
--
Qais Yousef
Hi Qais,
The rule is 'one size doesn't fit all', please see below.
On 7/11/23 18:58, Qais Yousef wrote:
> Hi Kajetan
>
> On 01/05/23 14:51, Kajetan Puchalski wrote:
>
> [...]
>
>> @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>> struct cpuidle_device *dev)
>> {
>> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
>> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>> int i;
>>
>> memset(cpu_data, 0, sizeof(*cpu_data));
>> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
> Given that utilization is invariant, why do we set the threshold based on
> cpu capacity?
To treat CPUs differently, not with the same policy.
>
> I'm not sure if this is a problem, but on little cores this threshold would be
> too low. Given that util is invariant - I wondered if we need to have a single
> threshold for all type of CPUs instead. Have you tried something like that
A single threshold for all CPUs might be biased towards some CPUs. Let's
pick the value 15 - which was tested to work really good in benchmarks
for the big CPUs. On the other hand when you set that value to little
CPUs, with max_capacity = 124, than you have 15/124 ~= 13% threshold.
That means you prefer to enter deeper idle state ~9x times (at max
freq). What if the Little's freq is set to e.g. < ~20% fmax, which
corresponds to capacity < ~25? Let's try to simulate such scenario.
In a situation we could have utilization 14 on Little CPU, than CPU
capacity (effectively frequency) voting based on utilization would be
1.2 * 14 = ~17 so let's pick OPP corresponding to 17 capacity.
In such condition the little CPU would run the 14-util-periodic-task for
14/17= ~82% of wall-clock time. That's a lot, and not suited for
entering deeper idle state on that CPU, isn't it?
Apart from that, the little CPUs are tiny in terms of silicon area
and are less leaky in WFI than big cores. Therefore, they don't need
aggressive entries into deeper idle state. At the same time, they
are often used for serving interrupts, where the latency is important
factor.
> while developing the patch?
We have tried different threshold values in terms of %, but for all CPUs
(at the same time) not per-cluster. The reason was to treat those CPUs
differently as described above.
Regards,
Lukasz
+CC Vincent and Peter
On 07/17/23 14:47, Lukasz Luba wrote:
> Hi Qais,
>
> The rule is 'one size doesn't fit all', please see below.
>
> On 7/11/23 18:58, Qais Yousef wrote:
> > Hi Kajetan
> >
> > On 01/05/23 14:51, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > struct cpuidle_device *dev)
> > > {
> > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > int i;
> > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> >
> > Given that utilization is invariant, why do we set the threshold based on
> > cpu capacity?
>
>
> To treat CPUs differently, not with the same policy.
>
>
> >
> > I'm not sure if this is a problem, but on little cores this threshold would be
> > too low. Given that util is invariant - I wondered if we need to have a single
> > threshold for all type of CPUs instead. Have you tried something like that
>
> A single threshold for all CPUs might be biased towards some CPUs. Let's
> pick the value 15 - which was tested to work really good in benchmarks
> for the big CPUs. On the other hand when you set that value to little
> CPUs, with max_capacity = 124, than you have 15/124 ~= 13% threshold.
> That means you prefer to enter deeper idle state ~9x times (at max
> freq). What if the Little's freq is set to e.g. < ~20% fmax, which
> corresponds to capacity < ~25? Let's try to simulate such scenario.
Hmm what I'm struggling with is that PELT is invariant. So the time it takes to
rise and decay to threshold of 15 should be the same for all CPUs, no?
>
> In a situation we could have utilization 14 on Little CPU, than CPU capacity
> (effectively frequency) voting based on utilization would be
> 1.2 * 14 = ~17 so let's pick OPP corresponding to 17 capacity.
> In such condition the little CPU would run the 14-util-periodic-task for
> 14/17= ~82% of wall-clock time. That's a lot, and not suited for
> entering deeper idle state on that CPU, isn't it?
Yes runtime is stretched. But we counter this at utilization level by making
PELT invariant. I thought that any CPU in the system will now take the same
amount of time to ramp-up and decay to the same util level. No?
But maybe what you're saying is that we don't need to take the invariance into
account?
My concern (that is not backed by real problem yet) is that the threshold is
near 0, and since PELT is invariant, the time to gain few points is constant
irrespective of any CPU/capacity/freq and this means the little CPUs has to be
absolutely idle with no activity almost at all, IIUC.
>
> Apart from that, the little CPUs are tiny in terms of silicon area
> and are less leaky in WFI than big cores. Therefore, they don't need
> aggressive entries into deeper idle state. At the same time, they
> are often used for serving interrupts, where the latency is important
> factor.
On Pixel 6 this threshold will translate to util_threshold of 2. Which looks
too low to me. Can't TEO do a good job before we reach such extremely low level
of inactivity?
Thanks
--
Qais Yousef
>
> > while developing the patch?
>
> We have tried different threshold values in terms of %, but for all CPUs
> (at the same time) not per-cluster. The reason was to treat those CPUs
> differently as described above.
>
> Regards,
> Lukasz
On 7/17/23 19:21, Qais Yousef wrote:
> +CC Vincent and Peter
>
> On 07/17/23 14:47, Lukasz Luba wrote:
>> Hi Qais,
>>
>> The rule is 'one size doesn't fit all', please see below.
>>
>> On 7/11/23 18:58, Qais Yousef wrote:
>>> Hi Kajetan
>>>
>>> On 01/05/23 14:51, Kajetan Puchalski wrote:
>>>
>>> [...]
>>>
>>>> @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>>>> struct cpuidle_device *dev)
>>>> {
>>>> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
>>>> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>>>> int i;
>>>> memset(cpu_data, 0, sizeof(*cpu_data));
>>>> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>>>
>>> Given that utilization is invariant, why do we set the threshold based on
>>> cpu capacity?
>>
>>
>> To treat CPUs differently, not with the same policy.
>>
>>
>>>
>>> I'm not sure if this is a problem, but on little cores this threshold would be
>>> too low. Given that util is invariant - I wondered if we need to have a single
>>> threshold for all type of CPUs instead. Have you tried something like that
>>
>> A single threshold for all CPUs might be biased towards some CPUs. Let's
>> pick the value 15 - which was tested to work really good in benchmarks
>> for the big CPUs. On the other hand when you set that value to little
>> CPUs, with max_capacity = 124, than you have 15/124 ~= 13% threshold.
>> That means you prefer to enter deeper idle state ~9x times (at max
>> freq). What if the Little's freq is set to e.g. < ~20% fmax, which
>> corresponds to capacity < ~25? Let's try to simulate such scenario.
>
> Hmm what I'm struggling with is that PELT is invariant. So the time it takes to
> rise and decay to threshold of 15 should be the same for all CPUs, no?
Yes, but that's not an issue. The idea is to have a threshold value
set to a level which corresponds to a long enough slip time (in
wall-clock). Then you don't waste the energy for turning on/off the
core too often.
>
>>
>> In a situation we could have utilization 14 on Little CPU, than CPU capacity
>> (effectively frequency) voting based on utilization would be
>> 1.2 * 14 = ~17 so let's pick OPP corresponding to 17 capacity.
>> In such condition the little CPU would run the 14-util-periodic-task for
>> 14/17= ~82% of wall-clock time. That's a lot, and not suited for
>> entering deeper idle state on that CPU, isn't it?
>
> Yes runtime is stretched. But we counter this at utilization level by making
> PELT invariant. I thought that any CPU in the system will now take the same
> amount of time to ramp-up and decay to the same util level. No?
The stretched runtime is what we want to derived out of rq util
information, mostly after task migration to some next cpu.
>
> But maybe what you're saying is that we don't need to take the invariance into
> account?
Invariance is OK, since is the common ground when we migrate those tasks
across cores and we still can conclude based on util the sleeping
cpu time.
>
> My concern (that is not backed by real problem yet) is that the threshold is
> near 0, and since PELT is invariant, the time to gain few points is constant
> irrespective of any CPU/capacity/freq and this means the little CPUs has to be
> absolutely idle with no activity almost at all, IIUC.
As I said, that is good for those Little CPU in term of better latency
for the tasks they serve and also doesn't harm the energy. The
deeper idle state for those tiny silicon area cores (and low-power
cells) doesn't bring much in avg for real workloads.
This is the trade-off: you don't want to sacrifice the latency factor,
you would rather pay a bit more energy not entering deep idle
on Little cores, but get better latency there.
For the big cores, which occupy bigger silicon area and are made from
High-Performance (HP) cells (and low V-threshold) leaking more, that
trade-off is set differently. That's why a similar small util task on
big core might be facing larger latency do to facing the need to
wake-up cpu from deeper idle state.
>
>>
>> Apart from that, the little CPUs are tiny in terms of silicon area
>> and are less leaky in WFI than big cores. Therefore, they don't need
>> aggressive entries into deeper idle state. At the same time, they
>> are often used for serving interrupts, where the latency is important
>> factor.
>
> On Pixel 6 this threshold will translate to util_threshold of 2. Which looks
> too low to me. Can't TEO do a good job before we reach such extremely low level
> of inactivity?
Kajetan has introduced a new trace event 'cpu_idle_miss' which was
helping us to say how to set the trade-off in different thresholds.
It appeared that we would rather prefer to miss-predict and be wrong
about missing opportunity to enter deeper idle. The opposite was more
harmful. You can enable that trace event on your platform and analyze
your use cases.
Regards,
Lukasz
Hi Qais,
On Tue, Jul 11, 2023 at 06:58:14PM +0100, Qais Yousef wrote:
> Hi Kajetan
>
> On 01/05/23 14:51, Kajetan Puchalski wrote:
>
> [...]
>
> > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > struct cpuidle_device *dev)
> > {
> > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > int i;
> >
> > memset(cpu_data, 0, sizeof(*cpu_data));
> > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
> Given that utilization is invariant, why do we set the threshold based on
> cpu capacity?
Conceptually, the threshold is meant to represent a level at which the
core is considered 'utilized'. I appreciate the definitions here can get
a little fuzzy but I think of it as "generally doing a non-insignificant
amount of work" even if there are currently no tasks scheduled on the core.
This comes in handy in real-world workloads where the core will go
through multiple cycles of busy-idle-busy-idle within each second.
The intention here is to be able to distinguish a scenario of "going
into idle for a few us because of the nature of the workload" from
"going into idle for longer because there is no workload".
I set the threshold based on capacity because I think conceptually it
makes more sense to say "every CPU is consireded to be utilized if the
util is above X% of its capacity" than to effectively have a varying
percentage based on the size of the core. 60 util is not that
much work for a 1024-util big core but it's almost half the capacity of
a little one, using a percentage/shift on capacity lets us account for that
while using a raw value would not.
There's also very practical issues but I'll describe those below.
> I'm not sure if this is a problem, but on little cores this threshold would be
> too low. Given that util is invariant - I wondered if we need to have a single
> threshold for all type of CPUs instead. Have you tried something like that
> while developing the patch?
Yes, the problem there is that it's very difficult to define what "too low"
actually means :)
Namely, do we define 'too low' based on the effects it has on
performance in terms of latency, on the resulting power usage or on the
prediction accuracy? In terms of the prediction accuracy, how do we
weigh the two possible types of mispredictions? I'll just try to explain
my thinking and how I got to my conclusions.
Based on my tests, on the types of platforms we both work with our
state0/wfi is very power efficient to stay in, very power efficient
to enter/exit and also very fast so it has very little impact on
latency. On the other hand, state1 is power efficient to *stay in* but
very costly to enter/exit in terms of *both* power and latency. The
effect this has is that there's many cases where going through a cycle
of busy-state1-busy-state1-busy and so on will actually use up more
power than if you only kept the core in wfi.
I had some tests done with effectively making the governor do "return 0"
in state selection, never using any state1 and the results were still
pretty good, only slightly worse than e.g. menu. The problem there was
that not using state1 on big cores would not leave them time to cool
down and we'd burn through the thermal budget too quickly then tank the
performance.
I don't have the numbers on hand but even completely disabling state1 on
the entire little cluster will work perfectly fine - your latency for
tasks that run on littles will improve and the thermal budget/power
won't take a particularly noticeable hit because of how small they are
in the first place.
This is why the governor is intentionally skewed towards shallower
states, they just work better most of the time. If you try to skew it
the other way the results just come out much worse because even a
relatively small amount of mispredicted state1 entries can completely
nullify any benefits that selecting state1 could bring.
The percentage approach does make the threshold for littles pretty small
but as desccribed above that's perfectly fine, could say a feature not a
bug :) If we tried setting a fixed one across all CPUs then we'd need to
pick one high enough for the big cores which would end up being too high
for the littles, lead to excessive state1 entries and all the issues
I've just described. TLDR: there's just more downsides on the other side.
In development I just had a sysctl to set the threshold shift and iirc I
tested values from 3 to 10-12 eventually arriving at 6 being the one
with best results across different metrics and benchmarks. If you're
backporting the patch somewhere and have a specific platform feel free
to test it with different shift values, it's possible that different
platforms will behave differently with this. I doubt there's any
appetite to make the shift tweakable at runtime rather than a
compile-time constant but if you'd like to push for that I'm happy to
sign off on it, would work just as well as it does now.
> Thanks
>
> --
> Qais Yousef
On 07/18/23 11:23, Lukasz Luba wrote:
>
>
> On 7/17/23 19:21, Qais Yousef wrote:
> > +CC Vincent and Peter
> >
> > On 07/17/23 14:47, Lukasz Luba wrote:
> > > Hi Qais,
> > >
> > > The rule is 'one size doesn't fit all', please see below.
> > >
> > > On 7/11/23 18:58, Qais Yousef wrote:
> > > > Hi Kajetan
> > > >
> > > > On 01/05/23 14:51, Kajetan Puchalski wrote:
> > > >
> > > > [...]
> > > >
> > > > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > > > struct cpuidle_device *dev)
> > > > > {
> > > > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > > > int i;
> > > > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> > > >
> > > > Given that utilization is invariant, why do we set the threshold based on
> > > > cpu capacity?
> > >
> > >
> > > To treat CPUs differently, not with the same policy.
> > >
> > >
> > > >
> > > > I'm not sure if this is a problem, but on little cores this threshold would be
> > > > too low. Given that util is invariant - I wondered if we need to have a single
> > > > threshold for all type of CPUs instead. Have you tried something like that
> > >
> > > A single threshold for all CPUs might be biased towards some CPUs. Let's
> > > pick the value 15 - which was tested to work really good in benchmarks
> > > for the big CPUs. On the other hand when you set that value to little
> > > CPUs, with max_capacity = 124, than you have 15/124 ~= 13% threshold.
> > > That means you prefer to enter deeper idle state ~9x times (at max
> > > freq). What if the Little's freq is set to e.g. < ~20% fmax, which
> > > corresponds to capacity < ~25? Let's try to simulate such scenario.
> >
> > Hmm what I'm struggling with is that PELT is invariant. So the time it takes to
> > rise and decay to threshold of 15 should be the same for all CPUs, no?
>
> Yes, but that's not an issue. The idea is to have a threshold value
> set to a level which corresponds to a long enough slip time (in
> wall-clock). Then you don't waste the energy for turning on/off the
> core too often.
>
> >
> > >
> > > In a situation we could have utilization 14 on Little CPU, than CPU capacity
> > > (effectively frequency) voting based on utilization would be
> > > 1.2 * 14 = ~17 so let's pick OPP corresponding to 17 capacity.
> > > In such condition the little CPU would run the 14-util-periodic-task for
> > > 14/17= ~82% of wall-clock time. That's a lot, and not suited for
> > > entering deeper idle state on that CPU, isn't it?
> >
> > Yes runtime is stretched. But we counter this at utilization level by making
> > PELT invariant. I thought that any CPU in the system will now take the same
> > amount of time to ramp-up and decay to the same util level. No?
>
> The stretched runtime is what we want to derived out of rq util
> information, mostly after task migration to some next cpu.
>
> >
> > But maybe what you're saying is that we don't need to take the invariance into
> > account?
>
> Invariance is OK, since is the common ground when we migrate those tasks
> across cores and we still can conclude based on util the sleeping
> cpu time.
>
> >
> > My concern (that is not backed by real problem yet) is that the threshold is
> > near 0, and since PELT is invariant, the time to gain few points is constant
> > irrespective of any CPU/capacity/freq and this means the little CPUs has to be
> > absolutely idle with no activity almost at all, IIUC.
>
> As I said, that is good for those Little CPU in term of better latency
> for the tasks they serve and also doesn't harm the energy. The
> deeper idle state for those tiny silicon area cores (and low-power
> cells) doesn't bring much in avg for real workloads.
> This is the trade-off: you don't want to sacrifice the latency factor,
> you would rather pay a bit more energy not entering deep idle
> on Little cores, but get better latency there.
I can follow this argument much better than the time stretch one for sure.
I sort of agree, but feel cautious still this is too aggressive. I guess time
will let us know how suitable this is or we'll need to evolve :-)
FWIW, these patches are now in GKI, so we should get complaints if it is
a real problem.
> For the big cores, which occupy bigger silicon area and are made from
> High-Performance (HP) cells (and low V-threshold) leaking more, that
> trade-off is set differently. That's why a similar small util task on
> big core might be facing larger latency do to facing the need to
> wake-up cpu from deeper idle state.
>
> >
> > >
> > > Apart from that, the little CPUs are tiny in terms of silicon area
> > > and are less leaky in WFI than big cores. Therefore, they don't need
> > > aggressive entries into deeper idle state. At the same time, they
> > > are often used for serving interrupts, where the latency is important
> > > factor.
> >
> > On Pixel 6 this threshold will translate to util_threshold of 2. Which looks
> > too low to me. Can't TEO do a good job before we reach such extremely low level
> > of inactivity?
>
> Kajetan has introduced a new trace event 'cpu_idle_miss' which was
> helping us to say how to set the trade-off in different thresholds.
> It appeared that we would rather prefer to miss-predict and be wrong
> about missing opportunity to enter deeper idle. The opposite was more
> harmful. You can enable that trace event on your platform and analyze
> your use cases.
Okay, thanks for the info.
Cheers
--
Qais Yousef
On 07/18/23 13:02, Kajetan Puchalski wrote:
> Hi Qais,
>
> On Tue, Jul 11, 2023 at 06:58:14PM +0100, Qais Yousef wrote:
> > Hi Kajetan
> >
> > On 01/05/23 14:51, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > struct cpuidle_device *dev)
> > > {
> > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > int i;
> > >
> > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> >
> > Given that utilization is invariant, why do we set the threshold based on
> > cpu capacity?
>
> Conceptually, the threshold is meant to represent a level at which the
> core is considered 'utilized'. I appreciate the definitions here can get
> a little fuzzy but I think of it as "generally doing a non-insignificant
> amount of work" even if there are currently no tasks scheduled on the core.
> This comes in handy in real-world workloads where the core will go
> through multiple cycles of busy-idle-busy-idle within each second.
> The intention here is to be able to distinguish a scenario of "going
> into idle for a few us because of the nature of the workload" from
> "going into idle for longer because there is no workload".
>
> I set the threshold based on capacity because I think conceptually it
> makes more sense to say "every CPU is consireded to be utilized if the
> util is above X% of its capacity" than to effectively have a varying
> percentage based on the size of the core. 60 util is not that
> much work for a 1024-util big core but it's almost half the capacity of
> a little one, using a percentage/shift on capacity lets us account for that
> while using a raw value would not.
Thanks for the explanation.
I did try the busy perspective, but I think I still view this as 60util means
we've are running on average for X ms. which I think what matters more than how
much this is of a work to the big core. I look at this; we still have few ms
worth of runtime on the CPU and it's not worth going to deeper idle state
yet.
I can appreciate you think that this percentage of runtime should be lower for
smaller cores. My doubt (which again is not backed by real problem - so I'm not
questioning but rather trying to understand :)) is that if this becomes too low
is it better than letting usual TEO logic to operate. The series in its current
shape is great and offers good improvement already, no doubt :)
By the way, by default big will get a threshold of 16, the little will get
a threshold of around 2. I think the latter will translate to few hundreds of
us of activity (haven't done proper measurement to be honest, so this could be
off but I don't think by much).
>
> There's also very practical issues but I'll describe those below.
>
> > I'm not sure if this is a problem, but on little cores this threshold would be
> > too low. Given that util is invariant - I wondered if we need to have a single
> > threshold for all type of CPUs instead. Have you tried something like that
> > while developing the patch?
>
> Yes, the problem there is that it's very difficult to define what "too low"
> actually means :)
target residency maybe?
> Namely, do we define 'too low' based on the effects it has on
> performance in terms of latency, on the resulting power usage or on the
> prediction accuracy? In terms of the prediction accuracy, how do we
> weigh the two possible types of mispredictions? I'll just try to explain
> my thinking and how I got to my conclusions.
>
> Based on my tests, on the types of platforms we both work with our
> state0/wfi is very power efficient to stay in, very power efficient
> to enter/exit and also very fast so it has very little impact on
> latency. On the other hand, state1 is power efficient to *stay in* but
> very costly to enter/exit in terms of *both* power and latency. The
> effect this has is that there's many cases where going through a cycle
> of busy-state1-busy-state1-busy and so on will actually use up more
> power than if you only kept the core in wfi.
>
> I had some tests done with effectively making the governor do "return 0"
> in state selection, never using any state1 and the results were still
> pretty good, only slightly worse than e.g. menu. The problem there was
> that not using state1 on big cores would not leave them time to cool
> down and we'd burn through the thermal budget too quickly then tank the
> performance.
>
> I don't have the numbers on hand but even completely disabling state1 on
> the entire little cluster will work perfectly fine - your latency for
> tasks that run on littles will improve and the thermal budget/power
> won't take a particularly noticeable hit because of how small they are
> in the first place.
>
> This is why the governor is intentionally skewed towards shallower
> states, they just work better most of the time. If you try to skew it
> the other way the results just come out much worse because even a
> relatively small amount of mispredicted state1 entries can completely
> nullify any benefits that selecting state1 could bring.
>
> The percentage approach does make the threshold for littles pretty small
> but as desccribed above that's perfectly fine, could say a feature not a
> bug :) If we tried setting a fixed one across all CPUs then we'd need to
I didn't think it's a bug. But it seemed too low, hence the question.
I actually thought a single value is enough for all CPUs since util is
invariant and the tipping point where TEO normal predictions should work should
be the same.
Thanks for the detailed explanation :)
> pick one high enough for the big cores which would end up being too high
> for the littles, lead to excessive state1 entries and all the issues
> I've just described. TLDR: there's just more downsides on the other side.
This is an artifact of the shifting algorithm you used to derive it. It is not
a real restriction. But I appreciate that simple approach proved to be good
enough, and I have nothing against it.
>
> In development I just had a sysctl to set the threshold shift and iirc I
> tested values from 3 to 10-12 eventually arriving at 6 being the one
> with best results across different metrics and benchmarks. If you're
> backporting the patch somewhere and have a specific platform feel free
> to test it with different shift values, it's possible that different
> platforms will behave differently with this. I doubt there's any
> appetite to make the shift tweakable at runtime rather than a
> compile-time constant but if you'd like to push for that I'm happy to
> sign off on it, would work just as well as it does now.
These patches are in GKI. So we'll if there are uncaught problems I guess :)
No appetite for a knob, but the very low value for littles did strike me and
thought I better ask at least. Today's littles are too tiny for their own good
and it seemed the threshold could end up being too aggressive especially in low
activity state. You effectively are saying that if we have few 100us of
activity, normal TEO predictions based on timers are no good and better to stay
shallower anyway.
Note that due to NOHZ, if we go to idle for an extended period the util value
might not decay for a while and miss some opportunities. Especially that when
it next wakes up, it's enough for this wake up to run for few 100s us to block
a deeper state before going back to sleep for extended period of time.
But we shall see. I got the answer I was looking for for now.
Thanks for the help!
--
Qais Yousef
On Tue, Jul 18, 2023 at 02:24:32PM +0100, Qais Yousef wrote:
> On 07/18/23 13:02, Kajetan Puchalski wrote:
> > Hi Qais,
> >
> > On Tue, Jul 11, 2023 at 06:58:14PM +0100, Qais Yousef wrote:
> > > Hi Kajetan
> > >
> > > On 01/05/23 14:51, Kajetan Puchalski wrote:
> > >
> > > [...]
> > >
> > > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > > struct cpuidle_device *dev)
> > > > {
> > > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > > int i;
> > > >
> > > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> > >
> > > Given that utilization is invariant, why do we set the threshold based on
> > > cpu capacity?
> >
> > Conceptually, the threshold is meant to represent a level at which the
> > core is considered 'utilized'. I appreciate the definitions here can get
> > a little fuzzy but I think of it as "generally doing a non-insignificant
> > amount of work" even if there are currently no tasks scheduled on the core.
> > This comes in handy in real-world workloads where the core will go
> > through multiple cycles of busy-idle-busy-idle within each second.
> > The intention here is to be able to distinguish a scenario of "going
> > into idle for a few us because of the nature of the workload" from
> > "going into idle for longer because there is no workload".
> >
> > I set the threshold based on capacity because I think conceptually it
> > makes more sense to say "every CPU is consireded to be utilized if the
> > util is above X% of its capacity" than to effectively have a varying
> > percentage based on the size of the core. 60 util is not that
> > much work for a 1024-util big core but it's almost half the capacity of
> > a little one, using a percentage/shift on capacity lets us account for that
> > while using a raw value would not.
>
> Thanks for the explanation.
>
> I did try the busy perspective, but I think I still view this as 60util means
> we've are running on average for X ms. which I think what matters more than how
> much this is of a work to the big core. I look at this; we still have few ms
> worth of runtime on the CPU and it's not worth going to deeper idle state
> yet.
>
> I can appreciate you think that this percentage of runtime should be lower for
> smaller cores. My doubt (which again is not backed by real problem - so I'm not
> questioning but rather trying to understand :)) is that if this becomes too low
> is it better than letting usual TEO logic to operate. The series in its current
> shape is great and offers good improvement already, no doubt :)
No worries! In my experience it does tend to be better than just letting
the metrics logic operate as the metrics logic is pretty much just
reasonably good maths-based guessing and it relies on making mistakes
before adjusting - the util approach tries to just not make those
mistakes. Again, too shallow decisions most of the time are perfectly
fine, too deep decisions are an actual problem for both power and
performance.
> By the way, by default big will get a threshold of 16, the little will get
> a threshold of around 2. I think the latter will translate to few hundreds of
> us of activity (haven't done proper measurement to be honest, so this could be
> off but I don't think by much).
Yeah that sounds about right but I don't think that's really an issue -
I'm pretty sure a case could be made to just never use deep idle on
littles anyway. Having some threshold above 0 at least accounts for the
"the phone is in your pocket" scenario. Because littles tend to run
many small background tasks over time in my testing they end up getting
tons of too deep decisions that mess things up so avoiding it tends
to be more beneficial. Especially I saw better results in UI benchmarks
like Jankbench that mainly run on littles.
> >
> > There's also very practical issues but I'll describe those below.
> >
> > > I'm not sure if this is a problem, but on little cores this threshold would be
> > > too low. Given that util is invariant - I wondered if we need to have a single
> > > threshold for all type of CPUs instead. Have you tried something like that
> > > while developing the patch?
> >
> > Yes, the problem there is that it's very difficult to define what "too low"
> > actually means :)
>
> target residency maybe?
Target residency refers to how long the cpu stays in idle, we're talking
about a threshold determining the avg util at which we allow deeper
idle in the first place. I don't think one really impacts the other? I
don't think we can really extrapolate from avg util the precise amount
of time we have until the next wakeup on the CPU apart from the
"probably soon" that the threshold is meant to determine.
> > Namely, do we define 'too low' based on the effects it has on
> > performance in terms of latency, on the resulting power usage or on the
> > prediction accuracy? In terms of the prediction accuracy, how do we
> > weigh the two possible types of mispredictions? I'll just try to explain
> > my thinking and how I got to my conclusions.
> >
> > Based on my tests, on the types of platforms we both work with our
> > state0/wfi is very power efficient to stay in, very power efficient
> > to enter/exit and also very fast so it has very little impact on
> > latency. On the other hand, state1 is power efficient to *stay in* but
> > very costly to enter/exit in terms of *both* power and latency. The
> > effect this has is that there's many cases where going through a cycle
> > of busy-state1-busy-state1-busy and so on will actually use up more
> > power than if you only kept the core in wfi.
> >
> > I had some tests done with effectively making the governor do "return 0"
> > in state selection, never using any state1 and the results were still
> > pretty good, only slightly worse than e.g. menu. The problem there was
> > that not using state1 on big cores would not leave them time to cool
> > down and we'd burn through the thermal budget too quickly then tank the
> > performance.
> >
> > I don't have the numbers on hand but even completely disabling state1 on
> > the entire little cluster will work perfectly fine - your latency for
> > tasks that run on littles will improve and the thermal budget/power
> > won't take a particularly noticeable hit because of how small they are
> > in the first place.
> >
> > This is why the governor is intentionally skewed towards shallower
> > states, they just work better most of the time. If you try to skew it
> > the other way the results just come out much worse because even a
> > relatively small amount of mispredicted state1 entries can completely
> > nullify any benefits that selecting state1 could bring.
> >
> > The percentage approach does make the threshold for littles pretty small
> > but as desccribed above that's perfectly fine, could say a feature not a
> > bug :) If we tried setting a fixed one across all CPUs then we'd need to
>
> I didn't think it's a bug. But it seemed too low, hence the question.
> I actually thought a single value is enough for all CPUs since util is
> invariant and the tipping point where TEO normal predictions should work should
> be the same.
Makes sense of course, I was saying it in the sense of "yes it's
intentionally that low" :)
From what I recall when I was testing it higher thresholds just didn't
work as well for the activity on the littles but could be
platform-specifc, always worth testing I suppose.
> Thanks for the detailed explanation :)
>
> > pick one high enough for the big cores which would end up being too high
> > for the littles, lead to excessive state1 entries and all the issues
> > I've just described. TLDR: there's just more downsides on the other side.
>
> This is an artifact of the shifting algorithm you used to derive it. It is not
> a real restriction. But I appreciate that simple approach proved to be good
> enough, and I have nothing against it.
True but even without the shifting just putting in a fixed number across
all clusters would have the same effect I'm pretty sure. E.g. if we go
with 16 that's fine for the big cluster but will let through a big chunk
of the too deep sleeps on the littles. If we pick 5 it's better for the
littles but suddenly the big cores which tend to get bigger-util tasks
will almost never go into deep idle.
> >
> > In development I just had a sysctl to set the threshold shift and iirc I
> > tested values from 3 to 10-12 eventually arriving at 6 being the one
> > with best results across different metrics and benchmarks. If you're
> > backporting the patch somewhere and have a specific platform feel free
> > to test it with different shift values, it's possible that different
> > platforms will behave differently with this. I doubt there's any
> > appetite to make the shift tweakable at runtime rather than a
> > compile-time constant but if you'd like to push for that I'm happy to
> > sign off on it, would work just as well as it does now.
>
> These patches are in GKI. So we'll if there are uncaught problems I guess :)
More testing platform is always a good idea, maybe we'll find some new
things out :)
>
> No appetite for a knob, but the very low value for littles did strike me and
> thought I better ask at least. Today's littles are too tiny for their own good
> and it seemed the threshold could end up being too aggressive especially in low
> activity state. You effectively are saying that if we have few 100us of
> activity, normal TEO predictions based on timers are no good and better to stay
> shallower anyway.
It probably would be too aggressive if our state0 was a polling state
like on Intel cores, here we're very much leveraging the fact that arm
wfi is already very efficient. This makes it possible to afford multiple
'too shallow' sleeps that give us better latency + not a big power
difference as opposed to the 'too deep' ones that seriously harm both.
> Note that due to NOHZ, if we go to idle for an extended period the util value
> might not decay for a while and miss some opportunities. Especially that when
> it next wakes up, it's enough for this wake up to run for few 100s us to block
> a deeper state before going back to sleep for extended period of time.
That's true, again a tradeoff that usually comes out better - it's much
better overall to miss a deep sleep opportunity than to go into deep
idle and then not hit the residency.
> But we shall see. I got the answer I was looking for for now.
You're welcome, good luck!
>
> Thanks for the help!
>
> --
> Qais Yousef
Hi Kajetan
On 07/18/23 14:24, Qais Yousef wrote:
> These patches are in GKI. So we'll if there are uncaught problems I guess :)
>
> No appetite for a knob, but the very low value for littles did strike me and
> thought I better ask at least. Today's littles are too tiny for their own good
> and it seemed the threshold could end up being too aggressive especially in low
> activity state. You effectively are saying that if we have few 100us of
> activity, normal TEO predictions based on timers are no good and better to stay
> shallower anyway.
>
> Note that due to NOHZ, if we go to idle for an extended period the util value
> might not decay for a while and miss some opportunities. Especially that when
> it next wakes up, it's enough for this wake up to run for few 100s us to block
> a deeper state before going back to sleep for extended period of time.
>
> But we shall see. I got the answer I was looking for for now.
Unfortunately not too long after the patches got merged I got regression report
of worse power. As you know on Android things are not as mainline, so I need to
untangle this to make sure it's not a red herring. But if you want to take my
word for it, I think the chances of it being a true regression is high. I had
to introduce knobs to allow controlling the thresholds for now, so the good
news they do help and it's not a total revert. I don't have a lot of info to
share, but it's the low activity use cases that seem to got impacted. Like
video playback for instance.
Generally, I'm trying to remove some hardcoded values from the scheduler that
enforces a behavior that is not universally desired on all systems/workloads.
And I think the way the util awareness threshold are done today fall into the
same category.
As I tried to highlight before, it is easy to trick the threshold by a task
that runs for a short time then goes back to sleep for a long time.
And when the system runs full throttle for a while, it'll take around 150+ms
for the util to decay to the threshold value. That's a long time to block
entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
updates can make this potentially worse.
In my view, the absolute comparison against util can be misleading. Even when
util is 512 for example, we still have 50% of idle time. How this time is
distributed can't be known from util alone. It could be one task waking up and
sleeping. It could be multiple tasks at many combination of patterns all
leading to the same outcome of CPU util being 512.
IIUC the idea is that if we have even small activity, then erring on the
shallow side is better. But given that target-residency is usually in few ms
range, do we really need to be that quite? With a target-residency of 3ms for
example, even at util of 900 there can be opportunities to enter it.
Can't we instead sample util at entry to idle loop and see if it is on a rising
or falling trend? When rising it makes sense to say there's demand, let's block
deeper idle state. But if it is falling, then if the decay time is longer than
target-residency we can say it's okay to permit the deeper idle states?
I need to think more about this; but I think it's worth trying to make these
thresholds more deterministic and quantifiable. There are too many workloads
and system variations. I'm not sure if a knob to control these thresholds is
good for anything but a workaround like I had to do. These hardcoded values
can be improved IMHO. Happy to help to find alternatives.
Cheers
--
Qais Yousef
Hi,
On Sun, Sep 17, 2023 at 02:05:16AM +0100, Qais Yousef wrote:
> Hi Kajetan
>
> On 07/18/23 14:24, Qais Yousef wrote:
>
> > These patches are in GKI. So we'll if there are uncaught problems I guess :)
> >
> > No appetite for a knob, but the very low value for littles did strike me and
> > thought I better ask at least. Today's littles are too tiny for their own good
> > and it seemed the threshold could end up being too aggressive especially in low
> > activity state. You effectively are saying that if we have few 100us of
> > activity, normal TEO predictions based on timers are no good and better to stay
> > shallower anyway.
> >
> > Note that due to NOHZ, if we go to idle for an extended period the util value
> > might not decay for a while and miss some opportunities. Especially that when
> > it next wakes up, it's enough for this wake up to run for few 100s us to block
> > a deeper state before going back to sleep for extended period of time.
> >
> > But we shall see. I got the answer I was looking for for now.
>
> Unfortunately not too long after the patches got merged I got regression report
> of worse power. As you know on Android things are not as mainline, so I need to
> untangle this to make sure it's not a red herring. But if you want to take my
> word for it, I think the chances of it being a true regression is high. I had
> to introduce knobs to allow controlling the thresholds for now, so the good
> news they do help and it's not a total revert. I don't have a lot of info to
> share, but it's the low activity use cases that seem to got impacted. Like
> video playback for instance.
Ah that's pretty unfortunate indeed. I definitely see how that could
happen, especially given that as you said it seems to be tied to
low activity use cases. When I was testing it out with UI benchmarks
(e.g. jankbench) the power numbers usually came out better. I suppose
video playback could be falling just over the currently set threshold
where it does not benefit from it in the same way as other workloads.
> Generally, I'm trying to remove some hardcoded values from the scheduler that
> enforces a behavior that is not universally desired on all systems/workloads.
> And I think the way the util awareness threshold are done today fall into the
> same category.
>
> As I tried to highlight before, it is easy to trick the threshold by a task
> that runs for a short time then goes back to sleep for a long time.
I do agree, the way I view this topic is just that using util as a proxy
for idle correctness gives us another dimension that cpuidle otherwise
does not have. I'm not very invested in the hardcoded threshold, it was
giving reasonable results in my testing so I stuck with it to not
overcomplicate the patchset but I see it more as "step 1" rather than
the end goal. There's definitely improvements to be had there and a
better way to do this, I'm just not sure what it is at the moment.
> And when the system runs full throttle for a while, it'll take around 150+ms
> for the util to decay to the threshold value. That's a long time to block
> entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
> updates can make this potentially worse.
>
> In my view, the absolute comparison against util can be misleading. Even when
> util is 512 for example, we still have 50% of idle time. How this time is
> distributed can't be known from util alone. It could be one task waking up and
> sleeping. It could be multiple tasks at many combination of patterns all
> leading to the same outcome of CPU util being 512.
> IIUC the idea is that if we have even small activity, then erring on the
> shallow side is better. But given that target-residency is usually in few ms
> range, do we really need to be that quite? With a target-residency of 3ms for
> example, even at util of 900 there can be opportunities to enter it.
It could be misleading, the question then is which scenario happens more
often - the one where we have enough time for deeper idle but are
missing it because of the util or the one where the wakeups are erratic
enough that the util check saves us from incorrectly entering deep idle.
In most workloads I saw it was the latter hence the improvements but I
imagine the actual answer is very workload dependent so I'm not very
surprised that some cases go the other way.
> Can't we instead sample util at entry to idle loop and see if it is on a rising
> or falling trend? When rising it makes sense to say there's demand, let's block
> deeper idle state. But if it is falling, then if the decay time is longer than
> target-residency we can say it's okay to permit the deeper idle states?
Sounds like a pretty good idea to explore - I've not tried it myself at
this point so I have no idea how it'd work in practice but at least
conceptually this could make sense imo.
> I need to think more about this; but I think it's worth trying to make these
> thresholds more deterministic and quantifiable. There are too many workloads
> and system variations. I'm not sure if a knob to control these thresholds is
> good for anything but a workaround like I had to do. These hardcoded values
> can be improved IMHO. Happy to help to find alternatives.
Yes very much agreed on this part, they definitely can be improved. I'm
currently exploring some other approaches to this issue as well but
that's more of a long-term thing so it'll likely take a couple of months
to see whether they're workable or not.
The role I think the util bits are supposed to play here is mainly to
give us a view from a bit higher up than the metrics themselves do
because of how quickly those decay. Another way to do it would be some
way to make the threshold self-adjusting in the same way the metrics
are, e.g. increase the threshold if we're suddenly hitting too many too
shallow sleeps and decrease when we're hitting too many deep sleeps as a
result of the util checks. This would take care of the edge cases
currently falling through the cracks of their util being right around the
threshold, the mechanism would adjust within a few seconds of a workload
like video playback running. Could be a step in the right direction at least.
>
> Cheers
>
> --
> Qais Yousef
On 09/18/23 12:41, Kajetan Puchalski wrote:
> Yes very much agreed on this part, they definitely can be improved. I'm
> currently exploring some other approaches to this issue as well but
> that's more of a long-term thing so it'll likely take a couple of months
> to see whether they're workable or not.
Sounds good :)
> The role I think the util bits are supposed to play here is mainly to
> give us a view from a bit higher up than the metrics themselves do
> because of how quickly those decay. Another way to do it would be some
> way to make the threshold self-adjusting in the same way the metrics
> are, e.g. increase the threshold if we're suddenly hitting too many too
> shallow sleeps and decrease when we're hitting too many deep sleeps as a
> result of the util checks. This would take care of the edge cases
> currently falling through the cracks of their util being right around the
> threshold, the mechanism would adjust within a few seconds of a workload
> like video playback running. Could be a step in the right direction at least.
I'm not keen on the 'too many' part personally. It does feel finger in the air
to me. But maybe it can work.
Beside timers, tasks wake up on synchronization mechanisms and IPC, we can
potentially gather more info from there. For example, if a task is blocked on
a kernel lock, then chances its in kernel contention and it supposed to wake up
soon for the kernel to finish servicing the operation (ie: finish the syscall
and return to user mode).
Android has the notion of cpus being in deeper idle states in EAS. I think we
could benefit from such thing in upstream too. Not all idle cpus are equal when
idle states are taken into account, for both power and latency reasons. So
there's room to improve on the wake up side to help keep CPUs in deeper idle
states too.
Load balancer state can give indications too. It'll try to spread to idle CPUs
first. Its activity is a good indication on the demand for an idle CPU to be
ready to run something soon.
Generally maybe scheduler and idle governor can coordinate better to
provide/request some low latency idle CPUs, but allow the rest to go into
deeper idle states, per TEO's rule of course.
On HMP systems this will be trickier for us as not all CPUs are equal. And
cluster idle can complicate things further. So the nomination of a low latency
CPU needs to consider more things into account compared to other systems
(except NUMA maybe) where any idle CPU is as good as any other.
Thanks!
--
Qais Yousef
Hi All,
I'm quite late on this thread but this patchset creates a major
regression for psci cpuidle driver when using the OSI mode (OS
initiated mode). In such a case, cpuidle driver takes care only of
CPUs power state and the deeper C-states ,which includes cluster and
other power domains, are handled with power domain framework. In such
configuration ,cpuidle has only 2 c-states : WFI and cpu off states
and others states that include the clusters, are managed by genpd and
its governor.
This patch selects cpuidle c-state N-1 as soon as the utilization is
above CPU capacity / 64 which means at most a level of 16 on the big
core but can be as low as 4 on little cores. These levels are very low
and the main result is that as soon as there is very little activity
on a CPU, cpuidle always selects WFI states whatever the estimated
sleep duration and which prevents any deeper states. Another effect is
that it also keeps the tick firing every 1ms in my case.
IMO, we should at least increase the utilization level
Regards,
Vincent
On Sun, 17 Sept 2023 at 03:05, Qais Yousef <[email protected]> wrote:
>
> Hi Kajetan
>
> On 07/18/23 14:24, Qais Yousef wrote:
>
> > These patches are in GKI. So we'll if there are uncaught problems I guess :)
> >
> > No appetite for a knob, but the very low value for littles did strike me and
> > thought I better ask at least. Today's littles are too tiny for their own good
> > and it seemed the threshold could end up being too aggressive especially in low
> > activity state. You effectively are saying that if we have few 100us of
> > activity, normal TEO predictions based on timers are no good and better to stay
> > shallower anyway.
> >
> > Note that due to NOHZ, if we go to idle for an extended period the util value
> > might not decay for a while and miss some opportunities. Especially that when
> > it next wakes up, it's enough for this wake up to run for few 100s us to block
> > a deeper state before going back to sleep for extended period of time.
> >
> > But we shall see. I got the answer I was looking for for now.
>
> Unfortunately not too long after the patches got merged I got regression report
> of worse power. As you know on Android things are not as mainline, so I need to
> untangle this to make sure it's not a red herring. But if you want to take my
> word for it, I think the chances of it being a true regression is high. I had
> to introduce knobs to allow controlling the thresholds for now, so the good
> news they do help and it's not a total revert. I don't have a lot of info to
> share, but it's the low activity use cases that seem to got impacted. Like
> video playback for instance.
>
> Generally, I'm trying to remove some hardcoded values from the scheduler that
> enforces a behavior that is not universally desired on all systems/workloads.
> And I think the way the util awareness threshold are done today fall into the
> same category.
>
> As I tried to highlight before, it is easy to trick the threshold by a task
> that runs for a short time then goes back to sleep for a long time.
>
> And when the system runs full throttle for a while, it'll take around 150+ms
> for the util to decay to the threshold value. That's a long time to block
> entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
> updates can make this potentially worse.
>
> In my view, the absolute comparison against util can be misleading. Even when
> util is 512 for example, we still have 50% of idle time. How this time is
> distributed can't be known from util alone. It could be one task waking up and
> sleeping. It could be multiple tasks at many combination of patterns all
> leading to the same outcome of CPU util being 512.
>
> IIUC the idea is that if we have even small activity, then erring on the
> shallow side is better. But given that target-residency is usually in few ms
> range, do we really need to be that quite? With a target-residency of 3ms for
> example, even at util of 900 there can be opportunities to enter it.
>
> Can't we instead sample util at entry to idle loop and see if it is on a rising
> or falling trend? When rising it makes sense to say there's demand, let's block
> deeper idle state. But if it is falling, then if the decay time is longer than
> target-residency we can say it's okay to permit the deeper idle states?
>
> I need to think more about this; but I think it's worth trying to make these
> thresholds more deterministic and quantifiable. There are too many workloads
> and system variations. I'm not sure if a knob to control these thresholds is
> good for anything but a workaround like I had to do. These hardcoded values
> can be improved IMHO. Happy to help to find alternatives.
>
>
> Cheers
>
> --
> Qais Yousef
Hi Vincent,
On 5/28/24 10:29, Vincent Guittot wrote:
> Hi All,
>
> I'm quite late on this thread but this patchset creates a major
> regression for psci cpuidle driver when using the OSI mode (OS
> initiated mode). In such a case, cpuidle driver takes care only of
> CPUs power state and the deeper C-states ,which includes cluster and
> other power domains, are handled with power domain framework. In such
> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> and others states that include the clusters, are managed by genpd and
> its governor.
>
> This patch selects cpuidle c-state N-1 as soon as the utilization is
> above CPU capacity / 64 which means at most a level of 16 on the big
> core but can be as low as 4 on little cores. These levels are very low
> and the main result is that as soon as there is very little activity
> on a CPU, cpuidle always selects WFI states whatever the estimated
> sleep duration and which prevents any deeper states. Another effect is
> that it also keeps the tick firing every 1ms in my case.
Thanks for reporting this.
Could you add what regression it's causing, please?
Performance or higher power?
Do you have some handy numbers, so we can see the problem size?
>
> IMO, we should at least increase the utilization level
Something worth to discuss, or make it configurable even.
Regards,
Lukasz
>
> Regards,
> Vincent
On 5/28/24 10:29, Vincent Guittot wrote:
> Hi All,
>
> I'm quite late on this thread but this patchset creates a major
> regression for psci cpuidle driver when using the OSI mode (OS
> initiated mode). In such a case, cpuidle driver takes care only of
> CPUs power state and the deeper C-states ,which includes cluster and
> other power domains, are handled with power domain framework. In such
> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> and others states that include the clusters, are managed by genpd and
> its governor.
>
> This patch selects cpuidle c-state N-1 as soon as the utilization is
> above CPU capacity / 64 which means at most a level of 16 on the big
> core but can be as low as 4 on little cores. These levels are very low
> and the main result is that as soon as there is very little activity
> on a CPU, cpuidle always selects WFI states whatever the estimated
> sleep duration and which prevents any deeper states. Another effect is
> that it also keeps the tick firing every 1ms in my case.
>
> IMO, we should at least increase the utilization level
>
> Regards,
> Vincent
I looked at teo too and what you describe looks reasonable within my
expectation.
Could you describe your workload a bit and details about the (I assume)
power regression?
Maybe compare with 64/32/16 as a divisor and a hack that doesn't override
tick_stop on utilization?
While /64 might seem aggressive (maybe it is too aggressive) I think it
tries to avoid cpu_idle_miss on c1 (too deep state selected) much more than
c0 (too shallow selected), because the latter only costs us some (small?)
energy while the former costs us performance and energy.
Basically WFI should already be really efficient and thus selecting
anything deeper be avoided as long as there is still some utilization.
But I'd be curious on your numbers.
Kind Regards,
Christian
Hi Vincent,
On Tue, May 28, 2024 at 11:29:02AM +0200, Vincent Guittot wrote:
> Hi All,
>
> I'm quite late on this thread but this patchset creates a major
> regression for psci cpuidle driver when using the OSI mode (OS
> initiated mode). In such a case, cpuidle driver takes care only of
> CPUs power state and the deeper C-states ,which includes cluster and
> other power domains, are handled with power domain framework. In such
> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> and others states that include the clusters, are managed by genpd and
> its governor.
>
> This patch selects cpuidle c-state N-1 as soon as the utilization is
> above CPU capacity / 64 which means at most a level of 16 on the big
> core but can be as low as 4 on little cores. These levels are very low
> and the main result is that as soon as there is very little activity
> on a CPU, cpuidle always selects WFI states whatever the estimated
> sleep duration and which prevents any deeper states. Another effect is
> that it also keeps the tick firing every 1ms in my case.
>
> IMO, we should at least increase the utilization level
I think you're most likely right on this, the reason why I ended up
leaving the threshold at cap/64 was that at cap/32 it would be too high
for the mechanism to actually have any effects on the device I was
testing this on.
The issue then of course is that if you tailor the threshold to little
cores it becomes too high for big cores, conversely if you tailor it to
big cores it becomes too low for small ones.
We could get around this by making sure the threshold doesn't end up
being lower than a certain capacity-independent minimum, how does that sound?
cpu_data->util_threshold = max(MIN_THRESHOLD, max_capacity >> UTIL_THRESHOLD_SHIFT);
And we set MIN_THRESHOLD to something like 10, 15 or 20. Not sure which
one would be more appropriate but I think this could help alleviate some
issues with the mechanism being too aggressive.
>
> Regards,
> Vincent
>
> On Sun, 17 Sept 2023 at 03:05, Qais Yousef <[email protected]> wrote:
> >
> > Hi Kajetan
> >
> > On 07/18/23 14:24, Qais Yousef wrote:
> >
> > > These patches are in GKI. So we'll if there are uncaught problems I guess :)
> > >
> > > No appetite for a knob, but the very low value for littles did strike me and
> > > thought I better ask at least. Today's littles are too tiny for their own good
> > > and it seemed the threshold could end up being too aggressive especially in low
> > > activity state. You effectively are saying that if we have few 100us of
> > > activity, normal TEO predictions based on timers are no good and better to stay
> > > shallower anyway.
> > >
> > > Note that due to NOHZ, if we go to idle for an extended period the util value
> > > might not decay for a while and miss some opportunities. Especially that when
> > > it next wakes up, it's enough for this wake up to run for few 100s us to block
> > > a deeper state before going back to sleep for extended period of time.
> > >
> > > But we shall see. I got the answer I was looking for for now.
> >
> > Unfortunately not too long after the patches got merged I got regression report
> > of worse power. As you know on Android things are not as mainline, so I need to
> > untangle this to make sure it's not a red herring. But if you want to take my
> > word for it, I think the chances of it being a true regression is high. I had
> > to introduce knobs to allow controlling the thresholds for now, so the good
> > news they do help and it's not a total revert. I don't have a lot of info to
> > share, but it's the low activity use cases that seem to got impacted. Like
> > video playback for instance.
> >
> > Generally, I'm trying to remove some hardcoded values from the scheduler that
> > enforces a behavior that is not universally desired on all systems/workloads.
> > And I think the way the util awareness threshold are done today fall into the
> > same category.
> >
> > As I tried to highlight before, it is easy to trick the threshold by a task
> > that runs for a short time then goes back to sleep for a long time.
> >
> > And when the system runs full throttle for a while, it'll take around 150+ms
> > for the util to decay to the threshold value. That's a long time to block
> > entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
> > updates can make this potentially worse.
> >
> > In my view, the absolute comparison against util can be misleading. Even when
> > util is 512 for example, we still have 50% of idle time. How this time is
> > distributed can't be known from util alone. It could be one task waking up and
> > sleeping. It could be multiple tasks at many combination of patterns all
> > leading to the same outcome of CPU util being 512.
> >
> > IIUC the idea is that if we have even small activity, then erring on the
> > shallow side is better. But given that target-residency is usually in few ms
> > range, do we really need to be that quite? With a target-residency of 3ms for
> > example, even at util of 900 there can be opportunities to enter it.
> >
> > Can't we instead sample util at entry to idle loop and see if it is on a rising
> > or falling trend? When rising it makes sense to say there's demand, let's block
> > deeper idle state. But if it is falling, then if the decay time is longer than
> > target-residency we can say it's okay to permit the deeper idle states?
> >
> > I need to think more about this; but I think it's worth trying to make these
> > thresholds more deterministic and quantifiable. There are too many workloads
> > and system variations. I'm not sure if a knob to control these thresholds is
> > good for anything but a workaround like I had to do. These hardcoded values
> > can be improved IMHO. Happy to help to find alternatives.
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
>
> Hi Vincent,
>
> On 5/28/24 10:29, Vincent Guittot wrote:
> > Hi All,
> >
> > I'm quite late on this thread but this patchset creates a major
> > regression for psci cpuidle driver when using the OSI mode (OS
> > initiated mode). In such a case, cpuidle driver takes care only of
> > CPUs power state and the deeper C-states ,which includes cluster and
> > other power domains, are handled with power domain framework. In such
> > configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> > and others states that include the clusters, are managed by genpd and
> > its governor.
> >
> > This patch selects cpuidle c-state N-1 as soon as the utilization is
> > above CPU capacity / 64 which means at most a level of 16 on the big
> > core but can be as low as 4 on little cores. These levels are very low
> > and the main result is that as soon as there is very little activity
> > on a CPU, cpuidle always selects WFI states whatever the estimated
> > sleep duration and which prevents any deeper states. Another effect is
> > that it also keeps the tick firing every 1ms in my case.
>
> Thanks for reporting this.
> Could you add what regression it's causing, please?
> Performance or higher power?
It's not a perf but rather a power regression. I don't have a power
counter so it's difficult to give figures but I found it while running
a unitary test below on my rb5:
run 500us every 19457ms on medium core (uclamp_min: 600).
With this use case, the idle time is more than 18ms (the 500us becomes
1ms as we don't run at max capacity) but the tick fires every 1ms
while the system is fully idle (all 8 cpus are idle) and as cpuidle
selects WFI, it prevents the full cluster power down. So even if WFI
is efficient, the power impact should be significant.
For a 5 sec test duration, the system doesn't spend any time in
cluster power down state with this patch but spent 3.9 sec in cluster
power down state without
> Do you have some handy numbers, so we can see the problem size?
>
> >
> > IMO, we should at least increase the utilization level
>
> Something worth to discuss, or make it configurable even.
>
> Regards,
> Lukasz
>
> >
> > Regards,
> > Vincent
On 05/28/24 11:29, Vincent Guittot wrote:
> Hi All,
>
> I'm quite late on this thread but this patchset creates a major
> regression for psci cpuidle driver when using the OSI mode (OS
> initiated mode). In such a case, cpuidle driver takes care only of
> CPUs power state and the deeper C-states ,which includes cluster and
> other power domains, are handled with power domain framework. In such
> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> and others states that include the clusters, are managed by genpd and
> its governor.
>
> This patch selects cpuidle c-state N-1 as soon as the utilization is
> above CPU capacity / 64 which means at most a level of 16 on the big
> core but can be as low as 4 on little cores. These levels are very low
> and the main result is that as soon as there is very little activity
> on a CPU, cpuidle always selects WFI states whatever the estimated
> sleep duration and which prevents any deeper states. Another effect is
> that it also keeps the tick firing every 1ms in my case.
Unfortunately I think we need to revert this. We've been seeing the power
regressions for a long while now and it doesn't seem we'll see an improvement
soon based on last discussion.
>
> IMO, we should at least increase the utilization level
This won't help. We tried different values, unfortunately the logic is flawed.
Utilization value on its own says nothing about the idleness of the system.
I think best to revert and rethink the logic. Which is something we're pursuing
and we'll share outcome when we have something to share. As it stands, this
doesn't help. And we should really strive to avoid magic thresholds and values.
They don't scale.
Thanks!
--
Qais Yousef
On 05/28/24 13:12, Kajetan Puchalski wrote:
> Hi Vincent,
>
> On Tue, May 28, 2024 at 11:29:02AM +0200, Vincent Guittot wrote:
> > Hi All,
> >
> > I'm quite late on this thread but this patchset creates a major
> > regression for psci cpuidle driver when using the OSI mode (OS
> > initiated mode). In such a case, cpuidle driver takes care only of
> > CPUs power state and the deeper C-states ,which includes cluster and
> > other power domains, are handled with power domain framework. In such
> > configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> > and others states that include the clusters, are managed by genpd and
> > its governor.
> >
> > This patch selects cpuidle c-state N-1 as soon as the utilization is
> > above CPU capacity / 64 which means at most a level of 16 on the big
> > core but can be as low as 4 on little cores. These levels are very low
> > and the main result is that as soon as there is very little activity
> > on a CPU, cpuidle always selects WFI states whatever the estimated
> > sleep duration and which prevents any deeper states. Another effect is
> > that it also keeps the tick firing every 1ms in my case.
> >
> > IMO, we should at least increase the utilization level
>
> I think you're most likely right on this, the reason why I ended up
> leaving the threshold at cap/64 was that at cap/32 it would be too high
> for the mechanism to actually have any effects on the device I was
> testing this on.
>
> The issue then of course is that if you tailor the threshold to little
> cores it becomes too high for big cores, conversely if you tailor it to
> big cores it becomes too low for small ones.
>
> We could get around this by making sure the threshold doesn't end up
> being lower than a certain capacity-independent minimum, how does that sound?
>
> cpu_data->util_threshold = max(MIN_THRESHOLD, max_capacity >> UTIL_THRESHOLD_SHIFT);
>
> And we set MIN_THRESHOLD to something like 10, 15 or 20. Not sure which
> one would be more appropriate but I think this could help alleviate some
> issues with the mechanism being too aggressive.
I'm afraid this is not a solution. The whole threshold logic doesn't work I am
afraid. And there's no magic number that we can come up with to make it work.
The whole approach needs rethinking.
> >
> > Regards,
> > Vincent
> >
> > On Sun, 17 Sept 2023 at 03:05, Qais Yousef <[email protected]> wrote:
> > >
> > > Hi Kajetan
> > >
> > > On 07/18/23 14:24, Qais Yousef wrote:
> > >
> > > > These patches are in GKI. So we'll if there are uncaught problems I guess :)
> > > >
> > > > No appetite for a knob, but the very low value for littles did strike me and
> > > > thought I better ask at least. Today's littles are too tiny for their own good
> > > > and it seemed the threshold could end up being too aggressive especially in low
> > > > activity state. You effectively are saying that if we have few 100us of
> > > > activity, normal TEO predictions based on timers are no good and better to stay
> > > > shallower anyway.
> > > >
> > > > Note that due to NOHZ, if we go to idle for an extended period the util value
> > > > might not decay for a while and miss some opportunities. Especially that when
> > > > it next wakes up, it's enough for this wake up to run for few 100s us to block
> > > > a deeper state before going back to sleep for extended period of time.
> > > >
> > > > But we shall see. I got the answer I was looking for for now.
> > >
> > > Unfortunately not too long after the patches got merged I got regression report
> > > of worse power. As you know on Android things are not as mainline, so I need to
> > > untangle this to make sure it's not a red herring. But if you want to take my
> > > word for it, I think the chances of it being a true regression is high. I had
> > > to introduce knobs to allow controlling the thresholds for now, so the good
> > > news they do help and it's not a total revert. I don't have a lot of info to
> > > share, but it's the low activity use cases that seem to got impacted. Like
> > > video playback for instance.
> > >
> > > Generally, I'm trying to remove some hardcoded values from the scheduler that
> > > enforces a behavior that is not universally desired on all systems/workloads.
> > > And I think the way the util awareness threshold are done today fall into the
> > > same category.
> > >
> > > As I tried to highlight before, it is easy to trick the threshold by a task
> > > that runs for a short time then goes back to sleep for a long time.
> > >
> > > And when the system runs full throttle for a while, it'll take around 150+ms
> > > for the util to decay to the threshold value. That's a long time to block
> > > entering deeper idle states for. I'm not sure how NOHZ and blocked averaged
> > > updates can make this potentially worse.
> > >
> > > In my view, the absolute comparison against util can be misleading. Even when
> > > util is 512 for example, we still have 50% of idle time. How this time is
> > > distributed can't be known from util alone. It could be one task waking up and
> > > sleeping. It could be multiple tasks at many combination of patterns all
> > > leading to the same outcome of CPU util being 512.
> > >
> > > IIUC the idea is that if we have even small activity, then erring on the
> > > shallow side is better. But given that target-residency is usually in few ms
> > > range, do we really need to be that quite? With a target-residency of 3ms for
> > > example, even at util of 900 there can be opportunities to enter it.
> > >
> > > Can't we instead sample util at entry to idle loop and see if it is on a rising
> > > or falling trend? When rising it makes sense to say there's demand, let's block
> > > deeper idle state. But if it is falling, then if the decay time is longer than
> > > target-residency we can say it's okay to permit the deeper idle states?
> > >
> > > I need to think more about this; but I think it's worth trying to make these
> > > thresholds more deterministic and quantifiable. There are too many workloads
> > > and system variations. I'm not sure if a knob to control these thresholds is
> > > good for anything but a workaround like I had to do. These hardcoded values
> > > can be improved IMHO. Happy to help to find alternatives.
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef
On 5/28/24 15:07, Vincent Guittot wrote:
> On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
>>
>> Hi Vincent,
>>
>> On 5/28/24 10:29, Vincent Guittot wrote:
>>> Hi All,
>>>
>>> I'm quite late on this thread but this patchset creates a major
>>> regression for psci cpuidle driver when using the OSI mode (OS
>>> initiated mode). In such a case, cpuidle driver takes care only of
>>> CPUs power state and the deeper C-states ,which includes cluster and
>>> other power domains, are handled with power domain framework. In such
>>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
>>> and others states that include the clusters, are managed by genpd and
>>> its governor.
>>>
>>> This patch selects cpuidle c-state N-1 as soon as the utilization is
>>> above CPU capacity / 64 which means at most a level of 16 on the big
>>> core but can be as low as 4 on little cores. These levels are very low
>>> and the main result is that as soon as there is very little activity
>>> on a CPU, cpuidle always selects WFI states whatever the estimated
>>> sleep duration and which prevents any deeper states. Another effect is
>>> that it also keeps the tick firing every 1ms in my case.
>>
>> Thanks for reporting this.
>> Could you add what regression it's causing, please?
>> Performance or higher power?
>
> It's not a perf but rather a power regression. I don't have a power
> counter so it's difficult to give figures but I found it while running
> a unitary test below on my rb5:
> run 500us every 19457ms on medium core (uclamp_min: 600).
Is that supposed to say 19.457ms?
(Because below you say idle time is >18ms and total test time 5sec)
Is the utilisation more like 1/20000 or 1/20?
In any case what you describe is probably an issue, I'll try to reproduce.
Note also my findings here:
https://lore.kernel.org/lkml/[email protected]/
Kind Regards,
Christian
On Wed, 29 May 2024 at 15:09, Christian Loehle <[email protected]> wrote:
>
> On 5/28/24 15:07, Vincent Guittot wrote:
> > On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 5/28/24 10:29, Vincent Guittot wrote:
> >>> Hi All,
> >>>
> >>> I'm quite late on this thread but this patchset creates a major
> >>> regression for psci cpuidle driver when using the OSI mode (OS
> >>> initiated mode). In such a case, cpuidle driver takes care only of
> >>> CPUs power state and the deeper C-states ,which includes cluster and
> >>> other power domains, are handled with power domain framework. In such
> >>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> >>> and others states that include the clusters, are managed by genpd and
> >>> its governor.
> >>>
> >>> This patch selects cpuidle c-state N-1 as soon as the utilization is
> >>> above CPU capacity / 64 which means at most a level of 16 on the big
> >>> core but can be as low as 4 on little cores. These levels are very low
> >>> and the main result is that as soon as there is very little activity
> >>> on a CPU, cpuidle always selects WFI states whatever the estimated
> >>> sleep duration and which prevents any deeper states. Another effect is
> >>> that it also keeps the tick firing every 1ms in my case.
> >>
> >> Thanks for reporting this.
> >> Could you add what regression it's causing, please?
> >> Performance or higher power?
> >
> > It's not a perf but rather a power regression. I don't have a power
> > counter so it's difficult to give figures but I found it while running
> > a unitary test below on my rb5:
> > run 500us every 19457ms on medium core (uclamp_min: 600).
>
> Is that supposed to say 19.457ms?
Yes, it's a mistake. it's 19.457ms I forgot to put the dot when
copying the value from the rt-app json file
> (Because below you say idle time is >18ms and total test time 5sec)
> Is the utilisation more like 1/20000 or 1/20?
> In any case what you describe is probably an issue, I'll try to reproduce.
> Note also my findings here:
> https://lore.kernel.org/lkml/[email protected]/
>
> Kind Regards,
> Christian
>
Hi Vincent,
My apologies for delay, I was on sick leave.
On 5/28/24 15:07, Vincent Guittot wrote:
> On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
>>
>> Hi Vincent,
>>
>> On 5/28/24 10:29, Vincent Guittot wrote:
>>> Hi All,
>>>
>>> I'm quite late on this thread but this patchset creates a major
>>> regression for psci cpuidle driver when using the OSI mode (OS
>>> initiated mode). In such a case, cpuidle driver takes care only of
>>> CPUs power state and the deeper C-states ,which includes cluster and
>>> other power domains, are handled with power domain framework. In such
>>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
>>> and others states that include the clusters, are managed by genpd and
>>> its governor.
>>>
>>> This patch selects cpuidle c-state N-1 as soon as the utilization is
>>> above CPU capacity / 64 which means at most a level of 16 on the big
>>> core but can be as low as 4 on little cores. These levels are very low
>>> and the main result is that as soon as there is very little activity
>>> on a CPU, cpuidle always selects WFI states whatever the estimated
>>> sleep duration and which prevents any deeper states. Another effect is
>>> that it also keeps the tick firing every 1ms in my case.
>>
>> Thanks for reporting this.
>> Could you add what regression it's causing, please?
>> Performance or higher power?
>
> It's not a perf but rather a power regression. I don't have a power
> counter so it's difficult to give figures but I found it while running
> a unitary test below on my rb5:
> run 500us every 19457ms on medium core (uclamp_min: 600).
Mid cores are built differently, they have low static power (leakage).
Therefore, for them the residency in deeper idle state should be
longer than for Big CPU. When you power off the CPU you loose your
cache data/code. The data needs to be stored in the L3 or
further memory. When the cpu is powered on again, it needs code & data.
Thus, it will transfer that data/code from L3 or from DDR. That
information transfer has energy cost (it's not for free). The cost
of data from DDR is very high.
Then we have to justify if the energy lost while sleeping in shallower
idle state can be higher than loading data/code from outside.
For different CPU it would be different.
>
> With this use case, the idle time is more than 18ms (the 500us becomes
> 1ms as we don't run at max capacity) but the tick fires every 1ms
> while the system is fully idle (all 8 cpus are idle) and as cpuidle
> selects WFI, it prevents the full cluster power down. So even if WFI
> is efficient, the power impact should be significant.
I would say it's a problem of the right threshold. In this situation
the tick would be bigger issue IMO.
Because you don't have energy meter on that board, it's hard to say
if the power impact is significant.
Let's estimate something, when the system is not much loaded:
Mig CPU often has low freq at ~300-400MHz and Energy Model power
~for that OPP is ~30mW.
If you are loaded in e.g. 1% at lowest frequency than your
avg power would be ~0.3mW, so ~1mW would be at ~3% load for
that frequency. That's dynamic power if you need to serve some IRQ,
like the tick.
The static power would be ~5% of total power (for these low-power
cells in Mid core) of this ~30mW, so something ~1.5mW.
I wouldn't say it's significant, it's some small power which might
be tackled.
This is when the system is not much loaded. When it's loaded then
we might pick higher OPP for the Mid cluster, but also quite often
get tasks in those CPUs. Then the WFI is better in such situations.
>
> For a 5 sec test duration, the system doesn't spend any time in
> cluster power down state with this patch but spent 3.9 sec in cluster
> power down state without
I think this can be achieved with just changing the thresholds.
Regards,
Lukasz
Hi Qais,
+Todd and Wei on CC
On 5/29/24 11:19, Qais Yousef wrote:
> On 05/28/24 11:29, Vincent Guittot wrote:
>> Hi All,
>>
>> I'm quite late on this thread but this patchset creates a major
>> regression for psci cpuidle driver when using the OSI mode (OS
>> initiated mode). In such a case, cpuidle driver takes care only of
>> CPUs power state and the deeper C-states ,which includes cluster and
>> other power domains, are handled with power domain framework. In such
>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
>> and others states that include the clusters, are managed by genpd and
>> its governor.
>>
>> This patch selects cpuidle c-state N-1 as soon as the utilization is
>> above CPU capacity / 64 which means at most a level of 16 on the big
>> core but can be as low as 4 on little cores. These levels are very low
>> and the main result is that as soon as there is very little activity
>> on a CPU, cpuidle always selects WFI states whatever the estimated
>> sleep duration and which prevents any deeper states. Another effect is
>> that it also keeps the tick firing every 1ms in my case.
>
> Unfortunately I think we need to revert this. We've been seeing the power
> regressions for a long while now and it doesn't seem we'll see an improvement
> soon based on last discussion.
Could you be more precised when you say 'we'?
It's not Vincent, because he said he cannot measure power on his end.
Do you mean Google ACK? Or Google Pixel Team?
You send emails from your private account and people are confused when
you say 'we'.
>
>>
>> IMO, we should at least increase the utilization level
>
> This won't help. We tried different values, unfortunately the logic is flawed.
> Utilization value on its own says nothing about the idleness of the system.
This is not true. When you up-migrate a task to big CPU, then CPU idle
gov can instantly benefit from utilization information and won't make
mistake based on old local history and won't use deep idle state.
So migrating the utilization from one CPU to another CPU says a lot
about the idleness to that destination CPU.
When Christian removed the util he got -4.5% lower score in GB5, so
this util has impact [1].
> I think best to revert and rethink the logic. Which is something we're pursuing
> and we'll share outcome when we have something to share. As it stands, this
> doesn't help. And we should really strive to avoid magic thresholds and values.
> They don't scale.
Please share the power numbers. It's not helping when you just say
some power regression w/o numbers, but with assumption that you
are working for big company.
Regards,
Lukasz
[1]
https://lore.kernel.org/lkml/[email protected]/
On Wed, 12 Jun 2024 at 09:25, Lukasz Luba <[email protected]> wrote:
>
> Hi Vincent,
>
> My apologies for delay, I was on sick leave.
>
> On 5/28/24 15:07, Vincent Guittot wrote:
> > On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 5/28/24 10:29, Vincent Guittot wrote:
> >>> Hi All,
> >>>
> >>> I'm quite late on this thread but this patchset creates a major
> >>> regression for psci cpuidle driver when using the OSI mode (OS
> >>> initiated mode). In such a case, cpuidle driver takes care only of
> >>> CPUs power state and the deeper C-states ,which includes cluster and
> >>> other power domains, are handled with power domain framework. In such
> >>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
> >>> and others states that include the clusters, are managed by genpd and
> >>> its governor.
> >>>
> >>> This patch selects cpuidle c-state N-1 as soon as the utilization is
> >>> above CPU capacity / 64 which means at most a level of 16 on the big
> >>> core but can be as low as 4 on little cores. These levels are very low
> >>> and the main result is that as soon as there is very little activity
> >>> on a CPU, cpuidle always selects WFI states whatever the estimated
> >>> sleep duration and which prevents any deeper states. Another effect is
> >>> that it also keeps the tick firing every 1ms in my case.
> >>
> >> Thanks for reporting this.
> >> Could you add what regression it's causing, please?
> >> Performance or higher power?
> >
> > It's not a perf but rather a power regression. I don't have a power
> > counter so it's difficult to give figures but I found it while running
> > a unitary test below on my rb5:
> > run 500us every 19457ms on medium core (uclamp_min: 600).
>
> Mid cores are built differently, they have low static power (leakage).
> Therefore, for them the residency in deeper idle state should be
> longer than for Big CPU. When you power off the CPU you loose your
> cache data/code. The data needs to be stored in the L3 or
> further memory. When the cpu is powered on again, it needs code & data.
> Thus, it will transfer that data/code from L3 or from DDR. That
> information transfer has energy cost (it's not for free). The cost
> of data from DDR is very high.
> Then we have to justify if the energy lost while sleeping in shallower
> idle state can be higher than loading data/code from outside.
> For different CPU it would be different.
I'm aware of these points and the residency time of an idle state is
set to reflect this cost. In my case, the idle time is far above the
residency time which means that we should get some energy saving.
cpu off 4.488ms
cluster off 9.987ms
vs
sleep duration 18.000ms
Also, the policy of selecting a shallower idle state than the final
selected one doesn't work with PSCI OSI because cpuidle is only aware
of per CPU idle states but it is not aware of the cluster or
deeper/wider idle states so cpuidle doesn't know what will be the
final selected idle state. This is a major problem, in addition to
keep the tick firing
>
> >
> > With this use case, the idle time is more than 18ms (the 500us becomes
> > 1ms as we don't run at max capacity) but the tick fires every 1ms
> > while the system is fully idle (all 8 cpus are idle) and as cpuidle
> > selects WFI, it prevents the full cluster power down. So even if WFI
> > is efficient, the power impact should be significant.
>
> I would say it's a problem of the right threshold. In this situation
> the tick would be bigger issue IMO.
>
> Because you don't have energy meter on that board, it's hard to say
> if the power impact is significant.
>
> Let's estimate something, when the system is not much loaded:
> Mig CPU often has low freq at ~300-400MHz and Energy Model power
> ~for that OPP is ~30mW.
> If you are loaded in e.g. 1% at lowest frequency than your
> avg power would be ~0.3mW, so ~1mW would be at ~3% load for
> that frequency. That's dynamic power if you need to serve some IRQ,
> like the tick.
> The static power would be ~5% of total power (for these low-power
> cells in Mid core) of this ~30mW, so something ~1.5mW.
> I wouldn't say it's significant, it's some small power which might
> be tackled.
>
> This is when the system is not much loaded. When it's loaded then
> we might pick higher OPP for the Mid cluster, but also quite often
> get tasks in those CPUs. Then the WFI is better in such situations.
>
> >
> > For a 5 sec test duration, the system doesn't spend any time in
> > cluster power down state with this patch but spent 3.9 sec in cluster
> > power down state without
>
> I think this can be achieved with just changing the thresholds.
>
> Regards,
> Lukasz
On 6/12/24 10:04, Vincent Guittot wrote:
> On Wed, 12 Jun 2024 at 09:25, Lukasz Luba <[email protected]> wrote:
>>
>> Hi Vincent,
>>
>> My apologies for delay, I was on sick leave.
>>
>> On 5/28/24 15:07, Vincent Guittot wrote:
>>> On Tue, 28 May 2024 at 11:59, Lukasz Luba <[email protected]> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 5/28/24 10:29, Vincent Guittot wrote:
>>>>> Hi All,
>>>>>
>>>>> I'm quite late on this thread but this patchset creates a major
>>>>> regression for psci cpuidle driver when using the OSI mode (OS
>>>>> initiated mode). In such a case, cpuidle driver takes care only of
>>>>> CPUs power state and the deeper C-states ,which includes cluster and
>>>>> other power domains, are handled with power domain framework. In such
>>>>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
>>>>> and others states that include the clusters, are managed by genpd and
>>>>> its governor.
>>>>>
>>>>> This patch selects cpuidle c-state N-1 as soon as the utilization is
>>>>> above CPU capacity / 64 which means at most a level of 16 on the big
>>>>> core but can be as low as 4 on little cores. These levels are very low
>>>>> and the main result is that as soon as there is very little activity
>>>>> on a CPU, cpuidle always selects WFI states whatever the estimated
>>>>> sleep duration and which prevents any deeper states. Another effect is
>>>>> that it also keeps the tick firing every 1ms in my case.
>>>>
>>>> Thanks for reporting this.
>>>> Could you add what regression it's causing, please?
>>>> Performance or higher power?
>>>
>>> It's not a perf but rather a power regression. I don't have a power
>>> counter so it's difficult to give figures but I found it while running
>>> a unitary test below on my rb5:
>>> run 500us every 19457ms on medium core (uclamp_min: 600).
>>
>> Mid cores are built differently, they have low static power (leakage).
>> Therefore, for them the residency in deeper idle state should be
>> longer than for Big CPU. When you power off the CPU you loose your
>> cache data/code. The data needs to be stored in the L3 or
>> further memory. When the cpu is powered on again, it needs code & data.
>> Thus, it will transfer that data/code from L3 or from DDR. That
>> information transfer has energy cost (it's not for free). The cost
>> of data from DDR is very high.
>> Then we have to justify if the energy lost while sleeping in shallower
>> idle state can be higher than loading data/code from outside.
>> For different CPU it would be different.
>
> I'm aware of these points and the residency time of an idle state is
> set to reflect this cost. In my case, the idle time is far above the
> residency time which means that we should get some energy saving.
> cpu off 4.488ms
> cluster off 9.987ms
> vs
> sleep duration 18.000ms
>
> Also, the policy of selecting a shallower idle state than the final
> selected one doesn't work with PSCI OSI because cpuidle is only aware
> of per CPU idle states but it is not aware of the cluster or
> deeper/wider idle states so cpuidle doesn't know what will be the
> final selected idle state. This is a major problem, in addition to
> keep the tick firing
I think we are aligned with this.
Something has to change in this implementation of idle gov.
I'm a bit more skeptical about your second point with PSCI.
That standard might be to strong to change.