Hi all,
so my investigation into teo lead to the following fixes.
1/3:
As discussed the utilization threshold is too high while
there are benefits in certain workloads, there are quite a few
regressions, too. Revert the Util-awareness patch.
This in itself leads to regressions, but part of it can be offset
by the later patches.
See
https://lore.kernel.org/lkml/CAKfTPtA6ZzRR-zMN7sodOW+N_P+GqwNv4tGR+aMB5VXRT2b5bg@mail.gmail.com/
2/3:
Remove the 'recent' intercept logic, see my findings in:
https://lore.kernel.org/lkml/[email protected]/
I haven't found a way to salvage this properly, so I removed it.
The regular intercept seems to decay fast enough to not need this, but
we could change it if that turns out that we need this to be faster in
ramp-up and decaying.
3/3:
The rest of the intercept logic had issues, too.
See the commit.
Happy for anyone to take a look and test as well.
Some numbers for context, comparing:
- IO workload (intercept heavy).
- Timer workload very low utilization (check for deepest state)
- hackbench (high utilization)
- Geekbench 5 on Pixel6 (high utilization)
Tests 1 to 3 are on RK3399 with CONFIG_HZ=100.
target_residencies: 1, 900, 2000
1. IO workload, 5 runs, results sorted, in read IOPS.
fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
teo fixed v2:
/dev/nvme0n1
[4599, 4658, 4692, 4694, 4720]
/dev/mmcblk2
[5700, 5730, 5735, 5747, 5977]
/dev/mmcblk1
[2052, 2054, 2066, 2067, 2073]
teo mainline:
/dev/nvme0n1
[3793, 3825, 3846, 3865, 3964]
/dev/mmcblk2
[3831, 4110, 4154, 4203, 4228]
/dev/mmcblk1
[1559, 1564, 1596, 1611, 1618]
menu:
/dev/nvme0n1
[2571, 2630, 2804, 2813, 2917]
/dev/mmcblk2
[4181, 4260, 5062, 5260, 5329]
/dev/mmcblk1
[1567, 1581, 1585, 1603, 1769]
2. Timer workload (through IO for my convenience ???? )
Results in read IOPS, fio same as above.
echo "0 2097152 zero" | dmsetup create dm-zeros
echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
(Each IO is delayed by timer of 50ms, should be mostly in state2, for 5s total)
teo fixed v2:
idle_state time
2.0 4.807025
-1.0 0.219766
0.0 0.072007
1.0 0.169570
3199 cpu_idle total
38 cpu_idle_miss
31 cpu_idle_miss above
7 cpu_idle_miss below
teo mainline:
idle_state time
1.0 4.897942
-1.0 0.095375
0.0 0.253581
3221 cpu_idle total
1269 cpu_idle_miss
22 cpu_idle_miss above
1247 cpu_idle_miss below
menu:
idle_state time
2.0 4.295546
-1.0 0.234164
1.0 0.356344
0.0 0.401507
3421 cpu_idle total
129 cpu_idle_miss
52 cpu_idle_miss above
77 cpu_idle_miss below
Residencies:
teo mainline isn't in state2 at all, teo fixed is more in state2 than menu, but
both are in state2 the vast majority of the time as expected.
tldr: overall teo fixed spends more time in state2 while having
fewer idle_miss than menu.
teo mainline was just way too aggressive at selecting shallow states.
3. Hackbench, 5 runs
for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done
teo fixed v2:
Time: 4.937
Time: 4.898
Time: 4.871
Time: 4.833
Time: 4.898
teo mainline:
Time: 4.945
Time: 5.021
Time: 4.927
Time: 4.923
Time: 5.137
menu:
Time: 4.964
Time: 4.847
Time: 4.914
Time: 4.841
Time: 4.800
tldr: all comparable, teo mainline slightly worse
4. Geekbench 5 (multi-core) on Pixel6
(Device is cooled for each iteration separately)
teo mainline:
3113, 3068, 3079
mean 3086.66
teo revert util-awareness:
2947, 2941, 2952
mean 2946.66 (-4.54%)
teo fixed v2:
3032, 3066, 3019
mean 3039 (-1.54%)
Changes since v1:
- Removed all non-fixes.
- Do a full revert of Util-awareness instead of increasing thresholds.
- Address Dietmar's comments.
https://lore.kernel.org/linux-kernel/[email protected]/T/
Kind Regards,
Christian
Christian Loehle (3):
Revert: "cpuidle: teo: Introduce util-awareness"
cpuidle: teo: Remove recent intercepts metric
cpuidle: teo: Don't count non-existent intercepts
drivers/cpuidle/governors/teo.c | 189 +++++---------------------------
1 file changed, 27 insertions(+), 162 deletions(-)
--
2.34.1
When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a87 ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
as a hit (we have to count them as something otherwise we are stuck)
and therefore teo taking too long to recover from intercept-heavy
scenarios.
Fundamentally we can only do one of the two:
1. Skip sleep_length_ns query when we think intercept is likely
2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.
This patch chooses the latter as I've found the additional time it
takes to query the sleep length to be negligible and the variants of
option 1 (count all unknowns as misses or count all unknown as hits)
had significant regressions (as misses had lots of too shallow idle
state selections and as hits had terrible performance in
intercept-heavy workloads).
Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index cc7df59f488d..1e4b40474f49 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -231,8 +231,13 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
* length, this is a "hit", so update the "hits" metric for that bin.
* Otherwise, update the "intercepts" metric for the bin fallen into by
* the measured idle duration.
+ * If teo_select() bailed out early, we have to count this as a hit as
+ * we don't know what the true sleep length would've been. Otherwise
+ * we accumulate lots of intercepts at the shallower state (especially
+ * state0) even though there weren't any intercepts due to us
+ * anticipating one.
*/
- if (idx_timer == idx_duration)
+ if (idx_timer == idx_duration || cpu_data->sleep_length_ns == KTIME_MAX)
cpu_data->state_bins[idx_timer].hits += PULSE;
else
cpu_data->state_bins[idx_duration].intercepts += PULSE;
@@ -292,7 +297,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
unsigned int hit_sum = 0;
int constraint_idx = 0;
int idx0 = 0, idx = -1;
- bool alt_intercepts, alt_recent;
+ int prev_intercept_idx;
s64 duration_ns;
int i;
@@ -370,6 +375,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* all of the deeper states a shallower idle state is likely to be a
* better choice.
*/
+ prev_intercept_idx = idx;
if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
int first_suitable_idx = idx;
@@ -421,6 +427,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
first_suitable_idx = i;
}
}
+ if (!idx && prev_intercept_idx) {
+ /*
+ * We have to query the sleep length here otherwise we don't
+ * know after wakeup if our guess was correct.
+ */
+ duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+ cpu_data->sleep_length_ns = duration_ns;
+ }
/*
* If there is a latency constraint, it may be necessary to select an
--
2.34.1