2024-06-06 09:05:30

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 0/6] cpuidle: teo: fixes and improvements

Hi all,
so my investigation into teo lead to the following fixes and
improvements. Logically they are mostly independent, that's why this
cover letter is quite short, details are in the patches.

1/6:
As discussed, the utilization threshold is too high, while
there are benefits in certain workloads, there are quite a few
regressions, too.
2/6:
Especially with the new util threshold, stopping tick makes little
sense when utilized is detected, so don't.
3/6:
Particularly with WFI, even if it's the only state, stopping the tick
has benefits, so enable that in the early bail out.
4/6:
Stopping the tick with 0 cost (if the idle state dictates it) is too
aggressive IMO, so add 1ms constant cost.
XXX: This has the issue of now being counted as idle_miss, so we could
consider adding this to the states, too, but the simple implementation
of this would have the downside that the cost is added to deeper states
even if the tick is already off.
5/6:
Remove the 'recent' intercept logic, see my findings in:
https://lore.kernel.org/lkml/[email protected]/
I haven't found a way to salvage this properly, so I removed it.
The regular intercept seems to decay fast enough to not need this, but
we could change it if that turns out to be the case.
6/6:
The rest of the intercept logic had issues, too.
See the commit.

TODO: add some measurements of common workloads and some simple sanity
tests (like Vincent described in low utilization workloads if the
state selection looks reasonable).
I have some, but more (and more standardized) would be beneficial.

Happy for anyone to take a look and test as well.

Some numbers for context:
Maybe some numbers for context, I'll probably add them to the cover letter.

Comparing:
- IO workload (intercept heavy).
- Timer workload very low utilization (check for deepest state)
- hackbench (high utilization)
all on RK3399 with CONFIG_HZ=100.
target_residencies: 1, 900, 2000

1. IO workload, 5 runs, results sorted, in read IOPS.
fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;

teo fixed:
/dev/nvme0n1
[4597, 4673, 4727, 4741, 4756]
/dev/mmcblk2
[5753, 5832, 5837, 5911, 5949]
/dev/mmcblk1
[2059, 2062, 2070, 2071, 2080]

teo mainline:
/dev/nvme0n1
[3793, 3825, 3846, 3865, 3964]
/dev/mmcblk2
[3831, 4110, 4154, 4203, 4228]
/dev/mmcblk1
[1559, 1564, 1596, 1611, 1618]

menu:
/dev/nvme0n1
[2571, 2630, 2804, 2813, 2917]
/dev/mmcblk2
[4181, 4260, 5062, 5260, 5329]
/dev/mmcblk1
[1567, 1581, 1585, 1603, 1769]

2. Timer workload (through IO for my convenience ;) )
Results in read IOPS, fio same as above.
echo "0 2097152 zero" | dmsetup create dm-zeros
echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
(Each IO is delayed by timer of 50ms, should be mostly in state2)

teo fixed:
3269 cpu_idle total
48 cpu_idle_miss
30 cpu_idle_miss above
18 cpu_idle_miss below

teo mainline:
3221 cpu_idle total
1269 cpu_idle_miss
22 cpu_idle_miss above
1247 cpu_idle_miss below

menu:
3433 cpu_idle total
114 cpu_idle_miss
61 cpu_idle_miss above
53 cpu_idle_miss below

Residencies:
teo mainline isn't in state2 at all, teo fixed is more in state2 than menu, but
both are >95%.

tldr: overall teo fixed spends more time in state2 while having
fewer idle_miss than menu.
teo mainline was just way too aggressive at selecting shallow states.

3. Hackbench, 5 runs
for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done

teo fixed:
Time: 4.807
Time: 4.856
Time: 5.072
Time: 4.934
Time: 4.962

teo mainline:
Time: 4.945
Time: 5.021
Time: 4.927
Time: 4.923
Time: 5.137

menu:
Time: 4.991
Time: 4.884
Time: 4.880
Time: 4.946
Time: 4.980

tldr: all comparable, teo mainline slightly worse

Kind Regards,
Christian

Christian Loehle (6):
cpuidle: teo: Increase util-threshold
cpuidle: teo: Don't stop tick on utilized
cpuidle: teo: Don't always stop tick on one state
cpuidle: teo: Increase minimum time to stop tick
cpuidle: teo: Remove recent intercepts metric
cpuidle: teo: Don't count non-existent intercepts

drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
1 file changed, 48 insertions(+), 73 deletions(-)

--
2.34.1



2024-06-06 09:05:44

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 1/6] cpuidle: teo: Increase util-threshold

Increase the util-threshold by a lot as it was low enough for some
minor load to always be active, especially on smaller CPUs.

For small cap CPUs (Pixel6) the util threshold is as low as 1.
For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.

Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
Reported-by: Qais Yousef <[email protected]>
Reported-by: Vincent Guittot <[email protected]>
Suggested-by: Kajetan Puchalski <[email protected]>
Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 7244f71c59c5..45f43e2ee02d 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -146,13 +146,11 @@
* The number of bits to shift the CPU's capacity by in order to determine
* the utilized threshold.
*
- * 6 was chosen based on testing as the number that achieved the best balance
- * of power and performance on average.
- *
* The resulting threshold is high enough to not be triggered by background
- * noise and low enough to react quickly when activity starts to ramp up.
+ * noise.
*/
-#define UTIL_THRESHOLD_SHIFT 6
+#define UTIL_THRESHOLD_SHIFT 2
+#define UTIL_THRESHOLD_MIN 50

/*
* The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
@@ -671,7 +669,8 @@ static int teo_enable_device(struct cpuidle_driver *drv,
int i;

memset(cpu_data, 0, sizeof(*cpu_data));
- cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
+ cpu_data->util_threshold = max(UTIL_THRESHOLD_MIN,
+ max_capacity >> UTIL_THRESHOLD_SHIFT);

for (i = 0; i < NR_RECENT; i++)
cpu_data->recent_idx[i] = -1;
--
2.34.1


2024-06-06 09:06:05

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 2/6] cpuidle: teo: Don't stop tick on utilized

As we expect to be woken up early, stopping the tick is likely to be
a waste.

Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 45f43e2ee02d..2c427dd4cac0 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -429,10 +429,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
idx = 0;
goto out_tick;
}
- /* Assume that state 1 is not a polling one and use it. */
+ /*
+ * Assume that state 1 is not a polling one and use it, but
+ * don't stop the tick as we expect to be woken up early.
+ */
idx = 1;
duration_ns = drv->states[1].target_residency_ns;
- goto end;
+ goto out_tick_state;
}

/* Compute the sums of metrics for early wakeup pattern detection. */
@@ -618,6 +621,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
duration_ns >= TICK_NSEC) || tick_nohz_tick_stopped())
return idx;

+out_tick_state:
/*
* The tick is not going to be stopped, so if the target residency of
* the state to be returned is not within the time till the closest
--
2.34.1


2024-06-06 09:06:29

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 3/6] cpuidle: teo: Don't always stop tick on one state

Even if we have only one state, we unfortunately still need to query
expected sleep length if state0 is a proper sleep state like WFI.
This enables teo to save energy even in that scenario.

Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 2c427dd4cac0..216d34747e3b 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -403,7 +403,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/* Check if there is any choice in the first place. */
if (drv->state_count < 2) {
idx = 0;
- goto out_tick;
+ if (drv->states[0].flags & CPUIDLE_FLAG_POLLING)
+ goto out_tick;
+ /*
+ * If we only have one state and it is a proper one, check if
+ * disabling the tick would be worth it.
+ */
+ duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+ goto end;
}

if (!dev->states_usage[0].disable)
--
2.34.1


2024-06-06 09:06:57

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 5/6] cpuidle: teo: Remove recent intercepts metric

The logic for recent intercepts didn't work, there is an underflow
that can be observed during boot already, which teo usually doesn't
recover from, making the entire logic pointless.
Furthermore the recent intercepts also were never reset, thus not
actually being very 'recent'.

If it turns out to be necessary to focus more heavily on resets, the
intercepts metric also could be adjusted to decay more quickly.

Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 74 ++++++---------------------------
1 file changed, 12 insertions(+), 62 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index ca9422bbd8db..3f4801d1e797 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -59,10 +59,6 @@
* shallower than the one whose bin is fallen into by the sleep length (these
* situations are referred to as "intercepts" below).
*
- * In addition to the metrics described above, the governor counts recent
- * intercepts (that is, intercepts that have occurred during the last
- * %NR_RECENT invocations of it for the given CPU) for each bin.
- *
* In order to select an idle state for a CPU, the governor takes the following
* steps (modulo the possible latency constraint that must be taken into account
* too):
@@ -81,20 +77,15 @@
* idle long enough to avoid being intercepted if the sleep length had been
* equal to the current one).
*
- * - The sum of the numbers of recent intercepts for all of the idle states
- * shallower than the candidate one.
- *
- * 2. If the second sum is greater than the first one or the third sum is
- * greater than %NR_RECENT / 2, the CPU is likely to wake up early, so look
- * for an alternative idle state to select.
+ * 2. If the second sum is greater than the first one the CPU is likely to wake
+ * up early, so look for an alternative idle state to select.
*
* - Traverse the idle states shallower than the candidate one in the
* descending order.
*
- * - For each of them compute the sum of the "intercepts" metrics and the sum
- * of the numbers of recent intercepts over all of the idle states between
- * it and the candidate one (including the former and excluding the
- * latter).
+ * - For each of them compute the sum of the "intercepts" metrics over all
+ * of the idle states between it and the candidate one (including the
+ * former and excluding the latter).
*
* - If each of these sums that needs to be taken into account (because the
* check related to it has indicated that the CPU is likely to wake up
@@ -159,22 +150,14 @@
#define PULSE 1024
#define DECAY_SHIFT 3

-/*
- * Number of the most recent idle duration values to take into consideration for
- * the detection of recent early wakeup patterns.
- */
-#define NR_RECENT 9
-
/**
* struct teo_bin - Metrics used by the TEO cpuidle governor.
* @intercepts: The "intercepts" metric.
* @hits: The "hits" metric.
- * @recent: The number of recent "intercepts".
*/
struct teo_bin {
unsigned int intercepts;
unsigned int hits;
- unsigned int recent;
};

/**
@@ -183,8 +166,6 @@ struct teo_bin {
* @sleep_length_ns: Time till the closest timer event (at the selection time).
* @state_bins: Idle state data bins for this CPU.
* @total: Grand total of the "intercepts" and "hits" metrics for all bins.
- * @next_recent_idx: Index of the next @recent_idx entry to update.
- * @recent_idx: Indices of bins corresponding to recent "intercepts".
* @tick_hits: Number of "hits" after TICK_NSEC.
* @util_threshold: Threshold above which the CPU is considered utilized
*/
@@ -193,8 +174,6 @@ struct teo_cpu {
s64 sleep_length_ns;
struct teo_bin state_bins[CPUIDLE_STATE_MAX];
unsigned int total;
- int next_recent_idx;
- int recent_idx[NR_RECENT];
unsigned int tick_hits;
unsigned long util_threshold;
};
@@ -284,13 +263,6 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}
}

- i = cpu_data->next_recent_idx++;
- if (cpu_data->next_recent_idx >= NR_RECENT)
- cpu_data->next_recent_idx = 0;
-
- if (cpu_data->recent_idx[i] >= 0)
- cpu_data->state_bins[cpu_data->recent_idx[i]].recent--;
-
/*
* If the deepest state's target residency is below the tick length,
* make a record of it to help teo_select() decide whether or not
@@ -317,14 +289,10 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
* Otherwise, update the "intercepts" metric for the bin fallen into by
* the measured idle duration.
*/
- if (idx_timer == idx_duration) {
+ if (idx_timer == idx_duration)
cpu_data->state_bins[idx_timer].hits += PULSE;
- cpu_data->recent_idx[i] = -1;
- } else {
+ else
cpu_data->state_bins[idx_duration].intercepts += PULSE;
- cpu_data->state_bins[idx_duration].recent++;
- cpu_data->recent_idx[i] = idx_duration;
- }

end:
cpu_data->total += PULSE;
@@ -377,13 +345,10 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
unsigned int tick_intercept_sum = 0;
unsigned int idx_intercept_sum = 0;
unsigned int intercept_sum = 0;
- unsigned int idx_recent_sum = 0;
- unsigned int recent_sum = 0;
unsigned int idx_hit_sum = 0;
unsigned int hit_sum = 0;
int constraint_idx = 0;
int idx0 = 0, idx = -1;
- bool alt_intercepts, alt_recent;
bool cpu_utilized;
s64 duration_ns;
int i;
@@ -456,7 +421,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
*/
intercept_sum += prev_bin->intercepts;
hit_sum += prev_bin->hits;
- recent_sum += prev_bin->recent;

if (dev->states_usage[i].disable)
continue;
@@ -472,7 +436,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/* Save the sums for the current state. */
idx_intercept_sum = intercept_sum;
idx_hit_sum = hit_sum;
- idx_recent_sum = recent_sum;
}

/* Avoid unnecessary overhead. */
@@ -497,37 +460,28 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* If the sum of the intercepts metric for all of the idle states
* shallower than the current candidate one (idx) is greater than the
* sum of the intercepts and hits metrics for the candidate state and
- * all of the deeper states, or the sum of the numbers of recent
- * intercepts over all of the states shallower than the candidate one
- * is greater than a half of the number of recent events taken into
- * account, a shallower idle state is likely to be a better choice.
+ * all of the deeper states a shallower idle state is likely to be a
+ * better choice.
*/
- alt_intercepts = 2 * idx_intercept_sum > cpu_data->total - idx_hit_sum;
- alt_recent = idx_recent_sum > NR_RECENT / 2;
- if (alt_recent || alt_intercepts) {
+ if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
int first_suitable_idx = idx;

/*
* Look for the deepest idle state whose target residency had
* not exceeded the idle duration in over a half of the relevant
- * cases (both with respect to intercepts overall and with
- * respect to the recent intercepts only) in the past.
+ * cases in the past.
*
* Take the possible duration limitation present if the tick
* has been stopped already into account.
*/
intercept_sum = 0;
- recent_sum = 0;

for (i = idx - 1; i >= 0; i--) {
struct teo_bin *bin = &cpu_data->state_bins[i];

intercept_sum += bin->intercepts;
- recent_sum += bin->recent;

- if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
- (!alt_intercepts ||
- 2 * intercept_sum > idx_intercept_sum)) {
+ if (2 * intercept_sum > idx_intercept_sum) {
/*
* Use the current state unless it is too
* shallow or disabled, in which case take the
@@ -677,15 +631,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
{
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
- int i;

memset(cpu_data, 0, sizeof(*cpu_data));
cpu_data->util_threshold = max(UTIL_THRESHOLD_MIN,
max_capacity >> UTIL_THRESHOLD_SHIFT);

- for (i = 0; i < NR_RECENT; i++)
- cpu_data->recent_idx[i] = -1;
-
return 0;
}

--
2.34.1


2024-06-06 09:07:12

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 6/6] cpuidle: teo: Don't count non-existent intercepts

When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a87 ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
as a hit (we have to count them as something otherwise we are stuck).

Fundamentally we can only do one of the two:
1. Skip sleep_length_ns query when we think intercept is likely
2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.

This patch chooses that latter as I've found the additional time it
takes to query the sleep length to be negligible and the variants of
option 1 (count all unknowns as misses or count all unknown as hits)
had significant regressions (as misses had lots of too shallow idle
state selections and as hits had terrible performance in
intercept-heavy workloads).

Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 3f4801d1e797..f41788ff0a94 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -288,8 +288,13 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
* length, this is a "hit", so update the "hits" metric for that bin.
* Otherwise, update the "intercepts" metric for the bin fallen into by
* the measured idle duration.
+ * If teo_select() bailed out early, we have to count this as a hit as
+ * we don't know what the true sleep length would've been. Otherwise
+ * we accumulate lots of intercepts at the shallower state (especially
+ * state0) even though there weren't any intercepts due to us
+ * anticipating one.
*/
- if (idx_timer == idx_duration)
+ if (idx_timer == idx_duration || cpu_data->sleep_length_ns == KTIME_MAX)
cpu_data->state_bins[idx_timer].hits += PULSE;
else
cpu_data->state_bins[idx_duration].intercepts += PULSE;
@@ -349,6 +354,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
unsigned int hit_sum = 0;
int constraint_idx = 0;
int idx0 = 0, idx = -1;
+ int prev_intercept_idx;
bool cpu_utilized;
s64 duration_ns;
int i;
@@ -463,6 +469,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* all of the deeper states a shallower idle state is likely to be a
* better choice.
*/
+ prev_intercept_idx = idx;
if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
int first_suitable_idx = idx;

@@ -514,6 +521,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
first_suitable_idx = i;
}
}
+ if (prev_intercept_idx != idx && !idx) {
+ /*
+ * We have to query the sleep length here otherwise we don't
+ * know after wakeup if our guess was correct.
+ */
+ duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+ cpu_data->sleep_length_ns = duration_ns;
+ }

/*
* If there is a latency constraint, it may be necessary to select an
--
2.34.1


2024-06-06 09:33:36

by Christian Loehle

[permalink] [raw]
Subject: [PATCH 4/6] cpuidle: teo: Increase minimum time to stop tick

Since stopping the tick isn't free, add at least some minor constant
(1ms) for the threshold to stop the tick.

Signed-off-by: Christian Loehle <[email protected]>
---
drivers/cpuidle/governors/teo.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 216d34747e3b..ca9422bbd8db 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -622,10 +622,10 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/*
* Allow the tick to be stopped unless the selected state is a polling
* one or the expected idle duration is shorter than the tick period
- * length.
+ * length plus some constant (1ms) to account for stopping it.
*/
if ((!(drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- duration_ns >= TICK_NSEC) || tick_nohz_tick_stopped())
+ duration_ns > NSEC_PER_MSEC + TICK_NSEC) || tick_nohz_tick_stopped())
return idx;

out_tick_state:
--
2.34.1


2024-06-06 11:54:46

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

On 6/6/24 10:00, Christian Loehle wrote:
> Hi all,
> so my investigation into teo lead to the following fixes and
> improvements. Logically they are mostly independent, that's why this
> cover letter is quite short, details are in the patches.
>
> 1/6:
> As discussed, the utilization threshold is too high, while
> there are benefits in certain workloads, there are quite a few
> regressions, too.
> 2/6:
> Especially with the new util threshold, stopping tick makes little
> sense when utilized is detected, so don't.
> 3/6:
> Particularly with WFI, even if it's the only state, stopping the tick
> has benefits, so enable that in the early bail out.
> 4/6:
> Stopping the tick with 0 cost (if the idle state dictates it) is too
> aggressive IMO, so add 1ms constant cost.
> XXX: This has the issue of now being counted as idle_miss, so we could
> consider adding this to the states, too, but the simple implementation
> of this would have the downside that the cost is added to deeper states
> even if the tick is already off.
> 5/6:
> Remove the 'recent' intercept logic, see my findings in:
> https://lore.kernel.org/lkml/[email protected]/
> I haven't found a way to salvage this properly, so I removed it.
> The regular intercept seems to decay fast enough to not need this, but
> we could change it if that turns out to be the case.
> 6/6:
> The rest of the intercept logic had issues, too.
> See the commit.
>
> TODO: add some measurements of common workloads and some simple sanity
> tests (like Vincent described in low utilization workloads if the
> state selection looks reasonable).
> I have some, but more (and more standardized) would be beneficial.
>
> Happy for anyone to take a look and test as well.
>
> Some numbers for context:
> Maybe some numbers for context, I'll probably add them to the cover letter.
>
> Comparing:
> - IO workload (intercept heavy).
> - Timer workload very low utilization (check for deepest state)
> - hackbench (high utilization)
> all on RK3399 with CONFIG_HZ=100.
> target_residencies: 1, 900, 2000
>
> 1. IO workload, 5 runs, results sorted, in read IOPS.
> fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
>
> teo fixed:
> /dev/nvme0n1
> [4597, 4673, 4727, 4741, 4756]
> /dev/mmcblk2
> [5753, 5832, 5837, 5911, 5949]
> /dev/mmcblk1
> [2059, 2062, 2070, 2071, 2080]
>
> teo mainline:
> /dev/nvme0n1
> [3793, 3825, 3846, 3865, 3964]
> /dev/mmcblk2
> [3831, 4110, 4154, 4203, 4228]
> /dev/mmcblk1
> [1559, 1564, 1596, 1611, 1618]
>
> menu:
> /dev/nvme0n1
> [2571, 2630, 2804, 2813, 2917]
> /dev/mmcblk2
> [4181, 4260, 5062, 5260, 5329]
> /dev/mmcblk1
> [1567, 1581, 1585, 1603, 1769]
>
> 2. Timer workload (through IO for my convenience ;) )
> Results in read IOPS, fio same as above.
> echo "0 2097152 zero" | dmsetup create dm-zeros
> echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
> (Each IO is delayed by timer of 50ms, should be mostly in state2)
>
> teo fixed:
> 3269 cpu_idle total
> 48 cpu_idle_miss
> 30 cpu_idle_miss above
> 18 cpu_idle_miss below
>
> teo mainline:
> 3221 cpu_idle total
> 1269 cpu_idle_miss
> 22 cpu_idle_miss above
> 1247 cpu_idle_miss below
>
> menu:
> 3433 cpu_idle total
> 114 cpu_idle_miss
> 61 cpu_idle_miss above
> 53 cpu_idle_miss below
>
> Residencies:

Hmm, maybe actually including them would've been helpful too:
(Over 5s workload, only showing LITTLE cluster)
teo fixed:
idle_state
2.0 4.813378
-1.0 0.210820
1.0 0.202778
0.0 0.062426

teo mainline:
idle_state
1.0 4.895766
-1.0 0.098063
0.0 0.253069

menu:
idle_state
2.0 4.528356
-1.0 0.241486
1.0 0.345829
0.0 0.202505

>
> tldr: overall teo fixed spends more time in state2 while having
> fewer idle_miss than menu.
> teo mainline was just way too aggressive at selecting shallow states.
>
> 3. Hackbench, 5 runs
> for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done
>
> teo fixed:
> Time: 4.807
> Time: 4.856
> Time: 5.072
> Time: 4.934
> Time: 4.962
>
> teo mainline:
> Time: 4.945
> Time: 5.021
> Time: 4.927
> Time: 4.923
> Time: 5.137
>
> menu:
> Time: 4.991
> Time: 4.884
> Time: 4.880
> Time: 4.946
> Time: 4.980
>
> tldr: all comparable, teo mainline slightly worse
>
> Kind Regards,
> Christian
>
> Christian Loehle (6):
> cpuidle: teo: Increase util-threshold
> cpuidle: teo: Don't stop tick on utilized
> cpuidle: teo: Don't always stop tick on one state
> cpuidle: teo: Increase minimum time to stop tick
> cpuidle: teo: Remove recent intercepts metric
> cpuidle: teo: Don't count non-existent intercepts
>
> drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
> 1 file changed, 48 insertions(+), 73 deletions(-)
>
> --
> 2.34.1
>


2024-06-06 12:29:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle
<[email protected]> wrote:
>
> On 6/6/24 10:00, Christian Loehle wrote:
> > Hi all,
> > so my investigation into teo lead to the following fixes and
> > improvements. Logically they are mostly independent, that's why this
> > cover letter is quite short, details are in the patches.
> >
> > 1/6:
> > As discussed, the utilization threshold is too high, while
> > there are benefits in certain workloads, there are quite a few
> > regressions, too.
> > 2/6:
> > Especially with the new util threshold, stopping tick makes little
> > sense when utilized is detected, so don't.
> > 3/6:
> > Particularly with WFI, even if it's the only state, stopping the tick
> > has benefits, so enable that in the early bail out.
> > 4/6:
> > Stopping the tick with 0 cost (if the idle state dictates it) is too
> > aggressive IMO, so add 1ms constant cost.
> > XXX: This has the issue of now being counted as idle_miss, so we could
> > consider adding this to the states, too, but the simple implementation
> > of this would have the downside that the cost is added to deeper states
> > even if the tick is already off.
> > 5/6:
> > Remove the 'recent' intercept logic, see my findings in:
> > https://lore.kernel.org/lkml/[email protected]/
> > I haven't found a way to salvage this properly, so I removed it.
> > The regular intercept seems to decay fast enough to not need this, but
> > we could change it if that turns out to be the case.
> > 6/6:
> > The rest of the intercept logic had issues, too.
> > See the commit.
> >
> > TODO: add some measurements of common workloads and some simple sanity
> > tests (like Vincent described in low utilization workloads if the
> > state selection looks reasonable).
> > I have some, but more (and more standardized) would be beneficial.
> >
> > Happy for anyone to take a look and test as well.
> >
> > Some numbers for context:
> > Maybe some numbers for context, I'll probably add them to the cover letter.
> >
> > Comparing:
> > - IO workload (intercept heavy).
> > - Timer workload very low utilization (check for deepest state)
> > - hackbench (high utilization)
> > all on RK3399 with CONFIG_HZ=100.
> > target_residencies: 1, 900, 2000
> >
> > 1. IO workload, 5 runs, results sorted, in read IOPS.
> > fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
> >
> > teo fixed:
> > /dev/nvme0n1
> > [4597, 4673, 4727, 4741, 4756]
> > /dev/mmcblk2
> > [5753, 5832, 5837, 5911, 5949]
> > /dev/mmcblk1
> > [2059, 2062, 2070, 2071, 2080]
> >
> > teo mainline:
> > /dev/nvme0n1
> > [3793, 3825, 3846, 3865, 3964]
> > /dev/mmcblk2
> > [3831, 4110, 4154, 4203, 4228]
> > /dev/mmcblk1
> > [1559, 1564, 1596, 1611, 1618]
> >
> > menu:
> > /dev/nvme0n1
> > [2571, 2630, 2804, 2813, 2917]
> > /dev/mmcblk2
> > [4181, 4260, 5062, 5260, 5329]
> > /dev/mmcblk1
> > [1567, 1581, 1585, 1603, 1769]
> >
> > 2. Timer workload (through IO for my convenience ;) )
> > Results in read IOPS, fio same as above.
> > echo "0 2097152 zero" | dmsetup create dm-zeros
> > echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
> > (Each IO is delayed by timer of 50ms, should be mostly in state2)
> >
> > teo fixed:
> > 3269 cpu_idle total
> > 48 cpu_idle_miss
> > 30 cpu_idle_miss above
> > 18 cpu_idle_miss below
> >
> > teo mainline:
> > 3221 cpu_idle total
> > 1269 cpu_idle_miss
> > 22 cpu_idle_miss above
> > 1247 cpu_idle_miss below
> >
> > menu:
> > 3433 cpu_idle total
> > 114 cpu_idle_miss
> > 61 cpu_idle_miss above
> > 53 cpu_idle_miss below
> >
> > Residencies:
>
> Hmm, maybe actually including them would've been helpful too:
> (Over 5s workload, only showing LITTLE cluster)
> teo fixed:
> idle_state
> 2.0 4.813378
> -1.0 0.210820

Just to clarify, what does -1.0 mean here?

> 1.0 0.202778
> 0.0 0.062426
>
> teo mainline:
> idle_state
> 1.0 4.895766
> -1.0 0.098063
> 0.0 0.253069
>
> menu:
> idle_state
> 2.0 4.528356
> -1.0 0.241486
> 1.0 0.345829
> 0.0 0.202505
>
> >
> > tldr: overall teo fixed spends more time in state2 while having
> > fewer idle_miss than menu.
> > teo mainline was just way too aggressive at selecting shallow states.
> >
> > 3. Hackbench, 5 runs
> > for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done
> >
> > teo fixed:
> > Time: 4.807
> > Time: 4.856
> > Time: 5.072
> > Time: 4.934
> > Time: 4.962
> >
> > teo mainline:
> > Time: 4.945
> > Time: 5.021
> > Time: 4.927
> > Time: 4.923
> > Time: 5.137
> >
> > menu:
> > Time: 4.991
> > Time: 4.884
> > Time: 4.880
> > Time: 4.946
> > Time: 4.980
> >
> > tldr: all comparable, teo mainline slightly worse
> >
> > Kind Regards,
> > Christian
> >
> > Christian Loehle (6):
> > cpuidle: teo: Increase util-threshold
> > cpuidle: teo: Don't stop tick on utilized
> > cpuidle: teo: Don't always stop tick on one state
> > cpuidle: teo: Increase minimum time to stop tick
> > cpuidle: teo: Remove recent intercepts metric
> > cpuidle: teo: Don't count non-existent intercepts
> >
> > drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
> > 1 file changed, 48 insertions(+), 73 deletions(-)
> >
> > --
> > 2.34.1
> >
>
>

2024-06-06 12:37:05

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

On 6/6/24 13:29, Rafael J. Wysocki wrote:
> On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle
> <[email protected]> wrote:
>>
>> On 6/6/24 10:00, Christian Loehle wrote:
>>> Hi all,
>>> so my investigation into teo lead to the following fixes and
>>> improvements. Logically they are mostly independent, that's why this
>>> cover letter is quite short, details are in the patches.
>>>
>>> 1/6:
>>> As discussed, the utilization threshold is too high, while
>>> there are benefits in certain workloads, there are quite a few
>>> regressions, too.
>>> 2/6:
>>> Especially with the new util threshold, stopping tick makes little
>>> sense when utilized is detected, so don't.
>>> 3/6:
>>> Particularly with WFI, even if it's the only state, stopping the tick
>>> has benefits, so enable that in the early bail out.
>>> 4/6:
>>> Stopping the tick with 0 cost (if the idle state dictates it) is too
>>> aggressive IMO, so add 1ms constant cost.
>>> XXX: This has the issue of now being counted as idle_miss, so we could
>>> consider adding this to the states, too, but the simple implementation
>>> of this would have the downside that the cost is added to deeper states
>>> even if the tick is already off.
>>> 5/6:
>>> Remove the 'recent' intercept logic, see my findings in:
>>> https://lore.kernel.org/lkml/[email protected]/
>>> I haven't found a way to salvage this properly, so I removed it.
>>> The regular intercept seems to decay fast enough to not need this, but
>>> we could change it if that turns out to be the case.
>>> 6/6:
>>> The rest of the intercept logic had issues, too.
>>> See the commit.
>>>
>>> TODO: add some measurements of common workloads and some simple sanity
>>> tests (like Vincent described in low utilization workloads if the
>>> state selection looks reasonable).
>>> I have some, but more (and more standardized) would be beneficial.
>>>
>>> Happy for anyone to take a look and test as well.
>>>
>>> Some numbers for context:
>>> Maybe some numbers for context, I'll probably add them to the cover letter.
>>>
>>> Comparing:
>>> - IO workload (intercept heavy).
>>> - Timer workload very low utilization (check for deepest state)
>>> - hackbench (high utilization)
>>> all on RK3399 with CONFIG_HZ=100.
>>> target_residencies: 1, 900, 2000
>>>
>>> 1. IO workload, 5 runs, results sorted, in read IOPS.
>>> fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
>>>
>>> teo fixed:
>>> /dev/nvme0n1
>>> [4597, 4673, 4727, 4741, 4756]
>>> /dev/mmcblk2
>>> [5753, 5832, 5837, 5911, 5949]
>>> /dev/mmcblk1
>>> [2059, 2062, 2070, 2071, 2080]
>>>
>>> teo mainline:
>>> /dev/nvme0n1
>>> [3793, 3825, 3846, 3865, 3964]
>>> /dev/mmcblk2
>>> [3831, 4110, 4154, 4203, 4228]
>>> /dev/mmcblk1
>>> [1559, 1564, 1596, 1611, 1618]
>>>
>>> menu:
>>> /dev/nvme0n1
>>> [2571, 2630, 2804, 2813, 2917]
>>> /dev/mmcblk2
>>> [4181, 4260, 5062, 5260, 5329]
>>> /dev/mmcblk1
>>> [1567, 1581, 1585, 1603, 1769]
>>>
>>> 2. Timer workload (through IO for my convenience ;) )
>>> Results in read IOPS, fio same as above.
>>> echo "0 2097152 zero" | dmsetup create dm-zeros
>>> echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
>>> (Each IO is delayed by timer of 50ms, should be mostly in state2)
>>>
>>> teo fixed:
>>> 3269 cpu_idle total
>>> 48 cpu_idle_miss
>>> 30 cpu_idle_miss above
>>> 18 cpu_idle_miss below
>>>
>>> teo mainline:
>>> 3221 cpu_idle total
>>> 1269 cpu_idle_miss
>>> 22 cpu_idle_miss above
>>> 1247 cpu_idle_miss below
>>>
>>> menu:
>>> 3433 cpu_idle total
>>> 114 cpu_idle_miss
>>> 61 cpu_idle_miss above
>>> 53 cpu_idle_miss below
>>>
>>> Residencies:
>>
>> Hmm, maybe actually including them would've been helpful too:
>> (Over 5s workload, only showing LITTLE cluster)
>> teo fixed:
>> idle_state
>> 2.0 4.813378
>> -1.0 0.210820
>
> Just to clarify, what does -1.0 mean here?

Good point, I should've mentioned.
This adds up the residencies just from the trace event cpu_idle.
-1 stands for PWR_EVENT_EXIT i.e. non-idle, it's quite useful
if we're talking absolute numbers but idle state ratios.
tldr: the time the CPU isn't in any idle state.

> [snip]

2024-06-07 08:14:53

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 4/6] cpuidle: teo: Increase minimum time to stop tick

On 06/06/2024 11:00, Christian Loehle wrote:
> Since stopping the tick isn't free, add at least some minor constant
> (1ms) for the threshold to stop the tick.

Sounds pretty arbitrary to me? 'duration_ns' is either based on
target_residency_ns or tick_nohz_get_sleep_length() or even set to
TICK_NSEC/2. Does adding 1ms makes sense to all these cases? But then
why 1ms?

> Signed-off-by: Christian Loehle <[email protected]>
> ---
> drivers/cpuidle/governors/teo.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index 216d34747e3b..ca9422bbd8db 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -622,10 +622,10 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> /*
> * Allow the tick to be stopped unless the selected state is a polling
> * one or the expected idle duration is shorter than the tick period
> - * length.
> + * length plus some constant (1ms) to account for stopping it.
> */
> if ((!(drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - duration_ns >= TICK_NSEC) || tick_nohz_tick_stopped())
> + duration_ns > NSEC_PER_MSEC + TICK_NSEC) || tick_nohz_tick_stopped())
> return idx;
>
> out_tick_state:


2024-06-07 08:22:31

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpuidle: teo: Increase util-threshold

On 06/06/2024 11:00, Christian Loehle wrote:
> Increase the util-threshold by a lot as it was low enough for some
> minor load to always be active, especially on smaller CPUs.

We see the blocked part of the CPU utilization as something telling the
task scheduler that the corresponding tasks might be runnable soon again
on this CPU.

This model seems to be used here as well. I guess folks are still
debating whether the amount of blocked utilization is a good enough
indicator for the length of idle time.

> For small cap CPUs (Pixel6) the util threshold is as low as 1.
> For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.

So before this threshold was 16 on a 1024 CPU, now it's 256?

A <= 200 CPU has now a threshold of 50.

Where do those numbers come from? Just from running another workload on
a specific device?

[...]

2024-06-07 08:57:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 5/6] cpuidle: teo: Remove recent intercepts metric

On 06/06/2024 11:00, Christian Loehle wrote:
> The logic for recent intercepts didn't work, there is an underflow
> that can be observed during boot already, which teo usually doesn't
> recover from, making the entire logic pointless.

Is this related to:

https://lkml.kernel.org/r/[email protected] ?

In this case, a link here would be helpful to get the story.

> Furthermore the recent intercepts also were never reset, thus not
> actually being very 'recent'.
>
> If it turns out to be necessary to focus more heavily on resets, the
> intercepts metric also could be adjusted to decay more quickly.
>
> Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
> Signed-off-by: Christian Loehle <[email protected]>

[...]


2024-06-07 09:26:42

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 5/6] cpuidle: teo: Remove recent intercepts metric

On 6/7/24 09:57, Dietmar Eggemann wrote:
> On 06/06/2024 11:00, Christian Loehle wrote:
>> The logic for recent intercepts didn't work, there is an underflow
>> that can be observed during boot already, which teo usually doesn't
>> recover from, making the entire logic pointless.
>
> Is this related to:
>
> https://lkml.kernel.org/r/[email protected] ?
>
> In this case, a link here would be helpful to get the story.

It is!
Will include. To my defense, it is included in the cover letter.

>
>> Furthermore the recent intercepts also were never reset, thus not
>> actually being very 'recent'.
>>
>> If it turns out to be necessary to focus more heavily on resets, the
>> intercepts metric also could be adjusted to decay more quickly.
>>
>> Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
>> Signed-off-by: Christian Loehle <[email protected]>
>
> [...]
>


2024-06-07 09:29:34

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 4/6] cpuidle: teo: Increase minimum time to stop tick

On 6/7/24 09:14, Dietmar Eggemann wrote:
> On 06/06/2024 11:00, Christian Loehle wrote:
>> Since stopping the tick isn't free, add at least some minor constant
>> (1ms) for the threshold to stop the tick.
>
> Sounds pretty arbitrary to me? 'duration_ns' is either based on
> target_residency_ns or tick_nohz_get_sleep_length() or even set to
> TICK_NSEC/2. Does adding 1ms makes sense to all these cases? But then
> why 1ms?

It definitely is arbitrary, you're correct.
Feel free to treat this as RFC. I'll probably just drop this from the
serie and issue separately (to get the actual fixes merged more quickly).
Anyway I'd like to hear comments on this.

We are only interested in the cost of stopping the tick, which doesn't
really depend on the selected state residency nor the expected sleep length.
1ms works fine (for me!!), making it depend on TICK_NSEC would be natural,
too, but using TICK_NSEC is far too long for CONFIG_HZ=100 (and TICK_NSEC/2
too short for CONFIG_HZ=1000).
The cost of stopping the tick depends on a number of factors and knowing all
of them is probably not on the table anytime soon and until then I'd consider
this an improvement over 0.

>
>> Signed-off-by: Christian Loehle <[email protected]>
>> ---
>> drivers/cpuidle/governors/teo.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
>> index 216d34747e3b..ca9422bbd8db 100644
>> --- a/drivers/cpuidle/governors/teo.c
>> +++ b/drivers/cpuidle/governors/teo.c
>> @@ -622,10 +622,10 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> /*
>> * Allow the tick to be stopped unless the selected state is a polling
>> * one or the expected idle duration is shorter than the tick period
>> - * length.
>> + * length plus some constant (1ms) to account for stopping it.
>> */
>> if ((!(drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
>> - duration_ns >= TICK_NSEC) || tick_nohz_tick_stopped())
>> + duration_ns > NSEC_PER_MSEC + TICK_NSEC) || tick_nohz_tick_stopped())
>> return idx;
>>
>> out_tick_state:
>


2024-06-07 09:36:09

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpuidle: teo: Increase util-threshold

On 6/7/24 09:01, Dietmar Eggemann wrote:
> On 06/06/2024 11:00, Christian Loehle wrote:
>> Increase the util-threshold by a lot as it was low enough for some
>> minor load to always be active, especially on smaller CPUs.
>
> We see the blocked part of the CPU utilization as something telling the
> task scheduler that the corresponding tasks might be runnable soon again
> on this CPU.
>
> This model seems to be used here as well. I guess folks are still
> debating whether the amount of blocked utilization is a good enough
> indicator for the length of idle time.

Right, the blocked utilization is treated as an indicator that we will
be brought out of sleep by a non-timer wakeup.

>
>> For small cap CPUs (Pixel6) the util threshold is as low as 1.
>> For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.
>
> So before this threshold was 16 on a 1024 CPU, now it's 256?
>
> A <= 200 CPU has now a threshold of 50.
>
> Where do those numbers come from? Just from running another workload on
> a specific device?
>
> [...]

More or less yes.
Kajetan identified two broad use-cases for the utilization-based state
bypass: Early utilization ramp-up and high utilization scenarios.
The reports made it clear that the former can't be handled with a
threshold for just a single value as it will be too aggressive in
sustained (non-ramp-up) workloads.
To be fair, with patches 5 and 6 of this series, the ramp-up is
also handled quite early by the intercepts logic itself.
So as a fix I increased the value high enough to not trigger in
low-utilization scenarios.
There is likely room for optimization here, e.g. many wakeups
are also IPIs more related to the general system utilization instead
of the current CPU.

2024-06-07 10:17:47

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 6/6] cpuidle: teo: Don't count non-existent intercepts

On 06/06/2024 11:00, Christian Loehle wrote:
> When bailing out early, teo will not query the sleep length anymore
> since commit 6da8f9ba5a87 ("cpuidle: teo:
> Skip tick_nohz_get_sleep_length() call in some cases") with an
> expected sleep_length_ns value of KTIME_MAX.
> This lead to state0 accumulating lots of 'intercepts' because
> the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
> as a hit (we have to count them as something otherwise we are stuck).
>
> Fundamentally we can only do one of the two:
> 1. Skip sleep_length_ns query when we think intercept is likely
> 2. Have accurate data if sleep_length_ns is actually intercepted when
> we believe it is currently intercepted.
>
> This patch chooses that latter as I've found the additional time it
> takes to query the sleep length to be negligible and the variants of
> option 1 (count all unknowns as misses or count all unknown as hits)
> had significant regressions (as misses had lots of too shallow idle
> state selections and as hits had terrible performance in
> intercept-heavy workloads).

So '2.' is the 'if (prev_intercept_idx != idx && !idx)' case ?

[...]

> @@ -514,6 +521,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> first_suitable_idx = i;
> }
> }
> + if (prev_intercept_idx != idx && !idx) {

if (!idx && prev_intercept_idx) ?

> + /*
> + * We have to query the sleep length here otherwise we don't
> + * know after wakeup if our guess was correct.
> + */
> + duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> + cpu_data->sleep_length_ns = duration_ns;
> + }
>
> /*
> * If there is a latency constraint, it may be necessary to select an


2024-06-09 22:47:16

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpuidle: teo: Increase util-threshold

On 06/06/24 10:00, Christian Loehle wrote:
> Increase the util-threshold by a lot as it was low enough for some
> minor load to always be active, especially on smaller CPUs.
>
> For small cap CPUs (Pixel6) the util threshold is as low as 1.
> For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.
>
> Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
> Reported-by: Qais Yousef <[email protected]>
> Reported-by: Vincent Guittot <[email protected]>
> Suggested-by: Kajetan Puchalski <[email protected]>
> Signed-off-by: Christian Loehle <[email protected]>
> ---
> drivers/cpuidle/governors/teo.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index 7244f71c59c5..45f43e2ee02d 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -146,13 +146,11 @@
> * The number of bits to shift the CPU's capacity by in order to determine
> * the utilized threshold.
> *
> - * 6 was chosen based on testing as the number that achieved the best balance
> - * of power and performance on average.
> - *
> * The resulting threshold is high enough to not be triggered by background
> - * noise and low enough to react quickly when activity starts to ramp up.
> + * noise.
> */
> -#define UTIL_THRESHOLD_SHIFT 6
> +#define UTIL_THRESHOLD_SHIFT 2
> +#define UTIL_THRESHOLD_MIN 50
>
> /*
> * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> @@ -671,7 +669,8 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> int i;
>
> memset(cpu_data, 0, sizeof(*cpu_data));
> - cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> + cpu_data->util_threshold = max(UTIL_THRESHOLD_MIN,
> + max_capacity >> UTIL_THRESHOLD_SHIFT);

Thanks for trying to fix this. But I am afraid this is not a solution. There's
no magic number that can truly work here - we tried. As I tried to explain
before, a higher util value doesn't mean long idle time is unlikely. And
blocked load can cause problems where a decay can take too long.

We are following up with the suggestions I have thrown back then and we'll
share results if anything actually works.

For now, I think a revert is more appropriate. There was some perf benefit, but
the power regressions were bad and there's no threshold value that actually
works. The thresholding concept itself is incorrect and flawed - it seemed the
correct thing back then, yes. But in a hindsight now it doesn't work.


Thanks!

--
Qais Yousef

>
> for (i = 0; i < NR_RECENT; i++)
> cpu_data->recent_idx[i] = -1;
> --
> 2.34.1
>

2024-06-10 09:40:45

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpuidle: teo: Increase util-threshold

On 6/9/24 23:47, Qais Yousef wrote:
> On 06/06/24 10:00, Christian Loehle wrote:
>> Increase the util-threshold by a lot as it was low enough for some
>> minor load to always be active, especially on smaller CPUs.
>>
>> For small cap CPUs (Pixel6) the util threshold is as low as 1.
>> For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.
>>
>> Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
>> Reported-by: Qais Yousef <[email protected]>
>> Reported-by: Vincent Guittot <[email protected]>
>> Suggested-by: Kajetan Puchalski <[email protected]>
>> Signed-off-by: Christian Loehle <[email protected]>
>> ---
>> drivers/cpuidle/governors/teo.c | 11 +++++------
>> 1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
>> index 7244f71c59c5..45f43e2ee02d 100644
>> --- a/drivers/cpuidle/governors/teo.c
>> +++ b/drivers/cpuidle/governors/teo.c
>> @@ -146,13 +146,11 @@
>> * The number of bits to shift the CPU's capacity by in order to determine
>> * the utilized threshold.
>> *
>> - * 6 was chosen based on testing as the number that achieved the best balance
>> - * of power and performance on average.
>> - *
>> * The resulting threshold is high enough to not be triggered by background
>> - * noise and low enough to react quickly when activity starts to ramp up.
>> + * noise.
>> */
>> -#define UTIL_THRESHOLD_SHIFT 6
>> +#define UTIL_THRESHOLD_SHIFT 2
>> +#define UTIL_THRESHOLD_MIN 50
>>
>> /*
>> * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
>> @@ -671,7 +669,8 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>> int i;
>>
>> memset(cpu_data, 0, sizeof(*cpu_data));
>> - cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>> + cpu_data->util_threshold = max(UTIL_THRESHOLD_MIN,
>> + max_capacity >> UTIL_THRESHOLD_SHIFT);
>
> Thanks for trying to fix this. But I am afraid this is not a solution. There's
> no magic number that can truly work here - we tried. As I tried to explain
> before, a higher util value doesn't mean long idle time is unlikely. And
> blocked load can cause problems where a decay can take too long.
>
> We are following up with the suggestions I have thrown back then and we'll
> share results if anything actually works.

Namely watching increase / decrease of utilization?
I think you would have to watch at least a couple of values before entering such
a logic and at that point the intercepts logic will handle it anyway.
Furthermore IMO we should be wary about introducing any state in teo that persists
across calls if not absolutely necessary (like intercept-detection) as it really
makes teo much less predictable.

>
> For now, I think a revert is more appropriate. There was some perf benefit, but
> the power regressions were bad and there's no threshold value that actually
> works. The thresholding concept itself is incorrect and flawed - it seemed the
> correct thing back then, yes. But in a hindsight now it doesn't work.

Unfortunate :/
OK. I'll do some more testing with that, too. From what I can see a revert wouldn't
have terrible fallout with the series altogether, so I might just change this for
v2 and drop 2/6.

Kind Regards,
Christian


2024-06-10 10:31:29

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpuidle: teo: Increase util-threshold

On Mon, 10 Jun 2024 at 00:47, Qais Yousef <[email protected]> wrote:
>
> On 06/06/24 10:00, Christian Loehle wrote:
> > Increase the util-threshold by a lot as it was low enough for some
> > minor load to always be active, especially on smaller CPUs.
> >
> > For small cap CPUs (Pixel6) the util threshold is as low as 1.
> > For CPUs of capacity <64 it is 0. So ensure it is at a minimum, too.
> >
> > Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
> > Reported-by: Qais Yousef <[email protected]>
> > Reported-by: Vincent Guittot <[email protected]>
> > Suggested-by: Kajetan Puchalski <[email protected]>
> > Signed-off-by: Christian Loehle <[email protected]>
> > ---
> > drivers/cpuidle/governors/teo.c | 11 +++++------
> > 1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> > index 7244f71c59c5..45f43e2ee02d 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -146,13 +146,11 @@
> > * The number of bits to shift the CPU's capacity by in order to determine
> > * the utilized threshold.
> > *
> > - * 6 was chosen based on testing as the number that achieved the best balance
> > - * of power and performance on average.
> > - *
> > * The resulting threshold is high enough to not be triggered by background
> > - * noise and low enough to react quickly when activity starts to ramp up.
> > + * noise.
> > */
> > -#define UTIL_THRESHOLD_SHIFT 6
> > +#define UTIL_THRESHOLD_SHIFT 2
> > +#define UTIL_THRESHOLD_MIN 50
> >
> > /*
> > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> > @@ -671,7 +669,8 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > int i;
> >
> > memset(cpu_data, 0, sizeof(*cpu_data));
> > - cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> > + cpu_data->util_threshold = max(UTIL_THRESHOLD_MIN,
> > + max_capacity >> UTIL_THRESHOLD_SHIFT);
>
> Thanks for trying to fix this. But I am afraid this is not a solution. There's
> no magic number that can truly work here - we tried. As I tried to explain
> before, a higher util value doesn't mean long idle time is unlikely. And
> blocked load can cause problems where a decay can take too long.
>
> We are following up with the suggestions I have thrown back then and we'll
> share results if anything actually works.
>
> For now, I think a revert is more appropriate. There was some perf benefit, but
> the power regressions were bad and there's no threshold value that actually
> works. The thresholding concept itself is incorrect and flawed - it seemed the
> correct thing back then, yes. But in a hindsight now it doesn't work.
>

For the record, I fully agree with the above. A revert seems to be the
best option in my opinion too.

Besides for the above reasons; when using cpuidle-psci with PSCI OSI
mode, the approach leads to disabling *all* of cluster's idle-states
too, as those aren't even visible for the teo governor. I am sure,
that was not the intent with commit 9ce0f7c4bc64.

[...]

Kind regards
Uffe

2024-06-10 11:06:26

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH 6/6] cpuidle: teo: Don't count non-existent intercepts

On 6/7/24 11:17, Dietmar Eggemann wrote:
> On 06/06/2024 11:00, Christian Loehle wrote:
>> When bailing out early, teo will not query the sleep length anymore
>> since commit 6da8f9ba5a87 ("cpuidle: teo:
>> Skip tick_nohz_get_sleep_length() call in some cases") with an
>> expected sleep_length_ns value of KTIME_MAX.
>> This lead to state0 accumulating lots of 'intercepts' because
>> the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
>> as a hit (we have to count them as something otherwise we are stuck).
>>
>> Fundamentally we can only do one of the two:
>> 1. Skip sleep_length_ns query when we think intercept is likely
>> 2. Have accurate data if sleep_length_ns is actually intercepted when
>> we believe it is currently intercepted.
>>
>> This patch chooses that latter as I've found the additional time it
>> takes to query the sleep length to be negligible and the variants of
>> option 1 (count all unknowns as misses or count all unknown as hits)
>> had significant regressions (as misses had lots of too shallow idle
>> state selections and as hits had terrible performance in
>> intercept-heavy workloads).
>
> So '2.' is the 'if (prev_intercept_idx != idx && !idx)' case ?
>
> [...]

Yes, we allow the logic to bail out early, but not without querying the
expected sleep length.
(For idx > 0 the logic will continue to query the expected sleep length
later on.)

>
>> @@ -514,6 +521,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> first_suitable_idx = i;
>> }
>> }
>> + if (prev_intercept_idx != idx && !idx) {
>
> if (!idx && prev_intercept_idx) ?
>

Thanks! I picked that up for the next version.

>> + /*
>> + * We have to query the sleep length here otherwise we don't
>> + * know after wakeup if our guess was correct.
>> + */
>> + duration_ns = tick_nohz_get_sleep_length(&delta_tick);
>> + cpu_data->sleep_length_ns = duration_ns;
>> + }
>>
>> /*
>> * If there is a latency constraint, it may be necessary to select an
>