2022-11-02 15:45:47

by Kajetan Puchalski

[permalink] [raw]
Subject: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

Hi,

At the moment, all the available idle governors operate mainly based on their own past correctness
metrics along with timer events without taking into account any scheduling information. Especially
on interactive systems, this results in them frequently selecting a deeper idle state and then
waking up before its target residency is hit, thus leading to increased wakeup latency and lower
performance with no power saving. For 'menu' while web browsing on Android for instance, those types
of wakeups ('too deep') account for over 24% of all wakeups.

At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
it over C1. This is because the power usage of the two states can be so close that sufficient
amounts of too deep C1 sleeps can completely offset the C1 power saving to the point where it
would've been more power efficient to just use C0 instead.

Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
and nullify the potential power saving from using C1 in the first place. While taking this into
account, it is clear that on balance it is preferable for an idle governor to have more too shallow
sleeps instead of more too deep sleeps on those kinds of platforms.

Currently the best available governor under this metric is TEO which on average results in less than
half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
increased performance in the process.

This proposed optional extension to TEO would specifically tune it for minimising too deep
sleeps and minimising latency to achieve better performance. To this end, before selecting the next
idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the
threshold, the idle state selected by TEO metrics will be reduced by 1, thus selecting a shallower
state. If the util is below the threshold, the governor defaults to the TEO metrics mechanism to try
to select the deepest available idle state based on the closest timer event and its own correctness.

As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
states are present. If there's just 2 idle states and the CPU is utilized, we can directly select
the shallowest state and save cycles by skipping the entire metrics mechanism.

As of v3 it also includes an adjustment where on systems with more than 2 idle states the state will
only be reduced if the selected candidate is C1 and C0 is not a polling state. This will effectively
make the patch have no effect on most Intel systems.

This approach can outperform all the other currently available governors, at least on mobile device
workloads, which is why I think it is worth keeping as an option.

There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
it on TEO because it performs the best out of all the available options and I didn't think there was
any point in reinventing the wheel on the side of computing governor metrics. If a
better approach comes along at some point, there's no reason why the same idle aware mechanism
couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
a separate governor rather than a TEO add-on.

As for how the extension performs in practice, below I'll add some benchmark results I got while
testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly
an hour each time to minimise the impact of thermal issues.

Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches):

1. Geekbench 5 (latency-sensitive, heavy load test)

The values below are gmean values across 3 back to back iteration of Geekbench 5.
As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
values for all of the governors can change between runs as the benchmark might be affected by factors
other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
scores than all the other governors.

Benchmark scores

+-----------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|-----------------+-------------+---------+-------------|
| multicore_score | menu | 2826.5 | 0.0% |
| multicore_score | teo | 2764.8 | -2.18% |
| multicore_score | teo_util_v3 | 2849 | 0.8% |
| multicore_score | teo_util_v4 | 2865 | 1.36% |
| score | menu | 1053 | 0.0% |
| score | teo | 1050.7 | -0.22% |
| score | teo_util_v3 | 1059.6 | 0.63% |
| score | teo_util_v4 | 1057.6 | 0.44% |
+-----------------+-------------+---------+-------------+

Idle misses

The numbers are percentages of too deep and too shallow sleeps computed using the new trace
event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over
the course of a run and then dividing them by the total number of wakeups in that run.

+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 14.994% |
| teo | too deep | 9.649% |
| teo_util_v3 | too deep | 4.298% |
| teo_util_v4 | too deep | 4.02 % |
| menu | too shallow | 2.497% |
| teo | too shallow | 5.963% |
| teo_util_v3 | too shallow | 13.773% |
| teo_util_v4 | too shallow | 14.598% |
+-------------+-------------+--------------+

Power usage [mW]

+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 2551.4 | 0.0% |
| total_power | gmean | teo | 2606.8 | 2.17% |
| total_power | gmean | teo_util_v3 | 2670.1 | 4.65% |
| total_power | gmean | teo_util_v4 | 2722.3 | 6.7% |
+--------------+----------+-------------+---------+-------------+

Task wakeup latency

+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| AsyncTask #1 | gmean | menu | 78.16μs | 0.0% |
| AsyncTask #1 | gmean | teo | 61.60μs | -21.19% |
| AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% |
| AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% |
| labs.geekbench5 | gmean | menu | 88.55μs | 0.0% |
| labs.geekbench5 | gmean | teo | 100.97μs | 14.02% |
| labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% |
| labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% |
+-----------------+----------+-------------+-------------+-------------+

In case of this benchmark, the difference in latency does seem to translate into better scores.

2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test)

The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.

Benchmark scores

+----------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|----------------+-------------+---------+-------------|
| PcmaWebV2Score | menu | 5232 | 0.0% |
| PcmaWebV2Score | teo | 5219.8 | -0.23% |
| PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% |
| PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% |
+----------------+-------------+---------+-------------+

Idle misses

+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 24.814% |
| teo | too deep | 11.65% |
| teo_util_v3 | too deep | 3.481% |
| teo_util_v4 | too deep | 3.662% |
| menu | too shallow | 3.101% |
| teo | too shallow | 8.578% |
| teo_util_v3 | too shallow | 18.326% |
| teo_util_v4 | too shallow | 18.692% |
+-------------+-------------+--------------+

Power usage [mW]

+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 179.2 | 0.0% |
| total_power | gmean | teo | 184.8 | 3.1% |
| total_power | gmean | teo_util_v3 | 177.4 | -1.02% |
| total_power | gmean | teo_util_v4 | 184.1 | 2.71% |
+--------------+----------+-------------+---------+-------------+

Task wakeup latency

+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| CrRendererMain | gmean | menu | 236.63μs | 0.0% |
| CrRendererMain | gmean | teo | 201.85μs | -14.7% |
| CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% |
| CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% |
| chmark:workload | gmean | menu | 100.30μs | 0.0% |
| chmark:workload | gmean | teo | 80.20μs | -20.04% |
| chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% |
| chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% |
| surfaceflinger | gmean | menu | 97.57μs | 0.0% |
| surfaceflinger | gmean | teo | 98.86μs | 1.31% |
| surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% |
| surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% |
+-----------------+----------+-------------+-------------+-------------+

In this case the large latency improvement does not translate into a notable increase in benchmark score as
this particular benchmark mainly responds to changes in operating frequency.

3. Jankbench (locked 60hz screen) (normal usage UI test)

Frame durations

+---------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|---------------+------------------+---------+-------------|
| mean_duration | menu_60hz | 13.9 | 0.0% |
| mean_duration | teo_60hz | 14.7 | 6.0% |
| mean_duration | teo_util_v3_60hz | 13.8 | -0.87% |
| mean_duration | teo_util_v4_60hz | 12.6 | -9.0% |
+---------------+------------------+---------+-------------+

Jank percentage

+------------+------------------+---------+-------------+
| variable | kernel | value | perc_diff |
|------------+------------------+---------+-------------|
| jank_perc | menu_60hz | 1.5 | 0.0% |
| jank_perc | teo_60hz | 2.1 | 36.99% |
| jank_perc | teo_util_v3_60hz | 1.3 | -13.95% |
| jank_perc | teo_util_v4_60hz | 1.3 | -17.37% |
+------------+------------------+---------+-------------+

Idle misses

+------------------+-------------+--------------+
| wa_path | type | count_perc |
|------------------+-------------+--------------|
| menu_60hz | too deep | 26.00% |
| teo_60hz | too deep | 11.00% |
| teo_util_v3_60hz | too deep | 2.33% |
| teo_util_v4_60hz | too deep | 2.54% |
| menu_60hz | too shallow | 4.74% |
| teo_60hz | too shallow | 11.89% |
| teo_util_v3_60hz | too shallow | 21.78% |
| teo_util_v4_60hz | too shallow | 21.93% |
+------------------+-------------+--------------+

Power usage [mW]

+--------------+------------------+---------+-------------+
| chan_name | kernel | value | perc_diff |
|--------------+------------------+---------+-------------|
| total_power | menu_60hz | 144.6 | 0.0% |
| total_power | teo_60hz | 136.9 | -5.27% |
| total_power | teo_util_v3_60hz | 134.2 | -7.19% |
| total_power | teo_util_v4_60hz | 121.3 | -16.08% |
+--------------+------------------+---------+-------------+

Task wakeup latency

+-----------------+------------------+-------------+-------------+
| comm | kernel | value | perc_diff |
|-----------------+------------------+-------------+-------------|
| RenderThread | menu_60hz | 139.52μs | 0.0% |
| RenderThread | teo_60hz | 116.51μs | -16.49% |
| RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% |
| RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% |
| droid.benchmark | menu_60hz | 135.88μs | 0.0% |
| droid.benchmark | teo_60hz | 105.21μs | -22.57% |
| droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% |
| droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% |
| surfaceflinger | menu_60hz | 124.03μs | 0.0% |
| surfaceflinger | teo_60hz | 151.90μs | 22.47% |
| surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% |
| surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% |
+-----------------+------------------+-------------+-------------+

4. Speedometer 2 (heavy load web browsing test)

Benchmark scores

+-------------------+-------------+---------+-------------+
| metric | kernel | value | perc_diff |
|-------------------+-------------+---------+-------------|
| Speedometer Score | menu | 102 | 0.0% |
| Speedometer Score | teo | 104.9 | 2.88% |
| Speedometer Score | teo_util_v3 | 102.1 | 0.16% |
| Speedometer Score | teo_util_v4 | 103.8 | 1.83% |
+-------------------+-------------+---------+-------------+

Idle misses

+-------------+-------------+--------------+
| wa_path | type | count_perc |
|-------------+-------------+--------------|
| menu | too deep | 17.95% |
| teo | too deep | 6.46% |
| teo_util_v3 | too deep | 0.63% |
| teo_util_v4 | too deep | 0.64% |
| menu | too shallow | 3.86% |
| teo | too shallow | 8.21% |
| teo_util_v3 | too shallow | 14.72% |
| teo_util_v4 | too shallow | 14.43% |
+-------------+-------------+--------------+

Power usage [mW]

+--------------+----------+-------------+---------+-------------+
| chan_name | metric | kernel | value | perc_diff |
|--------------+----------+-------------+---------+-------------|
| total_power | gmean | menu | 2059 | 0.0% |
| total_power | gmean | teo | 2187.8 | 6.26% |
| total_power | gmean | teo_util_v3 | 2212.9 | 7.47% |
| total_power | gmean | teo_util_v4 | 2121.8 | 3.05% |
+--------------+----------+-------------+---------+-------------+

Task wakeup latency

+-----------------+----------+-------------+-------------+-------------+
| comm | metric | kernel | value | perc_diff |
|-----------------+----------+-------------+-------------+-------------|
| CrRendererMain | gmean | menu | 17.18μs | 0.0% |
| CrRendererMain | gmean | teo | 16.18μs | -5.82% |
| CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% |
| CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% |
| RenderThread | gmean | menu | 68.60μs | 0.0% |
| RenderThread | gmean | teo | 48.44μs | -29.39% |
| RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% |
| RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% |
| surfaceflinger | gmean | menu | 42.23μs | 0.0% |
| surfaceflinger | gmean | teo | 29.84μs | -29.33% |
| surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% |
| surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% |
+-----------------+----------+-------------+-------------+-------------+

At the very least this approach seems promising so I wanted to discuss it in RFC form first.
Thank you for taking your time to read this!

--
Kajetan

v3 -> v4:
- remove the chunk of code skipping metrics updates when the CPU was utilized
- include new test results and more benchmarks in the cover letter

v2 -> v3:
- add a patch adding an option to skip polling states in teo_find_shallower_state()
- only reduce the state if the candidate state is C1 and C0 is not a polling state
- add a check for polling states in the 2-states fast-exit path
- remove the ifdefs and Kconfig option

v1 -> v2:
- rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
- add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
- fix typos in comments
- include a missing header

Kajetan Puchalski (2):
cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
cpuidle: teo: Introduce util-awareness

drivers/cpuidle/governors/teo.c | 88 +++++++++++++++++++++++++++++++--
1 file changed, 84 insertions(+), 4 deletions(-)

--
2.37.1



2022-11-02 15:56:53

by Kajetan Puchalski

[permalink] [raw]
Subject: [RFC PATCH v4 2/2] cpuidle: teo: Introduce util-awareness

Modern interactive systems, such as recent Android phones, tend to have
power efficient shallow idle states. Selecting deeper idle states on a
device while a latency-sensitive workload is running can adversely impact
performance due to increased latency. Additionally, if the CPU wakes up
from a deeper sleep before its target residency as is often the case, it
results in a waste of energy on top of that.

This patch extends the TEO governor with a mechanism adding util-awareness,
effectively providing a way for the governor to reduce the selected idle
state by 1 when the CPU is being utilized over a certain threshold while
still trying to select the deepest possible state using TEO's metrics when
the CPU is not being utilized. This is now possible since the CPU
utilization is exported from the scheduler with the sched_cpu_util function
and already used e.g. in the thermal governor IPA.

Under this implementation, when the CPU is being utilised and the
selected candidate state is C1, it will be reduced to C0 as long as C0
is not a polling state. This effectively should make the patch have no
effect on most Intel systems.

This can provide drastically decreased latency and performance benefits in
certain types of mobile workloads that are sensitive to latency,
such as Geekbench 5.

Signed-off-by: Kajetan Puchalski <[email protected]>
---
drivers/cpuidle/governors/teo.c | 80 ++++++++++++++++++++++++++++++++-
1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index e2864474a98d..2f37aeba8cb8 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@
/*
* Timer events oriented CPU idle governor
*
+ * TEO governor:
* Copyright (C) 2018 - 2021 Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <[email protected]>
*/

/**
@@ -99,14 +104,49 @@
* select the given idle state instead of the candidate one.
*
* 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so a shallower idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threshold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above and the currently selected candidate is
+ * C1, the idle state will be reduced to C0 as long as C0 is not a polling state.
*/

#include <linux/cpuidle.h>
#include <linux/jiffies.h>
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/sched/topology.h>
#include <linux/tick.h>

+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
/*
* The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
* is used for decreasing metrics on a regular basis.
@@ -137,9 +177,11 @@ struct teo_bin {
* @time_span_ns: Time between idle state selection and post-wakeup update.
* @sleep_length_ns: Time till the closest timer event (at the selection time).
* @state_bins: Idle state data bins for this CPU.
- * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
+ * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
* @next_recent_idx: Index of the next @recent_idx entry to update.
* @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
*/
struct teo_cpu {
s64 time_span_ns;
@@ -148,10 +190,24 @@ struct teo_cpu {
unsigned int total;
int next_recent_idx;
int recent_idx[NR_RECENT];
+ unsigned long util_threshold;
+ bool utilized;
};

static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);

+/**
+ * teo_get_util - Update the CPU utilized status
+ * @dev: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
+{
+ unsigned long util = sched_cpu_util(dev->cpu);
+
+ cpu_data->utilized = util > cpu_data->util_threshold;
+}
+
/**
* teo_update - Update CPU metrics after wakeup.
* @drv: cpuidle driver containing state data.
@@ -323,6 +379,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
goto end;
}

+ teo_get_util(dev, cpu_data);
+ /* the cpu is being utilized and there's only 2 states to choose from */
+ /* no need to consider metrics, choose the shallowest non-polling state and exit */
+ if (drv->state_count < 3 && cpu_data->utilized) {
+ for (i = 0; i < drv->state_count; ++i) {
+ if (dev->states_usage[i].disable ||
+ drv->states[i].flags & CPUIDLE_FLAG_POLLING)
+ continue;
+ break;
+ }
+
+ idx = i;
+ goto end;
+ }
+
/*
* Find the deepest idle state whose target residency does not exceed
* the current sleep length and the deepest idle state not deeper than
@@ -454,6 +525,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (idx > constraint_idx)
idx = constraint_idx;

+ /* if the CPU is being utilized and C1 is the selected candidate */
+ /* choose a shallower non-polling state to improve latency */
+ if (cpu_data->utilized && idx == 1)
+ idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
+
end:
/*
* Don't stop the tick if the selected state is a polling one or if the
@@ -510,9 +586,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
int i;

memset(cpu_data, 0, sizeof(*cpu_data));
+ cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;

for (i = 0; i < NR_RECENT; i++)
cpu_data->recent_idx[i] = -1;
--
2.37.1


2022-11-21 12:50:54

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

Hi Rafael,

On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote:

[...]

> v3 -> v4:
> - remove the chunk of code skipping metrics updates when the CPU was utilized
> - include new test results and more benchmarks in the cover letter

[...]

It's been some time so I just wanted to bump this, what do you think
about this v4? Doug has already tested it, resuls for his machine are
attached to the v3 thread.

Thanks,
Kajetan

2022-11-21 13:06:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

On Mon, Nov 21, 2022 at 1:23 PM Kajetan Puchalski
<[email protected]> wrote:
>
> Hi Rafael,
>
> On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote:
>
> [...]
>
> > v3 -> v4:
> > - remove the chunk of code skipping metrics updates when the CPU was utilized
> > - include new test results and more benchmarks in the cover letter
>
> [...]
>
> It's been some time so I just wanted to bump this, what do you think
> about this v4? Doug has already tested it, resuls for his machine are
> attached to the v3 thread.

I have some comments, but it's being pushed down by more urgent things, sorry.

First off, I think that the information from your cover letter should
go into the patch changelog (at least the majority of it), as it's
relevant for the motivation part.

Also I think that this optimization is really trading energy for
performance and that should be emphasized. IOW, it is not about
improving the prediction accuracy (which is what the cover letter and
changelog seem to be claiming), but about reducing the expected CPU
wakeup latency in some cases.

I'll send more comments later today if I have the time or later this
week otherwise.

2022-11-24 04:41:38

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

On 2022.11.21 04:23 Kajetan Puchalski wrote:

> Hi Rafael,
>
> On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote:
>
> [...]
>
>> v3 -> v4:
>> - remove the chunk of code skipping metrics updates when the CPU was utilized
>> - include new test results and more benchmarks in the cover letter
>
> [...]
>
> It's been some time so I just wanted to bump this, what do you think
> about this v4? Doug has already tested it, resuls for his machine are
> attached to the v3 thread.

Hi All,

I continued to test this and included the proposed ladder idle governor in my continued testing.
(Which is why I added Rui as an addressee)
However, I ran out of time. Here is what I have:

Kernel: 6.1-rc3 and with patch sets
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
CPU scaling driver: intel_cpufreq
HWP disabled.
Unless otherwsie stated, performance CPU scaling govenor.

Legend:
teo: the current teo idle governor
util-v4: the RFC utilization teo patch set version 4.
menu: the menu idle governor
ladder-old: the current ladder idle governor
ladder: the RFC ladder patchset.

Workflow: shell-intensive serialized workloads.
Variable: PIDs per second.
Note: Single threaded.
Master reference: forced CPU affinity to 1 CPU.
Performance Results:
http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png
Schedutil Results:
http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png

Workflow: sleeping ebizzy 128 threads.
Variable: interval (uSecs).
Performance Results:
http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png
Performance power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/
Schedutil Results:
http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png
Schedutil power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/ebizzy/su/

Workflow: 6 core ping-pong.
Variable: amount of work packet per token transfer
Forced CPU affinity, 16.67% load per core (6 CPUs idle, 6 busy).
Overview:
http://smythies.com/~doug/linux/idle/teo-util/graphs/6-core-ping-pong-sweep.png
short loop times detail:
http://smythies.com/~doug/linux/idle/teo-util/graphs/6-core-ping-pong-sweep-detail-a.png
Power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/ping-sweep/6-4/
The transition between 35 and 40 minutes will be some future investigation.

Workflow: periodic 73, 113, 211, 347, 401 work/sleep frequency.
Summary: Nothing interesting.
Variable: work packet (load), ramps up and then down.
Single threaded.
Power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/consume/idle-3/
Higher resolution power data:
http://smythies.com/~doug/linux/idle/teo-util/consume/ps73/
http://smythies.com/~doug/linux/idle/teo-util/consume/ps113/
http://smythies.com/~doug/linux/idle/teo-util/consume/ps211/
http://smythies.com/~doug/linux/idle/teo-util/consume/ps347/
http://smythies.com/~doug/linux/idle/teo-util/consume/ps401/

Workflow: fast speed 2 pair, 4 threads ping-pong.
Variable: none, this is a dwell test.
Results:
http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/times.txt
Performance power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/perf/
Schedutil power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/su/

Workflow: medium speed 2 pair, 4 threads ping-pong.
Variable: none, this is a dwell test.
Results:
http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/times.txt
Performance power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/perf/
Schedutil power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/su/

Workflow: slow speed 2 pair, 4 threads ping-pong.
Variable: none, this is a dwell test.
Results:
http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/times.txt
Performance power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/perf/
Schedutil power and idle data:
http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/su/

Results summary:

Results are uSeconds per loop.
Less is better.

Slow ping pong - 2 pairs, 4 threads.

Performance:
ladder_old: Average: 2583 (-0.56%)
ladder: Average: 2617 (+0.81%)
menu: Average: 2596 Reference Time.
teo: Average: 2689 (+3.6%)
util-v4 Average: 2665 (+2.7%)

Schedutil:
ladder-old: Average: 4490 (+44%)
ladder: Average: 3296 (+5.9%)
menu: Average: 3113 Reference Time.
teo: Average: 4005 (+29%)
util-v4: Average: 3527 (+13%)

Medium ping pong - 2 pairs, 4 threads.

Performance:
ladder-old: Average: 11.8214 (+4.6%)
ladder: Average: 11.7730 (+4.2%)
menu: Average: 11.2971 Reference Time.
teo: Average: 11.355 (+5.1%)
util-v4: Average: 11.3364 (+3.4%)

Schedutil:
ladder-old: Average: 15.6813 (+30%)
ladder: Average: 15.4338 (+28%)
menu: Average: 12.0868 Reference Time.
teo: Average: 11.7367 (-2.9%)
util-v4: Average: 11.6352 (-3.7%)

Fast ping pong - 2 pairs, 4 threads.

Performance:
ladder-old: Average: 4.009 (+39%)
ladder: Average: 3.844 (+33%)
menu: Average: 2.891 Reference Time.
teo: Average: 3.053 (+5.6%)
util-v4: Average: 2.985 (+3.2%)

Schedutil:
ladder-old: Average: 5.053 (+64%)
ladder: Average: 5.278 (+71%)
menu: Average: 3.078 Reference Time.
teo: Average: 3.106 (+0.91%)
util-v4: Average: 3.15 (+2.35%)

... Doug


2022-11-26 17:08:45

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

> > Workflow: sleeping ebizzy 128 threads.
> > Variable: interval (uSecs).
> > Performance Results:
> > http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png
> > Performance power and idle data:
> > http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/
>
> for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you
> assert that an idle state is too deep/shallow?
>
is this got from the cpu_idle_miss trace event?

thanks,
rui

2022-11-26 17:19:21

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote:
> On 2022.11.21 04:23 Kajetan Puchalski wrote:
>
> > Hi Rafael,
> >
> > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > > v3 -> v4:
> > > - remove the chunk of code skipping metrics updates when the CPU
> > > was utilized
> > > - include new test results and more benchmarks in the cover
> > > letter
> >
> > [...]
> >
> > It's been some time so I just wanted to bump this, what do you
> > think
> > about this v4? Doug has already tested it, resuls for his machine
> > are
> > attached to the v3 thread.
>
> Hi All,
>
> I continued to test this and included the proposed ladder idle
> governor in my continued testing.
> (Which is why I added Rui as an addressee)

Hi, Doug,

Really appreciated your testing data on this.
I have some dumb questions and I need your help so that I can better
understand some of the graphs. :)

> However, I ran out of time. Here is what I have:
>
> Kernel: 6.1-rc3 and with patch sets
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> CPU scaling driver: intel_cpufreq
> HWP disabled.
> Unless otherwsie stated, performance CPU scaling govenor.
>
> Legend:
> teo: the current teo idle governor
> util-v4: the RFC utilization teo patch set version 4.
> menu: the menu idle governor
> ladder-old: the current ladder idle governor
> ladder: the RFC ladder patchset.
>
> Workflow: shell-intensive serialized workloads.
> Variable: PIDs per second.
> Note: Single threaded.
> Master reference: forced CPU affinity to 1 CPU.
> Performance Results:
> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png
> Schedutil Results:
> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png

what does 1cpu mean?

>
> Workflow: sleeping ebizzy 128 threads.
> Variable: interval (uSecs).
> Performance Results:
> http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png
> Performance power and idle data:
> http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/

for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you
assert that an idle state is too deep/shallow?

thanks,
rui

2022-11-26 23:01:11

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

On 2022.11.26 08:26 Rui wrote:
> On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote:
>> On 2022.11.21 04:23 Kajetan Puchalski wrote:
>>> On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote:
>>>
>>> [...]
>>>
>>>> v3 -> v4:
>>>> - remove the chunk of code skipping metrics updates when the CPU
>>>> was utilized
>>>> - include new test results and more benchmarks in the cover
>>>> letter
>>>
>>> [...]
>>>
>>> It's been some time so I just wanted to bump this, what do you
>>> think
>>> about this v4? Doug has already tested it, resuls for his machine
>>> are
>>> attached to the v3 thread.
>>
>> Hi All,
>>
>> I continued to test this and included the proposed ladder idle
>> governor in my continued testing.
>> (Which is why I added Rui as an addressee)
>
> Hi, Doug,

Hi Rui,

> Really appreciated your testing data on this.
> I have some dumb questions and I need your help so that I can better
> understand some of the graphs. :)
>
>> However, I ran out of time. Here is what I have:
>>
>> Kernel: 6.1-rc3 and with patch sets
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
>> CPU scaling driver: intel_cpufreq
>> HWP disabled.
>> Unless otherwsie stated, performance CPU scaling govenor.
>>
>> Legend:
>> teo: the current teo idle governor
>> util-v4: the RFC utilization teo patch set version 4.
>> menu: the menu idle governor
>> ladder-old: the current ladder idle governor
>> ladder: the RFC ladder patchset.
>>
>> Workflow: shell-intensive serialized workloads.
>> Variable: PIDs per second.
>> Note: Single threaded.
>> Master reference: forced CPU affinity to 1 CPU.

This is the 1cpu on the graph.

>> Performance Results:
>> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png
>> Schedutil Results:
>> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png
>
> what does 1cpu mean?

For shell-intensive serialized workflow or:

Dountil the list of tasks is finished:
Start the next task in the list of stuff to do (with a new PID).
Wait for it to finish
Enduntil

We know it represents a challenge for CPU frequency scaling drivers,
schedulers, and therefore idle drivers.

We also know that the best performance is achieved by overriding
the scheduler and forcing CPU affinity. I use this "best" case as the
master reference, using the label 1cpu on the graph.

>> Workflow: sleeping ebizzy 128 threads.
>> Variable: interval (uSecs).
>> Performance Results:
>> http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png
>> Performance power and idle data:
>> http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/
>
> for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you
> assert that an idle state is too deep/shallow?

I get those stats directly from the kernel driver statistics. For example:

$ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/above
/sys/devices/system/cpu/cpu4/cpuidle/state0/above:0
/sys/devices/system/cpu/cpu4/cpuidle/state1/above:38085
/sys/devices/system/cpu/cpu4/cpuidle/state2/above:7668
/sys/devices/system/cpu/cpu4/cpuidle/state3/above:6823

$ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/below
/sys/devices/system/cpu/cpu4/cpuidle/state0/below:72059
/sys/devices/system/cpu/cpu4/cpuidle/state1/below:246573
/sys/devices/system/cpu/cpu4/cpuidle/state2/below:7817
/sys/devices/system/cpu/cpu4/cpuidle/state3/below:0

I keep track of the changes per sample interval and graph
the sum for all CPUs as a percentage of the usage of
that idle state.

Because I can never remember what "above" and "below"
actually mean, I use the terms "was too shallow"
and "was too deep".

... Doug


2022-11-27 07:21:08

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/2] cpuidle: teo: Introduce util-awareness

On Sat, 2022-11-26 at 13:56 -0800, Doug Smythies wrote:
> On 2022.11.26 08:26 Rui wrote:
> > On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote:
> > > On 2022.11.21 04:23 Kajetan Puchalski wrote:
> > > > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski
> > > > wrote:
> > > >
> > > > [...]
> > > >
> > > > > v3 -> v4:
> > > > > - remove the chunk of code skipping metrics updates when the
> > > > > CPU
> > > > > was utilized
> > > > > - include new test results and more benchmarks in the cover
> > > > > letter
> > > >
> > > > [...]
> > > >
> > > > It's been some time so I just wanted to bump this, what do you
> > > > think
> > > > about this v4? Doug has already tested it, resuls for his
> > > > machine
> > > > are
> > > > attached to the v3 thread.
> > >
> > > Hi All,
> > >
> > > I continued to test this and included the proposed ladder idle
> > > governor in my continued testing.
> > > (Which is why I added Rui as an addressee)
> >
> > Hi, Doug,
>
> Hi Rui,
>
> > Really appreciated your testing data on this.
> > I have some dumb questions and I need your help so that I can
> > better
> > understand some of the graphs. :)
> >
> > > However, I ran out of time. Here is what I have:
> > >
> > > Kernel: 6.1-rc3 and with patch sets
> > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > CPU scaling driver: intel_cpufreq
> > > HWP disabled.
> > > Unless otherwsie stated, performance CPU scaling govenor.
> > >
> > > Legend:
> > > teo: the current teo idle governor
> > > util-v4: the RFC utilization teo patch set version 4.
> > > menu: the menu idle governor
> > > ladder-old: the current ladder idle governor
> > > ladder: the RFC ladder patchset.
> > >
> > > Workflow: shell-intensive serialized workloads.
> > > Variable: PIDs per second.
> > > Note: Single threaded.
> > > Master reference: forced CPU affinity to 1 CPU.
>
> This is the 1cpu on the graph.
>
> > > Performance Results:
> > > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png
> > > Schedutil Results:
> > > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png
> >
> > what does 1cpu mean?
>
> For shell-intensive serialized workflow or:
>
> Dountil the list of tasks is finished:
> Start the next task in the list of stuff to do (with a new PID).
> Wait for it to finish
> Enduntil
>
> We know it represents a challenge for CPU frequency scaling drivers,
> schedulers, and therefore idle drivers.
>
> We also know that the best performance is achieved by overriding
> the scheduler and forcing CPU affinity. I use this "best" case as the
> master reference, using the label 1cpu on the graph.
>
Got it.

> > > Workflow: sleeping ebizzy 128 threads.
> > > Variable: interval (uSecs).
> > > Performance Results:
> > > http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png
> > > Performance power and idle data:
> > > http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/
> >
> > for the "Idle state 0/1/2/3 was too deep" graphs, may I know how
> > you
> > assert that an idle state is too deep/shallow?
>
> I get those stats directly from the kernel driver statistics. For
> example:
>
> $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/above
> /sys/devices/system/cpu/cpu4/cpuidle/state0/above:0
> /sys/devices/system/cpu/cpu4/cpuidle/state1/above:38085
> /sys/devices/system/cpu/cpu4/cpuidle/state2/above:7668
> /sys/devices/system/cpu/cpu4/cpuidle/state3/above:6823
>
> $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/below
> /sys/devices/system/cpu/cpu4/cpuidle/state0/below:72059
> /sys/devices/system/cpu/cpu4/cpuidle/state1/below:246573
> /sys/devices/system/cpu/cpu4/cpuidle/state2/below:7817
> /sys/devices/system/cpu/cpu4/cpuidle/state3/below:0
>
> I keep track of the changes per sample interval and graph
> the sum for all CPUs as a percentage of the usage of
> that idle state.
>
> Because I can never remember what "above" and "below"
> actually mean, I use the terms "was too shallow"
> and "was too deep".

I just checked the code. My understanding is that
"above" means the previous idle state residency is too short, and a
shallower state would have been a better match.
"below" means the previous idle state residency is too long, and a
deeper state would have been a better match.

So probably "above" means "should be shallower" or "was too deep", and
"below" means "should be deeper" or "was to shallow"?

thanks,
rui