2022-10-31 12:23:55

by Kajetan Puchalski

[permalink] [raw]
Subject: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

Modern interactive systems, such as recent Android phones, tend to have
power efficient shallow idle states. Selecting deeper idle states on a
device while a latency-sensitive workload is running can adversely impact
performance due to increased latency. Additionally, if the CPU wakes up
from a deeper sleep before its target residency as is often the case, it
results in a waste of energy on top of that.

This patch extends the TEO governor with a mechanism adding util-awareness,
effectively providing a way for the governor to reduce the selected idle
state by 1 when the CPU is being utilized over a certain threshold while
still trying to select the deepest possible state using TEO's metrics when
the CPU is not being utilized. This is now possible since the CPU
utilization is exported from the scheduler with the sched_cpu_util function
and already used e.g. in the thermal governor IPA.

Under this implementation, when the CPU is being utilised and the
selected candidate state is C1, it will be reduced to C0 as long as C0
is not a polling state. This effectively should make the patch have no
effect on most Intel systems.

This can provide drastically decreased latency and performance benefits in
certain types of mobile workloads that are sensitive to latency,
such as Geekbench 5.

Signed-off-by: Kajetan Puchalski <[email protected]>
---
drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index e2864474a98d..0f38b3ecab3c 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@
/*
* Timer events oriented CPU idle governor
*
+ * TEO governor:
* Copyright (C) 2018 - 2021 Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <[email protected]>
*/

/**
@@ -99,14 +104,49 @@
* select the given idle state instead of the candidate one.
*
* 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so a shallower idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threshold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above and the currently selected candidate is
+ * C1, the idle state will be reduced to C0 as long as C0 is not a polling state.
*/

#include <linux/cpuidle.h>
#include <linux/jiffies.h>
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/sched/topology.h>
#include <linux/tick.h>

+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
/*
* The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
* is used for decreasing metrics on a regular basis.
@@ -137,9 +177,11 @@ struct teo_bin {
* @time_span_ns: Time between idle state selection and post-wakeup update.
* @sleep_length_ns: Time till the closest timer event (at the selection time).
* @state_bins: Idle state data bins for this CPU.
- * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
+ * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
* @next_recent_idx: Index of the next @recent_idx entry to update.
* @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
*/
struct teo_cpu {
s64 time_span_ns;
@@ -148,10 +190,24 @@ struct teo_cpu {
unsigned int total;
int next_recent_idx;
int recent_idx[NR_RECENT];
+ unsigned long util_threshold;
+ bool utilized;
};

static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);

+/**
+ * teo_get_util - Update the CPU utilized status
+ * @dev: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
+{
+ unsigned long util = sched_cpu_util(dev->cpu);
+
+ cpu_data->utilized = util > cpu_data->util_threshold;
+}
+
/**
* teo_update - Update CPU metrics after wakeup.
* @drv: cpuidle driver containing state data.
@@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
int i;

if (dev->last_state_idx >= 0) {
- teo_update(drv, dev);
+ /* don't update metrics if the cpu was utilized during the last sleep */
+ if (!cpu_data->utilized)
+ teo_update(drv, dev);
dev->last_state_idx = -1;
}

@@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
goto end;
}

+ teo_get_util(dev, cpu_data);
+ /* the cpu is being utilized and there's only 2 states to choose from */
+ /* no need to consider metrics, choose the shallowest non-polling state and exit */
+ if (drv->state_count < 3 && cpu_data->utilized) {
+ for (i = 0; i < drv->state_count; ++i) {
+ if (dev->states_usage[i].disable ||
+ drv->states[i].flags & CPUIDLE_FLAG_POLLING)
+ continue;
+ break;
+ }
+
+ idx = i;
+ goto end;
+ }
+
/*
* Find the deepest idle state whose target residency does not exceed
* the current sleep length and the deepest idle state not deeper than
@@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (idx > constraint_idx)
idx = constraint_idx;

+ /* if the CPU is being utilized and C1 is the selected candidate */
+ /* choose a shallower non-polling state to improve latency */
+ if (cpu_data->utilized && idx == 1)
+ idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
+
end:
/*
* Don't stop the tick if the selected state is a polling one or if the
@@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
int i;

memset(cpu_data, 0, sizeof(*cpu_data));
+ cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;

for (i = 0; i < NR_RECENT; i++)
cpu_data->recent_idx[i] = -1;
--
2.37.1



2022-11-01 07:09:46

by Doug Smythies

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

Hi Kajetan,

On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski
<[email protected]> wrote:

... [delete some]...

> /**
> * teo_update - Update CPU metrics after wakeup.
> * @drv: cpuidle driver containing state data.
> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> int i;
>
> if (dev->last_state_idx >= 0) {
> - teo_update(drv, dev);
> + /* don't update metrics if the cpu was utilized during the last sleep */
> + if (!cpu_data->utilized)
> + teo_update(drv, dev);
> dev->last_state_idx = -1;
> }

Ignoring the metrics is not the correct thing to do.
Depending on the workflow, it can severely bias the idle states deeper
than they should be because most of the needed information to select
the appropriate shallow state is tossed out.

Example 1:
2 pairs of ping pongs = 4 threads
Parameters chosen such that idle state 2 would be a most used state.
CPU frequency governor: Schedutil.
CPU frequency scaling driver: intel_cpufreq.
HWP: Disabled
Processor: i5-10600K (6 cores 12 cpus).
Kernel: 6.1-rc3
Run length: 1e8 cycles
Idle governor:
teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec

Where util-v4 is the same as this patch (util-v3) with the above code reverted.

Note: less time per loop is better.

Example 2: Same but parameters selected such that idle state 0 would
be a most used idle state.
Run Length: 4e8 cycles
Idle governor:
teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

Note: the util-v4 test is still in progress, but it is late in my time
zone. But I can tell from the idle state usage, which I can observe
once per minute, that the issue is, at least mostly, fixed.

... Doug

2022-11-01 23:40:23

by Doug Smythies

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On Mon, Oct 31, 2022 at 11:24 PM Doug Smythies <[email protected]> wrote:
>
> Hi Kajetan,
>
> On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski
> <[email protected]> wrote:
>
> ... [delete some]...
>
> > /**
> > * teo_update - Update CPU metrics after wakeup.
> > * @drv: cpuidle driver containing state data.
> > @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > int i;
> >
> > if (dev->last_state_idx >= 0) {
> > - teo_update(drv, dev);
> > + /* don't update metrics if the cpu was utilized during the last sleep */
> > + if (!cpu_data->utilized)
> > + teo_update(drv, dev);
> > dev->last_state_idx = -1;
> > }
>
> Ignoring the metrics is not the correct thing to do.
> Depending on the workflow, it can severely bias the idle states deeper
> than they should be because most of the needed information to select
> the appropriate shallow state is tossed out.
>
> Example 1:
> 2 pairs of ping pongs = 4 threads
> Parameters chosen such that idle state 2 would be a most used state.

Sorry, typo, I meant idle state 1 would be most used.

> CPU frequency governor: Schedutil.
> CPU frequency scaling driver: intel_cpufreq.
> HWP: Disabled
> Processor: i5-10600K (6 cores 12 cpus).
> Kernel: 6.1-rc3
> Run length: 1e8 cycles
> Idle governor:
> teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
> menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
> util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
> util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
>
> Where util-v4 is the same as this patch (util-v3) with the above code reverted.
>
> Note: less time per loop is better.
>
> Example 2: Same but parameters selected such that idle state 0 would
> be a most used idle state.
> Run Length: 4e8 cycles
> Idle governor:
> teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
> menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
> util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
> util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

util-v4: 3.15 uSecs/loop ; idle state 0 ~1.2e6 exits/sec

For the above tests we do not expect teo-util to have much impact.

For completeness:

Test 3: Same but parameters selected such that idle states 2 and 3
would be used the most.
Run Length: 3.42e5 cycles
CPU frequency scaling governor: schedutil.
CPU frequency scaling driver: intel_cpufreq.
Idle governor:
teo: 4005 uSecs/loop ; IS 2: 1917 IS 3: 107.4 exits/sec
menu: 3113 uSecs/loop ; IS 2: 1020 IS 3: 1576 exits/sec
util-v3: 3457 uSecs/loop ; IS 2: 1139 IS 3: 1000 exits/sec
util-v4: 3526 uSecs/loop ; IS 2: 2029 IS 3: 109 exits/sec

Now, things are very noisy with the schedutil governor, so...

Test 4: Same as test 3 except for frequency scaling governor.
Run Length: 3.42e5 cycles
CPU frequency scaling governor: performance.
CPU frequency scaling driver: intel_pstate.
Idle governor:
teo: 2688 uSecs/loop ; IS 2: 2489 IS 3: 16 exits/sec
menu: 2596 uSecs/loop ; IS 2: 865 IS 3: 2005 exits/sec
util-v3: 2766 uSecs/loop ; IS 2: 1049 IS 3: 1394 exits/sec
util-v4: 2756 uSecs/loop ; IS 2: 2440 IS 3: 24 exits/sec

... Doug

2022-11-02 15:01:03

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On Mon, Oct 31, 2022 at 11:24:06PM -0700, Doug Smythies wrote:

> Ignoring the metrics is not the correct thing to do.
> Depending on the workflow, it can severely bias the idle states deeper
> than they should be because most of the needed information to select
> the appropriate shallow state is tossed out.

Interesting, thanks for flagging this actually. Arguably it made sense
originally but with v3 changes it probably causes more harm than good.

From testing it on my own side it still looks all right without the
skipping metrics part so I'm fine with just dropping it from the patch.
V4 it is.

I'll include my updated numbers and more tests in the new cover letter.

> Example 1:
> 2 pairs of ping pongs = 4 threads
> Parameters chosen such that idle state 2 would be a most used state.
> CPU frequency governor: Schedutil.
> CPU frequency scaling driver: intel_cpufreq.
> HWP: Disabled
> Processor: i5-10600K (6 cores 12 cpus).
> Kernel: 6.1-rc3
> Run length: 1e8 cycles
> Idle governor:
> teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
> menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
> util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
> util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
>
> Where util-v4 is the same as this patch (util-v3) with the above code reverted.
>
> Note: less time per loop is better.
>
> Example 2: Same but parameters selected such that idle state 0 would
> be a most used idle state.
> Run Length: 4e8 cycles
> Idle governor:
> teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
> menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
> util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
> util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

Thanks a lot for the testing!

---
Kajetan

2022-11-25 19:11:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On Mon, Oct 31, 2022 at 1:14 PM Kajetan Puchalski
<[email protected]> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have
> power efficient shallow idle states. Selecting deeper idle states on a
> device while a latency-sensitive workload is running can adversely impact
> performance due to increased latency. Additionally, if the CPU wakes up
> from a deeper sleep before its target residency as is often the case, it
> results in a waste of energy on top of that.
>
> This patch extends the TEO governor with a mechanism adding util-awareness,
> effectively providing a way for the governor to reduce the selected idle
> state by 1 when the CPU is being utilized over a certain threshold while
> still trying to select the deepest possible state using TEO's metrics when
> the CPU is not being utilized. This is now possible since the CPU
> utilization is exported from the scheduler with the sched_cpu_util function
> and already used e.g. in the thermal governor IPA.
>
> Under this implementation, when the CPU is being utilised and the
> selected candidate state is C1, it will be reduced to C0 as long as C0
> is not a polling state. This effectively should make the patch have no
> effect on most Intel systems.
>
> This can provide drastically decreased latency and performance benefits in
> certain types of mobile workloads that are sensitive to latency,
> such as Geekbench 5.
>
> Signed-off-by: Kajetan Puchalski <[email protected]>
> ---
> drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++-
> 1 file changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index e2864474a98d..0f38b3ecab3c 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -2,8 +2,13 @@
> /*
> * Timer events oriented CPU idle governor
> *
> + * TEO governor:
> * Copyright (C) 2018 - 2021 Intel Corporation
> * Author: Rafael J. Wysocki <[email protected]>
> + *
> + * Util-awareness mechanism:
> + * Copyright (C) 2022 Arm Ltd.
> + * Author: Kajetan Puchalski <[email protected]>
> */
>
> /**
> @@ -99,14 +104,49 @@
> * select the given idle state instead of the candidate one.
> *
> * 3. By default, select the candidate state.
> + *
> + * Util-awareness mechanism:
> + *
> + * The idea behind the util-awareness extension is that there are two distinct
> + * scenarios for the CPU which should result in two different approaches to idle
> + * state selection - utilized and not utilized.
> + *
> + * In this case, 'utilized' means that the average runqueue util of the CPU is
> + * above a certain threshold.
> + *
> + * When the CPU is utilized while going into idle, more likely than not it will
> + * be woken up to do more work soon and so a shallower idle state should be
> + * selected to minimise latency and maximise performance. When the CPU is not
> + * being utilized, the usual metrics-based approach to selecting the deepest
> + * available idle state should be preferred to take advantage of the power
> + * saving.
> + *
> + * In order to achieve this, the governor uses a utilization threshold.
> + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> + * seems to be getting the best results.
> + *
> + * Before selecting the next idle state, the governor compares the current CPU
> + * util to the precomputed util threshold. If it's below, it defaults to the
> + * TEO metrics mechanism. If it's above and the currently selected candidate is
> + * C1, the idle state will be reduced to C0 as long as C0 is not a polling state.
> */
>
> #include <linux/cpuidle.h>
> #include <linux/jiffies.h>
> #include <linux/kernel.h>
> +#include <linux/sched.h>
> #include <linux/sched/clock.h>
> +#include <linux/sched/topology.h>
> #include <linux/tick.h>
>
> +/*
> + * The number of bits to shift the cpu's capacity by in order to determine
> + * the utilized threshold
> + */
> +#define UTIL_THRESHOLD_SHIFT 6

Why is this particular number regarded as the best one?

> +
> +
> /*
> * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> * is used for decreasing metrics on a regular basis.
> @@ -137,9 +177,11 @@ struct teo_bin {
> * @time_span_ns: Time between idle state selection and post-wakeup update.
> * @sleep_length_ns: Time till the closest timer event (at the selection time).
> * @state_bins: Idle state data bins for this CPU.
> - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
> * @next_recent_idx: Index of the next @recent_idx entry to update.
> * @recent_idx: Indices of bins corresponding to recent "intercepts".
> + * @util_threshold: Threshold above which the CPU is considered utilized
> + * @utilized: Whether the last sleep on the CPU happened while utilized
> */
> struct teo_cpu {
> s64 time_span_ns;
> @@ -148,10 +190,24 @@ struct teo_cpu {
> unsigned int total;
> int next_recent_idx;
> int recent_idx[NR_RECENT];
> + unsigned long util_threshold;
> + bool utilized;
> };
>
> static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> +/**
> + * teo_get_util - Update the CPU utilized status
> + * @dev: Target CPU
> + * @cpu_data: Governor CPU data for the target CPU
> + */
> +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> +{
> + unsigned long util = sched_cpu_util(dev->cpu);
> +
> + cpu_data->utilized = util > cpu_data->util_threshold;

Why exactly do you need the local variable here?

Then, if there's only one caller, maybe this could be folded into it?

> +}
> +
> /**
> * teo_update - Update CPU metrics after wakeup.
> * @drv: cpuidle driver containing state data.
> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> int i;
>
> if (dev->last_state_idx >= 0) {
> - teo_update(drv, dev);
> + /* don't update metrics if the cpu was utilized during the last sleep */

Why?

The metrics are related to idle duration and cpu_data->utilized
indicates whether or not latency should be reduced. These are
different things.

Moreover, this is just one data point and there may not be any direct
connection between it and the decision made in this particular cycle.

> + if (!cpu_data->utilized)
> + teo_update(drv, dev);
> dev->last_state_idx = -1;
> }
>
> @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> goto end;
> }
>
> + teo_get_util(dev, cpu_data);
> + /* the cpu is being utilized and there's only 2 states to choose from */
> + /* no need to consider metrics, choose the shallowest non-polling state and exit */

A proper kernel-coding-style 2-line comment, please!

Also I would say "utilized beyond the threshold" instead of just
"utilized" and "there are only 2 states" (plural).

> + if (drv->state_count < 3 && cpu_data->utilized) {
> + for (i = 0; i < drv->state_count; ++i) {
> + if (dev->states_usage[i].disable ||
> + drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> + continue;
> + break;

This looks odd. It would be more straightforward to do

if (!dev->states_usage[i].disable && !(drv->states[i].flags &
CPUIDLE_FLAG_POLLING)) {
idx = i;
goto end;
}

without the "break" and "continue".

> + }
> +
> + idx = i;
> + goto end;
> + }
> +
> /*
> * Find the deepest idle state whose target residency does not exceed
> * the current sleep length and the deepest idle state not deeper than
> @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> if (idx > constraint_idx)
> idx = constraint_idx;
>
> + /* if the CPU is being utilized and C1 is the selected candidate */
> + /* choose a shallower non-polling state to improve latency */

Again, the kernel coding style for multi-line comments is different
from the above.

> + if (cpu_data->utilized && idx == 1)

I've changed my mind with respect to adding the idx == 1 check to
this. If the goal is to reduce latency for the "loaded" CPUs, this
applies to deeper idle states too.

> + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> +
> end:
> /*
> * Don't stop the tick if the selected state is a polling one or if the
> @@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> struct cpuidle_device *dev)
> {
> struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> int i;
>
> memset(cpu_data, 0, sizeof(*cpu_data));
> + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
> for (i = 0; i < NR_RECENT; i++)
> cpu_data->recent_idx[i] = -1;
> --

2022-11-25 20:17:47

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On 2022.11.25 10:27 Rafael wrote:
> On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote:

... [delete some] ...

>> +}
>> +
>> /**
>> * teo_update - Update CPU metrics after wakeup.
>> * @drv: cpuidle driver containing state data.
>> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> int i;
>>
>> if (dev->last_state_idx >= 0) {
>> - teo_update(drv, dev);
>> + /* don't update metrics if the cpu was utilized during the last sleep */
>
> Why?
>
> The metrics are related to idle duration and cpu_data->utilized
> indicates whether or not latency should be reduced. These are
> different things.
>
> Moreover, this is just one data point and there may not be any direct
> connection between it and the decision made in this particular cycle.

Hi Rafael,

Yes, agreed. And version 4 of this patch set deleted this stuff.
See also my similar feedback, with accompanying test data
that verified the issue in [1].

[1] https://lore.kernel.org/linux-pm/CAAYoRsUgm6KyJCDowGKFVuMwJepnVN8NFEenjd3O-FN7+BETSw@mail.gmail.com/

... Doug


2022-11-27 05:29:16

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On 2022.11.25 10:27 Rafael wrote:
> On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote:

... [delete some] ...

>> /*
>> * Find the deepest idle state whose target residency does not exceed
>> * the current sleep length and the deepest idle state not deeper than
>> @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> if (idx > constraint_idx)
>> idx = constraint_idx;
>>
>> + /* if the CPU is being utilized and C1 is the selected candidate */
>> + /* choose a shallower non-polling state to improve latency */
>
> Again, the kernel coding style for multi-line comments is different
> from the above.
>
>> + if (cpu_data->utilized && idx == 1)
>
> I've changed my mind with respect to adding the idx == 1 check to
> this. If the goal is to reduce latency for the "loaded" CPUs, this
> applies to deeper idle states too.

After taking idle state 0 (POLL) out of it, the energy cost for reducing
the selected idle state by 1 was still high in some cases, at least on my
Intel processor. That was mainly for idle state 2 being bumped to idle
state 1. I don't recall significant differences bumping idle state 3 to idle
state 2, but I don't know about other Intel processors.

So, there is a trade-off here where we might want to accept this higher
energy consumption for no gain in some workflows verses the higher
energy for gain in other workflows, or not.

Example 1: Higher energy, for no benefit:
Workflow: a medium load at 211 work/sleep frequency.
This data is for one thread, but I looked at up to 6 threads.
No performance metric, the work just has to finish before
the next cycle begins.
CPU frequency scaling driver: intel_pstate
CPU frequency scaling governor: powersave
No HWP.
Kernel 6.1-rc3

teo: ~14.8 watts
util-v4 without the "idx == 1" above: 16.1 watts (+8.8%)
More info:
http://smythies.com/~doug/linux/idle/teo-util/consume/dwell-v4/

Example 2: Lower energy, but no loss in performance:
Workflow: 500 threads, light load per thread,
approximately 10 hertz work/sleep frequency per thread.
CPU frequency scaling driver: intel_cpufreq
CPU frequency scaling governor: schedutil
No HWP.
Kernel 6.1-rc3

teo: ~70 watts
util-v4 without the "idx == 1" above: ~59 watts (-16%)
Execution times were the same
More info:
http://smythies.com/~doug/linux/idle/teo-util/waiter/

Note: legend util-v4-1 is util-v4 without the "idx == 1".
I have also added util-v4-1 to some of the previous results.

For reference, my testing processor:

Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH MWAIT 0x30
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH MWAIT 0x60

... Doug

2022-11-28 15:24:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On Mon, Nov 28, 2022 at 3:33 PM Kajetan Puchalski
<[email protected]> wrote:
>
> On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote:
> > > +/*
> > > + * The number of bits to shift the cpu's capacity by in order to determine
> > > + * the utilized threshold
> > > + */
> > > +#define UTIL_THRESHOLD_SHIFT 6
> >
> > Why is this particular number regarded as the best one?
>
> Based on my testing this number achieved the best balance of power and
> performance on average. It also makes sense from looking at the util
> plots. The resulting threshold is high enough to not be triggered by
> background noise and low enough to react quickly when activity starts
> ramping up.

It would be good to put some of this information (or maybe even all of
it) into the comment above the symbol definition.

> > > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> > > +{
> > > + unsigned long util = sched_cpu_util(dev->cpu);
> > > +
> > > + cpu_data->utilized = util > cpu_data->util_threshold;
> >
> > Why exactly do you need the local variable here?
>
> It's not necessarily needed, I can replace it with comparing the result
> of the call directly.
>
> > Then, if there's only one caller, maybe this could be folded into it?
>
> I do think it's nicer to have it separated in its own helper function, that
> way if anything more has to be done with the util it'll all be
> self-contained. Having only one caller shouldn't be a big issue, it's
> also the case for teo_middle_of_bin and teo_find_shallower_state in the
> current TEO implementation.

OK

> > > + /* don't update metrics if the cpu was utilized during the last sleep */
> >
> > Why?
> >
> > The metrics are related to idle duration and cpu_data->utilized
> > indicates whether or not latency should be reduced. These are
> > different things.
> >
> > Moreover, this is just one data point and there may not be any direct
> > connection between it and the decision made in this particular cycle.
>
> Agreed, v4 already has this part removed.
>
> > > + if (!cpu_data->utilized)
> > > + teo_update(drv, dev);
> > > dev->last_state_idx = -1;
> > > }
> > >
> > > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > > goto end;
> > > }
> > >
> > > + teo_get_util(dev, cpu_data);
> > > + /* the cpu is being utilized and there's only 2 states to choose from */
> > > + /* no need to consider metrics, choose the shallowest non-polling state and exit */
> >
> > A proper kernel-coding-style 2-line comment, please!
> >
> > Also I would say "utilized beyond the threshold" instead of just
> > "utilized" and "there are only 2 states" (plural).
>
> Both good points, I'll fix that.
>
> > > + if (drv->state_count < 3 && cpu_data->utilized) {
> > > + for (i = 0; i < drv->state_count; ++i) {
> > > + if (dev->states_usage[i].disable ||
> > > + drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> > > + continue;
> > > + break;
> >
> > This looks odd. It would be more straightforward to do
> >
> > if (!dev->states_usage[i].disable && !(drv->states[i].flags &
> > CPUIDLE_FLAG_POLLING)) {
> > idx = i;
> > goto end;
> > }
> >
> > without the "break" and "continue".
>
> Fair enough, this works as well.
>
> > I've changed my mind with respect to adding the idx == 1 check to
> > this. If the goal is to reduce latency for the "loaded" CPUs, this
> > applies to deeper idle states too.
>
> I see, this has no effect on arm devices one way or the other so I don't
> mind, it's completely up to you. In light of Doug's test results
> regarding this change, should I remove the check in v5?

Yes, please.

2022-11-28 15:24:54

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness

On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote:
> > +/*
> > + * The number of bits to shift the cpu's capacity by in order to determine
> > + * the utilized threshold
> > + */
> > +#define UTIL_THRESHOLD_SHIFT 6
>
> Why is this particular number regarded as the best one?

Based on my testing this number achieved the best balance of power and
performance on average. It also makes sense from looking at the util
plots. The resulting threshold is high enough to not be triggered by
background noise and low enough to react quickly when activity starts
ramping up.

> > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> > +{
> > + unsigned long util = sched_cpu_util(dev->cpu);
> > +
> > + cpu_data->utilized = util > cpu_data->util_threshold;
>
> Why exactly do you need the local variable here?

It's not necessarily needed, I can replace it with comparing the result
of the call directly.

> Then, if there's only one caller, maybe this could be folded into it?

I do think it's nicer to have it separated in its own helper function, that
way if anything more has to be done with the util it'll all be
self-contained. Having only one caller shouldn't be a big issue, it's
also the case for teo_middle_of_bin and teo_find_shallower_state in the
current TEO implementation.

> > + /* don't update metrics if the cpu was utilized during the last sleep */
>
> Why?
>
> The metrics are related to idle duration and cpu_data->utilized
> indicates whether or not latency should be reduced. These are
> different things.
>
> Moreover, this is just one data point and there may not be any direct
> connection between it and the decision made in this particular cycle.

Agreed, v4 already has this part removed.

> > + if (!cpu_data->utilized)
> > + teo_update(drv, dev);
> > dev->last_state_idx = -1;
> > }
> >
> > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > goto end;
> > }
> >
> > + teo_get_util(dev, cpu_data);
> > + /* the cpu is being utilized and there's only 2 states to choose from */
> > + /* no need to consider metrics, choose the shallowest non-polling state and exit */
>
> A proper kernel-coding-style 2-line comment, please!
>
> Also I would say "utilized beyond the threshold" instead of just
> "utilized" and "there are only 2 states" (plural).

Both good points, I'll fix that.

> > + if (drv->state_count < 3 && cpu_data->utilized) {
> > + for (i = 0; i < drv->state_count; ++i) {
> > + if (dev->states_usage[i].disable ||
> > + drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> > + continue;
> > + break;
>
> This looks odd. It would be more straightforward to do
>
> if (!dev->states_usage[i].disable && !(drv->states[i].flags &
> CPUIDLE_FLAG_POLLING)) {
> idx = i;
> goto end;
> }
>
> without the "break" and "continue".

Fair enough, this works as well.

> I've changed my mind with respect to adding the idx == 1 check to
> this. If the goal is to reduce latency for the "loaded" CPUs, this
> applies to deeper idle states too.

I see, this has no effect on arm devices one way or the other so I don't
mind, it's completely up to you. In light of Doug's test results
regarding this change, should I remove the check in v5?

Thanks,
Kajetan