2020-12-07 16:42:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

Hi,

This is based on the RFC posted a few days ago:

https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/

The majority of the original cover letter still applies, so let me quote it
here:

Using intel_pstate in the passive mode with HWP enabled, in particular under
the schedutil governor, is still kind of problematic, because it has to assume
that it should not allow the frequency to fall below the one requested by the
governor. For this reason, it translates the target frequency into HWP.REQ.MIN
which generally causes the processor to run a bit too fast.

Moreover, this allows the HWP algorithm to use any frequency between the target
one and HWP.REQ.MAX that corresponds to the policy max limit and some workloads
cause it to go for the max turbo frequency prematurely which hurts energy-
efficiency without improving performance, even though the schedutil governor
itself would not allow the frequency to ramp up so fast.

This patch series attempts to improve the situation by introducing a new driver
callback allowing the driver to receive more information from the governor. In
particular, this allows the min (required) and target (desired) performance
levels to be passed to it and those can be used to give better hints to the
hardware.

There are two additional schedutil patches this time (IMO they kind of make
sense without the other two, even though the first one increases the amount of
memory used by schedutil) and patches [2-3/4] correspond to the two patches
in the RFC, respectively.

Please refer to the patch changelogs for details.

Thanks!




2020-12-07 16:43:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 3/4] cpufreq: Add special-purpose fast-switching callback for drivers

From: Rafael J. Wysocki <[email protected]>

First off, some cpufreq drivers (eg. intel_pstate) can pass hints
beyond the current target frequency to the hardware and there are no
provisions for doing that in the cpufreq framework. In particular,
today the driver has to assume that it should not allow the frequency
to fall below the one requested by the governor (or the required
capacity may not be provided) which may not be the case and which may
lead to excessive energy usage in some scenarios.

Second, the hints passed by these drivers to the hardware need not be
in terms of the frequency, so representing the utilization numbers
coming from the scheduler as frequency before passing them to those
drivers is not really useful.

Address the two points above by adding a special-purpose replacement
for the ->fast_switch callback, called ->adjust_perf, allowing the
governor to pass abstract performance level (rather than frequency)
values for the minimum (required) and target (desired) performance
along with the CPU capacity to compare them to.

Also update the schedutil governor to use the new callback instead
of ->fast_switch if present.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes with respect to the RFC:
- Don't pass "busy" to ->adjust_perf().
- Use a special 'update_util' hook for the ->adjust_perf() case in
schedutil (this still requires an additional branch because of the
shared common code between this case and the "frequency" one, but
IMV this version is cleaner nevertheless).

---
drivers/cpufreq/cpufreq.c | 40 ++++++++++++++++++++++++++++++++
include/linux/cpufreq.h | 14 +++++++++++
include/linux/sched/cpufreq.h | 5 ++++
kernel/sched/cpufreq_schedutil.c | 48 +++++++++++++++++++++++++++++++--------
4 files changed, 98 insertions(+), 9 deletions(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -320,6 +320,15 @@ struct cpufreq_driver {
unsigned int index);
unsigned int (*fast_switch)(struct cpufreq_policy *policy,
unsigned int target_freq);
+ /*
+ * ->fast_switch() replacement for drivers that use an internal
+ * representation of performance levels and can pass hints other than
+ * the target performance level to the hardware.
+ */
+ void (*adjust_perf)(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity);

/*
* Caches and returns the lowest driver-supported frequency greater than
@@ -588,6 +597,11 @@ struct cpufreq_governor {
/* Pass a target to the cpufreq driver */
unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
unsigned int target_freq);
+void cpufreq_driver_adjust_perf(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity);
+bool cpufreq_driver_has_adjust_perf(void);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -2097,6 +2097,46 @@ unsigned int cpufreq_driver_fast_switch(
}
EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);

+/**
+ * cpufreq_driver_adjust_perf - Adjust CPU performance level in one go.
+ * @cpu: Target CPU.
+ * @min_perf: Minimum (required) performance level (units of @capacity).
+ * @target_perf: Terget (desired) performance level (units of @capacity).
+ * @capacity: Capacity of the target CPU.
+ *
+ * Carry out a fast performance level switch of @cpu without sleeping.
+ *
+ * The driver's ->adjust_perf() callback invoked by this function must be
+ * suitable for being called from within RCU-sched read-side critical sections
+ * and it is expected to select a suitable performance level equal to or above
+ * @min_perf and preferably equal to or below @target_perf.
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same CPU and that it will never be called in
+ * parallel with either ->target() or ->target_index() or ->fast_switch() for
+ * the same CPU.
+ */
+void cpufreq_driver_adjust_perf(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity)
+{
+ cpufreq_driver->adjust_perf(cpu, min_perf, target_perf, capacity);
+}
+
+/**
+ * cpufreq_driver_has_adjust_perf - Check "direct fast switch" callback.
+ *
+ * Return 'true' if the ->adjust_perf callback is present for the
+ * current driver or 'false' otherwise.
+ */
+bool cpufreq_driver_has_adjust_perf(void)
+{
+ return !!cpufreq_driver->adjust_perf;
+}
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -432,13 +432,11 @@ static inline void ignore_dl_rate_limit(
sg_policy->limits_changed = true;
}

-static void sugov_update_single(struct update_util_data *hook, u64 time,
- unsigned int flags)
+static bool sugov_update_single_common(struct sugov_cpu *sg_cpu, u64 time,
+ unsigned int flags)
{
- struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned long prev_util = sg_cpu->util;
- unsigned int next_f;

sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -446,7 +444,7 @@ static void sugov_update_single(struct u
ignore_dl_rate_limit(sg_cpu, sg_policy);

if (!sugov_should_update_freq(sg_policy, time))
- return;
+ return false;

sugov_get_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time);
@@ -458,6 +456,19 @@ static void sugov_update_single(struct u
if (sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

+ return true;
+}
+
+static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
+ unsigned int flags)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int next_f;
+
+ if (!sugov_update_single_common(sg_cpu, time, flags))
+ return;
+
next_f = get_next_freq(sg_policy, sg_cpu->util, sg_cpu->max);

/*
@@ -474,6 +485,20 @@ static void sugov_update_single(struct u
}
}

+static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
+ unsigned int flags)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+
+ if (!sugov_update_single_common(sg_cpu, time, flags))
+ return;
+
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+ map_util_perf(sg_cpu->util), sg_cpu->max);
+
+ sg_cpu->sg_policy->last_freq_update_time = time;
+}
+
static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
{
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
@@ -812,6 +837,7 @@ static void sugov_exit(struct cpufreq_po
static int sugov_start(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;
+ void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
unsigned int cpu;

sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
@@ -831,13 +857,17 @@ static int sugov_start(struct cpufreq_po
sg_cpu->sg_policy = sg_policy;
}

+ if (policy_is_shared(policy))
+ uu = sugov_update_shared;
+ else if (policy->fast_switch_enabled && cpufreq_driver_has_adjust_perf())
+ uu = sugov_update_single_perf;
+ else
+ uu = sugov_update_single_freq;
+
for_each_cpu(cpu, policy->cpus) {
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);

- cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
- policy_is_shared(policy) ?
- sugov_update_shared :
- sugov_update_single);
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, uu);
}
return 0;
}
Index: linux-pm/include/linux/sched/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/sched/cpufreq.h
+++ linux-pm/include/linux/sched/cpufreq.h
@@ -28,6 +28,11 @@ static inline unsigned long map_util_fre
{
return (freq + (freq >> 2)) * util / cap;
}
+
+static inline unsigned long map_util_perf(unsigned long util)
+{
+ return util + (util >> 2);
+}
#endif /* CONFIG_CPU_FREQ */

#endif /* _LINUX_SCHED_CPUFREQ_H */



2020-12-07 16:43:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 4/4] cpufreq: intel_pstate: Implement the ->adjust_perf() callback

From: Rafael J. Wysocki <[email protected]>

Make intel_pstate expose the ->adjust_perf() callback when it
operates in the passive mode with HWP enabled which causes the
schedutil governor to use that callback instead of ->fast_switch().

The minimum and target performance-level values passed by the
governor to ->adjust_perf() are converted to HWP.REQ.MIN and
HWP.REQ.DESIRED, respectively, which allows the processor to
adjust its configuration to maximize energy-efficiency while
providing sufficient capacity.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Changes with respect to the RFC:
- Drop the code related to the dropped "busy" argument of
->adjust_perf().
- Update the changelog accordingly.

---
drivers/cpufreq/intel_pstate.c | 70 +++++++++++++++++++++++++++++++++--------
1 file changed, 58 insertions(+), 12 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -2526,20 +2526,19 @@ static void intel_cpufreq_trace(struct c
fp_toint(cpu->iowait_boost * 100));
}

-static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
- bool strict, bool fast_switch)
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 min, u32 max,
+ u32 desired, bool fast_switch)
{
u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;

value &= ~HWP_MIN_PERF(~0L);
- value |= HWP_MIN_PERF(target_pstate);
+ value |= HWP_MIN_PERF(min);

- /*
- * The entire MSR needs to be updated in order to update the HWP min
- * field in it, so opportunistically update the max too if needed.
- */
value &= ~HWP_MAX_PERF(~0L);
- value |= HWP_MAX_PERF(strict ? target_pstate : cpu->max_perf_ratio);
+ value |= HWP_MAX_PERF(max);
+
+ value &= ~HWP_DESIRED_PERF(~0L);
+ value |= HWP_DESIRED_PERF(desired);

if (value == prev)
return;
@@ -2569,11 +2568,15 @@ static int intel_cpufreq_update_pstate(s
int old_pstate = cpu->pstate.current_pstate;

target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
- if (hwp_active)
- intel_cpufreq_adjust_hwp(cpu, target_pstate,
- policy->strict_target, fast_switch);
- else if (target_pstate != old_pstate)
+ if (hwp_active) {
+ int max_pstate = policy->strict_target ?
+ target_pstate : cpu->max_perf_ratio;
+
+ intel_cpufreq_adjust_hwp(cpu, target_pstate, max_pstate, 0,
+ fast_switch);
+ } else if (target_pstate != old_pstate) {
intel_cpufreq_adjust_perf_ctl(cpu, target_pstate, fast_switch);
+ }

cpu->pstate.current_pstate = target_pstate;

@@ -2634,6 +2637,47 @@ static unsigned int intel_cpufreq_fast_s
return target_pstate * cpu->pstate.scaling;
}

+static void intel_cpufreq_adjust_perf(unsigned int cpunum,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity)
+{
+ struct cpudata *cpu = all_cpu_data[cpunum];
+ int old_pstate = cpu->pstate.current_pstate;
+ int cap_pstate, min_pstate, max_pstate, target_pstate;
+
+ update_turbo_state();
+ cap_pstate = global.turbo_disabled ? cpu->pstate.max_pstate :
+ cpu->pstate.turbo_pstate;
+
+ /* Optimization: Avoid unnecessary divisions. */
+
+ target_pstate = cap_pstate;
+ if (target_perf < capacity)
+ target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
+
+ min_pstate = cap_pstate;
+ if (min_perf < capacity)
+ min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
+
+ if (min_pstate < cpu->pstate.min_pstate)
+ min_pstate = cpu->pstate.min_pstate;
+
+ if (min_pstate < cpu->min_perf_ratio)
+ min_pstate = cpu->min_perf_ratio;
+
+ max_pstate = min(cap_pstate, cpu->max_perf_ratio);
+ if (max_pstate < min_pstate)
+ max_pstate = min_pstate;
+
+ target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
+
+ intel_cpufreq_adjust_hwp(cpu, min_pstate, max_pstate, target_pstate, true);
+
+ cpu->pstate.current_pstate = target_pstate;
+ intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+}
+
static int intel_cpufreq_cpu_init(struct cpufreq_policy *policy)
{
int max_state, turbo_max, min_freq, max_freq, ret;
@@ -3032,6 +3076,8 @@ static int __init intel_pstate_init(void
intel_pstate.attr = hwp_cpufreq_attrs;
intel_cpufreq.attr = hwp_cpufreq_attrs;
intel_cpufreq.flags |= CPUFREQ_NEED_UPDATE_LIMITS;
+ intel_cpufreq.fast_switch = NULL;
+ intel_cpufreq.adjust_perf = intel_cpufreq_adjust_perf;
if (!default_driver)
default_driver = &intel_pstate;




2020-12-08 12:12:28

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v1 3/4] cpufreq: Add special-purpose fast-switching callback for drivers

On 07-12-20, 17:35, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> First off, some cpufreq drivers (eg. intel_pstate) can pass hints
> beyond the current target frequency to the hardware and there are no
> provisions for doing that in the cpufreq framework. In particular,
> today the driver has to assume that it should not allow the frequency
> to fall below the one requested by the governor (or the required
> capacity may not be provided) which may not be the case and which may
> lead to excessive energy usage in some scenarios.
>
> Second, the hints passed by these drivers to the hardware need not be
> in terms of the frequency, so representing the utilization numbers
> coming from the scheduler as frequency before passing them to those
> drivers is not really useful.
>
> Address the two points above by adding a special-purpose replacement
> for the ->fast_switch callback, called ->adjust_perf, allowing the
> governor to pass abstract performance level (rather than frequency)
> values for the minimum (required) and target (desired) performance
> along with the CPU capacity to compare them to.
>
> Also update the schedutil governor to use the new callback instead
> of ->fast_switch if present.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Changes with respect to the RFC:
> - Don't pass "busy" to ->adjust_perf().
> - Use a special 'update_util' hook for the ->adjust_perf() case in
> schedutil (this still requires an additional branch because of the
> shared common code between this case and the "frequency" one, but
> IMV this version is cleaner nevertheless).
>
> ---
> drivers/cpufreq/cpufreq.c | 40 ++++++++++++++++++++++++++++++++
> include/linux/cpufreq.h | 14 +++++++++++
> include/linux/sched/cpufreq.h | 5 ++++
> kernel/sched/cpufreq_schedutil.c | 48 +++++++++++++++++++++++++++++++--------
> 4 files changed, 98 insertions(+), 9 deletions(-)
>
> Index: linux-pm/include/linux/cpufreq.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -320,6 +320,15 @@ struct cpufreq_driver {
> unsigned int index);
> unsigned int (*fast_switch)(struct cpufreq_policy *policy,
> unsigned int target_freq);
> + /*
> + * ->fast_switch() replacement for drivers that use an internal
> + * representation of performance levels and can pass hints other than
> + * the target performance level to the hardware.
> + */
> + void (*adjust_perf)(unsigned int cpu,
> + unsigned long min_perf,
> + unsigned long target_perf,
> + unsigned long capacity);

With this callback in place, do we still need to keep the other stuff we
introduced recently, like CPUFREQ_NEED_UPDATE_LIMITS ?

--
viresh

2020-12-08 13:01:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] cpufreq: intel_pstate: Implement the ->adjust_perf() callback

On Mon, Dec 07, 2020 at 05:38:58PM +0100, Rafael J. Wysocki wrote:

> +static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> + unsigned long min_perf,
> + unsigned long target_perf,
> + unsigned long capacity)
> +{
> + struct cpudata *cpu = all_cpu_data[cpunum];
> + int old_pstate = cpu->pstate.current_pstate;
> + int cap_pstate, min_pstate, max_pstate, target_pstate;
> +
> + update_turbo_state();
> + cap_pstate = global.turbo_disabled ? cpu->pstate.max_pstate :
> + cpu->pstate.turbo_pstate;
> +
> + /* Optimization: Avoid unnecessary divisions. */
> +
> + target_pstate = cap_pstate;
> + if (target_perf < capacity)
> + target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> +
> + min_pstate = cap_pstate;
> + if (min_perf < capacity)
> + min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> +
> + if (min_pstate < cpu->pstate.min_pstate)
> + min_pstate = cpu->pstate.min_pstate;
> +
> + if (min_pstate < cpu->min_perf_ratio)
> + min_pstate = cpu->min_perf_ratio;
> +
> + max_pstate = min(cap_pstate, cpu->max_perf_ratio);
> + if (max_pstate < min_pstate)
> + max_pstate = min_pstate;
> +
> + target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
> +
> + intel_cpufreq_adjust_hwp(cpu, min_pstate, max_pstate, target_pstate, true);

I'm confused... HWP doesn't do pstate, yet everything here is now called
pstate, help?

> +
> + cpu->pstate.current_pstate = target_pstate;
> + intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
> +}

2020-12-08 16:36:34

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> This is based on the RFC posted a few days ago:
>
> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
>
> The majority of the original cover letter still applies, so let me quote it
> here:
> [...]

Hello Rafael,

I'd like to test this patch, as I have concerns on how it performs against the
current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED
untouched).

I'll get results within a week. Do you mind holding back the merge until then?


Thanks!
Giovanni Gherdovich

2020-12-08 17:15:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] cpufreq: intel_pstate: Implement the ->adjust_perf() callback

On Tue, Dec 8, 2020 at 1:44 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Dec 07, 2020 at 05:38:58PM +0100, Rafael J. Wysocki wrote:
>
> > +static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> > + unsigned long min_perf,
> > + unsigned long target_perf,
> > + unsigned long capacity)
> > +{
> > + struct cpudata *cpu = all_cpu_data[cpunum];
> > + int old_pstate = cpu->pstate.current_pstate;
> > + int cap_pstate, min_pstate, max_pstate, target_pstate;
> > +
> > + update_turbo_state();
> > + cap_pstate = global.turbo_disabled ? cpu->pstate.max_pstate :
> > + cpu->pstate.turbo_pstate;
> > +
> > + /* Optimization: Avoid unnecessary divisions. */
> > +
> > + target_pstate = cap_pstate;
> > + if (target_perf < capacity)
> > + target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> > +
> > + min_pstate = cap_pstate;
> > + if (min_perf < capacity)
> > + min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> > +
> > + if (min_pstate < cpu->pstate.min_pstate)
> > + min_pstate = cpu->pstate.min_pstate;
> > +
> > + if (min_pstate < cpu->min_perf_ratio)
> > + min_pstate = cpu->min_perf_ratio;
> > +
> > + max_pstate = min(cap_pstate, cpu->max_perf_ratio);
> > + if (max_pstate < min_pstate)
> > + max_pstate = min_pstate;
> > +
> > + target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
> > +
> > + intel_cpufreq_adjust_hwp(cpu, min_pstate, max_pstate, target_pstate, true);
>
> I'm confused... HWP doesn't do pstate, yet everything here is now called
> pstate, help?

HWP.REQ.MIN, HWP.REQ.MAX and HWP.REQ.DESIRED all are in the same space
of values as the original PERF_CTL MSR, which is P-states, at least
effectively.

> > +
> > + cpu->pstate.current_pstate = target_pstate;
> > + intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
> > +}

2020-12-08 17:17:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich <[email protected]> wrote:
>
> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > This is based on the RFC posted a few days ago:
> >
> > https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
> >
> > The majority of the original cover letter still applies, so let me quote it
> > here:
> > [...]
>
> Hello Rafael,
>
> I'd like to test this patch, as I have concerns on how it performs against the
> current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED
> untouched).

The performance will be worse in some cases, which is to be expected,
but the point here is to counter the problem that in some cases the
frequency is excessive now.

> I'll get results within a week. Do you mind holding back the merge until then?

Sure, the series is still under review.

2020-12-09 01:11:40

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

On 2020.12.08 09:14 Rafael J. Wysocki wrote:
> On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich <[email protected]> wrote:
>> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
>>> Hi,
>>>
>>> This is based on the RFC posted a few days ago:
>>>
>>> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
>>>
>>> The majority of the original cover letter still applies, so let me quote it
>>> here:
>>> [...]
>>
>> Hello Rafael,
>>
>> I'd like to test this patch, as I have concerns on how it performs against the
>> current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED
>> untouched).
>
>The performance will be worse in some cases, which is to be expected,

Agreed. More further below.

>but the point here is to counter the problem that in some cases the
>frequency is excessive now.

Agreed.

I retested with this new version with my load sweep test,
and my results were the pretty similar to previously reported, [1].
If anything, there might be a perceptible difference between the RFC
version and this version as a function of work/sleep frequency.

73 Hertz: RFC was 0.8% less power
113 Hertz: RFC was 1.5% less power
211 Hertz: RFC was 0.8% less power
347 Hertz: RFC was 1.2% more power
401 Hertz: RFC was 1.6% more power

Of course, let us not lose site of the original gains that were in the 13 to 17% range.

Now, for the "worse in some cases, which is to be expected" part:

At least on my system, it is most evident for some of the pipe type tests,
where the schedutil governor has never really known what to do. This patch
set seems to add enough of a downward bias that this version of the schedutil
governor now behaves much like the other versions

Here is an example (no forced CPU affinity, 2 pipes):

Legend:
hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq; schedutil (always)
rjw: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq; schedutil
acpi-cpufreq (or just acpi): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + the non RFC 4 patch version of this is patch set, HWP enabled; intel_cpu-freq; schedutil

hwp: 3.295 uSec/loop (steady); average power: 62.78 Watts (steady); freq: 4.60GHz (steady).
rjw: 3.6 uSec/loop (noisey) (9.3% worse); average power: 44.13 Watts (noisey); freq: ?.??GHz (noisey).
no-hwp: 3.35 uSec/loop (noisey); average power: 59.17 Watts (noisey); freq: ?.??GHz (much less noisey).
acpi-cpufreq: 3.4 uSec/loop (noisey); average power: 56.93 Watts (noisey); freq: ?.??GHz (less noisey).
patch: 3.6 uSec/loop (noisey) (9.3% worse); average power: 43.36 Watts (noisey); freq: ?.??GHz (noisey).

One could argue that this patch set might be driving the frequency down a little too hard in this case,
but this is one of those difficult rotating through the CPUs cases anyhow.
As a side note, all other governors (well, not powersave) pin the CPU frequency at max (4.6 GHz).

For my version of the "Alex" pipe test (a token is passed around and around via 6 named pipes,
with a bit of work to do per token stop)
I get (more power = better performance):

hwp: average power: 17.16 Watts (very noisey)
rjw: average power: 3.18 (noisey)
no-hwp: average power: 2.45 (less noisey)
acpi-cpufreq: average power: 2.46 (less noisey)
patch: average power: 2.47 (less noisey)

The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz
and can't seem to make up its mind. If nothing else, all the other versions
are a little more deterministic in their response.

>
>> I'll get results within a week. Do you mind holding back the merge until then?

On my system that git "make test" is broken, so I can not do the high PIDs per second test that way.
My own PIDs per second test also has issues on this computer.
I am not sure when I'll get these type of tests working, but will try for within a week.

... Doug

References:

[1] https://marc.info/?l=linux-kernel&m=160692480907696&w=2

My tests results graphs (updated):
Note: I have to code the web site, or I get hammered by bots.
Note: it is .com only because it was less expensive than .org

73 Hertz (10 samples per second):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su73/
113 Hertz (10 samples per second):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su113/
211 Hertz (10 samples per second):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su211/
347 Hertz (10 samples per second):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su347/
401 Hertz (10 samples per second):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su401/

Note: The below graphs are mislabelled, because I made hacker use of a tool dedicated
to graphing, and HTML wrapping, the load sweep test. The test conditions are steady state
operation, no variable changes.

pipe-test (5 seconds per sample):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/pipe/
Alex test (5 seconds per sample):
Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/alex/


2020-12-13 19:18:23

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

On 2020.12.08 11:15 Doug wrote:
> On 2020.12.08 09:14 Rafael J. Wysocki wrote:
> > On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich <[email protected]> wrote:
> >> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
> >> I'd like to test this patch, as I have concerns on how it performs against the
> >> current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED
> >> untouched).
> >
> >The performance will be worse in some cases, which is to be expected,
>
> Agreed. More further below.
>
> >but the point here is to counter the problem that in some cases the
> >frequency is excessive now.
>
> Agreed.
...
>
> The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz
> and can't seem to make up its mind.


> If nothing else, all the other versions
> are a little more deterministic in their response.

Hmmm... I think that statement is backwards, for this particular
workflow. Read on.

>
> >
> >> I'll get results within a week. Do you mind holding back the merge until then?
>
> On my system that git "make test" is broken, so I can not do the high PIDs per second test that way.
> My own PIDs per second test also has issues on this computer.
> I am not sure when I'll get these type of tests working, but will try for within a week.

Geovanni: Previously you have reported run to run variability with the git "make test"
serialized single threaded workflow, with a new PID per task. Previously, I had attributed
the variability to disk I/O. For the schedutil governor, there is also a significant variability
in execution time within and around a frequency of 0.2 Hertz. Generally, it would tend to get
averaged out over the long execution time of the "make test".

I have completely re-done everything for my version (no disk I/O) of the PIDs per second test:

Preamble:

As mentioned previously, the issue with attempting to acquire useful debugging
information with the schedutil governor is that it is always so oscillatory.
Sampling processor package energy at too high a frequency and it just looks like
noise, sample at too low a frequency and any useful information is filtered out.
To make a long story shorter, a sample frequency of about 1 hertz, or a sample
period of about 1 second, was revealing. Event periods of between 5 and 35 seconds
are also supported by listening to the fan speed modulation on my test server.
A lot of intel_pstate_tracer.py utility trace data was acquired and reviewed, but
I am unable to make any sense out of it.

Data is presented using the teo idle governor, and a 1000 hertz kernel, but be aware
that the menu governor and a 250 Hertz kernel were also tried, but results not written
up. Data is presented for one set of conditions only, but be aware that all of the
PIDs per second space was searched for any specific anomalies. It is the same
everywhere (tested ~40 - ~3500 PIDs per second) (obviously at some low enough PIDs
per second, the differences will diminish) and the operating point of focus for
further dwell type testing was not cherry picked.

Operating point for the rest of the work:
Fixed work packet per task (PID):
~1780 PIDs per second, performance governor.
~ 308 PIDs per second, powersave governor.

Test computer: i5-9600K

Legend:

hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
rfc: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
acpi: Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + this patch set, HWP enabled; intel_cpu-freq; schedutil

Results:

Execution times (seconds. Less is better):

no-hwp:

conservative: Samples: 371 ; Average: 10.84296 ; Stand Deviation: 0.07160 ; Maximum: 11.09000 ; Minimum: 10.65000
ondemand: Samples: 342 ; Average: 11.76442 ; Stand Deviation: 0.02640 ; Maximum: 11.85000 ; Minimum: 11.71000
performance: Samples: 382 ; Average: 10.54450 ; Stand Deviation: 0.01564 ; Maximum: 10.61000 ; Minimum: 10.50000
powersave: Samples: 68 ; Average: 59.93897 ; Stand Deviation: 0.11697 ; Maximum: 60.19000 ; Minimum: 59.67000

schedutil: Samples: 293 ; Average: 13.73416 ; Stand Deviation: 0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000
acpi: Samples: 253 ; Average: 15.94889 ; Stand Deviation: 1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000

hwp:

conservative: Samples: 380 ; Average: 10.58518 ; Stand Deviation: 0.01562 ; Maximum: 10.64000 ; Minimum: 10.54000
ondemand: Samples: 380 ; Average: 10.58395 ; Stand Deviation: 0.01509 ; Maximum: 10.62000 ; Minimum: 10.54000
performance: Samples: 381 ; Average: 10.57871 ; Stand Deviation: 0.01870 ; Maximum: 10.63000 ; Minimum: 10.53000
powersave: Samples: 67 ; Average: 60.37075 ; Stand Deviation: 0.20095 ; Maximum: 60.81000 ; Minimum: 59.92000

schedutil: Samples: 380 ; Average: 10.58287 ; Stand Deviation: 0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000
patch: Samples: 276 ; Average: 14.57029 ; Stand Deviation: 0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000
rfc: Samples: 271 ; Average: 14.86037 ; Stand Deviation: 0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000

Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.):

no-hwp:

conservative: Samples: 4000 ; Average: 23.60704 ; Stand Deviation: 1.09529 ; Maximum: 43.88623 ; Minimum: 18.32019
onemand: Samples: 4000 ; Average: 16.95654 ; Stand Deviation: 0.19147 ; Maximum: 17.60266 ; Minimum: 16.03534
performance: Samples: 4000 ; Average: 25.41355 ; Stand Deviation: 0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807
powersave: Samples: 4000 ; Average: 1.93887 ; Stand Deviation: 0.06675 ; Maximum: 2.05737 ; Minimum: 1.26605

schedutil: Samples: 4000 ; Average: 12.58863 ; Stand Deviation: 5.48600 ; Maximum: 25.50934 ; Minimum: 7.54559
acpi: Samples: 4000 ; Average: 9.57924 ; Stand Deviation: 5.41157 ; Maximum: 25.06366 ; Minimum: 5.51129

hwp:

conservative: Samples: 4000 ; Average: 25.16285 ; Stand Deviation: 0.18960 ; Maximum: 25.85553 ; Minimum: 23.62634
ondemand: Samples: 4000 ; Average: 25.27046 ; Stand Deviation: 0.20764 ; Maximum: 25.89624 ; Minimum: 23.97101
performance: Samples: 4000 ; Average: 25.34541 ; Stand Deviation: 0.24076 ; Maximum: 26.01788 ; Minimum: 23.87823
powersave: Samples: 4000 ; Average: 1.90300 ; Stand Deviation: 0.09457 ; Maximum: 2.00519 ; Minimum: 1.27722

schedutil: Samples: 4000 ; Average: 25.24245 ; Stand Deviation: 0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746
patch: Samples: 4000 ; Average: 11.07225 ; Stand Deviation: 5.63142 ; Maximum: 24.99493 ; Minimum: 3.67548
rfc: Samples: 4000 ; Average: 10.35842 ; Stand Deviation: 4.77915 ; Maximum: 24.95953 ; Minimum: 7.26202

Power histogram information is available for every test. For the schedutil governor, typically it reveals two
distributions. In linear time the average power will sometimes be high for a second and sometimes low
for multiple seconds, resulting in the high execution time variability as revealed by the standard deviation
numbers.

I also have mains power data, at 1 sample per second for an overnight run:
(Watts into the computer. ~ = 34 + processor package power).
Samples: 26674 ; Average: 50.39585 ; Stand Deviation: 8.63522 ; Maximum: 69.51000 ; Minimum: 35.30000
with peaks in the 42 and 62 watt regions.

Note 1: this is a never throttle scenario, not even close.
Note 2: this test computer never scales back CPU frequency as a function of active cores.
Note 3: sometimes maximum power includes an outlier data point or 2.
Standard deviation is the important statistic.
For example, no-hwp maximum power was really 25.8 watts, not 43.9.

Summary:

hwp enabled intel-cpufreq/schedutil was stable, with execution times comparable to other
governors before this patch set, but it is unstable with longer and variable execution
times after the patch set.

Conclusion:

Yes, the original hwp intel-cpufreq/schedutil (A.K.A intel-pstate passive mode, hwp enabled,
schedutil governor) does much better for this type of fixed work packet but variable period
workflow. However, isn't that the trade-off between fixed period fixed work packet workflow,
which improved considerably with this patch set?

After the patch set it does behave more like the other versions(no-hwp, acpi), which
I don't know if that is a good or bad thing. As we all know, this type of workflow has
always been a challenge for governors.

... Doug

References:
None.
My objective was to not need pretty graphs to support this e-mail.


2020-12-14 20:15:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

Hi,

The timing of this is not perfect (sorry about that), but here's a refresh
of this series.

The majority of the previous cover letter still applies:

On Monday, December 7, 2020 5:25:38 PM CET Rafael J. Wysocki wrote:
>
> This is based on the RFC posted a few days ago:
>
> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
>
> Using intel_pstate in the passive mode with HWP enabled, in particular under
> the schedutil governor, is still kind of problematic, because it has to assume
> that it should not allow the frequency to fall below the one requested by the
> governor. For this reason, it translates the target frequency into HWP.REQ.MIN
> which generally causes the processor to run a bit too fast.
>
> Moreover, this allows the HWP algorithm to use any frequency between the target
> one and HWP.REQ.MAX that corresponds to the policy max limit and some workloads
> cause it to go for the max turbo frequency prematurely which hurts energy-
> efficiency without improving performance, even though the schedutil governor
> itself would not allow the frequency to ramp up so fast.
>
> This patch series attempts to improve the situation by introducing a new driver
> callback allowing the driver to receive more information from the governor. In
> particular, this allows the min (required) and target (desired) performance
> levels to be passed to it and those can be used to give better hints to the
> hardware.

In this second revision there are three patches (one preparatory patch for
schedutil that hasn't changed since the v1, the introduction of the new
callback and schedutil changes in patch [2/3] and the intel_pstate changes
in patch [3/3] that are the same as before.

Please see patch changelogs for details.

Thanks!



2020-12-14 20:20:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v2 2/3] cpufreq: Add special-purpose fast-switching callback for drivers

From: Rafael J. Wysocki <[email protected]>

First off, some cpufreq drivers (eg. intel_pstate) can pass hints
beyond the current target frequency to the hardware and there are no
provisions for doing that in the cpufreq framework. In particular,
today the driver has to assume that it should not allow the frequency
to fall below the one requested by the governor (or the required
capacity may not be provided) which may not be the case and which may
lead to excessive energy usage in some scenarios.

Second, the hints passed by these drivers to the hardware need not be
in terms of the frequency, so representing the utilization numbers
coming from the scheduler as frequency before passing them to those
drivers is not really useful.

Address the two points above by adding a special-purpose replacement
for the ->fast_switch callback, called ->adjust_perf, allowing the
governor to pass abstract performance level (rather than frequency)
values for the minimum (required) and target (desired) performance
along with the CPU capacity to compare them to.

Also update the schedutil governor to use the new callback instead
of ->fast_switch if present and if the utilization mertics are
frequency-invariant (that is requisite for the direct mapping
between the utilization and the CPU performance levels to be a
reasonable approximation).

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v1 -> v2:
- Do not share optimization code between the freq and perf paths.
- Fall back from perf to freq if scale invariance is not supported.

Changes with respect to the RFC:
- Don't pass "busy" to ->adjust_perf().
- Use a special 'update_util' hook for the ->adjust_perf() case in
schedutil (this still requires an additional branch because of the
shared common code between this case and the "frequency" one, but
IMV this version is cleaner nevertheless).

---
drivers/cpufreq/cpufreq.c | 40 ++++++++++++++++++++++
include/linux/cpufreq.h | 14 ++++++++
include/linux/sched/cpufreq.h | 5 ++
kernel/sched/cpufreq_schedutil.c | 68 +++++++++++++++++++++++++++++++++------
4 files changed, 117 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -320,6 +320,15 @@ struct cpufreq_driver {
unsigned int index);
unsigned int (*fast_switch)(struct cpufreq_policy *policy,
unsigned int target_freq);
+ /*
+ * ->fast_switch() replacement for drivers that use an internal
+ * representation of performance levels and can pass hints other than
+ * the target performance level to the hardware.
+ */
+ void (*adjust_perf)(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity);

/*
* Caches and returns the lowest driver-supported frequency greater than
@@ -588,6 +597,11 @@ struct cpufreq_governor {
/* Pass a target to the cpufreq driver */
unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
unsigned int target_freq);
+void cpufreq_driver_adjust_perf(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity);
+bool cpufreq_driver_has_adjust_perf(void);
int cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -2097,6 +2097,46 @@ unsigned int cpufreq_driver_fast_switch(
}
EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);

+/**
+ * cpufreq_driver_adjust_perf - Adjust CPU performance level in one go.
+ * @cpu: Target CPU.
+ * @min_perf: Minimum (required) performance level (units of @capacity).
+ * @target_perf: Terget (desired) performance level (units of @capacity).
+ * @capacity: Capacity of the target CPU.
+ *
+ * Carry out a fast performance level switch of @cpu without sleeping.
+ *
+ * The driver's ->adjust_perf() callback invoked by this function must be
+ * suitable for being called from within RCU-sched read-side critical sections
+ * and it is expected to select a suitable performance level equal to or above
+ * @min_perf and preferably equal to or below @target_perf.
+ *
+ * This function must not be called if policy->fast_switch_enabled is unset.
+ *
+ * Governors calling this function must guarantee that it will never be invoked
+ * twice in parallel for the same CPU and that it will never be called in
+ * parallel with either ->target() or ->target_index() or ->fast_switch() for
+ * the same CPU.
+ */
+void cpufreq_driver_adjust_perf(unsigned int cpu,
+ unsigned long min_perf,
+ unsigned long target_perf,
+ unsigned long capacity)
+{
+ cpufreq_driver->adjust_perf(cpu, min_perf, target_perf, capacity);
+}
+
+/**
+ * cpufreq_driver_has_adjust_perf - Check "direct fast switch" callback.
+ *
+ * Return 'true' if the ->adjust_perf callback is present for the
+ * current driver or 'false' otherwise.
+ */
+bool cpufreq_driver_has_adjust_perf(void)
+{
+ return !!cpufreq_driver->adjust_perf;
+}
+
/* Must set freqs->new to intermediate frequency */
static int __target_intermediate(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, int index)
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -432,13 +432,10 @@ static inline void ignore_dl_rate_limit(
sg_policy->limits_changed = true;
}

-static void sugov_update_single(struct update_util_data *hook, u64 time,
- unsigned int flags)
+static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
+ u64 time, unsigned int flags)
{
- struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
- unsigned int cached_freq = sg_policy->cached_raw_freq;
- unsigned int next_f;

sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -446,11 +443,25 @@ static void sugov_update_single(struct u
ignore_dl_rate_limit(sg_cpu, sg_policy);

if (!sugov_should_update_freq(sg_policy, time))
- return;
+ return false;

sugov_get_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time);

+ return true;
+}
+
+static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
+ unsigned int flags)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int cached_freq = sg_policy->cached_raw_freq;
+ unsigned int next_f;
+
+ if (!sugov_update_single_common(sg_cpu, time, flags))
+ return;
+
next_f = get_next_freq(sg_policy, sg_cpu->util, sg_cpu->max);
/*
* Do not reduce the frequency if the CPU has not been idle
@@ -477,6 +488,38 @@ static void sugov_update_single(struct u
}
}

+static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
+ unsigned int flags)
+{
+ struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ unsigned long prev_util = sg_cpu->util;
+
+ /*
+ * Fall back to the "frequency" path if frequency invariance is not
+ * supported, because the direct mapping between the utilization and
+ * the performance levels depends on the frequency invariance.
+ */
+ if (!arch_scale_freq_invariant()) {
+ sugov_update_single_freq(hook, time, flags);
+ return;
+ }
+
+ if (!sugov_update_single_common(sg_cpu, time, flags))
+ return;
+
+ /*
+ * Do not reduce the target performance level if the CPU has not been
+ * idle recently, as the reduction is likely to be premature then.
+ */
+ if (sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+ sg_cpu->util = prev_util;
+
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+ map_util_perf(sg_cpu->util), sg_cpu->max);
+
+ sg_cpu->sg_policy->last_freq_update_time = time;
+}
+
static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
{
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
@@ -815,6 +858,7 @@ static void sugov_exit(struct cpufreq_po
static int sugov_start(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;
+ void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
unsigned int cpu;

sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
@@ -834,13 +878,17 @@ static int sugov_start(struct cpufreq_po
sg_cpu->sg_policy = sg_policy;
}

+ if (policy_is_shared(policy))
+ uu = sugov_update_shared;
+ else if (policy->fast_switch_enabled && cpufreq_driver_has_adjust_perf())
+ uu = sugov_update_single_perf;
+ else
+ uu = sugov_update_single_freq;
+
for_each_cpu(cpu, policy->cpus) {
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);

- cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
- policy_is_shared(policy) ?
- sugov_update_shared :
- sugov_update_single);
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, uu);
}
return 0;
}
Index: linux-pm/include/linux/sched/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/sched/cpufreq.h
+++ linux-pm/include/linux/sched/cpufreq.h
@@ -28,6 +28,11 @@ static inline unsigned long map_util_fre
{
return (freq + (freq >> 2)) * util / cap;
}
+
+static inline unsigned long map_util_perf(unsigned long util)
+{
+ return util + (util >> 2);
+}
#endif /* CONFIG_CPU_FREQ */

#endif /* _LINUX_SCHED_CPUFREQ_H */



2020-12-15 04:20:19

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v1 3/4] cpufreq: Add special-purpose fast-switching callback for drivers

On 08-12-20, 14:32, Viresh Kumar wrote:
> On 07-12-20, 17:35, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > First off, some cpufreq drivers (eg. intel_pstate) can pass hints
> > beyond the current target frequency to the hardware and there are no
> > provisions for doing that in the cpufreq framework. In particular,
> > today the driver has to assume that it should not allow the frequency
> > to fall below the one requested by the governor (or the required
> > capacity may not be provided) which may not be the case and which may
> > lead to excessive energy usage in some scenarios.
> >
> > Second, the hints passed by these drivers to the hardware need not be
> > in terms of the frequency, so representing the utilization numbers
> > coming from the scheduler as frequency before passing them to those
> > drivers is not really useful.
> >
> > Address the two points above by adding a special-purpose replacement
> > for the ->fast_switch callback, called ->adjust_perf, allowing the
> > governor to pass abstract performance level (rather than frequency)
> > values for the minimum (required) and target (desired) performance
> > along with the CPU capacity to compare them to.
> >
> > Also update the schedutil governor to use the new callback instead
> > of ->fast_switch if present.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> >
> > Changes with respect to the RFC:
> > - Don't pass "busy" to ->adjust_perf().
> > - Use a special 'update_util' hook for the ->adjust_perf() case in
> > schedutil (this still requires an additional branch because of the
> > shared common code between this case and the "frequency" one, but
> > IMV this version is cleaner nevertheless).
> >
> > ---
> > drivers/cpufreq/cpufreq.c | 40 ++++++++++++++++++++++++++++++++
> > include/linux/cpufreq.h | 14 +++++++++++
> > include/linux/sched/cpufreq.h | 5 ++++
> > kernel/sched/cpufreq_schedutil.c | 48 +++++++++++++++++++++++++++++++--------
> > 4 files changed, 98 insertions(+), 9 deletions(-)
> >
> > Index: linux-pm/include/linux/cpufreq.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/cpufreq.h
> > +++ linux-pm/include/linux/cpufreq.h
> > @@ -320,6 +320,15 @@ struct cpufreq_driver {
> > unsigned int index);
> > unsigned int (*fast_switch)(struct cpufreq_policy *policy,
> > unsigned int target_freq);
> > + /*
> > + * ->fast_switch() replacement for drivers that use an internal
> > + * representation of performance levels and can pass hints other than
> > + * the target performance level to the hardware.
> > + */
> > + void (*adjust_perf)(unsigned int cpu,
> > + unsigned long min_perf,
> > + unsigned long target_perf,
> > + unsigned long capacity);
>
> With this callback in place, do we still need to keep the other stuff we
> introduced recently, like CPUFREQ_NEED_UPDATE_LIMITS ?

Ping

--
viresh

2020-12-15 04:22:36

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On 14-12-20, 21:01, Rafael J. Wysocki wrote:
> Hi,
>
> The timing of this is not perfect (sorry about that), but here's a refresh
> of this series.
>
> The majority of the previous cover letter still applies:
>
> On Monday, December 7, 2020 5:25:38 PM CET Rafael J. Wysocki wrote:
> >
> > This is based on the RFC posted a few days ago:
> >
> > https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
> >
> > Using intel_pstate in the passive mode with HWP enabled, in particular under
> > the schedutil governor, is still kind of problematic, because it has to assume
> > that it should not allow the frequency to fall below the one requested by the
> > governor. For this reason, it translates the target frequency into HWP.REQ.MIN
> > which generally causes the processor to run a bit too fast.
> >
> > Moreover, this allows the HWP algorithm to use any frequency between the target
> > one and HWP.REQ.MAX that corresponds to the policy max limit and some workloads
> > cause it to go for the max turbo frequency prematurely which hurts energy-
> > efficiency without improving performance, even though the schedutil governor
> > itself would not allow the frequency to ramp up so fast.
> >
> > This patch series attempts to improve the situation by introducing a new driver
> > callback allowing the driver to receive more information from the governor. In
> > particular, this allows the min (required) and target (desired) performance
> > levels to be passed to it and those can be used to give better hints to the
> > hardware.
>
> In this second revision there are three patches (one preparatory patch for
> schedutil that hasn't changed since the v1, the introduction of the new
> callback and schedutil changes in patch [2/3] and the intel_pstate changes
> in patch [3/3] that are the same as before.
>
> Please see patch changelogs for details.

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2020-12-15 15:44:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 3/4] cpufreq: Add special-purpose fast-switching callback for drivers

On Tue, Dec 15, 2020 at 5:17 AM Viresh Kumar <[email protected]> wrote:
>
> On 08-12-20, 14:32, Viresh Kumar wrote:
> > On 07-12-20, 17:35, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > First off, some cpufreq drivers (eg. intel_pstate) can pass hints
> > > beyond the current target frequency to the hardware and there are no
> > > provisions for doing that in the cpufreq framework. In particular,
> > > today the driver has to assume that it should not allow the frequency
> > > to fall below the one requested by the governor (or the required
> > > capacity may not be provided) which may not be the case and which may
> > > lead to excessive energy usage in some scenarios.
> > >
> > > Second, the hints passed by these drivers to the hardware need not be
> > > in terms of the frequency, so representing the utilization numbers
> > > coming from the scheduler as frequency before passing them to those
> > > drivers is not really useful.
> > >
> > > Address the two points above by adding a special-purpose replacement
> > > for the ->fast_switch callback, called ->adjust_perf, allowing the
> > > governor to pass abstract performance level (rather than frequency)
> > > values for the minimum (required) and target (desired) performance
> > > along with the CPU capacity to compare them to.
> > >
> > > Also update the schedutil governor to use the new callback instead
> > > of ->fast_switch if present.
> > >
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > ---
> > >
> > > Changes with respect to the RFC:
> > > - Don't pass "busy" to ->adjust_perf().
> > > - Use a special 'update_util' hook for the ->adjust_perf() case in
> > > schedutil (this still requires an additional branch because of the
> > > shared common code between this case and the "frequency" one, but
> > > IMV this version is cleaner nevertheless).
> > >
> > > ---
> > > drivers/cpufreq/cpufreq.c | 40 ++++++++++++++++++++++++++++++++
> > > include/linux/cpufreq.h | 14 +++++++++++
> > > include/linux/sched/cpufreq.h | 5 ++++
> > > kernel/sched/cpufreq_schedutil.c | 48 +++++++++++++++++++++++++++++++--------
> > > 4 files changed, 98 insertions(+), 9 deletions(-)
> > >
> > > Index: linux-pm/include/linux/cpufreq.h
> > > ===================================================================
> > > --- linux-pm.orig/include/linux/cpufreq.h
> > > +++ linux-pm/include/linux/cpufreq.h
> > > @@ -320,6 +320,15 @@ struct cpufreq_driver {
> > > unsigned int index);
> > > unsigned int (*fast_switch)(struct cpufreq_policy *policy,
> > > unsigned int target_freq);
> > > + /*
> > > + * ->fast_switch() replacement for drivers that use an internal
> > > + * representation of performance levels and can pass hints other than
> > > + * the target performance level to the hardware.
> > > + */
> > > + void (*adjust_perf)(unsigned int cpu,
> > > + unsigned long min_perf,
> > > + unsigned long target_perf,
> > > + unsigned long capacity);
> >
> > With this callback in place, do we still need to keep the other stuff we
> > introduced recently, like CPUFREQ_NEED_UPDATE_LIMITS ?
>
> Ping

Missed this one, sorry.

We still need those things for the other governors.

2020-12-17 15:29:00

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On 2020.12.14 12:02 Rafael J. Wysocki wrote:

> Hi,

Hi Rafael,

V2 test results below are new, other results are partially re-stated:

For readers that do not want to read on, I didn't find anything different than with
the other versions. This was more just due diligence.

Legend:

hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil
v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil

Fixed work packet, fixed period, periodic workflow, load sweep up/down:

load work/sleep frequency: 73 Hertz:

hwp: Average: 12.00822 watts
rjw: Average: 10.18089 watts
no-hwp: Average: 10.21947 watts
acpi-cpufreq: Average: 9.06585 watts
patch: Average: 10.26060 watts
v2: Average: 10.50444

load work/sleep frequency: 113 Hertz:

hwp: Average: 12.01056
rjw: Average: 10.12303
no-hwp: Average: 10.08228
acpi-cpufreq: Average: 9.02215
patch: Average: 10.27055
v2: Average: 10.31097

load work/sleep frequency: 211 Hertz:

hwp: Average: 12.16067
rjw: Average: 10.24413
no-hwp: Average: 10.12463
acpi-cpufreq: Average: 9.19175
patch: Average: 10.33000
v2: Average: 10.39811

load work/sleep frequency: 347 Hertz:

hwp: Average: 12.34169
rjw: Average: 10.79980
no-hwp: Average: 10.57296
acpi-cpufreq: Average: 9.84709
patch: Average: 10.67029
v2: Average: 10.93143

load work/sleep frequency: 401 Hertz:

hwp: Average: 12.42562
rjw: Average: 11.12465
no-hwp: Average: 11.24203
acpi-cpufreq: Average: 10.78670
patch: Average: 10.94514
v2: Average: 11.50324


Serialized single threaded via PIDs per second method:
A.K.A fixed work packet, variable period
Results:

Execution times (seconds. Less is better):

no-hwp:

performance: Samples: 382 ; Average: 10.54450 ; Stand Deviation: 0.01564 ; Maximum: 10.61000 ; Minimum: 10.50000

schedutil: Samples: 293 ; Average: 13.73416 ; Stand Deviation: 0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000
acpi: Samples: 253 ; Average: 15.94889 ; Stand Deviation: 1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000

hwp:

schedutil: Samples: 380 ; Average: 10.58287 ; Stand Deviation: 0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000
patch: Samples: 276 ; Average: 14.57029 ; Stand Deviation: 0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000
rfc: Samples: 271 ; Average: 14.86037 ; Stand Deviation: 0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000
v2: Samples: 274 ; Average: 14.67978 ; Stand Deviation: 1.03378 ; Maximum: 16.07000 ; Minimum: 11.43000

Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.):

no-hwp:

performance: Samples: 4000 ; Average: 25.41355 ; Stand Deviation: 0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807

schedutil: Samples: 4000 ; Average: 12.58863 ; Stand Deviation: 5.48600 ; Maximum: 25.50934 ; Minimum: 7.54559
acpi: Samples: 4000 ; Average: 9.57924 ; Stand Deviation: 5.41157 ; Maximum: 25.06366 ; Minimum: 5.51129

hwp:

schedutil: Samples: 4000 ; Average: 25.24245 ; Stand Deviation: 0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746
patch: Samples: 4000 ; Average: 11.07225 ; Stand Deviation: 5.63142 ; Maximum: 24.99493 ; Minimum: 3.67548
rfc: Samples: 4000 ; Average: 10.35842 ; Stand Deviation: 4.77915 ; Maximum: 24.95953 ; Minimum: 7.26202
v2: Samples: 4000 ; Average: 10.98284 ; Stand Deviation: 5.48859 ; Maximum: 25.76331 ; Minimum: 7.53790


2020-12-18 15:35:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

On Tue, Dec 08, 2020 at 11:14:36AM -0800, Doug Smythies wrote:
> At least on my system, it is most evident for some of the pipe type tests,
> where the schedutil governor has never really known what to do. This patch
> set seems to add enough of a downward bias that this version of the schedutil
> governor now behaves much like the other versions

Yeah, pipe relies on task-task interaction, where one task waits on
another, and by boosting the producer the consumer can start earlier and
we get more cycles done etc.. Rather similar to IO-wait, where by
boosting the producer we gain throughput.

schedutil doesn't track anything useful here, but it is a semi common
pattern and it would be really good if we could somehow fix this.

We obviously have access to the task A wakes task B information, but I'm
not sure what to do with it, we're tried some things like this in the
past (although for slightly different reasons) and they've always ended
up being a mess :/

2020-12-18 16:24:39

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On Mon, 2020-12-14 at 21:01 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> The timing of this is not perfect (sorry about that), but here's a refresh
> of this series.
>
> The majority of the previous cover letter still applies:
> [...]

Hello,

the series is tested using

-> tbench (packets processing with loopback networking, measures throughput)
-> dbench (filesystem operations, measures average latency)
-> kernbench (kernel compilation, elapsed time)
-> and gitsource (long-running shell script, elapsed time)

These are chosen because none of them is bound by compute and all are
sensitive to freq scaling decisions. The machines are a Cascade Lake based
server, a client Skylake and a Coffee Lake laptop.

What's being compared:

sugov-HWP.desired : the present series; intel_pstate=passive, governor=schedutil
sugov-HWP.min : mainline; intel_pstate=passive, governor=schedutil
powersave-HWP : mainline; intel_pstate=active, governor=powersave
perfgov-HWP : mainline; intel_pstate=active, governor=performance
sugov-no-HWP : HWP disabled; intel_pstate=passive, governor=schedutil

Dbench and Kernbench have neutral results, but Tbench has sugov-HWP.desired
lose in both performance and performance-per-watt, while Gitsource show the
series as faster in raw performance but again worse than the competition in
efficiency.

1. SUMMARY BY BENCHMARK
1.1. TBENCH
1.2. DBENCH
1.3. KERNBENCH
1.4. GITSOURCE
2. SUMMARY BY USER PROFILE
2.1. PERFORMANCE USER: what if I switch pergov -> schedutil?
2.2. DEFAULT USER: what if I switch powersave -> schedutil?
2.3. DEVELOPER: what if I switch sugov-HWP.min -> sugov-HWP.desired?
3. RESULTS TABLES
PERFORMANCE RATIOS
PERFORMANCE-PER-WATT RATIOS


1. SUMMARY BY BENCHMARK
~~~~~~~~~~~~~~~~~~~~~~~

Tbench: sugov-HWP.desired is the worst performance on all three
machines. sugov-HWP.min is between 20% and 90% better. The baseline
sugov-HWP.desired offers a lower throughput, but does it increase
efficiency? It actually doesn't: on two out of three machines the
incumbent code (current sugov, or intel_pstate=active) has 10% to 35%
better efficiency. In other word, the status quo is both faster and more
efficient than the proposed series on this benchmark.
The absolute power consumption is lower, but the delivered performance is
"even more lower", and that's why performance-per-watt shows a net loss.

Dbench: generally neutral, in both performance and efficiency. Powersave is
occasionally behind the pack in performance, 5% to 15%. A 15% performance
loss on the Coffe Lake is compensated by an 80% improved efficiency. To be
noted that on the same Coffee Lake sugov-no-HWP is 20% ahead of the pack
in efficiency.

Kernbench: neutral, in both performance and efficiency. powersave looses 14%
to the pack in performance on the Cascade Lake.

Gitsource: this test show the most compelling case against the
sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%
faster than sugov-HWP.min (it was expected to be slower!) and 35% less
efficient (we expected more performance-per-watt, not less).


2. SUMMARY BY USER PROFILE
~~~~~~~~~~~~~~~~~~~~~~~~~~

If I was a perfgov-HWP user, I would be 20%-90% faster than with other governors
on tbench an gitsource. This speed gap comes with an unexpected efficiency
bonus on both test. Since dbench and kernbench have a flat profile across the
board, there is no incentive to try another governor.

If I was a powersave-HWP user, I'd be the slower of the bunch. The lost
performance is not, in general, balanced by better efficiency. This only
happens on Coffee Lake, which is a CPU for the mobile market and possibly HWP
has efficiency-oriented tuning there. Any flavor of schedutil would be an
improvement.

From a developer perspective, the obstacles to move from HWP.min to
HWP.desired are tbench, where HWP.desired is worse than having no HWP support
at all, and gitsource, where HWP.desired has the opposite properties than
those advertised (it's actually faster but less efficient).


3. RESULTS TABLES
~~~~~~~~~~~~~~~~~

Tilde (~) means the result is the same as baseline (or, the ratio is close to 1).
The double asterisk (**) is a visual aid and means the result is better than
baseline (higher or lower depending on the case).


| 80x_CASCADELAKE_NUMA: Intel Cascade Lake, 40 cores / 80 threads, NUMA, SATA SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE RATIOS
| tbench 1.00 1.89** 1.88** 1.89** 1.17** higher
| dbench 1.00 ~ 1.06 ~ ~ lower
| kernbench 1.00 ~ 1.14 ~ ~ lower
| gitsource 1.00 1.11 2.70 0.80** ~ lower
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE-PER-WATT RATIOS
| tbench 1.00 1.36** 1.38** 1.33** 1.04** higher
| dbench 1.00 ~ ~ ~ ~ higher
| kernbench 1.00 ~ ~ ~ ~ higher
| gitsource 1.00 1.36** 0.63 1.22** 1.02** higher


| 8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE RATIOS
| tbench 1.00 1.27** 1.30** 1.30** 1.31** higher
| dbench 1.00 ~ 1.15 ~ ~ lower
| kernbench 1.00 ~ ~ ~ ~ lower
| gitsource 1.00 ~ 2.09 ~ ~ lower
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE-PER-WATT RATIOS
| tbench 1.00 ~ ~ ~ ~ higher
| dbench 1.00 ~ 1.82** ~ 1.22** higher
| kernbench 1.00 ~ ~ ~ ~ higher
| gitsource 1.00 ~ 1.56** ~ 1.17** higher


| 8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE RATIOS
| tbench 1.00 1.21** 1.22** 1.20** 1.06** higher
| dbench 1.00 ~ ~ ~ ~ lower
| kernbench 1.00 ~ ~ ~ ~ lower
| gitsource 1.00 ~ 1.71 0.96** ~ lower
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| PERFORMANCE-PER-WATT RATIOS
| tbench 1.00 1.11** 1.12** 1.10** 1.03** higher
| dbench 1.00 ~ ~ ~ ~ higher
| kernbench 1.00 ~ ~ ~ ~ higher
| gitsource 1.00 ~ 0.75 ~ ~ higher



Giovanni

2020-12-21 10:43:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On Thu, Dec 17, 2020 at 4:27 PM Doug Smythies <[email protected]> wrote:
>
> On 2020.12.14 12:02 Rafael J. Wysocki wrote:
>
> > Hi,
>
> Hi Rafael,
>
> V2 test results below are new, other results are partially re-stated:
>
> For readers that do not want to read on, I didn't find anything different than with
> the other versions. This was more just due diligence.

Thanks a lot for the data, much appreciated as always!

> Legend:
>
> hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
> rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
> no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
> acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
> patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil
> v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil
>
> Fixed work packet, fixed period, periodic workflow, load sweep up/down:
>
> load work/sleep frequency: 73 Hertz:
>
> hwp: Average: 12.00822 watts
> rjw: Average: 10.18089 watts
> no-hwp: Average: 10.21947 watts
> acpi-cpufreq: Average: 9.06585 watts
> patch: Average: 10.26060 watts
> v2: Average: 10.50444
>
> load work/sleep frequency: 113 Hertz:
>
> hwp: Average: 12.01056
> rjw: Average: 10.12303
> no-hwp: Average: 10.08228
> acpi-cpufreq: Average: 9.02215
> patch: Average: 10.27055
> v2: Average: 10.31097
>
> load work/sleep frequency: 211 Hertz:
>
> hwp: Average: 12.16067
> rjw: Average: 10.24413
> no-hwp: Average: 10.12463
> acpi-cpufreq: Average: 9.19175
> patch: Average: 10.33000
> v2: Average: 10.39811
>
> load work/sleep frequency: 347 Hertz:
>
> hwp: Average: 12.34169
> rjw: Average: 10.79980
> no-hwp: Average: 10.57296
> acpi-cpufreq: Average: 9.84709
> patch: Average: 10.67029
> v2: Average: 10.93143
>
> load work/sleep frequency: 401 Hertz:
>
> hwp: Average: 12.42562
> rjw: Average: 11.12465
> no-hwp: Average: 11.24203
> acpi-cpufreq: Average: 10.78670
> patch: Average: 10.94514
> v2: Average: 11.50324
>
>
> Serialized single threaded via PIDs per second method:
> A.K.A fixed work packet, variable period
> Results:
>
> Execution times (seconds. Less is better):
>
> no-hwp:
>
> performance: Samples: 382 ; Average: 10.54450 ; Stand Deviation: 0.01564 ; Maximum: 10.61000 ; Minimum: 10.50000
>
> schedutil: Samples: 293 ; Average: 13.73416 ; Stand Deviation: 0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000
> acpi: Samples: 253 ; Average: 15.94889 ; Stand Deviation: 1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000
>
> hwp:
>
> schedutil: Samples: 380 ; Average: 10.58287 ; Stand Deviation: 0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000
> patch: Samples: 276 ; Average: 14.57029 ; Stand Deviation: 0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000
> rfc: Samples: 271 ; Average: 14.86037 ; Stand Deviation: 0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000
> v2: Samples: 274 ; Average: 14.67978 ; Stand Deviation: 1.03378 ; Maximum: 16.07000 ; Minimum: 11.43000
>
> Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.):
>
> no-hwp:
>
> performance: Samples: 4000 ; Average: 25.41355 ; Stand Deviation: 0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807
>
> schedutil: Samples: 4000 ; Average: 12.58863 ; Stand Deviation: 5.48600 ; Maximum: 25.50934 ; Minimum: 7.54559
> acpi: Samples: 4000 ; Average: 9.57924 ; Stand Deviation: 5.41157 ; Maximum: 25.06366 ; Minimum: 5.51129
>
> hwp:
>
> schedutil: Samples: 4000 ; Average: 25.24245 ; Stand Deviation: 0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746
> patch: Samples: 4000 ; Average: 11.07225 ; Stand Deviation: 5.63142 ; Maximum: 24.99493 ; Minimum: 3.67548
> rfc: Samples: 4000 ; Average: 10.35842 ; Stand Deviation: 4.77915 ; Maximum: 24.95953 ; Minimum: 7.26202
> v2: Samples: 4000 ; Average: 10.98284 ; Stand Deviation: 5.48859 ; Maximum: 25.76331 ; Minimum: 7.53790
>
>

2020-12-21 16:14:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

Hi,

On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich
<[email protected]> wrote:
>
> On Mon, 2020-12-14 at 21:01 +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > The timing of this is not perfect (sorry about that), but here's a refresh
> > of this series.
> >
> > The majority of the previous cover letter still applies:
> > [...]
>
> Hello,
>
> the series is tested using
>
> -> tbench (packets processing with loopback networking, measures throughput)
> -> dbench (filesystem operations, measures average latency)
> -> kernbench (kernel compilation, elapsed time)
> -> and gitsource (long-running shell script, elapsed time)
>
> These are chosen because none of them is bound by compute and all are
> sensitive to freq scaling decisions. The machines are a Cascade Lake based
> server, a client Skylake and a Coffee Lake laptop.

First of all, many thanks for the results!

Any test results input is always much appreciated for all of the
changes under consideration.

> What's being compared:
>
> sugov-HWP.desired : the present series; intel_pstate=passive, governor=schedutil
> sugov-HWP.min : mainline; intel_pstate=passive, governor=schedutil
> powersave-HWP : mainline; intel_pstate=active, governor=powersave
> perfgov-HWP : mainline; intel_pstate=active, governor=performance
> sugov-no-HWP : HWP disabled; intel_pstate=passive, governor=schedutil
>
> Dbench and Kernbench have neutral results, but Tbench has sugov-HWP.desired
> lose in both performance and performance-per-watt, while Gitsource show the
> series as faster in raw performance but again worse than the competition in
> efficiency.

Well, AFAICS tbench "likes" high turbo and is sensitive to the
response time (as indicated by the fact that it is also sensitive to
the polling limit value in cpuidle).

Using the target perf to set HWP_REQ.DESIRED (instead of using it to
set HWP_REQ.MIN) generally causes the turbo to be less aggressive and
the response time to go up, so the tbench result is not a surprise at
all. This case represents the tradeoff being made here (as noted by
Doug in one of his previous messages).

The gitsource result is a bit counter-intuitive, but my conclusions
drawn from it are quite different from yours (more on that below).

> 1. SUMMARY BY BENCHMARK
> 1.1. TBENCH
> 1.2. DBENCH
> 1.3. KERNBENCH
> 1.4. GITSOURCE
> 2. SUMMARY BY USER PROFILE
> 2.1. PERFORMANCE USER: what if I switch pergov -> schedutil?
> 2.2. DEFAULT USER: what if I switch powersave -> schedutil?
> 2.3. DEVELOPER: what if I switch sugov-HWP.min -> sugov-HWP.desired?
> 3. RESULTS TABLES
> PERFORMANCE RATIOS
> PERFORMANCE-PER-WATT RATIOS
>
>
> 1. SUMMARY BY BENCHMARK
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> Tbench: sugov-HWP.desired is the worst performance on all three
> machines. sugov-HWP.min is between 20% and 90% better. The baseline
> sugov-HWP.desired offers a lower throughput, but does it increase
> efficiency? It actually doesn't: on two out of three machines the
> incumbent code (current sugov, or intel_pstate=active) has 10% to 35%
> better efficiency. In other word, the status quo is both faster and more
> efficient than the proposed series on this benchmark.
> The absolute power consumption is lower, but the delivered performance is
> "even more lower", and that's why performance-per-watt shows a net loss.

This benchmark is best off when run under the performance governor and
the observation that sugov-HWP.min is almost as good as the
performance governor for it is a consequence of a bias towards
performance in the former (which need not be regarded as a good
thing).

The drop in energy-efficiency is somewhat disappointing, but not
entirely unexpected too.

> Dbench: generally neutral, in both performance and efficiency. Powersave is
> occasionally behind the pack in performance, 5% to 15%. A 15% performance
> loss on the Coffe Lake is compensated by an 80% improved efficiency. To be
> noted that on the same Coffee Lake sugov-no-HWP is 20% ahead of the pack
> in efficiency.
>
> Kernbench: neutral, in both performance and efficiency. powersave looses 14%
> to the pack in performance on the Cascade Lake.
>
> Gitsource: this test show the most compelling case against the
> sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%
> faster than sugov-HWP.min (it was expected to be slower!) and 35% less
> efficient (we expected more performance-per-watt, not less).

This is a bit counter-intuitive, so it is good to try to understand
what's going on instead of drawing conclusions right away from pure
numbers.

My interpretation of the available data is that gitsource benefits
from the "race-to-idle" effect in terms of energy-efficiency which
also causes it to suffer in terms of performance. Namely, completing
the given piece of work faster causes some CPU idle time to become
available and that effectively reduces power, but it also increases
the response time (by the idle state exit latency) which causes
performance to drop. Whether or not this effect can be present depends
on what CPU idle states are available etc. and it may be a pure
coincidence.

What sugov-HWP.desired really does is to bias the frequency towards
whatever is perceived by schedutil as sufficient to run the workload
(which is a key property of it - see below) and it appears to do the
job here quite well, but it eliminates the "race-to-idle" effect that
the workload benefited from originally and, like it or not, schedutil
cannot know about that effect.

That effect can only be present if the frequencies used for running
the workload are too high and by a considerable margin (sufficient for
a deep enough idle state to be entered). In some cases running the
workload too fast helps (like in this one, although this time it
happens to hurt performance), but in some cases it really hurts
energy-efficiency and the question is whether or not this should be
always done.

There is a whole broad category of workloads involving periodic tasks
that do the same amount of work in every period regardless of the
frequency they run at (as long as the frequency is sufficient to avoid
"overrunning" the period) and they almost never benefit from
"race-to-idle".There is zero benefit from running them too fast and
the energy-efficiency goes down the sink when that happens.

Now the problem is that with sugov-HWP.min the users who care about
these workloads don't even have an option to use the task utilization
history recorded by the scheduler to bias the frequency towards the
"sufficient" level, because sugov-HWP.min only sets a lower bound on
the frequency selection to improve the situation, so the choice
between it and sugov-HWP.desired boils down to whether or not to give
that option to them and my clear preference is for that option to
exist. Sorry about that. [Note that it really is an option, though,
because "pure" HWP is still the default for HWP-enabled systems.]

It may be possible to restore some "race-to-idle" benefits by tweaking
HWP_REQ.EPP in the future, but that needs to be investigated.

BTW, what EPP value was there on the system where you saw better
performance under sugov-HWP.desired? If it was greater than zero, it
would be useful to decrease EPP (by adjusting the
energy_performance_preference attributes in sysfs for all CPUs) and
see what happens to the performance difference then.


>
> 2. SUMMARY BY USER PROFILE
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> If I was a perfgov-HWP user, I would be 20%-90% faster than with other governors
> on tbench an gitsource. This speed gap comes with an unexpected efficiency
> bonus on both test. Since dbench and kernbench have a flat profile across the
> board, there is no incentive to try another governor.
>
> If I was a powersave-HWP user, I'd be the slower of the bunch. The lost
> performance is not, in general, balanced by better efficiency. This only
> happens on Coffee Lake, which is a CPU for the mobile market and possibly HWP
> has efficiency-oriented tuning there. Any flavor of schedutil would be an
> improvement.
>
> From a developer perspective, the obstacles to move from HWP.min to
> HWP.desired are tbench, where HWP.desired is worse than having no HWP support
> at all, and gitsource, where HWP.desired has the opposite properties than
> those advertised (it's actually faster but less efficient).
>
>
> 3. RESULTS TABLES
> ~~~~~~~~~~~~~~~~~
>
> Tilde (~) means the result is the same as baseline (or, the ratio is close to 1).
> The double asterisk (**) is a visual aid and means the result is better than
> baseline (higher or lower depending on the case).
>
>
> | 80x_CASCADELAKE_NUMA: Intel Cascade Lake, 40 cores / 80 threads, NUMA, SATA SSD storage
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE RATIOS
> | tbench 1.00 1.89** 1.88** 1.89** 1.17** higher
> | dbench 1.00 ~ 1.06 ~ ~ lower
> | kernbench 1.00 ~ 1.14 ~ ~ lower
> | gitsource 1.00 1.11 2.70 0.80** ~ lower
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE-PER-WATT RATIOS
> | tbench 1.00 1.36** 1.38** 1.33** 1.04** higher
> | dbench 1.00 ~ ~ ~ ~ higher
> | kernbench 1.00 ~ ~ ~ ~ higher
> | gitsource 1.00 1.36** 0.63 1.22** 1.02** higher
>
>
> | 8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE RATIOS
> | tbench 1.00 1.27** 1.30** 1.30** 1.31** higher
> | dbench 1.00 ~ 1.15 ~ ~ lower
> | kernbench 1.00 ~ ~ ~ ~ lower
> | gitsource 1.00 ~ 2.09 ~ ~ lower
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE-PER-WATT RATIOS
> | tbench 1.00 ~ ~ ~ ~ higher
> | dbench 1.00 ~ 1.82** ~ 1.22** higher
> | kernbench 1.00 ~ ~ ~ ~ higher
> | gitsource 1.00 ~ 1.56** ~ 1.17** higher
>
>
> | 8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | sugov-HWP.des sugov-HWP.min powersave-HWP perfgov-HWP sugov-no-HWP better if
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE RATIOS
> | tbench 1.00 1.21** 1.22** 1.20** 1.06** higher
> | dbench 1.00 ~ ~ ~ ~ lower
> | kernbench 1.00 ~ ~ ~ ~ lower
> | gitsource 1.00 ~ 1.71 0.96** ~ lower
> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> | PERFORMANCE-PER-WATT RATIOS
> | tbench 1.00 1.11** 1.12** 1.10** 1.03** higher
> | dbench 1.00 ~ ~ ~ ~ higher
> | kernbench 1.00 ~ ~ ~ ~ higher
> | gitsource 1.00 ~ 0.75 ~ ~ higher

2020-12-23 13:09:18

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On Mon, 2020-12-21 at 17:11 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich wrote:
> >
> > Gitsource: this test show the most compelling case against the
> > sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%
> > faster than sugov-HWP.min (it was expected to be slower!) and 35% less
> > efficient (we expected more performance-per-watt, not less).
>
> This is a bit counter-intuitive, so it is good to try to understand
> what's going on instead of drawing conclusions right away from pure
> numbers.
>
> My interpretation of the available data is that gitsource benefits
> from the "race-to-idle" effect in terms of energy-efficiency which
> also causes it to suffer in terms of performance. Namely, completing
> the given piece of work faster causes some CPU idle time to become
> available and that effectively reduces power, but it also increases
> the response time (by the idle state exit latency) which causes
> performance to drop. Whether or not this effect can be present depends
> on what CPU idle states are available etc. and it may be a pure
> coincidence.
>
> [snip]

Right, race-to-idle might explain the increased efficiency of HWP.MIN.
As you note, increased exit latencies from idle can also explain the overall
performance difference.

> There is a whole broad category of workloads involving periodic tasks
> that do the same amount of work in every period regardless of the
> frequency they run at (as long as the frequency is sufficient to avoid
> "overrunning" the period) and they almost never benefit from
> "race-to-idle".There is zero benefit from running them too fast and
> the energy-efficiency goes down the sink when that happens.
>
> Now the problem is that with sugov-HWP.min the users who care about
> these workloads don't even have an option to use the task utilization
> history recorded by the scheduler to bias the frequency towards the
> "sufficient" level, because sugov-HWP.min only sets a lower bound on
> the frequency selection to improve the situation, so the choice
> between it and sugov-HWP.desired boils down to whether or not to give
> that option to them and my clear preference is for that option to
> exist. Sorry about that. [Note that it really is an option, though,
> because "pure" HWP is still the default for HWP-enabled systems.]

Sure, the periodic workloads benefit from this patch, Doug's test shows that.

I guess I'm still confused by the difference between setting HWP.DESIRED and
disabling HWP completely. The Intel manual says that a non-zero HWP.DESIRED
"effectively disabl[es] HW autonomous selection", but then continues with "The
Desired_Performance input is non-constraining in terms of Performance and
Energy optimizations, which are independently controlled". The first
statement sounds like HWP is out of the picture (no more autonomous
frequency selections) but the latter part implies there are other
optimizations still available. I'm not sure how to reason about that.

> It may be possible to restore some "race-to-idle" benefits by tweaking
> HWP_REQ.EPP in the future, but that needs to be investigated.
>
> BTW, what EPP value was there on the system where you saw better
> performance under sugov-HWP.desired? If it was greater than zero, it
> would be useful to decrease EPP (by adjusting the
> energy_performance_preference attributes in sysfs for all CPUs) and
> see what happens to the performance difference then.

For sugov-HWP.desired the EPP was 0x80 (the default value).


Giovanni

2020-12-29 01:25:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

On Wed, Dec 23, 2020 at 2:08 PM Giovanni Gherdovich
<[email protected]> wrote:
>
> On Mon, 2020-12-21 at 17:11 +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich wrote:
> > >
> > > Gitsource: this test show the most compelling case against the
> > > sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%
> > > faster than sugov-HWP.min (it was expected to be slower!) and 35% less
> > > efficient (we expected more performance-per-watt, not less).
> >
> > This is a bit counter-intuitive, so it is good to try to understand
> > what's going on instead of drawing conclusions right away from pure
> > numbers.
> >
> > My interpretation of the available data is that gitsource benefits
> > from the "race-to-idle" effect in terms of energy-efficiency which
> > also causes it to suffer in terms of performance. Namely, completing
> > the given piece of work faster causes some CPU idle time to become
> > available and that effectively reduces power, but it also increases
> > the response time (by the idle state exit latency) which causes
> > performance to drop. Whether or not this effect can be present depends
> > on what CPU idle states are available etc. and it may be a pure
> > coincidence.
> >
> > [snip]
>
> Right, race-to-idle might explain the increased efficiency of HWP.MIN.
> As you note, increased exit latencies from idle can also explain the overall
> performance difference.
>
> > There is a whole broad category of workloads involving periodic tasks
> > that do the same amount of work in every period regardless of the
> > frequency they run at (as long as the frequency is sufficient to avoid
> > "overrunning" the period) and they almost never benefit from
> > "race-to-idle".There is zero benefit from running them too fast and
> > the energy-efficiency goes down the sink when that happens.
> >
> > Now the problem is that with sugov-HWP.min the users who care about
> > these workloads don't even have an option to use the task utilization
> > history recorded by the scheduler to bias the frequency towards the
> > "sufficient" level, because sugov-HWP.min only sets a lower bound on
> > the frequency selection to improve the situation, so the choice
> > between it and sugov-HWP.desired boils down to whether or not to give
> > that option to them and my clear preference is for that option to
> > exist. Sorry about that. [Note that it really is an option, though,
> > because "pure" HWP is still the default for HWP-enabled systems.]
>
> Sure, the periodic workloads benefit from this patch, Doug's test shows that.
>
> I guess I'm still confused by the difference between setting HWP.DESIRED and
> disabling HWP completely. The Intel manual says that a non-zero HWP.DESIRED
> "effectively disabl[es] HW autonomous selection", but then continues with "The
> Desired_Performance input is non-constraining in terms of Performance and
> Energy optimizations, which are independently controlled". The first
> statement sounds like HWP is out of the picture (no more autonomous
> frequency selections) but the latter part implies there are other
> optimizations still available. I'm not sure how to reason about that.

For example, if HWP_REQ.DESIRED is set below the point of maximum
energy-efficiency that is known to the processor, it is allowed to go
for the max energy-efficiency instead of following the hint.
Likewise, if the hint is above the P-state corresponding to the max
performance in the given conditions (i.e. increasing the frequency is
not likely to result in better performance due to some limitations
known to the processor), the processor is allowed to set that P-state
instead of following the hint.

Generally speaking, the processor may not follow the hint if better
results can be achieved by putting the given CPU into a P-state
different from the requested one.

> > It may be possible to restore some "race-to-idle" benefits by tweaking
> > HWP_REQ.EPP in the future, but that needs to be investigated.
> >
> > BTW, what EPP value was there on the system where you saw better
> > performance under sugov-HWP.desired? If it was greater than zero, it
> > would be useful to decrease EPP (by adjusting the
> > energy_performance_preference attributes in sysfs for all CPUs) and
> > see what happens to the performance difference then.
>
> For sugov-HWP.desired the EPP was 0x80 (the default value).

So it would be worth testing with EPP=0x20 (or even lower).

Lowering the EPP should cause the processor to ramp up turbo
frequencies faster and it may also allow higher turbo frequencies to
be used.