2021-05-21 20:22:39

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH v2 0/3] EM / PM: Inefficient OPPs

We (Power team in Arm) are working with an experimental kernel for the
Google's Pixel4 to evaluate and improve the current mainline performance
and energy consumption on a real life device with Android.

The SD855 SoC found in this phone has several OPPs that are inefficient.
I.e. despite a lower frequency, they have a greater cost. (That cost being
fmax * OPP power / OPP freq). This issue is twofold. First of course,
running a specific workload at an inefficient OPP is counterproductive
since it wastes wasting energy. But also, inefficient OPPs make a
performance domain less appealing for task placement than it really is.

We evaluated the change presented here by running 30 iterations of Android
PCMark "Work 2.0 Performance". While we did not see any statistically
significant performance impact, this change allowed to drastically improve
the idle time residency.


| Running | WFI [1] | Idle |
------------------------+-------------+-----------+-----------+
Little cluster (4 CPUs) | -0.35% | +0.35% | +0.79% |
------------------------+-------------+-----------+-----------+
Medium cluster (3 CPUs) | -6.3% | -18% | +12% |
------------------------+-------------+-----------+-----------+
Big cluster (1 CPU) | -6.4% | -6.5% | +2.8% |
------------------------+-------------+-----------+-----------+

On the SD855, the inefficient OPPs are found on the little cluster. By
removing them from the Energy Model, we make the most efficient CPUs more
appealing for task placement, helping to reduce the running time for the
medium and big CPUs. Increasing idle time is crucial for this platform due
to the substantial energy cost differences among the clusters. Also,
despite not appearing in the statistics (the idle driver used here doesn't
report it), we can speculate that we also improve the cluster idle time.

[1] WFI: Wait for interrupt.

Changelog since v1:
- Remove the Look-up table as the numbers weren't strong enough to justify the
implementation.
- Split the patch.

Vincent Donnefort (3):
PM / EM: Fix inefficient state detection
PM / EM: Extend em_perf_domain with a flag field
PM / EM: Skip inefficient OPPs

include/linux/energy_model.h | 95 ++++++++++++++++++++++++++++++++++++----
kernel/power/energy_model.c | 34 +++++++-------
kernel/sched/cpufreq_schedutil.c | 4 ++
3 files changed, 106 insertions(+), 27 deletions(-)

--
2.7.4


2021-05-21 20:22:39

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH v2 2/3] PM / EM: Extend em_perf_domain with a flag field

Merge the current "milliwatts" option into a "flag" field. This intends to
prepare the extension of this structure for inefficient states support in
the Energy Model.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 757fc60..9be7bde 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -28,8 +28,7 @@ struct em_perf_state {
* em_perf_domain - Performance domain
* @table: List of performance states, in ascending order
* @nr_perf_states: Number of performance states
- * @milliwatts: Flag indicating the power values are in milli-Watts
- * or some other scale.
+ * @flags: See "em_perf_domain flags"
* @cpus: Cpumask covering the CPUs of the domain. It's here
* for performance reasons to avoid potential cache
* misses during energy calculations in the scheduler
@@ -44,10 +43,18 @@ struct em_perf_state {
struct em_perf_domain {
struct em_perf_state *table;
int nr_perf_states;
- int milliwatts;
+ int flags;
unsigned long cpus[];
};

+/*
+ * em_perf_domain flags:
+ *
+ * EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
+ * other scale.
+ */
+#define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
+
#define em_span_cpus(em) (to_cpumask((em)->cpus))

#ifdef CONFIG_ENERGY_MODEL
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index c4871a8..638581c 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -55,7 +55,8 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_cpus);
static int em_debug_units_show(struct seq_file *s, void *unused)
{
struct em_perf_domain *pd = s->private;
- char *units = pd->milliwatts ? "milliWatts" : "bogoWatts";
+ char *units = (pd->flags & EM_PERF_DOMAIN_MILLIWATTS) ?
+ "milliWatts" : "bogoWatts";

seq_printf(s, "%s\n", units);

@@ -341,7 +342,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
if (ret)
goto unlock;

- dev->em_pd->milliwatts = milliwatts;
+ if (milliwatts)
+ dev->em_pd->flags |= EM_PERF_DOMAIN_MILLIWATTS;

em_debug_create_pd(dev);
dev_info(dev, "EM: created perf domain\n");
--
2.7.4

2021-05-21 20:22:39

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

Some SoCs, such as the sd855 have OPPs within the same performance domain,
whose cost is higher than others with a higher frequency. Even though
those OPPs are interesting from a cooling perspective, it makes no sense
to use them when the device can run at full capacity. Those OPPs handicap
the performance domain, when choosing the most energy-efficient CPU and
are wasting energy. They are inefficient.

Hence, add support for such OPPs to the Energy Model. The table can now
be read skipping inefficient performance states (and by extension,
inefficient OPPs).

Currently, the efficient table is used in two paths. Schedutil, and
find_energy_efficient_cpu(). We have to modify both paths in the same
patch so they stay synchronized. The thermal framework still relies on
the full table.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 9be7bde..daaeccf 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -17,13 +17,25 @@
* device). It can be a total power: static and dynamic.
* @cost: The cost coefficient associated with this level, used during
* energy calculation. Equal to: power * max_frequency / frequency
+ * @flags: see "em_perf_state flags" description below.
*/
struct em_perf_state {
unsigned long frequency;
unsigned long power;
unsigned long cost;
+ unsigned long flags;
};

+/*
+ * em_perf_state flags:
+ *
+ * EM_PERF_STATE_INEFFICIENT: The performance state is inefficient. There is
+ * in this em_perf_domain, another performance state with a higher frequency
+ * but a lower or equal power cost. Such inefficient states are ignored when
+ * using em_pd_get_efficient_*() functions.
+ */
+#define EM_PERF_STATE_INEFFICIENT BIT(0)
+
/**
* em_perf_domain - Performance domain
* @table: List of performance states, in ascending order
@@ -52,8 +64,12 @@ struct em_perf_domain {
*
* EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
* other scale.
+ *
+ * EM_PERF_DOMAIN_INEFFICIENCIES: This perf domain contains inefficient perf
+ * states.
*/
#define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
+#define EM_PERF_DOMAIN_INEFFICIENCIES BIT(1)

#define em_span_cpus(em) (to_cpumask((em)->cpus))

@@ -93,6 +109,58 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
void em_dev_unregister_perf_domain(struct device *dev);

/**
+ * em_pd_get_efficient_state() - Get an efficient performance state from the EM
+ * @pd : Performance domain for which we want an efficient frequency
+ * @freq : Frequency to map with the EM
+ *
+ * It is called from the scheduler code quite frequently and as a consequence
+ * doesn't implement any check.
+ *
+ * Return: An efficient performance state, high enough to meet @freq
+ * requirement.
+ */
+static inline
+struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
+ unsigned long freq)
+{
+ struct em_perf_state *ps;
+ int i;
+
+ for (i = 0; i < pd->nr_perf_states; i++) {
+ ps = &pd->table[i];
+ if (ps->flags & EM_PERF_STATE_INEFFICIENT)
+ continue;
+ if (ps->frequency >= freq)
+ break;
+ }
+
+ return ps;
+}
+
+/**
+ * em_pd_get_efficient_freq() - Get the efficient frequency from the EM
+ * @pd : Performance domain for which we want an efficient frequency
+ * @freq : Frequency to map with the EM
+ *
+ * This function will return @freq if no inefficiencies have been found for
+ * that @pd. This is to avoid a useless table walk.
+ *
+ * Return: An efficient frequency, high enough to meet @freq requirement.
+ */
+static inline unsigned long em_pd_get_efficient_freq(struct em_perf_domain *pd,
+ unsigned long freq)
+{
+ struct em_perf_state *ps;
+
+ if (!pd || !(pd->flags & EM_PERF_DOMAIN_INEFFICIENCIES))
+ return freq;
+
+ ps = em_pd_get_efficient_state(pd, freq);
+
+ return ps->frequency;
+}
+
+/**
* em_cpu_energy() - Estimates the energy consumed by the CPUs of a
performance domain
* @pd : performance domain for which energy has to be estimated
@@ -111,7 +179,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
{
unsigned long freq, scale_cpu;
struct em_perf_state *ps;
- int i, cpu;
+ int cpu;

if (!sum_util)
return 0;
@@ -130,11 +198,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
* Find the lowest performance state of the Energy Model above the
* requested frequency.
*/
- for (i = 0; i < pd->nr_perf_states; i++) {
- ps = &pd->table[i];
- if (ps->frequency >= freq)
- break;
- }
+ ps = em_pd_get_efficient_state(pd, freq);

/*
* The capacity of a CPU in the domain at the performance state (ps)
@@ -224,6 +288,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
{
return 0;
}
+
+static inline unsigned long
+em_pd_get_efficient_freq(struct em_perf_domain *pd, unsigned long freq)
+{
+ return freq;
+}
#endif

#endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 638581c..af8d54f 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -2,7 +2,7 @@
/*
* Energy Model of devices
*
- * Copyright (c) 2018-2020, Arm ltd.
+ * Copyright (c) 2018-2021, Arm ltd.
* Written by: Quentin Perret, Arm ltd.
* Improvements provided by: Lukasz Luba, Arm ltd.
*/
@@ -42,6 +42,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
debugfs_create_ulong("power", 0444, d, &ps->power);
debugfs_create_ulong("cost", 0444, d, &ps->cost);
+ debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
}

static int em_debug_cpus_show(struct seq_file *s, void *unused)
@@ -161,6 +162,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
table[i].cost = div64_u64(fmax * table[i].power,
table[i].frequency);
if (table[i].cost >= prev_cost) {
+ table[i].flags = EM_PERF_STATE_INEFFICIENT;
+ pd->flags |= EM_PERF_DOMAIN_INEFFICIENCIES;
dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
table[i].frequency);
} else {
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4f09afd..5a91a2b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -10,6 +10,7 @@

#include "sched.h"

+#include <linux/energy_model.h>
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

@@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

freq = map_util_freq(util, freq, max);

+ /* Avoid inefficient performance states */
+ freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
+
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;

--
2.7.4

2021-05-24 12:45:45

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] PM / EM: Extend em_perf_domain with a flag field



On 5/21/21 5:54 PM, Vincent Donnefort wrote:
> Merge the current "milliwatts" option into a "flag" field. This intends to
> prepare the extension of this structure for inefficient states support in
> the Energy Model.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 757fc60..9be7bde 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h

Reviewed-by: Lukasz Luba <[email protected]>

2021-05-24 12:57:54

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs



On 5/21/21 5:54 PM, Vincent Donnefort wrote:
> Some SoCs, such as the sd855 have OPPs within the same performance domain,
> whose cost is higher than others with a higher frequency. Even though
> those OPPs are interesting from a cooling perspective, it makes no sense
> to use them when the device can run at full capacity. Those OPPs handicap
> the performance domain, when choosing the most energy-efficient CPU and
> are wasting energy. They are inefficient.
>
> Hence, add support for such OPPs to the Energy Model. The table can now
> be read skipping inefficient performance states (and by extension,
> inefficient OPPs).
>
> Currently, the efficient table is used in two paths. Schedutil, and
> find_energy_efficient_cpu(). We have to modify both paths in the same
> patch so they stay synchronized. The thermal framework still relies on
> the full table.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 9be7bde..daaeccf 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -17,13 +17,25 @@
> * device). It can be a total power: static and dynamic.
> * @cost: The cost coefficient associated with this level, used during
> * energy calculation. Equal to: power * max_frequency / frequency
> + * @flags: see "em_perf_state flags" description below.
> */
> struct em_perf_state {
> unsigned long frequency;
> unsigned long power;
> unsigned long cost;
> + unsigned long flags;

Maybe for now, we can have 'bool' here?
We would avoid *Num_opps of 'and' operations below

[snip]

> +static inline
> +struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
> + unsigned long freq)
> +{
> + struct em_perf_state *ps;
> + int i;
> +
> + for (i = 0; i < pd->nr_perf_states; i++) {
> + ps = &pd->table[i];
> + if (ps->flags & EM_PERF_STATE_INEFFICIENT)

Here, we can avoid this *N of '&', when having a simple bool

> + continue;
> + if (ps->frequency >= freq)
> + break;
> + }
> +
> + return ps;
> +}
> +

[snip

> +static inline unsigned long em_pd_get_efficient_freq(struct em_perf_domain *pd,
> + unsigned long freq)
> +{
> + struct em_perf_state *ps;
> +
> + if (!pd || !(pd->flags & EM_PERF_DOMAIN_INEFFICIENCIES))

This one is OK, since we have two features for this 'flags' now.

The rest looks good.

Regards,
Lukasz

2021-05-25 09:34:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Friday 21 May 2021 at 17:54:24 (+0100), Vincent Donnefort wrote:
> @@ -161,6 +162,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> table[i].cost = div64_u64(fmax * table[i].power,
> table[i].frequency);
> if (table[i].cost >= prev_cost) {
> + table[i].flags = EM_PERF_STATE_INEFFICIENT;
> + pd->flags |= EM_PERF_DOMAIN_INEFFICIENCIES;

If we're looking for micro-optimizations, then perhaps you could store
the index of the next efficient OPP (which would be 'i' if the current
OPP is already efficient), so you can jump to it directly when doing the
search.

> dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> table[i].frequency);
> } else {

2021-05-25 09:48:28

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Tue, May 25, 2021 at 09:33:01AM +0000, Quentin Perret wrote:
> On Friday 21 May 2021 at 17:54:24 (+0100), Vincent Donnefort wrote:
> > @@ -161,6 +162,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> > table[i].cost = div64_u64(fmax * table[i].power,
> > table[i].frequency);
> > if (table[i].cost >= prev_cost) {
> > + table[i].flags = EM_PERF_STATE_INEFFICIENT;
> > + pd->flags |= EM_PERF_DOMAIN_INEFFICIENCIES;
>
> If we're looking for micro-optimizations, then perhaps you could store
> the index of the next efficient OPP (which would be 'i' if the current
> OPP is already efficient), so you can jump to it directly when doing the
> search.

Wouldn't add any new field compared to this version so yeah it seems an
interesting improvement I could add for a next version.

>
> > dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> > table[i].frequency);
> > } else {

2021-05-25 09:57:22

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] PM / EM: Extend em_perf_domain with a flag field

On Friday 21 May 2021 at 17:54:23 (+0100), Vincent Donnefort wrote:
> Merge the current "milliwatts" option into a "flag" field. This intends to
> prepare the extension of this structure for inefficient states support in
> the Energy Model.
>
> Signed-off-by: Vincent Donnefort <[email protected]>

Reviewed-by: Quentin Perret <[email protected]>

2021-05-25 10:16:53

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Tue, May 25, 2021 at 10:48:23AM +0200, Peter Zijlstra wrote:
> On Fri, May 21, 2021 at 05:54:24PM +0100, Vincent Donnefort wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 4f09afd..5a91a2b 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -10,6 +10,7 @@
> >
> > #include "sched.h"
> >
> > +#include <linux/energy_model.h>
> > #include <linux/sched/cpufreq.h>
> > #include <trace/events/power.h>
> >
> > @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> > freq = map_util_freq(util, freq, max);
> >
> > + /* Avoid inefficient performance states */
> > + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> > +
> > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > return sg_policy->next_freq;
> >
>
> This seems somewhat unfortunate, it adds a loop over the OPPs only to
> then call into cpufreq to do the exact same thing again :/

Indeed, but it would be complicated to avoid the double loop:

It is possible to register OPPs (and by extension perf_states) for a
frequency for which, the cpufreq table entry is marked with
CPUFREQ_ENTRY_INVALID. It would probably be an issue that would have to be
fixed in the driver, but it is currently allowed.

More importantly, while resolving the frequency, we also cache the index in
cached_resolved_idx. Some drivers, such as qcom-cpufreq-hw rely on this
value for their fastswitch support.

Notice though, we would iterate over the EM only in the case where the
performance state has found inefficiencies.

2021-05-25 10:30:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Fri, May 21, 2021 at 05:54:24PM +0100, Vincent Donnefort wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 4f09afd..5a91a2b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -10,6 +10,7 @@
>
> #include "sched.h"
>
> +#include <linux/energy_model.h>
> #include <linux/sched/cpufreq.h>
> #include <trace/events/power.h>
>
> @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
> freq = map_util_freq(util, freq, max);
>
> + /* Avoid inefficient performance states */
> + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> +
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;
>

This seems somewhat unfortunate, it adds a loop over the OPPs only to
then call into cpufreq to do the exact same thing again :/

2021-05-25 10:33:35

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Tue, May 25, 2021 at 09:33:01AM +0000, Quentin Perret wrote:
> On Friday 21 May 2021 at 17:54:24 (+0100), Vincent Donnefort wrote:
> > @@ -161,6 +162,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> > table[i].cost = div64_u64(fmax * table[i].power,
> > table[i].frequency);
> > if (table[i].cost >= prev_cost) {
> > + table[i].flags = EM_PERF_STATE_INEFFICIENT;
> > + pd->flags |= EM_PERF_DOMAIN_INEFFICIENCIES;
>
> If we're looking for micro-optimizations, then perhaps you could store
> the index of the next efficient OPP (which would be 'i' if the current
> OPP is already efficient), so you can jump to it directly when doing the
> search.

Wouldn't add any new field compared to this version so yeah might be an
interesting improvement.

>
> > dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> > table[i].frequency);
> > } else {

2021-05-25 11:09:59

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs



On 5/25/21 10:46 AM, Vincent Donnefort wrote:
> On Tue, May 25, 2021 at 09:33:01AM +0000, Quentin Perret wrote:
>> On Friday 21 May 2021 at 17:54:24 (+0100), Vincent Donnefort wrote:
>>> @@ -161,6 +162,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>>> table[i].cost = div64_u64(fmax * table[i].power,
>>> table[i].frequency);
>>> if (table[i].cost >= prev_cost) {
>>> + table[i].flags = EM_PERF_STATE_INEFFICIENT;
>>> + pd->flags |= EM_PERF_DOMAIN_INEFFICIENCIES;
>>
>> If we're looking for micro-optimizations, then perhaps you could store
>> the index of the next efficient OPP (which would be 'i' if the current
>> OPP is already efficient), so you can jump to it directly when doing the
>> search.
>
> Wouldn't add any new field compared to this version so yeah might be an
> interesting improvement.
>

That's a few more instructions to parse the 'flags' filed. I'm not sure
if that brings speed improvements vs. if we not parse and have bool
filed with a simple looping. The out-of-order core might even suffer
from this parsing and loop index manipulations...

2021-05-25 13:11:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

Hi Lukasz,

On Tuesday 25 May 2021 at 12:03:14 (+0100), Lukasz Luba wrote:
> That's a few more instructions to parse the 'flags' filed. I'm not sure
> if that brings speed improvements vs. if we not parse and have bool
> filed with a simple looping. The out-of-order core might even suffer
> from this parsing and loop index manipulations...

I'm not sure what you mean about parsing here? I'm basically suggesting
to do something along the lines of:

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index daaeccfb9d6e..f02de32d2325 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -128,13 +128,11 @@ struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,

for (i = 0; i < pd->nr_perf_states; i++) {
ps = &pd->table[i];
- if (ps->flags & EM_PERF_STATE_INEFFICIENT)
- continue;
if (ps->frequency >= freq)
break;
}

- return ps;
+ return &pd->table[ps->next_efficient_idx];
}

What would be wrong with that?

Thanks,
Quentin

2021-05-25 13:36:18

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

Hi Quentin,

On 5/25/21 2:06 PM, Quentin Perret wrote:
> Hi Lukasz,
>
> On Tuesday 25 May 2021 at 12:03:14 (+0100), Lukasz Luba wrote:
>> That's a few more instructions to parse the 'flags' filed. I'm not sure
>> if that brings speed improvements vs. if we not parse and have bool
>> filed with a simple looping. The out-of-order core might even suffer
>> from this parsing and loop index manipulations...
>
> I'm not sure what you mean about parsing here? I'm basically suggesting
> to do something along the lines of:

I thought Vincent was going to re-use the 'flags' for it and keep it for
other purpose as well - which would require to parse/map-to-feature.
That's why I commented the patch earlier, pointing out that we shouldn't
prepare the code for future unknown EM_PERF_STATE_*. We can always
modify it when we need to add another feature later.

>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index daaeccfb9d6e..f02de32d2325 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -128,13 +128,11 @@ struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
>
> for (i = 0; i < pd->nr_perf_states; i++) {
> ps = &pd->table[i];
> - if (ps->flags & EM_PERF_STATE_INEFFICIENT)
> - continue;
> if (ps->frequency >= freq)
> break;
> }
>
> - return ps;
> + return &pd->table[ps->next_efficient_idx];
> }
>
> What would be wrong with that?

Until we measure it, I don't know TBH. It looks OK for the first glance.
I like it also because it's self-contained, doesn't require parsing,
doesn't bring any 'generic' variable.

Regards,
Lukasz

>
> Thanks,
> Quentin
>

2021-05-25 15:02:48

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Tue, May 25, 2021 at 10:21:37AM +0100, Vincent Donnefort wrote:
> On Tue, May 25, 2021 at 10:48:23AM +0200, Peter Zijlstra wrote:
> > On Fri, May 21, 2021 at 05:54:24PM +0100, Vincent Donnefort wrote:
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 4f09afd..5a91a2b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -10,6 +10,7 @@
> > >
> > > #include "sched.h"
> > >
> > > +#include <linux/energy_model.h>
> > > #include <linux/sched/cpufreq.h>
> > > #include <trace/events/power.h>
> > >
> > > @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >
> > > freq = map_util_freq(util, freq, max);
> > >
> > > + /* Avoid inefficient performance states */
> > > + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> > > +
> > > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > > return sg_policy->next_freq;
> > >
> >
> > This seems somewhat unfortunate, it adds a loop over the OPPs only to
> > then call into cpufreq to do the exact same thing again :/
>
> Indeed, but it would be complicated to avoid the double loop:
>
> It is possible to register OPPs (and by extension perf_states) for a
> frequency for which, the cpufreq table entry is marked with
> CPUFREQ_ENTRY_INVALID. It would probably be an issue that would have to be
> fixed in the driver, but it is currently allowed.
>
> More importantly, while resolving the frequency, we also cache the index in
> cached_resolved_idx. Some drivers, such as qcom-cpufreq-hw rely on this
> value for their fastswitch support.

Unless we are ok bringing the cpufreq idx into the Energy Model. But I
originally dismissed this idea. I didn't want to bring a such dependency between
the two frameworks, especially as the EM can also work with Devfreq. But maybe
it's worth it in the end. Any thoughts?

>
> Notice though, we would iterate over the EM only in the case where the
> performance state has found inefficiencies.
>

2021-05-26 05:41:38

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On 21-05-21, 17:54, Vincent Donnefort wrote:
> We (Power team in Arm) are working with an experimental kernel for the
> Google's Pixel4 to evaluate and improve the current mainline performance
> and energy consumption on a real life device with Android.
>
> The SD855 SoC found in this phone has several OPPs that are inefficient.
> I.e. despite a lower frequency, they have a greater cost. (That cost being
> fmax * OPP power / OPP freq). This issue is twofold. First of course,
> running a specific workload at an inefficient OPP is counterproductive
> since it wastes wasting energy. But also, inefficient OPPs make a
> performance domain less appealing for task placement than it really is.
>
> We evaluated the change presented here by running 30 iterations of Android
> PCMark "Work 2.0 Performance". While we did not see any statistically
> significant performance impact, this change allowed to drastically improve
> the idle time residency.
>
>
> | Running | WFI [1] | Idle |
> ------------------------+-------------+-----------+-----------+
> Little cluster (4 CPUs) | -0.35% | +0.35% | +0.79% |
> ------------------------+-------------+-----------+-----------+
> Medium cluster (3 CPUs) | -6.3% | -18% | +12% |
> ------------------------+-------------+-----------+-----------+
> Big cluster (1 CPU) | -6.4% | -6.5% | +2.8% |
> ------------------------+-------------+-----------+-----------+
>
> On the SD855, the inefficient OPPs are found on the little cluster. By
> removing them from the Energy Model, we make the most efficient CPUs more
> appealing for task placement, helping to reduce the running time for the
> medium and big CPUs. Increasing idle time is crucial for this platform due
> to the substantial energy cost differences among the clusters. Also,
> despite not appearing in the statistics (the idle driver used here doesn't
> report it), we can speculate that we also improve the cluster idle time.

First of all, sorry about not replying earlier. I have seen this earlier and was
shying away to receive some feedback from Rafael/Peter instead :(

I think the problem you mention is genuine, I have realized it in the past,
discussed with Vincent Guittot (cc'd) but never was able to get to a proper
solution as the EM model wasn't there then.

I have seen your approach (from top level) and I feel maybe we can improve upon
the whole idea a bit, lemme know what you think. The problem I see with this
approach is the unnecessary updates to schedutil that this series makes, which
IMHO is the wrong thing to do. Schedutil isn't the only governor and such
changes will end up making the performance delta between ondemand and schedutil
even more (difference based on their core design philosophy is fine, but these
are improvements which each of them should enjoy). And if another governor wants
these smart decisions to be added there, then it is trouble again.

Since the whole thing depends on EM and OPPs, I think we can actually do this.

When the cpufreq driver registers with the EM core, lets find all the
Inefficient OPPs and disable them once and for all. Of course, this must be done
on voluntarily basis, a flag from the drivers will do. With this, we won't be
required to update any thing at any of the governors end.

Will that work ?

--
viresh

2021-05-26 09:05:59

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On Wed, May 26, 2021 at 09:17:51AM +0530, Viresh Kumar wrote:
> On 21-05-21, 17:54, Vincent Donnefort wrote:
> > We (Power team in Arm) are working with an experimental kernel for the
> > Google's Pixel4 to evaluate and improve the current mainline performance
> > and energy consumption on a real life device with Android.
> >
> > The SD855 SoC found in this phone has several OPPs that are inefficient.
> > I.e. despite a lower frequency, they have a greater cost. (That cost being
> > fmax * OPP power / OPP freq). This issue is twofold. First of course,
> > running a specific workload at an inefficient OPP is counterproductive
> > since it wastes wasting energy. But also, inefficient OPPs make a
> > performance domain less appealing for task placement than it really is.
> >
> > We evaluated the change presented here by running 30 iterations of Android
> > PCMark "Work 2.0 Performance". While we did not see any statistically
> > significant performance impact, this change allowed to drastically improve
> > the idle time residency.
> >
> >
> > | Running | WFI [1] | Idle |
> > ------------------------+-------------+-----------+-----------+
> > Little cluster (4 CPUs) | -0.35% | +0.35% | +0.79% |
> > ------------------------+-------------+-----------+-----------+
> > Medium cluster (3 CPUs) | -6.3% | -18% | +12% |
> > ------------------------+-------------+-----------+-----------+
> > Big cluster (1 CPU) | -6.4% | -6.5% | +2.8% |
> > ------------------------+-------------+-----------+-----------+
> >
> > On the SD855, the inefficient OPPs are found on the little cluster. By
> > removing them from the Energy Model, we make the most efficient CPUs more
> > appealing for task placement, helping to reduce the running time for the
> > medium and big CPUs. Increasing idle time is crucial for this platform due
> > to the substantial energy cost differences among the clusters. Also,
> > despite not appearing in the statistics (the idle driver used here doesn't
> > report it), we can speculate that we also improve the cluster idle time.
>
> First of all, sorry about not replying earlier. I have seen this earlier and was
> shying away to receive some feedback from Rafael/Peter instead :(

No worries at all, thanks for your comments!

>
> I think the problem you mention is genuine, I have realized it in the past,
> discussed with Vincent Guittot (cc'd) but never was able to get to a proper
> solution as the EM model wasn't there then.
>
> I have seen your approach (from top level) and I feel maybe we can improve upon
> the whole idea a bit, lemme know what you think. The problem I see with this
> approach is the unnecessary updates to schedutil that this series makes, which
> IMHO is the wrong thing to do. Schedutil isn't the only governor and such
> changes will end up making the performance delta between ondemand and schedutil
> even more (difference based on their core design philosophy is fine, but these
> are improvements which each of them should enjoy). And if another governor wants
> these smart decisions to be added there, then it is trouble again.

I originally considered to add the inefficient knowledge into the CPUFreq table.
But I then gave up the idea for two reasons:

* The EM depends on having schedutil enabled. I don't think that any
other governor would then manage to rely on the inefficient OPPs. (also I
believe Peter had a plan to keep schedutil as the one and only governor)

* The CPUfreq driver doesn't have to rely on the CPUfreq table, if the
knowledge about inefficient OPPs is into the latter, some drivers might not
be able to rely on the feature (you might say 'their loss' though :))

For those reasons, I thought that adding inefficient support into the
CPUfreq table would complexify a lot the patchset for no functional gain.

>
> Since the whole thing depends on EM and OPPs, I think we can actually do this.
>
> When the cpufreq driver registers with the EM core, lets find all the
> Inefficient OPPs and disable them once and for all. Of course, this must be done
> on voluntarily basis, a flag from the drivers will do. With this, we won't be
> required to update any thing at any of the governors end.

We still need to keep the inefficient OPPs for thermal reason. But if we go with
the inefficiency support into the CPUfreq table, we could enable or disable
them, depending on the thermal pressure. Or add a flag to read the table with or
without inefficient OPPs?


>
> Will that work ?
>
> --
> viresh

2021-05-26 09:35:31

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On 26-05-21, 09:56, Lukasz Luba wrote:
> No, these OPPs have to stay because they are used in thermal for cooling
> states.

This won't break the thermal tables. Thermal just sets the max-freq for a CPU,
and it doesn't depend on the OPP table for that.

> DT cooling devices might have them set as a scope of possible
> states. We don't want to break existing platforms, don't we?

I don't think we will end up breaking anything here.

> We want to 'avoid' those OPPs when possible (no thermal pressure), but
> we might have to use them sometimes.

Why would we want to use them if they are inefficient ? Thermal or something
else as well ?

More in the other reply I am sending to Vincent.

--
viresh

2021-05-26 12:01:48

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

Hi Viresh,

On 5/26/21 4:47 AM, Viresh Kumar wrote:

[snip]

>
> First of all, sorry about not replying earlier. I have seen this earlier and was
> shying away to receive some feedback from Rafael/Peter instead :(
>
> I think the problem you mention is genuine, I have realized it in the past,
> discussed with Vincent Guittot (cc'd) but never was able to get to a proper
> solution as the EM model wasn't there then.
>
> I have seen your approach (from top level) and I feel maybe we can improve upon
> the whole idea a bit, lemme know what you think. The problem I see with this
> approach is the unnecessary updates to schedutil that this series makes, which
> IMHO is the wrong thing to do. Schedutil isn't the only governor and such
> changes will end up making the performance delta between ondemand and schedutil
> even more (difference based on their core design philosophy is fine, but these
> are improvements which each of them should enjoy). And if another governor wants
> these smart decisions to be added there, then it is trouble again.
>
> Since the whole thing depends on EM and OPPs, I think we can actually do this.
>
> When the cpufreq driver registers with the EM core, lets find all the
> Inefficient OPPs and disable them once and for all. Of course, this must be done
> on voluntarily basis, a flag from the drivers will do. With this, we won't be
> required to update any thing at any of the governors end.
>
> Will that work ?
>

No, these OPPs have to stay because they are used in thermal for cooling
states. DT cooling devices might have them set as a scope of possible
states. We don't want to break existing platforms, don't we?
We want to 'avoid' those OPPs when possible (no thermal pressure), but
we might have to use them sometimes.

Regards,
Lukasz

2021-05-26 12:03:36

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On 26-05-21, 10:01, Vincent Donnefort wrote:
> I originally considered to add the inefficient knowledge into the CPUFreq table.

I wasn't talking about the cpufreq table here in the beginning, but calling
dev_pm_opp_disable(), which will eventually reflect in cpufreq table as well.

> But I then gave up the idea for two reasons:
>
> * The EM depends on having schedutil enabled. I don't think that any
> other governor would then manage to rely on the inefficient OPPs. (also I
> believe Peter had a plan to keep schedutil as the one and only governor)

Right, that EM is only there for schedutil.

I would encourage if this can be done even without the EM dependency, if
possible. It would be a good thing to do generally for any driver that wants to
do that.

> * The CPUfreq driver doesn't have to rely on the CPUfreq table, if the
> knowledge about inefficient OPPs is into the latter, some drivers might not
> be able to rely on the feature (you might say 'their loss' though :))
>
> For those reasons, I thought that adding inefficient support into the
> CPUfreq table would complexify a lot the patchset for no functional gain.

What about disabling the OPP in the OPP core itself ? So every user will get the
same picture.

> >
> > Since the whole thing depends on EM and OPPs, I think we can actually do this.
> >
> > When the cpufreq driver registers with the EM core, lets find all the
> > Inefficient OPPs and disable them once and for all. Of course, this must be done
> > on voluntarily basis, a flag from the drivers will do. With this, we won't be
> > required to update any thing at any of the governors end.
>
> We still need to keep the inefficient OPPs for thermal reason.

How will that benefit us if that OPP is never going to run anyway ? We won't be
cooling down the CPU then, isn't it ?

> But if we go with
> the inefficiency support into the CPUfreq table, we could enable or disable
> them, depending on the thermal pressure. Or add a flag to read the table with or
> without inefficient OPPs?

Yeah, I was looking for a cpufreq driver flag or something like that so OPPs
don't disappear magically for some platforms which don't want it to happen.

Moreover, a cpufreq driver first creates the OPP table, then registers with EM
or thermal. If we can play with that sequence a bit and make sure inefficient
OPPs are disabled before thermal or cpufreq tables are created, we will be good.

--
viresh

2021-05-26 12:32:02

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On 26-05-21, 15:08, Viresh Kumar wrote:
> On 26-05-21, 10:01, Vincent Donnefort wrote:
> > I originally considered to add the inefficient knowledge into the CPUFreq table.
>
> I wasn't talking about the cpufreq table here in the beginning, but calling
> dev_pm_opp_disable(), which will eventually reflect in cpufreq table as well.
>
> > But I then gave up the idea for two reasons:
> >
> > * The EM depends on having schedutil enabled. I don't think that any
> > other governor would then manage to rely on the inefficient OPPs. (also I
> > believe Peter had a plan to keep schedutil as the one and only governor)
>
> Right, that EM is only there for schedutil.
>
> I would encourage if this can be done even without the EM dependency, if
> possible. It would be a good thing to do generally for any driver that wants to
> do that.

Another benefit of course is that we will have less work to do in the hotpath,
it was obvious, but I thought of stating it anyway :)

--
viresh

2021-05-26 12:37:37

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs



On 5/26/21 10:38 AM, Viresh Kumar wrote:
> On 26-05-21, 10:01, Vincent Donnefort wrote:
>> I originally considered to add the inefficient knowledge into the CPUFreq table.
>
> I wasn't talking about the cpufreq table here in the beginning, but calling
> dev_pm_opp_disable(), which will eventually reflect in cpufreq table as well.
>
>> But I then gave up the idea for two reasons:
>>
>> * The EM depends on having schedutil enabled. I don't think that any
>> other governor would then manage to rely on the inefficient OPPs. (also I
>> believe Peter had a plan to keep schedutil as the one and only governor)
>
> Right, that EM is only there for schedutil.
>
> I would encourage if this can be done even without the EM dependency, if
> possible. It would be a good thing to do generally for any driver that wants to
> do that.
>
>> * The CPUfreq driver doesn't have to rely on the CPUfreq table, if the
>> knowledge about inefficient OPPs is into the latter, some drivers might not
>> be able to rely on the feature (you might say 'their loss' though :))
>>
>> For those reasons, I thought that adding inefficient support into the
>> CPUfreq table would complexify a lot the patchset for no functional gain.
>
> What about disabling the OPP in the OPP core itself ? So every user will get the
> same picture.

There are SoCs which have OPPs every 100MHz even at high freq. They are
used e.g. when thermal kicks in. We shouldn't disable them in generic
frameworks like OPP. They might be used to provide enough CPU capacity,
when temp is high. Imagine you have a board which does some work:
sends and received some UDP packets. The board has been tested in oven
that it will still be able to handle X messages/sec but using an OPP,
which in our heuristic is 'inefficient'. You cannot go above, because it
will overheat the SoC, you might go below and find first 'efficient'
OPP. You might harm this board performance if e.g. the OPP is 20% slower
that this 'inefficient' which was tested by engineers.

>
>>>
>>> Since the whole thing depends on EM and OPPs, I think we can actually do this.
>>>
>>> When the cpufreq driver registers with the EM core, lets find all the
>>> Inefficient OPPs and disable them once and for all. Of course, this must be done
>>> on voluntarily basis, a flag from the drivers will do. With this, we won't be
>>> required to update any thing at any of the governors end.
>>
>> We still need to keep the inefficient OPPs for thermal reason.
>
> How will that benefit us if that OPP is never going to run anyway ? We won't be

This OPP still might be used, the Vincent heuristic is just a 'hint'.
Schedutil will check policy->max and could clamp the 'efficient'
returned freq to first allowed, which might be 'inefficient'

> cooling down the CPU then, isn't it ?

The 'inefficient' OPP is called from our 'energy placement' angle. For
other folks from automotive, industrial or IoT who are stress testing
SoCs and boards in various circumstances, they might call our
'inefficient' perf state as 'efficient' - for they need.

In our internal review I pointed that we are optimizing for mobiles with
this and we might actually need a #ifdef, config or a switch for this
heuristic.

2021-05-26 16:51:55

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs



On 5/26/21 11:24 AM, Lukasz Luba wrote:

[snip]

>> What about disabling the OPP in the OPP core itself ? So every user
>> will get the
>> same picture.
>
> There are SoCs which have OPPs every 100MHz even at high freq. They are
> used e.g. when thermal kicks in. We shouldn't disable them in generic
> frameworks like OPP. They might be used to provide enough CPU capacity,
> when temp is high. Imagine you have a board which does some work:
> sends and received some UDP packets. The board has been tested in oven
> that it will still be able to handle X messages/sec but using an OPP,
> which in our heuristic is 'inefficient'. You cannot go above, because it
> will overheat the SoC, you might go below and find first 'efficient'
> OPP. You might harm this board performance if e.g. the OPP is 20% slower
> that this 'inefficient' which was tested by engineers.
>
>>
>>>>
>>>> Since the whole thing depends on EM and OPPs, I think we can
>>>> actually do this.
>>>>
>>>> When the cpufreq driver registers with the EM core, lets find all the
>>>> Inefficient OPPs and disable them once and for all. Of course, this
>>>> must be done
>>>> on voluntarily basis, a flag from the drivers will do. With this, we
>>>> won't be
>>>> required to update any thing at any of the governors end.
>>>
>>> We still need to keep the inefficient OPPs for thermal reason.
>>
>> How will that benefit us if that OPP is never going to run anyway ? We
>> won't be
>
> This OPP still might be used, the Vincent heuristic is just a 'hint'.
> Schedutil will check policy->max and could clamp the 'efficient'
> returned freq to first allowed, which might be 'inefficient'
>
>> cooling down the CPU then, isn't it ?
>
> The 'inefficient' OPP is called from our 'energy placement' angle. For
> other folks from automotive, industrial or IoT who are stress testing
> SoCs and boards in various circumstances, they might call our
> 'inefficient' perf state as 'efficient' - for they need.
>
> In our internal review I pointed that we are optimizing for mobiles with
> this and we might actually need a #ifdef, config or a switch for this
> heuristic.
>

But even in mobiles, we might start facing issues e.g. during high
resolution recording, when we just disable 'inefficient' OPPs,
which were used in such use case and higher temperature.

2021-05-26 17:41:12

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs



On 5/26/21 11:39 AM, Lukasz Luba wrote:
>

[snip]

To summarize:
- we don't want to disable some OPPs
- we want to give a 'hint' from energy perspective
- we rely on SchedUtil 2nd stage which clamps this
freq hint value to allowed OPPs which might set
actually not the one what we see as 'efficient'
-- we don't harm some existing platform which might
needs these 'inefficient' OPPs in some use cases
- we pay some extra cost in this SchedUtil freq
switch path, which shouldn't harm too much.
-- we pay this cost only for arm/arm64 platforms
which use EM
-- this cost is balanced by the benefit that we see in
benchmarks and measured energy
-- the LUT might limit the impact

I hope this would help to better understand the scope and
impact of this patch set.

Regards,
Lukasz

2021-05-26 18:47:21

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

On Wed, May 26, 2021 at 03:08:07PM +0530, Viresh Kumar wrote:
> On 26-05-21, 10:01, Vincent Donnefort wrote:
> > I originally considered to add the inefficient knowledge into the CPUFreq table.
>
> I wasn't talking about the cpufreq table here in the beginning, but calling
> dev_pm_opp_disable(), which will eventually reflect in cpufreq table as well.
>
> > But I then gave up the idea for two reasons:
> >
> > * The EM depends on having schedutil enabled. I don't think that any
> > other governor would then manage to rely on the inefficient OPPs. (also I
> > believe Peter had a plan to keep schedutil as the one and only governor)
>
> Right, that EM is only there for schedutil.
>
> I would encourage if this can be done even without the EM dependency, if
> possible. It would be a good thing to do generally for any driver that wants to
> do that.
>
> > * The CPUfreq driver doesn't have to rely on the CPUfreq table, if the
> > knowledge about inefficient OPPs is into the latter, some drivers might not
> > be able to rely on the feature (you might say 'their loss' though :))
> >
> > For those reasons, I thought that adding inefficient support into the
> > CPUfreq table would complexify a lot the patchset for no functional gain.
>
> What about disabling the OPP in the OPP core itself ? So every user will get the
> same picture.
>
> > >
> > > Since the whole thing depends on EM and OPPs, I think we can actually do this.
> > >
> > > When the cpufreq driver registers with the EM core, lets find all the
> > > Inefficient OPPs and disable them once and for all. Of course, this must be done
> > > on voluntarily basis, a flag from the drivers will do. With this, we won't be
> > > required to update any thing at any of the governors end.
> >
> > We still need to keep the inefficient OPPs for thermal reason.
>
> How will that benefit us if that OPP is never going to run anyway ? We won't be
> cooling down the CPU then, isn't it ?

It would give more freedom for the cooling framework to pick a lower frequency
to mitigate the current temperature even if we know this isn't, energy
efficient.

As an example, on the Pixel4's SD855, the first 6 OPPs are inefficients on one
of the cluster. If we hide those from the cooling framework, we'll prevent
cooling for a quite wide range of frequencies.

That'd be however much more intrusive to support into cpufreq than just
preventing the OPPs to be registered.

>
> > But if we go with
> > the inefficiency support into the CPUfreq table, we could enable or disable
> > them, depending on the thermal pressure. Or add a flag to read the table with or
> > without inefficient OPPs?
>
> Yeah, I was looking for a cpufreq driver flag or something like that so OPPs
> don't disappear magically for some platforms which don't want it to happen.
>
> Moreover, a cpufreq driver first creates the OPP table, then registers with EM
> or thermal. If we can play with that sequence a bit and make sure inefficient
> OPPs are disabled before thermal or cpufreq tables are created, we will be good.
>
> --
> viresh

2021-05-27 07:37:58

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] EM / PM: Inefficient OPPs

Hi Viresh,

On 5/26/21 10:33 AM, Viresh Kumar wrote:
> On 26-05-21, 09:56, Lukasz Luba wrote:
>> No, these OPPs have to stay because they are used in thermal for cooling
>> states.
>
> This won't break the thermal tables. Thermal just sets the max-freq for a CPU,
> and it doesn't depend on the OPP table for that.
>
>> DT cooling devices might have them set as a scope of possible
>> states. We don't want to break existing platforms, don't we?
>
> I don't think we will end up breaking anything here.
>
>> We want to 'avoid' those OPPs when possible (no thermal pressure), but
>> we might have to use them sometimes.
>
> Why would we want to use them if they are inefficient ? Thermal or something
> else as well ?
>
> More in the other reply I am sending to Vincent.
>

I have responded to your email there. I don't know if you have seen it.
As I said there, these OPPs, which from energy perspective we call
'inefficient', might be used to provide enough performance under thermal
constraints.

Regards,
Lukasz

2021-05-28 06:18:50

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On 21-05-21, 17:54, Vincent Donnefort wrote:
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> +static inline
> +struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
> + unsigned long freq)
> +{
> + struct em_perf_state *ps;
> + int i;
> +
> + for (i = 0; i < pd->nr_perf_states; i++) {
> + ps = &pd->table[i];
> + if (ps->flags & EM_PERF_STATE_INEFFICIENT)
> + continue;
> + if (ps->frequency >= freq)
> + break;

I believe it may be more optimal if we change the sequence of these two 'if'
blocks here. We only need to check for inefficient frequencies if it is >= freq.

> + }
> +
> + return ps;
> +}


> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
> freq = map_util_freq(util, freq, max);
>
> + /* Avoid inefficient performance states */
> + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> +
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;

Assume this freq-table (E=efficient, IE=inefficient): 800M (E), 1G (E), 1.2G (IE), 1.4G (IE), 1.6G (E).
Thermal limits max to 1.4G

Freq returned by map_util_freq() is 1.01G.

Will we not end up selecting 1.4G here ? Inefficient as well as much higher than
what we requested for ?

--
viresh

2021-05-28 07:25:42

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On 25-05-21, 10:48, Peter Zijlstra wrote:
> On Fri, May 21, 2021 at 05:54:24PM +0100, Vincent Donnefort wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 4f09afd..5a91a2b 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -10,6 +10,7 @@
> >
> > #include "sched.h"
> >
> > +#include <linux/energy_model.h>
> > #include <linux/sched/cpufreq.h>
> > #include <trace/events/power.h>
> >
> > @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> > freq = map_util_freq(util, freq, max);
> >
> > + /* Avoid inefficient performance states */
> > + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> > +
> > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > return sg_policy->next_freq;
> >
>
> This seems somewhat unfortunate, it adds a loop over the OPPs only to
> then call into cpufreq to do the exact same thing again :/

And that's why I feel it needs to be done at a single place, either disable the
OPP (which seems like a bad option based on what Lukasz and Vincent said
earlier), or make changes in the cpufreq core itself to search for the best
frequency (like adding another API to mark some frequencies as inefficient, and
take that into account while selecting next freq).

There is a potential of ending up selecting the wrong frequency here because
there are too many decision making bodies here and so corner cases.

--
viresh

2021-05-28 09:13:13

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs



On 5/28/21 6:04 AM, Viresh Kumar wrote:
> On 21-05-21, 17:54, Vincent Donnefort wrote:
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> +static inline
>> +struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
>> + unsigned long freq)
>> +{
>> + struct em_perf_state *ps;
>> + int i;
>> +
>> + for (i = 0; i < pd->nr_perf_states; i++) {
>> + ps = &pd->table[i];
>> + if (ps->flags & EM_PERF_STATE_INEFFICIENT)
>> + continue;
>> + if (ps->frequency >= freq)
>> + break;
>
> I believe it may be more optimal if we change the sequence of these two 'if'
> blocks here. We only need to check for inefficient frequencies if it is >= freq.

Make sense (especially in context of your example below).

>
>> + }
>> +
>> + return ps;
>> +}
>
>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>>
>> freq = map_util_freq(util, freq, max);
>>
>> + /* Avoid inefficient performance states */
>> + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
>> +
>> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
>> return sg_policy->next_freq;
>
> Assume this freq-table (E=efficient, IE=inefficient): 800M (E), 1G (E), 1.2G (IE), 1.4G (IE), 1.6G (E).
> Thermal limits max to 1.4G
>
> Freq returned by map_util_freq() is 1.01G.
>
> Will we not end up selecting 1.4G here ? Inefficient as well as much higher than
> what we requested for ?
>

Correct, it will be 1.4G, instead of selecting 1.2G, in this example.
The two or more consecutive inefficient OPPs in the table ruins the
proper behavior. We have to reevaluate this.

Thanks Viresh for pointing this out.

Regards,
Lukasz

2021-06-01 08:50:50

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Fri, May 28, 2021 at 10:39:34AM +0530, Viresh Kumar wrote:
> On 25-05-21, 10:48, Peter Zijlstra wrote:
> > On Fri, May 21, 2021 at 05:54:24PM +0100, Vincent Donnefort wrote:
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 4f09afd..5a91a2b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -10,6 +10,7 @@
> > >
> > > #include "sched.h"
> > >
> > > +#include <linux/energy_model.h>
> > > #include <linux/sched/cpufreq.h>
> > > #include <trace/events/power.h>
> > >
> > > @@ -153,6 +154,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >
> > > freq = map_util_freq(util, freq, max);
> > >
> > > + /* Avoid inefficient performance states */
> > > + freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq);
> > > +
> > > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > > return sg_policy->next_freq;
> > >
> >
> > This seems somewhat unfortunate, it adds a loop over the OPPs only to
> > then call into cpufreq to do the exact same thing again :/
>
> And that's why I feel it needs to be done at a single place, either disable the
> OPP (which seems like a bad option based on what Lukasz and Vincent said
> earlier), or make changes in the cpufreq core itself to search for the best
> frequency (like adding another API to mark some frequencies as inefficient, and
> take that into account while selecting next freq).
>
> There is a potential of ending up selecting the wrong frequency here because
> there are too many decision making bodies here and so corner cases.
>
> --
> viresh

Hi Viresh,

Seems like no one has been really convinced about the arguments in favor of
keeping inefficiencies into EM :) Let me then give a shot with marking the OPPs
for the next version.

--
Vincent

2021-06-01 09:00:20

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On 01-06-21, 09:47, Vincent Donnefort wrote:
> Seems like no one has been really convinced about the arguments in favor of
> keeping inefficiencies into EM :) Let me then give a shot with marking the OPPs
> for the next version.

Right, I think this is what you should do:

- Add another flag for OPP entries, and mark them inefficient.

- Whoever traverses the list to find the next frequency (cpufreq here), checks
that flag somehow (or replicates that to its own table) and get the right
frequency out.

--
viresh

2021-06-01 09:11:06

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On Tuesday 01 Jun 2021 at 14:26:28 (+0530), Viresh Kumar wrote:
> On 01-06-21, 09:47, Vincent Donnefort wrote:
> > Seems like no one has been really convinced about the arguments in favor of
> > keeping inefficiencies into EM :) Let me then give a shot with marking the OPPs
> > for the next version.
>
> Right, I think this is what you should do:
>
> - Add another flag for OPP entries, and mark them inefficient.
>
> - Whoever traverses the list to find the next frequency (cpufreq here), checks
> that flag somehow (or replicates that to its own table) and get the right
> frequency out.

Just to reiterate here what was discussed on IRC the other day, I still
feel that the choice of an efficient OPP or not is a policy decision,
and should be left to the governor.

It's not obvious to me that the userspace govenor for instance wants any
of this. Same thing with e.g. the powersave governor if the lowest OPPs
are inefficient (yes skipping them will not impact energy, but it will
impact instantaneous power).

So if we're going to move that logic to the cpufreq core, then we'll
probably want two separate APIs and make sure to use the effiency-aware
one is used only from the places where that makes sense.

Thanks,
Quentin

2021-06-01 09:14:38

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] PM / EM: Skip inefficient OPPs

On 01-06-21, 09:07, Quentin Perret wrote:
> Just to reiterate here what was discussed on IRC the other day, I still
> feel that the choice of an efficient OPP or not is a policy decision,
> and should be left to the governor.

I agree. cpufreq core shouldn't always do this.

> It's not obvious to me that the userspace govenor for instance wants any
> of this. Same thing with e.g. the powersave governor if the lowest OPPs
> are inefficient (yes skipping them will not impact energy, but it will
> impact instantaneous power).

Yes, these governors shouldn't end up using the efficient only stuff.

> So if we're going to move that logic to the cpufreq core, then we'll
> probably want two separate APIs and make sure to use the effiency-aware
> one is used only from the places where that makes sense.

Yeah, we need another API or parameter "bool efficient" or something.

--
viresh