Hi All,
This is how I would fix the issue reported in BZ 200759 (see this patch series
from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2).
Patch [1/2] causes intel_pstate to update all policies if it gets a _PPC change
notification and sees a global turbo disable/enable change.
Patch [2/2] makes it update cpuinfo.max_freq for all policies in those cases.
The patches here have not been tested yet, so testing would be much appreciated.
Of course, comments are welcome too!
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
In some cases, the platform firmware disables or enables turbo
frequencies for all CPUs globally before triggering a _PPC change
notification for one of them. Obviously, that global change affects
all CPUs, not just the notified one, and it needs to be acted upon by
cpufreq.
The intel_pstate driver is able to detect such global changes of
the settings, but it also needs to update policy limits for all
CPUs if that happens, in particular if turbo frequencies are
enabled globally - to allow them to be used.
For this reason, introduce a new cpufreq driver callback to be
invoked on _PPC notifications, if present, instead of simply
calling cpufreq_update_policy() for the notified CPU and make
intel_pstate use it to trigger policy updates for all CPUs
in the system if global settings change.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/acpi/processor_perflib.c | 2 +-
drivers/cpufreq/cpufreq.c | 16 ++++++++++++++++
drivers/cpufreq/intel_pstate.c | 25 +++++++++++++++++++++++++
include/linux/cpufreq.h | 4 ++++
4 files changed, 46 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/acpi/processor_perflib.c
===================================================================
--- linux-pm.orig/drivers/acpi/processor_perflib.c
+++ linux-pm/drivers/acpi/processor_perflib.c
@@ -181,7 +181,7 @@ void acpi_processor_ppc_has_changed(stru
acpi_processor_ppc_ost(pr->handle, 0);
}
if (ret >= 0)
- cpufreq_update_policy(pr->id);
+ cpufreq_update_limits(pr->id);
}
int acpi_processor_get_bios_limit(int cpu, unsigned int *limit)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -2376,6 +2376,22 @@ unlock:
}
EXPORT_SYMBOL(cpufreq_update_policy);
+/**
+ * cpufreq_update_limits - Update policy limits for a given CPU.
+ * @cpu: CPU to update the policy limits for.
+ *
+ * Invoke the driver's ->update_limits callback if present or call
+ * cpufreq_update_policy() for @cpu.
+ */
+void cpufreq_update_limits(unsigned int cpu)
+{
+ if (cpufreq_driver->update_limits)
+ cpufreq_driver->update_limits(cpu);
+ else
+ cpufreq_update_policy(cpu);
+}
+EXPORT_SYMBOL(cpufreq_update_limits);
+
/*********************************************************************
* BOOST *
*********************************************************************/
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -179,6 +179,8 @@ struct vid_data {
* based on the MSR_IA32_MISC_ENABLE value and whether or
* not the maximum reported turbo P-state is different from
* the maximum reported non-turbo one.
+ * @turbo_disabled_upd: Last value of @turbo_disabled for which policies have
+ * been updated.
* @min_perf_pct: Minimum capacity limit in percent of the maximum turbo
* P-state capacity.
* @max_perf_pct: Maximum capacity limit in percent of the maximum turbo
@@ -187,6 +189,7 @@ struct vid_data {
struct global_params {
bool no_turbo;
bool turbo_disabled;
+ bool turbo_disabled_upd;
int max_perf_pct;
int min_perf_pct;
};
@@ -894,6 +897,25 @@ static void intel_pstate_update_policies
cpufreq_update_policy(cpu);
}
+static void intel_pstate_update_limits(unsigned int cpu)
+{
+ mutex_lock(&intel_pstate_driver_lock);
+
+ update_turbo_state();
+ /*
+ * If turbo has been turned on or off globally, policy limits for
+ * all CPUs need to be updated to reflect that.
+ */
+ if (global.turbo_disabled_upd != global.turbo_disabled) {
+ global.turbo_disabled_upd = global.turbo_disabled;
+ intel_pstate_update_policies();
+ } else {
+ cpufreq_update_policy(cpu);
+ }
+
+ mutex_unlock(&intel_pstate_driver_lock);
+}
+
/************************** sysfs begin ************************/
#define show_one(file_name, object) \
static ssize_t show_##file_name \
@@ -2135,6 +2157,7 @@ static int __intel_pstate_cpu_init(struc
/* cpuinfo and default policy values */
policy->cpuinfo.min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling;
update_turbo_state();
+ global.turbo_disabled_upd = global.turbo_disabled;
policy->cpuinfo.max_freq = global.turbo_disabled ?
cpu->pstate.max_pstate : cpu->pstate.turbo_pstate;
policy->cpuinfo.max_freq *= cpu->pstate.scaling;
@@ -2179,6 +2202,7 @@ static struct cpufreq_driver intel_pstat
.init = intel_pstate_cpu_init,
.exit = intel_pstate_cpu_exit,
.stop_cpu = intel_pstate_stop_cpu,
+ .update_limits = intel_pstate_update_limits,
.name = "intel_pstate",
};
@@ -2313,6 +2337,7 @@ static struct cpufreq_driver intel_cpufr
.init = intel_cpufreq_cpu_init,
.exit = intel_pstate_cpu_exit,
.stop_cpu = intel_cpufreq_stop_cpu,
+ .update_limits = intel_pstate_update_limits,
.name = "intel_cpufreq",
};
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -195,6 +195,7 @@ void disable_cpufreq(void);
u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
void cpufreq_update_policy(unsigned int cpu);
+void cpufreq_update_limits(unsigned int cpu);
bool have_governor_per_policy(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
@@ -322,6 +323,9 @@ struct cpufreq_driver {
/* should be defined, if possible */
unsigned int (*get)(unsigned int cpu);
+ /* Called to update policy limits on firmware notifications. */
+ void (*update_limits)(unsigned int cpu);
+
/* optional */
int (*bios_limit)(int cpu, unsigned int *limit);
From: Rafael J. Wysocki <[email protected]>
While the cpuinfo.max_freq value doesn't really matter for
intel_pstate in the active mode, in the passive mode it is used by
governors as the maximum physical frequency of the CPU and the
results of governor computations generally depend on it. Also it
is made available to user space via sysfs and it should match the
current HW configuration.
For this reason, make intel_pstate update cpuinfo.max_freq for all
CPUs if it detects a global change of turbo frequency settings from
"disable" to "enable" or the other way associated with a _PPC change
notification from the platform firmware.
Note that policy_is_inactive() and cpufreq_set_policy() need to be
made available to it for this purpose.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Update, because the patch sent previously doesn't build, due to an extra
arg declared for intel_pstate_update_max_freq().
---
drivers/cpufreq/cpufreq.c | 12 ++----------
drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
include/linux/cpufreq.h | 7 +++++++
3 files changed, 41 insertions(+), 11 deletions(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -897,6 +897,36 @@ static void intel_pstate_update_policies
cpufreq_update_policy(cpu);
}
+static void intel_pstate_update_max_freq(unsigned int cpu)
+{
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ struct cpufreq_policy new_policy;
+ struct cpudata *cpudata;
+
+ if (!policy)
+ return;
+
+ down_write(&policy->rwsem);
+
+ if (policy_is_inactive(policy))
+ goto unlock;
+
+ cpudata = all_cpu_data[cpu];
+ policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
+ cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
+
+ memcpy(&new_policy, policy, sizeof(*policy));
+ new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
+ new_policy.min = min(policy->user_policy.min, new_policy.max);
+
+ cpufreq_set_policy(policy, &new_policy);
+
+unlock:
+ up_write(&policy->rwsem);
+
+ cpufreq_cpu_put(policy);
+}
+
static void intel_pstate_update_limits(unsigned int cpu)
{
mutex_lock(&intel_pstate_driver_lock);
@@ -908,7 +938,8 @@ static void intel_pstate_update_limits(u
*/
if (global.turbo_disabled_upd != global.turbo_disabled) {
global.turbo_disabled_upd = global.turbo_disabled;
- intel_pstate_update_policies();
+ for_each_possible_cpu(cpu)
+ intel_pstate_update_max_freq(cpu);
} else {
cpufreq_update_policy(cpu);
}
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -34,11 +34,6 @@
static LIST_HEAD(cpufreq_policy_list);
-static inline bool policy_is_inactive(struct cpufreq_policy *policy)
-{
- return cpumask_empty(policy->cpus);
-}
-
/* Macros to iterate over CPU policies */
#define for_each_suitable_policy(__policy, __active) \
list_for_each_entry(__policy, &cpufreq_policy_list, policy_list) \
@@ -675,9 +670,6 @@ static ssize_t show_scaling_cur_freq(str
return ret;
}
-static int cpufreq_set_policy(struct cpufreq_policy *policy,
- struct cpufreq_policy *new_policy);
-
/**
* cpufreq_per_cpu_attr_write() / store_##file_name() - sysfs write access
*/
@@ -2235,8 +2227,8 @@ EXPORT_SYMBOL(cpufreq_get_policy);
*
* The cpuinfo part of @policy is not updated by this function.
*/
-static int cpufreq_set_policy(struct cpufreq_policy *policy,
- struct cpufreq_policy *new_policy)
+int cpufreq_set_policy(struct cpufreq_policy *policy,
+ struct cpufreq_policy *new_policy)
{
struct cpufreq_governor *old_gov;
int ret;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -178,6 +178,11 @@ static inline struct cpufreq_policy *cpu
static inline void cpufreq_cpu_put(struct cpufreq_policy *policy) { }
#endif
+static inline bool policy_is_inactive(struct cpufreq_policy *policy)
+{
+ return cpumask_empty(policy->cpus);
+}
+
static inline bool policy_is_shared(struct cpufreq_policy *policy)
{
return cpumask_weight(policy->cpus) > 1;
@@ -194,6 +199,8 @@ void disable_cpufreq(void);
u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
+int cpufreq_set_policy(struct cpufreq_policy *policy,
+ struct cpufreq_policy *new_policy);
void cpufreq_update_policy(unsigned int cpu);
void cpufreq_update_limits(unsigned int cpu);
bool have_governor_per_policy(void);
From: Rafael J. Wysocki <[email protected]>
While the cpuinfo.max_freq value doesn't really matter for
intel_pstate in the active mode, in the passive mode it is used by
governors as the maximum physical frequency of the CPU and the
results of governor computations generally depend on it. Also it
is made available to user space via sysfs and it should match the
current HW configuration.
For this reason, make intel_pstate update cpuinfo.max_freq for all
CPUs if it detects a global change of turbo frequency settings from
"disable" to "enable" or the other way associated with a _PPC change
notification from the platform firmware.
Note that policy_is_inactive() and cpufreq_set_policy() need to be
made available to it for this purpose.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/cpufreq.c | 12 ++----------
drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
include/linux/cpufreq.h | 7 +++++++
3 files changed, 41 insertions(+), 11 deletions(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -897,6 +897,36 @@ static void intel_pstate_update_policies
cpufreq_update_policy(cpu);
}
+static void intel_pstate_update_max_freq(unsigned int cpu, bool turbo_disabled)
+{
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ struct cpufreq_policy new_policy;
+ struct cpudata *cpudata;
+
+ if (!policy)
+ return;
+
+ down_write(&policy->rwsem);
+
+ if (policy_is_inactive(policy))
+ goto unlock;
+
+ cpudata = all_cpu_data[cpu];
+ policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
+ cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
+
+ memcpy(&new_policy, policy, sizeof(*policy));
+ new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
+ new_policy.min = min(policy->user_policy.min, new_policy.max);
+
+ cpufreq_set_policy(policy, &new_policy);
+
+unlock:
+ up_write(&policy->rwsem);
+
+ cpufreq_cpu_put(policy);
+}
+
static void intel_pstate_update_limits(unsigned int cpu)
{
mutex_lock(&intel_pstate_driver_lock);
@@ -908,7 +938,8 @@ static void intel_pstate_update_limits(u
*/
if (global.turbo_disabled_upd != global.turbo_disabled) {
global.turbo_disabled_upd = global.turbo_disabled;
- intel_pstate_update_policies();
+ for_each_possible_cpu(cpu)
+ intel_pstate_update_max_freq(cpu);
} else {
cpufreq_update_policy(cpu);
}
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -34,11 +34,6 @@
static LIST_HEAD(cpufreq_policy_list);
-static inline bool policy_is_inactive(struct cpufreq_policy *policy)
-{
- return cpumask_empty(policy->cpus);
-}
-
/* Macros to iterate over CPU policies */
#define for_each_suitable_policy(__policy, __active) \
list_for_each_entry(__policy, &cpufreq_policy_list, policy_list) \
@@ -675,9 +670,6 @@ static ssize_t show_scaling_cur_freq(str
return ret;
}
-static int cpufreq_set_policy(struct cpufreq_policy *policy,
- struct cpufreq_policy *new_policy);
-
/**
* cpufreq_per_cpu_attr_write() / store_##file_name() - sysfs write access
*/
@@ -2235,8 +2227,8 @@ EXPORT_SYMBOL(cpufreq_get_policy);
*
* The cpuinfo part of @policy is not updated by this function.
*/
-static int cpufreq_set_policy(struct cpufreq_policy *policy,
- struct cpufreq_policy *new_policy)
+int cpufreq_set_policy(struct cpufreq_policy *policy,
+ struct cpufreq_policy *new_policy)
{
struct cpufreq_governor *old_gov;
int ret;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -178,6 +178,11 @@ static inline struct cpufreq_policy *cpu
static inline void cpufreq_cpu_put(struct cpufreq_policy *policy) { }
#endif
+static inline bool policy_is_inactive(struct cpufreq_policy *policy)
+{
+ return cpumask_empty(policy->cpus);
+}
+
static inline bool policy_is_shared(struct cpufreq_policy *policy)
{
return cpumask_weight(policy->cpus) > 1;
@@ -194,6 +199,8 @@ void disable_cpufreq(void);
u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
+int cpufreq_set_policy(struct cpufreq_policy *policy,
+ struct cpufreq_policy *new_policy);
void cpufreq_update_policy(unsigned int cpu);
void cpufreq_update_limits(unsigned int cpu);
bool have_governor_per_policy(void);
On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> Hi All,
>
> This is how I would fix the issue reported in BZ 200759 (see this
> patch series
> from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2).
>
> Patch [1/2] causes intel_pstate to update all policies if it gets a
> _PPC change
> notification and sees a global turbo disable/enable change.
>
> Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> those cases.
>
> The patches here have not been tested yet, so testing would be much
> appreciated.
>
> Of course, comments are welcome too!
This is the only platform, someone reported such issue.
Can we solve this by some udev rules and offline/online cpu 1-3 on
power source change?
There are examples of changing governors on power source change.
https://bbs.archlinux.org/viewtopic.php?id=207186
Here instead of changing governor just
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 0 >
/sys/devices/system/cpu/cpu2/online
echo 1 >
/sys/devices/system/cpu/cpu2/online
echo 0 >
/sys/devices/system/cpu/cpu3/online
echo 1 >
/sys/devices/system/cpu/cpu3/online
Thanks,
Srinivas
>
> Thanks,
> Rafael
>
On Fri, Mar 01, 2019 at 09:39:27AM -0800, Srinivas Pandruvada wrote:
> On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > Hi All,
> >
> > This is how I would fix the issue reported in BZ 200759 (see this
> > patch series
> > from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2).
> >
> > Patch [1/2] causes intel_pstate to update all policies if it gets a
> > _PPC change
> > notification and sees a global turbo disable/enable change.
> >
> > Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> > those cases.
> >
> > The patches here have not been tested yet, so testing would be much
> > appreciated.
> >
> > Of course, comments are welcome too!
>
> This is the only platform, someone reported such issue.
> Can we solve this by some udev rules and offline/online cpu 1-3 on
> power source change?
>
Sound reasonable, we can deal with this BIOS problem in user space
too. But if cpu0 could not be offline, how could cpu0's policy
be updated?
Thanks,
Yu
> There are examples of changing governors on power source change.
> https://bbs.archlinux.org/viewtopic.php?id=207186
>
> Here instead of changing governor just
> echo 0 > /sys/devices/system/cpu/cpu1/online
> echo 1 > /sys/devices/system/cpu/cpu1/online
> echo 0 >
> /sys/devices/system/cpu/cpu2/online
> echo 1 >
> /sys/devices/system/cpu/cpu2/online
> echo 0 >
> /sys/devices/system/cpu/cpu3/online
> echo 1 >
> /sys/devices/system/cpu/cpu3/online
>
> Thanks,
> Srinivas
>
> >
> > Thanks,
> > Rafael
> >
>
On Sat, 2019-03-02 at 18:30 +0800, Yu Chen wrote:
> On Fri, Mar 01, 2019 at 09:39:27AM -0800, Srinivas Pandruvada wrote:
> > On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > > Hi All,
> > >
> > > This is how I would fix the issue reported in BZ 200759 (see this
> > > patch series
> > > from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2)
> > > .
> > >
> > > Patch [1/2] causes intel_pstate to update all policies if it gets
> > > a
> > > _PPC change
> > > notification and sees a global turbo disable/enable change.
> > >
> > > Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> > > those cases.
> > >
> > > The patches here have not been tested yet, so testing would be
> > > much
> > > appreciated.
> > >
> > > Of course, comments are welcome too!
> >
> > This is the only platform, someone reported such issue.
> > Can we solve this by some udev rules and offline/online cpu 1-3 on
> > power source change?
> >
>
> Sound reasonable, we can deal with this BIOS problem in user space
> too. But if cpu0 could not be offline, how could cpu0's policy
> be updated?
I thought CPU0 is not a problem as the _PPC is sent on this and
policies gets updated from the following log.
# Unplug
[ 25.775643] CPU 0: _PPC is 6 - frequency limited
[ 25.775660] intel_pstate: set_policy cpuinfo.max 3000000 policy->max
1700000
[ 25.775666] intel_pstate: cpu:0 max_state 17 min_policy_perf:8
max_policy_perf:17
[ 25.775670] intel_pstate: cpu:0 global_min:8 global_max:30
[ 25.775674] intel_pstate: cpu:0 max_perf_ratio:17 min_perf_ratio:8
"REDUCED FREQUENCY ABOVE AFTER UNPLUG"
# Re-plug
[ 36.979264] CPU 0: _PPC is 6 - frequency limited
[ 36.979276] intel_pstate: policy->max > max non turbo frequency
[ 36.979280] intel_pstate: set_policy cpuinfo.max 3000000 policy->max
3000000
[ 36.979283] intel_pstate: cpu:0 max_state 30 min_policy_perf:8
max_policy_perf:30
[ 36.979286] intel_pstate: cpu:0 global_min:8 global_max:30
[ 36.979289] intel_pstate: cpu:0 max_perf_ratio:30 min_perf_ratio:8
Thanks,
Srinivas
>
> Thanks,
> Yu
>
> > There are examples of changing governors on power source change.
> > https://bbs.archlinux.org/viewtopic.php?id=207186
> >
> > Here instead of changing governor just
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> > echo 1 > /sys/devices/system/cpu/cpu1/online
> > echo 0 >
> > /sys/devices/system/cpu/cpu2/online
> > echo 1 >
> > /sys/devices/system/cpu/cpu2/online
> > echo 0 >
> > /sys/devices/system/cpu/cpu3/online
> > echo 1 >
> > /sys/devices/system/cpu/cpu3/online
> >
> > Thanks,
> > Srinivas
> >
> > >
> > > Thanks,
> > > Rafael
> > >
On Fri, Mar 1, 2019 at 6:39 PM Srinivas Pandruvada
<[email protected]> wrote:
>
> On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > Hi All,
> >
> > This is how I would fix the issue reported in BZ 200759 (see this
> > patch series
> > from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2).
> >
> > Patch [1/2] causes intel_pstate to update all policies if it gets a
> > _PPC change
> > notification and sees a global turbo disable/enable change.
> >
> > Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> > those cases.
> >
> > The patches here have not been tested yet, so testing would be much
> > appreciated.
> >
> > Of course, comments are welcome too!
>
> This is the only platform, someone reported such issue.
I don't think this matters.
First, it doesn't mean that no other problems have this problem.
Second, the current handling of MSR_IA32_MISC_ENABLE_TURBO_DISABLE
changes in intel_pstate is not sufficient if the changes is from set
to unset anyway.
On Sun, 2019-03-03 at 18:03 +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 1, 2019 at 6:39 PM Srinivas Pandruvada
> <[email protected]> wrote:
> >
> > On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > > Hi All,
> > >
> > > This is how I would fix the issue reported in BZ 200759 (see this
> > > patch series
> > > from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2)
> > > .
> > >
> > > Patch [1/2] causes intel_pstate to update all policies if it gets
> > > a
> > > _PPC change
> > > notification and sees a global turbo disable/enable change.
> > >
> > > Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> > > those cases.
> > >
> > > The patches here have not been tested yet, so testing would be
> > > much
> > > appreciated.
> > >
> > > Of course, comments are welcome too!
> >
> > This is the only platform, someone reported such issue.
>
> I don't think this matters.
>
> First, it doesn't mean that no other problems have this problem.
>
> Second, the current handling of MSR_IA32_MISC_ENABLE_TURBO_DISABLE
> changes in intel_pstate is not sufficient if the changes is from set
> to unset anyway.
Any platform with HWP, you can't even notify of this change. So any
platform beyond SKL is not useful.
Fixing is always good :-)
On Sun, Mar 3, 2019 at 10:20 PM Srinivas Pandruvada
<[email protected]> wrote:
>
> On Sun, 2019-03-03 at 18:03 +0100, Rafael J. Wysocki wrote:
> > On Fri, Mar 1, 2019 at 6:39 PM Srinivas Pandruvada
> > <[email protected]> wrote:
> > >
> > > On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > > > Hi All,
> > > >
> > > > This is how I would fix the issue reported in BZ 200759 (see this
> > > > patch series
> > > > from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2)
> > > > .
> > > >
> > > > Patch [1/2] causes intel_pstate to update all policies if it gets
> > > > a
> > > > _PPC change
> > > > notification and sees a global turbo disable/enable change.
> > > >
> > > > Patch [2/2] makes it update cpuinfo.max_freq for all policies in
> > > > those cases.
> > > >
> > > > The patches here have not been tested yet, so testing would be
> > > > much
> > > > appreciated.
> > > >
> > > > Of course, comments are welcome too!
> > >
> > > This is the only platform, someone reported such issue.
> >
> > I don't think this matters.
> >
> > First, it doesn't mean that no other problems have this problem.
> >
> > Second, the current handling of MSR_IA32_MISC_ENABLE_TURBO_DISABLE
> > changes in intel_pstate is not sufficient if the changes is from set
> > to unset anyway.
>
> Any platform with HWP, you can't even notify of this change. So any
> platform beyond SKL is not useful.
Do you mean that HWP platforms don't supply _PPS (as a rule) and so
they don't send _PPC notifications? Is there anything they do instead
of it?
That's fair enough, but the point is that the driver doesn't do the
right thing even if the platform does send a _PPC notification.
> Fixing is always good :-)
Well, I can only agree with that ...
Hi,
I applied these patches on top of 4.20. The max frequency of all the
cores now reflects the actual value and it's updated whenever I
change the power source.
Regards,
Gabriele
On Fri, 1 Mar 2019 at 13:51, Rafael J. Wysocki <[email protected]> wrote:
>
> Hi All,
>
> This is how I would fix the issue reported in BZ 200759 (see this patch series
> from Yu too: https://marc.info/?l=linux-pm&m=155137672924029&w=2).
>
> Patch [1/2] causes intel_pstate to update all policies if it gets a _PPC change
> notification and sees a global turbo disable/enable change.
>
> Patch [2/2] makes it update cpuinfo.max_freq for all policies in those cases.
>
> The patches here have not been tested yet, so testing would be much appreciated.
>
> Of course, comments are welcome too!
>
> Thanks,
> Rafael
>
On Sun, 2019-03-03 at 22:51 +0100, Rafael J. Wysocki wrote:
> On Sun, Mar 3, 2019 at 10:20 PM Srinivas Pandruvada
> <[email protected]> wrote:
> >
> > On Sun, 2019-03-03 at 18:03 +0100, Rafael J. Wysocki wrote:
> > > On Fri, Mar 1, 2019 at 6:39 PM Srinivas Pandruvada
> > > <[email protected]> wrote:
> > > >
> > > > On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > > > > Hi All,
> > > > >
> > > > > This is how I would fix the issue reported in BZ 200759 (see
> > > > > thisdev
> > > > > patch series
> > > > > from Yu too:
> > > > > https://marc.info/?l=linux-pm&m=155137672924029&w=2)
> > > > > .
> > > > >
> > > > > Patch [1/2] causes intel_pstate to update all policies if it
> > > > > gets
> > > > > a
> > > > > _PPC change
> > > > > notification and sees a global turbo disable/enable change.
> > > > >
> > > > > Patch [2/2] makes it update cpuinfo.max_freq for all policies
> > > > > in
> > > > > those cases.
> > > > >
> > > > > The patches here have not been tested yet, so testing would
> > > > > be
> > > > > much
> > > > > appreciated.
> > > > >
> > > > > Of course, comments are welcome too!
> > > >
> > > > This is the only platform, someone reported such issue.
> > >
> > > I don't think this matters.
> > >
> > > First, it doesn't mean that no other problems have this problem.
> > >
> > > Second, the current handling of
> > > MSR_IA32_MISC_ENABLE_TURBO_DISABLE
> > > changes in intel_pstate is not sufficient if the changes is from
> > > set
> > > to unset anyway.
> >
> > Any platform with HWP, you can't even notify of this change. So any
> > platform beyond SKL is not useful.
>
> Do you mean that HWP platforms don't supply _PPS (as a rule) and so
> they don't send _PPC notifications? Is there anything they do
> instead
> of it?
There are other methods like PL1 budget limit for such cases. FW can
just change the config TDP level.
>
> That's fair enough, but the point is that the driver doesn'dev_t do
> the
> right thing even if the platform does send a _PPC notification.
_PPC notification is to indicate levels in _PSS not to disable/enable
turbo via IA32_MISC_*. The platform could have just set _PPC to 1 or to
TAR-1. Here _PPC is sent for somthing more than just changing _PSS max
level. Do we have bug in if _PPC just changes performance level?
>
> > Fixing is always good :-)
>
> Well, I can only agree with that ...
On Mon, Mar 4, 2019 at 5:06 AM Srinivas Pandruvada
<[email protected]> wrote:
>
> On Sun, 2019-03-03 at 22:51 +0100, Rafael J. Wysocki wrote:
> > On Sun, Mar 3, 2019 at 10:20 PM Srinivas Pandruvada
> > <[email protected]> wrote:
> > >
> > > On Sun, 2019-03-03 at 18:03 +0100, Rafael J. Wysocki wrote:
> > > > On Fri, Mar 1, 2019 at 6:39 PM Srinivas Pandruvada
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Fri, 2019-03-01 at 13:43 +0100, Rafael J. Wysocki wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > This is how I would fix the issue reported in BZ 200759 (see
> > > > > > thisdev
> > > > > > patch series
> > > > > > from Yu too:
> > > > > > https://marc.info/?l=linux-pm&m=155137672924029&w=2)
> > > > > > .
> > > > > >
> > > > > > Patch [1/2] causes intel_pstate to update all policies if it
> > > > > > gets
> > > > > > a
> > > > > > _PPC change
> > > > > > notification and sees a global turbo disable/enable change.
> > > > > >
> > > > > > Patch [2/2] makes it update cpuinfo.max_freq for all policies
> > > > > > in
> > > > > > those cases.
> > > > > >
> > > > > > The patches here have not been tested yet, so testing would
> > > > > > be
> > > > > > much
> > > > > > appreciated.
> > > > > >
> > > > > > Of course, comments are welcome too!
> > > > >
> > > > > This is the only platform, someone reported such issue.
> > > >
> > > > I don't think this matters.
> > > >
> > > > First, it doesn't mean that no other problems have this problem.
> > > >
> > > > Second, the current handling of
> > > > MSR_IA32_MISC_ENABLE_TURBO_DISABLE
> > > > changes in intel_pstate is not sufficient if the changes is from
> > > > set
> > > > to unset anyway.
> > >
> > > Any platform with HWP, you can't even notify of this change. So any
> > > platform beyond SKL is not useful.
> >
> > Do you mean that HWP platforms don't supply _PPS (as a rule) and so
> > they don't send _PPC notifications? Is there anything they do
> > instead
> > of it?
>
> There are other methods like PL1 budget limit for such cases. FW can
> just change the config TDP level.
OK, but that would be done without notification I suppose?
> > That's fair enough, but the point is that the driver doesn'dev_t do
> > the right thing even if the platform does send a _PPC notification.
>
> _PPC notification is to indicate levels in _PSS not to disable/enable
> turbo via IA32_MISC_*.
I was not talking about notifications, but about what the driver does
when MSR_IA32_MISC_ENABLE_TURBO_DISABLE is set or unset by FW on the
fly: this only really works if the change is from unset to set,
because the range of frequencies to use is restricted then, even
though user-visible policy limits are not updated. However, if the
change is from set to unset, the update of policy limits is actually
necessary for the change to really take effect (otherwise the user
policy max prevents turbo frequencies from being requested). HWP
seems to be affected too, for that matter, because the upper limit in
the MSR is not updated too then AFAICS.
> The platform could have just set _PPC to 1 or to TAR-1. Here _PPC is sent for somthing more than just changing _PSS max
> level. Do we have bug in if _PPC just changes performance level?
I think that we just need to live with the fact that _PPC may be
triggered for something more than changing _PSS max.
Note that the driver could trigger policy updates on
MSR_IA32_MISC_ENABLE_TURBO_DISABLE changes even without _PPC
notifications, but that would require more intrusive changes, because
work items cannot be queued up from scheduler context. It would need
to use irq_work for that and with some extra synchronization etc, and
for the time being it looks like we can piggy back on the _PPC change
notification mechanism.
On Sun, Mar 3, 2019 at 11:42 PM Gabriele Mazzotta
<[email protected]> wrote:
>
> Hi,
>
> I applied these patches on top of 4.20. The max frequency of all the
> cores now reflects the actual value and it's updated whenever I
> change the power source.
Thanks for the confirmation!
On Fri, Mar 01, 2019 at 01:57:06PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> While the cpuinfo.max_freq value doesn't really matter for
> intel_pstate in the active mode, in the passive mode it is used by
> governors as the maximum physical frequency of the CPU and the
> results of governor computations generally depend on it. Also it
> is made available to user space via sysfs and it should match the
> current HW configuration.
>
> For this reason, make intel_pstate update cpuinfo.max_freq for all
> CPUs if it detects a global change of turbo frequency settings from
> "disable" to "enable" or the other way associated with a _PPC change
> notification from the platform firmware.
>
> Note that policy_is_inactive() and cpufreq_set_policy() need to be
> made available to it for this purpose.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
> Reported-by: Gabriele Mazzotta <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Update, because the patch sent previously doesn't build, due to an extra
> arg declared for intel_pstate_update_max_freq().
>
> ---
> drivers/cpufreq/cpufreq.c | 12 ++----------
> drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
> include/linux/cpufreq.h | 7 +++++++
> 3 files changed, 41 insertions(+), 11 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -897,6 +897,36 @@ static void intel_pstate_update_policies
> cpufreq_update_policy(cpu);
> }
>
> +static void intel_pstate_update_max_freq(unsigned int cpu)
> +{
> + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> + struct cpufreq_policy new_policy;
> + struct cpudata *cpudata;
> +
> + if (!policy)
> + return;
> +
> + down_write(&policy->rwsem);
> +
> + if (policy_is_inactive(policy))
> + goto unlock;
> +
> + cpudata = all_cpu_data[cpu];
> + policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
> + cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
> +
> + memcpy(&new_policy, policy, sizeof(*policy));
> + new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
> + new_policy.min = min(policy->user_policy.min, new_policy.max);
> +
> + cpufreq_set_policy(policy, &new_policy);
> +
> +unlock:
> + up_write(&policy->rwsem);
> +
> + cpufreq_cpu_put(policy);
> +}
> +
I tried to test on a macbook in hand however I did not see the _PPC
notifier on this machine so it might not cover the code path in
this patch. I checked the cpufreq with this patch using
different load and the cpufreq scales well.
> static void intel_pstate_update_limits(unsigned int cpu)
> {
> mutex_lock(&intel_pstate_driver_lock);
> @@ -908,7 +938,8 @@ static void intel_pstate_update_limits(u
> */
> if (global.turbo_disabled_upd != global.turbo_disabled) {
> global.turbo_disabled_upd = global.turbo_disabled;
> - intel_pstate_update_policies();
> + for_each_possible_cpu(cpu)
> + intel_pstate_update_max_freq(cpu);
> } else {
> cpufreq_update_policy(cpu);
> }
> Index: linux-pm/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
> +++ linux-pm/drivers/cpufreq/cpufreq.c
> @@ -34,11 +34,6 @@
>
> static LIST_HEAD(cpufreq_policy_list);
>
> -static inline bool policy_is_inactive(struct cpufreq_policy *policy)
> -{
> - return cpumask_empty(policy->cpus);
> -}
> -
> /* Macros to iterate over CPU policies */
> #define for_each_suitable_policy(__policy, __active) \
> list_for_each_entry(__policy, &cpufreq_policy_list, policy_list) \
> @@ -675,9 +670,6 @@ static ssize_t show_scaling_cur_freq(str
> return ret;
> }
>
> -static int cpufreq_set_policy(struct cpufreq_policy *policy,
> - struct cpufreq_policy *new_policy);
> -
> /**
> * cpufreq_per_cpu_attr_write() / store_##file_name() - sysfs write access
> */
> @@ -2235,8 +2227,8 @@ EXPORT_SYMBOL(cpufreq_get_policy);
> *
> * The cpuinfo part of @policy is not updated by this function.
> */
There first seems to be some patching error when applying this on
top of upstream 5.0, but I realized that this patch is based on
intel-next.
Thanks,
Ryan
> -static int cpufreq_set_policy(struct cpufreq_policy *policy,
> - struct cpufreq_policy *new_policy)
> +int cpufreq_set_policy(struct cpufreq_policy *policy,
> + struct cpufreq_policy *new_policy)
> {
> struct cpufreq_governor *old_gov;
> int ret;
> Index: linux-pm/include/linux/cpufreq.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -178,6 +178,11 @@ static inline struct cpufreq_policy *cpu
> static inline void cpufreq_cpu_put(struct cpufreq_policy *policy) { }
> #endif
>
> +static inline bool policy_is_inactive(struct cpufreq_policy *policy)
> +{
> + return cpumask_empty(policy->cpus);
> +}
> +
> static inline bool policy_is_shared(struct cpufreq_policy *policy)
> {
> return cpumask_weight(policy->cpus) > 1;
> @@ -194,6 +199,8 @@ void disable_cpufreq(void);
>
> u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
> int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
> +int cpufreq_set_policy(struct cpufreq_policy *policy,
> + struct cpufreq_policy *new_policy);
> void cpufreq_update_policy(unsigned int cpu);
> void cpufreq_update_limits(unsigned int cpu);
> bool have_governor_per_policy(void);
>
[...]
> > There are other methods like PL1 budget limit for such cases. FW
> > can
> > just change the config TDP level.
>
> OK, but that would be done without notification I suppose?
There is a notification via processor PCI device (B0D4). This is passed
to user space to change the power limits. The new element is called
PPCC and it is exposed via sysfs.
Disabling turbo is not very interesting as there can be more turbo than
non turbo. So you loose lots of performance. So instead you can control
power in turbo region to give you more control. _PPC is even less
interesting as you can't control uncore power.
>
> > > That's fair enough, but the point is that the driver doesn'dev_t
> > > do
> > > the right thing even if the platform does send a _PPC
> > > notification.
> >
> > _PPC notification is to indicate levels in _PSS not to
> > disable/enable
> > turbo via IA32_MISC_*.
>
> I was not talking about notifications, but about what the driver does
> when MSR_IA32_MISC_ENABLE_TURBO_DISABLE is set or unset by FW on the
> fly: this only really works if the change is from unset to set,
> because the range of frequencies to use is restricted then, even
> though user-visible policy limits are not updated. However, if the
> change is from set to unset, the update of policy limits is actually
> necessary for the change to really take effect (otherwise the user
> policy max prevents turbo frequencies from being requested). HWP
> seems to be affected too, for that matter, because the upper limit in
> the MSR is not updated too then AFAICS.
>
> > The platform could have just set _PPC to 1 or to TAR-1. Here _PPC
> > is sent for somthing more than just changing _PSS max
> > level. Do we have bug in if _PPC just changes performance level?
>
> I think that we just need to live with the fact that _PPC may be
> triggered for something more than changing _PSS max.
I understand the reason why you are doing the change. I investigated
this bugzilla before. I was thinking can udev rules should be enough to
address this issue. But it was not a requirement.
The change itself is fine. Except may be hide the policy->rwsem in some
header file instead of directly exposing to clients to use.
Thanks,
Srinivas
On Mon, Mar 4, 2019 at 7:06 PM Srinivas Pandruvada
<[email protected]> wrote:
>
> [...]
> > > There are other methods like PL1 budget limit for such cases. FW
> > > can
> > > just change the config TDP level.
> >
> > OK, but that would be done without notification I suppose?
>
> There is a notification via processor PCI device (B0D4). This is passed
> to user space to change the power limits. The new element is called
> PPCC and it is exposed via sysfs.
What do you mean by "new element" and how exactly is it exposed?
> Disabling turbo is not very interesting as there can be more turbo than
> non turbo. So you loose lots of performance. So instead you can control
> power in turbo region to give you more control. _PPC is even less
> interesting as you can't control uncore power.
I guess that designers should know about that. The kernel is on the
receiving end here. :-)
> >
> > > > That's fair enough, but the point is that the driver doesn'dev_t
> > > > do
> > > > the right thing even if the platform does send a _PPC
> > > > notification.
> > >
> > > _PPC notification is to indicate levels in _PSS not to
> > > disable/enable
> > > turbo via IA32_MISC_*.
> >
> > I was not talking about notifications, but about what the driver does
> > when MSR_IA32_MISC_ENABLE_TURBO_DISABLE is set or unset by FW on the
> > fly: this only really works if the change is from unset to set,
> > because the range of frequencies to use is restricted then, even
> > though user-visible policy limits are not updated. However, if the
> > change is from set to unset, the update of policy limits is actually
> > necessary for the change to really take effect (otherwise the user
> > policy max prevents turbo frequencies from being requested). HWP
> > seems to be affected too, for that matter, because the upper limit in
> > the MSR is not updated too then AFAICS.
> >
> > > The platform could have just set _PPC to 1 or to TAR-1. Here _PPC
> > > is sent for somthing more than just changing _PSS max
> > > level. Do we have bug in if _PPC just changes performance level?
> >
> > I think that we just need to live with the fact that _PPC may be
> > triggered for something more than changing _PSS max.
>
> I understand the reason why you are doing the change. I investigated
> this bugzilla before. I was thinking can udev rules should be enough to
> address this issue. But it was not a requirement.
>
> The change itself is fine. Except may be hide the policy->rwsem in some
> header file instead of directly exposing to clients to use.
I'm not sure what you mean.
I guess that you are talking about intel_pstate_update_max_freq()
which acquires policy->rwsem. If so, what exactly is the problem with
it?
On Mon, 2019-03-04 at 22:57 +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 4, 2019 at 7:06 PM Srinivas Pandruvada
> <[email protected]> wrote:
> >
> > [...]
> > > > There are other methods like PL1 budget limit for such cases.
> > > > FW
> > > > can
> > > > just change the config TDP level.
> > >
> > > OK, but that would be done without notification I suppose?
> >
> > There is a notification via processor PCI device (B0D4). This is
> > passed
> > to user space to change the power limits. The new element is called
> > PPCC and it is exposed via sysfs.
>
> What do you mean by "new element" and how exactly is it exposed?
This is part of DPTF processor ACPI object (INT3401 or B0D4). They are
exposed in sysfs
E.g, /sys/bus/platform/devices/INT3401:00/power_limits/
There is a thermal uevent sent when they change. Both dptf daemon and
thermald listen and use to set rapl power-constraints including step
sizes for control. Someone can write a udev rule to do the same.
>
> > Disabling turbo is not very interesting as there can be more turbo
> > than
> > non turbo. So you loose lots of performance. So instead you can
> > control
> > power in turbo region to give you more control. _PPC is even less
> > interesting as you can't control uncore power.
>
> I guess that designers should know about that. The kernel is on the
> receiving end here. :-)
I think they know. Hence you don't see this issue of enable/disable of
turbo by firmware quite often. This laptop here I guess released in
beginning of 2014 with Haswell.
>
> > >
[...]
>
> I guess that you are talking about intel_pstate_update_max_freq()
> which acquires policy->rwsem. If so, what exactly is the problem
> with
> it?
I was suggesting to use an API/define in cpufreq.h which does operation
on policy->rwsem for better abstraction. This is the first time it was
used outside core cpufreq.c. As more places it will be used in future,
common function will help debug, if in some path there is a bug in
aquire/release of semaphore. But you can ignore this.
Thanks,
Srinivas
On Tue, Mar 5, 2019 at 12:04 AM Srinivas Pandruvada
<[email protected]> wrote:
>
> On Mon, 2019-03-04 at 22:57 +0100, Rafael J. Wysocki wrote:
> > On Mon, Mar 4, 2019 at 7:06 PM Srinivas Pandruvada
> > <[email protected]> wrote:
> > >
> > > [...]
> > > > > There are other methods like PL1 budget limit for such cases.
> > > > > FW
> > > > > can
> > > > > just change the config TDP level.
> > > >
> > > > OK, but that would be done without notification I suppose?
> > >
> > > There is a notification via processor PCI device (B0D4). This is
> > > passed
> > > to user space to change the power limits. The new element is called
> > > PPCC and it is exposed via sysfs.
> >
> > What do you mean by "new element" and how exactly is it exposed?
>
> This is part of DPTF processor ACPI object (INT3401 or B0D4). They are
> exposed in sysfs
> E.g, /sys/bus/platform/devices/INT3401:00/power_limits/
>
> There is a thermal uevent sent when they change. Both dptf daemon and
> thermald listen and use to set rapl power-constraints including step
> sizes for control. Someone can write a udev rule to do the same.
But the measure at hand here is a power one, not a thermal one AFAICS.
> > > Disabling turbo is not very interesting as there can be more turbo
> > > than
> > > non turbo. So you loose lots of performance. So instead you can
> > > control
> > > power in turbo region to give you more control. _PPC is even less
> > > interesting as you can't control uncore power.
> >
> > I guess that designers should know about that. The kernel is on the
> > receiving end here. :-)
>
> I think they know. Hence you don't see this issue of enable/disable of
> turbo by firmware quite often. This laptop here I guess released in
> beginning of 2014 with Haswell.
In this particular case, the battery is probably to weak to sustain
the currents associated with using high turbo P-states, so turbo needs
to be disabled altogether in order to avoid using turbo P-states at
all.
I guess this still would be the case on a contemporary system with a
sufficiently small battery.
> > > >
> [...]
>
> >
> > I guess that you are talking about intel_pstate_update_max_freq()
> > which acquires policy->rwsem. If so, what exactly is the problem
> > with
> > it?
> I was suggesting to use an API/define in cpufreq.h which does operation
> on policy->rwsem for better abstraction. This is the first time it was
> used outside core cpufreq.c. As more places it will be used in future,
> common function will help debug, if in some path there is a bug in
> aquire/release of semaphore. But you can ignore this.
Well, I guess I could introduce something like
cpufreq_cpu_acquire/release() that will lock/unlock policy->rwsem in
addition to getting the policy. That sequence is used in a couple of
places already.
Hi Rafael,
+CC Peter since we were talking about cpuinfo.*_freq recently.
On Friday 01 Mar 2019 at 13:57:06 (+0100), Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> While the cpuinfo.max_freq value doesn't really matter for
> intel_pstate in the active mode, in the passive mode it is used by
> governors as the maximum physical frequency of the CPU and the
> results of governor computations generally depend on it. Also it
> is made available to user space via sysfs and it should match the
> current HW configuration.
>
> For this reason, make intel_pstate update cpuinfo.max_freq for all
> CPUs if it detects a global change of turbo frequency settings from
> "disable" to "enable" or the other way associated with a _PPC change
> notification from the platform firmware.
>
> Note that policy_is_inactive() and cpufreq_set_policy() need to be
> made available to it for this purpose.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
> Reported-by: Gabriele Mazzotta <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> Update, because the patch sent previously doesn't build, due to an extra
> arg declared for intel_pstate_update_max_freq().
>
> ---
> drivers/cpufreq/cpufreq.c | 12 ++----------
> drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
> include/linux/cpufreq.h | 7 +++++++
> 3 files changed, 41 insertions(+), 11 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -897,6 +897,36 @@ static void intel_pstate_update_policies
> cpufreq_update_policy(cpu);
> }
>
> +static void intel_pstate_update_max_freq(unsigned int cpu)
> +{
> + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> + struct cpufreq_policy new_policy;
> + struct cpudata *cpudata;
> +
> + if (!policy)
> + return;
> +
> + down_write(&policy->rwsem);
> +
> + if (policy_is_inactive(policy))
> + goto unlock;
> +
> + cpudata = all_cpu_data[cpu];
> + policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
> + cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
> +
> + memcpy(&new_policy, policy, sizeof(*policy));
> + new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
> + new_policy.min = min(policy->user_policy.min, new_policy.max);
> +
> + cpufreq_set_policy(policy, &new_policy);
Do you want to force-restart the governor here ? Schedutil caches
cpuinfo.max_freq for the iowait stuff in sugov_start() [1]. I'm not sure
about the other governors.
And just removing sg_cpu->iowait_boost_max to use the cpuinfo struct
instead will conflict with [2], I think.
Thanks,
Quentin
[1] https://elixir.bootlin.com/linux/latest/source/kernel/sched/cpufreq_schedutil.c#L840
[2] https://lore.kernel.org/lkml/[email protected]/
> +
> +unlock:
> + up_write(&policy->rwsem);
> +
> + cpufreq_cpu_put(policy);
> +}
On Tuesday, March 5, 2019 11:42:59 AM CET Quentin Perret wrote:
> Hi Rafael,
>
> +CC Peter since we were talking about cpuinfo.*_freq recently.
>
> On Friday 01 Mar 2019 at 13:57:06 (+0100), Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > While the cpuinfo.max_freq value doesn't really matter for
> > intel_pstate in the active mode, in the passive mode it is used by
> > governors as the maximum physical frequency of the CPU and the
> > results of governor computations generally depend on it. Also it
> > is made available to user space via sysfs and it should match the
> > current HW configuration.
> >
> > For this reason, make intel_pstate update cpuinfo.max_freq for all
> > CPUs if it detects a global change of turbo frequency settings from
> > "disable" to "enable" or the other way associated with a _PPC change
> > notification from the platform firmware.
> >
> > Note that policy_is_inactive() and cpufreq_set_policy() need to be
> > made available to it for this purpose.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
> > Reported-by: Gabriele Mazzotta <[email protected]>
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> >
> > Update, because the patch sent previously doesn't build, due to an extra
> > arg declared for intel_pstate_update_max_freq().
> >
> > ---
> > drivers/cpufreq/cpufreq.c | 12 ++----------
> > drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
> > include/linux/cpufreq.h | 7 +++++++
> > 3 files changed, 41 insertions(+), 11 deletions(-)
> >
> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > @@ -897,6 +897,36 @@ static void intel_pstate_update_policies
> > cpufreq_update_policy(cpu);
> > }
> >
> > +static void intel_pstate_update_max_freq(unsigned int cpu)
> > +{
> > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > + struct cpufreq_policy new_policy;
> > + struct cpudata *cpudata;
> > +
> > + if (!policy)
> > + return;
> > +
> > + down_write(&policy->rwsem);
> > +
> > + if (policy_is_inactive(policy))
> > + goto unlock;
> > +
> > + cpudata = all_cpu_data[cpu];
> > + policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
> > + cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
> > +
> > + memcpy(&new_policy, policy, sizeof(*policy));
> > + new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
> > + new_policy.min = min(policy->user_policy.min, new_policy.max);
> > +
> > + cpufreq_set_policy(policy, &new_policy);
>
> Do you want to force-restart the governor here ?
cpufreq_set_policy() is expected to take care of the governor.
If it doesn't, there is a bug somewhere.
> Schedutil caches cpuinfo.max_freq for the iowait stuff in sugov_start() [1].
If it does so, it should update the cached value in sugov_limits().
I guess I can add a patch updating it to this series.
> I'm not sure about the other governors.
They don't do that AFAICS.
> And just removing sg_cpu->iowait_boost_max to use the cpuinfo struct
> instead will conflict with [2], I think.
Thanks for pointing this out.
Cheers,
Rafael
On Tuesday, March 5, 2019 11:50:56 AM CET Rafael J. Wysocki wrote:
> On Tuesday, March 5, 2019 11:42:59 AM CET Quentin Perret wrote:
> > Hi Rafael,
> >
> > +CC Peter since we were talking about cpuinfo.*_freq recently.
> >
> > On Friday 01 Mar 2019 at 13:57:06 (+0100), Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > While the cpuinfo.max_freq value doesn't really matter for
> > > intel_pstate in the active mode, in the passive mode it is used by
> > > governors as the maximum physical frequency of the CPU and the
> > > results of governor computations generally depend on it. Also it
> > > is made available to user space via sysfs and it should match the
> > > current HW configuration.
> > >
> > > For this reason, make intel_pstate update cpuinfo.max_freq for all
> > > CPUs if it detects a global change of turbo frequency settings from
> > > "disable" to "enable" or the other way associated with a _PPC change
> > > notification from the platform firmware.
> > >
> > > Note that policy_is_inactive() and cpufreq_set_policy() need to be
> > > made available to it for this purpose.
> > >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
> > > Reported-by: Gabriele Mazzotta <[email protected]>
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > ---
> > >
> > > Update, because the patch sent previously doesn't build, due to an extra
> > > arg declared for intel_pstate_update_max_freq().
> > >
> > > ---
> > > drivers/cpufreq/cpufreq.c | 12 ++----------
> > > drivers/cpufreq/intel_pstate.c | 33 ++++++++++++++++++++++++++++++++-
> > > include/linux/cpufreq.h | 7 +++++++
> > > 3 files changed, 41 insertions(+), 11 deletions(-)
> > >
> > > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > > @@ -897,6 +897,36 @@ static void intel_pstate_update_policies
> > > cpufreq_update_policy(cpu);
> > > }
> > >
> > > +static void intel_pstate_update_max_freq(unsigned int cpu)
> > > +{
> > > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > > + struct cpufreq_policy new_policy;
> > > + struct cpudata *cpudata;
> > > +
> > > + if (!policy)
> > > + return;
> > > +
> > > + down_write(&policy->rwsem);
> > > +
> > > + if (policy_is_inactive(policy))
> > > + goto unlock;
> > > +
> > > + cpudata = all_cpu_data[cpu];
> > > + policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
> > > + cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
> > > +
> > > + memcpy(&new_policy, policy, sizeof(*policy));
> > > + new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
> > > + new_policy.min = min(policy->user_policy.min, new_policy.max);
> > > +
> > > + cpufreq_set_policy(policy, &new_policy);
> >
> > Do you want to force-restart the governor here ?
>
> cpufreq_set_policy() is expected to take care of the governor.
> If it doesn't, there is a bug somewhere.
>
> > Schedutil caches cpuinfo.max_freq for the iowait stuff in sugov_start() [1].
>
> If it does so, it should update the cached value in sugov_limits().
>
> I guess I can add a patch updating it to this series.
So after the Peter's patch "sched/cpufreq: Fix 32bit math overflow"
I will need to recompute sg_cpu->min in sugov_limits().
I'll wait for that patch from Peter to go in and repost the series on
top of it.
On Tuesday 05 Mar 2019 at 11:50:56 (+0100), Rafael J. Wysocki wrote:
> On Tuesday, March 5, 2019 11:42:59 AM CET Quentin Perret wrote:
> > > +static void intel_pstate_update_max_freq(unsigned int cpu)
> > > +{
> > > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > > + struct cpufreq_policy new_policy;
> > > + struct cpudata *cpudata;
> > > +
> > > + if (!policy)
> > > + return;
> > > +
> > > + down_write(&policy->rwsem);
> > > +
> > > + if (policy_is_inactive(policy))
> > > + goto unlock;
> > > +
> > > + cpudata = all_cpu_data[cpu];
> > > + policy->cpuinfo.max_freq = global.turbo_disabled_upd ?
> > > + cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
> > > +
> > > + memcpy(&new_policy, policy, sizeof(*policy));
> > > + new_policy.max = min(policy->user_policy.max, policy->cpuinfo.max_freq);
> > > + new_policy.min = min(policy->user_policy.min, new_policy.max);
> > > +
> > > + cpufreq_set_policy(policy, &new_policy);
> >
> > Do you want to force-restart the governor here ?
>
> cpufreq_set_policy() is expected to take care of the governor.
> If it doesn't, there is a bug somewhere.
Yes, it does something when there is a governor change, but that's not
the case here IIUC.
> > Schedutil caches cpuinfo.max_freq for the iowait stuff in sugov_start() [1].
>
> If it does so, it should update the cached value in sugov_limits().
>
> I guess I can add a patch updating it to this series.
Right, I think we've been under the assumption that unlike policy->max,
cpuinfo.max_freq should be constant, so there was no need to update the
value at run time. But if that assumption doesn't hold any more, then
yeah we'll need something more dynamic I guess.
>
> > I'm not sure about the other governors.
>
> They don't do that AFAICS.
OK
> > And just removing sg_cpu->iowait_boost_max to use the cpuinfo struct
> > instead will conflict with [2], I think.
>
> Thanks for pointing this out.
N/p :-)
Thanks,
Quentin
On Tue, Mar 05, 2019 at 11:58:37AM +0100, Rafael J. Wysocki wrote:
> So after the Peter's patch "sched/cpufreq: Fix 32bit math overflow"
> I will need to recompute sg_cpu->min in sugov_limits().
So there's still an open question; do we want that ->min thing to depend
on available frequencies _at_all_ ?
I'm thinking it might be a good thing to have the iowait boost curve be
independent of all that.
Like said; if we set it at 128 (static), it takes 9 consequtive wake-ups
for it to reach 1024 (max). While now the curve depends on how wide the
gap is between min_freq and max_freq. And it seems weird to have this
behaviour depend on that. To me at least.
Now, I don't know if 128/9 is the right curve, it is just a random
number I pulled out of a hat. But it seems to make more sense than
depending on frequencies.
On Tuesday 05 Mar 2019 at 12:44:06 (+0100), Peter Zijlstra wrote:
> On Tue, Mar 05, 2019 at 11:58:37AM +0100, Rafael J. Wysocki wrote:
> > So after the Peter's patch "sched/cpufreq: Fix 32bit math overflow"
> > I will need to recompute sg_cpu->min in sugov_limits().
>
> So there's still an open question; do we want that ->min thing to depend
> on available frequencies _at_all_ ?
>
> I'm thinking it might be a good thing to have the iowait boost curve be
> independent of all that.
>
> Like said; if we set it at 128 (static), it takes 9 consequtive wake-ups
> for it to reach 1024 (max). While now the curve depends on how wide the
> gap is between min_freq and max_freq. And it seems weird to have this
> behaviour depend on that. To me at least.
I'm not conceptually against it, but that really wants testing I think.
I can already see how we're gonna see regressions: 128 is much lower
than 'min' on my juno for ex, so with the approach you suggest it's
gonna take several wake-up before the iowait stuff does anything at all.
The first steps will all be below the min freq, so they'll just be
transparent, while right now the iowait boost kicks in much faster :/
OTOH, you also have platforms like the recent Snapdragons with 30+ OPPs,
and for them starting at 128 will speed things up.
So maybe what you want is to start at max(min, 128) ?
> Now, I don't know if 128/9 is the right curve, it is just a random
> number I pulled out of a hat. But it seems to make more sense than
> depending on frequencies.
And btw why do you need 9 steps to reach MAX starting from 128 ? If we
double the boost at each wakeup you'll do 128 -> 256 -> 512 -> 1024 no ?
Thanks,
Quentin
On Tue, Mar 5, 2019 at 12:44 PM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Mar 05, 2019 at 11:58:37AM +0100, Rafael J. Wysocki wrote:
> > So after the Peter's patch "sched/cpufreq: Fix 32bit math overflow"
> > I will need to recompute sg_cpu->min in sugov_limits().
>
> So there's still an open question; do we want that ->min thing to depend
> on available frequencies _at_all_ ?
>
> I'm thinking it might be a good thing to have the iowait boost curve be
> independent of all that.
>
> Like said; if we set it at 128 (static), it takes 9 consequtive wake-ups
> for it to reach 1024 (max). While now the curve depends on how wide the
> gap is between min_freq and max_freq. And it seems weird to have this
> behaviour depend on that. To me at least.
>
> Now, I don't know if 128/9 is the right curve, it is just a random
> number I pulled out of a hat. But it seems to make more sense than
> depending on frequencies.
I agree.
On Tue, Mar 05, 2019 at 12:00:43PM +0000, Quentin Perret wrote:
> And btw why do you need 9 steps to reach MAX starting from 128 ? If we
> double the boost at each wakeup you'll do 128 -> 256 -> 512 -> 1024 no ?
Argh; I'm so full of fail this week; 10-7=3.
On Tue, Mar 5, 2019 at 1:00 PM Quentin Perret <[email protected]> wrote:
>
> On Tuesday 05 Mar 2019 at 12:44:06 (+0100), Peter Zijlstra wrote:
> > On Tue, Mar 05, 2019 at 11:58:37AM +0100, Rafael J. Wysocki wrote:
> > > So after the Peter's patch "sched/cpufreq: Fix 32bit math overflow"
> > > I will need to recompute sg_cpu->min in sugov_limits().
> >
> > So there's still an open question; do we want that ->min thing to depend
> > on available frequencies _at_all_ ?
> >
> > I'm thinking it might be a good thing to have the iowait boost curve be
> > independent of all that.
> >
> > Like said; if we set it at 128 (static), it takes 9 consequtive wake-ups
> > for it to reach 1024 (max). While now the curve depends on how wide the
> > gap is between min_freq and max_freq. And it seems weird to have this
> > behaviour depend on that. To me at least.
>
> I'm not conceptually against it, but that really wants testing I think.
> I can already see how we're gonna see regressions: 128 is much lower
> than 'min' on my juno for ex, so with the approach you suggest it's
> gonna take several wake-up before the iowait stuff does anything at all.
But that 128 needs to be compared to
(SCHED_CAPACITY_SCALE * cpuinfo.min_freq) / cpuinfo.max_freq
so with SCHED_CAPACITY_SCALE equal to 1024 this means max_freq 8x
higher than min_freq. That is not totally unreasonable IMO and
because sg_cpu->iowait_boost grows exponentially, the difference
between 8x and, say, 4x is one iteration.
> The first steps will all be below the min freq, so they'll just be
> transparent, while right now the iowait boost kicks in much faster :/
There can be one iteration of a difference this way or that way AFAICS
and I'm not even sure how much of a performance difference that makes
in practice.
OTOH I fundamentally don't see why the iowait boost should ramp up
faster on CPUs having a higher max_freq to min_freq ratio. Say you
have two platforms, both with max_freq of 2 GHz and with min_freq
equal to 250 MHz and 500 MHz, respectively. The ratios in question
will be 8 and 4 then, so the first one will reliably react 50% slower
to iowait than the second one for no particular reason at all.
> OTOH, you also have platforms like the recent Snapdragons with 30+ OPPs,
> and for them starting at 128 will speed things up.
>
> So maybe what you want is to start at max(min, 128) ?
That's not just min, though, or is it?
On Tuesday 05 Mar 2019 at 18:02:25 (+0100), Rafael J. Wysocki wrote:
> But that 128 needs to be compared to
>
> (SCHED_CAPACITY_SCALE * cpuinfo.min_freq) / cpuinfo.max_freq
>
> so with SCHED_CAPACITY_SCALE equal to 1024 this means max_freq 8x
> higher than min_freq. That is not totally unreasonable IMO and
> because sg_cpu->iowait_boost grows exponentially, the difference
> between 8x and, say, 4x is one iteration.
>
> > The first steps will all be below the min freq, so they'll just be
> > transparent, while right now the iowait boost kicks in much faster :/
>
> There can be one iteration of a difference this way or that way AFAICS
> and I'm not even sure how much of a performance difference that makes
> in practice.
Yeah I don't expect that to have a huge impact TBH but it'd be nice to
actually get numbers to verify that, that's all I'm saying :-)
You have 'funny' platforms like Juno r0 out there where the min/max
frequencies are 450MHz/850Mhz. In this case, starting from 128 you'll
need 3 wake-ups to reach what is currently the starting point. I'm not
sure if the impact is visible or not, but it's worth checking.
> OTOH I fundamentally don't see why the iowait boost should ramp up
> faster on CPUs having a higher max_freq to min_freq ratio. Say you
> have two platforms, both with max_freq of 2 GHz and with min_freq
> equal to 250 MHz and 500 MHz, respectively. The ratios in question
> will be 8 and 4 then, so the first one will reliably react 50% slower
> to iowait than the second one for no particular reason at all.
>
> > OTOH, you also have platforms like the recent Snapdragons with 30+ OPPs,
> > and for them starting at 128 will speed things up.
> >
> > So maybe what you want is to start at max(min, 128) ?
>
> That's not just min, though, or is it?
I'm not sure to get the question, so just to make sure it's clearer, I
was suggesting to do something along the lines of:
sg_cpu->min = max(min_freq * 1024 / max_freq, 128);
That basically just prevents you from starting too low -- some boards,
unlike juno, have tons of OPPs on the lower end of the curve so these
might benefit from getting a higher starting point. But then perhaps
this is in fact a good illustration of the issue of having different
ramp-up speeds depending on the min_freq so ... :-)
Thanks,
Quentin
On Tue, Mar 5, 2019 at 6:37 PM Quentin Perret <[email protected]> wrote:
>
> On Tuesday 05 Mar 2019 at 18:02:25 (+0100), Rafael J. Wysocki wrote:
> > But that 128 needs to be compared to
> >
> > (SCHED_CAPACITY_SCALE * cpuinfo.min_freq) / cpuinfo.max_freq
> >
> > so with SCHED_CAPACITY_SCALE equal to 1024 this means max_freq 8x
> > higher than min_freq. That is not totally unreasonable IMO and
> > because sg_cpu->iowait_boost grows exponentially, the difference
> > between 8x and, say, 4x is one iteration.
> >
> > > The first steps will all be below the min freq, so they'll just be
> > > transparent, while right now the iowait boost kicks in much faster :/
> >
> > There can be one iteration of a difference this way or that way AFAICS
> > and I'm not even sure how much of a performance difference that makes
> > in practice.
>
> Yeah I don't expect that to have a huge impact TBH but it'd be nice to
> actually get numbers to verify that, that's all I'm saying :-)
>
> You have 'funny' platforms like Juno r0 out there where the min/max
> frequencies are 450MHz/850Mhz. In this case, starting from 128 you'll
> need 3 wake-ups to reach what is currently the starting point. I'm not
> sure if the impact is visible or not, but it's worth checking.
Please recall that the iowait boosting algo was different to start
with, though: it jumped to the max right away and then backed off.
That turned out to be overly aggressive in general and led to some
undesired "jittery" behavior, which is why it was changed.
Thus it looks like the platforms on which it still jumps to the max
right away may actually benefit from changing it to more steps. :-)
In turn, the platforms where it takes more than 3 steps for the boost
to get to the max would get a slight performance improvement from this
changes and I'm not sure why that could be bad.
Moreover, it didn't depend on the min originally, just on the max and
just because I wanted the number of backoff steps needed to go back
down to zero to be independent of the platform IIRC. The dependency
on the min is sort of artificial here and leads to arbitrary
differences in behavior between different platforms which isn't
particularly fortunate.
With all of that in mind, I would go right away to making the boost
independent of min and max (the final number of steps to reach the max
is TBD in practice, but 3 looks like a good enough compromise to me).
FWIW, the internal governor in intel_pstate has been changed along
these lines already (the changes is waiting for Linus to pull it ATM).
Hi Rafael,
On Wednesday 06 Mar 2019 at 11:05:47 (+0100), Rafael J. Wysocki wrote:
> Please recall that the iowait boosting algo was different to start
> with, though: it jumped to the max right away and then backed off.
> That turned out to be overly aggressive in general and led to some
> undesired "jittery" behavior, which is why it was changed.
>
> Thus it looks like the platforms on which it still jumps to the max
> right away may actually benefit from changing it to more steps. :-)
On the energy side of things at least ... ;-)
> In turn, the platforms where it takes more than 3 steps for the boost
> to get to the max would get a slight performance improvement from this
> changes and I'm not sure why that could be bad.
For energy possibly ? IIUC this thread:
https://patchwork.kernel.org/patch/9735885/
the original intent of the ramp thing for the iowait boost was to reduce
power consumption.
> Moreover, it didn't depend on the min originally, just on the max and
> just because I wanted the number of backoff steps needed to go back
> down to zero to be independent of the platform IIRC. The dependency
> on the min is sort of artificial here and leads to arbitrary
> differences in behavior between different platforms which isn't
> particularly fortunate.
It's a question of perspective I would say. Right now you can say the
behaviour is somewhat coherent across platforms: getting an IOWAIT boost
means you'll run twice as fast regardless of your board. With the '128
approach', you may or may not run faster, depending on your set of OPPs.
Also on recent big little SoCs, the capacity ratio can be pretty high.
You can get little CPUs with 300 of capacity or so. The arbitrary 128
thing is basically gonna go near max freq in one step, although the CPUs
actually 20 available OPPs or so. And I guess that's a shame.
For these reasons, I feel like it's not completely idiotic to have a
platform-dependent starting point, although arguably min_freq might not
always be the best choice.
> With all of that in mind, I would go right away to making the boost
> independent of min and max (the final number of steps to reach the max
> is TBD in practice, but 3 looks like a good enough compromise to me).
Perhaps the energy model could help find a good starting point, and a
good number of steps ?
FWIW there should be a slot at OSPM to discuss how sugov could be made
smarter using the EM :-).
> FWIW, the internal governor in intel_pstate has been changed along
> these lines already (the changes is waiting for Linus to pull it ATM).
Thanks,
Quentin
On Thu, Mar 07, 2019 at 11:02:58AM +0000, Quentin Perret wrote:
> Also on recent big little SoCs, the capacity ratio can be pretty high.
> You can get little CPUs with 300 of capacity or so. The arbitrary 128
> thing is basically gonna go near max freq in one step, although the CPUs
> actually 20 available OPPs or so. And I guess that's a shame.
Ah, but remember; with the latest patch the iowait_boost is scaled on
max. So in order to hit that 300, you first have to hit 1024.
+ boost = (sg_cpu->iowait_boost * max) >> SCHED_CAPACITY_SHIFT;
+ return max(boost, util);
So iowait_boost goes from 128 to 1024 in 4 wakeups (the first sets 128,
the next 3 get it to 1024), but the effective value is, given @max ==
300:
128 -> 37
256 -> 75
512 -> 150
1024 -> 300
so irrespective of the number of OPPs and their spread, it takes 4
wakeups to get to max.
On Thu, Mar 7, 2019 at 12:03 PM Quentin Perret <[email protected]> wrote:
>
> Hi Rafael,
>
> On Wednesday 06 Mar 2019 at 11:05:47 (+0100), Rafael J. Wysocki wrote:
> > Please recall that the iowait boosting algo was different to start
> > with, though: it jumped to the max right away and then backed off.
> > That turned out to be overly aggressive in general and led to some
> > undesired "jittery" behavior, which is why it was changed.
> >
> > Thus it looks like the platforms on which it still jumps to the max
> > right away may actually benefit from changing it to more steps. :-)
>
> On the energy side of things at least ... ;-)
>
> > In turn, the platforms where it takes more than 3 steps for the boost
> > to get to the max would get a slight performance improvement from this
> > changes and I'm not sure why that could be bad.
>
> For energy possibly ? IIUC this thread:
>
> https://patchwork.kernel.org/patch/9735885/
>
> the original intent of the ramp thing for the iowait boost was to reduce
> power consumption.
>
> > Moreover, it didn't depend on the min originally, just on the max and
> > just because I wanted the number of backoff steps needed to go back
> > down to zero to be independent of the platform IIRC. The dependency
> > on the min is sort of artificial here and leads to arbitrary
> > differences in behavior between different platforms which isn't
> > particularly fortunate.
>
> It's a question of perspective I would say. Right now you can say the
> behaviour is somewhat coherent across platforms: getting an IOWAIT boost
> means you'll run twice as fast regardless of your board. With the '128
> approach', you may or may not run faster, depending on your set of OPPs.
>
> Also on recent big little SoCs, the capacity ratio can be pretty high.
> You can get little CPUs with 300 of capacity or so. The arbitrary 128
> thing is basically gonna go near max freq in one step, although the CPUs
> actually 20 available OPPs or so. And I guess that's a shame.
OK, you seem to be arguing that on some platforms there is a little
difference between 128 and 1024 in terms of power, while there may be
a lot of difference between, say, 64 and 128.
I can buy that, but then it also makes sense to boost quickly enough,
so maybe it could start at the min and jump from there to 256 right
away in the first step?
> For these reasons, I feel like it's not completely idiotic to have a
> platform-dependent starting point, although arguably min_freq might not
> always be the best choice.
>
> > With all of that in mind, I would go right away to making the boost
> > independent of min and max (the final number of steps to reach the max
> > is TBD in practice, but 3 looks like a good enough compromise to me).
>
> Perhaps the energy model could help find a good starting point, and a
> good number of steps ?
>
> FWIW there should be a slot at OSPM to discuss how sugov could be made
> smarter using the EM :-).
Well, if you can make a case for that. :-)
On Thursday 07 Mar 2019 at 12:23:44 (+0100), Peter Zijlstra wrote:
> Ah, but remember; with the latest patch the iowait_boost is scaled on
> max. So in order to hit that 300, you first have to hit 1024.
Argh ! Right ... So yeah, my whole point about big.little platforms was
totally wrong :(
/me goes back to sleep
On Thursday 07 Mar 2019 at 12:25:10 (+0100), Rafael J. Wysocki wrote:
> On Thu, Mar 7, 2019 at 12:03 PM Quentin Perret <[email protected]> wrote:
> >
> > Hi Rafael,
> >
> > On Wednesday 06 Mar 2019 at 11:05:47 (+0100), Rafael J. Wysocki wrote:
> > > Please recall that the iowait boosting algo was different to start
> > > with, though: it jumped to the max right away and then backed off.
> > > That turned out to be overly aggressive in general and led to some
> > > undesired "jittery" behavior, which is why it was changed.
> > >
> > > Thus it looks like the platforms on which it still jumps to the max
> > > right away may actually benefit from changing it to more steps. :-)
> >
> > On the energy side of things at least ... ;-)
> >
> > > In turn, the platforms where it takes more than 3 steps for the boost
> > > to get to the max would get a slight performance improvement from this
> > > changes and I'm not sure why that could be bad.
> >
> > For energy possibly ? IIUC this thread:
> >
> > https://patchwork.kernel.org/patch/9735885/
> >
> > the original intent of the ramp thing for the iowait boost was to reduce
> > power consumption.
> >
> > > Moreover, it didn't depend on the min originally, just on the max and
> > > just because I wanted the number of backoff steps needed to go back
> > > down to zero to be independent of the platform IIRC. The dependency
> > > on the min is sort of artificial here and leads to arbitrary
> > > differences in behavior between different platforms which isn't
> > > particularly fortunate.
> >
> > It's a question of perspective I would say. Right now you can say the
> > behaviour is somewhat coherent across platforms: getting an IOWAIT boost
> > means you'll run twice as fast regardless of your board. With the '128
> > approach', you may or may not run faster, depending on your set of OPPs.
> >
> > Also on recent big little SoCs, the capacity ratio can be pretty high.
> > You can get little CPUs with 300 of capacity or so. The arbitrary 128
> > thing is basically gonna go near max freq in one step, although the CPUs
> > actually 20 available OPPs or so. And I guess that's a shame.
>
> OK, you seem to be arguing that on some platforms there is a little
> difference between 128 and 1024 in terms of power, while there may be
> a lot of difference between, say, 64 and 128.
Well, in fact what I was saying here is wrong. As Peter said in another
email, we'll scale the boost to the CPU's cap, so even on little CPUs it
would take 4 steps to go to max ... So the problem I was trying to
highlight here simply doesn't exist.
So yes, please ignore that point :/
> I can buy that, but then it also makes sense to boost quickly enough,
> so maybe it could start at the min and jump from there to 256 right
> away in the first step?
>
> > For these reasons, I feel like it's not completely idiotic to have a
> > platform-dependent starting point, although arguably min_freq might not
> > always be the best choice.
> >
> > > With all of that in mind, I would go right away to making the boost
> > > independent of min and max (the final number of steps to reach the max
> > > is TBD in practice, but 3 looks like a good enough compromise to me).
> >
> > Perhaps the energy model could help find a good starting point, and a
> > good number of steps ?
> >
> > FWIW there should be a slot at OSPM to discuss how sugov could be made
> > smarter using the EM :-).
>
> Well, if you can make a case for that. :-)
:-)
Thanks,
Quentin