Hi All,
The v3 is here to address feedback from Srinivas and Doug, and Artem concerns
regarding the EPP sysfs interface changes.
The purpose of this series is to address some peculiarities related to
taking CPUs offline/online and switching between different operation
modes with HWP enabled that have become visible after allowing the
driver to work in the passive mode with HWP enabled in 5.9-rc1 (and
one that was there earlier, but can be addressed easily after the
changes made in 5.9-rc1).
Please refer to the patch changelogs for details.
For easier testing/review, the series is available from the git branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
intel_pstate-testing
I've done my best to address all of the possible corner cases, but the test
matrix is quite extensive and I may have missed something, so go ahead
and test.
Thanks,
Rafael
From: "Rafael J. Wysocki" <[email protected]>
Make intel_pstate update the cached EPP value when setting the EPP
via sysfs in the active mode just like it is the case in the passive
mode, for consistency, but also for the benefit of subsequent
changes.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
-> v2: No changes
v2 -> v3:
* Do not change the read part of the EPP sysfs interface.
* Change the subject and update the changelog.
---
drivers/cpufreq/intel_pstate.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index bcda1e700a73..e540448e0bd0 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -644,6 +644,8 @@ static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int *raw
static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
{
+ int ret;
+
/*
* Use the cached HWP Request MSR value, because in the active mode the
* register itself may be updated by intel_pstate_hwp_boost_up() or
@@ -659,7 +661,11 @@ static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
* function, so it cannot run in parallel with the update below.
*/
WRITE_ONCE(cpu->hwp_req_cached, value);
- return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+ ret = wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+ if (!ret)
+ cpu->epp_cached = epp;
+
+ return ret;
}
static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
@@ -762,10 +768,8 @@ static ssize_t store_energy_performance_preference(
cpufreq_stop_governor(policy);
ret = intel_pstate_set_epp(cpu, epp);
err = cpufreq_start_governor(policy);
- if (!ret) {
- cpu->epp_cached = epp;
+ if (!ret)
ret = err;
- }
}
}
@@ -2378,6 +2382,12 @@ static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
*/
policy->policy = CPUFREQ_POLICY_POWERSAVE;
+ if (hwp_active) {
+ struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+ cpu->epp_cached = intel_pstate_get_epp(cpu, 0);
+ }
+
return 0;
}
@@ -2585,7 +2595,7 @@ static int intel_cpufreq_cpu_init(struct cpufreq_policy *policy)
policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
WRITE_ONCE(cpu->hwp_req_cached, value);
- cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
+ cpu->epp_cached = intel_pstate_get_epp(cpu, value);
} else {
turbo_max = cpu->pstate.turbo_pstate;
policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
--
2.26.2
From: "Rafael J. Wysocki" <[email protected]>
Modify the EPP sysfs interface to reject attempts to change the EPP
to values different from 0 ("performance") in the active mode with
the "performance" policy (ie. scaling_governor set to "performance"),
to avoid situations in which the kernel appears to discard data
passed to it via the EPP sysfs attribute.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
v2 -> v3: New patch
---
Documentation/admin-guide/pm/intel_pstate.rst | 4 +++-
drivers/cpufreq/intel_pstate.c | 8 ++++++++
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index cdd1a9a7f9a2..5072e7064d13 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -123,7 +123,9 @@ Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's
internal P-state selection logic is expected to focus entirely on performance.
This will override the EPP/EPB setting coming from the ``sysfs`` interface
-(see `Energy vs Performance Hints`_ below).
+(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change
+the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this
+configuration will be rejected.
Also, in this configuration the range of P-states available to the processor's
internal P-state selection logic is always restricted to the upper boundary
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index e540448e0bd0..b308c39b6204 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -684,6 +684,14 @@ static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
else if (epp == -EINVAL)
epp = epp_values[pref_index - 1];
+ /*
+ * To avoid confusion, refuse to set EPP to any values different
+ * from 0 (performance) if the current policy is "performance",
+ * because those values would be overridden.
+ */
+ if (epp > 0 && cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE)
+ return -EBUSY;
+
ret = intel_pstate_set_epp(cpu_data, epp);
} else {
if (epp == -EINVAL)
--
2.26.2
From: "Rafael J. Wysocki" <[email protected]>
Add ->offline and ->online driver callbacks to prepare for taking a
CPU offline and to restore its working configuration when it goes
back online, respectively, to avoid invoking the ->init callback on
every CPU online which is quite a bit of unnecessary overhead.
Define ->offline and ->online so that they can be used in the
passive mode as well as in the active mode and because ->offline
will do the majority of ->stop_cpu work, the passive mode does
not need that callback any more, so drop it from there.
Also modify the active mode ->suspend and ->resume callbacks to
prevent them from interfering with the new ->offline and ->online
ones in case the latter are invoked withing the system-wide suspend
and resume code flow and make the passive mode use them too.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
-> v2: Rearrange intel_pstate_init_cpu() to restore some of the previous
behavior of it to retain the current active-mode EPP management.
v2 -> v3:
* Fold the previous [5/5] in, rework intel_pstate_resume(), add
intel_pstate_suspend().
* Drop intel_pstate_hwp_save_state() and drop epp_saved from struct cpudata.
* Update the changelog.
---
drivers/cpufreq/intel_pstate.c | 139 +++++++++++++++++++++------------
1 file changed, 91 insertions(+), 48 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index b308c39b6204..a265ccbcbbd7 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -219,14 +219,13 @@ struct global_params {
* @epp_policy: Last saved policy used to set EPP/EPB
* @epp_default: Power on default HWP energy performance
* preference/bias
- * @epp_saved: Saved EPP/EPB during system suspend or CPU offline
- * operation
* @epp_cached Cached HWP energy-performance preference value
* @hwp_req_cached: Cached value of the last HWP Request MSR
* @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
* @last_io_update: Last time when IO wake flag was set
* @sched_flags: Store scheduler flags for possible cross CPU update
* @hwp_boost_min: Last HWP boosted min performance
+ * @suspended: Whether or not the driver has been suspended.
*
* This structure stores per CPU instance data for all CPUs.
*/
@@ -258,13 +257,13 @@ struct cpudata {
s16 epp_powersave;
s16 epp_policy;
s16 epp_default;
- s16 epp_saved;
s16 epp_cached;
u64 hwp_req_cached;
u64 hwp_cap_cached;
u64 last_io_update;
unsigned int sched_flags;
u32 hwp_boost_min;
+ bool suspended;
};
static struct cpudata **all_cpu_data;
@@ -871,12 +870,6 @@ static void intel_pstate_hwp_set(unsigned int cpu)
cpu_data->epp_policy = cpu_data->policy;
- if (cpu_data->epp_saved >= 0) {
- epp = cpu_data->epp_saved;
- cpu_data->epp_saved = -EINVAL;
- goto update_epp;
- }
-
if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE) {
epp = intel_pstate_get_epp(cpu_data, value);
cpu_data->epp_powersave = epp;
@@ -903,7 +896,6 @@ static void intel_pstate_hwp_set(unsigned int cpu)
epp = cpu_data->epp_powersave;
}
-update_epp:
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
value &= ~GENMASK_ULL(31, 24);
value |= (u64)epp << 24;
@@ -915,14 +907,24 @@ static void intel_pstate_hwp_set(unsigned int cpu)
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
}
-static void intel_pstate_hwp_force_min_perf(int cpu)
+static void intel_pstate_hwp_offline(struct cpudata *cpu)
{
- u64 value;
+ u64 value = READ_ONCE(cpu->hwp_req_cached);
int min_perf;
- value = all_cpu_data[cpu]->hwp_req_cached;
+ if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
+ /*
+ * In case the EPP has been set to "performance" by the
+ * active mode "performance" scaling algorithm, replace that
+ * temporary value with the cached EPP one.
+ */
+ value &= ~GENMASK_ULL(31, 24);
+ value |= HWP_ENERGY_PERF_PREFERENCE(cpu->epp_cached);
+ WRITE_ONCE(cpu->hwp_req_cached, value);
+ }
+
value &= ~GENMASK_ULL(31, 0);
- min_perf = HWP_LOWEST_PERF(all_cpu_data[cpu]->hwp_cap_cached);
+ min_perf = HWP_LOWEST_PERF(cpu->hwp_cap_cached);
/* Set hwp_max = hwp_min */
value |= HWP_MAX_PERF(min_perf);
@@ -932,19 +934,7 @@ static void intel_pstate_hwp_force_min_perf(int cpu)
if (boot_cpu_has(X86_FEATURE_HWP_EPP))
value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
- wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
-}
-
-static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy)
-{
- struct cpudata *cpu_data = all_cpu_data[policy->cpu];
-
- if (!hwp_active)
- return 0;
-
- cpu_data->epp_saved = intel_pstate_get_epp(cpu_data, 0);
-
- return 0;
+ wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
}
#define POWER_CTL_EE_ENABLE 1
@@ -971,8 +961,22 @@ static void set_power_ctl_ee_state(bool input)
static void intel_pstate_hwp_enable(struct cpudata *cpudata);
+static int intel_pstate_suspend(struct cpufreq_policy *policy)
+{
+ struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+ pr_debug("CPU %d suspending\n", cpu->cpu);
+
+ cpu->suspended = true;
+
+ return 0;
+}
+
static int intel_pstate_resume(struct cpufreq_policy *policy)
{
+ struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+ pr_debug("CPU %d resuming\n", cpu->cpu);
/* Only restore if the system default is changed */
if (power_ctl_ee_state == POWER_CTL_EE_ENABLE)
@@ -980,18 +984,22 @@ static int intel_pstate_resume(struct cpufreq_policy *policy)
else if (power_ctl_ee_state == POWER_CTL_EE_DISABLE)
set_power_ctl_ee_state(false);
- if (!hwp_active)
- return 0;
+ if (hwp_active) {
+ mutex_lock(&intel_pstate_limits_lock);
- mutex_lock(&intel_pstate_limits_lock);
+ /*
+ * Enable for all CPUs, because the boot CPU may not be the
+ * first one to resume.
+ */
+ intel_pstate_hwp_enable(cpu);
- if (policy->cpu == 0)
- intel_pstate_hwp_enable(all_cpu_data[policy->cpu]);
+ wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST,
+ READ_ONCE(cpu->hwp_req_cached));
- all_cpu_data[policy->cpu]->epp_policy = 0;
- intel_pstate_hwp_set(policy->cpu);
+ mutex_unlock(&intel_pstate_limits_lock);
+ }
- mutex_unlock(&intel_pstate_limits_lock);
+ cpu->suspended = false;
return 0;
}
@@ -1440,7 +1448,6 @@ static void intel_pstate_hwp_enable(struct cpudata *cpudata)
wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);
wrmsrl_on_cpu(cpudata->cpu, MSR_PM_ENABLE, 0x1);
- cpudata->epp_policy = 0;
if (cpudata->epp_default == -EINVAL)
cpudata->epp_default = intel_pstate_get_epp(cpudata, 0);
}
@@ -2111,7 +2118,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
cpu->epp_default = -EINVAL;
cpu->epp_powersave = -EINVAL;
- cpu->epp_saved = -EINVAL;
}
cpu = all_cpu_data[cpunum];
@@ -2122,6 +2128,7 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
const struct x86_cpu_id *id;
intel_pstate_hwp_enable(cpu);
+ cpu->epp_policy = 0;
id = x86_match_cpu(intel_pstate_hwp_boost_ids);
if (id && intel_pstate_acpi_pm_profile_server())
@@ -2308,28 +2315,59 @@ static int intel_pstate_verify_policy(struct cpufreq_policy_data *policy)
return 0;
}
-static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
+static int intel_pstate_cpu_offline(struct cpufreq_policy *policy)
{
+ struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+ pr_debug("CPU %d going offline\n", cpu->cpu);
+
+ if (cpu->suspended)
+ return 0;
+
+ intel_pstate_exit_perf_limits(policy);
+
+ /*
+ * If the CPU is an SMT thread and it goes offline with the performance
+ * settings different from the minimum, it will prevent its sibling
+ * from getting to lower performance levels, so force the minimum
+ * performance on CPU offline to prevent that from happening.
+ */
if (hwp_active)
- intel_pstate_hwp_force_min_perf(policy->cpu);
+ intel_pstate_hwp_offline(cpu);
else
- intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+ intel_pstate_set_min_pstate(cpu);
+
+ return 0;
}
-static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
+static int intel_pstate_cpu_online(struct cpufreq_policy *policy)
{
- pr_debug("CPU %d exiting\n", policy->cpu);
+ struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+ pr_debug("CPU %d going online\n", cpu->cpu);
+
+ if (cpu->suspended)
+ return 0;
+
+ intel_pstate_init_acpi_perf_limits(policy);
- intel_pstate_clear_update_util_hook(policy->cpu);
if (hwp_active)
- intel_pstate_hwp_save_state(policy);
+ wrmsrl_on_cpu(policy->cpu, MSR_HWP_REQUEST,
+ READ_ONCE(cpu->hwp_req_cached));
+
+ return 0;
+}
+
+static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
+{
+ pr_debug("CPU %d stopping\n", policy->cpu);
- intel_cpufreq_stop_cpu(policy);
+ intel_pstate_clear_update_util_hook(policy->cpu);
}
static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
{
- intel_pstate_exit_perf_limits(policy);
+ pr_debug("CPU %d exiting\n", policy->cpu);
policy->fast_switch_possible = false;
@@ -2403,11 +2441,13 @@ static struct cpufreq_driver intel_pstate = {
.flags = CPUFREQ_CONST_LOOPS,
.verify = intel_pstate_verify_policy,
.setpolicy = intel_pstate_set_policy,
- .suspend = intel_pstate_hwp_save_state,
+ .suspend = intel_pstate_suspend,
.resume = intel_pstate_resume,
.init = intel_pstate_cpu_init,
.exit = intel_pstate_cpu_exit,
.stop_cpu = intel_pstate_stop_cpu,
+ .offline = intel_pstate_cpu_offline,
+ .online = intel_pstate_cpu_online,
.update_limits = intel_pstate_update_limits,
.name = "intel_pstate",
};
@@ -2662,7 +2702,10 @@ static struct cpufreq_driver intel_cpufreq = {
.fast_switch = intel_cpufreq_fast_switch,
.init = intel_cpufreq_cpu_init,
.exit = intel_cpufreq_cpu_exit,
- .stop_cpu = intel_cpufreq_stop_cpu,
+ .offline = intel_pstate_cpu_offline,
+ .online = intel_pstate_cpu_online,
+ .suspend = intel_pstate_suspend,
+ .resume = intel_pstate_resume,
.update_limits = intel_pstate_update_limits,
.name = "intel_cpufreq",
};
--
2.26.2
On Thu, 2020-08-27 at 17:08 +0200, Rafael J. Wysocki wrote:
> Hi All,
>
> The v3 is here to address feedback from Srinivas and Doug, and Artem
> concerns
> regarding the EPP sysfs interface changes.
>
> The purpose of this series is to address some peculiarities related
> to
> taking CPUs offline/online and switching between different operation
> modes with HWP enabled that have become visible after allowing the
> driver to work in the passive mode with HWP enabled in 5.9-rc1 (and
> one that was there earlier, but can be addressed easily after the
> changes made in 5.9-rc1).
>
> Please refer to the patch changelogs for details.
>
> For easier testing/review, the series is available from the git
> branch at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> intel_pstate-testing
>
> I've done my best to address all of the possible corner cases, but
> the test
> matrix is quite extensive and I may have missed something, so go
> ahead
> and test.
Unchecked MSR access (attached dmesg) then all HWP settings goes to 0s
except CPU 0 after resume after S3.
Full log of my console (full copy paste)
[labuser@otcpl-perf-test-skx-i9 ~]$ sudo -s
[root@otcpl-perf-test-skx-i9 labuser]# cd
/sys/devices/system/cpu/intel_pstate/
[root@otcpl-perf-test-skx-i9 intel_pstate]# grep . *
hwp_dynamic_boost:0
max_perf_pct:100
min_perf_pct:27
no_turbo:0
num_pstates:32
status:active
turbo_pct:32
[root@otcpl-perf-test-skx-i9 intel_pstate]# cd ../cpu0/cpufreq/
[root@otcpl-perf-test-skx-i9 cpufreq]# grep . *
affected_cpus:0
base_frequency:3300000
cpuinfo_max_freq:4300000
cpuinfo_min_freq:1200000
cpuinfo_transition_latency:0
energy_performance_available_preferences:default performance
balance_performance balance_power power
energy_performance_preference:balance_performance
related_cpus:0
scaling_available_governors:performance powersave
scaling_cur_freq:1199695
scaling_driver:intel_pstate
scaling_governor:powersave
scaling_max_freq:4300000
scaling_min_freq:1200000
scaling_setspeed:<unsupported>
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -a 0x771
70c212b
70c212b
70c212b
70c212d
70c212b
70c212b
70c212b
70c212b
70c212b
70c212d
70c212b
70c212b
70c212b
70c212d
70c212b
70c212b
70c212b
70c212b
70c212b
70c212d
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -a 0x774
80002b0c
80002b0c
80002b0c
80002d0c
80002b0c
80002b0c
80002b0c
80002b0c
80002b0c
80002d0c
80002b0c
80002b0c
80002b0c
80002d0c
80002b0c
80002b0c
80002b0c
80002b0c
80002b0c
80002d0d
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4000000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 9x774
rdmsr: CPU 0 cannot read MSR 0x00000009
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
8000280c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4200000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
80002a0c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4300000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
80002b0c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4000000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
8000280c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 127 >
energy_performance_preference
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
7f00280c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../../
cpu0/ cpu13/ cpu18/ cpu5/ cpu
freq/ kernel_max online uevent
cpu1/ cpu14/ cpu19/ cpu6/ cpu
idle/ microcode/ possible vulnerabilities/
cpu10/ cpu15/ cpu2/ cpu7/ hot
plug/ modalias power/
cpu11/ cpu16/ cpu3/ cpu8/ int
el_pstate/ nohz_full present
cpu12/ cpu17/ cpu4/ cpu9/ iso
lated offline smt/
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../
acpi_cppc/ cpuidle/ driver/ microcode/
subsystem/ uevent
cache/ crash_notes firmware_node/ node0/
thermal_throttle/
cpufreq/ crash_notes_size hotplug/ power/
topology/
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 128 >
energy_performance_preference
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4300000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
80002b0c
[root@otcpl-perf-test-skx-i9 cpufreq]# cd ../../cpu1
cpu1/ cpu10/ cpu11/ cpu12/ cpu13/ cpu14/ cpu15/ cpu16/ cpu17/ cpu18/
cpu19/
[root@otcpl-perf-test-skx-i9 cpufreq]# cd ../../cpu1
[root@otcpl-perf-test-skx-i9 cpu1]# cd cpufreq/
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
80002b0c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 127 >
energy_performance_preference
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 4000000 > scaling_max_freq
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 0 0x774
80002b0c
[root@otcpl-perf-test-skx-i9 cpufreq]# cat scaling_max_freq
4000000
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
7f00280c
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../online
bash: ../online: Permission denied
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../online
affected_cpus energy_performance_available_
preferences scaling_driver
base_frequency energy_performance_preference
scaling_governor
cpuinfo_max_freq related_cpus
scaling_max_freq
cpuinfo_min_freq scaling_available_governors
scaling_min_freq
cpuinfo_transition_latency scaling_cur_freq
scaling_setspeed
[root@otcpl-perf-test-skx-i9 cpufreq]# cat ../
cat: ../: Is a directory
[root@otcpl-perf-test-skx-i9 cpufreq]# cat ../online
cat: ../online: No such file or directory
[root@otcpl-perf-test-skx-i9 cpufreq]# ls ../../
cpu0 cpu11 cpu14 cpu17 cpu2 cpu5 cpu8 cpuidle isolated
modalias online present vulnerabilities
cpu1 cpu12 cpu15 cpu18 cpu3 cpu6 cpu9 hotplug kernel_m
ax nohz_full possible smt
cpu10 cpu13 cpu16 cpu19 cpu4 cpu7 cpufreq intel_pstate microcod
e offline power uevent
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../../cpu1
cpu1/ cpu10/ cpu11/ cpu12/ cpu13/ cpu14/ cpu15/ cpu16/ cpu17/ cpu18/
cpu19/
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > ../../cpu1/online
[root@otcpl-perf-test-skx-i9 cpufreq]# echo 1 > ../../cpu1/online
[root@otcpl-perf-test-skx-i9 cpufreq]# dmesg | tail
[ 626.188805] intel_pstate: CPU 1 stopping
[ 626.188808] intel_pstate: CPU 1 going offline
[ 626.205589] smpboot: CPU 1 is now offline
[ 633.671812] x86: Booting SMP configuration:
[ 633.671815] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 633.677208] intel_pstate: CPU 1 going online
[ 633.677253] intel_pstate: set_policy cpuinfo.max 4300000 policy->max
4000000
[ 633.677259] intel_pstate: cpu:1 max_state 43 min_policy_perf:12
max_policy_perf:40
[ 633.677260] intel_pstate: cpu:1 global_min:12 global_max:43
[ 633.677262] intel_pstate: cpu:1 max_perf_ratio:40 min_perf_ratio:12
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
7f00280c
[root@otcpl-perf-test-skx-i9 cpufreq]# rtcwake -m mem -s 10
rtcwake: wakeup from "mem" using /dev/rtc0 at Thu Aug 27 19:42:51 2020
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
0
[root@otcpl-perf-test-skx-i9 cpufreq]# grep . *
affected_cpus:1
base_frequency:3300000
cpuinfo_max_freq:4300000
cpuinfo_min_freq:1200000
cpuinfo_transition_latency:0
energy_performance_available_preferences:default performance
balance_performance balance_power power
energy_performance_preference:performance
related_cpus:1
scaling_available_governors:performance powersave
scaling_cur_freq:699896
scaling_driver:intel_pstate
scaling_governor:powersave
scaling_max_freq:4000000
scaling_min_freq:1200000
scaling_setspeed:<unsupported>
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -a 0x774
80002b0c
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
[root@otcpl-perf-test-skx-i9 cpufreq]# dmesg > /tmp/dmesg.txt
[root@otcpl-perf-test-skx-i9 cpufreq]# vi /tmp/dmesg.txt
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -a 0x774
80002b0c
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
[root@otcpl-perf-test-skx-i9 cpufreq]#
Thanks,
Srinivas
>
> Thanks,
> Rafael
>
>
>
On Thu, 2020-08-27 at 17:14 +0200, Rafael J. Wysocki wrote:
> From: "Rafael J. Wysocki" <[email protected]>
>
> Modify the EPP sysfs interface to reject attempts to change the EPP
> to values different from 0 ("performance") in the active mode with
> the "performance" policy (ie. scaling_governor set to "performance"),
> to avoid situations in which the kernel appears to discard data
> passed to it via the EPP sysfs attribute.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
This one looks good to me, thanks.
Reviewed-by: Artem Bityutskiy <[email protected]>
--
Best Regards,
Artem Bityutskiy
On Thu, Aug 27, 2020 at 5:28 PM Rafael J. Wysocki <[email protected]> wrote:
>
> From: "Rafael J. Wysocki" <[email protected]>
>
> Add ->offline and ->online driver callbacks to prepare for taking a
> CPU offline and to restore its working configuration when it goes
> back online, respectively, to avoid invoking the ->init callback on
> every CPU online which is quite a bit of unnecessary overhead.
>
> Define ->offline and ->online so that they can be used in the
> passive mode as well as in the active mode and because ->offline
> will do the majority of ->stop_cpu work, the passive mode does
> not need that callback any more, so drop it from there.
>
> Also modify the active mode ->suspend and ->resume callbacks to
> prevent them from interfering with the new ->offline and ->online
> ones in case the latter are invoked withing the system-wide suspend
> and resume code flow and make the passive mode use them too.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> -> v2: Rearrange intel_pstate_init_cpu() to restore some of the previous
> behavior of it to retain the current active-mode EPP management.
>
> v2 -> v3:
> * Fold the previous [5/5] in, rework intel_pstate_resume(), add
> intel_pstate_suspend().
> * Drop intel_pstate_hwp_save_state() and drop epp_saved from struct cpudata.
> * Update the changelog.
>
> ---
> drivers/cpufreq/intel_pstate.c | 139 +++++++++++++++++++++------------
> 1 file changed, 91 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index b308c39b6204..a265ccbcbbd7 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -219,14 +219,13 @@ struct global_params {
> * @epp_policy: Last saved policy used to set EPP/EPB
> * @epp_default: Power on default HWP energy performance
> * preference/bias
> - * @epp_saved: Saved EPP/EPB during system suspend or CPU offline
> - * operation
> * @epp_cached Cached HWP energy-performance preference value
> * @hwp_req_cached: Cached value of the last HWP Request MSR
> * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
> * @last_io_update: Last time when IO wake flag was set
> * @sched_flags: Store scheduler flags for possible cross CPU update
> * @hwp_boost_min: Last HWP boosted min performance
> + * @suspended: Whether or not the driver has been suspended.
> *
> * This structure stores per CPU instance data for all CPUs.
> */
> @@ -258,13 +257,13 @@ struct cpudata {
> s16 epp_powersave;
> s16 epp_policy;
> s16 epp_default;
> - s16 epp_saved;
> s16 epp_cached;
> u64 hwp_req_cached;
> u64 hwp_cap_cached;
> u64 last_io_update;
> unsigned int sched_flags;
> u32 hwp_boost_min;
> + bool suspended;
> };
>
> static struct cpudata **all_cpu_data;
> @@ -871,12 +870,6 @@ static void intel_pstate_hwp_set(unsigned int cpu)
>
> cpu_data->epp_policy = cpu_data->policy;
>
> - if (cpu_data->epp_saved >= 0) {
> - epp = cpu_data->epp_saved;
> - cpu_data->epp_saved = -EINVAL;
> - goto update_epp;
> - }
> -
> if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE) {
> epp = intel_pstate_get_epp(cpu_data, value);
> cpu_data->epp_powersave = epp;
> @@ -903,7 +896,6 @@ static void intel_pstate_hwp_set(unsigned int cpu)
>
> epp = cpu_data->epp_powersave;
> }
> -update_epp:
> if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> value &= ~GENMASK_ULL(31, 24);
> value |= (u64)epp << 24;
> @@ -915,14 +907,24 @@ static void intel_pstate_hwp_set(unsigned int cpu)
> wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
> }
>
> -static void intel_pstate_hwp_force_min_perf(int cpu)
> +static void intel_pstate_hwp_offline(struct cpudata *cpu)
> {
> - u64 value;
> + u64 value = READ_ONCE(cpu->hwp_req_cached);
> int min_perf;
>
> - value = all_cpu_data[cpu]->hwp_req_cached;
> + if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> + /*
> + * In case the EPP has been set to "performance" by the
> + * active mode "performance" scaling algorithm, replace that
> + * temporary value with the cached EPP one.
> + */
> + value &= ~GENMASK_ULL(31, 24);
> + value |= HWP_ENERGY_PERF_PREFERENCE(cpu->epp_cached);
> + WRITE_ONCE(cpu->hwp_req_cached, value);
> + }
> +
> value &= ~GENMASK_ULL(31, 0);
> - min_perf = HWP_LOWEST_PERF(all_cpu_data[cpu]->hwp_cap_cached);
> + min_perf = HWP_LOWEST_PERF(cpu->hwp_cap_cached);
>
> /* Set hwp_max = hwp_min */
> value |= HWP_MAX_PERF(min_perf);
> @@ -932,19 +934,7 @@ static void intel_pstate_hwp_force_min_perf(int cpu)
> if (boot_cpu_has(X86_FEATURE_HWP_EPP))
> value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
>
> - wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
> -}
> -
> -static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy)
> -{
> - struct cpudata *cpu_data = all_cpu_data[policy->cpu];
> -
> - if (!hwp_active)
> - return 0;
> -
> - cpu_data->epp_saved = intel_pstate_get_epp(cpu_data, 0);
> -
> - return 0;
> + wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> }
>
> #define POWER_CTL_EE_ENABLE 1
> @@ -971,8 +961,22 @@ static void set_power_ctl_ee_state(bool input)
>
> static void intel_pstate_hwp_enable(struct cpudata *cpudata);
>
> +static int intel_pstate_suspend(struct cpufreq_policy *policy)
> +{
> + struct cpudata *cpu = all_cpu_data[policy->cpu];
> +
> + pr_debug("CPU %d suspending\n", cpu->cpu);
> +
> + cpu->suspended = true;
> +
> + return 0;
> +}
> +
> static int intel_pstate_resume(struct cpufreq_policy *policy)
> {
> + struct cpudata *cpu = all_cpu_data[policy->cpu];
> +
> + pr_debug("CPU %d resuming\n", cpu->cpu);
>
> /* Only restore if the system default is changed */
> if (power_ctl_ee_state == POWER_CTL_EE_ENABLE)
> @@ -980,18 +984,22 @@ static int intel_pstate_resume(struct cpufreq_policy *policy)
> else if (power_ctl_ee_state == POWER_CTL_EE_DISABLE)
> set_power_ctl_ee_state(false);
>
> - if (!hwp_active)
> - return 0;
> + if (hwp_active) {
> + mutex_lock(&intel_pstate_limits_lock);
>
> - mutex_lock(&intel_pstate_limits_lock);
> + /*
> + * Enable for all CPUs, because the boot CPU may not be the
> + * first one to resume.
> + */
> + intel_pstate_hwp_enable(cpu);
>
> - if (policy->cpu == 0)
> - intel_pstate_hwp_enable(all_cpu_data[policy->cpu]);
> + wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST,
> + READ_ONCE(cpu->hwp_req_cached));
>
> - all_cpu_data[policy->cpu]->epp_policy = 0;
> - intel_pstate_hwp_set(policy->cpu);
> + mutex_unlock(&intel_pstate_limits_lock);
> + }
>
> - mutex_unlock(&intel_pstate_limits_lock);
> + cpu->suspended = false;
>
> return 0;
> }
> @@ -1440,7 +1448,6 @@ static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);
>
> wrmsrl_on_cpu(cpudata->cpu, MSR_PM_ENABLE, 0x1);
> - cpudata->epp_policy = 0;
> if (cpudata->epp_default == -EINVAL)
> cpudata->epp_default = intel_pstate_get_epp(cpudata, 0);
> }
> @@ -2111,7 +2118,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
>
> cpu->epp_default = -EINVAL;
> cpu->epp_powersave = -EINVAL;
> - cpu->epp_saved = -EINVAL;
> }
>
> cpu = all_cpu_data[cpunum];
> @@ -2122,6 +2128,7 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
> const struct x86_cpu_id *id;
>
> intel_pstate_hwp_enable(cpu);
> + cpu->epp_policy = 0;
>
> id = x86_match_cpu(intel_pstate_hwp_boost_ids);
> if (id && intel_pstate_acpi_pm_profile_server())
> @@ -2308,28 +2315,59 @@ static int intel_pstate_verify_policy(struct cpufreq_policy_data *policy)
> return 0;
> }
>
> -static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> +static int intel_pstate_cpu_offline(struct cpufreq_policy *policy)
> {
> + struct cpudata *cpu = all_cpu_data[policy->cpu];
> +
> + pr_debug("CPU %d going offline\n", cpu->cpu);
> +
> + if (cpu->suspended)
> + return 0;
> +
> + intel_pstate_exit_perf_limits(policy);
> +
> + /*
> + * If the CPU is an SMT thread and it goes offline with the performance
> + * settings different from the minimum, it will prevent its sibling
> + * from getting to lower performance levels, so force the minimum
> + * performance on CPU offline to prevent that from happening.
> + */
> if (hwp_active)
> - intel_pstate_hwp_force_min_perf(policy->cpu);
> + intel_pstate_hwp_offline(cpu);
> else
> - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> + intel_pstate_set_min_pstate(cpu);
> +
> + return 0;
> }
>
> -static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> +static int intel_pstate_cpu_online(struct cpufreq_policy *policy)
> {
> - pr_debug("CPU %d exiting\n", policy->cpu);
> + struct cpudata *cpu = all_cpu_data[policy->cpu];
> +
> + pr_debug("CPU %d going online\n", cpu->cpu);
> +
> + if (cpu->suspended)
_cpu_online() needs to enable HWP if "suspended", or the MSR accesses
in _set_policy() will trigger warnings when resuming from ACPI S3.
This has been fixed in the intel_pstate-testing branch already and I
will send an update of the patch tomorrow.
Thanks!