This patch series has 3 goals:
1. Make "cpu MHz" in /proc/cpufino supportable.
2. Make /sys/.../cpufreq/scaling_cur_freq meaningful
and consistent on modern x86 systems.
3. Use 1. and 2. to remove scheduler and cpufreq overhead
There are 3 main changes since this series was proposed
about a year ago:
This update responds to distro feedback to make /proc/cpuinfo
"cpu MHz" constant. Originally, we had proposed making it return
the exact same value as cpufreq sysfs.
Some community members suggested that sysfs MHz values should
be meaninful, even down to 10ms intervals. So this has been
changed, versus the original proposal to not re-compute
at intervals shorter than 100ms.
(The recommendation remains to use turbostat(8) or other utility
to reliably measure concurrent intervals of arbitrary length)
The intel_pstate sampling mechanism has changed.
Originally this series removed an intel_pstate timer in HWP mode.
Now it removes the analogous scheduler call-back.
Please let me know if you see any issues with this series.
thanks!
[PATCH 1/5] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"
[PATCH 2/5] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
[PATCH 3/5] intel_pstate: remove intel_pstate.get()
[PATCH 4/5] intel_pstate: skip scheduler hook when in "performance" mode.
[PATCH 5/5] intel_pstate: delete scheduler hook in HWP mode
The following changes since commit 3c2993b8c6143d8a5793746a54eba8f86f95240f:
Linux 4.12-rc4 (2017-06-04 16:47:43 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86
for you to fetch changes up to 8554c677900ad6bcb308f4afe1099aead3cfebd3:
intel_pstate: delete scheduler hook in HWP mode (2017-06-07 19:20:45 -0700)
----------------------------------------------------------------
Len Brown (5):
x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"
x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
intel_pstate: remove intel_pstate.get()
intel_pstate: skip scheduler hook when in "performance" mode.
intel_pstate: delete scheduler hook in HWP mode
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/proc.c | 10 +----
drivers/cpufreq/cpufreq.c | 7 +++-
drivers/cpufreq/intel_pstate.c | 34 +++--------------
include/linux/cpufreq.h | 13 +++++++
6 files changed, 110 insertions(+), 37 deletions(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
From: Len Brown <[email protected]>
The goal of this change is to give users a uniform and meaningful
result when they read /sys/...cpufreq/scaling_cur_freq
on modern x86 hardware, as compared to what they get today.
Modern x86 processors include the hardware needed
to accurately calculate frequency over an interval --
APERF, MPERF, and the TSC.
Here we provide an x86 routine to make this calculation
on supported hardware, and use it in preference to any
driver driver-specific cpufreq_driver.get() routine.
MHz is computed like so:
MHz = base_MHz * delta_APERF / delta_MPERF
MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive invocations
of aperfmperf_khz_on_cpu(), which are expected to to
happen on-demand when users read sysfs attribute
cpufreq/scaling_cur_freq.
As with previous methods of calculating MHz,
idle time is excluded.
base_MHz above is from TSC calibration global "cpu_khz".
This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or if the Linux cpufreq sub-system is or is-not installed.
When this routine is invoked more frequently, the measurement
interval becomes shorter. However, the code limits re-computation
to 10ms intervals so that average frequency remains meaningful.
Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.
Signed-off-by: Len Brown <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 7 +++-
include/linux/cpufreq.h | 13 +++++++
4 files changed, 102 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 5200001..cdf8249 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -21,6 +21,7 @@ obj-y += common.o
obj-y += rdrand.o
obj-y += match.o
obj-y += bugs.o
+obj-$(CONFIG_CPU_FREQ) += aperfmperf.o
obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
new file mode 100644
index 0000000..5ccf63a
--- /dev/null
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -0,0 +1,82 @@
+/*
+ * x86 APERF/MPERF KHz calculation for
+ * /sys/.../cpufreq/scaling_cur_freq
+ *
+ * Copyright (C) 2017 Intel Corp.
+ * Author: Len Brown <[email protected]>
+ *
+ * This file is licensed under GPLv2.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+
+struct aperfmperf_sample {
+ unsigned int khz;
+ unsigned long jiffies;
+ u64 aperf;
+ u64 mperf;
+};
+
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
+
+/*
+ * aperfmperf_snapshot_khz()
+ * On the current CPU, snapshot APERF, MPERF, and jiffies
+ * unless we already did it within 10ms
+ * calculate kHz, save snapshot
+ */
+static void aperfmperf_snapshot_khz(void *dummy)
+{
+ u64 aperf, aperf_delta;
+ u64 mperf, mperf_delta;
+ struct aperfmperf_sample *s = &get_cpu_var(samples);
+
+ /* Don't bother re-computing within 10 ms */
+ if (time_before(jiffies, s->jiffies + HZ/100))
+ goto out;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ aperf_delta = aperf - s->aperf;
+ mperf_delta = mperf - s->mperf;
+
+ /*
+ * There is no architectural guarantee that MPERF
+ * increments faster than we can read it.
+ */
+ if (mperf_delta == 0)
+ goto out;
+
+ /*
+ * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
+ * khz = (cpu_khz * aperf_delta) / mperf_delta
+ */
+ if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
+ s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
+ else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
+ s->khz = div64_u64(aperf_delta,
+ div64_u64(mperf_delta, cpu_khz));
+ s->jiffies = jiffies;
+ s->aperf = aperf;
+ s->mperf = mperf;
+
+out:
+ put_cpu_var(samples);
+}
+
+unsigned int aperfmperf_khz_on_cpu(unsigned int cpu)
+{
+ if (!cpu_khz)
+ return 0;
+
+ if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+ return 0;
+
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+
+ return per_cpu(samples.khz, cpu);
+}
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 26b643d..a667fac 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -635,8 +635,13 @@ show_one(scaling_max_freq, max);
static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
{
ssize_t ret;
+ unsigned int freq;
- if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
+ freq = arch_freq_get_on_cpu(policy->cpu);
+ if (freq)
+ ret = sprintf(buf, "%u\n", freq);
+ else if (cpufreq_driver && cpufreq_driver->setpolicy &&
+ cpufreq_driver->get)
ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
else
ret = sprintf(buf, "%u\n", policy->cur);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index a5ce0bbe..cfc6220 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
}
#endif
+#ifdef CONFIG_X86
+extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return aperfmperf_khz_on_cpu(cpu);
+}
+#else
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return 0;
+}
+#endif
+
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
--
2.7.4
From: Len Brown <[email protected]>
The x86 cpufreq core now uses aperfmperf_khz_on_cpu()
to supply /sys/.../cpufreq/scaling_cur_freq
on all x86 systems supporting APERF/MPERF.
That includes 100% of systems supported by intel_pstate,
and so intel_pstate.get() is now a NOP -- remove it.
Invoke aperfmperf_khz_on_cpu() directly,
if legacy-mode p-state tracing is enabled.
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index b7de5bd..5d67780 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1597,12 +1597,6 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
return false;
}
-static inline int32_t get_avg_frequency(struct cpudata *cpu)
-{
- return mul_ext_fp(cpu->sample.core_avg_perf,
- cpu->pstate.max_pstate_physical * cpu->pstate.scaling);
-}
-
static inline int32_t get_avg_pstate(struct cpudata *cpu)
{
return mul_ext_fp(cpu->pstate.max_pstate_physical,
@@ -1728,7 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
sample->mperf,
sample->aperf,
sample->tsc,
- get_avg_frequency(cpu),
+ aperfmperf_khz_on_cpu(cpu->cpu),
fp_toint(cpu->iowait_boost * 100));
}
@@ -1922,13 +1916,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
return 0;
}
-static unsigned int intel_pstate_get(unsigned int cpu_num)
-{
- struct cpudata *cpu = all_cpu_data[cpu_num];
-
- return cpu ? get_avg_frequency(cpu) : 0;
-}
-
static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];
@@ -2157,7 +2144,6 @@ static struct cpufreq_driver intel_pstate = {
.setpolicy = intel_pstate_set_policy,
.suspend = intel_pstate_hwp_save_state,
.resume = intel_pstate_resume,
- .get = intel_pstate_get,
.init = intel_pstate_cpu_init,
.exit = intel_pstate_cpu_exit,
.stop_cpu = intel_pstate_stop_cpu,
--
2.7.4
From: Len Brown <[email protected]>
When the governor is set to "performance", intel_pstate does not
need the scheduler hook for doing any calculations. Under these
conditions, its only purpose is to continue to maintain
cpufreq/scaling_cur_freq.
But the cpufreq/scaling_cur_freq sysfs attribute is now provided by
the x86 cpufreq core on all modern x86 systems, including
all systems supported by the intel_pstate driver.
So in "performance" governor mode, the scheduler hook can be skipped.
This applies to both in Software and Hardware P-state control modes.
Suggested-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 5d67780..0ff3a4b 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2025,10 +2025,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
*/
intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_max_within_limits(cpu);
+ } else {
+ intel_pstate_set_update_util_hook(policy->cpu);
}
- intel_pstate_set_update_util_hook(policy->cpu);
-
if (hwp_active)
intel_pstate_hwp_set(policy->cpu);
--
2.7.4
From: Len Brown <[email protected]>
The cpufreqa/scaling_cur_freq sysfs attribute is now provided by
the x86 cpufreq core on all modern x86 systems, including
all systems supported by the intel_pstate driver.
In HWP mode, maintaining that value was the sole purpose of
the scheduler hook, intel_pstate_update_util_hwp(),
so it can now be removed.
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 0ff3a4b..718732b 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1726,16 +1726,6 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
fp_toint(cpu->iowait_boost * 100));
}
-static void intel_pstate_update_util_hwp(struct update_util_data *data,
- u64 time, unsigned int flags)
-{
- struct cpudata *cpu = container_of(data, struct cpudata, update_util);
- u64 delta_ns = time - cpu->sample.time;
-
- if ((s64)delta_ns >= INTEL_PSTATE_HWP_SAMPLING_INTERVAL)
- intel_pstate_sample(cpu, time);
-}
-
static void intel_pstate_update_util_pid(struct update_util_data *data,
u64 time, unsigned int flags)
{
@@ -1920,6 +1910,9 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];
+ if (hwp_active)
+ return;
+
if (cpu->update_util_set)
return;
@@ -2543,7 +2536,6 @@ static int __init intel_pstate_init(void)
} else {
hwp_active++;
intel_pstate.attr = hwp_cpufreq_attrs;
- pstate_funcs.update_util = intel_pstate_update_util_hwp;
goto hwp_cpu_matched;
}
} else {
--
2.7.4
From: Len Brown <[email protected]>
cpufreq_quick_get() allows cpufreq drivers to over-ride cpu_khz
that is otherwise reported in x86 /proc/cpuinfo "cpu MHz".
There are four problems with this scheme,
any of them is sufficient justification to delete it.
1. Depending on which cpufreq driver is loaded, the behavior
of this field is different.
2. Distros complain that they have to explain to users
why and how this field changes. Distros have requested a constant.
3. The two major providers of this information, acpi_cpufreq
and intel_pstate, both "get it wrong" in different ways.
acpi_cpufreq lies to the user by telling them that
they are running at whatever frequency was last
requested by software.
intel_pstate lies to the user by telling them that
they are running at the average frequency computed
over an undefined measurement. But an average computed
over an undefined interval, is itself, undefined...
4. On modern processors, user space utilities, such as
turbostat(1), are more accurate and more precise, while
supporing concurrent measurement over arbitrary intervals.
Users who have been consulting /proc/cpuinfo to
track changing CPU frequency will be dissapointed that
it no longer wiggles -- perhaps being unaware of the
limitations of the information they have been consuming.
Yes, they can change their scripts to look in sysfs
cpufreq/scaling_cur_frequency. Here they will find the same
data of dubious quality here removed from /proc/cpuinfo.
The value in sysfs will be addressed in a subsequent patch
to address issues 1-3, above.
Issue 4 will remain -- users that really care about
accurate frequency information should not be using either
proc or sysfs kernel interfaces.
They should be using using turbostat(8), or a similar
purpose-built analysis tool.
Signed-off-by: Len Brown <[email protected]>
---
arch/x86/kernel/cpu/proc.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 6df621a..218f798 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -2,7 +2,6 @@
#include <linux/timex.h>
#include <linux/string.h>
#include <linux/seq_file.h>
-#include <linux/cpufreq.h>
/*
* Get CPU information for use by the procfs.
@@ -76,14 +75,9 @@ static int show_cpuinfo(struct seq_file *m, void *v)
if (c->microcode)
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
- if (cpu_has(c, X86_FEATURE_TSC)) {
- unsigned int freq = cpufreq_quick_get(cpu);
-
- if (!freq)
- freq = cpu_khz;
+ if (cpu_has(c, X86_FEATURE_TSC))
seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
- }
+ cpu_khz / 1000, (cpu_khz % 1000));
/* Cache size */
if (c->x86_cache_size >= 0)
--
2.7.4
On Wednesday, June 07, 2017 07:39:14 PM Len Brown wrote:
> From: Len Brown <[email protected]>
>
> The x86 cpufreq core now uses aperfmperf_khz_on_cpu()
> to supply /sys/.../cpufreq/scaling_cur_freq
> on all x86 systems supporting APERF/MPERF.
>
> That includes 100% of systems supported by intel_pstate,
> and so intel_pstate.get() is now a NOP -- remove it.
>
> Invoke aperfmperf_khz_on_cpu() directly,
> if legacy-mode p-state tracing is enabled.
>
> Signed-off-by: Len Brown <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 16 +---------------
> 1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index b7de5bd..5d67780 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1597,12 +1597,6 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
> return false;
> }
>
> -static inline int32_t get_avg_frequency(struct cpudata *cpu)
> -{
> - return mul_ext_fp(cpu->sample.core_avg_perf,
> - cpu->pstate.max_pstate_physical * cpu->pstate.scaling);
> -}
> -
> static inline int32_t get_avg_pstate(struct cpudata *cpu)
> {
> return mul_ext_fp(cpu->pstate.max_pstate_physical,
> @@ -1728,7 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
> sample->mperf,
> sample->aperf,
> sample->tsc,
> - get_avg_frequency(cpu),
> + aperfmperf_khz_on_cpu(cpu->cpu),
> fp_toint(cpu->iowait_boost * 100));
> }
>
> @@ -1922,13 +1916,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
> return 0;
> }
>
> -static unsigned int intel_pstate_get(unsigned int cpu_num)
> -{
> - struct cpudata *cpu = all_cpu_data[cpu_num];
> -
> - return cpu ? get_avg_frequency(cpu) : 0;
> -}
> -
> static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
> {
> struct cpudata *cpu = all_cpu_data[cpu_num];
> @@ -2157,7 +2144,6 @@ static struct cpufreq_driver intel_pstate = {
> .setpolicy = intel_pstate_set_policy,
> .suspend = intel_pstate_hwp_save_state,
> .resume = intel_pstate_resume,
> - .get = intel_pstate_get,
> .init = intel_pstate_cpu_init,
> .exit = intel_pstate_cpu_exit,
> .stop_cpu = intel_pstate_stop_cpu,
>
This change will cause cpufreq_quick_get() to work differently and it is
called by KVM among other things. Will that still work?
Thanks,
Rafael
On Wednesday, June 07, 2017 07:39:15 PM Len Brown wrote:
> From: Len Brown <[email protected]>
>
> When the governor is set to "performance", intel_pstate does not
> need the scheduler hook for doing any calculations. Under these
> conditions, its only purpose is to continue to maintain
> cpufreq/scaling_cur_freq.
>
> But the cpufreq/scaling_cur_freq sysfs attribute is now provided by
> the x86 cpufreq core on all modern x86 systems, including
> all systems supported by the intel_pstate driver.
>
> So in "performance" governor mode, the scheduler hook can be skipped.
> This applies to both in Software and Hardware P-state control modes.
>
> Suggested-by: Srinivas Pandruvada <[email protected]>
> Signed-off-by: Len Brown <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 5d67780..0ff3a4b 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2025,10 +2025,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
> */
> intel_pstate_clear_update_util_hook(policy->cpu);
The statement above shouldn't be necessary any more after the change below.
> intel_pstate_max_within_limits(cpu);
> + } else {
> + intel_pstate_set_update_util_hook(policy->cpu);
> }
>
> - intel_pstate_set_update_util_hook(policy->cpu);
> -
> if (hwp_active)
> intel_pstate_hwp_set(policy->cpu);
>
>
What about update_turbo_pstate()?
In theory MSR_IA32_MISC_ENABLE_TURBO_DISABLE can be set at any time, so
wouldn't that become problematic after this change?
Thanks,
Rafael
On Wednesday, June 07, 2017 07:39:16 PM Len Brown wrote:
> From: Len Brown <[email protected]>
>
> The cpufreqa/scaling_cur_freq sysfs attribute is now provided by
> the x86 cpufreq core on all modern x86 systems, including
> all systems supported by the intel_pstate driver.
Not sure what you mean by "x86 cpufreq core"?
Besides, I'd reorder this change with respect to patch [4/5] as this
eliminates the hook entirely and then the "performance"-related change
would only affect non-HWP.
Thanks,
Rafael
On Wednesday, June 07, 2017 07:39:14 PM Len Brown wrote:
> From: Len Brown <[email protected]>
>
> The x86 cpufreq core now uses aperfmperf_khz_on_cpu()
> to supply /sys/.../cpufreq/scaling_cur_freq
> on all x86 systems supporting APERF/MPERF.
>
> That includes 100% of systems supported by intel_pstate,
> and so intel_pstate.get() is now a NOP -- remove it.
>
> Invoke aperfmperf_khz_on_cpu() directly,
> if legacy-mode p-state tracing is enabled.
>
> Signed-off-by: Len Brown <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 16 +---------------
> 1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index b7de5bd..5d67780 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1597,12 +1597,6 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
> return false;
> }
>
> -static inline int32_t get_avg_frequency(struct cpudata *cpu)
> -{
> - return mul_ext_fp(cpu->sample.core_avg_perf,
> - cpu->pstate.max_pstate_physical * cpu->pstate.scaling);
> -}
> -
> static inline int32_t get_avg_pstate(struct cpudata *cpu)
> {
> return mul_ext_fp(cpu->pstate.max_pstate_physical,
> @@ -1728,7 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
> sample->mperf,
> sample->aperf,
> sample->tsc,
> - get_avg_frequency(cpu),
> + aperfmperf_khz_on_cpu(cpu->cpu),
I don't think this change is an improvement.
We compute cpu->sample.core_avg_perf anyway and calling aperfmperf_khz_on_cpu()
here means quite a bit of additional overhead.
Besides, the sample->mperf and sample->aperf values here are expected to be
the ones used for computing the frequency.
> fp_toint(cpu->iowait_boost * 100));
> }
>
> @@ -1922,13 +1916,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
> return 0;
> }
>
> -static unsigned int intel_pstate_get(unsigned int cpu_num)
> -{
> - struct cpudata *cpu = all_cpu_data[cpu_num];
> -
> - return cpu ? get_avg_frequency(cpu) : 0;
> -}
> -
> static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
> {
> struct cpudata *cpu = all_cpu_data[cpu_num];
> @@ -2157,7 +2144,6 @@ static struct cpufreq_driver intel_pstate = {
> .setpolicy = intel_pstate_set_policy,
> .suspend = intel_pstate_hwp_save_state,
> .resume = intel_pstate_resume,
> - .get = intel_pstate_get,
> .init = intel_pstate_cpu_init,
> .exit = intel_pstate_cpu_exit,
> .stop_cpu = intel_pstate_stop_cpu,
>
Thanks,
Rafael
On Fri, Jun 16, 2017 at 7:53 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 07, 2017 07:39:14 PM Len Brown wrote:
>> From: Len Brown <[email protected]>
>>
>> The x86 cpufreq core now uses aperfmperf_khz_on_cpu()
>> to supply /sys/.../cpufreq/scaling_cur_freq
>> on all x86 systems supporting APERF/MPERF.
>>
>> That includes 100% of systems supported by intel_pstate,
>> and so intel_pstate.get() is now a NOP -- remove it.
>>
>> Invoke aperfmperf_khz_on_cpu() directly,
>> if legacy-mode p-state tracing is enabled.
>>
>> Signed-off-by: Len Brown <[email protected]>
>> ---
>> drivers/cpufreq/intel_pstate.c | 16 +---------------
>> 1 file changed, 1 insertion(+), 15 deletions(-)
>>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index b7de5bd..5d67780 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -1597,12 +1597,6 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
>> return false;
>> }
>>
>> -static inline int32_t get_avg_frequency(struct cpudata *cpu)
>> -{
>> - return mul_ext_fp(cpu->sample.core_avg_perf,
>> - cpu->pstate.max_pstate_physical * cpu->pstate.scaling);
>> -}
>> -
>> static inline int32_t get_avg_pstate(struct cpudata *cpu)
>> {
>> return mul_ext_fp(cpu->pstate.max_pstate_physical,
>> @@ -1728,7 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
>> sample->mperf,
>> sample->aperf,
>> sample->tsc,
>> - get_avg_frequency(cpu),
>> + aperfmperf_khz_on_cpu(cpu->cpu),
Note that I deleted the line above in an updated version of this patch
that I'm ready to send out.
There were a couple of problems with it.
The first is that it was ugly that tracing (which occurs only in the
SW governor case)
could shorten the measurement interval as seen by the sysfs interface.
The second is that this trace point can be called with irqs off,
and smp_call_function_single() will WARN when called with irqs off.
Srinivas Acked that I simply remove this field from the tracepoint --
as it is redundant to calculate it in the kernel when we are already
exporting the raw values of aperf and mperf.
>> fp_toint(cpu->iowait_boost * 100));
>> }
>>
>> @@ -1922,13 +1916,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
>> return 0;
>> }
>>
>> -static unsigned int intel_pstate_get(unsigned int cpu_num)
>> -{
>> - struct cpudata *cpu = all_cpu_data[cpu_num];
>> -
>> - return cpu ? get_avg_frequency(cpu) : 0;
>> -}
>> -
>> static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
>> {
>> struct cpudata *cpu = all_cpu_data[cpu_num];
>> @@ -2157,7 +2144,6 @@ static struct cpufreq_driver intel_pstate = {
>> .setpolicy = intel_pstate_set_policy,
>> .suspend = intel_pstate_hwp_save_state,
>> .resume = intel_pstate_resume,
>> - .get = intel_pstate_get,
>> .init = intel_pstate_cpu_init,
>> .exit = intel_pstate_cpu_exit,
>> .stop_cpu = intel_pstate_stop_cpu,
>>
>
> This change will cause cpufreq_quick_get() to work differently and it is
> called by KVM among other things. Will that still work?
Neither invocation of cpufreq_quick_get() (ARM cpu-cooling and X86 variable TSC)
are aligned with the hardware supported by the intel_pstate driver.
If it were, then yes, cpufreq_quick_get() would not call the (absent)
intel_pstate.get,
and instead would return policy->cur.
acpi_cpufreq() retains its internal .get routine, and so cpufreq_quick_get()
doesn't change at all on those legacy systems. As it turns out this
is moot, since that
driver returns the cached frequency request in either case.
thanks,
Len Brown, Intel Open Source Technology Center
On Wednesday, June 07, 2017 07:39:13 PM Len Brown wrote:
> From: Len Brown <[email protected]>
>
> The goal of this change is to give users a uniform and meaningful
> result when they read /sys/...cpufreq/scaling_cur_freq
> on modern x86 hardware, as compared to what they get today.
>
> Modern x86 processors include the hardware needed
> to accurately calculate frequency over an interval --
> APERF, MPERF, and the TSC.
>
> Here we provide an x86 routine to make this calculation
> on supported hardware, and use it in preference to any
> driver driver-specific cpufreq_driver.get() routine.
>
> MHz is computed like so:
>
> MHz = base_MHz * delta_APERF / delta_MPERF
>
> MHz is the average frequency of the busy processor
> over a measurement interval. The interval is
> defined to be the time between successive invocations
> of aperfmperf_khz_on_cpu(), which are expected to to
> happen on-demand when users read sysfs attribute
> cpufreq/scaling_cur_freq.
>
> As with previous methods of calculating MHz,
> idle time is excluded.
>
> base_MHz above is from TSC calibration global "cpu_khz".
>
> This x86 native method to calculate MHz returns a meaningful result
> no matter if P-states are controlled by hardware or firmware
> and/or if the Linux cpufreq sub-system is or is-not installed.
>
> When this routine is invoked more frequently, the measurement
> interval becomes shorter. However, the code limits re-computation
> to 10ms intervals so that average frequency remains meaningful.
>
> Discerning users are encouraged to take advantage of
> the turbostat(8) utility, which can gracefully handle
> concurrent measurement intervals of arbitrary length.
>
> Signed-off-by: Len Brown <[email protected]>
> ---
> arch/x86/kernel/cpu/Makefile | 1 +
> arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
> drivers/cpufreq/cpufreq.c | 7 +++-
> include/linux/cpufreq.h | 13 +++++++
> 4 files changed, 102 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
>
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index 5200001..cdf8249 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -21,6 +21,7 @@ obj-y += common.o
> obj-y += rdrand.o
> obj-y += match.o
> obj-y += bugs.o
> +obj-$(CONFIG_CPU_FREQ) += aperfmperf.o
>
> obj-$(CONFIG_PROC_FS) += proc.o
> obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> new file mode 100644
> index 0000000..5ccf63a
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -0,0 +1,82 @@
> +/*
> + * x86 APERF/MPERF KHz calculation for
> + * /sys/.../cpufreq/scaling_cur_freq
> + *
> + * Copyright (C) 2017 Intel Corp.
> + * Author: Len Brown <[email protected]>
> + *
> + * This file is licensed under GPLv2.
> + */
> +
> +#include <linux/jiffies.h>
> +#include <linux/math64.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
> +
> +struct aperfmperf_sample {
> + unsigned int khz;
> + unsigned long jiffies;
> + u64 aperf;
> + u64 mperf;
> +};
> +
> +static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
> +
> +/*
> + * aperfmperf_snapshot_khz()
> + * On the current CPU, snapshot APERF, MPERF, and jiffies
> + * unless we already did it within 10ms
> + * calculate kHz, save snapshot
> + */
> +static void aperfmperf_snapshot_khz(void *dummy)
> +{
> + u64 aperf, aperf_delta;
> + u64 mperf, mperf_delta;
> + struct aperfmperf_sample *s = &get_cpu_var(samples);
> +
> + /* Don't bother re-computing within 10 ms */
> + if (time_before(jiffies, s->jiffies + HZ/100))
> + goto out;
> +
> + rdmsrl(MSR_IA32_APERF, aperf);
> + rdmsrl(MSR_IA32_MPERF, mperf);
> +
> + aperf_delta = aperf - s->aperf;
> + mperf_delta = mperf - s->mperf;
> +
> + /*
> + * There is no architectural guarantee that MPERF
> + * increments faster than we can read it.
> + */
> + if (mperf_delta == 0)
> + goto out;
> +
> + /*
> + * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
> + * khz = (cpu_khz * aperf_delta) / mperf_delta
> + */
> + if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
> + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
> + else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
> + s->khz = div64_u64(aperf_delta,
> + div64_u64(mperf_delta, cpu_khz));
> + s->jiffies = jiffies;
> + s->aperf = aperf;
> + s->mperf = mperf;
> +
> +out:
> + put_cpu_var(samples);
> +}
> +
> +unsigned int aperfmperf_khz_on_cpu(unsigned int cpu)
> +{
> + if (!cpu_khz)
> + return 0;
> +
> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
> + return 0;
> +
> + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
> +
> + return per_cpu(samples.khz, cpu);
> +}
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 26b643d..a667fac 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -635,8 +635,13 @@ show_one(scaling_max_freq, max);
> static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> {
> ssize_t ret;
> + unsigned int freq;
>
> - if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
> + freq = arch_freq_get_on_cpu(policy->cpu);
> + if (freq)
> + ret = sprintf(buf, "%u\n", freq);
> + else if (cpufreq_driver && cpufreq_driver->setpolicy &&
> + cpufreq_driver->get)
> ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
> else
> ret = sprintf(buf, "%u\n", policy->cur);
I wonder if we could change intel_pstate_get() to simply return
aperfmperf_khz_on_cpu(cpu_num)?
That would allow us to avoid the extra branch here and get rid of the
#ifdef x86 from the header.
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index a5ce0bbe..cfc6220 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
> }
> #endif
>
> +#ifdef CONFIG_X86
> +extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
> +{
> + return aperfmperf_khz_on_cpu(cpu);
> +}
> +#else
> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
> +{
> + return 0;
> +}
> +#endif
> +
> /* the following are really really optional */
> extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
> extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
>
Thanks,
Rafael
On Friday, June 16, 2017 08:35:19 PM Len Brown wrote:
> On Fri, Jun 16, 2017 at 7:53 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Wednesday, June 07, 2017 07:39:14 PM Len Brown wrote:
> >> From: Len Brown <[email protected]>
> >>
> >> The x86 cpufreq core now uses aperfmperf_khz_on_cpu()
> >> to supply /sys/.../cpufreq/scaling_cur_freq
> >> on all x86 systems supporting APERF/MPERF.
> >>
> >> That includes 100% of systems supported by intel_pstate,
> >> and so intel_pstate.get() is now a NOP -- remove it.
> >>
> >> Invoke aperfmperf_khz_on_cpu() directly,
> >> if legacy-mode p-state tracing is enabled.
> >>
> >> Signed-off-by: Len Brown <[email protected]>
> >> ---
> >> drivers/cpufreq/intel_pstate.c | 16 +---------------
> >> 1 file changed, 1 insertion(+), 15 deletions(-)
> >>
> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> index b7de5bd..5d67780 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -1597,12 +1597,6 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
> >> return false;
> >> }
> >>
> >> -static inline int32_t get_avg_frequency(struct cpudata *cpu)
> >> -{
> >> - return mul_ext_fp(cpu->sample.core_avg_perf,
> >> - cpu->pstate.max_pstate_physical * cpu->pstate.scaling);
> >> -}
> >> -
> >> static inline int32_t get_avg_pstate(struct cpudata *cpu)
> >> {
> >> return mul_ext_fp(cpu->pstate.max_pstate_physical,
> >> @@ -1728,7 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
> >> sample->mperf,
> >> sample->aperf,
> >> sample->tsc,
> >> - get_avg_frequency(cpu),
> >> + aperfmperf_khz_on_cpu(cpu->cpu),
>
> Note that I deleted the line above in an updated version of this patch
> that I'm ready to send out.
>
> There were a couple of problems with it.
> The first is that it was ugly that tracing (which occurs only in the
> SW governor case)
> could shorten the measurement interval as seen by the sysfs interface.
>
> The second is that this trace point can be called with irqs off,
> and smp_call_function_single() will WARN when called with irqs off.
>
> Srinivas Acked that I simply remove this field from the tracepoint --
> as it is redundant to calculate it in the kernel when we are already
> exporting the raw values of aperf and mperf.
This changes the tracepoint format and I know about a couple of user space
scripts that consume these tracepoints though.
What would be wrong with leaving it as is?
> >> fp_toint(cpu->iowait_boost * 100));
> >> }
> >>
> >> @@ -1922,13 +1916,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
> >> return 0;
> >> }
> >>
> >> -static unsigned int intel_pstate_get(unsigned int cpu_num)
> >> -{
> >> - struct cpudata *cpu = all_cpu_data[cpu_num];
> >> -
> >> - return cpu ? get_avg_frequency(cpu) : 0;
> >> -}
> >> -
> >> static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
> >> {
> >> struct cpudata *cpu = all_cpu_data[cpu_num];
> >> @@ -2157,7 +2144,6 @@ static struct cpufreq_driver intel_pstate = {
> >> .setpolicy = intel_pstate_set_policy,
> >> .suspend = intel_pstate_hwp_save_state,
> >> .resume = intel_pstate_resume,
> >> - .get = intel_pstate_get,
> >> .init = intel_pstate_cpu_init,
> >> .exit = intel_pstate_cpu_exit,
> >> .stop_cpu = intel_pstate_stop_cpu,
> >>
> >
> > This change will cause cpufreq_quick_get() to work differently and it is
> > called by KVM among other things. Will that still work?
>
> Neither invocation of cpufreq_quick_get() (ARM cpu-cooling and X86 variable TSC)
> are aligned with the hardware supported by the intel_pstate driver.
OK
Thanks,
Rafael
On Fri, Jun 16, 2017 at 8:04 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 07, 2017 07:39:15 PM Len Brown wrote:
>> From: Len Brown <[email protected]>
>>
>> When the governor is set to "performance", intel_pstate does not
>> need the scheduler hook for doing any calculations. Under these
>> conditions, its only purpose is to continue to maintain
>> cpufreq/scaling_cur_freq.
>>
>> But the cpufreq/scaling_cur_freq sysfs attribute is now provided by
>> the x86 cpufreq core on all modern x86 systems, including
>> all systems supported by the intel_pstate driver.
>>
>> So in "performance" governor mode, the scheduler hook can be skipped.
>> This applies to both in Software and Hardware P-state control modes.
>>
>> Suggested-by: Srinivas Pandruvada <[email protected]>
>> Signed-off-by: Len Brown <[email protected]>
>> ---
>> drivers/cpufreq/intel_pstate.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index 5d67780..0ff3a4b 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -2025,10 +2025,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
>> */
>> intel_pstate_clear_update_util_hook(policy->cpu);
>
> The statement above shouldn't be necessary any more after the change below.
The policy can change at run time form something other than performance
to performance, so we want to clear the hook in that case, no?
>> intel_pstate_max_within_limits(cpu);
>> + } else {
>> + intel_pstate_set_update_util_hook(policy->cpu);
>> }
>>
>> - intel_pstate_set_update_util_hook(policy->cpu);
>> -
>> if (hwp_active)
>> intel_pstate_hwp_set(policy->cpu);
>>
>
> What about update_turbo_pstate()?
>
> In theory MSR_IA32_MISC_ENABLE_TURBO_DISABLE can be set at any time, so
> wouldn't that become problematic after this change?
yes, the sysfs "no_turbo" attribute can be modified at any time, invoking
update_turbo_state(), which will update MSR_IA32_MISC_ENABLE_TURBO_DISABLE
But how is the presence or change in turbo related to the lack of a
need to hook the scheduler callback in "performance" mode? The hook
literally does nothing in this case, except consume cycles, no?
--
Len Brown, Intel Open Source Technology Center
>> >> - get_avg_frequency(cpu),
>> >> + aperfmperf_khz_on_cpu(cpu->cpu),
>>
>> Note that I deleted the line above in an updated version of this patch
>> that I'm ready to send out.
>>
>> There were a couple of problems with it.
>> The first is that it was ugly that tracing (which occurs only in the
>> SW governor case)
>> could shorten the measurement interval as seen by the sysfs interface.
>>
>> The second is that this trace point can be called with irqs off,
>> and smp_call_function_single() will WARN when called with irqs off.
>>
>> Srinivas Acked that I simply remove this field from the tracepoint --
>> as it is redundant to calculate it in the kernel when we are already
>> exporting the raw values of aperf and mperf.
>
> This changes the tracepoint format and I know about a couple of user space
> scripts that consume these tracepoints though.
> What would be wrong with leaving it as is?
I'm fine keeping get_avg_frequency() for the purpose of just this tracepoint
compatibility, if you think it is useful to do so.
I removed it because its only function was the (removed) intel_pstate.get()
and this tracepoint. And this tracepoint already includes aperf and mperf,
which can be used to calculate frequency in user-space, if desired.
Srinivas Acked' that updating his user-space script would be fine --
dunno if that is sufficient.
cheers,
-Len
On Friday, June 16, 2017 08:52:53 PM Len Brown wrote:
> On Fri, Jun 16, 2017 at 8:04 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Wednesday, June 07, 2017 07:39:15 PM Len Brown wrote:
> >> From: Len Brown <[email protected]>
> >>
> >> When the governor is set to "performance", intel_pstate does not
> >> need the scheduler hook for doing any calculations. Under these
> >> conditions, its only purpose is to continue to maintain
> >> cpufreq/scaling_cur_freq.
> >>
> >> But the cpufreq/scaling_cur_freq sysfs attribute is now provided by
> >> the x86 cpufreq core on all modern x86 systems, including
> >> all systems supported by the intel_pstate driver.
> >>
> >> So in "performance" governor mode, the scheduler hook can be skipped.
> >> This applies to both in Software and Hardware P-state control modes.
> >>
> >> Suggested-by: Srinivas Pandruvada <[email protected]>
> >> Signed-off-by: Len Brown <[email protected]>
> >> ---
> >> drivers/cpufreq/intel_pstate.c | 4 ++--
> >> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> index 5d67780..0ff3a4b 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -2025,10 +2025,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
> >> */
> >> intel_pstate_clear_update_util_hook(policy->cpu);
> >
> > The statement above shouldn't be necessary any more after the change below.
>
> The policy can change at run time form something other than performance
> to performance, so we want to clear the hook in that case, no?
Yes.
> >> intel_pstate_max_within_limits(cpu);
> >> + } else {
> >> + intel_pstate_set_update_util_hook(policy->cpu);
> >> }
> >>
> >> - intel_pstate_set_update_util_hook(policy->cpu);
> >> -
> >> if (hwp_active)
> >> intel_pstate_hwp_set(policy->cpu);
> >>
> >
> > What about update_turbo_pstate()?
> >
> > In theory MSR_IA32_MISC_ENABLE_TURBO_DISABLE can be set at any time, so
> > wouldn't that become problematic after this change?
>
> yes, the sysfs "no_turbo" attribute can be modified at any time, invoking
> update_turbo_state(), which will update MSR_IA32_MISC_ENABLE_TURBO_DISABLE
If that was the only way it could change, I wouldn't worry about it, but what
about changes by BMCs and similar? Are they not a concern?
> But how is the presence or change in turbo related to the lack of a
> need to hook the scheduler callback in "performance" mode? The hook
> literally does nothing in this case, except consume cycles, no?
No.
It actually sets the P-state to the current maximum (which admittedly is
excessive) exactly because the maximum may change on the fly in theory.
If it can't change on the fly (or we don't care), we can do some more
simplifications there. :-)
Thanks,
Rafael
On Fri, Jun 16, 2017 at 8:09 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 07, 2017 07:39:16 PM Len Brown wrote:
>> From: Len Brown <[email protected]>
>>
>> The cpufreqa/scaling_cur_freq sysfs attribute is now provided by
>> the x86 cpufreq core on all modern x86 systems, including
>> all systems supported by the intel_pstate driver.
>
> Not sure what you mean by "x86 cpufreq core"?
I refer to code that builds if (CONFIG_X86 && CONFIG_CPU_FREQ)
Since it was enough to provoke a comment form you, how about this wording?:
The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.
> Besides, I'd reorder this change with respect to patch [4/5] as this
> eliminates the hook entirely and then the "performance"-related change
> would only affect non-HWP.
I don't actually see a problem with either order,
but i'll send the refresh with the order you suggest.
thanks,
Len Brown, Intel Open Source Technology Center
On Friday, June 16, 2017 09:10:39 PM Len Brown wrote:
> >> >> - get_avg_frequency(cpu),
> >> >> + aperfmperf_khz_on_cpu(cpu->cpu),
> >>
> >> Note that I deleted the line above in an updated version of this patch
> >> that I'm ready to send out.
> >>
> >> There were a couple of problems with it.
> >> The first is that it was ugly that tracing (which occurs only in the
> >> SW governor case)
> >> could shorten the measurement interval as seen by the sysfs interface.
> >>
> >> The second is that this trace point can be called with irqs off,
> >> and smp_call_function_single() will WARN when called with irqs off.
> >>
> >> Srinivas Acked that I simply remove this field from the tracepoint --
> >> as it is redundant to calculate it in the kernel when we are already
> >> exporting the raw values of aperf and mperf.
> >
> > This changes the tracepoint format and I know about a couple of user space
> > scripts that consume these tracepoints though.
> > What would be wrong with leaving it as is?
>
> I'm fine keeping get_avg_frequency() for the purpose of just this tracepoint
> compatibility, if you think it is useful to do so.
>
> I removed it because its only function was the (removed) intel_pstate.get()
> and this tracepoint. And this tracepoint already includes aperf and mperf,
> which can be used to calculate frequency in user-space, if desired.
> Srinivas Acked' that updating his user-space script would be fine --
> dunno if that is sufficient.
Well, we can try to make this change and see if there are any complaints
about it.
But the Srinivas' script is in the tree, so it would be good to update it too
along with the tracepoint.
Thanks,
Rafael
On Fri, Jun 16, 2017 at 9:06 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday, June 16, 2017 08:52:53 PM Len Brown wrote:
>> On Fri, Jun 16, 2017 at 8:04 PM, Rafael J. Wysocki <[email protected]> wrote:
>> > On Wednesday, June 07, 2017 07:39:15 PM Len Brown wrote:
>> >> From: Len Brown <[email protected]>
>> >>
>> >> When the governor is set to "performance", intel_pstate does not
>> >> need the scheduler hook for doing any calculations. Under these
>> >> conditions, its only purpose is to continue to maintain
>> >> cpufreq/scaling_cur_freq.
>> >>
>> >> But the cpufreq/scaling_cur_freq sysfs attribute is now provided by
>> >> the x86 cpufreq core on all modern x86 systems, including
>> >> all systems supported by the intel_pstate driver.
>> >>
>> >> So in "performance" governor mode, the scheduler hook can be skipped.
>> >> This applies to both in Software and Hardware P-state control modes.
>> >>
>> >> Suggested-by: Srinivas Pandruvada <[email protected]>
>> >> Signed-off-by: Len Brown <[email protected]>
>> >> ---
>> >> drivers/cpufreq/intel_pstate.c | 4 ++--
>> >> 1 file changed, 2 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> >> index 5d67780..0ff3a4b 100644
>> >> --- a/drivers/cpufreq/intel_pstate.c
>> >> +++ b/drivers/cpufreq/intel_pstate.c
>> >> @@ -2025,10 +2025,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
>> >> */
>> >> intel_pstate_clear_update_util_hook(policy->cpu);
>> >
>> > The statement above shouldn't be necessary any more after the change below.
>>
>> The policy can change at run time form something other than performance
>> to performance, so we want to clear the hook in that case, no?
>
> Yes.
>
>> >> intel_pstate_max_within_limits(cpu);
>> >> + } else {
>> >> + intel_pstate_set_update_util_hook(policy->cpu);
>> >> }
>> >>
>> >> - intel_pstate_set_update_util_hook(policy->cpu);
>> >> -
>> >> if (hwp_active)
>> >> intel_pstate_hwp_set(policy->cpu);
>> >>
>> >
>> > What about update_turbo_pstate()?
>> >
>> > In theory MSR_IA32_MISC_ENABLE_TURBO_DISABLE can be set at any time, so
>> > wouldn't that become problematic after this change?
>>
>> yes, the sysfs "no_turbo" attribute can be modified at any time, invoking
>> update_turbo_state(), which will update MSR_IA32_MISC_ENABLE_TURBO_DISABLE
>
> If that was the only way it could change, I wouldn't worry about it, but what
> about changes by BMCs and similar? Are they not a concern?
>
>> But how is the presence or change in turbo related to the lack of a
>> need to hook the scheduler callback in "performance" mode? The hook
>> literally does nothing in this case, except consume cycles, no?
>
> No.
>
> It actually sets the P-state to the current maximum (which admittedly is
> excessive) exactly because the maximum may change on the fly in theory.
There are 2 cases.
If turbo was enabled and were we requesting max turbo
and "somebody" disabled turbo in an MSR, then the HW would
simply clip our excessive req
> If it can't change on the fly (or we don't care), we can do some more
> simplifications there. :-)
I do not think it is Linux's responsibility to monitor changes to MSRs
such as Turbo enable/disable done behind its back by a BMC at run-time.
(if this is even possible)
--
Len Brown, Intel Open Source Technology Center
sorry, that was a premature send...
>>> > What about update_turbo_pstate()?
>>> >
>>> > In theory MSR_IA32_MISC_ENABLE_TURBO_DISABLE can be set at any time, so
>>> > wouldn't that become problematic after this change?
>>>
>>> yes, the sysfs "no_turbo" attribute can be modified at any time, invoking
>>> update_turbo_state(), which will update MSR_IA32_MISC_ENABLE_TURBO_DISABLE
>>
>> If that was the only way it could change, I wouldn't worry about it, but what
>> about changes by BMCs and similar? Are they not a concern?
>>
>>> But how is the presence or change in turbo related to the lack of a
>>> need to hook the scheduler callback in "performance" mode? The hook
>>> literally does nothing in this case, except consume cycles, no?
>>
>> No.
>>
>> It actually sets the P-state to the current maximum (which admittedly is
>> excessive) exactly because the maximum may change on the fly in theory.
>
> There are 2 cases.
>
> If turbo was enabled and were we requesting max turbo
> and "somebody" disabled turbo in an MSR, then the HW would
> simply clip our excessive
request to what the hardware supports.
if turbo was disabled and we were requesting max non-turbo,
and "somebody" enabled turbo in the MSR,
then it is highly likely that current request is already sufficiently
high enough
to enable turbo. "highly likely" = would work on 100% of the
machines I have ever seen, including those with mis-configured TAR.
ie. it should not matter.
>> If it can't change on the fly (or we don't care), we can do some more
>> simplifications there. :-)
>
> I do not think it is Linux's responsibility to monitor changes to MSRs
> such as Turbo enable/disable done behind its back by a BMC at run-time.
> (if this is even possible)
>
>
> --
> Len Brown, Intel Open Source Technology Center
On Sat, Jun 17, 2017 at 3:21 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday, June 16, 2017 09:10:39 PM Len Brown wrote:
>> >> >> - get_avg_frequency(cpu),
>> >> >> + aperfmperf_khz_on_cpu(cpu->cpu),
>> >>
>> >> Note that I deleted the line above in an updated version of this patch
>> >> that I'm ready to send out.
>> >>
>> >> There were a couple of problems with it.
>> >> The first is that it was ugly that tracing (which occurs only in the
>> >> SW governor case)
>> >> could shorten the measurement interval as seen by the sysfs interface.
>> >>
>> >> The second is that this trace point can be called with irqs off,
>> >> and smp_call_function_single() will WARN when called with irqs off.
>> >>
>> >> Srinivas Acked that I simply remove this field from the tracepoint --
>> >> as it is redundant to calculate it in the kernel when we are already
>> >> exporting the raw values of aperf and mperf.
>> >
>> > This changes the tracepoint format and I know about a couple of user space
>> > scripts that consume these tracepoints though.
>> > What would be wrong with leaving it as is?
>>
>> I'm fine keeping get_avg_frequency() for the purpose of just this tracepoint
>> compatibility, if you think it is useful to do so.
>>
>> I removed it because its only function was the (removed) intel_pstate.get()
>> and this tracepoint. And this tracepoint already includes aperf and mperf,
>> which can be used to calculate frequency in user-space, if desired.
>> Srinivas Acked' that updating his user-space script would be fine --
>> dunno if that is sufficient.
>
> Well, we can try to make this change and see if there are any complaints
> about it.
>
> But the Srinivas' script is in the tree, so it would be good to update it too
> along with the tracepoint.
On a second thought, in order to compute the frequency, user space
needs to know the scaling and the max_pstate_physical value too, which
may not be straightforward to obtain (on some Atoms, for example).
So why don't we leave the tracepoint as is for now?
Thanks,
Rafael
On Fri, Jun 16, 2017 at 8:30 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday, June 07, 2017 07:39:13 PM Len Brown wrote:
>> From: Len Brown <[email protected]>
>>
>> The goal of this change is to give users a uniform and meaningful
>> result when they read /sys/...cpufreq/scaling_cur_freq
>> on modern x86 hardware, as compared to what they get today.
>>
>> Modern x86 processors include the hardware needed
>> to accurately calculate frequency over an interval --
>> APERF, MPERF, and the TSC.
>>
>> Here we provide an x86 routine to make this calculation
>> on supported hardware, and use it in preference to any
>> driver driver-specific cpufreq_driver.get() routine.
>>
>> MHz is computed like so:
>>
>> MHz = base_MHz * delta_APERF / delta_MPERF
>>
>> MHz is the average frequency of the busy processor
>> over a measurement interval. The interval is
>> defined to be the time between successive invocations
>> of aperfmperf_khz_on_cpu(), which are expected to to
>> happen on-demand when users read sysfs attribute
>> cpufreq/scaling_cur_freq.
>>
>> As with previous methods of calculating MHz,
>> idle time is excluded.
>>
>> base_MHz above is from TSC calibration global "cpu_khz".
>>
>> This x86 native method to calculate MHz returns a meaningful result
>> no matter if P-states are controlled by hardware or firmware
>> and/or if the Linux cpufreq sub-system is or is-not installed.
>>
>> When this routine is invoked more frequently, the measurement
>> interval becomes shorter. However, the code limits re-computation
>> to 10ms intervals so that average frequency remains meaningful.
>>
>> Discerning users are encouraged to take advantage of
>> the turbostat(8) utility, which can gracefully handle
>> concurrent measurement intervals of arbitrary length.
>>
>> Signed-off-by: Len Brown <[email protected]>
>> ---
>> arch/x86/kernel/cpu/Makefile | 1 +
>> arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
>> drivers/cpufreq/cpufreq.c | 7 +++-
>> include/linux/cpufreq.h | 13 +++++++
>> 4 files changed, 102 insertions(+), 1 deletion(-)
>> create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
>>
>> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
>> index 5200001..cdf8249 100644
>> --- a/arch/x86/kernel/cpu/Makefile
>> +++ b/arch/x86/kernel/cpu/Makefile
>> @@ -21,6 +21,7 @@ obj-y += common.o
>> obj-y += rdrand.o
>> obj-y += match.o
>> obj-y += bugs.o
>> +obj-$(CONFIG_CPU_FREQ) += aperfmperf.o
>>
>> obj-$(CONFIG_PROC_FS) += proc.o
>> obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
>> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
>> new file mode 100644
>> index 0000000..5ccf63a
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/aperfmperf.c
>> @@ -0,0 +1,82 @@
>> +/*
>> + * x86 APERF/MPERF KHz calculation for
>> + * /sys/.../cpufreq/scaling_cur_freq
>> + *
>> + * Copyright (C) 2017 Intel Corp.
>> + * Author: Len Brown <[email protected]>
>> + *
>> + * This file is licensed under GPLv2.
>> + */
>> +
>> +#include <linux/jiffies.h>
>> +#include <linux/math64.h>
>> +#include <linux/percpu.h>
>> +#include <linux/smp.h>
>> +
>> +struct aperfmperf_sample {
>> + unsigned int khz;
>> + unsigned long jiffies;
>> + u64 aperf;
>> + u64 mperf;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
>> +
>> +/*
>> + * aperfmperf_snapshot_khz()
>> + * On the current CPU, snapshot APERF, MPERF, and jiffies
>> + * unless we already did it within 10ms
>> + * calculate kHz, save snapshot
>> + */
>> +static void aperfmperf_snapshot_khz(void *dummy)
>> +{
>> + u64 aperf, aperf_delta;
>> + u64 mperf, mperf_delta;
>> + struct aperfmperf_sample *s = &get_cpu_var(samples);
>> +
>> + /* Don't bother re-computing within 10 ms */
>> + if (time_before(jiffies, s->jiffies + HZ/100))
>> + goto out;
>> +
>> + rdmsrl(MSR_IA32_APERF, aperf);
>> + rdmsrl(MSR_IA32_MPERF, mperf);
>> +
>> + aperf_delta = aperf - s->aperf;
>> + mperf_delta = mperf - s->mperf;
>> +
>> + /*
>> + * There is no architectural guarantee that MPERF
>> + * increments faster than we can read it.
>> + */
>> + if (mperf_delta == 0)
>> + goto out;
>> +
>> + /*
>> + * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
>> + * khz = (cpu_khz * aperf_delta) / mperf_delta
>> + */
>> + if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
>> + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
>> + else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
>> + s->khz = div64_u64(aperf_delta,
>> + div64_u64(mperf_delta, cpu_khz));
>> + s->jiffies = jiffies;
>> + s->aperf = aperf;
>> + s->mperf = mperf;
>> +
>> +out:
>> + put_cpu_var(samples);
>> +}
>> +
>> +unsigned int aperfmperf_khz_on_cpu(unsigned int cpu)
>> +{
>> + if (!cpu_khz)
>> + return 0;
>> +
>> + if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>> + return 0;
>> +
>> + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
>> +
>> + return per_cpu(samples.khz, cpu);
>> +}
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index 26b643d..a667fac 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -635,8 +635,13 @@ show_one(scaling_max_freq, max);
>> static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>> {
>> ssize_t ret;
>> + unsigned int freq;
>>
>> - if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
>> + freq = arch_freq_get_on_cpu(policy->cpu);
>> + if (freq)
>> + ret = sprintf(buf, "%u\n", freq);
>> + else if (cpufreq_driver && cpufreq_driver->setpolicy &&
>> + cpufreq_driver->get)
>> ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
>> else
>> ret = sprintf(buf, "%u\n", policy->cur);
>
> I wonder if we could change intel_pstate_get() to simply return
> aperfmperf_khz_on_cpu(cpu_num)?
>
> That would allow us to avoid the extra branch here and get rid of the
> #ifdef x86 from the header.
The reason I put the hook here is specifically so that the same
code would always be called on the x86 architecture,
no not matter what cpufreq driver is loaded.
Yes, alternatively, all possible driver.get routines could be updated
to call the same routine. That is acpi-cpufreq.c, intel_pstate.c,
others?
Do I think that saving a branch is meaningful in a system call to retrieve
the average frequency from the kernel? No, I don't.
I'm not religious about "if (CONFIG_X86)" appearing in cpufreq.h.
If somebody really cares, then the purist approach would be
to add an additional arch-specific-hook header -- though that seemed
overkill to me...
thanks,
-Len
>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>> index a5ce0bbe..cfc6220 100644
>> --- a/include/linux/cpufreq.h
>> +++ b/include/linux/cpufreq.h
>> @@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
>> }
>> #endif
>>
>> +#ifdef CONFIG_X86
>> +extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
>> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
>> +{
>> + return aperfmperf_khz_on_cpu(cpu);
>> +}
>> +#else
>> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
>> +{
>> + return 0;
>> +}
>> +#endif
>> +
>> /* the following are really really optional */
>> extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
>> extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
>>
>
> Thanks,
> Rafael
>
--
Len Brown, Intel Open Source Technology Center
> On a second thought, in order to compute the frequency, user space
> needs to know the scaling and the max_pstate_physical value too, which
> may not be straightforward to obtain (on some Atoms, for example).
unless you run turbostat:-)
> So why don't we leave the tracepoint as is for now?
sure, no problem.
--
Len Brown, Intel Open Source Technology Center
On Friday, June 16, 2017 09:49:00 PM Len Brown wrote:
> On Fri, Jun 16, 2017 at 8:30 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Wednesday, June 07, 2017 07:39:13 PM Len Brown wrote:
> >> From: Len Brown <[email protected]>
[cut]
> >
> > I wonder if we could change intel_pstate_get() to simply return
> > aperfmperf_khz_on_cpu(cpu_num)?
> >
> > That would allow us to avoid the extra branch here and get rid of the
> > #ifdef x86 from the header.
>
> The reason I put the hook here is specifically so that the same
> code would always be called on the x86 architecture,
> no not matter what cpufreq driver is loaded.
>
> Yes, alternatively, all possible driver.get routines could be updated
> to call the same routine. That is acpi-cpufreq.c, intel_pstate.c,
> others?
Just acpi-cpufreq.c and intel_pstate.c.
Moreover, I wouldn't change the behavior on systems using acpi-cpufreq.c,
because why really?
And some users may complain that now what they see in cpuinfo_cur_freq is
different from what they saw there before.
That leaves us with just intel_pstate.
In addition to that, it would be good to avoid reading APERF and MPERF in the
cases when we already have the values. I have an idea on how to do that, but
I'll just need to prepare a patch.
> Do I think that saving a branch is meaningful in a system call to retrieve
> the average frequency from the kernel? No, I don't.
>
> I'm not religious about "if (CONFIG_X86)" appearing in cpufreq.h.
> If somebody really cares, then the purist approach would be
> to add an additional arch-specific-hook header -- though that seemed
> overkill to me...
I'm not a fan of arch stuff in general code, C or headers.
If it is necessary to add an arch header, let's add it, but at this point I'm
not sure about that even.
Thanks,
Rafael
On Monday, June 19, 2017 02:28:21 PM Rafael J. Wysocki wrote:
> On Friday, June 16, 2017 09:49:00 PM Len Brown wrote:
> > On Fri, Jun 16, 2017 at 8:30 PM, Rafael J. Wysocki <[email protected]> wrote:
> > > On Wednesday, June 07, 2017 07:39:13 PM Len Brown wrote:
> > >> From: Len Brown <[email protected]>
>
> [cut]
>
> > >
> > > I wonder if we could change intel_pstate_get() to simply return
> > > aperfmperf_khz_on_cpu(cpu_num)?
> > >
> > > That would allow us to avoid the extra branch here and get rid of the
> > > #ifdef x86 from the header.
> >
> > The reason I put the hook here is specifically so that the same
> > code would always be called on the x86 architecture,
> > no not matter what cpufreq driver is loaded.
> >
> > Yes, alternatively, all possible driver.get routines could be updated
> > to call the same routine. That is acpi-cpufreq.c, intel_pstate.c,
> > others?
>
> Just acpi-cpufreq.c and intel_pstate.c.
>
> Moreover, I wouldn't change the behavior on systems using acpi-cpufreq.c,
> because why really?
Actually, having thought a bit more about this, I see why that may be useful.
Also the #ifdef CONFIG_X86 in the header is better than an arch header, at
least for now, so I agree with the approach in your patch.
Thanks,
Rafael