In-Reply-To:
Hi Rafael,
This patch series has 3 goals:
1. Make "cpu MHz" in /proc/cpuinfo supportable.
2. Make /sys/.../cpufreq/scaling_cur_freq meaningful
and consistent on modern x86 systems.
3. Use 1. and 2. to remove scheduler and cpufreq overhead
There are 3 main changes since this series was proposed
about a year ago:
This update responds to distro feedback to make /proc/cpuinfo
"cpu MHz" constant. Originally, we had proposed making it return
the same dynamic value as cpufreq sysfs.
Some community members suggested that sysfs MHz values should
be meaninful, even down to 10ms intervals. So this has been
changed, versus the original proposal to not re-compute
at intervals shorter than 100ms.
(For those who really care about observing frequency, the
recommendation remains to use turbostat(8) or equivalent utility,
which can reliably measure concurrent intervals of arbitrary length)
The intel_pstate sampling mechanism has changed.
Originally this series removed an intel_pstate timer in HWP mode.
Now it removes the analogous scheduler call-back.
Most recently, in response to posting this patch on the list
about 10-days ago, the patch to remove frequency calculation
from inside intel_pstate was dropped, in order to maintain compatibility
with tracing scripts. Also, the order of the last two patches
has been exchanged.
Please let me know if you see any issues with this series.
thanks!
Len Brown, Intel Open Source Technology Center
The following changes since commit 3c2993b8c6143d8a5793746a54eba8f86f95240f:
Linux 4.12-rc4 (2017-06-04 16:47:43 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86
for you to fetch changes up to d020eed98440faa4a529c621f881aa9fda296956:
intel_pstate: skip scheduler hook when in "performance" mode. (2017-06-16 19:11:13 -0700)
----------------------------------------------------------------
Len Brown (4):
x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"
x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
intel_pstate: delete scheduler hook in HWP mode
intel_pstate: skip scheduler hook when in "performance" mode.
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/proc.c | 10 +----
drivers/cpufreq/cpufreq.c | 7 +++-
drivers/cpufreq/intel_pstate.c | 18 +++------
include/linux/cpufreq.h | 13 +++++++
6 files changed, 109 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
From: Len Brown <[email protected]>
When the governor is set to "performance", intel_pstate does not
need the scheduler hook for doing any calculations. Under these
conditions, its only purpose is to continue to maintain
cpufreq/scaling_cur_freq.
The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.
So in "performance" governor mode, the scheduler hook can be skipped.
This applies to both in Software and Hardware P-state control modes.
Suggested-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 4ec5668..4538182 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2031,10 +2031,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
*/
intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_max_within_limits(cpu);
+ } else {
+ intel_pstate_set_update_util_hook(policy->cpu);
}
- intel_pstate_set_update_util_hook(policy->cpu);
-
if (hwp_active)
intel_pstate_hwp_set(policy->cpu);
--
2.7.4
From: Len Brown <[email protected]>
The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.
In HWP mode, maintaining that value was the sole purpose of
the scheduler hook, intel_pstate_update_util_hwp(),
so it can now be removed.
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index b7de5bd..4ec5668 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1732,16 +1732,6 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
fp_toint(cpu->iowait_boost * 100));
}
-static void intel_pstate_update_util_hwp(struct update_util_data *data,
- u64 time, unsigned int flags)
-{
- struct cpudata *cpu = container_of(data, struct cpudata, update_util);
- u64 delta_ns = time - cpu->sample.time;
-
- if ((s64)delta_ns >= INTEL_PSTATE_HWP_SAMPLING_INTERVAL)
- intel_pstate_sample(cpu, time);
-}
-
static void intel_pstate_update_util_pid(struct update_util_data *data,
u64 time, unsigned int flags)
{
@@ -1933,6 +1923,9 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];
+ if (hwp_active)
+ return;
+
if (cpu->update_util_set)
return;
@@ -2557,7 +2550,6 @@ static int __init intel_pstate_init(void)
} else {
hwp_active++;
intel_pstate.attr = hwp_cpufreq_attrs;
- pstate_funcs.update_util = intel_pstate_update_util_hwp;
goto hwp_cpu_matched;
}
} else {
--
2.7.4
From: Len Brown <[email protected]>
The goal of this change is to give users a uniform and meaningful
result when they read /sys/...cpufreq/scaling_cur_freq
on modern x86 hardware, as compared to what they get today.
Modern x86 processors include the hardware needed
to accurately calculate frequency over an interval --
APERF, MPERF, and the TSC.
Here we provide an x86 routine to make this calculation
on supported hardware, and use it in preference to any
driver driver-specific cpufreq_driver.get() routine.
MHz is computed like so:
MHz = base_MHz * delta_APERF / delta_MPERF
MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive invocations
of aperfmperf_khz_on_cpu(), which are expected to to
happen on-demand when users read sysfs attribute
cpufreq/scaling_cur_freq.
As with previous methods of calculating MHz,
idle time is excluded.
base_MHz above is from TSC calibration global "cpu_khz".
This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or if the Linux cpufreq sub-system is or is-not installed.
When this routine is invoked more frequently, the measurement
interval becomes shorter. However, the code limits re-computation
to 10ms intervals so that average frequency remains meaningful.
Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.
Signed-off-by: Len Brown <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 82 ++++++++++++++++++++++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 7 +++-
include/linux/cpufreq.h | 13 +++++++
4 files changed, 102 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 5200001..cdf8249 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -21,6 +21,7 @@ obj-y += common.o
obj-y += rdrand.o
obj-y += match.o
obj-y += bugs.o
+obj-$(CONFIG_CPU_FREQ) += aperfmperf.o
obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
new file mode 100644
index 0000000..5ccf63a
--- /dev/null
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -0,0 +1,82 @@
+/*
+ * x86 APERF/MPERF KHz calculation for
+ * /sys/.../cpufreq/scaling_cur_freq
+ *
+ * Copyright (C) 2017 Intel Corp.
+ * Author: Len Brown <[email protected]>
+ *
+ * This file is licensed under GPLv2.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+
+struct aperfmperf_sample {
+ unsigned int khz;
+ unsigned long jiffies;
+ u64 aperf;
+ u64 mperf;
+};
+
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
+
+/*
+ * aperfmperf_snapshot_khz()
+ * On the current CPU, snapshot APERF, MPERF, and jiffies
+ * unless we already did it within 10ms
+ * calculate kHz, save snapshot
+ */
+static void aperfmperf_snapshot_khz(void *dummy)
+{
+ u64 aperf, aperf_delta;
+ u64 mperf, mperf_delta;
+ struct aperfmperf_sample *s = &get_cpu_var(samples);
+
+ /* Don't bother re-computing within 10 ms */
+ if (time_before(jiffies, s->jiffies + HZ/100))
+ goto out;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ aperf_delta = aperf - s->aperf;
+ mperf_delta = mperf - s->mperf;
+
+ /*
+ * There is no architectural guarantee that MPERF
+ * increments faster than we can read it.
+ */
+ if (mperf_delta == 0)
+ goto out;
+
+ /*
+ * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
+ * khz = (cpu_khz * aperf_delta) / mperf_delta
+ */
+ if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
+ s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
+ else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
+ s->khz = div64_u64(aperf_delta,
+ div64_u64(mperf_delta, cpu_khz));
+ s->jiffies = jiffies;
+ s->aperf = aperf;
+ s->mperf = mperf;
+
+out:
+ put_cpu_var(samples);
+}
+
+unsigned int aperfmperf_khz_on_cpu(unsigned int cpu)
+{
+ if (!cpu_khz)
+ return 0;
+
+ if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+ return 0;
+
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+
+ return per_cpu(samples.khz, cpu);
+}
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 26b643d..a667fac 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -635,8 +635,13 @@ show_one(scaling_max_freq, max);
static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
{
ssize_t ret;
+ unsigned int freq;
- if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
+ freq = arch_freq_get_on_cpu(policy->cpu);
+ if (freq)
+ ret = sprintf(buf, "%u\n", freq);
+ else if (cpufreq_driver && cpufreq_driver->setpolicy &&
+ cpufreq_driver->get)
ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
else
ret = sprintf(buf, "%u\n", policy->cur);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index a5ce0bbe..cfc6220 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
}
#endif
+#ifdef CONFIG_X86
+extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return aperfmperf_khz_on_cpu(cpu);
+}
+#else
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return 0;
+}
+#endif
+
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
--
2.7.4
From: Len Brown <[email protected]>
cpufreq_quick_get() allows cpufreq drivers to over-ride cpu_khz
that is otherwise reported in x86 /proc/cpuinfo "cpu MHz".
There are four problems with this scheme,
any of them is sufficient justification to delete it.
1. Depending on which cpufreq driver is loaded, the behavior
of this field is different.
2. Distros complain that they have to explain to users
why and how this field changes. Distros have requested a constant.
3. The two major providers of this information, acpi_cpufreq
and intel_pstate, both "get it wrong" in different ways.
acpi_cpufreq lies to the user by telling them that
they are running at whatever frequency was last
requested by software.
intel_pstate lies to the user by telling them that
they are running at the average frequency computed
over an undefined measurement. But an average computed
over an undefined interval, is itself, undefined...
4. On modern processors, user space utilities, such as
turbostat(1), are more accurate and more precise, while
supporing concurrent measurement over arbitrary intervals.
Users who have been consulting /proc/cpuinfo to
track changing CPU frequency will be dissapointed that
it no longer wiggles -- perhaps being unaware of the
limitations of the information they have been consuming.
Yes, they can change their scripts to look in sysfs
cpufreq/scaling_cur_frequency. Here they will find the same
data of dubious quality here removed from /proc/cpuinfo.
The value in sysfs will be addressed in a subsequent patch
to address issues 1-3, above.
Issue 4 will remain -- users that really care about
accurate frequency information should not be using either
proc or sysfs kernel interfaces.
They should be using using turbostat(8), or a similar
purpose-built analysis tool.
Signed-off-by: Len Brown <[email protected]>
---
arch/x86/kernel/cpu/proc.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 6df621a..218f798 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -2,7 +2,6 @@
#include <linux/timex.h>
#include <linux/string.h>
#include <linux/seq_file.h>
-#include <linux/cpufreq.h>
/*
* Get CPU information for use by the procfs.
@@ -76,14 +75,9 @@ static int show_cpuinfo(struct seq_file *m, void *v)
if (c->microcode)
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
- if (cpu_has(c, X86_FEATURE_TSC)) {
- unsigned int freq = cpufreq_quick_get(cpu);
-
- if (!freq)
- freq = cpu_khz;
+ if (cpu_has(c, X86_FEATURE_TSC))
seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
- }
+ cpu_khz / 1000, (cpu_khz % 1000));
/* Cache size */
if (c->x86_cache_size >= 0)
--
2.7.4
On Sat, Jun 17, 2017 at 5:03 AM, Len Brown <[email protected]> wrote:
>
> In-Reply-To:
>
> Hi Rafael,
>
> This patch series has 3 goals:
>
> 1. Make "cpu MHz" in /proc/cpuinfo supportable.
>
> 2. Make /sys/.../cpufreq/scaling_cur_freq meaningful
> and consistent on modern x86 systems.
>
> 3. Use 1. and 2. to remove scheduler and cpufreq overhead
>
> There are 3 main changes since this series was proposed
> about a year ago:
>
> This update responds to distro feedback to make /proc/cpuinfo
> "cpu MHz" constant. Originally, we had proposed making it return
> the same dynamic value as cpufreq sysfs.
>
> Some community members suggested that sysfs MHz values should
> be meaninful, even down to 10ms intervals. So this has been
> changed, versus the original proposal to not re-compute
> at intervals shorter than 100ms.
>
> (For those who really care about observing frequency, the
> recommendation remains to use turbostat(8) or equivalent utility,
> which can reliably measure concurrent intervals of arbitrary length)
>
> The intel_pstate sampling mechanism has changed.
> Originally this series removed an intel_pstate timer in HWP mode.
> Now it removes the analogous scheduler call-back.
>
> Most recently, in response to posting this patch on the list
> about 10-days ago, the patch to remove frequency calculation
> from inside intel_pstate was dropped, in order to maintain compatibility
> with tracing scripts. Also, the order of the last two patches
> has been exchanged.
>
> Please let me know if you see any issues with this series.
>
> thanks!
> Len Brown, Intel Open Source Technology Center
>
> The following changes since commit 3c2993b8c6143d8a5793746a54eba8f86f95240f:
>
> Linux 4.12-rc4 (2017-06-04 16:47:43 -0700)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86
>
> for you to fetch changes up to d020eed98440faa4a529c621f881aa9fda296956:
>
> intel_pstate: skip scheduler hook when in "performance" mode. (2017-06-16 19:11:13 -0700)
>
> ----------------------------------------------------------------
> Len Brown (4):
> x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"
> x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
> intel_pstate: delete scheduler hook in HWP mode
> intel_pstate: skip scheduler hook when in "performance" mode.
I'd like to hear from the x86 maintainers about the first two patches.
At least I'd like to know that there are no objections there.
Thanks,
Rafael
On Fri, 16 Jun 2017, Len Brown wrote:
> Issue 4 will remain -- users that really care about
> accurate frequency information should not be using either
> proc or sysfs kernel interfaces.
> They should be using using turbostat(8), or a similar
> purpose-built analysis tool.
>
> Signed-off-by: Len Brown <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
On Fri, 16 Jun 2017, Len Brown wrote:
> +#include <linux/jiffies.h>
> +#include <linux/math64.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
> +
> +struct aperfmperf_sample {
> + unsigned int khz;
> + unsigned long jiffies;
> + u64 aperf;
> + u64 mperf;
Nit. Please write these in tabular fashion:
unsigned int khz;
unsigned long jiffies;
u64 aperf;
u64 mperf;
> +};
> +
> +static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
> +
> +/*
> + * aperfmperf_snapshot_khz()
> + * On the current CPU, snapshot APERF, MPERF, and jiffies
> + * unless we already did it within 10ms
> + * calculate kHz, save snapshot
> + */
> +static void aperfmperf_snapshot_khz(void *dummy)
> +{
> + u64 aperf, aperf_delta;
> + u64 mperf, mperf_delta;
> + struct aperfmperf_sample *s = &get_cpu_var(samples);
This is invoked via a smp function call, so you want
this_cpu_ptr(samples)
here.
> + /* Don't bother re-computing within 10 ms */
> + if (time_before(jiffies, s->jiffies + HZ/100))
> + goto out;
That way you can spare the gotos and simply return
> index a5ce0bbe..cfc6220 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
> }
> #endif
>
> +#ifdef CONFIG_X86
> +extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
Please don't add arch specific crap in general headers.
Also having cpu as int and unsigned int is not really consistent.
The simple way to avoid that is to have a weak function and have an arch
override for it. If that does not work because the cpufreq stuff must be
built as a module, then you still can avoid CONFIG_X86 and have something
like CONFIG_ARCH_HAS_FOO.
Thanks,
tglx
On Tue, Jun 20, 2017 at 6:15 AM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 16 Jun 2017, Len Brown wrote:
>> +#include <linux/jiffies.h>
>> +#include <linux/math64.h>
>> +#include <linux/percpu.h>
>> +#include <linux/smp.h>
>> +
>> +struct aperfmperf_sample {
>> + unsigned int khz;
>> + unsigned long jiffies;
>> + u64 aperf;
>> + u64 mperf;
>
> Nit. Please write these in tabular fashion:
> unsigned int khz;
> unsigned long jiffies;
> u64 aperf;
> u64 mperf;
sure, no problem -- I agreed that looks better.
But it seems there is some inconsistency about this style nit --
even in this directory. If there is consensus going forward,
would it make sense for coding-style.rst and checkpatch.pl
to squawk about this, so you don't have to?
>> +};
>> +
>> +static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
>> +
>> +/*
>> + * aperfmperf_snapshot_khz()
>> + * On the current CPU, snapshot APERF, MPERF, and jiffies
>> + * unless we already did it within 10ms
>> + * calculate kHz, save snapshot
>> + */
>> +static void aperfmperf_snapshot_khz(void *dummy)
>> +{
>> + u64 aperf, aperf_delta;
>> + u64 mperf, mperf_delta;
>> + struct aperfmperf_sample *s = &get_cpu_var(samples);
>
> This is invoked via a smp function call, so you want
>
> this_cpu_ptr(samples)
>
> here.
done. thanks!
>> + /* Don't bother re-computing within 10 ms */
>> + if (time_before(jiffies, s->jiffies + HZ/100))
>> + goto out;
>
> That way you can spare the gotos and simply return
happy to remove the gotos, thanks!
>> index a5ce0bbe..cfc6220 100644
>> --- a/include/linux/cpufreq.h
>> +++ b/include/linux/cpufreq.h
>> @@ -883,6 +883,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
>> }
>> #endif
>>
>> +#ifdef CONFIG_X86
>> +extern unsigned int aperfmperf_khz_on_cpu(unsigned int cpu);
>> +static inline unsigned int arch_freq_get_on_cpu(int cpu)
> ... having cpu as int and unsigned int is not really consistent.
True. the SMP code uses int cpu, while cpufreq_policy.cpu is an unsigned int.
I'll use "int cpu".
> Please don't add arch specific crap in general headers.
>
> The simple way to avoid that is to have a weak function and have an arch
> override for it. If that does not work because the cpufreq stuff must be
> built as a module, then you still can avoid CONFIG_X86 and have something
> like CONFIG_ARCH_HAS_FOO.
Thanks -- this is not modular code, and so yes, __weak works -- hadn't
had occasion to use it before...
Attached is the incremental diff responding to your comments, in case
that is convenient.
I'll re-send this series with this patch updated in a sec.
thanks,
Len Brown, Intel Open Source Technology Center