2017-06-24 05:12:03

by Len Brown

[permalink] [raw]
Subject: [PATCH 0/4 v2] x86,cpufreq: unify APERF/MPERF computation

Hi Rafael, Thomas,

Please see the updated 2nd patch in this series -- in response to tglx review.
Patch 1,3,4 are unchanged.

thanks,
-Len

This patch series has 3 goals:

1. Make "cpu MHz" in /proc/cpuinfo supportable.

2. Make /sys/.../cpufreq/scaling_cur_freq meaningful
and consistent on modern x86 systems.

3. Use 1. and 2. to remove scheduler and cpufreq overhead

There are 3 main changes since this series was proposed
about a year ago:

This update responds to distro feedback to make /proc/cpuinfo
"cpu MHz" constant. Originally, we had proposed making it return
the same dynamic value as cpufreq sysfs.

Some community members suggested that sysfs MHz values should
be meaninful, even down to 10ms intervals. So this has been
changed, versus the original proposal to not re-compute
at intervals shorter than 100ms.

(For those who really care about observing frequency, the
recommendation remains to use turbostat(8) or equivalent utility,
which can reliably measure concurrent intervals of arbitrary length)

The intel_pstate sampling mechanism has changed.
Originally this series removed an intel_pstate timer in HWP mode.
Now it removes the analogous scheduler call-back.

Most recently, in response to posting this patch on the list
about 10-days ago, the patch to remove frequency calculation
from inside intel_pstate was dropped, in order to maintain compatibility
with tracing scripts. Also, the order of the last two patches
has been exchanged.

Please let me know if you see any issues with this series.

thanks!
Len Brown, Intel Open Source Technology Center

The following changes since commit 3c2993b8c6143d8a5793746a54eba8f86f95240f:

Linux 4.12-rc4 (2017-06-04 16:47:43 -0700)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86

for you to fetch changes up to 68516d288d3968fe22d6c8984a7bcbdcdbed351d:

intel_pstate: skip scheduler hook when in "performance" mode. (2017-06-23 22:01:46 -0700)

----------------------------------------------------------------
Len Brown (4):
x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"
x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
intel_pstate: delete scheduler hook in HWP mode
intel_pstate: skip scheduler hook when in "performance" mode.

arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/proc.c | 10 +----
drivers/cpufreq/cpufreq.c | 12 +++++-
drivers/cpufreq/intel_pstate.c | 18 +++------
include/linux/cpufreq.h | 2 +
6 files changed, 100 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c


2017-06-24 05:12:07

by Len Brown

[permalink] [raw]
Subject: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF

From: Len Brown <[email protected]>

The goal of this change is to give users a uniform and meaningful
result when they read /sys/...cpufreq/scaling_cur_freq
on modern x86 hardware, as compared to what they get today.

Modern x86 processors include the hardware needed
to accurately calculate frequency over an interval --
APERF, MPERF, and the TSC.

Here we provide an x86 routine to make this calculation
on supported hardware, and use it in preference to any
driver driver-specific cpufreq_driver.get() routine.

MHz is computed like so:

MHz = base_MHz * delta_APERF / delta_MPERF

MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive invocations
of aperfmperf_khz_on_cpu(), which are expected to to
happen on-demand when users read sysfs attribute
cpufreq/scaling_cur_freq.

As with previous methods of calculating MHz,
idle time is excluded.

base_MHz above is from TSC calibration global "cpu_khz".

This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or if the Linux cpufreq sub-system is or is-not installed.

When this routine is invoked more frequently, the measurement
interval becomes shorter. However, the code limits re-computation
to 10ms intervals so that average frequency remains meaningful.

Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.

Signed-off-by: Len Brown <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++
drivers/cpufreq/cpufreq.c | 12 +++++-
include/linux/cpufreq.h | 2 +
4 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 5200001..cdf8249 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -21,6 +21,7 @@ obj-y += common.o
obj-y += rdrand.o
obj-y += match.o
obj-y += bugs.o
+obj-$(CONFIG_CPU_FREQ) += aperfmperf.o

obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
new file mode 100644
index 0000000..d869c86
--- /dev/null
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -0,0 +1,79 @@
+/*
+ * x86 APERF/MPERF KHz calculation for
+ * /sys/.../cpufreq/scaling_cur_freq
+ *
+ * Copyright (C) 2017 Intel Corp.
+ * Author: Len Brown <[email protected]>
+ *
+ * This file is licensed under GPLv2.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+
+struct aperfmperf_sample {
+ unsigned int khz;
+ unsigned long jiffies;
+ u64 aperf;
+ u64 mperf;
+};
+
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
+
+/*
+ * aperfmperf_snapshot_khz()
+ * On the current CPU, snapshot APERF, MPERF, and jiffies
+ * unless we already did it within 10ms
+ * calculate kHz, save snapshot
+ */
+static void aperfmperf_snapshot_khz(void *dummy)
+{
+ u64 aperf, aperf_delta;
+ u64 mperf, mperf_delta;
+ struct aperfmperf_sample *s = this_cpu_ptr(&samples);
+
+ /* Don't bother re-computing within 10 ms */
+ if (time_before(jiffies, s->jiffies + HZ/100))
+ return;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ aperf_delta = aperf - s->aperf;
+ mperf_delta = mperf - s->mperf;
+
+ /*
+ * There is no architectural guarantee that MPERF
+ * increments faster than we can read it.
+ */
+ if (mperf_delta == 0)
+ return;
+
+ /*
+ * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
+ * khz = (cpu_khz * aperf_delta) / mperf_delta
+ */
+ if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
+ s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
+ else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
+ s->khz = div64_u64(aperf_delta,
+ div64_u64(mperf_delta, cpu_khz));
+ s->jiffies = jiffies;
+ s->aperf = aperf;
+ s->mperf = mperf;
+}
+
+unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ if (!cpu_khz)
+ return 0;
+
+ if (!static_cpu_has(X86_FEATURE_APERFMPERF))
+ return 0;
+
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+
+ return per_cpu(samples.khz, cpu);
+}
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 26b643d..6e7424d 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -632,11 +632,21 @@ show_one(cpuinfo_transition_latency, cpuinfo.transition_latency);
show_one(scaling_min_freq, min);
show_one(scaling_max_freq, max);

+__weak unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return 0;
+}
+
static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
{
ssize_t ret;
+ unsigned int freq;

- if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
+ freq = arch_freq_get_on_cpu(policy->cpu);
+ if (freq)
+ ret = sprintf(buf, "%u\n", freq);
+ else if (cpufreq_driver && cpufreq_driver->setpolicy &&
+ cpufreq_driver->get)
ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
else
ret = sprintf(buf, "%u\n", policy->cur);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index a5ce0bbe..905117b 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -883,6 +883,8 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
}
#endif

+extern unsigned int arch_freq_get_on_cpu(int cpu);
+
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
--
2.7.4

2017-06-24 05:12:06

by Len Brown

[permalink] [raw]
Subject: [PATCH 3/4] intel_pstate: delete scheduler hook in HWP mode

From: Len Brown <[email protected]>

The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

In HWP mode, maintaining that value was the sole purpose of
the scheduler hook, intel_pstate_update_util_hwp(),
so it can now be removed.

Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index b7de5bd..4ec5668 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1732,16 +1732,6 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
fp_toint(cpu->iowait_boost * 100));
}

-static void intel_pstate_update_util_hwp(struct update_util_data *data,
- u64 time, unsigned int flags)
-{
- struct cpudata *cpu = container_of(data, struct cpudata, update_util);
- u64 delta_ns = time - cpu->sample.time;
-
- if ((s64)delta_ns >= INTEL_PSTATE_HWP_SAMPLING_INTERVAL)
- intel_pstate_sample(cpu, time);
-}
-
static void intel_pstate_update_util_pid(struct update_util_data *data,
u64 time, unsigned int flags)
{
@@ -1933,6 +1923,9 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];

+ if (hwp_active)
+ return;
+
if (cpu->update_util_set)
return;

@@ -2557,7 +2550,6 @@ static int __init intel_pstate_init(void)
} else {
hwp_active++;
intel_pstate.attr = hwp_cpufreq_attrs;
- pstate_funcs.update_util = intel_pstate_update_util_hwp;
goto hwp_cpu_matched;
}
} else {
--
2.7.4

2017-06-24 05:12:39

by Len Brown

[permalink] [raw]
Subject: [PATCH 4/4] intel_pstate: skip scheduler hook when in "performance" mode.

From: Len Brown <[email protected]>

When the governor is set to "performance", intel_pstate does not
need the scheduler hook for doing any calculations. Under these
conditions, its only purpose is to continue to maintain
cpufreq/scaling_cur_freq.

The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

So in "performance" governor mode, the scheduler hook can be skipped.
This applies to both in Software and Hardware P-state control modes.

Suggested-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Len Brown <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 4ec5668..4538182 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2031,10 +2031,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
*/
intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_max_within_limits(cpu);
+ } else {
+ intel_pstate_set_update_util_hook(policy->cpu);
}

- intel_pstate_set_update_util_hook(policy->cpu);
-
if (hwp_active)
intel_pstate_hwp_set(policy->cpu);

--
2.7.4

2017-06-24 05:13:28

by Len Brown

[permalink] [raw]
Subject: [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"

From: Len Brown <[email protected]>

cpufreq_quick_get() allows cpufreq drivers to over-ride cpu_khz
that is otherwise reported in x86 /proc/cpuinfo "cpu MHz".

There are four problems with this scheme,
any of them is sufficient justification to delete it.

1. Depending on which cpufreq driver is loaded, the behavior
of this field is different.

2. Distros complain that they have to explain to users
why and how this field changes. Distros have requested a constant.

3. The two major providers of this information, acpi_cpufreq
and intel_pstate, both "get it wrong" in different ways.

acpi_cpufreq lies to the user by telling them that
they are running at whatever frequency was last
requested by software.

intel_pstate lies to the user by telling them that
they are running at the average frequency computed
over an undefined measurement. But an average computed
over an undefined interval, is itself, undefined...

4. On modern processors, user space utilities, such as
turbostat(1), are more accurate and more precise, while
supporing concurrent measurement over arbitrary intervals.

Users who have been consulting /proc/cpuinfo to
track changing CPU frequency will be dissapointed that
it no longer wiggles -- perhaps being unaware of the
limitations of the information they have been consuming.

Yes, they can change their scripts to look in sysfs
cpufreq/scaling_cur_frequency. Here they will find the same
data of dubious quality here removed from /proc/cpuinfo.
The value in sysfs will be addressed in a subsequent patch
to address issues 1-3, above.

Issue 4 will remain -- users that really care about
accurate frequency information should not be using either
proc or sysfs kernel interfaces.
They should be using using turbostat(8), or a similar
purpose-built analysis tool.

Signed-off-by: Len Brown <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/cpu/proc.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 6df621a..218f798 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -2,7 +2,6 @@
#include <linux/timex.h>
#include <linux/string.h>
#include <linux/seq_file.h>
-#include <linux/cpufreq.h>

/*
* Get CPU information for use by the procfs.
@@ -76,14 +75,9 @@ static int show_cpuinfo(struct seq_file *m, void *v)
if (c->microcode)
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);

- if (cpu_has(c, X86_FEATURE_TSC)) {
- unsigned int freq = cpufreq_quick_get(cpu);
-
- if (!freq)
- freq = cpu_khz;
+ if (cpu_has(c, X86_FEATURE_TSC))
seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
- }
+ cpu_khz / 1000, (cpu_khz % 1000));

/* Cache size */
if (c->x86_cache_size >= 0)
--
2.7.4

2017-06-24 08:56:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF

On Fri, 23 Jun 2017, Len Brown wrote:
> This x86 native method to calculate MHz returns a meaningful result
> no matter if P-states are controlled by hardware or firmware
> and/or if the Linux cpufreq sub-system is or is-not installed.
>
> When this routine is invoked more frequently, the measurement
> interval becomes shorter. However, the code limits re-computation
> to 10ms intervals so that average frequency remains meaningful.
>
> Discerning users are encouraged to take advantage of
> the turbostat(8) utility, which can gracefully handle
> concurrent measurement intervals of arbitrary length.
>
> Signed-off-by: Len Brown <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

Raphael, please take the whole lot through the cpufreq tree.

Thanks,

tglx

2017-06-24 12:03:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF

On Sat, Jun 24, 2017 at 10:56 AM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 23 Jun 2017, Len Brown wrote:
>> This x86 native method to calculate MHz returns a meaningful result
>> no matter if P-states are controlled by hardware or firmware
>> and/or if the Linux cpufreq sub-system is or is-not installed.
>>
>> When this routine is invoked more frequently, the measurement
>> interval becomes shorter. However, the code limits re-computation
>> to 10ms intervals so that average frequency remains meaningful.
>>
>> Discerning users are encouraged to take advantage of
>> the turbostat(8) utility, which can gracefully handle
>> concurrent measurement intervals of arbitrary length.
>>
>> Signed-off-by: Len Brown <[email protected]>
>
> Reviewed-by: Thomas Gleixner <[email protected]>
>
> Raphael, please take the whole lot through the cpufreq tree.

I will, thanks!

Rafael

2017-07-25 22:40:32

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF

Sorry to be late to the party on this one:

On 2017.06.23 10:12 Len Brown wrote:

> The goal of this change is to give users a uniform and meaningful
> result when they read /sys/...cpufreq/scaling_cur_freq
> on modern x86 hardware, as compared to what they get today.

Myself, I like what I got then, and not what I get now.

> Modern x86 processors include the hardware needed
> to accurately calculate frequency over an interval --
> APERF, MPERF, and the TSC.
>
> Here we provide an x86 routine to make this calculation
> on supported hardware, and use it in preference to any
> driver driver-specific cpufreq_driver.get() routine.
>
> MHz is computed like so:
>
> MHz = base_MHz * delta_APERF / delta_MPERF

Yes, thanks very much.

> MHz is the average frequency of the busy processor
> over a measurement interval. The interval is
> defined to be the time between successive invocations
> of aperfmperf_khz_on_cpu(), which are expected to to
> happen on-demand when users read sysfs attribute
> cpufreq/scaling_cur_freq.

Yes but that can be hours apart, resulting in useless information.
This threw me for a loop for several days.

> As with previous methods of calculating MHz,
> idle time is excluded.

Which makes the response time to a correct answer
asymmetric. i.e. removal of a load on a CPU will
linger much much longer that adding a load on a CPU.

> base_MHz above is from TSC calibration global "cpu_khz".

Yes, thank you very much.

> This x86 native method to calculate MHz returns a meaningful result
> no matter if P-states are controlled by hardware or firmware
> and/or if the Linux cpufreq sub-system is or is-not installed.
>
> When this routine is invoked more frequently, the measurement
> interval becomes shorter. However, the code limits re-computation
> to 10ms intervals so that average frequency remains meaningful.
>
> Discerning users are encouraged to take advantage of
> the turbostat(8) utility, which can gracefully handle
> concurrent measurement intervals of arbitrary length.

Somehow, somewhere along the way, turbostat no longer seems
to use base_MHz based on the actual TSC. It used to.

> Signed-off-by: Len Brown <[email protected]>
> ---
> arch/x86/kernel/cpu/Makefile | 1 +
> arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++
> drivers/cpufreq/cpufreq.c | 12 +++++-
> include/linux/cpufreq.h | 2 +
> 4 files changed, 93 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/kernel/cpu/aperfmperf.c

... [deleted some] ...

> + * aperfmperf_snapshot_khz()
> + * On the current CPU, snapshot APERF, MPERF, and jiffies
> + * unless we already did it within 10ms

Well, it'll be 8 mSec on a 250 Hz kernel.
There is no maximum time defined, so the interval can be anything,
and therefore the result can be dominated by stale information.

> + * calculate kHz, save snapshot
> + */
> +static void aperfmperf_snapshot_khz(void *dummy)
> +{
> + u64 aperf, aperf_delta;
> + u64 mperf, mperf_delta;
> + struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> +
> + /* Don't bother re-computing within 10 ms */
> + if (time_before(jiffies, s->jiffies + HZ/100))
> + return;

The above condition would be 8 mSec on a 250 Hertz kernel,
wouldn't it?
(I don't care, I'm just saying.)
__________________________________

A long boring story is copied below, but it also includes my test data.

Summary:

. There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup),
thereby potentially influencing the system under test.
. Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that
was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency
was a negligible perturbation to the system.
. Now the data is as old as the time the command was run, which might be hours.

For reference my test computer contains an i7-2600K processor, and TSC is 3411.1043 MHz. Minimum pstate 16.

I did follow the e-mail thread [1] about changes to the "cpu MHz" line from /proc/cpuinfo, and expected it to have changed,
and indeed, it only ever prints TSC now and never changes. Whereas with kernel 4.12 it printed the actual CPU frequency,
albeit with the limitations stated in the e-mail thread, which I have always understood and accepted. O.K. so now it
is useless as an actual CPU frequency inquiry tool.

Now, there are two other methods (well three if one includes turbostat) for observing CPU frequency:

The "sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq" method, works the same as it
did in the past (well, there is another active thread about issues with it), but requires root access.

And the "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" method, which works fine
with kernel 4.12, but seems to give incorrect information with kernel 4.13-rc1, unless one inquires two or
more times and discards the first inquiry.

Test 1 data:

Notes:
CPU 7 only. It is 100% busy all the time.
The CPU burn program prints a time stamp every N loops, as a way to do a sanity check on CPU frequency.
Sanity checks were also done by acquiring trace data.
Turbo is disabled, so the maximum CPU frequency is predicable and known, independent of what other cores are doing.
The data is not from the first loop through this test.

Data:
/sys/devices/system/cpu/intel_pstate/max_perf_pct: 100
Actual CPU 7 frequency: 3411104
Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 3400000
Kernel 4.13-rc1: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 3400000
Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3400000
Kernel 4.13-rc1, 1st read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1765012*
Kernel 4.13-rc1, 2nd read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3411286

/sys/devices/system/cpu/intel_pstate/max_perf_pct: 42
Actual CPU 7 frequency: 1605225
Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 1599768
Kernel 4.13-rc1: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 1599975
Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1599768
Kernel 4.13-rc1, 1st read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3309707*
Kernel 4.13-rc1, 2nd read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605311

* The value listed for the first read is both a function of the time difference between
changing the maximum CPU frequency and the inquiry and how long since the last read the
actual CPU frequency was changed.

Data (for increase from 1.6 GHz to 3.4 GHz):
First read quickly (manually): 1765012
0.25 seconds to first read: 1663176
0.5 seconds to first read: 1671658
1 seconds to first read: 1767889
2 seconds to first read: 1872128
3 seconds to first read: 1769770
4 seconds to first read: 1814673
5 seconds to first read: 2297147
10 seconds to first read: 2394407
20 seconds to first read: 2720619
30 seconds to first read: 2875374
2 minutes to first read: 3373563
5 minutes to first read: 3363630
10 minutes to first read: 3376521

Data (for decrease from 3.4 GHz to 1.6 GHz):
0.25 seconds to first read: 3381255
0.5 seconds to first read: 3323808
1 seconds to first read: 3247873
2 seconds to first read: 3090182
3 seconds to first read: 3104870
4 seconds to first read: 2837281
5 seconds to first read: 2962827
10 seconds to first read: 2510951
20 seconds to first read: 2763956
30 seconds to first read: 2116198
2 minutes to first read: 1876923
5 minutes to first read: 1715839
10 minutes to first read: 1634040

Note: the above table was done more or less manually.

Test 2 data:

Just take the load off of CPU 7 and then look at its frequency (any amount of time later, I have yet to find a time limit):

Kernel 4.13-rc1, 1st read (1 minute after load removed): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3410964
Kernel 4.13-rc1, 2nd read (anytime after the 1st read): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605326

Kernel 4.13-rc1, 1st read (24 minutes after load removed): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3268873
Kernel 4.13-rc1, 2nd read (anytime after the 1st read): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605233

[1] http://marc.info/?t=149766883400002&r=1&w=2

Note: now also tested with kernel 4.13-rc2.

... Doug


2017-07-26 17:23:54

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF

Hi Doug,

Clearly you are a "discerning user", who understands the limitations
of the kernel sysfs interface,
both new and old, for communicating frequency. With the limitations
of the (old and new)
sysfs interfaces, why are you using it, rather than turbostat?

>> As with previous methods of calculating MHz,
>> idle time is excluded.
>
> Which makes the response time to a correct answer
> asymmetric. i.e. removal of a load on a CPU will
> linger much much longer that adding a load on a CPU.

If the measurement interval is not defined, then a "correct answer"
is also no defined.

Users now have the capability to define the measurement interval.
Before they didn't, they just could observe it was "short enough
that it looks current". Others may want to measure frequency over
a longer (or even known) interval, and the previous code made
that impossible.

Also, you may be interested to know that in HWP mode, intel_pstate
used to have a periodic timer who's _only_ job was to wake up the
CPU so that the driver could update the frequency statistic for sysfs.
(now it is a scheduler callback)
Sure, a "discerning user" may have noticed that they have "fresh" data
in sysfs, but most users were better served by not having a timer
fire to refresh data that they'll never consume... The new code
never runs at all, unless the user asks it to.

> Somehow, somewhere along the way, turbostat no longer seems
> to use base_MHz based on the actual TSC. It used to.

True, though not directly related to this thread...
On current and future Intel hardware, base_mhz and TSC rate
are not in the same clock domain. Only on very specific configurations
are those clock rates now equal.

>> + /* Don't bother re-computing within 10 ms */
>> + if (time_before(jiffies, s->jiffies + HZ/100))
>> + return;
>
> The above condition would be 8 mSec on a 250 Hertz kernel,
> wouldn't it?
> (I don't care, I'm just saying.)

True. We could replace the "10ms" comment with "typically 10ms", or "recently".
Note that the value here isn't precise, it is there just to prevent
wasted overhead.
The previous version of this patch was equally valid with a value 10x larger.

> Summary:
>
> . There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup),
> thereby potentially influencing the system under test.

This has always been true, just that the wakeups used to happen inside
the kernel -- whether you consumed the answer or not.

> . Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that
> was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency
> was a negligible perturbation to the system.

Frequency data isn't "already available", it has to be measured.
A measurement is not valid unless it is made over a known measurement interval.

> . Now the data is as old as the time the command was run, which might be hours.

True, under controlled conditions, the sysfs measurement interval
could be days or months long.

If a known interval is desired, than something need to provoke a read
of the attribute
at the start of the interval of interest.

Yes, we could do this inside the kernel, but then that would add
overhead to the
system for the vast majority of users who never even read this attribute,
and it would also take control of the interval away from the user.

Making this interface more complex inside the kernel doesn't seem like
a prudent path to go down
when turbostat already exists and can already measure
concurrent/overlapping intervals of arbitrary length
in user-space.

While I still haven't gleaned exactly what you are trying to measure,
I'm very much interested to know if/why you can't measure it using
the new sysfs attribute semantics, or better yet, using turbostat.

thanks,
Len Brown, Intel Open Source Technology Center

2017-07-28 00:21:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected

From: Rafael J. Wysocki <[email protected]>

After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
in sysfs only behaves as expected on x86 with APERF/MPERF registers
available when it is read from at least twice in a row.

The value returned by the first read may not be meaningful, because
the computations in there use cached values from the previous
aperfmperf_snapshot_khz() call which may be stale. However, the
interface is expected to return meaningful values on every read,
including the first one.

To address this problem modify arch_freq_get_on_cpu() to call
aperfmperf_snapshot_khz() twice, with a short delay between
these calls, if the previous invocation of aperfmperf_snapshot_khz()
was too far back in the past (specifically, more that 1s ago) and
adjust aperfmperf_snapshot_khz() for that.

Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF"
Reported-by: Doug Smythies <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++-------
1 file changed, 29 insertions(+), 7 deletions(-)

Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
+++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
@@ -8,20 +8,25 @@
* This file is licensed under GPLv2.
*/

-#include <linux/jiffies.h>
+#include <linux/delay.h>
+#include <linux/ktime.h>
#include <linux/math64.h>
#include <linux/percpu.h>
#include <linux/smp.h>

struct aperfmperf_sample {
unsigned int khz;
- unsigned long jiffies;
+ ktime_t time;
u64 aperf;
u64 mperf;
};

static DEFINE_PER_CPU(struct aperfmperf_sample, samples);

+#define APERFMPERF_CACHE_THRESHOLD_MS 10
+#define APERFMPERF_REFRESH_DELAY_MS 20
+#define APERFMPERF_STALE_THRESHOLD_MS 1000
+
/*
* aperfmperf_snapshot_khz()
* On the current CPU, snapshot APERF, MPERF, and jiffies
@@ -33,9 +38,11 @@ static void aperfmperf_snapshot_khz(void
u64 aperf, aperf_delta;
u64 mperf, mperf_delta;
struct aperfmperf_sample *s = this_cpu_ptr(&samples);
+ ktime_t now = ktime_get();
+ s64 time_delta = ktime_ms_delta(now, s->time);

- /* Don't bother re-computing within 10 ms */
- if (time_before(jiffies, s->jiffies + HZ/100))
+ /* Don't bother re-computing within the cache threshold time. */
+ if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
return;

rdmsrl(MSR_IA32_APERF, aperf);
@@ -51,6 +58,16 @@ static void aperfmperf_snapshot_khz(void
if (mperf_delta == 0)
return;

+ s->time = now;
+ s->aperf = aperf;
+ s->mperf = mperf;
+
+ /* If the previous iteration was too long ago, discard it. */
+ if (time_delta > APERFMPERF_STALE_THRESHOLD_MS) {
+ s->khz = 0;
+ return;
+ }
+
/*
* if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
* khz = (cpu_khz * aperf_delta) / mperf_delta
@@ -60,13 +77,12 @@ static void aperfmperf_snapshot_khz(void
else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
s->khz = div64_u64(aperf_delta,
div64_u64(mperf_delta, cpu_khz));
- s->jiffies = jiffies;
- s->aperf = aperf;
- s->mperf = mperf;
}

unsigned int arch_freq_get_on_cpu(int cpu)
{
+ unsigned int khz;
+
if (!cpu_khz)
return 0;

@@ -74,6 +90,12 @@ unsigned int arch_freq_get_on_cpu(int cp
return 0;

smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+ khz = per_cpu(samples.khz, cpu);
+ if (khz)
+ return khz;
+
+ msleep(APERFMPERF_REFRESH_DELAY_MS);
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);

return per_cpu(samples.khz, cpu);
}

2017-07-28 06:01:46

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected

On 2017.07.27 17:13 Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki <[email protected]>
>
> After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
> calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
> in sysfs only behaves as expected on x86 with APERF/MPERF registers
> available when it is read from at least twice in a row.
>
> The value returned by the first read may not be meaningful, because
> the computations in there use cached values from the previous
> aperfmperf_snapshot_khz() call which may be stale. However, the
> interface is expected to return meaningful values on every read,
> including the first one.
>
> To address this problem modify arch_freq_get_on_cpu() to call
> aperfmperf_snapshot_khz() twice, with a short delay between
> these calls, if the previous invocation of aperfmperf_snapshot_khz()
> was too far back in the past (specifically, more that 1s ago) and
> adjust aperfmperf_snapshot_khz() for that.
>
> Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF"
> Reported-by: Doug Smythies <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++-------
> 1 file changed, 29 insertions(+), 7 deletions(-)
>
> Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c

...[deleted the rest]...

This proposed patch would be good. However, I can only try it maybe by Sunday.
I think the maximum time span means that this code:

/*
* if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
* khz = (cpu_khz * aperf_delta) / mperf_delta
*/
if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
s->khz = div64_u64(aperf_delta,
div64_u64(mperf_delta, cpu_khz));

Could be reduced to this:

s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);

Because it could never overflow anymore.

... Doug


2017-07-28 12:35:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected

On Thursday, July 27, 2017 11:01:39 PM Doug Smythies wrote:
> On 2017.07.27 17:13 Rafael J. Wysocki wrote:
>
> > From: Rafael J. Wysocki <[email protected]>
> >
> > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
> > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
> > in sysfs only behaves as expected on x86 with APERF/MPERF registers
> > available when it is read from at least twice in a row.
> >
> > The value returned by the first read may not be meaningful, because
> > the computations in there use cached values from the previous
> > aperfmperf_snapshot_khz() call which may be stale. However, the
> > interface is expected to return meaningful values on every read,
> > including the first one.
> >
> > To address this problem modify arch_freq_get_on_cpu() to call
> > aperfmperf_snapshot_khz() twice, with a short delay between
> > these calls, if the previous invocation of aperfmperf_snapshot_khz()
> > was too far back in the past (specifically, more that 1s ago) and
> > adjust aperfmperf_snapshot_khz() for that.
> >
> > Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF"
> > Reported-by: Doug Smythies <[email protected]>
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> > arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++-------
> > 1 file changed, 29 insertions(+), 7 deletions(-)
> >
> > Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
>
> ...[deleted the rest]...
>
> This proposed patch would be good. However, I can only try it maybe by Sunday.
> I think the maximum time span means that this code:
>
> /*
> * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
> * khz = (cpu_khz * aperf_delta) / mperf_delta
> */
> if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
> s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
> else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
> s->khz = div64_u64(aperf_delta,
> div64_u64(mperf_delta, cpu_khz));
>
> Could be reduced to this:
>
> s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
>
> Because it could never overflow anymore.

Right, that's a good point.

I'll send a v2 with this change included shortly.

Thanks,
Rafael

2017-07-28 12:53:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected

From: Rafael J. Wysocki <[email protected]>

After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
in sysfs only behaves as expected on x86 with APERF/MPERF registers
available when it is read from at least twice in a row. The value
returned by the first read may not be meaningful, because the
computations in there use cached values from the previous iteration
of aperfmperf_snapshot_khz() which may be stale.

To prevent that from happening, modify arch_freq_get_on_cpu() to
call aperfmperf_snapshot_khz() twice, with a short delay between
these calls, if the previous invocation of aperfmperf_snapshot_khz()
was too far back in the past (specifically, more that 1s ago).

Also, as pointed out by Doug Smythies, aperf_delta is limited now
and the multiplication of it by cpu_khz won't overflow, so simplify
the s->khz computations too.

Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF"
Reported-by: Doug Smythies <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: Simplify the khz computations as per the Doug's suggestion.

---
arch/x86/kernel/cpu/aperfmperf.c | 40 +++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)

Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
+++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
@@ -8,20 +8,25 @@
* This file is licensed under GPLv2.
*/

-#include <linux/jiffies.h>
+#include <linux/delay.h>
+#include <linux/ktime.h>
#include <linux/math64.h>
#include <linux/percpu.h>
#include <linux/smp.h>

struct aperfmperf_sample {
unsigned int khz;
- unsigned long jiffies;
+ ktime_t time;
u64 aperf;
u64 mperf;
};

static DEFINE_PER_CPU(struct aperfmperf_sample, samples);

+#define APERFMPERF_CACHE_THRESHOLD_MS 10
+#define APERFMPERF_REFRESH_DELAY_MS 20
+#define APERFMPERF_STALE_THRESHOLD_MS 1000
+
/*
* aperfmperf_snapshot_khz()
* On the current CPU, snapshot APERF, MPERF, and jiffies
@@ -33,9 +38,11 @@ static void aperfmperf_snapshot_khz(void
u64 aperf, aperf_delta;
u64 mperf, mperf_delta;
struct aperfmperf_sample *s = this_cpu_ptr(&samples);
+ ktime_t now = ktime_get();
+ s64 time_delta = ktime_ms_delta(now, s->time);

- /* Don't bother re-computing within 10 ms */
- if (time_before(jiffies, s->jiffies + HZ/100))
+ /* Don't bother re-computing within the cache threshold time. */
+ if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
return;

rdmsrl(MSR_IA32_APERF, aperf);
@@ -51,22 +58,21 @@ static void aperfmperf_snapshot_khz(void
if (mperf_delta == 0)
return;

- /*
- * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
- * khz = (cpu_khz * aperf_delta) / mperf_delta
- */
- if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
- s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
- else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
- s->khz = div64_u64(aperf_delta,
- div64_u64(mperf_delta, cpu_khz));
- s->jiffies = jiffies;
+ s->time = now;
s->aperf = aperf;
s->mperf = mperf;
+
+ /* If the previous iteration was too long ago, discard it. */
+ if (time_delta > APERFMPERF_STALE_THRESHOLD_MS)
+ s->khz = 0;
+ else
+ s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
}

unsigned int arch_freq_get_on_cpu(int cpu)
{
+ unsigned int khz;
+
if (!cpu_khz)
return 0;

@@ -74,6 +80,12 @@ unsigned int arch_freq_get_on_cpu(int cp
return 0;

smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+ khz = per_cpu(samples.khz, cpu);
+ if (khz)
+ return khz;
+
+ msleep(APERFMPERF_REFRESH_DELAY_MS);
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);

return per_cpu(samples.khz, cpu);
}

2017-07-31 23:46:49

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected

On 2017.07.28 05:45 Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki <[email protected]>
>
> After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
> calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
> in sysfs only behaves as expected on x86 with APERF/MPERF registers
> available when it is read from at least twice in a row. The value
> returned by the first read may not be meaningful, because the
> computations in there use cached values from the previous iteration
> of aperfmperf_snapshot_khz() which may be stale.
>
> To prevent that from happening, modify arch_freq_get_on_cpu() to
> call aperfmperf_snapshot_khz() twice, with a short delay between
> these calls, if the previous invocation of aperfmperf_snapshot_khz()
> was too far back in the past (specifically, more that 1s ago).

...[deleted the rest]...

This patch seems to work fine and addresses my complaints from last week.
Thanks.

... Doug


2017-08-01 00:58:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected

On Monday, July 31, 2017 04:46:42 PM Doug Smythies wrote:
> On 2017.07.28 05:45 Rafael J. Wysocki wrote:
>
> > From: Rafael J. Wysocki <[email protected]>
> >
> > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to
> > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute
> > in sysfs only behaves as expected on x86 with APERF/MPERF registers
> > available when it is read from at least twice in a row. The value
> > returned by the first read may not be meaningful, because the
> > computations in there use cached values from the previous iteration
> > of aperfmperf_snapshot_khz() which may be stale.
> >
> > To prevent that from happening, modify arch_freq_get_on_cpu() to
> > call aperfmperf_snapshot_khz() twice, with a short delay between
> > these calls, if the previous invocation of aperfmperf_snapshot_khz()
> > was too far back in the past (specifically, more that 1s ago).
>
> ...[deleted the rest]...
>
> This patch seems to work fine and addresses my complaints from last week.
> Thanks.

Thanks for the confirmation!

Rafael