2022-03-11 15:11:49

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

From: Eric Dumazet <[email protected]>

Opening /proc/cpuinfo can have a big latency on hosts with many cpus,
mostly because it is essentially doing:

for_each_online_cpu(cpu)
smp_call_function_single(cpu, aperfmperf_snapshot_khz, ...)

smp_call_function_single() is reusing a common csd, meaning that
each invocation needs to wait for completion of the prior one.

Paul recent patches have lowered number of cpus receiving the IPI,
but there are still cases where the latency of the above loop can
reach 10 ms, then an extra msleep(10) is performed, for a total of 20ms.

Using smp_call_function_many() allows for full parallelism,
and latency is down to ~80 usec, on a host with 256 cpus.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
---
arch/x86/kernel/cpu/aperfmperf.c | 32 +++++++++++++++++++++++---------
1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 22911deacb6e441ad60ddb57190ef3772afb3cf0..a305310ceb44784a0ad9be7c196061d98fa1adbc 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -67,7 +67,8 @@ static void aperfmperf_snapshot_khz(void *dummy)
atomic_set_release(&s->scfpending, 0);
}

-static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
+static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait,
+ struct cpumask *mask)
{
s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
@@ -76,9 +77,13 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
return true;

- if (!atomic_xchg(&s->scfpending, 1) || wait)
- smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
-
+ if (!atomic_xchg(&s->scfpending, 1) || wait) {
+ if (mask)
+ __cpumask_set_cpu(cpu, mask);
+ else
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz,
+ NULL, wait);
+ }
/* Return false if the previous iteration was too long ago. */
return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
}
@@ -97,13 +102,14 @@ unsigned int aperfmperf_get_khz(int cpu)
if (rcu_is_idle_cpu(cpu))
return 0; /* Idle CPUs are completely uninteresting. */

- aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
+ aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL);
return per_cpu(samples.khz, cpu);
}

void arch_freq_prepare_all(void)
{
ktime_t now = ktime_get();
+ cpumask_var_t mask;
bool wait = false;
int cpu;

@@ -113,17 +119,25 @@ void arch_freq_prepare_all(void)
if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
return;

+ if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+ return;
+
+ cpus_read_lock();
for_each_online_cpu(cpu) {
if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
continue;
if (rcu_is_idle_cpu(cpu))
continue; /* Idle CPUs are completely uninteresting. */
- if (!aperfmperf_snapshot_cpu(cpu, now, false))
+ if (!aperfmperf_snapshot_cpu(cpu, now, false, mask))
wait = true;
}

- if (wait)
- msleep(APERFMPERF_REFRESH_DELAY_MS);
+ preempt_disable();
+ smp_call_function_many(mask, aperfmperf_snapshot_khz, NULL, wait);
+ preempt_enable();
+ cpus_read_unlock();
+
+ free_cpumask_var(mask);
}

unsigned int arch_freq_get_on_cpu(int cpu)
@@ -139,7 +153,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
return 0;

- if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
+ if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL))
return per_cpu(samples.khz, cpu);

msleep(APERFMPERF_REFRESH_DELAY_MS);
--
2.35.1.723.g4982287a31-goog


2022-03-11 22:03:00

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On 3/11/2022 2:17 AM, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> Opening /proc/cpuinfo can have a big latency on hosts with many cpus,
> mostly because it is essentially doing:
>
> for_each_online_cpu(cpu)
> smp_call_function_single(cpu, aperfmperf_snapshot_khz, ...)
>
> smp_call_function_single() is reusing a common csd, meaning that
> each invocation needs to wait for completion of the prior one.
>
> Paul recent patches have lowered number of cpus receiving the IPI,
> but there are still cases where the latency of the above loop can
> reach 10 ms, then an extra msleep(10) is performed, for a total of 20ms.
>
> Using smp_call_function_many() allows for full parallelism,
> and latency is down to ~80 usec, on a host with 256 cpus.

This looks reasonable to me.

Acked-by: Rafael J. Wysocki <[email protected]>

or if you want me to pick it up, please resend the patch with a CC to
[email protected].

> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: <[email protected]>
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 32 +++++++++++++++++++++++---------
> 1 file changed, 23 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> index 22911deacb6e441ad60ddb57190ef3772afb3cf0..a305310ceb44784a0ad9be7c196061d98fa1adbc 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -67,7 +67,8 @@ static void aperfmperf_snapshot_khz(void *dummy)
> atomic_set_release(&s->scfpending, 0);
> }
>
> -static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> +static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait,
> + struct cpumask *mask)
> {
> s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
> struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> @@ -76,9 +77,13 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> return true;
>
> - if (!atomic_xchg(&s->scfpending, 1) || wait)
> - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
> -
> + if (!atomic_xchg(&s->scfpending, 1) || wait) {
> + if (mask)
> + __cpumask_set_cpu(cpu, mask);
> + else
> + smp_call_function_single(cpu, aperfmperf_snapshot_khz,
> + NULL, wait);
> + }
> /* Return false if the previous iteration was too long ago. */
> return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
> }
> @@ -97,13 +102,14 @@ unsigned int aperfmperf_get_khz(int cpu)
> if (rcu_is_idle_cpu(cpu))
> return 0; /* Idle CPUs are completely uninteresting. */
>
> - aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
> + aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL);
> return per_cpu(samples.khz, cpu);
> }
>
> void arch_freq_prepare_all(void)
> {
> ktime_t now = ktime_get();
> + cpumask_var_t mask;
> bool wait = false;
> int cpu;
>
> @@ -113,17 +119,25 @@ void arch_freq_prepare_all(void)
> if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> return;
>
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> + return;
> +
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> continue;
> if (rcu_is_idle_cpu(cpu))
> continue; /* Idle CPUs are completely uninteresting. */
> - if (!aperfmperf_snapshot_cpu(cpu, now, false))
> + if (!aperfmperf_snapshot_cpu(cpu, now, false, mask))
> wait = true;
> }
>
> - if (wait)
> - msleep(APERFMPERF_REFRESH_DELAY_MS);
> + preempt_disable();
> + smp_call_function_many(mask, aperfmperf_snapshot_khz, NULL, wait);
> + preempt_enable();
> + cpus_read_unlock();
> +
> + free_cpumask_var(mask);
> }
>
> unsigned int arch_freq_get_on_cpu(int cpu)
> @@ -139,7 +153,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> return 0;
>
> - if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
> + if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL))
> return per_cpu(samples.khz, cpu);
>
> msleep(APERFMPERF_REFRESH_DELAY_MS);


2022-03-11 22:29:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Thu, Mar 10, 2022 at 05:17:15PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> Opening /proc/cpuinfo can have a big latency on hosts with many cpus,
> mostly because it is essentially doing:
>
> for_each_online_cpu(cpu)
> smp_call_function_single(cpu, aperfmperf_snapshot_khz, ...)
>
> smp_call_function_single() is reusing a common csd, meaning that
> each invocation needs to wait for completion of the prior one.
>
> Paul recent patches have lowered number of cpus receiving the IPI,
> but there are still cases where the latency of the above loop can
> reach 10 ms, then an extra msleep(10) is performed, for a total of 20ms.
>
> Using smp_call_function_many() allows for full parallelism,
> and latency is down to ~80 usec, on a host with 256 cpus.
>
> Signed-off-by: Eric Dumazet <[email protected]>

Nice!!!

Acked-by: Paul E. McKenney <[email protected]>

> Cc: Paul E. McKenney <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: <[email protected]>
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 32 +++++++++++++++++++++++---------
> 1 file changed, 23 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> index 22911deacb6e441ad60ddb57190ef3772afb3cf0..a305310ceb44784a0ad9be7c196061d98fa1adbc 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -67,7 +67,8 @@ static void aperfmperf_snapshot_khz(void *dummy)
> atomic_set_release(&s->scfpending, 0);
> }
>
> -static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> +static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait,
> + struct cpumask *mask)
> {
> s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
> struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> @@ -76,9 +77,13 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> return true;
>
> - if (!atomic_xchg(&s->scfpending, 1) || wait)
> - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
> -
> + if (!atomic_xchg(&s->scfpending, 1) || wait) {
> + if (mask)
> + __cpumask_set_cpu(cpu, mask);
> + else
> + smp_call_function_single(cpu, aperfmperf_snapshot_khz,
> + NULL, wait);
> + }
> /* Return false if the previous iteration was too long ago. */
> return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
> }
> @@ -97,13 +102,14 @@ unsigned int aperfmperf_get_khz(int cpu)
> if (rcu_is_idle_cpu(cpu))
> return 0; /* Idle CPUs are completely uninteresting. */
>
> - aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
> + aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL);
> return per_cpu(samples.khz, cpu);
> }
>
> void arch_freq_prepare_all(void)
> {
> ktime_t now = ktime_get();
> + cpumask_var_t mask;
> bool wait = false;
> int cpu;
>
> @@ -113,17 +119,25 @@ void arch_freq_prepare_all(void)
> if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> return;
>
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> + return;
> +
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> continue;
> if (rcu_is_idle_cpu(cpu))
> continue; /* Idle CPUs are completely uninteresting. */
> - if (!aperfmperf_snapshot_cpu(cpu, now, false))
> + if (!aperfmperf_snapshot_cpu(cpu, now, false, mask))
> wait = true;
> }
>
> - if (wait)
> - msleep(APERFMPERF_REFRESH_DELAY_MS);
> + preempt_disable();
> + smp_call_function_many(mask, aperfmperf_snapshot_khz, NULL, wait);
> + preempt_enable();
> + cpus_read_unlock();
> +
> + free_cpumask_var(mask);
> }
>
> unsigned int arch_freq_get_on_cpu(int cpu)
> @@ -139,7 +153,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> return 0;
>
> - if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
> + if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL))
> return per_cpu(samples.khz, cpu);
>
> msleep(APERFMPERF_REFRESH_DELAY_MS);
> --
> 2.35.1.723.g4982287a31-goog
>

2022-03-25 18:03:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Fri, Mar 11, 2022 at 8:36 AM Rafael J. Wysocki
<[email protected]> wrote:
>
> On 3/11/2022 2:17 AM, Eric Dumazet wrote:
> > From: Eric Dumazet <[email protected]>
> >
> > Opening /proc/cpuinfo can have a big latency on hosts with many cpus,
> > mostly because it is essentially doing:
> >
> > for_each_online_cpu(cpu)
> > smp_call_function_single(cpu, aperfmperf_snapshot_khz, ...)
> >
> > smp_call_function_single() is reusing a common csd, meaning that
> > each invocation needs to wait for completion of the prior one.
> >
> > Paul recent patches have lowered number of cpus receiving the IPI,
> > but there are still cases where the latency of the above loop can
> > reach 10 ms, then an extra msleep(10) is performed, for a total of 20ms.
> >
> > Using smp_call_function_many() allows for full parallelism,
> > and latency is down to ~80 usec, on a host with 256 cpus.
>
> This looks reasonable to me.
>
> Acked-by: Rafael J. Wysocki <[email protected]>
>
> or if you want me to pick it up, please resend the patch with a CC to
> [email protected].

I do not know what x86 maintainers prefer ?

Let them give their advice here, thanks !

>
> > Signed-off-by: Eric Dumazet <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: <[email protected]>
> > ---
> > arch/x86/kernel/cpu/aperfmperf.c | 32 +++++++++++++++++++++++---------
> > 1 file changed, 23 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> > index 22911deacb6e441ad60ddb57190ef3772afb3cf0..a305310ceb44784a0ad9be7c196061d98fa1adbc 100644
> > --- a/arch/x86/kernel/cpu/aperfmperf.c
> > +++ b/arch/x86/kernel/cpu/aperfmperf.c
> > @@ -67,7 +67,8 @@ static void aperfmperf_snapshot_khz(void *dummy)
> > atomic_set_release(&s->scfpending, 0);
> > }
> >
> > -static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> > +static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait,
> > + struct cpumask *mask)
> > {
> > s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
> > struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> > @@ -76,9 +77,13 @@ static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> > if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> > return true;
> >
> > - if (!atomic_xchg(&s->scfpending, 1) || wait)
> > - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
> > -
> > + if (!atomic_xchg(&s->scfpending, 1) || wait) {
> > + if (mask)
> > + __cpumask_set_cpu(cpu, mask);
> > + else
> > + smp_call_function_single(cpu, aperfmperf_snapshot_khz,
> > + NULL, wait);
> > + }
> > /* Return false if the previous iteration was too long ago. */
> > return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
> > }
> > @@ -97,13 +102,14 @@ unsigned int aperfmperf_get_khz(int cpu)
> > if (rcu_is_idle_cpu(cpu))
> > return 0; /* Idle CPUs are completely uninteresting. */
> >
> > - aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
> > + aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL);
> > return per_cpu(samples.khz, cpu);
> > }
> >
> > void arch_freq_prepare_all(void)
> > {
> > ktime_t now = ktime_get();
> > + cpumask_var_t mask;
> > bool wait = false;
> > int cpu;
> >
> > @@ -113,17 +119,25 @@ void arch_freq_prepare_all(void)
> > if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> > return;
> >
> > + if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> > + return;
> > +
> > + cpus_read_lock();
> > for_each_online_cpu(cpu) {
> > if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> > continue;
> > if (rcu_is_idle_cpu(cpu))
> > continue; /* Idle CPUs are completely uninteresting. */
> > - if (!aperfmperf_snapshot_cpu(cpu, now, false))
> > + if (!aperfmperf_snapshot_cpu(cpu, now, false, mask))
> > wait = true;
> > }
> >
> > - if (wait)
> > - msleep(APERFMPERF_REFRESH_DELAY_MS);
> > + preempt_disable();
> > + smp_call_function_many(mask, aperfmperf_snapshot_khz, NULL, wait);
> > + preempt_enable();
> > + cpus_read_unlock();
> > +
> > + free_cpumask_var(mask);
> > }
> >
> > unsigned int arch_freq_get_on_cpu(int cpu)
> > @@ -139,7 +153,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> > if (!housekeeping_cpu(cpu, HK_FLAG_MISC))
> > return 0;
> >
> > - if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
> > + if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true, NULL))
> > return per_cpu(samples.khz, cpu);
> >
> > msleep(APERFMPERF_REFRESH_DELAY_MS);
>
>

2022-03-30 19:08:59

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Wed, Mar 30, 2022 at 10:05 AM Eric Dumazet <[email protected]> wrote:
>

> Can you send an actual patch, with a changelog then ?
>
> I saw kind of a rant about my patch, which was fine IMO.

I forgot to say that avoiding IPI is very nice, of course.

Thanks !

2022-03-31 02:55:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

Eric,

On Thu, Mar 10 2022 at 17:17, Eric Dumazet wrote:
> Opening /proc/cpuinfo can have a big latency on hosts with many cpus,

this is important because open() sends IPIs? I assume you meant
reading. But even that is of questionable importance unless you care to
provide some useful information why this matters.

AFAIK, there are _two_ cases why /proc/cpuinfo is read:

1) Retrieve information about the CPUs and the [mis]features
supported by the kernel. This information is fully static and for
that purpose exposing the nominal CPU frequency would be
completely sufficient.

2) Retrieve the 'actual' CPU frequency because using the per CPU
sysfs interface is slow. In the worst case do that in a loop.

I consider #2 an abuse and in fact the exposure of aperf/mperf to that
interface should have never happened at all. But sure, features....

As a consequence we are tinkering with this nonsense and optimizing it
to death without even thinking about whether this interface makes sense
or not:

> Using smp_call_function_many() allows for full parallelism,
> and latency is down to ~80 usec, on a host with 256 cpus.

which I hate with a passion because that allows *unpriviledged* user
space to inject systemwide IPIs every 10ms just to read these counters
which are providing not more than some estimate and are of no value for
the only sane use case of /proc/cpuinfo, i.e. #1 above.

What's worse, that 80 usec worst case latency is spent in the context of
an *unpriviledged* user space thread in preemption disabled context to
wait for the SMP function calls to complete. RT users are very happy
about that...

On a machine with 256 CPUs the readout of /proc/cpuinfo without this
whole aperf/mperf IPI muck takes already ~3msec just to dump information
which is largely uninteresting:

Total size: 400014
Unique line size: 15146 ~= 3%

Total lines: 7168
Unique lines: 857 ~= 11%

This 3msec is only the time for 'read()' w/o any IPI costs or subsequent
parsing.

Can we please take a step back and think about this for real instead of
using the 'all I have is a hammer' approach?

The use cases I'm aware of are:

1) Read the CPU [mis]features supported by the kernel:

Why would you read more than one CPU just for this if it's
trivial to figure out whether the system supports heterogenous
feature sets or not. Even if so, then still 90% of that
information is redundant because the feature differences are not
per CPU, they are per CPU clusters

2) Topology information

3) Provide a report for whatever purpose

4) CPU MHz retrieval

I might have missed some "important" use case here. Feel free to educate
me on that.

Neither #1 nor #2 have any interest in redundant information nor do they
care about "accurate" CPU MHz information.

For #3 the amount of redundant information does not matter, but neither
does the CPU MHz information. That's perfectly fine with the nominal
frequency.

So that leaves us with #4, which is a monitoring problem:

1) For the one off case the latency does not matter at all and if
done right then the whole IPI nonsense can be avoided
completely.

2) For continuous monitoring it matters obviously

If that's the real use case people care about then we should
provide a proper interface for it and do the obvious:

Set a flag to tell the CPUs to collect that data on a regular
base, e.g. in the tick interrupt.

The resulting overhead is going to be:

- The time to check the flag. If placed right then the costs is
in the low single digit cycles and not necessarily noticable
at all in the noise of the tick interrupt.

- The readout time for the A/MPERF MSRs, i.e. about 300 cycles
total.

IOW, we are talking about 200 - 300 cycles overhead for providing
the information on demand and very low single digits cycles
overhead per tick if the flag is not set.

Pretty much independent of the uarchs I tested on with a trivial
check, i.e. 'if (!collect) return;', the result was completely
within the noise of the timer interrupt and I really could not
read any significant difference out of it for the case where
collect was false.

Now compare that to the current IPI case with your patch:

- The IPI cost is ~3us on the remote CPU on the machine I did
the experiments on. But that's not taking the resulting cache
pollution and whatever into account.

- The costs for waiting on the initiating CPU for the remote
CPUs maxed out at ~90us

which sums up to 90 + 256 * 3 = 858us total compute time every
10ms, which amounts to 1.7e6 cycles.

That means 300 * 256 = 76800 cycles per 10ms worst case if all
CPUs are busy and have a tick running versus 1.7e6 cycles plus
associated costs.

But it gets even better. The addition of frequency invariance scheduling
for x86 already reads APERF and MPERF in *every* tick on recent machines.

Of course this code lives elsewhere and does not share anything with the
preexisting aperf/mperf muck. Sigh!

So there is no real reason anymore to avoid a periodic readout of
APERF/MPERF and provide the data for the other users.

Something like the below makes all the IPI nonsense and more go
away. It's probably incomplete, but builds, boots and shows pretty
numbers. :)

Thanks,

tglx
---
arch/x86/kernel/cpu/aperfmperf.c | 464 +++++++++++++++++++++++++++++++--------
arch/x86/kernel/cpu/proc.c | 2
arch/x86/kernel/smpboot.c | 355 -----------------------------
fs/proc/cpuinfo.c | 6
include/linux/cpufreq.h | 1
5 files changed, 372 insertions(+), 456 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -6,146 +6,422 @@
* Copyright (C) 2017 Intel Corp.
* Author: Len Brown <[email protected]>
*/
-
-#include <linux/delay.h>
-#include <linux/ktime.h>
+#include <linux/cpufreq.h>
#include <linux/math64.h>
#include <linux/percpu.h>
-#include <linux/cpufreq.h>
-#include <linux/smp.h>
#include <linux/sched/isolation.h>
-#include <linux/rcupdate.h>
+#include <linux/sched/topology.h>
+#include <linux/smp.h>
+#include <linux/syscore_ops.h>
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>

#include "cpu.h"

struct aperfmperf_sample {
- unsigned int khz;
- atomic_t scfpending;
- ktime_t time;
- u64 aperf;
- u64 mperf;
+ seqcount_t seq;
+ unsigned long last_update;
+ u64 acnt;
+ u64 mcnt;
+ u64 aperf;
+ u64 mperf;
};

-static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
-
-#define APERFMPERF_CACHE_THRESHOLD_MS 10
-#define APERFMPERF_REFRESH_DELAY_MS 10
-#define APERFMPERF_STALE_THRESHOLD_MS 1000
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples) = {
+ .seq = SEQCNT_ZERO(apermperf_sample.s)
+};

-/*
- * aperfmperf_snapshot_khz()
- * On the current CPU, snapshot APERF, MPERF, and jiffies
- * unless we already did it within 10ms
- * calculate kHz, save snapshot
- */
-static void aperfmperf_snapshot_khz(void *dummy)
+unsigned int arch_freq_get_on_cpu(int cpu)
{
- u64 aperf, aperf_delta;
- u64 mperf, mperf_delta;
- struct aperfmperf_sample *s = this_cpu_ptr(&samples);
- unsigned long flags;
+ struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
+ unsigned long last;
+ unsigned int seq;
+ u64 acnt, mcnt;

- local_irq_save(flags);
- rdmsrl(MSR_IA32_APERF, aperf);
- rdmsrl(MSR_IA32_MPERF, mperf);
- local_irq_restore(flags);
+ if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+ return 0;

- aperf_delta = aperf - s->aperf;
- mperf_delta = mperf - s->mperf;
+ do {
+ seq = raw_read_seqcount_begin(&s->seq);
+ last = s->last_update;
+ acnt = s->acnt;
+ mcnt = s->mcnt;
+ } while (read_seqcount_retry(&s->seq, seq));

/*
- * There is no architectural guarantee that MPERF
- * increments faster than we can read it.
+ * Bail on invalid count and when the last update was too long ago,
+ * which covers idle and NOHZ full CPUs.
*/
- if (mperf_delta == 0)
- return;
+ if (!mcnt || (jiffies - last) > (HZ / 25))
+ return 0;

- s->time = ktime_get();
- s->aperf = aperf;
- s->mperf = mperf;
- s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
- atomic_set_release(&s->scfpending, 0);
+ return div64_u64((cpu_khz * acnt), mcnt);
}

-static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
+static void init_counter_refs(void)
{
- s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
- struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
+ u64 aperf, mperf;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ this_cpu_write(samples.aperf, aperf);
+ this_cpu_write(samples.mperf, mperf);
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ * BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);

- /* Don't bother re-computing within the cache threshold time. */
- if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
- return true;
+static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
+static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;

- if (!atomic_xchg(&s->scfpending, 1) || wait)
- smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
+void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+ arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
+ arch_turbo_freq_ratio;
+}
+EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
+
+static bool turbo_disabled(void)
+{
+ u64 misc_en;
+ int err;

- /* Return false if the previous iteration was too long ago. */
- return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
+ err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+ if (err)
+ return false;
+
+ return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
}

-unsigned int aperfmperf_get_khz(int cpu)
+static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
{
- if (!cpu_khz)
- return 0;
+ int err;

- if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
- return 0;
+ err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
+ if (err)
+ return false;

- if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
- return 0;
+ err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
+ if (err)
+ return false;

- if (rcu_is_idle_cpu(cpu))
- return 0; /* Idle CPUs are completely uninteresting. */
+ *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
+ *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */

- aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
- return per_cpu(samples.khz, cpu);
+ return true;
}

-void arch_freq_prepare_all(void)
+#define X86_MATCH(model) \
+ X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, \
+ INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+ X86_MATCH(XEON_PHI_KNL),
+ X86_MATCH(XEON_PHI_KNM),
+ {}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+ X86_MATCH(SKYLAKE_X),
+ {}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+ X86_MATCH(ATOM_GOLDMONT),
+ X86_MATCH(ATOM_GOLDMONT_D),
+ X86_MATCH(ATOM_GOLDMONT_PLUS),
+ {}
+};
+
+static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+ int num_delta_fratio)
{
- ktime_t now = ktime_get();
- bool wait = false;
- int cpu;
+ int fratio, delta_fratio, found;
+ int err, i;
+ u64 msr;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+ if (err)
+ return false;
+
+ fratio = (msr >> 8) & 0xFF;
+ i = 16;
+ found = 0;
+ do {
+ if (found >= num_delta_fratio) {
+ *turbo_freq = fratio;
+ return true;
+ }
+
+ delta_fratio = (msr >> (i + 5)) & 0x7;
+
+ if (delta_fratio) {
+ found += 1;
+ fratio -= delta_fratio;
+ }

- if (!cpu_khz)
- return;
+ i += 8;
+ } while (i < 64);

- if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return true;
+}
+
+static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+{
+ u64 ratios, counts;
+ u32 group_size;
+ int err, i;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
+ if (err)
+ return false;
+
+ for (i = 0; i < 64; i += 8) {
+ group_size = (counts >> i) & 0xFF;
+ if (group_size >= size) {
+ *turbo_freq = (ratios >> i) & 0xFF;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+ u64 msr;
+ int err;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+ *turbo_freq = (msr >> 24) & 0xFF; /* 4C turbo */
+
+ /* The CPU may have less than 4 cores */
+ if (!*turbo_freq)
+ *turbo_freq = msr & 0xFF; /* 1C turbo */
+
+ return true;
+}
+
+static bool intel_set_max_freq_ratio(void)
+{
+ u64 base_freq, turbo_freq;
+ u64 turbo_ratio;
+
+ if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
+
+ if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
+ if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
+ knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
+ if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
+ goto out;
+
+ if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
+
+ return false;
+
+out:
+ /*
+ * Some hypervisors advertise X86_FEATURE_APERFMPERF
+ * but then fill all MSR's with zeroes.
+ * Some CPUs have turbo boost but don't declare any turbo ratio
+ * in MSR_TURBO_RATIO_LIMIT.
+ */
+ if (!base_freq || !turbo_freq) {
+ pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
+ return false;
+ }
+
+ turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
+ if (!turbo_ratio) {
+ pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
+ return false;
+ }
+
+ arch_turbo_freq_ratio = turbo_ratio;
+ arch_set_max_freq_ratio(turbo_disabled());
+
+ return true;
+}
+
+#ifdef CONFIG_PM_SLEEP
+static struct syscore_ops freq_invariance_syscore_ops = {
+ .resume = init_counter_refs,
+};
+
+static void register_freq_invariance_syscore_ops(void)
+{
+ /* Bail out if registered already. */
+ if (freq_invariance_syscore_ops.node.prev)
return;

- for_each_online_cpu(cpu) {
- if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
- continue;
- if (rcu_is_idle_cpu(cpu))
- continue; /* Idle CPUs are completely uninteresting. */
- if (!aperfmperf_snapshot_cpu(cpu, now, false))
- wait = true;
+ register_syscore_ops(&freq_invariance_syscore_ops);
+}
+#else
+static inline void register_freq_invariance_syscore_ops(void) {}
+#endif
+
+static void __init_freq_invariance(bool cppc_ready)
+{
+ bool ret = false;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ ret = intel_set_max_freq_ratio();
+ else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+ if (!cppc_ready)
+ return;
+ ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
}

- if (wait)
- msleep(APERFMPERF_REFRESH_DELAY_MS);
+ if (ret) {
+ static_branch_enable(&arch_scale_freq_key);
+ register_freq_invariance_syscore_ops();
+ pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
+ } else {
+ pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
+ }
}

-unsigned int arch_freq_get_on_cpu(int cpu)
+static void disable_freq_invariance_workfn(struct work_struct *work)
{
- struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
+ static_branch_disable(&arch_scale_freq_key);
+}

- if (!cpu_khz)
- return 0;
+static DECLARE_WORK(disable_freq_invariance_work,
+ disable_freq_invariance_workfn);

- if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
- return 0;
+DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;

- if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
- return 0;
+static void scale_freq_tick(u64 acnt, u64 mcnt)
+{
+ u64 freq_scale;
+
+ if (!arch_scale_freq_invariant())
+ return;
+
+ if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
+ goto error;
+
+ if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
+ goto error;
+
+ freq_scale = div64_u64(acnt, mcnt);
+ if (!freq_scale)
+ goto error;
+
+ if (freq_scale > SCHED_CAPACITY_SCALE)
+ freq_scale = SCHED_CAPACITY_SCALE;
+
+ this_cpu_write(arch_freq_scale, freq_scale);
+ return;
+
+error:
+ pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
+ schedule_work(&disable_freq_invariance_work);
+}
+#else /* CONFIG_X86_64 */
+static inline void __init_freq_invariance(bool cppc_ready) { }
+static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
+#endif /* !CONFIG_X86_64 */
+
+void init_freq_invariance(bool secondary, bool cppc_ready)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+ return;
+
+ init_counter_refs();
+ if (!secondary)
+ __init_freq_invariance(cppc_ready);
+}
+
+void arch_scale_freq_tick(void)
+{
+ struct aperfmperf_sample *s = this_cpu_ptr(&samples);
+ u64 acnt, mcnt, aperf, mperf;

- if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
- return per_cpu(samples.khz, cpu);
+ if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
+ return;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+ acnt = aperf - s->aperf;
+ mcnt = mperf - s->mperf;

- msleep(APERFMPERF_REFRESH_DELAY_MS);
- atomic_set(&s->scfpending, 1);
- smp_mb(); /* ->scfpending before smp_call_function_single(). */
- smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+ raw_write_seqcount_begin(&s->seq);
+ s->last_update = jiffies;
+ s->acnt = acnt;
+ s->mcnt = mcnt;
+ raw_write_seqcount_end(&s->seq);
+
+ s->aperf = aperf;
+ s->mperf = mperf;

- return per_cpu(samples.khz, cpu);
+ scale_freq_tick(acnt, mcnt);
}
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);

if (cpu_has(c, X86_FEATURE_TSC)) {
- unsigned int freq = aperfmperf_get_khz(cpu);
+ unsigned int freq = arch_freq_get_on_cpu(cpu);

if (!freq)
freq = cpufreq_quick_get(cpu);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -56,7 +56,6 @@
#include <linux/numa.h>
#include <linux/pgtable.h>
#include <linux/overflow.h>
-#include <linux/syscore_ops.h>

#include <asm/acpi.h>
#include <asm/desc.h>
@@ -1847,357 +1846,3 @@ void native_play_dead(void)
}

#endif
-
-#ifdef CONFIG_X86_64
-/*
- * APERF/MPERF frequency ratio computation.
- *
- * The scheduler wants to do frequency invariant accounting and needs a <1
- * ratio to account for the 'current' frequency, corresponding to
- * freq_curr / freq_max.
- *
- * Since the frequency freq_curr on x86 is controlled by micro-controller and
- * our P-state setting is little more than a request/hint, we need to observe
- * the effective frequency 'BusyMHz', i.e. the average frequency over a time
- * interval after discarding idle time. This is given by:
- *
- * BusyMHz = delta_APERF / delta_MPERF * freq_base
- *
- * where freq_base is the max non-turbo P-state.
- *
- * The freq_max term has to be set to a somewhat arbitrary value, because we
- * can't know which turbo states will be available at a given point in time:
- * it all depends on the thermal headroom of the entire package. We set it to
- * the turbo level with 4 cores active.
- *
- * Benchmarks show that's a good compromise between the 1C turbo ratio
- * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
- * which would ignore the entire turbo range (a conspicuous part, making
- * freq_curr/freq_max always maxed out).
- *
- * An exception to the heuristic above is the Atom uarch, where we choose the
- * highest turbo level for freq_max since Atom's are generally oriented towards
- * power efficiency.
- *
- * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
- * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
- */
-
-DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
-
-static DEFINE_PER_CPU(u64, arch_prev_aperf);
-static DEFINE_PER_CPU(u64, arch_prev_mperf);
-static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
-static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
-
-void arch_set_max_freq_ratio(bool turbo_disabled)
-{
- arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
- arch_turbo_freq_ratio;
-}
-EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
-
-static bool turbo_disabled(void)
-{
- u64 misc_en;
- int err;
-
- err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
- if (err)
- return false;
-
- return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
-}
-
-static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
- int err;
-
- err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
- if (err)
- return false;
-
- err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
- if (err)
- return false;
-
- *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
- *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */
-
- return true;
-}
-
-#define X86_MATCH(model) \
- X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, \
- INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
-
-static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
- X86_MATCH(XEON_PHI_KNL),
- X86_MATCH(XEON_PHI_KNM),
- {}
-};
-
-static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
- X86_MATCH(SKYLAKE_X),
- {}
-};
-
-static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
- X86_MATCH(ATOM_GOLDMONT),
- X86_MATCH(ATOM_GOLDMONT_D),
- X86_MATCH(ATOM_GOLDMONT_PLUS),
- {}
-};
-
-static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
- int num_delta_fratio)
-{
- int fratio, delta_fratio, found;
- int err, i;
- u64 msr;
-
- err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
- if (err)
- return false;
-
- *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
-
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
- if (err)
- return false;
-
- fratio = (msr >> 8) & 0xFF;
- i = 16;
- found = 0;
- do {
- if (found >= num_delta_fratio) {
- *turbo_freq = fratio;
- return true;
- }
-
- delta_fratio = (msr >> (i + 5)) & 0x7;
-
- if (delta_fratio) {
- found += 1;
- fratio -= delta_fratio;
- }
-
- i += 8;
- } while (i < 64);
-
- return true;
-}
-
-static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
-{
- u64 ratios, counts;
- u32 group_size;
- int err, i;
-
- err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
- if (err)
- return false;
-
- *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
-
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
- if (err)
- return false;
-
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
- if (err)
- return false;
-
- for (i = 0; i < 64; i += 8) {
- group_size = (counts >> i) & 0xFF;
- if (group_size >= size) {
- *turbo_freq = (ratios >> i) & 0xFF;
- return true;
- }
- }
-
- return false;
-}
-
-static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
-{
- u64 msr;
- int err;
-
- err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
- if (err)
- return false;
-
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
- if (err)
- return false;
-
- *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
- *turbo_freq = (msr >> 24) & 0xFF; /* 4C turbo */
-
- /* The CPU may have less than 4 cores */
- if (!*turbo_freq)
- *turbo_freq = msr & 0xFF; /* 1C turbo */
-
- return true;
-}
-
-static bool intel_set_max_freq_ratio(void)
-{
- u64 base_freq, turbo_freq;
- u64 turbo_ratio;
-
- if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
- goto out;
-
- if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
- skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
- goto out;
-
- if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
- knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
- goto out;
-
- if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
- skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
- goto out;
-
- if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
- goto out;
-
- return false;
-
-out:
- /*
- * Some hypervisors advertise X86_FEATURE_APERFMPERF
- * but then fill all MSR's with zeroes.
- * Some CPUs have turbo boost but don't declare any turbo ratio
- * in MSR_TURBO_RATIO_LIMIT.
- */
- if (!base_freq || !turbo_freq) {
- pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
- return false;
- }
-
- turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
- if (!turbo_ratio) {
- pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
- return false;
- }
-
- arch_turbo_freq_ratio = turbo_ratio;
- arch_set_max_freq_ratio(turbo_disabled());
-
- return true;
-}
-
-static void init_counter_refs(void)
-{
- u64 aperf, mperf;
-
- rdmsrl(MSR_IA32_APERF, aperf);
- rdmsrl(MSR_IA32_MPERF, mperf);
-
- this_cpu_write(arch_prev_aperf, aperf);
- this_cpu_write(arch_prev_mperf, mperf);
-}
-
-#ifdef CONFIG_PM_SLEEP
-static struct syscore_ops freq_invariance_syscore_ops = {
- .resume = init_counter_refs,
-};
-
-static void register_freq_invariance_syscore_ops(void)
-{
- /* Bail out if registered already. */
- if (freq_invariance_syscore_ops.node.prev)
- return;
-
- register_syscore_ops(&freq_invariance_syscore_ops);
-}
-#else
-static inline void register_freq_invariance_syscore_ops(void) {}
-#endif
-
-void init_freq_invariance(bool secondary, bool cppc_ready)
-{
- bool ret = false;
-
- if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
- return;
-
- if (secondary) {
- if (static_branch_likely(&arch_scale_freq_key)) {
- init_counter_refs();
- }
- return;
- }
-
- if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
- ret = intel_set_max_freq_ratio();
- else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
- if (!cppc_ready) {
- return;
- }
- ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
- }
-
- if (ret) {
- init_counter_refs();
- static_branch_enable(&arch_scale_freq_key);
- register_freq_invariance_syscore_ops();
- pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
- } else {
- pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
- }
-}
-
-static void disable_freq_invariance_workfn(struct work_struct *work)
-{
- static_branch_disable(&arch_scale_freq_key);
-}
-
-static DECLARE_WORK(disable_freq_invariance_work,
- disable_freq_invariance_workfn);
-
-DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
-
-void arch_scale_freq_tick(void)
-{
- u64 freq_scale;
- u64 aperf, mperf;
- u64 acnt, mcnt;
-
- if (!arch_scale_freq_invariant())
- return;
-
- rdmsrl(MSR_IA32_APERF, aperf);
- rdmsrl(MSR_IA32_MPERF, mperf);
-
- acnt = aperf - this_cpu_read(arch_prev_aperf);
- mcnt = mperf - this_cpu_read(arch_prev_mperf);
-
- this_cpu_write(arch_prev_aperf, aperf);
- this_cpu_write(arch_prev_mperf, mperf);
-
- if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
- goto error;
-
- if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
- goto error;
-
- freq_scale = div64_u64(acnt, mcnt);
- if (!freq_scale)
- goto error;
-
- if (freq_scale > SCHED_CAPACITY_SCALE)
- freq_scale = SCHED_CAPACITY_SCALE;
-
- this_cpu_write(arch_freq_scale, freq_scale);
- return;
-
-error:
- pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
- schedule_work(&disable_freq_invariance_work);
-}
-#endif /* CONFIG_X86_64 */
--- a/fs/proc/cpuinfo.c
+++ b/fs/proc/cpuinfo.c
@@ -5,14 +5,10 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

-__weak void arch_freq_prepare_all(void)
-{
-}
-
extern const struct seq_operations cpuinfo_op;
+
static int cpuinfo_open(struct inode *inode, struct file *file)
{
- arch_freq_prepare_all();
return seq_open(file, &cpuinfo_op);
}

--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governo
struct cpufreq_governor *old_gov) { }
#endif

-extern void arch_freq_prepare_all(void);
extern unsigned int arch_freq_get_on_cpu(int cpu);

#ifndef arch_set_freq_scale

2022-03-31 04:10:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Wed, Mar 30, 2022 at 8:58 AM Thomas Gleixner <[email protected]> wrote:
>
> Eric,
>
> On Thu, Mar 10 2022 at 17:17, Eric Dumazet wrote:
> > Opening /proc/cpuinfo can have a big latency on hosts with many cpus,
>
> this is important because open() sends IPIs? I assume you meant
> reading. But even that is of questionable importance unless you care to
> provide some useful information why this matters.

In our case, we had soft lockups.
Backporting recent patch from Paul helped a lot.

While doing the backport, I realized that the function was a bit silly,
because it was essentially looping X times sending an IPI, instead
of using broadcast IPI facility, which is less expensive.

We _also_ removed the binary that was for some unknown reason scanning
/proc/cpuinfo (that was a copy of ethtool )
but I felt that we should fix the kernel to save headaches for others.

>
> AFAIK, there are _two_ cases why /proc/cpuinfo is read:
>
> 1) Retrieve information about the CPUs and the [mis]features
> supported by the kernel. This information is fully static and for
> that purpose exposing the nominal CPU frequency would be
> completely sufficient.
>
> 2) Retrieve the 'actual' CPU frequency because using the per CPU
> sysfs interface is slow. In the worst case do that in a loop.
>
> I consider #2 an abuse and in fact the exposure of aperf/mperf to that
> interface should have never happened at all. But sure, features....
>
> As a consequence we are tinkering with this nonsense and optimizing it
> to death without even thinking about whether this interface makes sense
> or not:
>
> > Using smp_call_function_many() allows for full parallelism,
> > and latency is down to ~80 usec, on a host with 256 cpus.
>
> which I hate with a passion because that allows *unpriviledged* user
> space to inject systemwide IPIs every 10ms just to read these counters
> which are providing not more than some estimate and are of no value for
> the only sane use case of /proc/cpuinfo, i.e. #1 above.

You do realize that before my patch, this is already happening ?

My "optimization" simply replace an open loop of individual IPI with
use of the broadcast IPI capability.

Are you saying we should remove IPI broadcast and use loops
of IPI, one cpu at a time ?

>
> What's worse, that 80 usec worst case latency is spent in the context of
> an *unpriviledged* user space thread in preemption disabled context to
> wait for the SMP function calls to complete. RT users are very happy
> about that...
>
> On a machine with 256 CPUs the readout of /proc/cpuinfo without this
> whole aperf/mperf IPI muck takes already ~3msec just to dump information
> which is largely uninteresting:
>
> Total size: 400014
> Unique line size: 15146 ~= 3%
>
> Total lines: 7168
> Unique lines: 857 ~= 11%
>
> This 3msec is only the time for 'read()' w/o any IPI costs or subsequent
> parsing.
>
> Can we please take a step back and think about this for real instead of
> using the 'all I have is a hammer' approach?
>
> The use cases I'm aware of are:
>
> 1) Read the CPU [mis]features supported by the kernel:
>
> Why would you read more than one CPU just for this if it's
> trivial to figure out whether the system supports heterogenous
> feature sets or not. Even if so, then still 90% of that
> information is redundant because the feature differences are not
> per CPU, they are per CPU clusters
>
> 2) Topology information
>
> 3) Provide a report for whatever purpose
>
> 4) CPU MHz retrieval
>
> I might have missed some "important" use case here. Feel free to educate
> me on that.
>
> Neither #1 nor #2 have any interest in redundant information nor do they
> care about "accurate" CPU MHz information.
>
> For #3 the amount of redundant information does not matter, but neither
> does the CPU MHz information. That's perfectly fine with the nominal
> frequency.
>
> So that leaves us with #4, which is a monitoring problem:
>
> 1) For the one off case the latency does not matter at all and if
> done right then the whole IPI nonsense can be avoided
> completely.
>
> 2) For continuous monitoring it matters obviously
>
> If that's the real use case people care about then we should
> provide a proper interface for it and do the obvious:
>
> Set a flag to tell the CPUs to collect that data on a regular
> base, e.g. in the tick interrupt.
>
> The resulting overhead is going to be:
>
> - The time to check the flag. If placed right then the costs is
> in the low single digit cycles and not necessarily noticable
> at all in the noise of the tick interrupt.
>
> - The readout time for the A/MPERF MSRs, i.e. about 300 cycles
> total.
>
> IOW, we are talking about 200 - 300 cycles overhead for providing
> the information on demand and very low single digits cycles
> overhead per tick if the flag is not set.
>
> Pretty much independent of the uarchs I tested on with a trivial
> check, i.e. 'if (!collect) return;', the result was completely
> within the noise of the timer interrupt and I really could not
> read any significant difference out of it for the case where
> collect was false.
>
> Now compare that to the current IPI case with your patch:
>
> - The IPI cost is ~3us on the remote CPU on the machine I did
> the experiments on. But that's not taking the resulting cache
> pollution and whatever into account.
>
> - The costs for waiting on the initiating CPU for the remote
> CPUs maxed out at ~90us
>
> which sums up to 90 + 256 * 3 = 858us total compute time every
> 10ms, which amounts to 1.7e6 cycles.
>
> That means 300 * 256 = 76800 cycles per 10ms worst case if all
> CPUs are busy and have a tick running versus 1.7e6 cycles plus
> associated costs.
>
> But it gets even better. The addition of frequency invariance scheduling
> for x86 already reads APERF and MPERF in *every* tick on recent machines.
>
> Of course this code lives elsewhere and does not share anything with the
> preexisting aperf/mperf muck. Sigh!
>
> So there is no real reason anymore to avoid a periodic readout of
> APERF/MPERF and provide the data for the other users.
>
> Something like the below makes all the IPI nonsense and more go
> away. It's probably incomplete, but builds, boots and shows pretty
> numbers. :)
>
> Thanks,
>
> tglx
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 464 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c | 2
> arch/x86/kernel/smpboot.c | 355 -----------------------------
> fs/proc/cpuinfo.c | 6
> include/linux/cpufreq.h | 1
> 5 files changed, 372 insertions(+), 456 deletions(-)
>
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -6,146 +6,422 @@
> * Copyright (C) 2017 Intel Corp.
> * Author: Len Brown <[email protected]>
> */
> -
> -#include <linux/delay.h>
> -#include <linux/ktime.h>
> +#include <linux/cpufreq.h>
> #include <linux/math64.h>
> #include <linux/percpu.h>
> -#include <linux/cpufreq.h>
> -#include <linux/smp.h>
> #include <linux/sched/isolation.h>
> -#include <linux/rcupdate.h>
> +#include <linux/sched/topology.h>
> +#include <linux/smp.h>
> +#include <linux/syscore_ops.h>
> +
> +#include <asm/cpu_device_id.h>
> +#include <asm/intel-family.h>
>
> #include "cpu.h"
>
> struct aperfmperf_sample {
> - unsigned int khz;
> - atomic_t scfpending;
> - ktime_t time;
> - u64 aperf;
> - u64 mperf;
> + seqcount_t seq;
> + unsigned long last_update;
> + u64 acnt;
> + u64 mcnt;
> + u64 aperf;
> + u64 mperf;
> };
>
> -static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
> -
> -#define APERFMPERF_CACHE_THRESHOLD_MS 10
> -#define APERFMPERF_REFRESH_DELAY_MS 10
> -#define APERFMPERF_STALE_THRESHOLD_MS 1000
> +static DEFINE_PER_CPU(struct aperfmperf_sample, samples) = {
> + .seq = SEQCNT_ZERO(apermperf_sample.s)
> +};
>
> -/*
> - * aperfmperf_snapshot_khz()
> - * On the current CPU, snapshot APERF, MPERF, and jiffies
> - * unless we already did it within 10ms
> - * calculate kHz, save snapshot
> - */
> -static void aperfmperf_snapshot_khz(void *dummy)
> +unsigned int arch_freq_get_on_cpu(int cpu)
> {
> - u64 aperf, aperf_delta;
> - u64 mperf, mperf_delta;
> - struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> - unsigned long flags;
> + struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> + unsigned long last;
> + unsigned int seq;
> + u64 acnt, mcnt;
>
> - local_irq_save(flags);
> - rdmsrl(MSR_IA32_APERF, aperf);
> - rdmsrl(MSR_IA32_MPERF, mperf);
> - local_irq_restore(flags);
> + if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> + return 0;
>
> - aperf_delta = aperf - s->aperf;
> - mperf_delta = mperf - s->mperf;
> + do {
> + seq = raw_read_seqcount_begin(&s->seq);
> + last = s->last_update;
> + acnt = s->acnt;
> + mcnt = s->mcnt;
> + } while (read_seqcount_retry(&s->seq, seq));
>
> /*
> - * There is no architectural guarantee that MPERF
> - * increments faster than we can read it.
> + * Bail on invalid count and when the last update was too long ago,
> + * which covers idle and NOHZ full CPUs.
> */
> - if (mperf_delta == 0)
> - return;
> + if (!mcnt || (jiffies - last) > (HZ / 25))
> + return 0;
>
> - s->time = ktime_get();
> - s->aperf = aperf;
> - s->mperf = mperf;
> - s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
> - atomic_set_release(&s->scfpending, 0);
> + return div64_u64((cpu_khz * acnt), mcnt);
> }
>
> -static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait)
> +static void init_counter_refs(void)
> {
> - s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu));
> - struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> + u64 aperf, mperf;
> +
> + rdmsrl(MSR_IA32_APERF, aperf);
> + rdmsrl(MSR_IA32_MPERF, mperf);
> +
> + this_cpu_write(samples.aperf, aperf);
> + this_cpu_write(samples.mperf, mperf);
> +}
> +
> +#ifdef CONFIG_X86_64
> +/*
> + * APERF/MPERF frequency ratio computation.
> + *
> + * The scheduler wants to do frequency invariant accounting and needs a <1
> + * ratio to account for the 'current' frequency, corresponding to
> + * freq_curr / freq_max.
> + *
> + * Since the frequency freq_curr on x86 is controlled by micro-controller and
> + * our P-state setting is little more than a request/hint, we need to observe
> + * the effective frequency 'BusyMHz', i.e. the average frequency over a time
> + * interval after discarding idle time. This is given by:
> + *
> + * BusyMHz = delta_APERF / delta_MPERF * freq_base
> + *
> + * where freq_base is the max non-turbo P-state.
> + *
> + * The freq_max term has to be set to a somewhat arbitrary value, because we
> + * can't know which turbo states will be available at a given point in time:
> + * it all depends on the thermal headroom of the entire package. We set it to
> + * the turbo level with 4 cores active.
> + *
> + * Benchmarks show that's a good compromise between the 1C turbo ratio
> + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
> + * which would ignore the entire turbo range (a conspicuous part, making
> + * freq_curr/freq_max always maxed out).
> + *
> + * An exception to the heuristic above is the Atom uarch, where we choose the
> + * highest turbo level for freq_max since Atom's are generally oriented towards
> + * power efficiency.
> + *
> + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
> + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
> + */
> +
> +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
>
> - /* Don't bother re-computing within the cache threshold time. */
> - if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> - return true;
> +static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
> +static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
>
> - if (!atomic_xchg(&s->scfpending, 1) || wait)
> - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait);
> +void arch_set_max_freq_ratio(bool turbo_disabled)
> +{
> + arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
> + arch_turbo_freq_ratio;
> +}
> +EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
> +
> +static bool turbo_disabled(void)
> +{
> + u64 misc_en;
> + int err;
>
> - /* Return false if the previous iteration was too long ago. */
> - return time_delta <= APERFMPERF_STALE_THRESHOLD_MS;
> + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
> + if (err)
> + return false;
> +
> + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
> }
>
> -unsigned int aperfmperf_get_khz(int cpu)
> +static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> {
> - if (!cpu_khz)
> - return 0;
> + int err;
>
> - if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> - return 0;
> + err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
> + if (err)
> + return false;
>
> - if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> - return 0;
> + err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
> + if (err)
> + return false;
>
> - if (rcu_is_idle_cpu(cpu))
> - return 0; /* Idle CPUs are completely uninteresting. */
> + *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
> + *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */
>
> - aperfmperf_snapshot_cpu(cpu, ktime_get(), true);
> - return per_cpu(samples.khz, cpu);
> + return true;
> }
>
> -void arch_freq_prepare_all(void)
> +#define X86_MATCH(model) \
> + X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, \
> + INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
> +
> +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
> + X86_MATCH(XEON_PHI_KNL),
> + X86_MATCH(XEON_PHI_KNM),
> + {}
> +};
> +
> +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
> + X86_MATCH(SKYLAKE_X),
> + {}
> +};
> +
> +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
> + X86_MATCH(ATOM_GOLDMONT),
> + X86_MATCH(ATOM_GOLDMONT_D),
> + X86_MATCH(ATOM_GOLDMONT_PLUS),
> + {}
> +};
> +
> +static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> + int num_delta_fratio)
> {
> - ktime_t now = ktime_get();
> - bool wait = false;
> - int cpu;
> + int fratio, delta_fratio, found;
> + int err, i;
> + u64 msr;
> +
> + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> + if (err)
> + return false;
> +
> + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> +
> + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> + if (err)
> + return false;
> +
> + fratio = (msr >> 8) & 0xFF;
> + i = 16;
> + found = 0;
> + do {
> + if (found >= num_delta_fratio) {
> + *turbo_freq = fratio;
> + return true;
> + }
> +
> + delta_fratio = (msr >> (i + 5)) & 0x7;
> +
> + if (delta_fratio) {
> + found += 1;
> + fratio -= delta_fratio;
> + }
>
> - if (!cpu_khz)
> - return;
> + i += 8;
> + } while (i < 64);
>
> - if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> + return true;
> +}
> +
> +static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
> +{
> + u64 ratios, counts;
> + u32 group_size;
> + int err, i;
> +
> + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> + if (err)
> + return false;
> +
> + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> +
> + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
> + if (err)
> + return false;
> +
> + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
> + if (err)
> + return false;
> +
> + for (i = 0; i < 64; i += 8) {
> + group_size = (counts >> i) & 0xFF;
> + if (group_size >= size) {
> + *turbo_freq = (ratios >> i) & 0xFF;
> + return true;
> + }
> + }
> +
> + return false;
> +}
> +
> +static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> +{
> + u64 msr;
> + int err;
> +
> + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> + if (err)
> + return false;
> +
> + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> + if (err)
> + return false;
> +
> + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> + *turbo_freq = (msr >> 24) & 0xFF; /* 4C turbo */
> +
> + /* The CPU may have less than 4 cores */
> + if (!*turbo_freq)
> + *turbo_freq = msr & 0xFF; /* 1C turbo */
> +
> + return true;
> +}
> +
> +static bool intel_set_max_freq_ratio(void)
> +{
> + u64 base_freq, turbo_freq;
> + u64 turbo_ratio;
> +
> + if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
> + goto out;
> +
> + if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
> + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> + goto out;
> +
> + if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
> + knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> + goto out;
> +
> + if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
> + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
> + goto out;
> +
> + if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
> + goto out;
> +
> + return false;
> +
> +out:
> + /*
> + * Some hypervisors advertise X86_FEATURE_APERFMPERF
> + * but then fill all MSR's with zeroes.
> + * Some CPUs have turbo boost but don't declare any turbo ratio
> + * in MSR_TURBO_RATIO_LIMIT.
> + */
> + if (!base_freq || !turbo_freq) {
> + pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
> + return false;
> + }
> +
> + turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
> + if (!turbo_ratio) {
> + pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
> + return false;
> + }
> +
> + arch_turbo_freq_ratio = turbo_ratio;
> + arch_set_max_freq_ratio(turbo_disabled());
> +
> + return true;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static struct syscore_ops freq_invariance_syscore_ops = {
> + .resume = init_counter_refs,
> +};
> +
> +static void register_freq_invariance_syscore_ops(void)
> +{
> + /* Bail out if registered already. */
> + if (freq_invariance_syscore_ops.node.prev)
> return;
>
> - for_each_online_cpu(cpu) {
> - if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> - continue;
> - if (rcu_is_idle_cpu(cpu))
> - continue; /* Idle CPUs are completely uninteresting. */
> - if (!aperfmperf_snapshot_cpu(cpu, now, false))
> - wait = true;
> + register_syscore_ops(&freq_invariance_syscore_ops);
> +}
> +#else
> +static inline void register_freq_invariance_syscore_ops(void) {}
> +#endif
> +
> +static void __init_freq_invariance(bool cppc_ready)
> +{
> + bool ret = false;
> +
> + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> + ret = intel_set_max_freq_ratio();
> + else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> + if (!cppc_ready)
> + return;
> + ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
> }
>
> - if (wait)
> - msleep(APERFMPERF_REFRESH_DELAY_MS);
> + if (ret) {
> + static_branch_enable(&arch_scale_freq_key);
> + register_freq_invariance_syscore_ops();
> + pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> + } else {
> + pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
> + }
> }
>
> -unsigned int arch_freq_get_on_cpu(int cpu)
> +static void disable_freq_invariance_workfn(struct work_struct *work)
> {
> - struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu);
> + static_branch_disable(&arch_scale_freq_key);
> +}
>
> - if (!cpu_khz)
> - return 0;
> +static DECLARE_WORK(disable_freq_invariance_work,
> + disable_freq_invariance_workfn);
>
> - if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> - return 0;
> +DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
>
> - if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> - return 0;
> +static void scale_freq_tick(u64 acnt, u64 mcnt)
> +{
> + u64 freq_scale;
> +
> + if (!arch_scale_freq_invariant())
> + return;
> +
> + if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> + goto error;
> +
> + if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> + goto error;
> +
> + freq_scale = div64_u64(acnt, mcnt);
> + if (!freq_scale)
> + goto error;
> +
> + if (freq_scale > SCHED_CAPACITY_SCALE)
> + freq_scale = SCHED_CAPACITY_SCALE;
> +
> + this_cpu_write(arch_freq_scale, freq_scale);
> + return;
> +
> +error:
> + pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
> + schedule_work(&disable_freq_invariance_work);
> +}
> +#else /* CONFIG_X86_64 */
> +static inline void __init_freq_invariance(bool cppc_ready) { }
> +static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
> +#endif /* !CONFIG_X86_64 */
> +
> +void init_freq_invariance(bool secondary, bool cppc_ready)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> + return;
> +
> + init_counter_refs();
> + if (!secondary)
> + __init_freq_invariance(cppc_ready);
> +}
> +
> +void arch_scale_freq_tick(void)
> +{
> + struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> + u64 acnt, mcnt, aperf, mperf;
>
> - if (aperfmperf_snapshot_cpu(cpu, ktime_get(), true))
> - return per_cpu(samples.khz, cpu);
> + if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> + return;
> +
> + rdmsrl(MSR_IA32_APERF, aperf);
> + rdmsrl(MSR_IA32_MPERF, mperf);
> + acnt = aperf - s->aperf;
> + mcnt = mperf - s->mperf;
>
> - msleep(APERFMPERF_REFRESH_DELAY_MS);
> - atomic_set(&s->scfpending, 1);
> - smp_mb(); /* ->scfpending before smp_call_function_single(). */
> - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
> + raw_write_seqcount_begin(&s->seq);
> + s->last_update = jiffies;
> + s->acnt = acnt;
> + s->mcnt = mcnt;
> + raw_write_seqcount_end(&s->seq);
> +
> + s->aperf = aperf;
> + s->mperf = mperf;
>
> - return per_cpu(samples.khz, cpu);
> + scale_freq_tick(acnt, mcnt);
> }
> --- a/arch/x86/kernel/cpu/proc.c
> +++ b/arch/x86/kernel/cpu/proc.c
> @@ -84,7 +84,7 @@ static int show_cpuinfo(struct seq_file
> seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
>
> if (cpu_has(c, X86_FEATURE_TSC)) {
> - unsigned int freq = aperfmperf_get_khz(cpu);
> + unsigned int freq = arch_freq_get_on_cpu(cpu);
>
> if (!freq)
> freq = cpufreq_quick_get(cpu);
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -56,7 +56,6 @@
> #include <linux/numa.h>
> #include <linux/pgtable.h>
> #include <linux/overflow.h>
> -#include <linux/syscore_ops.h>
>
> #include <asm/acpi.h>
> #include <asm/desc.h>
> @@ -1847,357 +1846,3 @@ void native_play_dead(void)
> }
>
> #endif
> -
> -#ifdef CONFIG_X86_64
> -/*
> - * APERF/MPERF frequency ratio computation.
> - *
> - * The scheduler wants to do frequency invariant accounting and needs a <1
> - * ratio to account for the 'current' frequency, corresponding to
> - * freq_curr / freq_max.
> - *
> - * Since the frequency freq_curr on x86 is controlled by micro-controller and
> - * our P-state setting is little more than a request/hint, we need to observe
> - * the effective frequency 'BusyMHz', i.e. the average frequency over a time
> - * interval after discarding idle time. This is given by:
> - *
> - * BusyMHz = delta_APERF / delta_MPERF * freq_base
> - *
> - * where freq_base is the max non-turbo P-state.
> - *
> - * The freq_max term has to be set to a somewhat arbitrary value, because we
> - * can't know which turbo states will be available at a given point in time:
> - * it all depends on the thermal headroom of the entire package. We set it to
> - * the turbo level with 4 cores active.
> - *
> - * Benchmarks show that's a good compromise between the 1C turbo ratio
> - * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
> - * which would ignore the entire turbo range (a conspicuous part, making
> - * freq_curr/freq_max always maxed out).
> - *
> - * An exception to the heuristic above is the Atom uarch, where we choose the
> - * highest turbo level for freq_max since Atom's are generally oriented towards
> - * power efficiency.
> - *
> - * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
> - * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
> - */
> -
> -DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> -
> -static DEFINE_PER_CPU(u64, arch_prev_aperf);
> -static DEFINE_PER_CPU(u64, arch_prev_mperf);
> -static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
> -static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
> -
> -void arch_set_max_freq_ratio(bool turbo_disabled)
> -{
> - arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
> - arch_turbo_freq_ratio;
> -}
> -EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
> -
> -static bool turbo_disabled(void)
> -{
> - u64 misc_en;
> - int err;
> -
> - err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
> - if (err)
> - return false;
> -
> - return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
> -}
> -
> -static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> -{
> - int err;
> -
> - err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
> - if (err)
> - return false;
> -
> - err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
> - if (err)
> - return false;
> -
> - *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
> - *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */
> -
> - return true;
> -}
> -
> -#define X86_MATCH(model) \
> - X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, \
> - INTEL_FAM6_##model, X86_FEATURE_APERFMPERF, NULL)
> -
> -static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
> - X86_MATCH(XEON_PHI_KNL),
> - X86_MATCH(XEON_PHI_KNM),
> - {}
> -};
> -
> -static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
> - X86_MATCH(SKYLAKE_X),
> - {}
> -};
> -
> -static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
> - X86_MATCH(ATOM_GOLDMONT),
> - X86_MATCH(ATOM_GOLDMONT_D),
> - X86_MATCH(ATOM_GOLDMONT_PLUS),
> - {}
> -};
> -
> -static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
> - int num_delta_fratio)
> -{
> - int fratio, delta_fratio, found;
> - int err, i;
> - u64 msr;
> -
> - err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> - if (err)
> - return false;
> -
> - *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> -
> - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> - if (err)
> - return false;
> -
> - fratio = (msr >> 8) & 0xFF;
> - i = 16;
> - found = 0;
> - do {
> - if (found >= num_delta_fratio) {
> - *turbo_freq = fratio;
> - return true;
> - }
> -
> - delta_fratio = (msr >> (i + 5)) & 0x7;
> -
> - if (delta_fratio) {
> - found += 1;
> - fratio -= delta_fratio;
> - }
> -
> - i += 8;
> - } while (i < 64);
> -
> - return true;
> -}
> -
> -static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
> -{
> - u64 ratios, counts;
> - u32 group_size;
> - int err, i;
> -
> - err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> - if (err)
> - return false;
> -
> - *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> -
> - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
> - if (err)
> - return false;
> -
> - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
> - if (err)
> - return false;
> -
> - for (i = 0; i < 64; i += 8) {
> - group_size = (counts >> i) & 0xFF;
> - if (group_size >= size) {
> - *turbo_freq = (ratios >> i) & 0xFF;
> - return true;
> - }
> - }
> -
> - return false;
> -}
> -
> -static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
> -{
> - u64 msr;
> - int err;
> -
> - err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
> - if (err)
> - return false;
> -
> - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
> - if (err)
> - return false;
> -
> - *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
> - *turbo_freq = (msr >> 24) & 0xFF; /* 4C turbo */
> -
> - /* The CPU may have less than 4 cores */
> - if (!*turbo_freq)
> - *turbo_freq = msr & 0xFF; /* 1C turbo */
> -
> - return true;
> -}
> -
> -static bool intel_set_max_freq_ratio(void)
> -{
> - u64 base_freq, turbo_freq;
> - u64 turbo_ratio;
> -
> - if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
> - goto out;
> -
> - if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
> - skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> - goto out;
> -
> - if (x86_match_cpu(has_knl_turbo_ratio_limits) &&
> - knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
> - goto out;
> -
> - if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
> - skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
> - goto out;
> -
> - if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
> - goto out;
> -
> - return false;
> -
> -out:
> - /*
> - * Some hypervisors advertise X86_FEATURE_APERFMPERF
> - * but then fill all MSR's with zeroes.
> - * Some CPUs have turbo boost but don't declare any turbo ratio
> - * in MSR_TURBO_RATIO_LIMIT.
> - */
> - if (!base_freq || !turbo_freq) {
> - pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n");
> - return false;
> - }
> -
> - turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq);
> - if (!turbo_ratio) {
> - pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n");
> - return false;
> - }
> -
> - arch_turbo_freq_ratio = turbo_ratio;
> - arch_set_max_freq_ratio(turbo_disabled());
> -
> - return true;
> -}
> -
> -static void init_counter_refs(void)
> -{
> - u64 aperf, mperf;
> -
> - rdmsrl(MSR_IA32_APERF, aperf);
> - rdmsrl(MSR_IA32_MPERF, mperf);
> -
> - this_cpu_write(arch_prev_aperf, aperf);
> - this_cpu_write(arch_prev_mperf, mperf);
> -}
> -
> -#ifdef CONFIG_PM_SLEEP
> -static struct syscore_ops freq_invariance_syscore_ops = {
> - .resume = init_counter_refs,
> -};
> -
> -static void register_freq_invariance_syscore_ops(void)
> -{
> - /* Bail out if registered already. */
> - if (freq_invariance_syscore_ops.node.prev)
> - return;
> -
> - register_syscore_ops(&freq_invariance_syscore_ops);
> -}
> -#else
> -static inline void register_freq_invariance_syscore_ops(void) {}
> -#endif
> -
> -void init_freq_invariance(bool secondary, bool cppc_ready)
> -{
> - bool ret = false;
> -
> - if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> - return;
> -
> - if (secondary) {
> - if (static_branch_likely(&arch_scale_freq_key)) {
> - init_counter_refs();
> - }
> - return;
> - }
> -
> - if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> - ret = intel_set_max_freq_ratio();
> - else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> - if (!cppc_ready) {
> - return;
> - }
> - ret = amd_set_max_freq_ratio(&arch_turbo_freq_ratio);
> - }
> -
> - if (ret) {
> - init_counter_refs();
> - static_branch_enable(&arch_scale_freq_key);
> - register_freq_invariance_syscore_ops();
> - pr_info("Estimated ratio of average max frequency by base frequency (times 1024): %llu\n", arch_max_freq_ratio);
> - } else {
> - pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
> - }
> -}
> -
> -static void disable_freq_invariance_workfn(struct work_struct *work)
> -{
> - static_branch_disable(&arch_scale_freq_key);
> -}
> -
> -static DECLARE_WORK(disable_freq_invariance_work,
> - disable_freq_invariance_workfn);
> -
> -DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
> -
> -void arch_scale_freq_tick(void)
> -{
> - u64 freq_scale;
> - u64 aperf, mperf;
> - u64 acnt, mcnt;
> -
> - if (!arch_scale_freq_invariant())
> - return;
> -
> - rdmsrl(MSR_IA32_APERF, aperf);
> - rdmsrl(MSR_IA32_MPERF, mperf);
> -
> - acnt = aperf - this_cpu_read(arch_prev_aperf);
> - mcnt = mperf - this_cpu_read(arch_prev_mperf);
> -
> - this_cpu_write(arch_prev_aperf, aperf);
> - this_cpu_write(arch_prev_mperf, mperf);
> -
> - if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> - goto error;
> -
> - if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> - goto error;
> -
> - freq_scale = div64_u64(acnt, mcnt);
> - if (!freq_scale)
> - goto error;
> -
> - if (freq_scale > SCHED_CAPACITY_SCALE)
> - freq_scale = SCHED_CAPACITY_SCALE;
> -
> - this_cpu_write(arch_freq_scale, freq_scale);
> - return;
> -
> -error:
> - pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
> - schedule_work(&disable_freq_invariance_work);
> -}
> -#endif /* CONFIG_X86_64 */
> --- a/fs/proc/cpuinfo.c
> +++ b/fs/proc/cpuinfo.c
> @@ -5,14 +5,10 @@
> #include <linux/proc_fs.h>
> #include <linux/seq_file.h>
>
> -__weak void arch_freq_prepare_all(void)
> -{
> -}
> -
> extern const struct seq_operations cpuinfo_op;
> +
> static int cpuinfo_open(struct inode *inode, struct file *file)
> {
> - arch_freq_prepare_all();
> return seq_open(file, &cpuinfo_op);
> }
>
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1199,7 +1199,6 @@ static inline void sched_cpufreq_governo
> struct cpufreq_governor *old_gov) { }
> #endif
>
> -extern void arch_freq_prepare_all(void);
> extern unsigned int arch_freq_get_on_cpu(int cpu);
>
> #ifndef arch_set_freq_scale
>

2022-03-31 04:19:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Wed, Mar 30, 2022 at 10:02 AM Thomas Gleixner <[email protected]> wrote:
>
> On Wed, Mar 30 2022 at 09:51, Eric Dumazet wrote:
> > On Wed, Mar 30, 2022 at 8:58 AM Thomas Gleixner <[email protected]> wrote:
> >> which I hate with a passion because that allows *unpriviledged* user
> >> space to inject systemwide IPIs every 10ms just to read these counters
> >> which are providing not more than some estimate and are of no value for
> >> the only sane use case of /proc/cpuinfo, i.e. #1 above.
> >
> > You do realize that before my patch, this is already happening ?
> >
> > My "optimization" simply replace an open loop of individual IPI with
> > use of the broadcast IPI capability.
> >
> > Are you saying we should remove IPI broadcast and use loops
> > of IPI, one cpu at a time ?
>
> I rather have no IPIs at all...

Can you send an actual patch, with a changelog then ?

I saw kind of a rant about my patch, which was fine IMO.

Sorry.

2022-03-31 04:25:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Wed, Mar 30 2022 at 09:51, Eric Dumazet wrote:
> On Wed, Mar 30, 2022 at 8:58 AM Thomas Gleixner <[email protected]> wrote:
>> which I hate with a passion because that allows *unpriviledged* user
>> space to inject systemwide IPIs every 10ms just to read these counters
>> which are providing not more than some estimate and are of no value for
>> the only sane use case of /proc/cpuinfo, i.e. #1 above.
>
> You do realize that before my patch, this is already happening ?
>
> My "optimization" simply replace an open loop of individual IPI with
> use of the broadcast IPI capability.
>
> Are you saying we should remove IPI broadcast and use loops
> of IPI, one cpu at a time ?

I rather have no IPIs at all...

2022-03-31 04:54:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/cpu: use smp_call_function_many() in arch_freq_prepare_all()

On Wed, Mar 30 2022 at 10:05, Eric Dumazet wrote:
> On Wed, Mar 30, 2022 at 10:02 AM Thomas Gleixner <[email protected]> wrote:
>> On Wed, Mar 30 2022 at 09:51, Eric Dumazet wrote:
>> > On Wed, Mar 30, 2022 at 8:58 AM Thomas Gleixner <[email protected]> wrote:
>> >> which I hate with a passion because that allows *unpriviledged* user
>> >> space to inject systemwide IPIs every 10ms just to read these counters
>> >> which are providing not more than some estimate and are of no value for
>> >> the only sane use case of /proc/cpuinfo, i.e. #1 above.
>> >
>> > You do realize that before my patch, this is already happening ?
>> >
>> > My "optimization" simply replace an open loop of individual IPI with
>> > use of the broadcast IPI capability.
>> >
>> > Are you saying we should remove IPI broadcast and use loops
>> > of IPI, one cpu at a time ?
>>
>> I rather have no IPIs at all...
>
> Can you send an actual patch, with a changelog then ?

I can polish up the patch I sent, split it up and add changelogs. Sure.

Thanks,

tglx