2022-04-16 00:52:47

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

APERF/MPERF is utilized in two ways:

1) Ad hoc readout of CPU frequency which requires IPIs

2) Frequency scale calculation for frequency invariant scheduling which
reads APERF/MPERF on every tick.

These are completely independent code parts. Eric observed long latencies
when reading /proc/cpuinfo which reads out CPU frequency via #1 and
proposed to replace the per CPU single IPI with a broadcast IPI.

While this makes the latency smaller, it is not necessary at all because #2
samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
which are excluded from IPI already.

It could be argued that not all APERF/MPERF capable systems have the
required BIOS information to enable frequency invariance support, but in
practice most of them do. So the APERF/MPERF sampling can be made
unconditional and just the frequency scale calculation for the scheduler
excluded.

The following series consolidates that.

Thanks,

tglx
---
arch/x86/include/asm/cpu.h | 2
arch/x86/include/asm/topology.h | 17 -
arch/x86/kernel/acpi/cppc.c | 28 --
arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++--------
arch/x86/kernel/cpu/proc.c | 2
arch/x86/kernel/smpboot.c | 358 -----------------------------
fs/proc/cpuinfo.c | 6
include/linux/cpufreq.h | 1
8 files changed, 405 insertions(+), 483 deletions(-)



2022-04-20 00:22:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

Eric,

On Tue, Apr 19 2022 at 08:51, Eric Dumazet wrote:
> On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <[email protected]> wrote:
>> It could be argued that not all APERF/MPERF capable systems have the
>> required BIOS information to enable frequency invariance support, but in
>> practice most of them do. So the APERF/MPERF sampling can be made
>> unconditional and just the frequency scale calculation for the scheduler
>> excluded.
>>
>> The following series consolidates that.
>>
>
> Thanks a lot for working on that Thomas.
>
> I am not sure I will be able to backport this to a Google prodkernel,
> as I guess there will be many merge conflicts.

:)

> Do you have by any chance this work available in a git branch ?

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf

Thanks,

tglx

2022-04-20 08:54:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <[email protected]> wrote:
>
> APERF/MPERF is utilized in two ways:
>
> 1) Ad hoc readout of CPU frequency which requires IPIs
>
> 2) Frequency scale calculation for frequency invariant scheduling which
> reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.
>

Thanks a lot for working on that Thomas.

I am not sure I will be able to backport this to a Google prodkernel,
as I guess there will be many merge conflicts.

Do you have by any chance this work available in a git branch ?

Thanks.



> Thanks,
>
> tglx
> ---
> arch/x86/include/asm/cpu.h | 2
> arch/x86/include/asm/topology.h | 17 -
> arch/x86/kernel/acpi/cppc.c | 28 --
> arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c | 2
> arch/x86/kernel/smpboot.c | 358 -----------------------------
> fs/proc/cpuinfo.c | 6
> include/linux/cpufreq.h | 1
> 8 files changed, 405 insertions(+), 483 deletions(-)
>
>

2022-04-21 08:10:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> ---
> arch/x86/include/asm/cpu.h | 2
> arch/x86/include/asm/topology.h | 17 -
> arch/x86/kernel/acpi/cppc.c | 28 --
> arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c | 2
> arch/x86/kernel/smpboot.c | 358 -----------------------------
> fs/proc/cpuinfo.c | 6
> include/linux/cpufreq.h | 1
> 8 files changed, 405 insertions(+), 483 deletions(-)


Acked-by: Peter Zijlstra (Intel) <[email protected]>

2022-04-21 09:47:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <[email protected]> wrote:
>
> Hi Thomas,
>
> On 2022.04.15 12:20 Thomas Gleixner wrote:
>
> > APERF/MPERF is utilized in two ways:
> >
> > 1) Ad hoc readout of CPU frequency which requires IPIs
> >
> > 2) Frequency scale calculation for frequency invariant scheduling which
> > reads APERF/MPERF on every tick.
> >
> > These are completely independent code parts. Eric observed long latencies
> > when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> > proposed to replace the per CPU single IPI with a broadcast IPI.
> >
> > While this makes the latency smaller, it is not necessary at all because #2
> > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> > which are excluded from IPI already.
> >
> > It could be argued that not all APERF/MPERF capable systems have the
> > required BIOS information to enable frequency invariance support, but in
> > practice most of them do. So the APERF/MPERF sampling can be made
> > unconditional and just the frequency scale calculation for the scheduler
> > excluded.
> >
> > The following series consolidates that.
>
> I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
> and intel_pstate (active) CPU frequency scaling drivers and various
> governors. Additionally, with HWP both enabled and disabled.
>
> For intel_pstate (active), both HWP enabled or disabled, the behaviour
> of scaling_cur_freq is inconsistent with prior to this patch set and other
> scaling driver governor combinations.
>
> Note there is no issue with " grep MHz /proc/cpuinfo" for any
> combination.
>
> Examples:
>
> No-HWP:
>
> active/powersave:
> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

That's because after the changes in this series scaling_cur_freq
returns 0 if the given CPU is idle.

I guess it could return the last known result, but that wouldn't be
more meaningful.

2022-04-21 23:43:04

by Doug Smythies

[permalink] [raw]
Subject: RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

Hi Thomas,

On 2022.04.15 12:20 Thomas Gleixner wrote:

> APERF/MPERF is utilized in two ways:
>
> 1) Ad hoc readout of CPU frequency which requires IPIs
>
> 2) Frequency scale calculation for frequency invariant scheduling which
> reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.

I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
and intel_pstate (active) CPU frequency scaling drivers and various
governors. Additionally, with HWP both enabled and disabled.

For intel_pstate (active), both HWP enabled or disabled, the behaviour
of scaling_cur_freq is inconsistent with prior to this patch set and other
scaling driver governor combinations.

Note there is no issue with " grep MHz /proc/cpuinfo" for any
combination.

Examples:

No-HWP:

active/powersave:
doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

active/performance:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

HWP:

active/powersave:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:799993
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800069
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800131
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:799844

active/performance:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:4800186
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:4800016
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

Other configurations:
intel_cpufreq /schedutil (no HWP), for example:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

> Thanks,
>
> tglx
> ---
> arch/x86/include/asm/cpu.h | 2
> arch/x86/include/asm/topology.h | 17 -
> arch/x86/kernel/acpi/cppc.c | 28 --
> arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c | 2
> arch/x86/kernel/smpboot.c | 358 -----------------------------
> fs/proc/cpuinfo.c | 6
> include/linux/cpufreq.h | 1
> 8 files changed, 405 insertions(+), 483 deletions(-)


2022-04-22 00:39:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> APERF/MPERF is utilized in two ways:
>
> 1) Ad hoc readout of CPU frequency which requires IPIs
>
> 2) Frequency scale calculation for frequency invariant scheduling which
> reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.

Acked-by: Paul E. McKenney <[email protected]>

> Thanks,
>
> tglx
> ---
> arch/x86/include/asm/cpu.h | 2
> arch/x86/include/asm/topology.h | 17 -
> arch/x86/kernel/acpi/cppc.c | 28 --
> arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c | 2
> arch/x86/kernel/smpboot.c | 358 -----------------------------
> fs/proc/cpuinfo.c | 6
> include/linux/cpufreq.h | 1
> 8 files changed, 405 insertions(+), 483 deletions(-)
>
>

2022-04-22 09:49:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <[email protected]> wrote:
>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>> scaling driver governor combinations.
>>
>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>> combination.
>>
>> Examples:
>>
>> No-HWP:
>>
>> active/powersave:
>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>
> That's because after the changes in this series scaling_cur_freq
> returns 0 if the given CPU is idle.

Which is sensible IMO as there is really no point in waking an idle CPU
just to read those MSRs, then wait 20ms wake it up again to read those
MSRs again.

> I guess it could return the last known result, but that wouldn't be
> more meaningful.

Right.

Thanks,

tglx

2022-04-22 10:37:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Tue, Apr 19, 2022 at 1:39 PM Thomas Gleixner <[email protected]> wrote:
>
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf
>

Excellent, things look fine to me.

Before:
# grep MHz /proc/cpuinfo | sort | uniq -c
255 cpu MHz : 2249.998
1 cpu MHz : 3297.719

# grep MHz /proc/cpuinfo | sort|uniq -c
1 cpu MHz : 1590.400
1 cpu MHz : 1684.772
1 cpu MHz : 1693.890
1 cpu MHz : 1780.072
1 cpu MHz : 1784.513
1 cpu MHz : 1831.106
1 cpu MHz : 1880.344
1 cpu MHz : 1953.481
1 cpu MHz : 1980.636
1 cpu MHz : 2013.620
1 cpu MHz : 2219.617
240 cpu MHz : 2250.173
1 cpu MHz : 3292.206
1 cpu MHz : 3294.956
1 cpu MHz : 3297.653
1 cpu MHz : 3298.385
1 cpu MHz : 3300.197

Tested-by: Eric Dumazet <[email protected]>

2022-04-22 17:53:52

by Doug Smythies

[permalink] [raw]
Subject: RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

Hi Thomas, Rafael,

Thank you for your replies.

On 2022.04.19 14:11 Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
>> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <[email protected]> wrote:
>>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>>> scaling driver governor combinations.
>>>
>>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>>> combination.
>>>
>>> Examples:
>>>
>>> No-HWP:
>>>
>>> active/powersave:
>>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>>
>> That's because after the changes in this series scaling_cur_freq
>> returns 0 if the given CPU is idle.
>
> Which is sensible IMO as there is really no point in waking an idle CPU
> just to read those MSRs, then wait 20ms wake it up again to read those
> MSRs again.

I totally agree.
It is the inconsistency for what is displayed as a function of driver/governor
that is my concern.

>
>> I guess it could return the last known result, but that wouldn't be
>> more meaningful.
>
> Right.

How about something like this, which I realize might break something else,
but just to demonstrate:

doug@s19:~/kernel/linux$ git diff
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 80f535cc8a75..a161e75794cd 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -710,7 +710,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
else
- ret = sprintf(buf, "%u\n", policy->cur);
+ ret = sprintf(buf, "%u\n", freq);
return ret;
}

Note: I left the other 0 return condition, because I do not know what uses it.

Which gives:

acpi-cpufreq/schedutil
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4100723

intel_pstate/powersave (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800295
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800015

intel_cpufreq/schedutil (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:1971265
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2785446
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

Which I suggest is more consistent.

Note: because it was deleted from this thread,
and just for reference, I'll repost the previous
intel_cpufreq/schedutil (no-HWP) output:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

... Doug


2022-04-25 22:25:09

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>> That's because after the changes in this series scaling_cur_freq
>>> returns 0 if the given CPU is idle.
>>
>> Which is sensible IMO as there is really no point in waking an idle CPU
>> just to read those MSRs, then wait 20ms wake it up again to read those
>> MSRs again.
>
> I totally agree.
> It is the inconsistency for what is displayed as a function of driver/governor
> that is my concern.

Raphael suggested to move the show_cpuinfo() logic into the a/mperf
code. See below.

Thanks,

tglx
---
Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
From: Thomas Gleixner <[email protected]>
Date: Mon, 25 Apr 2022 15:19:29 +0200

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++---
arch/x86/kernel/cpu/proc.c | 7 +------
2 files changed, 8 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
unsigned int arch_freq_get_on_cpu(int cpu)
{
struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+ unsigned int seq, freq;
unsigned long last;
- unsigned int seq;
u64 acnt, mcnt;

if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
- return 0;
+ goto fallback;

do {
seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cp
* which covers idle and NOHZ full CPUs.
*/
if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
- return 0;
+ goto fallback;

return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+ freq = cpufreq_quick_get(cpu);
+ return freq ? freq : cpu_khz;
}

static int __init bp_init_aperfmperf(void)
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file
if (cpu_has(c, X86_FEATURE_TSC)) {
unsigned int freq = arch_freq_get_on_cpu(cpu);

- if (!freq)
- freq = cpufreq_quick_get(cpu);
- if (!freq)
- freq = cpu_khz;
- seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
+ seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
}

/* Cache size */

2022-04-26 10:35:08

by Doug Smythies

[permalink] [raw]
Subject: Re: [patch 00/10] x86/cpu: Consolidate APERF/MPERF code

On Mon, Apr 25, 2022 at 8:45 AM Thomas Gleixner <[email protected]> wrote:
> On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
>> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>>> That's because after the changes in this series scaling_cur_freq
>>>> returns 0 if the given CPU is idle.
>>>
>>> Which is sensible IMO as there is really no point in waking an idle CPU
>>> just to read those MSRs, then wait 20ms wake it up again to read those
>>> MSRs again.
>>
>> I totally agree.
>> It is the inconsistency for what is displayed as a function of driver/governor
>> that is my concern.
>
> Raphael suggested to move the show_cpuinfo() logic into the a/mperf
> code. See below.

Hi Thomas,

I tested the patch on top of your 10 patch set on kernel 5.18-rc3.
It addresses my consistency concerns.

Thank you

... Doug

> ---
> Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
> From: Thomas Gleixner <[email protected]>
> Date: Mon, 25 Apr 2022 15:19:29 +0200
>
...

Subject: [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: e696cabf5da2b4ed104508674de6125b860f3c9f
Gitweb: https://git.kernel.org/tip/e696cabf5da2b4ed104508674de6125b860f3c9f
Author: Thomas Gleixner <[email protected]>
AuthorDate: Mon, 25 Apr 2022 17:45:42 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 27 Apr 2022 15:51:09 +02:00

x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Doug Smythies <[email protected]>
Link: https://lore.kernel.org/r/87pml5180p.ffs@tglx

---
arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++---
arch/x86/kernel/cpu/proc.c | 7 +------
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b15c884..1f60a2b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
unsigned int arch_freq_get_on_cpu(int cpu)
{
struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+ unsigned int seq, freq;
unsigned long last;
- unsigned int seq;
u64 acnt, mcnt;

if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
- return 0;
+ goto fallback;

do {
seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cpu)
* which covers idle and NOHZ full CPUs.
*/
if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
- return 0;
+ goto fallback;

return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+ freq = cpufreq_quick_get(cpu);
+ return freq ? freq : cpu_khz;
}

static int __init bp_init_aperfmperf(void)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 0a0ee55..099b6f0 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
if (cpu_has(c, X86_FEATURE_TSC)) {
unsigned int freq = arch_freq_get_on_cpu(cpu);

- if (!freq)
- freq = cpufreq_quick_get(cpu);
- if (!freq)
- freq = cpu_khz;
- seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
+ seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
}

/* Cache size */

Subject: [tip: x86/cleanups] x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: fb4c77c21aba03677f283acda3cae748ef866abf
Gitweb: https://git.kernel.org/tip/fb4c77c21aba03677f283acda3cae748ef866abf
Author: Thomas Gleixner <[email protected]>
AuthorDate: Mon, 25 Apr 2022 17:45:42 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 27 Apr 2022 20:22:20 +02:00

x86/aperfmperf: Integrate the fallback code from show_cpuinfo()

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Doug Smythies <[email protected]>
Link: https://lore.kernel.org/r/87pml5180p.ffs@tglx

---
arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++---
arch/x86/kernel/cpu/proc.c | 7 +------
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index b15c884..1f60a2b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
unsigned int arch_freq_get_on_cpu(int cpu)
{
struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+ unsigned int seq, freq;
unsigned long last;
- unsigned int seq;
u64 acnt, mcnt;

if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
- return 0;
+ goto fallback;

do {
seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cpu)
* which covers idle and NOHZ full CPUs.
*/
if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
- return 0;
+ goto fallback;

return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+ freq = cpufreq_quick_get(cpu);
+ return freq ? freq : cpu_khz;
}

static int __init bp_init_aperfmperf(void)
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 0a0ee55..099b6f0 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
if (cpu_has(c, X86_FEATURE_TSC)) {
unsigned int freq = arch_freq_get_on_cpu(cpu);

- if (!freq)
- freq = cpufreq_quick_get(cpu);
- if (!freq)
- freq = cpu_khz;
- seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
- freq / 1000, (freq % 1000));
+ seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
}

/* Cache size */