2020-10-22 16:51:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > However I do want to retire ondemand, conservative and also very much
> > intel_pstate/active mode.
>
> I agree in general, but IMO it would not be prudent to do that without making
> schedutil provide the same level of performance in all of the relevant use
> cases.

Agreed; I though to have understood we were there already.


2020-10-22 17:49:13

by Mel Gorman

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > > However I do want to retire ondemand, conservative and also very much
> > > intel_pstate/active mode.
> >
> > I agree in general, but IMO it would not be prudent to do that without making
> > schedutil provide the same level of performance in all of the relevant use
> > cases.
>
> Agreed; I though to have understood we were there already.

AFAIK, not quite (added Giovanni as he has been paying more attention).
Schedutil has improved since it was merged but not to the extent where
it is a drop-in replacement. The standard it needs to meet is that
it is at least equivalent to powersave (in intel_pstate language)
or ondemand (acpi_cpufreq) and within a reasonable percentage of the
performance governor. Defaulting to performance is a) giving up and b)
the performance governor is not a universal win. There are some questions
currently on whether schedutil is good enough when HWP is not available.
There was some evidence (I don't have the data, Giovanni was looking into
it) that HWP was a requirement to make schedutil work well. That is a
hazard in itself because someone could test on the latest gen Intel CPU
and conclude everything is fine and miss that Intel-specific technology
is needed to make it work well while throwing everyone else under a bus.
Giovanni knows a lot more than I do about this, I could be wrong or
forgetting things.

For distros, switching to schedutil by default would be nice because
frequency selection state would follow the task instead of being per-cpu
and we could stop worrying about different HWP implementations but it's
not at the point where the switch is advisable. I would expect hard data
before switching the default and still would strongly advise having a
period of time where we can fall back when someone inevitably finds a
new corner case or exception.

For reference, SLUB had the same problem for years. It was switched
on by default in the kernel config but it was a long time before
SLUB was generally equivalent to SLAB in terms of performance. Block
multiqueue also had vaguely similar issues before the default changes
and a period of time before it was removed removed (example whinging mail
https://lore.kernel.org/lkml/[email protected]/)
It's schedutil's turn :P

--
Mel Gorman
SUSE Labs

2020-10-22 18:07:03

by A L

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core



---- From: Peter Zijlstra <[email protected]> -- Sent: 2020-10-22 - 14:29 ----

> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>> > However I do want to retire ondemand, conservative and also very much
>> > intel_pstate/active mode.
>>
>> I agree in general, but IMO it would not be prudent to do that without making
>> schedutil provide the same level of performance in all of the relevant use
>> cases.
>
> Agreed; I though to have understood we were there already.

Hi,


Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software.

On my AMD 3000G CPU with kernel-5.9.1:


grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*

With ondemand:
time_in_state:3900000 145179
time_in_state:1600000 9588482
total_trans:177565
trans_table: From : To
trans_table: : 3900000 1600000
trans_table: 3900000: 0 88783
trans_table: 1600000: 88782 0

With schedutil only two file exists:
reset:<empty>
total_trans:216609


I'd really like to have these stats populated with schedutil, if that's possible.

Thanks.

2020-10-22 20:26:23

by Colin King

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 22/10/2020 15:52, Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>>>> However I do want to retire ondemand, conservative and also very much
>>>> intel_pstate/active mode.
>>>
>>> I agree in general, but IMO it would not be prudent to do that without making
>>> schedutil provide the same level of performance in all of the relevant use
>>> cases.
>>
>> Agreed; I though to have understood we were there already.
>
> AFAIK, not quite (added Giovanni as he has been paying more attention).
> Schedutil has improved since it was merged but not to the extent where
> it is a drop-in replacement. The standard it needs to meet is that
> it is at least equivalent to powersave (in intel_pstate language)
> or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> performance governor. Defaulting to performance is a) giving up and b)
> the performance governor is not a universal win. There are some questions
> currently on whether schedutil is good enough when HWP is not available.
> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well. That is a
> hazard in itself because someone could test on the latest gen Intel CPU
> and conclude everything is fine and miss that Intel-specific technology
> is needed to make it work well while throwing everyone else under a bus.
> Giovanni knows a lot more than I do about this, I could be wrong or
> forgetting things.
>
> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's
> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

..and it would be really useful for distros to know when the hard data
is available so that they can make an informed decision when to move to
schedutil.

>
> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance. Block
> multiqueue also had vaguely similar issues before the default changes
> and a period of time before it was removed removed (example whinging mail
> https://lore.kernel.org/lkml/[email protected]/)
> It's schedutil's turn :P
>

2020-10-22 23:59:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:

> There are some questions
> currently on whether schedutil is good enough when HWP is not available.

Srinivas and Rafael will know better, but Intel does run a lot of tests
and IIRC it was found that schedutil was on-par for !HWP. That was the
basis for commit:

33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")

But now it turns out that commit results in running intel_pstate-passive
on ondemand, which is quite horrible.

> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well.

That seems to be the question; Rafael just said the opposite.

> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's

s/HWP/cpufreq-governors/ ? But yes.

> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

Which is why I advocated to make it 'difficult' to use the old ones and
only later remove them.

> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance.

I remember :-)

2020-10-23 05:06:34

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

Hello Peter, Rafael,

back in August I tested a v5.8 kernel adding Rafael's patches from v5.9 that
make schedutil and HWP works together, i.e. f6ebbcf08f37 ("cpufreq: intel_pstate:
Implement passive mode with HWP enabled").

The main point I took from the exercise is that tbench (network benchmark
in localhost) is problematic for schedutil and only with HWP (thanks to
Rafael's patch above) it reaches the throughput of the other governors.
When HWP isn't available, the penalty is 5-10% and I need to understand if
the cause is something that can affect other applications too (or just a
quirk of this test).

I ran this campaign this summer when Rafal CC'ed me to f6ebbcf08f37
("cpufreq: intel_pstate: Implement passive mode with HWP enabled"),
I didn't reply as the patch was a win anyways (my bad, I should have posted
the positive results). The regression of tbench with schedutil w/o HWP,
that went unnoticed for long, got the best of my attention.

Other remarks

* on gitsource (running the git unit test suite, measures elapsed time)
schedutil is a lot better than Intel's powersave but not as good as the
performance governor.

* for the AMD EPYC machines we haven't yet implemented frequency invariant
accounting, which might explain why schedutil looses to ondemand on all
the benchmarks.

* on dbench (filesystem, measures latency) and kernbench (kernel compilation),
sugov is as good as the Intel performance governor. You can add or remove
HWP (to either sugov or perfgov), it doesn't make a difference. Intel's
powersave in general trails behind.

* generally my main concern is performance, not power efficiency, but I was
a little disappointed to see schedutil being just as efficient as
perfgov (the performance-per-watt ratios): there are even a few cases
where (on tbench) the performance governor is both faster and more
efficient. From previous conversations with Rafael I recall that
switching frequency has an energy cost, so it could be that schedutil
switches too often to amortize it. I haven't checked.

To read the tables:

Tilde (~) means the result is the same as baseline (or, the ratio is close
to 1). The double asterisk (**) is a visual aid and means the result is
worse than baseline (higher or lower depending on the case).

For an overview of the possible configurations (intel_psate passive,
active, HWP on/off etc) I made the diagram at
https://beta.suse.com/private/ggherdovich/cpufreq/x86-cpufreq.png

1) INTEL, HWP-CAPABLE MACHINES
2) INTEL, NON-HWP-CAPABLE MACHINES
3) AMD EPYC

1) INTEL, HWP-CAPABLE MACHINES:

64x_SKYLAKE_NUMA: Intel Skylake SP, 32 cores / 64 threads, NUMA, SATA SSD storage
------------------------------------------------------------------------------
sugov-HWP sugov-no-HWP powersave-HWP perfgov-HWP better if
------------------------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 0.68 ~ 1.03** higher
dbench 1.00 ~ 1.03 ~ lower
kernbench 1.00 ~ 1.11 ~ lower
gitsource 1.00 1.03 2.26 0.82** lower
------------------------------------------------------------------------------
PERFORMANCE-PER-WATT RATIOS
tbench 1.00 0.74 ~ ~ higher
dbench 1.00 ~ ~ ~ higher
kernbench 1.00 ~ 0.96 ~ higher
gitsource 1.00 0.96 0.45 1.15** higher


8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage
------------------------------------------------------------------------------
sugov-HWP sugov-no-HWP powersave-HWP perfgov-HWP better if
------------------------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 0.91 ~ ~ higher
dbench 1.00 ~ ~ ~ lower
kernbench 1.00 ~ ~ ~ lower
gitsource 1.00 1.04 1.77 ~ lower
------------------------------------------------------------------------------
PERFORMANCE-PER-WATT RATIOS
tbench 1.00 0.95 ~ ~ higher
dbench 1.00 ~ ~ ~ higher
kernbench 1.00 ~ ~ ~ higher
gitsource 1.00 ~ 0.74 ~ higher


8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
---------------------------------------------------------------
sugov-HWP powersave-HWP perfgov-HWP better if
---------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 ~ ~ higher
dbench 1.00 1.12 ~ lower
kernbench 1.00 ~ ~ lower
gitsource 1.00 2.05 ~ lower
---------------------------------------------------------------
PERFORMANCE-PER-WATT RATIOS
tbench 1.00 ~ ~ higher
dbench 1.00 1.80** ~ higher
kernbench 1.00 ~ ~ higher
gitsource 1.00 1.52** ~ higher


2) INTEL, NON-HWP-CAPABLE MACHINES:

80x_BROADWELL_NUMA: Intel Broadwell EP, 40 cores / 80 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
sugov powersave perfgov better if
---------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 1.11** 1.10** higher
dbench 1.00 1.10 ~ lower
kernbench 1.00 1.10 ~ lower
gitsource 1.00 2.27 0.95** lower
---------------------------------------------------------------
PERFORMANCE-PER-WATT RATIOS
tbench 1.00 1.05** 1.04** higher
dbench 1.00 1.24** 0.95 higher
kernbench 1.00 ~ ~ higher
gitsource 1.00 0.86 1.04** higher


48x_HASWELL_NUMA: Intel Haswell EP, 24 cores / 48 threads, NUMA, HDD storage
---------------------------------------------------------------
sugov powersave perfgov better if
---------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 1.25** 1.27** higher
dbench 1.00 1.17 ~ lower
kernbench 1.00 1.04 ~ lower
gitsource 1.00 1.54 0.79** lower
---------------------------------------------------------------
PERFORMANCE-PER-WATT RATIOS
tbench 1.00 1.18** 1.11** higher
dbench 1.00 1.25** ~ higher
kernbench 1.00 1.04** 0.97 higher
gitsource 1.00 0.77 ~ higher


3) AMD EPYC:

256x_ROME_NUMA: AMD Rome , 128 cores / 256 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
sugov ondemand perfgov better if
---------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 1.11** 1.58** higher
dbench 1.00 0.44** 0.40** lower
kernbench 1.00 ~ 0.91** lower
gitsource 1.00 0.96** 0.65** lower


128x_NAPLES_NUMA: AMD Naples , 64 cores / 128 threads, NUMA, SATA SSD storage
---------------------------------------------------------------
sugov ondemand perfgov better if
---------------------------------------------------------------
PERFORMANCE RATIOS
tbench 1.00 1.10** 1.19** higher
dbench 1.00 1.05 0.95** lower
kernbench 1.00 ~ 0.95** lower
gitsource 1.00 0.93** 0.55** lower


Giovanni

2020-10-23 05:14:50

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, 2020-10-22 at 22:10 +0200, Giovanni Gherdovich wrote:
> [...]
> To read the tables:
>
> Tilde (~) means the result is the same as baseline (or, the ratio is close
> to 1). The double asterisk (**) is a visual aid and means the result is
> worse than baseline (higher or lower depending on the case).

Ouch, the opposite. Double asterisk (**) is where the result is better
than baseline, and schedutil needs improvement.


Giovanni

2020-10-23 08:14:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
> * for the AMD EPYC machines we haven't yet implemented frequency invariant
> accounting, which might explain why schedutil looses to ondemand on all
> the benchmarks.

Right, I poked the AMD people on that a few times, but nothing seems to
be forthcoming :/ Tom, any way you could perhaps expedite the matter?

In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in

arch/x86/kernel/smpboot.c:init_freq_invariance()

The main issue is finding a 'max' frequency that is not the absolute max
turbo boost (this could result in not reaching it very often) but also
not too low such that we're always clipping.

And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
the chips have a CPPC interface which could be used instead. Is there
any progress on that?

2020-10-23 19:02:06

by Tom Lendacky

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 10/23/20 2:03 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
>> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>> accounting, which might explain why schedutil looses to ondemand on all
>> the benchmarks.
>
> Right, I poked the AMD people on that a few times, but nothing seems to
> be forthcoming :/ Tom, any way you could perhaps expedite the matter?

Adding Nathan to the thread to help out here.

Thanks,
Tom

>
> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in
>
> arch/x86/kernel/smpboot.c:init_freq_invariance()
>
> The main issue is finding a 'max' frequency that is not the absolute max
> turbo boost (this could result in not reaching it very often) but also
> not too low such that we're always clipping.
>
> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
> the chips have a CPPC interface which could be used instead. Is there
> any progress on that?
>

2020-10-27 06:39:08

by Nathan Fontenot

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 10/23/2020 12:46 PM, Tom Lendacky wrote:
> On 10/23/20 2:03 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
>>> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>>>    accounting, which might explain why schedutil looses to ondemand on all
>>>    the benchmarks.
>>
>> Right, I poked the AMD people on that a few times, but nothing seems to
>> be forthcoming :/ Tom, any way you could perhaps expedite the matter?
>
> Adding Nathan to the thread to help out here.
>
> Thanks,
> Tom

Thanks Tom, diving in...

>
>>
>> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in
>>
>>    arch/x86/kernel/smpboot.c:init_freq_invariance()
>>
>> The main issue is finding a 'max' frequency that is not the absolute max
>> turbo boost (this could result in not reaching it very often) but also
>> not too low such that we're always clipping.

I've started looking into this and have a lead but need to confirm that the
frequency value I'm getting is not an absolute max.

>>
>> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
>> the chips have a CPPC interface which could be used instead. Is there
>> any progress on that?
>>

Correct, AMD uses acpi_cpufreq. The newer AMD chips do have a CPPC interface
(not sure how far back 'newer' covers). I'll take a look at schedutil and
cppc_cpufreq and the possibility of transitioning to them for AMD.

-Nathan