2020-10-21 13:59:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> Prior to 5.8, my machine was using intel_pstate and had few background
> tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> the kernel decided that intel_cpufreq would be more appropriate, which
> introduced kworkers every 0.004 seconds on all cores.

That still doesn't make any sense. Are you running the legacy on-demand
thing or something?

Rafael, Srinivas, Viresh, how come it defaults to that?


2020-10-22 00:38:23

by Julia Lawall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core



On Wed, 21 Oct 2020, Peter Zijlstra wrote:

> On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > Prior to 5.8, my machine was using intel_pstate and had few background
> > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > the kernel decided that intel_cpufreq would be more appropriate, which
> > introduced kworkers every 0.004 seconds on all cores.
>
> That still doesn't make any sense. Are you running the legacy on-demand
> thing or something?
>
> Rafael, Srinivas, Viresh, how come it defaults to that?

The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
have the HWP feature, even though the cores seemed to be able to change
their frequencies at the hardware level.

julia

2020-10-22 01:02:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote:
>
>
> On Wed, 21 Oct 2020, Peter Zijlstra wrote:
>
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> >
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> >
> > Rafael, Srinivas, Viresh, how come it defaults to that?
>
> The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> have the HWP feature, even though the cores seemed to be able to change
> their frequencies at the hardware level.

That just makes intel_pstate not prefer active mode. With the clear
intent that it should then go use schedutil, but somehow it looks like
you landed on ondemand, which is absolutely atrocious.

What's:

$ for i in /sys/devices/system/cpu/cpu0/cpufreq/scaling_*; do echo -n $i ": "; cat $i; done

say, for you? And if you do:

$ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo schedutil > $i; done

Are the kworkers gone?


2020-10-22 01:25:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > Prior to 5.8, my machine was using intel_pstate and had few background
> > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > the kernel decided that intel_cpufreq would be more appropriate, which
> > introduced kworkers every 0.004 seconds on all cores.
>
> That still doesn't make any sense. Are you running the legacy on-demand
> thing or something?
>
> Rafael, Srinivas, Viresh, how come it defaults to that?

Does we want something like this?

---
arch/x86/configs/i386_defconfig | 3 +--
arch/x86/configs/x86_64_defconfig | 3 +--
drivers/cpufreq/Kconfig | 7 +++++--
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
index 78210793d357..c343ad459737 100644
--- a/arch/x86/configs/i386_defconfig
+++ b/arch/x86/configs/i386_defconfig
@@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_BGRT=y
-CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
-CONFIG_CPU_FREQ_GOV_ONDEMAND=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_EFI_VARS=y
CONFIG_KPROBES=y
diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 9936528e1939..23e1ea85c1ec 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_BGRT=y
-CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
-CONFIG_CPU_FREQ_GOV_ONDEMAND=y
+CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_IA32_EMULATION=y
CONFIG_EFI_VARS=y
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..8dfca6e9b836 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -37,8 +37,7 @@ config CPU_FREQ_STAT
choice
prompt "Default CPUFreq governor"
default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
+ default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
help
This option sets which CPUFreq governor shall be loaded at
@@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE

config CPU_FREQ_DEFAULT_GOV_ONDEMAND
bool "ondemand"
+ depends on !SMP
select CPU_FREQ_GOV_ONDEMAND
select CPU_FREQ_GOV_PERFORMANCE
help
@@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND

config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
bool "conservative"
+ depends on !SMP
select CPU_FREQ_GOV_CONSERVATIVE
select CPU_FREQ_GOV_PERFORMANCE
help
@@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE

config CPU_FREQ_GOV_ONDEMAND
tristate "'ondemand' cpufreq policy governor"
+ depends on !SMP
select CPU_FREQ_GOV_COMMON
help
'ondemand' - This driver adds a dynamic cpufreq policy governor.
@@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
config CPU_FREQ_GOV_CONSERVATIVE
tristate "'conservative' cpufreq governor"
depends on CPU_FREQ
+ depends on !SMP
select CPU_FREQ_GOV_COMMON
help
'conservative' - this driver is rather similar to the 'ondemand'

2020-10-22 01:48:55

by Julia Lawall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core



On Wed, 21 Oct 2020, Peter Zijlstra wrote:

> On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote:
> >
> >
> > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> >
> > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > have the HWP feature, even though the cores seemed to be able to change
> > their frequencies at the hardware level.
>
> That just makes intel_pstate not prefer active mode. With the clear
> intent that it should then go use schedutil, but somehow it looks like
> you landed on ondemand, which is absolutely atrocious.
>
> What's:
>
> $ for i in /sys/devices/system/cpu/cpu0/cpufreq/scaling_*; do echo -n $i ": "; cat $i; done
>
> say, for you? And if you do:

/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors :
ondemand performance schedutil
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq : 1197706
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver : intel_cpufreq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor : ondemand
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq : 3000000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq : 1200000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed : <unsupported>

>
> $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo schedutil > $i; done
>
> Are the kworkers gone?

Yes, they are gone. The performance seems to be more uniform and better
with the kworkers, though (see attached traces). This is a very short
benchmark, though.

julia


Attachments:
cg2.pdf (267.74 kB)

2020-10-22 06:31:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> >
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> >
> > Rafael, Srinivas, Viresh, how come it defaults to that?
>
> Does we want something like this?
>
> ---
> arch/x86/configs/i386_defconfig | 3 +--
> arch/x86/configs/x86_64_defconfig | 3 +--
> drivers/cpufreq/Kconfig | 7 +++++--
> 3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
> index 78210793d357..c343ad459737 100644
> --- a/arch/x86/configs/i386_defconfig
> +++ b/arch/x86/configs/i386_defconfig
> @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
> CONFIG_PM_TRACE_RTC=y
> CONFIG_ACPI_DOCK=y
> CONFIG_ACPI_BGRT=y
> -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> CONFIG_X86_ACPI_CPUFREQ=y
> CONFIG_EFI_VARS=y
> CONFIG_KPROBES=y
> diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
> index 9936528e1939..23e1ea85c1ec 100644
> --- a/arch/x86/configs/x86_64_defconfig
> +++ b/arch/x86/configs/x86_64_defconfig
> @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
> CONFIG_PM_TRACE_RTC=y
> CONFIG_ACPI_DOCK=y
> CONFIG_ACPI_BGRT=y
> -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> CONFIG_X86_ACPI_CPUFREQ=y
> CONFIG_IA32_EMULATION=y
> CONFIG_EFI_VARS=y
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 2c7171e0b001..8dfca6e9b836 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -37,8 +37,7 @@ config CPU_FREQ_STAT
> choice
> prompt "Default CPUFreq governor"
> default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
> default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> help
> This option sets which CPUFreq governor shall be loaded at
> @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
>
> config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> bool "ondemand"
> + depends on !SMP
> select CPU_FREQ_GOV_ONDEMAND
> select CPU_FREQ_GOV_PERFORMANCE
> help
> @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
>
> config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> bool "conservative"
> + depends on !SMP
> select CPU_FREQ_GOV_CONSERVATIVE
> select CPU_FREQ_GOV_PERFORMANCE
> help

The changes above should work.

> @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
>
> config CPU_FREQ_GOV_ONDEMAND
> tristate "'ondemand' cpufreq policy governor"
> + depends on !SMP

But I don't think that we can do this and the one below.

> select CPU_FREQ_GOV_COMMON
> help
> 'ondemand' - This driver adds a dynamic cpufreq policy governor.
> @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
> config CPU_FREQ_GOV_CONSERVATIVE
> tristate "'conservative' cpufreq governor"
> depends on CPU_FREQ
> + depends on !SMP
> select CPU_FREQ_GOV_COMMON
> help
> 'conservative' - this driver is rather similar to the 'ondemand'
>




2020-10-22 06:33:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
>
> On Wed, 21 Oct 2020, Peter Zijlstra wrote:
>
> > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > introduced kworkers every 0.004 seconds on all cores.
> >
> > That still doesn't make any sense. Are you running the legacy on-demand
> > thing or something?
> >
> > Rafael, Srinivas, Viresh, how come it defaults to that?
>
> The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> have the HWP feature, even though the cores seemed to be able to change
> their frequencies at the hardware level.

That's in the range of "turbo" P-states (if a P-state above a certain threshold
is requested by the governor, the processor has a license to choose P-states
in the range above this threshold by itself).

Cheers!



2020-10-22 06:47:52

by Julia Lawall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core



On Wed, 21 Oct 2020, Rafael J. Wysocki wrote:

> On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
> >
> > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> >
> > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > have the HWP feature, even though the cores seemed to be able to change
> > their frequencies at the hardware level.
>
> That's in the range of "turbo" P-states (if a P-state above a certain threshold
> is requested by the governor, the processor has a license to choose P-states
> in the range above this threshold by itself).

Sorry, but I don't understand this answer at all.

thanks,
julia

2020-10-22 06:55:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wednesday, October 21, 2020 9:47:51 PM CEST Julia Lawall wrote:
>
> On Wed, 21 Oct 2020, Rafael J. Wysocki wrote:
>
> > On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote:
> > >
> > > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> > >
> > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > > introduced kworkers every 0.004 seconds on all cores.
> > > >
> > > > That still doesn't make any sense. Are you running the legacy on-demand
> > > > thing or something?
> > > >
> > > > Rafael, Srinivas, Viresh, how come it defaults to that?
> > >
> > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > > have the HWP feature, even though the cores seemed to be able to change
> > > their frequencies at the hardware level.
> >
> > That's in the range of "turbo" P-states (if a P-state above a certain threshold
> > is requested by the governor, the processor has a license to choose P-states
> > in the range above this threshold by itself).
>
> Sorry, but I don't understand this answer at all.

Well, sorry about that and let me rephrase then.

Contemporary CPUs have two ranges of P-states, the so called "guaranteed
performance" range and the "turbo" range.

In the "guaranteed performance" range the CPU runs in the P-state requested
by the governor, unless a higher P-state has been requested for another CPU in
its frequency domain (usually covering the entire processor package) , in which
case that higher P-state will be used (the effective P-state for all CPUs in the
frequency domain is the maximum of all P-states requested for individual CPUs).

However, if the governor requests a P-state from the "turbo" range, the
processor is not required to take that request literally and the PM unit in it may
override the governor's choice and cause the CPU to run in a different P-state
(also from the "turbo" range), even if lower P-states have been requested for
the other CPUs in the processor package.

This is also explained in Documentation/admin-guide/pm/intel_pstate.rst (in the
Turbo P-states Support section), in more detail and hopefully more clearly.

Cheers!



2020-10-22 08:25:28

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On 21-10-20, 20:11, Rafael J. Wysocki wrote:
> On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote:
> > On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> >
> > Does we want something like this?
> >
> > ---
> > arch/x86/configs/i386_defconfig | 3 +--
> > arch/x86/configs/x86_64_defconfig | 3 +--
> > drivers/cpufreq/Kconfig | 7 +++++--
> > 3 files changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
> > index 78210793d357..c343ad459737 100644
> > --- a/arch/x86/configs/i386_defconfig
> > +++ b/arch/x86/configs/i386_defconfig
> > @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y
> > CONFIG_PM_TRACE_RTC=y
> > CONFIG_ACPI_DOCK=y
> > CONFIG_ACPI_BGRT=y
> > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> > CONFIG_X86_ACPI_CPUFREQ=y
> > CONFIG_EFI_VARS=y
> > CONFIG_KPROBES=y
> > diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
> > index 9936528e1939..23e1ea85c1ec 100644
> > --- a/arch/x86/configs/x86_64_defconfig
> > +++ b/arch/x86/configs/x86_64_defconfig
> > @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y
> > CONFIG_PM_TRACE_RTC=y
> > CONFIG_ACPI_DOCK=y
> > CONFIG_ACPI_BGRT=y
> > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
> > CONFIG_X86_ACPI_CPUFREQ=y
> > CONFIG_IA32_EMULATION=y
> > CONFIG_EFI_VARS=y
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index 2c7171e0b001..8dfca6e9b836 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -37,8 +37,7 @@ config CPU_FREQ_STAT
> > choice
> > prompt "Default CPUFreq governor"
> > default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> > + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP
> > default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> > help
> > This option sets which CPUFreq governor shall be loaded at
> > @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE
> >
> > config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> > bool "ondemand"
> > + depends on !SMP
> > select CPU_FREQ_GOV_ONDEMAND
> > select CPU_FREQ_GOV_PERFORMANCE
> > help
> > @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> >
> > config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> > bool "conservative"
> > + depends on !SMP
> > select CPU_FREQ_GOV_CONSERVATIVE
> > select CPU_FREQ_GOV_PERFORMANCE
> > help
>
> The changes above should work.
>
> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >
> > config CPU_FREQ_GOV_ONDEMAND
> > tristate "'ondemand' cpufreq policy governor"
> > + depends on !SMP
>
> But I don't think that we can do this and the one below.

I have exactly the same comments.

> > select CPU_FREQ_GOV_COMMON
> > help
> > 'ondemand' - This driver adds a dynamic cpufreq policy governor.
> > @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND
> > config CPU_FREQ_GOV_CONSERVATIVE
> > tristate "'conservative' cpufreq governor"
> > depends on CPU_FREQ
> > + depends on !SMP
> > select CPU_FREQ_GOV_COMMON
> > help
> > 'conservative' - this driver is rather similar to the 'ondemand'
> >
>
>
>

--
viresh

2020-10-22 08:43:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote:

> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >
> > config CPU_FREQ_GOV_ONDEMAND
> > tristate "'ondemand' cpufreq policy governor"
> > + depends on !SMP
>
> But I don't think that we can do this and the one below.

Well, but we need to do something to force people onto schedutil,
otherwise we'll get more crap like this thread.

Can we take the choice away? Only let Kconfig select which governors are
available and then set the default ourselves? I mean, the end goal being
to not have selectable governors at all, this seems like a good step
anyway.

---

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..3a9f54b382c0 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -34,77 +34,6 @@ config CPU_FREQ_STAT

If in doubt, say N.

-choice
- prompt "Default CPUFreq governor"
- default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
- default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
- help
- This option sets which CPUFreq governor shall be loaded at
- startup. If in doubt, use the default setting.
-
-config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
- bool "performance"
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'performance' as default. This sets
- the frequency statically to the highest frequency supported by
- the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_POWERSAVE
- bool "powersave"
- select CPU_FREQ_GOV_POWERSAVE
- help
- Use the CPUFreq governor 'powersave' as default. This sets
- the frequency statically to the lowest frequency supported by
- the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_USERSPACE
- bool "userspace"
- select CPU_FREQ_GOV_USERSPACE
- help
- Use the CPUFreq governor 'userspace' as default. This allows
- you to set the CPU frequency manually or when a userspace
- program shall be able to set the CPU dynamically without having
- to enable the userspace governor manually.
-
-config CPU_FREQ_DEFAULT_GOV_ONDEMAND
- bool "ondemand"
- select CPU_FREQ_GOV_ONDEMAND
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'ondemand' as default. This allows
- you to get a full dynamic frequency capable system by simply
- loading your cpufreq low-level hardware driver.
- Be aware that not all cpufreq drivers support the ondemand
- governor. If unsure have a look at the help section of the
- driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
- bool "conservative"
- select CPU_FREQ_GOV_CONSERVATIVE
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'conservative' as default. This allows
- you to get a full dynamic frequency capable system by simply
- loading your cpufreq low-level hardware driver.
- Be aware that not all cpufreq drivers support the conservative
- governor. If unsure have a look at the help section of the
- driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
- bool "schedutil"
- depends on SMP
- select CPU_FREQ_GOV_SCHEDUTIL
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the 'schedutil' CPUFreq governor by default. If unsure,
- have a look at the help section of that governor. The fallback
- governor will be 'performance'.
-
-endchoice
-
config CPU_FREQ_GOV_PERFORMANCE
tristate "'performance' governor"
help
@@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
config CPU_FREQ_GOV_ONDEMAND
tristate "'ondemand' cpufreq policy governor"
select CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_PERFORMANCE
help
'ondemand' - This driver adds a dynamic cpufreq policy governor.
The governor does a periodic polling and
@@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
tristate "'conservative' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_PERFORMANCE
help
'conservative' - this driver is rather similar to the 'ondemand'
governor both in its source code and its purpose, the difference is
@@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
bool "'schedutil' cpufreq policy governor"
depends on CPU_FREQ && SMP
select CPU_FREQ_GOV_ATTR_SET
+ select CPU_FREQ_GOV_PERFORMANCE
select IRQ_WORK
help
This governor makes decisions based on the utilization data provided
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1877f5e2e5b0..6848e3337b65 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
return NULL;
}

+static struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ struct cpufreq_governor *gov = NULL;
+
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+ gov = find_governor("schedutil");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
+ gov = find_governor("ondemand");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
+ gov = find_governor("conservative");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+ gov = find_governor("userspace");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
+ gov = find_governor("powersave");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
+ gov = find_governor("performance");
+ if (gov)
+ return gov;
+#endif
+
+ return gov;
+}
+
static struct cpufreq_governor *get_governor(const char *str_governor)
{
struct cpufreq_governor *t;
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index aa39ff31ec9f..25449a1ee954 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for "
"optimised for use in a battery environment");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &CPU_FREQ_GOV_CONSERVATIVE;
-}
-#endif
-
cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE);
cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE);
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index ac361a8b1d3b..2a7fd645e1fb 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for "
"Low Latency Frequency Transition capable processors");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &CPU_FREQ_GOV_ONDEMAND;
-}
-#endif
-
cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND);
cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND);
diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c
index 71c1d9aba772..c6c7473a3a73 100644
--- a/drivers/cpufreq/cpufreq_performance.c
+++ b/drivers/cpufreq/cpufreq_performance.c
@@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = {
.limits = cpufreq_gov_performance_limits,
};

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_performance;
-}
-#endif
#ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE
struct cpufreq_governor *cpufreq_fallback_governor(void)
{
diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c
index 7749522355b5..8b9555f70e9d 100644
--- a/drivers/cpufreq/cpufreq_powersave.c
+++ b/drivers/cpufreq/cpufreq_powersave.c
@@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <[email protected]>");
MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_powersave;
-}
-#endif
-
cpufreq_governor_init(cpufreq_gov_powersave);
cpufreq_governor_exit(cpufreq_gov_powersave);
diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c
index 50a4d7846580..48ce53038ac0 100644
--- a/drivers/cpufreq/cpufreq_userspace.c
+++ b/drivers/cpufreq/cpufreq_userspace.c
@@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <[email protected]>, "
MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_userspace;
-}
-#endif
-
cpufreq_governor_init(cpufreq_gov_userspace);
cpufreq_governor_exit(cpufreq_gov_userspace);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..3b77e408377a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = {
.limits = sugov_limits,
};

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &schedutil_gov;
-}
-#endif
-
cpufreq_governor_init(schedutil_gov);

#ifdef CONFIG_ENERGY_MODEL

2020-10-22 12:21:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Wednesday, October 21, 2020 2:52:03 PM CEST Peter Zijlstra wrote:
> On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote:
> >
> >
> > On Wed, 21 Oct 2020, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > > > Prior to 5.8, my machine was using intel_pstate and had few background
> > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8
> > > > the kernel decided that intel_cpufreq would be more appropriate, which
> > > > introduced kworkers every 0.004 seconds on all cores.
> > >
> > > That still doesn't make any sense. Are you running the legacy on-demand
> > > thing or something?
> > >
> > > Rafael, Srinivas, Viresh, how come it defaults to that?
> >
> > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small
> > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not
> > have the HWP feature, even though the cores seemed to be able to change
> > their frequencies at the hardware level.
>
> That just makes intel_pstate not prefer active mode. With the clear
> intent that it should then go use schedutil, but somehow it looks like
> you landed on ondemand, which is absolutely atrocious.

Probably, the "default governor" setting was inherited from the old config.

It didn't matter when the active mode was used by default, because it only
recognizes the "powersave" and "performance" settings (and "ondemand" causes
it to fall back to "powersave" IIRC).

Cheers!



2020-10-22 16:16:28

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On 22-10-20, 09:11, Peter Zijlstra wrote:
> Well, but we need to do something to force people onto schedutil,
> otherwise we'll get more crap like this thread.
>
> Can we take the choice away? Only let Kconfig select which governors are
> available and then set the default ourselves? I mean, the end goal being
> to not have selectable governors at all, this seems like a good step
> anyway.

Just to clarify and complete the point a bit here, the users can still
pass the default governor from cmdline using
cpufreq.default_governor=, which will take precedence over the one the
below code is playing with. And later once the kernel is up, they can
still choose a different governor from userspace.

> ---
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 2c7171e0b001..3a9f54b382c0 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -34,77 +34,6 @@ config CPU_FREQ_STAT
>
> If in doubt, say N.
>
> -choice
> - prompt "Default CPUFreq governor"
> - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> - help
> - This option sets which CPUFreq governor shall be loaded at
> - startup. If in doubt, use the default setting.
> -
> -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> - bool "performance"
> - select CPU_FREQ_GOV_PERFORMANCE
> - help
> - Use the CPUFreq governor 'performance' as default. This sets
> - the frequency statically to the highest frequency supported by
> - the CPU.
> -
> -config CPU_FREQ_DEFAULT_GOV_POWERSAVE
> - bool "powersave"
> - select CPU_FREQ_GOV_POWERSAVE
> - help
> - Use the CPUFreq governor 'powersave' as default. This sets
> - the frequency statically to the lowest frequency supported by
> - the CPU.
> -
> -config CPU_FREQ_DEFAULT_GOV_USERSPACE
> - bool "userspace"
> - select CPU_FREQ_GOV_USERSPACE
> - help
> - Use the CPUFreq governor 'userspace' as default. This allows
> - you to set the CPU frequency manually or when a userspace
> - program shall be able to set the CPU dynamically without having
> - to enable the userspace governor manually.
> -
> -config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> - bool "ondemand"
> - select CPU_FREQ_GOV_ONDEMAND
> - select CPU_FREQ_GOV_PERFORMANCE
> - help
> - Use the CPUFreq governor 'ondemand' as default. This allows
> - you to get a full dynamic frequency capable system by simply
> - loading your cpufreq low-level hardware driver.
> - Be aware that not all cpufreq drivers support the ondemand
> - governor. If unsure have a look at the help section of the
> - driver. Fallback governor will be the performance governor.
> -
> -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> - bool "conservative"
> - select CPU_FREQ_GOV_CONSERVATIVE
> - select CPU_FREQ_GOV_PERFORMANCE
> - help
> - Use the CPUFreq governor 'conservative' as default. This allows
> - you to get a full dynamic frequency capable system by simply
> - loading your cpufreq low-level hardware driver.
> - Be aware that not all cpufreq drivers support the conservative
> - governor. If unsure have a look at the help section of the
> - driver. Fallback governor will be the performance governor.
> -
> -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
> - bool "schedutil"
> - depends on SMP
> - select CPU_FREQ_GOV_SCHEDUTIL
> - select CPU_FREQ_GOV_PERFORMANCE
> - help
> - Use the 'schedutil' CPUFreq governor by default. If unsure,
> - have a look at the help section of that governor. The fallback
> - governor will be 'performance'.
> -
> -endchoice
> -
> config CPU_FREQ_GOV_PERFORMANCE
> tristate "'performance' governor"
> help
> @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
> config CPU_FREQ_GOV_ONDEMAND
> tristate "'ondemand' cpufreq policy governor"
> select CPU_FREQ_GOV_COMMON
> + select CPU_FREQ_GOV_PERFORMANCE
> help
> 'ondemand' - This driver adds a dynamic cpufreq policy governor.
> The governor does a periodic polling and
> @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
> tristate "'conservative' cpufreq governor"
> depends on CPU_FREQ
> select CPU_FREQ_GOV_COMMON
> + select CPU_FREQ_GOV_PERFORMANCE
> help
> 'conservative' - this driver is rather similar to the 'ondemand'
> governor both in its source code and its purpose, the difference is
> @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
> bool "'schedutil' cpufreq policy governor"
> depends on CPU_FREQ && SMP
> select CPU_FREQ_GOV_ATTR_SET
> + select CPU_FREQ_GOV_PERFORMANCE

And I am not really sure why we always wanted this backup performance
governor to be there unless the said governors are built as module.

> select IRQ_WORK
> help
> This governor makes decisions based on the utilization data provided
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 1877f5e2e5b0..6848e3337b65 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
> return NULL;
> }
>
> +static struct cpufreq_governor *cpufreq_default_governor(void)
> +{
> + struct cpufreq_governor *gov = NULL;
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> + gov = find_governor("schedutil");
> + if (gov)
> + return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
> + gov = find_governor("ondemand");
> + if (gov)
> + return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
> + gov = find_governor("conservative");
> + if (gov)
> + return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> + gov = find_governor("userspace");
> + if (gov)
> + return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
> + gov = find_governor("powersave");
> + if (gov)
> + return gov;
> +#endif
> +
> +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
> + gov = find_governor("performance");
> + if (gov)
> + return gov;
> +#endif
> +
> + return gov;
> +}

I think that would be fine with me. Though we would be required to
update couple of defconfigs here to make sure they keep working the
way they wanted to.

--
viresh

2020-10-22 16:28:29

by walter harms

[permalink] [raw]
Subject: AW: [PATCH] sched/fair: check for idle core

To avoid this #ifdef jungle ..
I guess it would be save to search for a governor even
ones that are not enabled.

and a second thing: can we change the subject please ?

jm2c
wh



________________________________________
Von: Peter Zijlstra [[email protected]]
Gesendet: Donnerstag, 22. Oktober 2020 09:11
An: Rafael J. Wysocki
Cc: Julia Lawall; Mel Gorman; Ingo Molnar; [email protected]; Juri Lelli; Vincent Guittot; Dietmar Eggemann; Steven Rostedt; Ben Segall; Daniel Bristot de Oliveira; [email protected]; Valentin Schneider; Gilles Muller; [email protected]; [email protected]
Betreff: Re: [PATCH] sched/fair: check for idle core

On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote:

> > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE
> >
> > config CPU_FREQ_GOV_ONDEMAND
> > tristate "'ondemand' cpufreq policy governor"
> > + depends on !SMP
>
> But I don't think that we can do this and the one below.

Well, but we need to do something to force people onto schedutil,
otherwise we'll get more crap like this thread.

Can we take the choice away? Only let Kconfig select which governors are
available and then set the default ourselves? I mean, the end goal being
to not have selectable governors at all, this seems like a good step
anyway.

---

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 2c7171e0b001..3a9f54b382c0 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -34,77 +34,6 @@ config CPU_FREQ_STAT

If in doubt, say N.

-choice
- prompt "Default CPUFreq governor"
- default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
- default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
- default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
- help
- This option sets which CPUFreq governor shall be loaded at
- startup. If in doubt, use the default setting.
-
-config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
- bool "performance"
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'performance' as default. This sets
- the frequency statically to the highest frequency supported by
- the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_POWERSAVE
- bool "powersave"
- select CPU_FREQ_GOV_POWERSAVE
- help
- Use the CPUFreq governor 'powersave' as default. This sets
- the frequency statically to the lowest frequency supported by
- the CPU.
-
-config CPU_FREQ_DEFAULT_GOV_USERSPACE
- bool "userspace"
- select CPU_FREQ_GOV_USERSPACE
- help
- Use the CPUFreq governor 'userspace' as default. This allows
- you to set the CPU frequency manually or when a userspace
- program shall be able to set the CPU dynamically without having
- to enable the userspace governor manually.
-
-config CPU_FREQ_DEFAULT_GOV_ONDEMAND
- bool "ondemand"
- select CPU_FREQ_GOV_ONDEMAND
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'ondemand' as default. This allows
- you to get a full dynamic frequency capable system by simply
- loading your cpufreq low-level hardware driver.
- Be aware that not all cpufreq drivers support the ondemand
- governor. If unsure have a look at the help section of the
- driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
- bool "conservative"
- select CPU_FREQ_GOV_CONSERVATIVE
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the CPUFreq governor 'conservative' as default. This allows
- you to get a full dynamic frequency capable system by simply
- loading your cpufreq low-level hardware driver.
- Be aware that not all cpufreq drivers support the conservative
- governor. If unsure have a look at the help section of the
- driver. Fallback governor will be the performance governor.
-
-config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
- bool "schedutil"
- depends on SMP
- select CPU_FREQ_GOV_SCHEDUTIL
- select CPU_FREQ_GOV_PERFORMANCE
- help
- Use the 'schedutil' CPUFreq governor by default. If unsure,
- have a look at the help section of that governor. The fallback
- governor will be 'performance'.
-
-endchoice
-
config CPU_FREQ_GOV_PERFORMANCE
tristate "'performance' governor"
help
@@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
config CPU_FREQ_GOV_ONDEMAND
tristate "'ondemand' cpufreq policy governor"
select CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_PERFORMANCE
help
'ondemand' - This driver adds a dynamic cpufreq policy governor.
The governor does a periodic polling and
@@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
tristate "'conservative' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
+ select CPU_FREQ_GOV_PERFORMANCE
help
'conservative' - this driver is rather similar to the 'ondemand'
governor both in its source code and its purpose, the difference is
@@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
bool "'schedutil' cpufreq policy governor"
depends on CPU_FREQ && SMP
select CPU_FREQ_GOV_ATTR_SET
+ select CPU_FREQ_GOV_PERFORMANCE
select IRQ_WORK
help
This governor makes decisions based on the utilization data provided
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1877f5e2e5b0..6848e3337b65 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
return NULL;
}

+static struct cpufreq_governor *cpufreq_default_governor(void)
+{
+ struct cpufreq_governor *gov = NULL;
+
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+ gov = find_governor("schedutil");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
+ gov = find_governor("ondemand");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
+ gov = find_governor("conservative");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+ gov = find_governor("userspace");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
+ gov = find_governor("powersave");
+ if (gov)
+ return gov;
+#endif
+
+#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
+ gov = find_governor("performance");
+ if (gov)
+ return gov;
+#endif
+
+ return gov;
+}
+
static struct cpufreq_governor *get_governor(const char *str_governor)
{
struct cpufreq_governor *t;
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index aa39ff31ec9f..25449a1ee954 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for "
"optimised for use in a battery environment");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &CPU_FREQ_GOV_CONSERVATIVE;
-}
-#endif
-
cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE);
cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE);
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index ac361a8b1d3b..2a7fd645e1fb 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for "
"Low Latency Frequency Transition capable processors");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &CPU_FREQ_GOV_ONDEMAND;
-}
-#endif
-
cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND);
cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND);
diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c
index 71c1d9aba772..c6c7473a3a73 100644
--- a/drivers/cpufreq/cpufreq_performance.c
+++ b/drivers/cpufreq/cpufreq_performance.c
@@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = {
.limits = cpufreq_gov_performance_limits,
};

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_performance;
-}
-#endif
#ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE
struct cpufreq_governor *cpufreq_fallback_governor(void)
{
diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c
index 7749522355b5..8b9555f70e9d 100644
--- a/drivers/cpufreq/cpufreq_powersave.c
+++ b/drivers/cpufreq/cpufreq_powersave.c
@@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <[email protected]>");
MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_powersave;
-}
-#endif
-
cpufreq_governor_init(cpufreq_gov_powersave);
cpufreq_governor_exit(cpufreq_gov_powersave);
diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c
index 50a4d7846580..48ce53038ac0 100644
--- a/drivers/cpufreq/cpufreq_userspace.c
+++ b/drivers/cpufreq/cpufreq_userspace.c
@@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <[email protected]>, "
MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'");
MODULE_LICENSE("GPL");

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &cpufreq_gov_userspace;
-}
-#endif
-
cpufreq_governor_init(cpufreq_gov_userspace);
cpufreq_governor_exit(cpufreq_gov_userspace);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..3b77e408377a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = {
.limits = sugov_limits,
};

-#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
-struct cpufreq_governor *cpufreq_default_governor(void)
-{
- return &schedutil_gov;
-}
-#endif
-
cpufreq_governor_init(schedutil_gov);

#ifdef CONFIG_ENERGY_MODEL

2020-10-22 16:38:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> On 22-10-20, 09:11, Peter Zijlstra wrote:
> > Well, but we need to do something to force people onto schedutil,
> > otherwise we'll get more crap like this thread.
> >
> > Can we take the choice away? Only let Kconfig select which governors are
> > available and then set the default ourselves? I mean, the end goal being
> > to not have selectable governors at all, this seems like a good step
> > anyway.
>
> Just to clarify and complete the point a bit here, the users can still
> pass the default governor from cmdline using
> cpufreq.default_governor=, which will take precedence over the one the
> below code is playing with. And later once the kernel is up, they can
> still choose a different governor from userspace.

Right.

Also some people simply set "performance" as the default governor and then
don't touch cpufreq otherwise (the idea is to get everything to the max
freq right away and stay in that mode forever). This still needs to be
possible IMO.

> > ---
> >
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index 2c7171e0b001..3a9f54b382c0 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -34,77 +34,6 @@ config CPU_FREQ_STAT
> >
> > If in doubt, say N.
> >
> > -choice
> > - prompt "Default CPUFreq governor"
> > - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ
> > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
> > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
> > - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> > - help
> > - This option sets which CPUFreq governor shall be loaded at
> > - startup. If in doubt, use the default setting.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> > - bool "performance"
> > - select CPU_FREQ_GOV_PERFORMANCE
> > - help
> > - Use the CPUFreq governor 'performance' as default. This sets
> > - the frequency statically to the highest frequency supported by
> > - the CPU.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_POWERSAVE
> > - bool "powersave"
> > - select CPU_FREQ_GOV_POWERSAVE
> > - help
> > - Use the CPUFreq governor 'powersave' as default. This sets
> > - the frequency statically to the lowest frequency supported by
> > - the CPU.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_USERSPACE
> > - bool "userspace"
> > - select CPU_FREQ_GOV_USERSPACE
> > - help
> > - Use the CPUFreq governor 'userspace' as default. This allows
> > - you to set the CPU frequency manually or when a userspace
> > - program shall be able to set the CPU dynamically without having
> > - to enable the userspace governor manually.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_ONDEMAND
> > - bool "ondemand"
> > - select CPU_FREQ_GOV_ONDEMAND
> > - select CPU_FREQ_GOV_PERFORMANCE
> > - help
> > - Use the CPUFreq governor 'ondemand' as default. This allows
> > - you to get a full dynamic frequency capable system by simply
> > - loading your cpufreq low-level hardware driver.
> > - Be aware that not all cpufreq drivers support the ondemand
> > - governor. If unsure have a look at the help section of the
> > - driver. Fallback governor will be the performance governor.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> > - bool "conservative"
> > - select CPU_FREQ_GOV_CONSERVATIVE
> > - select CPU_FREQ_GOV_PERFORMANCE
> > - help
> > - Use the CPUFreq governor 'conservative' as default. This allows
> > - you to get a full dynamic frequency capable system by simply
> > - loading your cpufreq low-level hardware driver.
> > - Be aware that not all cpufreq drivers support the conservative
> > - governor. If unsure have a look at the help section of the
> > - driver. Fallback governor will be the performance governor.
> > -
> > -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
> > - bool "schedutil"
> > - depends on SMP
> > - select CPU_FREQ_GOV_SCHEDUTIL
> > - select CPU_FREQ_GOV_PERFORMANCE
> > - help
> > - Use the 'schedutil' CPUFreq governor by default. If unsure,
> > - have a look at the help section of that governor. The fallback
> > - governor will be 'performance'.
> > -
> > -endchoice
> > -
> > config CPU_FREQ_GOV_PERFORMANCE
> > tristate "'performance' governor"
> > help
> > @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE
> > config CPU_FREQ_GOV_ONDEMAND
> > tristate "'ondemand' cpufreq policy governor"
> > select CPU_FREQ_GOV_COMMON
> > + select CPU_FREQ_GOV_PERFORMANCE
> > help
> > 'ondemand' - This driver adds a dynamic cpufreq policy governor.
> > The governor does a periodic polling and
> > @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE
> > tristate "'conservative' cpufreq governor"
> > depends on CPU_FREQ
> > select CPU_FREQ_GOV_COMMON
> > + select CPU_FREQ_GOV_PERFORMANCE
> > help
> > 'conservative' - this driver is rather similar to the 'ondemand'
> > governor both in its source code and its purpose, the difference is
> > @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL
> > bool "'schedutil' cpufreq policy governor"
> > depends on CPU_FREQ && SMP
> > select CPU_FREQ_GOV_ATTR_SET
> > + select CPU_FREQ_GOV_PERFORMANCE
>
> And I am not really sure why we always wanted this backup performance
> governor to be there unless the said governors are built as module.

Apparently, some old drivers had problems with switching frequencies fast enough
for ondemand to be used with them and the fallback was for those cases. AFAICS.

> > select IRQ_WORK
> > help
> > This governor makes decisions based on the utilization data provided
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 1877f5e2e5b0..6848e3337b65 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
> > return NULL;
> > }
> >
> > +static struct cpufreq_governor *cpufreq_default_governor(void)
> > +{
> > + struct cpufreq_governor *gov = NULL;
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > + gov = find_governor("schedutil");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND
> > + gov = find_governor("ondemand");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE
> > + gov = find_governor("conservative");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> > + gov = find_governor("userspace");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE
> > + gov = find_governor("powersave");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE
> > + gov = find_governor("performance");
> > + if (gov)
> > + return gov;
> > +#endif
> > +
> > + return gov;
> > +}
>
> I think that would be fine with me. Though we would be required to
> update couple of defconfigs here to make sure they keep working the
> way they wanted to.

Generally agreed, but see the point about the "performance" governor above.



2020-10-22 20:09:56

by Peter Zijlstra

[permalink] [raw]
Subject: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > Well, but we need to do something to force people onto schedutil,
> > > otherwise we'll get more crap like this thread.
> > >
> > > Can we take the choice away? Only let Kconfig select which governors are
> > > available and then set the default ourselves? I mean, the end goal being
> > > to not have selectable governors at all, this seems like a good step
> > > anyway.
> >
> > Just to clarify and complete the point a bit here, the users can still
> > pass the default governor from cmdline using
> > cpufreq.default_governor=, which will take precedence over the one the
> > below code is playing with. And later once the kernel is up, they can
> > still choose a different governor from userspace.
>
> Right.
>
> Also some people simply set "performance" as the default governor and then
> don't touch cpufreq otherwise (the idea is to get everything to the max
> freq right away and stay in that mode forever). This still needs to be
> possible IMO.

Performance/powersave make sense to keep.

However I do want to retire ondemand, conservative and also very much
intel_pstate/active mode. I also have very little sympathy for
userspace.

We should start by making it hard to use them and eventually just delete
them outright.

2020-10-22 21:23:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

[CC linux-pm and Len]

On Thursday, October 22, 2020 2:02:13 PM CEST Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > >
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > >
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> >
> > Right.
> >
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever). This still needs to be
> > possible IMO.
>
> Performance/powersave make sense to keep.
>
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode.

I agree in general, but IMO it would not be prudent to do that without making
schedutil provide the same level of performance in all of the relevant use
cases.

> I also have very little sympathy for userspace.

That I completely agree with.

> We should start by making it hard to use them and eventually just delete
> them outright.

Right, but see above: IMO step 0 should be to ensure that schedutil is a viable
replacement for all users.



2020-10-23 08:03:13

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > And I am not really sure why we always wanted this backup performance
> > governor to be there unless the said governors are built as module.
>
> Apparently, some old drivers had problems with switching frequencies fast enough
> for ondemand to be used with them and the fallback was for those cases. AFAICS.

Do we still need this ? Or better ask those platforms to individually
enable both of them.

--
viresh

2020-10-23 16:35:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote:
> On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > And I am not really sure why we always wanted this backup performance
> > > governor to be there unless the said governors are built as module.
> >
> > Apparently, some old drivers had problems with switching frequencies fast enough
> > for ondemand to be used with them and the fallback was for those cases. AFAICS.
>
> Do we still need this ?

For the reasonably modern hardware, I don't think so.

> Or better ask those platforms to individually
> enable both of them.

Bu who knows what they are? :-)



2020-10-27 14:01:46

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: check for idle core

On 23-10-20, 17:06, Rafael J. Wysocki wrote:
> On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote:
> > On 22-10-20, 13:45, Rafael J. Wysocki wrote:
> > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > > And I am not really sure why we always wanted this backup performance
> > > > governor to be there unless the said governors are built as module.
> > >
> > > Apparently, some old drivers had problems with switching frequencies fast enough
> > > for ondemand to be used with them and the fallback was for those cases. AFAICS.
> >
> > Do we still need this ?
>
> For the reasonably modern hardware, I don't think so.
>
> > Or better ask those platforms to individually
> > enable both of them.
>
> Bu who knows what they are? :-)

I was planning to break them and let them complain :)

--
viresh

2020-10-27 14:41:05

by Valentin Schneider

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core


On 27/10/20 11:11, Qais Yousef wrote:
> On 10/22/20 14:02, Peter Zijlstra wrote:
>> However I do want to retire ondemand, conservative and also very much
>> intel_pstate/active mode. I also have very little sympathy for
>> userspace.
>
> Userspace is useful for testing and sanity checking. Not sure if people use it
> to measure voltage/current at each frequency to generate
> dynamic-power-coefficient for their platform. Lukasz, Dietmar?
>

It's valuable even just for cpufreq sanity checking - we have that test
that goes through increasing frequencies and asserts the work done is
monotonically increasing. This has been quite useful in the past to detect
broken bits.

That *should* still be totally doable with any other governor by using the
scaling_{min, max}_freq sysfs interface.

> Thanks

2020-10-27 14:42:11

by Qais Yousef

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 10/27/20 11:26, Valentin Schneider wrote:
>
> On 27/10/20 11:11, Qais Yousef wrote:
> > On 10/22/20 14:02, Peter Zijlstra wrote:
> >> However I do want to retire ondemand, conservative and also very much
> >> intel_pstate/active mode. I also have very little sympathy for
> >> userspace.
> >
> > Userspace is useful for testing and sanity checking. Not sure if people use it
> > to measure voltage/current at each frequency to generate
> > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> >
>
> It's valuable even just for cpufreq sanity checking - we have that test
> that goes through increasing frequencies and asserts the work done is
> monotonically increasing. This has been quite useful in the past to detect
> broken bits.
>
> That *should* still be totally doable with any other governor by using the
> scaling_{min, max}_freq sysfs interface.

True. This effectively makes every governor a potential user space governor.

/me not sure to be happy or grumpy about it

Thanks

--
Qais Yousef

2020-10-28 07:05:29

by Qais Yousef

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 10/22/20 14:02, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > >
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > >
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> >
> > Right.
> >
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever). This still needs to be
> > possible IMO.
>
> Performance/powersave make sense to keep.
>
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode. I also have very little sympathy for
> userspace.

Userspace is useful for testing and sanity checking. Not sure if people use it
to measure voltage/current at each frequency to generate
dynamic-power-coefficient for their platform. Lukasz, Dietmar?

Thanks

--
Qais Yousef

>
> We should start by making it hard to use them and eventually just delete
> them outright.
>

2020-10-28 07:17:04

by Viresh Kumar

[permalink] [raw]
Subject: Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

On 27-10-20, 11:42, Qais Yousef wrote:
> On 10/27/20 11:26, Valentin Schneider wrote:
> >
> > On 27/10/20 11:11, Qais Yousef wrote:
> > > On 10/22/20 14:02, Peter Zijlstra wrote:
> > >> However I do want to retire ondemand, conservative and also very much
> > >> intel_pstate/active mode. I also have very little sympathy for
> > >> userspace.
> > >
> > > Userspace is useful for testing and sanity checking. Not sure if people use it
> > > to measure voltage/current at each frequency to generate
> > > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> > >
> >
> > It's valuable even just for cpufreq sanity checking - we have that test
> > that goes through increasing frequencies and asserts the work done is
> > monotonically increasing. This has been quite useful in the past to detect
> > broken bits.
> >
> > That *should* still be totally doable with any other governor by using the
> > scaling_{min, max}_freq sysfs interface.
>
> True. This effectively makes every governor a potential user space governor.
>
> /me not sure to be happy or grumpy about it

Userspace governor should be kept as is, it is very effective to get
unnecessary governor code out of the path when testing basic
functioning of the hardware/driver. It is quite useful when things
don't work as expected.

--
viresh