On ia64, we need runtime control to manage CPU power state in the idle
loop. Logically it means a sysctl entry in /proc/sys/kernel. Even
though this sysctl entry doesn't exist today, lots of arch already has
some sort of API to dynamically enable/disable low power idle state.
Looking at linux-2.6.4, arm, arm26, cris, i386, parisc, sh, um, x86-64
all has very much the same code in each arch. So instead of replicate
another set under arch/ia64, we are proposing these API to be abstracted
out in the generic code. And also add a sysctl entry under /proc/sys/kernel.
Would this be useful to all architecture who wants this features? It
would be a lot less code duplication.
- Ken
diff -Nur linux-2.6.4/include/linux/sysctl.h linux-2.6.4.halt/include/linux/sysctl.h
--- linux-2.6.4/include/linux/sysctl.h 2004-03-10 18:55:28.000000000 -0800
+++ linux-2.6.4.halt/include/linux/sysctl.h 2004-03-17 15:33:30.000000000 -0800
@@ -131,6 +131,7 @@
KERN_PRINTK_RATELIMIT_BURST=61, /* int: tune printk ratelimiting */
KERN_PTY=62, /* dir: pty driver */
KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */
+ KERN_LOWPOWER_IDLE=64, /* int: low power idle */
};
diff -Nur linux-2.6.4/kernel/cpu.c linux-2.6.4.halt/kernel/cpu.c
--- linux-2.6.4/kernel/cpu.c 2004-03-10 18:55:44.000000000 -0800
+++ linux-2.6.4.halt/kernel/cpu.c 2004-03-17 15:36:32.000000000 -0800
@@ -64,3 +64,15 @@
up(&cpucontrol);
return ret;
}
+
+atomic_t halt_counter;
+void enable_halt(void)
+{
+ atomic_dec(&halt_counter);
+}
+void disable_halt(void)
+{
+ atomic_inc(&halt_counter);
+}
+EXPORT_SYMBOL(enable_halt);
+EXPORT_SYMBOL(disable_halt);
diff -Nur linux-2.6.4/kernel/sysctl.c linux-2.6.4.halt/kernel/sysctl.c
--- linux-2.6.4/kernel/sysctl.c 2004-03-10 18:55:22.000000000 -0800
+++ linux-2.6.4.halt/kernel/sysctl.c 2004-03-17 15:34:52.000000000 -0800
@@ -64,6 +64,7 @@
extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
+extern atomic_t halt_counter;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -615,6 +616,14 @@
.mode = 0444,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = KERN_LOWPOWER_IDLE,
+ .procname = "lowpower_idle",
+ .data = &halt_counter,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
"Kenneth Chen" <[email protected]> wrote:
>
> On ia64, we need runtime control to manage CPU power state in the idle
> loop.
Can you expand on this? Does this mean that the admin can select different
idle-loop algorithms? If so, what alternative algorithms exist?
On x86 I'd like to be able to turn idle=poll on when performing oprofile
run, so the numbers come out right. Will this let me do that?
> Logically it means a sysctl entry in /proc/sys/kernel.
Yes, but the *meanings* of the different values of that sysctl need to be
defined, and documented. If lowpower_idle=42 has a totally different
meaning on different architectures then that's unfortunate but
understandable. But we should at least enumerate the different values and
try to get different architectures to honour `42' in the same way.
> +atomic_t halt_counter;
Needs to be initialised - atomic_t's may have spinlocks inside them or
anything else.
> +extern atomic_t halt_counter;
Needs to be in a header, not in .c
> /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
> static int maxolduid = 65535;
> @@ -615,6 +616,14 @@
> .mode = 0444,
> .proc_handler = &proc_dointvec,
> },
> + {
> + .ctl_name = KERN_LOWPOWER_IDLE,
> + .procname = "lowpower_idle",
> + .data = &halt_counter,
> + .maxlen = sizeof (int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> { .ctl_name = 0 }
> };
You cannot treat an int* as an atomic_t*!
Why do we want to inc and dec a user-specified tunable anyway? I think I
don't understand what you're trying to do with this patch.
>>>>> Andrew Morton on Wednesday, March 17, 2004 5:05 PM
> >
> > On ia64, we need runtime control to manage CPU power state in the
> > idle loop.
>
> Can you expand on this?
If architecture provides a facility for low power state, we would like
to turn that on in the idle loop to conserve power. However, in some
specific situation like for performance, it is desired to be off at
least during that period of time. A runtime control would allow power
state to be managed dynamically to accommodate both.
That's what we are trying to do: to have a sysctl to control whether
CPU goes into low power state or not in the default_idle() loop. In the
generic code, kernel provides a mechanism to set/clear a flag, and in each
arch, we can then test the flag before entering into low power state.
> Does this mean that the admin can select
> different idle-loop algorithms? If so, what alternative algorithms exist?
This patch isn't that fancy, nice feature but maybe next step :-)
> > Logically it means a sysctl entry in /proc/sys/kernel.
> Yes, but the *meanings* of the different values of that sysctl need
> to be defined, and documented. If lowpower_idle=42 has a totally
> different meaning on different architectures then that's unfortunate
> but understandable. But we should at least enumerate the different
> values and try to get different architectures to honour `42' in the
> same way.
Writing to sysctl should be a bool, reading the value can be number of
module currently disabled low power idle. I think the original intent
is to use ref count for enabling/disabling. (granted, we copied the
code from other arch).
> Needs to be initialised - atomic_t's may have spinlocks inside them or
> anything else.
>
> Needs to be in a header, not in .c
>
> You cannot treat an int* as an atomic_t*!
My monkey work, must be not having enough coffee today :-P
- Ken
"Kenneth Chen" <[email protected]> wrote:
>
> > > Logically it means a sysctl entry in /proc/sys/kernel.
> > Yes, but the *meanings* of the different values of that sysctl need
> > to be defined, and documented. If lowpower_idle=42 has a totally
> > different meaning on different architectures then that's unfortunate
> > but understandable. But we should at least enumerate the different
> > values and try to get different architectures to honour `42' in the
> > same way.
>
> Writing to sysctl should be a bool, reading the value can be number of
> module currently disabled low power idle. I think the original intent
> is to use ref count for enabling/disabling. (granted, we copied the
> code from other arch).
OK, so why not give us:
#define IDLE_HALT 0
#define IDLE_POLL 1
#define IDLE_SUPER_LOW_POWER_HALT 2
and so forth (are there any others?).
Set some system-wide integer via a sysctl and let the particular
architecture decide how best to implement the currently-selected idle mode?
On Wed, 17 Mar 2004, Andrew Morton wrote:
> "Kenneth Chen" <[email protected]> wrote:
> >
> > > > Logically it means a sysctl entry in /proc/sys/kernel.
> > > Yes, but the *meanings* of the different values of that sysctl need
> > > to be defined, and documented. If lowpower_idle=42 has a totally
> > > different meaning on different architectures then that's unfortunate
> > > but understandable. But we should at least enumerate the different
> > > values and try to get different architectures to honour `42' in the
> > > same way.
> >
> > Writing to sysctl should be a bool, reading the value can be number of
> > module currently disabled low power idle. I think the original intent
> > is to use ref count for enabling/disabling. (granted, we copied the
> > code from other arch).
>
> OK, so why not give us:
>
> #define IDLE_HALT 0
> #define IDLE_POLL 1
> #define IDLE_SUPER_LOW_POWER_HALT 2
>
> and so forth (are there any others?).
>
> Set some system-wide integer via a sysctl and let the particular
> architecture decide how best to implement the currently-selected idle mode?
I'm wondering whether the setting of these magic numbers can't be done
using cpufreq infrastructure.
<snip>
> OK, so why not give us:
>
> #define IDLE_HALT 0
> #define IDLE_POLL 1
> #define IDLE_SUPER_LOW_POWER_HALT 2
>
> and so forth (are there any others?).
#define IDLE_C1HALT ?
I created another one for Athlon Nforce2 to prevent lockups in apic mode.
It is proving more useful than I thought.
It has also been used on SIS740 to prevent same problem.
I know such a workaround should not be required but it works well.
It modifies C1 state idle behaviour by being a little more intelligent about
when it is worthwhile to do into disconnect and has a crude but effective
delay in case of back to back disconnect reconnect cycles.
Recent post.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-03/4278.html
Patch is here.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-02/6520.html
Currently it is kernel arg activated by "idle=C1halt".
>
> Set some system-wide integer via a sysctl and let the particular
> architecture decide how best to implement the currently-selected idle mode?
A lockup detector on Athlon systems could conceivably invoke above idle state
after a manual reboot as not all systems of the same chipset have the problem.
Ross.
On Wed, Mar 17, 2004 at 10:40:31PM -0500, Zwane Mwaikambo wrote:
> On Wed, 17 Mar 2004, Andrew Morton wrote:
>
> > "Kenneth Chen" <[email protected]> wrote:
> > >
> > > > > Logically it means a sysctl entry in /proc/sys/kernel.
> > > > Yes, but the *meanings* of the different values of that sysctl need
> > > > to be defined, and documented. If lowpower_idle=42 has a totally
> > > > different meaning on different architectures then that's unfortunate
> > > > but understandable. But we should at least enumerate the different
> > > > values and try to get different architectures to honour `42' in the
> > > > same way.
> > >
> > > Writing to sysctl should be a bool, reading the value can be number of
> > > module currently disabled low power idle. I think the original intent
> > > is to use ref count for enabling/disabling. (granted, we copied the
> > > code from other arch).
I assume ia64 does idling using the ACPI processor.c driver? If so, couldn't
writing to /proc/acpi/processor/./power be an option?
> > OK, so why not give us:
> >
> > #define IDLE_HALT 0
> > #define IDLE_POLL 1
> > #define IDLE_SUPER_LOW_POWER_HALT 2
> >
> > and so forth (are there any others?).
> >
> > Set some system-wide integer via a sysctl and let the particular
> > architecture decide how best to implement the currently-selected idle mode?
>
> I'm wondering whether the setting of these magic numbers can't be done
> using cpufreq infrastructure.
I doubt it -- there's no ia64 cpufreq driver anyway, and cpufreq is about
frequency scaling and (sometimes) throttling, but not "idling". And
"idling" is a too different implementation anyways.
Dominik
* Andrew Morton <[email protected]>:
> "Kenneth Chen" <[email protected]> wrote:
>>
>> On ia64, we need runtime control to manage CPU power state in the idle
>> loop.
>
> Can you expand on this? Does this mean that the admin can select different
> idle-loop algorithms? If so, what alternative algorithms exist?
At least on 760MPX chipset, when Athlon is pushed into powersaving mode
(disconnected from PCI bus) great deal of graphical distortions are
introduced into bttv -based card's picture.
I've dealt with this by rmmod:ing amd76x_pm -module (this action disables
powersaving mode), but some sane API for disabling these disturbancies
would be much better.
--
Psi -- <http://www.iki.fi/pasi.savolainen>
>>>>> Dominik Brodowski wrote on Thu, March 18, 2004 1:05 AM
> I assume ia64 does idling using the ACPI processor.c driver?
No, not really.
> If so, couldn't writing to /proc/acpi/processor/./power be
> an option?
Not all platform has ACPI support, so going through ACPI isn't
generic enough.
>>>>> Andrew Morton wrote on Wed, March 17, 2004 7:28 PM
> > "Kenneth Chen" <[email protected]> wrote:
> >
> > Writing to sysctl should be a bool, reading the value can be number of
> > module currently disabled low power idle. I think the original intent
> > is to use ref count for enabling/disabling. (granted, we copied the
> > code from other arch).
>
> OK, so why not give us:
>
> #define IDLE_HALT 0
> #define IDLE_POLL 1
> #define IDLE_SUPER_LOW_POWER_HALT 2
>
> and so forth (are there any others?).
>
> Set some system-wide integer via a sysctl and let the particular
> architecture decide how best to implement the currently-selected
> idle mode?
Sounds good, Thanks for the suggestion. I just coded it up:
diff -Nur linux-2.6.4/include/linux/cpu.h linux-2.6.4.halt/include/linux/cpu.h
--- linux-2.6.4/include/linux/cpu.h 2004-03-10 18:55:23.000000000 -0800
+++ linux-2.6.4.halt/include/linux/cpu.h 2004-03-18 13:47:43.000000000 -0800
@@ -52,6 +52,12 @@
#endif /* CONFIG_SMP */
extern struct sysdev_class cpu_sysdev_class;
+extern int idle_mode;
+
+#define IDLE_NOOP 0
+#define IDLE_HALT 1
+#define IDLE_POLL 2
+#define IDLE_ACPI 3
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
diff -Nur linux-2.6.4/include/linux/sysctl.h linux-2.6.4.halt/include/linux/sysctl.h
--- linux-2.6.4/include/linux/sysctl.h 2004-03-10 18:55:28.000000000 -0800
+++ linux-2.6.4.halt/include/linux/sysctl.h 2004-03-18 12:00:40.000000000 -0800
@@ -131,6 +131,7 @@
KERN_PRINTK_RATELIMIT_BURST=61, /* int: tune printk ratelimiting */
KERN_PTY=62, /* dir: pty driver */
KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */
+ KERN_IDLE_MODE=64, /* int: arch specific cpu idle mode */
};
diff -Nur linux-2.6.4/kernel/cpu.c linux-2.6.4.halt/kernel/cpu.c
--- linux-2.6.4/kernel/cpu.c 2004-03-10 18:55:44.000000000 -0800
+++ linux-2.6.4.halt/kernel/cpu.c 2004-03-18 13:29:28.000000000 -0800
@@ -64,3 +64,5 @@
up(&cpucontrol);
return ret;
}
+
+int idle_mode = IDLE_HALT;
diff -Nur linux-2.6.4/kernel/sysctl.c linux-2.6.4.halt/kernel/sysctl.c
--- linux-2.6.4/kernel/sysctl.c 2004-03-10 18:55:22.000000000 -0800
+++ linux-2.6.4.halt/kernel/sysctl.c 2004-03-18 13:52:27.000000000 -0800
@@ -39,6 +39,7 @@
#include <linux/initrd.h>
#include <linux/times.h>
#include <linux/limits.h>
+#include <linux/cpu.h>
#include <asm/uaccess.h>
#ifdef CONFIG_ROOT_NFS
@@ -615,6 +616,14 @@
.mode = 0444,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = KERN_IDLE_MODE,
+ .procname = "idle_mode",
+ .data = &idle_mode,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
"Kenneth Chen" <[email protected]> wrote:
>
> >>>>> Andrew Morton wrote on Wed, March 17, 2004 7:28 PM
> > > "Kenneth Chen" <[email protected]> wrote:
> > >
> > > Writing to sysctl should be a bool, reading the value can be number of
> > > module currently disabled low power idle. I think the original intent
> > > is to use ref count for enabling/disabling. (granted, we copied the
> > > code from other arch).
> >
> > OK, so why not give us:
> >
> > #define IDLE_HALT 0
> > #define IDLE_POLL 1
> > #define IDLE_SUPER_LOW_POWER_HALT 2
> >
> > and so forth (are there any others?).
> >
> > Set some system-wide integer via a sysctl and let the particular
> > architecture decide how best to implement the currently-selected
> > idle mode?
>
>
> Sounds good, Thanks for the suggestion. I just coded it up:
>
Looks fine, thanks. I'll queue that up pending some code which actually
uses it. And the obligatory update to Documentation/kernel-parameters.txt ;)
Zwane Mwaikambo wrote:
>>Set some system-wide integer via a sysctl and let the particular
>>architecture decide how best to implement the currently-selected idle mode?
>
> I'm wondering whether the setting of these magic numbers can't be done
> using cpufreq infrastructure.
I'd vote for using Patrick Mochel's PM subsystem and use a standard set
of identifiers that are mapped to a platform-specific idle behavior, in
much the same way as platform suspend modes are handled today. For
example, strings echoed to /sys/power/idle could be an interface. If
folks are amenable to this I'd be happy to supply a (generic) patch for it.
--
Todd Poynor
MontaVista Software
Todd Poynor <[email protected]> wrote:
>
> Zwane Mwaikambo wrote:
>
> >>Set some system-wide integer via a sysctl and let the particular
> >>architecture decide how best to implement the currently-selected idle mode?
> >
> > I'm wondering whether the setting of these magic numbers can't be done
> > using cpufreq infrastructure.
>
> I'd vote for using Patrick Mochel's PM subsystem and use a standard set
> of identifiers that are mapped to a platform-specific idle behavior, in
> much the same way as platform suspend modes are handled today. For
> example, strings echoed to /sys/power/idle could be an interface. If
> folks are amenable to this I'd be happy to supply a (generic) patch for it.
That sounds suitable, thanks.
On Thu, 18 Mar 2004, Todd Poynor wrote:
> Zwane Mwaikambo wrote:
>
> >>Set some system-wide integer via a sysctl and let the particular
> >>architecture decide how best to implement the currently-selected idle mode?
> >
> > I'm wondering whether the setting of these magic numbers can't be done
> > using cpufreq infrastructure.
>
> I'd vote for using Patrick Mochel's PM subsystem and use a standard set
> of identifiers that are mapped to a platform-specific idle behavior, in
> much the same way as platform suspend modes are handled today. For
> example, strings echoed to /sys/power/idle could be an interface. If
> folks are amenable to this I'd be happy to supply a (generic) patch for it.
Ta, sounds good, give us a look.
Hi!
> Sounds good, Thanks for the suggestion. I just coded it up:
>
>
> diff -Nur linux-2.6.4/include/linux/cpu.h linux-2.6.4.halt/include/linux/cpu.h
> --- linux-2.6.4/include/linux/cpu.h 2004-03-10 18:55:23.000000000 -0800
> +++ linux-2.6.4.halt/include/linux/cpu.h 2004-03-18 13:47:43.000000000 -0800
> @@ -52,6 +52,12 @@
>
> #endif /* CONFIG_SMP */
> extern struct sysdev_class cpu_sysdev_class;
> +extern int idle_mode;
> +
> +#define IDLE_NOOP 0
> +#define IDLE_HALT 1
> +#define IDLE_POLL 2
> +#define IDLE_ACPI 3
>
How is idle_noop different from idle_poll?
idle_halt is equivalent to idle_acpi_C1. But acpi supports also C2
(deeper sleep), and C3 (sleep without coherent caches) and newer
machines support even more. You might want to talk to Len Brown.
[And yes, limiting to C2 (for example) *is* usefull; some machines
(nforce2 iirc) have bugs, and die if you do C3 at wrong time].
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Hi!
> On ia64, we need runtime control to manage CPU power state in the idle
> loop. Logically it means a sysctl entry in /proc/sys/kernel. Even
> though this sysctl entry doesn't exist today, lots of arch already has
> some sort of API to dynamically enable/disable low power idle state.
If you make it "max Cx state to allow", it will be usefull for ACPI people, too...
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
>>>>> Pavel Machek wrote on Wed, March 24, 2004 1:54 AM
> > +extern int idle_mode;
> > +
> > +#define IDLE_NOOP 0
> > +#define IDLE_HALT 1
> > +#define IDLE_POLL 2
> > +#define IDLE_ACPI 3
> >
>
> How is idle_noop different from idle_poll?
I was thinking idle_noop truly does nothing at all, versus idle_poll
which optimize cross cpu wakeup.
>>>>> Todd Poynor wrote on Thu, March 18, 2004 3:00 PM
> I'd vote for using Patrick Mochel's PM subsystem and use a standard
> set of identifiers that are mapped to a platform-specific idle behavior,
> in much the same way as platform suspend modes are handled today. For
> example, strings echoed to /sys/power/idle could be an interface. If
> folks are amenable to this I'd be happy to supply a (generic) patch for it.
Just wondering what is the state of development for this new PM scheme?
I don't check LKML that frequently, sorry if I miss any posting on LKML.
- Ken