2005-11-15 09:08:48

by Ingo Molnar

[permalink] [raw]
Subject: 2.6.14-rt13

i have released the 2.6.14-rt13 tree, which can be downloaded from the
usual place:

http://redhat.com/~mingo/realtime-preempt/

lots of fixes in this release affecting all supported architectures, all
across the board. Big MIPS update from John Cooper.

Changes since 2.6.14-rt1:

- lots of RCU fixes and updates in signal handling and related areas
(Paul E. McKenney)

- big RCU torture-test update (Paul E. McKenney)

- fix netfilter/conntrack crash reported by Paweł Sikora

- big MIPS update (John Cooper)

- ARM updates (Daniel Walker)

- PPC updates (Benedikt Spranger)

- ktimers rounding fix (Thomas Gleixner)

- off by one fix in timespec normalization (George Anzinger)

- lpptest Kconfig dependency fix (Tom Rini)

- clean up get_cpu_tick() -> get_cycles() in blocker, lpptest and
latency.c. (Tom Rini)

- fix ppc32 bootwrapper code for new zlib (Tom Rini)

- rtc histogram fixes merged for real :-) (K.R. Foley)

- fix NMI watchdog false positive (Steven Rostedt, me)

- added the nsleep() kernel API, which uses high-resolution sleeps

- build fix on !PREEMPT_RT

- cleanup of the PER_CPU_LOCKED infrastructure

- fix softlockup false positives triggered by the RCU torture-test.

- do not send a false -ERESTART_RESTARTBLOCK to userspace if the
HRT timer hardware wakes us up early.

to build a 2.6.14-rt13 tree, the following patches should be applied:

http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.14.tar.bz2
http://redhat.com/~mingo/realtime-preempt/patch-2.6.14-rt13

Ingo


2005-11-15 16:36:41

by Mark Knecht

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On 11/15/05, Ingo Molnar <[email protected]> wrote:
> i have released the 2.6.14-rt13 tree, which can be downloaded from the
> usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> lots of fixes in this release affecting all supported architectures, all
> across the board. Big MIPS update from John Cooper.
<SNIP>

2.6.14-rt13 is up and running here. Everything looks fine in the first
couple of hours. Nothing negative to report.

Please let me know if there are any particular features that you'd
like me to look at on an AMD64 machine.

Cheers,
Mark

2005-11-15 19:56:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Tue, Nov 15, 2005 at 08:36:40AM -0800, Mark Knecht wrote:
> On 11/15/05, Ingo Molnar <[email protected]> wrote:
> > i have released the 2.6.14-rt13 tree, which can be downloaded from the
> > usual place:
> >
> > http://redhat.com/~mingo/realtime-preempt/
> >
> > lots of fixes in this release affecting all supported architectures, all
> > across the board. Big MIPS update from John Cooper.
> <SNIP>
>
> 2.6.14-rt13 is up and running here. Everything looks fine in the first
> couple of hours. Nothing negative to report.

Ditto on an old x86 Netfinity box.

Thanx, Paul

2005-11-16 03:49:36

by K.R. Foley

[permalink] [raw]
Subject: Re: 2.6.14-rt13

Ingo Molnar wrote:
> i have released the 2.6.14-rt13 tree, which can be downloaded from the
> usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> lots of fixes in this release affecting all supported architectures, all
> across the board. Big MIPS update from John Cooper.
>
> Changes since 2.6.14-rt1:
>
> - lots of RCU fixes and updates in signal handling and related areas
> (Paul E. McKenney)
>
> - big RCU torture-test update (Paul E. McKenney)
>

In case anyone else makes the same mistake I did. If you are using the
same config from a previous build, you may have RCU_TORTURE_TEST=Y (not
module) and not even know it when running RT patches. You will however
definitely notice it if you use the config to build a non RT kernel like
2.6.15-rc1. The previous RT patch defaulted RCU_TORTURE_TEST=y. By the
way, the fact that I didn't even notice that the torture test was
running with the RT kernel is a true measure of how well things have
progressed. :-)

--
kr

2005-11-16 08:40:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rt13


* K.R. Foley <[email protected]> wrote:

> > - big RCU torture-test update (Paul E. McKenney)
>
> In case anyone else makes the same mistake I did. If you are using the
> same config from a previous build, you may have RCU_TORTURE_TEST=Y
> (not module) and not even know it when running RT patches. You will
> however definitely notice it if you use the config to build a non RT
> kernel like 2.6.15-rc1. The previous RT patch defaulted
> RCU_TORTURE_TEST=y. By the way, the fact that I didn't even notice
> that the torture test was running with the RT kernel is a true measure
> of how well things have progressed. :-)

yeah - i left it on by default, i usually do that with new debugging
features, to give new code more exposure. In other words, mass
distributed RCU stress-testing by stealth ;-)

I'll make it default-off once the RCU related changes have calmed down.
The rcutorture kernel threads run at nice +19 so they should be barely
noticeable. (except for a sudden and unexplained spike in the world's
power consumption, and the resulting energy crisis ;-)

Ingo

2005-11-16 17:01:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Wed, Nov 16, 2005 at 09:40:37AM +0100, Ingo Molnar wrote:
>
> * K.R. Foley <[email protected]> wrote:
>
> > > - big RCU torture-test update (Paul E. McKenney)
> >
> > In case anyone else makes the same mistake I did. If you are using the
> > same config from a previous build, you may have RCU_TORTURE_TEST=Y
> > (not module) and not even know it when running RT patches. You will
> > however definitely notice it if you use the config to build a non RT
> > kernel like 2.6.15-rc1. The previous RT patch defaulted
> > RCU_TORTURE_TEST=y. By the way, the fact that I didn't even notice
> > that the torture test was running with the RT kernel is a true measure
> > of how well things have progressed. :-)
>
> yeah - i left it on by default, i usually do that with new debugging
> features, to give new code more exposure. In other words, mass
> distributed RCU stress-testing by stealth ;-)

Cool!!! If anyone sees a printk line starting with "rcutorture:"
that includes the string "!!!", please pass it along accompanied by
your config and what your workload was doing at the time.

Thanx, Paul

> I'll make it default-off once the RCU related changes have calmed down.
> The rcutorture kernel threads run at nice +19 so they should be barely
> noticeable. (except for a sudden and unexplained spike in the world's
> power consumption, and the resulting energy crisis ;-)
>
> Ingo
>

2005-11-18 18:05:11

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote:
> i have released the 2.6.14-rt13 tree, which can be downloaded from the
> usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> lots of fixes in this release affecting all supported architectures, all
> across the board. Big MIPS update from John Cooper.

Hi Ingo, I'm back from the trip and built -rt13 to test on my dual core
Athlons. As I emailed you yesterday off the list it looked good, but I
guess it took longer than usual for things to degrade. This morning I'm
seeing the usual warnings from Jack. And, for the first time in a while,
actual xruns. I'll try your suggestion of booting with idle=poll.

[begin speculation]
You mentioned before that the TSC's from both cpus could drift from each
other over time. Assuming that is the source of timing (I have no idea)
that could explain the behavior of Jack, it gets a reference time from
one of the cpus and then compares that with what it gets from either cpu
depending on where it is running at a given time. If it is the same cpu
all is fine, if it is the other and it has drifted then the warning is
printed.

-- Fernando


2005-11-18 21:58:18

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote:
> You mentioned before that the TSC's from both cpus could drift from
> each other over time. Assuming that is the source of timing (I have no
> idea) that could explain the behavior of Jack, it gets a reference
> time from one of the cpus and then compares that with what it gets
> from either cpu depending on where it is running at a given time. If
> it is the same cpu all is fine, if it is the other and it has drifted
> then the warning is printed.

Yes, JACK uses rdtsc() for microsecond resolution timing and assumes
that the TSCs are in sync.

I've asked on this list what a better time source could be and didn't
get any useful responses, people just told me "use gettimeofday()" which
is WAY too slow.

Lee

2005-11-18 22:06:38

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote:
> On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote:
> > You mentioned before that the TSC's from both cpus could drift from
> > each other over time. Assuming that is the source of timing (I have no
> > idea) that could explain the behavior of Jack, it gets a reference
> > time from one of the cpus and then compares that with what it gets
> > from either cpu depending on where it is running at a given time. If
> > it is the same cpu all is fine, if it is the other and it has drifted
> > then the warning is printed.
>
> Yes, JACK uses rdtsc() for microsecond resolution timing and assumes
> that the TSCs are in sync.
>
> I've asked on this list what a better time source could be and didn't
> get any useful responses, people just told me "use gettimeofday()" which
> is WAY too slow.

Arghhh, at least I take this as a confirmation that the TSCs do drift
and there is no workaround. It currently makes the -rt/Jack combination
not very useful, at least in my tests.

Is there a way to resync the TSCs?
-- Fernando


2005-11-18 22:08:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rt13


* Fernando Lopez-Lezcano <[email protected]> wrote:

> Arghhh, at least I take this as a confirmation that the TSCs do drift
> and there is no workaround. It currently makes the -rt/Jack
> combination not very useful, at least in my tests.
>
> Is there a way to resync the TSCs?

no reasonable way. Does idle=poll make any difference?

Ingo

2005-11-18 22:13:19

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote:
> On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote:
> > On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote:
> > > You mentioned before that the TSC's from both cpus could drift from
> > > each other over time. Assuming that is the source of timing (I have no
> > > idea) that could explain the behavior of Jack, it gets a reference
> > > time from one of the cpus and then compares that with what it gets
> > > from either cpu depending on where it is running at a given time. If
> > > it is the same cpu all is fine, if it is the other and it has drifted
> > > then the warning is printed.
> >
> > Yes, JACK uses rdtsc() for microsecond resolution timing and assumes
> > that the TSCs are in sync.
> >
> > I've asked on this list what a better time source could be and didn't
> > get any useful responses, people just told me "use gettimeofday()" which
> > is WAY too slow.
>
> Arghhh, at least I take this as a confirmation that the TSCs do drift
> and there is no workaround. It currently makes the -rt/Jack combination
> not very useful, at least in my tests.
>
> Is there a way to resync the TSCs?

I don't think so. A better question is what mechanism have the hardware
vendors provided to replace the apparently-no-longer-reliable TSC for
cheap high res timing on modern machines. Unfortunately I suspect the
answer at this point is "nothing, you're screwed".

I've read that gettimeofday() does not have to enter the kernel on
x86-64, maybe it's fast enough, though almost certainly orders of
magnitude slower than rdtsc(). It seems like a huge step backwards for
any apps with high res timing requirements.

Lee

2005-11-18 22:15:58

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <[email protected]> wrote:
>
> > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > and there is no workaround. It currently makes the -rt/Jack
> > combination not very useful, at least in my tests.
> >
> > Is there a way to resync the TSCs?
>
> no reasonable way. Does idle=poll make any difference?

But JACK itself uses rdtsc() for timing calculations so TSC drift is
invariably fatal.

Lee

2005-11-18 22:26:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rt13


On Fri, 18 Nov 2005, Lee Revell wrote:

> On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote:
> > * Fernando Lopez-Lezcano <[email protected]> wrote:
> >
> > > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > > and there is no workaround. It currently makes the -rt/Jack
> > > combination not very useful, at least in my tests.
> > >
> > > Is there a way to resync the TSCs?
> >
> > no reasonable way. Does idle=poll make any difference?
>
> But JACK itself uses rdtsc() for timing calculations so TSC drift is
> invariably fatal.

Can it simply be pinned to a cpu?

-- Steve

2005-11-18 22:32:44

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, Nov 18, 2005 at 05:13:03PM -0500, Lee Revell wrote:
> On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote:
> > On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote:
> > > On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote:
> > > > You mentioned before that the TSC's from both cpus could drift from
> > > > each other over time. Assuming that is the source of timing (I have no
> > > > idea) that could explain the behavior of Jack, it gets a reference
> > > > time from one of the cpus and then compares that with what it gets
> > > > from either cpu depending on where it is running at a given time. If
> > > > it is the same cpu all is fine, if it is the other and it has drifted
> > > > then the warning is printed.
> > >
> > > Yes, JACK uses rdtsc() for microsecond resolution timing and assumes
> > > that the TSCs are in sync.
> > >
> > > I've asked on this list what a better time source could be and didn't
> > > get any useful responses, people just told me "use gettimeofday()" which
> > > is WAY too slow.
> >
> > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > and there is no workaround. It currently makes the -rt/Jack combination
> > not very useful, at least in my tests.
> >
> > Is there a way to resync the TSCs?
>
> I don't think so. A better question is what mechanism have the hardware
> vendors provided to replace the apparently-no-longer-reliable TSC for
> cheap high res timing on modern machines. Unfortunately I suspect the
> answer at this point is "nothing, you're screwed".

There are many mechanisms to keep time:

1) RTC: 0.5 sec resolution, interrupts
2) PIT: takes ages to read, overflows at each timer interrupt
3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt
4) HPET: slow to read, overflows in 5 minutes. Nice, but usually not present.
5) TSC: fast, completely unreliable. Frequency changes, CPUs diverge over time.
6) LAPIC: reasonably fast, unreliable, per-cpu

Userspace can only use 1), 4) and 5). mplayer uses the RTC to
synchronize, using it as a 1 kHz interrupt source.

The kernel does quite a lot of magic and jumps through many hoops to
make a reliable and fast time source combining these.

> I've read that gettimeofday() does not have to enter the kernel on
> x86-64, maybe it's fast enough, though almost certainly orders of
> magnitude slower than rdtsc().

It depends on the hardware config, and kernel version. With my latest
patch it takes approximately 175 ns on a reasonably fast CPU to do
gettimeofday() from userspace. And much better results will be possible
(~5x better) when RDTSCP enabled CPUs become available.

This patch still has problems, and as such I'll still have to rewrite
significant portions before I release it.

Anyway, current gettimeofday() on SMP AMD x86-64 can be as bad as 1500ns.

> It seems like a huge step backwards for
> any apps with high res timing requirements.

gettimeofday() is the only guaranteed working mechanism. And it's as
fast as the hardware allows.

--
Vojtech Pavlik
SuSE Labs, SuSE CR

2005-11-18 22:42:05

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <[email protected]> wrote:
>
> > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > and there is no workaround. It currently makes the -rt/Jack
> > combination not very useful, at least in my tests.
> >
> > Is there a way to resync the TSCs?
>
> no reasonable way. Does idle=poll make any difference?

I don't know yet, and I may never know :-) I've been running it for a
while and so far works but that's what I thought yesterday of -rt13. It
is not practical for normal use, it just heats the cpu unnecessarily and
there's no way to control it other than a reboot. I'll keep my machine
running like this till I go home later.

-- Fernando


2005-11-18 23:38:06

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 17:25 -0500, Steven Rostedt wrote:
> On Fri, 18 Nov 2005, Lee Revell wrote:
>
> > On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote:
> > > * Fernando Lopez-Lezcano <[email protected]> wrote:
> > >
> > > > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > > > and there is no workaround. It currently makes the -rt/Jack
> > > > combination not very useful, at least in my tests.
> > > >
> > > > Is there a way to resync the TSCs?
> > >
> > > no reasonable way. Does idle=poll make any difference?
> >
> > But JACK itself uses rdtsc() for timing calculations so TSC drift is
> > invariably fatal.
>
> Can it simply be pinned to a cpu?

Is there a way to know in which cpu a process is running? At least Jack
could ignore timinig issues if the measurement is going to happen in a
different cpu than the one where the original timestamp was collected.

-- Fernando


2005-11-18 23:58:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rt13



On Fri, 18 Nov 2005, Fernando Lopez-Lezcano wrote:

> > Can it simply be pinned to a cpu?
>
> Is there a way to know in which cpu a process is running? At least Jack
> could ignore timinig issues if the measurement is going to happen in a
> different cpu than the one where the original timestamp was collected.
>

Simple answer? No. At least not meaningfully.

If you do:

cpu = fictitious_get_my_cpu();
if (cpu == last_cpu()) {
rdtsc(oldtime);
...
}

There's no guarantee that jack doesn't switch cpu's from when it found out
what CPU it was on to doing the calculation. So it would be easier to pin
it.

(apt-get schedutils)

man 1 taskset

or if you modify the code:

mn 2 sched_setaffinity

-- Steve



2005-11-19 02:29:41

by George Anzinger

[permalink] [raw]
Subject: Re: 2.6.14-rt13

Vojtech Pavlik wrote:
> On Fri, Nov 18, 2005 at 05:13:03PM -0500, Lee Revell wrote:
>
>>On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote:
>>
>>>On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote:
>>>
>>>>On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote:
>>>>
>>>>>You mentioned before that the TSC's from both cpus could drift from
>>>>>each other over time. Assuming that is the source of timing (I have no
>>>>>idea) that could explain the behavior of Jack, it gets a reference
>>>>>time from one of the cpus and then compares that with what it gets
>>>>>from either cpu depending on where it is running at a given time. If
>>>>>it is the same cpu all is fine, if it is the other and it has drifted
>>>>>then the warning is printed.
>>>>
>>>>Yes, JACK uses rdtsc() for microsecond resolution timing and assumes
>>>>that the TSCs are in sync.
>>>>
>>>>I've asked on this list what a better time source could be and didn't
>>>>get any useful responses, people just told me "use gettimeofday()" which
>>>>is WAY too slow.
>>>
>>>Arghhh, at least I take this as a confirmation that the TSCs do drift
>>>and there is no workaround. It currently makes the -rt/Jack combination
>>>not very useful, at least in my tests.
>>>
>>>Is there a way to resync the TSCs?
>>
>>I don't think so. A better question is what mechanism have the hardware
>>vendors provided to replace the apparently-no-longer-reliable TSC for
>>cheap high res timing on modern machines. Unfortunately I suspect the
>>answer at this point is "nothing, you're screwed".
>
>
> There are many mechanisms to keep time:
>
> 1) RTC: 0.5 sec resolution, interrupts
> 2) PIT: takes ages to read, overflows at each timer interrupt
> 3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt

The PMTMR can be read from user space (if you can find it). See the
"iopl" man page. It is an I/O access and so is slow, but you can read
it.

Finding it is another matter. It does not have a fixed address (i.e.
it differs from machine to machine, but is constant on any given
machine). The boot code roots it out of an info block put in memory
by the BIOS. I suppose one could put a printk in the boot code to
disclose it...

George
--


> 4) HPET: slow to read, overflows in 5 minutes. Nice, but usually not present.
> 5) TSC: fast, completely unreliable. Frequency changes, CPUs diverge over time.
> 6) LAPIC: reasonably fast, unreliable, per-cpu
>
> Userspace can only use 1), 4) and 5). mplayer uses the RTC to
> synchronize, using it as a 1 kHz interrupt source.
>
> The kernel does quite a lot of magic and jumps through many hoops to
> make a reliable and fast time source combining these.
>
>
>>I've read that gettimeofday() does not have to enter the kernel on
>>x86-64, maybe it's fast enough, though almost certainly orders of
>>magnitude slower than rdtsc().
>
>
> It depends on the hardware config, and kernel version. With my latest
> patch it takes approximately 175 ns on a reasonably fast CPU to do
> gettimeofday() from userspace. And much better results will be possible
> (~5x better) when RDTSCP enabled CPUs become available.
>
> This patch still has problems, and as such I'll still have to rewrite
> significant portions before I release it.
>
> Anyway, current gettimeofday() on SMP AMD x86-64 can be as bad as 1500ns.
>
>
>>It seems like a huge step backwards for
>>any apps with high res timing requirements.
>
>
> gettimeofday() is the only guaranteed working mechanism. And it's as
> fast as the hardware allows.
>

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-11-19 02:39:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, 2005-11-18 at 14:41 -0800, Fernando Lopez-Lezcano wrote:
> On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote:
> > * Fernando Lopez-Lezcano <[email protected]> wrote:
> >
> > > Arghhh, at least I take this as a confirmation that the TSCs do drift
> > > and there is no workaround. It currently makes the -rt/Jack
> > > combination not very useful, at least in my tests.
> > >
> > > Is there a way to resync the TSCs?
> >
> > no reasonable way. Does idle=poll make any difference?
>
> I don't know yet, and I may never know :-) I've been running it for a
> while and so far works but that's what I thought yesterday of -rt13. It
> is not practical for normal use, it just heats the cpu unnecessarily and
> there's no way to control it other than a reboot.

Not anymore!

OK, I used this as an exercise to learn how kobject and sysfs work (I've
been putting this off for too long). So if this isn't exactly proper,
let me know :-)

Ingo, This could be a temporary patch until we come up with a better
solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is
_not_ set, it still lets you switch the machine to idle=poll on the fly,
as well as turn it off. If you have idle=poll, this doesn't even show
up.

So for example (I'm currently running it):

# cat /sys/kernel/idle/idle_poll
off
# echo 1 > /sys/kernel/idle/idle_poll
# cat /sys/kernel/idle/idle_poll on
# echo 0 > /sys/kernel/idle/idle_poll
# cat /sys/kernel/idle/idle_poll off

# echo on > /sys/kernel/idle/idle_poll
and
# echo off > /sys/kernel/idle/idle_poll
also work.

So like I said. This could be used for just those that need to have
idle=poll for running benchmarks but don't want to reboot when they are
done.

-- Steve

PS. I haven't tested to see if the idle actually changes, but it looks
pretty obvious in the code in cpu_idle:

idle = pm_idle;
if (!idle)
idle = default_idle;
if (cpu_is_offline(smp_processor_id()))
play_dead();
stop_critical_timing();
propagate_preempt_locks_value();
idle();



Index: linux-2.6.14-rt13/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.14-rt13.orig/arch/x86_64/kernel/process.c 2005-11-15 11:12:37.000000000 -0500
+++ linux-2.6.14-rt13/arch/x86_64/kernel/process.c 2005-11-18 21:12:53.000000000 -0500
@@ -822,3 +822,104 @@
sp -= get_random_int() % 8192;
return sp & ~0xf;
}
+
+#ifdef CONFIG_SYSFS
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/spinlock.h>
+
+#define KERNEL_ATTR_RW(_name) \
+static struct subsys_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static spinlock_t idle_switch_lock = SPIN_LOCK_UNLOCKED(idle_switch_lock);
+
+static struct idlep_kobject
+{
+ struct kobject kobj;
+ int is_poll;
+ void (*idle)(void);
+} idle_kobj;
+
+static ssize_t idle_poll_show(struct subsystem *subsys, char *page)
+{
+ return sprintf(page, "%s\n", (idle_kobj.is_poll ? "on" : "off"));
+}
+
+static ssize_t idle_poll_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&idle_switch_lock, flags);
+
+ if (strncmp(buf,"1",1)==0 ||
+ (len >=2 && strncmp(buf,"on",2)==0)) {
+ if (idle_kobj.is_poll != 1) {
+ idle_kobj.is_poll = 1;
+ pm_idle = poll_idle;
+ }
+ } else if (strncmp(buf,"0",1)==0 ||
+ (len >= 3 && strncmp(buf,"off",3)==0)) {
+ if (idle_kobj.is_poll != 0) {
+ idle_kobj.is_poll = 0;
+ pm_idle = idle_kobj.idle;
+ }
+ }
+
+ spin_unlock_irqrestore(&idle_switch_lock, flags);
+
+ return len;
+}
+
+
+KERNEL_ATTR_RW(idle_poll);
+
+static struct attribute * idle_attrs[] = {
+ &idle_poll_attr.attr,
+ NULL
+};
+
+static struct attribute_group idle_attr_group = {
+ .attrs = idle_attrs,
+};
+
+static int __init idle_poll_set_init(void)
+{
+ int err;
+
+ /*
+ * If the default is alread poll_idle then
+ * don't even bother with this.
+ */
+ if (pm_idle == poll_idle)
+ return 0;
+
+ memset(&idle_kobj, 0, sizeof(idle_kobj));
+
+ idle_kobj.is_poll = 0;
+ idle_kobj.idle = pm_idle;
+
+ err = kobject_set_name(&idle_kobj.kobj, "%s", "idle");
+ if (err)
+ goto out;
+
+ idle_kobj.kobj.parent = &kernel_subsys.kset.kobj;
+ err = kobject_register(&idle_kobj.kobj);
+ if (err)
+ goto out;
+
+ err = sysfs_create_group(&idle_kobj.kobj,
+ &idle_attr_group);
+ if (err)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_INFO "Problem setting up sysfs idle_poll\n");
+ return 0;
+}
+
+late_initcall(idle_poll_set_init);
+#endif /* CONFIG_FS */
+


2005-11-19 07:45:17

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Fri, Nov 18, 2005 at 06:28:24PM -0800, George Anzinger wrote:

> >There are many mechanisms to keep time:
> >
> >1) RTC: 0.5 sec resolution, interrupts
> >2) PIT: takes ages to read, overflows at each timer interrupt
> >3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt
>
> The PMTMR can be read from user space (if you can find it). See the
> "iopl" man page. It is an I/O access and so is slow, but you can read
> it.

Yes, however this must be limited to a small number of privileged
applications - iopl() is only available to CAP_SYS_RAWIO IIRC,
and thus it's not suitable for general use.

> Finding it is another matter. It does not have a fixed address (i.e.
> it differs from machine to machine, but is constant on any given
> machine). The boot code roots it out of an info block put in memory
> by the BIOS. I suppose one could put a printk in the boot code to
> disclose it...

There is really no reason to do that, since the time to read it (~1200
ns) is much less than the time to enter the kernel (less than 200 ns),
so gettimeofday() is definitely easier to use and also doesn't overflow.

--
Vojtech Pavlik
SuSE Labs, SuSE CR

2005-11-19 18:36:07

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Sat, 2005-11-19 at 08:45 +0100, Vojtech Pavlik wrote:
> On Fri, Nov 18, 2005 at 06:28:24PM -0800, George Anzinger wrote:
> > Finding it is another matter. It does not have a fixed address (i.e.
> > it differs from machine to machine, but is constant on any given
> > machine). The boot code roots it out of an info block put in memory
> > by the BIOS. I suppose one could put a printk in the boot code to
> > disclose it...
>
> There is really no reason to do that, since the time to read it (~1200
> ns) is much less than the time to enter the kernel (less than 200 ns),
> so gettimeofday() is definitely easier to use and also doesn't overflow.
>

Thanks very much, you have answered my question. We would prefer
gettimeofday() anyway for portability, so if the plan is to make it
faster then we can deal with losing the TSC.

Lee

2005-11-21 21:33:42

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote:
> i have released the 2.6.14-rt13 tree, which can be downloaded from the
> usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> lots of fixes in this release affecting all supported architectures, all
> across the board. Big MIPS update from John Cooper.

Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the
problems I was having with random screensaver triggering and keyboard
repeats?

It is apparently not fixed.

I just had a short burst of key repeats and saw one random screen blank.
Right now everything seems normal but I was not allucinating :-)

-- Fernando


2005-11-21 21:41:25

by john stultz

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Mon, 2005-11-21 at 13:32 -0800, Fernando Lopez-Lezcano wrote:
> On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote:
> > i have released the 2.6.14-rt13 tree, which can be downloaded from the
> > usual place:
> >
> > http://redhat.com/~mingo/realtime-preempt/
> >
> > lots of fixes in this release affecting all supported architectures, all
> > across the board. Big MIPS update from John Cooper.
>
> Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the
> problems I was having with random screensaver triggering and keyboard
> repeats?
>
> It is apparently not fixed.
>
> I just had a short burst of key repeats and saw one random screen blank.
> Right now everything seems normal but I was not allucinating :-)

Hmm. Sounds like timekeeping issues, could you send me dmesg output?

thanks
-john

2005-11-22 11:19:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rt13


* Fernando Lopez-Lezcano <[email protected]> wrote:

> I just had a short burst of key repeats and saw one random screen
> blank. Right now everything seems normal but I was not allucinating
> :-)

btw., today i have experienced a 'key repeat' event with the stock FC4
SMP kernel too, on an X2 athlon. That kernel didnt have idle=poll
specified, so gettimeofday() could time-warp in substantial ways.

so i'd say the 'key repeat' problem is almost certainly caused by TSC
"time warps" on X2's.

Ingo

2005-11-24 15:07:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rt13


* Steven Rostedt <[email protected]> wrote:

> OK, I used this as an exercise to learn how kobject and sysfs work
> (I've been putting this off for too long). So if this isn't exactly
> proper, let me know :-)
>
> Ingo, This could be a temporary patch until we come up with a better
> solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is
> _not_ set, it still lets you switch the machine to idle=poll on the
> fly, as well as turn it off. If you have idle=poll, this doesn't even
> show up.

ok, i've applied this one too. Could you also submit it upstream (and
implement it for x86)? It makes sense to enable/disable the
polling-based idle routine runtime.

Ingo

2005-11-24 15:22:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rt13

On Thu, 2005-11-24 at 16:07 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > OK, I used this as an exercise to learn how kobject and sysfs work
> > (I've been putting this off for too long). So if this isn't exactly
> > proper, let me know :-)
> >
> > Ingo, This could be a temporary patch until we come up with a better
> > solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is
> > _not_ set, it still lets you switch the machine to idle=poll on the
> > fly, as well as turn it off. If you have idle=poll, this doesn't even
> > show up.
>
> ok, i've applied this one too. Could you also submit it upstream (and
> implement it for x86)? It makes sense to enable/disable the
> polling-based idle routine runtime.

OK, it'll have to wait till tomorrow. As you probably know, it is
Thanksgiving here in the US. And my wife would kill me if I work
today ;-)

-- Steve


2005-11-25 20:56:43

by Steven Rostedt

[permalink] [raw]
Subject: [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13)

On Thu, 2005-11-24 at 16:07 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
> > Ingo, This could be a temporary patch until we come up with a better
> > solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is
> > _not_ set, it still lets you switch the machine to idle=poll on the
> > fly, as well as turn it off. If you have idle=poll, this doesn't even
> > show up.
>
> ok, i've applied this one too. Could you also submit it upstream (and
> implement it for x86)? It makes sense to enable/disable the
> polling-based idle routine runtime.

As a request from Ingo, I fixed up this patch a little to allow both
x86_64 and i386 to switch to and from idle_poll at runtime. I noticed
that the APCI driver in drivers/acpi/processor_idle.c may cause some
race condition with this patch so I added some protection there.
Basically, if the acpi code changes pm_idle, then you can't change to
idle_poll, and vice-versa.

What this patch does is creates an entry
into /sys/kernel/idle/idle_poll. It will show whether or not the
idle_poll is being used as a runtime idle routine. It is also used to
set the runtime idle.

with:

# echo 1 > /sys/kernel/idle/idle_poll
or
# echo on > /sys/kernel/idle/idle_poll

The system will switch to the idle_poll idle routine.

with:

# echo 0 > /sys/kernel/idle/idle_poll
or
# echo off > /sys/kernel/idle/idle_poll

The system will switch out of idle poll. Note that if the command line
states "idle=poll" then this will not be implemented.

This is still a work-in-progress. Since I only own a x86_64 and i386
that is all I ported the code for and tested. Looking for who else
exports pm_idle I see that the following archs may also need to be
updated:

arm, arm26, i64, sparc.

I also have not yet protected the pm_idle in arch/i386/kernel/apm.c

I figure that I should get some comments before I spend any more time on
this.

Thanks,

-- Steve

Index: linux-2.6.15-rc2-git5/arch/i386/kernel/Makefile
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/Makefile 2005-10-27 20:02:08.000000000 -0400
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/Makefile 2005-11-25 11:56:25.000000000 -0500
@@ -34,6 +34,7 @@
obj-$(CONFIG_HPET_TIMER) += time_hpet.o
obj-$(CONFIG_EFI) += efi.o efi_stub.o
obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
+obj-$(CONFIG_SYSFS) += switch2poll.o

EXTRA_AFLAGS := -traditional

Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-25 10:58:53.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-25 12:18:12.000000000 -0500
@@ -39,6 +39,7 @@
#include <linux/ptrace.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/spinlock.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -64,6 +65,12 @@
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

+spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED;
+EXPORT_SYMBOL(pm_idle_switch_lock);
+
+int pm_idle_locked = 0;
+EXPORT_SYMBOL(pm_idle_locked);
+
/*
* Return saved PC of a blocked thread.
*/
@@ -126,7 +133,7 @@
* to poll the ->work.need_resched flag instead of waiting for the
* cross-CPU IPI to arrive. Use this option with caution.
*/
-static void poll_idle (void)
+void poll_idle (void)
{
local_irq_enable();

Index: linux-2.6.15-rc2-git5/arch/i386/kernel/switch2poll.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/switch2poll.c 2005-11-25 11:55:19.000000000 -0500
@@ -0,0 +1,5 @@
+/*
+ * Same type of hack used for early_printk. This keeps the code
+ * in one place.
+ */
+#include "../../x86_64/kernel/switch2poll.c"
Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/Makefile
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/Makefile 2005-11-22 12:13:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/Makefile 2005-11-25 11:56:40.000000000 -0500
@@ -30,6 +30,7 @@
obj-$(CONFIG_DUMMY_IOMMU) += pci-nommu.o pci-dma.o
obj-$(CONFIG_KPROBES) += kprobes.o
obj-$(CONFIG_X86_PM_TIMER) += pmtimer.o
+obj-$(CONFIG_SYSFS) += switch2poll.o

obj-$(CONFIG_MODULES) += module.o

Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-25 10:58:53.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-25 12:17:53.000000000 -0500
@@ -36,6 +36,7 @@
#include <linux/utsname.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/spinlock.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -60,6 +61,12 @@
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

+spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED;
+EXPORT_SYMBOL(pm_idle_switch_lock);
+
+int pm_idle_locked = 0;
+EXPORT_SYMBOL(pm_idle_locked);
+
/*
* Powermanagement idle function, if any..
*/
@@ -110,7 +117,7 @@
* to poll the ->need_resched flag instead of waiting for the
* cross-CPU IPI to arrive. Use this option with caution.
*/
-static void poll_idle (void)
+void poll_idle (void)
{
local_irq_enable();

Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/switch2poll.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/switch2poll.c 2005-11-25 12:23:22.000000000 -0500
@@ -0,0 +1,112 @@
+#include <linux/module.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/spinlock.h>
+#include <linux/pm.h>
+
+extern void poll_idle (void);
+
+#define KERNEL_ATTR_RW(_name) \
+static struct subsys_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct idlep_kobject
+{
+ struct kobject kobj;
+ int is_poll;
+ void (*idle)(void);
+} idle_kobj;
+
+static ssize_t idle_poll_show(struct subsystem *subsys, char *page)
+{
+ return sprintf(page, "%s\n", (idle_kobj.is_poll ? "on" : "off"));
+}
+
+static ssize_t idle_poll_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&pm_idle_switch_lock, flags);
+
+ /*
+ * If power management is handling the idle function,
+ * then leave it be.
+ */
+ if (pm_idle_locked) {
+ len = -EBUSY;
+ goto out;
+ }
+
+ if (strncmp(buf,"1",1)==0 ||
+ (len >=2 && strncmp(buf,"on",2)==0)) {
+ if (idle_kobj.is_poll != 1) {
+ idle_kobj.is_poll = 1;
+ boot_option_idle_override = 1;
+ idle_kobj.idle = pm_idle;
+ pm_idle = poll_idle;
+ }
+ } else if (strncmp(buf,"0",1)==0 ||
+ (len >= 3 && strncmp(buf,"off",3)==0)) {
+ if (idle_kobj.is_poll != 0) {
+ boot_option_idle_override = 0;
+ idle_kobj.is_poll = 0;
+ pm_idle = idle_kobj.idle;
+ }
+ }
+
+out:
+ spin_unlock_irqrestore(&pm_idle_switch_lock, flags);
+
+ return len;
+}
+
+
+KERNEL_ATTR_RW(idle_poll);
+
+static struct attribute * idle_attrs[] = {
+ &idle_poll_attr.attr,
+ NULL
+};
+
+static struct attribute_group idle_attr_group = {
+ .attrs = idle_attrs,
+};
+
+static int __init idle_poll_set_init(void)
+{
+ int err;
+
+ /*
+ * If the default is alread poll_idle then
+ * don't even bother with this.
+ */
+ if (pm_idle == poll_idle)
+ return 0;
+
+ memset(&idle_kobj, 0, sizeof(idle_kobj));
+
+ idle_kobj.is_poll = 0;
+ idle_kobj.idle = pm_idle;
+
+ err = kobject_set_name(&idle_kobj.kobj, "%s", "idle");
+ if (err)
+ goto out;
+
+ idle_kobj.kobj.parent = &kernel_subsys.kset.kobj;
+ err = kobject_register(&idle_kobj.kobj);
+ if (err)
+ goto out;
+
+ err = sysfs_create_group(&idle_kobj.kobj,
+ &idle_attr_group);
+ if (err)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_INFO "Problem setting up sysfs idle_poll\n");
+ return 0;
+}
+
+late_initcall(idle_poll_set_init);
Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-22 12:13:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-25 13:15:59.000000000 -0500
@@ -38,6 +38,7 @@
#include <linux/dmi.h>
#include <linux/moduleparam.h>
#include <linux/sched.h> /* need_resched() */
+#include <linux/spinlock.h>

#include <asm/io.h>
#include <asm/uaccess.h>
@@ -990,6 +991,7 @@
static int first_run = 0;
struct proc_dir_entry *entry = NULL;
unsigned int i;
+ unsigned long flags;

ACPI_FUNCTION_TRACE("acpi_processor_power_init");

@@ -1023,6 +1025,7 @@
* Note that we use previously set idle handler will be used on
* platforms that only support C1.
*/
+ spin_lock_irqsave(&pm_idle_switch_lock, flags);
if ((pr->flags.power) && (!boot_option_idle_override)) {
printk(KERN_INFO PREFIX "CPU%d (power states:", pr->id);
for (i = 1; i <= pr->power.count; i++)
@@ -1034,8 +1037,13 @@
if (pr->id == 0) {
pm_idle_save = pm_idle;
pm_idle = acpi_processor_idle;
+ /*
+ * Don't allow switching of the pm_idle to poll.
+ */
+ pm_idle_locked = 1;
}
}
+ spin_unlock_irqrestore(&pm_idle_switch_lock, flags);

/* 'power' [R] */
entry = create_proc_entry(ACPI_PROCESSOR_FILE_POWER,
@@ -1078,5 +1086,7 @@
cpu_idle_wait();
}

+ pm_idle_locked = 0;
+
return_VALUE(0);
}
Index: linux-2.6.15-rc2-git5/include/linux/pm.h
===================================================================
--- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-25 12:05:33.000000000 -0500
+++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-25 12:17:17.000000000 -0500
@@ -25,6 +25,7 @@

#include <linux/config.h>
#include <linux/list.h>
+#include <linux/spinlock.h>
#include <asm/atomic.h>

/*
@@ -102,6 +103,8 @@
*/
extern void (*pm_idle)(void);
extern void (*pm_power_off)(void);
+extern spinlock_t pm_idle_switch_lock;
+extern int pm_idle_locked;

typedef int __bitwise suspend_state_t;



2005-11-26 13:06:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13)


* Steven Rostedt <[email protected]> wrote:

> As a request from Ingo, I fixed up this patch a little to allow both
> x86_64 and i386 to switch to and from idle_poll at runtime. I noticed
> that the APCI driver in drivers/acpi/processor_idle.c may cause some
> race condition with this patch so I added some protection there.
> Basically, if the acpi code changes pm_idle, then you can't change to
> idle_poll, and vice-versa.
>
> What this patch does is creates an entry into
> /sys/kernel/idle/idle_poll. It will show whether or not the idle_poll
> is being used as a runtime idle routine. It is also used to set the
> runtime idle.
>
> with:
>
> # echo 1 > /sys/kernel/idle/idle_poll
> or
> # echo on > /sys/kernel/idle/idle_poll

find some minor cleanups below.

a more general question is, shouldnt the configuration method rather be
something like:

echo idle > /sys/kernel/idle

and there could also be a /sys/kernel/idle_methods which would enumerate
all the strings that are possible? This way we'd not hardcode
'idle-poll' in any way.

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

arch/i386/kernel/process.c | 6 +++---
arch/x86_64/kernel/process.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)

Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -65,11 +65,11 @@ static int hlt_counter;
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

-spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED;
-EXPORT_SYMBOL(pm_idle_switch_lock);
+DEFINE_SPINLOCK(pm_idle_switch_lock);
+EXPORT_SYMBOL_GPL(pm_idle_switch_lock);

int pm_idle_locked = 0;
-EXPORT_SYMBOL(pm_idle_locked);
+EXPORT_SYMBOL_GPL(pm_idle_locked);

/*
* Return saved PC of a blocked thread.
Index: linux/arch/x86_64/kernel/process.c
===================================================================
--- linux.orig/arch/x86_64/kernel/process.c
+++ linux/arch/x86_64/kernel/process.c
@@ -61,11 +61,11 @@ static atomic_t hlt_counter = ATOMIC_INI
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

-spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED;
-EXPORT_SYMBOL(pm_idle_switch_lock);
+DEFINE_SPINLOCK(pm_idle_switch_lock);
+EXPORT_SYMBOL_GPL(pm_idle_switch_lock);

int pm_idle_locked = 0;
-EXPORT_SYMBOL(pm_idle_locked);
+EXPORT_SYMBOL_GPL(pm_idle_locked);

/*
* Powermanagement idle function, if any..

2005-11-29 02:48:39

by Steven Rostedt

[permalink] [raw]
Subject: [RFC][PATCH] Runtime switching of the idle function [take 2]

Here's an update on the switching of the idle function.

As Ingo has suggested, I removed this from being specific to the
poll_idle function.

Description:

This patch creates a directory in /sys/kernel called idle. This
directory contains two files: idle_ctrl and idle_methods. Reading
idle_ctrl will show the function that is currently being used for idle,
and idle_methods shows the available methods for the user to send write
into idle_ctrl to change which function to use for idle.

If the freeze attribute is set for an idle function (defined in the
idle_info struct explained below), then the user cannot add or remove
that function. This is used by the acpi since I wasn't sure how it
would handle having that function added or removed dynamically.
Functions that are frozen are shown in the idle_methods (and idle_ctrl
when used) with an asterisk (*) in front of the name.

I moved the code from arch/x86_64 to outside the arch directories into
kernel. The file is called idle.c. This implements functions to
register idle and unregister idle. It also has the functions to set
which idle to use. This file also creates the entries into the sysfs
directory. Currently this is only compiled for i386, x86_64, and
ia64.

Since I only have i386 and x86_64, I was only able to test the changes
in those two archs. I modified ia64, but haven't even tried to compile
it. If someone with that arch would like to do me the favor, please
do ;-)

I've created an idle_info structure that is used to register the idle
functions. This is now how acpi adds its functions.

struct idle_info {
struct list_head list; /* used to link in with all other registered */
const char *name; /* name to be used to add as well as to show */
idlefunc_t func; /* the function to be called for idle */
int freeze; /* set to disallow the user from adding or removing it */
int inuse; /* set when being used as the idle function */
};

This is a much more robust way of handling changes of the idle function
and can easily be adapted to other archs that would like to also
implement dynamic changes of the idle function. This would be nice to
add to sparc (hint hint).

Here's the patch:

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-28 20:30:51.000000000 -0500
@@ -39,6 +39,7 @@
#include <linux/ptrace.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/idle.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -72,11 +73,6 @@
return ((unsigned long *)tsk->thread.esp)[3];
}

-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

void disable_hlt(void)
@@ -185,7 +181,7 @@
__get_cpu_var(cpu_idle_state) = 0;

rmb();
- idle = pm_idle;
+ idle = idle_func;

if (!idle)
idle = default_idle;
@@ -230,6 +226,8 @@
}
EXPORT_SYMBOL_GPL(cpu_idle_wait);

+static struct idle_info idle_mwait;
+
/*
* This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
* which can obviate IPI to trigger checking of need_resched.
@@ -258,25 +256,62 @@
* Skip, if setup has overridden idle.
* One CPU supports mwait => All CPUs supports mwait
*/
- if (!pm_idle) {
+ memset(&idle_mwait, 0, sizeof(idle_mwait));
+ idle_mwait.name = "mwait";
+ idle_mwait.func = mwait_idle;
+ register_idle(&idle_mwait);
+
+ if (!idle_func) {
printk("using mwait in idle threads.\n");
- pm_idle = mwait_idle;
+ set_idle("mwait");
}
}
}

+static struct idle_info idle_default;
+static struct idle_info idle_poll;
+
+static int __init add_idle(void)
+{
+ static int set;
+
+ if (set)
+ return 0;
+ set = 1;
+
+ memset(&idle_poll, 0, sizeof(idle_poll));
+ idle_poll.name = "poll";
+ idle_poll.func = poll_idle;
+ register_idle(&idle_poll);
+
+ /*
+ * Allow the user to switch out of poll_idle even
+ * if it was a boot option.
+ */
+ memset(&idle_default, 0, sizeof(idle_default));
+ idle_default.name = "default";
+ idle_default.func = default_idle;
+ register_idle(&idle_default);
+
+ return 0;
+}
+
+arch_initcall(add_idle);
+
static int __init idle_setup (char *str)
{
+ add_idle();
if (!strncmp(str, "poll", 4)) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ set_idle("poll");
+
#ifdef CONFIG_X86_SMP
if (smp_num_siblings > 1)
printk("WARNING: polling idle and HT enabled, performance may degrade.\n");
#endif
} else if (!strncmp(str, "halt", 4)) {
printk("using halt in idle threads.\n");
- pm_idle = default_idle;
+ set_idle("default");
}

boot_option_idle_override = 1;
Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-28 20:30:21.000000000 -0500
@@ -36,6 +36,8 @@
#include <linux/utsname.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/spinlock.h>
+#include <linux/idle.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -60,10 +62,6 @@
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

void disable_hlt(void)
@@ -195,7 +193,7 @@
__get_cpu_var(cpu_idle_state) = 0;

rmb();
- idle = pm_idle;
+ idle = idle_func;
if (!idle)
idle = default_idle;
if (cpu_is_offline(smp_processor_id()))
@@ -209,6 +207,8 @@
}
}

+struct idle_info idle_mwait;
+
/*
* This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
* which can obviate IPI to trigger checking of need_resched.
@@ -233,25 +233,61 @@
{
static int printed;
if (cpu_has(c, X86_FEATURE_MWAIT)) {
+ memset(&idle_mwait, 0, sizeof(idle_mwait));
+ idle_mwait.name = "mwait";
+ idle_mwait.func = mwait_idle;
+ register_idle(&idle_mwait);
+
/*
* Skip, if setup has overridden idle.
* One CPU supports mwait => All CPUs supports mwait
*/
- if (!pm_idle) {
+ if (!idle_func) {
if (!printed) {
printk("using mwait in idle threads.\n");
printed = 1;
}
- pm_idle = mwait_idle;
+ set_idle("mwait");
}
}
}

+static struct idle_info idle_default;
+static struct idle_info idle_poll;
+
+static int __init add_idle(void)
+{
+ static int set;
+
+ if (set)
+ return 0;
+ set = 1;
+
+ memset(&idle_poll, 0, sizeof(idle_poll));
+ idle_poll.name = "poll";
+ idle_poll.func = poll_idle;
+ register_idle(&idle_poll);
+
+ /*
+ * Allow the user to switch out of poll_idle even
+ * if it was a boot option.
+ */
+ memset(&idle_default, 0, sizeof(idle_default));
+ idle_default.name = "default";
+ idle_default.func = default_idle;
+ register_idle(&idle_default);
+
+ return 0;
+}
+arch_initcall(add_idle);
+
static int __init idle_setup (char *str)
{
+ add_idle();
+
if (!strncmp(str, "poll", 4)) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ set_idle("poll");
}

boot_option_idle_override = 1;
Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-28 19:59:42.000000000 -0500
@@ -38,6 +38,8 @@
#include <linux/dmi.h>
#include <linux/moduleparam.h>
#include <linux/sched.h> /* need_resched() */
+#include <linux/spinlock.h>
+#include <linux/idle.h>

#include <asm/io.h>
#include <asm/uaccess.h>
@@ -56,6 +58,7 @@
#define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */
static void (*pm_idle_save) (void);
module_param(max_cstate, uint, 0644);
+#define PM_IDLE_NAME "pm_idle"

static unsigned int nocst = 0;
module_param(nocst, uint, 0000);
@@ -891,13 +894,13 @@
return_VALUE(-ENODEV);

/* Fall back to the default idle loop */
- pm_idle = pm_idle_save;
+ set_idle(NULL);
synchronize_sched(); /* Relies on interrupts forcing exit from idle. */

pr->flags.power = 0;
result = acpi_processor_get_power_info(pr);
if ((pr->flags.power == 1) && (pr->flags.power_setup_done))
- pm_idle = acpi_processor_idle;
+ set_idle(PM_IDLE_NAME);

return_VALUE(result);
}
@@ -983,6 +986,8 @@
.release = single_release,
};

+static struct idle_info pm_idle_info;
+
int acpi_processor_power_init(struct acpi_processor *pr,
struct acpi_device *device)
{
@@ -1032,8 +1037,17 @@
printk(")\n");

if (pr->id == 0) {
- pm_idle_save = pm_idle;
- pm_idle = acpi_processor_idle;
+ memset(&pm_idle_info, 0, sizeof(pm_idle_info));
+ pm_idle_info.name = PM_IDLE_NAME;
+ pm_idle_info.func = acpi_processor_idle;
+ pm_idle_info.freeze = 1;
+
+ register_idle(&pm_idle_info);
+ /*
+ * Just use the default idle
+ */
+ pm_idle_save = get_idle(NULL);
+ set_idle(PM_IDLE_NAME);
}
}

@@ -1068,7 +1082,29 @@

/* Unregister the idle handler when processor #0 is removed. */
if (pr->id == 0) {
- pm_idle = pm_idle_save;
+ int tries = 0;
+ int ret;
+ set_idle(NULL);
+ do {
+ if ((ret = unregister_idle(PM_IDLE_NAME)) == 0)
+ break;
+ /*
+ * for some reason the idle function is being used.
+ * Wait a little and then try again.
+ */
+ if (ret == -EINVAL) {
+ printk(KERN_WARNING
+ "ACPI idle function never registered?\n");
+ break;
+ }
+ yield();
+ } while (tries++ < 10);
+ if (tries > 10) {
+ printk(KERN_WARNING
+ "Unable to unresgister ACPI idle function\n");
+ /* don't unregister */
+ return_VALUE(ret);
+ }

/*
* We are about to unload the current idle thread pm callback
Index: linux-2.6.15-rc2-git5/include/linux/pm.h
===================================================================
--- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-28 19:59:42.000000000 -0500
@@ -25,6 +25,7 @@

#include <linux/config.h>
#include <linux/list.h>
+#include <linux/spinlock.h>
#include <asm/atomic.h>

/*
@@ -102,6 +103,8 @@
*/
extern void (*pm_idle)(void);
extern void (*pm_power_off)(void);
+extern spinlock_t pm_idle_switch_lock;
+extern int pm_idle_locked;

typedef int __bitwise suspend_state_t;

Index: linux-2.6.15-rc2-git5/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/Kconfig 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/Kconfig 2005-11-28 19:59:42.000000000 -0500
@@ -69,6 +69,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
source "init/Kconfig"


Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 19:59:42.000000000 -0500
@@ -58,7 +58,6 @@
EXPORT_SYMBOL(disable_irq_nosync);
EXPORT_SYMBOL(probe_irq_mask);
EXPORT_SYMBOL(kernel_thread);
-EXPORT_SYMBOL(pm_idle);
EXPORT_SYMBOL(pm_power_off);
EXPORT_SYMBOL(get_cmos_time);

Index: linux-2.6.15-rc2-git5/include/linux/idle.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/include/linux/idle.h 2005-11-28 21:36:00.000000000 -0500
@@ -0,0 +1,67 @@
+/*
+ * idle.h - Registering of the idle function (for supported archs)
+ *
+ * Copyright (C) 2005 Steven Rostedt <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _LINUX_IDLE_H
+#define _LINUX_IDLE_H
+
+#include <linux/config.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <asm/atomic.h>
+
+typedef void (*idlefunc_t)(void);
+
+struct idle_info {
+ struct list_head list;
+ const char *name; /* Name visible to users */
+ idlefunc_t func; /* idle function to run */
+ int freeze; /* Only allow kernel to add or remove */
+ int inuse; /* set when being used */
+};
+
+/*
+ * Registering and unregistering functions that may be used
+ * instead of the default idle function. This only adds
+ * them to the list of functions to be used, it does not
+ * set the
+ */
+extern int register_idle(struct idle_info *info);
+extern int unregister_idle(const char *name);
+
+/*
+ * This sets the idle function to the registered function
+ * by name. Use NULL to set the idle function back to
+ * the default.
+ */
+extern int set_idle(const char *name);
+
+/*
+ * Return the function that is registered by name.
+ * Use NULL to get the default function.
+ * NULL may be returned (as that may be what the current
+ * idle function is set to, to use a default). NULL will
+ * also be returned if name is not registered.
+ */
+extern idlefunc_t get_idle(const char *name);
+
+extern idlefunc_t idle_func;
+
+#endif /* _LINUX_IDLE_H */
Index: linux-2.6.15-rc2-git5/kernel/Makefile
===================================================================
--- linux-2.6.15-rc2-git5.orig/kernel/Makefile 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/kernel/Makefile 2005-11-28 19:59:42.000000000 -0500
@@ -32,6 +32,7 @@
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_DYNAMIC_IDLE) += idle.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.15-rc2-git5/kernel/idle.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/kernel/idle.c 2005-11-28 20:29:57.000000000 -0500
@@ -0,0 +1,308 @@
+/*
+ * kernel/idle.c
+ *
+ * Setting up of the idle function to be dynamic.
+ *
+ * Copyright (C) 2005 Steven Rostedt
+ */
+#include <linux/module.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/spinlock.h>
+#include <linux/idle.h>
+
+idlefunc_t idle_func;
+
+static void (*idle_default)(void);
+static LIST_HEAD(idle_elements);
+static DECLARE_MUTEX(idle_sem);
+static struct idle_info *curr_idle;
+
+#ifdef CONFIG_SYSFS
+int idle_sysfs_init;
+#endif
+
+extern void poll_idle (void);
+
+static struct idle_info *__find_idle_info(const char *name)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ /*
+ * A little inefficient, but this isn't called often.
+ */
+ list_for_each(curr, &idle_elements) {
+ p = list_entry(curr, struct idle_info, list);
+ if (!strcmp(name, p->name))
+ break;
+ }
+ if (curr == &idle_elements)
+ p = NULL;
+
+ return p;
+}
+
+int register_idle(struct idle_info *info)
+{
+ struct idle_info *p;
+ int ret = -EEXIST;
+
+ BUG_ON(!info->name);
+
+ down(&idle_sem);
+
+ p = __find_idle_info(info->name);
+ if (p)
+ goto out;
+ ret = 0;
+
+ list_add(&info->list, &idle_elements);
+
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(register_idle);
+
+int unregister_idle(const char *name)
+{
+ struct idle_info *p;
+ int ret = -EINVAL;
+
+ BUG_ON(!name);
+
+ down(&idle_sem);
+
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+ if (p->inuse) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ ret = 0;
+
+ list_del_init(&p->list);
+
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(unregister_idle);
+
+static int __set_idle(struct idle_info *info)
+{
+ if (curr_idle)
+ curr_idle->inuse--;
+ info->inuse++;
+ curr_idle = info;
+ return 0;
+}
+
+int set_idle(const char *name)
+{
+ struct idle_info *p;
+ int ret = 0;
+
+ down(&idle_sem);
+
+ if (!name) {
+ /* Set to the default function */
+ if (curr_idle) {
+ curr_idle->inuse--;
+ curr_idle = NULL;
+ }
+ idle_func = idle_default;
+ goto out;
+ }
+
+ ret = -EINVAL;
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+
+ __set_idle(p);
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(set_idle);
+
+idlefunc_t get_idle(const char *name)
+{
+ struct idle_info *p;
+ idlefunc_t ret = idle_default;
+
+ down(&idle_sem);
+
+ if (!name)
+ goto out;
+
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+
+ ret = p->func;
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(get_idle);
+
+#ifdef CONFIG_SYSFS
+#define KERNEL_ATTR_RW(_name) \
+static struct subsys_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct idlep_kobject
+{
+ struct kobject kobj;
+} idle_kobj;
+
+static ssize_t idle_ctrl_show(struct subsystem *subsys, char *page)
+{
+ ssize_t ret;
+ char *star = "";
+ const char *name = "default";
+
+ down(&idle_sem);
+ if (curr_idle) {
+ name = curr_idle->name;
+ if (curr_idle->freeze)
+ star = "*";
+ }
+ ret = sprintf(page, "%s%s\n", star, name);
+ up(&idle_sem);
+
+ return ret;
+}
+
+static ssize_t idle_ctrl_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ ssize_t ret = -EBUSY;
+
+ down(&idle_sem);
+
+ if (curr_idle && curr_idle->freeze)
+ goto out;
+
+ list_for_each(curr, &idle_elements) {
+ int size;
+ p = list_entry(curr, struct idle_info, list);
+
+ size = strlen(p->name);
+ if (len <= size)
+ continue;
+ if (!strncmp(p->name, buf, size))
+ break;
+ }
+ if (curr == &idle_elements) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * This idle routine may have been registered to
+ * not allow users to add or remove this.
+ */
+ if (p->freeze)
+ goto out;
+
+ __set_idle(p);
+
+ ret = len;
+out:
+ up(&idle_sem);
+
+ return ret;
+}
+
+KERNEL_ATTR_RW(idle_ctrl);
+
+static ssize_t idle_methods_show(struct subsystem *subsys, char *page)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ ssize_t len = 0;
+
+ down(&idle_sem);
+ list_for_each(curr, &idle_elements) {
+ p = list_entry(curr, struct idle_info, list);
+ if (len + 3 + strlen(p->name) >= PAGE_SIZE) {
+ printk("idle functions overflowed sysfs??\n");
+ break;
+ }
+ len += sprintf(page+len, "%s%s%s",
+ len ? " " : "",
+ p->freeze ? "*" : "",
+ p->name);
+ }
+ if (len + 2 < PAGE_SIZE)
+ len += sprintf(page+len, "\n");
+
+ up(&idle_sem);
+ return len;
+}
+
+static ssize_t idle_methods_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ /* do nothing */
+ return len;
+}
+
+KERNEL_ATTR_RW(idle_methods);
+
+static struct attribute * idle_attrs[] = {
+ &idle_ctrl_attr.attr,
+ &idle_methods_attr.attr,
+ NULL
+};
+
+static struct attribute_group idle_attr_group = {
+ .attrs = idle_attrs,
+};
+
+static int __init idle_setup_sysfs(void)
+{
+ int err;
+
+ memset(&idle_kobj, 0, sizeof(idle_kobj));
+ err = kobject_set_name(&idle_kobj.kobj, "%s", "idle");
+ if (err)
+ goto out;
+
+ kobj_set_kset_s(&idle_kobj, kernel_subsys);
+
+ idle_kobj.kobj.parent = &kernel_subsys.kset.kobj;
+ err = kobject_register(&idle_kobj.kobj);
+ if (err)
+ goto out;
+
+ err = sysfs_create_group(&idle_kobj.kobj,
+ &idle_attr_group);
+ if (err)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_INFO "Problem setting up sysfs idle_ctrl\n");
+ return 0;
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init idle_setup(void)
+{
+ idle_default = idle_func;
+
+#ifdef CONFIG_SYSFS
+ idle_setup_sysfs();
+#endif
+ return 0;
+}
+
+late_initcall(idle_setup);
Index: linux-2.6.15-rc2-git5/arch/i386/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/Kconfig 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/Kconfig 2005-11-28 19:59:42.000000000 -0500
@@ -45,6 +45,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
source "init/Kconfig"

menu "Processor type and features"
Index: linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/apm.c 2005-11-28 19:59:34.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c 2005-11-28 19:59:42.000000000 -0500
@@ -225,6 +225,7 @@
#include <linux/smp_lock.h>
#include <linux/dmi.h>
#include <linux/suspend.h>
+#include <linux/idle.h>

#include <asm/system.h>
#include <asm/uaccess.h>
@@ -2220,6 +2221,9 @@
{ }
};

+static struct idle_info apm_idle;
+#define APM_IDLE_NAME "apm"
+
/*
* Just start the APM thread. We do NOT want to do APM BIOS
* calls from anything but the APM thread, if for no other reason
@@ -2373,8 +2377,14 @@
if (HZ != 100)
idle_period = (idle_period * HZ) / 100;
if (idle_threshold < 100) {
- original_pm_idle = pm_idle;
- pm_idle = apm_cpu_idle;
+ memset(&apm_idle, 0, sizeof(apm_idle));
+ apm_idle.name = APM_IDLE_NAME;
+ apm_idle.func = apm_cpu_idle;
+ apm_idle.freeze = 1;
+ register_idle(&apm_idle);
+
+ original_pm_idle = get_idle(NULL);
+ set_idle(APM_IDLE_NAME);
set_pm_idle = 1;
}

@@ -2386,7 +2396,26 @@
int error;

if (set_pm_idle) {
- pm_idle = original_pm_idle;
+ int tries = 0;
+ int ret;
+ set_idle(NULL);
+ do {
+ if ((ret = unregister_idle(APM_IDLE_NAME)) == 0)
+ break;
+ /*
+ * for some reason the idle function is being used.
+ * Wait a little and then try again.
+ */
+ if (ret == -EINVAL) {
+ printk(KERN_WARNING
+ "APM idle function never registered?\n");
+ break;
+ }
+ yield();
+ } while (tries++ < 10);
+ if (tries > 10)
+ printk(KERN_WARNING
+ "Unable to unresgister APM idle function\n");
/*
* We are about to unload the current idle thread pm callback
* (pm_idle), Wait for all processors to update cached/local
Index: linux-2.6.15-rc2-git5/arch/ia64/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/Kconfig 2005-11-22 12:13:22.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/Kconfig 2005-11-28 20:17:30.000000000 -0500
@@ -62,6 +62,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
choice
prompt "System type"
default IA64_GENERIC
Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/acpi.c 2005-11-22 12:13:22.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c 2005-11-28 20:23:41.000000000 -0500
@@ -60,8 +60,6 @@

#define PREFIX "ACPI: "

-void (*pm_idle) (void);
-EXPORT_SYMBOL(pm_idle);
void (*pm_power_off) (void);
EXPORT_SYMBOL(pm_power_off);

Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/process.c 2005-11-25 10:58:53.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c 2005-11-28 20:29:33.000000000 -0500
@@ -31,6 +31,7 @@
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/kprobes.h>
+#include <linux/idle.h>

#include <asm/cpu.h>
#include <asm/delay.h>
@@ -289,7 +290,7 @@
if (mark_idle)
(*mark_idle)(1);

- idle = pm_idle;
+ idle = idle_func;
if (!idle)
idle = default_idle;
(*idle)();
Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/setup.c 2005-11-22 12:13:22.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c 2005-11-28 20:23:09.000000000 -0500
@@ -43,6 +43,7 @@
#include <linux/initrd.h>
#include <linux/platform.h>
#include <linux/pm.h>
+#include <linux/idle.h>

#include <asm/ia32.h>
#include <asm/machvec.h>
@@ -738,6 +739,8 @@
ia64_max_cacheline_size = max;
}

+struct idle_info idle_default;
+
/*
* cpu_init() initializes state that is per-CPU. This function acts
* as a 'CPU state barrier', nothing should get across.
@@ -861,7 +864,13 @@
/* size of physical stacked register partition plus 8 bytes: */
__get_cpu_var(ia64_phys_stacked_size_p8) = num_phys_stacked*8 + 8;
platform_cpu_init();
- pm_idle = default_idle;
+
+ memset(&idle_default, 0, sizeof(idle_default));
+ idle_default.name = "default";
+ idle_default.func = default_idle;
+ register_idle(&idle_default);
+
+ set_idle("default");
}

void


2005-11-29 03:03:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Steven Rostedt <[email protected]> wrote:
>
> This patch creates a directory in /sys/kernel called idle.
>

At no point do you appear to explain _why_ the kernel needs this feature?

> ...
> - pm_idle = pm_idle_save;
> + int tries = 0;
> + int ret;
> + set_idle(NULL);
> + do {
> + if ((ret = unregister_idle(PM_IDLE_NAME)) == 0)
> + break;
> + /*
> + * for some reason the idle function is being used.
> + * Wait a little and then try again.
> + */
> + if (ret == -EINVAL) {
> + printk(KERN_WARNING
> + "ACPI idle function never registered?\n");
> + break;
> + }
> + yield();
> + } while (tries++ < 10);

The use of yield() could be problematic - its semantics are rather
ill-defined. Maybe msleep(1) or something?

What's this loop here for anyway? Looks kludgy.

> + if (tries > 10) {
> + printk(KERN_WARNING
> + "Unable to unresgister ACPI idle function\n");

tpyo

> + memset(&idle_kobj, 0, sizeof(idle_kobj));

There are several memsets of statically allocated structures which are
already all-zero.

2005-11-29 03:42:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote:
> Steven Rostedt <[email protected]> wrote:
> >
> > This patch creates a directory in /sys/kernel called idle.
> >
>
> At no point do you appear to explain _why_ the kernel needs this feature?

Sorry about that. This originally came up when we had problems with the
AMD64 x2 in the -rt patch. It was noted that the TSCs would get very
far out of sync and cause problems. The way to solve this was to set
idle=poll. The original patch I sent was to allow the user to change to
idle=poll dynamically. This way they could switch to the poll_idle and
run there tests (requiring tsc not to drift) and then switch back to the
default idle to save on electricity.

Note: It's been stated that the tsc drift can cause problems with the
vanilla kernel too.

Ingo asked if I could make this more robust and not dependent on
idle_poll.

Maybe Ingo can give a better explanation?

>
> > ...
> > - pm_idle = pm_idle_save;
> > + int tries = 0;
> > + int ret;
> > + set_idle(NULL);
> > + do {
> > + if ((ret = unregister_idle(PM_IDLE_NAME)) == 0)
> > + break;
> > + /*
> > + * for some reason the idle function is being used.
> > + * Wait a little and then try again.
> > + */
> > + if (ret == -EINVAL) {
> > + printk(KERN_WARNING
> > + "ACPI idle function never registered?\n");
> > + break;
> > + }
> > + yield();
> > + } while (tries++ < 10);
>
> The use of yield() could be problematic - its semantics are rather
> ill-defined. Maybe msleep(1) or something?
>
> What's this loop here for anyway? Looks kludgy.

Oops! That was required by some other garbage that I had earlier. I
cleaned up the patch some more, and this is no longer required. (will
remove).

>
> > + if (tries > 10) {
> > + printk(KERN_WARNING
> > + "Unable to unresgister ACPI idle function\n");
>
> tpyo

Will fix.

>
> > + memset(&idle_kobj, 0, sizeof(idle_kobj));
>
> There are several memsets of statically allocated structures which are
> already all-zero.
>

:) I'm really paranoid! OK, I always like to do a memset even when it's
not needed. I'll purge them too.

Thanks for having a look.

-- Steve


2005-11-29 04:02:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Steven Rostedt <[email protected]> wrote:
>
> On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote:
> > Steven Rostedt <[email protected]> wrote:
> > >
> > > This patch creates a directory in /sys/kernel called idle.
> > >
> >
> > At no point do you appear to explain _why_ the kernel needs this feature?
>
> Sorry about that. This originally came up when we had problems with the
> AMD64 x2 in the -rt patch. It was noted that the TSCs would get very
> far out of sync and cause problems.

Unsynced TSCs are rare, but they happen. I guess even if we were to resync
them, these measurements would screw up.


> The way to solve this was to set
> idle=poll. The original patch I sent was to allow the user to change to
> idle=poll dynamically. This way they could switch to the poll_idle and
> run there tests (requiring tsc not to drift) and then switch back to the
> default idle to save on electricity.

Use gettimeofday()?

If it's just for some sort of instrumentation, run NR_CPUS instances of a
niced-down busyloop, pin each one to a different CPU? That way the idle
function doesn't get called at all..

2005-11-29 04:22:56

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Mon, 2005-11-28 at 22:42 -0500, Steven Rostedt wrote:
> On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote:
> > Steven Rostedt <[email protected]> wrote:
> > >
> > > This patch creates a directory in /sys/kernel called idle.
> > >
> >
> > At no point do you appear to explain _why_ the kernel needs this feature?
>
> Sorry about that. This originally came up when we had problems with the
> AMD64 x2 in the -rt patch. It was noted that the TSCs would get very
> far out of sync and cause problems. The way to solve this was to set
> idle=poll. The original patch I sent was to allow the user to change to
> idle=poll dynamically. This way they could switch to the poll_idle and
> run there tests (requiring tsc not to drift) and then switch back to the
> default idle to save on electricity.

The problem with this is that this must be a one way transition. That
is, once the TSCs have become unsynchronized, there is no use going back
to using the polling idle unless you add some code to re-sync the TSCs
which would be ugly to do after the system has booted.

Using idle=poll (for anything other then debugging) is really a worst
case workaround for systems that do not have alternative clocksources
like ACPI PM or HPET.

Its an interesting bit of code, but I'm not really sure I understand its
usefulness.

thanks
-john



2005-11-29 06:44:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]


* Andrew Morton <[email protected]> wrote:

> > The way to solve this was to set
> > idle=poll. The original patch I sent was to allow the user to change to
> > idle=poll dynamically. This way they could switch to the poll_idle and
> > run there tests (requiring tsc not to drift) and then switch back to the
> > default idle to save on electricity.
>
> Use gettimeofday()?
>
> If it's just for some sort of instrumentation, run NR_CPUS instances
> of a niced-down busyloop, pin each one to a different CPU? That way
> the idle function doesn't get called at all..

idle=poll is also frequently done for performance reasons [it reduces
idle wakeup latency by 10 usecs] - while it could be turned off if the
system has been idle for some time. E.g. cpufreqd could sample idle time
and turn on/off idle=poll. High-performance setups could enable it all
the time.

as long as it can be done with zero-cost, i dont see why Steven's patch
wouldnt be a plus for us. It's a performance thing, and having runtime
switches for seemless performance features cannot be bad.

Ingo

2005-11-29 06:55:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
>
>>>The way to solve this was to set
>>> idle=poll. The original patch I sent was to allow the user to change to
>>> idle=poll dynamically. This way they could switch to the poll_idle and
>>> run there tests (requiring tsc not to drift) and then switch back to the
>>> default idle to save on electricity.
>>
>>Use gettimeofday()?
>>
>>If it's just for some sort of instrumentation, run NR_CPUS instances
>>of a niced-down busyloop, pin each one to a different CPU? That way
>>the idle function doesn't get called at all..
>
>
> idle=poll is also frequently done for performance reasons [it reduces
> idle wakeup latency by 10 usecs] - while it could be turned off if the
> system has been idle for some time. E.g. cpufreqd could sample idle time
> and turn on/off idle=poll. High-performance setups could enable it all
> the time.
>
> as long as it can be done with zero-cost, i dont see why Steven's patch
> wouldnt be a plus for us. It's a performance thing, and having runtime
> switches for seemless performance features cannot be bad.
>

Why not just slightly cleanup and extend (eg. to ACPI) the
hlt_counter thingy that many architectures already have?

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-29 13:37:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Ingo Molnar <[email protected]> writes:

> * Andrew Morton <[email protected]> wrote:
>
> > > The way to solve this was to set
> > > idle=poll. The original patch I sent was to allow the user to change to
> > > idle=poll dynamically. This way they could switch to the poll_idle and
> > > run there tests (requiring tsc not to drift) and then switch back to the
> > > default idle to save on electricity.
> >
> > Use gettimeofday()?
> >
> > If it's just for some sort of instrumentation, run NR_CPUS instances
> > of a niced-down busyloop, pin each one to a different CPU? That way
> > the idle function doesn't get called at all..
>
> idle=poll is also frequently done for performance reasons [it reduces
> idle wakeup latency by 10 usecs]

And it's obsolete on CPUs with monitor/mwait.
And in practice the CPU will run so hot that only benchmarkers like it.

I think switching idle is the wrong way to do. We should rather
fix the various problems.

For fixing the TSC issue it is 100% the wrong approach Imho.
Basically software has to live with TSCs being unsynchronized
and gettimeofday should do the right thing (and if not it should be fixed)

- while it could be turned off if the
> system has been idle for some time. E.g. cpufreqd could sample idle time
> and turn on/off idle=poll. High-performance setups could enable it all
> the time.

And upgrade their server air condition or issue additional ear protection
to the desktop user? Most likely you will just drive the CPUs into
thermal throttle at some point with that, not get more performance anyways.

> as long as it can be done with zero-cost, i dont see why Steven's patch
> wouldnt be a plus for us. It's a performance thing, and having runtime
> switches for seemless performance features cannot be bad.

The interface is ugly and I suspect fixing the various obscure race this
obscure feature would undoubtedly add will be a long term maintenance
issue. And it's the wrong thing to do anyways because it just papers
over other problems that should be fixed in the right way.

-Andi

2005-11-29 14:22:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Mon, 2005-11-28 at 20:22 -0800, john stultz wrote:
> On Mon, 2005-11-28 at 22:42 -0500, Steven Rostedt wrote:
> > On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote:
> > > Steven Rostedt <[email protected]> wrote:
> > > >
> > > > This patch creates a directory in /sys/kernel called idle.
> > > >
> > >
> > > At no point do you appear to explain _why_ the kernel needs this feature?
> >
> > Sorry about that. This originally came up when we had problems with the
> > AMD64 x2 in the -rt patch. It was noted that the TSCs would get very
> > far out of sync and cause problems. The way to solve this was to set
> > idle=poll. The original patch I sent was to allow the user to change to
> > idle=poll dynamically. This way they could switch to the poll_idle and
> > run there tests (requiring tsc not to drift) and then switch back to the
> > default idle to save on electricity.
>
> The problem with this is that this must be a one way transition. That
> is, once the TSCs have become unsynchronized, there is no use going back
> to using the polling idle unless you add some code to re-sync the TSCs
> which would be ugly to do after the system has booted.
>

I've thought about that too. But this patch does allow you to start
with idle=poll and then switch back. Also, if you do lock to a cpu,
you don't need to worry about the tsc from slipping if you switch to
idle=poll.

-- Steve

> Using idle=poll (for anything other then debugging) is really a worst
> case workaround for systems that do not have alternative clocksources
> like ACPI PM or HPET.
>
> Its an interesting bit of code, but I'm not really sure I understand its
> usefulness.
>
> thanks
> -john
>
>
>

2005-11-29 14:19:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, 2005-11-29 at 11:05 -0700, Andi Kleen so nicely wrote:
> > idle=poll is also frequently done for performance reasons [it reduces
> > idle wakeup latency by 10 usecs]
>
> And it's obsolete on CPUs with monitor/mwait.

And I wish my system supported it.

> And in practice the CPU will run so hot that only benchmarkers like it.

Why would it run hot? What's the difference between polling and doing
other things. How many transistors does it take to poll?

>
> I think switching idle is the wrong way to do. We should rather
> fix the various problems.
>
> For fixing the TSC issue it is 100% the wrong approach Imho.

I would only say 80% the wrong approach, but that's me ;-)

> Basically software has to live with TSCs being unsynchronized
> and gettimeofday should do the right thing (and if not it should be fixed)

I guess the biggest complaint most have is that the rdtsc _is_ the
fastest way to read a clock. If it isn't reliable, then what good is
it? It's unfortunate that Intel didn't solidify the clock usage. Yes,
use HPET, or something else, but those are slower, and may not be on all
systems. Every system that I owned had a tsc but for critical systems
it isn't up to par (what a shame).

>
> - while it could be turned off if the
> > system has been idle for some time. E.g. cpufreqd could sample idle time
> > and turn on/off idle=poll. High-performance setups could enable it all
> > the time.
>
> And upgrade their server air condition or issue additional ear protection
> to the desktop user? Most likely you will just drive the CPUs into
> thermal throttle at some point with that, not get more performance anyways.

Again, what would make it so hot? It is a waste of CPU cycles, and does
waste energy that way, but does it really heat up the CPU that much?
It's just a loop. I've run much more complex algorithms for days
without any problems. I only once over heated a CPU and that was doing
some brute force calculations of prime numbers.

>
> > as long as it can be done with zero-cost, i dont see why Steven's patch
> > wouldnt be a plus for us. It's a performance thing, and having runtime
> > switches for seemless performance features cannot be bad.
>
> The interface is ugly and I suspect fixing the various obscure race this
> obscure feature would undoubtedly add will be a long term maintenance
> issue. And it's the wrong thing to do anyways because it just papers
> over other problems that should be fixed in the right way.

Oh come now, it's not that ugly. And it would not produce any more
obscure race conditions than the current method of changing idle with
the acpi processor_idle module has.

But I'll agree that this is more of a paper over than a solution. Too
bad I wasted a day writing and testing it (mostly just to learn about
kobjects and sysfs which I still feel is very clumsy).

But since I did clean up the patch, and it is still useful for those
debugging problems with timers. I'm supplying this cleaned up version
(Thank you Andrew for the comments).

-- Steve

Ingo, would you like this for -rt? Even if it will never be accepted
into mainline.


[take 3]:

Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-29 07:43:52.000000000 -0500
@@ -39,6 +39,7 @@
#include <linux/ptrace.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/idle.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -72,11 +73,6 @@
return ((unsigned long *)tsk->thread.esp)[3];
}

-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

void disable_hlt(void)
@@ -185,7 +181,7 @@
__get_cpu_var(cpu_idle_state) = 0;

rmb();
- idle = pm_idle;
+ idle = idle_func;

if (!idle)
idle = default_idle;
@@ -250,6 +246,11 @@
}
}

+static struct idle_info idle_mwait = {
+ .name = "mwait",
+ .func = mwait_idle
+};
+
void __devinit select_idle_routine(const struct cpuinfo_x86 *c)
{
if (cpu_has(c, X86_FEATURE_MWAIT)) {
@@ -258,25 +259,60 @@
* Skip, if setup has overridden idle.
* One CPU supports mwait => All CPUs supports mwait
*/
- if (!pm_idle) {
+ register_idle(&idle_mwait);
+
+ if (!idle_func) {
printk("using mwait in idle threads.\n");
- pm_idle = mwait_idle;
+ set_idle("mwait");
}
}
}

+static struct idle_info idle_default = {
+ .name = "default",
+ .func = default_idle
+};
+
+static struct idle_info idle_poll = {
+ .name = "poll",
+ .func = poll_idle
+};
+
+static int __init add_idle(void)
+{
+ static int set;
+
+ if (set)
+ return 0;
+ set = 1;
+
+ register_idle(&idle_poll);
+
+ /*
+ * Allow the user to switch out of poll_idle even
+ * if it was a boot option.
+ */
+ register_idle(&idle_default);
+
+ return 0;
+}
+
+arch_initcall(add_idle);
+
static int __init idle_setup (char *str)
{
+ add_idle();
if (!strncmp(str, "poll", 4)) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ set_idle("poll");
+
#ifdef CONFIG_X86_SMP
if (smp_num_siblings > 1)
printk("WARNING: polling idle and HT enabled, performance may degrade.\n");
#endif
} else if (!strncmp(str, "halt", 4)) {
printk("using halt in idle threads.\n");
- pm_idle = default_idle;
+ set_idle("default");
}

boot_option_idle_override = 1;
Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-29 07:45:44.000000000 -0500
@@ -36,6 +36,8 @@
#include <linux/utsname.h>
#include <linux/random.h>
#include <linux/kprobes.h>
+#include <linux/spinlock.h>
+#include <linux/idle.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -60,10 +62,6 @@
unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);

-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

void disable_hlt(void)
@@ -195,7 +193,7 @@
__get_cpu_var(cpu_idle_state) = 0;

rmb();
- idle = pm_idle;
+ idle = idle_func;
if (!idle)
idle = default_idle;
if (cpu_is_offline(smp_processor_id()))
@@ -229,29 +227,68 @@
}
}

+static struct idle_info idle_mwait = {
+ .name = "mwait",
+ .func = mwait_idle
+};
+
void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
{
static int printed;
if (cpu_has(c, X86_FEATURE_MWAIT)) {
+ register_idle(&idle_mwait);
+
/*
* Skip, if setup has overridden idle.
* One CPU supports mwait => All CPUs supports mwait
*/
- if (!pm_idle) {
+ if (!idle_func) {
if (!printed) {
printk("using mwait in idle threads.\n");
printed = 1;
}
- pm_idle = mwait_idle;
+ set_idle("mwait");
}
}
}

+static struct idle_info idle_default = {
+ .name = "default",
+ .func = default_idle
+};
+
+static struct idle_info idle_poll = {
+ .name = "poll",
+ .func = poll_idle
+};
+
+static int __init add_idle(void)
+{
+ static int set;
+
+ if (set)
+ return 0;
+ set = 1;
+
+ register_idle(&idle_poll);
+
+ /*
+ * Allow the user to switch out of poll_idle even
+ * if it was a boot option.
+ */
+ register_idle(&idle_default);
+
+ return 0;
+}
+arch_initcall(add_idle);
+
static int __init idle_setup (char *str)
{
+ add_idle();
+
if (!strncmp(str, "poll", 4)) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ set_idle("poll");
}

boot_option_idle_override = 1;
Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-29 07:47:52.000000000 -0500
@@ -38,6 +38,8 @@
#include <linux/dmi.h>
#include <linux/moduleparam.h>
#include <linux/sched.h> /* need_resched() */
+#include <linux/spinlock.h>
+#include <linux/idle.h>

#include <asm/io.h>
#include <asm/uaccess.h>
@@ -56,6 +58,7 @@
#define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */
static void (*pm_idle_save) (void);
module_param(max_cstate, uint, 0644);
+#define PM_IDLE_NAME "pm_idle"

static unsigned int nocst = 0;
module_param(nocst, uint, 0000);
@@ -891,13 +894,13 @@
return_VALUE(-ENODEV);

/* Fall back to the default idle loop */
- pm_idle = pm_idle_save;
+ set_idle(NULL);
synchronize_sched(); /* Relies on interrupts forcing exit from idle. */

pr->flags.power = 0;
result = acpi_processor_get_power_info(pr);
if ((pr->flags.power == 1) && (pr->flags.power_setup_done))
- pm_idle = acpi_processor_idle;
+ set_idle(PM_IDLE_NAME);

return_VALUE(result);
}
@@ -983,6 +986,12 @@
.release = single_release,
};

+static struct idle_info pm_idle_info = {
+ .name = PM_IDLE_NAME,
+ .func = acpi_processor_idle,
+ .freeze = 1
+};
+
int acpi_processor_power_init(struct acpi_processor *pr,
struct acpi_device *device)
{
@@ -1032,8 +1041,12 @@
printk(")\n");

if (pr->id == 0) {
- pm_idle_save = pm_idle;
- pm_idle = acpi_processor_idle;
+ register_idle(&pm_idle_info);
+ /*
+ * Just use the default idle
+ */
+ pm_idle_save = get_idle(NULL);
+ set_idle(PM_IDLE_NAME);
}
}

@@ -1068,8 +1081,8 @@

/* Unregister the idle handler when processor #0 is removed. */
if (pr->id == 0) {
- pm_idle = pm_idle_save;
-
+ set_idle(NULL);
+ unregister_idle(PM_IDLE_NAME);
/*
* We are about to unload the current idle thread pm callback
* (pm_idle), Wait for all processors to update cached/local
Index: linux-2.6.15-rc2-git5/include/linux/pm.h
===================================================================
--- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-28 20:31:47.000000000 -0500
@@ -25,6 +25,7 @@

#include <linux/config.h>
#include <linux/list.h>
+#include <linux/spinlock.h>
#include <asm/atomic.h>

/*
@@ -102,6 +103,8 @@
*/
extern void (*pm_idle)(void);
extern void (*pm_power_off)(void);
+extern spinlock_t pm_idle_switch_lock;
+extern int pm_idle_locked;

typedef int __bitwise suspend_state_t;

Index: linux-2.6.15-rc2-git5/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/Kconfig 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/Kconfig 2005-11-28 20:31:47.000000000 -0500
@@ -69,6 +69,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
source "init/Kconfig"


Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 20:31:47.000000000 -0500
@@ -58,7 +58,6 @@
EXPORT_SYMBOL(disable_irq_nosync);
EXPORT_SYMBOL(probe_irq_mask);
EXPORT_SYMBOL(kernel_thread);
-EXPORT_SYMBOL(pm_idle);
EXPORT_SYMBOL(pm_power_off);
EXPORT_SYMBOL(get_cmos_time);

Index: linux-2.6.15-rc2-git5/include/linux/idle.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/include/linux/idle.h 2005-11-28 20:31:47.000000000 -0500
@@ -0,0 +1,71 @@
+/*
+ * idle.h - Registering of the idle function (for supported archs)
+ *
+ * Copyright (C) 2005 Steven Rostedt <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _LINUX_IDLE_H
+#define _LINUX_IDLE_H
+
+#include <linux/config.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/kobject.h>
+#include <asm/atomic.h>
+
+typedef void (*idlefunc_t)(void);
+
+struct idle_info {
+ struct list_head list;
+ const char *name; /* Name visible to users */
+ idlefunc_t func; /* idle function to run */
+ int freeze; /* Only allow kernel to add or remove */
+ int inuse; /* set when being used */
+#ifdef CONFIG_SYSFS
+ struct kobject kobj;
+#endif
+};
+
+/*
+ * Registering and unregistering functions that may be used
+ * instead of the default idle function. This only adds
+ * them to the list of functions to be used, it does not
+ * set the
+ */
+extern int register_idle(struct idle_info *info);
+extern int unregister_idle(const char *name);
+
+/*
+ * This sets the idle function to the registered function
+ * by name. Use NULL to set the idle function back to
+ * the default.
+ */
+extern int set_idle(const char *name);
+
+/*
+ * Return the function that is registered by name.
+ * Use NULL to get the default function.
+ * NULL may be returned (as that may be what the current
+ * idle function is set to, to use a default). NULL will
+ * also be returned if name is not registered.
+ */
+extern idlefunc_t get_idle(const char *name);
+
+extern idlefunc_t idle_func;
+
+#endif /* _LINUX_IDLE_H */
Index: linux-2.6.15-rc2-git5/kernel/Makefile
===================================================================
--- linux-2.6.15-rc2-git5.orig/kernel/Makefile 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/kernel/Makefile 2005-11-28 20:31:47.000000000 -0500
@@ -32,6 +32,7 @@
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_DYNAMIC_IDLE) += idle.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.15-rc2-git5/kernel/idle.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc2-git5/kernel/idle.c 2005-11-28 20:31:47.000000000 -0500
@@ -0,0 +1,308 @@
+/*
+ * kernel/idle.c
+ *
+ * Setting up of the idle function to be dynamic.
+ *
+ * Copyright (C) 2005 Steven Rostedt
+ */
+#include <linux/module.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/spinlock.h>
+#include <linux/idle.h>
+
+idlefunc_t idle_func;
+
+static void (*idle_default)(void);
+static LIST_HEAD(idle_elements);
+static DECLARE_MUTEX(idle_sem);
+static struct idle_info *curr_idle;
+
+#ifdef CONFIG_SYSFS
+int idle_sysfs_init;
+#endif
+
+extern void poll_idle (void);
+
+static struct idle_info *__find_idle_info(const char *name)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ /*
+ * A little inefficient, but this isn't called often.
+ */
+ list_for_each(curr, &idle_elements) {
+ p = list_entry(curr, struct idle_info, list);
+ if (!strcmp(name, p->name))
+ break;
+ }
+ if (curr == &idle_elements)
+ p = NULL;
+
+ return p;
+}
+
+int register_idle(struct idle_info *info)
+{
+ struct idle_info *p;
+ int ret = -EEXIST;
+
+ BUG_ON(!info->name);
+
+ down(&idle_sem);
+
+ p = __find_idle_info(info->name);
+ if (p)
+ goto out;
+ ret = 0;
+
+ list_add(&info->list, &idle_elements);
+
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(register_idle);
+
+int unregister_idle(const char *name)
+{
+ struct idle_info *p;
+ int ret = -EINVAL;
+
+ BUG_ON(!name);
+
+ down(&idle_sem);
+
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+ if (p->inuse) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ ret = 0;
+
+ list_del_init(&p->list);
+
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(unregister_idle);
+
+static int __set_idle(struct idle_info *info)
+{
+ if (curr_idle)
+ curr_idle->inuse--;
+ info->inuse++;
+ curr_idle = info;
+ return 0;
+}
+
+int set_idle(const char *name)
+{
+ struct idle_info *p;
+ int ret = 0;
+
+ down(&idle_sem);
+
+ if (!name) {
+ /* Set to the default function */
+ if (curr_idle) {
+ curr_idle->inuse--;
+ curr_idle = NULL;
+ }
+ idle_func = idle_default;
+ goto out;
+ }
+
+ ret = -EINVAL;
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+
+ __set_idle(p);
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(set_idle);
+
+idlefunc_t get_idle(const char *name)
+{
+ struct idle_info *p;
+ idlefunc_t ret = idle_default;
+
+ down(&idle_sem);
+
+ if (!name)
+ goto out;
+
+ p = __find_idle_info(name);
+ if (!p)
+ goto out;
+
+ ret = p->func;
+out:
+ up(&idle_sem);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(get_idle);
+
+#ifdef CONFIG_SYSFS
+#define KERNEL_ATTR_RW(_name) \
+static struct subsys_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct idlep_kobject
+{
+ struct kobject kobj;
+} idle_kobj;
+
+static ssize_t idle_ctrl_show(struct subsystem *subsys, char *page)
+{
+ ssize_t ret;
+ char *star = "";
+ const char *name = "default";
+
+ down(&idle_sem);
+ if (curr_idle) {
+ name = curr_idle->name;
+ if (curr_idle->freeze)
+ star = "*";
+ }
+ ret = sprintf(page, "%s%s\n", star, name);
+ up(&idle_sem);
+
+ return ret;
+}
+
+static ssize_t idle_ctrl_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ ssize_t ret = -EBUSY;
+
+ down(&idle_sem);
+
+ if (curr_idle && curr_idle->freeze)
+ goto out;
+
+ list_for_each(curr, &idle_elements) {
+ int size;
+ p = list_entry(curr, struct idle_info, list);
+
+ size = strlen(p->name);
+ if (len <= size)
+ continue;
+ if (!strncmp(p->name, buf, size))
+ break;
+ }
+ if (curr == &idle_elements) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * This idle routine may have been registered to
+ * not allow users to add or remove this.
+ */
+ if (p->freeze)
+ goto out;
+
+ __set_idle(p);
+
+ ret = len;
+out:
+ up(&idle_sem);
+
+ return ret;
+}
+
+KERNEL_ATTR_RW(idle_ctrl);
+
+static ssize_t idle_methods_show(struct subsystem *subsys, char *page)
+{
+ struct list_head *curr;
+ struct idle_info *p;
+ ssize_t len = 0;
+
+ down(&idle_sem);
+ list_for_each(curr, &idle_elements) {
+ p = list_entry(curr, struct idle_info, list);
+ if (len + 3 + strlen(p->name) >= PAGE_SIZE) {
+ printk("idle functions overflowed sysfs??\n");
+ break;
+ }
+ len += sprintf(page+len, "%s%s%s",
+ len ? " " : "",
+ p->freeze ? "*" : "",
+ p->name);
+ }
+ if (len + 2 < PAGE_SIZE)
+ len += sprintf(page+len, "\n");
+
+ up(&idle_sem);
+ return len;
+}
+
+static ssize_t idle_methods_store(struct subsystem *subsys,
+ const char *buf, size_t len)
+{
+ /* do nothing */
+ return len;
+}
+
+KERNEL_ATTR_RW(idle_methods);
+
+static struct attribute * idle_attrs[] = {
+ &idle_ctrl_attr.attr,
+ &idle_methods_attr.attr,
+ NULL
+};
+
+static struct attribute_group idle_attr_group = {
+ .attrs = idle_attrs,
+};
+
+static int __init idle_setup_sysfs(void)
+{
+ int err;
+
+ memset(&idle_kobj, 0, sizeof(idle_kobj));
+ err = kobject_set_name(&idle_kobj.kobj, "%s", "idle");
+ if (err)
+ goto out;
+
+ kobj_set_kset_s(&idle_kobj, kernel_subsys);
+
+ idle_kobj.kobj.parent = &kernel_subsys.kset.kobj;
+ err = kobject_register(&idle_kobj.kobj);
+ if (err)
+ goto out;
+
+ err = sysfs_create_group(&idle_kobj.kobj,
+ &idle_attr_group);
+ if (err)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_INFO "Problem setting up sysfs idle_ctrl\n");
+ return 0;
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init idle_setup(void)
+{
+ idle_default = idle_func;
+
+#ifdef CONFIG_SYSFS
+ idle_setup_sysfs();
+#endif
+ return 0;
+}
+
+late_initcall(idle_setup);
Index: linux-2.6.15-rc2-git5/arch/i386/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/Kconfig 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/Kconfig 2005-11-28 20:31:47.000000000 -0500
@@ -45,6 +45,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
source "init/Kconfig"

menu "Processor type and features"
Index: linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/apm.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c 2005-11-28 20:31:47.000000000 -0500
@@ -225,6 +225,7 @@
#include <linux/smp_lock.h>
#include <linux/dmi.h>
#include <linux/suspend.h>
+#include <linux/idle.h>

#include <asm/system.h>
#include <asm/uaccess.h>
@@ -2220,6 +2221,9 @@
{ }
};

+static struct idle_info apm_idle;
+#define APM_IDLE_NAME "apm"
+
/*
* Just start the APM thread. We do NOT want to do APM BIOS
* calls from anything but the APM thread, if for no other reason
@@ -2373,8 +2377,14 @@
if (HZ != 100)
idle_period = (idle_period * HZ) / 100;
if (idle_threshold < 100) {
- original_pm_idle = pm_idle;
- pm_idle = apm_cpu_idle;
+ memset(&apm_idle, 0, sizeof(apm_idle));
+ apm_idle.name = APM_IDLE_NAME;
+ apm_idle.func = apm_cpu_idle;
+ apm_idle.freeze = 1;
+ register_idle(&apm_idle);
+
+ original_pm_idle = get_idle(NULL);
+ set_idle(APM_IDLE_NAME);
set_pm_idle = 1;
}

@@ -2386,7 +2396,26 @@
int error;

if (set_pm_idle) {
- pm_idle = original_pm_idle;
+ int tries = 0;
+ int ret;
+ set_idle(NULL);
+ do {
+ if ((ret = unregister_idle(APM_IDLE_NAME)) == 0)
+ break;
+ /*
+ * for some reason the idle function is being used.
+ * Wait a little and then try again.
+ */
+ if (ret == -EINVAL) {
+ printk(KERN_WARNING
+ "APM idle function never registered?\n");
+ break;
+ }
+ yield();
+ } while (tries++ < 10);
+ if (tries > 10)
+ printk(KERN_WARNING
+ "Unable to unresgister APM idle function\n");
/*
* We are about to unload the current idle thread pm callback
* (pm_idle), Wait for all processors to update cached/local
Index: linux-2.6.15-rc2-git5/arch/ia64/Kconfig
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/Kconfig 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/Kconfig 2005-11-28 20:31:47.000000000 -0500
@@ -62,6 +62,10 @@
bool
default y

+config DYNAMIC_IDLE
+ bool
+ default y
+
choice
prompt "System type"
default IA64_GENERIC
Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/acpi.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c 2005-11-28 20:31:47.000000000 -0500
@@ -60,8 +60,6 @@

#define PREFIX "ACPI: "

-void (*pm_idle) (void);
-EXPORT_SYMBOL(pm_idle);
void (*pm_power_off) (void);
EXPORT_SYMBOL(pm_power_off);

Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/process.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c 2005-11-28 20:31:47.000000000 -0500
@@ -31,6 +31,7 @@
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/kprobes.h>
+#include <linux/idle.h>

#include <asm/cpu.h>
#include <asm/delay.h>
@@ -289,7 +290,7 @@
if (mark_idle)
(*mark_idle)(1);

- idle = pm_idle;
+ idle = idle_func;
if (!idle)
idle = default_idle;
(*idle)();
Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c
===================================================================
--- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/setup.c 2005-11-28 20:31:24.000000000 -0500
+++ linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c 2005-11-29 07:46:59.000000000 -0500
@@ -43,6 +43,7 @@
#include <linux/initrd.h>
#include <linux/platform.h>
#include <linux/pm.h>
+#include <linux/idle.h>

#include <asm/ia32.h>
#include <asm/machvec.h>
@@ -738,6 +739,11 @@
ia64_max_cacheline_size = max;
}

+struct idle_info idle_default = {
+ .name = "default",
+ .func = default_idle
+};
+
/*
* cpu_init() initializes state that is per-CPU. This function acts
* as a 'CPU state barrier', nothing should get across.
@@ -861,7 +867,10 @@
/* size of physical stacked register partition plus 8 bytes: */
__get_cpu_var(ia64_phys_stacked_size_p8) = num_phys_stacked*8 + 8;
platform_cpu_init();
- pm_idle = default_idle;
+
+ register_idle(&idle_default);
+
+ set_idle("default");
}

void


2005-11-29 14:50:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, Nov 29, 2005 at 09:19:31AM -0500, Steven Rostedt wrote:
> > And in practice the CPU will run so hot that only benchmarkers like it.
>
> Why would it run hot? What's the difference between polling and doing
> other things. How many transistors does it take to poll?

It will prevent the CPU from going into sleep states and essentially
keep most of it enabled.

>
> >
> > I think switching idle is the wrong way to do. We should rather
> > fix the various problems.
> >
> > For fixing the TSC issue it is 100% the wrong approach Imho.
>
> I would only say 80% the wrong approach, but that's me ;-)
>
> > Basically software has to live with TSCs being unsynchronized
> > and gettimeofday should do the right thing (and if not it should be fixed)
>
> I guess the biggest complaint most have is that the rdtsc _is_ the
> fastest way to read a clock. If it isn't reliable, then what good is

It's the fastest way to read something which needs quite complex
knowledge to turn into a reliable clock value. In general only
the kernel has this knowledge.

And gettimeofday is optimized to give you the fatest reliable
clock.

> it? It's unfortunate that Intel didn't solidify the clock usage. Yes,
> use HPET, or something else, but those are slower, and may not be on all
> systems. Every system that I owned had a tsc but for critical systems
> it isn't up to par (what a shame).

Just use gettimeofday. It shields you from all that and when
the hardware supports it is quite fast too.

> > > system has been idle for some time. E.g. cpufreqd could sample idle time
> > > and turn on/off idle=poll. High-performance setups could enable it all
> > > the time.
> >
> > And upgrade their server air condition or issue additional ear protection
> > to the desktop user? Most likely you will just drive the CPUs into
> > thermal throttle at some point with that, not get more performance anyways.
>
> Again, what would make it so hot? It is a waste of CPU cycles, and does
> waste energy that way, but does it really heat up the CPU that much?

Yes it does.

-Andi

2005-11-29 15:42:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, 2005-11-29 at 15:50 +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 09:19:31AM -0500, Steven Rostedt wrote:
> > > And in practice the CPU will run so hot that only benchmarkers like it.
> >
> > Why would it run hot? What's the difference between polling and doing
> > other things. How many transistors does it take to poll?
>
> It will prevent the CPU from going into sleep states and essentially
> keep most of it enabled.

Well, there's one thing that my patch _does_ help with. (And it has
just helped me now). If you boot up with idle=poll and forget about it,
you can check what idle routine is being used and switch out of poll
without rebooting. (like I'm doing right now :-)

-- Steve

2005-11-29 19:39:23

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC][PATCH] Runtime switching of the idle function [take 2]

idle=poll is a really bad way to go from a power perspective.
While it is diminishing returns to get into deeper C-states,
getting into at least C1 (HALT or MONITOR/MWAIT) is very important
on many processors.

Note that if the issue at hand is the TSC stopping in deep
ACPI C-states, that there is a flag already available to limit
how deep the C-states go. eg.

processor.max_cstate=2 will disable C3, C4 etc
You can do this at run-time by writing to
/sys/module/processor/parameters/max_cstate

I agree with Andi that we have some work to do to address
the issue directly, which is that the TSC is not reliable
under all conditions on all processors. I think we need
some modes for TSC to detect and handle the cases where it either
stops in C3 or changes speeds, vs the systems where it actually
works the way we want it to -- constant rate that never stops.

>Why not just slightly cleanup and extend (eg. to ACPI) the
>hlt_counter thingy that many architectures already have?

Hmmm, I see the floppy driver invoking hlt_counter,
but it isn't clear what the general semantics and general
users are supposd to be. Can you clue me in?

thanks,
-Len

2005-11-29 19:53:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, Nov 29, 2005 at 02:37:53PM -0500, Brown, Len wrote:
> idle=poll is a really bad way to go from a power perspective.
> While it is diminishing returns to get into deeper C-states,
> getting into at least C1 (HALT or MONITOR/MWAIT) is very important
> on many processors.
>
> Note that if the issue at hand is the TSC stopping in deep
> ACPI C-states, that there is a flag already available to limit
> how deep the C-states go. eg.

No i think they tried to work around the fact that
it's not synchronized on AMD systems - in particular
it drifts slightly even on single socket dual core
A64 X2s and disabling C1 works around that.

But idle=poll is too big an hammer for this. Vojtech
is working on a solution anyways that should address this
better.


> processor.max_cstate=2 will disable C3, C4 etc
> You can do this at run-time by writing to
> /sys/module/processor/parameters/max_cstate

In this case it's already C1 that's the problem,
so that won't help them.

> I agree with Andi that we have some work to do to address
> the issue directly, which is that the TSC is not reliable
> under all conditions on all processors. I think we need

We're mostly addressing it - there are problems left, but
overall it's looking good. The remaining problem is
an education issue of users to not use RDTSC directly,
but use gettimeofday/clock_gettime

One remaining use is measurements, but for that it is
already dubious (e.g. due to ticking at a possible
different frequency than the CPU). For that I want
to establish the RDPMC 0 convention.

Probably need better documentation for all of this though...

> some modes for TSC to detect and handle the cases where it either
> stops in C3 or changes speeds, vs the systems where it actually
> works the way we want it to -- constant rate that never stops.
>
> >Why not just slightly cleanup and extend (eg. to ACPI) the
> >hlt_counter thingy that many architectures already have?
>
> Hmmm, I see the floppy driver invoking hlt_counter,
> but it isn't clear what the general semantics and general
> users are supposd to be. Can you clue me in?

It's an ancient hack for an ancient machine chipset bug, but AFAIK
not used/needed on anything modern.

Should probably remove it from x86-64 too.

-Andi

2005-11-29 20:35:48

by Lee Revell

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote:
> We're mostly addressing it - there are problems left, but
> overall it's looking good. The remaining problem is
> an education issue of users to not use RDTSC directly,
> but use gettimeofday/clock_gettime

No the issue is to make gettimeofday fast enough that the people who
currently have to use the TSC can use it. Right now it's 1500-3000 nsec
or so, Vojtech mentioned that he has a patch that could reduce that to
150-300 nsec.

Lee

2005-11-29 20:51:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, Nov 29, 2005 at 03:35:39PM -0500, Lee Revell wrote:
> On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote:
> > We're mostly addressing it - there are problems left, but
> > overall it's looking good. The remaining problem is
> > an education issue of users to not use RDTSC directly,
> > but use gettimeofday/clock_gettime
>
> No the issue is to make gettimeofday fast enough that the people who
> currently have to use the TSC can use it. Right now it's 1500-3000 nsec
> or so, Vojtech mentioned that he has a patch that could reduce that to

It's only that slow if the hardware can't do better.

And the kernel makes it only slow when using RDTSC directly
is unsafe - so if you use it directly thinking the kernel cheats
you for your cycles you're just shoting yourself in the own foot.

> 150-300 nsec.

If you have capable hardware it can already do much better.

-Andi

2005-11-30 00:59:55

by Lee Revell

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Tue, 2005-11-29 at 21:51 +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 03:35:39PM -0500, Lee Revell wrote:
> > On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote:
> > > We're mostly addressing it - there are problems left, but
> > > overall it's looking good. The remaining problem is
> > > an education issue of users to not use RDTSC directly,
> > > but use gettimeofday/clock_gettime
> >
> > No the issue is to make gettimeofday fast enough that the people who
> > currently have to use the TSC can use it. Right now it's 1500-3000 nsec
> > or so, Vojtech mentioned that he has a patch that could reduce that to
>
> It's only that slow if the hardware can't do better.
>
> And the kernel makes it only slow when using RDTSC directly
> is unsafe - so if you use it directly thinking the kernel cheats
> you for your cycles you're just shoting yourself in the own foot.
>
> > 150-300 nsec.
>
> If you have capable hardware it can already do much better.
>

But on my system gettimeofday uses the TSC and it's still ~35x slower
than RDTSC:

rlrevell@mindpipe:~$ ./timetest
rdtsc: 10000 calls in 1079 usecs
gettimeofday: 10000 calls in 36628 usecs

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

typedef unsigned long long cycles_t;

#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))

static inline cycles_t get_cycles_tsc (void)
{
unsigned long long ret;

rdtscll(ret);
return ret;
}

static inline cycles_t get_cycles_gtod (void)
{
struct timeval tv;
gettimeofday (&tv, NULL);

return tv.tv_usec;
}

int main (void) {
int i;
cycles_t start_time;
start_time= get_cycles_gtod();
for (i = 0; i < 10000; i++) {
get_cycles_tsc();
}
printf("rdtsc: %i calls in %llu usecs\n", i, get_cycles_gtod() - start_time);
start_time = get_cycles_gtod();
for (i = 0; i < 10000; i++) {
get_cycles_gtod();
}
printf("gettimeofday: %i calls in %llu usecs\n", i, get_cycles_gtod() - start_time);
return 0;
}


2005-11-30 01:07:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

> But on my system gettimeofday uses the TSC and it's still ~35x slower
> than RDTSC:
>
> rlrevell@mindpipe:~$ ./timetest
> rdtsc: 10000 calls in 1079 usecs
> gettimeofday: 10000 calls in 36628 usecs

First if you run this on an Athlon 64 the measurement is likely
wrong because RDTSC can be speculated around. To get accurate
data you need to add synchronizing instructions.

Then you're likely running 32bit. It doesn't use vsyscall gettimeofday
yet, which makes it slower. 64bit would.

-Andi

2005-11-30 01:22:53

by Lee Revell

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Wed, 2005-11-30 at 02:06 +0100, Andi Kleen wrote:
> > But on my system gettimeofday uses the TSC and it's still ~35x slower
> > than RDTSC:
> >
> > rlrevell@mindpipe:~$ ./timetest
> > rdtsc: 10000 calls in 1079 usecs
> > gettimeofday: 10000 calls in 36628 usecs
>
> First if you run this on an Athlon 64 the measurement is likely
> wrong because RDTSC can be speculated around. To get accurate
> data you need to add synchronizing instructions.
>

OK. Just for reference here's what people on the JACK list reported:

2.6.14-rt13, PREEMPT_RT, Athlon X2 4400+ (dual core)

rdtsc: 10000 calls in 68 usecs
gettimeofday: 10000 calls in 5170 usecs

[email protected]/HT (OpenSUSE 10.0 2.6.13-15-smp):

rdtsc: 10000 calls in 253 usecs
gettimeofday: 10000 calls in 26547 usecs

> Then you're likely running 32bit. It doesn't use vsyscall gettimeofday
> yet, which makes it slower. 64bit would.

Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the
works?

Lee

2005-11-30 01:58:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

> > Then you're likely running 32bit. It doesn't use vsyscall gettimeofday
> > yet, which makes it slower. 64bit would.
>
> Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the
> works?

John Stultz used to have patches for it, but for some reason he never
pushed them into mainline. On i386 it unfortunately needs adding
a test and branch to the syscall path to be 100% ABI compatible, but I
doubt that was the reason he dropped it.

-Andi

2005-11-30 02:19:47

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

On Wed, 2005-11-30 at 02:58 +0100, Andi Kleen wrote:
> > > Then you're likely running 32bit. It doesn't use vsyscall gettimeofday
> > > yet, which makes it slower. 64bit would.
> >
> > Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the
> > works?
>
> John Stultz used to have patches for it, but for some reason he never
> pushed them into mainline.

Unfortunately it was a pretty ugly patch. Correctness issues with the
existing code have kept focused on my timekeeping rework, however I have
kept it in mind, and I do have a i386 vsyscall gtod patch that applies
ontop of my tod work. I've been maintaining it on the side while I focus
on the core code, but it is much cleaner now. For fun I'll try to
remember to send it out with the next release.

> On i386 it unfortunately needs adding
> a test and branch to the syscall path to be 100% ABI compatible, but I
> doubt that was the reason he dropped it.

Yea, I didn't know enough about the VDSO/unwind bits to get it to do the
right thing w/ glibc, so that bit was pretty hackish. I'll still need
some help on this bit to make it really something that could be
included.

thanks
-john



2005-12-02 01:28:27

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Andi Kleen wrote:
> Ingo Molnar <[email protected]> writes:
>>> If it's just for some sort of instrumentation, run NR_CPUS instances
>>> of a niced-down busyloop, pin each one to a different CPU? That way
>>> the idle function doesn't get called at all..
>> idle=poll is also frequently done for performance reasons [it reduces
>> idle wakeup latency by 10 usecs]
>
> And it's obsolete on CPUs with monitor/mwait.
There are some platforms for example IBM ZPro Xeon based machines where
monitor/mwait seems to trigger some kind of SMM and introduce horrible latencies.
With idle=poll ZPros show pretty good worst case latencies, in the order of 10usec
(tested with RTAI/Fusion). With default idle (ie mwait) even average latency is in
hundreds of milliseconds.
You might argue that it's a bug in the their HW design or something but as it stands
today I wouldn't say that monitor/mwait obsoletes idle=poll.

Also IMO saying that CPU will run too hot with idle=poll is basically saying that those
CPUs cannot be used for simulations and stuff which run flat out for days (months actually).
Which is obviously not true (again speaking from experience :)).

Max

2005-12-02 01:45:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

> Also IMO saying that CPU will run too hot with idle=poll is basically
> saying that those
> CPUs cannot be used for simulations and stuff which run flat out for days
> (months actually).
> Which is obviously not true (again speaking from experience :)).

The CPUs can be used, but many cooling setups
(both AirCon in complete data centers, cooling in Blade Racks, laptops)
the cooling is now often designed to not cool
the maximum thermal output of all systems in parallel, but instead
throttle the systems when things get too hot. This usually
works because in most workloads systems are more often idle
than busy, so no throttling is needed.

On desktops it probably won't throttle, but just become noisy
when all the fans spin up.

All things you don't really want.

Super computing is different of course, but even there maximum
capacity of the air condition often limits how many CPUs you can buy.
And you need all the help you can get.

That said you're right that there is still a small niche
where idle=poll makes sense, but it's definitely nothing
that should be encouraged to be used regularly like that
original patch would.

-Andi

2005-12-03 02:18:28

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Andi Kleen wrote:
>> Also IMO saying that CPU will run too hot with idle=poll is basically
>> saying that those
>> CPUs cannot be used for simulations and stuff which run flat out for days
>> (months actually).
>> Which is obviously not true (again speaking from experience :)).
>
> The CPUs can be used, but many cooling setups
> (both AirCon in complete data centers, cooling in Blade Racks, laptops)
> the cooling is now often designed to not cool
> the maximum thermal output of all systems in parallel, but instead
> throttle the systems when things get too hot. This usually
> works because in most workloads systems are more often idle
> than busy, so no throttling is needed.
>
> On desktops it probably won't throttle, but just become noisy
> when all the fans spin up.
>
> All things you don't really want.
We do it (simulations that is) on normal 1U and desktop machines. No special
cooling and stuff. And it does not cause any problems. Granted we don't use
cheap/crappy machines but still it's unmodified off-the-shelf HW.

btw That ZPro machine that I mentioned used to run with idle=poll for weeks
and fans would never spin up unless you put real load on it.

> Super computing is different of course, but even there maximum
> capacity of the air condition often limits how many CPUs you can buy.
> And you need all the help you can get.
>
> That said you're right that there is still a small niche
> where idle=poll makes sense, but it's definitely nothing
> that should be encouraged to be used regularly like that
> original patch would.
Agreed.

Max

2005-12-18 14:24:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]

Hi!

> Description:
>
> This patch creates a directory in /sys/kernel called idle. This
> directory contains two files: idle_ctrl and idle_methods. Reading
> idle_ctrl will show the function that is currently being used for idle,
> and idle_methods shows the available methods for the user to send write
> into idle_ctrl to change which function to use for idle.

Pretty ugly interface, I'd say... is listing function really neccessary?

Pavel

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-12-18 15:26:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] Runtime switching of the idle function [take 2]


On Tue, 29 Nov 2005, Pavel Machek wrote:

> Hi!
>
> > Description:
> >
> > This patch creates a directory in /sys/kernel called idle. This
> > directory contains two files: idle_ctrl and idle_methods. Reading
> > idle_ctrl will show the function that is currently being used for idle,
> > and idle_methods shows the available methods for the user to send write
> > into idle_ctrl to change which function to use for idle.
>
> Pretty ugly interface, I'd say... is listing function really neccessary?
>

What interface would you prefer? And the listing was a feature request
made by Ingo.

But this is pretty much moot, since the patch is not going any further
than the RT patch. And even then, it probably is only temporary, if it is
even still in there (I haven't checked).

--Steve