LinuxLists.cc - sched: deep power-saving states

2008-10-22 13:38:44

Subject: sched: deep power-saving states

Hi Arjan,
I was giving some thought to that topic you brought up at our
LF-end-user session on RT w.r.t. deep power state wakeup adding latency.

As Steven mentioned, we currently have this thing called "cpupri"
(kernel/sched_cpupri.c) in the scheduler which allows us to classify
each core (on a per disjoint cpuset basis) as being either IDLE,
SCHED_OTHER, or RT1 - RT99. (Note that currently we lump both IDLE and
SCHED_OTHER together as SCHED_OTHER because we don't yet care to
differentiate between them, but I have patches to fix this that I can
submit).

What I was thinking is that a simple mechanism to quantify the
power-state penalty would be to add those states as priority levels in
the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
means the scheduler would favor waking an IDLE-RUNNING core over an
IDLE-PS1-PSn, etc. The question in my mind is: can the power-states be
determined in a static fashion such that we know what value to quantify
the idle state before we enter it? Or is it more dynamic (e.g. the
longer it is in an MWAIT, the deeper the sleep gets).

If its dynamic, is there a deterministic algorithm that could be applied
so that, say, a timer on a different CPU (bsp makes sense to me) could
advance the IDLE-PSx state in cpupri on behalf of the low-power core as
time goes on?

Thoughts?
-Greg

2008-10-22 13:47:38

by Arjan van de Ven

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 22 Oct 2008 09:42:52 -0400
Gregory Haskins <[email protected]> wrote:

> What I was thinking is that a simple mechanism to quantify the
> power-state penalty would be to add those states as priority levels in
> the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
> and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
> means the scheduler would favor waking an IDLE-RUNNING core over an
> IDLE-PS1-PSn, etc. The question in my mind is: can the power-states
> be determined in a static fashion such that we know what value to
> quantify the idle state before we enter it? Or is it more dynamic
> (e.g. the longer it is in an MWAIT, the deeper the sleep gets).

it's a little dynamic, but just assuming the worst will be a very good
approximation of reality. And we know what we're getting into in that
sense.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-22 14:01:23

by Gregory Haskins

[permalink] [raw]

Subject: Re: sched: deep power-saving states

Arjan van de Ven wrote:
> On Wed, 22 Oct 2008 09:42:52 -0400
> Gregory Haskins <[email protected]> wrote:
>
>
>> What I was thinking is that a simple mechanism to quantify the
>> power-state penalty would be to add those states as priority levels in
>> the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
>> and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
>> means the scheduler would favor waking an IDLE-RUNNING core over an
>> IDLE-PS1-PSn, etc. The question in my mind is: can the power-states
>> be determined in a static fashion such that we know what value to
>> quantify the idle state before we enter it? Or is it more dynamic
>> (e.g. the longer it is in an MWAIT, the deeper the sleep gets).
>>
>
> it's a little dynamic, but just assuming the worst will be a very good
> approximation of reality. And we know what we're getting into in that
> sense.
>

Ok, but if we just assume the worst case always, how do I differentiate
between, say, IDLE-RUNNING and IDLE-PSn? If I assign them all to
IDLE-PSn apriori its no better than the basic single IDLE state we
support today. Or am I misunderstanding you?

-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-10-22 14:07:06

by Arjan van de Ven

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 22 Oct 2008 10:05:21 -0400
Gregory Haskins <[email protected]> wrote:

> Arjan van de Ven wrote:
> > On Wed, 22 Oct 2008 09:42:52 -0400
> > Gregory Haskins <[email protected]> wrote:
> >
> >
> >> What I was thinking is that a simple mechanism to quantify the
> >> power-state penalty would be to add those states as priority
> >> levels in the cpupri namespace. E.g. We could substitute
> >> IDLE-RUNNING for IDLE, and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn,
> >> OTHER, RT1, .. RT99. This means the scheduler would favor waking
> >> an IDLE-RUNNING core over an IDLE-PS1-PSn, etc. The question in
> >> my mind is: can the power-states be determined in a static fashion
> >> such that we know what value to quantify the idle state before we
> >> enter it? Or is it more dynamic (e.g. the longer it is in an
> >> MWAIT, the deeper the sleep gets).
> >
> > it's a little dynamic, but just assuming the worst will be a very
> > good approximation of reality. And we know what we're getting into
> > in that sense.
> >
>
> Ok, but if we just assume the worst case always, how do I
> differentiate between, say, IDLE-RUNNING and IDLE-PSn? If I assign
> them all to IDLE-PSn apriori its no better than the basic single IDLE
> state we support today. Or am I misunderstanding you?

eh yes I wasn't very clear; it's pre-coffee time here ;)

we know *for each C state* we go in, what its maximum latency is.
Now, that is the *maximum*; there are times where it'll be less
(there are several steps for going into a C-state hardware wise, and if
an interrupt comes in before they're all completed, getting out of it
means not having to undo ALL the steps, so it'll be faster)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-22 14:22:41

by Gregory Haskins

[permalink] [raw]

Subject: Re: sched: deep power-saving states

Arjan van de Ven wrote:
> On Wed, 22 Oct 2008 10:05:21 -0400
> Gregory Haskins <[email protected]> wrote:
>
>
>> Arjan van de Ven wrote:
>>
>>> On Wed, 22 Oct 2008 09:42:52 -0400
>>> Gregory Haskins <[email protected]> wrote:
>>>
>>>
>>>
>>>> What I was thinking is that a simple mechanism to quantify the
>>>> power-state penalty would be to add those states as priority
>>>> levels in the cpupri namespace. E.g. We could substitute
>>>> IDLE-RUNNING for IDLE, and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn,
>>>> OTHER, RT1, .. RT99. This means the scheduler would favor waking
>>>> an IDLE-RUNNING core over an IDLE-PS1-PSn, etc. The question in
>>>> my mind is: can the power-states be determined in a static fashion
>>>> such that we know what value to quantify the idle state before we
>>>> enter it? Or is it more dynamic (e.g. the longer it is in an
>>>> MWAIT, the deeper the sleep gets).
>>>>
>>> it's a little dynamic, but just assuming the worst will be a very
>>> good approximation of reality. And we know what we're getting into
>>> in that sense.
>>>
>>>
>> Ok, but if we just assume the worst case always, how do I
>> differentiate between, say, IDLE-RUNNING and IDLE-PSn? If I assign
>> them all to IDLE-PSn apriori its no better than the basic single IDLE
>> state we support today. Or am I misunderstanding you?
>>
>
> eh yes I wasn't very clear; it's pre-coffee time here ;)
>
> we know *for each C state* we go in, what its maximum latency is.
> Now, that is the *maximum*; there are times where it'll be less
> (there are several steps for going into a C-state hardware wise, and if
> an interrupt comes in before they're all completed, getting out of it
> means not having to undo ALL the steps, so it'll be faster)
>

[Adding Peter Zijlstra to the thread]

Ah, yes of course! That makes sense. So I have to admit I am fairly
ignorant of the ACPI C-state stuff, so I just read up on it. In the
context of what you said, it makes perfect sense to me now.

IIUC, the OS selects which C-state it will enter at idle points based on
some internal criteria (TBD). All we have to do is remap the cpupri
"IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn and have
the cpupri map get updated coincident with the pm_idle() call. Then the
scheduler will naturally favor cores that are in lighter sleep over
cores in deep sleep.

I am not sure if this is exactly what you were getting at during the
conf, since it doesnt really consider deep-sleep latency times
directly. But I think this is a step in the right direction.

-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-10-22 14:36:28

by Arjan van de Ven

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 22 Oct 2008 10:26:49 -0400
Gregory Haskins <[email protected]> wrote:
steps, so it'll be
> > faster)
>
> [Adding Peter Zijlstra to the thread]
>
> Ah, yes of course! That makes sense. So I have to admit I am fairly
> ignorant of the ACPI C-state stuff, so I just read up on it. In the
> context of what you said, it makes perfect sense to me now.
>
> IIUC, the OS selects which C-state it will enter at idle points based
> on some internal criteria (TBD). All we have to do is remap the
> cpupri "IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn
> and have the cpupri map get updated coincident with the pm_idle()
> call. Then the scheduler will naturally favor cores that are in
> lighter sleep over cores in deep sleep.
>
> I am not sure if this is exactly what you were getting at during the
> conf, since it doesnt really consider deep-sleep latency times
> directly. But I think this is a step in the right direction.

it for sure is a step in the right direction.
the actual exit costs are an optional parameter in this sense,
the steps between C states are non-linear (more like exponential)
so knowing the actual numbers could be used. but even if you don't
use it, it still makes sense and is a very good first order behavior.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-22 19:50:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 2008-10-22 at 07:36 -0700, Arjan van de Ven wrote:
> On Wed, 22 Oct 2008 10:26:49 -0400
> Gregory Haskins <[email protected]> wrote:
> steps, so it'll be
> > > faster)
> >
> > [Adding Peter Zijlstra to the thread]
> >
> > Ah, yes of course! That makes sense. So I have to admit I am fairly
> > ignorant of the ACPI C-state stuff, so I just read up on it. In the
> > context of what you said, it makes perfect sense to me now.
> >
> > IIUC, the OS selects which C-state it will enter at idle points based
> > on some internal criteria (TBD). All we have to do is remap the
> > cpupri "IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn
> > and have the cpupri map get updated coincident with the pm_idle()
> > call. Then the scheduler will naturally favor cores that are in
> > lighter sleep over cores in deep sleep.
> >
> > I am not sure if this is exactly what you were getting at during the
> > conf, since it doesnt really consider deep-sleep latency times
> > directly. But I think this is a step in the right direction.
>
> it for sure is a step in the right direction.
> the actual exit costs are an optional parameter in this sense,
> the steps between C states are non-linear (more like exponential)
> so knowing the actual numbers could be used. but even if you don't
> use it, it still makes sense and is a very good first order behavior.

This still leaves us with the worst case IRQ response as given by the
deepest C state. Which might be un-desirable.

jcm was, once upon a time, working on dynamically changing the idle
routine, so that people who care about wakeup latency can run idle=poll
while their application runs, and the acpi C state stuff when nobody
cares.

This could of course then be tied into the PM QoS stuff Mark has been
doing.

Fact of life is, for some RT apps, anything but idle=poll is too much.

But yes, when C states are in play, it makes sense to try and wake a cpu
that's not deep over a very deep idle one.

2008-10-22 19:55:06

by Arjan van de Ven

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 22 Oct 2008 21:49:52 +0200
Peter Zijlstra <[email protected]> wrote:
>
> This still leaves us with the worst case IRQ response as given by the
> deepest C state. Which might be un-desirable.

that's a different problem in a different problem space.
>
> jcm was, once upon a time, working on dynamically changing the idle
> routine, so that people who care about wakeup latency can run
> idle=poll while their application runs, and the acpi C state stuff
> when nobody cares.
>
> This could of course then be tied into the PM QoS stuff Mark has been
> doing.

in fact you already have this *exactly* today; this isn't future
technology.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-22 20:05:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 2008-10-22 at 12:55 -0700, Arjan van de Ven wrote:
> On Wed, 22 Oct 2008 21:49:52 +0200
> Peter Zijlstra <[email protected]> wrote:
> >
> > This still leaves us with the worst case IRQ response as given by the
> > deepest C state. Which might be un-desirable.
>
> that's a different problem in a different problem space.

Ah right, so the only point was trying to wake shallow cpus so as to try
and let deep cpus idle longer?

> > jcm was, once upon a time, working on dynamically changing the idle
> > routine, so that people who care about wakeup latency can run
> > idle=poll while their application runs, and the acpi C state stuff
> > when nobody cares.
> >
> > This could of course then be tied into the PM QoS stuff Mark has been
> > doing.
>
> in fact you already have this *exactly* today; this isn't future
> technology.

Interesting, what knob do I turn to get idle=poll dynamically?

2008-10-22 20:12:32

by Arjan van de Ven

[permalink] [raw]

Subject: Re: sched: deep power-saving states

On Wed, 22 Oct 2008 22:05:25 +0200
Peter Zijlstra <[email protected]> wrote:

> >
> > in fact you already have this *exactly* today; this isn't future
> > technology.
>
> Interesting, what knob do I turn to get idle=poll dynamically?

you ask PMQOS for a 0 usec latency, and you just get idle=poll behavior;
you can do this from the kernel, or, as root, from userland.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org