2012-08-13 12:21:11

by Alex Shi

[permalink] [raw]
Subject: [discussion]sched: a rough proposal to enable power saving in scheduler

Since there is no power saving consideration in scheduler CFS, I has a
very rough idea for enabling a new power saving schema in CFS.

It bases on the following assumption:
1, If there are many task crowd in system, just let few domain cpus
running and let other cpus idle can not save power. Let all cpu take the
load, finish tasks early, and then get into idle. will save more power
and have better user experience.

2, schedule domain, schedule group perfect match the hardware, and
the power consumption unit. So, pull tasks out of a domain means
potentially this power consumption unit idle.

So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
power aware scheduling), this proposal will adopt the
sched_balance_policy concept and use 2 kind of policy: performance, power.

And in scheduling, 2 place will care the policy, load_balance() and in
task fork/exec: select_task_rq_fair().

Here is some pseudo code try to explain the proposal behaviour in
load_balance() and select_task_rq_fair();


load_balance() {
update_sd_lb_stats(); //get busiest group, idlest group data.

if (sd->nr_running > sd's capacity) {
//power saving policy is not suitable for
//this scenario, it runs like performance policy
mv tasks from busiest cpu in busiest group to
idlest cpu in idlest group;
} else {// the sd has enough capacity to hold all tasks.
if (sg->nr_running > sg's capacity) {
//imbalanced between groups
if (schedule policy == performance) {
//when 2 busiest group at same busy
//degree, need to prefer the one has
// softest group??
move tasks from busiest group to
idletest group;
} else if (schedule policy == power)
move tasks from busiest group to
idlest group until busiest is just full
of capacity.
//the busiest group can balance
//internally after next time LB,
} else {
//all groups has enough capacity for its tasks.
if (schedule policy == performance)
//all tasks may has enough cpu
//resources to run,
//mv tasks from busiest to idlest group?
//no, at this time, it's better to keep
//the task on current cpu.
//so, it is maybe better to do balance
//in each of groups
for_each_imbalance_groups()
move tasks from busiest cpu to
idlest cpu in each of groups;
else if (schedule policy == power) {
if (no hard pin in idlest group)
mv tasks from idlest group to
busiest until busiest full.
else
mv unpin tasks to the biggest
hard pin group.
}
}
}
}

select_task_rq_fair()
{
for_each_domain(cpu, tmp) {
if (policy == power && tmp_has_capacity &&
tmp->flags & sd_flag) {
sd = tmp;
//It is fine to got cpu in the domain
break;
}
}

while(sd) {
if policy == power
find_busiest_and_capable_group()
else
find_idlest_group();
if (!group) {
sd = sd->child;
continue;
}
...
}
}

sub proposal:
1, If it's possible to balance task on idlest cpu not appointed 'balance
cpu'. If so, it may can reduce one more time balancing.
The idlest cpu can prefer the new idle cpu; and is the least load cpu;
2, se or task load is good for running time setting.
but it should the second basis in load balancing. The first basis of LB
is running tasks' number in group/cpu. Since whatever of the weight of
groups is, if the tasks number is less than cpu number, the group is
still has capacity to take more tasks. (will consider the SMT cpu power
or other big/little cpu capacity on ARM.)

unsolved issues:
1, like current scheduler, it didn't handled cpu affinity well in
load_balance.
2, task group that isn't consider well in this rough proposal.

It isn't consider well and may has mistaken . So just share my ideas and
hope it become better and workable in your comments and discussion.

Thanks
Alex


2012-08-14 07:35:23

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/13/2012 08:21 PM, Alex Shi wrote:

> Since there is no power saving consideration in scheduler CFS, I has a
> very rough idea for enabling a new power saving schema in CFS.
>
> It bases on the following assumption:
> 1, If there are many task crowd in system, just let few domain cpus
> running and let other cpus idle can not save power. Let all cpu take the
> load, finish tasks early, and then get into idle. will save more power
> and have better user experience.
>
> 2, schedule domain, schedule group perfect match the hardware, and
> the power consumption unit. So, pull tasks out of a domain means
> potentially this power consumption unit idle.
>
> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
> power aware scheduling), this proposal will adopt the
> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> And in scheduling, 2 place will care the policy, load_balance() and in
> task fork/exec: select_task_rq_fair().



Any comments for this rough proposal, specially for the assumptions?

>
> Here is some pseudo code try to explain the proposal behaviour in
> load_balance() and select_task_rq_fair();
>
>
> load_balance() {
> update_sd_lb_stats(); //get busiest group, idlest group data.
>
> if (sd->nr_running > sd's capacity) {
> //power saving policy is not suitable for
> //this scenario, it runs like performance policy
> mv tasks from busiest cpu in busiest group to
> idlest cpu in idlest group;
> } else {// the sd has enough capacity to hold all tasks.
> if (sg->nr_running > sg's capacity) {
> //imbalanced between groups
> if (schedule policy == performance) {
> //when 2 busiest group at same busy
> //degree, need to prefer the one has
> // softest group??
> move tasks from busiest group to
> idletest group;
> } else if (schedule policy == power)
> move tasks from busiest group to
> idlest group until busiest is just full
> of capacity.
> //the busiest group can balance
> //internally after next time LB,
> } else {
> //all groups has enough capacity for its tasks.
> if (schedule policy == performance)
> //all tasks may has enough cpu
> //resources to run,
> //mv tasks from busiest to idlest group?
> //no, at this time, it's better to keep
> //the task on current cpu.
> //so, it is maybe better to do balance
> //in each of groups
> for_each_imbalance_groups()
> move tasks from busiest cpu to
> idlest cpu in each of groups;
> else if (schedule policy == power) {
> if (no hard pin in idlest group)
> mv tasks from idlest group to
> busiest until busiest full.
> else
> mv unpin tasks to the biggest
> hard pin group.
> }
> }
> }
> }
>
> select_task_rq_fair()
> {
> for_each_domain(cpu, tmp) {
> if (policy == power && tmp_has_capacity &&
> tmp->flags & sd_flag) {
> sd = tmp;
> //It is fine to got cpu in the domain
> break;
> }
> }
>
> while(sd) {
> if policy == power
> find_busiest_and_capable_group()
> else
> find_idlest_group();
> if (!group) {
> sd = sd->child;
> continue;
> }
> ...
> }
> }
>
> sub proposal:
> 1, If it's possible to balance task on idlest cpu not appointed 'balance
> cpu'. If so, it may can reduce one more time balancing.
> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
> 2, se or task load is good for running time setting.
> but it should the second basis in load balancing. The first basis of LB
> is running tasks' number in group/cpu. Since whatever of the weight of
> groups is, if the tasks number is less than cpu number, the group is
> still has capacity to take more tasks. (will consider the SMT cpu power
> or other big/little cpu capacity on ARM.)
>
> unsolved issues:
> 1, like current scheduler, it didn't handled cpu affinity well in
> load_balance.
> 2, task group that isn't consider well in this rough proposal.
>
> It isn't consider well and may has mistaken . So just share my ideas and
> hope it become better and workable in your comments and discussion.
>
> Thanks
> Alex

2012-08-15 08:23:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, 2012-08-14 at 15:35 +0800, Alex Shi wrote:
>
> Any comments for this rough proposal, specially for the assumptions?
>
Let me read it first ;-)

2012-08-15 11:05:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
> Since there is no power saving consideration in scheduler CFS, I has a
> very rough idea for enabling a new power saving schema in CFS.

Adding Thomas, he always delights poking holes in power schemes.

> It bases on the following assumption:
> 1, If there are many task crowd in system, just let few domain cpus
> running and let other cpus idle can not save power. Let all cpu take the
> load, finish tasks early, and then get into idle. will save more power
> and have better user experience.

I'm not sure this is a valid assumption. I've had it explained to me by
various people that race-to-idle isn't always the best thing. It has to
do with the cost of switching power states and the duration of execution
and other such things.

> 2, schedule domain, schedule group perfect match the hardware, and
> the power consumption unit. So, pull tasks out of a domain means
> potentially this power consumption unit idle.

I'm not sure I understand what you're saying, sorry.

> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
> power aware scheduling), this proposal will adopt the
> sched_balance_policy concept and use 2 kind of policy: performance, power.

Yay, ideally we'd also provide a 3rd option: auto, which simply switches
between the two based on AC/BAT, UPS status and simple things like that.
But this seems like a later concern, you have to have something to pick
between before you can pick :-)

> And in scheduling, 2 place will care the policy, load_balance() and in
> task fork/exec: select_task_rq_fair().

ack

> Here is some pseudo code try to explain the proposal behaviour in
> load_balance() and select_task_rq_fair();

Oh man.. A few words outlining the general idea would've been nice.

> load_balance() {
> update_sd_lb_stats(); //get busiest group, idlest group data.
>
> if (sd->nr_running > sd's capacity) {
> //power saving policy is not suitable for
> //this scenario, it runs like performance policy
> mv tasks from busiest cpu in busiest group to
> idlest cpu in idlest group;

Once upon a time we talked about adding a factor to the capacity for
this. So say you'd allow 2*capacity before overflowing and waking
another power group.

But I think we should not go on nr_running here, PJTs per-entity load
tracking stuff gives us much better measures -- also, repost that series
already Paul! :-)

Also, I'm not sure this is entirely correct, the thing you want to do
for power aware stuff is to minimize the number of active power domains,
this means you don't want idlest, you want least busy non-idle.

> } else {// the sd has enough capacity to hold all tasks.
> if (sg->nr_running > sg's capacity) {
> //imbalanced between groups
> if (schedule policy == performance) {
> //when 2 busiest group at same busy
> //degree, need to prefer the one has
> // softest group??
> move tasks from busiest group to
> idletest group;

So I'd leave the currently implemented scheme as performance, and I
don't think the above describes the current state.

> } else if (schedule policy == power)
> move tasks from busiest group to
> idlest group until busiest is just full
> of capacity.
> //the busiest group can balance
> //internally after next time LB,

There's another thing we need to do, and that is collect tasks in a
minimal amount of power domains. The old code (that got deleted) did
something like that, you can revive some of the that code if needed -- I
just killed everything to be able to start with a clean slate.


> } else {
> //all groups has enough capacity for its tasks.
> if (schedule policy == performance)
> //all tasks may has enough cpu
> //resources to run,
> //mv tasks from busiest to idlest group?
> //no, at this time, it's better to keep
> //the task on current cpu.
> //so, it is maybe better to do balance
> //in each of groups
> for_each_imbalance_groups()
> move tasks from busiest cpu to
> idlest cpu in each of groups;
> else if (schedule policy == power) {
> if (no hard pin in idlest group)
> mv tasks from idlest group to
> busiest until busiest full.
> else
> mv unpin tasks to the biggest
> hard pin group.
> }
> }
> }
> }

OK, so you only start to group later.. I think we can do better than
that.

>
> sub proposal:
> 1, If it's possible to balance task on idlest cpu not appointed 'balance
> cpu'. If so, it may can reduce one more time balancing.
> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
> 2, se or task load is good for running time setting.
> but it should the second basis in load balancing. The first basis of LB
> is running tasks' number in group/cpu. Since whatever of the weight of
> groups is, if the tasks number is less than cpu number, the group is
> still has capacity to take more tasks. (will consider the SMT cpu power
> or other big/little cpu capacity on ARM.)

Ah, no we shouldn't balance on nr_running, but on the amount of time
consumed. Imagine two tasks being woken at the same time, both tasks
will only run a fraction of the available time, you don't want this to
exceed your capacity because ran back to back the one cpu will still be
mostly idle.

What you want it to keep track of a per-cpu utilization level (inverse
of idle-time) and using PJTs per-task runnable avg see if placing the
new task on will exceed the utilization limit.

I think some of the Linaro people actually played around with this,
Vincent?

> unsolved issues:
> 1, like current scheduler, it didn't handled cpu affinity well in
> load_balance.

cpu affinity is always 'fun'.. while there's still a few fun sites in
the current load-balancer we do better than we did a while ago.

> 2, task group that isn't consider well in this rough proposal.

You mean the cgroup mess?

> It isn't consider well and may has mistaken . So just share my ideas and
> hope it become better and workable in your comments and discussion.

Very simplistically the current scheme is a 'spread' the load scheme
(SD_PREFER_SIBLING if you will). We spread load to maximize per-task
cache and cpu power.

The power scheme should be a 'pack' scheme, where we minimize the active
power domains.

One way to implement this is to keep track of an active and
under-utilized power domain (the target) and fail the regular (pull)
load-balance for all cpus not in that domain. For the cpu that are in
that domain we'll have find_busiest select from all other under-utilized
domains pulling tasks to fill our target, once full, we pick a new
target, goto 1.

2012-08-15 13:15:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
> > Since there is no power saving consideration in scheduler CFS, I has a
> > very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
> > It bases on the following assumption:
> > 1, If there are many task crowd in system, just let few domain cpus
> > running and let other cpus idle can not save power. Let all cpu take the
> > load, finish tasks early, and then get into idle. will save more power
> > and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.

I think what he means here is that we might want to let all cores on
the node (i.e., domain) finish and then power down the whole node which
should bring much more power savings than letting a subset of the cores
idle. Alex?

[ … ]

> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
> > } else if (schedule policy == power)
> > move tasks from busiest group to
> > idlest group until busiest is just full
> > of capacity.
> > //the busiest group can balance
> > //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains.

Yep.

Btw, what heuristic would tell here when a domain overflows and another
needs to get woken? Combined load of the whole domain?

And if I absolutely positively don't want a node to wake up, do I
hotplug its cores off or are we going to have a way to tell the
scheduler to overcommit the non-idle domains and spread the tasks only
among them.

I'm thinking of short bursts here where it would be probably beneficial
to let the tasks rather wait runnable for a while then wake up the next
node and waste power...

Thanks.

--
Regards/Gruss,
Boris.

2012-08-15 13:45:03

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 4:05 AM, Peter Zijlstra wrote:
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS

nooooooo!

anything but this.
if anyone thinks that AC/Battery matters for power sensitivity... they need to go talk to a datacenter operator ;-)

seriously, there are possibly many ways to have a power/performance preference..,. but AC/battery is a very very poor one.

2012-08-15 14:20:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 4:05 AM, Peter Zijlstra wrote:
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing.

it's not so much race to idle (which is more about frequency than anything else)

it's about the situation that in order 0 approximation, the first (logical) CPU
you bring out of idle is the least efficient one, or rather, all consecutive CPUs that
you add cost less incremental power than this first one.
Keeping this first one on longer (versus parallelism) is a bad trade off.

in an order 1 approximation you are absolutely correct. If the other task will only run briefly,
moving it (and thus waking a core up) is a loss due to transition costs.

The whole situation hinges on what is "briefly" (or "long enough" in other words).

for a typical Intel or AMD based cpu, the tipping point will likely be somewhere between 100 usec and 300 usec,
but this is obviously somewhat CPU and architecture specific.

Interrupts usually are well below that (hopefully ;-).
Very short tasks, that just get a disk IO completion to then schedule the next IO... will be too.

Ideally the scheduler builds up some history of the typical run duration of the task (with a bias to more recent runs).
But... even then, the past is only a poor predictor for the future.


2012-08-15 14:24:37

by Rakib Mullick

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/13/12, Alex Shi <[email protected]> wrote:
> Since there is no power saving consideration in scheduler CFS, I has a
> very rough idea for enabling a new power saving schema in CFS.
>
> It bases on the following assumption:
> 1, If there are many task crowd in system, just let few domain cpus
> running and let other cpus idle can not save power. Let all cpu take the
> load, finish tasks early, and then get into idle. will save more power
> and have better user experience.
>
This assumption indirectly point towards the scheme when performance
is enabled, isn't it? Cause you're trying to spread the load equally
amongst all the CPUs.

> 2, schedule domain, schedule group perfect match the hardware, and
> the power consumption unit. So, pull tasks out of a domain means
> potentially this power consumption unit idle.
>
How do you plan to test this power saving scheme? Using powertop? Or,
is there any other tools?

>
> select_task_rq_fair()
> {
> for_each_domain(cpu, tmp) {
> if (policy == power && tmp_has_capacity &&
> tmp->flags & sd_flag) {
> sd = tmp;
> //It is fine to got cpu in the domain
> break;
> }
> }
>
> while(sd) {
> if policy == power
> find_busiest_and_capable_group()

I'm not sure what find_busiest_and_capable_group() would really be, it
seems it'll find the busiest and capable group, but isn't it a
conflict with the first assumption you proposed on your proposal?

Thanks,
Rakib.

2012-08-15 14:39:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-15 at 06:45 -0700, Arjan van de Ven wrote:
> On 8/15/2012 4:05 AM, Peter Zijlstra wrote:
> > Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> > between the two based on AC/BAT, UPS
>
> nooooooo!
>
> anything but this.
> if anyone thinks that AC/Battery matters for power sensitivity... they
> need to go talk to a datacenter operator ;-)

Servers in a datacenter have battery?

> seriously, there are possibly many ways to have a power/performance
> preference..,. but AC/battery is a very very poor one.
>
Do expand..

2012-08-15 14:43:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-15 at 15:15 +0200, Borislav Petkov wrote:
> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
> > On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
> > > Since there is no power saving consideration in scheduler CFS, I has a
> > > very rough idea for enabling a new power saving schema in CFS.
> >
> > Adding Thomas, he always delights poking holes in power schemes.
> >
> > > It bases on the following assumption:
> > > 1, If there are many task crowd in system, just let few domain cpus
> > > running and let other cpus idle can not save power. Let all cpu take the
> > > load, finish tasks early, and then get into idle. will save more power
> > > and have better user experience.
> >
> > I'm not sure this is a valid assumption. I've had it explained to me by
> > various people that race-to-idle isn't always the best thing. It has to
> > do with the cost of switching power states and the duration of execution
> > and other such things.
>
> I think what he means here is that we might want to let all cores on
> the node (i.e., domain) finish and then power down the whole node which
> should bring much more power savings than letting a subset of the cores
> idle. Alex?

Sure we can do that.

> > So I'd leave the currently implemented scheme as performance, and I
> > don't think the above describes the current state.
> >
> > > } else if (schedule policy == power)
> > > move tasks from busiest group to
> > > idlest group until busiest is just full
> > > of capacity.
> > > //the busiest group can balance
> > > //internally after next time LB,
> >
> > There's another thing we need to do, and that is collect tasks in a
> > minimal amount of power domains.
>
> Yep.
>
> Btw, what heuristic would tell here when a domain overflows and another
> needs to get woken? Combined load of the whole domain?
>
> And if I absolutely positively don't want a node to wake up, do I
> hotplug its cores off or are we going to have a way to tell the
> scheduler to overcommit the non-idle domains and spread the tasks only
> among them.
>
> I'm thinking of short bursts here where it would be probably beneficial
> to let the tasks rather wait runnable for a while then wake up the next
> node and waste power...

I was thinking of a utilization measure made of per-task weighted
runnable averages. This should indeed cover that case and we'll overflow
when on average there is no (significant) idle time over a period longer
than the averaging period.

Anyway, I'm not too set on this and I'm very sure we can tweak this ad
infinitum, so starting with something relatively simple that works for
most is preferred.

As already stated, I think some of the Linaro people actually played
around with something like this based on PJTs patches.

2012-08-15 14:43:55

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 7:39 AM, Peter Zijlstra wrote:
> On Wed, 2012-08-15 at 06:45 -0700, Arjan van de Ven wrote:
>> On 8/15/2012 4:05 AM, Peter Zijlstra wrote:
>>> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
>>> between the two based on AC/BAT, UPS
>>
>> nooooooo!
>>
>> anything but this.
>> if anyone thinks that AC/Battery matters for power sensitivity... they
>> need to go talk to a datacenter operator ;-)
>
> Servers in a datacenter have battery?

they have AC, and sometimes a battery called "UPS".
DC is getting much more prevalent in datacenters in general.

>
>> seriously, there are possibly many ways to have a power/performance
>> preference..,. but AC/battery is a very very poor one.
>>
> Do expand..
>

The easy cop-out is provide the sysadmin a slider.
The slightly less easy one is to (and we're taking this approach
in the new P state code we're working on) say "in the default
setting, we're going to sacrifice up to 5% performance from peak
to give you the best power savings within that performance loss budget"
(with a slider that can give you 0%, 2 1/2% 5% and 10%)

on Intel PCs and servers, there usually is a bios switch/setting for this
(there is a setting the bios does to the CPU, and we can read that. Not all bioses
expose this setting to the end user). We could take clue from what was set there.

2012-08-15 14:45:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-15 at 07:19 -0700, Arjan van de Ven wrote:
> Ideally the scheduler builds up some history of the typical run
> duration of the task (with a bias to more recent runs).
> But... even then, the past is only a poor predictor for the future.

PJTs patches do this. But yes, a crystal-ball instruction which
completes in a single cycle would be most appreciated :-)

2012-08-15 14:47:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 15 Aug 2012, Peter Zijlstra wrote:

> On Wed, 2012-08-15 at 07:19 -0700, Arjan van de Ven wrote:
> > Ideally the scheduler builds up some history of the typical run
> > duration of the task (with a bias to more recent runs).
> > But... even then, the past is only a poor predictor for the future.
>
> PJTs patches do this. But yes, a crystal-ball instruction which
> completes in a single cycle would be most appreciated :-)

You could ask the S390 folks whether they could whip one up for proof
of concept. :)

2012-08-15 14:55:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-15 at 20:24 +0600, Rakib Mullick wrote:
> How do you plan to test this power saving scheme? Using powertop? Or,
> is there any other tools?

We should start out simple enough that we can validate it by looking at
task placement by hand, eg. 4 tasks on a dual socket quad-core, should
only keep one socket awake.

We can also add an power aware evaluator to Linsched (another one of
those things that needs getting sorted).

And yeah, someone running with a power meter is of course king.

_BUT_ we shouldn't go off the wall with power meters as that very
quickly gets very specific to the system being measured.

We should really keep this thing as simple as possible while still
providing some benefit for all various architectures without tons of per
arch knobs and knowhow.

2012-08-15 15:05:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-15 at 07:43 -0700, Arjan van de Ven wrote:

> > Servers in a datacenter have battery?
>
> they have AC, and sometimes a battery called "UPS".
> DC is getting much more prevalent in datacenters in general.

AC/DC (/me slams a riff on his air-guitar)...

> >> seriously, there are possibly many ways to have a power/performance
> >> preference..,. but AC/battery is a very very poor one.
> >>
> > Do expand..
> >
>
> The easy cop-out is provide the sysadmin a slider.
> The slightly less easy one is to (and we're taking this approach
> in the new P state code we're working on) say "in the default
> setting, we're going to sacrifice up to 5% performance from peak
> to give you the best power savings within that performance loss budget"
> (with a slider that can give you 0%, 2 1/2% 5% and 10%)
>
> on Intel PCs and servers, there usually is a bios switch/setting for this
> (there is a setting the bios does to the CPU, and we can read that. Not all bioses
> expose this setting to the end user). We could take clue from what was set there.

This all sounds far too complicated.. we're talking about simple
spreading and packing balancers without deep arch knowledge and knobs,
we couldn't possibly evaluate anything like that.

I was really more thinking of something useful for the laptops out
there, when they pull the power cord it makes sense to try and keep CPUs
asleep until the one that's awake is saturated.

2012-08-15 16:19:35

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:

> power aware scheduling), this proposal will adopt the
> sched_balance_policy concept and use 2 kind of policy: performance, power.

Are there workloads in which "power" might provide more performance than
"performance"? If so, don't use these terms.

--
Matthew Garrett | [email protected]

2012-08-15 16:23:36

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:

> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.

Please, really, don't do that. Pushing power policy decisions into
multiple bits of the stack makes things much more awkward, especially
when the criteria you're describing are about the least interesting
reasons for switching these states. They're most relevant on
multi-socket systems, and the overwhelming power concern there is
rack-level overcommit or cooling. You're going to need an external
policy agent to handle the majority of cases people actually care about.

--
Matthew Garrett | [email protected]

2012-08-15 16:34:11

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
> > It bases on the following assumption:
> > 1, If there are many task crowd in system, just let few domain cpus
> > running and let other cpus idle can not save power. Let all cpu take the
> > load, finish tasks early, and then get into idle. will save more power
> > and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.

This is affected by Intel's implementation - if there's a single active
core in the system then you can't put *any* package into the deep
package C states, and that means you don't get into memory self refresh.
It's a pretty big difference. But this isn't inherently true, and I
suspect that any implementation is going to have to handle scenarios
where the behaviour of one package doesn't influence the behaviour of
another package.

Long term we probably also need to consider whether migrating pages
between nodes is worth it. That's going to be especially important as
systems start implementing ACPI 5's memory power management, which
effectively lets us cut power to all the RAM attached to a node.

--
Matthew Garrett | [email protected]

2012-08-15 17:59:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 8:04 AM, Peter Zijlstra wrote:
> This all sounds far too complicated.. we're talking about simple
> spreading and packing balancers without deep arch knowledge and knobs,
> we couldn't possibly evaluate anything like that.
>
> I was really more thinking of something useful for the laptops out
> there, when they pull the power cord it makes sense to try and keep CPUs
> asleep until the one that's awake is saturated.

as long as you don't do that on machines with an Intel CPU.. since that'd be
the worst case behavior for tasks that run for more than 100 usec.
(e.g. not interrupts, but almost everything else)



2012-08-15 18:02:22

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 9:34 AM, Matthew Garrett wrote:
> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>
>> I'm not sure this is a valid assumption. I've had it explained to me by
>> various people that race-to-idle isn't always the best thing. It has to
>> do with the cost of switching power states and the duration of execution
>> and other such things.
>
> This is affected by Intel's implementation - if there's a single active

not just intel.. also AMD
basically everyone who has the memory controller in the cpu package will end up with
a restriction very similar to this.

(this is because the exit-from-self-refresh latency is pretty high.. at least in DDR2/3)

2012-08-15 22:58:31

by Rakib Mullick

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 8:55 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-08-15 at 20:24 +0600, Rakib Mullick wrote:
>> How do you plan to test this power saving scheme? Using powertop? Or,
>> is there any other tools?
>
> We should start out simple enough that we can validate it by looking at
> task placement by hand, eg. 4 tasks on a dual socket quad-core, should
> only keep one socket awake.
>
Yeah, that's what we can do "task placement", that's what scheduler is
known for :).

> We can also add an power aware evaluator to Linsched (another one of
> those things that needs getting sorted).
>
> And yeah, someone running with a power meter is of course king.
>
> _BUT_ we shouldn't go off the wall with power meters as that very
> quickly gets very specific to the system being measured.
>
> We should really keep this thing as simple as possible while still
> providing some benefit for all various architectures without tons of per
> arch knobs and knowhow.

Perhaps this is because, there's no well sorted specification from
various arch, to properly deal with power saving from scheduler's POV.
Actually, this is what I wasn't really sure whether there's any
documentation of how scheduler should work to save power.

Thanks,
Rakib

2012-08-16 01:14:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/15/2012 10:43 AM, Arjan van de Ven wrote:

> The easy cop-out is provide the sysadmin a slider.
> The slightly less easy one is to (and we're taking this approach
> in the new P state code we're working on) say "in the default
> setting, we're going to sacrifice up to 5% performance from peak
> to give you the best power savings within that performance loss budget"
> (with a slider that can give you 0%, 2 1/2% 5% and 10%)

On a related note, I am looking at the c-state menu governor.

We seem to have issues there, with Linux often going into a much
deeper C state than warranted, which can lead to a fairly steep
performance penalty for some workloads.

One of the issues we identified is that detect_repeating_patterns
would deal quite poorly with patterns that have a short pause,
followed by a long pause, followed by another short pause. For
example, pinging a virtual machine :)

The idea Matthew and I have is simply planning for a shorter
sleep period (discarding the outliers to the high end in the
function once known as detect_repeating_patterns), and going
to a deeper C state if we have significantly overslept.

The new estimation code is easy, but for the past days I have
been looking through the timer code to figure out how such a
timer could fire, and how we could recognize it without it
looking like a normal wakeup, if we do not end up accidentally
waking up another CPU, etc...

I guess I should probably post what I have and kick off a
debate between people who know that code much better than I do :)

--
All rights reversed

2012-08-16 01:17:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 6:14 PM, Rik van Riel wrote:
> On 08/15/2012 10:43 AM, Arjan van de Ven wrote:
>
>> The easy cop-out is provide the sysadmin a slider.
>> The slightly less easy one is to (and we're taking this approach
>> in the new P state code we're working on) say "in the default
>> setting, we're going to sacrifice up to 5% performance from peak
>> to give you the best power savings within that performance loss budget"
>> (with a slider that can give you 0%, 2 1/2% 5% and 10%)
>
> On a related note, I am looking at the c-state menu governor.
>
> We seem to have issues there, with Linux often going into a much
> deeper C state than warranted, which can lead to a fairly steep
> performance penalty for some workloads.
>

predicting the future is hard.
if you pick a too deep C state, you get a certain fixed performance hit
if you pick a too shallow C state, you get a pretty large power hit (depending on how long you actually stay idle)

also would need to know hw details; at least on Intel a bunch of things are done
by the firmware and some platforms we're not doing the right things as Linux
(or BIOS)

2012-08-16 01:21:22

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 6:14 PM, Rik van Riel wrote:
>
> The idea Matthew and I have is simply planning for a shorter
> sleep period (discarding the outliers to the high end in the
> function once known as detect_repeating_patterns), and going
> to a deeper C state if we have significantly overslept.
>
> The new estimation code is easy, but for the past days I have
> been looking through the timer code to figure out how such a
> timer could fire, and how we could recognize it without it
> looking like a normal wakeup, if we do not end up accidentally
> waking up another CPU, etc...

this sort of code we recently developed already in house;
i'm surprised it hasn't been pasted to lkml yet

2012-08-16 03:08:00

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


Appreciate for your so detailed review and comments!

On 08/15/2012 07:05 PM, Peter Zijlstra wrote:

> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.


Uh, the info will be kept in my mind. Thanks!

>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.


Sorry.
this assumption is that power domain can be matched into current SDs.

So the 'pack' power scheme can simply minimise the active power domain
via minimise active scheduler domains.

>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)


Sure. let's build up the performance/power first. :)

>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack


Thanks!

>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>> if (sd->nr_running > sd's capacity) {
>> //power saving policy is not suitable for
>> //this scenario, it runs like performance policy
>> mv tasks from busiest cpu in busiest group to
>> idlest cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.


Sorry, I missed this. and didn't find detailed in lkml and google. Could
any one like to give the related URL if convenience?

>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)


Agree for the better solution, will study Paul's post. :)

>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.


Sure. the least busy non-idle is better target.

>
>> } else {// the sd has enough capacity to hold all tasks.
>> if (sg->nr_running > sg's capacity) {
>> //imbalanced between groups
>> if (schedule policy == performance) {
>> //when 2 busiest group at same busy
>> //degree, need to prefer the one has
>> // softest group??
>> move tasks from busiest group to
>> idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.


Sure. it's better to keep current state as performance. Here is a little
difference is it try to find the more suitable balance cpu in idlest
group the usual this_cpu.

But maybe the current solution is better.

>
>> } else if (schedule policy == power)
>> move tasks from busiest group to
>> idlest group until busiest is just full
>> of capacity.
>> //the busiest group can balance
>> //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.


Thanks for reminder.
And agree to clear all. Painting on white paper is more pleasure. :)

>
>
>> } else {
>> //all groups has enough capacity for its tasks.
>> if (schedule policy == performance)
>> //all tasks may has enough cpu
>> //resources to run,
>> //mv tasks from busiest to idlest group?
>> //no, at this time, it's better to keep
>> //the task on current cpu.
>> //so, it is maybe better to do balance
>> //in each of groups
>> for_each_imbalance_groups()
>> move tasks from busiest cpu to
>> idlest cpu in each of groups;
>> else if (schedule policy == power) {
>> if (no hard pin in idlest group)
>> mv tasks from idlest group to
>> busiest until busiest full.
>> else
>> mv unpin tasks to the biggest
>> hard pin group.
>> }
>> }
>> }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.


Would you like to share more detailed ideas here?

>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.


Agree with you.

>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.


Thanks for reminder!

>
> I think some of the Linaro people actually played around with this,
> Vincent?
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?


Yes.

>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.


Thanks for re-clarify! that is also this proposal want to do.

And as to the select_task_rq_fair part. The rough idea correct here:

select_task_rq_fair()
{
int powersaving = 0;

for_each_domain(cpu, tmp) {

if (policy == power && tmp_has_capacity &&
tmp->flags & sd_flag) {
sd = tmp;
//semi-idle domain is suitable for power scheme
powersaving = 1;
break;
}
}

...

while(sd) {
...
if (policy == power && powersaving == 1)
find_busiest_and_capable_group()
else
find_idlest_group();

if (!group) {
sd = sd->child;
continue;
}
...
}
}

>
>

2012-08-16 03:10:11

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/15/2012 09:15 PM, Borislav Petkov wrote:

> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>
>> Adding Thomas, he always delights poking holes in power schemes.
>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>
>> I'm not sure this is a valid assumption. I've had it explained to me by
>> various people that race-to-idle isn't always the best thing. It has to
>> do with the cost of switching power states and the duration of execution
>> and other such things.
>
> I think what he means here is that we might want to let all cores on
> the node (i.e., domain) finish and then power down the whole node which
> should bring much more power savings than letting a subset of the cores
> idle. Alex?


Yes, that is my assumption. If my memory service me well. The idea get
from Suresh when introducing the old power saving schema.

>
> [ … ]
>
>> So I'd leave the currently implemented scheme as performance, and I
>> don't think the above describes the current state.
>>
>>> } else if (schedule policy == power)
>>> move tasks from busiest group to
>>> idlest group until busiest is just full
>>> of capacity.
>>> //the busiest group can balance
>>> //internally after next time LB,
>>
>> There's another thing we need to do, and that is collect tasks in a
>> minimal amount of power domains.
>
> Yep.
>
> Btw, what heuristic would tell here when a domain overflows and another
> needs to get woken? Combined load of the whole domain?
>
> And if I absolutely positively don't want a node to wake up, do I
> hotplug its cores off or are we going to have a way to tell the
> scheduler to overcommit the non-idle domains and spread the tasks only
> among them.


You are right. here using the least load non-idle group is better than
idlest.

>
> I'm thinking of short bursts here where it would be probably beneficial
> to let the tasks rather wait runnable for a while then wake up the next
> node and waste power...


True. Maybe that is Peter mentioned '2*capacity' reason?

>
> Thanks.
>

2012-08-16 03:22:31

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/15/2012 10:43 PM, Peter Zijlstra wrote:

> On Wed, 2012-08-15 at 15:15 +0200, Borislav Petkov wrote:
>> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>>> Since there is no power saving consideration in scheduler CFS, I has a
>>>> very rough idea for enabling a new power saving schema in CFS.
>>>
>>> Adding Thomas, he always delights poking holes in power schemes.
>>>
>>>> It bases on the following assumption:
>>>> 1, If there are many task crowd in system, just let few domain cpus
>>>> running and let other cpus idle can not save power. Let all cpu take the
>>>> load, finish tasks early, and then get into idle. will save more power
>>>> and have better user experience.
>>>
>>> I'm not sure this is a valid assumption. I've had it explained to me by
>>> various people that race-to-idle isn't always the best thing. It has to
>>> do with the cost of switching power states and the duration of execution
>>> and other such things.
>>
>> I think what he means here is that we might want to let all cores on
>> the node (i.e., domain) finish and then power down the whole node which
>> should bring much more power savings than letting a subset of the cores
>> idle. Alex?
>
> Sure we can do that.
>
>>> So I'd leave the currently implemented scheme as performance, and I
>>> don't think the above describes the current state.
>>>
>>>> } else if (schedule policy == power)
>>>> move tasks from busiest group to
>>>> idlest group until busiest is just full
>>>> of capacity.
>>>> //the busiest group can balance
>>>> //internally after next time LB,
>>>
>>> There's another thing we need to do, and that is collect tasks in a
>>> minimal amount of power domains.
>>
>> Yep.
>>
>> Btw, what heuristic would tell here when a domain overflows and another
>> needs to get woken? Combined load of the whole domain?
>>
>> And if I absolutely positively don't want a node to wake up, do I
>> hotplug its cores off or are we going to have a way to tell the
>> scheduler to overcommit the non-idle domains and spread the tasks only
>> among them.
>>
>> I'm thinking of short bursts here where it would be probably beneficial
>> to let the tasks rather wait runnable for a while then wake up the next
>> node and waste power...
>
> I was thinking of a utilization measure made of per-task weighted
> runnable averages. This should indeed cover that case and we'll overflow
> when on average there is no (significant) idle time over a period longer
> than the averaging period.


It's also a good idea. :)

>
> Anyway, I'm not too set on this and I'm very sure we can tweak this ad
> infinitum, so starting with something relatively simple that works for
> most is preferred.
>
> As already stated, I think some of the Linaro people actually played
> around with something like this based on PJTs patches.


Vincent, would you like to introduce more?

2012-08-16 04:57:56

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/15/2012 10:24 PM, Rakib Mullick wrote:

> On 8/13/12, Alex Shi <[email protected]> wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>>
> This assumption indirectly point towards the scheme when performance
> is enabled, isn't it? Cause you're trying to spread the load equally
> amongst all the CPUs.


It is.

>
>>
>> select_task_rq_fair()
>> {

int powersaving = 0;

>> for_each_domain(cpu, tmp) {
>> if (policy == power && tmp_has_capacity &&
>> tmp->flags & sd_flag) {
>> sd = tmp;
>> //It is fine to got cpu in the domain

powersaving = 1;

>> break;
>> }
>> }
>>
>> while(sd) {
if (policy == power && powersaving == 1)
>> find_busiest_and_capable_group()
>
> I'm not sure what find_busiest_and_capable_group() would really be, it
> seems it'll find the busiest and capable group, but isn't it a
> conflict with the first assumption you proposed on your proposal?


This pseudo code missed a power saving workable flag , adding it into
above code should solved your concern.

>
> Thanks,
> Rakib.

2012-08-16 05:03:45

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/16/2012 12:19 AM, Matthew Garrett wrote:

> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Are there workloads in which "power" might provide more performance than
> "performance"? If so, don't use these terms.
>


Power scheme should no chance has better performance in design.

2012-08-16 05:26:31

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/15/2012 10:55 PM, Peter Zijlstra wrote:

> On Wed, 2012-08-15 at 20:24 +0600, Rakib Mullick wrote:
>> How do you plan to test this power saving scheme? Using powertop? Or,
>> is there any other tools?
>
> We should start out simple enough that we can validate it by looking at
> task placement by hand, eg. 4 tasks on a dual socket quad-core, should
> only keep one socket awake.
>
> We can also add an power aware evaluator to Linsched (another one of
> those things that needs getting sorted).
>
> And yeah, someone running with a power meter is of course king.
>
> _BUT_ we shouldn't go off the wall with power meters as that very
> quickly gets very specific to the system being measured.
>
> We should really keep this thing as simple as possible while still
> providing some benefit for all various architectures without tons of per
> arch knobs and knowhow.


Definitely agree all!

2012-08-16 05:31:49

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Thu, Aug 16, 2012 at 01:03:32PM +0800, Alex Shi wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
> > Are there workloads in which "power" might provide more performance than
> > "performance"? If so, don't use these terms.
>
> Power scheme should no chance has better performance in design.

Power will tend to concentrate processes on packages, while performance
will tend to split them across packages? What if two cooperating
processes gain from being on the same package and sharing cache
locality?

--
Matthew Garrett | [email protected]

2012-08-16 05:39:49

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/16/2012 01:31 PM, Matthew Garrett wrote:

> On Thu, Aug 16, 2012 at 01:03:32PM +0800, Alex Shi wrote:
>> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>>> Are there workloads in which "power" might provide more performance than
>>> "performance"? If so, don't use these terms.
>>
>> Power scheme should no chance has better performance in design.
>
> Power will tend to concentrate processes on packages,


yes.

while performance
> will tend to split them across packages?


No, there is still has balance idea in this rough proposal. If a domain
is not overload, it is better to left old tasks unchanged. I should say,
current scheduler is the 'performance' trend scheme.

What if two cooperating
> processes gain from being on the same package and sharing cache
> locality?
>

2012-08-16 05:45:55

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Thu, Aug 16, 2012 at 01:39:36PM +0800, Alex Shi wrote:
> On 08/16/2012 01:31 PM, Matthew Garrett wrote:
> > will tend to split them across packages?
>
>
> No, there is still has balance idea in this rough proposal. If a domain
> is not overload, it is better to left old tasks unchanged. I should say,
> current scheduler is the 'performance' trend scheme.

The current process isn't necessarily ideal for all workloads - that's
one of the reasons for letting workspace modify process affinity. I
agree that the "performance" mode will tend to provide better
performance than the "power" mode for an arbitrary workload, but if
there are workloads that would perform better in "power" then it's a
poor naming scheme.

--
Matthew Garrett | [email protected]

2012-08-16 06:42:46

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


Hi everyone,

>From what I have understood so far,I try to summarise pin pointed
differences between the performance and power policies as found
relevant to the scheduler-load balancing mechanism.Any thoughts?

*Performance policy*:

Q1.Who triggers load_balance?
Load balance is triggered when a cpu is found to be idle.(Pull mechanism)

Q2.How is load_balance handled?
When triggered,the load is looked to be pulled from its sched domain.
First the sched groups in the domain the cpu belongs to is queried
followed by the runqueues in the busiest group.then the tasks are moved.

This course of action is found analogous to the performance policy because:

1.First the idle cpu initiates the pull action
2.The busiest cpu hands over the load to this cpu.A person who can
handle any work is querying as to who cannot handle more work.

*Power policy*:

So how is power policy different? As Peter says,'pack more than spread
more'.

Q1.Who triggers load balance?
It is the cpu which cannot handle more work.Idle cpu is left to remain
idle.(Push mechanism)

Q2.How is load_balance handled?
First the least busy runqueue,from within the sched_group that the busy
cpu belongs to is queried.if none exist,ie all the runqueues are equally
busy then move on to the other sched groups.

Here again the 'least busy' policy should be applied,first at
group level then at the runqueue level.

This course of action is found analogous to the power policy because as
much as possible busy and capable cpus within a small range try to
handle the existing load.

Regards
Preeti

2012-08-16 08:05:53

by Rakib Mullick

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/16/12, Alex Shi <[email protected]> wrote:
> On 08/15/2012 10:24 PM, Rakib Mullick wrote:
>
>> On 8/13/12, Alex Shi <[email protected]> wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>>
>> This assumption indirectly point towards the scheme when performance
>> is enabled, isn't it? Cause you're trying to spread the load equally
>> amongst all the CPUs.
>
>
> It is.
>
Okay, then what would be the default mechanism? Performance or
powersavings ? Your proposal deals with performance and power saving,
but there should be a default mechanism too, what that default
mechanism would be? Shouldn't performance be the default one and
discard checking for performance?

>>
>>>
>>> select_task_rq_fair()
>>> {
>
> int powersaving = 0;
>
>>> for_each_domain(cpu, tmp) {
>>> if (policy == power && tmp_has_capacity &&
>>> tmp->flags & sd_flag) {
>>> sd = tmp;
>>> //It is fine to got cpu in the domain
>
> powersaving = 1;
>
>>> break;
>>> }
>>> }
>>>
>>> while(sd) {
> if (policy == power && powersaving == 1)
>>> find_busiest_and_capable_group()
>>
>> I'm not sure what find_busiest_and_capable_group() would really be, it
>> seems it'll find the busiest and capable group, but isn't it a
>> conflict with the first assumption you proposed on your proposal?
>
>
> This pseudo code missed a power saving workable flag , adding it into
> above code should solved your concern.
>
I think I should take a look at this one when it'll be prepared for RFC.

Thanks,
Rakib.

2012-08-16 09:58:58

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/16/2012 02:53 PM, preeti wrote:

>
> Hi everyone,
>
> From what I have understood so far,I try to summarise pin pointed
> differences between the performance and power policies as found
> relevant to the scheduler-load balancing mechanism.Any thoughts?


Currently, the load_balance trigger will be called in timer for periodic
tick, or dynamic tick.
In periodic tick, the cpu is waked, so do load_balance is not cost much.
But in dynamic tick. we'd better do power policy suitable scenario
checking in nohz_kick_needed(), and then do nohz_balancer_kick on least
but non-idle cpu if possible. that reduce the idle cpu waking chance.

Any comments?

2012-08-16 12:46:05

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Thu, Aug 16, 2012 at 12:23 PM, preeti <[email protected]> wrote:
>
> Hi everyone,
>
> From what I have understood so far,I try to summarise pin pointed
> differences between the performance and power policies as found
> relevant to the scheduler-load balancing mechanism.Any thoughts?
>
> *Performance policy*:
>
> Q1.Who triggers load_balance?
> Load balance is triggered when a cpu is found to be idle.(Pull mechanism)
>
> Q2.How is load_balance handled?
> When triggered,the load is looked to be pulled from its sched domain.
> First the sched groups in the domain the cpu belongs to is queried
> followed by the runqueues in the busiest group.then the tasks are moved.
>
> This course of action is found analogous to the performance policy because:
>
> 1.First the idle cpu initiates the pull action
> 2.The busiest cpu hands over the load to this cpu.A person who can
> handle any work is querying as to who cannot handle more work.
>
> *Power policy*:
>
> So how is power policy different? As Peter says,'pack more than spread
> more'.
>
> Q1.Who triggers load balance?
> It is the cpu which cannot handle more work.Idle cpu is left to remain
> idle.(Push mechanism)
>
> Q2.How is load_balance handled?
> First the least busy runqueue,from within the sched_group that the busy
> cpu belongs to is queried.if none exist,ie all the runqueues are equally
> busy then move on to the other sched groups.
>
> Here again the 'least busy' policy should be applied,first at
> group level then at the runqueue level.
>
> This course of action is found analogous to the power policy because as
> much as possible busy and capable cpus within a small range try to
> handle the existing load.
>
Not to complicate the power policy scheme but always *packing* may
not be the best approach for all CPU packages. As mentioned, packing
ensures that least number of power domains are in use and effectively
reduce the active power consumption on paper but there are few
considerations which might conflict with this assumption.

-- Many architectures get best power saving when the entire CPU cluster
or SD is idle. Intel folks already mentioned this and also extended this
concept for attached memory with the CPU domain from self refresh point
of view. This is true for the CPUs who have very little active leakage and
hence "race to idle" would be better so that cluster can hit the deeper
C-state to save more power.

-- Spreading vs Packing actually can be made OPP(CPU operating point)
dependent. Some of the mobile workload and power numbers measured in
the past shown that when CPU operating at lower OPP(considering the
load is less),
packing would be the best option to have higher opportunity for cluster to idle
where as while operating at higher operating point(assuming the higher CPU load
and possibly more threads), a spread with race to idle in mind might
be beneficial.
Of-course this is going to be bit messy since the CPUFreq and scheduler needs
to be linked.

-- May be this is already possible but for architectures like big.LITTLE,
the power consumption and active leakage can be significantly different
across big and little CPU packages.
Meaning the big CPU cluster or SD might be more power efficient
with packing where as Little CPU cluster would be power efficient
with spreading. Hence the possible need of per SD configurability.

Ofcourse all of this can be done step by step starting with most simple
power policy as stated by Peter.

Regards
Santosh











been used whenever possible in sd.

2012-08-16 13:57:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/15/2012 10:03 PM, Alex Shi wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than
>> "performance"? If so, don't use these terms.
>>
>
>
> Power scheme should no chance has better performance in design.

ehm.....

so in reality, the very first thing that helps power, is to run software efficiently.

anything else is completely secondary.

if placement policy leads to a placement that's different from the most efficient placement,
you're already burning extra power...

2012-08-16 14:01:32

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

> *Power policy*:
>
> So how is power policy different? As Peter says,'pack more than spread
> more'.

this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.

the only thing you do not want to do, is wake cpus up for
tasks that only run extremely briefly (think "100 usec" or less).

so maybe the balance interval is slightly different, or more, you don't balance tasks that
historically ran only for brief periods

2012-08-16 14:32:07

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

Hi all,

On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:
> >
> > sub proposal:
> > 1, If it's possible to balance task on idlest cpu not appointed 'balance
> > cpu'. If so, it may can reduce one more time balancing.
> > The idlest cpu can prefer the new idle cpu; and is the least load cpu;
> > 2, se or task load is good for running time setting.
> > but it should the second basis in load balancing. The first basis of LB
> > is running tasks' number in group/cpu. Since whatever of the weight of
> > groups is, if the tasks number is less than cpu number, the group is
> > still has capacity to take more tasks. (will consider the SMT cpu power
> > or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.
>
> I think some of the Linaro people actually played around with this,
> Vincent?
>

I agree. A better measure of cpu load and task weight than nr_running
and the current task load weight are necessary to do proper task
packing.

I have used PJTs per-task load-tracking for scheduling experiments on
heterogeneous systems and my experience is that it works quite well for
determining the load of a specific task. Something like PJTs work
would be a good starting point for power aware scheduling and better
support for heterogeneous systems.

One of the biggest challenges here for load-balancing is translating
task load from one cpu to another as the task load is influenced by the
total load of its cpu. So a task that appears to be heavy on an
oversubscribed cpu might not be so heavy after all when it is moved to a
cpu with plenty cpu time to spare. This issue is likely to be more
pronounced on heterogeneous systems and system with aggressive frequency
scaling. It might be possible to avoid having to translate load or that
it doesn't really matter, but I haven't completely convinced myself yet.

My point is that getting the task load right or at least better is a
fundamental requirement for improving power aware scheduling.

Best regards,
Morten

2012-08-16 18:47:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/16/2012 10:01 AM, Arjan van de Ven wrote:
>> *Power policy*:
>>
>> So how is power policy different? As Peter says,'pack more than spread
>> more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> the only thing you do not want to do, is wake cpus up for
> tasks that only run extremely briefly (think "100 usec" or less).
>
> so maybe the balance interval is slightly different, or more, you don't balance tasks that
> historically ran only for brief periods

This makes me think that maybe, in addition to tracking
the idle residency time in the c-state governor, we may
also want to track the average run times in the scheduler.

The c-state governor can call the scheduler code before
putting a CPU to sleep, to indicate (1) the wakeup latency
of the CPU, and (2) whether TLB and/or cache get invalidated.

At wakeup time, the scheduler can check whether the CPU
the to-be-woken process ran on is in a deeper sleep state,
and whether the typical run time for the process significantly
exceeds the wakeup latency of the CPU it last ran on.

If the process typically runs for a short interval, and/or
the process's CPU lost its cached state, it may be better
to run the just-woken task on the CPU that is doing the
waking up, instead of on the CPU where it used to run.

Does that make sense?

Am I overlooking any factors?

2012-08-16 19:21:13

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/16/2012 11:45 AM, Rik van Riel wrote:
>
> The c-state governor can call the scheduler code before
> putting a CPU to sleep, to indicate (1) the wakeup latency
> of the CPU, and (2) whether TLB and/or cache get invalidated.

I don't think (2) is useful really; that basically always happens ;-)

2012-08-17 01:30:07

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/16/2012 10:01 PM, Arjan van de Ven wrote:

>> *Power policy*:
>>
>> So how is power policy different? As Peter says,'pack more than spread
>> more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> the only thing you do not want to do, is wake cpus up for
> tasks that only run extremely briefly (think "100 usec" or less).


It's a very important and valuable info!
Just want to know how you get this? From CS cost or cache/TLB refill cost?

>
> so maybe the balance interval is slightly different, or more, you don't balance tasks that
> historically ran only for brief periods
>
>

2012-08-17 08:49:12

by Paul Turner

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>> if (sd->nr_running > sd's capacity) {
>> //power saving policy is not suitable for
>> //this scenario, it runs like performance policy
>> mv tasks from busiest cpu in busiest group to
>> idlest cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)

Yes -- I just got back from Africa this week. It's updated for almost
all the previous comments but I ran out of time before I left to
re-post. I'm just about caught up enough that I should be able to get
this done over the upcoming weekend. Monday at the latest.

>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>> } else {// the sd has enough capacity to hold all tasks.
>> if (sg->nr_running > sg's capacity) {
>> //imbalanced between groups
>> if (schedule policy == performance) {
>> //when 2 busiest group at same busy
>> //degree, need to prefer the one has
>> // softest group??
>> move tasks from busiest group to
>> idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>> } else if (schedule policy == power)
>> move tasks from busiest group to
>> idlest group until busiest is just full
>> of capacity.
>> //the busiest group can balance
>> //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>> } else {
>> //all groups has enough capacity for its tasks.
>> if (schedule policy == performance)
>> //all tasks may has enough cpu
>> //resources to run,
>> //mv tasks from busiest to idlest group?
>> //no, at this time, it's better to keep
>> //the task on current cpu.
>> //so, it is maybe better to do balance
>> //in each of groups
>> for_each_imbalance_groups()
>> move tasks from busiest cpu to
>> idlest cpu in each of groups;
>> else if (schedule policy == power) {
>> if (no hard pin in idlest group)
>> mv tasks from idlest group to
>> busiest until busiest full.
>> else
>> mv unpin tasks to the biggest
>> hard pin group.
>> }
>> }
>> }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.
>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.

Observations of the runnable average also have the nice property that
it quickly converges to 100% when over-scheduled.

Since we also have the usage average for a single task the ratio of
used avg:runnable avg is likely a useful pointwise estimate.

>
> I think some of the Linaro people actually played around with this,
> Vincent?
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?
>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.
>
>

2012-08-17 09:00:15

by Paul Turner

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven
<[email protected]> wrote:
> On 8/15/2012 9:34 AM, Matthew Garrett wrote:
>> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>>> It bases on the following assumption:
>>>> 1, If there are many task crowd in system, just let few domain cpus
>>>> running and let other cpus idle can not save power. Let all cpu take the
>>>> load, finish tasks early, and then get into idle. will save more power
>>>> and have better user experience.
>>>
>>> I'm not sure this is a valid assumption. I've had it explained to me by
>>> various people that race-to-idle isn't always the best thing. It has to
>>> do with the cost of switching power states and the duration of execution
>>> and other such things.
>>
>> This is affected by Intel's implementation - if there's a single active
>
> not just intel.. also AMD
> basically everyone who has the memory controller in the cpu package will end up with
> a restriction very similar to this.
>

I think this is circular to discussion previously held on this topic.
This preference is arch specific; we need to reduce the set of inputs
to a sensible, actionable set, and plumb that so that the architecture
and not the scheduler can supply this preference.

That you believe 100-300us is actually the tipping point vs power
migration cost is probably in itself one of the most useful replies
I've seen on this topic in all of the last few rounds of discussion
its been through. It suggests we could actually parameterize this in
a manner similar to wake-up migration cost; with a minimum usage
average for which it's worth spilling to an idle sibling.

- Paul

> (this is because the exit-from-self-refresh latency is pretty high.. at least in DDR2/3)
>
>

2012-08-17 18:41:22

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> > *Power policy*:
> >
> > So how is power policy different? As Peter says,'pack more than spread
> > more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.

Is this really true? In a two-socket system I'd have thought the benefit
of keeping socket 1 in package C3 outweighed the cost of keeping socket
0 awake for slightly longer.

--
Matthew Garrett | [email protected]

2012-08-17 18:44:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
>>> *Power policy*:
>>>
>>> So how is power policy different? As Peter says,'pack more than spread
>>> more'.
>>
>> this is ... a dubiously general statement.
>>
>> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> Is this really true? In a two-socket system I'd have thought the benefit
> of keeping socket 1 in package C3 outweighed the cost of keeping socket
> 0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

2012-08-17 18:47:21

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
> On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> > On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> >> this is ... a dubiously general statement.
> >>
> >> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
> >
> > Is this really true? In a two-socket system I'd have thought the benefit
> > of keeping socket 1 in package C3 outweighed the cost of keeping socket
> > 0 awake for slightly longer.
>
> not on Intel
>
> you can't enter package c3 either until every one is down.
> (e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache
can't be entirely powered down?

--
Matthew Garrett | [email protected]

2012-08-17 19:47:35

by Chris Friesen

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/17/2012 12:47 PM, Matthew Garrett wrote:
> On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
>> On 8/17/2012 11:41 AM, Matthew Garrett wrote:
>>> On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
>>>> this is ... a dubiously general statement.
>>>>
>>>> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>>> Is this really true? In a two-socket system I'd have thought the benefit
>>> of keeping socket 1 in package C3 outweighed the cost of keeping socket
>>> 0 awake for slightly longer.
>> not on Intel
>>
>> you can't enter package c3 either until every one is down.
>> (e.g. memory controller must stay on etc etc)
> I thought that was only PC6 - is there any reason why the package cache
> can't be entirely powered down?

According to
"http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf"
once you're in package C6 then you can go to package C7.

The datasheet for the Xeon E5 (my variant at least) says it doesn't do
C7 so never powers down the LLC. However, as you said earlier once you
can put the socket into C6 which saves about 30W compared to C1E.

So as far as I can see with this CPU at least you would benefit from
shutting down a whole socket when possible.

Chris

2012-08-17 19:51:04

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
> On 08/17/2012 12:47 PM, Matthew Garrett wrote:

> The datasheet for the Xeon E5 (my variant at least) says it doesn't
> do C7 so never powers down the LLC. However, as you said earlier
> once you can put the socket into C6 which saves about 30W compared
> to C1E.
>
> So as far as I can see with this CPU at least you would benefit from
> shutting down a whole socket when possible.

Having any active cores on the system prevents all packages from going
into PC6 or deeper. What I'm not clear on is whether less deep package C
states are also blocked.

--
Matthew Garrett | [email protected]

2012-08-17 20:17:45

by Chris Friesen

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/17/2012 01:50 PM, Matthew Garrett wrote:
> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
>> On 08/17/2012 12:47 PM, Matthew Garrett wrote:
>
>> The datasheet for the Xeon E5 (my variant at least) says it doesn't
>> do C7 so never powers down the LLC. However, as you said earlier
>> once you can put the socket into C6 which saves about 30W compared
>> to C1E.
>>
>> So as far as I can see with this CPU at least you would benefit from
>> shutting down a whole socket when possible.
>
> Having any active cores on the system prevents all packages from going
> into PC6 or deeper. What I'm not clear on is whether less deep package C
> states are also blocked.
>

Right, we need the memory controller.

The E5 datasheet is a bit ambiguous, it reads:


A processor enters the package C3 low power state when:
-At least one core is in the C3 state.
-The other cores are in a C3 or lower power state, and the processor
has been granted permission by the platform.


Unfortunately it doesn't specify whether that is the other cores in the
package, or the other cores on the whole system.

Chris

2012-08-18 14:33:20

by Luming Yu

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen
<[email protected]> wrote:
> On 08/17/2012 01:50 PM, Matthew Garrett wrote:
>>
>> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
>>>
>>> On 08/17/2012 12:47 PM, Matthew Garrett wrote:
>>
>>
>>> The datasheet for the Xeon E5 (my variant at least) says it doesn't
>>> do C7 so never powers down the LLC. However, as you said earlier
>>> once you can put the socket into C6 which saves about 30W compared
>>> to C1E.
>>>
>>> So as far as I can see with this CPU at least you would benefit from
>>> shutting down a whole socket when possible.
>>
>>
>> Having any active cores on the system prevents all packages from going
>> into PC6 or deeper. What I'm not clear on is whether less deep package C
>> states are also blocked.
>>
>
> Right, we need the memory controller.
>
> The E5 datasheet is a bit ambiguous, it reads:
>
>
> A processor enters the package C3 low power state when:
> -At least one core is in the C3 state.
> -The other cores are in a C3 or lower power state, and the processor has
> been granted permission by the platform.
>
>
> Unfortunately it doesn't specify whether that is the other cores in the
> package, or the other cores on the whole system.
>

Hardware limitations is just part of the problem. We could find them
out from various white papers or data sheets, or test out.To me, the
key problem in terms of power and performance balancing still lies in
CPU and memory allocation method. For example, on a system we can
benefit from shutting down a whole socket when possible, if a workload
allocates 50% CPU cycles and 50% memory bandwidth and space on a two
socket system(modern), an ideal allocation method ( I assume it's our
goal of the discussion) should leave CPU, cache, memory controller and
memory on one socket ( node) completely idle and in deepest power
saving mode. But obviously, we need to spread as much as possible
across all cores in another socket(to race to idle). So from the
example above, we see a threshold that we need to reference before
selecting one from two complete different policy: spread or not
spread... As long as there is hardware limitation, we could always
need knob like that referenced threshold to adapt on different
hardware in one kernel....

/l

2012-08-18 14:52:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/18/2012 7:33 AM, Luming Yu wrote:
> saving mode. But obviously, we need to spread as much as possible
> across all cores in another socket(to race to idle). So from the
> example above, we see a threshold that we need to reference before
> selecting one from two complete different policy: spread or not
> spread... As long as there is hardware limitation, we could always
> need knob like that referenced threshold to adapt on different
> hardware in one kernel....

I think the physics are slightly simpler, if you abstract it one level.

every reasonable system out there has things that can be off if all cores are in the deep power state,
that have to be on if even one of them is alive. On "big core" Intel, that's uncore and memory controller,
on small core (atom/phone) Intel that is the chipset fabric only. On ARM it might be something else. On all of
them it's some clocks, PLLs, voltage regulators etc etc.

not all chips are advanced enough to aggressively these things off when they could, but most are nowadays.

so in abstract, there's a power offset that gets you from 0 to 1, Lets call this P0
there is also a power offset to go from 1 to 2, but that's smaller than 0->1. Lets call this Pc

or rather, 0->1 has the same kind of offset as 1->2 plus some extra offset.. so P0 = Pbase + Pc

there's also an energy cost for waking a cpu up (and letting it go back to sleep afterwards)... call it Ewake

so the abstract question is
you're running a task A on cpu 0
you want to also run a task B, which you estimate to run for time T

it's more energy efficient to wake a 2nd cpu if

Ewake < T * Pbase

(this assumes all cores are the same, you get a more complex formula if that's not the case, where T is even core specific)


there is no hardware policy *switch* in such formula, only parameters.
If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula very naturally leads to one extreme of the behavior
if Ewake is very high, then it leads to the other extreme.

The only other variable is the user preference between power and performance balance.. but that's a pure preference, not hardware
specific anymore.



2012-08-19 10:12:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

Hi all,
I can probably add some bits to the discussion, after all I'm preparing
a talk for Plumbers that is strictly related :-). My points are not CFS
related (so feel free to ignore me), but they would probably be
interesting if we talk about power aware scheduling in Linux in general.

On 08/16/2012 04:31 PM, Morten Rasmussen wrote:
> Hi all,
>
> On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:
>>>
>>> sub proposal:
>>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>>> cpu'. If so, it may can reduce one more time balancing.
>>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>>> 2, se or task load is good for running time setting.
>>> but it should the second basis in load balancing. The first basis of LB
>>> is running tasks' number in group/cpu. Since whatever of the weight of
>>> groups is, if the tasks number is less than cpu number, the group is
>>> still has capacity to take more tasks. (will consider the SMT cpu power
>>> or other big/little cpu capacity on ARM.)
>>
>> Ah, no we shouldn't balance on nr_running, but on the amount of time
>> consumed. Imagine two tasks being woken at the same time, both tasks
>> will only run a fraction of the available time, you don't want this to
>> exceed your capacity because ran back to back the one cpu will still be
>> mostly idle.
>>
>> What you want it to keep track of a per-cpu utilization level (inverse
>> of idle-time) and using PJTs per-task runnable avg see if placing the
>> new task on will exceed the utilization limit.
>>
>> I think some of the Linaro people actually played around with this,
>> Vincent?
>>
>
> I agree. A better measure of cpu load and task weight than nr_running
> and the current task load weight are necessary to do proper task
> packing.
>
> I have used PJTs per-task load-tracking for scheduling experiments on
> heterogeneous systems and my experience is that it works quite well for
> determining the load of a specific task. Something like PJTs work
> would be a good starting point for power aware scheduling and better
> support for heterogeneous systems.
>

I didn't tried PJTs work myself (it's on my todo list), but with
SCHED_DEADLINE you can see the picture from the other side and, instead
of tracking per-task load, you can enforce a task not to exceed its
allowed "load".
This is done reserving some fraction of CPU time (runtime or budget)
every predefined interval of time (period). Than this allocated
bandwidth is enforced with proper scheduling mechanisms (BTW, I have
another talk at Plumbers explaining the SCHED_DEADLINE patchset in more
details).

> One of the biggest challenges here for load-balancing is translating
> task load from one cpu to another as the task load is influenced by the
> total load of its cpu. So a task that appears to be heavy on an
> oversubscribed cpu might not be so heavy after all when it is moved to a
> cpu with plenty cpu time to spare. This issue is likely to be more
> pronounced on heterogeneous systems and system with aggressive frequency
> scaling. It might be possible to avoid having to translate load or that
> it doesn't really matter, but I haven't completely convinced myself yet.
>

This is probably a key point where deadline scheduling could be helpful.
A task load in this case cannot be influenced by other tasks in the
system and it is one of the known variables. Actually, this is however
half true. Isolation is achieved only considering CPU time between
concurrently executing task, other terms like cache interferences etc.
cannot be controlled. The nice fact is that a misbehaving task, one that
tries or experiments deviations from its allowed CPU fraction, is
throttled and cannot influence other tasks behavior.
As I will show during my talk (power aware deadline scheduling), other
techniques are required when a task execution time it is not stricly
known beforehand, beeing this due to interferences or intrinsic
variability on the performed activity. They fall in the domain of
adaptive/feedback scheduling.

> My point is that getting the task load right or at least better is a
> fundamental requirement for improving power aware scheduling.
>

Fully agree :-).

As I said, I just wanted to add something, sorry if I misinterpret the
purpose of this discussion.

Best Regards,

- Juri Lelli

2012-08-20 08:06:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Arjan van de Ven <[email protected]> wrote:

> On 8/15/2012 8:04 AM, Peter Zijlstra wrote:
>
> > This all sounds far too complicated.. we're talking about
> > simple spreading and packing balancers without deep arch
> > knowledge and knobs, we couldn't possibly evaluate anything
> > like that.
> >
> > I was really more thinking of something useful for the
> > laptops out there, when they pull the power cord it makes
> > sense to try and keep CPUs asleep until the one that's awake
> > is saturated.

s/CPU/core ?

> as long as you don't do that on machines with an Intel CPU..
> since that'd be the worst case behavior for tasks that run for
> more than 100 usec. (e.g. not interrupts, but almost
> everything else)

The question is, do we need to balance for 'power saving', on
systems that care more about power use than they care about peak
performance/throughput, at all?

If the answer is 'no' then things get rather simple.

If the answer is 'yes' then there's clear cases where the kernel
(should) automatically know the events where we switch from
balancing for performance to balancing for power:

- the system boots up on battery

- the system was on AC but the cord has been pulled and the
system is now on battery

- the administrator configures the system on AC to be
power-conscious.

( and the opposite direction events wants the scheduler to
switch from 'balancing for power' to 'balancing for
performance'. )

There's also cases where the kernel has insufficient information
from the hardware and from the admin about the preferred
characteristics/policy of the system - a tweakable fallback knob
might be provided for that sad case.

The point is, that knob is not the policy setting and it's not
the main mechanism. It's a fallback.

Thanks,

Ingo

2012-08-20 08:26:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote:
> > > I was really more thinking of something useful for the
> > > laptops out there, when they pull the power cord it makes
> > > sense to try and keep CPUs asleep until the one that's awake
> > > is saturated.
>
> s/CPU/core ?

I was thinking logical cpus, but whatever really.

2012-08-20 13:26:28

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/20/2012 1:06 AM, Ingo Molnar wrote:
>
>
> There's also cases where the kernel has insufficient information
> from the hardware and from the admin about the preferred
> characteristics/policy of the system - a tweakable fallback knob
> might be provided for that sad case.
>
> The point is, that knob is not the policy setting and it's not
> the main mechanism. It's a fallback.

if we call the knob "powersave", it better save power...
if we call it "group together" or "spread out".. no problem with that.



2012-08-20 15:36:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 15 August 2012 13:05, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>> if (sd->nr_running > sd's capacity) {
>> //power saving policy is not suitable for
>> //this scenario, it runs like performance policy
>> mv tasks from busiest cpu in busiest group to
>> idlest cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)
>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>> } else {// the sd has enough capacity to hold all tasks.
>> if (sg->nr_running > sg's capacity) {
>> //imbalanced between groups
>> if (schedule policy == performance) {
>> //when 2 busiest group at same busy
>> //degree, need to prefer the one has
>> // softest group??
>> move tasks from busiest group to
>> idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>> } else if (schedule policy == power)
>> move tasks from busiest group to
>> idlest group until busiest is just full
>> of capacity.
>> //the busiest group can balance
>> //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>> } else {
>> //all groups has enough capacity for its tasks.
>> if (schedule policy == performance)
>> //all tasks may has enough cpu
>> //resources to run,
>> //mv tasks from busiest to idlest group?
>> //no, at this time, it's better to keep
>> //the task on current cpu.
>> //so, it is maybe better to do balance
>> //in each of groups
>> for_each_imbalance_groups()
>> move tasks from busiest cpu to
>> idlest cpu in each of groups;
>> else if (schedule policy == power) {
>> if (no hard pin in idlest group)
>> mv tasks from idlest group to
>> busiest until busiest full.
>> else
>> mv unpin tasks to the biggest
>> hard pin group.
>> }
>> }
>> }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.
>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.
>
> I think some of the Linaro people actually played around with this,
> Vincent?

Sorry for the late reply but I had almost no network access during last weeks.

So Linaro also works on a power aware scheduler as Peter mentioned.

Based on previous tests, we have concluded that main drawback of the
(now removed) old power scheduler was that we had no way to make
difference between short and long running tasks whereas it's a key
input (at least for phone) for deciding to pack tasks and for
selecting the core on an asymmetric system.
One additional key information is the power distribution in the system
which can have a finer granularity than current sched_domain
description. Peter's proposal was to use a SHARE_POWERLINE flag
similarly to flags that already describe if a sched_domain share
resources or cpu capacity.

With these 2 new information, we can have a 1st power saving scheduler
which spread or packed tasks across core and package

Vincent
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?
>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.
>
>

2012-08-20 15:47:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 16 August 2012 07:03, Alex Shi <[email protected]> wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than
>> "performance"? If so, don't use these terms.
>>
>
>
> Power scheme should no chance has better performance in design.

A side effect of packing small tasks on one core is that you always
use the core with the lowest C-state which will minimize the wake up
latency so you can sometime get better results than performance mode
which will try to use a other core in another cluster which will take
more time to wake up that waiting for the end of the current task.

Vincent

Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

One issue that is often forgotten is that there are users who want lowest
latency and not highest performance. Our systems sit idle for most of the
time but when a specific event occurs (typically a packet is received)
they must react in the fastest way possible.

On every new generation of hardware and software we keep on running into
various mechanisms that automatically power down when idle for a long time
(to save power...). And its pretty hard to figure these things out given
the complexity of modern hardware. F.e. for the Sandybridges we found that
the memory channel powers down after 2 milliseconds idle time and that was
unaffected by any of the bios config options. Similar mechanisms exist in
the kernel but those are easier discover since there is source.

So please make sure that there are obvious and easy ways to switch this
stuff off or provide "low latency" know that keeps the system from
assuming that idle time means that full performance is not needed.

2012-08-20 15:52:26

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, Aug 20, 2012 at 03:47:54PM +0000, Christoph Lameter wrote:

> So please make sure that there are obvious and easy ways to switch this
> stuff off or provide "low latency" know that keeps the system from
> assuming that idle time means that full performance is not needed.

That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
already do what you want?

--
Matthew Garrett | [email protected]

2012-08-20 15:55:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 17 August 2012 10:43, Paul Turner <[email protected]> wrote:
> On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra <[email protected]> wrote:
>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>
>> Adding Thomas, he always delights poking holes in power schemes.
>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>
>> I'm not sure this is a valid assumption. I've had it explained to me by
>> various people that race-to-idle isn't always the best thing. It has to
>> do with the cost of switching power states and the duration of execution
>> and other such things.
>>
>>> 2, schedule domain, schedule group perfect match the hardware, and
>>> the power consumption unit. So, pull tasks out of a domain means
>>> potentially this power consumption unit idle.
>>
>> I'm not sure I understand what you're saying, sorry.
>>
>>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
>> between the two based on AC/BAT, UPS status and simple things like that.
>> But this seems like a later concern, you have to have something to pick
>> between before you can pick :-)
>>
>>> And in scheduling, 2 place will care the policy, load_balance() and in
>>> task fork/exec: select_task_rq_fair().
>>
>> ack
>>
>>> Here is some pseudo code try to explain the proposal behaviour in
>>> load_balance() and select_task_rq_fair();
>>
>> Oh man.. A few words outlining the general idea would've been nice.
>>
>>> load_balance() {
>>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>>
>>> if (sd->nr_running > sd's capacity) {
>>> //power saving policy is not suitable for
>>> //this scenario, it runs like performance policy
>>> mv tasks from busiest cpu in busiest group to
>>> idlest cpu in idlest group;
>>
>> Once upon a time we talked about adding a factor to the capacity for
>> this. So say you'd allow 2*capacity before overflowing and waking
>> another power group.
>>
>> But I think we should not go on nr_running here, PJTs per-entity load
>> tracking stuff gives us much better measures -- also, repost that series
>> already Paul! :-)
>
> Yes -- I just got back from Africa this week. It's updated for almost
> all the previous comments but I ran out of time before I left to
> re-post. I'm just about caught up enough that I should be able to get
> this done over the upcoming weekend. Monday at the latest.
>
>>
>> Also, I'm not sure this is entirely correct, the thing you want to do
>> for power aware stuff is to minimize the number of active power domains,
>> this means you don't want idlest, you want least busy non-idle.
>>
>>> } else {// the sd has enough capacity to hold all tasks.
>>> if (sg->nr_running > sg's capacity) {
>>> //imbalanced between groups
>>> if (schedule policy == performance) {
>>> //when 2 busiest group at same busy
>>> //degree, need to prefer the one has
>>> // softest group??
>>> move tasks from busiest group to
>>> idletest group;
>>
>> So I'd leave the currently implemented scheme as performance, and I
>> don't think the above describes the current state.
>>
>>> } else if (schedule policy == power)
>>> move tasks from busiest group to
>>> idlest group until busiest is just full
>>> of capacity.
>>> //the busiest group can balance
>>> //internally after next time LB,
>>
>> There's another thing we need to do, and that is collect tasks in a
>> minimal amount of power domains. The old code (that got deleted) did
>> something like that, you can revive some of the that code if needed -- I
>> just killed everything to be able to start with a clean slate.
>>
>>
>>> } else {
>>> //all groups has enough capacity for its tasks.
>>> if (schedule policy == performance)
>>> //all tasks may has enough cpu
>>> //resources to run,
>>> //mv tasks from busiest to idlest group?
>>> //no, at this time, it's better to keep
>>> //the task on current cpu.
>>> //so, it is maybe better to do balance
>>> //in each of groups
>>> for_each_imbalance_groups()
>>> move tasks from busiest cpu to
>>> idlest cpu in each of groups;
>>> else if (schedule policy == power) {
>>> if (no hard pin in idlest group)
>>> mv tasks from idlest group to
>>> busiest until busiest full.
>>> else
>>> mv unpin tasks to the biggest
>>> hard pin group.
>>> }
>>> }
>>> }
>>> }
>>
>> OK, so you only start to group later.. I think we can do better than
>> that.
>>
>>>
>>> sub proposal:
>>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>>> cpu'. If so, it may can reduce one more time balancing.
>>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>>> 2, se or task load is good for running time setting.
>>> but it should the second basis in load balancing. The first basis of LB
>>> is running tasks' number in group/cpu. Since whatever of the weight of
>>> groups is, if the tasks number is less than cpu number, the group is
>>> still has capacity to take more tasks. (will consider the SMT cpu power
>>> or other big/little cpu capacity on ARM.)
>>
>> Ah, no we shouldn't balance on nr_running, but on the amount of time
>> consumed. Imagine two tasks being woken at the same time, both tasks
>> will only run a fraction of the available time, you don't want this to
>> exceed your capacity because ran back to back the one cpu will still be
>> mostly idle.
>>
>> What you want it to keep track of a per-cpu utilization level (inverse
>> of idle-time) and using PJTs per-task runnable avg see if placing the
>> new task on will exceed the utilization limit.
>
> Observations of the runnable average also have the nice property that
> it quickly converges to 100% when over-scheduled.
>
> Since we also have the usage average for a single task the ratio of
> used avg:runnable avg is likely a useful pointwise estimate.

yes that's clearly a good input from your per-task load tracking. You
can have a core which is 100% used by several tasks. In one case the
used avg and the runnable avg are quite similar which means that we
don't waiting for the core too much and in the other case the runnable
avg can be max value which means that tasks are waiting for the core
and it's worth using 2 cores in the same clusters

Vincent
>
>>
>> I think some of the Linaro people actually played around with this,
>> Vincent?
>>
>>> unsolved issues:
>>> 1, like current scheduler, it didn't handled cpu affinity well in
>>> load_balance.
>>
>> cpu affinity is always 'fun'.. while there's still a few fun sites in
>> the current load-balancer we do better than we did a while ago.
>>
>>> 2, task group that isn't consider well in this rough proposal.
>>
>> You mean the cgroup mess?
>>
>>> It isn't consider well and may has mistaken . So just share my ideas and
>>> hope it become better and workable in your comments and discussion.
>>
>> Very simplistically the current scheme is a 'spread' the load scheme
>> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
>> cache and cpu power.
>>
>> The power scheme should be a 'pack' scheme, where we minimize the active
>> power domains.
>>
>> One way to implement this is to keep track of an active and
>> under-utilized power domain (the target) and fail the regular (pull)
>> load-balance for all cpus not in that domain. For the cpu that are in
>> that domain we'll have find_busiest select from all other under-utilized
>> domains pulling tasks to fill our target, once full, we pick a new
>> target, goto 1.
>>
>>

2012-08-20 18:17:10

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:

> If the answer is 'yes' then there's clear cases where the kernel
> (should) automatically know the events where we switch from
> balancing for performance to balancing for power:

No. We can't identify all of these cases and we can't identify corner
cases. Putting this kind of policy in the kernel is an awful idea. It
should never be altering policy itself, because it'll get it wrong and
people will file bugs complaining that it got it wrong and the biggest
case where you *need* to be able to handle switching between performance
and power optimisations (your rack management unit just told you that
you're going to have to drop power consumption by 20W) is one where the
kernel doesn't have all the information it needs to do this. So why
bother at all?

--
Matthew Garrett | [email protected]

Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Mon, 20 Aug 2012, Matthew Garrett wrote:

> On Mon, Aug 20, 2012 at 03:47:54PM +0000, Christoph Lameter wrote:
>
> > So please make sure that there are obvious and easy ways to switch this
> > stuff off or provide "low latency" know that keeps the system from
> > assuming that idle time means that full performance is not needed.
>
> That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
> already do what you want?

Dont know. A simple solution is not to compile power management into the
kernel.

2012-08-21 00:58:29

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/20/2012 11:36 PM, Vincent Guittot wrote:

>> > What you want it to keep track of a per-cpu utilization level (inverse
>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>> > new task on will exceed the utilization limit.
>> >
>> > I think some of the Linaro people actually played around with this,
>> > Vincent?
> Sorry for the late reply but I had almost no network access during last weeks.
>
> So Linaro also works on a power aware scheduler as Peter mentioned.
>
> Based on previous tests, we have concluded that main drawback of the
> (now removed) old power scheduler was that we had no way to make
> difference between short and long running tasks whereas it's a key
> input (at least for phone) for deciding to pack tasks and for
> selecting the core on an asymmetric system.


It is hard to estimate future in general view point. but from hack
point, maybe you can add something to hint this from task_struct. :)

> One additional key information is the power distribution in the system
> which can have a finer granularity than current sched_domain
> description. Peter's proposal was to use a SHARE_POWERLINE flag
> similarly to flags that already describe if a sched_domain share
> resources or cpu capacity.


Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
and SD_SHARE_PKG_RESOURCES.

>
> With these 2 new information, we can have a 1st power saving scheduler
> which spread or packed tasks across core and package


Fine, I like to test them on X86, plus SMT and NUMA :)

>
> Vincent

2012-08-21 01:06:02

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/20/2012 11:47 PM, Vincent Guittot wrote:

> On 16 August 2012 07:03, Alex Shi <[email protected]> wrote:
>> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>>
>>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>>
>>>> power aware scheduling), this proposal will adopt the
>>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>>
>>> Are there workloads in which "power" might provide more performance than
>>> "performance"? If so, don't use these terms.
>>>
>>
>>
>> Power scheme should no chance has better performance in design.
>
> A side effect of packing small tasks on one core is that you always
> use the core with the lowest C-state which will minimize the wake up
> latency so you can sometime get better results than performance mode
> which will try to use a other core in another cluster which will take
> more time to wake up that waiting for the end of the current task.
>


Sure. some scenario packing tasks into smaller domain will bring
performance benefit.

2012-08-21 09:42:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Matthew Garrett <[email protected]> wrote:

> On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:
>
> > If the answer is 'yes' then there's clear cases where the kernel
> > (should) automatically know the events where we switch from
> > balancing for performance to balancing for power:
>
> No. We can't identify all of these cases and we can't identify
> corner cases. [...]

There's no need to identify 'all' of these cases - but if the
kernel knows then it can have intelligent default behavior.

> [...] Putting this kind of policy in the kernel is an awful
> idea. [...]

A modern kernel better know what state the system is in: on
battery or on AC power.

> [...] It should never be altering policy itself, [...]

The kernel/scheduler simply offers sensible defaults where it
can. User-space can augment/modify/override that in any which
way it wishes to.

This stuff has not been properly sorted out in the last 10+
years since we have battery driven devices, so we might as well
start with the kernel offering sane default behavior where it
can ...

> [...] because it'll get it wrong and people will file bugs
> complaining that it got it wrong and the biggest case where
> you *need* to be able to handle switching between performance
> and power optimisations (your rack management unit just told
> you that you're going to have to drop power consumption by
> 20W) is one where the kernel doesn't have all the information
> it needs to do this. So why bother at all?

The point is to have a working default mechanism.

Thanks,

Ingo

2012-08-21 11:05:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 21 August 2012 02:58, Alex Shi <[email protected]> wrote:
> On 08/20/2012 11:36 PM, Vincent Guittot wrote:
>
>>> > What you want it to keep track of a per-cpu utilization level (inverse
>>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>>> > new task on will exceed the utilization limit.
>>> >
>>> > I think some of the Linaro people actually played around with this,
>>> > Vincent?
>> Sorry for the late reply but I had almost no network access during last weeks.
>>
>> So Linaro also works on a power aware scheduler as Peter mentioned.
>>
>> Based on previous tests, we have concluded that main drawback of the
>> (now removed) old power scheduler was that we had no way to make
>> difference between short and long running tasks whereas it's a key
>> input (at least for phone) for deciding to pack tasks and for
>> selecting the core on an asymmetric system.
>
>
> It is hard to estimate future in general view point. but from hack
> point, maybe you can add something to hint this from task_struct. :)
>

per-task load tracking patchsets give you a good view of the last dozen of ms

>> One additional key information is the power distribution in the system
>> which can have a finer granularity than current sched_domain
>> description. Peter's proposal was to use a SHARE_POWERLINE flag
>> similarly to flags that already describe if a sched_domain share
>> resources or cpu capacity.
>
>
> Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
> and SD_SHARE_PKG_RESOURCES.

SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some
part of the physical core)
SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like
cache and memory access)

>
>>
>> With these 2 new information, we can have a 1st power saving scheduler
>> which spread or packed tasks across core and package
>
>
> Fine, I like to test them on X86, plus SMT and NUMA :)
>
>>
>> Vincent
>
>

2012-08-21 11:40:12

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > [...] Putting this kind of policy in the kernel is an awful
> > idea. [...]
>
> A modern kernel better know what state the system is in: on
> battery or on AC power.

That's a fundamentally uninteresting thing for the kernel to know about.
AC/battery is just not an important power management policy input when
compared to various other things.

> > [...] It should never be altering policy itself, [...]
>
> The kernel/scheduler simply offers sensible defaults where it
> can. User-space can augment/modify/override that in any which
> way it wishes to.
>
> This stuff has not been properly sorted out in the last 10+
> years since we have battery driven devices, so we might as well
> start with the kernel offering sane default behavior where it
> can ...

Userspace has been doing a perfectly reasonable job of determining
policy here.

> > [...] because it'll get it wrong and people will file bugs
> > complaining that it got it wrong and the biggest case where
> > you *need* to be able to handle switching between performance
> > and power optimisations (your rack management unit just told
> > you that you're going to have to drop power consumption by
> > 20W) is one where the kernel doesn't have all the information
> > it needs to do this. So why bother at all?
>
> The point is to have a working default mechanism.

Your suggestions aren't a working default mechanism.

--
Matthew Garrett | [email protected]

2012-08-21 15:19:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > [...] Putting this kind of policy in the kernel is an awful
> > > idea. [...]
> >
> > A modern kernel better know what state the system is in: on
> > battery or on AC power.
>
> That's a fundamentally uninteresting thing for the kernel to
> know about. [...]

I disagree.

> [...] AC/battery is just not an important power management
> policy input when compared to various other things.

Such as?

The thing is, when I use Linux on a laptop then AC/battery is
*the* main policy input.

> > > [...] It should never be altering policy itself, [...]
> >
> > The kernel/scheduler simply offers sensible defaults where
> > it can. User-space can augment/modify/override that in any
> > which way it wishes to.
> >
> > This stuff has not been properly sorted out in the last 10+
> > years since we have battery driven devices, so we might as
> > well start with the kernel offering sane default behavior
> > where it can ...
>
> Userspace has been doing a perfectly reasonable job of
> determining policy here.

Has it properly switched the scheduler's balancing between
power-effient and performance-maximizing strategies when for
example a laptop's AC got unplugged/replugged?

> > > [...] because it'll get it wrong and people will file bugs
> > > complaining that it got it wrong and the biggest case
> > > where you *need* to be able to handle switching between
> > > performance and power optimisations (your rack management
> > > unit just told you that you're going to have to drop power
> > > consumption by 20W) is one where the kernel doesn't have
> > > all the information it needs to do this. So why bother at
> > > all?
> >
> > The point is to have a working default mechanism.
>
> Your suggestions aren't a working default mechanism.

In what way?

Thanks,

Ingo

2012-08-21 15:25:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


>>> A modern kernel better know what state the system is in: on
>>> battery or on AC power.
>>
>> That's a fundamentally uninteresting thing for the kernel to
>> know about. [...]
>
> I disagree.

and I'll agree with Matthew and disagree with you ;-)

>
>> [...] AC/battery is just not an important power management
>> policy input when compared to various other things.
>
> Such as?
>
> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

I think you're wrong there.
First of all, not the whole world is a laptop.
Phones and servers are very different than laptops in this sense.
In a phone, when you're charging, you want to be EXTRA power efficient in many ways
(since charging creates heat, and that heat will take away your thermal budget).
In a datacenter, you're either on AC or DC all the time, and power efficiency still matters.

And even on a laptop.. heat production matters even when on AC... laptops are more and more like phones
that way.

2012-08-21 15:28:43

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > [...] AC/battery is just not an important power management
> > policy input when compared to various other things.
>
> Such as?

The scheduler's behaviour is going to have a minimal impact on power
consumption on laptops. Other things are much more important - backlight
level, ASPM state, that kind of thing. So why special case the
scheduler? This is going to be hugely more important on multi-socket
systems, where your policy is usually going to be dictated by the
specific workload that you're running at the time. The exception is in
cases where your rack is overcommitted for power and your rack
management unit is telling you to reduce power consumption since
otherwise it's going to have to cut the power to one of the machines in
the rack in the next few seconds.

> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

And it's already well handled from userspace, as it has to be.

> > Userspace has been doing a perfectly reasonable job of
> > determining policy here.
>
> Has it properly switched the scheduler's balancing between
> power-effient and performance-maximizing strategies when for
> example a laptop's AC got unplugged/replugged?

No, because sched_mt_powersave usually crippled performance more than it
saved power and nobody makes multi-socket laptops.

--
Matthew Garrett | [email protected]

2012-08-21 15:58:43

by Alan

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

> > That's a fundamentally uninteresting thing for the kernel to
> > know about. [...]
>
> I disagree.

The kernel has no idea of the power architecture leading up to the plug
socket. The kernel has no idea of the policy concerns of the user.

> > [...] AC/battery is just not an important power management
> > policy input when compared to various other things.
>
> Such as?
>
> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

Along with distance likely to be travelled without a socket being
available, whether you remembered the charger, and a pile of other things
('can I get this built before Linus wakes up').

The kernel isn't capable of computing these other factors. The userspace
can at least make an educated guess,

In the business space its even more complicated because battery/mains may
well only be visible via SNMP queries to the power systems and the bigger
concern may well be heat efficiency. If you are running a cloud your
policy considerations also include things like your current spot
electricity price, outside temperature and your current spot compute price
chargeable.

> > Userspace has been doing a perfectly reasonable job of
> > determining policy here.
>
> Has it properly switched the scheduler's balancing between
> power-effient and performance-maximizing strategies when for
> example a laptop's AC got unplugged/replugged?

You work for Red Hat, maybe you should ask your distro people if they do.
While you are it at perhaps also some of the ATA power management that
will probably be an order of magnitude more significant could get
included ;)

Seriously. On a typical laptop the things you can do about power are
dominated by the backlight, by disk power (eg idle SATA links), by USB
device power downs where possible, by turning off any unused phys and by
not having the CPU wake up, which means fixing the desktop apps to behave
sensibly.

I'd like to see actual numbers and evidence on a wide range of workloads
the spread/don't spread thing is even measurable given that you've also
got to factor in effects like completing faster and turning everything
off. I'd *really* like to see such evidence on a laptop,which is your
one cited case it might work.

> > Your suggestions aren't a working default mechanism.
>
> In what way?

For one if the default behaviour is that when I get on the train and am
on battery my video playback begins to stutter due to some kernel
magic then I shall be unamused and file it as a regression.....

Policy is userspace - the desktop can figure out I'm watching movies and
what this means, the kernel can't.

I'd also note there have been repeated attempts to put power management
policy on various OS's by putting the power management policy

- in the hardware
- in SMM handlers
- in the kernel

and every single one has been a failure because those parts of the system
never have enough information nor do they have enough variety of control
to manage the complexity of input state.

It's a single policy file for a distro to do scheduler configuration
based upon power events. One trivial 'drop it here' shell script. The
difference then being the desktop can be taught to do overrides and
policy properly.

It might be the kernel has important knowledge about what "schedule
for efficiency" means and even to be able to ask the kernel to dot hat
- but it has no idea what the right policy is at any given moment.

ie even if there is a /sys/mumble/schedule_for_efficiency

the echo "1" > and echo "0" > belong in a script

Alan

2012-08-21 15:59:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > [...] AC/battery is just not an important power management
> > > policy input when compared to various other things.
> >
> > Such as?
>
> The scheduler's behaviour is going to have a minimal impact on
> power consumption on laptops. Other things are much more
> important - backlight level, ASPM state, that kind of thing.
> So why special case the scheduler? [...]

I'm not special casing the scheduler - but we are talking about
scheduler policies here, so *if* it makes sense to handle this
dynamically then obviously the scheduler wants to use system
state information when/if the kernel can get it.

Your argument is as if you said that the shape of a car's side
view mirrors is not important to its top speed, because the
overall shape of the chassis and engine power are much more
important.

But we are desiging side view mirrors here, so we might as well
do a good job there.

> [...] This is going to be hugely more important on
> multi-socket systems, where your policy is usually going to be
> dictated by the specific workload that you're running at the
> time. [...]

If only we had some kernel subsystem that is intimiately familar
with the workloads running on the system and could act
accordingly and with low latency.

We could name that subsystem it in some intuitive fashion: it
switches and schedules workloads, so how about calling it the
'scheduler'?

> [...] The exception is in cases where your rack is
> overcommitted for power and your rack management unit is
> telling you to reduce power consumption since otherwise it's
> going to have to cut the power to one of the machines in the
> rack in the next few seconds.

( That must be some ACPI middleware driven crap, right? Not
really the Linux kernel's problem. )

> > The thing is, when I use Linux on a laptop then AC/battery
> > is *the* main policy input.
>
> And it's already well handled from userspace, as it has to be.

Not according to the developers switching away from Linux
desktop distros in droves, because MacOSX or Win7 has 30%+
better battery efficiency.

The scheduler might be a small part of the picture, but it's
certainly a part of it.

> > > Userspace has been doing a perfectly reasonable job of
> > > determining policy here.
> >
> > Has it properly switched the scheduler's balancing between
> > power-effient and performance-maximizing strategies when for
> > example a laptop's AC got unplugged/replugged?
>
> No, because sched_mt_powersave usually crippled performance
> more than it saved power and nobody makes multi-socket
> laptops.

That's a user-space policy management fail right there: why
wasn't this fixed? If the default policy is in the kernel we can
at least fix it in one place for the most common cases. If it's
spread out amongst multiple projects then progress only happens
at glacial speed ...

Thanks,

Ingo

2012-08-21 16:13:43

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > The scheduler's behaviour is going to have a minimal impact on
> > power consumption on laptops. Other things are much more
> > important - backlight level, ASPM state, that kind of thing.
> > So why special case the scheduler? [...]
>
> I'm not special casing the scheduler - but we are talking about
> scheduler policies here, so *if* it makes sense to handle this
> dynamically then obviously the scheduler wants to use system
> state information when/if the kernel can get it.
>
> Your argument is as if you said that the shape of a car's side
> view mirrors is not important to its top speed, because the
> overall shape of the chassis and engine power are much more
> important.
>
> But we are desiging side view mirrors here, so we might as well
> do a good job there.

If the kernel is going to make power choices automatically then it
should do it everywhere, not piecemeal.

> > [...] This is going to be hugely more important on
> > multi-socket systems, where your policy is usually going to be
> > dictated by the specific workload that you're running at the
> > time. [...]
>
> If only we had some kernel subsystem that is intimiately familar
> with the workloads running on the system and could act
> accordingly and with low latency.
>
> We could name that subsystem it in some intuitive fashion: it
> switches and schedules workloads, so how about calling it the
> 'scheduler'?

The scheduler is unaware of whether I care about a process finishing
quickly or whether I care about it consuming less power.

> > [...] The exception is in cases where your rack is
> > overcommitted for power and your rack management unit is
> > telling you to reduce power consumption since otherwise it's
> > going to have to cut the power to one of the machines in the
> > rack in the next few seconds.
>
> ( That must be some ACPI middleware driven crap, right? Not
> really the Linux kernel's problem. )

It's as much the Linux kernel's problem as AC/battery decisions are -
ie, it's not.

> > > The thing is, when I use Linux on a laptop then AC/battery
> > > is *the* main policy input.
> >
> > And it's already well handled from userspace, as it has to be.
>
> Not according to the developers switching away from Linux
> desktop distros in droves, because MacOSX or Win7 has 30%+
> better battery efficiency.

Ok so what you're actually telling me here is that you don't understand
anything about power management and where our problems are.

> The scheduler might be a small part of the picture, but it's
> certainly a part of it.

It's in the drivers, which is where it has been since we went tickless.

> > No, because sched_mt_powersave usually crippled performance
> > more than it saved power and nobody makes multi-socket
> > laptops.
>
> That's a user-space policy management fail right there: why
> wasn't this fixed? If the default policy is in the kernel we can
> at least fix it in one place for the most common cases. If it's
> spread out amongst multiple projects then progress only happens
> at glacial speed ...

sched_mt_powersave was inherently going to have a huge impact on
performance, and with modern chips that would result in the platform
consuming more power. It was a feature that was useful for a small
number of generations of desktop CPUs - I don't think it would ever skew
the power/performance ratio in a useful direction on mobile hardware.
But feel free to blame userspace for hardware design.

--
Matthew Garrett | [email protected]

2012-08-21 18:23:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > The scheduler's behaviour is going to have a minimal impact on
> > > power consumption on laptops. Other things are much more
> > > important - backlight level, ASPM state, that kind of thing.
> > > So why special case the scheduler? [...]
> >
> > I'm not special casing the scheduler - but we are talking about
> > scheduler policies here, so *if* it makes sense to handle this
> > dynamically then obviously the scheduler wants to use system
> > state information when/if the kernel can get it.
> >
> > Your argument is as if you said that the shape of a car's side
> > view mirrors is not important to its top speed, because the
> > overall shape of the chassis and engine power are much more
> > important.
> >
> > But we are desiging side view mirrors here, so we might as well
> > do a good job there.
>
> If the kernel is going to make power choices automatically
> then it should do it everywhere, not piecemeal.

Why? Good scheduling is useful even in isolation.

> The scheduler is unaware of whether I care about a process
> finishing quickly or whether I care about it consuming less
> power.

You are posing them as if the two were mutually exclusive, while
in reality they are not necessarily exclusive: it's quite
possible that the highest (non-turbo) CPU frequency happens to
be the most energy efficient one for a CPU with a particular
workload ...

You also missed the bit of my mail where I suggested that such
user preferences and tolerances can be communicated to the
scheduler via a policy toggle - which the scheduler would take
into account.

I suggest to use sane defaults, such as being energy efficient
on battery power (within a sane threshold) and maximizing
throughput on AC power (within a sane threshold).

That would go a *long* way improving the current mess. If Linux
power efficiency was so good today then I'd not ask for kernel
driven defaults - but the reality is that in terms of process
scheduling we suck today (and have sucked for the last 10 years)
so pretty much any approach will improve things.

> > > > The thing is, when I use Linux on a laptop then
> > > > AC/battery is *the* main policy input.
> > >
> > > And it's already well handled from userspace, as it has to
> > > be.
> >
> > Not according to the developers switching away from Linux
> > desktop distros in droves, because MacOSX or Win7 has 30%+
> > better battery efficiency.
>
> Ok so what you're actually telling me here is that you don't
> understand anything about power management and where our
> problems are.

Huh? In practice we suck today in terms of energy efficiency.
That covers both scheduling and other areas.

Saying this out aloud does not tell anything about my
understanding of power management...

So please outline a technical point.

> > The scheduler might be a small part of the picture, but it's
> > certainly a part of it.
>
> It's in the drivers, which is where it has been since we went
> tickless.

You mean the code is in drivers? Or the problem is in drivers?

Both is true currently - this discussion is about the future, to
make the scheduler aware of power concerns, as the scheduler
(and the timer subsystem) already calculates various interesting
metrics that matter to energy efficient scheduling.

> > > No, because sched_mt_powersave usually crippled performance
> > > more than it saved power and nobody makes multi-socket
> > > laptops.
> >
> > That's a user-space policy management fail right there: why
> > wasn't this fixed? If the default policy is in the kernel we can
> > at least fix it in one place for the most common cases. If it's
> > spread out amongst multiple projects then progress only happens
> > at glacial speed ...
>
> sched_mt_powersave was inherently going to have a huge impact
> on performance, and with modern chips that would result in the
> platform consuming more power. It was a feature that was
> useful for a small number of generations of desktop CPUs - I
> don't think it would ever skew the power/performance ratio in
> a useful direction on mobile hardware. But feel free to blame
> userspace for hardware design.

FYI, sched_mt_powersave is *GONE* in recent kernels, because it
basically never worked. This thread is about designing and
implementing something that actually works.

Thanks,

Ingo

2012-08-21 18:34:34

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > The scheduler is unaware of whether I care about a process
> > finishing quickly or whether I care about it consuming less
> > power.
>
> You are posing them as if the two were mutually exclusive, while
> in reality they are not necessarily exclusive: it's quite
> possible that the highest (non-turbo) CPU frequency happens to
> be the most energy efficient one for a CPU with a particular
> workload ...

You just put in a proviso that makes them mutually exclusive. If I want
it done fast, I want it done in the highest turbo CPU frequency. If I
don't want it done fast, I want it done in the most efficient CPU
frequency. They're probably not the same thing.

> You also missed the bit of my mail where I suggested that such
> user preferences and tolerances can be communicated to the
> scheduler via a policy toggle - which the scheduler would take
> into account.

Yes. And that toggle should be the thing that defines the policy under
all circumstances.

> > Ok so what you're actually telling me here is that you don't
> > understand anything about power management and where our
> > problems are.
>
> Huh? In practice we suck today in terms of energy efficiency.
> That covers both scheduling and other areas.
>
> Saying this out aloud does not tell anything about my
> understanding of power management...
>
> So please outline a technical point.

Our power consumption is worse than under other operating systems is
almost entirely because only one of our three GPU drivers implements any
kind of useful power management. The power saving functionality that we
expose to userspace is already used when it's safe to do so. So blaming
our userspace policy management for our higher power consumption means
that you can't possibly understand where the problems actually are,
which indicates that you probably shouldn't be trying to tell me about
optimal approaches to power management.

> You mean the code is in drivers? Or the problem is in drivers?

The problem is in the drivers.

> > sched_mt_powersave was inherently going to have a huge impact
> > on performance, and with modern chips that would result in the
> > platform consuming more power. It was a feature that was
> > useful for a small number of generations of desktop CPUs - I
> > don't think it would ever skew the power/performance ratio in
> > a useful direction on mobile hardware. But feel free to blame
> > userspace for hardware design.
>
> FYI, sched_mt_powersave is *GONE* in recent kernels, because it
> basically never worked. This thread is about designing and
> implementing something that actually works.

Yes. You asked me whether userspace ever used the knobs that the kernel
exposed. I said no, because the only knob relevant for laptops would
never improve energy efficiency on laptops. It is therefore impossible
to use this as an example of userspace policy management not doing the
right thing.

--
Matthew Garrett | [email protected]

2012-08-21 18:48:52

by Alan

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

> Why? Good scheduling is useful even in isolation.

For power - I suspect it's damn near irrelevant except on a big big
machine.

Unless you've sorted out your SATA, fixed your phy handling, optimised
your desktop for wakeups and worked down the big wakeup causes one by one
it's turd polishing.

PM means fixing the stack top to bottom, and its a whackamole game, each
one you fix you find the next. You have to sort the entire stack from
desktop apps to kernel.

However benchmarks talk - so lets have some benchmarks ... on a laptop.

Alan

2012-08-22 05:42:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:

> I'd like to see actual numbers and evidence on a wide range of workloads
> the spread/don't spread thing is even measurable given that you've also
> got to factor in effects like completing faster and turning everything
> off. I'd *really* like to see such evidence on a laptop,which is your
> one cited case it might work.

For my dinky dual core laptop, I suspect you're right, but for a more
powerful laptop, I'd expect spread/don't to be noticeable.

Yeah, hard numbers would be nice to see.

If I had a powerful laptop, I'd kill irq balancing, and all but periodic
load balancing, and expect to see a positive result. Dunno what fickle
electron gods would _really_ do with those prayers though.

-Mike

2012-08-22 09:03:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Alan Cox <[email protected]> wrote:

> > Why? Good scheduling is useful even in isolation.
>
> For power - I suspect it's damn near irrelevant except on a
> big big machine.

With deep enough C states it's rather relevant whether we
continue to burn +50W for a couple of more milliseconds or not,
and whether we have the right information from the scheduler and
timer subsystem about how long the next idle period is expected
to be and how bursty a given task is.

'Balance for energy efficiency' obviously ties into the C state
and frequency selection logic, which is rather detached right
now, running its own (imperfect) scheduling metrics logic and
doing pretty much the worst possible C state and frequency
decisions in typical everyday desktop workloads.

> Unless you've sorted out your SATA, fixed your phy handling,
> optimised your desktop for wakeups and worked down the big
> wakeup causes one by one it's turd polishing.
>
> PM means fixing the stack top to bottom, and its a whackamole
> game, each one you fix you find the next. You have to sort the
> entire stack from desktop apps to kernel.

Moving 'policy' into user-space has been an utter failure,
mostly because there's not a single project/subsystem
responsible for getting a good result to users. This is why
I resist "policy should not be in the kernel" meme here.

> However benchmarks talk - so lets have some benchmarks ... on
> a laptop.

Agreed.

Thanks,

Ingo

2012-08-22 09:10:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Matthew Garrett <[email protected]> wrote:

> [...]
>
> Our power consumption is worse than under other operating
> systems is almost entirely because only one of our three GPU
> drivers implements any kind of useful power management. [...]

... and because our CPU frequency and C state selection logic is
doing pretty much the worst possible decisions (on x86 at
least).

Regardless, you cannot possibly seriously suggest that because
there's even greater suckage elsewhere for some workloads we
should not even bother with improving the situation here.

Anyway, I agree with Alan that actual numbers matter.

Thanks,

Ingo

2012-08-22 10:56:35

by Alan

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

> With deep enough C states it's rather relevant whether we
> continue to burn +50W for a couple of more milliseconds or not,
> and whether we have the right information from the scheduler and
> timer subsystem about how long the next idle period is expected
> to be and how bursty a given task is.

50W for 2mS here and there is an irrelevance compared with burning a
continual half a watt due to the upstream tree lack some of the SATA
power patches for example.

It's the classic "standby mode" problem - energy efficiency has time as a
factor and there are a lot of milliseconds in 5 hours. That means
anything continually on rapidly dominates the problem space.

> > PM means fixing the stack top to bottom, and its a whackamole
> > game, each one you fix you find the next. You have to sort the
> > entire stack from desktop apps to kernel.
>
> Moving 'policy' into user-space has been an utter failure,
> mostly because there's not a single project/subsystem
> responsible for getting a good result to users. This is why
> I resist "policy should not be in the kernel" meme here.

You *can't* fix PM in one place. Power management is a top to bottom
thing. It starts in the hardware and propogates right to the top of the
user space stack.

A single stupid behaviour in a desktop app is all it needs to knock the
odd hour or two off your battery life. Something is mundane as refreshing
a bit of the display all the time keeping the GPU and CPU from sleeping
well.

Most distros haven't managed to do power management properly because it
is this entire integration problem. Every single piece of the puzzle has
to be in place before you get any serious gain.

It's not a kernel v user thing. The kernel can't fix it, random bits of
userspace can't fix it. This is effectively a "product level" integration
problem.

Alan

2012-08-22 11:34:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler


* Alan Cox <[email protected]> wrote:

> > With deep enough C states it's rather relevant whether we
> > continue to burn +50W for a couple of more milliseconds or
> > not, and whether we have the right information from the
> > scheduler and timer subsystem about how long the next idle
> > period is expected to be and how bursty a given task is.
>
> 50W for 2mS here and there is an irrelevance compared with
> burning a continual half a watt due to the upstream tree lack
> some of the SATA power patches for example.

It can be more than an irrelevance if the CPU is saturated - say
a game running on a mobile device very commonly saturates the
CPU. A third of the energy is spent in the CPU, sometimes more.

> It's the classic "standby mode" problem - energy efficiency
> has time as a factor and there are a lot of milliseconds in 5
> hours. That means anything continually on rapidly dominates
> the problem space.
>
> > > PM means fixing the stack top to bottom, and its a whackamole
> > > game, each one you fix you find the next. You have to sort the
> > > entire stack from desktop apps to kernel.
> >
> > Moving 'policy' into user-space has been an utter failure,
> > mostly because there's not a single project/subsystem
> > responsible for getting a good result to users. This is why
> > I resist "policy should not be in the kernel" meme here.
>
> You *can't* fix PM in one place. [...]

Preferably one project, not one place - but at least don't go
down the false path of:

" Policy always belongs into user-space so the kernel can
continue to do a shitty job even for pieces it could
understand better ..."

My opinion is that it depends, and I also think that we are so
bad currently (on x86) that we can do little harm by trying to
do things better.

> [...] Power management is a top to bottom thing. It starts in
> the hardware and propogates right to the top of the user space
> stack.

Partly because it's misdesigned: in practice there's very little
true user policy about power saving:

- On mobile devices I almost never tweak policy as a user -
sometimes I override screen brightness but that's all (and
it's trivial compared to all the many other things that go
on).

- On a laptop I'd love to never have to tweak it either -
running fast when on AC and running efficient when on battery
is a perfectly fine life-time default for me.

90% of the "policy" comes with the *form factor* - i.e. it's
something the hardware and thus the kernel could intimately
know about.

Yes, there are exceptions and there are servers.

The mobile device user mostly *only cares about battery life*,
for a given amount of real utility provided by the device. The
"user policy" fetish here is a serious misunderstanding of how
it should all work. There arent millions of people out there
wanting to tweak the heck out of PM.

People prefer no knobs at all - they want good defaults and they
want at most a single, intuitive, actionable control to override
the automation in 1% of the usecases, such as screen brightness.

> A single stupid behaviour in a desktop app is all it needs to
> knock the odd hour or two off your battery life. Something is
> mundane as refreshing a bit of the display all the time
> keeping the GPU and CPU from sleeping well.

Even with highly powertop-optimized systems that have no such
app and have very low wakeup rates we still lag behind the
competition.

> Most distros haven't managed to do power management properly
> because it is this entire integration problem. Every single
> piece of the puzzle has to be in place before you get any
> serious gain.

Most certainly.

So why not move most pieces into one well-informed code domain
(the kernel) and only expose high level controls, instead of
expecting user-space to get it all right.

Then the 'only' job of user-space would be to not be silly when
implementing their functionality. (and there's nothing
intimately PM about that.)

> It's not a kernel v user thing. The kernel can't fix it,
> random bits of userspace can't fix it. This is effectively a
> "product level" integration problem.

Of course the kernel can fix many parts by offering automation
like automatically shutting down unused interfaces (and offering
better ABIs if that is not possible due to some poor historic
choice), choosing frequencies and C states wisely, etc.

Kernel design decisions *matter*:

Look for example how moving X lowlevel drivers from user-space
into kernel-space enabled GPU level power management to begin
with. With the old X method it was essentially impossible. Now
it's at least possible.

Or look at how Android adding a high-level interface like
suspend blockers materially improved the power saving situation
for them.

This learned helplessness that "the kernel can do nothing about
PM" is somewhat annoying :-)

Thanks,

Ingo

2012-08-22 11:35:27

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote:
>
> * Matthew Garrett <[email protected]> wrote:
>
> > [...]
> >
> > Our power consumption is worse than under other operating
> > systems is almost entirely because only one of our three GPU
> > drivers implements any kind of useful power management. [...]
>
> ... and because our CPU frequency and C state selection logic is
> doing pretty much the worst possible decisions (on x86 at
> least).

You have figures showing that our C state residence is worse than, say,
Windows? Because my own testing says that we're way better at that.
Could we be better? Sure. Is it why we're worse? No.

> Regardless, you cannot possibly seriously suggest that because
> there's even greater suckage elsewhere for some workloads we
> should not even bother with improving the situation here.

I'm enthusiastic about improving the scheduler's behaviour. I'm
unenthusiastic about putting in automatic hacks related to AC state.

--
Matthew Garrett | [email protected]

2012-08-22 12:55:01

by Alan

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

> It can be more than an irrelevance if the CPU is saturated - say
> a game running on a mobile device very commonly saturates the
> CPU. A third of the energy is spent in the CPU, sometimes more.

If the CPU is saturated you already lost. What you going to do - the CPU
is saturated - slow it down, then it'll use more power.

> > You *can't* fix PM in one place. [...]
>
> Preferably one project, not one place - but at least don't go
> down the false path of:
>
> " Policy always belongs into user-space so the kernel can
> continue to do a shitty job even for pieces it could
> understand better ..."
>
> My opinion is that it depends, and I also think that we are so
> bad currently (on x86) that we can do little harm by trying to
> do things better.

All the evidence I've seen says we are doing the kernel side stuff right.

>
> > [...] Power management is a top to bottom thing. It starts in
> > the hardware and propogates right to the top of the user space
> > stack.
>
> Partly because it's misdesigned: in practice there's very little
> true user policy about power saving:

It's not about policy, its about code behaviour. You have to fix every
single piece of code.

> - On mobile devices I almost never tweak policy as a user -
> sometimes I override screen brightness but that's all (and
> it's trivial compared to all the many other things that go
> on).

Put a single badly broken app on an Android device and your battery life
will plough. That's despite Android having some highly active management
policies to minimise the effect. It works out of the box because someone
spent a huge amount of time with a power meter and monitoring tools
beating up whoever was top of the wakeup lists.

> it should all work. There arent millions of people out there
> wanting to tweak the heck out of PM.

Don't confuse policy managed by the userspace and buttons for users to
tweak. Userspace understands things like "would it be better to drop
video quality or burn more power" and has access to info the kernel can't
even begin to evaluate.

> People prefer no knobs at all - they want good defaults and they
> want at most a single, intuitive, actionable control to override
> the automation in 1% of the usecases, such as screen brightness.

That's a different discussion.

> > A single stupid behaviour in a desktop app is all it needs to
> > knock the odd hour or two off your battery life. Something is
> > mundane as refreshing a bit of the display all the time
> > keeping the GPU and CPU from sleeping well.
>
> Even with highly powertop-optimized systems that have no such
> app and have very low wakeup rates we still lag behind the
> competition.

Actually we don't. Well not if your distro is put together properly,
and has the relevant SATA patches and the like merged. Stock Fedora may
be pants but if so that's a distro problem.

> So why not move most pieces into one well-informed code domain
> (the kernel) and only expose high level controls, instead of
> expecting user-space to get it all right.

Because the kernel doesn't have the information needed. You'd have to add
megabytes of code to the kernel - including things like video playback
engines.

> Then the 'only' job of user-space would be to not be silly when
> implementing their functionality. (and there's nothing
> intimately PM about that.)

That sounds like ignorance

> Kernel design decisions *matter*:

Of course they do but its a tiny part of the story. The power management
function mathematically has a large number of important inputs for which
the kernel cannot deduce the values without massive layering violations.

Also inconveniently for your worldview but as demonstrated in every case
and by everyone who has dug into it, you also have to fix all the wakeup
sources on each level. That's the reality. From the moment you wake for
an event that was not strictly needed you are essentially attempting to
mitigate a failure not trying to deal with the actual problem.

> Look for example how moving X lowlevel drivers from user-space
> into kernel-space enabled GPU level power management to begin
> with. With the old X method it was essentially impossible. Now
> it's at least possible.

Actually it was perfectly possible before for what the cards of the time
could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to
be a good place to do PM.

> Or look at how Android adding a high-level interface like
> suspend blockers materially improved the power saving situation
> for them.

Blockers are not policy. The blocking *policy* is managed elsewhere. They
are a tool for freezing stuff that is being rude.

> This learned helplessness that "the kernel can do nothing about
> PM" is somewhat annoying :-)

Sorry was that a different thread I didn't read ?

The inability to learn from both the past and basic systems theory is
what I find rather more irritating. Plus your mistaken belief that we are
worse than the other OS's on this. We are not. If your system sucks then
instrument it, get the SATA patches into your kernel, run powertweak over
it and ask your distro folks why you had to change any of the settings
and why they hadn't shipped it that way.

Alan

2012-08-22 13:03:37

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
>
>> I'd like to see actual numbers and evidence on a wide range of workloads
>> the spread/don't spread thing is even measurable given that you've also
>> got to factor in effects like completing faster and turning everything
>> off. I'd *really* like to see such evidence on a laptop,which is your
>> one cited case it might work.
>
> For my dinky dual core laptop, I suspect you're right, but for a more
> powerful laptop, I'd expect spread/don't to be noticeable.

yeah if you don't spread, you will waste some power.
but.. current linux behavior is to spread.
so we can only make it worse.


>
> Yeah, hard numbers would be nice to see.
>
> If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> load balancing, and expect to see a positive result.

I'd expect to see a negative result ;-)

2012-08-22 13:10:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote:
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
> >
> >> I'd like to see actual numbers and evidence on a wide range of workloads
> >> the spread/don't spread thing is even measurable given that you've also
> >> got to factor in effects like completing faster and turning everything
> >> off. I'd *really* like to see such evidence on a laptop,which is your
> >> one cited case it might work.
> >
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
>
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Hm, so I can stop fretting about select_idle_sibling(). Good.

> > Yeah, hard numbers would be nice to see.
> >
> > If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> > load balancing, and expect to see a positive result.
>
> I'd expect to see a negative result ;-)

Ok, so I have my head on backward. Gives a different perspective :)

-Mike

2012-08-22 13:21:32

by Matthew Garrett

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
>
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Right. For a single socket system the only thing you can do is use two
threads in preference to using two cores. That'll keep an extra core in
a deep C state for longer, at the cost of keeping the package out of a
deep C state for longer. There might be a win if the two processes
benefit from improved L1 cache locality, or if you're talking about
short periodic work, but for the majority of cases I'd expect Arjan to
be completely correct here. Things get more interesting with
multi-socket systems, but that's beyond the laptop use case.

--
Matthew Garrett | [email protected]

2012-08-22 13:23:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 8/22/2012 6:21 AM, Matthew Garrett wrote:
> On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
>> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
>>> For my dinky dual core laptop, I suspect you're right, but for a more
>>> powerful laptop, I'd expect spread/don't to be noticeable.
>>
>> yeah if you don't spread, you will waste some power.
>> but.. current linux behavior is to spread.
>> so we can only make it worse.
>
> Right. For a single socket system the only thing you can do is use two
> threads in preference to using two cores. That'll keep an extra core in
> a deep C state for longer, at the cost of keeping the package out of a
> deep C state for longer. There might be a win if the two processes
> benefit from improved L1 cache locality, or if you're talking about

basically "if HT sharing would be good for performance" ;-)

(btw this is good news, it means this is not an actual power/performance tradeoff, but a "get it right" tradeoff)

2012-08-23 08:20:21

by Alex Shi

[permalink] [raw]
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler

On 08/22/2012 05:10 PM, Ingo Molnar wrote:

>
> * Matthew Garrett <[email protected]> wrote:
>
>> [...]
>>
>> Our power consumption is worse than under other operating
>> systems is almost entirely because only one of our three GPU
>> drivers implements any kind of useful power management. [...]
>
> ... and because our CPU frequency and C state selection logic is
> doing pretty much the worst possible decisions (on x86 at
> least).
>
> Regardless, you cannot possibly seriously suggest that because
> there's even greater suckage elsewhere for some workloads we
> should not even bother with improving the situation here.
>
> Anyway, I agree with Alan that actual numbers matter.


Sure. we'd better make ideas into code, and then let benchmarks and data
speaking.

>
> Thanks,
>
> Ingo