2008-06-25 19:09:27

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [RFC v1] Tunable sched_mc_power_savings=n

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power.

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings is used to
turn on this feature.

When enabled, this tunable would influence the loadbalancer decision
in find_busiest_group(). Two parameters are extracted at the this
time. group_leader is the group that is almost full and has just
enough capacity to pull few (one) tasks while group_min is the group
that has too few tasks and if we can move them to group_leader, then
this group can go completely idle.

The default criteria to select group_leader and group_min would catch
long running threads on various packages and pull them to single
package. The group_capacity limits the number of tasks that is being
pulled and we are expected to have one task per core in a package and
all the core in a package are loaded.

This default criteria for selection when sched_mc_power_savings=1 has
a good balance of power savings and least performance impact. The
conservative approach taken towards consolidation makes the selection
criteria workload dependent. Long running steady state workloads are
placed correct, but not bursty workload.

The idea being proposed is to enhance the tunable with varied degrees
of consolidation that can work best for different workload
characteristics. echo 2 > /sys/.../sched_mc_power_savings could
enable more aggressive consolidation than the default.

I am presently working on different criteria that can help consolidate
different types of workload with varied degrees of power savings and
performance impact.

Advantages:

* Enterprise workloads on large hardware configurations may need
aggressive consolidation strategy
* Performance impact on server is different from desktop or laptops.
Interactivity is less of a concern on large enterprise servers while
workload response times and performance per watt is more significant
* Aggressive power savings even with marginal performance penalty is
is a useful tunable for servers since it may provide good
performance-per-watt at low utilisation
* This tunable can influence other parts of scheduler like wakeup
biasing for overall task consolidation

Proposed changes:

* Add more values to sched_mc_power_savings tunable (bit flags?)
* Enable different consolidation strategy based on the value
* Evaluate different strategy against different workloads and design
heuristics for auto tuning
* Modify selection of group_leader by changing the spare capacity
evaluation
* Increase group capacity of the group leader to avoid pulling tasks
away from group_leader within a short time
* Choose different load_idx while evaluating and selecting the load
* Use the sched_mc_power_savings settings outside of load balancer
like in task wakeup biasing
* Design power saving loadbalancer in combination with process wakeup
biasing in order to consolidate bursty and short running jobs to
less CPU packages in an idle or under-utilised system.

Disadvantages:

* More tunable settings will lead to sub-optimal performance if not
exploited correctly. Once the tunable criteria is established and
we have good heuristics, we can have a default setting that can
automatically choose the right technique.

I will send the changes in criteria and their impact in subsequent
RFCs. I would like to solicit feedback on the overall idea and inputs
from people who have already attempted similar changes.

Thanks,
Vaidy


2008-06-26 13:56:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Vaidyanathan Srinivasan <[email protected]> writes:
>
> The idea being proposed is to enhance the tunable with varied degrees
> of consolidation that can work best for different workload
> characteristics. echo 2 > /sys/.../sched_mc_power_savings could
> enable more aggressive consolidation than the default.

It would be better to fix the single power saving default to work
better with bursty workloads too than to add more tunables. Tunables
are basically "we give up, let's push the problem to the user"
which is not nice. I suspect a lot of users won't even know if their
workloads are bursty or not. Or they might have workloads which
are both bursty and not bursty.

Or did you try that and failed?

-Andi

2008-06-26 15:03:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Andi Kleen wrote:
> Vaidyanathan Srinivasan <[email protected]> writes:
>> The idea being proposed is to enhance the tunable with varied degrees
>> of consolidation that can work best for different workload
>> characteristics. echo 2 > /sys/.../sched_mc_power_savings could
>> enable more aggressive consolidation than the default.
>
> It would be better to fix the single power saving default to work
> better with bursty workloads too than to add more tunables. Tunables
> are basically "we give up, let's push the problem to the user"
> which is not nice. I suspect a lot of users won't even know if their
> workloads are bursty or not. Or they might have workloads which
> are both bursty and not bursty.
>
> Or did you try that and failed?
>

A user could be an application and certain applications can predict their
workload. For example, a database, a file indexer, etc can predict their workload.

Policies are best known in user land and the best controlled from there.
Consider a case where the end user might select a performance based policy or a
policy to aggressively save power (during peak tariff times). With
virtualization, the whole concept of application is changing, the OS by itself
could be an application :)


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-26 15:04:31

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Thu, Jun 26, 2008 at 03:49:01PM +0200, Andi Kleen wrote:
> Vaidyanathan Srinivasan <[email protected]> writes:
> >
> > The idea being proposed is to enhance the tunable with varied degrees
> > of consolidation that can work best for different workload
> > characteristics. echo 2 > /sys/.../sched_mc_power_savings could
> > enable more aggressive consolidation than the default.
>
> It would be better to fix the single power saving default to work
> better with bursty workloads too than to add more tunables. Tunables
> are basically "we give up, let's push the problem to the user"
> which is not nice. I suspect a lot of users won't even know if their
> workloads are bursty or not. Or they might have workloads which
> are both bursty and not bursty.
>
> Or did you try that and failed?

I think we have a reasonable default with sched_mc_power_savings=1.
Beyond that it hard to figure out how much work you can group together
and run in a small number of physical CPU packages. The approach
we are taking is to let system administrators decide what level
of power savings they want. If they want power savings at the cost
of performance, they should be able to do so using a higher
value of sched_mc_power_savings. If they see that they can pack
more work without affecting their transaction time, they should
be able to adjust the level of packing. Beyond a sane default,
it is hard to do this inside the kernel.

Thanks
Dipankar

2008-06-26 18:09:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n


> A user could be an application and certain applications can predict their
> workload.

So you expect the applications to run suid root and change a sysctl?
And what happens when two applications run that do that and they have differing
requirements? Will they fight over the sysctl?

> For example, a database, a file indexer, etc can predict their workload.


A file indexer should run with a high nice level and low priority would ideally always
prefer power saving. But it doesn't currently. Perhaps it should?

>
> Policies are best known in user land and the best controlled from there.
> Consider a case where the end user might select a performance based policy or a
> policy to aggressively save power (during peak tariff times). With

How many users are going to do that? Seems like a unrealistic case to me.

-Andi

2008-06-26 18:29:53

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* Dipankar Sarma <[email protected]> [2008-06-26 20:31:00]:

> On Thu, Jun 26, 2008 at 03:49:01PM +0200, Andi Kleen wrote:
> > Vaidyanathan Srinivasan <[email protected]> writes:
> > >
> > > The idea being proposed is to enhance the tunable with varied degrees
> > > of consolidation that can work best for different workload
> > > characteristics. echo 2 > /sys/.../sched_mc_power_savings could
> > > enable more aggressive consolidation than the default.
> >
> > It would be better to fix the single power saving default to work
> > better with bursty workloads too than to add more tunables. Tunables
> > are basically "we give up, let's push the problem to the user"
> > which is not nice. I suspect a lot of users won't even know if their
> > workloads are bursty or not. Or they might have workloads which
> > are both bursty and not bursty.
> >
> > Or did you try that and failed?
>
> I think we have a reasonable default with sched_mc_power_savings=1.
> Beyond that it hard to figure out how much work you can group together
> and run in a small number of physical CPU packages. The approach
> we are taking is to let system administrators decide what level
> of power savings they want. If they want power savings at the cost
> of performance, they should be able to do so using a higher
> value of sched_mc_power_savings. If they see that they can pack
> more work without affecting their transaction time, they should
> be able to adjust the level of packing. Beyond a sane default,
> it is hard to do this inside the kernel.

Hi Andi,

Aggressive grouping and consolidation may hurt performance to some
extent depending on the workload. The default setting could have least
performance impact and moderate power savings. We certainly need
user/application input on how much 'potential' performance hit the
application is willing to take in order to save considerable power
under low system utilisation. As Dipankar has mentioned, the proposed
idea is to use sched_mc_power_savings as a power-savings and
performance trade-off tunable parameter.

We tried to tweak wakeup logic to move tasks to one package at idle,
it works great at idle, but could potentially cause too much redundant
load balancing at certain system utilisation. Every technique used to
consolidate tasks has its benefits at particular utilisation level and
also depends on nature of workload. I agree that we should avoid
tunable as far as possible, but we still need make the changes
available to community so that we can compare the different methods
across various workloads and system configuration. One of the
settings in the tunable can very well be 'let the kernel decide what
is best'

--Vaidy

2008-06-26 18:51:14

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* Andi Kleen <[email protected]> [2008-06-26 20:08:41]:

>
> > A user could be an application and certain applications can predict their
> > workload.
>
> So you expect the applications to run suid root and change a sysctl?
> And what happens when two applications run that do that and they have differing
> requirements? Will they fight over the sysctl?

System management software and workload monitoring and managing
software can potentially control the tunable on behalf of the
applications for best overall power savings and performance.

Applications with conflicting goals should resolve among themselves.
The application with highest performance requirement should win. The
power QoS framework set_acceptable_latency() ensures that the lowest
latency set across the system wins. This tunable can also be based on
the similar approach.


> > For example, a database, a file indexer, etc can predict their workload.
>
>
> A file indexer should run with a high nice level and low priority would ideally always
> prefer power saving. But it doesn't currently. Perhaps it should?

Power management settings affect the entire system. It may not be
based on per application priority or nice value. However if the
priority of all the applications currently running in the system
indicate power savings, then the kernel can goto more aggressive power
saving state.

> >
> > Policies are best known in user land and the best controlled from there.
> > Consider a case where the end user might select a performance based policy or a
> > policy to aggressively save power (during peak tariff times). With
>
> How many users are going to do that? Seems like a unrealistic case to me.

System management software should do this. Certainly manual
intervention to change these settings will not be popular. Given the
trends in virtualisation and modular systems, most datacenters will
use some form of systems management software and infrastructure that
is empowered to make policy based decisions on provisioning and
systems configuration.

In a small-scale datacenters, peak and off-peak hour settings can be
potentially done through simple cron jobs.

--Vaidy

2008-06-26 20:09:36

by David Collier-Brown

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Vaidyanathan Srinivasan wrote:
> * Andi Kleen <[email protected]> [2008-06-26 20:08:41]:
>
>
>>>A user could be an application and certain applications can predict their
>>>workload.
>>
>>So you expect the applications to run suid root and change a sysctl?
>>And what happens when two applications run that do that and they have differing
>>requirements? Will they fight over the sysctl?

There are cases where Oracle does this, to ensure the (critical!) log writer
isn't starved by cpu-hungry query optimizer processes...


> System management software and workload monitoring and managing
> software can potentially control the tunable on behalf of the
> applications for best overall power savings and performance.
>
> Applications with conflicting goals should resolve among themselves.
> The application with highest performance requirement should win. The
> power QoS framework set_acceptable_latency() ensures that the lowest
> latency set across the system wins. This tunable can also be based on
> the similar approach.

This is what the IBM zOS "WLM" does: a godlike service runs, records
the delays of workloads on the system, and then adjusts tuning
parameters to speed up processes which are running slower than their
service levels call for, taking the resources from processes which
are running faster than service agreements require.

Look for goal-directed resource management and "workload manager" in
Redbooks. Better, ask some of the IBM folks here (;-))


>>>For example, a database, a file indexer, etc can predict their workload.
>>
>>
>>A file indexer should run with a high nice level and low priority would ideally always
>>prefer power saving. But it doesn't currently. Perhaps it should?
>
>
> Power management settings affect the entire system. It may not be
> based on per application priority or nice value. However if the
> priority of all the applications currently running in the system
> indicate power savings, then the kernel can goto more aggressive power
> saving state.
>
>
>>>Policies are best known in user land and the best controlled from there.
>>>Consider a case where the end user might select a performance based policy or a
>>>policy to aggressively save power (during peak tariff times). With
>>
>>How many users are going to do that? Seems like a unrealistic case to me.

It's just another policy you could have in your workload management
set: a friend and I were discussing that just the other day!

> System management software should do this. Certainly manual
> intervention to change these settings will not be popular. Given the
> trends in virtualisation and modular systems, most datacenters will
> use some form of systems management software and infrastructure that
> is empowered to make policy based decisions on provisioning and
> systems configuration.
>
> In a small-scale datacenters, peak and off-peak hour settings can be
> potentially done through simple cron jobs.
>
> --Vaidy
-

--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

2008-06-26 20:17:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Vaidyanathan Srinivasan wrote:

Playing devil's advocate here.


> * Andi Kleen <[email protected]> [2008-06-26 20:08:41]:
>
>>> A user could be an application and certain applications can predict their
>>> workload.
>> So you expect the applications to run suid root and change a sysctl?
>> And what happens when two applications run that do that and they have differing
>> requirements? Will they fight over the sysctl?
>
> System management software and workload monitoring and managing
> software can potentially control the tunable on behalf of the
> applications for best overall power savings and performance.

Does it have the needed information for that? e.g. real time information
on what the system does? I don't think anybody is in a better position
to control that than the kernel.

> Applications with conflicting goals should resolve among themselves.

That sounds wrong to me. Negotiating between conflicting requirements
from different applications is something that kernels are supposed
to do.

> The application with highest performance requirement should win.

That is right, but the kernel can do that based on nice levels
and possibly other information, can't it?


> The
> power QoS framework set_acceptable_latency() ensures that the lowest
> latency set across the system wins.

But that only helps kernel drivers, not user space, doesn't it?

> Power management settings affect the entire system. It may not be
> based on per application priority or nice value. However if the
> priority of all the applications currently running in the system
> indicate power savings, then the kernel can goto more aggressive power
> saving state.

That's what I meant yes. So if only the file system indexer is running
over night all niced it will run as power efficiently as possible.

> In a small-scale datacenters, peak and off-peak hour settings can be
> potentially done through simple cron jobs.

Is there any real drawback from only controlling it through nice levels?

Anyways I think the main thing I object to in your proposal is that
your tunable is system global, not per process. I'm also not
sure if a tunable is really a good idea and if the kernel couldn't
do a better job.

-Andi

2008-06-26 21:03:38

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Thu, Jun 26, 2008 at 10:17:00PM +0200, Andi Kleen wrote:
> Vaidyanathan Srinivasan wrote:
> > System management software and workload monitoring and managing
> > software can potentially control the tunable on behalf of the
> > applications for best overall power savings and performance.
>
> Does it have the needed information for that? e.g. real time information
> on what the system does? I don't think anybody is in a better position
> to control that than the kernel.

Some workload managers already do that - they provision cpu and memory
resources based on request rates and response times. Such software is
in a better position to make a decision whether they can live with
reduced performance due to power saving mode or not. The point I am
making is the the kernel doesn't have any notion of transactional
performance - so if an administrator wants to run unimportant
transactions on a slower but low-power system, he/she should have
the option of doing so.

> > Applications with conflicting goals should resolve among themselves.
>
> That sounds wrong to me. Negotiating between conflicting requirements
> from different applications is something that kernels are supposed
> to do.

Agreed. However that is a difficult problem to solve and not the
intention of this idea. Global power setting is a simple first step.
I don't think we have a good understanding of cases where conflicting
power requirements from multiple applications need to be addressed.
We will have to look at that when the issue arises.

> > In a small-scale datacenters, peak and off-peak hour settings can be
> > potentially done through simple cron jobs.
>
> Is there any real drawback from only controlling it through nice levels?

In a system with more than a couple of sockets, it is more beneficial
(power-wise) to pack all work in to a small number of processors
and let the other processors go to very low power sleep. Compared
to running tasks slowly and spreading them all over the processors.

> Anyways I think the main thing I object to in your proposal is that
> your tunable is system global, not per process. I'm also not
> sure if a tunable is really a good idea and if the kernel couldn't
> do a better job.

While it would be nice to have a per process tunable, I am not sure
we are ready for that yet. A global setting is easy to implement
and we have immediate use for it. The kernel already does a decent
job conservatively - by packing one task per core in a package
when sched_mc_power_savings=1 is set. Any further packing may affect
performance and should not therefore be the default behavior.

Thanks
Dipankar

2008-06-26 21:37:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Dipankar Sarma wrote:

> Some workload managers already do that - they provision cpu and memory
> resources based on request rates and response times. Such software is
> in a better position to make a decision whether they can live with
> reduced performance due to power saving mode or not. The point I am
> making is the the kernel doesn't have any notion of transactional
> performance

The kernel definitely knows about burstiness vs non burstiness at least
(although it currently has no long term memory for that). Does it need
more than that for this? Anyways if nice levels were used that is not
even needed, because it's ok to run niced processes slower.

And your workload manager could just nice processes. It should probably
do that anyways to tell ondemand you don't need full frequency.

- so if an administrator wants to run unimportant
> transactions on a slower but low-power system, he/she should have
> the option of doing so.
>
>>> Applications with conflicting goals should resolve among themselves.
>> That sounds wrong to me. Negotiating between conflicting requirements
>> from different applications is something that kernels are supposed
>> to do.
>
> Agreed. However that is a difficult problem to solve and not the
> intention of this idea. Global power setting is a simple first step.
> I don't think we have a good understanding of cases where conflicting

Always the guy who needs the most performance wins? And if only
niced processes are running it's ok to be slower.

It would be similar to nice levels. In fact nice levels could be probably
used directly (similar to how ionice coopts them too)

Or another case that already uses it is cpufreq/ondemand: when only niced
processes run the CPU is not cranked up to the highest frequency.

I don't see why that information couldn't be used by the load balancer
either to optimize socket use for power. Ok except that the load balancer
is already very tricky. But still would be probably better to have some more
complex code that does DTRT automatically than another tunable.

>>> In a small-scale datacenters, peak and off-peak hour settings can be
>>> potentially done through simple cron jobs.
>> Is there any real drawback from only controlling it through nice levels?
>
> In a system with more than a couple of sockets, it is more beneficial
> (power-wise) to pack all work in to a small number of processors
> and let the other processors go to very low power sleep. Compared
> to running tasks slowly and spreading them all over the processors.

You answered a different question?

> While it would be nice to have a per process tunable, I am not sure
> we are ready for that yet.

Can you please elaborate what you think is missing?

-Andi

2008-06-26 21:43:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Thu, 2008-06-26 at 23:37 +0200, Andi Kleen wrote:
> Dipankar Sarma wrote:
>
> > Some workload managers already do that - they provision cpu and memory
> > resources based on request rates and response times. Such software is
> > in a better position to make a decision whether they can live with
> > reduced performance due to power saving mode or not. The point I am
> > making is the the kernel doesn't have any notion of transactional
> > performance
>
> The kernel definitely knows about burstiness vs non burstiness at least
> (although it currently has no long term memory for that). Does it need
> more than that for this? Anyways if nice levels were used that is not
> even needed, because it's ok to run niced processes slower.
>
> And your workload manager could just nice processes. It should probably
> do that anyways to tell ondemand you don't need full frequency.

Except that I want my nice 19 distcc processes to utilize as much cpu as
possible, but just not bother any other stuff I might be doing...


2008-06-26 22:39:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Peter Zijlstra wrote:

>> And your workload manager could just nice processes. It should probably
>> do that anyways to tell ondemand you don't need full frequency.
>
> Except that I want my nice 19 distcc processes to utilize as much cpu as
> possible, but just not bother any other stuff I might be doing...

They already won't do that if you run ondemand and cpufreq. It won't
crank up the frequency for niced processes.

Extending that existing policy to socket load balancing would be only
natural.

-Andi

2008-06-27 04:15:38

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Andi Kleen wrote:
>> A user could be an application and certain applications can predict their
>> workload.
>
> So you expect the applications to run suid root and change a sysctl?
> And what happens when two applications run that do that and they have differing
> requirements? Will they fight over the sysctl?
>

We expect the system administrator to set an overall policy. The administrators
should have some flexibility in deciding how aggressive they want their power
savings to be

>> For example, a database, a file indexer, etc can predict their workload.
>
>
> A file indexer should run with a high nice level and low priority would ideally always
> prefer power saving. But it doesn't currently. Perhaps it should?
>

Replace file indexer with a datawarehouse, What if I have several instances of
these workloads running in parallel? The administrator should be able to decide
when to consolidate for power and when to spread for performance.


>> Policies are best known in user land and the best controlled from there.
>> Consider a case where the end user might select a performance based policy or a
>> policy to aggressively save power (during peak tariff times). With
>
> How many users are going to do that? Seems like a unrealistic case to me.

Two generic comments about the users part

1. The fact that we have sched_mc_power_savings is an indication that there are
users trying to use it for power savings
2. Users demand features, but they can only use them once we provide the tunables.

It might seem unrealistic for a one machine scenario, but consider a data center
hosting thousands of servers. Depending on the utilization, the administrator
might decide to use different policies for different servers.



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-27 04:57:27

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Thu, Jun 26, 2008 at 11:37:08PM +0200, Andi Kleen wrote:
> Dipankar Sarma wrote:
>
> > Some workload managers already do that - they provision cpu and memory
> > resources based on request rates and response times. Such software is
> > in a better position to make a decision whether they can live with
> > reduced performance due to power saving mode or not. The point I am
> > making is the the kernel doesn't have any notion of transactional
> > performance
>
> The kernel definitely knows about burstiness vs non burstiness at least
> (although it currently has no long term memory for that). Does it need
> more than that for this? Anyways if nice levels were used that is not
> even needed, because it's ok to run niced processes slower.
>
> And your workload manager could just nice processes. It should probably
> do that anyways to tell ondemand you don't need full frequency.

The current usage of this we are looking requires system-wide
settings. That means nicing every process running on the system.
That seems a little messy. Secondly, even if you nice the processes
they are still going to be spread all over the CPU packages
running at lower frequencies due to nice. The point I am making
is that it is more effective to push work into smaller number
of cpu packages and let others go to low-power sleep state.

> > Agreed. However that is a difficult problem to solve and not the
> > intention of this idea. Global power setting is a simple first step.
> > I don't think we have a good understanding of cases where conflicting
>
> Always the guy who needs the most performance wins? And if only
> niced processes are running it's ok to be slower.
>
> It would be similar to nice levels. In fact nice levels could be probably
> used directly (similar to how ionice coopts them too)
>
> Or another case that already uses it is cpufreq/ondemand: when only niced
> processes run the CPU is not cranked up to the highest frequency.

Using nice, you can force lowering of frequency - but you can do that
using userspace governor as well - no need to mess with process
priorities. We are talking about a different optimization here - something
that will give more benefits in powersave mode when you have large
systems.

> >>> In a small-scale datacenters, peak and off-peak hour settings can be
> >>> potentially done through simple cron jobs.
> >> Is there any real drawback from only controlling it through nice levels?
> >
> > In a system with more than a couple of sockets, it is more beneficial
> > (power-wise) to pack all work in to a small number of processors
> > and let the other processors go to very low power sleep. Compared
> > to running tasks slowly and spreading them all over the processors.
>
> You answered a different question?

The point is that grouping tasks into small number of sockets is
more effective than nicing which may still spread the tasks all
over the sockets. Think of this as light-weight CPU hotplug.
Something that can compact and expand CPU capacity fast and
extends an existing power management interface / logic.

Thanks
Dipankar

2008-06-27 06:23:20

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* Andi Kleen <[email protected]> [2008-06-27 00:38:53]:

> Peter Zijlstra wrote:
>
> >> And your workload manager could just nice processes. It should probably
> >> do that anyways to tell ondemand you don't need full frequency.
> >
> > Except that I want my nice 19 distcc processes to utilize as much cpu as
> > possible, but just not bother any other stuff I might be doing...
>
> They already won't do that if you run ondemand and cpufreq. It won't
> crank up the frequency for niced processes.

This may not provide the best power saving if the workload is bursty.
Finishing the job quickly and entering sleep states have better
impact. This is the race-to-idle problem where we want to maximise the
sleep state utilisation relative to reducing the frequency. The
benefit of this technique is certainly workload specific. However
even in this particular case, running at the lowest frequency is the
safest option from OS point of view for power savings. However for
maximum power savings, increasing sleep state utilisation have the
following advantages:

* Sleep states are per core while voltage and frequency control are
for multiple cores in a multi-core package. Hence freq change
decisions needs to be taken at the package level. Though ondemand
makes the decision based on per-core utilisation and process
priority, the actual effect in hardware is the highest freq
recommended by all cores. Per core decision is actually only
a recommendation or a vote.

* Moving tasks to less number of CPU package in a multi socket system
will provide maximum savings since even shared resources on the idle
sockets can be in low power states.

Multi socket systems with multi core CPUs have more controls for power
savings that were previously not available on single core systems.
Automatically making the right decision is an ideal solution. However
since there are trade-offs, we would like the users to experiment with
what suits them the best. The rational is similar to why we provide
different cpufreq governors and tunables.

If we discover a good automatic technique to choose the right power
saving strategy that is widely acceptable, then certainly we will go
for it. Can we build the stepping stone to reach there? Can we consider
these tunables as enablements for end users to try them out easily
and provide feedback?

>
> Extending that existing policy to socket load balancing would be only
> natural.

Consolidation based on task priority seems to be the challenge here.
However this is a good point. This is certainly a parameter for auto
tuning if only we can overcome the challenges in using priority for
task consolidation.

--Vaidy

2008-06-27 06:48:48

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* David Collier-Brown <[email protected]> [2008-06-26 15:37:06]:

> Vaidyanathan Srinivasan wrote:
>> * Andi Kleen <[email protected]> [2008-06-26 20:08:41]:
>>
>>
>>>> A user could be an application and certain applications can predict their
>>>> workload.
>>>
>>> So you expect the applications to run suid root and change a sysctl?
>>> And what happens when two applications run that do that and they have differing
>>> requirements? Will they fight over the sysctl?
>
> There are cases where Oracle does this, to ensure the (critical!) log writer
> isn't starved by cpu-hungry query optimizer processes...

Good here is an example for the use-case we are proposing ;)

>
>
>> System management software and workload monitoring and managing
>> software can potentially control the tunable on behalf of the
>> applications for best overall power savings and performance.
>>
>> Applications with conflicting goals should resolve among themselves.
>> The application with highest performance requirement should win. The
>> power QoS framework set_acceptable_latency() ensures that the lowest
>> latency set across the system wins. This tunable can also be based on
>> the similar approach.
>
> This is what the IBM zOS "WLM" does: a godlike service runs, records
> the delays of workloads on the system, and then adjusts tuning
> parameters to speed up processes which are running slower than their
> service levels call for, taking the resources from processes which
> are running faster than service agreements require.
>
> Look for goal-directed resource management and "workload manager" in
> Redbooks. Better, ask some of the IBM folks here (;-))

This tunable can certainly be very useful for such WLM software.
However this can be useful in simple system deployment as well. If
the purpose of the system and its workload characteristics are easily
determined and there is little runtime variation, then the
administrator can easily choose the correct tunable.

>
>>>> For example, a database, a file indexer, etc can predict their workload.
>>>
>>>
>>> A file indexer should run with a high nice level and low priority would ideally always
>>> prefer power saving. But it doesn't currently. Perhaps it should?
>>
>>
>> Power management settings affect the entire system. It may not be
>> based on per application priority or nice value. However if the
>> priority of all the applications currently running in the system
>> indicate power savings, then the kernel can goto more aggressive power
>> saving state.
>>
>>
>>>> Policies are best known in user land and the best controlled from there.
>>>> Consider a case where the end user might select a performance based policy or a
>>>> policy to aggressively save power (during peak tariff times). With
>>>
>>> How many users are going to do that? Seems like a unrealistic case to me.
>
> It's just another policy you could have in your workload management
> set: a friend and I were discussing that just the other day!

Power policy across datacenter that takes into account customer
priority class and current cost of power (peak vs non peak time).

>> System management software should do this. Certainly manual
>> intervention to change these settings will not be popular. Given the
>> trends in virtualisation and modular systems, most datacenters will
>> use some form of systems management software and infrastructure that
>> is empowered to make policy based decisions on provisioning and
>> systems configuration.
>>
>> In a small-scale datacenters, peak and off-peak hour settings can be
>> potentially done through simple cron jobs.
>>
>> --Vaidy
> -
>
> --dave
> --
> David Collier-Brown | Always do right. This will gratify
> Sun Microsystems, Toronto | some people and astonish the rest
> [email protected] | -- Mark Twain
> (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
> bridge: (877) 385-4099 code: 506 9191#
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-06-27 07:18:16

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* Andi Kleen <[email protected]> [2008-06-26 22:17:00]:

> Vaidyanathan Srinivasan wrote:
>
> Playing devil's advocate here.
>

[...]

> > The
> > power QoS framework set_acceptable_latency() ensures that the lowest
> > latency set across the system wins.
>
> But that only helps kernel drivers, not user space, doesn't it?

Yes the QoS notification is mainly for kernel drivers, but
applications can control them using the /dev/[...,network_latency,...]
interface as documented in Documentations/power/pm_qos_interface.txt

The device drivers are expected to get feedback (tunable?) from
applications that are dependent on those drivers and set the correct
power saving level. Multimedia applications are expected to make use
of this interface to set/communicate the correct power saving levels
for audio drivers.

Many application can set different latency requirement, but the least
will win. Here the PM-QoS framework in kernel arbitrates between
applications and resolves conflicts by choosing the least latency or
most conservative power saving mode.

--Vaidy

[...]

2008-06-27 07:51:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Fri, 2008-06-27 at 00:38 +0200, Andi Kleen wrote:
> Peter Zijlstra wrote:
>
> >> And your workload manager could just nice processes. It should probably
> >> do that anyways to tell ondemand you don't need full frequency.
> >
> > Except that I want my nice 19 distcc processes to utilize as much cpu as
> > possible, but just not bother any other stuff I might be doing...
>
> They already won't do that if you run ondemand and cpufreq. It won't
> crank up the frequency for niced processes.
>
> Extending that existing policy to socket load balancing would be only
> natural.

There used to be an option for them to also up on niced load. If that
disappeared then I'd call that a huge usability regression. Basically
making ondemand useless.

/me checks,..

Yeah, on F9, my opteron runs at 1GHz when idle, but when I start distcc,
which like said runs on nice 19, the cpu speed goes up to 2.4GHz.

And it uses the ondemand govenor.

2008-06-27 08:03:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Dipankar Sarma wrote:
> On Thu, Jun 26, 2008 at 11:37:08PM +0200, Andi Kleen wrote:
>> Dipankar Sarma wrote:
>>
>>> Some workload managers already do that - they provision cpu and memory
>>> resources based on request rates and response times. Such software is
>>> in a better position to make a decision whether they can live with
>>> reduced performance due to power saving mode or not. The point I am
>>> making is the the kernel doesn't have any notion of transactional
>>> performance
>> The kernel definitely knows about burstiness vs non burstiness at least
>> (although it currently has no long term memory for that). Does it need
>> more than that for this? Anyways if nice levels were used that is not
>> even needed, because it's ok to run niced processes slower.
>>
>> And your workload manager could just nice processes. It should probably
>> do that anyways to tell ondemand you don't need full frequency.
>
> The current usage of this we are looking requires system-wide
> settings. That means nicing every process running on the system.
> That seems a little messy.

Is it less messy than the letting applications negotiate
for the best policy by themselves as someone else suggested on the thread?

> Secondly, even if you nice the processes
> they are still going to be spread all over the CPU packages
> running at lower frequencies due to nice.

My point was that this could be fixed and you could use nice
(or another per process parameter if you prefer)
as an input to load balancer decisions.

> Using nice, you can force lowering of frequency - but you can do that
> using userspace governor as well - no need to mess with process
> priorities.


> We are talking about a different optimization here - something
> that will give more benefits in powersave mode when you have large
> systems.

Yes it's a different optimization (although the over all theme -- power saving
-- is the same), but is there a real reason it cannot be driven from the
same per process heuristics instead of your ugly global sysctl?

>>>>> In a small-scale datacenters, peak and off-peak hour settings can be
>>>>> potentially done through simple cron jobs.
>>>> Is there any real drawback from only controlling it through nice levels?
>>> In a system with more than a couple of sockets, it is more beneficial
>>> (power-wise) to pack all work in to a small number of processors
>>> and let the other processors go to very low power sleep. Compared
>>> to running tasks slowly and spreading them all over the processors.
>> You answered a different question?
>
> The point is that grouping tasks into small number of sockets is
> more effective than nicing which may still spread the tasks all
> over the sockets.

Sorry you completely misunderstood me. I know the principle
behind the socket grouping. And yes it's a different mechanism
from cpu frequency scaling.

My point was just that the heuristics
used by one power saving mechanism (ondemand) could be used
for the other too (socket grouping) -- and it would be certainly
a far saner interface than a global sysctl!.

-Andi

2008-06-27 08:06:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Peter Zijlstra wrote:

> There used to be an option for them to also up on niced load. If that

You could always force socket power saving mode to off globally too if
you don't want it at all.

> disappeared then I'd call that a huge usability regression. Basically
> making ondemand useless.
>
> /me checks,..
>
> Yeah, on F9, my opteron runs at 1GHz when idle, but when I start distcc,
> which like said runs on nice 19, the cpu speed goes up to 2.4GHz.

Ok distcc is a special case, but it doesn't apply to a lot of other
processes (do you really want your CPU to crank up for "updatedb" or
beagle or some backup job for example?)

Perhaps there should be a way to express this in priorities?
"I am low priority, but want to be work conserving if the system
is idle"

The group scheduler is changing the semantics of nice completely
anyways, so so more changes could be applied.

-Andi

2008-06-27 08:19:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Hi

> Advantages:
>
> * Enterprise workloads on large hardware configurations may need
> aggressive consolidation strategy
> * Performance impact on server is different from desktop or laptops.
> Interactivity is less of a concern on large enterprise servers while
> workload response times and performance per watt is more significant
> * Aggressive power savings even with marginal performance penalty is
> is a useful tunable for servers since it may provide good
> performance-per-watt at low utilisation
> * This tunable can influence other parts of scheduler like wakeup
> biasing for overall task consolidation

I'd like to know how many saving power.
if there are only small saving, I think this is not interesting feature.

Do you expect how many percentage saving?


2008-06-27 08:49:10

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* KOSAKI Motohiro <[email protected]> [2008-06-27 17:08:22]:

> Hi
>
> > Advantages:
> >
> > * Enterprise workloads on large hardware configurations may need
> > aggressive consolidation strategy
> > * Performance impact on server is different from desktop or laptops.
> > Interactivity is less of a concern on large enterprise servers while
> > workload response times and performance per watt is more significant
> > * Aggressive power savings even with marginal performance penalty is
> > is a useful tunable for servers since it may provide good
> > performance-per-watt at low utilisation
> > * This tunable can influence other parts of scheduler like wakeup
> > biasing for overall task consolidation
>
> I'd like to know how many saving power.
> if there are only small saving, I think this is not interesting feature.
>
> Do you expect how many percentage saving?

The power savings depends on the number of sockets. With the present
hardware on servers, we are seeing very small power savings. However
deep sleep states and wide variation in CPU power consumption in
future will increase the percentage. The percentage may be around
1 to 5 percent. Given the system utilisation pattern and large number
of systems idle in a datacenter, this is not an insignificant number.
The power value can be significant in a 4 socket or larger system
configuration.

--Vaidy

2008-06-27 12:57:05

by David Collier-Brown

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

KOSAKI Motohiro wrote:
> Hi
>
>
>>Advantages:
>>
>>* Enterprise workloads on large hardware configurations may need
>> aggressive consolidation strategy
>>* Performance impact on server is different from desktop or laptops.
>> Interactivity is less of a concern on large enterprise servers while
>> workload response times and performance per watt is more significant
>>* Aggressive power savings even with marginal performance penalty is
>> is a useful tunable for servers since it may provide good
>> performance-per-watt at low utilisation
>>* This tunable can influence other parts of scheduler like wakeup
>> biasing for overall task consolidation
>
>
> I'd like to know how many saving power.
> if there are only small saving, I think this is not interesting feature.
>
> Do you expect how many percentage saving?
>

An experiment using DVFS on Xeon yeilded a 15-watt allowable reduction
even under running a considerable TPC-W workload. Lesser loads allowed
a 40-watt (out of 160) reduction.

--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

2008-06-28 11:27:45

by Tim Connors

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 00:38:53 +0200:
> Peter Zijlstra wrote:
>
> >> And your workload manager could just nice processes. It should probably
> >> do that anyways to tell ondemand you don't need full frequency.
> >
> > Except that I want my nice 19 distcc processes to utilize as much cpu as
> > possible, but just not bother any other stuff I might be doing...
>
> They already won't do that if you run ondemand and cpufreq. It won't
> crank up the frequency for niced processes.

Shouldn't there be a powernice, just as there is an ionice and a nice?
Just as you don't always want CPU priority and IO priority to be
coupled, Peter has just demonstrated a very good case where you don't
want power and CPU choices to be coupled. Whether the ondemand
governor of CPUFreq counts a process as wanting the CPU to run at a
higher speed, and these scheduler decisions should be controlled by
powernice. By default, perhaps a high powernice should equal a high
nice equal to a high ionice, but the user should be able to change
this. The last thing you want is a distcc process taking up lots of
time, burning more Joules because it runs 10 times longer with only
half the power. It's not a nice choice between that and running at
nice 0 where it interferes with the user's editing.

2008-06-28 11:35:45

by Tim Connors

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 10:06:28 +0200:
> Peter Zijlstra wrote:
> > disappeared then I'd call that a huge usability regression. Basically
> > making ondemand useless.
> >
> > /me checks,..
> >
> > Yeah, on F9, my opteron runs at 1GHz when idle, but when I start distcc,
> > which like said runs on nice 19, the cpu speed goes up to 2.4GHz.
>
> Ok distcc is a special case,

No it's not. Most compute heavy jobs most people run would be better
off being done sooner rather than later, otherwise you might as well
go out and buy a 100MHz computer. But most users also want "nice" to
do what was intended of it -- make one app not steal *any* CPU cycles
from another app that would really rather those CPU cycles right now
(yes, I know that long running CPU jobs theoretically become lower
priority so steal less, and in theory, there is no difference between
theory and practice. But in practice, there is, and these long
running jobs still impact on desktop and ssh interactivity)

I end up nicing opera and firefox half the time because I'm sick of
their CPU leaks. It doesn't mean I don't want them to finish their
screen updating sooner.

2008-06-28 11:56:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Tim Connors wrote:
> Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 10:06:28 +0200:
>> Peter Zijlstra wrote:
>>> disappeared then I'd call that a huge usability regression. Basically
>>> making ondemand useless.
>>>
>>> /me checks,..
>>>
>>> Yeah, on F9, my opteron runs at 1GHz when idle, but when I start distcc,
>>> which like said runs on nice 19, the cpu speed goes up to 2.4GHz.
>> Ok distcc is a special case,
>
> No it's not.

You're arguing against the current default of ondemand then.

-Andi

2008-06-28 12:26:12

by Matthew Garrett

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Fri, Jun 27, 2008 at 10:06:28AM +0200, Andi Kleen wrote:

> Ok distcc is a special case, but it doesn't apply to a lot of other
> processes (do you really want your CPU to crank up for "updatedb" or
> beagle or some backup job for example?)

If something's CPU-bound, then you almost certainly want to speed the
CPU up. There's no power advantage to leaving it at a low frequency. I'd
be surprised if things like beagle or updatedb are CPU-bound, though.

--
Matthew Garrett | [email protected]

2008-06-28 12:36:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Matthew Garrett wrote:
> On Fri, Jun 27, 2008 at 10:06:28AM +0200, Andi Kleen wrote:
>
>> Ok distcc is a special case, but it doesn't apply to a lot of other
>> processes (do you really want your CPU to crank up for "updatedb" or
>> beagle or some backup job for example?)
>
> If something's CPU-bound, then you almost certainly want to speed the
> CPU up. There's no power advantage to leaving it at a low frequency.

I'm not sure you can say it that certainly. While on many standalone systems
"race to idle" is the best strategy, there are cases where it is not
true.

For example if you're in a data center at a specific operating point and
you would need to crank up the air condition at significant power cost it might
be well better overall to force all servers to a lower operating point
and avoid that.

That said in general you all should have complained when ondemand behaviour
was introduced.

Also it's unclear that the general "race to idle" heuristic really
applies to the case of the "keep sockets idle" power optimization
that started this thread.

Usually package C states bring much more than core C states
and keeping another package completely idle saves likely
more power than the power cost of running something a little
bit slower on a package that is already busy on another core.

I still think using nice levels for this is reasonable.

-Andi

2008-06-28 12:57:17

by Matthew Garrett

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Sat, Jun 28, 2008 at 02:36:02PM +0200, Andi Kleen wrote:

> For example if you're in a data center at a specific operating point and
> you would need to crank up the air condition at significant power cost it might
> be well better overall to force all servers to a lower operating point
> and avoid that.

Sure, there are cases where you have additional constraints. But within
those constraints, you probably want to run as fast as possible.

> That said in general you all should have complained when ondemand behaviour
> was introduced.

ignore_nice seems to be set to 0 by default?

> Also it's unclear that the general "race to idle" heuristic really
> applies to the case of the "keep sockets idle" power optimization
> that started this thread.
>
> Usually package C states bring much more than core C states
> and keeping another package completely idle saves likely
> more power than the power cost of running something a little
> bit slower on a package that is already busy on another core.

I'd agree with that.
--
Matthew Garrett | [email protected]

2008-06-29 18:05:46

by David Collier-Brown

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 00:38:53 +0200:
>>Peter Zijlstra wrote:
>>>>And your workload manager could just nice processes. It should probably
>>>>do that anyways to tell ondemand you don't need full frequency.
>>>
>>>Except that I want my nice 19 distcc processes to utilize as much cpu as
>>>possible, but just not bother any other stuff I might be doing...
>>
>>They already won't do that if you run ondemand and cpufreq. It won't
>>crank up the frequency for niced processes.


Tim Connors then wrote:
> Shouldn't there be a powernice, just as there is an ionice and a nice?
Hmmn, how about:

User Commands nice(1)

NAME
nice - invoke a command with an altered priority

SYNOPSIS
/usr/bin/nice [-increment | -n increment] [-s|-i|-e|-p] command [argu-
ment...]

DESCRIPTION
The nice utility invokes command, requesting that it be run
with a different priority. If -i is specified, the priority
of (disk) I/O is modified. If -e is specified, ethernet (or
other networking) priority is changed. If -p is specified, power
usage priority is changed and if -s is specified, or none
of -1, -e or -p is specified, then system scheduling priority
is modified...

--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

2008-06-30 04:58:23

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

* David Collier-Brown <[email protected]> [2008-06-29 14:02:58]:

> Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 00:38:53 +0200:
>>> Peter Zijlstra wrote:
>>>>> And your workload manager could just nice processes. It should probably
>>>>> do that anyways to tell ondemand you don't need full frequency.
>>>>
>>>> Except that I want my nice 19 distcc processes to utilize as much cpu as
>>>> possible, but just not bother any other stuff I might be doing...
>>>
>>> They already won't do that if you run ondemand and cpufreq. It won't
>>> crank up the frequency for niced processes.
>
>
> Tim Connors then wrote:
>> Shouldn't there be a powernice, just as there is an ionice and a nice?
> Hmmn, how about:
>
> User Commands nice(1)
>
> NAME
> nice - invoke a command with an altered priority
>
> SYNOPSIS
> /usr/bin/nice [-increment | -n increment] [-s|-i|-e|-p] command [argu-
> ment...]
>
> DESCRIPTION
> The nice utility invokes command, requesting that it be run
> with a different priority. If -i is specified, the priority
> of (disk) I/O is modified. If -e is specified, ethernet (or
> other networking) priority is changed. If -p is specified, power
> usage priority is changed and if -s is specified, or none of -1,
> -e or -p is specified, then system scheduling priority
> is modified...

This is good. We are exploring powernice. 'Generally' cpu, io and
power nice values should be similar: high or low. Can we comeup with
use cases where we want to have conflicting nice values for cpu, io
and power?

CPU IO POWER
distcc: low low low
firefox: low high high
ssh/shell: high high high
X: high high low


I am trying to find answer to the question: Should we have the power
saving tunable as 'nice' value per process or system wide?

How should we interpret the POWER parameter in a datacenter with power
constraint as mentioned in this thread? Or in a simple case of AC vs
battery in a laptop.

Thanks,
Vaidy

2008-06-30 07:17:40

by Tim Connors

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Mon, 30 Jun 2008, Vaidyanathan Srinivasan wrote:

> * David Collier-Brown <[email protected]> [2008-06-29 14:02:58]:
>
> > Andi Kleen <[email protected]> said on Fri, 27 Jun 2008 00:38:53 +0200:
> >>> Peter Zijlstra wrote:
> >>>>> And your workload manager could just nice processes. It should probably
> >>>>> do that anyways to tell ondemand you don't need full frequency.
> >>>>
> >>>> Except that I want my nice 19 distcc processes to utilize as much cpu as
> >>>> possible, but just not bother any other stuff I might be doing...
> >>>
> >>> They already won't do that if you run ondemand and cpufreq. It won't
> >>> crank up the frequency for niced processes.
> >
> >
> > Tim Connors then wrote:
> >> Shouldn't there be a powernice, just as there is an ionice and a nice?
> > Hmmn, how about:
> >
> > User Commands nice(1)
> >
> > NAME
> > nice - invoke a command with an altered priority
> >
> > SYNOPSIS
> > /usr/bin/nice [-increment | -n increment] [-s|-i|-e|-p] command [argu-
> > ment...]
> >
> > DESCRIPTION
> > The nice utility invokes command, requesting that it be run
> > with a different priority. If -i is specified, the priority
> > of (disk) I/O is modified. If -e is specified, ethernet (or
> > other networking) priority is changed. If -p is specified, power
> > usage priority is changed and if -s is specified, or none of -1,
-i ^^^
> > -e or -p is specified, then system scheduling priority
> > is modified...
>
> This is good. We are exploring powernice. 'Generally' cpu, io and
> power nice values should be similar: high or low. Can we comeup with
> use cases where we want to have conflicting nice values for cpu, io
> and power?
>
> CPU IO POWER
> distcc: low low low
> firefox: low high high
> ssh/shell: high high high
> X: high high low

What's "high" mean? High priority, or high niceness?

Looks like you're referring to priority there. Although, if those are
real examples, then it demonstrates why different people would set
different priororities (I's say firefox would be both high CPU and power
nice).

distcc wants to be high CPU "nice" (low CPU priority - let other desktop
etc things get done first). But low niceness for power and probably io
(get it over and done with sooner, and IO traffic is burst, so won't
interfere so much with other IO).

> How should we interpret the POWER parameter in a datacenter with power
> constraint as mentioned in this thread? Or in a simple case of AC vs
> battery in a laptop.

On laptop battery, background tasks like firefox redrawing crappy
animations -- high power nice, high cpu nice (ie, if it was the only thing
running, and it still wanted to chew 100% cpu, it'll only be chewing
850MHz of 100% cpu on my Core2 Duo). My shell though, will be running at
the default io=cpu=power nice of 0.

Datacentre running with little loading because it's approaching midnight
localtime, so lets run the general background tasks at high power nice,
medium cpu nice, medium IO nice. During peak times, the main transaction
tasks running at low power nice, low cpu and low io nice, will be busy,
and so the cpus all go up a notch or three. It's not just a matter of
installing powersaved and saying "performance" vs "ondemand", at various
times of the day, because it's better to adjust dynamically based on real
load.

--
Tim Connors

2008-06-30 14:21:24

by David Collier-Brown

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

Vaidyanathan Srinivasan wrote:
> I am trying to find answer to the question: Should we have the power
> saving tunable as 'nice' value per process or system wide?
>
> How should we interpret the POWER parameter in a datacenter with power
> constraint as mentioned in this thread? Or in a simple case of AC vs
> battery in a laptop.

I agree with Tim re setting them all independently, and suggest that
they're all really per-process values: setting power saving
system-wide is meaningful, but so are individual settings.
There is therefor an argument for making them subsets of
a higher-level nice program.

Mind you, the order in which one *implements* the capability,
and whether one does powernice first and adds it to nice later
is your call! I have no idea of how hard what I suggested is (;-))

--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

2008-06-30 14:31:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

David Collier-Brown wrote:
> Vaidyanathan Srinivasan wrote:
>> I am trying to find answer to the question: Should we have the power
>> saving tunable as 'nice' value per process or system wide?
>>
>> How should we interpret the POWER parameter in a datacenter with power
>> constraint as mentioned in this thread? Or in a simple case of AC vs
>> battery in a laptop.
>
> I agree with Tim re setting them all independently,

I agree that powernice is likely a good idea (although the semantics
are not 100% clear yet), but there's still the issue
(shared with ionice) that 99.99+% of all setups won't set powernice
explicitely so you still need a reasonable default when it is not
set.

Me thinks the correct strategy would be something like this:

- When powernice is set prefer it
- For the idle socket optimization: use nice because it's
unclear that "race to idle" applies here.
- For ondemand: when nice is set behave more like the conservative
governor and take longer to crank up [this might be controversal]

Also are the best powernice semantics the same between idle
sockets and ondemand? I'm not sure.


and suggest that
> they're all really per-process values: setting power saving system-wide
> is meaningful, but so are individual settings.
> There is therefor an argument for making them subsets of
> a higher-level nice program.
>
> Mind you, the order in which one *implements* the capability,
> and whether one does powernice first and adds it to nice later
> is your call! I have no idea of how hard what I suggested is (;-))

In general for Linux deployment it tends to be easier
to provide another package with an own command instead of
patching a core package like coreutils

With an own package you can just tell the user
"type (yum|zypper|apt-get|...) install powernice",
while an updated coreutils tends to be more trouble or even
require a distribution update.

-Andi

2008-06-30 16:14:49

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC v1] Tunable sched_mc_power_savings=n

On Fri, Jun 27, 2008 at 10:03:06AM +0200, Andi Kleen wrote:
> Dipankar Sarma wrote:
> > On Thu, Jun 26, 2008 at 11:37:08PM +0200, Andi Kleen wrote:
> >> Dipankar Sarma wrote:
> >>
> > The current usage of this we are looking requires system-wide
> > settings. That means nicing every process running on the system.
> > That seems a little messy.
>
> Is it less messy than the letting applications negotiate
> for the best policy by themselves as someone else suggested on the thread?

I don't think letting applications negotiate among
themselves is a good idea. The kernel should do that.

> > Secondly, even if you nice the processes
> > they are still going to be spread all over the CPU packages
> > running at lower frequencies due to nice.
>
> My point was that this could be fixed and you could use nice
> (or another per process parameter if you prefer)
> as an input to load balancer decisions.

Agreed. A variation of this that allows tasks to indicate
their CPU power requirement, is something that we experimented
with long ago. There are some difficult issues that need to be
sorted out if this is to be effective -

1. For some applications, like xmms, it is easy to predict. For
commercial workloads - like a database, it is hard to get
it right.

2. Conflicting power requirements are hard to resolve. Grouping
of tasks based on various combinations of power requirement
is complex.

3. Setting global policy is expensive - you have to loop through
all the tasks in the system.

> > We are talking about a different optimization here - something
> > that will give more benefits in powersave mode when you have large
> > systems.
>
> Yes it's a different optimization (although the over all theme -- power saving
> -- is the same), but is there a real reason it cannot be driven from the
> same per process heuristics instead of your ugly global sysctl?

See the issues #1 and #2 above. Apart from that, what we discovered
was that server admins really want a global settings at the moment.
Any finer granularity than that would be a waste for them at the
moment. No one really is looking at running php+mysql at one powernice
and tomcat in another level *in the same server*.


> My point was just that the heuristics
> used by one power saving mechanism (ondemand) could be used
> for the other too (socket grouping) -- and it would be certainly
> a far saner interface than a global sysctl!.

Per-task settings was the first thing we looked at when we
started out. I think we should experiment with it and see
if we can come up with a simple implementation that handles
conflicting requirements well. If this can also handle global
system power settings without having to loop through all the
tasks in the system, I am OK with it.


Thanks
Dipankar