2009-04-26 20:46:38

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

Hi,

The sched_mc_powersavings tunable can be set to {0,1,2} to enable
aggressive task consolidation to less number of cpu packages and save
power. Under certain conditions, sched_mc=2 may provide better
performance in a underutilised system by keeping the group of tasks on
a single cpu package facilitating cache sharing and reduced off-chip
traffic.

Extending this concept further, the following patch series tries to
implement sched_mc={3,4,5} where CPUs/cores are forced to be idle and
thereby save power at the cost of performance. Some of the cpu
packages in the system are overloaded with tasks while other packages
can have free cpus. This patch is a hack to discuss the idea and
requirements.

Objective:
----------

* Framework to evacuate tasks from cpus in order to force the cpu
cores to stay at idle

* Interrupts can be moved using user space irqbalancer daemons, while
timer migration framework is being discussed:
http://lkml.org/lkml/2009/4/16/45

* Forcefully idling cpu cores in a system will reduce the power
consumption of the system and also cool cpu packages for thermal
management

Requirements:
------------

* Fast response time and low OS overhead to moved tasks away from
selected cpu packages. CPU hotplug is too heavyweight for this
purpose

Use cases:
---------

* Enabling the right number of cpus to run the given workload can
provide good power vs performance tradeoffs.

* Ability to throttle the number of cores uses in the system along
with other power saving controls like cpufreq governors can enable
the system to operate at a more power efficient operating point and
still meet the design objectives.

* Facilitate thermal management by evacuating cores from hot cpu packages

Alternatives:
-------------

* CPU hotplug: Heavy weight and slow. Setting up and tear down of
data structures involved. May need new fast or light weight
notifications

* CPUSets: Exclusive CPU sets and partitioned sched domains involve
rebuilding sched domains and relatively heavy weight for the purpose

The following patch is against 2.6.30-rc3 and will work only in
an under utilised system (Tasks <= number of cores).

Test results for ebizzy 8 threads at various sched_mc settings has been
summarised with relative values below. The test platform is dual socket
quad core x86 system (pre-Nehalem).

--------------------------------------------------------
sched_mc No Cores Performance AvgPower
used Records/sec (Watts)
--------------------------------------------------------
0 8 1.00x 1.00y
1 8 1.02x 1.01y
2 8 0.83x 1.01y
3 7 0.86x 0.97y
4 6 0.76x 0.92y
5 4 0.72x 0.82y
--------------------------------------------------------

There were wide run variation with ebizzy. The purpose of the above
data is to justify use of core evacuation for power vs performance
trade-offs.

ToDo:
-----

* Make the core evacuation predictable under different system load
conditions and workload characteristics
* Enhance framework to control which packages/cores will be
evacuated, this is needed for thermal management

I can experiment with different benchmarks/platforms and post results
while the framework is being discussed.

Please let me know you comments and suggestions.

Thanks,
Vaidy

---

Vaidyanathan Srinivasan (3):
sched: loadbalancer hacks for forced packing of tasks
sched: threshold helper functions
sched: add more levels of sched_mc


include/linux/sched.h | 4 ++++
kernel/sched.c | 35 ++++++++++++++++++++++++++++++++++-
2 files changed, 38 insertions(+), 1 deletions(-)

--


2009-04-26 20:46:55

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [RFC PATCH v1 1/3] sched: add more levels of sched_mc

Add few more levels to sched_mc for cpu evacuation.
These levels will try to keep CPU cores free in order
to reduce power consumption.

sched_mc=3 to 5 enables cpu evacuation

** This is a RFC patch for discussion ***

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..8b27295 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -788,6 +788,10 @@ enum powersavings_balance_level {
POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
* cpu package for power savings
*/
+ POWERSAVINGS_INCREASE_GROUP_CAPACITY_1, /* 1*imbalalance_pct = 125% */
+ POWERSAVINGS_INCREASE_GROUP_CAPACITY_2, /* 2*imbalalance_pct = 150% */
+ POWERSAVINGS_INCREASE_GROUP_CAPACITY_3, /* 4*imbalalance_pct = 200% */
+
MAX_POWERSAVINGS_BALANCE_LEVELS
};

2009-04-26 20:47:20

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [RFC PATCH v1 2/3] sched: threshold helper functions

Define group capacity threshold as a multiple of
impalance percentage at higher levels of sched_mc settings.

sched_mc=3 Group capacity increased by 25% (5 tasks on quad core)
sched_mc=4 Group capacity increased by 50% (6 tasks on quad core)
sched_mc=5 Group capacity increased by 100% (8 tasks on quad core)

*** RFC patch for discussion ***

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..f88ed04 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3291,6 +3291,21 @@ static inline int get_sd_load_idx(struct sched_domain *sd,


#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+static inline unsigned int group_capacity_bump_pct(struct sched_domain *sd)
+{
+ if (sched_mc_power_savings >= POWERSAVINGS_INCREASE_GROUP_CAPACITY_3)
+ return 100+(sd->imbalance_pct-100)*4;
+
+ if (sched_mc_power_savings >= POWERSAVINGS_INCREASE_GROUP_CAPACITY_2)
+ return 100+(sd->imbalance_pct-100)*2;
+
+ if (sched_mc_power_savings >= POWERSAVINGS_INCREASE_GROUP_CAPACITY_1)
+ return sd->imbalance_pct;
+
+ return 100;
+}
+
/**
* init_sd_power_savings_stats - Initialize power savings statistics for
* the given sched_domain, during load balancing.
@@ -3433,6 +3448,12 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
{
return 0;
}
+
+static inline unsigned int group_capacity_bump_pct(struct sched_domain *sd)
+{
+ return 100;
+}
+
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */


2009-04-26 20:47:38

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [RFC PATCH v1 3/3] sched: loadbalancer hacks for forced packing of tasks

Pack more tasks in a group so as to reduce number of CPUs
used to run the work in the system.

Just for load balancing purpose, assume the group capacity
has been increased by group_capacity_bump_pct()

Hacks:

o Make non-idle cpus also perform powersave balance so
that we can pull more tasks into the group
o Increase group capacity for calculation
o Increase load-balancing threshold so that even if a
group is loaded by group_capacity_bump_pct, consider
it balanced

*** RFC patch for discussion ***

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f88ed04..b20dbcb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3234,6 +3234,7 @@ struct sd_lb_stats {
int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
int power_savings_balance; /* Is powersave balance needed for this sd */
+ unsigned int group_capacity_bump; /* % increase in group capacity */
struct sched_group *group_min; /* Least loaded group in sd */
struct sched_group *group_leader; /* Group which relieves group_min */
unsigned long min_load_per_task; /* load_per_task in group_min */
@@ -3321,12 +3322,16 @@ static inline void init_sd_power_savings_stats(struct sched_domain *sd,
* Busy processors will not participate in power savings
* balance.
*/
- if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
+ if ((idle == CPU_NOT_IDLE &&
+ sched_mc_power_savings <
+ POWERSAVINGS_INCREASE_GROUP_CAPACITY_1) ||
+ !(sd->flags & SD_POWERSAVINGS_BALANCE))
sds->power_savings_balance = 0;
else {
sds->power_savings_balance = 1;
sds->min_nr_running = ULONG_MAX;
sds->leader_nr_running = 0;
+ sds->group_capacity_bump = group_capacity_bump_pct(sd);
}
}

@@ -3586,6 +3591,9 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,

if (local_group && balance && !(*balance))
return;
+ /* Bump up group capacity for forced packing of tasks */
+ sgs.group_capacity = sgs.group_capacity *
+ sds->group_capacity_bump / 100;

sds->total_load += sgs.group_load;
sds->total_pwr += group->__cpu_power;
@@ -3786,6 +3794,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;

+ /* Push the upper limits for overload */
+ if (100 * sds.max_load <= sds.group_capacity_bump * SCHED_LOAD_SCALE)
+ goto out_balanced;
+
sds.busiest_load_per_task /= sds.busiest_nr_running;
if (sds.group_imb)
sds.busiest_load_per_task =

2009-04-27 03:52:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n


* Vaidyanathan Srinivasan <[email protected]> wrote:

> Test results for ebizzy 8 threads at various sched_mc settings has
> been summarised with relative values below. The test platform is
> dual socket quad core x86 system (pre-Nehalem).
>
> --------------------------------------------------------
> sched_mc No Cores Performance AvgPower
> used Records/sec (Watts)
> --------------------------------------------------------
> 0 8 1.00x 1.00y
> 1 8 1.02x 1.01y
> 2 8 0.83x 1.01y
> 3 7 0.86x 0.97y
> 4 6 0.76x 0.92y
> 5 4 0.72x 0.82y
> --------------------------------------------------------

Looks like we want the kernel default to be sched_mc=1 ?

Regarding the values for 2...5 - is the AvgPower column time
normalized or workload normalized?

If it's time normalized then it appears there's no power win here at
all: we'd be better off by throttling the workload directly (by
injecting sleeps or something like that), right?

Ingo

2009-04-27 05:43:58

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Ingo Molnar <[email protected]> [2009-04-27 05:52:16]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > Test results for ebizzy 8 threads at various sched_mc settings has
> > been summarised with relative values below. The test platform is
> > dual socket quad core x86 system (pre-Nehalem).
> >
> > --------------------------------------------------------
> > sched_mc No Cores Performance AvgPower
> > used Records/sec (Watts)
> > --------------------------------------------------------
> > 0 8 1.00x 1.00y
> > 1 8 1.02x 1.01y
> > 2 8 0.83x 1.01y
> > 3 7 0.86x 0.97y
> > 4 6 0.76x 0.92y
> > 5 4 0.72x 0.82y
> > --------------------------------------------------------
>
> Looks like we want the kernel default to be sched_mc=1 ?

Hi Ingo,

Yes, sched_mc wins for a simple cpu bound workload like this. But the
challenge is that the best settings depends on the workload and the
system configuration. This leads me to think that the default setting
should be left with the distros where we can factor in various
parameters and choose the right default from user space.


> Regarding the values for 2...5 - is the AvgPower column time
> normalized or workload normalized?

The AvgPower is time normalised, just the power value divided by the
baseline at sched_mc=0.

> If it's time normalized then it appears there's no power win here at
> all: we'd be better off by throttling the workload directly (by
> injecting sleeps or something like that), right?

Yes, there is no power win when comparing with peak benchmark
throughput in this case. However more complex workload setup may not
show similar characteristics because they are not dependent only on
CPU bandwidth for their peak performance.

* Reduction in cpu bandwidth may not directly translate to performance
reduction on complex workloads
* Even if there is degradation, the system may still meet the design
objectives. 20-30% increase in response time over a 1 second
nominal value may be acceptable in most cases
* End user can tie application priority to such tunable where we can
get power savings from low priority applications
* Reducing average power consumption at a given point may save money
to datacenter managers based on differential power cost
* Reducing average power reduces heat and provides greater savings
from cooling infrastructure
* This framework can be used for thermal leveling on larger
under-utilised machines to keep the overall temperature low and save
leakage power

Here, we would like end users and datacenter management software to
have fine grain steps to trade performance for power savings rather
than switching off servers and reducing application availability.

Your suggestion of throttling applications to achieve the same goal is
valid, but has the following limitations:

* The framework will be application dependent and the level of
throttling required to evacuate cores is variable
* We get best power savings when the granularity of control is at
a core level first and at a package level next (perhaps node level
also)
* Throttled applications may still not choose the most power efficient
combination of cores to run
* Having a framework to evacuate core in the OS helps in providing the
right granularity of control

The overall objective is to let users pick the right number cores to
run the job and allow the kernel to choose the most power efficient
combination of cores to run the job.

sched_mc={1,2} will allow the kernel to pick the most power efficient
combination of cores to run the workload while sched_mc={3,4,5} lets
the user control the number of cores to use or evacuate.

--Vaidy

2009-04-27 05:54:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n


* Vaidyanathan Srinivasan <[email protected]> wrote:

> > > --------------------------------------------------------
> > > sched_mc No Cores Performance AvgPower
> > > used Records/sec (Watts)
> > > --------------------------------------------------------
> > > 0 8 1.00x 1.00y
> > > 1 8 1.02x 1.01y
> > > 2 8 0.83x 1.01y
> > > 3 7 0.86x 0.97y
> > > 4 6 0.76x 0.92y
> > > 5 4 0.72x 0.82y
> > > --------------------------------------------------------
> >
> > Looks like we want the kernel default to be sched_mc=1 ?
>
> Hi Ingo,
>
> Yes, sched_mc wins for a simple cpu bound workload like this. But
> the challenge is that the best settings depends on the workload
> and the system configuration. This leads me to think that the
> default setting should be left with the distros where we can
> factor in various parameters and choose the right default from
> user space.
>
>
> > Regarding the values for 2...5 - is the AvgPower column time
> > normalized or workload normalized?
>
> The AvgPower is time normalised, just the power value divided by
> the baseline at sched_mc=0.
>
> > If it's time normalized then it appears there's no power win
> > here at all: we'd be better off by throttling the workload
> > directly (by injecting sleeps or something like that), right?
>
> Yes, there is no power win when comparing with peak benchmark
> throughput in this case. However more complex workload setup may
> not show similar characteristics because they are not dependent
> only on CPU bandwidth for their peak performance.
>
> * Reduction in cpu bandwidth may not directly translate to performance
> reduction on complex workloads
> * Even if there is degradation, the system may still meet the design
> objectives. 20-30% increase in response time over a 1 second
> nominal value may be acceptable in most cases

But ... we could probably get a _better_ (near linear) slowdown by
injecting wait cycles into the workload.

I.e. we should only touch balancing if there's a _genuine_ power
saving: i.e. less power is used for the same throughput.

The numbers in the table show a plain slowdown: doing fewer
transactions means less power used. But that is trivial to achieve
for a CPU-bound workload: throttle the workload. I.e. inject less
work, save power.

And if we want to throttle 'transparently', from the kernel, we
should do it not via an artificial open-ended scale of
sched_mc=2,3,4,5... - we should do it via a _percentage_ value.

I.e. a system setting that says "at most utilize the system 80% of
its peak capacity". That can be implemented by the kernel injecting
small delays or by intentionally not scheduling on certain CPUs (but
not delaying tasks - forcing them to other cpus in essence).

Ingo

2009-04-27 05:55:22

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

On Mon, Apr 27, 2009 at 05:52:16AM +0200, Ingo Molnar wrote:
>
> Regarding the values for 2...5 - is the AvgPower column time
> normalized or workload normalized?
>
> If it's time normalized then it appears there's no power win here at
> all: we'd be better off by throttling the workload directly (by
> injecting sleeps or something like that), right?

Energy savings with this will depend on the workload running. We have
seen transactional workloads where taking off a few cores has almost
no impact on throughput or response time.

Thanks
Dipankar

2009-04-27 06:39:37

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Ingo Molnar <[email protected]> [2009-04-27 07:53:47]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > > > --------------------------------------------------------
> > > > sched_mc No Cores Performance AvgPower
> > > > used Records/sec (Watts)
> > > > --------------------------------------------------------
> > > > 0 8 1.00x 1.00y
> > > > 1 8 1.02x 1.01y
> > > > 2 8 0.83x 1.01y
> > > > 3 7 0.86x 0.97y
> > > > 4 6 0.76x 0.92y
> > > > 5 4 0.72x 0.82y
> > > > --------------------------------------------------------
> > >
> > > Looks like we want the kernel default to be sched_mc=1 ?
> >
> > Hi Ingo,
> >
> > Yes, sched_mc wins for a simple cpu bound workload like this. But
> > the challenge is that the best settings depends on the workload
> > and the system configuration. This leads me to think that the
> > default setting should be left with the distros where we can
> > factor in various parameters and choose the right default from
> > user space.
> >
> >
> > > Regarding the values for 2...5 - is the AvgPower column time
> > > normalized or workload normalized?
> >
> > The AvgPower is time normalised, just the power value divided by
> > the baseline at sched_mc=0.
> >
> > > If it's time normalized then it appears there's no power win
> > > here at all: we'd be better off by throttling the workload
> > > directly (by injecting sleeps or something like that), right?
> >
> > Yes, there is no power win when comparing with peak benchmark
> > throughput in this case. However more complex workload setup may
> > not show similar characteristics because they are not dependent
> > only on CPU bandwidth for their peak performance.
> >
> > * Reduction in cpu bandwidth may not directly translate to performance
> > reduction on complex workloads
> > * Even if there is degradation, the system may still meet the design
> > objectives. 20-30% increase in response time over a 1 second
> > nominal value may be acceptable in most cases
>
> But ... we could probably get a _better_ (near linear) slowdown by
> injecting wait cycles into the workload.

We have advantages when complete cpu packages are not used as opposed
to just injecting idle time in all cores.

> I.e. we should only touch balancing if there's a _genuine_ power
> saving: i.e. less power is used for the same throughput.

Load balancer knows the cpu package topology and in essence knows the
most power efficient combinations of cores to use. If we have to
schedule on 4 cores in a 8 core system, the load balancer can pick the
right combination.

> The numbers in the table show a plain slowdown: doing fewer
> transactions means less power used. But that is trivial to achieve
> for a CPU-bound workload: throttle the workload. I.e. inject less
> work, save power.

Agreed, this example does not show the best use case for this
feature, however we can easily experimentally verify that targeted
evacuation of cores can provide better performance-per-watt as
compared to plain throttling to reduce utilisation.

> And if we want to throttle 'transparently', from the kernel, we
> should do it not via an artificial open-ended scale of
> sched_mc=2,3,4,5... - we should do it via a _percentage_ value.

Yes we want to transparently throttle from the kernel at a core level
granularity.

Having a percentage value that can take discrete steps based on the
number of cores in the system is a good idea. I will switch the
parameter to percentage in the next iteration.

> I.e. a system setting that says "at most utilize the system 80% of
> its peak capacity". That can be implemented by the kernel injecting
> small delays or by intentionally not scheduling on certain CPUs (but
> not delaying tasks - forcing them to other cpus in essence).

Advances in hardware power management like very low power deep sleep
states and further package level power savings when all cores are idle
changes the above assumption.

Uniformly adding delays on all CPUs provide far less power savings as
compared to not using one core or one complete package. Evacuating
core/package essentially shuts them off as compared to very short
bursts of idle times.

If we can accumulate all such idle times to a single core, with little
effect on fairness, we get better power savings for the same amount of
idle time or utilisation.

Agreed that this is a coarse granularity compared to injecting delay,
but this will become practical as the core density increase in the
enterprise processor design.

--Vaidy

2009-04-27 07:02:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Vaidyanathan Srinivasan <[email protected]> [2009-04-27 12:09:03]:

> * Ingo Molnar <[email protected]> [2009-04-27 07:53:47]:
>
> >
> > * Vaidyanathan Srinivasan <[email protected]> wrote:
> >
> > > > > --------------------------------------------------------
> > > > > sched_mc No Cores Performance AvgPower
> > > > > used Records/sec (Watts)
> > > > > --------------------------------------------------------
> > > > > 0 8 1.00x 1.00y
> > > > > 1 8 1.02x 1.01y
> > > > > 2 8 0.83x 1.01y
> > > > > 3 7 0.86x 0.97y
> > > > > 4 6 0.76x 0.92y
> > > > > 5 4 0.72x 0.82y
> > > > > --------------------------------------------------------
> > > >
> > > > Looks like we want the kernel default to be sched_mc=1 ?
> > >
> > > Hi Ingo,
> > >
> > > Yes, sched_mc wins for a simple cpu bound workload like this. But
> > > the challenge is that the best settings depends on the workload
> > > and the system configuration. This leads me to think that the
> > > default setting should be left with the distros where we can
> > > factor in various parameters and choose the right default from
> > > user space.
> > >
> > >
> > > > Regarding the values for 2...5 - is the AvgPower column time
> > > > normalized or workload normalized?
> > >
> > > The AvgPower is time normalised, just the power value divided by
> > > the baseline at sched_mc=0.
> > >
> > > > If it's time normalized then it appears there's no power win
> > > > here at all: we'd be better off by throttling the workload
> > > > directly (by injecting sleeps or something like that), right?
> > >
> > > Yes, there is no power win when comparing with peak benchmark
> > > throughput in this case. However more complex workload setup may
> > > not show similar characteristics because they are not dependent
> > > only on CPU bandwidth for their peak performance.
> > >
> > > * Reduction in cpu bandwidth may not directly translate to performance
> > > reduction on complex workloads
> > > * Even if there is degradation, the system may still meet the design
> > > objectives. 20-30% increase in response time over a 1 second
> > > nominal value may be acceptable in most cases
> >
> > But ... we could probably get a _better_ (near linear) slowdown by
> > injecting wait cycles into the workload.
>
> We have advantages when complete cpu packages are not used as opposed
> to just injecting idle time in all cores.
>
> > I.e. we should only touch balancing if there's a _genuine_ power
> > saving: i.e. less power is used for the same throughput.
>
> Load balancer knows the cpu package topology and in essence knows the
> most power efficient combinations of cores to use. If we have to
> schedule on 4 cores in a 8 core system, the load balancer can pick the
> right combination.
>
> > The numbers in the table show a plain slowdown: doing fewer
> > transactions means less power used. But that is trivial to achieve
> > for a CPU-bound workload: throttle the workload. I.e. inject less
> > work, save power.
>
> Agreed, this example does not show the best use case for this
> feature, however we can easily experimentally verify that targeted
> evacuation of cores can provide better performance-per-watt as
> compared to plain throttling to reduce utilisation.
>

We have throttling in the form of P-states so that infrastructure
already exists, albeit in hardware. We want to go one step further
with targetted evacuation.

> > And if we want to throttle 'transparently', from the kernel, we
> > should do it not via an artificial open-ended scale of
> > sched_mc=2,3,4,5... - we should do it via a _percentage_ value.
>
> Yes we want to transparently throttle from the kernel at a core level
> granularity.
>
> Having a percentage value that can take discrete steps based on the
> number of cores in the system is a good idea. I will switch the
> parameter to percentage in the next iteration.
>
> > I.e. a system setting that says "at most utilize the system 80% of
> > its peak capacity". That can be implemented by the kernel injecting
> > small delays or by intentionally not scheduling on certain CPUs (but
> > not delaying tasks - forcing them to other cpus in essence).
>
> Advances in hardware power management like very low power deep sleep
> states and further package level power savings when all cores are idle
> changes the above assumption.
>
> Uniformly adding delays on all CPUs provide far less power savings as
> compared to not using one core or one complete package. Evacuating
> core/package essentially shuts them off as compared to very short
> bursts of idle times.
>
> If we can accumulate all such idle times to a single core, with little
> effect on fairness, we get better power savings for the same amount of
> idle time or utilisation.
>
> Agreed that this is a coarse granularity compared to injecting delay,
> but this will become practical as the core density increase in the
> enterprise processor design.

Apart from increasing core density, per-core power management is becoming
more mature, so evacuating cores is becoming an attractive
proposition.

--
Balbir

2009-04-27 10:10:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

On Mon, 2009-04-27 at 02:16 +0530, Vaidyanathan Srinivasan wrote:
> Hi,
>
> The sched_mc_powersavings tunable can be set to {0,1,2} to enable
> aggressive task consolidation to less number of cpu packages and save
> power. Under certain conditions, sched_mc=2 may provide better
> performance in a underutilised system by keeping the group of tasks on
> a single cpu package facilitating cache sharing and reduced off-chip
> traffic.
>
> Extending this concept further, the following patch series tries to
> implement sched_mc={3,4,5} where CPUs/cores are forced to be idle and
> thereby save power at the cost of performance. Some of the cpu
> packages in the system are overloaded with tasks while other packages
> can have free cpus. This patch is a hack to discuss the idea and
> requirements.
>
> Objective:
> ----------
>
> * Framework to evacuate tasks from cpus in order to force the cpu
> cores to stay at idle
>
> * Interrupts can be moved using user space irqbalancer daemons, while
> timer migration framework is being discussed:
> http://lkml.org/lkml/2009/4/16/45
>
> * Forcefully idling cpu cores in a system will reduce the power
> consumption of the system and also cool cpu packages for thermal
> management
>
> Requirements:
> ------------
>
> * Fast response time and low OS overhead to moved tasks away from
> selected cpu packages. CPU hotplug is too heavyweight for this
> purpose
>
> Use cases:
> ---------
>
> * Enabling the right number of cpus to run the given workload can
> provide good power vs performance tradeoffs.
>
> * Ability to throttle the number of cores uses in the system along
> with other power saving controls like cpufreq governors can enable
> the system to operate at a more power efficient operating point and
> still meet the design objectives.
>
> * Facilitate thermal management by evacuating cores from hot cpu packages
>
> Alternatives:
> -------------
>
> * CPU hotplug: Heavy weight and slow. Setting up and tear down of
> data structures involved. May need new fast or light weight
> notifications
>
> * CPUSets: Exclusive CPU sets and partitioned sched domains involve
> rebuilding sched domains and relatively heavy weight for the purpose
>
> The following patch is against 2.6.30-rc3 and will work only in
> an under utilised system (Tasks <= number of cores).
>
> Test results for ebizzy 8 threads at various sched_mc settings has been
> summarised with relative values below. The test platform is dual socket
> quad core x86 system (pre-Nehalem).
>
> --------------------------------------------------------
> sched_mc No Cores Performance AvgPower
> used Records/sec (Watts)
> --------------------------------------------------------
> 0 8 1.00x 1.00y
> 1 8 1.02x 1.01y
> 2 8 0.83x 1.01y
> 3 7 0.86x 0.97y
> 4 6 0.76x 0.92y
> 5 4 0.72x 0.82y
> --------------------------------------------------------
>
> There were wide run variation with ebizzy. The purpose of the above
> data is to justify use of core evacuation for power vs performance
> trade-offs.
>
> ToDo:
> -----
>
> * Make the core evacuation predictable under different system load
> conditions and workload characteristics
> * Enhance framework to control which packages/cores will be
> evacuated, this is needed for thermal management


I think this is going about it the wrong way.

The whole thing seems to be targeted at thermal management, not power
saving. Therefore using the power saving stuff is backwards.

Provide a knob that provides max_thermal_capacity, and schedule
accordingly.

FWIW I utterly hate these force idle things because they cause the
scheduler to become non-work conserving, but I have to concede that
software will likely be more suited to handle the thermal overload issue
than hardware will ever be -- so for that use case I'm willing to go
along.

Also, the user interface should be that single thermal capacity knob,
more fine grained control is undesired.

Also, before you continue, expand on the interaction with realtime
processes.

2009-04-27 14:20:49

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Peter Zijlstra <[email protected]> [2009-04-27 12:09:14]:

> On Mon, 2009-04-27 at 02:16 +0530, Vaidyanathan Srinivasan wrote:
> > Hi,
> >
> > The sched_mc_powersavings tunable can be set to {0,1,2} to enable
> > aggressive task consolidation to less number of cpu packages and save
> > power. Under certain conditions, sched_mc=2 may provide better
> > performance in a underutilised system by keeping the group of tasks on
> > a single cpu package facilitating cache sharing and reduced off-chip
> > traffic.
> >
> > Extending this concept further, the following patch series tries to
> > implement sched_mc={3,4,5} where CPUs/cores are forced to be idle and
> > thereby save power at the cost of performance. Some of the cpu
> > packages in the system are overloaded with tasks while other packages
> > can have free cpus. This patch is a hack to discuss the idea and
> > requirements.
> >
> > Objective:
> > ----------
> >
> > * Framework to evacuate tasks from cpus in order to force the cpu
> > cores to stay at idle
> >
> > * Interrupts can be moved using user space irqbalancer daemons, while
> > timer migration framework is being discussed:
> > http://lkml.org/lkml/2009/4/16/45
> >
> > * Forcefully idling cpu cores in a system will reduce the power
> > consumption of the system and also cool cpu packages for thermal
> > management
> >
> > Requirements:
> > ------------
> >
> > * Fast response time and low OS overhead to moved tasks away from
> > selected cpu packages. CPU hotplug is too heavyweight for this
> > purpose
> >
> > Use cases:
> > ---------
> >
> > * Enabling the right number of cpus to run the given workload can
> > provide good power vs performance tradeoffs.
> >
> > * Ability to throttle the number of cores uses in the system along
> > with other power saving controls like cpufreq governors can enable
> > the system to operate at a more power efficient operating point and
> > still meet the design objectives.
> >
> > * Facilitate thermal management by evacuating cores from hot cpu packages
> >
> > Alternatives:
> > -------------
> >
> > * CPU hotplug: Heavy weight and slow. Setting up and tear down of
> > data structures involved. May need new fast or light weight
> > notifications
> >
> > * CPUSets: Exclusive CPU sets and partitioned sched domains involve
> > rebuilding sched domains and relatively heavy weight for the purpose
> >
> > The following patch is against 2.6.30-rc3 and will work only in
> > an under utilised system (Tasks <= number of cores).
> >
> > Test results for ebizzy 8 threads at various sched_mc settings has been
> > summarised with relative values below. The test platform is dual socket
> > quad core x86 system (pre-Nehalem).
> >
> > --------------------------------------------------------
> > sched_mc No Cores Performance AvgPower
> > used Records/sec (Watts)
> > --------------------------------------------------------
> > 0 8 1.00x 1.00y
> > 1 8 1.02x 1.01y
> > 2 8 0.83x 1.01y
> > 3 7 0.86x 0.97y
> > 4 6 0.76x 0.92y
> > 5 4 0.72x 0.82y
> > --------------------------------------------------------
> >
> > There were wide run variation with ebizzy. The purpose of the above
> > data is to justify use of core evacuation for power vs performance
> > trade-offs.
> >
> > ToDo:
> > -----
> >
> > * Make the core evacuation predictable under different system load
> > conditions and workload characteristics
> > * Enhance framework to control which packages/cores will be
> > evacuated, this is needed for thermal management
>
>
> I think this is going about it the wrong way.
>
> The whole thing seems to be targeted at thermal management, not power
> saving. Therefore using the power saving stuff is backwards.

The framework is useful for power savings and thermal management.
Actually we can generalise this a framework to throttle cores.

Power savings need only core evacuation, kernel can decide the most
optimum cores to evacuate for best power savings. While in thermal
management we will additional need a 'vector' parameter to direct the
load to different parts of the system and level the heat generated.

> Provide a knob that provides max_thermal_capacity, and schedule
> accordingly.

Yes, we can pick a generic name and use this as a function of total
system capacity to indicate number of cores to evacuate.

> FWIW I utterly hate these force idle things because they cause the
> scheduler to become non-work conserving, but I have to concede that
> software will likely be more suited to handle the thermal overload issue
> than hardware will ever be -- so for that use case I'm willing to go
> along.

Yes, I agree with your opinion. However if we can come up with
a clean framework to take cores out of scheduler's view, then the work
conserving nature of the scheduler can be preserved on the sub-set of
cores. Inserting idle states is more intrusive than leaving out full
cores.

> Also, the user interface should be that single thermal capacity knob,
> more fine grained control is undesired.

For power savings, a single evacuation knob will do. While for
thermal we will need additional parameters to choose the right cores
to evacuate. Some sort of directional/vector parameter.

> Also, before you continue, expand on the interaction with realtime
> processes.

Sure. We will run into complications with respect to realtime
scheduling. You had earlier pointed out a need for variable cpu power
to achieve fairness for non-realtime tasks in the presence of realtime
tasks. We should re-visit that idea.

Thanks for the review comments.

--Vaidy

2009-04-28 08:34:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

On Mon, 2009-04-27 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2009-04-27 12:09:14]:

> > The whole thing seems to be targeted at thermal management, not power
> > saving. Therefore using the power saving stuff is backwards.
>
> The framework is useful for power savings and thermal management.
> Actually we can generalise this a framework to throttle cores.

To what purpose?

> Power savings need only core evacuation, kernel can decide the most
> optimum cores to evacuate for best power savings. While in thermal
> management we will additional need a 'vector' parameter to direct the
> load to different parts of the system and level the heat generated.

Power saving should not generate idle, it should just accumulate idle in
the most favourable way.

Thermal management must generate idle to avoid hardware breakdown etc.
Does it really need more than a single max_thermal_capacity knob? That
is, does it really matter which die in the machine generates the heat?

If so, why?

> > Provide a knob that provides max_thermal_capacity, and schedule
> > accordingly.
>
> Yes, we can pick a generic name and use this as a function of total
> system capacity to indicate number of cores to evacuate.

No, it should be in a thermal unit, not nr of cores.

> > FWIW I utterly hate these force idle things because they cause the
> > scheduler to become non-work conserving, but I have to concede that
> > software will likely be more suited to handle the thermal overload issue
> > than hardware will ever be -- so for that use case I'm willing to go
> > along.
>
> Yes, I agree with your opinion. However if we can come up with
> a clean framework to take cores out of scheduler's view, then the work
> conserving nature of the scheduler can be preserved on the sub-set of
> cores. Inserting idle states is more intrusive than leaving out full
> cores.

Not really, when you consider the machine (or load-balance domain)
taking out a few cores it still non-work preserving as you take away
capacity.

I'm against taking out capacity for anything other than thermal
management -- full stop.

> > Also, the user interface should be that single thermal capacity knob,
> > more fine grained control is undesired.
>
> For power savings, a single evacuation knob will do. While for
> thermal we will need additional parameters to choose the right cores
> to evacuate. Some sort of directional/vector parameter.

Why? are machines that non-uniform in cooling capacity that it really
matters which core generates the heat? Sounds like badly designed
hardware to me.

I would expect it to only be the total head generated/power taken from
the rack unit.

> > Also, before you continue, expand on the interaction with realtime
> > processes.
>
> Sure. We will run into complications with respect to realtime
> scheduling. You had earlier pointed out a need for variable cpu power
> to achieve fairness for non-realtime tasks in the presence of realtime
> tasks. We should re-visit that idea.

There is that, another point is load generated by SCHED_OTHER tasks
pushing the machine in thermal overload should not shut down the
capacity needed for the real-time tasks.

2009-04-28 08:53:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n


* Peter Zijlstra <[email protected]> wrote:

> > > Also, the user interface should be that single thermal
> > > capacity knob, more fine grained control is undesired.
> >
> > For power savings, a single evacuation knob will do. While for
> > thermal we will need additional parameters to choose the right
> > cores to evacuate. Some sort of directional/vector parameter.
>
> Why? are machines that non-uniform in cooling capacity that it
> really matters which core generates the heat? Sounds like badly
> designed hardware to me.
>
> I would expect it to only be the total head generated/power taken
> from the rack unit.

If we add thermal throttling at the kernel level then a single knob
(with a percentile-ish unit) is probably the furthest we will go -
with "not doing it at all" still being the other, very tempting
alternative.

If the only technical way you can find to do it is via myriads of
non-intuitive knobs and per core settings - then the answer is
really 'no thanks'.

Ingo

2009-04-28 16:11:26

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Peter Zijlstra <[email protected]> [2009-04-28 10:33:38]:

> On Mon, 2009-04-27 at 19:50 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2009-04-27 12:09:14]:
>
> > > The whole thing seems to be targeted at thermal management, not power
> > > saving. Therefore using the power saving stuff is backwards.
> >
> > The framework is useful for power savings and thermal management.
> > Actually we can generalise this a framework to throttle cores.
>
> To what purpose?

Throttling work will save power and reduce heat. I was thinking that
to reduce heat we may have to take different cores off at different
times.

> > Power savings need only core evacuation, kernel can decide the most
> > optimum cores to evacuate for best power savings. While in thermal
> > management we will additional need a 'vector' parameter to direct the
> > load to different parts of the system and level the heat generated.
>
> Power saving should not generate idle, it should just accumulate idle in
> the most favourable way.

Agreed. I am looking for ideas to accumulate idles to a single core
or multiple of cores.

> Thermal management must generate idle to avoid hardware breakdown etc.
> Does it really need more than a single max_thermal_capacity knob? That
> is, does it really matter which die in the machine generates the heat?
>
> If so, why?

I think so because the apart from over-heat trip, we have an
opportunity to reduce leakage power which is proportional to
temperature. Uniformly heating all core can save us leakage power.
But spreading work for this purpose is not favourable because we will
not goto package idle states.

We still need to consolidate idle times across system to certain cores
and also periodically keep shifting these idle cores.

Just an idea and possibility, flame me if this weird enough :)

> > > Provide a knob that provides max_thermal_capacity, and schedule
> > > accordingly.
> >
> > Yes, we can pick a generic name and use this as a function of total
> > system capacity to indicate number of cores to evacuate.
>
> No, it should be in a thermal unit, not nr of cores.

Thermal unit is not as intuitive as cores or system capacity right?
Are you suggesting that we specify the maximum heat that can be
generated?

> > > FWIW I utterly hate these force idle things because they cause the
> > > scheduler to become non-work conserving, but I have to concede that
> > > software will likely be more suited to handle the thermal overload issue
> > > than hardware will ever be -- so for that use case I'm willing to go
> > > along.
> >
> > Yes, I agree with your opinion. However if we can come up with
> > a clean framework to take cores out of scheduler's view, then the work
> > conserving nature of the scheduler can be preserved on the sub-set of
> > cores. Inserting idle states is more intrusive than leaving out full
> > cores.
>
> Not really, when you consider the machine (or load-balance domain)
> taking out a few cores it still non-work preserving as you take away
> capacity.

Agreed. But cpu offline, cpufreq governors, and multi-threaded
CPUs do take away capacity from scheduler today.

> I'm against taking out capacity for anything other than thermal
> management -- full stop.

Are we entering the domain of resource management now? Should
throttling work be a resource management problem?

> > > Also, the user interface should be that single thermal capacity knob,
> > > more fine grained control is undesired.
> >
> > For power savings, a single evacuation knob will do. While for
> > thermal we will need additional parameters to choose the right cores
> > to evacuate. Some sort of directional/vector parameter.
>
> Why? are machines that non-uniform in cooling capacity that it really
> matters which core generates the heat? Sounds like badly designed
> hardware to me.
>
> I would expect it to only be the total head generated/power taken from
> the rack unit.

Your point is correct as long as we want to prevent a thermal trip.
But in future systems we have an opportunity to save power by reducing
the core temperature at the same heat output. Basically uniformly
heating all cores rather than just one part of the system even if we
are within the total thermal limit can help save leakage power.

> > > Also, before you continue, expand on the interaction with realtime
> > > processes.
> >
> > Sure. We will run into complications with respect to realtime
> > scheduling. You had earlier pointed out a need for variable cpu power
> > to achieve fairness for non-realtime tasks in the presence of realtime
> > tasks. We should re-visit that idea.
>
> There is that, another point is load generated by SCHED_OTHER tasks
> pushing the machine in thermal overload should not shut down the
> capacity needed for the real-time tasks.

Yes, this is an interesting and valid requirement. We should be able
to limit capacity to selected scheduler classes

--Vaidy

2009-04-28 16:16:25

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation using sched_mc=n

* Ingo Molnar <[email protected]> [2009-04-28 10:52:37]:

>
> * Peter Zijlstra <[email protected]> wrote:
>
> > > > Also, the user interface should be that single thermal
> > > > capacity knob, more fine grained control is undesired.
> > >
> > > For power savings, a single evacuation knob will do. While for
> > > thermal we will need additional parameters to choose the right
> > > cores to evacuate. Some sort of directional/vector parameter.
> >
> > Why? are machines that non-uniform in cooling capacity that it
> > really matters which core generates the heat? Sounds like badly
> > designed hardware to me.
> >
> > I would expect it to only be the total head generated/power taken
> > from the rack unit.
>
> If we add thermal throttling at the kernel level then a single knob
> (with a percentile-ish unit) is probably the furthest we will go -
> with "not doing it at all" still being the other, very tempting
> alternative.

Sure, this is all we would like to do. Simpler interface is welcome
and will have easy adoption.

> If the only technical way you can find to do it is via myriads of
> non-intuitive knobs and per core settings - then the answer is
> really 'no thanks'.

Agreed. We definitely do not want to add myriads of non-intuitive
knobs. Lets see if a percentage/capacity type knob will work for all.

Thanks,
Vaidy