Hi,
The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power. The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.
The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.
Please refer to the following discussions and article for details.
[1]Making power policy just work
http://lwn.net/Articles/287924/
[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/
[3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/297306/
[4][RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n
http://lkml.org/lkml/2008/11/10/260
The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.
This version of the patch incorporates comments and feedback
received on the previous post. Thanks to Peter Zijlstra, Gregory
Haskins, and Vatsa for the review and comments.
Changes from v3:
----------------
* Fixed the locking code with double_lock_balance() in
active-balance-newidle.patch
* Moved sched_mc_preferred_wakeup_cpu to root_domain structure so that
each partitioned sched domain will get independent nominated cpu
* More comments in active-balance-newidle.patch
* Reverted sched MC level and CPU level fine tuning in v2.6.28-rc4 for
now. These affect consolidation since SD_BALANCE_NEWIDLE is
removed. I will rework the tuning in the next iteration to
selectively enable them at sched_mc=2
* Patch series on 2.6.28-rc6 kernel
Changes from v2:
----------------
* Fixed locking order issue in active-balance new-idle
* Moved the wakeup biasing code to wake_idle() function and preserve
wake_affine function. Previous version would break wake affine in
order to aggressively consolidate tasks
* Removed sched_mc_preferred_wakeup_cpu global variable and moved to
doms_cur/dattr_cur and added a per_cpu pointer to appropriate
storage in partitioned sched domain. This changed is needed to
preserve functionality in case of partitioned sched domains
* Patch on 2.6.28-rc3 kernel
Results:
--------
Basic functionality of the code has not changed and the power vs
performance benefits for kernbench are similar to the ones posted
earlier.
KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
package system
SchedMC Run Time Package Idle Energy Power
0 80.04 52.77% 53.23% 1.00x J 1.00y W
1 81.41 37.03% 69.67% 0.97x J 0.95y W
2 76.25 20.26% 85.86% 0.92x J 0.97y W
*** This is RFC code and not for inclusion ***
Please feel free to test, and let me know your comments and feedback.
Thanks,
Vaidy
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
Gautham R Shenoy (1):
sched: Framework for sched_mc/smt_power_savings=N
Vaidyanathan Srinivasan (6):
sched: fine-tune SD_MC_INIT -- revert
sched: re-tune balancing -- revert
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
arch/x86/include/asm/topology.h | 7 +--
include/linux/sched.h | 11 +++++
include/linux/topology.h | 6 +--
kernel/sched.c | 89 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 17 +++++++
5 files changed, 118 insertions(+), 12 deletions(-)
--
commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
Author: Ingo Molnar <[email protected]>
Date: Wed Nov 5 16:52:08 2008 +0100
Reverting this patch since this affects consolidation
and power savings. Will rework this tuning in the next
iteration.
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.
Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)
lat_ctx on a NUMA box is particularly happy about this change:
before:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70
after:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07
a 2.75x speedup.
pipe-test is similarly happy about it too:
| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.
Also:
- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains
Sysbench+postgresql improves all around the board, quite significantly:
.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 +12.20%
64: 7342 8266 +11.18%
128: 7342 8064 +8.95%
256: 7519 7884 +4.62%
512: 7350 7731 +4.93%
-------------------------------------------------
SUM: 55412 59341 +6.62%
So it's a win both for the runup portion, the peak area and the tail.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
arch/x86/include/asm/topology.h | 7 +++----
include/linux/topology.h | 4 ++--
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4850e4b..90ac771 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -154,7 +154,7 @@ extern unsigned long node_remap_size[];
#endif
-/* sched_domains SD_NODE_INIT for NUMA machines */
+/* sched_domains SD_NODE_INIT for NUMAQ machines */
#define SD_NODE_INIT (struct sched_domain) { \
.min_interval = 8, \
.max_interval = 32, \
@@ -169,9 +169,8 @@ extern unsigned long node_remap_size[];
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_EXEC \
| SD_BALANCE_FORK \
- | SD_WAKE_AFFINE \
- | SD_WAKE_BALANCE \
- | SD_SERIALIZE, \
+ | SD_SERIALIZE \
+ | SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
}
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 117f1b7..2565f4a 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -146,10 +146,10 @@ void arch_update_cpu_topology(void);
.wake_idx = 1, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
- | SD_BALANCE_EXEC \
+ | SD_BALANCE_NEWIDLE \
| SD_BALANCE_FORK \
+ | SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
- | SD_WAKE_BALANCE \
| BALANCE_FOR_PKG_POWER,\
.last_balance = jiffies, \
.balance_interval = 1, \
commit 14800984706bf6936bbec5187f736e928be5c218
Author: Mike Galbraith <[email protected]>
Date: Fri Nov 7 15:26:50 2008 +0100
Reverting this patch for now since this affects
consolidation and power savings. Will rework
this tuning in the next iteration.
Tune SD_MC_INIT the same way as SD_CPU_INIT:
unset SD_BALANCE_NEWIDLE, and set SD_WAKE_BALANCE.
This improves vmark by 5%:
vmark 132102 125968 125497 messages/sec avg 127855.66 .984
vmark 139404 131719 131272 messages/sec avg 134131.66 1.033
Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
include/linux/topology.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 2565f4a..b1551d4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -120,10 +120,10 @@ void arch_update_cpu_topology(void);
.wake_idx = 1, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
| SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
- | SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
| BALANCE_FOR_MC_POWER, \
.last_balance = jiffies, \
From: Gautham R Shenoy <[email protected]>
*** RFC patch of work in progress and not for inclusion. ***
Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.
Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.
This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).
Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.
The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.
Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 16 +++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 644ffbd..d862837 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -760,6 +760,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * cpu package for power savings
+ */
+ MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
#define BALANCE_FOR_MC_POWER \
(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
diff --git a/kernel/sched.c b/kernel/sched.c
index 9b1e793..ea33446 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7876,14 +7876,24 @@ int arch_reinit_sched_domains(void)
static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
{
int ret;
+ unsigned int level = 0;
- if (buf[0] != '0' && buf[0] != '1')
+ sscanf(buf, "%u", &level);
+
+ /*
+ * level is always be positive so don't check for
+ * level < POWERSAVINGS_BALANCE_NONE which is 0
+ * What happens on 0 or 1 byte write,
+ * need to check for count as well?
+ */
+
+ if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
return -EINVAL;
if (smt)
- sched_smt_power_savings = (buf[0] == '1');
+ sched_smt_power_savings = level;
else
- sched_mc_power_savings = (buf[0] == '1');
+ sched_mc_power_savings = level;
ret = arch_reinit_sched_domains();
Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.
This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
This logic provides effective consolidation of short running
daemon jobs in a almost idle system
The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 54 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d28cd98..e4aff6c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3694,10 +3694,64 @@ redo:
}
if (!ld_moved) {
+ int active_balance;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 2)
+ return -1;
+
+ /*
+ * The only task running in a non-idle cpu can be moved to this
+ * cpu in an attempt to completely freeup the other CPU
+ * package. The same method used to move task in load_balance()
+ * have been extended for load_balance_newidle() to speedup
+ * consolidation at sched_mc=POWERSAVINGS_BALANCE_WAKEUP (2)
+ *
+ * The package power saving logic comes from
+ * find_busiest_group(). If there are no imbalance, then
+ * f_b_g() will return NULL. However when sched_mc={1,2} then
+ * f_b_g() will select a group from which a running task may be
+ * pulled to this cpu in order to make the other package idle.
+ * If there is no opportunity to make a package idle and if
+ * there are no imbalance, then f_b_g() will return NULL and no
+ * action will be taken in load_balance_newidle().
+ *
+ * Under normal task pull operation due to imbalance, there
+ * will be more than one task in the source run queue and
+ * move_tasks() will succeed. ld_moved will be true and this
+ * active balance code will not be triggered.
+ */
+
+ /* Lock busiest in correct order while this_rq is held */
+ double_lock_balance(this_rq, busiest);
+
+ /*
+ * don't kick the migration_thread, if the curr
+ * task on busiest cpu can't be moved to this_cpu
+ */
+ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ double_unlock_balance(this_rq, busiest);
+ all_pinned = 1;
+ return ld_moved;
+ }
+
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+ }
+
+ double_unlock_balance(this_rq, busiest);
+ if (active_balance)
+ wake_up_process(busiest->migration_thread);
+
} else
sd->nr_balance_failed = 0;
When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.
Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.
This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 79b71f3..d28cd98 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,17 @@ struct root_domain {
#ifdef CONFIG_SMP
struct cpupri cpupri;
#endif
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+ /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be
+ * used when most cpus are idle in the system indicating overall very
+ * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ */
+ unsigned int sched_mc_preferred_wakeup_cpu;
+
+#endif
+
};
/*
@@ -3090,6 +3101,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
return 0;
}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -3406,6 +3418,9 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ first_cpu(group_leader->cpumask);
return group_min;
}
#endif
Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch. Use
this information in sched_mc_preferred_wakeup_cpu in function
wake_idle() to bias task wakeups if the following conditions
are satisfied:
- The present cpu that is trying to wakeup the process is
idle and waking the target process on this cpu will
potentially wakeup a completely idle package
- The previous cpu on which the target process ran is
also idle and hence selecting the previous cpu may
wakeup a semi idle cpu package
- The task being woken up is allowed to run in the
nominated cpu (cpu affinity and restrictions)
Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.
Cache hotness is considered since the actual biasing happens
in wake_idle() only if the application is cache cold.
This technique will effectively move short running bursty jobs in
a mostly idle system.
Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched_fair.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 98345e4..939f2a1 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
cpumask_t tmp;
struct sched_domain *sd;
int i;
+ unsigned int chosen_wakeup_cpu;
+ int this_cpu;
+
+ /*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ */
+
+ this_cpu = smp_processor_id();
+ chosen_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
+ cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
+ return chosen_wakeup_cpu;
/*
* If it is idle, then it is the best cpu to run this task.
Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.
find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index ea33446..79b71f3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3263,7 +3263,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3279,7 +3279,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;
Vaidyanathan Srinivasan wrote:
> From: Gautham R Shenoy <[email protected]>
>
> *** RFC patch of work in progress and not for inclusion. ***
>
> Currently the sched_mc/smt_power_savings variable is a boolean, which either
> enables or disables topology based power savings. This extends the behaviour of
> the variable from boolean to multivalued, such that based on the value, we
> decide how aggressively do we want to perform topology based powersavings
> balance.
>
> Variable levels of power saving tunable would benefit end user to match the
> required level of power savings vs performance trade off depending on the
> system configuration and workloads.
>
> This initial version makes the sched_mc_power_savings global variable to take
> more values (0,1,2).
Might I suggest a dimensioned number rather than a relative one?
One might say that 100 represents the full power of a system, meaning
that all chips/cores are running at full speed, whereas 50 means that
the power system would attempt to halve the resources available, and
would return a value that represents the value that the power system
believes it has achieved. For example, if it could only reduce the
clock speed by 10%, on a old uniprocessor, it would return 90.
An additional, second value it might return might be the power
reduction it believed it had achieved.
These, by the way, are what my Tadpole GUI shows (;-)) so I'm just
following someone else's lead.
--dave
>
> Later version is expected to add new member sd->powersavings_level at the multi
> core CPU level sched_domain. This make all sd->flags check for
> SD_POWERSAVINGS_BALANCE into a different macro that will check for
> powersavings_level.
>
> The power savings level setting should be in one place either in the
> sched_mc_power_savings global variable or contained within the appropriate
> sched_domain structure.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> include/linux/sched.h | 11 +++++++++++
> kernel/sched.c | 16 +++++++++++++---
> 2 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 644ffbd..d862837 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -760,6 +760,17 @@ enum cpu_idle_type {
> #define SD_SERIALIZE 1024 /* Only a single load balancing instance */
> #define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
>
> +enum powersavings_balance_level {
> + POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
> + POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
> + * first for long running threads
> + */
> + POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
> + * cpu package for power savings
> + */
> + MAX_POWERSAVINGS_BALANCE_LEVELS
> +};
> +
> #define BALANCE_FOR_MC_POWER \
> (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9b1e793..ea33446 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7876,14 +7876,24 @@ int arch_reinit_sched_domains(void)
> static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> {
> int ret;
> + unsigned int level = 0;
>
> - if (buf[0] != '0' && buf[0] != '1')
> + sscanf(buf, "%u", &level);
> +
> + /*
> + * level is always be positive so don't check for
> + * level < POWERSAVINGS_BALANCE_NONE which is 0
> + * What happens on 0 or 1 byte write,
> + * need to check for count as well?
> + */
> +
> + if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
> return -EINVAL;
>
> if (smt)
> - sched_smt_power_savings = (buf[0] == '1');
> + sched_smt_power_savings = level;
> else
> - sched_mc_power_savings = (buf[0] == '1');
> + sched_mc_power_savings = level;
>
> ret = arch_reinit_sched_domains();
>
>
>
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
* David Collier-Brown <[email protected]> [2008-11-21 19:45:55]:
> Vaidyanathan Srinivasan wrote:
> > From: Gautham R Shenoy <[email protected]>
> >
> > *** RFC patch of work in progress and not for inclusion. ***
> >
> > Currently the sched_mc/smt_power_savings variable is a boolean, which either
> > enables or disables topology based power savings. This extends the behaviour of
> > the variable from boolean to multivalued, such that based on the value, we
> > decide how aggressively do we want to perform topology based powersavings
> > balance.
> >
> > Variable levels of power saving tunable would benefit end user to match the
> > required level of power savings vs performance trade off depending on the
> > system configuration and workloads.
> >
> > This initial version makes the sched_mc_power_savings global variable to take
> > more values (0,1,2).
>
> Might I suggest a dimensioned number rather than a relative one?
> One might say that 100 represents the full power of a system, meaning
> that all chips/cores are running at full speed, whereas 50 means that
> the power system would attempt to halve the resources available, and
> would return a value that represents the value that the power system
> believes it has achieved. For example, if it could only reduce the
> clock speed by 10%, on a old uniprocessor, it would return 90.
Ideally we would like to have such a metric :) However practically
the power savings and performance tradeoff depends on
1) System configuration (topology, cpu type, cpu features)
2) Workload -- memory bound, IO bound, cpu bound
3) Environment -- system temperature
and many more dimensions that we have not considered yet!
What you are asking for may have been possible if the CPUs were very
simple and performance and power directly corresponded to operating
frequency. Modern CPUs have very widely varying operating
characteristics that greatly depend on workload type. Hence deriving
a metric for power vs performance tradeoff will be very inaccurate and
useless.
We may be able to design the framework in such a way that each level
of settings will provide increasing levels of power savings with
little or no performance impact (depending on the workload).
Power consumption at sched_mc=0 > sched_mc=1 > sched_mc=2 but not in
linear scale.
Also, this is only one of the component of the power saving tunables.
There are various governor settings and platform setting that may
affect the power consumption.
> An additional, second value it might return might be the power
> reduction it believed it had achieved.
Measuring the actual power being consumed will be useful for sys
admins to choose the correct settings. However this is platform
dependent and best left to the platform management tools as compared
to generic scheduler.
We would expect the end user to use the platform management tools to
collect and trend the power consumption data and correlate with the
power saving tunables to decide the best power vs performance
tradeoffs.
> These, by the way, are what my Tadpole GUI shows (;-)) so I'm just
> following someone else's lead.
Can you please provide more details. I could not get a google hit on
the Tadpole GUI which is relevant to this discussion.
Thanks,
Vaidy
On Fri, 2008-11-21 at 14:01 +0530, Vaidyanathan Srinivasan wrote:
> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
>
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
>
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical. Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 15 +++++++++++++++
> 1 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 79b71f3..d28cd98 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -493,6 +493,17 @@ struct root_domain {
> #ifdef CONFIG_SMP
> struct cpupri cpupri;
> #endif
> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +
> + /*
> + * Preferred wake up cpu nominated by sched_mc balance that will be
> + * used when most cpus are idle in the system indicating overall very
> + * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
> + */
> + unsigned int sched_mc_preferred_wakeup_cpu;
> +
> +#endif
> +
> };
>
> /*
Do we really need that extra whitespace?
> @@ -3090,6 +3101,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
> return 0;
> }
>
> +
> /*
> * find_busiest_group finds and returns the busiest CPU group within the
> * domain. It calculates and returns the amount of weighted load which
Ditto?
> @@ -3406,6 +3418,9 @@ out_balanced:
>
> if (this == group_leader && group_leader != group_min) {
> *imbalance = min_load_per_task;
> + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
> + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> + first_cpu(group_leader->cpumask);
While not strictly needed, I prefer braces around multi-line statments.
It's easier on the eyes.
> return group_min;
> }
> #endif
>
On Fri, 2008-11-21 at 14:00 +0530, Vaidyanathan Srinivasan wrote:
> * Reverted sched MC level and CPU level fine tuning in v2.6.28-rc4 for
> now. These affect consolidation since SD_BALANCE_NEWIDLE is
> removed. I will rework the tuning in the next iteration to
> selectively enable them at sched_mc=2
Looking forward to the next version.
* Peter Zijlstra <[email protected]> [2008-11-23 03:03:58]:
> On Fri, 2008-11-21 at 14:01 +0530, Vaidyanathan Srinivasan wrote:
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> >
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> >
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical. Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> > ---
> >
> > kernel/sched.c | 15 +++++++++++++++
> > 1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 79b71f3..d28cd98 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -493,6 +493,17 @@ struct root_domain {
> > #ifdef CONFIG_SMP
> > struct cpupri cpupri;
> > #endif
> > +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> > +
> > + /*
> > + * Preferred wake up cpu nominated by sched_mc balance that will be
> > + * used when most cpus are idle in the system indicating overall very
> > + * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
> > + */
> > + unsigned int sched_mc_preferred_wakeup_cpu;
> > +
> > +#endif
> > +
> > };
> >
> > /*
>
> Do we really need that extra whitespace?
>
> > @@ -3090,6 +3101,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
> > return 0;
> > }
> >
> > +
> > /*
> > * find_busiest_group finds and returns the busiest CPU group within the
> > * domain. It calculates and returns the amount of weighted load which
>
> Ditto?
>
> > @@ -3406,6 +3418,9 @@ out_balanced:
> >
> > if (this == group_leader && group_leader != group_min) {
> > *imbalance = min_load_per_task;
> > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
> > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> > + first_cpu(group_leader->cpumask);
>
> While not strictly needed, I prefer braces around multi-line statments.
> It's easier on the eyes.
Hi Peter,
Thanks for the detailed review. I will fix these issues in the next
iteration.
--Vaidy