Hi,
The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power. The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.
The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.
Please refer to the following discussions and article for details.
[1]Making power policy just work
http://lwn.net/Articles/287924/
[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/
[3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/297306/
[4][RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n
http://lkml.org/lkml/2008/11/10/260
[5][RFC PATCH v4 0/7] Tunable sched_mc_power_savings=n
http://lkml.org/lkml/2008/11/21/47
The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.
This version of the patch incorporates comments and feedback received
on the previous post. Thanks to Peter Zijlstra and others for the
review and comments.
Changes from v4:
----------------
* Conditionally added SD_BALANCE_NEWIDLE flag to MC and CPU level
domain to help task consolidation when sched_mc > 0
Removing this flag significantly reduces the number load_balance
calls and considerably slows down task consolidation and in effect
power savings.
Ingo and Mike Galbraith removed this flag to improve performance for
select benchmarks. I have added the flags only when power savings
is requested and restore to full performance mode if sched_mc=0
* Fixed few whitespace issues
* Patch series against 2.6.28-rc8 kernel
Changes from v3:
----------------
* Fixed the locking code with double_lock_balance() in
active-balance-newidle.patch
* Moved sched_mc_preferred_wakeup_cpu to root_domain structure so that
each partitioned sched domain will get independent nominated cpu
* More comments in active-balance-newidle.patch
* Reverted sched MC level and CPU level fine tuning in v2.6.28-rc4 for
now. These affect consolidation since SD_BALANCE_NEWIDLE is
removed. I will rework the tuning in the next iteration to
selectively enable them at sched_mc=2
* Patch series on 2.6.28-rc6 kernel
Changes from v2:
----------------
* Fixed locking order issue in active-balance new-idle
* Moved the wakeup biasing code to wake_idle() function and preserve
wake_affine function. Previous version would break wake affine in
order to aggressively consolidate tasks
* Removed sched_mc_preferred_wakeup_cpu global variable and moved to
doms_cur/dattr_cur and added a per_cpu pointer to appropriate
storage in partitioned sched domain. This changed is needed to
preserve functionality in case of partitioned sched domains
* Patch on 2.6.28-rc3 kernel
Results:
--------
Basic functionality of the code has not changed and the power vs
performance benefits for kernbench are similar to the ones posted
earlier.
KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
package system
SchedMC Run Time Package Idle Energy Power
0 81.28 52.43% 53.53% 1.00x J 1.00y W
1 80.71 37.35% 68.91% 0.96x J 0.97y W
2 76.05 23.81% 82.65% 0.92x J 0.98y W
*** This is RFC code and not for inclusion ***
Please feel free to test, and let me know your comments and feedback.
Thanks,
Vaidy
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
Gautham R Shenoy (1):
sched: Framework for sched_mc/smt_power_savings=N
Vaidyanathan Srinivasan (6):
sched: idle_balance() does not call load_balance_newidle()
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
include/linux/sched.h | 21 +++++++++++
include/linux/topology.h | 6 ++-
kernel/sched.c | 88 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 17 +++++++++
4 files changed, 124 insertions(+), 8 deletions(-)
--
From: Gautham R Shenoy <[email protected]>
*** RFC patch of work in progress and not for inclusion. ***
Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.
Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.
This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).
Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.
The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.
Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 16 +++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..888f2b2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,6 +764,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * cpu package for power savings
+ */
+ MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
#define BALANCE_FOR_MC_POWER \
(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
diff --git a/kernel/sched.c b/kernel/sched.c
index e4bb1dd..322cd2a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7879,14 +7879,24 @@ int arch_reinit_sched_domains(void)
static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
{
int ret;
+ unsigned int level = 0;
- if (buf[0] != '0' && buf[0] != '1')
+ sscanf(buf, "%u", &level);
+
+ /*
+ * level is always be positive so don't check for
+ * level < POWERSAVINGS_BALANCE_NONE which is 0
+ * What happens on 0 or 1 byte write,
+ * need to check for count as well?
+ */
+
+ if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
return -EINVAL;
if (smt)
- sched_smt_power_savings = (buf[0] == '1');
+ sched_smt_power_savings = level;
else
- sched_mc_power_savings = (buf[0] == '1');
+ sched_mc_power_savings = level;
ret = arch_reinit_sched_domains();
When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.
Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.
This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 6bea99b..0918677 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,14 @@ struct root_domain {
#ifdef CONFIG_SMP
struct cpupri cpupri;
#endif
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+ /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be
+ * used when most cpus are idle in the system indicating overall very
+ * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ */
+ unsigned int sched_mc_preferred_wakeup_cpu;
+#endif
};
/*
@@ -3407,6 +3415,10 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ first_cpu(group_leader->cpumask);
+ }
return group_min;
}
#endif
Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.
find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 322cd2a..6bea99b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3264,7 +3264,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3280,7 +3280,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;
Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch. Use
this information in sched_mc_preferred_wakeup_cpu in function
wake_idle() to bias task wakeups if the following conditions
are satisfied:
- The present cpu that is trying to wakeup the process is
idle and waking the target process on this cpu will
potentially wakeup a completely idle package
- The previous cpu on which the target process ran is
also idle and hence selecting the previous cpu may
wakeup a semi idle cpu package
- The task being woken up is allowed to run in the
nominated cpu (cpu affinity and restrictions)
Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.
Cache hotness is considered since the actual biasing happens
in wake_idle() only if the application is cache cold.
This technique will effectively move short running bursty jobs in
a mostly idle system.
Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched_fair.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 98345e4..939f2a1 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
cpumask_t tmp;
struct sched_domain *sd;
int i;
+ unsigned int chosen_wakeup_cpu;
+ int this_cpu;
+
+ /*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ */
+
+ this_cpu = smp_processor_id();
+ chosen_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
+ cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
+ return chosen_wakeup_cpu;
/*
* If it is idle, then it is the best cpu to run this task.
Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.
This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
This logic provides effective consolidation of short running
daemon jobs in a almost idle system
The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 54 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0918677..d60f191 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3692,10 +3692,64 @@ redo:
}
if (!ld_moved) {
+ int active_balance;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 2)
+ return -1;
+
+ /*
+ * The only task running in a non-idle cpu can be moved to this
+ * cpu in an attempt to completely freeup the other CPU
+ * package. The same method used to move task in load_balance()
+ * have been extended for load_balance_newidle() to speedup
+ * consolidation at sched_mc=POWERSAVINGS_BALANCE_WAKEUP (2)
+ *
+ * The package power saving logic comes from
+ * find_busiest_group(). If there are no imbalance, then
+ * f_b_g() will return NULL. However when sched_mc={1,2} then
+ * f_b_g() will select a group from which a running task may be
+ * pulled to this cpu in order to make the other package idle.
+ * If there is no opportunity to make a package idle and if
+ * there are no imbalance, then f_b_g() will return NULL and no
+ * action will be taken in load_balance_newidle().
+ *
+ * Under normal task pull operation due to imbalance, there
+ * will be more than one task in the source run queue and
+ * move_tasks() will succeed. ld_moved will be true and this
+ * active balance code will not be triggered.
+ */
+
+ /* Lock busiest in correct order while this_rq is held */
+ double_lock_balance(this_rq, busiest);
+
+ /*
+ * don't kick the migration_thread, if the curr
+ * task on busiest cpu can't be moved to this_cpu
+ */
+ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ double_unlock_balance(this_rq, busiest);
+ all_pinned = 1;
+ return ld_moved;
+ }
+
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+ }
+
+ double_unlock_balance(this_rq, busiest);
+ if (active_balance)
+ wake_up_process(busiest->migration_thread);
+
} else
sd->nr_balance_failed = 0;
Add SD_BALANCE_NEWIDLE flag at MC level and CPU level
if sched_mc is set. This helps power savings and
will not affect performance when sched_mc=0
Ingo and Mike Galbraith have optimised the SD flags by
removing SD_BALANCE_NEWIDLE at MC and CPU level. This
helps performance but hurts power savings since this
slows down task consolidation by reducing the number
of times load_balance is run.
sched: fine-tune SD_MC_INIT
commit 14800984706bf6936bbec5187f736e928be5c218
Author: Mike Galbraith <[email protected]>
Date: Fri Nov 7 15:26:50 2008 +0100
sched: re-tune balancing -- revert
commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
Author: Ingo Molnar <[email protected]>
Date: Wed Nov 5 16:52:08 2008 +0100
This patch selectively enables SD_BALANCE_NEWIDLE flag
only when sched_mc is set to 1 or 2. This helps power savings
by task consolidation and also does not hurt performance at
sched_mc=0 where all power saving optimisations are turned off.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
include/linux/sched.h | 10 ++++++++++
include/linux/topology.h | 6 ++++--
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 888f2b2..e894892 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -782,6 +782,16 @@ enum powersavings_balance_level {
((sched_mc_power_savings || sched_smt_power_savings) ? \
SD_POWERSAVINGS_BALANCE : 0)
+/*
+ * Optimise SD flags for power savings:
+ * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
+ * Keep default SD flags if sched_{smt,mc}_power_saving=0
+ */
+
+#define POWERSAVING_SD_FLAGS \
+ ((sched_mc_power_savings || sched_smt_power_savings) ? \
+ SD_BALANCE_NEWIDLE : 0)
+
#define test_sd_parent(sd, flag) ((sd->parent && \
(sd->parent->flags & flag)) ? 1 : 0)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 117f1b7..26c608c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,7 +125,8 @@ void arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | BALANCE_FOR_MC_POWER, \
+ | BALANCE_FOR_MC_POWER \
+ | POWERSAVING_SD_FLAGS, \
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -150,7 +151,8 @@ void arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | BALANCE_FOR_PKG_POWER,\
+ | BALANCE_FOR_PKG_POWER \
+ | POWERSAVING_SD_FLAGS, \
.last_balance = jiffies, \
.balance_interval = 1, \
}
load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
is set at higher level domain (3-CPU) and not in low level domain
(2-MC).
pulled_task is initialised to -1 and checked for non-zero which
is always true if the lowest level sched_domain does not have
SD_BALANCE_NEWIDLE flag set.
Trivial fix to initialise pulled_task to zero.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d60f191..36500cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3773,7 +3773,7 @@ out_balanced:
static void idle_balance(int this_cpu, struct rq *this_rq)
{
struct sched_domain *sd;
- int pulled_task = -1;
+ int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
cpumask_t tmpmask;
* Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:37]:
> From: Gautham R Shenoy <[email protected]>
>
> *** RFC patch of work in progress and not for inclusion. ***
>
> Currently the sched_mc/smt_power_savings variable is a boolean, which either
> enables or disables topology based power savings. This extends the behaviour of
> the variable from boolean to multivalued, such that based on the value, we
> decide how aggressively do we want to perform topology based powersavings
> balance.
>
> Variable levels of power saving tunable would benefit end user to match the
> required level of power savings vs performance trade off depending on the
> system configuration and workloads.
>
> This initial version makes the sched_mc_power_savings global variable to take
> more values (0,1,2).
>
> Later version is expected to add new member sd->powersavings_level at the multi
> core CPU level sched_domain. This make all sd->flags check for
> SD_POWERSAVINGS_BALANCE into a different macro that will check for
> powersavings_level.
>
> The power savings level setting should be in one place either in the
> sched_mc_power_savings global variable or contained within the appropriate
> sched_domain structure.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> include/linux/sched.h | 11 +++++++++++
> kernel/sched.c | 16 +++++++++++++---
> 2 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55e30d1..888f2b2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -764,6 +764,17 @@ enum cpu_idle_type {
> #define SD_SERIALIZE 1024 /* Only a single load balancing instance */
> #define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
>
> +enum powersavings_balance_level {
> + POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
> + POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
> + * first for long running threads
> + */
> + POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
> + * cpu package for power savings
> + */
> + MAX_POWERSAVINGS_BALANCE_LEVELS
> +};
> +
> #define BALANCE_FOR_MC_POWER \
> (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index e4bb1dd..322cd2a 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7879,14 +7879,24 @@ int arch_reinit_sched_domains(void)
> static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> {
> int ret;
> + unsigned int level = 0;
>
> - if (buf[0] != '0' && buf[0] != '1')
> + sscanf(buf, "%u", &level);
Don't we need to check what sscanf returns? Does a invalid value push
the power savings to 0
> +
> + /*
> + * level is always be positive so don't check for
> + * level < POWERSAVINGS_BALANCE_NONE which is 0
> + * What happens on 0 or 1 byte write,
> + * need to check for count as well?
> + */
See above
> +
> + if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
> return -EINVAL;
>
> if (smt)
> - sched_smt_power_savings = (buf[0] == '1');
> + sched_smt_power_savings = level;
> else
> - sched_mc_power_savings = (buf[0] == '1');
> + sched_mc_power_savings = level;
>
> ret = arch_reinit_sched_domains();
>
>
>
--
Balbir
* Balbir Singh <[email protected]> [2008-12-12 00:25:29]:
> * Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:37]:
>
> > From: Gautham R Shenoy <[email protected]>
> >
> > *** RFC patch of work in progress and not for inclusion. ***
> >
> > Currently the sched_mc/smt_power_savings variable is a boolean, which either
> > enables or disables topology based power savings. This extends the behaviour of
> > the variable from boolean to multivalued, such that based on the value, we
> > decide how aggressively do we want to perform topology based powersavings
> > balance.
> >
> > Variable levels of power saving tunable would benefit end user to match the
> > required level of power savings vs performance trade off depending on the
> > system configuration and workloads.
> >
> > This initial version makes the sched_mc_power_savings global variable to take
> > more values (0,1,2).
> >
> > Later version is expected to add new member sd->powersavings_level at the multi
> > core CPU level sched_domain. This make all sd->flags check for
> > SD_POWERSAVINGS_BALANCE into a different macro that will check for
> > powersavings_level.
> >
> > The power savings level setting should be in one place either in the
> > sched_mc_power_savings global variable or contained within the appropriate
> > sched_domain structure.
> >
> > Signed-off-by: Gautham R Shenoy <[email protected]>
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> > ---
> >
> > include/linux/sched.h | 11 +++++++++++
> > kernel/sched.c | 16 +++++++++++++---
> > 2 files changed, 24 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 55e30d1..888f2b2 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -764,6 +764,17 @@ enum cpu_idle_type {
> > #define SD_SERIALIZE 1024 /* Only a single load balancing instance */
> > #define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
> >
> > +enum powersavings_balance_level {
> > + POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
> > + POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
> > + * first for long running threads
> > + */
> > + POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
> > + * cpu package for power savings
> > + */
> > + MAX_POWERSAVINGS_BALANCE_LEVELS
> > +};
> > +
> > #define BALANCE_FOR_MC_POWER \
> > (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index e4bb1dd..322cd2a 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -7879,14 +7879,24 @@ int arch_reinit_sched_domains(void)
> > static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> > {
> > int ret;
> > + unsigned int level = 0;
> >
> > - if (buf[0] != '0' && buf[0] != '1')
> > + sscanf(buf, "%u", &level);
>
> Don't we need to check what sscanf returns? Does a invalid value push
> the power savings to 0
Hi Balbir,
Good catch. I have always been providing correct value ;)
An incorrect input will make sched_mc=0 I just verified that. I will
fix it.
--Vaidy
* Pavel Machek <[email protected]> [1970-01-01 01:13:43]:
>
> > Results:
> > --------
> >
> > Basic functionality of the code has not changed and the power vs
> > performance benefits for kernbench are similar to the ones posted
> > earlier.
> >
> > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > package system
> >
> > SchedMC Run Time Package Idle Energy Power
> > 0 81.28 52.43% 53.53% 1.00x J 1.00y W
> > 1 80.71 37.35% 68.91% 0.96x J 0.97y W
> > 2 76.05 23.81% 82.65% 0.92x J 0.98y W
> >
> > *** This is RFC code and not for inclusion ***
>
> Hmm, so it makes it compile faster _and_ it saves power? Why to keep
> it tunable at all if it is win-win? Or are there other benchmarks?
Hi Pavel,
The performance and power gain depends on the nature of the
application. If the processor cache is large enough, then
consolidation improves cache sharing and makes the system finish the
job faster. I would expect applications with larger cache foot print
and no data sharing may experience degradation in performance.
I am open to suggestion on any interesting workload that should be
tested. I have tested ebizzy and SPECjbb for power savings. I am
also looking forward to test reports from others having different
system topology and processor power saving features.
The idea behind the tunable is to have power saving as higher priority
in scheduling relative to performance. At higher sched_mc(2) settings
(aggressive power save mode), we should save power even if there is
a slight degradation in application performance. If we save power and
improve performance, then it is a best case scenario. However I would
expect certain workloads to sacrifice performance for the power
consumption.
At sched_mc = 0 the scheduler would not optimise for power savings
and provide maximum performance for any application.
Kernel compile is a representative workload where my additional
optimisations for sched_mc=2 improves over the default sched_mc=1
implementation in the kernel.
--Vaidy
* Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:48]:
> Just in case two groups have identical load, prefer to move load to lower
> logical cpu number rather than the present logic of moving to higher logical
> number.
>
> find_busiest_group() tries to look for a group_leader that has spare capacity
> to take more tasks and freeup an appropriate least loaded group. Just in case
> there is a tie and the load is equal, then the group with higher logical number
> is favoured. This conflicts with user space irqbalance daemon that will move
> interrupts to lower logical number if the system utilisation is very low.
>
This patch will work well with irqbalance only when irqbalance decides
to switch to power mode and if the interrupt rate is high and
irqbalance is in performance mode and sched_mc > 1, what is the impact
of this patch?
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 322cd2a..6bea99b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3264,7 +3264,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> */
> if ((sum_nr_running < min_nr_running) ||
> (sum_nr_running == min_nr_running &&
> - first_cpu(group->cpumask) <
> + first_cpu(group->cpumask) >
> first_cpu(group_min->cpumask))) {
The first_cpu logic worries me a bit. This has existed for a while
already, but with the topology I see on my system, I find the cpu
numbers interleaved on my system (0,2,4 and 6) belong to one core and
odd numbers to the other.
So for a topology like (assume dual core, dual socket)
0-3
/ \
0-1 2-3
/ \ / \
0 1 2 3
If group_min is the domain with (2-3) and we are looking at
group(0-1). first_cpu of (0-1) is 0 and (2-3) is 2, how does changing
"<" to ">" help push the tasks to the lower ordered group? In the case
described above group_min continues to be (2-3).
Shouldn't the check be if (first_cpu(group->cpumask) <=
first_cpu(group_min->cpumask)?
> group_min = group;
> min_nr_running = sum_nr_running;
> @@ -3280,7 +3280,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> if (sum_nr_running <= group_capacity - 1) {
> if (sum_nr_running > leader_nr_running ||
> (sum_nr_running == leader_nr_running &&
> - first_cpu(group->cpumask) >
> + first_cpu(group->cpumask) <
> first_cpu(group_leader->cpumask))) {
> group_leader = group;
> leader_nr_running = sum_nr_running;
>
>
All these changes are good, I would like to see additional statistics
that show how many decisions were taken due to new power aware
balancing logic, so that I spot the bad and corner cases based on the
statistics I see.
--
Balbir
* Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:57]:
> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
>
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
>
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical. Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6bea99b..0918677 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -493,6 +493,14 @@ struct root_domain {
> #ifdef CONFIG_SMP
> struct cpupri cpupri;
> #endif
> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> + /*
> + * Preferred wake up cpu nominated by sched_mc balance that will be
> + * used when most cpus are idle in the system indicating overall very
> + * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
Is the root domain good enough?
What is POWERSAVINGS_BALANCE_WAKEUP(2), is it sched_mc == 2?
> + */
> + unsigned int sched_mc_preferred_wakeup_cpu;
> +#endif
> };
>
> /*
> @@ -3407,6 +3415,10 @@ out_balanced:
>
> if (this == group_leader && group_leader != group_min) {
> *imbalance = min_load_per_task;
> + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
OK, it is :) (for the question above). Where do we utilize the set
sched_mc_preferred_wakeup_cpu?
> + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> + first_cpu(group_leader->cpumask);
Everytime we balance, we keep replacing rd->sched_mc_preferred_wake_up
with group_lead->cpumask? My big concern is that we do this without
checking if the group_leader has sufficient capacity (after it will
pull in tasks since we made the checks for nr_running and capacity).
> + }
> return group_min;
> }
> #endif
>
>
--
Balbir
* Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:13:23]:
> load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
> is set at higher level domain (3-CPU) and not in low level domain
> (2-MC).
>
> pulled_task is initialised to -1 and checked for non-zero which
> is always true if the lowest level sched_domain does not have
> SD_BALANCE_NEWIDLE flag set.
>
> Trivial fix to initialise pulled_task to zero.
>
This is already pulled in for 2.6.29, so I am not reviewing it. We can
drop it from the next patch series.
--
Balbir
* Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:13:04]:
> Preferred wakeup cpu (from a semi idle package) has been
> nominated in find_busiest_group() in the previous patch. Use
> this information in sched_mc_preferred_wakeup_cpu in function
> wake_idle() to bias task wakeups if the following conditions
> are satisfied:
> - The present cpu that is trying to wakeup the process is
> idle and waking the target process on this cpu will
> potentially wakeup a completely idle package
> - The previous cpu on which the target process ran is
> also idle and hence selecting the previous cpu may
> wakeup a semi idle cpu package
> - The task being woken up is allowed to run in the
> nominated cpu (cpu affinity and restrictions)
>
> Basically if both the current cpu and the previous cpu on
> which the task ran is idle, select the nominated cpu from semi
> idle cpu package for running the new task that is waking up.
>
> Cache hotness is considered since the actual biasing happens
> in wake_idle() only if the application is cache cold.
>
> This technique will effectively move short running bursty jobs in
> a mostly idle system.
>
> Wakeup biasing for power savings gets automatically disabled if
> system utilisation increases due to the fact that the probability
> of finding both this_cpu and prev_cpu idle decreases.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched_fair.c | 17 +++++++++++++++++
> 1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 98345e4..939f2a1 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> cpumask_t tmp;
> struct sched_domain *sd;
> int i;
> + unsigned int chosen_wakeup_cpu;
> + int this_cpu;
> +
> + /*
> + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> + * are idle and this is not a kernel thread and this task's affinity
> + * allows it to be moved to preferred cpu, then just move!
> + */
> +
> + this_cpu = smp_processor_id();
> + chosen_wakeup_cpu =
> + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> +
> + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
The p->mm check is racy, it needs to be done under task_lock(). The
best way to check for a kernel thread is get_task_mm(), followed by
put_task_mm() is the mm is not NULL. We also need to check to see if
the task is _hot_ on cpu. We should negate this optimization in case
chosen_wakeup_cpu is idle, so check for that as well.
> + cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> + return chosen_wakeup_cpu;
>
> /*
> * If it is idle, then it is the best cpu to run this task.
>
>
--
Balbir
On Mon, 2008-12-15 at 01:38 +0530, Vaidyanathan Srinivasan wrote:
> I am open to suggestion on any interesting workload that should be
> tested. I have tested ebizzy and SPECjbb for power savings.
Try some of the typical fortran stuff like lapack?
On Mon, 2008-12-15 at 12:31 +0530, Balbir Singh wrote:
> > kernel/sched_fair.c | 17 +++++++++++++++++
> > 1 files changed, 17 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 98345e4..939f2a1 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> > cpumask_t tmp;
> > struct sched_domain *sd;
> > int i;
> > + unsigned int chosen_wakeup_cpu;
> > + int this_cpu;
> > +
> > + /*
> > + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> > + * are idle and this is not a kernel thread and this task's affinity
> > + * allows it to be moved to preferred cpu, then just move!
> > + */
> > +
> > + this_cpu = smp_processor_id();
> > + chosen_wakeup_cpu =
> > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> > +
> > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
>
> The p->mm check is racy, it needs to be done under task_lock(). The
> best way to check for a kernel thread is get_task_mm(), followed by
> put_task_mm() is the mm is not NULL. We also need to check to see if
> the task is _hot_ on cpu. We should negate this optimization in case
> chosen_wakeup_cpu is idle, so check for that as well.
Sure its racy, but so what?
The worst I can see it that we exclude a dying task from this logic,
which isn't a problem at all, since its dying anyway.
Also, I don't think you can grab task_lock() from under rq->lock...
> > + cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> > + return chosen_wakeup_cpu;
> >
> > /*
> > * If it is idle, then it is the best cpu to run this task.
> >
> >
>
On Mon, 2008-12-15 at 09:25 +0100, Peter Zijlstra wrote:
> On Mon, 2008-12-15 at 12:31 +0530, Balbir Singh wrote:
>
> > > kernel/sched_fair.c | 17 +++++++++++++++++
> > > 1 files changed, 17 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > > index 98345e4..939f2a1 100644
> > > --- a/kernel/sched_fair.c
> > > +++ b/kernel/sched_fair.c
> > > @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> > > cpumask_t tmp;
> > > struct sched_domain *sd;
> > > int i;
> > > + unsigned int chosen_wakeup_cpu;
> > > + int this_cpu;
> > > +
> > > + /*
> > > + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> > > + * are idle and this is not a kernel thread and this task's affinity
> > > + * allows it to be moved to preferred cpu, then just move!
> > > + */
> > > +
> > > + this_cpu = smp_processor_id();
> > > + chosen_wakeup_cpu =
> > > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> > > +
> > > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > > + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
> >
> > The p->mm check is racy, it needs to be done under task_lock(). The
> > best way to check for a kernel thread is get_task_mm(), followed by
> > put_task_mm() is the mm is not NULL. We also need to check to see if
> > the task is _hot_ on cpu. We should negate this optimization in case
> > chosen_wakeup_cpu is idle, so check for that as well.
>
> Sure its racy, but so what?
>
> The worst I can see it that we exclude a dying task from this logic,
> which isn't a problem at all, since its dying anyway.
At which point I seriously doubt it'd still be on the rq anyway.
> Also, I don't think you can grab task_lock() from under rq->lock...
>
> > > + cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> > > + return chosen_wakeup_cpu;
> > >
> > > /*
> > > * If it is idle, then it is the best cpu to run this task.
> > >
> > >
> >
* Peter Zijlstra <[email protected]> [2008-12-15 09:25:21]:
> On Mon, 2008-12-15 at 12:31 +0530, Balbir Singh wrote:
>
> > > kernel/sched_fair.c | 17 +++++++++++++++++
> > > 1 files changed, 17 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > > index 98345e4..939f2a1 100644
> > > --- a/kernel/sched_fair.c
> > > +++ b/kernel/sched_fair.c
> > > @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> > > cpumask_t tmp;
> > > struct sched_domain *sd;
> > > int i;
> > > + unsigned int chosen_wakeup_cpu;
> > > + int this_cpu;
> > > +
> > > + /*
> > > + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> > > + * are idle and this is not a kernel thread and this task's affinity
> > > + * allows it to be moved to preferred cpu, then just move!
> > > + */
> > > +
> > > + this_cpu = smp_processor_id();
> > > + chosen_wakeup_cpu =
> > > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> > > +
> > > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > > + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
> >
> > The p->mm check is racy, it needs to be done under task_lock(). The
> > best way to check for a kernel thread is get_task_mm(), followed by
> > put_task_mm() is the mm is not NULL. We also need to check to see if
> > the task is _hot_ on cpu. We should negate this optimization in case
> > chosen_wakeup_cpu is idle, so check for that as well.
>
> Sure its racy, but so what?
>
> The worst I can see it that we exclude a dying task from this logic,
> which isn't a problem at all, since its dying anyway.
Fair enough... except that they dying task will wake up a potentially
idle CPU and die.
>
> Also, I don't think you can grab task_lock() from under rq->lock...
I've not looked at how task_lock() nests under rq->lock. I'll look
>
> > > + cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> > > + return chosen_wakeup_cpu;
> > >
> > > /*
> > > * If it is idle, then it is the best cpu to run this task.
> > >
> > >
> >
>
>
--
Balbir
* Peter Zijlstra <[email protected]> [2008-12-15 09:33:04]:
> On Mon, 2008-12-15 at 09:25 +0100, Peter Zijlstra wrote:
> > On Mon, 2008-12-15 at 12:31 +0530, Balbir Singh wrote:
> >
> > > > kernel/sched_fair.c | 17 +++++++++++++++++
> > > > 1 files changed, 17 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > > > index 98345e4..939f2a1 100644
> > > > --- a/kernel/sched_fair.c
> > > > +++ b/kernel/sched_fair.c
> > > > @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> > > > cpumask_t tmp;
> > > > struct sched_domain *sd;
> > > > int i;
> > > > + unsigned int chosen_wakeup_cpu;
> > > > + int this_cpu;
> > > > +
> > > > + /*
> > > > + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> > > > + * are idle and this is not a kernel thread and this task's affinity
> > > > + * allows it to be moved to preferred cpu, then just move!
> > > > + */
> > > > +
> > > > + this_cpu = smp_processor_id();
> > > > + chosen_wakeup_cpu =
> > > > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> > > > +
> > > > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > > > + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
> > >
> > > The p->mm check is racy, it needs to be done under task_lock(). The
> > > best way to check for a kernel thread is get_task_mm(), followed by
> > > put_task_mm() is the mm is not NULL. We also need to check to see if
> > > the task is _hot_ on cpu. We should negate this optimization in case
> > > chosen_wakeup_cpu is idle, so check for that as well.
> >
> > Sure its racy, but so what?
> >
> > The worst I can see it that we exclude a dying task from this logic,
> > which isn't a problem at all, since its dying anyway.
>
> At which point I seriously doubt it'd still be on the rq anyway.
>
I forgot to mention that, the check should be (p->mm && !(p->flags & PF_KTHREAD))
--
Balbir
* Balbir Singh <[email protected]> [2008-12-15 11:42:09]:
> * Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:48]:
>
> > Just in case two groups have identical load, prefer to move load to lower
> > logical cpu number rather than the present logic of moving to higher logical
> > number.
> >
> > find_busiest_group() tries to look for a group_leader that has spare capacity
> > to take more tasks and freeup an appropriate least loaded group. Just in case
> > there is a tie and the load is equal, then the group with higher logical number
> > is favoured. This conflicts with user space irqbalance daemon that will move
> > interrupts to lower logical number if the system utilisation is very low.
> >
>
> This patch will work well with irqbalance only when irqbalance decides
> to switch to power mode and if the interrupt rate is high and
At power save mode irqbalance will move interrupts to cpu0. At this
point the system utilisation is very very low. When cpu0's
utilisation increases with all other packages being idle, then
sched_mc kernel side logic will also consolidate to cpu0.
> irqbalance is in performance mode and sched_mc > 1, what is the impact
> of this patch?
This will be different workload characteristics. If the interrupt
rate is high but the over all system utilisation is low, then
interrupts will get routed to any cpu decided by irqbalancer while
kernel will independently choose a package for tasks to run. However
if the network traffic is very high, then that will drive up the
overall system utilisation as well and hence the opportunity to save
power will also reduce.
At this point we have an approximate convergence of user space
irqbalancer and kernel side decision to the same CPU package. This
can go wrong for some corner cases or network intensive workloads.
Having the user space and kernel communicate on the cpu to consolidate
tasks and interrupts will improve the power savings. Timer and
interrupts need to be consolidated and coordinated with the sched_mc
based task consolidation.
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> > ---
> >
> > kernel/sched.c | 4 ++--
> > 1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 322cd2a..6bea99b 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3264,7 +3264,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > */
> > if ((sum_nr_running < min_nr_running) ||
> > (sum_nr_running == min_nr_running &&
> > - first_cpu(group->cpumask) <
> > + first_cpu(group->cpumask) >
> > first_cpu(group_min->cpumask))) {
>
> The first_cpu logic worries me a bit. This has existed for a while
> already, but with the topology I see on my system, I find the cpu
> numbers interleaved on my system (0,2,4 and 6) belong to one core and
> odd numbers to the other.
>
> So for a topology like (assume dual core, dual socket)
>
> 0-3
> / \
> 0-1 2-3
> / \ / \
> 0 1 2 3
>
>
> If group_min is the domain with (2-3) and we are looking at
> group(0-1). first_cpu of (0-1) is 0 and (2-3) is 2, how does changing
> "<" to ">" help push the tasks to the lower ordered group? In the case
> described above group_min continues to be (2-3).
>
> Shouldn't the check be if (first_cpu(group->cpumask) <=
> first_cpu(group_min->cpumask)?
Cpu number are set in cpumask starting from the right most bit. cpu
mask for group 0-1 will be 0x03 and 2-3 will be 0xC0. first_cpu()
should get us the index of first LSB bit set. As you have explained,
if the 'load' is equal in both the group, one task each, then
group_min will be group 2-3 with my change. group_min is the group
from which task are pulled and group_leader is the group to which
the tasks should be moved for power savings.
I have reversed the comparison in the group_leader selection as well.
Without this patch, at equal load, the recommendation from
find_busiest_group will be 0-1 ==> 2-3, while with this change the
group leader will become 0-1 and group_min will be 2-3 and the
direction will be 2-3 ==> 0-1.
>
> > group_min = group;
> > min_nr_running = sum_nr_running;
> > @@ -3280,7 +3280,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > if (sum_nr_running <= group_capacity - 1) {
> > if (sum_nr_running > leader_nr_running ||
> > (sum_nr_running == leader_nr_running &&
> > - first_cpu(group->cpumask) >
> > + first_cpu(group->cpumask) <
> > first_cpu(group_leader->cpumask))) {
> > group_leader = group;
> > leader_nr_running = sum_nr_running;
> >
> >
>
> All these changes are good, I would like to see additional statistics
> that show how many decisions were taken due to new power aware
> balancing logic, so that I spot the bad and corner cases based on the
> statistics I see.
I had traced all these decisions during initial development. This
situation happens more often when we have exactly half load of long
running threads in the system. Like ebizzy with 2 threads in a dual
socket dual core system that you just described. This change in
decision will help consolidate the workload to group 0-1 as compared
to 2-3 in most cases. The consolidation decision can be favoring 2-3
also if there were other small bursty tasks running in the system that
will make the find_busiest group suggest consolidation of 0-1 ==> 2-3.
This exact condition of first_cpu() > xxx is important only when the
nr_running is equal in both groups and after one of two load balance
iterations, the system would consolidated the task and this decision
point will not be hit.
Thanks for the detailed review.
--Vaidy
* Balbir Singh <[email protected]> [2008-12-15 12:10:56]:
> * Vaidyanathan Srinivasan <[email protected]> [2008-12-11 23:12:57]:
>
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> >
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> >
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical. Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> > ---
> >
> > kernel/sched.c | 12 ++++++++++++
> > 1 files changed, 12 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 6bea99b..0918677 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -493,6 +493,14 @@ struct root_domain {
> > #ifdef CONFIG_SMP
> > struct cpupri cpupri;
> > #endif
> > +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> > + /*
> > + * Preferred wake up cpu nominated by sched_mc balance that will be
> > + * used when most cpus are idle in the system indicating overall very
> > + * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
>
> Is the root domain good enough?
>
> What is POWERSAVINGS_BALANCE_WAKEUP(2), is it sched_mc == 2?
Yes
>
> > + */
> > + unsigned int sched_mc_preferred_wakeup_cpu;
> > +#endif
> > };
> >
> > /*
> > @@ -3407,6 +3415,10 @@ out_balanced:
> >
> > if (this == group_leader && group_leader != group_min) {
> > *imbalance = min_load_per_task;
> > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
>
> OK, it is :) (for the question above). Where do we utilize the set
> sched_mc_preferred_wakeup_cpu?
We use this nominated cpu in wake_idle() in sched_fair.c
> > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> > + first_cpu(group_leader->cpumask);
>
> Everytime we balance, we keep replacing rd->sched_mc_preferred_wake_up
> with group_lead->cpumask? My big concern is that we do this without
The first_cpu in the group_leader's mask. The nomination is a cpu
number.
> checking if the group_leader has sufficient capacity (after it will
> pull in tasks since we made the checks for nr_running and capacity).
You are correct. But if we are running find_busiest_group(), then we
are in load_balance() code on this cpu and exit from this function
should recommend a pull task. The cpu evaluating the load on
group_leader will be the nominated load_balancer cpu for this
group/domain. Nobody would have pushed task to our group while we are
at this function. However interrupts and other preemptable corner
cases may change the load with RT tasks etc. Generally the
computed load on _this_ group (group_leader) will not change.
What you are pointing out is valid for other group loads like
group_min etc.
--Vaidy
* Balbir Singh <[email protected]> [2008-12-15 14:16:42]:
> * Peter Zijlstra <[email protected]> [2008-12-15 09:33:04]:
>
> > On Mon, 2008-12-15 at 09:25 +0100, Peter Zijlstra wrote:
> > > On Mon, 2008-12-15 at 12:31 +0530, Balbir Singh wrote:
> > >
> > > > > kernel/sched_fair.c | 17 +++++++++++++++++
> > > > > 1 files changed, 17 insertions(+), 0 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > > > > index 98345e4..939f2a1 100644
> > > > > --- a/kernel/sched_fair.c
> > > > > +++ b/kernel/sched_fair.c
> > > > > @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p)
> > > > > cpumask_t tmp;
> > > > > struct sched_domain *sd;
> > > > > int i;
> > > > > + unsigned int chosen_wakeup_cpu;
> > > > > + int this_cpu;
> > > > > +
> > > > > + /*
> > > > > + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> > > > > + * are idle and this is not a kernel thread and this task's affinity
> > > > > + * allows it to be moved to preferred cpu, then just move!
> > > > > + */
> > > > > +
> > > > > + this_cpu = smp_processor_id();
> > > > > + chosen_wakeup_cpu =
> > > > > + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> > > > > +
> > > > > + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > > > > + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
> > > >
> > > > The p->mm check is racy, it needs to be done under task_lock(). The
> > > > best way to check for a kernel thread is get_task_mm(), followed by
> > > > put_task_mm() is the mm is not NULL. We also need to check to see if
> > > > the task is _hot_ on cpu. We should negate this optimization in case
> > > > chosen_wakeup_cpu is idle, so check for that as well.
> > >
> > > Sure its racy, but so what?
> > >
> > > The worst I can see it that we exclude a dying task from this logic,
> > > which isn't a problem at all, since its dying anyway.
> >
> > At which point I seriously doubt it'd still be on the rq anyway.
> >
>
> I forgot to mention that, the check should be (p->mm && !(p->flags & PF_KTHREAD))
I can check for PF_KTHREAD for now. However, I should reduce the
number of checks since this may slow down wake_idle for sched_mc=2.
We can tolerate p->mm check on a dying process as Peter has suggested,
hence we don't need to protect it. We are not going to access any
contents of the mm struct.
If PF_KTHREAD is only being used by AIO, then I feel we can drop the
check since the threads will not have affinity and they can be moved
to other cpus anyway.
The main reason for skipping kthread is that they may be using per-cpu
variables and sleep/preempted. I did not want the wake_idle() logic
to move them around forcefully. This is not the general case and this
situation should not happen.
Second reason is to optimise on the affinity check since most of the
kthreads have affinity and cannot be moved.
This condition check needs optimisation after getting the framework
functionally correct and useful.
--Vaidy
>> > > Sure its racy, but so what?
>> > >
>> > > The worst I can see it that we exclude a dying task from this logic,
>> > > which isn't a problem at all, since its dying anyway.
>> >
>> > At which point I seriously doubt it'd still be on the rq anyway.
>> >
>>
>> I forgot to mention that, the check should be (p->mm && !(p->flags & PF_KTHREAD))
>
> I can check for PF_KTHREAD for now. However, I should reduce the
> number of checks since this may slow down wake_idle for sched_mc=2.
>
> We can tolerate p->mm check on a dying process as Peter has suggested,
> hence we don't need to protect it. We are not going to access any
> contents of the mm struct.
>
> If PF_KTHREAD is only being used by AIO, then I feel we can drop the
> check since the threads will not have affinity and they can be moved
> to other cpus anyway.
>
> The main reason for skipping kthread is that they may be using per-cpu
> variables and sleep/preempted. I did not want the wake_idle() logic
> to move them around forcefully. This is not the general case and this
> situation should not happen.
>
> Second reason is to optimise on the affinity check since most of the
> kthreads have affinity and cannot be moved.
>
> This condition check needs optimisation after getting the framework
> functionally correct and useful.
Vaidy, you (or your mailer) seem to have dropped me off of the to/cc
list while replying and this seems to be the case for all replies.
Balbir
* Balbir Singh <[email protected]> [2008-12-15 23:32:36]:
> >> > > Sure its racy, but so what?
> >> > >
> >> > > The worst I can see it that we exclude a dying task from this logic,
> >> > > which isn't a problem at all, since its dying anyway.
> >> >
> >> > At which point I seriously doubt it'd still be on the rq anyway.
> >> >
> >>
> >> I forgot to mention that, the check should be (p->mm && !(p->flags & PF_KTHREAD))
> >
> > I can check for PF_KTHREAD for now. However, I should reduce the
> > number of checks since this may slow down wake_idle for sched_mc=2.
> >
> > We can tolerate p->mm check on a dying process as Peter has suggested,
> > hence we don't need to protect it. We are not going to access any
> > contents of the mm struct.
> >
> > If PF_KTHREAD is only being used by AIO, then I feel we can drop the
> > check since the threads will not have affinity and they can be moved
> > to other cpus anyway.
> >
> > The main reason for skipping kthread is that they may be using per-cpu
> > variables and sleep/preempted. I did not want the wake_idle() logic
> > to move them around forcefully. This is not the general case and this
> > situation should not happen.
> >
> > Second reason is to optimise on the affinity check since most of the
> > kthreads have affinity and cannot be moved.
> >
> > This condition check needs optimisation after getting the framework
> > functionally correct and useful.
>
>
> Vaidy, you (or your mailer) seem to have dropped me off of the to/cc
> list while replying and this seems to be the case for all replies.
Hi Balbir,
Thanks for pointing that out. I will review my mutt setup. This is
strange... since a group-reply to this message puts you in the to list
as expected, but not the ones that I sent earlier in reply to your
messages. The header seems to be correct for others except for your
message :(
--Vaidy
> Results:
> --------
>
> Basic functionality of the code has not changed and the power vs
> performance benefits for kernbench are similar to the ones posted
> earlier.
>
> KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> package system
>
> SchedMC Run Time Package Idle Energy Power
> 0 81.28 52.43% 53.53% 1.00x J 1.00y W
> 1 80.71 37.35% 68.91% 0.96x J 0.97y W
> 2 76.05 23.81% 82.65% 0.92x J 0.98y W
>
> *** This is RFC code and not for inclusion ***
Hmm, so it makes it compile faster _and_ it saves power? Why to keep
it tunable at all if it is win-win? Or are there other benchmarks?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html