Subject: [PATCH v3 0/6] sched: Extend sched_mc/smt_framework

Hi,

I am reposting the iteration 3 of the patch series that extends the existing
sched_smt_/mc_power_savings framework to work on platforms
that have on-chip memory controllers making each of the cpu-package
a 'node'. I have rebased this patch series against 2.6.29-rc8.

Changes from V2: (Found here: --> http://lkml.org/lkml/2009/3/3/109)
- Patches have been split up in an incremental manner for easy review.
- Fixed comments for some variables.
- Renamed some variables to better reflect their usage.

Changes from V1: (Found here: --> http://lkml.org/lkml/2009/2/16/221)
- Added comments to explain power-saving part in find_busiest_group()
- Added comments for the different sched_domain levels.

Background
------------------------------------------------------------------
On machines with on-chip memory controller, each physical CPU
package forms a NUMA node and the CPU level sched_domain will have
only one group. This prevents any form of power saving balance across
these nodes. Enabling the sched_mc_power_savings tunable to work as
designed on these new single CPU NUMA node machines will help task
consolidation and save power as we did in other multi core multi
socket platforms.

Consolidation across NODES have implications of cross-node memory
access and other NUMA locality issues. Even under such constraints
there could be scope for power savings vs performance tradeoffs and
hence making the sched_mc_powersavings work as expected on these
platform is justified.

sched_mc/smt_power_savings is still a tunable and power savings benefits
and performance would vary depending on the workload and the system
topology and hardware features.

The patch series has been tested on a 2-Socket Quad-core Dual threaded
box with kernbench as the workload, varying the number of threads.


+------------------------------------------------------------------------+
|Test: make -j4 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 35.34 |Core4: 62.01 |
| | | | +-------------+----------------+
| | | | |Core1: 58.34 |Core5: 17.41 |
| 0 | 0 | 100 | 107.84 +-------------+----------------+
| | | | |Core2: 63.97 |Core6: 60.29 |
| | | | +-------------+----------------+
| | | | |Core3: 68.64 |Core7: 61.46 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 34.28 |Core4: 18.26 |
| | | | +-------------+----------------+
| | | | |Core1: 99.19 |Core5: 18.54 |
| 0 | 2 | 99.89 | 109.91 +-------------+----------------+
| | | | |Core2: 99.89 |Core6: 21.54 |
| | | | +-------------+----------------+
| | | | |Core3: 99.91 |Core7: 23.21 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 20.17 |Core4: 69.30 |
| | | | +-------------+----------------+
| | | | |Core1: 50.22 |Core5: 55.97 |
| 2 | 2 | 95.03 | 139.95 +-------------+----------------+
| | | | |Core2: 83.95 |Core6: 92.70 |
| | | | +-------------+----------------+
| | | | |Core3: 88.95 |Core7: 95.58 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j6 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 25.35 |Core4: 41.07 |
| | | | +-------------+----------------+
| | | | |Core1: 43.84 |Core5: 19.95 |
| 0 | 0 | 100 | 77.67 +-------------+----------------+
| | | | |Core2: 43.23 |Core6: 42.82 |
| | | | +-------------+----------------+
| | | | |Core3: 47.66 |Core7: 45.96 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 24.13 |Core4: 41.80 |
| | | | +-------------+----------------+
| | | | |Core1: 51.51 |Core5: 23.61 |
| 0 | 2 | 99.41 | 81.50 +-------------+----------------+
| | | | |Core2: 55.43 |Core6: 38.67 |
| | | | +-------------+----------------+
| | | | |Core3: 57.79 |Core7: 38.84 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 7.75 |Core4: 94.45 |
| | | | +-------------+----------------+
| | | | |Core1: 19.04 |Core5: 67.42 |
| 2 | 2 | 93.32 | 100.39 +-------------+----------------+
| | | | |Core2: 28.29 |Core6: 96.90 |
| | | | +-------------+----------------+
| | | | |Core3: 66.63 |Core7: 99.86 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j8 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 18.17 |Core4: 33.38 |
| | | | +-------------+----------------+
| | | | |Core1: 34.62 |Core5: 19.58 |
| 0 | 0 | 100 | 63.82 +-------------+----------------+
| | | | |Core2: 31.99 |Core6: 32.35 |
| | | | +-------------+----------------+
| | | | |Core3: 34.59 |Core7: 29.99 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 15.20 |Core4: 41.41 |
| | | | +-------------+----------------+
| | | | |Core1: 28.45 |Core5: 21.32 |
| 0 | 2 | 99.17 | 65.65 +-------------+----------------+
| | | | |Core2: 31.14 |Core6: 41.26 |
| | | | +-------------+----------------+
| | | | |Core3: 30.52 |Core7: 42.95 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 16.65 |Core4: 79.04 |
| | | | +-------------+----------------+
| | | | |Core1: 26.74 |Core5: 50.98 |
| 2 | 2 | 89.58 | 82.83 +-------------+----------------+
| | | | |Core2: 30.42 |Core6: 81.33 |
| | | | +-------------+----------------+
| | | | |Core3: 35.57 |Core7: 90.03 |
+-----------+----------+--------+---------+-------------+----------------+


---

Gautham R Shenoy (6):
sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
sched: Arbitrate the nomination of preferred_wakeup_cpu
sched: Rename the variable sched_mc_preferred_wakeup_cpu
sched: Add Comments at the beginning of find_busiest_group.
sched: Record the current active power savings level
sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()


include/linux/sched.h | 66 +++++++++++++++++----------------
include/linux/topology.h | 6 +--
kernel/sched.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 4 +-
4 files changed, 123 insertions(+), 45 deletions(-)

--
Thanks and Regards
gautham.


Subject: [PATCH 3 1/6] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()

This is a code cleanup patch which combines the functions of
sd_balance_for_mc/package_power() and sd_power_saving_flags() into a single
function.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 65 +++++++++++++++++++++++-----------------------
include/linux/topology.h | 6 +---
2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 011db2f..37fecf7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,34 +794,45 @@ enum powersavings_balance_level {

extern int sched_mc_power_savings, sched_smt_power_savings;

-static inline int sd_balance_for_mc_power(void)
-{
- if (sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
-
- return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
+enum sched_domain_level {
+ SD_LV_NONE = 0,
+ SD_LV_SIBLING, /* Represents the THREADS domain */
+ SD_LV_MC, /* Represents the CORES domain */
+ SD_LV_CPU, /* Represents the PACKAGE domain */
+ SD_LV_NODE, /* Represents the NODES domain */
+ SD_LV_ALLNODES,
+ SD_LV_MAX
+};

- return 0;
-}

-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
+/**
+ * sd_power_savings_flags: Returns the flags specific to power-aware-load
+ * balancing for a given sched_domain level
+ *
+ * @level: The sched_domain level for which the the power-aware-load balancing
+ * flags need to be set.
+ *
+ * This function helps in setting the flags for power-aware load balancing for
+ * a given sched_domain.
+ * - SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
+ * load balancing is applicable at this domain.
+ *
+ * - SD_BALANCE_NEWIDLE helps aggressive task consolidation and
+ * power-savings.
+ *
+ * For more information on power aware scheduling, see the comment before
+ * find_busiest_group() in kernel/sched.c
*/

-static inline int sd_power_saving_flags(void)
+static inline int sd_power_saving_flags(enum sched_domain_level level)
{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_BALANCE_NEWIDLE;
+ if (level == SD_LV_MC && !sched_smt_power_savings)
+ return 0;
+ if (level == SD_LV_CPU &&
+ !(sched_mc_power_savings || sched_smt_power_savings))
+ return 0;

- return 0;
+ return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
}

struct sched_group {
@@ -847,16 +858,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
return to_cpumask(sg->cpumask);
}

-enum sched_domain_level {
- SD_LV_NONE = 0,
- SD_LV_SIBLING,
- SD_LV_MC,
- SD_LV_CPU,
- SD_LV_NODE,
- SD_LV_ALLNODES,
- SD_LV_MAX
-};
-
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e632d29..8097dce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,8 +125,7 @@ int arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | sd_balance_for_mc_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_MC),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -151,8 +150,7 @@ int arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | sd_balance_for_package_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_CPU),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

Subject: [PATCH 3 3/6] sched: Add Comments at the beginning of find_busiest_group.

Currently there are no comments pertaining to power-savings balance in
the function find_busiest_group. Add appropriate comments.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---

kernel/sched.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 407ee03..864c6ca 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3090,6 +3090,23 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
* should be moved to restore balance via the imbalance parameter.
+ *
+ * Power-savings-balance: Through the sysfs tunables sched_mc/smt_power_savings
+ * he user can opt for aggressive task consolidation as a means to save power.
+ * When this is activated, we would have the SD_POWERSAVINGS_BALANCE flag
+ * set for appropriate sched_domains,
+ *
+ * Within such sched_domains, find_busiest_group would try to identify
+ * a sched_group which can be freed-up and whose tasks can be migrated to
+ * a sibling group which has the capacity to accomodate the former's tasks.
+ * If such a "can-go-idle" sched_group does exist, then the sibling group
+ * which can accomodate it's tasks is returned as the busiest group.
+ *
+ * Furthermore, if the user opts for more aggressive power-aware load
+ * balancing through sched_smt/mc_power_savings = 2, i.e when the
+ * active_power_savings_level greater or equal to POWERSAVINGS_BALANCE_WAKEUP,
+ * find_busiest_group will also nominate the preferred CPU, on which the tasks
+ * should hence forth be woken up on, instead of bothering an idle-cpu.
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,

Subject: [PATCH 3 2/6] sched: Record the current active power savings level

The current active power savings level of a system is defined as the
maximum of the sched_mc_power_savings and the sched_smt_power_savings.

The decisions during power-aware loadbalancing, depend on this value.

Record this value in a read mostly global variable instead of having to
compute it everytime.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 8 ++++++--
kernel/sched_fair.c | 2 +-
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 37fecf7..7dc8aea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,7 @@ enum powersavings_balance_level {
};

extern int sched_mc_power_savings, sched_smt_power_savings;
+extern enum powersavings_balance_level active_power_savings_level;

enum sched_domain_level {
SD_LV_NONE = 0,
diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..407ee03 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3398,7 +3398,7 @@ out_balanced:

if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
@@ -3683,7 +3683,7 @@ redo:
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
return -1;

if (sd->nr_balance_failed++ < 2)
@@ -7206,6 +7206,8 @@ static void sched_domain_node_span(int node, struct cpumask *span)
#endif /* CONFIG_NUMA */

int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+/* Records the currently active power savings level */
+enum powersavings_balance_level __read_mostly active_power_savings_level;

/*
* The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8040,6 +8042,8 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
sched_smt_power_savings = level;
else
sched_mc_power_savings = level;
+ active_power_savings_level = max(sched_smt_power_savings,
+ sched_mc_power_savings);

arch_reinit_sched_domains();

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0566f2a..a3583c6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1054,7 +1054,7 @@ static int wake_idle(int cpu, struct task_struct *p)
chosen_wakeup_cpu =
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;

- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&
p->mm && !(p->flags & PF_KTHREAD) &&
cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))

Subject: [PATCH 3 4/6] sched: Rename the variable sched_mc_preferred_wakeup_cpu

sched_mc_ preferred_wakeup_cpu is currently used when the user seeks
power savings through aggressive task consolidation.

This is applicable for both sched_mc_power_savings as well as
sched_smt_power_savings. So rename sched_mc_preferred_wakeup_cpu to
preferred_wakeup_cpu.

Also fix the comment for preferred_wakeup_cpu.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 12 +++++++-----
kernel/sched_fair.c | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 864c6ca..16d7655 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -515,11 +515,13 @@ struct root_domain {
#endif
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
/*
- * Preferred wake up cpu nominated by sched_mc balance that will be
- * used when most cpus are idle in the system indicating overall very
- * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ * Preferred wake up cpu which is nominated by load balancer,
+ * is the CPU on which the tasks would be woken up, which
+ * otherwise would have woken up on an idle CPU even on a system
+ * with low-cpu-utilization.
+ * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
- unsigned int sched_mc_preferred_wakeup_cpu;
+ unsigned int preferred_wakeup_cpu;
#endif
};

@@ -3416,7 +3418,7 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
return group_min;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a3583c6..03b1e3c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1052,7 +1052,7 @@ static int wake_idle(int cpu, struct task_struct *p)

this_cpu = smp_processor_id();
chosen_wakeup_cpu =
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu;

if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&

Subject: [PATCH 3 5/6] sched: Arbitrate the nomination of preferred_wakeup_cpu

Currently for sched_mc/smt_power_savings = 2, we consolidate tasks
by having a preferred_wakeup_cpu which will be used for all the
further wake ups.

This preferred_wakeup_cpu is currently nominated by find_busiest_group()
while loadbalancing for sched_domains which has SD_POWERSAVINGS_BALANCE flag
set.

However, on systems which are multi-threaded and multi-core, we can
have multiple sched_domains in the same hierarchy with
SD_POWERSAVINGS_BALANCE flag set.

Currently we don't have any arbitration mechanism as to while load balancing
for which sched_domain in the hierarchy should find_busiest_group(sd)
nominate the preferred_wakeup_cpu. Hence can overwrite valid nominations
made previously thereby causing the preferred_wakup_cpu to ping-pong
thereby preventing us from effectively consolidating tasks.

Fix this by means of an arbitration algorithm, where in we nominate the
preferred_wakeup_cpu sched_domain in find_busiest_group() for a particular
sched_domain if the sched_domain:
- is the topmost power aware sched_domain.
OR
- contains the previously nominated preferred wake up cpu in it's span.

This will help to further fine tune the wake-up biasing logic by
identifying a partially busy core within a CPU package instead of
potentially waking up a completely idle core.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 16d7655..651550c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -522,6 +522,14 @@ struct root_domain {
* This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
unsigned int preferred_wakeup_cpu;
+
+ /*
+ * top_powersavings_sd_lvl records the level of the highest
+ * sched_domain that has the SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * Used to arbitrate nomination of the preferred_wakeup_cpu.
+ */
+ enum sched_domain_level top_powersavings_sd_lvl;
#endif
};

@@ -3416,9 +3424,27 @@ out_balanced:
goto ret;

if (this == group_leader && group_leader != group_min) {
+ struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
*imbalance = min_load_per_task;
- if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
+ /*
+ * To avoid overwriting of preferred_wakeup_cpu nominations
+ * while calling find_busiest_group() at various sched_domain
+ * levels, we define an arbitration mechanism wherein
+ * find_busiest_group() nominates a preferred_wakeup_cpu at
+ * the sched_domain sd if:
+ *
+ * - sd is the highest sched_domain in the hierarchy having the
+ * SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * OR
+ *
+ * - sd contains the previously nominated preferred_wakeup_cpu
+ * in it's span.
+ */
+ if (sd->level == my_rd->top_powersavings_sd_lvl ||
+ cpu_isset(my_rd->preferred_wakeup_cpu,
+ *sched_domain_span(sd))) {
+ my_rd->preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
return group_min;
@@ -7541,6 +7567,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
struct root_domain *rd;
cpumask_var_t nodemask, this_sibling_map, this_core_map, send_covered,
tmpmask;
+ struct sched_domain *sd;
+
#ifdef CONFIG_NUMA
cpumask_var_t domainspan, covered, notcovered;
struct sched_group **sched_group_nodes = NULL;
@@ -7816,6 +7844,19 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

err = 0;

+ rd->preferred_wakeup_cpu = UINT_MAX;
+ rd->top_powersavings_sd_lvl = SD_LV_NONE;
+
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
+ goto free_tmpmask;
+
+ /* Record the level of the highest power-aware sched_domain */
+ for_each_domain(first_cpu(*cpu_map), sd) {
+ if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+ continue;
+ rd->top_powersavings_sd_lvl = sd->level;
+ }
+
free_tmpmask:
free_cpumask_var(tmpmask);
free_send_covered:

Subject: [PATCH 3 6/6] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

Currently a sched_domain having a single group can be prevented from getting
degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
only one group, it won't have any scope for performing powersavings balance as
it does not have a sibling group to pull from.

Apart from not provide any powersavings, it also fails to participate
in normal load-balancing.

Fix this by allowing such a sched_domain to degenerate and pass on the
responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 651550c..0e7882a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6970,6 +6970,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_PKG_RESOURCES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
+
+ /*
+ * If the only flag that is preventing us from degenerating
+ * a domain with a single group is SD_POWERSAVINGS_BALANCE
+ * check if it can be transferred to the new parent,
+ * and degenerate this domain. With a single
+ * group, it anyway can't contribute to power-aware load
+ * balancing.
+ */
+ if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
+ pflags &= ~SD_POWERSAVINGS_BALANCE;
+ parent->parent->flags |=
+ sd_power_saving_flags(parent->level);
+ }
}
if (~cflags & pflags)
return 0;

2009-03-19 15:16:29

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] sched: Extend sched_mc/smt_framework

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:17]:

> Hi,
>
> I am reposting the iteration 3 of the patch series that extends the existing
> sched_smt_/mc_power_savings framework to work on platforms
> that have on-chip memory controllers making each of the cpu-package
> a 'node'. I have rebased this patch series against 2.6.29-rc8.
>
> Changes from V2: (Found here: --> http://lkml.org/lkml/2009/3/3/109)
> - Patches have been split up in an incremental manner for easy review.
> - Fixed comments for some variables.
> - Renamed some variables to better reflect their usage.
>
> Changes from V1: (Found here: --> http://lkml.org/lkml/2009/2/16/221)
> - Added comments to explain power-saving part in find_busiest_group()
> - Added comments for the different sched_domain levels.
>
> Background
> ------------------------------------------------------------------
> On machines with on-chip memory controller, each physical CPU
> package forms a NUMA node and the CPU level sched_domain will have
> only one group. This prevents any form of power saving balance across
> these nodes. Enabling the sched_mc_power_savings tunable to work as
> designed on these new single CPU NUMA node machines will help task
> consolidation and save power as we did in other multi core multi
> socket platforms.
>
> Consolidation across NODES have implications of cross-node memory
> access and other NUMA locality issues. Even under such constraints
> there could be scope for power savings vs performance tradeoffs and
> hence making the sched_mc_powersavings work as expected on these
> platform is justified.

If the workload threads share lots of data from cache, then
consolidating them will improve cache sharing at the last level cache
in the package. If most of the working set fits the on chip cache,
then the cross-node reference latencies will be effectively hidden.

> sched_mc/smt_power_savings is still a tunable and power savings benefits
> and performance would vary depending on the workload and the system
> topology and hardware features.

In your results we can see significant performance degradation for
marginal power savings when sibling threads are used to run workloads.
Kernbench is perhaps cpu intensive and did not leave many stall cycles
in the processor for the sibling thread to benefit.

Some other workloads that experience stalls due to memory references
may not see such degradation when run on sibling threads.

--Vaidy

2009-03-19 16:21:54

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 1/6] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:22]:

> This is a code cleanup patch which combines the functions of
> sd_balance_for_mc/package_power() and sd_power_saving_flags() into a single
> function.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Cc: Vaidyanathan Srinivasan <[email protected]>

Acked-by: Vaidyanathan Srinivasan <[email protected]>

> ---
>
> include/linux/sched.h | 65 +++++++++++++++++++++++-----------------------
> include/linux/topology.h | 6 +---
> 2 files changed, 35 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 011db2f..37fecf7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -794,34 +794,45 @@ enum powersavings_balance_level {
>
> extern int sched_mc_power_savings, sched_smt_power_savings;
>
> -static inline int sd_balance_for_mc_power(void)
> -{
> - if (sched_smt_power_savings)
> - return SD_POWERSAVINGS_BALANCE;
> -
> - return 0;
> -}
> -
> -static inline int sd_balance_for_package_power(void)
> -{
> - if (sched_mc_power_savings | sched_smt_power_savings)
> - return SD_POWERSAVINGS_BALANCE;
> +enum sched_domain_level {
> + SD_LV_NONE = 0,
> + SD_LV_SIBLING, /* Represents the THREADS domain */
/* Sched domain of sibling THREADS in a CORE */
> + SD_LV_MC, /* Represents the CORES domain */
/* Sched domain of sibling CORES in a PACKAGE */
> + SD_LV_CPU, /* Represents the PACKAGE domain */
/* Sched domain of PACKAGES in a NODE */
> + SD_LV_NODE, /* Represents the NODES domain */
/* Sched domain of NUMA NODEs in the system */

Just making the description/comments more elaborate.

> + SD_LV_ALLNODES,
> + SD_LV_MAX
> +};
>
> - return 0;
> -}
>
> -/*
> - * Optimise SD flags for power savings:
> - * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
> - * Keep default SD flags if sched_{smt,mc}_power_saving=0
> +/**
> + * sd_power_savings_flags: Returns the flags specific to power-aware-load
> + * balancing for a given sched_domain level
> + *
> + * @level: The sched_domain level for which the the power-aware-load balancing
> + * flags need to be set.
> + *
> + * This function helps in setting the flags for power-aware load balancing for
> + * a given sched_domain.
> + * - SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
> + * load balancing is applicable at this domain.
> + *
> + * - SD_BALANCE_NEWIDLE helps aggressive task consolidation and
> + * power-savings.
> + *
> + * For more information on power aware scheduling, see the comment before
> + * find_busiest_group() in kernel/sched.c
> */
>
> -static inline int sd_power_saving_flags(void)
> +static inline int sd_power_saving_flags(enum sched_domain_level level)
> {
> - if (sched_mc_power_savings | sched_smt_power_savings)
> - return SD_BALANCE_NEWIDLE;
> + if (level == SD_LV_MC && !sched_smt_power_savings)
> + return 0;
> + if (level == SD_LV_CPU &&
> + !(sched_mc_power_savings || sched_smt_power_savings))
> + return 0;
>
> - return 0;
> + return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
> }
>
> struct sched_group {
> @@ -847,16 +858,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
> return to_cpumask(sg->cpumask);
> }
>
> -enum sched_domain_level {
> - SD_LV_NONE = 0,
> - SD_LV_SIBLING,
> - SD_LV_MC,
> - SD_LV_CPU,
> - SD_LV_NODE,
> - SD_LV_ALLNODES,
> - SD_LV_MAX
> -};
> -
> struct sched_domain_attr {
> int relax_domain_level;
> };
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index e632d29..8097dce 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -125,8 +125,7 @@ int arch_update_cpu_topology(void);
> | SD_WAKE_AFFINE \
> | SD_WAKE_BALANCE \
> | SD_SHARE_PKG_RESOURCES\
> - | sd_balance_for_mc_power()\
> - | sd_power_saving_flags(),\
> + | sd_power_saving_flags(SD_LV_MC),\
> .last_balance = jiffies, \
> .balance_interval = 1, \
> }
> @@ -151,8 +150,7 @@ int arch_update_cpu_topology(void);
> | SD_BALANCE_FORK \
> | SD_WAKE_AFFINE \
> | SD_WAKE_BALANCE \
> - | sd_balance_for_package_power()\
> - | sd_power_saving_flags(),\
> + | sd_power_saving_flags(SD_LV_CPU),\
> .last_balance = jiffies, \
> .balance_interval = 1, \
> }

--Vaidy

2009-03-19 16:31:06

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 2/6] sched: Record the current active power savings level

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:28]:

> The current active power savings level of a system is defined as the
> maximum of the sched_mc_power_savings and the sched_smt_power_savings.
>
> The decisions during power-aware loadbalancing, depend on this value.
>
> Record this value in a read mostly global variable instead of having to
> compute it everytime.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
>
> include/linux/sched.h | 1 +
> kernel/sched.c | 8 ++++++--
> kernel/sched_fair.c | 2 +-
> 3 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 37fecf7..7dc8aea 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -793,6 +793,7 @@ enum powersavings_balance_level {
> };
>
> extern int sched_mc_power_savings, sched_smt_power_savings;

Now we will need sched_mc_power_savings and sched_smt_power_savings
only until we rebuild the sched domains. These can be static
variables and we can perhaps remove the extern for them? Better still
if we can capture this information elsewhere until sched domain is
built and SD_POWERSAVINGS_BALANCE flags are set so as to not
have a need for these global variables.

> +extern enum powersavings_balance_level active_power_savings_level;
> enum sched_domain_level {
> SD_LV_NONE = 0,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e2558c..407ee03 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3398,7 +3398,7 @@ out_balanced:
>
> if (this == group_leader && group_leader != group_min) {
> *imbalance = min_load_per_task;
> - if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
> + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> cpumask_first(sched_group_cpus(group_leader));
> }
> @@ -3683,7 +3683,7 @@ redo:
> !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> return -1;
>
> - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> + if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
> return -1;
>
> if (sd->nr_balance_failed++ < 2)
> @@ -7206,6 +7206,8 @@ static void sched_domain_node_span(int node, struct cpumask *span)
> #endif /* CONFIG_NUMA */
>
> int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
> +/* Records the currently active power savings level */
> +enum powersavings_balance_level __read_mostly active_power_savings_level;
>
> /*
> * The cpus mask in sched_group and sched_domain hangs off the end.
> @@ -8040,6 +8042,8 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> sched_smt_power_savings = level;
> else
> sched_mc_power_savings = level;
> + active_power_savings_level = max(sched_smt_power_savings,
> + sched_mc_power_savings);
>
> arch_reinit_sched_domains();
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0566f2a..a3583c6 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1054,7 +1054,7 @@ static int wake_idle(int cpu, struct task_struct *p)
> chosen_wakeup_cpu =
> cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
>
> - if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
> idle_cpu(cpu) && idle_cpu(this_cpu) &&
> p->mm && !(p->flags & PF_KTHREAD) &&
> cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
>


Acked-by: Vaidyanathan Srinivasan <[email protected]>

2009-03-19 16:37:51

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 3/6] sched: Add Comments at the beginning of find_busiest_group.

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:33]:

> Currently there are no comments pertaining to power-savings balance in
> the function find_busiest_group. Add appropriate comments.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
>
> kernel/sched.c | 17 +++++++++++++++++
> 1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 407ee03..864c6ca 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3090,6 +3090,23 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
> * find_busiest_group finds and returns the busiest CPU group within the
> * domain. It calculates and returns the amount of weighted load which
> * should be moved to restore balance via the imbalance parameter.
> + *
> + * Power-savings-balance: Through the sysfs tunables sched_mc/smt_power_savings
> + * he user can opt for aggressive task consolidation as a means to save power.
^ the
> + * When this is activated, we would have the SD_POWERSAVINGS_BALANCE flag
^^^^ When sched_{mc,smt}_powersavings is activated, then
SD_POWERSAVINGS_BALANCE...
> + * set for appropriate sched_domains,
> + *
> + * Within such sched_domains, find_busiest_group would try to identify
> + * a sched_group which can be freed-up and whose tasks can be migrated to
> + * a sibling group which has the capacity to accomodate the former's tasks.
^^^ remaining capacity
> + * If such a "can-go-idle" sched_group does exist, then the sibling group
^^^^^^^^
this group is returned
as busiest_group
> + * which can accomodate it's tasks is returned as the busiest group.
^^^^^^^^^^^^^^^^ This is the
group_leader and busiest_group is
returned if the current cpu is
a member of group leader.
> + *
> + * Furthermore, if the user opts for more aggressive power-aware load
> + * balancing through sched_smt/mc_power_savings = 2, i.e when the
> + * active_power_savings_level greater or equal to POWERSAVINGS_BALANCE_WAKEUP,
> + * find_busiest_group will also nominate the preferred CPU, on which the tasks
> + * should hence forth be woken up on, instead of bothering an idle-cpu.
> */
> static struct sched_group *
> find_busiest_group(struct sched_domain *sd, int this_cpu,


Acked-by: Vaidyanathan Srinivasan <[email protected]>

2009-03-19 16:40:28

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 4/6] sched: Rename the variable sched_mc_preferred_wakeup_cpu

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:38]:

> sched_mc_ preferred_wakeup_cpu is currently used when the user seeks
> power savings through aggressive task consolidation.
>
> This is applicable for both sched_mc_power_savings as well as
> sched_smt_power_savings. So rename sched_mc_preferred_wakeup_cpu to
> preferred_wakeup_cpu.
>
> Also fix the comment for preferred_wakeup_cpu.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> ---
>
> kernel/sched.c | 12 +++++++-----
> kernel/sched_fair.c | 2 +-
> 2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 864c6ca..16d7655 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -515,11 +515,13 @@ struct root_domain {
> #endif
> #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> /*
> - * Preferred wake up cpu nominated by sched_mc balance that will be
> - * used when most cpus are idle in the system indicating overall very
> - * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
> + * Preferred wake up cpu which is nominated by load balancer,
> + * is the CPU on which the tasks would be woken up, which
> + * otherwise would have woken up on an idle CPU even on a system
> + * with low-cpu-utilization.
> + * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
> */
> - unsigned int sched_mc_preferred_wakeup_cpu;
> + unsigned int preferred_wakeup_cpu;
Good short name :)
> #endif
> };
>
> @@ -3416,7 +3418,7 @@ out_balanced:
> if (this == group_leader && group_leader != group_min) {
> *imbalance = min_load_per_task;
> if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> - cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> + cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
> cpumask_first(sched_group_cpus(group_leader));
> }
> return group_min;
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index a3583c6..03b1e3c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1052,7 +1052,7 @@ static int wake_idle(int cpu, struct task_struct *p)
>
> this_cpu = smp_processor_id();
> chosen_wakeup_cpu =
> - cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> + cpu_rq(this_cpu)->rd->preferred_wakeup_cpu;
>
> if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
> idle_cpu(cpu) && idle_cpu(this_cpu) &&


Acked-by: Vaidyanathan Srinivasan <[email protected]>

2009-03-19 16:54:20

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 6/6] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:48]:

> Currently a sched_domain having a single group can be prevented from getting
> degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
> only one group, it won't have any scope for performing powersavings balance as
> it does not have a sibling group to pull from.
>
> Apart from not provide any powersavings, it also fails to participate
> in normal load-balancing.
>
> Fix this by allowing such a sched_domain to degenerate and pass on the
> responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Cc: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 651550c..0e7882a 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6970,6 +6970,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
> SD_SHARE_PKG_RESOURCES);
> if (nr_node_ids == 1)
> pflags &= ~SD_SERIALIZE;
> +
> + /*
> + * If the only flag that is preventing us from degenerating
> + * a domain with a single group is SD_POWERSAVINGS_BALANCE
> + * check if it can be transferred to the new parent,
> + * and degenerate this domain. With a single
> + * group, it anyway can't contribute to power-aware load
> + * balancing.
> + */
> + if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
> + pflags &= ~SD_POWERSAVINGS_BALANCE;
> + parent->parent->flags |=
> + sd_power_saving_flags(parent->level);
> + }
> }
> if (~cflags & pflags)
> return 0;
>

Nice concise patch that meets the objective.

So you are not able to add SD_POWERSAVINGS_BALANCE to the list of
flags needing more than 1 group because you need to pass the flag to
parent (if exist) while the rest of the flags are just cleared since
they are irrelevant.

Acked-by: Vaidyanathan Srinivasan <[email protected]>

2009-03-19 16:56:33

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 6/6] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:48]:

> Currently a sched_domain having a single group can be prevented from getting
> degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
> only one group, it won't have any scope for performing powersavings balance as
> it does not have a sibling group to pull from.
>
> Apart from not provide any powersavings, it also fails to participate
> in normal load-balancing.
>
> Fix this by allowing such a sched_domain to degenerate and pass on the
> responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Cc: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 651550c..0e7882a 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6970,6 +6970,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
> SD_SHARE_PKG_RESOURCES);
> if (nr_node_ids == 1)
> pflags &= ~SD_SERIALIZE;
> +
> + /*
> + * If the only flag that is preventing us from degenerating
> + * a domain with a single group is SD_POWERSAVINGS_BALANCE
> + * check if it can be transferred to the new parent,
> + * and degenerate this domain. With a single
> + * group, it anyway can't contribute to power-aware load
> + * balancing.

can't contribute to any form of load balancing or
power savings.

(SD_POWERSAVINGS_BALANCE flags has not being
listed in the flags needing more than 1 group)

> + */
> + if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
> + pflags &= ~SD_POWERSAVINGS_BALANCE;
> + parent->parent->flags |=
> + sd_power_saving_flags(parent->level);
> + }
> }
> if (~cflags & pflags)
> return 0;

--Vaidy

2009-03-19 17:22:50

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 3 5/6] sched: Arbitrate the nomination of preferred_wakeup_cpu

* Gautham R Shenoy <[email protected]> [2009-03-18 14:52:43]:

> Currently for sched_mc/smt_power_savings = 2, we consolidate tasks
> by having a preferred_wakeup_cpu which will be used for all the
> further wake ups.
>
> This preferred_wakeup_cpu is currently nominated by find_busiest_group()
> while loadbalancing for sched_domains which has SD_POWERSAVINGS_BALANCE flag
> set.
>
> However, on systems which are multi-threaded and multi-core, we can
> have multiple sched_domains in the same hierarchy with
> SD_POWERSAVINGS_BALANCE flag set.
>
> Currently we don't have any arbitration mechanism as to while load balancing
> for which sched_domain in the hierarchy should find_busiest_group(sd)
> nominate the preferred_wakeup_cpu. Hence can overwrite valid nominations
> made previously thereby causing the preferred_wakup_cpu to ping-pong
> thereby preventing us from effectively consolidating tasks.
>
> Fix this by means of an arbitration algorithm, where in we nominate the
> preferred_wakeup_cpu sched_domain in find_busiest_group() for a particular
> sched_domain if the sched_domain:
> - is the topmost power aware sched_domain.
> OR
> - contains the previously nominated preferred wake up cpu in it's span.
>
> This will help to further fine tune the wake-up biasing logic by
> identifying a partially busy core within a CPU package instead of
> potentially waking up a completely idle core.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> ---
>
> kernel/sched.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
> 1 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 16d7655..651550c 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -522,6 +522,14 @@ struct root_domain {
> * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
> */
> unsigned int preferred_wakeup_cpu;
> +
> + /*
> + * top_powersavings_sd_lvl records the level of the highest
> + * sched_domain that has the SD_POWERSAVINGS_BALANCE flag set.
> + *
> + * Used to arbitrate nomination of the preferred_wakeup_cpu.
> + */
> + enum sched_domain_level top_powersavings_sd_lvl;
> #endif
> };
>
> @@ -3416,9 +3424,27 @@ out_balanced:
> goto ret;
>
> if (this == group_leader && group_leader != group_min) {
> + struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
> *imbalance = min_load_per_task;
> - if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> - cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
> + /*
> + * To avoid overwriting of preferred_wakeup_cpu nominations
> + * while calling find_busiest_group() at various sched_domain
> + * levels, we define an arbitration mechanism wherein
> + * find_busiest_group() nominates a preferred_wakeup_cpu at
> + * the sched_domain sd if:
> + *
> + * - sd is the highest sched_domain in the hierarchy having the
> + * SD_POWERSAVINGS_BALANCE flag set.
> + *
> + * OR
> + *
> + * - sd contains the previously nominated preferred_wakeup_cpu
> + * in it's span.
> + */
> + if (sd->level == my_rd->top_powersavings_sd_lvl ||
> + cpu_isset(my_rd->preferred_wakeup_cpu,
> + *sched_domain_span(sd))) {
> + my_rd->preferred_wakeup_cpu =
> cpumask_first(sched_group_cpus(group_leader));
> }
> return group_min;
> @@ -7541,6 +7567,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
> struct root_domain *rd;
> cpumask_var_t nodemask, this_sibling_map, this_core_map, send_covered,
> tmpmask;
> + struct sched_domain *sd;
> +
> #ifdef CONFIG_NUMA
> cpumask_var_t domainspan, covered, notcovered;
> struct sched_group **sched_group_nodes = NULL;
> @@ -7816,6 +7844,19 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
>
> err = 0;
>
> + rd->preferred_wakeup_cpu = UINT_MAX;
> + rd->top_powersavings_sd_lvl = SD_LV_NONE;
> +
> + if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
> + goto free_tmpmask;
> +
> + /* Record the level of the highest power-aware sched_domain */
> + for_each_domain(first_cpu(*cpu_map), sd) {
> + if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> + continue;
> + rd->top_powersavings_sd_lvl = sd->level;
> + }
> +
> free_tmpmask:
> free_cpumask_var(tmpmask);
> free_send_covered:
>

Acked-by: Vaidyanathan Srinivasan <[email protected]>

Subject: Re: [PATCH 3 2/6] sched: Record the current active power savings level

On Thu, Mar 19, 2009 at 10:02:05PM +0530, Vaidyanathan Srinivasan wrote:
> * Gautham R Shenoy <[email protected]> [2009-03-18 14:52:28]:
>
> > The current active power savings level of a system is defined as the
> > maximum of the sched_mc_power_savings and the sched_smt_power_savings.
> >
> > The decisions during power-aware loadbalancing, depend on this value.
> >
> > Record this value in a read mostly global variable instead of having to
> > compute it everytime.
> >
> > Signed-off-by: Gautham R Shenoy <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > ---
> >
> > include/linux/sched.h | 1 +
> > kernel/sched.c | 8 ++++++--
> > kernel/sched_fair.c | 2 +-
> > 3 files changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 37fecf7..7dc8aea 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -793,6 +793,7 @@ enum powersavings_balance_level {
> > };
> >
> > extern int sched_mc_power_savings, sched_smt_power_savings;
>
> Now we will need sched_mc_power_savings and sched_smt_power_savings
> only until we rebuild the sched domains. These can be static
> variables and we can perhaps remove the extern for them? Better still
> if we can capture this information elsewhere until sched domain is
> built and SD_POWERSAVINGS_BALANCE flags are set so as to not
> have a need for these global variables.

Right now we need these variables outside sched.c only
while rebuilding the sched_domains, as you rightly pointed out. So, yes
these variables should be _read_only outside sched.c, but they are
required nevertheless.

However, like we had discussed in one of the earlier posts,
when we can have a single tunable that can capture in
essence what these two variables seek to achieve, we can get rid of
these variables. Till then I think we'll have to retain them.

>
> > +extern enum powersavings_balance_level active_power_savings_level;
> > enum sched_domain_level {
> > SD_LV_NONE = 0,
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 8e2558c..407ee03 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3398,7 +3398,7 @@ out_balanced:
> >
> > if (this == group_leader && group_leader != group_min) {
> > *imbalance = min_load_per_task;
> > - if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
> > + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> > cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> > cpumask_first(sched_group_cpus(group_leader));
> > }
> > @@ -3683,7 +3683,7 @@ redo:
> > !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> > return -1;
> >
> > - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> > + if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
> > return -1;
> >
> > if (sd->nr_balance_failed++ < 2)
> > @@ -7206,6 +7206,8 @@ static void sched_domain_node_span(int node, struct cpumask *span)
> > #endif /* CONFIG_NUMA */
> >
> > int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
> > +/* Records the currently active power savings level */
> > +enum powersavings_balance_level __read_mostly active_power_savings_level;
> >
> > /*
> > * The cpus mask in sched_group and sched_domain hangs off the end.
> > @@ -8040,6 +8042,8 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> > sched_smt_power_savings = level;
> > else
> > sched_mc_power_savings = level;
> > + active_power_savings_level = max(sched_smt_power_savings,
> > + sched_mc_power_savings);
> >
> > arch_reinit_sched_domains();
> >
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 0566f2a..a3583c6 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -1054,7 +1054,7 @@ static int wake_idle(int cpu, struct task_struct *p)
> > chosen_wakeup_cpu =
> > cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> >
> > - if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> > + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
> > idle_cpu(cpu) && idle_cpu(this_cpu) &&
> > p->mm && !(p->flags & PF_KTHREAD) &&
> > cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> >
>
>
> Acked-by: Vaidyanathan Srinivasan <[email protected]>

--
Thanks and Regards
gautham