Subject: [PATCH V3 0/6] sched: Extend sched_mc/smt_power_savings framework

Hi,

This is the version 3 of the patch series that extends the existing
sched_smt_/mc_power_savings framework to work on platforms
that have on-chip memory controllers making each of the cpu-package
a 'node'. I have incorporated the review comments from my previous posting in
this series.

I am also working on reviving the find_busiest_group() cleanup that was
proposed by vaidy earlier. But since it involves quite a bit of code movement,
it requires a wider review. Hence I will post it as a seperate thread.

Changes from V2: (Found here: --> http://lkml.org/lkml/2009/3/3/109)
- Patches have been split up in an incremental manner for easy review.
- Fixed comments for some variables.
- Renamed some variables to better reflect their usage.

Changes from V1: (Found here: --> http://lkml.org/lkml/2009/2/16/221)
- Added comments to explain power-saving part in find_busiest_group()
- Added comments for the different sched_domain levels.

Background
------------------------------------------------------------------
On machines with on-chip memory controller, each physical CPU
package forms a NUMA node and the CPU level sched_domain will have
only one group. This prevents any form of power saving balance across
these nodes. Enabling the sched_mc_power_savings tunable to work as
designed on these new single CPU NUMA node machines will help task
consolidation and save power as we did in other multi core multi
socket platforms.

Consolidation across NODES have implications of cross-node memory
access and other NUMA locality issues. Even under such constraints
there could be scope for power savings vs performance tradeoffs and
hence making the sched_mc_powersavings work as expected on these
platform is justified.

sched_mc/smt_power_savings is still a tunable and power savings benefits
and performance would vary depending on the workload and the system
topology and hardware features.

The patch series has been tested on a 2-Socket Quad-core Dual threaded
box with kernbench as the workload, varying the number of threads.


+------------------------------------------------------------------------+
|Test: make -j4 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 35.34 |Core4: 62.01 |
| | | | +-------------+----------------+
| | | | |Core1: 58.34 |Core5: 17.41 |
| 0 | 0 | 100 | 107.84 +-------------+----------------+
| | | | |Core2: 63.97 |Core6: 60.29 |
| | | | +-------------+----------------+
| | | | |Core3: 68.64 |Core7: 61.46 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 34.28 |Core4: 18.26 |
| | | | +-------------+----------------+
| | | | |Core1: 99.19 |Core5: 18.54 |
| 0 | 2 | 99.89 | 109.91 +-------------+----------------+
| | | | |Core2: 99.89 |Core6: 21.54 |
| | | | +-------------+----------------+
| | | | |Core3: 99.91 |Core7: 23.21 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 20.17 |Core4: 69.30 |
| | | | +-------------+----------------+
| | | | |Core1: 50.22 |Core5: 55.97 |
| 2 | 2 | 95.03 | 139.95 +-------------+----------------+
| | | | |Core2: 83.95 |Core6: 92.70 |
| | | | +-------------+----------------+
| | | | |Core3: 88.95 |Core7: 95.58 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j6 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 25.35 |Core4: 41.07 |
| | | | +-------------+----------------+
| | | | |Core1: 43.84 |Core5: 19.95 |
| 0 | 0 | 100 | 77.67 +-------------+----------------+
| | | | |Core2: 43.23 |Core6: 42.82 |
| | | | +-------------+----------------+
| | | | |Core3: 47.66 |Core7: 45.96 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 24.13 |Core4: 41.80 |
| | | | +-------------+----------------+
| | | | |Core1: 51.51 |Core5: 23.61 |
| 0 | 2 | 99.41 | 81.50 +-------------+----------------+
| | | | |Core2: 55.43 |Core6: 38.67 |
| | | | +-------------+----------------+
| | | | |Core3: 57.79 |Core7: 38.84 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 7.75 |Core4: 94.45 |
| | | | +-------------+----------------+
| | | | |Core1: 19.04 |Core5: 67.42 |
| 2 | 2 | 93.32 | 100.39 +-------------+----------------+
| | | | |Core2: 28.29 |Core6: 96.90 |
| | | | +-------------+----------------+
| | | | |Core3: 66.63 |Core7: 99.86 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j8 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 18.17 |Core4: 33.38 |
| | | | +-------------+----------------+
| | | | |Core1: 34.62 |Core5: 19.58 |
| 0 | 0 | 100 | 63.82 +-------------+----------------+
| | | | |Core2: 31.99 |Core6: 32.35 |
| | | | +-------------+----------------+
| | | | |Core3: 34.59 |Core7: 29.99 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 15.20 |Core4: 41.41 |
| | | | +-------------+----------------+
| | | | |Core1: 28.45 |Core5: 21.32 |
| 0 | 2 | 99.17 | 65.65 +-------------+----------------+
| | | | |Core2: 31.14 |Core6: 41.26 |
| | | | +-------------+----------------+
| | | | |Core3: 30.52 |Core7: 42.95 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 16.65 |Core4: 79.04 |
| | | | +-------------+----------------+
| | | | |Core1: 26.74 |Core5: 50.98 |
| 2 | 2 | 89.58 | 82.83 +-------------+----------------+
| | | | |Core2: 30.42 |Core6: 81.33 |
| | | | +-------------+----------------+
| | | | |Core3: 35.57 |Core7: 90.03 |
+-----------+----------+--------+---------+-------------+----------------+

---

Gautham R Shenoy (6):
sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
sched: Arbitrate the nomination of preferred_wakeup_cpu
sched: Rename the variable sched_mc_preferred_wakeup_cpu
sched: Add Comments at the beginning of find_busiest_group.
sched: Record the current active power savings level
sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()


include/linux/sched.h | 66 +++++++++++++++++----------------
include/linux/topology.h | 6 +--
kernel/sched.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 4 +-
4 files changed, 123 insertions(+), 45 deletions(-)

--
Thanks and Regards
gautham.


Subject: [PATCH V3 1/6] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()

This is a code cleanup patch which combines the functions of
sd_balance_for_mc/package_power() and sd_power_saving_flags() into a single
function.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 65 +++++++++++++++++++++++-----------------------
include/linux/topology.h | 6 +---
2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8981e52..73aea7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,34 +794,45 @@ enum powersavings_balance_level {

extern int sched_mc_power_savings, sched_smt_power_savings;

-static inline int sd_balance_for_mc_power(void)
-{
- if (sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
-
- return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
+enum sched_domain_level {
+ SD_LV_NONE = 0,
+ SD_LV_SIBLING, /* Represents the THREADS domain */
+ SD_LV_MC, /* Represents the CORES domain */
+ SD_LV_CPU, /* Represents the PACKAGE domain */
+ SD_LV_NODE, /* Represents the NODES domain */
+ SD_LV_ALLNODES,
+ SD_LV_MAX
+};

- return 0;
-}

-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
+/**
+ * sd_power_savings_flags: Returns the flags specific to power-aware-load
+ * balancing for a given sched_domain level
+ *
+ * @level: The sched_domain level for which the the power-aware-load balancing
+ * flags need to be set.
+ *
+ * This function helps in setting the flags for power-aware load balancing for
+ * a given sched_domain.
+ * - SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
+ * load balancing is applicable at this domain.
+ *
+ * - SD_BALANCE_NEWIDLE helps aggressive task consolidation and
+ * power-savings.
+ *
+ * For more information on power aware scheduling, see the comment before
+ * find_busiest_group() in kernel/sched.c
*/

-static inline int sd_power_saving_flags(void)
+static inline int sd_power_saving_flags(enum sched_domain_level level)
{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_BALANCE_NEWIDLE;
+ if (level == SD_LV_MC && !sched_smt_power_savings)
+ return 0;
+ if (level == SD_LV_CPU &&
+ !(sched_mc_power_savings || sched_smt_power_savings))
+ return 0;

- return 0;
+ return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
}

struct sched_group {
@@ -847,16 +858,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
return to_cpumask(sg->cpumask);
}

-enum sched_domain_level {
- SD_LV_NONE = 0,
- SD_LV_SIBLING,
- SD_LV_MC,
- SD_LV_CPU,
- SD_LV_NODE,
- SD_LV_ALLNODES,
- SD_LV_MAX
-};
-
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e632d29..8097dce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,8 +125,7 @@ int arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | sd_balance_for_mc_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_MC),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -151,8 +150,7 @@ int arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | sd_balance_for_package_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_CPU),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

Subject: [PATCH V3 2/6] sched: Record the current active power savings level

The current active power savings level of a system is defined as the
maximum of the sched_mc_power_savings and the sched_smt_power_savings.

The decisions during power-aware loadbalancing, depend on this value.

Record this value in a read mostly global variable instead of having to
compute it everytime.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 8 ++++++--
kernel/sched_fair.c | 2 +-
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73aea7b..8563698 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,7 @@ enum powersavings_balance_level {
};

extern int sched_mc_power_savings, sched_smt_power_savings;
+extern enum powersavings_balance_level active_power_savings_level;

enum sched_domain_level {
SD_LV_NONE = 0,
diff --git a/kernel/sched.c b/kernel/sched.c
index 410eec4..4629bb1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3398,7 +3398,7 @@ out_balanced:

if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
@@ -3683,7 +3683,7 @@ redo:
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
return -1;

if (sd->nr_balance_failed++ < 2)
@@ -7206,6 +7206,8 @@ static void sched_domain_node_span(int node, struct cpumask *span)
#endif /* CONFIG_NUMA */

int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+/* Records the currently active power savings level */
+enum powersavings_balance_level __read_mostly active_power_savings_level;

/*
* The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8040,6 +8042,8 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
sched_smt_power_savings = level;
else
sched_mc_power_savings = level;
+ active_power_savings_level = max(sched_smt_power_savings,
+ sched_mc_power_savings);

arch_reinit_sched_domains();

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0566f2a..a3583c6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1054,7 +1054,7 @@ static int wake_idle(int cpu, struct task_struct *p)
chosen_wakeup_cpu =
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;

- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&
p->mm && !(p->flags & PF_KTHREAD) &&
cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))

Subject: [PATCH V3 3/6] sched: Add Comments at the beginning of find_busiest_group.

Currently there are no comments pertaining to power-savings balance in
the function find_busiest_group. Add appropriate comments.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---

kernel/sched.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4629bb1..8648eb0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3090,6 +3090,23 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
* should be moved to restore balance via the imbalance parameter.
+ *
+ * Power-savings-balance: Through the sysfs tunables sched_mc/smt_power_savings
+ * he user can opt for aggressive task consolidation as a means to save power.
+ * When this is activated, we would have the SD_POWERSAVINGS_BALANCE flag
+ * set for appropriate sched_domains,
+ *
+ * Within such sched_domains, find_busiest_group would try to identify
+ * a sched_group which can be freed-up and whose tasks can be migrated to
+ * a sibling group which has the capacity to accomodate the former's tasks.
+ * If such a "can-go-idle" sched_group does exist, then the sibling group
+ * which can accomodate it's tasks is returned as the busiest group.
+ *
+ * Furthermore, if the user opts for more aggressive power-aware load
+ * balancing through sched_smt/mc_power_savings = 2, i.e when the
+ * active_power_savings_level greater or equal to POWERSAVINGS_BALANCE_WAKEUP,
+ * find_busiest_group will also nominate the preferred CPU, on which the tasks
+ * should hence forth be woken up on, instead of bothering an idle-cpu.
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,

Subject: [PATCH V3 4/6] sched: Rename the variable sched_mc_preferred_wakeup_cpu

sched_mc_ preferred_wakeup_cpu is currently used when the user seeks
power savings through aggressive task consolidation.

This is applicable for both sched_mc_power_savings as well as
sched_smt_power_savings. So rename sched_mc_preferred_wakeup_cpu to
preferred_wakeup_cpu.

Also fix the comment for preferred_wakeup_cpu.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 12 +++++++-----
kernel/sched_fair.c | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8648eb0..d57a882 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -515,11 +515,13 @@ struct root_domain {
#endif
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
/*
- * Preferred wake up cpu nominated by sched_mc balance that will be
- * used when most cpus are idle in the system indicating overall very
- * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ * Preferred wake up cpu which is nominated by load balancer,
+ * is the CPU on which the tasks would be woken up, which
+ * otherwise would have woken up on an idle CPU even on a system
+ * with low-cpu-utilization.
+ * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
- unsigned int sched_mc_preferred_wakeup_cpu;
+ unsigned int preferred_wakeup_cpu;
#endif
};

@@ -3416,7 +3418,7 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
return group_min;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a3583c6..03b1e3c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1052,7 +1052,7 @@ static int wake_idle(int cpu, struct task_struct *p)

this_cpu = smp_processor_id();
chosen_wakeup_cpu =
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu;

if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&

Subject: [PATCH V3 5/6] sched: Arbitrate the nomination of preferred_wakeup_cpu

Currently for sched_mc/smt_power_savings = 2, we consolidate tasks
by having a preferred_wakeup_cpu which will be used for all the
further wake ups.

This preferred_wakeup_cpu is currently nominated by find_busiest_group()
while loadbalancing for sched_domains which has SD_POWERSAVINGS_BALANCE flag
set.

However, on systems which are multi-threaded and multi-core, we can
have multiple sched_domains in the same hierarchy with
SD_POWERSAVINGS_BALANCE flag set.

Currently we don't have any arbitration mechanism as to while load balancing
for which sched_domain in the hierarchy should find_busiest_group(sd)
nominate the preferred_wakeup_cpu. Hence can overwrite valid nominations
made previously thereby causing the preferred_wakup_cpu to ping-pong
thereby preventing us from effectively consolidating tasks.

Fix this by means of an arbitration algorithm, where in we nominate the
preferred_wakeup_cpu sched_domain in find_busiest_group() for a particular
sched_domain if the sched_domain:
- is the topmost power aware sched_domain.
OR
- contains the previously nominated preferred wake up cpu in it's span.

This will help to further fine tune the wake-up biasing logic by
identifying a partially busy core within a CPU package instead of
potentially waking up a completely idle core.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d57a882..35e1651 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -522,6 +522,14 @@ struct root_domain {
* This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
unsigned int preferred_wakeup_cpu;
+
+ /*
+ * top_powersavings_sd_lvl records the level of the highest
+ * sched_domain that has the SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * Used to arbitrate nomination of the preferred_wakeup_cpu.
+ */
+ enum sched_domain_level top_powersavings_sd_lvl;
#endif
};

@@ -3416,9 +3424,27 @@ out_balanced:
goto ret;

if (this == group_leader && group_leader != group_min) {
+ struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
*imbalance = min_load_per_task;
- if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
+ /*
+ * To avoid overwriting of preferred_wakeup_cpu nominations
+ * while calling find_busiest_group() at various sched_domain
+ * levels, we define an arbitration mechanism wherein
+ * find_busiest_group() nominates a preferred_wakeup_cpu at
+ * the sched_domain sd if:
+ *
+ * - sd is the highest sched_domain in the hierarchy having the
+ * SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * OR
+ *
+ * - sd contains the previously nominated preferred_wakeup_cpu
+ * in it's span.
+ */
+ if (sd->level == my_rd->top_powersavings_sd_lvl ||
+ cpu_isset(my_rd->preferred_wakeup_cpu,
+ *sched_domain_span(sd))) {
+ my_rd->preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
return group_min;
@@ -7541,6 +7567,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
struct root_domain *rd;
cpumask_var_t nodemask, this_sibling_map, this_core_map, send_covered,
tmpmask;
+ struct sched_domain *sd;
+
#ifdef CONFIG_NUMA
cpumask_var_t domainspan, covered, notcovered;
struct sched_group **sched_group_nodes = NULL;
@@ -7816,6 +7844,19 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

err = 0;

+ rd->preferred_wakeup_cpu = UINT_MAX;
+ rd->top_powersavings_sd_lvl = SD_LV_NONE;
+
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
+ goto free_tmpmask;
+
+ /* Record the level of the highest power-aware sched_domain */
+ for_each_domain(first_cpu(*cpu_map), sd) {
+ if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+ continue;
+ rd->top_powersavings_sd_lvl = sd->level;
+ }
+
free_tmpmask:
free_cpumask_var(tmpmask);
free_send_covered:

Subject: [PATCH V3 6/6] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

Currently a sched_domain having a single group can be prevented from getting
degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
only one group, it won't have any scope for performing powersavings balance as
it does not have a sibling group to pull from.

Apart from not provide any powersavings, it also fails to participate
in normal load-balancing.

Fix this by allowing such a sched_domain to degenerate and pass on the
responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 35e1651..d27b8e3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6970,6 +6970,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_PKG_RESOURCES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
+
+ /*
+ * If the only flag that is preventing us from degenerating
+ * a domain with a single group is SD_POWERSAVINGS_BALANCE
+ * check if it can be transferred to the new parent,
+ * and degenerate this domain. With a single
+ * group, it anyway can't contribute to power-aware load
+ * balancing.
+ */
+ if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
+ pflags &= ~SD_POWERSAVINGS_BALANCE;
+ parent->parent->flags |=
+ sd_power_saving_flags(parent->level);
+ }
}
if (~cflags & pflags)
return 0;