Subject: [PATCH v4 0/5] sched: Extend sched_mc/smt_power_savings framework.

Hi,

This is the iteration number 4 of the patchset which extends the
sched_mc_power_savings enhancements to benefit sched_smt_power_savings
as well. This is intended to work on platforms that have on-chip
memory controllers making each of the cpu-package a 'node'.

The patch-series is against linux-2.6-tip master, as on March 30th.

In addition to providing power savings, on such platforms,
this patchset fixes the inconsistent behavior of
sched_smt_power_savings while running odd number of pairs of tasks.

Ideally, when sched_smt_power_savings is enabled, we would like to see
the tasks running on sibling threads to take advantage of the
cache-sharing. However, in the case when there
are only 2 threads running, and sched_smt_power_savings is enabled,
the current load balancer doesn't pull them from across packages
onto a single core.

Changes from V3 (Found here: --> http://lkml.org/lkml/2009/3/6/23)
- Rebased and Retested against linux-2.6-tip master as on Mar 30th.
- Dropped the patch which added comments for find_busiest_group. That
has been sent as a seperate series.

Changes from V2: (Found here: --> http://lkml.org/lkml/2009/3/3/109)
- Patches have been split up in an incremental manner for easy review.
- Fixed comments for some variables.
- Renamed some variables to better reflect their usage.

Changes from V1: (Found here: --> http://lkml.org/lkml/2009/2/16/221)
- Added comments to explain power-saving part in find_busiest_group()
- Added comments for the different sched_domain levels.

Background
------------------------------------------------------------------
On machines with on-chip memory controller, each physical CPU
package forms a NUMA node and the CPU level sched_domain will have
only one group. This prevents any form of power saving balance across
these nodes. Enabling the sched_mc_power_savings tunable to work as
designed on these new single CPU NUMA node machines will help task
consolidation and save power as we did in other multi core multi
socket platforms.

Consolidation across NODES have implications of cross-node memory
access and other NUMA locality issues. Even under such constraints
there could be scope for power savings vs performance tradeoffs and
hence making the sched_mc_powersavings work as expected on these
platform is justified.

sched_mc/smt_power_savings is still a tunable and power savings benefits
and performance would vary depending on the workload and the system
topology and hardware features.

The results of this patch series tested with kernbench on a
2-Socket Quad-core Dual threaded box, varying the number of threads is
as follows:

+------------------------------------------------------------------------+
|Test: make -j4 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 37.44 |Core4: 70.50 |
| | | | +-------------+----------------+
| | | | |Core1: 59.89 |Core5: 20.08 |
| 0 | 0 | 100 | 107.45 +-------------+----------------+
| | | | |Core2: 57.63 |Core6: 62.80 |
| | | | +-------------+----------------+
| | | | |Core3: 64.78 |Core7: 65.07 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 13.41 |Core4: 98.06 |
| | | | +-------------+----------------+
| | | | |Core1: 28.56 |Core5: 60.64 |
| 0 | 2 | 99.89 | 109.95 +-------------+----------------+
| | | | |Core2: 28.49 |Core6: 98.30 |
| | | | +-------------+----------------+
| | | | |Core3: 31.49 |Core7: 99.77 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 35.05 |Core4: 41.78 |
| | | | +-------------+----------------+
| | | | |Core1: 78.28 |Core5: 32.15 |
| 2 | 2 | 95.84 | 137.73 +-------------+----------------+
| | | | |Core2: 93.45 |Core6: 87.49 |
| | | | +-------------+----------------+
| | | | |Core3: 97.70 |Core7: 90.47 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j6 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 25.50 |Core4: 39.22 |
| | | | +-------------+----------------+
| | | | |Core1: 46.47 |Core5: 20.71 |
| 0 | 0 | 100 | 76.06 +-------------+----------------+
| | | | |Core2: 45.20 |Core6: 42.30 |
| | | | +-------------+----------------+
| | | | |Core3: 46.50 |Core7: 47.29 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 17.00 |Core4: 52.38 |
| | | | +-------------+----------------+
| | | | |Core1: 42.08 |Core5: 26.21 |
| 0 | 2 | 98.99 | 79.53 +-------------+----------------+
| | | | |Core2: 46.47 |Core6: 58.16 |
| | | | +-------------+----------------+
| | | | |Core3: 43.39 |Core7: 54.36 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 63.85 |Core4: 21.55 |
| | | | +-------------+----------------+
| | | | |Core1: 93.35 |Core5: 18.76 |
| 2 | 2 | 92.16 | 100.22 +-------------+----------------+
| | | | |Core2: 96.02 |Core6: 36.76 |
| | | | +-------------+----------------+
| | | | |Core3: 99.01 |Core7: 64.32 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j8 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 19.34 |Core4: 34.01 |
| | | | +-------------+----------------+
| | | | |Core1: 36.06 |Core5: 21.02 |
| 0 | 0 | 100 | 62.67 +-------------+----------------+
| | | | |Core2: 31.60 |Core6: 32.32 |
| | | | +-------------+----------------+
| | | | |Core3: 34.89 |Core7: 36.48 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 17.53 |Core4: 35.30 |
| | | | +-------------+----------------+
| | | | |Core1: 37.05 |Core5: 22.93 |
| 0 | 2 | 99.20 | 64.08 +-------------+----------------+
| | | | |Core2: 36.96 |Core6: 35.07 |
| | | | +-------------+----------------+
| | | | |Core3: 36.09 |Core7: 37.38 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 11.58 |Core4: 91.99 |
| | | | +-------------+----------------+
| | | | |Core1: 18.51 |Core5: 58.37 |
| 2 | 2 | 90.20 | 80.87 +-------------+----------------+
| | | | |Core2: 22.62 |Core6: 97.68 |
| | | | +-------------+----------------+
| | | | |Core3: 21.83 |Core7: 99.80 |
+-----------+----------+--------+---------+-------------+----------------+

---

Gautham R Shenoy (5):
sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
sched: Arbitrate the nomination of preferred_wakeup_cpu
sched: Rename the variable sched_mc_preferred_wakeup_cpu
sched: Record the current active power savings level
sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()


include/linux/sched.h | 66 ++++++++++++++++++-----------------
include/linux/topology.h | 6 +--
kernel/sched.c | 86 ++++++++++++++++++++++++++++++++++++++++------
kernel/sched_fair.c | 4 +-
4 files changed, 112 insertions(+), 50 deletions(-)

--
Thanks and Regards
gautham.


Subject: [PATCH v4 1/5] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_*_power()

Combines the functions of sd_balance_for_mc/package_power() and
sd_power_saving_flags() into a single function.

Also add comments for various sched_domain levels.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

include/linux/sched.h | 65 +++++++++++++++++++++++-----------------------
include/linux/topology.h | 6 +---
2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1df4745..8aaf276 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -803,34 +803,45 @@ enum powersavings_balance_level {

extern int sched_mc_power_savings, sched_smt_power_savings;

-static inline int sd_balance_for_mc_power(void)
-{
- if (sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
-
- return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
+enum sched_domain_level {
+ SD_LV_NONE = 0,
+ SD_LV_SIBLING, /* Represents the THREADS domain */
+ SD_LV_MC, /* Represents the CORES domain */
+ SD_LV_CPU, /* Represents the PACKAGE domain */
+ SD_LV_NODE, /* Represents the NODES domain */
+ SD_LV_ALLNODES,
+ SD_LV_MAX
+};

- return 0;
-}

-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
+/**
+ * sd_power_savings_flags: Returns the flags specific to power-aware-load
+ * balancing for a given sched_domain level
+ *
+ * @level: The sched_domain level for which the the power-aware-load balancing
+ * flags need to be set.
+ *
+ * This function helps in setting the flags for power-aware load balancing for
+ * a given sched_domain.
+ * - SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
+ * load balancing is applicable at this domain.
+ *
+ * - SD_BALANCE_NEWIDLE helps aggressive task consolidation and
+ * power-savings.
+ *
+ * For more information on power aware scheduling, see the comment before
+ * find_busiest_group() in kernel/sched.c
*/

-static inline int sd_power_saving_flags(void)
+static inline int sd_power_saving_flags(enum sched_domain_level level)
{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_BALANCE_NEWIDLE;
+ if (level == SD_LV_MC && !sched_smt_power_savings)
+ return 0;
+ if (level == SD_LV_CPU &&
+ !(sched_mc_power_savings || sched_smt_power_savings))
+ return 0;

- return 0;
+ return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
}

struct sched_group {
@@ -856,16 +867,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
return to_cpumask(sg->cpumask);
}

-enum sched_domain_level {
- SD_LV_NONE = 0,
- SD_LV_SIBLING,
- SD_LV_MC,
- SD_LV_CPU,
- SD_LV_NODE,
- SD_LV_ALLNODES,
- SD_LV_MAX
-};
-
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7402c1a..2338388 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -121,8 +121,7 @@ int arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | sd_balance_for_mc_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_MC),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -147,8 +146,7 @@ int arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | sd_balance_for_package_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_CPU),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

Subject: [PATCH v4 2/5] sched: Record the current active power savings level

The existing load balancer code is dependent on the sched_mc_power_savings
variable. However, on multi-core + multi-threaded machines, these decisions
need to be dependent on the values of both sched_mc_power_savings and
sched_smt_power_savings.

Create a new variable named active_power_savings_level which is the
maximum of the sched_mc_power_savings and sched_smt_power_savings.

Record this value in a read mostly global variable at the time when the user
changes the value of sched_mc/smt_power_savings tunable, and use it for
load-balancing decisions instead of computing it everytime.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 8 ++++++--
kernel/sched_fair.c | 2 +-
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8aaf276..1b1cab4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -802,6 +802,7 @@ enum powersavings_balance_level {
};

extern int sched_mc_power_savings, sched_smt_power_savings;
+extern enum powersavings_balance_level active_power_savings_level;

enum sched_domain_level {
SD_LV_NONE = 0,
diff --git a/kernel/sched.c b/kernel/sched.c
index 706517c..07b774e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3450,7 +3450,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
*imbalance = sds->min_load_per_task;
sds->busiest = sds->group_min;

- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
group_first_cpu(sds->group_leader);
}
@@ -4120,7 +4120,7 @@ redo:
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
return -1;

if (sd->nr_balance_failed++ < 2)
@@ -7741,6 +7741,8 @@ static void sched_domain_node_span(int node, struct cpumask *span)
#endif /* CONFIG_NUMA */

int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+/* Records the currently active power savings level */
+enum powersavings_balance_level __read_mostly active_power_savings_level;

/*
* The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8575,6 +8577,8 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
sched_smt_power_savings = level;
else
sched_mc_power_savings = level;
+ active_power_savings_level = max(sched_smt_power_savings,
+ sched_mc_power_savings);

arch_reinit_sched_domains();

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3816f21..02324d2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1054,7 +1054,7 @@ static int wake_idle(int cpu, struct task_struct *p)
chosen_wakeup_cpu =
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;

- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&
p->mm && !(p->flags & PF_KTHREAD) &&
cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))

Subject: [PATCH v4 3/5] sched: Rename the variable sched_mc_preferred_wakeup_cpu

sched_mc_preferred_wakeup_cpu is currently used when the user seeks
power savings through aggressive task consolidation.

But this functionality is applicable for both sched_mc_power_savings as
well as sched_smt_power_savings. So rename
sched_mc_preferred_wakeup_cpu to preferred_wakeup_cpu.

Also fix the comment for preferred_wakeup_cpu.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 12 +++++++-----
kernel/sched_fair.c | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 07b774e..36d116b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -542,11 +542,13 @@ struct root_domain {
#endif
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
/*
- * Preferred wake up cpu nominated by sched_mc balance that will be
- * used when most cpus are idle in the system indicating overall very
- * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ * Preferred wake up cpu which is nominated by load balancer,
+ * is the CPU on which the tasks would be woken up, which
+ * otherwise would have woken up on an idle CPU even on a system
+ * with low-cpu-utilization.
+ * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
- unsigned int sched_mc_preferred_wakeup_cpu;
+ unsigned int preferred_wakeup_cpu;
#endif
};

@@ -3451,7 +3453,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
sds->busiest = sds->group_min;

if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
group_first_cpu(sds->group_leader);
}

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 02324d2..dacb0d8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1052,7 +1052,7 @@ static int wake_idle(int cpu, struct task_struct *p)

this_cpu = smp_processor_id();
chosen_wakeup_cpu =
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+ cpu_rq(this_cpu)->rd->preferred_wakeup_cpu;

if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&

Subject: [PATCH v4 5/5] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

Currently a sched_domain having a single group can be prevented from getting
degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
only one group, it won't have any scope for performing powersavings balance as
it does not have a sibling group to pull from.

Apart from not provide any powersavings, it also fails to participate
in normal load-balancing.

So, fix this by allowing such a sched_domain to degenerate and pass on the
responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.

This patch also fixes the inconsistent behavior of
sched_smt_power_savings while running odd number of pairs of tasks.

Ideally, when sched_smt_power_savings is enabled, we would like to see
the tasks running on sibling threads to take advantage of the
cache-sharing. However, in the case when there
are only 2 threads running, and sched_smt_power_savings is enabled,
the current load balancer doesn't pull them from across packages
onto a single core. This is because of the way the sched_domains
are degenerated today, where the degenerating domain doesn't pass on
the power-savings balance related flag to the new parent.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 193bb67..5f3d16a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7489,6 +7489,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_PKG_RESOURCES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
+
+ /*
+ * If the only flag that is preventing us from degenerating
+ * a domain with a single group is SD_POWERSAVINGS_BALANCE
+ * check if it can be transferred to the new parent,
+ * and degenerate this domain. With a single
+ * group, it anyway can't contribute to power-aware load
+ * balancing.
+ */
+ if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
+ pflags &= ~SD_POWERSAVINGS_BALANCE;
+ parent->parent->flags |=
+ sd_power_saving_flags(parent->level);
+ }
}
if (~cflags & pflags)
return 0;

Subject: [PATCH v4 4/5] sched: Arbitrate the nomination of preferred_wakeup_cpu

Currently for sched_mc/smt_power_savings = 2, we consolidate tasks
by having a preferred_wakeup_cpu which will be used for all the
further wake ups.

This preferred_wakeup_cpu is currently nominated by find_busiest_group()
when we perform load-balancing at sched_domains which has
SD_POWERSAVINGS_BALANCE flag set.

However, on systems which are multi-threaded and multi-core, we can
have multiple sched_domains in the same hierarchy with
SD_POWERSAVINGS_BALANCE flag set.

Currently we don't have any arbitration mechanism as to while performing
load balancing at which sched_domain in the hierarchy should
find_busiest_group(sd) nominate the preferred_wakeup_cpu.
Hence can overwrite valid nominations made previously thereby
causing the preferred_wakup_cpu to ping-pong,
thereby preventing us from effectively consolidating tasks.

Fix this by means of an arbitration algorithm, where in we nominate the
preferred_wakeup_cpu while performing load balancing at a particular
sched_domain if that sched_domain:
- is the topmost power aware sched_domain.
OR
- contains the previously nominated preferred wake up cpu in it's span.

This will help to further fine tune the wake-up biasing logic by
identifying a partially busy core within a CPU package instead of
potentially waking up a completely idle core.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 36d116b..193bb67 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -549,6 +549,14 @@ struct root_domain {
* This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
unsigned int preferred_wakeup_cpu;
+
+ /*
+ * top_powersavings_sd_lvl records the level of the highest
+ * sched_domain that has the SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * Used to arbitrate nomination of the preferred_wakeup_cpu.
+ */
+ enum sched_domain_level top_powersavings_sd_lvl;
#endif
};

@@ -3439,9 +3447,11 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
* Returns 1 if there is potential to perform power-savings balance.
* Else returns 0.
*/
-static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+static inline int check_power_save_busiest_group(struct sched_domain *sd,
+ struct sd_lb_stats *sds, int this_cpu, unsigned long *imbalance)
{
+ struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
+
if (!sds->power_savings_balance)
return 0;

@@ -3452,8 +3462,25 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
*imbalance = sds->min_load_per_task;
sds->busiest = sds->group_min;

- if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->preferred_wakeup_cpu =
+ /*
+ * To avoid overwriting of preferred_wakeup_cpu nominations
+ * while performing load-balancing at various sched_domain
+ * levels, we define an arbitration mechanism wherein
+ * we nominates a preferred_wakeup_cpu while load balancing
+ * at a particular sched_domain sd if:
+ *
+ * - sd is the highest sched_domain in the hierarchy having the
+ * SD_POWERSAVINGS_BALANCE flag set.
+ *
+ * OR
+ *
+ * - sd contains the previously nominated preferred_wakeup_cpu
+ * in it's span.
+ */
+ if (sd->level == my_rd->top_powersavings_sd_lvl ||
+ cpumask_test_cpu(my_rd->preferred_wakeup_cpu,
+ sched_domain_span(sd))) {
+ my_rd->preferred_wakeup_cpu =
group_first_cpu(sds->group_leader);
}

@@ -3473,8 +3500,8 @@ static inline void update_sd_power_savings_stats(struct sched_group *group,
return;
}

-static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
- int this_cpu, unsigned long *imbalance)
+static inline int check_power_save_busiest_group(struct sched_domain *sd,
+ struct sd_lb_stats *sds, int this_cpu, unsigned long *imbalance)
{
return 0;
}
@@ -3838,7 +3865,7 @@ out_balanced:
* There is no obvious imbalance. But check if we can do some balancing
* to save power.
*/
- if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
+ if (check_power_save_busiest_group(sd, &sds, this_cpu, imbalance))
return sds.busiest;
ret:
*imbalance = 0;
@@ -8059,6 +8086,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
struct root_domain *rd;
cpumask_var_t nodemask, this_sibling_map, this_core_map, send_covered,
tmpmask;
+ struct sched_domain *sd;
+
#ifdef CONFIG_NUMA
cpumask_var_t domainspan, covered, notcovered;
struct sched_group **sched_group_nodes = NULL;
@@ -8334,6 +8363,19 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

err = 0;

+ rd->preferred_wakeup_cpu = UINT_MAX;
+ rd->top_powersavings_sd_lvl = SD_LV_NONE;
+
+ if (active_power_savings_level < POWERSAVINGS_BALANCE_WAKEUP)
+ goto free_tmpmask;
+
+ /* Record the level of the highest power-aware sched_domain */
+ for_each_domain(first_cpu(*cpu_map), sd) {
+ if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+ continue;
+ rd->top_powersavings_sd_lvl = sd->level;
+ }
+
free_tmpmask:
free_cpumask_var(tmpmask);
free_send_covered: