Subject: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

Hi,

This is the second iteration of the patch series that extends the existing
sched_smt_/mc_power_savings framework to work on platforms
that have on-chip memory controllers making each of the cpu-package
a 'node'.

Original posting can be found here --> http://lkml.org/lkml/2009/2/16/221

Based on Peter's review comments, I have added additional comments at
appropriate places that help us understand the reason behind doing
certain things in the name of power-aware-load balancing.

Background
------------------------------------------------------------------
On machines with on-chip memory controller, each physical CPU
package forms a NUMA node and the CPU level sched_domain will have
only one group. This prevents any form of power saving balance across
these nodes. Enabling the sched_mc_power_savings tunable to work as
designed on these new single CPU NUMA node machines will help task
consolidation and save power as we did in other multi core multi
socket platforms.

Consolidation across NODES have implications of cross-node memory
access and other NUMA locality issues. Even under such constraints
there could be scope for power savings vs performance tradeoffs and
hence making the sched_mc_powersavings work as expected on these
platform is justified.

sched_mc/smt_power_savings is still a tunable and power savings benefits
and performance would vary depending on the workload and the system
topology and hardware features.

The patch series has been tested on a 2-Socket Quad-core Dual threaded
box with kernbench as the workload, varying the number of threads.


+------------------------------------------------------------------------+
|Test: make -j4 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | (s) | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 35.34 |Core4: 62.01 |
| | | | +-------------+----------------+
| | | | |Core1: 58.34 |Core5: 17.41 |
| 0 | 0 | 100 | 107.84 +-------------+----------------+
| | | | |Core2: 63.97 |Core6: 60.29 |
| | | | +-------------+----------------+
| | | | |Core3: 68.64 |Core7: 61.46 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 34.28 |Core4: 18.26 |
| | | | +-------------+----------------+
| | | | |Core1: 99.19 |Core5: 18.54 |
| 0 | 2 | 99.89 | 109.91 +-------------+----------------+
| | | | |Core2: 99.89 |Core6: 21.54 |
| | | | +-------------+----------------+
| | | | |Core3: 99.91 |Core7: 23.21 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 20.17 |Core4: 69.30 |
| | | | +-------------+----------------+
| | | | |Core1: 50.22 |Core5: 55.97 |
| 2 | 2 | 95.03 | 139.95 +-------------+----------------+
| | | | |Core2: 83.95 |Core6: 92.70 |
| | | | +-------------+----------------+
| | | | |Core3: 88.95 |Core7: 95.58 |
+-----------+----------+--------+ --------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j6 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 25.35 |Core4: 41.07 |
| | | | +-------------+----------------+
| | | | |Core1: 43.84 |Core5: 19.95 |
| 0 | 0 | 100 | 77.67 +-------------+----------------+
| | | | |Core2: 43.23 |Core6: 42.82 |
| | | | +-------------+----------------+
| | | | |Core3: 47.66 |Core7: 45.96 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 24.13 |Core4: 41.80 |
| | | | +-------------+----------------+
| | | | |Core1: 51.51 |Core5: 23.61 |
| 0 | 2 | 99.41 | 81.50 +-------------+----------------+
| | | | |Core2: 55.43 |Core6: 38.67 |
| | | | +-------------+----------------+
| | | | |Core3: 57.79 |Core7: 38.84 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 7.75 |Core4: 94.45 |
| | | | +-------------+----------------+
| | | | |Core1: 19.04 |Core5: 67.42 |
| 2 | 2 | 93.32 | 100.39 +-------------+----------------+
| | | | |Core2: 28.29 |Core6: 96.90 |
| | | | +-------------+----------------+
| | | | |Core3: 66.63 |Core7: 99.86 |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j8 |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
| | | | | idle | idle |
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 18.17 |Core4: 33.38 |
| | | | +-------------+----------------+
| | | | |Core1: 34.62 |Core5: 19.58 |
| 0 | 0 | 100 | 63.82 +-------------+----------------+
| | | | |Core2: 31.99 |Core6: 32.35 |
| | | | +-------------+----------------+
| | | | |Core3: 34.59 |Core7: 29.99 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 15.20 |Core4: 41.41 |
| | | | +-------------+----------------+
| | | | |Core1: 28.45 |Core5: 21.32 |
| 0 | 2 | 99.17 | 65.65 +-------------+----------------+
| | | | |Core2: 31.14 |Core6: 41.26 |
| | | | +-------------+----------------+
| | | | |Core3: 30.52 |Core7: 42.95 |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
| | | | |Core0: 16.65 |Core4: 79.04 |
| | | | +-------------+----------------+
| | | | |Core1: 26.74 |Core5: 50.98 |
| 2 | 2 | 89.58 | 82.83 +-------------+----------------+
| | | | |Core2: 30.42 |Core6: 81.33 |
| | | | +-------------+----------------+
| | | | |Core3: 35.57 |Core7: 90.03 |
+-----------+----------+--------+---------+-------------+----------------+

---

Gautham R Shenoy (3):
sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()


include/linux/sched.h | 66 ++++++++++++++++++-----------------
include/linux/topology.h | 6 +--
kernel/sched.c | 88 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 2 +
4 files changed, 119 insertions(+), 43 deletions(-)


Thoughts ?
--
Thanks and Regards
gautham.


Subject: [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()

This is a code cleanup patch which combines the functions of
sd_balance_for_mc/package_power() and sd_power_saving_flags() into a single
function.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 65 +++++++++++++++++++++++-----------------------
include/linux/topology.h | 6 +---
2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 02e16d2..362807a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,34 +777,45 @@ enum powersavings_balance_level {

extern int sched_mc_power_savings, sched_smt_power_savings;

-static inline int sd_balance_for_mc_power(void)
-{
- if (sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
-
- return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_POWERSAVINGS_BALANCE;
+enum sched_domain_level {
+ SD_LV_NONE = 0,
+ SD_LV_SIBLING, /* Represents the THREADS domain */
+ SD_LV_MC, /* Represents the CORES domain */
+ SD_LV_CPU, /* Represents the PACKAGE domain */
+ SD_LV_NODE, /* Represents the NODES domain */
+ SD_LV_ALLNODES,
+ SD_LV_MAX
+};

- return 0;
-}

-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
+/**
+ * sd_power_savings_flags: Returns the flags specific to power-aware-load
+ * balancing for a given sched_domain level
+ *
+ * @level: The sched_domain level for which the the power-aware-load balancing
+ * flags need to be set.
+ *
+ * This function helps in setting the flags for power-aware load balancing for
+ * a given sched_domain.
+ * - SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
+ * load balancing is applicable at this domain.
+ *
+ * - SD_BALANCE_NEWIDLE helps aggressive task consolidation and
+ * power-savings.
+ *
+ * For more information on power aware scheduling, see the comment before
+ * find_busiest_group() in kernel/sched.c
*/

-static inline int sd_power_saving_flags(void)
+static inline int sd_power_saving_flags(enum sched_domain_level level)
{
- if (sched_mc_power_savings | sched_smt_power_savings)
- return SD_BALANCE_NEWIDLE;
+ if (level == SD_LV_MC && !sched_smt_power_savings)
+ return 0;
+ if (level == SD_LV_CPU &&
+ !(sched_mc_power_savings || sched_smt_power_savings))
+ return 0;

- return 0;
+ return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
}

struct sched_group {
@@ -830,16 +841,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
return to_cpumask(sg->cpumask);
}

-enum sched_domain_level {
- SD_LV_NONE = 0,
- SD_LV_SIBLING,
- SD_LV_MC,
- SD_LV_CPU,
- SD_LV_NODE,
- SD_LV_ALLNODES,
- SD_LV_MAX
-};
-
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e632d29..8097dce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,8 +125,7 @@ int arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | sd_balance_for_mc_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_MC),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -151,8 +150,7 @@ int arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | sd_balance_for_package_power()\
- | sd_power_saving_flags(),\
+ | sd_power_saving_flags(SD_LV_CPU),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

Subject: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.

The existing algorithm to nominate a preferred wake up cpu would not
work on a machine which has both sched_mc_power_savings and
sched_smt_power_savings enabled. On such machines, the nomination at a lower
level would keep overwriting the nominations by it's peer-level as well as
higher level sched_domains. This would lead to the ping-ponging of the
nominated wake-up cpu, thereby preventing us from effectively consolidating
tasks.

Correct this by defining the authorized nomination sched_domain level, which
is either the highest sched_domain level containing the
SD_POWERSAVINGS_BALANCE flag or a lower level which contains the previously
nominated wake-up cpu in it's span.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 74 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched_fair.c | 2 +
3 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 362807a..7f66595 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -776,6 +776,7 @@ enum powersavings_balance_level {
};

extern int sched_mc_power_savings, sched_smt_power_savings;
+extern enum powersavings_balance_level active_power_savings_level;

enum sched_domain_level {
SD_LV_NONE = 0,
diff --git a/kernel/sched.c b/kernel/sched.c
index 52bbf1c..7d22a96 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -515,11 +515,22 @@ struct root_domain {
#endif
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
/*
- * Preferred wake up cpu nominated by sched_mc balance that will be
- * used when most cpus are idle in the system indicating overall very
- * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ * Preferred wake up cpu which is nominated by load balancer,
+ * is the CPU on which the tasks would be woken up, which
+ * otherwise would have woken up on an idle CPU even on a system
+ * with low-cpu-utilization.
+ * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
*/
unsigned int sched_mc_preferred_wakeup_cpu;
+ /*
+ * authorized_nomination_level records the sched-domain level, which can
+ * in the process of load-balancing nominate the
+ * sched_mc_preferred_wakeup_cpu.
+ *
+ * This helps in serializing the nominations thereby preventing
+ * multiple sched-domain levels overwriting each others' nominations.
+ */
+ enum sched_domain_level authorized_nomination_level;
#endif
};

@@ -3090,6 +3101,22 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
* should be moved to restore balance via the imbalance parameter.
+ *
+ * Power-savings-balance: If the user has enabled the option to save power
+ * by means of task consolidation, then at the corresponding sched_domains,
+ * the SD_POWERSAVINGS_BALANCE flag will be set.
+ *
+ * Within such sched_domains, find_busiest_group would try to identify
+ * a sched_group which can be freed-up and it's tasks can be migrated to
+ * another group which has the capacity to accomodate the former's tasks.
+ * If such a "can-go-idle" sched_groups does exist, then the group which can
+ * accomodate it's tasks is returned as the busiest group.
+ *
+ * Furthermore, if the user opts for more aggressive power-aware load
+ * balancing, i.e when the active_power_savings_level greater or equal to
+ * POWERSAVINGS_BALANCE_WAKEUP, find_busiest_group will also nominate
+ * the preferred CPU, on which the tasks should hence forth
+ * be woken up on, instead of bothering an idle-cpu.
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
@@ -3397,9 +3424,18 @@ out_balanced:
goto ret;

if (this == group_leader && group_leader != group_min) {
+ struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
*imbalance = min_load_per_task;
- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ /*
+ * Nominate the the preferred wakeup cpu only if this
+ * sched_domain is authorized to do so, or if this sched_domain
+ * contains the previously nominated cpu.
+ */
+ if (sd->level == my_rd->authorized_nomination_level ||
+ (sd->level < my_rd->authorized_nomination_level &&
+ cpu_isset(my_rd->sched_mc_preferred_wakeup_cpu,
+ *sched_domain_span(sd)))) {
+ my_rd->sched_mc_preferred_wakeup_cpu =
cpumask_first(sched_group_cpus(group_leader));
}
return group_min;
@@ -3683,7 +3719,8 @@ redo:
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (active_power_savings_level <
+ POWERSAVINGS_BALANCE_WAKEUP)
return -1;

if (sd->nr_balance_failed++ < 2)
@@ -7193,6 +7230,9 @@ static void sched_domain_node_span(int node, struct cpumask *span)

int sched_smt_power_savings = 0, sched_mc_power_savings = 0;

+/* Records the currently active power saving level. */
+enum powersavings_balance_level active_power_savings_level;
+
/*
* The cpus mask in sched_group and sched_domain hangs off the end.
* FIXME: use cpumask_var_t or dynamic percpu alloc to avoid wasting space
@@ -7781,6 +7821,25 @@ static int __build_sched_domains(const struct cpumask *cpu_map,

err = 0;

+/* Assign the sched-domain level which can nominate preferred wake-up cpu */
+ rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
+ rd->authorized_nomination_level = SD_LV_NONE;
+
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
+ struct sched_domain *sd;
+ enum sched_domain_level authorized_nomination_level =
+ SD_LV_NONE;
+
+ for_each_domain(first_cpu(*cpu_map), sd) {
+ if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+ continue;
+ authorized_nomination_level = sd->level;
+ }
+
+ rd->authorized_nomination_level = authorized_nomination_level;
+ }
+
+
free_tmpmask:
free_cpumask_var(tmpmask);
free_send_covered:
@@ -8027,6 +8086,9 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
else
sched_mc_power_savings = level;

+ active_power_savings_level = max(sched_smt_power_savings,
+ sched_mc_power_savings);
+
arch_reinit_sched_domains();

return count;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5cc1c16..bddee3e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1042,7 +1042,7 @@ static int wake_idle(int cpu, struct task_struct *p)
chosen_wakeup_cpu =
cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;

- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
idle_cpu(cpu) && idle_cpu(this_cpu) &&
p->mm && !(p->flags & PF_KTHREAD) &&
cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))

Subject: [PATCH 2 3/3] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.

Currently a sched_domain having a single group can be prevented from getting
degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
only one group, it won't have any scope for performing powersavings balance as
it does not have a sibling group to pull from.

Apart from not provide any powersavings, it also fails to participate
in normal load-balancing.

Fix this by allowing such a sched_domain to degenerate and pass on the
responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.

Signed-off-by: Gautham R Shenoy <[email protected]>
Cc: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7d22a96..765e3e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6957,6 +6957,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_PKG_RESOURCES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
+
+ /*
+ * If the only flag that is preventing us from degenerating
+ * a domain with a single group is SD_POWERSAVINGS_BALANCE
+ * check if it can be transferred to the new parent,
+ * and degenerate this domain. With a single
+ * group, it anyway can't contribute to power-aware load
+ * balancing.
+ */
+ if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
+ pflags &= ~SD_POWERSAVINGS_BALANCE;
+ parent->parent->flags |=
+ sd_power_saving_flags(parent->level);
+ }
}
if (~cflags & pflags)
return 0;

2009-03-03 12:22:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:

> Background
> ------------------------------------------------------------------
> On machines with on-chip memory controller, each physical CPU
> package forms a NUMA node and the CPU level sched_domain will have
> only one group. This prevents any form of power saving balance across
> these nodes. Enabling the sched_mc_power_savings tunable to work as
> designed on these new single CPU NUMA node machines will help task
> consolidation and save power as we did in other multi core multi
> socket platforms.
>
> Consolidation across NODES have implications of cross-node memory
> access and other NUMA locality issues. Even under such constraints
> there could be scope for power savings vs performance tradeoffs and
> hence making the sched_mc_powersavings work as expected on these
> platform is justified.
>
> sched_mc/smt_power_savings is still a tunable and power savings benefits
> and performance would vary depending on the workload and the system
> topology and hardware features.
>
> The patch series has been tested on a 2-Socket Quad-core Dual threaded
> box with kernbench as the workload, varying the number of threads.
>

> +------------------------------------------------------------------------+
> |Test: make -j8 |
> +-----------+----------+--------+---------+-------------+----------------+
> | sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
> | | | | | idle | idle |
> +-----------+----------+--------+---------+-------------+----------------+
> | | | | |Core0: 18.17 |Core4: 33.38 |
> | | | | +-------------+----------------+
> | | | | |Core1: 34.62 |Core5: 19.58 |
> | 0 | 0 | 100 | 63.82 +-------------+----------------+
> | | | | |Core2: 31.99 |Core6: 32.35 |
> | | | | +-------------+----------------+
> | | | | |Core3: 34.59 |Core7: 29.99 |
> +-----------+----------+--------+---------+-------------+----------------+

> +-----------+----------+--------+---------+-------------+----------------+
> | | | | |Core0: 16.65 |Core4: 79.04 |
> | | | | +-------------+----------------+
> | | | | |Core1: 26.74 |Core5: 50.98 |
> | 2 | 2 | 89.58 | 82.83 +-------------+----------------+
> | | | | |Core2: 30.42 |Core6: 81.33 |
> | | | | +-------------+----------------+
> | | | | |Core3: 35.57 |Core7: 90.03 |
> +-----------+----------+--------+---------+-------------+----------------+

So while we take longer (~20s) we save about 10% in power?

It would be good to mention something about how power usage is measured.

Furthermore, do we really need those separate mc/smt power savings
settings? -- It appears to me we ought to consolidate some of that and
provide a single knob to save power.

> ---
>
> Gautham R Shenoy (3):
> sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
> sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
> sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()

Acked-by: Peter Zijlstra <[email protected]>

A few nits on patch #2, please follow up with incremental cleanups.

2009-03-03 12:24:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.

On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> + rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> + rd->authorized_nomination_level = SD_LV_NONE;
> +
> + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> + struct sched_domain *sd;
> + enum sched_domain_level authorized_nomination_level =
> + SD_LV_NONE;
> +
> + for_each_domain(first_cpu(*cpu_map), sd) {
> + if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> + continue;
> + authorized_nomination_level = sd->level;
> + }
> +
> + rd->authorized_nomination_level = authorized_nomination_level;
> + }

Very odd looking comments there, and that enum init wrapping looks
weird. Either exceed 80 chars, or write it in a second line like:

enum sched_domain_level authorized_nomination_level;

authorized_nomination_level = SD_LV_NONE;

2009-03-03 13:59:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> > +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> > + rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> > + rd->authorized_nomination_level = SD_LV_NONE;
> > +
> > + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> > + struct sched_domain *sd;
> > + enum sched_domain_level authorized_nomination_level =
> > + SD_LV_NONE;
> > +
> > + for_each_domain(first_cpu(*cpu_map), sd) {
> > + if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> > + continue;
> > + authorized_nomination_level = sd->level;
> > + }
> > +
> > + rd->authorized_nomination_level = authorized_nomination_level;
> > + }
>
> Very odd looking comments there, and that enum init wrapping looks
> weird. Either exceed 80 chars, or write it in a second line like:
>
> enum sched_domain_level authorized_nomination_level;
>
> authorized_nomination_level = SD_LV_NONE;

I think find_busiest_group() needs to be split up into several
helper functions first, before we add more to it. I.e. first a
couple of cleanup patches that factor it out into 3-4 helper
functions plus a really easy-to-read find_busiest_group() main
function.

Then can we do the above change - and i bet we'll win at least
one indentation level as well so that weird line break goes
away.

Ingo

2009-03-03 15:23:55

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

* Peter Zijlstra <[email protected]> [2009-03-03 13:21:57]:

> On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
>
> > Background
> > ------------------------------------------------------------------
> > On machines with on-chip memory controller, each physical CPU
> > package forms a NUMA node and the CPU level sched_domain will have
> > only one group. This prevents any form of power saving balance across
> > these nodes. Enabling the sched_mc_power_savings tunable to work as
> > designed on these new single CPU NUMA node machines will help task
> > consolidation and save power as we did in other multi core multi
> > socket platforms.
> >
> > Consolidation across NODES have implications of cross-node memory
> > access and other NUMA locality issues. Even under such constraints
> > there could be scope for power savings vs performance tradeoffs and
> > hence making the sched_mc_powersavings work as expected on these
> > platform is justified.
> >
> > sched_mc/smt_power_savings is still a tunable and power savings benefits
> > and performance would vary depending on the workload and the system
> > topology and hardware features.
> >
> > The patch series has been tested on a 2-Socket Quad-core Dual threaded
> > box with kernbench as the workload, varying the number of threads.
> >
>
> > +------------------------------------------------------------------------+
> > |Test: make -j8 |
> > +-----------+----------+--------+---------+-------------+----------------+
> > | sched_smt | sched_mc | %Power | Time | % Package 0 | % Package 1 |
> > | | | | | idle | idle |
> > +-----------+----------+--------+---------+-------------+----------------+
> > | | | | |Core0: 18.17 |Core4: 33.38 |
> > | | | | +-------------+----------------+
> > | | | | |Core1: 34.62 |Core5: 19.58 |
> > | 0 | 0 | 100 | 63.82 +-------------+----------------+
> > | | | | |Core2: 31.99 |Core6: 32.35 |
> > | | | | +-------------+----------------+
> > | | | | |Core3: 34.59 |Core7: 29.99 |
> > +-----------+----------+--------+---------+-------------+----------------+
>
> > +-----------+----------+--------+---------+-------------+----------------+
> > | | | | |Core0: 16.65 |Core4: 79.04 |
> > | | | | +-------------+----------------+
> > | | | | |Core1: 26.74 |Core5: 50.98 |
> > | 2 | 2 | 89.58 | 82.83 +-------------+----------------+
> > | | | | |Core2: 30.42 |Core6: 81.33 |
> > | | | | +-------------+----------------+
> > | | | | |Core3: 35.57 |Core7: 90.03 |
> > +-----------+----------+--------+---------+-------------+----------------+
>
> So while we take longer (~20s) we save about 10% in power?

Yes that is correct. Since we are consolidating on sibling threads the
performance goes down. Also this degradation is very much workload
dependent. If the workloads can benefit a lot from sibling threads,
then we will be able to save power with modest performance
degradation.

This tunable is mainly focusing on power savings. If performance
improves, then it is a bonus :)

> It would be good to mention something about how power usage is measured.

Power usage is measured by computing the energy consumed over the
benchmark duration and then finding average power by dividing
energy/time. The relative power consumption is for the entire system.

> Furthermore, do we really need those separate mc/smt power savings
> settings? -- It appears to me we ought to consolidate some of that and
> provide a single knob to save power.

Yes, having one sched_power_savings will definitely help. However,
mapping the various combination of settings to a single knob that will
provide consistent behavior across workloads and system configuration
is a challenge.

> > ---
> >
> > Gautham R Shenoy (3):
> > sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
> > sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
> > sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()
>
> Acked-by: Peter Zijlstra <[email protected]>
>
> A few nits on patch #2, please follow up with incremental cleanups.

Thanks for the review comments and ack.

--Vaidy

2009-03-03 15:29:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
>
> > It would be good to mention something about how power usage is measured.
>
> Power usage is measured by computing the energy consumed over the
> benchmark duration and then finding average power by dividing
> energy/time. The relative power consumption is for the entire system.

Is this measured using an external (indepedent) power monitor, or
looking at system internal power usage stats?

Have you compared these two methods, if so, do they give comparable
results?

> > Furthermore, do we really need those separate mc/smt power savings
> > settings? -- It appears to me we ought to consolidate some of that and
> > provide a single knob to save power.
>
> Yes, having one sched_power_savings will definitely help. However,
> mapping the various combination of settings to a single knob that will
> provide consistent behavior across workloads and system configuration
> is a challenge.

Would it be an option to provide a single knob for users and have some
fine grained feature set for developing/debugging, much like we have
sched_features?

2009-03-03 15:44:14

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.

* Ingo Molnar <[email protected]> [2009-03-03 14:59:07]:

>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> > > +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> > > + rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> > > + rd->authorized_nomination_level = SD_LV_NONE;
> > > +
> > > + if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> > > + struct sched_domain *sd;
> > > + enum sched_domain_level authorized_nomination_level =
> > > + SD_LV_NONE;
> > > +
> > > + for_each_domain(first_cpu(*cpu_map), sd) {
> > > + if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> > > + continue;
> > > + authorized_nomination_level = sd->level;
> > > + }
> > > +
> > > + rd->authorized_nomination_level = authorized_nomination_level;
> > > + }
> >
> > Very odd looking comments there, and that enum init wrapping looks
> > weird. Either exceed 80 chars, or write it in a second line like:
> >
> > enum sched_domain_level authorized_nomination_level;
> >
> > authorized_nomination_level = SD_LV_NONE;
>
> I think find_busiest_group() needs to be split up into several
> helper functions first, before we add more to it. I.e. first a
> couple of cleanup patches that factor it out into 3-4 helper
> functions plus a really easy-to-read find_busiest_group() main
> function.

Hi Ingo,

I had earlier posted cleanup for find_busiest_group(), but Peter
mentioned that he is changing the cpu_power infrastructure and I can
rework the cleanup after that.

[RFC PATCH v1 0/5] Modular find_busiest_group()
V1: http://lkml.org/lkml/2008/9/24/201
V2: http://lkml.org/lkml/2008/10/9/176

Peter did post an RFD for his new proposal:
http://lkml.org/lkml/2008/10/14/148

I think we can revive the discussion thread and start cleaning up.

> Then can we do the above change - and i bet we'll win at least
> one indentation level as well so that weird line break goes
> away.

The above code is in __build_sched_domains() and I think we can
shorten the variable names and make the code look better. Indentation
level here is not as bad as in find_busiest_group().

Thanks,
Vaidy

2009-03-03 15:48:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.

On Tue, 2009-03-03 at 21:15 +0530, Vaidyanathan Srinivasan wrote:

> I had earlier posted cleanup for find_busiest_group(), but Peter
> mentioned that he is changing the cpu_power infrastructure and I can
> rework the cleanup after that.

Right, I need to get back to pondering that problem. The whole 1/x thing
doesn't really work.

2009-03-03 16:26:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2009-03-03 at 21:15 +0530, Vaidyanathan Srinivasan wrote:
>
> > I had earlier posted cleanup for find_busiest_group(), but
> > Peter mentioned that he is changing the cpu_power
> > infrastructure and I can rework the cleanup after that.
>
> Right, I need to get back to pondering that problem. The whole
> 1/x thing doesn't really work.

well, the cleanups are needed in any case before we add more
changes to it - that function is a monster.

Ingo

2009-03-04 11:05:25

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

* Peter Zijlstra <[email protected]> [2009-03-03 16:28:30]:

> On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
> >
> > > It would be good to mention something about how power usage is measured.
> >
> > Power usage is measured by computing the energy consumed over the
> > benchmark duration and then finding average power by dividing
> > energy/time. The relative power consumption is for the entire system.
>
> Is this measured using an external (indepedent) power monitor, or
> looking at system internal power usage stats?
>
> Have you compared these two methods, if so, do they give comparable
> results?

Since we are doing relative power comparison, the absolute power value
and the method would not affect the results.

Your question is more relevant for SPECPower type of benchmarks where
the score depends on the average power consumed and accuracy or method
of measurement can affect the score.

> > > Furthermore, do we really need those separate mc/smt power savings
> > > settings? -- It appears to me we ought to consolidate some of that and
> > > provide a single knob to save power.
> >
> > Yes, having one sched_power_savings will definitely help. However,
> > mapping the various combination of settings to a single knob that will
> > provide consistent behavior across workloads and system configuration
> > is a challenge.
>
> Would it be an option to provide a single knob for users and have some
> fine grained feature set for developing/debugging, much like we have
> sched_features?

Yes, that will be very good. Let me explore if sched_features can
include the power saving attributes while the sched_power_savings can
be a simple and coarse control for end-users.

--Vaidy

Subject: Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework

On Wed, Mar 04, 2009 at 04:36:43PM +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2009-03-03 16:28:30]:
>
> > On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
> > >
> > > > It would be good to mention something about how power usage is measured.
> > >
> > > Power usage is measured by computing the energy consumed over the
> > > benchmark duration and then finding average power by dividing
> > > energy/time. The relative power consumption is for the entire system.
> >
> > Is this measured using an external (indepedent) power monitor, or
> > looking at system internal power usage stats?
> >
> > Have you compared these two methods, if so, do they give comparable
> > results?
>
> Since we are doing relative power comparison, the absolute power value
> and the method would not affect the results.
>
> Your question is more relevant for SPECPower type of benchmarks where
> the score depends on the average power consumed and accuracy or method
> of measurement can affect the score.
>
> > > > Furthermore, do we really need those separate mc/smt power savings
> > > > settings? -- It appears to me we ought to consolidate some of that and
> > > > provide a single knob to save power.
> > >
> > > Yes, having one sched_power_savings will definitely help. However,
> > > mapping the various combination of settings to a single knob that will
> > > provide consistent behavior across workloads and system configuration
> > > is a challenge.
> >
> > Would it be an option to provide a single knob for users and have some
> > fine grained feature set for developing/debugging, much like we have
> > sched_features?
>
> Yes, that will be very good. Let me explore if sched_features can
> include the power saving attributes while the sched_power_savings can
> be a simple and coarse control for end-users.

Currently each of sched_smt_power_savings and sched_mc_power_savings
tunables can take the values (0, 1, 2).

However, based on the combinations of
(sched_smt_power_savings, sched_mc_power_savings) values,
the system can be in one of the 5 (and not 9) states mentioned below:
[in the increasing order of aggressiveness of the task consolidation]

0. { (0, 0) }
1. { (0, 1) }
2. { (0, 2) }
3. { (1, 0), (1, 1) }
4. { (1, 2), (2, 0), (2, 1), (2, 2)}

4 and 5 are because sched_smt_power_savings implies sched_mc_power_savings.

So, on a system that has multiple cores and multiple threads, the single
knob can take values (0, 1, 2, 3, 4).

On systems that have only multiple cores, the knob can take values
(0, 1, 2).

> --Vaidy

--
Thanks and Regards
gautham