LinuxLists.cc - [RFC PATCH v3 0/5] Tunable sched_mc_power

2008-11-10 18:33:17

Subject: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power. The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.

The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.

Please refer to the following discussions and article for details.

[1]Making power policy just work
http://lwn.net/Articles/287924/

[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/

[3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/297306/

The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.

This version of the patch incorporates comments and feedback received
on the previous post. Thanks to Peter Zijlstra for the review and
comments.

Changes from v2:
----------------

* Fixed locking order issue in active-balance-new-idle
* Moved the wakeup biasing code to wake_idle() function and preserve
wake_affine function. Previous version would break wake_affine in
order to aggressively consolidate tasks
* Removed sched_mc_preferred_wakeup_cpu global variable and moved to
doms_cur/dattr_cur and added a per_cpu pointer to appropriate
storage in partitioned sched domain. This changed is needed to
preserve functionality in case of partitioned sched domains
* Patch on 2.6.28-rc3 kernel

Notes:
------

* The patch has been tested on x86 with basic cpusets and kernbench.
Correct functionality in the case of partitioned sched domain need
to be analysed.

Results:
--------

Basic functionality of the code has not changed and the power vs
performance benefits for kernbench are similar to the ones posted
earlier.

KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
package system

SchedMC Run Time Package Idle Energy Power
0 81.61s 53.07% 52.81% 1.00x J 1.00y W
1 81.52s 40.83% 65.45% 0.96x J 0.96y W
2 74.66s 22.20% 83.94% 0.90x J 0.98y W

*** This is RFC code and not for inclusion ***

Please feel free to test, and let me know your comments and feedback.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

---

Gautham R Shenoy (1):
sched: Framework for sched_mc/smt_power_savings=N

Vaidyanathan Srinivasan (4):
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance

include/linux/sched.h | 12 +++++++
kernel/sched.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 17 +++++++++
3 files changed, 112 insertions(+), 6 deletions(-)

--

2008-11-10 18:29:44

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v3 2/5] sched: favour lower logical cpu number for sched_mc balance

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0dd6cf6..d910496 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3251,7 +3251,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3267,7 +3267,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;

2008-11-10 18:29:58

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. For now lets try with a per-cpu variable
pointing to a common storage in partition sched domain
attribute. Global variable may not work in partitioned
sched domain case.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 34 +++++++++++++++++++++++++++++++++-
2 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 715028a..8363d02 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -810,6 +810,7 @@ enum sched_domain_level {

struct sched_domain_attr {
int relax_domain_level;
+ unsigned int preferred_wakeup_cpu;
};

#define SD_ATTR_INIT (struct sched_domain_attr) { \
diff --git a/kernel/sched.c b/kernel/sched.c
index d910496..16c5e1f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1612,6 +1612,21 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif

+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+/*
+ * Preferred wake up cpu nominated by sched_mc balance that will be used when
+ * most cpus are idle in the system indicating overall very low system
+ * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
+ */
+
+DEFINE_PER_CPU(unsigned int *, sched_mc_preferred_wakeup_cpu);
+
+/* Default storage allocation for non-partitioned sched domains */
+unsigned int fallback_preferred_wakeup_cpu;
+
+#endif
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -3078,6 +3093,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
return 0;
}

+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -3394,6 +3410,10 @@ out_balanced:

if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+ *per_cpu(sched_mc_preferred_wakeup_cpu,
+ smp_processor_id()) =
+ first_cpu(group_leader->cpumask);
return group_min;
}
#endif
@@ -7372,7 +7392,7 @@ static void set_domain_attribute(struct sched_domain *sd,
static int __build_sched_domains(const cpumask_t *cpu_map,
struct sched_domain_attr *attr)
{
- int i;
+ int i, cpu;
struct root_domain *rd;
SCHED_CPUMASK_DECLARE(allmasks);
cpumask_t *tmpmask;
@@ -7472,6 +7492,18 @@ static int __build_sched_domains(const cpumask_t *cpu_map,
sd->parent = p;
p->child = sd;
cpu_to_core_group(i, cpu_map, &sd->groups, tmpmask);
+ /* Set the preferred wake up CPU */
+ if (attr) {
+ for_each_cpu_mask_nr(cpu, sd->span) {
+ per_cpu(sched_mc_preferred_wakeup_cpu, cpu) =
+ &attr->preferred_wakeup_cpu;
+ }
+ } else {
+ for_each_cpu_mask_nr(cpu, sd->span) {
+ per_cpu(sched_mc_preferred_wakeup_cpu, cpu) =
+ &fallback_preferred_wakeup_cpu;
+ }
+ }
#endif

#ifdef CONFIG_SCHED_SMT

2008-11-10 18:33:30

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v3 1/5] sched: Framework for sched_mc/smt_power_savings=N

From: Gautham R Shenoy <[email protected]>

*** RFC patch of work in progress and not for inclusion. ***

Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.

Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.

This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).

Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.

The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.

Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 16 +++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b483f39..715028a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -759,6 +759,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */

+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * cpu package for power savings
+ */
+ MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
#define BALANCE_FOR_MC_POWER \
(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)

diff --git a/kernel/sched.c b/kernel/sched.c
index e8819bc..0dd6cf6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7859,14 +7859,24 @@ int arch_reinit_sched_domains(void)
static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
{
int ret;
+ unsigned int level = 0;

- if (buf[0] != '0' && buf[0] != '1')
+ sscanf(buf, "%u", &level);
+
+ /*
+ * level is always be positive so don't check for
+ * level < POWERSAVINGS_BALANCE_NONE which is 0
+ * What happens on 0 or 1 byte write,
+ * need to check for count as well?
+ */
+
+ if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
return -EINVAL;

if (smt)
- sched_smt_power_savings = (buf[0] == '1');
+ sched_smt_power_savings = level;
else
- sched_mc_power_savings = (buf[0] == '1');
+ sched_mc_power_savings = level;

ret = arch_reinit_sched_domains();

2008-11-10 18:50:27

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v3 4/5] sched: bias task wakeups to preferred semi-idle packages

Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch. Use
this information in sched_mc_preferred_wakeup_cpu in function
wake_idle() to bias task wakeups if the following conditions
are satisfied:
- The present cpu that is trying to wakeup the process is
idle and waking the target process on this cpu will
potentially wakeup a completely idle package
- The previous cpu on which the target process ran is
also idle and hence selecting the previous cpu may
wakeup a semi idle cpu package
- The task being woken up is allowed to run in the
nominated cpu (cpu affinity and restrictions)

Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.

Cache hotness is considered since the actual biasing happens
in wake_idle() only if the application is cache cold.

This technique will effectively move short running bursty jobs in
a mostly idle system.

Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched_fair.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ce514af..ad5269a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1026,6 +1026,23 @@ static int wake_idle(int cpu, struct task_struct *p)
cpumask_t tmp;
struct sched_domain *sd;
int i;
+ int this_cpu;
+ unsigned int *chosen_wakeup_cpu;
+
+ /*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ */
+
+ this_cpu = smp_processor_id();
+ chosen_wakeup_cpu = per_cpu(sched_mc_preferred_wakeup_cpu, this_cpu);
+
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ chosen_wakeup_cpu &&
+ idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
+ cpu_isset(*chosen_wakeup_cpu, p->cpus_allowed))
+ return *chosen_wakeup_cpu;

/*
* If it is idle, then it is the best cpu to run this task.

2008-11-10 18:50:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

a quick response, I'll read them more carefully tomorrow:

- why are the preferred cpu things pointers? afaict using just the cpu
number is both smaller and clearer to the reader.

- in patch 5/5 you do:

+ spin_unlock(&this_rq->lock);
+ double_rq_lock(this_rq, busiest);

we call that double_lock_balance()

- comments go like:

/*
* this is a multi-
* line comment
*/

2008-11-10 18:53:30

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 35 +++++++++++++++++++++++++++++++++++
1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 16c5e1f..4d99509 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3687,10 +3687,45 @@ redo:
}

if (!ld_moved) {
+ int active_balance;
+ unsigned long flags;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 2)
+ return -1;
+
+ /* Release this_rq lock and take in correct order */
+ spin_unlock(&this_rq->lock);
+ double_rq_lock(this_rq, busiest);
+
+ /* don't kick the migration_thread, if the curr
+ * task on busiest cpu can't be moved to this_cpu
+ */
+ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ double_rq_unlock(this_rq, busiest);
+ spin_lock(&this_rq->lock);
+ all_pinned = 1;
+ return ld_moved;
+ }
+
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+ }
+
+ double_rq_unlock(this_rq, busiest);
+ if (active_balance)
+ wake_up_process(busiest->migration_thread);
+
+ spin_lock(&this_rq->lock);
} else
sd->nr_balance_failed = 0;

2008-11-11 04:49:20

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

* Peter Zijlstra <[email protected]> [2008-11-10 19:50:16]:

>
> a quick response, I'll read them more carefully tomorrow:

Hi Peter,

Thanks for the quick review.

>
> - why are the preferred cpu things pointers? afaict using just the cpu
> number is both smaller and clearer to the reader.

I would need each cpu within a partitioned sched domain to point to
the _same_ preferred wakeup cpu. The preferred CPU will be updated in
one place in find_busiest_group() and used by wake_idle.

If I have a per cpu value, then updating it for each cpu in the
partitioned sched domain will be slow.

The actual number of preferred_wakeup_cpu will be equal to the number
of partitions. If there are no partitions in the sched domains, then
then all per-cpu pointers will point to the same variable.

> - in patch 5/5 you do:
>
> + spin_unlock(&this_rq->lock);
> + double_rq_lock(this_rq, busiest);
>
> we call that double_lock_balance()

Will fix this. Did not look for such a routine :)

> - comments go like:
>
> /*
> * this is a multi-
> * line comment
> */

Will fix this too.

Thanks,
Vaidy

2008-11-11 13:44:07

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
>
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
>
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical. For now lets try with a per-cpu variable
> pointing to a common storage in partition sched domain
> attribute. Global variable may not work in partitioned
> sched domain case.

Would it make sense to place the preferred_wakeup_cpu stuff in the
root_domain structure we already have?

2008-11-11 13:47:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> Active load balancing is a process by which migration thread
> is woken up on the target CPU in order to pull current
> running task on another package into this newly idle
> package.
>
> This method is already in use with normal load_balance(),
> this patch introduces this method to new idle cpus when
> sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
>
> This logic provides effective consolidation of short running
> daemon jobs in a almost idle system
>
> The side effect of this patch may be ping-ponging of tasks
> if the system is moderately utilised. May need to adjust the
> iterations before triggering.

OK, I'm so not getting this patch..

if normal newly idle balancing fails that means the other runqueue has
only a single task on it (or some other really stubborn stuff), so then
you go move that one task that is already running, from one cpu to
another.

_why_?

The only answer I can come up with is that you prefer one cpu's
idle-ness over another - which makes sense, as you try to get whole
packages idle.

But I'm not seeing where that package logic is hidden..

2008-11-11 14:04:29

by Gregory Haskins

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

Peter Zijlstra wrote:
> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
>
>> When the system utilisation is low and more cpus are idle,
>> then the process waking up from sleep should prefer to
>> wakeup an idle cpu from semi-idle cpu package (multi core
>> package) rather than a completely idle cpu package which
>> would waste power.
>>
>> Use the sched_mc balance logic in find_busiest_group() to
>> nominate a preferred wakeup cpu.
>>
>> This info can be sored in appropriate sched_domain, but
>> updating this info in all copies of sched_domain is not
>> practical. For now lets try with a per-cpu variable
>> pointing to a common storage in partition sched domain
>> attribute. Global variable may not work in partitioned
>> sched domain case.
>>
>
> Would it make sense to place the preferred_wakeup_cpu stuff in the
> root_domain structure we already have?
>

From the description, this is exactly what the root-domains were created
to solve.

Vaidyanathan, just declare your object in "struct root_domain" and
initialize it in init_rootdomain() in kernel/sched.c, and then access it
via rq->rd to take advantage of this infrastructure. It will
automatically follow any partitioning that happens to be configured.

-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-11-11 15:22:19

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
> > Would it make sense to place the preferred_wakeup_cpu stuff in the
> > root_domain structure we already have?
> >
>
> From the description, this is exactly what the root-domains were created
> to solve.
>
> Vaidyanathan, just declare your object in "struct root_domain" and
> initialize it in init_rootdomain() in kernel/sched.c, and then access it
> via rq->rd to take advantage of this infrastructure. It will
> automatically follow any partitioning that happens to be configured.

If I understand correctly, we may want to have more than one preferred
cpu in a given sched domain, taking into account node topology i.e if a
given sched domain encompasses two nodes, then we may like to designate
2 preferred wakeup_cpu's, one per node. If that is the case, then
root_domain may not be of use here?

- vatsa

2008-11-11 15:26:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

On Tue, 2008-11-11 at 20:51 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
> > > Would it make sense to place the preferred_wakeup_cpu stuff in the
> > > root_domain structure we already have?
> > >
> >
> > From the description, this is exactly what the root-domains were created
> > to solve.
> >
> > Vaidyanathan, just declare your object in "struct root_domain" and
> > initialize it in init_rootdomain() in kernel/sched.c, and then access it
> > via rq->rd to take advantage of this infrastructure. It will
> > automatically follow any partitioning that happens to be configured.
>
> If I understand correctly, we may want to have more than one preferred
> cpu in a given sched domain, taking into account node topology i.e if a
> given sched domain encompasses two nodes, then we may like to designate
> 2 preferred wakeup_cpu's, one per node. If that is the case, then
> root_domain may not be of use here?

Agreed, in which case this sched_domain_attr stuff might work out better
- but I'm not sure I fully get that.. will stare at that a bit more.

2008-11-11 16:51:21

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

* Peter Zijlstra <[email protected]> [2008-11-11 14:43:39]:

> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> >
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> >
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical. For now lets try with a per-cpu variable
> > pointing to a common storage in partition sched domain
> > attribute. Global variable may not work in partitioned
> > sched domain case.
>
> Would it make sense to place the preferred_wakeup_cpu stuff in the
> root_domain structure we already have?

Yep, that will be a good idea. We can get to root_domain from each
CPU's rq and we can get rid of the per-cpu pointers for
preferred_wakeup_cpu as well. I will change the implementation and
re-post.

Thanks,
Vaidy

2008-11-11 16:56:50

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2008-11-11 14:43:39]:
>
>> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
>>> When the system utilisation is low and more cpus are idle,
>>> then the process waking up from sleep should prefer to
>>> wakeup an idle cpu from semi-idle cpu package (multi core
>>> package) rather than a completely idle cpu package which
>>> would waste power.
>>>
>>> Use the sched_mc balance logic in find_busiest_group() to
>>> nominate a preferred wakeup cpu.
>>>
>>> This info can be sored in appropriate sched_domain, but
>>> updating this info in all copies of sched_domain is not
>>> practical. For now lets try with a per-cpu variable
>>> pointing to a common storage in partition sched domain
>>> attribute. Global variable may not work in partitioned
>>> sched domain case.
>> Would it make sense to place the preferred_wakeup_cpu stuff in the
>> root_domain structure we already have?
>
> Yep, that will be a good idea. We can get to root_domain from each
> CPU's rq and we can get rid of the per-cpu pointers for
> preferred_wakeup_cpu as well. I will change the implementation and
> re-post.

Did you see Vatsa's comments? root_domain will no work if you have more than one
preferred_wakeup_cpu per domain.

--
Balbir

2008-11-11 17:08:31

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

* Peter Zijlstra <[email protected]> [2008-11-11 14:47:15]:

> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> > Active load balancing is a process by which migration thread
> > is woken up on the target CPU in order to pull current
> > running task on another package into this newly idle
> > package.
> >
> > This method is already in use with normal load_balance(),
> > this patch introduces this method to new idle cpus when
> > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
> >
> > This logic provides effective consolidation of short running
> > daemon jobs in a almost idle system
> >
> > The side effect of this patch may be ping-ponging of tasks
> > if the system is moderately utilised. May need to adjust the
> > iterations before triggering.
>
> OK, I'm so not getting this patch..
>
> if normal newly idle balancing fails that means the other runqueue has
> only a single task on it (or some other really stubborn stuff), so then
> you go move that one task that is already running, from one cpu to
> another.
>
> _why_?
>
> The only answer I can come up with is that you prefer one cpu's
> idle-ness over another - which makes sense, as you try to get whole
> packages idle.

Your answer is correct. We want to move that one task from a non-idle
cpu to this cpu that is just going to be idle.

This is the same method used to move task in load_balance(), I have
extended it for load_balance_newidle() to make the consolidation
faster at sched_mc=2.

> But I'm not seeing where that package logic is hidden..

The package logic comes from find_busiest_group(). If there are no
imbalance, then find_busiest_group() will return NULL. However when
sched_mc={1,2} then find_busiest_group() will select a group
from which a running task may be pulled to this cpu in order to make
the other package idle. If there is no opportunity to make a package
idle and if there are no imbalance, then find_busiest_group() will
return NULL and no action will be taken in load_balance_newidle().

Under normal task pull operation due to imbalance, there will be more
than one task in the source run queue and move_tasks() will succeed.
ld_moved will be true and the active balance code will not be
triggered.

If we enter a scenario where we are moving the only running task from
another cpu, then this should have been suggested by
find_busiest_group's sched_mc balance logic and thus moving that task
will potentially freeup the source package.

Thanks for the careful review.

--Vaidy

2008-11-11 17:14:24

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

* Gregory Haskins <[email protected]> [2008-11-11 09:07:58]:

> Peter Zijlstra wrote:
> > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> >
> >> When the system utilisation is low and more cpus are idle,
> >> then the process waking up from sleep should prefer to
> >> wakeup an idle cpu from semi-idle cpu package (multi core
> >> package) rather than a completely idle cpu package which
> >> would waste power.
> >>
> >> Use the sched_mc balance logic in find_busiest_group() to
> >> nominate a preferred wakeup cpu.
> >>
> >> This info can be sored in appropriate sched_domain, but
> >> updating this info in all copies of sched_domain is not
> >> practical. For now lets try with a per-cpu variable
> >> pointing to a common storage in partition sched domain
> >> attribute. Global variable may not work in partitioned
> >> sched domain case.
> >>
> >
> > Would it make sense to place the preferred_wakeup_cpu stuff in the
> > root_domain structure we already have?
> >
>
> From the description, this is exactly what the root-domains were created
> to solve.
>
> Vaidyanathan, just declare your object in "struct root_domain" and
> initialize it in init_rootdomain() in kernel/sched.c, and then access it
> via rq->rd to take advantage of this infrastructure. It will
> automatically follow any partitioning that happens to be configured.

Yep, I agree. I will use root_domain for this purpose in the next
revision.

Thanks,
Vaidy

2008-11-11 17:18:45

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

* Peter Zijlstra <[email protected]> [2008-11-11 16:26:14]:

> On Tue, 2008-11-11 at 20:51 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
> > > > Would it make sense to place the preferred_wakeup_cpu stuff in the
> > > > root_domain structure we already have?
> > > >
> > >
> > > From the description, this is exactly what the root-domains were created
> > > to solve.
> > >
> > > Vaidyanathan, just declare your object in "struct root_domain" and
> > > initialize it in init_rootdomain() in kernel/sched.c, and then access it
> > > via rq->rd to take advantage of this infrastructure. It will
> > > automatically follow any partitioning that happens to be configured.
> >
> > If I understand correctly, we may want to have more than one preferred
> > cpu in a given sched domain, taking into account node topology i.e if a
> > given sched domain encompasses two nodes, then we may like to designate
> > 2 preferred wakeup_cpu's, one per node. If that is the case, then
> > root_domain may not be of use here?
>
> Agreed, in which case this sched_domain_attr stuff might work out better
> - but I'm not sure I fully get that.. will stare at that a bit more.

The current code that I posted assumes one preferred_wakeup_cpu per
partitioned domain. Moving the variable to root_domain is a good idea
for this implementation.

In future when we need one preferred_wakeup_cpu per node per
partitioned domain, we will need a array for each partitioned domain.
Having the array in root_domain is better than having it in dattr.

Depending upon experimental results, we may choose to have only one
preferred_wakeup_cpu per partitioned domain. When the system
utilisation is quite low, it is better to move all movable tasks from
each node to a selected node (0). This will freeup all CPUs in other
nodes. Just that we need to consider cache hotness and cross-node
memory access more carefully before crossing a node boundary for
consolidation.

--Vaidy

2008-11-11 17:22:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

On Tue, 2008-11-11 at 22:34 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2008-11-11 14:47:15]:
>
> > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> > > Active load balancing is a process by which migration thread
> > > is woken up on the target CPU in order to pull current
> > > running task on another package into this newly idle
> > > package.
> > >
> > > This method is already in use with normal load_balance(),
> > > this patch introduces this method to new idle cpus when
> > > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
> > >
> > > This logic provides effective consolidation of short running
> > > daemon jobs in a almost idle system
> > >
> > > The side effect of this patch may be ping-ponging of tasks
> > > if the system is moderately utilised. May need to adjust the
> > > iterations before triggering.
> >
> > OK, I'm so not getting this patch..
> >
> > if normal newly idle balancing fails that means the other runqueue has
> > only a single task on it (or some other really stubborn stuff), so then
> > you go move that one task that is already running, from one cpu to
> > another.
> >
> > _why_?
> >
> > The only answer I can come up with is that you prefer one cpu's
> > idle-ness over another - which makes sense, as you try to get whole
> > packages idle.
>
> Your answer is correct. We want to move that one task from a non-idle
> cpu to this cpu that is just going to be idle.
>
> This is the same method used to move task in load_balance(), I have
> extended it for load_balance_newidle() to make the consolidation
> faster at sched_mc=2.
>
>
> > But I'm not seeing where that package logic is hidden..
>
>
> The package logic comes from find_busiest_group(). If there are no
> imbalance, then find_busiest_group() will return NULL. However when
> sched_mc={1,2} then find_busiest_group() will select a group
> from which a running task may be pulled to this cpu in order to make
> the other package idle. If there is no opportunity to make a package
> idle and if there are no imbalance, then find_busiest_group() will
> return NULL and no action will be taken in load_balance_newidle().
>
> Under normal task pull operation due to imbalance, there will be more
> than one task in the source run queue and move_tasks() will succeed.
> ld_moved will be true and the active balance code will not be
> triggered.
>
> If we enter a scenario where we are moving the only running task from
> another cpu, then this should have been suggested by
> find_busiest_group's sched_mc balance logic and thus moving that task
> will potentially freeup the source package.
>
> Thanks for the careful review.

Ah, right, thanks!

Could you clarify this by adding a comment to this effect right before
the added code?

2008-11-11 17:24:25

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

* Balbir Singh <[email protected]> [2008-11-11 22:19:46]:

> Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2008-11-11 14:43:39]:
> >
> >> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> >>> When the system utilisation is low and more cpus are idle,
> >>> then the process waking up from sleep should prefer to
> >>> wakeup an idle cpu from semi-idle cpu package (multi core
> >>> package) rather than a completely idle cpu package which
> >>> would waste power.
> >>>
> >>> Use the sched_mc balance logic in find_busiest_group() to
> >>> nominate a preferred wakeup cpu.
> >>>
> >>> This info can be sored in appropriate sched_domain, but
> >>> updating this info in all copies of sched_domain is not
> >>> practical. For now lets try with a per-cpu variable
> >>> pointing to a common storage in partition sched domain
> >>> attribute. Global variable may not work in partitioned
> >>> sched domain case.
> >> Would it make sense to place the preferred_wakeup_cpu stuff in the
> >> root_domain structure we already have?
> >
> > Yep, that will be a good idea. We can get to root_domain from each
> > CPU's rq and we can get rid of the per-cpu pointers for
> > preferred_wakeup_cpu as well. I will change the implementation and
> > re-post.
>
> Did you see Vatsa's comments? root_domain will no work if you have more than one
> preferred_wakeup_cpu per domain.

Hi Balbir,

I just saw Vatsa's comments. We have similar limitation with the
current implementation also. sched_domain_attr dattr is also per
partitioned domain and not per numa node.

In the current implementation we can get rid of the per-cpu variables
and use root_domain. Later we can have an array in root_domain and
index it based on the cpu's node.

Thanks,
Vaidy

2008-11-11 17:28:23

by Vaidyanathan Srinivasan

[permalink] [raw]

Subject: Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

* Peter Zijlstra <[email protected]> [2008-11-11 18:21:50]:

> On Tue, 2008-11-11 at 22:34 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2008-11-11 14:47:15]:
> >
> > > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
> > > > Active load balancing is a process by which migration thread
> > > > is woken up on the target CPU in order to pull current
> > > > running task on another package into this newly idle
> > > > package.
> > > >
> > > > This method is already in use with normal load_balance(),
> > > > this patch introduces this method to new idle cpus when
> > > > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
> > > >
> > > > This logic provides effective consolidation of short running
> > > > daemon jobs in a almost idle system
> > > >
> > > > The side effect of this patch may be ping-ponging of tasks
> > > > if the system is moderately utilised. May need to adjust the
> > > > iterations before triggering.
> > >
> > > OK, I'm so not getting this patch..
> > >
> > > if normal newly idle balancing fails that means the other runqueue has
> > > only a single task on it (or some other really stubborn stuff), so then
> > > you go move that one task that is already running, from one cpu to
> > > another.
> > >
> > > _why_?
> > >
> > > The only answer I can come up with is that you prefer one cpu's
> > > idle-ness over another - which makes sense, as you try to get whole
> > > packages idle.
> >
> > Your answer is correct. We want to move that one task from a non-idle
> > cpu to this cpu that is just going to be idle.
> >
> > This is the same method used to move task in load_balance(), I have
> > extended it for load_balance_newidle() to make the consolidation
> > faster at sched_mc=2.
> >
> >
> > > But I'm not seeing where that package logic is hidden..
> >
> >
> > The package logic comes from find_busiest_group(). If there are no
> > imbalance, then find_busiest_group() will return NULL. However when
> > sched_mc={1,2} then find_busiest_group() will select a group
> > from which a running task may be pulled to this cpu in order to make
> > the other package idle. If there is no opportunity to make a package
> > idle and if there are no imbalance, then find_busiest_group() will
> > return NULL and no action will be taken in load_balance_newidle().
> >
> > Under normal task pull operation due to imbalance, there will be more
> > than one task in the source run queue and move_tasks() will succeed.
> > ld_moved will be true and the active balance code will not be
> > triggered.
> >
> > If we enter a scenario where we are moving the only running task from
> > another cpu, then this should have been suggested by
> > find_busiest_group's sched_mc balance logic and thus moving that task
> > will potentially freeup the source package.
> >
> > Thanks for the careful review.
>
> Ah, right, thanks!
>
> Could you clarify this by adding a comment to this effect right before
> the added code?

Sure. Will add detailed comments.

--Vaidy