Subject: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB

Hi,

This is the second iteration of the patchset which aims at improving
the idle-load balancer nomination logic, by taking the system topology
into consideration.

Changes from v1 (found here: http://lkml.org/lkml/2009/4/2/246)
o Fixed the kernel-doc style comments.
o Renamed a variable to better reflect it's usage.

Background
======================================
An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
and performs load-balancing on behalf of the other idle CPUs. Currently,
this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)

The drawback of the current method is that the CPU numbering in the
cores/packages need not necessarily be sequential. For example, on a
two-socket, Quad core system, the CPU numbering can be as follows:

|-------------------------------| |-------------------------------|
| | | | | |
| 0 | 2 | | 1 | 3 |
|-------------------------------| |-------------------------------|
| | | | | |
| 4 | 6 | | 5 | 7 |
|-------------------------------| |-------------------------------|

Now, the other power-savings settings such as the sched_mc/smt_power_savings
and the power-aware IRQ balancer try to balance tasks/IRQs by taking
the system topology into consideration, with the intention of keeping
as many "power-domains" (cores/packages) in the low-power state.

The current idle-load-balancer nomination does not necessarily align towards
this policy. For eg, we could be having tasks and interrupts largely running
on the first package with the intention of keeping the second package idle.
Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
which in-turn becomes nominated as the idle-load balancer. CPU1 being from
the 2nd package, would in turn prevent the 2nd package from going into a
deeper sleep state.

Instead the role of the idle-load balancer could have been assumed by an
idle CPU from the first package, thereby helping the second package go
completely idle.

This patchset has been tested with 2.6.30-rc1 on a Two-Socket
Quad core system with the topology as mentioned above.

|----------------------------------------------------------------------------|
| With Patchset + sched_mc_power_savings = 1 |
|----------------------------------------------------------------------------|
|make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
| | | on Package 0 | on Package 1 |
|----------------------------------------------------------------------------|
|taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 227.234s | 56969 | 57080 | 1003 | 588 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 55995 | 703 | 583 | 600 |
|----------------------------------------------------------------------------|
|taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 227.136s | 1109 | 611 | 57074 | 57091 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 709 | 637 | 56133 | 587 |
|----------------------------------------------------------------------------|

We see here that the idle load balancer is chosen from the package which is
busy. In the first case, it's CPU4 and in the second case it's CPU5.

|----------------------------------------------------------------------------|
| With Patchset + sched_mc_power_savings = 1 |
|----------------------------------------------------------------------------|
|make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
| | | on Package 0 | on Package 1 |
|----------------------------------------------------------------------------|
|taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 228.786s | 59094 | 61994 | 13984 | 43652 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 1827 | 734 | 748 | 760 |
|----------------------------------------------------------------------------|
|taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 228.435s | 57013 | 876 | 58596 | 61633 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 772 | 1133 | 850 | 910 |
|----------------------------------------------------------------------------|

Here, we see that the idle load balancer is chosen from the other package,
despite choosing sched_mc_power_savings = 1. In the first case, we have
CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
it's CPU0 and CPU6, which assume that role.

---

Gautham R Shenoy (2):
sched: Nominate a power-efficient ilb in select_nohz_balancer()
sched: Nominate idle load balancer from a semi-idle package.


kernel/sched.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 135 insertions(+), 10 deletions(-)

--
Thanks and Regards
gautham.


Subject: [RFC PATCH 2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer()

The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.

Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d702be7..bbf367d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4402,8 +4402,24 @@ int select_nohz_load_balancer(int stop_tick)
/* make me the ilb owner */
if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
return 1;
- } else if (atomic_read(&nohz.load_balancer) == cpu)
+ } else if (atomic_read(&nohz.load_balancer) == cpu) {
+ int new_ilb;
+
+ if (!(sched_smt_power_savings ||
+ sched_mc_power_savings))
+ return 1;
+ /*
+ * Check to see if there is a more power-efficient
+ * ilb.
+ */
+ new_ilb = find_new_ilb(cpu);
+ if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
+ atomic_set(&nohz.load_balancer, -1);
+ resched_cpu(new_ilb);
+ return 0;
+ }
return 1;
+ }
} else {
if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
return 0;

Subject: [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.

For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.

Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.

The algorithm is activated only when sched_mc/smt_power_savings != 0.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 118 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6cc1fd5..d702be7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4228,10 +4228,126 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
static struct {
atomic_t load_balancer;
cpumask_var_t cpu_mask;
+ cpumask_var_t ilb_grp_nohz_mask;
} nohz ____cacheline_aligned = {
.load_balancer = ATOMIC_INIT(-1),
};

+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+/**
+ * lowest_flag_domain - Return lowest sched_domain containing flag.
+ * @cpu: The cpu whose lowest level of sched domain is to
+ * be returned.
+ * @flag: The flag to check for the lowest sched_domain
+ * for the given cpu.
+ *
+ * Returns the lowest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+ struct sched_domain *sd;
+
+ for_each_domain(cpu, sd)
+ if (sd && (sd->flags & flag))
+ break;
+
+ return sd;
+}
+
+/**
+ * for_each_flag_domain - Iterates over sched_domains containing the flag.
+ * @cpu: The cpu whose domains we're iterating over.
+ * @sd: variable holding the value of the power_savings_sd
+ * for cpu.
+ * @flag: The flag to filter the sched_domains to be iterated.
+ *
+ * Iterates over all the scheduler domains for a given cpu that has the 'flag'
+ * set, starting from the lowest sched_domain to the highest.
+ */
+#define for_each_flag_domain(cpu, sd, flag) \
+ for (sd = lowest_flag_domain(cpu, flag); \
+ (sd && (sd->flags & flag)); sd = sd->parent)
+
+/**
+ * is_semi_idle_group - Checks if the given sched_group is semi-idle.
+ * @ilb_group: group to be checked for semi-idleness
+ *
+ * Returns: 1 if the group is semi-idle. 0 otherwise.
+ *
+ * We define a sched_group to be semi idle if it has atleast one idle-CPU
+ * and atleast one non-idle CPU. This helper function checks if the given
+ * sched_group is semi-idle or not.
+ */
+static inline int is_semi_idle_group(struct sched_group *ilb_group)
+{
+ cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+ sched_group_cpus(ilb_group));
+
+ /*
+ * A sched_group is semi-idle when it has atleast one busy cpu
+ * and atleast one idle cpu.
+ */
+ if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+ return 0;
+
+ if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+ return 0;
+
+ return 1;
+}
+/**
+ * find_new_ilb - Finds the optimum idle load balancer for nomination.
+ * @cpu: The cpu which is nominating a new idle_load_balancer.
+ *
+ * Returns: Returns the id of the idle load balancer if it exists,
+ * Else, returns >= nr_cpu_ids.
+ *
+ * This algorithm picks the idle load balancer such that it belongs to a
+ * semi-idle powersavings sched_domain. The idea is to try and avoid
+ * completely idle packages/cores just for the purpose of idle load balancing
+ * when there are other idle cpu's which are better suited for that job.
+ */
+static int find_new_ilb(int cpu)
+{
+ struct sched_domain *sd;
+ struct sched_group *ilb_group;
+
+ /*
+ * Have idle load balancer selection from semi-idle packages only
+ * when power-aware load balancing is enabled
+ */
+ if (!(sched_smt_power_savings || sched_mc_power_savings))
+ goto out_done;
+
+ /*
+ * Optimize for the case when we have no idle CPUs or only one
+ * idle CPU. Don't walk the sched_domain hierarchy in such cases
+ */
+ if (cpumask_weight(nohz.cpu_mask) < 2)
+ goto out_done;
+
+ for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
+ ilb_group = sd->groups;
+
+ do {
+ if (is_semi_idle_group(ilb_group))
+ return cpumask_first(nohz.ilb_grp_nohz_mask);
+
+ ilb_group = ilb_group->next;
+
+ } while (ilb_group != sd->groups);
+ }
+
+out_done:
+ return cpumask_first(nohz.cpu_mask);
+}
+#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
+static inline int find_new_ilb(int call_cpu)
+{
+ return first_cpu(nohz.cpu_mask);
+}
+#endif
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4456,15 +4572,7 @@ static inline void trigger_load_balance(struct rq *rq, int cpu)
}

if (atomic_read(&nohz.load_balancer) == -1) {
- /*
- * simple selection for now: Nominate the
- * first cpu in the nohz list to be the next
- * ilb owner.
- *
- * TBD: Traverse the sched domains and nominate
- * the nearest cpu in the nohz.cpu_mask.
- */
- int ilb = cpumask_first(nohz.cpu_mask);
+ int ilb = find_new_ilb(cpu);

if (ilb < nr_cpu_ids)
resched_cpu(ilb);
@@ -8985,6 +9093,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
alloc_bootmem_cpumask_var(&nohz.cpu_mask);
+ alloc_bootmem_cpumask_var(&nohz.ilb_grp_nohz_mask);
#endif
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */

2009-04-14 09:45:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB

On Tue, 2009-04-14 at 10:25 +0530, Gautham R Shenoy wrote:
> Hi,
>
> This is the second iteration of the patchset which aims at improving
> the idle-load balancer nomination logic, by taking the system topology
> into consideration.
>
> Changes from v1 (found here: http://lkml.org/lkml/2009/4/2/246)
> o Fixed the kernel-doc style comments.
> o Renamed a variable to better reflect it's usage.
>
> Background
> ======================================
> An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
> and performs load-balancing on behalf of the other idle CPUs. Currently,
> this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)
>
> The drawback of the current method is that the CPU numbering in the
> cores/packages need not necessarily be sequential. For example, on a
> two-socket, Quad core system, the CPU numbering can be as follows:
>
> |-------------------------------| |-------------------------------|
> | | | | | |
> | 0 | 2 | | 1 | 3 |
> |-------------------------------| |-------------------------------|
> | | | | | |
> | 4 | 6 | | 5 | 7 |
> |-------------------------------| |-------------------------------|
>
> Now, the other power-savings settings such as the sched_mc/smt_power_savings
> and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> the system topology into consideration, with the intention of keeping
> as many "power-domains" (cores/packages) in the low-power state.
>
> The current idle-load-balancer nomination does not necessarily align towards
> this policy. For eg, we could be having tasks and interrupts largely running
> on the first package with the intention of keeping the second package idle.
> Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> the 2nd package, would in turn prevent the 2nd package from going into a
> deeper sleep state.
>
> Instead the role of the idle-load balancer could have been assumed by an
> idle CPU from the first package, thereby helping the second package go
> completely idle.
>
> This patchset has been tested with 2.6.30-rc1 on a Two-Socket
> Quad core system with the topology as mentioned above.
>
> |----------------------------------------------------------------------------|
> | With Patchset + sched_mc_power_savings = 1 |
> |----------------------------------------------------------------------------|
> |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> | | | on Package 0 | on Package 1 |
> |----------------------------------------------------------------------------|
> |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 227.234s | 56969 | 57080 | 1003 | 588 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 55995 | 703 | 583 | 600 |
> |----------------------------------------------------------------------------|
> |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 227.136s | 1109 | 611 | 57074 | 57091 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 709 | 637 | 56133 | 587 |
> |----------------------------------------------------------------------------|
>
> We see here that the idle load balancer is chosen from the package which is
> busy. In the first case, it's CPU4 and in the second case it's CPU5.
>
> |----------------------------------------------------------------------------|
> | With Patchset + sched_mc_power_savings = 1 |
> |----------------------------------------------------------------------------|
> |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> | | | on Package 0 | on Package 1 |
> |----------------------------------------------------------------------------|
> |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 228.786s | 59094 | 61994 | 13984 | 43652 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 1827 | 734 | 748 | 760 |
> |----------------------------------------------------------------------------|
> |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 228.435s | 57013 | 876 | 58596 | 61633 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 772 | 1133 | 850 | 910 |
> |----------------------------------------------------------------------------|
>
> Here, we see that the idle load balancer is chosen from the other package,
> despite choosing sched_mc_power_savings = 1. In the first case, we have
> CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
> it's CPU0 and CPU6, which assume that role.

Both tables above claim to be _with_ the pathes :-), from the
accompanying text one can deduce its the bottom one that is without.

Patches look straight-forward enough, seems good stuff.

Thanks!

Subject: Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB

On Tue, Apr 14, 2009 at 11:48:04AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-04-14 at 10:25 +0530, Gautham R Shenoy wrote:
> > Hi,
> >
> > This is the second iteration of the patchset which aims at improving
> > the idle-load balancer nomination logic, by taking the system topology
> > into consideration.
> >
> > Changes from v1 (found here: http://lkml.org/lkml/2009/4/2/246)
> > o Fixed the kernel-doc style comments.
> > o Renamed a variable to better reflect it's usage.
> >
> > Background
> > ======================================
> > An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
> > and performs load-balancing on behalf of the other idle CPUs. Currently,
> > this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)
> >
> > The drawback of the current method is that the CPU numbering in the
> > cores/packages need not necessarily be sequential. For example, on a
> > two-socket, Quad core system, the CPU numbering can be as follows:
> >
> > |-------------------------------| |-------------------------------|
> > | | | | | |
> > | 0 | 2 | | 1 | 3 |
> > |-------------------------------| |-------------------------------|
> > | | | | | |
> > | 4 | 6 | | 5 | 7 |
> > |-------------------------------| |-------------------------------|
> >
> > Now, the other power-savings settings such as the sched_mc/smt_power_savings
> > and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> > the system topology into consideration, with the intention of keeping
> > as many "power-domains" (cores/packages) in the low-power state.
> >
> > The current idle-load-balancer nomination does not necessarily align towards
> > this policy. For eg, we could be having tasks and interrupts largely running
> > on the first package with the intention of keeping the second package idle.
> > Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> > which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> > the 2nd package, would in turn prevent the 2nd package from going into a
> > deeper sleep state.
> >
> > Instead the role of the idle-load balancer could have been assumed by an
> > idle CPU from the first package, thereby helping the second package go
> > completely idle.
> >
> > This patchset has been tested with 2.6.30-rc1 on a Two-Socket
> > Quad core system with the topology as mentioned above.
> >
> > |----------------------------------------------------------------------------|
> > | With Patchset + sched_mc_power_savings = 1 |
> > |----------------------------------------------------------------------------|
> > |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> > | | | on Package 0 | on Package 1 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 227.234s | 56969 | 57080 | 1003 | 588 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 55995 | 703 | 583 | 600 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 227.136s | 1109 | 611 | 57074 | 57091 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 709 | 637 | 56133 | 587 |
> > |----------------------------------------------------------------------------|
> >
> > We see here that the idle load balancer is chosen from the package which is
> > busy. In the first case, it's CPU4 and in the second case it's CPU5.
> >
> > |----------------------------------------------------------------------------|
> > | With Patchset + sched_mc_power_savings = 1 |
^^^^
Without
> > |----------------------------------------------------------------------------|
> > |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> > | | | on Package 0 | on Package 1 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 228.786s | 59094 | 61994 | 13984 | 43652 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 1827 | 734 | 748 | 760 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 228.435s | 57013 | 876 | 58596 | 61633 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 772 | 1133 | 850 | 910 |
> > |----------------------------------------------------------------------------|
> >
> > Here, we see that the idle load balancer is chosen from the other package,
> > despite choosing sched_mc_power_savings = 1. In the first case, we have
> > CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
> > it's CPU0 and CPU6, which assume that role.
>
> Both tables above claim to be _with_ the pathes :-), from the
> accompanying text one can deduce its the bottom one that is without.

Sorry, copy pasted the 2nd table from the first, and updated only the
values.


>
> Patches look straight-forward enough, seems good stuff.

Thanks for the review!
>
> Thanks!

--
Thanks and Regards
gautham

Subject: [tip:sched/core] sched: Nominate idle load balancer from a semi-idle package.

Commit-ID: f711f6090a81cbd396b63de90f415d33f563af9b
Gitweb: http://git.kernel.org/tip/f711f6090a81cbd396b63de90f415d33f563af9b
Author: Gautham R Shenoy <[email protected]>
AuthorDate: Tue, 14 Apr 2009 10:25:30 +0530
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 14 Apr 2009 11:49:19 +0200

sched: Nominate idle load balancer from a semi-idle package.

Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.

For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.

Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.

The algorithm is activated only when sched_mc/smt_power_savings != 0.

Signed-off-by: Gautham R Shenoy <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 118 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5724508..b0fefa3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4240,10 +4240,126 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
static struct {
atomic_t load_balancer;
cpumask_var_t cpu_mask;
+ cpumask_var_t ilb_grp_nohz_mask;
} nohz ____cacheline_aligned = {
.load_balancer = ATOMIC_INIT(-1),
};

+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+/**
+ * lowest_flag_domain - Return lowest sched_domain containing flag.
+ * @cpu: The cpu whose lowest level of sched domain is to
+ * be returned.
+ * @flag: The flag to check for the lowest sched_domain
+ * for the given cpu.
+ *
+ * Returns the lowest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+ struct sched_domain *sd;
+
+ for_each_domain(cpu, sd)
+ if (sd && (sd->flags & flag))
+ break;
+
+ return sd;
+}
+
+/**
+ * for_each_flag_domain - Iterates over sched_domains containing the flag.
+ * @cpu: The cpu whose domains we're iterating over.
+ * @sd: variable holding the value of the power_savings_sd
+ * for cpu.
+ * @flag: The flag to filter the sched_domains to be iterated.
+ *
+ * Iterates over all the scheduler domains for a given cpu that has the 'flag'
+ * set, starting from the lowest sched_domain to the highest.
+ */
+#define for_each_flag_domain(cpu, sd, flag) \
+ for (sd = lowest_flag_domain(cpu, flag); \
+ (sd && (sd->flags & flag)); sd = sd->parent)
+
+/**
+ * is_semi_idle_group - Checks if the given sched_group is semi-idle.
+ * @ilb_group: group to be checked for semi-idleness
+ *
+ * Returns: 1 if the group is semi-idle. 0 otherwise.
+ *
+ * We define a sched_group to be semi idle if it has atleast one idle-CPU
+ * and atleast one non-idle CPU. This helper function checks if the given
+ * sched_group is semi-idle or not.
+ */
+static inline int is_semi_idle_group(struct sched_group *ilb_group)
+{
+ cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+ sched_group_cpus(ilb_group));
+
+ /*
+ * A sched_group is semi-idle when it has atleast one busy cpu
+ * and atleast one idle cpu.
+ */
+ if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+ return 0;
+
+ if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+ return 0;
+
+ return 1;
+}
+/**
+ * find_new_ilb - Finds the optimum idle load balancer for nomination.
+ * @cpu: The cpu which is nominating a new idle_load_balancer.
+ *
+ * Returns: Returns the id of the idle load balancer if it exists,
+ * Else, returns >= nr_cpu_ids.
+ *
+ * This algorithm picks the idle load balancer such that it belongs to a
+ * semi-idle powersavings sched_domain. The idea is to try and avoid
+ * completely idle packages/cores just for the purpose of idle load balancing
+ * when there are other idle cpu's which are better suited for that job.
+ */
+static int find_new_ilb(int cpu)
+{
+ struct sched_domain *sd;
+ struct sched_group *ilb_group;
+
+ /*
+ * Have idle load balancer selection from semi-idle packages only
+ * when power-aware load balancing is enabled
+ */
+ if (!(sched_smt_power_savings || sched_mc_power_savings))
+ goto out_done;
+
+ /*
+ * Optimize for the case when we have no idle CPUs or only one
+ * idle CPU. Don't walk the sched_domain hierarchy in such cases
+ */
+ if (cpumask_weight(nohz.cpu_mask) < 2)
+ goto out_done;
+
+ for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
+ ilb_group = sd->groups;
+
+ do {
+ if (is_semi_idle_group(ilb_group))
+ return cpumask_first(nohz.ilb_grp_nohz_mask);
+
+ ilb_group = ilb_group->next;
+
+ } while (ilb_group != sd->groups);
+ }
+
+out_done:
+ return cpumask_first(nohz.cpu_mask);
+}
+#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
+static inline int find_new_ilb(int call_cpu)
+{
+ return first_cpu(nohz.cpu_mask);
+}
+#endif
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4468,15 +4584,7 @@ static inline void trigger_load_balance(struct rq *rq, int cpu)
}

if (atomic_read(&nohz.load_balancer) == -1) {
- /*
- * simple selection for now: Nominate the
- * first cpu in the nohz list to be the next
- * ilb owner.
- *
- * TBD: Traverse the sched domains and nominate
- * the nearest cpu in the nohz.cpu_mask.
- */
- int ilb = cpumask_first(nohz.cpu_mask);
+ int ilb = find_new_ilb(cpu);

if (ilb < nr_cpu_ids)
resched_cpu(ilb);
@@ -9051,6 +9159,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
alloc_bootmem_cpumask_var(&nohz.cpu_mask);
+ alloc_bootmem_cpumask_var(&nohz.ilb_grp_nohz_mask);
#endif
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */

Subject: [tip:sched/core] sched: Nominate a power-efficient ilb in select_nohz_balancer()

Commit-ID: e790fb0ba64bfec158e1219d899cb588275d12ab
Gitweb: http://git.kernel.org/tip/e790fb0ba64bfec158e1219d899cb588275d12ab
Author: Gautham R Shenoy <[email protected]>
AuthorDate: Tue, 14 Apr 2009 10:25:35 +0530
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 14 Apr 2009 11:49:19 +0200

sched: Nominate a power-efficient ilb in select_nohz_balancer()

The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.

Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.

Signed-off-by: Gautham R Shenoy <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b0fefa3..36d213b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4414,8 +4414,24 @@ int select_nohz_load_balancer(int stop_tick)
/* make me the ilb owner */
if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
return 1;
- } else if (atomic_read(&nohz.load_balancer) == cpu)
+ } else if (atomic_read(&nohz.load_balancer) == cpu) {
+ int new_ilb;
+
+ if (!(sched_smt_power_savings ||
+ sched_mc_power_savings))
+ return 1;
+ /*
+ * Check to see if there is a more power-efficient
+ * ilb.
+ */
+ new_ilb = find_new_ilb(cpu);
+ if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
+ atomic_set(&nohz.load_balancer, -1);
+ resched_cpu(new_ilb);
+ return 0;
+ }
return 1;
+ }
} else {
if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
return 0;

2009-04-14 17:32:21

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Gautham R Shenoy wrote:
> kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 files changed, 118 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6cc1fd5..d702be7 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
>
> +/**
> + * is_semi_idle_group - Checks if the given sched_group is semi-idle.
> + * @ilb_group: group to be checked for semi-idleness
> + *
> + * Returns: 1 if the group is semi-idle. 0 otherwise.
> + *
> + * We define a sched_group to be semi idle if it has atleast one idle-CPU
> + * and atleast one non-idle CPU. This helper function checks if the given

"at least" (2x)

> + * sched_group is semi-idle or not.
> + */
> +static inline int is_semi_idle_group(struct sched_group *ilb_group)
> +{
> + cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
> + sched_group_cpus(ilb_group));
> +
> + /*
> + * A sched_group is semi-idle when it has atleast one busy cpu
> + * and atleast one idle cpu.

"at least" (2x)

> + */
> + if (cpumask_empty(nohz.ilb_grp_nohz_mask))
> + return 0;
> +
> + if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
> + return 0;
> +
> + return 1;
> +}


--
~Randy

2009-04-22 01:06:54

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB

On Mon, 2009-04-13 at 21:55 -0700, Gautham R Shenoy wrote:
> Now, the other power-savings settings such as the sched_mc/smt_power_savings
> and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> the system topology into consideration, with the intention of keeping
> as many "power-domains" (cores/packages) in the low-power state.
>
> The current idle-load-balancer nomination does not necessarily align towards
> this policy. For eg, we could be having tasks and interrupts largely running
> on the first package with the intention of keeping the second package idle.
> Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> the 2nd package, would in turn prevent the 2nd package from going into a
> deeper sleep state.
>
> Instead the role of the idle-load balancer could have been assumed by an
> idle CPU from the first package, thereby helping the second package go
> completely idle.

Can we also do this by default? i.e., even when no power-savings policy
is selected.

I don't see anything wrong by enabling this logic for all the cases.

thanks,
suresh

2009-04-26 16:56:55

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB

* Suresh Siddha <[email protected]> [2009-04-21 18:05:22]:

> On Mon, 2009-04-13 at 21:55 -0700, Gautham R Shenoy wrote:
> > Now, the other power-savings settings such as the sched_mc/smt_power_savings
> > and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> > the system topology into consideration, with the intention of keeping
> > as many "power-domains" (cores/packages) in the low-power state.
> >
> > The current idle-load-balancer nomination does not necessarily align towards
> > this policy. For eg, we could be having tasks and interrupts largely running
> > on the first package with the intention of keeping the second package idle.
> > Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> > which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> > the 2nd package, would in turn prevent the 2nd package from going into a
> > deeper sleep state.
> >
> > Instead the role of the idle-load balancer could have been assumed by an
> > idle CPU from the first package, thereby helping the second package go
> > completely idle.
>
> Can we also do this by default? i.e., even when no power-savings policy
> is selected.
>
> I don't see anything wrong by enabling this logic for all the cases.

Hi Suresh,

Thanks for the review. The selection of the idle-load-balancer
depends on the sched domain level with SD_POWERSAVINGS_BALANCE flag
set that indicates the cpu-package boundary. Hence this scheme
depends on sched_mc_powersavings being non zero so that the sched
domains (cpu level) are build based on cpu-package information.

The logical-cpu to cpu-package grouping information is not readily
available (in the sched domain) when power-saving policies are off.

We may need to look at selection of idle-load-balancer for maximum
performance (bandwidth utilisation) when power save policies are off.

--Vaidy