Subject: [PATCH v2 0/2] sched: Nominate a power-efficient ILB.

Hi,

This patchset improves the idle-load balancer nomination logic, by taking
into consideration the system topology into consideration.

An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
and performs load-balancing on behalf of the other idle CPUs. Currently,
this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)

The drawback of the current method is that the CPU numbering in the
cores/packages need not necessarily be sequential. For example, on a
two-socket, Quad core system, the CPU numbering can be as follows:

|-------------------------------| |-------------------------------|
| | | | | |
| 0 | 2 | | 1 | 3 |
|-------------------------------| |-------------------------------|
| | | | | |
| 4 | 6 | | 5 | 7 |
|-------------------------------| |-------------------------------|

Now, the other power-savings settings such as the sched_mc/smt_power_savings
and the power-aware IRQ balancer try to balance tasks/IRQs by taking
the system topology into consideration, with the intention of keeping
as many "power-domains" (cores/packages) in the low-power state.

The current idle-load-balancer nomination does not necessarily align towards
this policy. For eg, we could be having tasks and interrupts largely running
on the first package with the intention of keeping the second package idle.
Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
which in-turn becomes nominated as the idle-load balancer. CPU1 being from
the 2nd package, would in turn prevent the 2nd package from going into a
deeper sleep state.

Instead the role of the idle-load balancer could have been assumed by an
idle CPU from the first package, thereby helping the second package go
completely idle.

This patchset has been tested with 2.6.29-tip-master on a Two-Socket
Quad core system with the topology as mentioned above.

|----------------------------------------------------------------------------|
| With Patchset + sched_mc_power_savings = 1 |
|----------------------------------------------------------------------------|
|make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
| | | on Package 0 | on Package 1 |
|----------------------------------------------------------------------------|
|taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 221.675s | 55421 | 55530 | 587 | 579 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 54335 | 642 | 734 | 533 |
|----------------------------------------------------------------------------|
|taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 221.806s | 1241 | 553 | 55566 | 55555 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 567 | 632 | 54332 | 561 |
|----------------------------------------------------------------------------|

We see here that the idle load balancer is chosen from the package which is
busy. In the first case, it's CPU4 and in the second case it's CPU5.

|----------------------------------------------------------------------------|
| Without Patchset + sched_mc_power_savings = 1 |
|----------------------------------------------------------------------------|
|make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
| | | on Package 0 | on Package 1 |
|----------------------------------------------------------------------------|
|taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 221.727s | 55100 | 55134 | 55134 | 14591 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 1590 | 856 | 613 | 598 |
|----------------------------------------------------------------------------|
|taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
| | 221.500s | 43444 | 12918 | 54766 | 55170 |
| | |----------------------------------------------|
| | | CPU4 | CPU6 | CPU5 | CPU7 |
| | | 653 | 777 | 1008 | 585 |
|----------------------------------------------------------------------------|

Here, we see that the idle load balancer is chosen from the other package,
despite choosing sched_mc_power_savings = 1. In the first case, we have
CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
it's CPU0 and CPU2, which assume that role.

Thoughts ?

---

Gautham R Shenoy (2):
sched: Nominate a power-efficient ilb in select_nohz_balancer()
sched: Nominate idle load balancer from a semi-idle package.


kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 117 insertions(+), 10 deletions(-)

--
Thanks and Regards
gautham.


Subject: [PATCH v2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer()

The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.

Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4fc1ec0..c249473 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4439,8 +4439,24 @@ int select_nohz_load_balancer(int stop_tick)
/* make me the ilb owner */
if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
return 1;
- } else if (atomic_read(&nohz.load_balancer) == cpu)
+ } else if (atomic_read(&nohz.load_balancer) == cpu) {
+ int new_ilb;
+
+ if (!(sched_smt_power_savings ||
+ sched_mc_power_savings))
+ return 1;
+ /*
+ * Check to see if there is a more power-efficient
+ * ilb.
+ */
+ new_ilb = find_new_ilb(cpu);
+ if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
+ atomic_set(&nohz.load_balancer, -1);
+ resched_cpu(new_ilb);
+ return 0;
+ }
return 1;
+ }
} else {
if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
return 0;

Subject: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.

For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.

Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.

The algorithm is activated only when sched_mc/smt_power_savings != 0.

Signed-off-by: Gautham R Shenoy <[email protected]>
---

kernel/sched.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 100 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 706517c..4fc1ec0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4283,10 +4283,108 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
static struct {
atomic_t load_balancer;
cpumask_var_t cpu_mask;
+ cpumask_var_t tmpmask;
} nohz ____cacheline_aligned = {
.load_balancer = ATOMIC_INIT(-1),
};

+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+/**
+ * lowest_flag_domain: Returns the lowest sched_domain
+ * that has the given flag set for a particular cpu.
+ * @cpu: The cpu whose lowest level of sched domain is to
+ * be returned.
+ *
+ * @flag: The flag to check for the lowest sched_domain
+ * for the given cpu
+ */
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+ struct sched_domain *sd;
+
+ for_each_domain(cpu, sd)
+ if (sd && (sd->flags & flag))
+ break;
+
+ return sd;
+}
+
+/**
+ * for_each_flag_domain: Iterates over all the scheduler domains
+ * for a given cpu that has the 'flag' set, starting from
+ * the lowest to the highest.
+ * @cpu: The cpu whose domains we're iterating over.
+ * @sd: variable holding the value of the power_savings_sd
+ * for cpu
+ */
+#define for_each_flag_domain(cpu, sd, flag) \
+ for (sd = lowest_flag_domain(cpu, flag); \
+ (sd && (sd->flags & flag)); sd = sd->parent)
+
+static inline int is_semi_idle_group(struct sched_group *ilb_group)
+{
+ cpumask_and(nohz.tmpmask, nohz.cpu_mask, sched_group_cpus(ilb_group));
+
+ /*
+ * A sched_group is semi-idle when it has atleast one busy cpu
+ * and atleast one idle cpu.
+ */
+ if (!(cpumask_empty(nohz.tmpmask) ||
+ cpumask_equal(nohz.tmpmask, sched_group_cpus(ilb_group))))
+ return 1;
+
+ return 0;
+}
+/**
+ * find_new_ilb: Finds or nominates a new idle load balancer.
+ * @cpu: The cpu which is nominating a new idle_load_balancer.
+ *
+ * This algorithm picks the idle load balancer such that it belongs to a
+ * semi-idle powersavings sched_domain. The idea is to try and avoid
+ * completely idle packages/cores just for the purpose of idle load balancing
+ * when there are other idle cpu's which are better suited for that job.
+ */
+static int find_new_ilb(int cpu)
+{
+ struct sched_domain *sd;
+ struct sched_group *ilb_group;
+
+ /*
+ * Optimization for the case when there is no idle cpu or
+ * only 1 idle cpu to choose from.
+ */
+ if (cpumask_weight(nohz.cpu_mask) < 2)
+ goto out_done;
+
+ /*
+ * Have idle load balancer selection from semi-idle packages only
+ * when power-aware load balancing is enabled
+ */
+ if (!(sched_smt_power_savings || sched_mc_power_savings))
+ goto out_done;
+
+ for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
+ ilb_group = sd->groups;
+
+ do {
+ if (is_semi_idle_group(ilb_group))
+ return cpumask_first(nohz.tmpmask);
+
+ ilb_group = ilb_group->next;
+
+ } while (ilb_group != sd->groups);
+ }
+
+out_done:
+ return cpumask_first(nohz.cpu_mask);
+}
+#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
+static inline int find_new_ilb(int call_cpu)
+{
+ return first_cpu(nohz.cpu_mask);
+}
+#endif
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4511,15 +4609,7 @@ static inline void trigger_load_balance(struct rq *rq, int cpu)
}

if (atomic_read(&nohz.load_balancer) == -1) {
- /*
- * simple selection for now: Nominate the
- * first cpu in the nohz list to be the next
- * ilb owner.
- *
- * TBD: Traverse the sched domains and nominate
- * the nearest cpu in the nohz.cpu_mask.
- */
- int ilb = cpumask_first(nohz.cpu_mask);
+ int ilb = find_new_ilb(cpu);

if (ilb < nr_cpu_ids)
resched_cpu(ilb);
@@ -9046,6 +9136,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
alloc_bootmem_cpumask_var(&nohz.cpu_mask);
+ alloc_bootmem_cpumask_var(&nohz.tmpmask);
#endif
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */

2009-04-02 15:52:59

by Jaswinder Singh Rajput

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Here are some minor issues:

On Thu, 2009-04-02 at 18:08 +0530, Gautham R Shenoy wrote:
> Currently the nomination of idle-load balancer is done by choosing the first
> idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
> such an idle cpu could come from a completely idle core/package thereby
> preventing the whole core/package from being in a low-power state.
>
> For eg, consider a quad-core dual package system. The cpu numbering need
> not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
> With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
> as fewer Packages/Cores active. But the current idle load balancer logic
> goes against this by choosing the first_cpu in the nohz.cpu_mask and not
> taking the system topology into consideration.
>
> Improve the algorithm to nominate the idle load balancer from a semi idle
> cores/packages thereby increasing the probability of the cores/packages being
> in deeper sleep states for longer duration.
>
> The algorithm is activated only when sched_mc/smt_power_savings != 0.
>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> ---
>
> kernel/sched.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 100 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 706517c..4fc1ec0 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4283,10 +4283,108 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
> static struct {
> atomic_t load_balancer;
> cpumask_var_t cpu_mask;
> + cpumask_var_t tmpmask;

Can you find some better name than tmpmask.

> } nohz ____cacheline_aligned = {
> .load_balancer = ATOMIC_INIT(-1),
> };
>
> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +/**

^^^^^^
This comment is not valid and even Randy send patches to fix these
comments and also shared the error messages because of these comments by
your earlier patches. Replace it with /*

> + * lowest_flag_domain: Returns the lowest sched_domain
> + * that has the given flag set for a particular cpu.
> + * @cpu: The cpu whose lowest level of sched domain is to
> + * be returned.
> + *
> + * @flag: The flag to check for the lowest sched_domain
> + * for the given cpu
> + */
> +static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> +{
> + struct sched_domain *sd;
> +
> + for_each_domain(cpu, sd)
> + if (sd && (sd->flags & flag))
> + break;
> +
> + return sd;
> +}
> +
> +/**

Ditto.

> + * for_each_flag_domain: Iterates over all the scheduler domains
> + * for a given cpu that has the 'flag' set, starting from
> + * the lowest to the highest.
> + * @cpu: The cpu whose domains we're iterating over.
> + * @sd: variable holding the value of the power_savings_sd
> + * for cpu

This can be come in one line:

+ * @sd: variable holding the value of the power_savings_sd for cpu

> + */
> +#define for_each_flag_domain(cpu, sd, flag) \
> + for (sd = lowest_flag_domain(cpu, flag); \
> + (sd && (sd->flags & flag)); sd = sd->parent)
> +
> +static inline int is_semi_idle_group(struct sched_group *ilb_group)
> +{
> + cpumask_and(nohz.tmpmask, nohz.cpu_mask, sched_group_cpus(ilb_group));
> +
> + /*
> + * A sched_group is semi-idle when it has atleast one busy cpu
> + * and atleast one idle cpu.
> + */
> + if (!(cpumask_empty(nohz.tmpmask) ||
> + cpumask_equal(nohz.tmpmask, sched_group_cpus(ilb_group))))
> + return 1;
> +
> + return 0;
> +}
> +/**

Ditto.

> + * find_new_ilb: Finds or nominates a new idle load balancer.
> + * @cpu: The cpu which is nominating a new idle_load_balancer.
> + *
> + * This algorithm picks the idle load balancer such that it belongs to a
> + * semi-idle powersavings sched_domain. The idea is to try and avoid
> + * completely idle packages/cores just for the purpose of idle load balancing
> + * when there are other idle cpu's which are better suited for that job.
> + */
> +static int find_new_ilb(int cpu)
> +{
> + struct sched_domain *sd;
> + struct sched_group *ilb_group;
> +
> + /*
> + * Optimization for the case when there is no idle cpu or
> + * only 1 idle cpu to choose from.
> + */
> + if (cpumask_weight(nohz.cpu_mask) < 2)
> + goto out_done;
> +

We can simply avoid these gotos.

--
JSR

2009-04-03 07:04:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Gautham R Shenoy <[email protected]> writes:
>
> Improve the algorithm to nominate the idle load balancer from a semi idle
> cores/packages thereby increasing the probability of the cores/packages being
> in deeper sleep states for longer duration.

The basic patch looks good.

In theory you could also look for a nearby nohz balancer in the end
to optimize traffic on the interconnect of a larger NUMA system,
but it's probably not worth it.

>
> The algorithm is activated only when sched_mc/smt_power_savings != 0.

But it seems to me that this check could be dropped and doing it
unconditionally, because idle balancing doesn't need much memory
bandwith or cpu power, so always putting it nearby is good.

-Ani

--
[email protected] -- Speaking for myself only.

Subject: Re: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Hi Jaswinder,
Thanks for the review. Comments interspersed.


On Thu, Apr 02, 2009 at 09:22:11PM +0530, Jaswinder Singh Rajput wrote:
> Here are some minor issues:
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 706517c..4fc1ec0 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4283,10 +4283,108 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
> > static struct {
> > atomic_t load_balancer;
> > cpumask_var_t cpu_mask;
> > + cpumask_var_t tmpmask;
>
> Can you find some better name than tmpmask.

Sure, I'll think of it. The cpumask is required to store some
intermediate results when we compute the idle_loadbalancer. Any
suggestions ?

>
> > } nohz ____cacheline_aligned = {
> > .load_balancer = ATOMIC_INIT(-1),
> > };
> >
> > +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> > +/**
>
> ^^^^^^
> This comment is not valid and even Randy send patches to fix these
> comments and also shared the error messages because of these comments by
> your earlier patches. Replace it with /*

I think you're referring to this: http://lkml.org/lkml/2009/3/29/7
I had missed this patch. Thanks for pointing it out.

I think Randy's patch was addressing the issue of structs
not requiring a /** style comments. But in this case,
it's a function, and hence needs kernel-doc style notation. Or am I
missing something here ? Point about the single-line
short function description is well taken.

>
> > + * lowest_flag_domain: Returns the lowest sched_domain
> > + * that has the given flag set for a particular cpu.
> > + * @cpu: The cpu whose lowest level of sched domain is to
> > + * be returned.
> > + *
> > + * @flag: The flag to check for the lowest sched_domain
> > + * for the given cpu
> > + */
> > +static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> > +{
> > + struct sched_domain *sd;
> > +
> > + for_each_domain(cpu, sd)
> > + if (sd && (sd->flags & flag))
> > + break;
> > +
> > + return sd;
> > +}
> > +
> > +/**
>
> Ditto.
>

Ditto :-)

> > + * for_each_flag_domain: Iterates over all the scheduler domains
> > + * for a given cpu that has the 'flag' set, starting from
> > + * the lowest to the highest.
> > + * @cpu: The cpu whose domains we're iterating over.
> > + * @sd: variable holding the value of the power_savings_sd
> > + * for cpu
>
> This can be come in one line:
Agreed. Will correct this.

>
> + * @sd: variable holding the value of the power_savings_sd for cpu
>
> > + */
> > +#define for_each_flag_domain(cpu, sd, flag) \
> > + for (sd = lowest_flag_domain(cpu, flag); \
> > + (sd && (sd->flags & flag)); sd = sd->parent)
> > +
> > +static inline int is_semi_idle_group(struct sched_group *ilb_group)
> > +{
> > + cpumask_and(nohz.tmpmask, nohz.cpu_mask, sched_group_cpus(ilb_group));
> > +
> > + /*
> > + * A sched_group is semi-idle when it has atleast one busy cpu
> > + * and atleast one idle cpu.
> > + */
> > + if (!(cpumask_empty(nohz.tmpmask) ||
> > + cpumask_equal(nohz.tmpmask, sched_group_cpus(ilb_group))))
> > + return 1;
> > +
> > + return 0;
> > +}
> > +/**
>
> Ditto.
>

Ditto!
> > + * find_new_ilb: Finds or nominates a new idle load balancer.
> > + * @cpu: The cpu which is nominating a new idle_load_balancer.
> > + *
> > + * This algorithm picks the idle load balancer such that it belongs to a
> > + * semi-idle powersavings sched_domain. The idea is to try and avoid
> > + * completely idle packages/cores just for the purpose of idle load balancing
> > + * when there are other idle cpu's which are better suited for that job.
> > + */
> > +static int find_new_ilb(int cpu)
> > +{
> > + struct sched_domain *sd;
> > + struct sched_group *ilb_group;
> > +
> > + /*
> > + * Optimization for the case when there is no idle cpu or
> > + * only 1 idle cpu to choose from.
> > + */
> > + if (cpumask_weight(nohz.cpu_mask) < 2)
> > + goto out_done;
> > +
>
> We can simply avoid these gotos.

Not really! When we don't have any idle cpu or have only one idle cpu,
it doesn't make sense to walk the domain-hierarchy to find the idle load
balancer. In the former case, there is none. In the latter case, there's
only one.

But to verify what effect this optimization might have,
I ran kernbench on a large box (112 CPUS) and here
are the results.

-----------------------------------------------
make -j$i patch patch
without gotos with gotos
-----------------------------------------------
1 1230.87 s 1195.95 s
2 850.51 s 596.45 s
4 368.91 s 310.49 s
6 255.61 s 216.31 s
8 201.94 s 167.89 s
10 168.95 s 138.30 s
12 151.64 s 123.14 s
14 135.98 s 108.72 s
28 86.09 s 70.92 s
56 61.11 s 55.52 s
112 49.30 s 47.23 s
------------------------------------------------

Clearly, the patch with gotos gives us a better score than the one
without.

>
> --
> JSR

--
Thanks and Regards
gautham

Subject: Re: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Hi Andi,

Thanks for the review.
On Fri, Apr 03, 2009 at 09:04:42AM +0200, Andi Kleen wrote:
> Gautham R Shenoy <[email protected]> writes:
> >
> > Improve the algorithm to nominate the idle load balancer from a semi idle
> > cores/packages thereby increasing the probability of the cores/packages being
> > in deeper sleep states for longer duration.
>
> The basic patch looks good.
>
> In theory you could also look for a nearby nohz balancer in the end
> to optimize traffic on the interconnect of a larger NUMA system,
> but it's probably not worth it.

The algorithm does this already, since it starts off with it's own
sched_group in the power-aware sched_domain, and moves to it's
sibling-groups. The sibling groups are linked in the order of
their proximity.
>
> >
> > The algorithm is activated only when sched_mc/smt_power_savings != 0.
>
> But it seems to me that this check could be dropped and doing it
> unconditionally, because idle balancing doesn't need much memory
> bandwith or cpu power, so always putting it nearby is good.

Well, right now, a new idle load balancer is nominated when the current
idle load balancer picks up a task. At this point, if the user is
concerned about performance as opposed to energy savings, we wouldn't
want to iterate over the domain hierarchy to find the best idle load
balancer, would we ? Because that might cause latency in running the job
that is queued on our runqueue.

Actually this can be optimized. We can have the current idle-load
balancer nominate the ilb as the first_cpu(nohz._cpu_mask). And this
idle load balancer at the end of the sched_tick can see if there's a
more power-efficient idle load balancer.

Let me see if this gives any benefit over the patches that I've posted.
>
> -Ani
>
> --
> [email protected] -- Speaking for myself only.

--
Thanks and Regards
gautham

2009-04-03 15:17:09

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: Nominate idle load balancer from a semi-idle package.

Gautham R Shenoy wrote:
> Hi Jaswinder,
> Thanks for the review. Comments interspersed.
>
>
> On Thu, Apr 02, 2009 at 09:22:11PM +0530, Jaswinder Singh Rajput wrote:
>> Here are some minor issues:
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index 706517c..4fc1ec0 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>
>>> } nohz ____cacheline_aligned = {
>>> .load_balancer = ATOMIC_INIT(-1),
>>> };
>>>
>>> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
>>> +/**
>> ^^^^^^
>> This comment is not valid and even Randy send patches to fix these
>> comments and also shared the error messages because of these comments by
>> your earlier patches. Replace it with /*
>
> I think you're referring to this: http://lkml.org/lkml/2009/3/29/7
> I had missed this patch. Thanks for pointing it out.
>
> I think Randy's patch was addressing the issue of structs
> not requiring a /** style comments. But in this case,
> it's a function, and hence needs kernel-doc style notation. Or am I
> missing something here ? Point about the single-line
> short function description is well taken.

If you want to use kernel-doc for functions or structs or unions
or enums, do so. Just use the correct syntax, please.

If it's not kernel-doc, don't use /** to begin the comment block.


>>> + * lowest_flag_domain: Returns the lowest sched_domain
>>> + * that has the given flag set for a particular cpu.
>>> + * @cpu: The cpu whose lowest level of sched domain is to
>>> + * be returned.
>>> + *
>>> + * @flag: The flag to check for the lowest sched_domain
>>> + * for the given cpu
>>> + */
>>> +static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>> +{
>>> + struct sched_domain *sd;
>>> +
>>> + for_each_domain(cpu, sd)
>>> + if (sd && (sd->flags & flag))
>>> + break;
>>> +
>>> + return sd;
>>> +}
>>> +
>>> +/**
>> Ditto.
>>
>
> Ditto :-)
>
>>> + * for_each_flag_domain: Iterates over all the scheduler domains
>>> + * for a given cpu that has the 'flag' set, starting from
>>> + * the lowest to the highest.
>>> + * @cpu: The cpu whose domains we're iterating over.
>>> + * @sd: variable holding the value of the power_savings_sd
>>> + * for cpu
>> This can be come in one line:
> Agreed. Will correct this.
>
>> + * @sd: variable holding the value of the power_savings_sd for cpu
>>
>>> + */
>>> +#define for_each_flag_domain(cpu, sd, flag) \
>>> + for (sd = lowest_flag_domain(cpu, flag); \
>>> + (sd && (sd->flags & flag)); sd = sd->parent)
>>> +
>>> +static inline int is_semi_idle_group(struct sched_group *ilb_group)
>>> +{
>>> + cpumask_and(nohz.tmpmask, nohz.cpu_mask, sched_group_cpus(ilb_group));
>>> +
>>> + /*
>>> + * A sched_group is semi-idle when it has atleast one busy cpu
>>> + * and atleast one idle cpu.
>>> + */
>>> + if (!(cpumask_empty(nohz.tmpmask) ||
>>> + cpumask_equal(nohz.tmpmask, sched_group_cpus(ilb_group))))
>>> + return 1;
>>> +
>>> + return 0;
>>> +}
>>> +/**
>> Ditto.
>>
>
> Ditto!
>>> + * find_new_ilb: Finds or nominates a new idle load balancer.
>>> + * @cpu: The cpu which is nominating a new idle_load_balancer.
>>> + *
>>> + * This algorithm picks the idle load balancer such that it belongs to a
>>> + * semi-idle powersavings sched_domain. The idea is to try and avoid
>>> + * completely idle packages/cores just for the purpose of idle load balancing
>>> + * when there are other idle cpu's which are better suited for that job.
>>> + */
>>> +static int find_new_ilb(int cpu)


--
~Randy