2010-02-13 01:15:36

by Suresh Siddha

[permalink] [raw]
Subject: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

PeterZ/Ingo,

Ling Ma and Yanmin reported this SMT scheduler regression which lead to
the condition where both the SMT threads on a core are busy while the
other cores in the same socket are completely idle, causing major
performance regression. I have appended a fix for this. This is
relatively low risk fix and if you agree with both the fix and
risk-assessment, can we please push this to Linus so that we can address
this in 2.6.33.

thanks,
suresh
---

From: Suresh Siddha <[email protected]>
Subject: sched: fix SMT scheduler regression in find_busiest_queue()

Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
Author: Peter Zijlstra <[email protected]>
Date: Tue Sep 1 10:34:38 2009 +0200

sched: Try to deal with low capacity

@@ -4203,15 +4223,18 @@ find_busiest_queue()
...
for_each_cpu(i, sched_group_cpus(group)) {
+ unsigned long power = power_of(i);

...

- wl = weighted_cpuload(i);
+ wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+ wl /= power;

- if (rq->nr_running == 1 && wl > imbalance)
+ if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <[email protected]>
Initial-Analysis-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [2.6.32.x]
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..bef5369 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
continue;

rq = cpu_rq(i);
- wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
- wl /= power;
+ wl = weighted_cpuload(i);

+ /*
+ * When comparing with imbalance, use weighted_cpuload()
+ * which is not scaled with the cpu power.
+ */
if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;

+ /*
+ * For the load comparisons with the other cpu's, consider
+ * the weighted_cpuload() scaled with the cpu power, so that
+ * the load can be moved away from the cpu that is potentially
+ * running at a lower capacity.
+ */
+ wl = (wl * SCHED_LOAD_SCALE) / power;
+
if (wl > max_load) {
max_load = wl;
busiest = rq;


2010-02-13 01:33:25

by Suresh Siddha

[permalink] [raw]
Subject: change in sched cpu_power causing regressions with SCHED_MC

Peterz,

We have one more problem that Yanmin and Ling Ma reported. On a dual
socket quad-core platforms (for example platforms based on NHM-EP), we
are seeing scenarios where one socket is completely busy (with all the 4
cores running with 4 tasks) and another socket is completely idle.

This causes performance issues as those 4 tasks share the memory
controller, last-level cache bandwidth etc. Also we won't be taking
advantage of turbo-mode as much as we like. We will have all these
benefits if we move two of those tasks to the other socket. Now both the
sockets can potentially go to turbo etc and improve performance.

In short, your recent change (shown below) broke this behavior. In the
kernel summit you mentioned you made this change with out affecting the
behavior of SMT/MC. And my testing immediately after kernel-summit also
didn't show the problem (perhaps my test didn't hit this specific
change). But apparently we are having performance issues with this patch
(Ling Ma's bisect pointed to this patch). I will look more detailed into
this after the long weekend (to see if we can catch this scenario in
fix_small_imbalance() etc). But wanted to give you a quick heads up.
Thanks.

commit f93e65c186ab3c05ce2068733ca10e34fd00125e
Author: Peter Zijlstra <[email protected]>
Date: Tue Sep 1 10:34:32 2009 +0200

sched: Restore __cpu_power to a straight sum of power

cpu_power is supposed to be a representation of the process
capacity of the cpu, not a value to randomly tweak in order to
affect placement.

Remove the placement hacks.

Signed-off-by: Peter Zijlstra <[email protected]>
Tested-by: Andreas Herrmann <[email protected]>
Acked-by: Andreas Herrmann <[email protected]>
Acked-by: Gautham R Shenoy <[email protected]>
Cc: Balbir Singh <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index da1edc8..584a122 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8464,15 +8464,13 @@ static void free_sched_groups(const struct cpumask *cpu_map,
* there are asymmetries in the topology. If there are asymmetries, group
* having more cpu_power will pickup more load compared to the group having
* less cpu_power.
- *
- * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
- * the maximum number of tasks a group can handle in the presence of other idle
- * or lightly loaded groups in the same sched domain.
*/
static void init_sched_groups_power(int cpu, struct sched_domain *sd)
{
struct sched_domain *child;
struct sched_group *group;
+ long power;
+ int weight;

WARN_ON(!sd || !sd->groups);

@@ -8483,22 +8481,20 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)

sd->groups->__cpu_power = 0;

- /*
- * For perf policy, if the groups in child domain share resources
- * (for example cores sharing some portions of the cache hierarchy
- * or SMT), then set this domain groups cpu_power such that each group
- * can handle only one task, when there are other idle groups in the
- * same sched domain.
- */
- if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
- (child->flags &
- (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
- sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+ if (!child) {
+ power = SCHED_LOAD_SCALE;
+ weight = cpumask_weight(sched_domain_span(sd));
+ /*
+ * SMT siblings share the power of a single core.
+ */
+ if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+ power /= weight;
+ sg_inc_cpu_power(sd->groups, power);
return;
}

/*
- * add cpu_power of each child group to this groups cpu_power
+ * Add cpu_power of each child group to this groups cpu_power.
*/
group = child->groups;
do {






2010-02-13 10:36:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
>
> We have one more problem that Yanmin and Ling Ma reported. On a dual
> socket quad-core platforms (for example platforms based on NHM-EP), we
> are seeing scenarios where one socket is completely busy (with all the 4
> cores running with 4 tasks) and another socket is completely idle.
>
> This causes performance issues as those 4 tasks share the memory
> controller, last-level cache bandwidth etc. Also we won't be taking
> advantage of turbo-mode as much as we like. We will have all these
> benefits if we move two of those tasks to the other socket. Now both the
> sockets can potentially go to turbo etc and improve performance.
>
> In short, your recent change (shown below) broke this behavior. In the
> kernel summit you mentioned you made this change with out affecting the
> behavior of SMT/MC. And my testing immediately after kernel-summit also
> didn't show the problem (perhaps my test didn't hit this specific
> change). But apparently we are having performance issues with this patch
> (Ling Ma's bisect pointed to this patch). I will look more detailed into
> this after the long weekend (to see if we can catch this scenario in
> fix_small_imbalance() etc). But wanted to give you a quick heads up.
> Thanks.

Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
it provides the capacity==1 thing the cpu_power games used to provide.

Not saying it's not broken, but that's where the we should be looking to
fix it.

2010-02-13 10:41:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Sat, 2010-02-13 at 11:36 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> >
> > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > socket quad-core platforms (for example platforms based on NHM-EP), we
> > are seeing scenarios where one socket is completely busy (with all the 4
> > cores running with 4 tasks) and another socket is completely idle.
> >
> > This causes performance issues as those 4 tasks share the memory
> > controller, last-level cache bandwidth etc. Also we won't be taking
> > advantage of turbo-mode as much as we like. We will have all these
> > benefits if we move two of those tasks to the other socket. Now both the
> > sockets can potentially go to turbo etc and improve performance.
> >
> > In short, your recent change (shown below) broke this behavior. In the
> > kernel summit you mentioned you made this change with out affecting the
> > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > didn't show the problem (perhaps my test didn't hit this specific
> > change). But apparently we are having performance issues with this patch
> > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > this after the long weekend (to see if we can catch this scenario in
> > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > Thanks.
>
> Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> it provides the capacity==1 thing the cpu_power games used to provide.
>
> Not saying it's not broken, but that's where the we should be looking to
> fix it.


BTW, do you think its possible to automate such test cases and put them
in a test-suite?

2010-02-13 18:28:00

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Suresh Siddha <[email protected]> [2010-02-12 17:14:22]:

> PeterZ/Ingo,
>
> Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> the condition where both the SMT threads on a core are busy while the
> other cores in the same socket are completely idle, causing major
> performance regression. I have appended a fix for this. This is
> relatively low risk fix and if you agree with both the fix and
> risk-assessment, can we please push this to Linus so that we can address
> this in 2.6.33.

Hi Suresh,

I have been looking at this issue in order to make
sched_smt_powersavings work. In my simple tests I find that the
default behavior is to have one task per core first since the total
cpu power of the core will be 1178 (589*2) that is not sufficient to
keep two tasks balanced in the group.

In the scenario that you have described, even though the group has
been identified as busiest, the find_busiest_queue() will return null
since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
to wl being greater than imbalance.

The fix that you have posted will solve the problem described.
However we need to make sched_smt_powersavings also work by increasing
the group capacity and allowing two tasks to run in a core.

As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
across group at any sched domain so that the solution will work for
pre-Nehalem quad cores also. But it still needs some work to get it
right. Please refer to my earlier bug report at:

http://lkml.org/lkml/2010/2/8/80

The solution you have posted will not work for non-HT quad cores where
we want the tasks to be spread across cache domains for best
performance though not a severe performance regression as in the case
of Nehalem.

I will test your solution in different scenarios and post updates.

Thanks,
Vaidy


> thanks,
> suresh
> ---
>
> From: Suresh Siddha <[email protected]>
> Subject: sched: fix SMT scheduler regression in find_busiest_queue()
>
> Fix a SMT scheduler performance regression that is leading to a scenario
> where SMT threads in one core are completely idle while both the SMT threads
> in another core (on the same socket) are busy.
>
> This is caused by this commit (with the problematic code highlighted)
>
> commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> Author: Peter Zijlstra <[email protected]>
> Date: Tue Sep 1 10:34:38 2009 +0200
>
> sched: Try to deal with low capacity
>
> @@ -4203,15 +4223,18 @@ find_busiest_queue()
> ...
> for_each_cpu(i, sched_group_cpus(group)) {
> + unsigned long power = power_of(i);
>
> ...
>
> - wl = weighted_cpuload(i);
> + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> + wl /= power;
>
> - if (rq->nr_running == 1 && wl > imbalance)
> + if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> On a SMT system, power of the HT logical cpu will be 589 and
> the scheduler load imbalance (for scenarios like the one mentioned above)
> can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> the weighted load with the power will result in "wl > imbalance" and
> ultimately resulting in find_busiest_queue() return NULL, causing
> load_balance() to think that the load is well balanced. But infact
> one of the tasks can be moved to the idle core for optimal performance.
>
> We don't need to use the weighted load (wl) scaled by the cpu power to
> compare with imabalance. In that condition, we already know there is only a
> single task "rq->nr_running == 1" and the comparison between imbalance,
> wl is to make sure that we select the correct priority thread which matches
> imbalance. So we really need to compare the imabalnce with the original
> weighted load of the cpu and not the scaled load.
>
> But in other conditions where we want the most hammered(busiest) cpu, we can
> use scaled load to ensure that we consider the cpu power in addition to the
> actual load on that cpu, so that we can move the load away from the
> guy that is getting most hammered with respect to the actual capacity,
> as compared with the rest of the cpu's in that busiest group.
>
> Fix it.
>
> Reported-by: Ma Ling <[email protected]>
> Initial-Analysis-by: Zhang, Yanmin <[email protected]>
> Signed-off-by: Suresh Siddha <[email protected]>
> Cc: [email protected] [2.6.32.x]
> ---
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..bef5369 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> continue;
>
> rq = cpu_rq(i);
> - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> - wl /= power;
> + wl = weighted_cpuload(i);
>
> + /*
> + * When comparing with imbalance, use weighted_cpuload()
> + * which is not scaled with the cpu power.
> + */
> if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> + /*
> + * For the load comparisons with the other cpu's, consider
> + * the weighted_cpuload() scaled with the cpu power, so that
> + * the load can be moved away from the cpu that is potentially
> + * running at a lower capacity.
> + */
> + wl = (wl * SCHED_LOAD_SCALE) / power;
> +
> if (wl > max_load) {
> max_load = wl;
> busiest = rq;
>
>

2010-02-13 18:34:10

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Suresh Siddha <[email protected]> [2010-02-12 17:31:19]:

> Peterz,
>
> We have one more problem that Yanmin and Ling Ma reported. On a dual
> socket quad-core platforms (for example platforms based on NHM-EP), we
> are seeing scenarios where one socket is completely busy (with all the 4
> cores running with 4 tasks) and another socket is completely idle.
>
> This causes performance issues as those 4 tasks share the memory
> controller, last-level cache bandwidth etc. Also we won't be taking
> advantage of turbo-mode as much as we like. We will have all these
> benefits if we move two of those tasks to the other socket. Now both the
> sockets can potentially go to turbo etc and improve performance.
>
> In short, your recent change (shown below) broke this behavior. In the
> kernel summit you mentioned you made this change with out affecting the
> behavior of SMT/MC. And my testing immediately after kernel-summit also
> didn't show the problem (perhaps my test didn't hit this specific
> change). But apparently we are having performance issues with this patch
> (Ling Ma's bisect pointed to this patch). I will look more detailed into
> this after the long weekend (to see if we can catch this scenario in
> fix_small_imbalance() etc). But wanted to give you a quick heads up.
> Thanks.
>
> commit f93e65c186ab3c05ce2068733ca10e34fd00125e
> Author: Peter Zijlstra <[email protected]>
> Date: Tue Sep 1 10:34:32 2009 +0200
>
> sched: Restore __cpu_power to a straight sum of power
>
> cpu_power is supposed to be a representation of the process
> capacity of the cpu, not a value to randomly tweak in order to
> affect placement.
>
> Remove the placement hacks.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Tested-by: Andreas Herrmann <[email protected]>
> Acked-by: Andreas Herrmann <[email protected]>
> Acked-by: Gautham R Shenoy <[email protected]>
> Cc: Balbir Singh <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index da1edc8..584a122 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8464,15 +8464,13 @@ static void free_sched_groups(const struct cpumask *cpu_map,
> * there are asymmetries in the topology. If there are asymmetries, group
> * having more cpu_power will pickup more load compared to the group having
> * less cpu_power.
> - *
> - * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
> - * the maximum number of tasks a group can handle in the presence of other idle
> - * or lightly loaded groups in the same sched domain.
> */
> static void init_sched_groups_power(int cpu, struct sched_domain *sd)
> {
> struct sched_domain *child;
> struct sched_group *group;
> + long power;
> + int weight;
>
> WARN_ON(!sd || !sd->groups);
>
> @@ -8483,22 +8481,20 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>
> sd->groups->__cpu_power = 0;
>
> - /*
> - * For perf policy, if the groups in child domain share resources
> - * (for example cores sharing some portions of the cache hierarchy
> - * or SMT), then set this domain groups cpu_power such that each group
> - * can handle only one task, when there are other idle groups in the
> - * same sched domain.
> - */
> - if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> - (child->flags &
> - (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> - sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
> + if (!child) {
> + power = SCHED_LOAD_SCALE;
> + weight = cpumask_weight(sched_domain_span(sd));
> + /*
> + * SMT siblings share the power of a single core.
> + */
> + if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
> + power /= weight;
> + sg_inc_cpu_power(sd->groups, power);
> return;
> }
>
> /*
> - * add cpu_power of each child group to this groups cpu_power
> + * Add cpu_power of each child group to this groups cpu_power.
> */
> group = child->groups;
> do {
>

I have hit the same problem in older non-HT quad cores also.
(http://lkml.org/lkml/2010/2/8/80)

The following condition in find_busiest_group()
sds.max_load <= sds.busiest_load_per_task

treats unequally loaded groups as balanced as longs they are below
capacity.

We need to change the above condition before we hit the
fix_small_imbalance() step.

--Vaidy

2010-02-13 18:37:45

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Peter Zijlstra <[email protected]> [2010-02-13 11:42:04]:

> On Sat, 2010-02-13 at 11:36 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> > >
> > > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > > socket quad-core platforms (for example platforms based on NHM-EP), we
> > > are seeing scenarios where one socket is completely busy (with all the 4
> > > cores running with 4 tasks) and another socket is completely idle.
> > >
> > > This causes performance issues as those 4 tasks share the memory
> > > controller, last-level cache bandwidth etc. Also we won't be taking
> > > advantage of turbo-mode as much as we like. We will have all these
> > > benefits if we move two of those tasks to the other socket. Now both the
> > > sockets can potentially go to turbo etc and improve performance.
> > >
> > > In short, your recent change (shown below) broke this behavior. In the
> > > kernel summit you mentioned you made this change with out affecting the
> > > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > > didn't show the problem (perhaps my test didn't hit this specific
> > > change). But apparently we are having performance issues with this patch
> > > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > > this after the long weekend (to see if we can catch this scenario in
> > > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > > Thanks.
> >
> > Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> > it provides the capacity==1 thing the cpu_power games used to provide.
> >
> > Not saying it's not broken, but that's where the we should be looking to
> > fix it.
>
>
> BTW, do you think its possible to automate such test cases and put them
> in a test-suite?

Linux Test Project (LTP) has some test cases for sched_mc/smt
balancing. They can be improved and generalized to include all basic
scheduler task placement checks.

--Vaidy

2010-02-13 18:40:10

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Peter Zijlstra <[email protected]> [2010-02-13 11:36:28]:

> On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> >
> > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > socket quad-core platforms (for example platforms based on NHM-EP), we
> > are seeing scenarios where one socket is completely busy (with all the 4
> > cores running with 4 tasks) and another socket is completely idle.
> >
> > This causes performance issues as those 4 tasks share the memory
> > controller, last-level cache bandwidth etc. Also we won't be taking
> > advantage of turbo-mode as much as we like. We will have all these
> > benefits if we move two of those tasks to the other socket. Now both the
> > sockets can potentially go to turbo etc and improve performance.
> >
> > In short, your recent change (shown below) broke this behavior. In the
> > kernel summit you mentioned you made this change with out affecting the
> > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > didn't show the problem (perhaps my test didn't hit this specific
> > change). But apparently we are having performance issues with this patch
> > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > this after the long weekend (to see if we can catch this scenario in
> > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > Thanks.
>
> Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> it provides the capacity==1 thing the cpu_power games used to provide.
>
> Not saying it's not broken, but that's where the we should be looking to
> fix it.

Yes, making SD_PREFER_SIBLING work as expected will help spread tasks
at all sched_domain levels and will integrate well with powersavings
balance where we want to negate the effect.

--Vaidy

2010-02-13 18:40:50

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Sat, 2010-02-13 at 11:27 -0700, Vaidyanathan Srinivasan wrote:
> The fix that you have posted will solve the problem described.

Thanks. This SMT scheduler regression is critical for performance and
would like Ingo/Peterz to push this to Linus as soon as possible. We can
fix other known issues when we have patches ready and acceptable to
everyone. Agree?

> However we need to make sched_smt_powersavings also work by increasing
> the group capacity and allowing two tasks to run in a core.

I don't think you saying that this patch breaks sched_smt_powersavings?
If so, We need to address power-saving aspect differently. Atleast this
is not as critical, as we don't have any customer who is using the
smt/mc powersavings tunables.

> As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> across group at any sched domain so that the solution will work for
> pre-Nehalem quad cores also. But it still needs some work to get it
> right.

Agree.

> The solution you have posted will not work for non-HT quad cores where
> we want the tasks to be spread across cache domains for best
> performance though not a severe performance regression as in the case
> of Nehalem.

This is completely different issue from this patch and I started another
thread for this.

thanks
suresh

2010-02-13 18:50:45

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Sat, 2010-02-13 at 10:37 -0800, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2010-02-13 11:42:04]:
> > BTW, do you think its possible to automate such test cases and put them
> > in a test-suite?
>
> Linux Test Project (LTP) has some test cases for sched_mc/smt
> balancing.

It is very unfortunate that neither this nor Linux-kernel performance
tests that Yanmin and co tracks has caught these problems much before.
Ling Ma's focused tests/analysis for something else started showing
these basic issues.

> They can be improved and generalized to include all basic
> scheduler task placement checks.

Probably it is best if we can include the very basic checks/tests in the
kernel itself. More focused and extensive tests can be done outside the
kernel.

thanks,
suresh

2010-02-13 18:56:18

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Suresh Siddha <[email protected]> [2010-02-13 10:39:36]:

> On Sat, 2010-02-13 at 11:27 -0700, Vaidyanathan Srinivasan wrote:
> > The fix that you have posted will solve the problem described.
>
> Thanks. This SMT scheduler regression is critical for performance and
> would like Ingo/Peterz to push this to Linus as soon as possible. We can
> fix other known issues when we have patches ready and acceptable to
> everyone. Agree?

Yes, Agreed.

> > However we need to make sched_smt_powersavings also work by increasing
> > the group capacity and allowing two tasks to run in a core.
>
> I don't think you saying that this patch breaks sched_smt_powersavings?
> If so, We need to address power-saving aspect differently. Atleast this
> is not as critical, as we don't have any customer who is using the
> smt/mc powersavings tunables.

Correct. This patch does not break sched_smt_powersavings, additional
change in group capacity is needed. More work is needed, but nothing
to hold against this fix.

We would want customers to start using powersavings tunables and make
them work reliably. However, I agree that performance comes first :)

> > As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> > across group at any sched domain so that the solution will work for
> > pre-Nehalem quad cores also. But it still needs some work to get it
> > right.
>
> Agree.
>
> > The solution you have posted will not work for non-HT quad cores where
> > we want the tasks to be spread across cache domains for best
> > performance though not a severe performance regression as in the case
> > of Nehalem.
>
> This is completely different issue from this patch and I started another
> thread for this.

Correct. We can incrementally solve the balancing for different scenarios.

--Vaidy

2010-02-13 20:26:08

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Vaidyanathan Srinivasan <[email protected]> [2010-02-13 23:57:48]:

> * Suresh Siddha <[email protected]> [2010-02-12 17:14:22]:
>
> > PeterZ/Ingo,
> >
> > Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> > the condition where both the SMT threads on a core are busy while the
> > other cores in the same socket are completely idle, causing major
> > performance regression. I have appended a fix for this. This is
> > relatively low risk fix and if you agree with both the fix and
> > risk-assessment, can we please push this to Linus so that we can address
> > this in 2.6.33.
>
> Hi Suresh,
>
> I have been looking at this issue in order to make
> sched_smt_powersavings work. In my simple tests I find that the
> default behavior is to have one task per core first since the total
> cpu power of the core will be 1178 (589*2) that is not sufficient to
> keep two tasks balanced in the group.
>
> In the scenario that you have described, even though the group has
> been identified as busiest, the find_busiest_queue() will return null
> since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
> to wl being greater than imbalance.
>
> The fix that you have posted will solve the problem described.
> However we need to make sched_smt_powersavings also work by increasing
> the group capacity and allowing two tasks to run in a core.
>
> As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> across group at any sched domain so that the solution will work for
> pre-Nehalem quad cores also. But it still needs some work to get it
> right. Please refer to my earlier bug report at:
>
> http://lkml.org/lkml/2010/2/8/80
>
> The solution you have posted will not work for non-HT quad cores where
> we want the tasks to be spread across cache domains for best
> performance though not a severe performance regression as in the case
> of Nehalem.
>
> I will test your solution in different scenarios and post updates.

I have tested this patch on Nehalem and it provides the desired result
when sched_smt/mc_powersavings=0. One task per core is placed before
using sibling threads. However the core usage is mostly balanced across
different packages but not always. Incorrect consolidation to SMT
threads when free cores are available does not happen once this fix is
applied.

> Thanks,
> Vaidy
>
>
> > thanks,
> > suresh
> > ---
> >
> > From: Suresh Siddha <[email protected]>
> > Subject: sched: fix SMT scheduler regression in find_busiest_queue()
> >
> > Fix a SMT scheduler performance regression that is leading to a scenario
> > where SMT threads in one core are completely idle while both the SMT threads
> > in another core (on the same socket) are busy.
> >
> > This is caused by this commit (with the problematic code highlighted)
> >
> > commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> > Author: Peter Zijlstra <[email protected]>
> > Date: Tue Sep 1 10:34:38 2009 +0200
> >
> > sched: Try to deal with low capacity
> >
> > @@ -4203,15 +4223,18 @@ find_busiest_queue()
> > ...
> > for_each_cpu(i, sched_group_cpus(group)) {
> > + unsigned long power = power_of(i);
> >
> > ...
> >
> > - wl = weighted_cpuload(i);
> > + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > + wl /= power;
> >
> > - if (rq->nr_running == 1 && wl > imbalance)
> > + if (capacity && rq->nr_running == 1 && wl > imbalance)
> > continue;
> >
> > On a SMT system, power of the HT logical cpu will be 589 and
> > the scheduler load imbalance (for scenarios like the one mentioned above)
> > can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> > the weighted load with the power will result in "wl > imbalance" and
> > ultimately resulting in find_busiest_queue() return NULL, causing
> > load_balance() to think that the load is well balanced. But infact
> > one of the tasks can be moved to the idle core for optimal performance.
> >
> > We don't need to use the weighted load (wl) scaled by the cpu power to
> > compare with imabalance. In that condition, we already know there is only a
> > single task "rq->nr_running == 1" and the comparison between imbalance,
> > wl is to make sure that we select the correct priority thread which matches
> > imbalance. So we really need to compare the imabalnce with the original
> > weighted load of the cpu and not the scaled load.
> >
> > But in other conditions where we want the most hammered(busiest) cpu, we can
> > use scaled load to ensure that we consider the cpu power in addition to the
> > actual load on that cpu, so that we can move the load away from the
> > guy that is getting most hammered with respect to the actual capacity,
> > as compared with the rest of the cpu's in that busiest group.
> >
> > Fix it.
> >
> > Reported-by: Ma Ling <[email protected]>
> > Initial-Analysis-by: Zhang, Yanmin <[email protected]>
> > Signed-off-by: Suresh Siddha <[email protected]>

Acked-by: Vaidyanathan Srinivasan <[email protected]>

> > Cc: [email protected] [2.6.32.x]
> > ---
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..bef5369 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > continue;
> >
> > rq = cpu_rq(i);
> > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > - wl /= power;
> > + wl = weighted_cpuload(i);
> >
> > + /*
> > + * When comparing with imbalance, use weighted_cpuload()
> > + * which is not scaled with the cpu power.
> > + */
> > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > continue;
> >
> > + /*
> > + * For the load comparisons with the other cpu's, consider
> > + * the weighted_cpuload() scaled with the cpu power, so that
> > + * the load can be moved away from the cpu that is potentially
> > + * running at a lower capacity.
> > + */
> > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > +
> > if (wl > max_load) {
> > max_load = wl;
> > busiest = rq;
> >
> >

2010-02-13 20:36:22

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Vaidyanathan Srinivasan <[email protected]> [2010-02-14 01:55:52]:

> * Vaidyanathan Srinivasan <[email protected]> [2010-02-13 23:57:48]:
>
> > * Suresh Siddha <[email protected]> [2010-02-12 17:14:22]:
> >
> > > PeterZ/Ingo,
> > >
> > > Ling Ma and Yanmin reported this SMT scheduler regression which lead to
> > > the condition where both the SMT threads on a core are busy while the
> > > other cores in the same socket are completely idle, causing major
> > > performance regression. I have appended a fix for this. This is
> > > relatively low risk fix and if you agree with both the fix and
> > > risk-assessment, can we please push this to Linus so that we can address
> > > this in 2.6.33.
> >
> > Hi Suresh,
> >
> > I have been looking at this issue in order to make
> > sched_smt_powersavings work. In my simple tests I find that the
> > default behavior is to have one task per core first since the total
> > cpu power of the core will be 1178 (589*2) that is not sufficient to
> > keep two tasks balanced in the group.
> >
> > In the scenario that you have described, even though the group has
> > been identified as busiest, the find_busiest_queue() will return null
> > since wl will be 1780 {load(1024)*SCHED_LOAD_SCALE/power(589)} leading
> > to wl being greater than imbalance.
> >
> > The fix that you have posted will solve the problem described.
> > However we need to make sched_smt_powersavings also work by increasing
> > the group capacity and allowing two tasks to run in a core.
> >
> > As Peter mentioned, SD_PREFER_SIBLING flag is meant to spread the work
> > across group at any sched domain so that the solution will work for
> > pre-Nehalem quad cores also. But it still needs some work to get it
> > right. Please refer to my earlier bug report at:
> >
> > http://lkml.org/lkml/2010/2/8/80
> >
> > The solution you have posted will not work for non-HT quad cores where
> > we want the tasks to be spread across cache domains for best
> > performance though not a severe performance regression as in the case
> > of Nehalem.
> >
> > I will test your solution in different scenarios and post updates.
>
> I have tested this patch on Nehalem and it provides the desired result
> when sched_smt/mc_powersavings=0. One task per core is placed before
> using sibling threads. However the core usage is mostly balanced across
> different packages but not always. Incorrect consolidation to SMT
> threads when free cores are available does not happen once this fix is
> applied.
>
> > Thanks,
> > Vaidy
> >
> >
> > > thanks,
> > > suresh
> > > ---
> > >
> > > From: Suresh Siddha <[email protected]>
> > > Subject: sched: fix SMT scheduler regression in find_busiest_queue()
> > >
> > > Fix a SMT scheduler performance regression that is leading to a scenario
> > > where SMT threads in one core are completely idle while both the SMT threads
> > > in another core (on the same socket) are busy.
> > >
> > > This is caused by this commit (with the problematic code highlighted)
> > >
> > > commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> > > Author: Peter Zijlstra <[email protected]>
> > > Date: Tue Sep 1 10:34:38 2009 +0200
> > >
> > > sched: Try to deal with low capacity
> > >
> > > @@ -4203,15 +4223,18 @@ find_busiest_queue()
> > > ...
> > > for_each_cpu(i, sched_group_cpus(group)) {
> > > + unsigned long power = power_of(i);
> > >
> > > ...
> > >
> > > - wl = weighted_cpuload(i);
> > > + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > + wl /= power;
> > >
> > > - if (rq->nr_running == 1 && wl > imbalance)
> > > + if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > continue;
> > >
> > > On a SMT system, power of the HT logical cpu will be 589 and
> > > the scheduler load imbalance (for scenarios like the one mentioned above)
> > > can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> > > the weighted load with the power will result in "wl > imbalance" and
> > > ultimately resulting in find_busiest_queue() return NULL, causing
> > > load_balance() to think that the load is well balanced. But infact
> > > one of the tasks can be moved to the idle core for optimal performance.
> > >
> > > We don't need to use the weighted load (wl) scaled by the cpu power to
> > > compare with imabalance. In that condition, we already know there is only a
> > > single task "rq->nr_running == 1" and the comparison between imbalance,
> > > wl is to make sure that we select the correct priority thread which matches
> > > imbalance. So we really need to compare the imabalnce with the original
> > > weighted load of the cpu and not the scaled load.
> > >
> > > But in other conditions where we want the most hammered(busiest) cpu, we can
> > > use scaled load to ensure that we consider the cpu power in addition to the
> > > actual load on that cpu, so that we can move the load away from the
> > > guy that is getting most hammered with respect to the actual capacity,
> > > as compared with the rest of the cpu's in that busiest group.
> > >
> > > Fix it.
> > >
> > > Reported-by: Ma Ling <[email protected]>
> > > Initial-Analysis-by: Zhang, Yanmin <[email protected]>
> > > Signed-off-by: Suresh Siddha <[email protected]>
>
> Acked-by: Vaidyanathan Srinivasan <[email protected]>
>
> > > Cc: [email protected] [2.6.32.x]
> > > ---
> > >
> > > diff --git a/kernel/sched.c b/kernel/sched.c
> > > index 3a8fb30..bef5369 100644
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > continue;
> > >
> > > rq = cpu_rq(i);
> > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > - wl /= power;
> > > + wl = weighted_cpuload(i);
> > >
> > > + /*
> > > + * When comparing with imbalance, use weighted_cpuload()
> > > + * which is not scaled with the cpu power.
> > > + */
> > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > continue;
> > >
> > > + /*
> > > + * For the load comparisons with the other cpu's, consider
> > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > + * the load can be moved away from the cpu that is potentially
> > > + * running at a lower capacity.
> > > + */
> > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > +
> > > if (wl > max_load) {
> > > max_load = wl;
> > > busiest = rq;
> > >
> > >

In addition to the above fix, for sched_smt_powersavings to work, the
group capacity of the core (mc level) should be made 2 in
update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
DIV_RPUND_UP()

sgs->group_capacity =
DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);

Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
flag to force capacity to 1. Need to see if there are any side
effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
based on sched_smt_powersavings flag.

Will post a patch after more tests.

--Vaidy

2010-02-14 10:12:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote:
> > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > > continue;
> > > >
> > > > rq = cpu_rq(i);
> > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > > - wl /= power;
> > > > + wl = weighted_cpuload(i);
> > > >
> > > > + /*
> > > > + * When comparing with imbalance, use weighted_cpuload()
> > > > + * which is not scaled with the cpu power.
> > > > + */
> > > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > > continue;
> > > >
> > > > + /*
> > > > + * For the load comparisons with the other cpu's, consider
> > > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > > + * the load can be moved away from the cpu that is potentially
> > > > + * running at a lower capacity.
> > > > + */
> > > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > > +
> > > > if (wl > max_load) {
> > > > max_load = wl;
> > > > busiest = rq;
> > > >
> > > >
>
> In addition to the above fix, for sched_smt_powersavings to work, the
> group capacity of the core (mc level) should be made 2 in
> update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
> DIV_RPUND_UP()
>
> sgs->group_capacity =
> DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);
>
> Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
> flag to force capacity to 1. Need to see if there are any side
> effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
> based on sched_smt_powersavings flag.

OK, so while I think that Suresh' patch can make sense (haven't had time
to think it through), the above really sounds wrong. Things should not
rely on the cpu_power value like that.

2010-02-15 12:35:47

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Peter Zijlstra <[email protected]> [2010-02-14 11:11:58]:

> On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote:
> > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > > > continue;
> > > > >
> > > > > rq = cpu_rq(i);
> > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > > > - wl /= power;
> > > > > + wl = weighted_cpuload(i);
> > > > >
> > > > > + /*
> > > > > + * When comparing with imbalance, use weighted_cpuload()
> > > > > + * which is not scaled with the cpu power.
> > > > > + */
> > > > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > > > continue;
> > > > >
> > > > > + /*
> > > > > + * For the load comparisons with the other cpu's, consider
> > > > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > > > + * the load can be moved away from the cpu that is potentially
> > > > > + * running at a lower capacity.
> > > > > + */
> > > > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > > > +
> > > > > if (wl > max_load) {
> > > > > max_load = wl;
> > > > > busiest = rq;
> > > > >
> > > > >
> >
> > In addition to the above fix, for sched_smt_powersavings to work, the
> > group capacity of the core (mc level) should be made 2 in
> > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
> > DIV_RPUND_UP()
> >
> > sgs->group_capacity =
> > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);
> >
> > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
> > flag to force capacity to 1. Need to see if there are any side
> > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
> > based on sched_smt_powersavings flag.
>
> OK, so while I think that Suresh' patch can make sense (haven't had time
> to think it through), the above really sounds wrong. Things should not
> rely on the cpu_power value like that.

Hi Peter,

The reason rounding is a problem is because threads have fractional
cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level
a group has 2*589=1178 and group_capacity will be 1 always if
DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING
flag.

We are reducing group capacity here to 1 even though we have 2 sibling
threads in the group. In the sched_smt_powassavings>0 case, the
group_capacity should be 2 to allow task consolidation to this group
while leaving other groups completely idle.

DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare
capacity is rounded up and counted.

While, if SD_REFER_SIBLING is set,

update_sd_lb_stats():
if (prefer_sibling)
sgs.group_capacity = min(sgs.group_capacity, 1UL);

will ensure the group_capacity is 1 and allows spreading of tasks.

--Vaidy

2010-02-15 13:02:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Mon, 2010-02-15 at 18:05 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2010-02-14 11:11:58]:
>
> > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote:
> > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > > > > continue;
> > > > > >
> > > > > > rq = cpu_rq(i);
> > > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > > > > - wl /= power;
> > > > > > + wl = weighted_cpuload(i);
> > > > > >
> > > > > > + /*
> > > > > > + * When comparing with imbalance, use weighted_cpuload()
> > > > > > + * which is not scaled with the cpu power.
> > > > > > + */
> > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > > > > continue;
> > > > > >
> > > > > > + /*
> > > > > > + * For the load comparisons with the other cpu's, consider
> > > > > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > > > > + * the load can be moved away from the cpu that is potentially
> > > > > > + * running at a lower capacity.
> > > > > > + */
> > > > > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > > > > +
> > > > > > if (wl > max_load) {
> > > > > > max_load = wl;
> > > > > > busiest = rq;
> > > > > >
> > > > > >
> > >
> > > In addition to the above fix, for sched_smt_powersavings to work, the
> > > group capacity of the core (mc level) should be made 2 in
> > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
> > > DIV_RPUND_UP()
> > >
> > > sgs->group_capacity =
> > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);
> > >
> > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
> > > flag to force capacity to 1. Need to see if there are any side
> > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
> > > based on sched_smt_powersavings flag.
> >
> > OK, so while I think that Suresh' patch can make sense (haven't had time
> > to think it through), the above really sounds wrong. Things should not
> > rely on the cpu_power value like that.
>
> Hi Peter,
>
> The reason rounding is a problem is because threads have fractional
> cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level
> a group has 2*589=1178 and group_capacity will be 1 always if
> DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING
> flag.
>
> We are reducing group capacity here to 1 even though we have 2 sibling
> threads in the group. In the sched_smt_powassavings>0 case, the
> group_capacity should be 2 to allow task consolidation to this group
> while leaving other groups completely idle.
>
> DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare
> capacity is rounded up and counted.
>
> While, if SD_REFER_SIBLING is set,
>
> update_sd_lb_stats():
> if (prefer_sibling)
> sgs.group_capacity = min(sgs.group_capacity, 1UL);
>
> will ensure the group_capacity is 1 and allows spreading of tasks.

We should be weakening this link between cpu_power and capacity, not
strengthening it. What I think you want is to use
cpumask_weight(sched_gropu_cpus(group)) or something as capacity.

The setup for cpu_power is that it can reflect the actual capacity for
work, esp with todays asymmetric cpus where a socket can run on a
different frequency we need to make sure this is so.

So no, that DIV_ROUND_UP is utterly broken, as there might be many ways
for cpu_power of multiple threads/cpus to be less than the number of
cpus.

Furthermore, for powersavings it makes sense to make the capacity a
function of an overload argument/tunable, so that you can specify the
threshold of packing.

So really, cpu_power is a normalization factor to equally distribute
load across cpus that have asymmetric work capacity, if you need any
placement constraints outside of that, do _NOT_ touch cpu_power.

2010-02-15 22:29:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Fri, 2010-02-12 at 17:14 -0800, Suresh Siddha wrote:

> From: Suresh Siddha <[email protected]>
> Subject: sched: fix SMT scheduler regression in find_busiest_queue()
>
> Fix a SMT scheduler performance regression that is leading to a scenario
> where SMT threads in one core are completely idle while both the SMT threads
> in another core (on the same socket) are busy.
>
> This is caused by this commit (with the problematic code highlighted)
>
> commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
> Author: Peter Zijlstra <[email protected]>
> Date: Tue Sep 1 10:34:38 2009 +0200
>
> sched: Try to deal with low capacity
>
> @@ -4203,15 +4223,18 @@ find_busiest_queue()
> ...
> for_each_cpu(i, sched_group_cpus(group)) {
> + unsigned long power = power_of(i);
>
> ...
>
> - wl = weighted_cpuload(i);
> + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> + wl /= power;
>
> - if (rq->nr_running == 1 && wl > imbalance)
> + if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> On a SMT system, power of the HT logical cpu will be 589 and
> the scheduler load imbalance (for scenarios like the one mentioned above)
> can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
> the weighted load with the power will result in "wl > imbalance" and
> ultimately resulting in find_busiest_queue() return NULL, causing
> load_balance() to think that the load is well balanced. But infact
> one of the tasks can be moved to the idle core for optimal performance.
>
> We don't need to use the weighted load (wl) scaled by the cpu power to
> compare with imabalance. In that condition, we already know there is only a
> single task "rq->nr_running == 1" and the comparison between imbalance,
> wl is to make sure that we select the correct priority thread which matches
> imbalance. So we really need to compare the imabalnce with the original
> weighted load of the cpu and not the scaled load.
>
> But in other conditions where we want the most hammered(busiest) cpu, we can
> use scaled load to ensure that we consider the cpu power in addition to the
> actual load on that cpu, so that we can move the load away from the
> guy that is getting most hammered with respect to the actual capacity,
> as compared with the rest of the cpu's in that busiest group.
>
> Fix it.
>
> Reported-by: Ma Ling <[email protected]>
> Initial-Analysis-by: Zhang, Yanmin <[email protected]>
> Signed-off-by: Suresh Siddha <[email protected]>
> Cc: [email protected] [2.6.32.x]

A reproduction case would have been nice, I've been playing with busy
loops and plotting the cpus on paper, but I didn't manage to reproduce.

Still, I went through the logic and it seems to make sense, so:

Acked-by: Peter Zijlstra <[email protected]>

Ingo, sed -e 's/sched\.c/sched_fair.c/g', makes it apply to tip/master
and should provide means of solving the rebase/merge conflict.

> ---
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..bef5369 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> continue;
>
> rq = cpu_rq(i);
> - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> - wl /= power;
> + wl = weighted_cpuload(i);
>
> + /*
> + * When comparing with imbalance, use weighted_cpuload()
> + * which is not scaled with the cpu power.
> + */
> if (capacity && rq->nr_running == 1 && wl > imbalance)
> continue;
>
> + /*
> + * For the load comparisons with the other cpu's, consider
> + * the weighted_cpuload() scaled with the cpu power, so that
> + * the load can be moved away from the cpu that is potentially
> + * running at a lower capacity.
> + */
> + wl = (wl * SCHED_LOAD_SCALE) / power;
> +
> if (wl > max_load) {
> max_load = wl;
> busiest = rq;
>
>

2010-02-16 14:16:31

by Suresh Siddha

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix SMT scheduler regression in find_busiest_queue()

Commit-ID: 9000f05c6d1607f79c0deacf42b09693be673f4c
Gitweb: http://git.kernel.org/tip/9000f05c6d1607f79c0deacf42b09693be673f4c
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 12 Feb 2010 17:14:22 -0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Tue, 16 Feb 2010 15:13:59 +0100

sched: Fix SMT scheduler regression in find_busiest_queue()

Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1
Author: Peter Zijlstra <[email protected]>
Date: Tue Sep 1 10:34:38 2009 +0200

sched: Try to deal with low capacity

@@ -4203,15 +4223,18 @@ find_busiest_queue()
...
for_each_cpu(i, sched_group_cpus(group)) {
+ unsigned long power = power_of(i);

...

- wl = weighted_cpuload(i);
+ wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+ wl /= power;

- if (rq->nr_running == 1 && wl > imbalance)
+ if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <[email protected]>
Initial-Analysis-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Cc: [email protected] [2.6.32.x]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/sched.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e3199df..4d78aef 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
continue;

rq = cpu_rq(i);
- wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
- wl /= power;
+ wl = weighted_cpuload(i);

+ /*
+ * When comparing with imbalance, use weighted_cpuload()
+ * which is not scaled with the cpu power.
+ */
if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;

+ /*
+ * For the load comparisons with the other cpu's, consider
+ * the weighted_cpuload() scaled with the cpu power, so that
+ * the load can be moved away from the cpu that is potentially
+ * running at a lower capacity.
+ */
+ wl = (wl * SCHED_LOAD_SCALE) / power;
+
if (wl > max_load) {
max_load = wl;
busiest = rq;

2010-02-16 15:59:18

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Peter Zijlstra <[email protected]> [2010-02-15 14:00:43]:

> On Mon, 2010-02-15 at 18:05 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2010-02-14 11:11:58]:
> >
> > > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote:
> > > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
> > > > > > > continue;
> > > > > > >
> > > > > > > rq = cpu_rq(i);
> > > > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
> > > > > > > - wl /= power;
> > > > > > > + wl = weighted_cpuload(i);
> > > > > > >
> > > > > > > + /*
> > > > > > > + * When comparing with imbalance, use weighted_cpuload()
> > > > > > > + * which is not scaled with the cpu power.
> > > > > > > + */
> > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance)
> > > > > > > continue;
> > > > > > >
> > > > > > > + /*
> > > > > > > + * For the load comparisons with the other cpu's, consider
> > > > > > > + * the weighted_cpuload() scaled with the cpu power, so that
> > > > > > > + * the load can be moved away from the cpu that is potentially
> > > > > > > + * running at a lower capacity.
> > > > > > > + */
> > > > > > > + wl = (wl * SCHED_LOAD_SCALE) / power;
> > > > > > > +
> > > > > > > if (wl > max_load) {
> > > > > > > max_load = wl;
> > > > > > > busiest = rq;
> > > > > > >
> > > > > > >
> > > >
> > > > In addition to the above fix, for sched_smt_powersavings to work, the
> > > > group capacity of the core (mc level) should be made 2 in
> > > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to
> > > > DIV_RPUND_UP()
> > > >
> > > > sgs->group_capacity =
> > > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE);
> > > >
> > > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING
> > > > flag to force capacity to 1. Need to see if there are any side
> > > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain
> > > > based on sched_smt_powersavings flag.
> > >
> > > OK, so while I think that Suresh' patch can make sense (haven't had time
> > > to think it through), the above really sounds wrong. Things should not
> > > rely on the cpu_power value like that.
> >
> > Hi Peter,
> >
> > The reason rounding is a problem is because threads have fractional
> > cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level
> > a group has 2*589=1178 and group_capacity will be 1 always if
> > DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING
> > flag.
> >
> > We are reducing group capacity here to 1 even though we have 2 sibling
> > threads in the group. In the sched_smt_powassavings>0 case, the
> > group_capacity should be 2 to allow task consolidation to this group
> > while leaving other groups completely idle.
> >
> > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare
> > capacity is rounded up and counted.
> >
> > While, if SD_REFER_SIBLING is set,
> >
> > update_sd_lb_stats():
> > if (prefer_sibling)
> > sgs.group_capacity = min(sgs.group_capacity, 1UL);
> >
> > will ensure the group_capacity is 1 and allows spreading of tasks.
>
> We should be weakening this link between cpu_power and capacity, not
> strengthening it. What I think you want is to use
> cpumask_weight(sched_gropu_cpus(group)) or something as capacity.

Yes, this is a good suggestion and elegant solution to the problem.
I have a patch attached using this approach.

> The setup for cpu_power is that it can reflect the actual capacity for
> work, esp with todays asymmetric cpus where a socket can run on a
> different frequency we need to make sure this is so.
>
> So no, that DIV_ROUND_UP is utterly broken, as there might be many ways
> for cpu_power of multiple threads/cpus to be less than the number of
> cpus.
>
> Furthermore, for powersavings it makes sense to make the capacity a
> function of an overload argument/tunable, so that you can specify the
> threshold of packing.
>
> So really, cpu_power is a normalization factor to equally distribute
> load across cpus that have asymmetric work capacity, if you need any
> placement constraints outside of that, do _NOT_ touch cpu_power.

Agreed. Placement control should be handled by SD_PREFER_SIBLING
and SD_POWER_SAVINGS flags.

--Vaidy

---

sched_smt_powersavings for threaded systems need this fix for
consolidation to sibling threads to work. Since threads have
fractional capacity, group_capacity will turn out to be one
always and not accommodate another task in the sibling thread.

This fix makes group_capacity a function of cpumask_weight that
will enable the power saving load balancer to pack tasks among
sibling threads and keep more cores idle.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 522cf0e..ec3a5c5 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2538,9 +2538,17 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
* In case the child domain prefers tasks go to siblings
* first, lower the group capacity to one so that we'll try
* and move all the excess tasks away.
+ * If power savings balance is set at this domain, then
+ * make capacity equal to number of hardware threads to
+ * accomodate more tasks until capacity is reached. The
+ * default is fractional capacity for sibling hardware
+ * threads for fair use of available hardware resources.
*/
if (prefer_sibling)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ else if (sd->flags & SD_POWERSAVINGS_BALANCE)
+ sgs.group_capacity =
+ cpumask_weight(sched_group_cpus(group));

if (local_group) {
sds->this_load = sgs.avg_load;
@@ -2855,7 +2863,8 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return 0;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP &&
+ sched_smt_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
return 0;
}

2010-02-16 17:29:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Tue, 2010-02-16 at 21:29 +0530, Vaidyanathan Srinivasan wrote:
> Agreed. Placement control should be handled by SD_PREFER_SIBLING
> and SD_POWER_SAVINGS flags.
>
> --Vaidy
>
> ---
>
> sched_smt_powersavings for threaded systems need this fix for
> consolidation to sibling threads to work. Since threads have
> fractional capacity, group_capacity will turn out to be one
> always and not accommodate another task in the sibling thread.
>
> This fix makes group_capacity a function of cpumask_weight that
> will enable the power saving load balancer to pack tasks among
> sibling threads and keep more cores idle.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 522cf0e..ec3a5c5 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2538,9 +2538,17 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> * In case the child domain prefers tasks go to siblings
> * first, lower the group capacity to one so that we'll try
> * and move all the excess tasks away.

I prefer a blank line in between two paragraphs, but even better would
be to place this comment at the else if site.

> + * If power savings balance is set at this domain, then
> + * make capacity equal to number of hardware threads to
> + * accomodate more tasks until capacity is reached. The

my spell checker seems to prefer: accommodate

> + * default is fractional capacity for sibling hardware
> + * threads for fair use of available hardware resources.
> */
> if (prefer_sibling)
> sgs.group_capacity = min(sgs.group_capacity, 1UL);
> + else if (sd->flags & SD_POWERSAVINGS_BALANCE)
> + sgs.group_capacity =
> + cpumask_weight(sched_group_cpus(group));

I guess we should apply cpu_active_mask so that we properly deal with
offline siblings, except with cpumasks being the beasts they are I see
no cheap way to do that.

> if (local_group) {
> sds->this_load = sgs.avg_load;
> @@ -2855,7 +2863,8 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
> !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> return 0;
>
> - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> + if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP &&
> + sched_smt_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> return 0;
> }

/me still hopes for that unification patch.. :-)

2010-02-16 18:25:42

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Peter Zijlstra <[email protected]> [2010-02-16 18:28:44]:

> On Tue, 2010-02-16 at 21:29 +0530, Vaidyanathan Srinivasan wrote:
> > Agreed. Placement control should be handled by SD_PREFER_SIBLING
> > and SD_POWER_SAVINGS flags.
> >
> > --Vaidy
> >
> > ---
> >
> > sched_smt_powersavings for threaded systems need this fix for
> > consolidation to sibling threads to work. Since threads have
> > fractional capacity, group_capacity will turn out to be one
> > always and not accommodate another task in the sibling thread.
> >
> > This fix makes group_capacity a function of cpumask_weight that
> > will enable the power saving load balancer to pack tasks among
> > sibling threads and keep more cores idle.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> >
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 522cf0e..ec3a5c5 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -2538,9 +2538,17 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> > * In case the child domain prefers tasks go to siblings
> > * first, lower the group capacity to one so that we'll try
> > * and move all the excess tasks away.
>
> I prefer a blank line in between two paragraphs, but even better would
> be to place this comment at the else if site.
>
> > + * If power savings balance is set at this domain, then
> > + * make capacity equal to number of hardware threads to
> > + * accomodate more tasks until capacity is reached. The
>
> my spell checker seems to prefer: accommodate

ok, will fix the comment.

> > + * default is fractional capacity for sibling hardware
> > + * threads for fair use of available hardware resources.
> > */
> > if (prefer_sibling)
> > sgs.group_capacity = min(sgs.group_capacity, 1UL);
> > + else if (sd->flags & SD_POWERSAVINGS_BALANCE)
> > + sgs.group_capacity =
> > + cpumask_weight(sched_group_cpus(group));
>
> I guess we should apply cpu_active_mask so that we properly deal with
> offline siblings, except with cpumasks being the beasts they are I see
> no cheap way to do that.

The sched_domain will be rebuilt with the sched_group_cpus()
representing only online siblings right? sched_group_cpus(group) will
always be a subset of cpu_active_mask. Can please explain your
comment.

> > if (local_group) {
> > sds->this_load = sgs.avg_load;
> > @@ -2855,7 +2863,8 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
> > !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> > return 0;
> >
> > - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> > + if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP &&
> > + sched_smt_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> > return 0;
> > }
>
> /me still hopes for that unification patch.. :-)

I will post an RFC soon. The main challenge has been with the order
in which we should place SD_POWER_SAVINGS flag at MC and CPU/NODE level
depending on the system topology and sched_powersavings settings.

--Vaidy

2010-02-16 18:46:59

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

* Vaidyanathan Srinivasan <[email protected]> [2010-02-16 23:55:30]:

> * Peter Zijlstra <[email protected]> [2010-02-16 18:28:44]:
>
> > On Tue, 2010-02-16 at 21:29 +0530, Vaidyanathan Srinivasan wrote:
> > > Agreed. Placement control should be handled by SD_PREFER_SIBLING
> > > and SD_POWER_SAVINGS flags.
> > >
> > > --Vaidy
> > >
> > > ---
> > >
> > > sched_smt_powersavings for threaded systems need this fix for
> > > consolidation to sibling threads to work. Since threads have
> > > fractional capacity, group_capacity will turn out to be one
> > > always and not accommodate another task in the sibling thread.
> > >
> > > This fix makes group_capacity a function of cpumask_weight that
> > > will enable the power saving load balancer to pack tasks among
> > > sibling threads and keep more cores idle.
> > >
> > > Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> > >
> > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > > index 522cf0e..ec3a5c5 100644
> > > --- a/kernel/sched_fair.c
> > > +++ b/kernel/sched_fair.c
> > > @@ -2538,9 +2538,17 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> > > * In case the child domain prefers tasks go to siblings
> > > * first, lower the group capacity to one so that we'll try
> > > * and move all the excess tasks away.
> >
> > I prefer a blank line in between two paragraphs, but even better would
> > be to place this comment at the else if site.
> >
> > > + * If power savings balance is set at this domain, then
> > > + * make capacity equal to number of hardware threads to
> > > + * accomodate more tasks until capacity is reached. The
> >
> > my spell checker seems to prefer: accommodate
>
> ok, will fix the comment.

Thanks for the review, here is the updated patch:
---
sched: Fix group_capacity for sched_smt_powersavings

sched_smt_powersavings for threaded systems need this fix for
consolidation to sibling threads to work. Since threads have
fractional capacity, group_capacity will turn out to be one
always and not accommodate another task in the sibling thread.

This fix makes group_capacity a function of cpumask_weight that
will enable the power saving load balancer to pack tasks among
sibling threads and keep more cores idle.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 522cf0e..4466144 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2541,6 +2541,21 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
*/
if (prefer_sibling)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ /*
+ * If power savings balance is set at this domain, then
+ * make capacity equal to number of hardware threads to
+ * accommodate more tasks until capacity is reached.
+ */
+ else if (sd->flags & SD_POWERSAVINGS_BALANCE)
+ sgs.group_capacity =
+ cpumask_weight(sched_group_cpus(group));
+
+ /*
+ * The default group_capacity is rounded from sum of
+ * fractional cpu_powers of sibling hardware threads
+ * in order to enable fair use of available hardware
+ * resources.
+ */

if (local_group) {
sds->this_load = sgs.avg_load;
@@ -2855,7 +2870,8 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return 0;

- if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP &&
+ sched_smt_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
return 0;
}

2010-02-16 18:49:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: fix SMT scheduler regression in find_busiest_queue()

On Tue, 2010-02-16 at 23:55 +0530, Vaidyanathan Srinivasan wrote:
> The sched_domain will be rebuilt with the sched_group_cpus()
> representing only online siblings right? sched_group_cpus(group) will
> always be a subset of cpu_active_mask. Can please explain your
> comment.

__build_*_sched_domain() seems to only rebuild the sd->span, not the
sched_group's mask, cpu_to_*_group() only picks an existing group based
on the cpumask passed in, it doesn't change sg->cpumask afaict.

That is also the reason we drag load_balance_tmpmask all through
load_balance() afaict.



2010-02-19 02:17:52

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Sat, 2010-02-13 at 02:36 -0800, Peter Zijlstra wrote:
> On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> >
> > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > socket quad-core platforms (for example platforms based on NHM-EP), we
> > are seeing scenarios where one socket is completely busy (with all the 4
> > cores running with 4 tasks) and another socket is completely idle.
> >
> > This causes performance issues as those 4 tasks share the memory
> > controller, last-level cache bandwidth etc. Also we won't be taking
> > advantage of turbo-mode as much as we like. We will have all these
> > benefits if we move two of those tasks to the other socket. Now both the
> > sockets can potentially go to turbo etc and improve performance.
> >
> > In short, your recent change (shown below) broke this behavior. In the
> > kernel summit you mentioned you made this change with out affecting the
> > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > didn't show the problem (perhaps my test didn't hit this specific
> > change). But apparently we are having performance issues with this patch
> > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > this after the long weekend (to see if we can catch this scenario in
> > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > Thanks.
>
> Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> it provides the capacity==1 thing the cpu_power games used to provide.
>
> Not saying it's not broken, but that's where the we should be looking to
> fix it.

Peter, Some portions of code in fix_small_imbalance() and
calculate_imbalance() are comparing max_load and busiest_load_per_task.
Some of these comparisons are ok but some of them are broken. Broken
comparisons are assuming that the cpu_power is SCHED_LOAD_SCALE. Also
there is one check which still assumes that the world is balanced when
max_load <= busiest_load_per_task. This is wrong with the recent changes
(as cpu power no longer reflects the group capacity that is needed to
implement SCHED_MC/SCHED_SMT).

The appended patch works for me and fixes the SCHED_MC performance
behavior. I am sending this patch out for a quick review and I will do
bit more testing tomorrow and If you don't follow what I am doing in
this patch and why, then stay tuned for a patch with complete changelog
that I will send tomorrow. Good night. Thanks.
---

Signed-off-by: Suresh Siddha <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..2f4cac0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3423,6 +3423,7 @@ struct sd_lb_stats {
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;

int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -3880,6 +3881,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
+ sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
sds->group_imb = sgs.group_imb;
}
@@ -3902,6 +3904,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
{
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
+ unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
sds->this_load_per_task /= sds->this_nr_running;
@@ -3912,8 +3915,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
- sds->busiest_load_per_task * imbn) {
+ scaled_busy_load_per_task = sds->busiest_load_per_task
+ * SCHED_LOAD_SCALE;
+ scaled_busy_load_per_task /= sds->busiest->cpu_power;
+
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ scaled_busy_load_per_task * imbn) {
*imbalance = sds->busiest_load_per_task;
return;
}
@@ -3964,7 +3971,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull;
+ unsigned long max_pull, load_above_capacity = ~0UL;
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -3975,9 +3982,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(sds->max_load - sds->avg_load,
- sds->max_load - sds->busiest_load_per_task);
+ if (!sds->group_imb) {
+ /*
+ * Don't want to pull so many tasks that a group would go idle.
+ */
+ load_above_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
+
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+
+ load_above_capacity /= sds->busiest->cpu_power;
+ }
+
+ /*
+ * We're trying to get all the cpus to the average_load, so we don't
+ * want to push ourselves above the average load, nor do we wish to
+ * reduce the max loaded cpu below the average load, as either of these
+ * actions would just result in more rebalancing later, and ping-pong
+ * tasks around. Thus we look for the minimum possible imbalance.
+ * Negative imbalances (*we* are more loaded than anyone else) will
+ * be counted as no imbalance for these purposes -- we can't fix that
+ * by pulling tasks to us. Be careful of negative numbers as they'll
+ * appear as very large values with unsigned longs.
+ */
+ max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
@@ -4069,19 +4097,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
sds.busiest_load_per_task =
min(sds.busiest_load_per_task, sds.avg_load);

- /*
- * We're trying to get all the cpus to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load, as either of these
- * actions would just result in more rebalancing later, and ping-pong
- * tasks around. Thus we look for the minimum possible imbalance.
- * Negative imbalances (*we* are more loaded than anyone else) will
- * be counted as no imbalance for these purposes -- we can't fix that
- * by pulling tasks to us. Be careful of negative numbers as they'll
- * appear as very large values with unsigned longs.
- */
- if (sds.max_load <= sds.busiest_load_per_task)
- goto out_balanced;

/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);

2010-02-19 12:32:18

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Suresh Siddha <[email protected]> [2010-02-18 18:16:47]:

> On Sat, 2010-02-13 at 02:36 -0800, Peter Zijlstra wrote:
> > On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> > >
> > > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > > socket quad-core platforms (for example platforms based on NHM-EP), we
> > > are seeing scenarios where one socket is completely busy (with all the 4
> > > cores running with 4 tasks) and another socket is completely idle.
> > >
> > > This causes performance issues as those 4 tasks share the memory
> > > controller, last-level cache bandwidth etc. Also we won't be taking
> > > advantage of turbo-mode as much as we like. We will have all these
> > > benefits if we move two of those tasks to the other socket. Now both the
> > > sockets can potentially go to turbo etc and improve performance.
> > >
> > > In short, your recent change (shown below) broke this behavior. In the
> > > kernel summit you mentioned you made this change with out affecting the
> > > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > > didn't show the problem (perhaps my test didn't hit this specific
> > > change). But apparently we are having performance issues with this patch
> > > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > > this after the long weekend (to see if we can catch this scenario in
> > > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > > Thanks.
> >
> > Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> > it provides the capacity==1 thing the cpu_power games used to provide.
> >
> > Not saying it's not broken, but that's where the we should be looking to
> > fix it.
>
> Peter, Some portions of code in fix_small_imbalance() and
> calculate_imbalance() are comparing max_load and busiest_load_per_task.
> Some of these comparisons are ok but some of them are broken. Broken
> comparisons are assuming that the cpu_power is SCHED_LOAD_SCALE. Also
> there is one check which still assumes that the world is balanced when
> max_load <= busiest_load_per_task. This is wrong with the recent changes
> (as cpu power no longer reflects the group capacity that is needed to
> implement SCHED_MC/SCHED_SMT).
>
> The appended patch works for me and fixes the SCHED_MC performance
> behavior. I am sending this patch out for a quick review and I will do
> bit more testing tomorrow and If you don't follow what I am doing in
> this patch and why, then stay tuned for a patch with complete changelog
> that I will send tomorrow. Good night. Thanks.
> ---
>


Hi,

I tested Suresh's patch with ebizzy on a machine with 2 package, each
with 4 cores, sched_mc_power_saveings set to 0.

I found that, on applying the patch, performance inproves
significantly for running ebizzy with 5 and 7 threads. The rest of the
cases, there is no significant improvement of performance from the baseline
kernel but there is no degradation either.

And the task placement seem to be coming out correct in the
underloaded system, although i feel that we might be ping-ponging a
bit. Will test more when i get time.

arun

2010-02-19 13:03:32

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Suresh Siddha <[email protected]> [2010-02-18 18:16:47]:

> On Sat, 2010-02-13 at 02:36 -0800, Peter Zijlstra wrote:
> > On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> > >
> > > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > > socket quad-core platforms (for example platforms based on NHM-EP), we
> > > are seeing scenarios where one socket is completely busy (with all the 4
> > > cores running with 4 tasks) and another socket is completely idle.
> > >
> > > This causes performance issues as those 4 tasks share the memory
> > > controller, last-level cache bandwidth etc. Also we won't be taking
> > > advantage of turbo-mode as much as we like. We will have all these
> > > benefits if we move two of those tasks to the other socket. Now both the
> > > sockets can potentially go to turbo etc and improve performance.
> > >
> > > In short, your recent change (shown below) broke this behavior. In the
> > > kernel summit you mentioned you made this change with out affecting the
> > > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > > didn't show the problem (perhaps my test didn't hit this specific
> > > change). But apparently we are having performance issues with this patch
> > > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > > this after the long weekend (to see if we can catch this scenario in
> > > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > > Thanks.
> >
> > Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> > it provides the capacity==1 thing the cpu_power games used to provide.
> >
> > Not saying it's not broken, but that's where the we should be looking to
> > fix it.
>
> Peter, Some portions of code in fix_small_imbalance() and
> calculate_imbalance() are comparing max_load and busiest_load_per_task.
> Some of these comparisons are ok but some of them are broken. Broken
> comparisons are assuming that the cpu_power is SCHED_LOAD_SCALE. Also
> there is one check which still assumes that the world is balanced when
> max_load <= busiest_load_per_task. This is wrong with the recent changes
> (as cpu power no longer reflects the group capacity that is needed to
> implement SCHED_MC/SCHED_SMT).
>
> The appended patch works for me and fixes the SCHED_MC performance
> behavior. I am sending this patch out for a quick review and I will do
> bit more testing tomorrow and If you don't follow what I am doing in
> this patch and why, then stay tuned for a patch with complete changelog
> that I will send tomorrow. Good night. Thanks.

Hi Suresh,

Thanks for sharing the patch.

> ---
>
> Signed-off-by: Suresh Siddha <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..2f4cac0 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3423,6 +3423,7 @@ struct sd_lb_stats {
> unsigned long max_load;
> unsigned long busiest_load_per_task;
> unsigned long busiest_nr_running;
> + unsigned long busiest_group_capacity;
>
> int group_imb; /* Is there imbalance in this sd */
> #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> @@ -3880,6 +3881,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> sds->max_load = sgs.avg_load;
> sds->busiest = group;
> sds->busiest_nr_running = sgs.sum_nr_running;
> + sds->busiest_group_capacity = sgs.group_capacity;
> sds->busiest_load_per_task = sgs.sum_weighted_load;
> sds->group_imb = sgs.group_imb;
> }
> @@ -3902,6 +3904,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> {
> unsigned long tmp, pwr_now = 0, pwr_move = 0;
> unsigned int imbn = 2;
> + unsigned long scaled_busy_load_per_task;
>
> if (sds->this_nr_running) {
> sds->this_load_per_task /= sds->this_nr_running;
> @@ -3912,8 +3915,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> sds->this_load_per_task =
> cpu_avg_load_per_task(this_cpu);
>
> - if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
> - sds->busiest_load_per_task * imbn) {
> + scaled_busy_load_per_task = sds->busiest_load_per_task
> + * SCHED_LOAD_SCALE;
> + scaled_busy_load_per_task /= sds->busiest->cpu_power;
> +
> + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> + scaled_busy_load_per_task * imbn) {
> *imbalance = sds->busiest_load_per_task;
> return;

This change looks good.

> }
> @@ -3964,7 +3971,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> unsigned long *imbalance)
> {
> - unsigned long max_pull;
> + unsigned long max_pull, load_above_capacity = ~0UL;
> /*
> * In the presence of smp nice balancing, certain scenarios can have
> * max load less than avg load(as we skip the groups at or below
> @@ -3975,9 +3982,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> return fix_small_imbalance(sds, this_cpu, imbalance);
> }
>
> - /* Don't want to pull so many tasks that a group would go idle */
> - max_pull = min(sds->max_load - sds->avg_load,
> - sds->max_load - sds->busiest_load_per_task);
> + if (!sds->group_imb) {
> + /*
> + * Don't want to pull so many tasks that a group would go idle.
> + */
> + load_above_capacity = (sds->busiest_nr_running -
> + sds->busiest_group_capacity);
> +
> + load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
> +
> + load_above_capacity /= sds->busiest->cpu_power;
> + }

This seems tricky. max_load - avg_load will be less than
load_above_capacity most of the time. How does this expression
increase the max_pull from previous expression?

> + /*
> + * We're trying to get all the cpus to the average_load, so we don't
> + * want to push ourselves above the average load, nor do we wish to
> + * reduce the max loaded cpu below the average load, as either of these
> + * actions would just result in more rebalancing later, and ping-pong
> + * tasks around. Thus we look for the minimum possible imbalance.
> + * Negative imbalances (*we* are more loaded than anyone else) will
> + * be counted as no imbalance for these purposes -- we can't fix that
> + * by pulling tasks to us. Be careful of negative numbers as they'll
> + * appear as very large values with unsigned longs.
> + */
> + max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

Does this increase or decrease the value of max_pull from previous
expression?

> /* How much load to actually move to equalise the imbalance */
> *imbalance = min(max_pull * sds->busiest->cpu_power,
> @@ -4069,19 +4097,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> sds.busiest_load_per_task =
> min(sds.busiest_load_per_task, sds.avg_load);
>
> - /*
> - * We're trying to get all the cpus to the average_load, so we don't
> - * want to push ourselves above the average load, nor do we wish to
> - * reduce the max loaded cpu below the average load, as either of these
> - * actions would just result in more rebalancing later, and ping-pong
> - * tasks around. Thus we look for the minimum possible imbalance.
> - * Negative imbalances (*we* are more loaded than anyone else) will
> - * be counted as no imbalance for these purposes -- we can't fix that
> - * by pulling tasks to us. Be careful of negative numbers as they'll
> - * appear as very large values with unsigned longs.
> - */
> - if (sds.max_load <= sds.busiest_load_per_task)
> - goto out_balanced;

This is right. This condition was treating most cases as balanced and
exit right here. However if this check is removed, we will have to
execute more code to detect/ascertain balanced case.

--Vaidy

2010-02-19 15:29:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Thu, 2010-02-18 at 18:16 -0800, Suresh Siddha wrote:
> On Sat, 2010-02-13 at 02:36 -0800, Peter Zijlstra wrote:
> > On Fri, 2010-02-12 at 17:31 -0800, Suresh Siddha wrote:
> > >
> > > We have one more problem that Yanmin and Ling Ma reported. On a dual
> > > socket quad-core platforms (for example platforms based on NHM-EP), we
> > > are seeing scenarios where one socket is completely busy (with all the 4
> > > cores running with 4 tasks) and another socket is completely idle.
> > >
> > > This causes performance issues as those 4 tasks share the memory
> > > controller, last-level cache bandwidth etc. Also we won't be taking
> > > advantage of turbo-mode as much as we like. We will have all these
> > > benefits if we move two of those tasks to the other socket. Now both the
> > > sockets can potentially go to turbo etc and improve performance.
> > >
> > > In short, your recent change (shown below) broke this behavior. In the
> > > kernel summit you mentioned you made this change with out affecting the
> > > behavior of SMT/MC. And my testing immediately after kernel-summit also
> > > didn't show the problem (perhaps my test didn't hit this specific
> > > change). But apparently we are having performance issues with this patch
> > > (Ling Ma's bisect pointed to this patch). I will look more detailed into
> > > this after the long weekend (to see if we can catch this scenario in
> > > fix_small_imbalance() etc). But wanted to give you a quick heads up.
> > > Thanks.
> >
> > Right, so the behaviour we want should be provided by SD_PREFER_SIBLING,
> > it provides the capacity==1 thing the cpu_power games used to provide.
> >
> > Not saying it's not broken, but that's where the we should be looking to
> > fix it.
>
> Peter, Some portions of code in fix_small_imbalance() and
> calculate_imbalance() are comparing max_load and busiest_load_per_task.
> Some of these comparisons are ok but some of them are broken. Broken
> comparisons are assuming that the cpu_power is SCHED_LOAD_SCALE. Also
> there is one check which still assumes that the world is balanced when
> max_load <= busiest_load_per_task. This is wrong with the recent changes
> (as cpu power no longer reflects the group capacity that is needed to
> implement SCHED_MC/SCHED_SMT).

The whole notion of load_per_task is utterly broken for cgroup stuff
(but I can fully understand you not wanting to care about that).

A smaller breakage is that its constructed using rq->nr_running, not
cfs_rq->nr_running, the problem is that the former includes !fair tasks
and the latter (for the cgroup case) not all tasks.

It would be good to come up with something that does not rely on this
metric at all, although I admit not quite seeing how that would work.

> The appended patch works for me and fixes the SCHED_MC performance
> behavior. I am sending this patch out for a quick review and I will do
> bit more testing tomorrow and If you don't follow what I am doing in
> this patch and why, then stay tuned for a patch with complete changelog
> that I will send tomorrow. Good night. Thanks.

I can mostly follow the code, but I can't seem to quickly see what
particular problem you're trying to solve :-)

In particular, using the below script:

$ cat show-loop
#!/bin/bash
NR=$1; shift

for ((i=0; i<NR; i++)) ; do
./loop &
done

for ((i=0; i<3; i++)) ; do
sleep 1

( ps -deo pid,sgi_p,cmd | grep loop | grep bash | while read pid cpu cmd; do

SOCKET=`cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id`
CORE=`cat /sys/devices/system/cpu/cpu${cpu}/topology/core_id`
SIBLINGS=`cat /sys/devices/system/cpu/cpu${cpu}/topology/thread_siblings_list`

printf "loop-%05d on CPU: %02d SOCKET: %02d CORE: %02d THREADS: ${SIBLINGS}\n" $pid $cpu $SOCKET $CORE

done ) | sort | awk '{socket[$6]++; th[$8 + (256*$6)]++; print $0} END { for (i in socket) { print "socket-" i ": " socket[i]; } for (i in th) { if (th[i] > 1) { print "thread-" int(i/256)"/"(i%256) ": " th[i]; } } }'

echo ""
echo "-------------------"
echo ""

done

killall loop
---

I can find no difference between tip/master +/- patch for the following
scenarios:

./show-loop $nr_sockets

mostly its 1 loop per socket, sometimes I see 2 on a single socket

./show-loop $nr_cores_in_machine

mostly it shows 1 loop per core, sometimes one core is idle and one core
uses both threads.

Anything in between also mostly shows an equal distribution of loops per
socket.

> Signed-off-by: Suresh Siddha <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..2f4cac0 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c

FWIW, this code no longer lives in sched.c ;-)

> @@ -3423,6 +3423,7 @@ struct sd_lb_stats {
> unsigned long max_load;
> unsigned long busiest_load_per_task;
> unsigned long busiest_nr_running;
> + unsigned long busiest_group_capacity;
>
> int group_imb; /* Is there imbalance in this sd */
> #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> @@ -3880,6 +3881,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> sds->max_load = sgs.avg_load;
> sds->busiest = group;
> sds->busiest_nr_running = sgs.sum_nr_running;
> + sds->busiest_group_capacity = sgs.group_capacity;
> sds->busiest_load_per_task = sgs.sum_weighted_load;
> sds->group_imb = sgs.group_imb;
> }
> @@ -3902,6 +3904,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> {
> unsigned long tmp, pwr_now = 0, pwr_move = 0;
> unsigned int imbn = 2;
> + unsigned long scaled_busy_load_per_task;
>
> if (sds->this_nr_running) {
> sds->this_load_per_task /= sds->this_nr_running;
> @@ -3912,8 +3915,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> sds->this_load_per_task =
> cpu_avg_load_per_task(this_cpu);
>
> - if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
> - sds->busiest_load_per_task * imbn) {
> + scaled_busy_load_per_task = sds->busiest_load_per_task
> + * SCHED_LOAD_SCALE;
> + scaled_busy_load_per_task /= sds->busiest->cpu_power;
> +
> + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> + scaled_busy_load_per_task * imbn) {
> *imbalance = sds->busiest_load_per_task;
> return;
> }

Right, makes sense, max_load/this_load (being samples of avg_load) are
scaled.

calculate_imbalance() seems to always have undone the power scaling,
*cpu_power / SCHED_LOAD_SCALE, as the imbalance is based on avg_load
derivatives.

fix_small_imbalance() also returns an unscaled value, based on the
unscaled load_per_task derivatives.

But what I'm not seeing is how this got introduced recently, looking
at .29 its pretty much the same, comparing the unscaled load_per_task to
avg_load, so my recent change to make cpu_power more fluid may have made
aggrevated the situation, but its not a new issue.

> @@ -3964,7 +3971,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> unsigned long *imbalance)
> {
> - unsigned long max_pull;
> + unsigned long max_pull, load_above_capacity = ~0UL;
> /*
> * In the presence of smp nice balancing, certain scenarios can have
> * max load less than avg load(as we skip the groups at or below
> @@ -3975,9 +3982,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> return fix_small_imbalance(sds, this_cpu, imbalance);
> }
>
> - /* Don't want to pull so many tasks that a group would go idle */
> - max_pull = min(sds->max_load - sds->avg_load,
> - sds->max_load - sds->busiest_load_per_task);
> + if (!sds->group_imb) {
> + /*
> + * Don't want to pull so many tasks that a group would go idle.
> + */
> + load_above_capacity = (sds->busiest_nr_running -
> + sds->busiest_group_capacity);
> +
> + load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
> +
> + load_above_capacity /= sds->busiest->cpu_power;
> + }
> +
> + /*
> + * We're trying to get all the cpus to the average_load, so we don't
> + * want to push ourselves above the average load, nor do we wish to
> + * reduce the max loaded cpu below the average load, as either of these
> + * actions would just result in more rebalancing later, and ping-pong
> + * tasks around.


About the ping-pong for the statically infeasible scenario, the problem
with that is that we actually want to ping-pong a little [*], so we
might want to look at maybe doing a sd->next_pingpong jiffy measure to
allow some of that.

[*] I've seen people pretty upset about the fact that when they start 6
similar loads on a quad cpu the tasks will not finish in roughly similar
times.

> Thus we look for the minimum possible imbalance.
> + * Negative imbalances (*we* are more loaded than anyone else) will
> + * be counted as no imbalance for these purposes -- we can't fix that
> + * by pulling tasks to us. Be careful of negative numbers as they'll
> + * appear as very large values with unsigned longs.
> + */
> + max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

OK, so this bit actually changes behaviour and seems to make sense, the
load above capacity makes more sense than max - load_per_task.

> /* How much load to actually move to equalise the imbalance */
> *imbalance = min(max_pull * sds->busiest->cpu_power,
> @@ -4069,19 +4097,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> sds.busiest_load_per_task =
> min(sds.busiest_load_per_task, sds.avg_load);
>
> - /*
> - * We're trying to get all the cpus to the average_load, so we don't
> - * want to push ourselves above the average load, nor do we wish to
> - * reduce the max loaded cpu below the average load, as either of these
> - * actions would just result in more rebalancing later, and ping-pong
> - * tasks around. Thus we look for the minimum possible imbalance.
> - * Negative imbalances (*we* are more loaded than anyone else) will
> - * be counted as no imbalance for these purposes -- we can't fix that
> - * by pulling tasks to us. Be careful of negative numbers as they'll
> - * appear as very large values with unsigned longs.
> - */
> - if (sds.max_load <= sds.busiest_load_per_task)
> - goto out_balanced;

You'll have to remove item 6) on the comment further up. And we might as
well move the whole busiest_load_per_task computation along into
calculate_imbalance().

> /* Looks like there is an imbalance. Compute it */
> calculate_imbalance(&sds, this_cpu, imbalance);
>
>


2010-02-19 18:37:56

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 06:05 -0800, Peter Zijlstra wrote:
> The whole notion of load_per_task is utterly broken for cgroup stuff
> (but I can fully understand you not wanting to care about that).

Yep. I will leave that to cgroup folks :)

> A smaller breakage is that its constructed using rq->nr_running, not
> cfs_rq->nr_running, the problem is that the former includes !fair tasks
> and the latter (for the cgroup case) not all tasks.
>
> It would be good to come up with something that does not rely on this
> metric at all, although I admit not quite seeing how that would work.

I wouldn't mind further enhancements but am currently focused on solving
this regression first :(

>
> > The appended patch works for me and fixes the SCHED_MC performance
> > behavior. I am sending this patch out for a quick review and I will do
> > bit more testing tomorrow and If you don't follow what I am doing in
> > this patch and why, then stay tuned for a patch with complete changelog
> > that I will send tomorrow. Good night. Thanks.
>
> I can mostly follow the code, but I can't seem to quickly see what
> particular problem you're trying to solve :-)

Ok. Let me help you understand below.

>
> In particular, using the below script:
>
> $ cat show-loop
> #!/bin/bash
> NR=$1; shift
>
> for ((i=0; i<NR; i++)) ; do
> ./loop &
> done
>
> for ((i=0; i<3; i++)) ; do
> sleep 1
>
> ( ps -deo pid,sgi_p,cmd | grep loop | grep bash | while read pid cpu cmd; do
>
> SOCKET=`cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id`
> CORE=`cat /sys/devices/system/cpu/cpu${cpu}/topology/core_id`
> SIBLINGS=`cat /sys/devices/system/cpu/cpu${cpu}/topology/thread_siblings_list`
>
> printf "loop-%05d on CPU: %02d SOCKET: %02d CORE: %02d THREADS: ${SIBLINGS}\n" $pid $cpu $SOCKET $CORE
>
> done ) | sort | awk '{socket[$6]++; th[$8 + (256*$6)]++; print $0} END { for (i in socket) { print "socket-" i ": " socket[i]; } for (i in th) { if (th[i] > 1) { print "thread-" int(i/256)"/"(i%256) ": " th[i]; } } }'
>
> echo ""
> echo "-------------------"
> echo ""
>
> done
>
> killall loop
> ---
>
> I can find no difference between tip/master +/- patch for the following
> scenarios:
>
> ./show-loop $nr_sockets
>
> mostly its 1 loop per socket, sometimes I see 2 on a single socket
>
> ./show-loop $nr_cores_in_machine
>
> mostly it shows 1 loop per core, sometimes one core is idle and one core
> uses both threads.
>
> Anything in between also mostly shows an equal distribution of loops per
> socket.


exec/fork balance is not broken. i.e., during exec/fork we balance the
load equally among sockets/cores etc. What is broken is:

a) In SMT case, once we end up in a situation where both the threads of
the core are busy , with another core completely idle, load balance is
not moving one of the threads to the idle core. This unbalanced
situation can happen because of a previous wake-up decision and/or
threads on other core went to sleep/died etc. Once we end up in this
unbalanced situation, we continue in that state with out fixing it.

b) Similar to "a", this is MC case where we end up four cores busy in
one socket with other 4 cores in another socket completely idle. And
this is the situation which we are trying to solve in this patch.

In your above example, we test mostly fork/exec balance but not the
above sleep/wakeup scenarios.

>
> > Signed-off-by: Suresh Siddha <[email protected]>
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..2f4cac0 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
>
> FWIW, this code no longer lives in sched.c ;-)

Yep. I noticed that in -tip tree but was working on linus's tree to see
if I can address this regression. Based on the remaining time, perhaps
we should first address this in -tip and then back-port the fix
to .33-stable.

>
> > @@ -3423,6 +3423,7 @@ struct sd_lb_stats {
> > unsigned long max_load;
> > unsigned long busiest_load_per_task;
> > unsigned long busiest_nr_running;
> > + unsigned long busiest_group_capacity;
> >
> > int group_imb; /* Is there imbalance in this sd */
> > #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> > @@ -3880,6 +3881,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> > sds->max_load = sgs.avg_load;
> > sds->busiest = group;
> > sds->busiest_nr_running = sgs.sum_nr_running;
> > + sds->busiest_group_capacity = sgs.group_capacity;
> > sds->busiest_load_per_task = sgs.sum_weighted_load;
> > sds->group_imb = sgs.group_imb;
> > }
> > @@ -3902,6 +3904,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> > {
> > unsigned long tmp, pwr_now = 0, pwr_move = 0;
> > unsigned int imbn = 2;
> > + unsigned long scaled_busy_load_per_task;
> >
> > if (sds->this_nr_running) {
> > sds->this_load_per_task /= sds->this_nr_running;
> > @@ -3912,8 +3915,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> > sds->this_load_per_task =
> > cpu_avg_load_per_task(this_cpu);
> >
> > - if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
> > - sds->busiest_load_per_task * imbn) {
> > + scaled_busy_load_per_task = sds->busiest_load_per_task
> > + * SCHED_LOAD_SCALE;
> > + scaled_busy_load_per_task /= sds->busiest->cpu_power;
> > +
> > + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> > + scaled_busy_load_per_task * imbn) {
> > *imbalance = sds->busiest_load_per_task;
> > return;
> > }
>
> Right, makes sense, max_load/this_load (being samples of avg_load) are
> scaled.
>
> calculate_imbalance() seems to always have undone the power scaling,
> *cpu_power / SCHED_LOAD_SCALE, as the imbalance is based on avg_load
> derivatives.
>
> fix_small_imbalance() also returns an unscaled value, based on the
> unscaled load_per_task derivatives.
>
> But what I'm not seeing is how this got introduced recently, looking
> at .29 its pretty much the same, comparing the unscaled load_per_task to
> avg_load, so my recent change to make cpu_power more fluid may have made
> aggrevated the situation, but its not a new issue.

Your understanding is correct. We now have the issue for MC domains
(because of recent cpu_power changes) whereas previously we had the
issue probably for big NUMA (when I say big numa, probably two or more
sockets per numa domain) systems that I had never tested to observe this
issue before.

>
> > @@ -3964,7 +3971,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> > static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> > unsigned long *imbalance)
> > {
> > - unsigned long max_pull;
> > + unsigned long max_pull, load_above_capacity = ~0UL;
> > /*
> > * In the presence of smp nice balancing, certain scenarios can have
> > * max load less than avg load(as we skip the groups at or below
> > @@ -3975,9 +3982,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> > return fix_small_imbalance(sds, this_cpu, imbalance);
> > }
> >
> > - /* Don't want to pull so many tasks that a group would go idle */
> > - max_pull = min(sds->max_load - sds->avg_load,
> > - sds->max_load - sds->busiest_load_per_task);
> > + if (!sds->group_imb) {
> > + /*
> > + * Don't want to pull so many tasks that a group would go idle.
> > + */
> > + load_above_capacity = (sds->busiest_nr_running -
> > + sds->busiest_group_capacity);
> > +
> > + load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
> > +
> > + load_above_capacity /= sds->busiest->cpu_power;
> > + }
> > +
> > + /*
> > + * We're trying to get all the cpus to the average_load, so we don't
> > + * want to push ourselves above the average load, nor do we wish to
> > + * reduce the max loaded cpu below the average load, as either of these
> > + * actions would just result in more rebalancing later, and ping-pong
> > + * tasks around.
>
>
> About the ping-pong for the statically infeasible scenario, the problem
> with that is that we actually want to ping-pong a little [*], so we
> might want to look at maybe doing a sd->next_pingpong jiffy measure to
> allow some of that.
>
> [*] I've seen people pretty upset about the fact that when they start 6
> similar loads on a quad cpu the tasks will not finish in roughly similar
> times.

I thought we were doing this (ping-ping a little) already. Unless
something broke here also. I thought fix_small_imbalance() takes care of
this too.

>
> > Thus we look for the minimum possible imbalance.
> > + * Negative imbalances (*we* are more loaded than anyone else) will
> > + * be counted as no imbalance for these purposes -- we can't fix that
> > + * by pulling tasks to us. Be careful of negative numbers as they'll
> > + * appear as very large values with unsigned longs.
> > + */
> > + max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
>
> OK, so this bit actually changes behaviour and seems to make sense, the
> load above capacity makes more sense than max - load_per_task.
>
> > /* How much load to actually move to equalise the imbalance */
> > *imbalance = min(max_pull * sds->busiest->cpu_power,
> > @@ -4069,19 +4097,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > sds.busiest_load_per_task =
> > min(sds.busiest_load_per_task, sds.avg_load);
> >
> > - /*
> > - * We're trying to get all the cpus to the average_load, so we don't
> > - * want to push ourselves above the average load, nor do we wish to
> > - * reduce the max loaded cpu below the average load, as either of these
> > - * actions would just result in more rebalancing later, and ping-pong
> > - * tasks around. Thus we look for the minimum possible imbalance.
> > - * Negative imbalances (*we* are more loaded than anyone else) will
> > - * be counted as no imbalance for these purposes -- we can't fix that
> > - * by pulling tasks to us. Be careful of negative numbers as they'll
> > - * appear as very large values with unsigned longs.
> > - */
> > - if (sds.max_load <= sds.busiest_load_per_task)
> > - goto out_balanced;
>
> You'll have to remove item 6) on the comment further up. And we might as
> well move the whole busiest_load_per_task computation along into
> calculate_imbalance().

Ok. I am trying to do a simplified patch first, so that it is acceptable
to be back-ported in to -stable (for 2.6.32 and 2.6.33) so that
distributions based on those versions won't have this regression. I will
do another rev of this patch today. We can do more cleanups following
that.

thanks,
suresh

2010-02-19 19:17:52

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 05:03 -0800, Vaidyanathan Srinivasan wrote:
> > - /* Don't want to pull so many tasks that a group would go idle */
> > - max_pull = min(sds->max_load - sds->avg_load,
> > - sds->max_load - sds->busiest_load_per_task);
> > + if (!sds->group_imb) {
> > + /*
> > + * Don't want to pull so many tasks that a group would go idle.
> > + */
> > + load_above_capacity = (sds->busiest_nr_running -
> > + sds->busiest_group_capacity);
> > +
> > + load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
> > +
> > + load_above_capacity /= sds->busiest->cpu_power;
> > + }
>
> This seems tricky. max_load - avg_load will be less than
> load_above_capacity most of the time. How does this expression
> increase the max_pull from previous expression?

I am not trying to increase/decrease from the previous expression. Just
trying to do the right thing (to ultimately address smt/mc
power-savings), as the "max_load - busiest_load_per_task" no longer
represents the load above capacity.

>
> > + /*
> > + * We're trying to get all the cpus to the average_load, so we don't
> > + * want to push ourselves above the average load, nor do we wish to
> > + * reduce the max loaded cpu below the average load, as either of these
> > + * actions would just result in more rebalancing later, and ping-pong
> > + * tasks around. Thus we look for the minimum possible imbalance.
> > + * Negative imbalances (*we* are more loaded than anyone else) will
> > + * be counted as no imbalance for these purposes -- we can't fix that
> > + * by pulling tasks to us. Be careful of negative numbers as they'll
> > + * appear as very large values with unsigned longs.
> > + */
> > + max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
>
> Does this increase or decrease the value of max_pull from previous
> expression?

Does the above help answer your question, Vaidy?

>
> > /* How much load to actually move to equalise the imbalance */
> > *imbalance = min(max_pull * sds->busiest->cpu_power,
> > @@ -4069,19 +4097,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> > sds.busiest_load_per_task =
> > min(sds.busiest_load_per_task, sds.avg_load);
> >
> > - /*
> > - * We're trying to get all the cpus to the average_load, so we don't
> > - * want to push ourselves above the average load, nor do we wish to
> > - * reduce the max loaded cpu below the average load, as either of these
> > - * actions would just result in more rebalancing later, and ping-pong
> > - * tasks around. Thus we look for the minimum possible imbalance.
> > - * Negative imbalances (*we* are more loaded than anyone else) will
> > - * be counted as no imbalance for these purposes -- we can't fix that
> > - * by pulling tasks to us. Be careful of negative numbers as they'll
> > - * appear as very large values with unsigned longs.
> > - */
> > - if (sds.max_load <= sds.busiest_load_per_task)
> > - goto out_balanced;
>
> This is right. This condition was treating most cases as balanced and
> exit right here. However if this check is removed, we will have to
> execute more code to detect/ascertain balanced case.

To add, in update_sd_lb_stats() we are already doing this:

} else if (sgs.avg_load > sds->max_load &&
(sgs.sum_nr_running > sgs.group_capacity ||
sgs.group_imb)) {

So we are already checking sum_nr_running > group_capacity to select the
busiest group. So we are doing the equivalent of this balanced check
much before.

thanks,
suresh

2010-02-19 19:48:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 10:36 -0800, Suresh Siddha wrote:
> exec/fork balance is not broken. i.e., during exec/fork we balance the
> load equally among sockets/cores etc. What is broken is:
>
> a) In SMT case, once we end up in a situation where both the threads of
> the core are busy , with another core completely idle, load balance is
> not moving one of the threads to the idle core. This unbalanced
> situation can happen because of a previous wake-up decision and/or
> threads on other core went to sleep/died etc. Once we end up in this
> unbalanced situation, we continue in that state with out fixing it.
>
> b) Similar to "a", this is MC case where we end up four cores busy in
> one socket with other 4 cores in another socket completely idle. And
> this is the situation which we are trying to solve in this patch.
>
> In your above example, we test mostly fork/exec balance but not the
> above sleep/wakeup scenarios.

Ah, indeed. Let me extend my script to cover that.

The below script does indeed show a change, but the result still isn't
perfect, when I do ./show-loop 8, it starts 8 loops nicely spread over 2
sockets, the difference is that all 4 remaining would stay on socket 0,
the patched kernel gets 1 over to socket 1.


---

NR=$1; shift

cleanup()
{
killall loop
}

show_each_loop()
{
KILL=$1

ps -deo pid,sgi_p,cmd | grep loop | grep bash | while read pid cpu cmd; do

SOCKET=`cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id`
CORE=`cat /sys/devices/system/cpu/cpu${cpu}/topology/core_id`
SIBLINGS=`cat /sys/devices/system/cpu/cpu${cpu}/topology/thread_siblings_list`

printf "loop-%05d on CPU: %02d SOCKET: %02d CORE: %02d THREADS: ${SIBLINGS} " $pid $cpu $SOCKET $CORE

if [ $SOCKET -eq $KILL ]; then
kill $pid;
printf "(killed)"
fi

printf "\n"
done
}

trap cleanup SIGINT

echo "starting loops..."

for ((i=0; i<NR; i++)) ; do
./loop &
done

sleep 1;

echo "killing those on socket 1..."
echo ""

show_each_loop 1

echo ""
echo "watching load-balance work..."
echo ""

while sleep 1 ; do

show_each_loop -1 | sort | awk '{socket[$6]++; th[$8 + (256*$6)]++; print $0}
END { for (i in socket) { print "socket-" i ": " socket[i]; }
for (i in th) { if (th[i] > 1) { print "thread-" int(i/256)"/"(i%256) ": " th[i]; } } }'

echo ""
echo "-------------------"
echo ""

done


2010-02-19 19:52:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 10:36 -0800, Suresh Siddha wrote:
> > About the ping-pong for the statically infeasible scenario, the problem
> > with that is that we actually want to ping-pong a little [*], so we
> > might want to look at maybe doing a sd->next_pingpong jiffy measure to
> > allow some of that.
> >
> > [*] I've seen people pretty upset about the fact that when they start 6
> > similar loads on a quad cpu the tasks will not finish in roughly similar
> > times.
>
> I thought we were doing this (ping-ping a little) already. Unless
> something broke here also. I thought fix_small_imbalance() takes care of
> this too.

I'm not sure we do, when I test on the patched kernel I see two distinct
groups of runtime appear (haven't checked the unpatched one, but I doubt
it'll be better).


2010-02-19 19:53:41

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 11:47 -0800, Peter Zijlstra wrote:
> On Fri, 2010-02-19 at 10:36 -0800, Suresh Siddha wrote:
> > exec/fork balance is not broken. i.e., during exec/fork we balance the
> > load equally among sockets/cores etc. What is broken is:
> >
> > a) In SMT case, once we end up in a situation where both the threads of
> > the core are busy , with another core completely idle, load balance is
> > not moving one of the threads to the idle core. This unbalanced
> > situation can happen because of a previous wake-up decision and/or
> > threads on other core went to sleep/died etc. Once we end up in this
> > unbalanced situation, we continue in that state with out fixing it.
> >
> > b) Similar to "a", this is MC case where we end up four cores busy in
> > one socket with other 4 cores in another socket completely idle. And
> > this is the situation which we are trying to solve in this patch.
> >
> > In your above example, we test mostly fork/exec balance but not the
> > above sleep/wakeup scenarios.
>
> Ah, indeed. Let me extend my script to cover that.
>
> The below script does indeed show a change, but the result still isn't
> perfect, when I do ./show-loop 8, it starts 8 loops nicely spread over 2
> sockets, the difference is that all 4 remaining would stay on socket 0,
> the patched kernel gets 1 over to socket 1.

Peter, Have you applied both my smt patch and mc patch?

thanks,
suresh

2010-02-19 20:02:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 11:50 -0800, Suresh Siddha wrote:
> On Fri, 2010-02-19 at 11:47 -0800, Peter Zijlstra wrote:
> > On Fri, 2010-02-19 at 10:36 -0800, Suresh Siddha wrote:
> > > exec/fork balance is not broken. i.e., during exec/fork we balance the
> > > load equally among sockets/cores etc. What is broken is:
> > >
> > > a) In SMT case, once we end up in a situation where both the threads of
> > > the core are busy , with another core completely idle, load balance is
> > > not moving one of the threads to the idle core. This unbalanced
> > > situation can happen because of a previous wake-up decision and/or
> > > threads on other core went to sleep/died etc. Once we end up in this
> > > unbalanced situation, we continue in that state with out fixing it.
> > >
> > > b) Similar to "a", this is MC case where we end up four cores busy in
> > > one socket with other 4 cores in another socket completely idle. And
> > > this is the situation which we are trying to solve in this patch.
> > >
> > > In your above example, we test mostly fork/exec balance but not the
> > > above sleep/wakeup scenarios.
> >
> > Ah, indeed. Let me extend my script to cover that.
> >
> > The below script does indeed show a change, but the result still isn't
> > perfect, when I do ./show-loop 8, it starts 8 loops nicely spread over 2
> > sockets, the difference is that all 4 remaining would stay on socket 0,
> > the patched kernel gets 1 over to socket 1.
>
> Peter, Have you applied both my smt patch and mc patch?

Yes, find_busiest_queue() has the wl fixup in (as per tip/master).

2010-02-20 01:14:47

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 12:02 -0800, Peter Zijlstra wrote:
> On Fri, 2010-02-19 at 11:50 -0800, Suresh Siddha wrote:
> > On Fri, 2010-02-19 at 11:47 -0800, Peter Zijlstra wrote:
> > > On Fri, 2010-02-19 at 10:36 -0800, Suresh Siddha wrote:
> > > > exec/fork balance is not broken. i.e., during exec/fork we balance the
> > > > load equally among sockets/cores etc. What is broken is:
> > > >
> > > > a) In SMT case, once we end up in a situation where both the threads of
> > > > the core are busy , with another core completely idle, load balance is
> > > > not moving one of the threads to the idle core. This unbalanced
> > > > situation can happen because of a previous wake-up decision and/or
> > > > threads on other core went to sleep/died etc. Once we end up in this
> > > > unbalanced situation, we continue in that state with out fixing it.
> > > >
> > > > b) Similar to "a", this is MC case where we end up four cores busy in
> > > > one socket with other 4 cores in another socket completely idle. And
> > > > this is the situation which we are trying to solve in this patch.
> > > >
> > > > In your above example, we test mostly fork/exec balance but not the
> > > > above sleep/wakeup scenarios.
> > >
> > > Ah, indeed. Let me extend my script to cover that.
> > >
> > > The below script does indeed show a change, but the result still isn't
> > > perfect, when I do ./show-loop 8, it starts 8 loops nicely spread over 2
> > > sockets, the difference is that all 4 remaining would stay on socket 0,
> > > the patched kernel gets 1 over to socket 1.
> >
> > Peter, Have you applied both my smt patch and mc patch?
>
> Yes, find_busiest_queue() has the wl fixup in (as per tip/master).

Ok Peter. There is another place that is scaling load_per_task with
cpu_power but later comparing with the difference of max and min of the
actual cpu load. :(

avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
group->cpu_power;

if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;

Fixing this seems to have fixed the problem you mentioned. Can you
please checkout the appended patch? If everything seems ok, then I will
send the patch (against -tip tree) on monday morning with the detailed
changelog.

Signed-off-by: Suresh Siddha <[email protected]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..213b445 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3423,6 +3423,7 @@ struct sd_lb_stats {
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;

int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -3742,8 +3743,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long sum_avg_load_per_task;
- unsigned long avg_load_per_task;
+ unsigned long avg_load_per_task = 0;

if (local_group) {
balance_cpu = group_first_cpu(group);
@@ -3752,7 +3752,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
}

/* Tally up the load of all CPUs in the group */
- sum_avg_load_per_task = avg_load_per_task = 0;
max_cpu_load = 0;
min_cpu_load = ~0UL;

@@ -3782,7 +3781,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->sum_nr_running += rq->nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);

- sum_avg_load_per_task += cpu_avg_load_per_task(i);
}

/*
@@ -3801,6 +3799,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;


+ if (sgs->sum_nr_running)
+ avg_load_per_task =
+ sgs->sum_weighted_load / sgs->sum_nr_running;
/*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of two tasks.
@@ -3810,9 +3811,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
- avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
- group->cpu_power;
-
if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;

@@ -3880,6 +3878,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
+ sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
sds->group_imb = sgs.group_imb;
}
@@ -3902,6 +3901,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
{
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
+ unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
sds->this_load_per_task /= sds->this_nr_running;
@@ -3912,8 +3912,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
- sds->busiest_load_per_task * imbn) {
+ scaled_busy_load_per_task = sds->busiest_load_per_task
+ * SCHED_LOAD_SCALE;
+ scaled_busy_load_per_task /= sds->busiest->cpu_power;
+
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
*imbalance = sds->busiest_load_per_task;
return;
}
@@ -3964,7 +3968,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull;
+ unsigned long max_pull, load_above_capacity = ~0UL;
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -3975,9 +3979,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(sds->max_load - sds->avg_load,
- sds->max_load - sds->busiest_load_per_task);
+ if (!sds->group_imb) {
+ /*
+ * Don't want to pull so many tasks that a group would go idle.
+ */
+ load_above_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
+
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+
+ load_above_capacity /= sds->busiest->cpu_power;
+ }
+
+ /*
+ * We're trying to get all the cpus to the average_load, so we don't
+ * want to push ourselves above the average load, nor do we wish to
+ * reduce the max loaded cpu below the average load, as either of these
+ * actions would just result in more rebalancing later, and ping-pong
+ * tasks around. Thus we look for the minimum possible imbalance.
+ * Negative imbalances (*we* are more loaded than anyone else) will
+ * be counted as no imbalance for these purposes -- we can't fix that
+ * by pulling tasks to us. Be careful of negative numbers as they'll
+ * appear as very large values with unsigned longs.
+ */
+ max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
@@ -4069,19 +4094,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
sds.busiest_load_per_task =
min(sds.busiest_load_per_task, sds.avg_load);

- /*
- * We're trying to get all the cpus to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load, as either of these
- * actions would just result in more rebalancing later, and ping-pong
- * tasks around. Thus we look for the minimum possible imbalance.
- * Negative imbalances (*we* are more loaded than anyone else) will
- * be counted as no imbalance for these purposes -- we can't fix that
- * by pulling tasks to us. Be careful of negative numbers as they'll
- * appear as very large values with unsigned longs.
- */
- if (sds.max_load <= sds.busiest_load_per_task)
- goto out_balanced;

/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);



2010-02-22 18:50:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Fri, 2010-02-19 at 17:13 -0800, Suresh Siddha wrote:

> Ok Peter. There is another place that is scaling load_per_task with
> cpu_power but later comparing with the difference of max and min of the
> actual cpu load. :(
>
> avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
> group->cpu_power;
>
> if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
> sgs->group_imb = 1;
>
> Fixing this seems to have fixed the problem you mentioned. Can you
> please checkout the appended patch? If everything seems ok, then I will
> send the patch (against -tip tree) on monday morning with the detailed
> changelog.

Yes, this one does seem to generate the intended behaviour and does look
good (after cleaning up some of the now redundant comments).

Thanks!

> Signed-off-by: Suresh Siddha <[email protected]>
> ---
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..213b445 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3423,6 +3423,7 @@ struct sd_lb_stats {
> unsigned long max_load;
> unsigned long busiest_load_per_task;
> unsigned long busiest_nr_running;
> + unsigned long busiest_group_capacity;
>
> int group_imb; /* Is there imbalance in this sd */
> #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> @@ -3742,8 +3743,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> unsigned long load, max_cpu_load, min_cpu_load;
> int i;
> unsigned int balance_cpu = -1, first_idle_cpu = 0;
> - unsigned long sum_avg_load_per_task;
> - unsigned long avg_load_per_task;
> + unsigned long avg_load_per_task = 0;
>
> if (local_group) {
> balance_cpu = group_first_cpu(group);
> @@ -3752,7 +3752,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> }
>
> /* Tally up the load of all CPUs in the group */
> - sum_avg_load_per_task = avg_load_per_task = 0;
> max_cpu_load = 0;
> min_cpu_load = ~0UL;
>
> @@ -3782,7 +3781,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> sgs->sum_nr_running += rq->nr_running;
> sgs->sum_weighted_load += weighted_cpuload(i);
>
> - sum_avg_load_per_task += cpu_avg_load_per_task(i);
> }
>
> /*
> @@ -3801,6 +3799,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
>
>
> + if (sgs->sum_nr_running)
> + avg_load_per_task =
> + sgs->sum_weighted_load / sgs->sum_nr_running;
> /*
> * Consider the group unbalanced when the imbalance is larger
> * than the average weight of two tasks.
> @@ -3810,9 +3811,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> * normalized nr_running number somewhere that negates
> * the hierarchy?
> */
> - avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
> - group->cpu_power;
> -
> if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
> sgs->group_imb = 1;
>
> @@ -3880,6 +3878,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> sds->max_load = sgs.avg_load;
> sds->busiest = group;
> sds->busiest_nr_running = sgs.sum_nr_running;
> + sds->busiest_group_capacity = sgs.group_capacity;
> sds->busiest_load_per_task = sgs.sum_weighted_load;
> sds->group_imb = sgs.group_imb;
> }
> @@ -3902,6 +3901,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> {
> unsigned long tmp, pwr_now = 0, pwr_move = 0;
> unsigned int imbn = 2;
> + unsigned long scaled_busy_load_per_task;
>
> if (sds->this_nr_running) {
> sds->this_load_per_task /= sds->this_nr_running;
> @@ -3912,8 +3912,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> sds->this_load_per_task =
> cpu_avg_load_per_task(this_cpu);
>
> - if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
> - sds->busiest_load_per_task * imbn) {
> + scaled_busy_load_per_task = sds->busiest_load_per_task
> + * SCHED_LOAD_SCALE;
> + scaled_busy_load_per_task /= sds->busiest->cpu_power;
> +
> + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> + (scaled_busy_load_per_task * imbn)) {
> *imbalance = sds->busiest_load_per_task;
> return;
> }
> @@ -3964,7 +3968,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
> static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> unsigned long *imbalance)
> {
> - unsigned long max_pull;
> + unsigned long max_pull, load_above_capacity = ~0UL;
> /*
> * In the presence of smp nice balancing, certain scenarios can have
> * max load less than avg load(as we skip the groups at or below
> @@ -3975,9 +3979,30 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> return fix_small_imbalance(sds, this_cpu, imbalance);
> }
>
> - /* Don't want to pull so many tasks that a group would go idle */
> - max_pull = min(sds->max_load - sds->avg_load,
> - sds->max_load - sds->busiest_load_per_task);
> + if (!sds->group_imb) {
> + /*
> + * Don't want to pull so many tasks that a group would go idle.
> + */
> + load_above_capacity = (sds->busiest_nr_running -
> + sds->busiest_group_capacity);
> +
> + load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
> +
> + load_above_capacity /= sds->busiest->cpu_power;
> + }
> +
> + /*
> + * We're trying to get all the cpus to the average_load, so we don't
> + * want to push ourselves above the average load, nor do we wish to
> + * reduce the max loaded cpu below the average load, as either of these
> + * actions would just result in more rebalancing later, and ping-pong
> + * tasks around. Thus we look for the minimum possible imbalance.
> + * Negative imbalances (*we* are more loaded than anyone else) will
> + * be counted as no imbalance for these purposes -- we can't fix that
> + * by pulling tasks to us. Be careful of negative numbers as they'll
> + * appear as very large values with unsigned longs.
> + */
> + max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
>
> /* How much load to actually move to equalise the imbalance */
> *imbalance = min(max_pull * sds->busiest->cpu_power,
> @@ -4069,19 +4094,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> sds.busiest_load_per_task =
> min(sds.busiest_load_per_task, sds.avg_load);
>
> - /*
> - * We're trying to get all the cpus to the average_load, so we don't
> - * want to push ourselves above the average load, nor do we wish to
> - * reduce the max loaded cpu below the average load, as either of these
> - * actions would just result in more rebalancing later, and ping-pong
> - * tasks around. Thus we look for the minimum possible imbalance.
> - * Negative imbalances (*we* are more loaded than anyone else) will
> - * be counted as no imbalance for these purposes -- we can't fix that
> - * by pulling tasks to us. Be careful of negative numbers as they'll
> - * appear as very large values with unsigned longs.
> - */
> - if (sds.max_load <= sds.busiest_load_per_task)
> - goto out_balanced;
>
> /* Looks like there is an imbalance. Compute it */
> calculate_imbalance(&sds, this_cpu, imbalance);
>
>
>
>

2010-02-24 00:14:59

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Mon, 2010-02-22 at 10:50 -0800, Peter Zijlstra wrote:
> On Fri, 2010-02-19 at 17:13 -0800, Suresh Siddha wrote:
>
> > Ok Peter. There is another place that is scaling load_per_task with
> > cpu_power but later comparing with the difference of max and min of the
> > actual cpu load. :(
> >
> > avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
> > group->cpu_power;
> >
> > if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
> > sgs->group_imb = 1;
> >
> > Fixing this seems to have fixed the problem you mentioned. Can you
> > please checkout the appended patch? If everything seems ok, then I will
> > send the patch (against -tip tree) on monday morning with the detailed
> > changelog.
>
> Yes, this one does seem to generate the intended behaviour and does look
> good (after cleaning up some of the now redundant comments).
>
> Thanks!

Ok. Here is the patch with complete changelog. I added "Cc stable" tag
so that it can be picked up for 2.6.32 and 2.6.33, as we would like to
see this regression addressed in those kernels. Peter/Ingo: Can you
please queue this patch for -tip for 2.6.34?

Thanks.
---

From: Suresh Siddha <[email protected]>
Subject: sched: fix imbalance calculations to address SCHED_MC regression

On platforms like dual socket quad-core platform, the scheduler load balancer
is not detecting the load imbalances in certain scenarios. This is leading to
scenarios like where one socket is completely busy (with all the 4 cores running
with 4 tasks) and leaving another socket completely idle. This causes
performance issues as those 4 tasks share the memory controller, last-level
cache bandwidth etc. Also we won't be taking advantage of turbo-mode as much as
we like etc.

Some of the comparisons in the scheduler load balancing code are comparing
the "weighted cpu load that is scaled wrt sched_group's cpu_power" with the
"weighted average load per task that is not scaled wrt sched_group's cpu_power".
While this is probably broken for a long time (for multi socket numa nodes etc),
the problem is aggrevated with this recent change:

commit f93e65c186ab3c05ce2068733ca10e34fd00125e
Author: Peter Zijlstra <[email protected]>
Date: Tue Sep 1 10:34:32 2009 +0200

sched: Restore __cpu_power to a straight sum of power

Also with this change, the sched group cpu power alone no longer reflects the
group capacity that is needed to implement MC, MT performance (default) and
power-savings (user-selectable) policies. We need to use the computed
group capacity (sgs.group_capacity, that is computed using the
SD_PREFER_SIBLING logic in update_sd_lb_stats()) to find out if the group with
the max load is above its capacity and how much load to move etc.

Fix it.

Reported-by: Ma Ling <[email protected]>
Initial-Analysis-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [2.6.32.x, 2.6.33.x]
---

kernel/sched_fair.c | 63 ++++++++++++++++++++++++++++++---------------------
1 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ff7692c..7e0620e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2097,6 +2097,7 @@ struct sd_lb_stats {
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;

int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2416,14 +2417,12 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long sum_avg_load_per_task;
- unsigned long avg_load_per_task;
+ unsigned long avg_load_per_task = 0;

if (local_group)
balance_cpu = group_first_cpu(group);

/* Tally up the load of all CPUs in the group */
- sum_avg_load_per_task = avg_load_per_task = 0;
max_cpu_load = 0;
min_cpu_load = ~0UL;

@@ -2453,7 +2452,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->sum_nr_running += rq->nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);

- sum_avg_load_per_task += cpu_avg_load_per_task(i);
}

/*
@@ -2474,6 +2472,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;


+ if (sgs->sum_nr_running)
+ avg_load_per_task =
+ sgs->sum_weighted_load / sgs->sum_nr_running;
/*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of two tasks.
@@ -2483,9 +2484,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
- avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
- group->cpu_power;
-
if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;

@@ -2553,6 +2551,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
+ sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
sds->group_imb = sgs.group_imb;
}
@@ -2575,6 +2574,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
{
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
+ unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
sds->this_load_per_task /= sds->this_nr_running;
@@ -2585,8 +2585,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
- sds->busiest_load_per_task * imbn) {
+ scaled_busy_load_per_task = sds->busiest_load_per_task
+ * SCHED_LOAD_SCALE;
+ scaled_busy_load_per_task /= sds->busiest->cpu_power;
+
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
*imbalance = sds->busiest_load_per_task;
return;
}
@@ -2637,7 +2641,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull;
+ unsigned long max_pull, load_above_capacity = ~0UL;
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -2648,9 +2652,29 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(sds->max_load - sds->avg_load,
- sds->max_load - sds->busiest_load_per_task);
+ if (!sds->group_imb) {
+ /*
+ * Don't want to pull so many tasks that a group would go idle.
+ */
+ load_above_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
+
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+
+ load_above_capacity /= sds->busiest->cpu_power;
+ }
+
+ /*
+ * We're trying to get all the cpus to the average_load, so we don't
+ * want to push ourselves above the average load, nor do we wish to
+ * reduce the max loaded cpu below the average load. At the same time,
+ * we also don't want to reduce the group load below the group capacity
+ * (so that we can implement power-savings policies etc). Thus we look
+ * for the minimum possible imbalance.
+ * Be careful of negative numbers as they'll appear as very large values
+ * with unsigned longs.
+ */
+ max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
@@ -2742,19 +2766,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
sds.busiest_load_per_task =
min(sds.busiest_load_per_task, sds.avg_load);

- /*
- * We're trying to get all the cpus to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load, as either of these
- * actions would just result in more rebalancing later, and ping-pong
- * tasks around. Thus we look for the minimum possible imbalance.
- * Negative imbalances (*we* are more loaded than anyone else) will
- * be counted as no imbalance for these purposes -- we can't fix that
- * by pulling tasks to us. Be careful of negative numbers as they'll
- * appear as very large values with unsigned longs.
- */
- if (sds.max_load <= sds.busiest_load_per_task)
- goto out_balanced;

/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);

2010-02-24 17:43:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Tue, 2010-02-23 at 16:13 -0800, Suresh Siddha wrote:
>
> Ok. Here is the patch with complete changelog. I added "Cc stable" tag
> so that it can be picked up for 2.6.32 and 2.6.33, as we would like to
> see this regression addressed in those kernels. Peter/Ingo: Can you
> please queue this patch for -tip for 2.6.34?
>

Have picked it up with the following changes, Thanks!


Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2471,10 +2471,6 @@ static inline void update_sg_lb_stats(st
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

-
- if (sgs->sum_nr_running)
- avg_load_per_task =
- sgs->sum_weighted_load / sgs->sum_nr_running;
/*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of two tasks.
@@ -2484,6 +2480,9 @@ static inline void update_sg_lb_stats(st
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
+ if (sgs->sum_nr_running)
+ avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+
if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;

@@ -2642,6 +2641,13 @@ static inline void calculate_imbalance(s
unsigned long *imbalance)
{
unsigned long max_pull, load_above_capacity = ~0UL;
+
+ sds.busiest_load_per_task /= sds.busiest_nr_running;
+ if (sds.group_imb) {
+ sds.busiest_load_per_task =
+ min(sds.busiest_load_per_task, sds.avg_load);
+ }
+
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -2742,7 +2748,6 @@ find_busiest_group(struct sched_domain *
* 4) This group is more busy than the avg busieness at this
* sched_domain.
* 5) The imbalance is within the specified limit.
- * 6) Any rebalance would lead to ping-pong
*/
if (!(*balance))
goto ret;
@@ -2761,12 +2766,6 @@ find_busiest_group(struct sched_domain *
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;

- sds.busiest_load_per_task /= sds.busiest_nr_running;
- if (sds.group_imb)
- sds.busiest_load_per_task =
- min(sds.busiest_load_per_task, sds.avg_load);
-
-
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
return sds.busiest;

2010-02-24 19:33:14

by Suresh Siddha

[permalink] [raw]
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

On Wed, 2010-02-24 at 09:43 -0800, Peter Zijlstra wrote:
> On Tue, 2010-02-23 at 16:13 -0800, Suresh Siddha wrote:
> >
> > Ok. Here is the patch with complete changelog. I added "Cc stable" tag
> > so that it can be picked up for 2.6.32 and 2.6.33, as we would like to
> > see this regression addressed in those kernels. Peter/Ingo: Can you
> > please queue this patch for -tip for 2.6.34?
> >
>
> Have picked it up with the following changes, Thanks!

Looks good. Thanks.

2010-02-26 10:25:33

by Suresh Siddha

[permalink] [raw]
Subject: [tip:sched/core] sched: Fix SCHED_MC regression caused by change in sched cpu_power

Commit-ID: d8462eb81b72ebee011a8adbb9a9a5cd94213af9
Gitweb: http://git.kernel.org/tip/d8462eb81b72ebee011a8adbb9a9a5cd94213af9
Author: Suresh Siddha <[email protected]>
AuthorDate: Tue, 23 Feb 2010 16:13:52 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 26 Feb 2010 10:54:27 +0100

sched: Fix SCHED_MC regression caused by change in sched cpu_power

On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.

Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:

|
| commit f93e65c186ab3c05ce2068733ca10e34fd00125e
| Author: Peter Zijlstra <[email protected]>
| Date: Tue Sep 1 10:34:32 2009 +0200
|
| sched: Restore __cpu_power to a straight sum of power
|

Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.

We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.

Reported-by: Ma Ling <[email protected]>
Initial-Analysis-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: <[email protected]> # [2.6.32.x, 2.6.33.x]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 76 +++++++++++++++++++++++++++++----------------------
1 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ff7692c..a3591a7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2097,6 +2097,7 @@ struct sd_lb_stats {
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;

int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2416,14 +2417,12 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long sum_avg_load_per_task;
- unsigned long avg_load_per_task;
+ unsigned long avg_load_per_task = 0;

if (local_group)
balance_cpu = group_first_cpu(group);

/* Tally up the load of all CPUs in the group */
- sum_avg_load_per_task = avg_load_per_task = 0;
max_cpu_load = 0;
min_cpu_load = ~0UL;

@@ -2453,7 +2452,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->sum_nr_running += rq->nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);

- sum_avg_load_per_task += cpu_avg_load_per_task(i);
}

/*
@@ -2473,7 +2471,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

-
/*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of two tasks.
@@ -2483,8 +2480,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
- avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
- group->cpu_power;
+ if (sgs->sum_nr_running)
+ avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;
@@ -2553,6 +2550,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
+ sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
sds->group_imb = sgs.group_imb;
}
@@ -2575,6 +2573,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
{
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
+ unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
sds->this_load_per_task /= sds->this_nr_running;
@@ -2585,8 +2584,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
- sds->busiest_load_per_task * imbn) {
+ scaled_busy_load_per_task = sds->busiest_load_per_task
+ * SCHED_LOAD_SCALE;
+ scaled_busy_load_per_task /= sds->busiest->cpu_power;
+
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
*imbalance = sds->busiest_load_per_task;
return;
}
@@ -2637,7 +2640,14 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull;
+ unsigned long max_pull, load_above_capacity = ~0UL;
+
+ sds.busiest_load_per_task /= sds.busiest_nr_running;
+ if (sds.group_imb) {
+ sds.busiest_load_per_task =
+ min(sds.busiest_load_per_task, sds.avg_load);
+ }
+
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -2648,9 +2658,29 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(sds->max_load - sds->avg_load,
- sds->max_load - sds->busiest_load_per_task);
+ if (!sds->group_imb) {
+ /*
+ * Don't want to pull so many tasks that a group would go idle.
+ */
+ load_above_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
+
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+
+ load_above_capacity /= sds->busiest->cpu_power;
+ }
+
+ /*
+ * We're trying to get all the cpus to the average_load, so we don't
+ * want to push ourselves above the average load, nor do we wish to
+ * reduce the max loaded cpu below the average load. At the same time,
+ * we also don't want to reduce the group load below the group capacity
+ * (so that we can implement power-savings policies etc). Thus we look
+ * for the minimum possible imbalance.
+ * Be careful of negative numbers as they'll appear as very large values
+ * with unsigned longs.
+ */
+ max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
@@ -2718,7 +2748,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* 4) This group is more busy than the avg busieness at this
* sched_domain.
* 5) The imbalance is within the specified limit.
- * 6) Any rebalance would lead to ping-pong
*/
if (!(*balance))
goto ret;
@@ -2737,25 +2766,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;

- sds.busiest_load_per_task /= sds.busiest_nr_running;
- if (sds.group_imb)
- sds.busiest_load_per_task =
- min(sds.busiest_load_per_task, sds.avg_load);
-
- /*
- * We're trying to get all the cpus to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load, as either of these
- * actions would just result in more rebalancing later, and ping-pong
- * tasks around. Thus we look for the minimum possible imbalance.
- * Negative imbalances (*we* are more loaded than anyone else) will
- * be counted as no imbalance for these purposes -- we can't fix that
- * by pulling tasks to us. Be careful of negative numbers as they'll
- * appear as very large values with unsigned longs.
- */
- if (sds.max_load <= sds.busiest_load_per_task)
- goto out_balanced;
-
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
return sds.busiest;

2010-02-26 14:58:24

by Suresh Siddha

[permalink] [raw]
Subject: [tip:sched/core] sched: Fix SCHED_MC regression caused by change in sched cpu_power

Commit-ID: dd5feea14a7de4edbd9f36db1a2db785de91b88d
Gitweb: http://git.kernel.org/tip/dd5feea14a7de4edbd9f36db1a2db785de91b88d
Author: Suresh Siddha <[email protected]>
AuthorDate: Tue, 23 Feb 2010 16:13:52 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 26 Feb 2010 15:45:13 +0100

sched: Fix SCHED_MC regression caused by change in sched cpu_power

On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.

Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:

|
| commit f93e65c186ab3c05ce2068733ca10e34fd00125e
| Author: Peter Zijlstra <[email protected]>
| Date: Tue Sep 1 10:34:32 2009 +0200
|
| sched: Restore __cpu_power to a straight sum of power
|

Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.

We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.

Reported-by: Ma Ling <[email protected]>
Initial-Analysis-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: <[email protected]> # [2.6.32.x, 2.6.33.x]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 76 +++++++++++++++++++++++++++++----------------------
1 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ff7692c..3e1fd96 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2097,6 +2097,7 @@ struct sd_lb_stats {
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;

int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2416,14 +2417,12 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long sum_avg_load_per_task;
- unsigned long avg_load_per_task;
+ unsigned long avg_load_per_task = 0;

if (local_group)
balance_cpu = group_first_cpu(group);

/* Tally up the load of all CPUs in the group */
- sum_avg_load_per_task = avg_load_per_task = 0;
max_cpu_load = 0;
min_cpu_load = ~0UL;

@@ -2453,7 +2452,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->sum_nr_running += rq->nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);

- sum_avg_load_per_task += cpu_avg_load_per_task(i);
}

/*
@@ -2473,7 +2471,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

-
/*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of two tasks.
@@ -2483,8 +2480,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
- avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
- group->cpu_power;
+ if (sgs->sum_nr_running)
+ avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;
@@ -2553,6 +2550,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
+ sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
sds->group_imb = sgs.group_imb;
}
@@ -2575,6 +2573,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
{
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
+ unsigned long scaled_busy_load_per_task;

if (sds->this_nr_running) {
sds->this_load_per_task /= sds->this_nr_running;
@@ -2585,8 +2584,12 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
sds->this_load_per_task =
cpu_avg_load_per_task(this_cpu);

- if (sds->max_load - sds->this_load + sds->busiest_load_per_task >=
- sds->busiest_load_per_task * imbn) {
+ scaled_busy_load_per_task = sds->busiest_load_per_task
+ * SCHED_LOAD_SCALE;
+ scaled_busy_load_per_task /= sds->busiest->cpu_power;
+
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
*imbalance = sds->busiest_load_per_task;
return;
}
@@ -2637,7 +2640,14 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds,
static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
unsigned long *imbalance)
{
- unsigned long max_pull;
+ unsigned long max_pull, load_above_capacity = ~0UL;
+
+ sds->busiest_load_per_task /= sds->busiest_nr_running;
+ if (sds->group_imb) {
+ sds->busiest_load_per_task =
+ min(sds->busiest_load_per_task, sds->avg_load);
+ }
+
/*
* In the presence of smp nice balancing, certain scenarios can have
* max load less than avg load(as we skip the groups at or below
@@ -2648,9 +2658,29 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}

- /* Don't want to pull so many tasks that a group would go idle */
- max_pull = min(sds->max_load - sds->avg_load,
- sds->max_load - sds->busiest_load_per_task);
+ if (!sds->group_imb) {
+ /*
+ * Don't want to pull so many tasks that a group would go idle.
+ */
+ load_above_capacity = (sds->busiest_nr_running -
+ sds->busiest_group_capacity);
+
+ load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+
+ load_above_capacity /= sds->busiest->cpu_power;
+ }
+
+ /*
+ * We're trying to get all the cpus to the average_load, so we don't
+ * want to push ourselves above the average load, nor do we wish to
+ * reduce the max loaded cpu below the average load. At the same time,
+ * we also don't want to reduce the group load below the group capacity
+ * (so that we can implement power-savings policies etc). Thus we look
+ * for the minimum possible imbalance.
+ * Be careful of negative numbers as they'll appear as very large values
+ * with unsigned longs.
+ */
+ max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
*imbalance = min(max_pull * sds->busiest->cpu_power,
@@ -2718,7 +2748,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* 4) This group is more busy than the avg busieness at this
* sched_domain.
* 5) The imbalance is within the specified limit.
- * 6) Any rebalance would lead to ping-pong
*/
if (!(*balance))
goto ret;
@@ -2737,25 +2766,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;

- sds.busiest_load_per_task /= sds.busiest_nr_running;
- if (sds.group_imb)
- sds.busiest_load_per_task =
- min(sds.busiest_load_per_task, sds.avg_load);
-
- /*
- * We're trying to get all the cpus to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load, as either of these
- * actions would just result in more rebalancing later, and ping-pong
- * tasks around. Thus we look for the minimum possible imbalance.
- * Negative imbalances (*we* are more loaded than anyone else) will
- * be counted as no imbalance for these purposes -- we can't fix that
- * by pulling tasks to us. Be careful of negative numbers as they'll
- * appear as very large values with unsigned longs.
- */
- if (sds.max_load <= sds.busiest_load_per_task)
- goto out_balanced;
-
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
return sds.busiest;