2011-02-04 20:51:56

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22

Consider a system with { [ (A B) (C D) ] [ (E F) (G H) ] },
() denoting SMT siblings, [] cores on same socket and {} system wide
Further, A, C and D are idle, B is busy and one of EFGH has excess load.

With sd_idle logic, a check in rebalance_domains() converts tick
based load balance requests from CPU A to busy load balance for core
and above domains (lower rate of balance and higher load_idx).

With first_idle_cpu logic, when CPU C or D tries to balance across domains
the logic finds CPU A as first idle CPU in the group and nominates CPU A to
idle balance across sockets.

But, sd_idle above would not allow CPU A to do cross socket idle balance
as CPU A switches its higher level balancing to busy balance.

So, this can result is no cross socket balancing for extended periods.

The fix here adds additional check to detect sd_idle logic in
first_idle_cpu code path. We will now nominate (in order or preference):
* First fully idle CPU
* First semi-idle CPU
* First CPU

Note that this solution works fine for 2 SMT siblings case and won't be
perfect in picking proper semi-idle in case of more than 2 SMT threads.

The problem was found by looking at the code and schedstat output. I don't
yet have any data to show impact of this on any workload.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 41 +++++++++++++++++++++++++++++++++++++++--
1 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 62723a4..1790cc2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2603,6 +2603,37 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
return 0;
}

+/*
+ * Find if there is any busy CPUs in SD_SHARE_CPUPOWER domain of
+ * requested CPU.
+ * Bypass the check in case of SD_POWERSAVINGS_BALANCE on
+ * parent domain. In that case requested CPU can still be nominated as
+ * balancer for higher domains.
+ */
+static int is_cpupower_sharing_domain_idle(int cpu)
+{
+ struct sched_domain *sd;
+ int i;
+
+ if (!(sysctl_sched_compat_yield & 0x4))
+ return 1;
+
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_SHARE_CPUPOWER))
+ break;
+
+ if (test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
+ return 1;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (!idle_cpu(i))
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @sd: The sched_domain whose statistics are to be updated.
@@ -2625,6 +2656,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
+ unsigned int first_semiidle_cpu = 0;
unsigned long avg_load_per_task = 0;

if (local_group)
@@ -2644,8 +2676,13 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Bias balancing toward cpus of our domain */
if (local_group) {
if (idle_cpu(i) && !first_idle_cpu) {
- first_idle_cpu = 1;
- balance_cpu = i;
+ if (is_cpupower_sharing_domain_idle(i)) {
+ first_idle_cpu = 1;
+ balance_cpu = i;
+ } else if (!first_semiidle_cpu) {
+ first_semiidle_cpu = 1;
+ balance_cpu = i;
+ }
}

load = target_load(i, load_idx);
--
1.7.3.1


2011-02-04 21:25:55

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

Consider a system with { [ (A B) (C D) ] [ (E F) (G H) ] },
() denoting SMT siblings, [] cores on same socket and {} system wide
Further, A, C and D are idle, B is busy and one of EFGH has excess load.

With sd_idle logic, a check in rebalance_domains() converts tick
based load balance requests from CPU A to busy load balance for core
and above domains (lower rate of balance and higher load_idx).

With first_idle_cpu logic, when CPU C or D tries to balance across domains
the logic finds CPU A as first idle CPU in the group and nominates CPU A to
idle balance across sockets.

But, sd_idle above would not allow CPU A to do cross socket idle balance
as CPU A switches its higher level balancing to busy balance.

So, this can result is no cross socket balancing for extended periods.

The fix here adds additional check to detect sd_idle logic in
first_idle_cpu code path. We will now nominate (in order or preference):
* First fully idle CPU
* First semi-idle CPU
* First CPU

Note that this solution works fine for 2 SMT siblings case and won't be
perfect in picking proper semi-idle in case of more than 2 SMT threads.

The problem was found by looking at the code and schedstat output. I don't
yet have any data to show impact of this on any workload.

Changes from v0:
* Removed by test code that I left out in earlier version by mistake.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 38 ++++++++++++++++++++++++++++++++++++--
1 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 62723a4..e9dbfa8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2603,6 +2603,34 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
return 0;
}

+/*
+ * Find if there is any busy CPUs in SD_SHARE_CPUPOWER domain of
+ * requested CPU.
+ * Bypass the check in case of SD_POWERSAVINGS_BALANCE on
+ * parent domain. In that case requested CPU can still be nominated as
+ * balancer for higher domains.
+ */
+static int is_cpupower_sharing_domain_idle(int cpu)
+{
+ struct sched_domain *sd;
+ int i;
+
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_SHARE_CPUPOWER))
+ break;
+
+ if (test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
+ return 1;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (!idle_cpu(i))
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @sd: The sched_domain whose statistics are to be updated.
@@ -2625,6 +2653,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
+ unsigned int first_semiidle_cpu = 0;
unsigned long avg_load_per_task = 0;

if (local_group)
@@ -2644,8 +2673,13 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Bias balancing toward cpus of our domain */
if (local_group) {
if (idle_cpu(i) && !first_idle_cpu) {
- first_idle_cpu = 1;
- balance_cpu = i;
+ if (is_cpupower_sharing_domain_idle(i)) {
+ first_idle_cpu = 1;
+ balance_cpu = i;
+ } else if (!first_semiidle_cpu) {
+ first_semiidle_cpu = 1;
+ balance_cpu = i;
+ }
}

load = target_load(i, load_idx);
--
1.7.3.1

2011-02-07 13:49:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Fri, 2011-02-04 at 13:25 -0800, Venkatesh Pallipadi wrote:
> Consider a system with { [ (A B) (C D) ] [ (E F) (G H) ] },
> () denoting SMT siblings, [] cores on same socket and {} system wide
> Further, A, C and D are idle, B is busy and one of EFGH has excess load.
>
> With sd_idle logic, a check in rebalance_domains() converts tick
> based load balance requests from CPU A to busy load balance for core
> and above domains (lower rate of balance and higher load_idx).

the if (load_balance())
idle = CPU_NOT_IDLE;
bit, right?

> With first_idle_cpu logic, when CPU C or D tries to balance across domains
> the logic finds CPU A as first idle CPU in the group and nominates CPU A to
> idle balance across sockets.

Right..

> But, sd_idle above would not allow CPU A to do cross socket idle balance
> as CPU A switches its higher level balancing to busy balance.

Because it fails the sd->flags & SD_SHARE_CPUPOWER test at the beginning
of load_balance() and hence sd_idle will remain 0, right?

I'm just not quite sure how we then end up returning !0 for
load_balance(), both branches returning -1 seem conditional on
SD_SHARE_CPUPOWER but the [ (A B) (C D) ], domain doesn't have that set.

> So, this can result is no cross socket balancing for extended periods.

Which is bad

> The fix here adds additional check to detect sd_idle logic in
> first_idle_cpu code path. We will now nominate (in order or preference):
> * First fully idle CPU
> * First semi-idle CPU
> * First CPU
>
> Note that this solution works fine for 2 SMT siblings case and won't be
> perfect in picking proper semi-idle in case of more than 2 SMT threads.

All these SMT exceptions make my head hurt, can't we clean that up
instead of making them worse?

Why is SMT treaded differently from say a shared cache? In both cases we
want to spread the load as wide as possible to provide as much of the
resources to the few runnable tasks.

2011-02-07 18:22:06

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Mon, Feb 7, 2011 at 5:50 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2011-02-04 at 13:25 -0800, Venkatesh Pallipadi wrote:
>> Consider a system with { [ (A B) (C D) ] [ (E F) (G H) ] },
>> () denoting SMT siblings, [] cores on same socket and {} system wide
>> Further, A, C and D are idle, B is busy and one of EFGH has excess load.
>>
>> With sd_idle logic, a check in rebalance_domains() converts tick
>> based load balance requests from CPU A to busy load balance for core
>> and above domains (lower rate of balance and higher load_idx).
>
> the if (load_balance())
> ? ? ? ?idle = CPU_NOT_IDLE;
> bit, right?
>
>> With first_idle_cpu logic, when CPU C or D tries to balance across domains
>> the logic finds CPU A as first idle CPU in the group and nominates CPU A to
>> idle balance across sockets.
>
> Right..
>
>> But, sd_idle above would not allow CPU A to do cross socket idle balance
>> as CPU A switches its higher level balancing to busy balance.
>
> Because it fails the sd->flags & SD_SHARE_CPUPOWER test at the beginning
> of load_balance() and hence sd_idle will remain 0, right?
>
> I'm just not quite sure how we then end up returning !0 for
> load_balance(), both branches returning -1 seem conditional on
> SD_SHARE_CPUPOWER but the [ (A B) (C D) ], domain doesn't have that set.
>

For (A B) domain, SD_SHARE_CPUPOWER is set and when A finds that B
is busy, it sets its sd_idle to 1 during its SMT sibling balance (once every
2-4 ticks) and load_balance() returns -1 in this case. And rebalance_domains()
looks at this -1 and makes load_balance calls for CORE and NUMA domains
as CPU_NOT_IDLE, thus increasing the load_balance period.

>> So, this can result is no cross socket balancing for extended periods.
>
> Which is bad
>
>> The fix here adds additional check to detect sd_idle logic in
>> first_idle_cpu code path. We will now nominate (in order or preference):
>> * First fully idle CPU
>> * First semi-idle CPU
>> * First CPU
>>
>> Note that this solution works fine for 2 SMT siblings case and won't be
>> perfect in picking proper semi-idle in case of more than 2 SMT threads.
>
> All these SMT exceptions make my head hurt, can't we clean that up
> instead of making them worse?
>
> Why is SMT treaded differently from say a shared cache? In both cases we
> want to spread the load as wide as possible to provide as much of the
> resources to the few runnable tasks.
>

IIRC, the reason for the whole sd_idle part was to have less aggressive
load balance when one SMT sibling is busy and other is idle, in order not
to take CPU cycles away from the busy sibling.
Suresh will know the exact reasoning behind this and which CPUs and
which workload this helped..

Since this patch, I started looking at sd_idle more closely and I have
two other patches fixing problems related to sd_idle in its current form.
I agree that sd_idle stuff is making the code a lot complicated and
we can clean/remove it.

Thanks,
Venki

2011-02-07 19:53:17

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Mon, 2011-02-07 at 10:21 -0800, Venkatesh Pallipadi wrote:
> On Mon, Feb 7, 2011 at 5:50 AM, Peter Zijlstra <[email protected]> wrote:
> > Why is SMT treaded differently from say a shared cache? In both cases we
> > want to spread the load as wide as possible to provide as much of the
> > resources to the few runnable tasks.
> >
>
> IIRC, the reason for the whole sd_idle part was to have less aggressive
> load balance when one SMT sibling is busy and other is idle, in order not
> to take CPU cycles away from the busy sibling.
> Suresh will know the exact reasoning behind this and which CPUs and
> which workload this helped..

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5969fe06

Original code came from Nick in 2005.

[PATCH] sched: HT optimisation

If an idle sibling of an HT queue encounters a busy sibling, then make
higher level load balancing of the non-idle variety.

Performance of multiprocessor HT systems with low numbers of tasks
(generally < number of virtual CPUs) can be significantly worse than the
exact same workloads when running in non-HT mode. The reason is largely
due to poor scheduling behaviour.

This patch improves the situation, making the performance gap far less
significant on one problematic test case (tbench).


Peter, to answer your question of why SMT is treated different to cores
sharing cache, performance improvements contributed by SMT is far less
compared to the cores and any wrong decisions in SMT load balancing
(especially in the presence of idle cores, packages) has a bigger
impact.

I think in the tbench case referred by Nick, idle HT siblings in a busy
package picked the load instead of the idle packages. And thus we
probably had to wait for active load balance to kick in to distribute
the load etc by which the damage would have been. Performance impact of
this condition wouldn't be as severe in the cores sharing last level
cache and other resources.

Also there are lot of changes in this area since 2005. So it would be
nice to revisit the tbench case and see if the logic of propagating busy
sibling status to the higher level load balances is still needed or not.

On the contrary, perhaps there might be some workloads which may benefit
in performance/latency if we completely do away with this less
aggressive SMT load balancing.

Venki, as you are looking into the fixes in this area, can you run your
workloads (aswell as tbench) and compare the logic with your fixes vs
removing this logic ?

thanks,
suresh

2011-02-08 17:37:23

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Mon, Feb 7, 2011 at 11:53 AM, Suresh Siddha
<[email protected]> wrote:
> On Mon, 2011-02-07 at 10:21 -0800, Venkatesh Pallipadi wrote:
>> On Mon, Feb 7, 2011 at 5:50 AM, Peter Zijlstra <[email protected]> wrote:
>> > Why is SMT treaded differently from say a shared cache? In both cases we
>> > want to spread the load as wide as possible to provide as much of the
>> > resources to the few runnable tasks.
>> >
>>
>> IIRC, the reason for the whole sd_idle part was to have less aggressive
>> load balance when one SMT sibling is busy and other is idle, in order not
>> to take CPU cycles away from the busy sibling.
>> Suresh will know the exact reasoning behind this and which CPUs and
>> which workload this helped..
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5969fe06
>
> Original code came from Nick in 2005.
>
> ? ? ? ?[PATCH] sched: HT optimisation
>
> ? ? ? ?If an idle sibling of an HT queue encounters a busy sibling, then make
> ? ? ? ?higher level load balancing of the non-idle variety.
>
> ? ? ? ?Performance of multiprocessor HT systems with low numbers of tasks
> ? ? ? ?(generally < number of virtual CPUs) can be significantly worse than the
> ? ? ? ?exact same workloads when running in non-HT mode. ?The reason is largely
> ? ? ? ?due to poor scheduling behaviour.
>
> ? ? ? ?This patch improves the situation, making the performance gap far less
> ? ? ? ?significant on one problematic test case (tbench).
>
>
> Peter, to answer your question of why SMT is treated different to cores
> sharing cache, performance improvements contributed by SMT is far less
> compared to the cores and any wrong decisions in SMT load balancing
> (especially in the presence of idle cores, packages) has a bigger
> impact.
>
> I think in the tbench case referred by Nick, idle HT siblings in a busy
> package picked the load instead of the idle packages. And thus we
> probably had to wait for active load balance to kick in to distribute
> the load etc by which the damage would have been. Performance impact of
> this condition wouldn't be as severe in the cores sharing last level
> cache and other resources.
>
> Also there are lot of changes in this area since 2005. So it would be
> nice to revisit the tbench case and see if the logic of propagating busy
> sibling status to the higher level load balances is still needed or not.
>
> On the contrary, perhaps there might be some workloads which may benefit
> in performance/latency if we completely do away with this less
> aggressive SMT load balancing.
>
> Venki, as you are looking into the fixes in this area, can you run your
> workloads (aswell as tbench) and compare the logic with your fixes vs
> removing this logic ?
>

I ran tbench (localhost) with various nprocs (4, 8, 12, 18, 24 on a 12
core 24 SMT system),
comparing current sd_idle, sd_idle with all the fixes I have and no sd_idle.
In general tbench Throughput number had a lot of variance and no noticeable
perf improvement with any of the kernels. The only observable result
was with 12 nprocs (with 11 repeat runs). But this again has high variance.

baseline
N Min Max Median Avg Stddev
x 11 1922.2 2165.48 1937.77 1983.4464 81.730586

sd_idle fixes
N Min Max Median Avg Stddev
x 11 1974.05 2142.77 1994.35 2022.6618 53.034822

no sd_idle
N Min Max Median Avg Stddev
x 11 1982.41 2145.88 2056.86 2043.2027 59.879523


We have atleast one workload where both sd_idle fixes and no sd_idle helps
in terms of reducing the idle time and improving the workload latency response,
in comparison to baseline.

Thanks,
Venki

2011-02-08 18:14:16

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Misc sd_idle related fixes

Here are the 3 sd_idle related changes I tested with, for reference. Among
the three, the third patch is the one that helps us in reducing idle cycles
with one of our workloads and thus improves the latency response.

Thanks,
Venki

2011-02-08 18:14:23

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 1/3] sched: Resolve sd_idle and first_idle_cpu Catch-22

Consider a system with { [ (A B) (C D) ] [ (E F) (G H) ] },
() denoting SMT siblings, [] cores on same socket and {} system wide
Further, A, C and D are idle, B is busy and one of EFGH has excess load.

With sd_idle logic, a check in rebalance_domains() converts tick
based load balance requests from CPU A to busy load balance for core
and above domains (lower rate of balance and higher load_idx).

With first_idle_cpu logic, when CPU C or D tries to balance across domains
the logic finds CPU A as first idle CPU in the group and nominates CPU A to
idle balance across sockets.

But, sd_idle above would not allow CPU A to do cross socket idle balance
as CPU A switches its higher level balancing to busy balance.

So, this can result is no cross socket balancing for extended periods.

The fix here adds additional check to detect sd_idle logic in
first_idle_cpu code path. We will now nominate (in order or preference):
* First fully idle CPU
* First semi-idle CPU
* First CPU

Note that this solution works fine for 2 SMT siblings case and won't be
perfect in picking proper semi-idle in case of more than 2 SMT threads.

The problem was found by looking at the code and schedstat output. I don't
yet have any data to show impact of this on any workload.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 38 ++++++++++++++++++++++++++++++++++++--
1 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0c26e2d..d7e6da3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2603,6 +2603,34 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
return 0;
}

+/*
+ * Find if there is any busy CPUs in SD_SHARE_CPUPOWER domain of
+ * requested CPU.
+ * Bypass the check in case of SD_POWERSAVINGS_BALANCE on
+ * parent domain. In that case requested CPU can still be nominated as
+ * balancer for higher domains.
+ */
+static int is_cpupower_sharing_domain_idle(int cpu)
+{
+ struct sched_domain *sd;
+ int i;
+
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_SHARE_CPUPOWER))
+ break;
+
+ if (test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
+ return 1;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (!idle_cpu(i))
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @sd: The sched_domain whose statistics are to be updated.
@@ -2625,6 +2653,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
+ unsigned int first_semiidle_cpu = 0;
unsigned long avg_load_per_task = 0;

if (local_group)
@@ -2644,8 +2673,13 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Bias balancing toward cpus of our domain */
if (local_group) {
if (idle_cpu(i) && !first_idle_cpu) {
- first_idle_cpu = 1;
- balance_cpu = i;
+ if (is_cpupower_sharing_domain_idle(i)) {
+ first_idle_cpu = 1;
+ balance_cpu = i;
+ } else if (!first_semiidle_cpu) {
+ first_semiidle_cpu = 1;
+ balance_cpu = i;
+ }
}

load = target_load(i, load_idx);
--
1.7.3.1

2011-02-08 18:14:30

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 2/3] sched: fix_up broken SMT load balance dilation

There is logic in rebalance_domains that intends to change CPU_IDLE
load balancing from an SMT CPU to CPU_NOT_IDLE, in presence of
a busy SMT sibling.

load_balance() at SIBLING domain returns -1, when there is a busy
sibling in that domain and the check in rebalance domain for non-zero
return values following which idle is changed to CPU_NOT_IDLE.

But this does not work as intended. This does reduce the number of
higher domain load balancing on such SMT CPUs. But, they end up
doing CPU_IDLE balance most of the times.

Here is a "10s diff" of CPU_IDLE and CPU_NOT_IDLE lb_count from
sched_stat (fields 2, 3, 11 from domain lines) on the particular
CPU of interest.

sd_cpus lb_count[CPU_IDLE] lb_count[CPU_NOT_IDLE]

00001001 4579 0
0003f03f 1200 0
00ffffff 310 0

00001001 4485 0
0003f03f 999 0
00ffffff 341 0

00001001 4593 0
0003f03f 1031 0
00ffffff 293 0

The reason for this is, we do successfully avoid load balancing of higher
domain when SIBLING domain says one of the siblings is busy. But, next
CORE or NODE load balancing can trigger (and is triggering) at a jiffy
when there is no SIBLING load balance pending and thus those
load balances will not know about SMT sibling being busy and go ahead
with CPU_IDLE.

One way to solve this is to remember the idle state from last sibling
load balance and bubble it up the domain levels. With that, under
same conditions as above, schedstat shows

sd_cpus lb_count[CPU_IDLE] lb_count[CPU_NOT_IDLE]

00001001 4677 0
0003f03f 2 39
00ffffff 2 9

00001001 4684 0
0003f03f 3 37
00ffffff 3 12

00001001 4781 0
0003f03f 1 39
00ffffff 1 21

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched_fair.c | 4 ++++
2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..56194b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,6 +937,7 @@ struct sched_domain {
unsigned int nr_balance_failed; /* initialise to 0 */

u64 last_update;
+ enum cpu_idle_type bubble_up_idle;

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d7e6da3..91227d9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3871,6 +3871,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
idle = CPU_NOT_IDLE;
}
sd->last_balance = jiffies;
+ sd->bubble_up_idle = idle;
}
if (need_serialize)
spin_unlock(&balancing);
@@ -3887,6 +3888,9 @@ out:
*/
if (!balance)
break;
+
+ if (idle == CPU_IDLE)
+ idle = sd->bubble_up_idle;
}

/*
--
1.7.3.1

2011-02-08 18:14:34

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 3/3] sched: newidle balance set idle_timestamp only on successful pull

load_balance() could return a negative value in the case of
SMT sibling CPU being busy. Code in idle_balance() though, uses this
return value as an indicator of successful task pull, ignoring the
-1 return value.

This has two problems:
1) Resets idle_stamp even when this return value is -1.
Specific case is on SMT capable system, CPU A is idle and its sibling
CPU B is busy. In this case, CPU A avg_idle will not depend on
a task sleeping/waking up on it. Instead it will continue to hold stale
avg_idle value for extended period of time.

Simple test case of driving avg_idle on a CPU to desired value by using
a usleep loop and then starting a 100% busy loop on its sibling and
changing the usleep rate on original CPU (or removing it completely),
I see the avg_idle on this CPU not updating at all in this case.

2) Breaks out of idle_balance, skipping all higher level domains.
Case can be made that breaking out here is a 'feature' and not a 'bug'.
Periodic balance uses this signal to drop down to busy balance for
higher level domains. But, is simple break out of balancing good for
newidle case?

We do see results in our workload, that this break increases idle time
and reduces both throughput and latency measurably. Also, as newidle
balance itself is ratelimited with avg_idle, would it be OK to continue
balancing upper domains in this case of SMT sibling busy? Or may be
reduce it to CPU_NOT_IDLE from CPU_NEWLY_IDLE? Change here goes with
the former option.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 91227d9..56ab3a6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3483,7 +3483,7 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
interval = msecs_to_jiffies(sd->balance_interval);
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
- if (pulled_task) {
+ if (pulled_task > 0) {
this_rq->idle_stamp = 0;
break;
}
--
1.7.3.1

2011-02-09 03:37:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: newidle balance set idle_timestamp only on successful pull

On Tue, 2011-02-08 at 10:13 -0800, Venkatesh Pallipadi wrote:
> load_balance() could return a negative value in the case of
> SMT sibling CPU being busy. Code in idle_balance() though, uses this
> return value as an indicator of successful task pull, ignoring the
> -1 return value.

Yup, garden variety bug.

> This has two problems:
> 1) Resets idle_stamp even when this return value is -1.
> Specific case is on SMT capable system, CPU A is idle and its sibling
> CPU B is busy. In this case, CPU A avg_idle will not depend on
> a task sleeping/waking up on it. Instead it will continue to hold stale
> avg_idle value for extended period of time.

Not good.

Acked-by: Mike Galbraith <[email protected]>

-Mike

2011-02-09 09:28:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Misc sd_idle related fixes

On Tue, 2011-02-08 at 10:13 -0800, Venkatesh Pallipadi wrote:
> Here are the 3 sd_idle related changes I tested with, for reference. Among
> the three, the third patch is the one that helps us in reducing idle cycles
> with one of our workloads and thus improves the latency response.

Have you tried what happens if you simply rip all that SMT stuff out and
simplify the code? Afaict much of the capacity stuff we have should have
a similar effect and is less confusing, no?

2011-02-09 15:54:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Mon, 2011-02-07 at 11:53 -0800, Suresh Siddha wrote:
>
> Peter, to answer your question of why SMT is treated different to cores
> sharing cache, performance improvements contributed by SMT is far less
> compared to the cores and any wrong decisions in SMT load balancing
> (especially in the presence of idle cores, packages) has a bigger
> impact.
>
> I think in the tbench case referred by Nick, idle HT siblings in a busy
> package picked the load instead of the idle packages. And thus we
> probably had to wait for active load balance to kick in to distribute
> the load etc by which the damage would have been. Performance impact of
> this condition wouldn't be as severe in the cores sharing last level
> cache and other resources.
>
> Also there are lot of changes in this area since 2005. So it would be
> nice to revisit the tbench case and see if the logic of propagating busy
> sibling status to the higher level load balances is still needed or not.
>
> On the contrary, perhaps there might be some workloads which may benefit
> in performance/latency if we completely do away with this less
> aggressive SMT load balancing.

Right, but our current capacity logic does exactly that and seems to
work for more than 2 smt siblings (it does the whole asymmetric power7
muck).

>From a quick glance at the sched.c state at the time of Nick's patch,
the capacity logic wasn't around then.

So I see no reason what so ever to keep this SMT exception.

2011-02-10 17:24:18

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: Misc sd_idle related fixes

On Wed, Feb 9, 2011 at 1:29 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2011-02-08 at 10:13 -0800, Venkatesh Pallipadi wrote:
>> Here are the 3 sd_idle related changes I tested with, for reference. Among
>> the three, the third patch is the one that helps us in reducing idle cycles
>> with one of our workloads and thus improves the latency response.
>
> Have you tried what happens if you simply rip all that SMT stuff out and
> simplify the code? Afaict much of the capacity stuff we have should have
> a similar effect and is less confusing, no?
>

Among the benchmarks I looked at (tbench and internal workload that
showed benefit with these fixes), I see both no sd_idle and
sd_idle+fixes have similar effect. So, I do not see any problems with
ripping out sd_idle altogether.

We may still need to change first_idle_cpu logic a bit for SMT though.
It can prevent 2 hop migrations in cases like:
{ [ (A B) (C D) ] [ (E F) (G H) ] } grouping, if B is busy, EFGH are
busy and ACD are idle;
As A happens to be first idle CPU, it will be the one bringing in the
load from socket EFGH and then C or D have to pull the load from A.
Instead if C or D is nominated to pull the task from other socket, we
can reduce one hop.
I do not see capacity logic handling this case. But, this is more of a
micro-optimization and may affect workloads like SPECjbb at low
utilization, etc. I haven't seen this affecting the workloads we care
about.

Thanks,
Venki

2011-02-12 01:20:16

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Wed, 2011-02-09 at 07:55 -0800, Peter Zijlstra wrote:
> On Mon, 2011-02-07 at 11:53 -0800, Suresh Siddha wrote:
> >
> > Peter, to answer your question of why SMT is treated different to cores
> > sharing cache, performance improvements contributed by SMT is far less
> > compared to the cores and any wrong decisions in SMT load balancing
> > (especially in the presence of idle cores, packages) has a bigger
> > impact.
> >
> > I think in the tbench case referred by Nick, idle HT siblings in a busy
> > package picked the load instead of the idle packages. And thus we
> > probably had to wait for active load balance to kick in to distribute
> > the load etc by which the damage would have been. Performance impact of
> > this condition wouldn't be as severe in the cores sharing last level
> > cache and other resources.
> >
> > Also there are lot of changes in this area since 2005. So it would be
> > nice to revisit the tbench case and see if the logic of propagating busy
> > sibling status to the higher level load balances is still needed or not.
> >
> > On the contrary, perhaps there might be some workloads which may benefit
> > in performance/latency if we completely do away with this less
> > aggressive SMT load balancing.
>
> Right, but our current capacity logic does exactly that and seems to
> work for more than 2 smt siblings (it does the whole asymmetric power7
> muck).
>
> From a quick glance at the sched.c state at the time of Nick's patch,
> the capacity logic wasn't around then.

Yes Peter. We have lot more logic now which is trying to predict the
imbalance between the groups more accurately.

>
> So I see no reason what so ever to keep this SMT exception.

I am also ok with removing this code. But as Venki mentioned earlier
(http://marc.info/?l=linux-kernel&m=129735866732171&w=2), we need to
make sure idle core gets priority instead of an idle smt-thread on a
busy core while pulling the load from the busiest socket.

I requested Venki to post these 2 patches of removing the propagation of
busy sibling status to an idle sibling and prioritizing the idle core
while pulling the load. I will request Alex and Tim to run their
performance workloads to make sure that this doesn't show any
regressions.

thanks,
suresh

2011-02-14 22:39:41

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: Wholesale removal of sd_idle logic

sd_idle logic was introduced way back in 2005 (commit 5969fe06),
as an HT optimization.

As per the discussion in the thread here
lkml subject - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
https://patchwork.kernel.org/patch/532501/

the capacity based logic in the load balancer right now handles this
in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
does not seem to bring any adiitional benefits. sd_idle logic also has
some bugs that has performance impact. Here is the patch that removes
the sd_idle logic altogether.

The initial patch here - https://patchwork.kernel.org/patch/532501/
applies cleanly over the below change and provides a micro-optimization
for a specific case, where an idle core can pull tasks instead of a
core with one thread being idle and other thread being busy.
It will be good to get some data on whether this micro-optimization
matters or not.

Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
logic. Copying Vaidy to know the impact of this change there.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 53 ++++++++++----------------------------------------
1 files changed, 11 insertions(+), 42 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0c26e2d..932dc13 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2610,7 +2610,6 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
* @this_cpu: Cpu for which load balance is currently performed.
* @idle: Idle status of this_cpu
* @load_idx: Load index of sched_domain of this_cpu for load calc.
- * @sd_idle: Idle status of the sched_domain containing group.
* @local_group: Does group contain this_cpu.
* @cpus: Set of cpus considered for load balancing.
* @balance: Should we balance.
@@ -2618,7 +2617,7 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
*/
static inline void update_sg_lb_stats(struct sched_domain *sd,
struct sched_group *group, int this_cpu,
- enum cpu_idle_type idle, int load_idx, int *sd_idle,
+ enum cpu_idle_type idle, int load_idx,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
{
@@ -2638,9 +2637,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
for_each_cpu_and(i, sched_group_cpus(group), cpus) {
struct rq *rq = cpu_rq(i);

- if (*sd_idle && rq->nr_running)
- *sd_idle = 0;
-
/* Bias balancing toward cpus of our domain */
if (local_group) {
if (idle_cpu(i) && !first_idle_cpu) {
@@ -2755,15 +2751,13 @@ static bool update_sd_pick_busiest(struct sched_domain *sd,
* @sd: sched_domain whose statistics are to be updated.
* @this_cpu: Cpu for which load balance is currently performed.
* @idle: Idle status of this_cpu
- * @sd_idle: Idle status of the sched_domain containing sg.
* @cpus: Set of cpus considered for load balancing.
* @balance: Should we balance.
* @sds: variable to hold the statistics for this sched_domain.
*/
static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
- enum cpu_idle_type idle, int *sd_idle,
- const struct cpumask *cpus, int *balance,
- struct sd_lb_stats *sds)
+ enum cpu_idle_type idle, const struct cpumask *cpus,
+ int *balance, struct sd_lb_stats *sds)
{
struct sched_domain *child = sd->child;
struct sched_group *sg = sd->groups;
@@ -2781,7 +2775,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,

local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(sg));
memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx, sd_idle,
+ update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx,
local_group, cpus, balance, &sgs);

if (local_group && !(*balance))
@@ -3033,7 +3027,6 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
* @imbalance: Variable which stores amount of weighted load which should
* be moved to restore balance/put a group to idle.
* @idle: The idle status of this_cpu.
- * @sd_idle: The idleness of sd
* @cpus: The set of CPUs under consideration for load-balancing.
* @balance: Pointer to a variable indicating if this_cpu
* is the appropriate cpu to perform load balancing at this_level.
@@ -3046,7 +3039,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
- int *sd_idle, const struct cpumask *cpus, int *balance)
+ const struct cpumask *cpus, int *balance)
{
struct sd_lb_stats sds;

@@ -3056,8 +3049,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* Compute the various statistics relavent for load balancing at
* this level.
*/
- update_sd_lb_stats(sd, this_cpu, idle, sd_idle, cpus,
- balance, &sds);
+ update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, &sds);

/* Cases where imbalance does not exist from POV of this_cpu */
/* 1) this_cpu is not the appropriate cpu to perform load balancing
@@ -3193,7 +3185,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
/* Working cpumask for load_balance and load_balance_newidle. */
static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
+static int need_active_balance(struct sched_domain *sd, int idle,
int busiest_cpu, int this_cpu)
{
if (idle == CPU_NEWLY_IDLE) {
@@ -3225,10 +3217,6 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
* move_tasks() will succeed. ld_moved will be true and this
* active balance code will not be triggered.
*/
- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- return 0;
-
if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
return 0;
}
@@ -3246,7 +3234,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
{
- int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
+ int ld_moved, all_pinned = 0, active_balance = 0;
struct sched_group *group;
unsigned long imbalance;
struct rq *busiest;
@@ -3255,20 +3243,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,

cpumask_copy(cpus, cpu_active_mask);

- /*
- * When power savings policy is enabled for the parent domain, idle
- * sibling can pick up load irrespective of busy siblings. In this case,
- * let the state of idle sibling percolate up as CPU_IDLE, instead of
- * portraying it as CPU_NOT_IDLE.
- */
- if (idle != CPU_NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- sd_idle = 1;
-
schedstat_inc(sd, lb_count[idle]);

redo:
- group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
+ group = find_busiest_group(sd, this_cpu, &imbalance, idle,
cpus, balance);

if (*balance == 0)
@@ -3330,8 +3308,7 @@ redo:
if (idle != CPU_NEWLY_IDLE)
sd->nr_balance_failed++;

- if (need_active_balance(sd, sd_idle, idle, cpu_of(busiest),
- this_cpu)) {
+ if (need_active_balance(sd, idle, cpu_of(busiest), this_cpu)) {
raw_spin_lock_irqsave(&busiest->lock, flags);

/* don't kick the active_load_balance_cpu_stop,
@@ -3386,10 +3363,6 @@ redo:
sd->balance_interval *= 2;
}

- if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- ld_moved = -1;
-
goto out;

out_balanced:
@@ -3403,11 +3376,7 @@ out_one_pinned:
(sd->balance_interval < sd->max_interval))
sd->balance_interval *= 2;

- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- ld_moved = -1;
- else
- ld_moved = 0;
+ ld_moved = 0;
out:
return ld_moved;
}
--
1.7.3.1

2011-02-15 09:16:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Fri, 2011-02-11 at 17:20 -0800, Suresh Siddha wrote:
> I will request Alex and Tim to run their
> performance workloads to make sure that this doesn't show any
> regressions.

Are those workloads public? If so which are they?

It would be good if all of us working on this could use them.. :-)

2011-02-15 17:02:51

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH] sched: Wholesale removal of sd_idle logic

* Venkatesh Pallipadi <[email protected]> [2011-02-14 14:38:50]:

> sd_idle logic was introduced way back in 2005 (commit 5969fe06),
> as an HT optimization.
>
> As per the discussion in the thread here
> lkml subject - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
> https://patchwork.kernel.org/patch/532501/
>
> the capacity based logic in the load balancer right now handles this
> in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
> does not seem to bring any adiitional benefits. sd_idle logic also has
> some bugs that has performance impact. Here is the patch that removes
> the sd_idle logic altogether.
>
> The initial patch here - https://patchwork.kernel.org/patch/532501/
> applies cleanly over the below change and provides a micro-optimization
> for a specific case, where an idle core can pull tasks instead of a
> core with one thread being idle and other thread being busy.
> It will be good to get some data on whether this micro-optimization
> matters or not.
>
> Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
> logic. Copying Vaidy to know the impact of this change there.

Hi Venki,

The dependency is to avoid active balancing when there is a busy
sibling and power save balance is not set.

Another logic would propagate/force sd_idle=1 to induce more frequent
balancing for idle sibling in case of power save balance. Removing
sd_idle will make this default.

Your changes look good. I will test and report.

> Signed-off-by: Venkatesh Pallipadi <[email protected]>

Acked-by: Vaidyanathan Srinivasan <[email protected]>

> ---
> kernel/sched_fair.c | 53 ++++++++++----------------------------------------
> 1 files changed, 11 insertions(+), 42 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0c26e2d..932dc13 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2610,7 +2610,6 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> * @this_cpu: Cpu for which load balance is currently performed.
> * @idle: Idle status of this_cpu
> * @load_idx: Load index of sched_domain of this_cpu for load calc.
> - * @sd_idle: Idle status of the sched_domain containing group.
> * @local_group: Does group contain this_cpu.
> * @cpus: Set of cpus considered for load balancing.
> * @balance: Should we balance.
> @@ -2618,7 +2617,7 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> */
> static inline void update_sg_lb_stats(struct sched_domain *sd,
> struct sched_group *group, int this_cpu,
> - enum cpu_idle_type idle, int load_idx, int *sd_idle,
> + enum cpu_idle_type idle, int load_idx,
> int local_group, const struct cpumask *cpus,
> int *balance, struct sg_lb_stats *sgs)
> {
> @@ -2638,9 +2637,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
> for_each_cpu_and(i, sched_group_cpus(group), cpus) {
> struct rq *rq = cpu_rq(i);
>
> - if (*sd_idle && rq->nr_running)
> - *sd_idle = 0;
> -
> /* Bias balancing toward cpus of our domain */
> if (local_group) {
> if (idle_cpu(i) && !first_idle_cpu) {
> @@ -2755,15 +2751,13 @@ static bool update_sd_pick_busiest(struct sched_domain *sd,
> * @sd: sched_domain whose statistics are to be updated.
> * @this_cpu: Cpu for which load balance is currently performed.
> * @idle: Idle status of this_cpu
> - * @sd_idle: Idle status of the sched_domain containing sg.
> * @cpus: Set of cpus considered for load balancing.
> * @balance: Should we balance.
> * @sds: variable to hold the statistics for this sched_domain.
> */
> static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
> - enum cpu_idle_type idle, int *sd_idle,
> - const struct cpumask *cpus, int *balance,
> - struct sd_lb_stats *sds)
> + enum cpu_idle_type idle, const struct cpumask *cpus,
> + int *balance, struct sd_lb_stats *sds)
> {
> struct sched_domain *child = sd->child;
> struct sched_group *sg = sd->groups;
> @@ -2781,7 +2775,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
>
> local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(sg));
> memset(&sgs, 0, sizeof(sgs));
> - update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx, sd_idle,
> + update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx,
> local_group, cpus, balance, &sgs);
>
> if (local_group && !(*balance))
> @@ -3033,7 +3027,6 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> * @imbalance: Variable which stores amount of weighted load which should
> * be moved to restore balance/put a group to idle.
> * @idle: The idle status of this_cpu.
> - * @sd_idle: The idleness of sd
> * @cpus: The set of CPUs under consideration for load-balancing.
> * @balance: Pointer to a variable indicating if this_cpu
> * is the appropriate cpu to perform load balancing at this_level.
> @@ -3046,7 +3039,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
> static struct sched_group *
> find_busiest_group(struct sched_domain *sd, int this_cpu,
> unsigned long *imbalance, enum cpu_idle_type idle,
> - int *sd_idle, const struct cpumask *cpus, int *balance)
> + const struct cpumask *cpus, int *balance)
> {
> struct sd_lb_stats sds;
>
> @@ -3056,8 +3049,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
> * Compute the various statistics relavent for load balancing at
> * this level.
> */
> - update_sd_lb_stats(sd, this_cpu, idle, sd_idle, cpus,
> - balance, &sds);
> + update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, &sds);
>
> /* Cases where imbalance does not exist from POV of this_cpu */
> /* 1) this_cpu is not the appropriate cpu to perform load balancing
> @@ -3193,7 +3185,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
> /* Working cpumask for load_balance and load_balance_newidle. */
> static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
>
> -static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
> +static int need_active_balance(struct sched_domain *sd, int idle,
> int busiest_cpu, int this_cpu)
> {
> if (idle == CPU_NEWLY_IDLE) {
> @@ -3225,10 +3217,6 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
> * move_tasks() will succeed. ld_moved will be true and this
> * active balance code will not be triggered.
> */
> - if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
> - !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> - return 0;
> -

This condition will nack active balancing for semi idle core when
sched_smt_powersavings is not set. f_b_g() itself should have
returned NULL if there are no power savings opportunity.

> if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> return 0;
> }
> @@ -3246,7 +3234,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct sched_domain *sd, enum cpu_idle_type idle,
> int *balance)
> {
> - int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
> + int ld_moved, all_pinned = 0, active_balance = 0;
> struct sched_group *group;
> unsigned long imbalance;
> struct rq *busiest;
> @@ -3255,20 +3243,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
> cpumask_copy(cpus, cpu_active_mask);
>
> - /*
> - * When power savings policy is enabled for the parent domain, idle
> - * sibling can pick up load irrespective of busy siblings. In this case,
> - * let the state of idle sibling percolate up as CPU_IDLE, instead of
> - * portraying it as CPU_NOT_IDLE.
> - */
> - if (idle != CPU_NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
> - !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> - sd_idle = 1;

This is kind of becoming the default now when sd_idle is removed.
When powersave balance is set, we want to run load balancer more
frequently.

> -
> schedstat_inc(sd, lb_count[idle]);
>
> redo:
> - group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
> + group = find_busiest_group(sd, this_cpu, &imbalance, idle,
> cpus, balance);
>
> if (*balance == 0)
> @@ -3330,8 +3308,7 @@ redo:
> if (idle != CPU_NEWLY_IDLE)
> sd->nr_balance_failed++;
>
> - if (need_active_balance(sd, sd_idle, idle, cpu_of(busiest),
> - this_cpu)) {
> + if (need_active_balance(sd, idle, cpu_of(busiest), this_cpu)) {
> raw_spin_lock_irqsave(&busiest->lock, flags);
>
> /* don't kick the active_load_balance_cpu_stop,
> @@ -3386,10 +3363,6 @@ redo:
> sd->balance_interval *= 2;
> }
>
> - if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
> - !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> - ld_moved = -1;

I have not figured out where ld_moved is checked for -1 and why we
need to treat this as a special case.

Your bug fix in idle_balance() for if (pulled_task) {...} is a good
catch.

> -
> goto out;
>
> out_balanced:
> @@ -3403,11 +3376,7 @@ out_one_pinned:
> (sd->balance_interval < sd->max_interval))
> sd->balance_interval *= 2;
>
> - if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
> - !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> - ld_moved = -1;
> - else
> - ld_moved = 0;

Ack. But why did we have to flag this case earlier?

> + ld_moved = 0;
> out:
> return ld_moved;
> }

--Vaidy

2011-02-15 18:26:06

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: Wholesale removal of sd_idle logic

On Tue, Feb 15, 2011 at 9:01 AM, Vaidyanathan Srinivasan
<[email protected]> wrote:
> * Venkatesh Pallipadi <[email protected]> [2011-02-14 14:38:50]:
>
>> sd_idle logic was introduced way back in 2005 (commit 5969fe06),
>> as an HT optimization.
>>
>> As per the discussion in the thread here
>> lkml subject - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
>> https://patchwork.kernel.org/patch/532501/
>>
>> the capacity based logic in the load balancer right now handles this
>> in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
>> does not seem to bring any adiitional benefits. sd_idle logic also has
>> some bugs that has performance impact. Here is the patch that removes
>> the sd_idle logic altogether.
>>
>> The initial patch here - https://patchwork.kernel.org/patch/532501/
>> applies cleanly over the below change and provides a micro-optimization
>> for a specific case, where an idle core can pull tasks instead of a
>> core with one thread being idle and other thread being busy.
>> It will be good to get some data on whether this micro-optimization
>> matters or not.
>>
>> Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
>> logic. Copying Vaidy to know the impact of this change there.
>
> Hi Venki,
>
> The dependency is to avoid active balancing when there is a busy
> sibling and power save balance is not set.
>
> Another logic would propagate/force sd_idle=1 to induce more frequent
> balancing for idle sibling in case of power save balance. ?Removing
> sd_idle will make this default.
>
> Your changes look good. ?I will test and report.
>
>> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>
> Acked-by: Vaidyanathan Srinivasan <[email protected]>
>
>> ---
>> ?kernel/sched_fair.c | ? 53 ++++++++++----------------------------------------
>> ?1 files changed, 11 insertions(+), 42 deletions(-)
>>

<snip>

>> @@ -3386,10 +3363,6 @@ redo:
>> ? ? ? ? ? ? ? ? ? ? ? sd->balance_interval *= 2;
>> ? ? ? }
>>
>> - ? ? if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
>> - ? ? ? ? !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
>> - ? ? ? ? ? ? ld_moved = -1;
>
> I have not figured out where ld_moved is checked for -1 and why we
> need to treat this as a special case.
>

Return value of -1 was being consumed in rebalance domains() call
to load_balance(). Returning -1 (instead of 0 in this case) makes
rebalance_domains() to call higher domain load balancing
with CPU_NOT_IDLE, when sibling is busy and even when there
was no load pulled in.

Thanks,
Venki

> Your bug fix in idle_balance() for if (pulled_task) {...} is a good
> catch.
>
>> -
>> ? ? ? goto out;
>>
>> ?out_balanced:
>> @@ -3403,11 +3376,7 @@ out_one_pinned:
>> ? ? ? ? ? ? ? ? ? ? ? (sd->balance_interval < sd->max_interval))
>> ? ? ? ? ? ? ? sd->balance_interval *= 2;
>>
>> - ? ? if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
>> - ? ? ? ? !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
>> - ? ? ? ? ? ? ld_moved = -1;
>> - ? ? else
>> - ? ? ? ? ? ? ld_moved = 0;
>
> Ack. ?But why did we have to flag this case earlier?
>
>> + ? ? ld_moved = 0;
>> ?out:
>> ? ? ? return ld_moved;
>> ?}
>
> --Vaidy
>
>

2011-02-15 19:11:34

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1

On Tue, 2011-02-15 at 01:15 -0800, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 17:20 -0800, Suresh Siddha wrote:
> > I will request Alex and Tim to run their
> > performance workloads to make sure that this doesn't show any
> > regressions.
>
> Are those workloads public? If so which are they?
>
> It would be good if all of us working on this could use them.. :-)

http://kernel-perf.sourceforge.net/about_tests.php

thanks,
suresh

2011-02-16 08:54:17

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH] sched: Wholesale removal of sd_idle logic

* Venkatesh Pallipadi <[email protected]> [2011-02-15 10:26:01]:

<snip>

> >> @@ -3386,10 +3363,6 @@ redo:
> >> ? ? ? ? ? ? ? ? ? ? ? sd->balance_interval *= 2;
> >> ? ? ? }
> >>
> >> - ? ? if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
> >> - ? ? ? ? !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> >> - ? ? ? ? ? ? ld_moved = -1;
> >
> > I have not figured out where ld_moved is checked for -1 and why we
> > need to treat this as a special case.
> >
>
> Return value of -1 was being consumed in rebalance domains() call
> to load_balance(). Returning -1 (instead of 0 in this case) makes
> rebalance_domains() to call higher domain load balancing
> with CPU_NOT_IDLE, when sibling is busy and even when there
> was no load pulled in.

Ok, so in rebalance_domain() we need not distinguish between 1 or -1
return code. We can set idle = CPU_NOT_IDLE as long as we pull
a task. With removal of sd_idle logic, the value of idle will remain
same for all domains.

Now in idle_balance() you have to check for return code > 0 in order
to reset this_rq->idle_stamp = 0. But in the next check:

if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
/*
* We are going idle. next_balance may be set based on
* a busy processor. So reset next_balance.
*/
this_rq->next_balance = next_balance;
}

Earlier, we would push the next_balance interval for busy sibling case
since pulled_task will be set to -1. But now, with the removal of
sd_idle logic, the this_rq->next_balance will not be touched leading
to sooner rebalance.

In summary, the load_balance() will return 0 or 1, and the special
case of -1 is completely removed in the new code. Thanks for the
clarification.

--Vaidy

2011-02-16 11:44:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Wholesale removal of sd_idle logic

On Mon, 2011-02-14 at 14:38 -0800, Venkatesh Pallipadi wrote:
>
> The initial patch here - https://patchwork.kernel.org/patch/532501/
> applies cleanly over the below change and provides a micro-optimization
> for a specific case, where an idle core can pull tasks instead of a
> core with one thread being idle and other thread being busy.
> It will be good to get some data on whether this micro-optimization
> matters or not.

I would much rather see a solution that solves that problem for the more
generic situation where we lower capacity to 1.

The whole double-balance thing is fairly inherent in the whole design,
and making an exception for just one special case doesn't look right.

2011-02-16 13:51:05

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [tip:sched/core] sched: Wholesale removal of sd_idle logic

Commit-ID: 46e49b3836c7cd2ae5b5fe76fa981d0d292a52fe
Gitweb: http://git.kernel.org/tip/46e49b3836c7cd2ae5b5fe76fa981d0d292a52fe
Author: Venkatesh Pallipadi <[email protected]>
AuthorDate: Mon, 14 Feb 2011 14:38:50 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Feb 2011 13:33:20 +0100

sched: Wholesale removal of sd_idle logic

sd_idle logic was introduced way back in 2005 (commit 5969fe06),
as an HT optimization.

As per the discussion in the thread here:

lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
https://patchwork.kernel.org/patch/532501/

The capacity based logic in the load balancer right now handles this
in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
does not seem to bring any additional benefits. sd_idle logic also has
some bugs that has performance impact. Here is the patch that removes
the sd_idle logic altogether.

Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
logic.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Acked-by: Vaidyanathan Srinivasan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 53 ++++++++++----------------------------------------
1 files changed, 11 insertions(+), 42 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0270246..d384e73 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2672,7 +2672,6 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
* @this_cpu: Cpu for which load balance is currently performed.
* @idle: Idle status of this_cpu
* @load_idx: Load index of sched_domain of this_cpu for load calc.
- * @sd_idle: Idle status of the sched_domain containing group.
* @local_group: Does group contain this_cpu.
* @cpus: Set of cpus considered for load balancing.
* @balance: Should we balance.
@@ -2680,7 +2679,7 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
*/
static inline void update_sg_lb_stats(struct sched_domain *sd,
struct sched_group *group, int this_cpu,
- enum cpu_idle_type idle, int load_idx, int *sd_idle,
+ enum cpu_idle_type idle, int load_idx,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
{
@@ -2700,9 +2699,6 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
for_each_cpu_and(i, sched_group_cpus(group), cpus) {
struct rq *rq = cpu_rq(i);

- if (*sd_idle && rq->nr_running)
- *sd_idle = 0;
-
/* Bias balancing toward cpus of our domain */
if (local_group) {
if (idle_cpu(i) && !first_idle_cpu) {
@@ -2817,15 +2813,13 @@ static bool update_sd_pick_busiest(struct sched_domain *sd,
* @sd: sched_domain whose statistics are to be updated.
* @this_cpu: Cpu for which load balance is currently performed.
* @idle: Idle status of this_cpu
- * @sd_idle: Idle status of the sched_domain containing sg.
* @cpus: Set of cpus considered for load balancing.
* @balance: Should we balance.
* @sds: variable to hold the statistics for this sched_domain.
*/
static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
- enum cpu_idle_type idle, int *sd_idle,
- const struct cpumask *cpus, int *balance,
- struct sd_lb_stats *sds)
+ enum cpu_idle_type idle, const struct cpumask *cpus,
+ int *balance, struct sd_lb_stats *sds)
{
struct sched_domain *child = sd->child;
struct sched_group *sg = sd->groups;
@@ -2843,7 +2837,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,

local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(sg));
memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx, sd_idle,
+ update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx,
local_group, cpus, balance, &sgs);

if (local_group && !(*balance))
@@ -3095,7 +3089,6 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
* @imbalance: Variable which stores amount of weighted load which should
* be moved to restore balance/put a group to idle.
* @idle: The idle status of this_cpu.
- * @sd_idle: The idleness of sd
* @cpus: The set of CPUs under consideration for load-balancing.
* @balance: Pointer to a variable indicating if this_cpu
* is the appropriate cpu to perform load balancing at this_level.
@@ -3108,7 +3101,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
- int *sd_idle, const struct cpumask *cpus, int *balance)
+ const struct cpumask *cpus, int *balance)
{
struct sd_lb_stats sds;

@@ -3118,8 +3111,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* Compute the various statistics relavent for load balancing at
* this level.
*/
- update_sd_lb_stats(sd, this_cpu, idle, sd_idle, cpus,
- balance, &sds);
+ update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, &sds);

/* Cases where imbalance does not exist from POV of this_cpu */
/* 1) this_cpu is not the appropriate cpu to perform load balancing
@@ -3255,7 +3247,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group,
/* Working cpumask for load_balance and load_balance_newidle. */
static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
+static int need_active_balance(struct sched_domain *sd, int idle,
int busiest_cpu, int this_cpu)
{
if (idle == CPU_NEWLY_IDLE) {
@@ -3287,10 +3279,6 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
* move_tasks() will succeed. ld_moved will be true and this
* active balance code will not be triggered.
*/
- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- return 0;
-
if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
return 0;
}
@@ -3308,7 +3296,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
{
- int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
+ int ld_moved, all_pinned = 0, active_balance = 0;
struct sched_group *group;
unsigned long imbalance;
struct rq *busiest;
@@ -3317,20 +3305,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,

cpumask_copy(cpus, cpu_active_mask);

- /*
- * When power savings policy is enabled for the parent domain, idle
- * sibling can pick up load irrespective of busy siblings. In this case,
- * let the state of idle sibling percolate up as CPU_IDLE, instead of
- * portraying it as CPU_NOT_IDLE.
- */
- if (idle != CPU_NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- sd_idle = 1;
-
schedstat_inc(sd, lb_count[idle]);

redo:
- group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
+ group = find_busiest_group(sd, this_cpu, &imbalance, idle,
cpus, balance);

if (*balance == 0)
@@ -3392,8 +3370,7 @@ redo:
if (idle != CPU_NEWLY_IDLE)
sd->nr_balance_failed++;

- if (need_active_balance(sd, sd_idle, idle, cpu_of(busiest),
- this_cpu)) {
+ if (need_active_balance(sd, idle, cpu_of(busiest), this_cpu)) {
raw_spin_lock_irqsave(&busiest->lock, flags);

/* don't kick the active_load_balance_cpu_stop,
@@ -3448,10 +3425,6 @@ redo:
sd->balance_interval *= 2;
}

- if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- ld_moved = -1;
-
goto out;

out_balanced:
@@ -3465,11 +3438,7 @@ out_one_pinned:
(sd->balance_interval < sd->max_interval))
sd->balance_interval *= 2;

- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
- ld_moved = -1;
- else
- ld_moved = 0;
+ ld_moved = 0;
out:
return ld_moved;
}

2011-02-18 01:26:51

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH] sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1


> I am also ok with removing this code. But as Venki mentioned earlier
> (http://marc.info/?l=linux-kernel&m=129735866732171&w=2), we need to
> make sure idle core gets priority instead of an idle smt-thread on a
> busy core while pulling the load from the busiest socket.
>
> I requested Venki to post these 2 patches of removing the propagation of
> busy sibling status to an idle sibling and prioritizing the idle core
> while pulling the load. I will request Alex and Tim to run their
> performance workloads to make sure that this doesn't show any
> regressions.

I have got sd_idle deletion and other patches. So just tested this v1
patch based on 38-rc4 kernel on WSM-EP NHM-EP, and Core2 machine, didn't
find clear performance regression or improvement, include on
hackbench/specjbb/volano etc.

>
> thanks,
> suresh
>