2009-09-01 08:51:45

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 0/8] load-balancing and cpu_power -v2

A more complete version, one that compiles and mostly works on the
simple tests to which it was subjected.

It still lacks integration with APERF/MPERF because that stuff was
hidding in some acpi driver instead of placed in arch code for general
consumption.. will fix.

Also, SD_SHARE_CPUPOWER seems redundant in the face of sd->level ==
SD_LV_SIBLING, should we remove the SD_flag or depricate the level?

Anyway, have at it, poke holes and report issues.


Subject: Re: [RFC][PATCH 0/8] load-balancing and cpu_power -v2

On Tue, Sep 01, 2009 at 10:34:31AM +0200, Peter Zijlstra wrote:
> A more complete version, one that compiles and mostly works on the
> simple tests to which it was subjected.

Very nice series.

Will queue it up for testing to see if anything
obvious breaks.

>
> It still lacks integration with APERF/MPERF because that stuff was
> hidding in some acpi driver instead of placed in arch code for general
> consumption.. will fix.
>
> Also, SD_SHARE_CPUPOWER seems redundant in the face of sd->level ==
> SD_LV_SIBLING, should we remove the SD_flag or depricate the level?

IIRC, the only place where we used SD_SHARE_CPUPOWER was while assigning
the cpu_power. With the arch_smt_gain() coming into picture, I feel we
can remove the SD_flag.

Not sure what you meant by deprecating the level.
>
> Anyway, have at it, poke holes and report issues.


--
Thanks and Regards
gautham

Subject: Re: [RFC][PATCH 0/8] load-balancing and cpu_power -v2

On Tue, Sep 01, 2009 at 10:34:31AM +0200, Peter Zijlstra wrote:
> A more complete version, one that compiles and mostly works on the
> simple tests to which it was subjected.
>
> It still lacks integration with APERF/MPERF because that stuff was
> hidding in some acpi driver instead of placed in arch code for general
> consumption.. will fix.
>
> Also, SD_SHARE_CPUPOWER seems redundant in the face of sd->level ==
> SD_LV_SIBLING, should we remove the SD_flag or depricate the level?
>
> Anyway, have at it, poke holes and report issues.

Tested it (to a certain extend).
Found no performance degradation (on 1P, 2P, 4P systems). (One could
think performance might slightly degrade due to more frequent
__cpu_power updates).

Issue that I see is that switching between scheduling policies has no
effect on already running tasks:

- tasks that are already distributed among sockets are _not_
concentrated on one socket when switching from performance to
power_savings scheduling

- tasks utilizing a socket are _not_ distributed among sockets when
switching from power_savings to performance policy

This applies to modification of sched_mc_power_savings. And I think
one of above scenarios is already broken in tip/master w/o your
patches.

Otherwise especially wrt to integration of APERF/MPERF this seems to
be a good approach.


Regards,
Andreas

--
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632

2009-09-03 13:39:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] load-balancing and cpu_power -v2

On Thu, 2009-09-03 at 14:10 +0200, Andreas Herrmann wrote:
> On Tue, Sep 01, 2009 at 10:34:31AM +0200, Peter Zijlstra wrote:

> > Anyway, have at it, poke holes and report issues.
>
> Tested it (to a certain extend).
> Found no performance degradation (on 1P, 2P, 4P systems). (One could
> think performance might slightly degrade due to more frequent
> __cpu_power updates).

Grand.

> This applies to modification of sched_mc_power_savings. And I think
> one of above scenarios is already broken in tip/master w/o your
> patches.

I suspect so, but confirmation would be good, I don't have a single
machine large enough to actually test any of that power saving muck.

> Otherwise especially wrt to integration of APERF/MPERF this seems to
> be a good approach.

On that, are there AMD systems supporting APERF/MPERF? I only know of
Intel machines doing so.

2009-09-04 07:19:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] load-balancing and cpu_power -v2


* Andreas Herrmann <[email protected]> wrote:

> On Tue, Sep 01, 2009 at 10:34:31AM +0200, Peter Zijlstra wrote:
> > A more complete version, one that compiles and mostly works on the
> > simple tests to which it was subjected.
> >
> > It still lacks integration with APERF/MPERF because that stuff was
> > hidding in some acpi driver instead of placed in arch code for general
> > consumption.. will fix.
> >
> > Also, SD_SHARE_CPUPOWER seems redundant in the face of sd->level ==
> > SD_LV_SIBLING, should we remove the SD_flag or depricate the level?
> >
> > Anyway, have at it, poke holes and report issues.
>
> Tested it (to a certain extend).
> Found no performance degradation (on 1P, 2P, 4P systems). (One could
> think performance might slightly degrade due to more frequent
> __cpu_power updates).

Ok, thanks for doing that, it's really useful - this saved me a day
of testing and allows me to accelerate these patches and try to
queue them up in tip:sched today.

Note, we have regressed the load-balancer in recently but its
inherent complexity makes it pretty hard to fix. We dont max out
kbuild performance for example - i see this in my distcc builds.
Would be nice to sort that out too for .32, on top of Peter's
power-balancing series.

We've got the sched-domains setup simplification suggestions from
Peter as well, those could be done separately (but are important as
well, to express more complex hierarchies like Magny-Cours).

Ingo

2009-09-04 09:28:10

by Ingo Molnar

[permalink] [raw]
Subject: [crash] Re: [RFC][PATCH 0/8] load-balancing and cpu_power -v2


i've queued up Peter's patches, with your and Gautham's fixes
embedded. It works mostly fine - except on two larger boxes, where
-tip stress-testing triggered this crash:

aldebaran login: [ 1774.088275] divide error: 0000 [#1] SMP
[ 1774.092293] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[ 1774.100355] CPU 13
[ 1774.102498] Modules linked in:
[ 1774.105631] Pid: 30881, comm: hackbench Not tainted 2.6.31-rc8-tip-01308-g484d664-dirty #1629 X8DTN
[ 1774.114807] RIP: 0010:[<ffffffff81041c38>] [<ffffffff81041c38>] sched_balance_self+0x19b/0x2d4
[ 1774.123676] RSP: 0018:ffff880306c1fd58 EFLAGS: 00010246
[ 1774.129037] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1774.136287] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000040
[ 1774.143554] RBP: ffff880306c1fde8 R08: 0000000000000000 R09: ffffc9000140f4c8
[ 1774.150748] R10: ffff88031288c650 R11: ffff880306c1fe08 R12: ffffc90001a0f3a0
[ 1774.158007] R13: ffffc9000140f4b0 R14: 0000000000000000 R15: 0000000000014f00
[ 1774.165248] FS: 0000000000000000(0000) GS:ffffc90001a00000(0063) knlGS:00000000f7f156c0
[ 1774.173473] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1774.179320] CR2: 000000004822c0ac CR3: 000000031357b000 CR4: 00000000000006e0
[ 1774.186586] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1774.193826] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1774.201101] Process hackbench (pid: 30881, threadinfo ffff880306c1e000, task ffff88030b4aa710)
[ 1774.209836] Stack:
[ 1774.211861] 0000000000014f00 0000000000014f00 0000000000014f00 0000000000014f10
[ 1774.219300] <0> 0000000000014f10 0000000d00000008 ffff88030b4aa710 ffffc9000140f4c0
[ 1774.227129] <0> 0000000006c1fdd8 0000007d00000001 ffff88030b4aa900 ffffc9000100f4b0
[ 1774.235225] Call Trace:
[ 1774.237704] [<ffffffff810444c4>] sched_fork+0x2c/0x15f
[ 1774.243032] [<ffffffff8104942d>] copy_process+0x407/0xda7
[ 1774.248608] [<ffffffff81049f16>] do_fork+0x149/0x309
[ 1774.253809] [<ffffffff8120ff12>] ? __up_read+0x9e/0xa8
[ 1774.259101] [<ffffffff8106df32>] ? up_read+0xe/0x10
[ 1774.264129] [<ffffffff81565d66>] ? do_page_fault+0x291/0x2c3
[ 1774.269968] [<ffffffff81035028>] sys32_clone+0x2c/0x2e
[ 1774.275267] [<ffffffff81034d05>] ia32_ptregs_common+0x25/0x4c
[ 1774.281200] Code: cb 48 8b 7d a8 ff c2 be 40 00 00 00 48 63 d2 e8 73 9b 1c 00 3b 05 19 2e 8c 00 89 c2 7c 8d 41 8b 4d 08 48 c1 e3 0a 31 d2 48 89 d8 <48> f7 f1 83 7d b4 00 48 89 c1 75 16 4c 39 f0 73 0d 49 89 c6 48
[ 1774.301423] RIP [<ffffffff81041c38>] sched_balance_self+0x19b/0x2d4
[ 1774.307903] RSP <ffff880306c1fd58>
[ 1774.311474] ---[ end trace a56b661c1598b0fc ]---
[ 1774.316202] Kernel panic - not syncing: Fatal exception
[ 1774.321497] Pid: 30881, comm: hackbench Tainted: G D 2.6.31-rc8-tip-01308-g484d664-dirty #1629
[ 1774.330925] Call Trace:
[ 1774.333402] [<ffffffff81561aaa>] panic+0x7a/0x125
[ 1774.338269] [<ffffffff815645d2>] oops_end+0xaa/0xba
[ 1774.343321] [<ffffffff8100f4f1>] die+0x5a/0x63
[ 1774.347887] [<ffffffff81563ff6>] do_trap+0x110/0x11f
[ 1774.353052] [<ffffffff8100d8ab>] do_divide_error+0x90/0x99
[ 1774.358691] [<ffffffff81041c38>] ? sched_balance_self+0x19b/0x2d4
[ 1774.364966] [<ffffffff810d8021>] ? zone_statistics+0x65/0x6a
[ 1774.370831] [<ffffffff810cb2ef>] ? get_page_from_freelist+0x4a2/0x675
[ 1774.377487] [<ffffffff8100cad5>] divide_error+0x15/0x20
[ 1774.382894] [<ffffffff81041c38>] ? sched_balance_self+0x19b/0x2d4
[ 1774.389173] [<ffffffff81041c21>] ? sched_balance_self+0x184/0x2d4
[ 1774.395479] [<ffffffff810444c4>] sched_fork+0x2c/0x15f
[ 1774.400792] [<ffffffff8104942d>] copy_process+0x407/0xda7
[ 1774.406397] [<ffffffff81049f16>] do_fork+0x149/0x309
[ 1774.411562] [<ffffffff8120ff12>] ? __up_read+0x9e/0xa8
[ 1774.416897] [<ffffffff8106df32>] ? up_read+0xe/0x10
[ 1774.421957] [<ffffffff81565d66>] ? do_page_fault+0x291/0x2c3
[ 1774.427815] [<ffffffff81035028>] sys32_clone+0x2c/0x2e
[ 1774.433124] [<ffffffff81034d05>] ia32_ptregs_common+0x25/0x4c

config attached as well.

the domain setup is this:

SD flag: 4717
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
- 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
+ 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
current val on cpu0/domain1:
SD flag: 1133
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
- 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
+1024: SD_SERIALIZE: Only a single load balancing instance

it's a 4x4 Opteron testbox.

Ingo


Attachments:
(No filename) (5.12 kB)
config (63.68 kB)
Download all attachments

2009-09-04 10:25:42

by Ingo Molnar

[permalink] [raw]
Subject: [tip:sched/balancing] sched: Fix dynamic power-balancing crash

Commit-ID: d7ea17a76916e456fcc78e45142c66f7fb875e3d
Gitweb: http://git.kernel.org/tip/d7ea17a76916e456fcc78e45142c66f7fb875e3d
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 4 Sep 2009 11:49:25 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 4 Sep 2009 11:52:52 +0200

sched: Fix dynamic power-balancing crash

This crash:

[ 1774.088275] divide error: 0000 [#1] SMP
[ 1774.100355] CPU 13
[ 1774.102498] Modules linked in:
[ 1774.105631] Pid: 30881, comm: hackbench Not tainted 2.6.31-rc8-tip-01308-g484d664-dirty #1629 X8DTN
[ 1774.114807] RIP: 0010:[<ffffffff81041c38>] [<ffffffff81041c38>]
sched_balance_self+0x19b/0x2d4

Triggers because update_group_power() modifies the sd tree and does
temporary calculations there - not considering that other CPUs
could observe intermediate values, such as the zero initial value.

Calculate it in a temporary variable instead. (we need no memory
barrier as these are all statistical values anyway)

Acked-by: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b537853..796baf7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3765,19 +3765,22 @@ static void update_group_power(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
+ unsigned long power;

if (!child) {
update_cpu_power(sd, cpu);
return;
}

- sdg->cpu_power = 0;
+ power = 0;

group = child->groups;
do {
- sdg->cpu_power += group->cpu_power;
+ power += group->cpu_power;
group = group->next;
} while (group != child->groups);
+
+ sdg->cpu_power = power;
}

/**