2008-08-29 13:19:07

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: sched_mc_power_savings broken with CGROUPS+CPUSETS

Hi,

sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.

I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
SD_LV_CPU whenever sched_mc_power_savings is set.

This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
is just compiled in while this is never hit when they are compiled
out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
sched_mc_power_savings = 1 completely breaks the
sched_mc_power_savings heuristics.

To recreate the problem,
Have sched_mc power savings enabled CONFIG_SCHED_MC=y
Add this BUG_ON()

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings

Try these these on a multi core x86 box.

sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
I do not have a confirmation that the root cause is same in all
successive versions. sched_mc_power_savings works perfect in
2.6.25.

Please help me root cause the issue. Please point me to changes that
may potential cause this bug.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

Index: linux-2.6.27-rc5/kernel/sched.c
===================================================================
--- linux-2.6.27-rc5.orig/kernel/sched.c 2008-08-29 17:48:56.000000000 +0000
+++ linux-2.6.27-rc5/kernel/sched.c 2008-08-29 18:16:09.000000000 +0000
@@ -3375,6 +3375,8 @@

out_balanced:
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+ BUG_ON(sched_mc_power_savings && sd->level == SD_LV_CPU &&
+ !(sd->flags & SD_POWERSAVINGS_BALANCE));
if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
goto ret;


2008-08-29 14:08:40

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

* Peter Zijlstra <[email protected]> [2008-08-29 15:23:57]:

> On Fri, 2008-08-29 at 18:45 +0530, Vaidyanathan Srinivasan wrote:
> > Hi,
> >
> > sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
> > When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
> >
> > I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
> > SD_LV_CPU whenever sched_mc_power_savings is set.
> >
> > This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
> > is just compiled in while this is never hit when they are compiled
> > out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
> > sched_mc_power_savings = 1 completely breaks the
> > sched_mc_power_savings heuristics.
> >
> > To recreate the problem,
> > Have sched_mc power savings enabled CONFIG_SCHED_MC=y
> > Add this BUG_ON()
> >
> > echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> >
> > Try these these on a multi core x86 box.
> >
> > sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
> > I do not have a confirmation that the root cause is same in all
> > successive versions. sched_mc_power_savings works perfect in
> > 2.6.25.
> >
> > Please help me root cause the issue. Please point me to changes that
> > may potential cause this bug.
>
> I'm still greatly mistified by all that power savings code.
>
> Its hard to read and utterly hard to comprehend - I've been about to rip
> the whole stuff out on several occasions. But so far tried to carefully
> thread around it maintaining its operation even though not fully
> understood.
>
> Someone with clue - preferably the authors of the code in question -
> should enlighten us with a patch that adds some comments as to the
> intent of said lines of code.

I will try to add my notes around it. I have been playing around with
sched_mc to make it work for different workloads.

--Vaidy

2008-08-29 14:53:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

On Fri, 2008-08-29 at 18:45 +0530, Vaidyanathan Srinivasan wrote:
> Hi,
>
> sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
> When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
>
> I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
> SD_LV_CPU whenever sched_mc_power_savings is set.
>
> This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
> is just compiled in while this is never hit when they are compiled
> out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
> sched_mc_power_savings = 1 completely breaks the
> sched_mc_power_savings heuristics.
>
> To recreate the problem,
> Have sched_mc power savings enabled CONFIG_SCHED_MC=y
> Add this BUG_ON()
>
> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
>
> Try these these on a multi core x86 box.
>
> sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
> I do not have a confirmation that the root cause is same in all
> successive versions. sched_mc_power_savings works perfect in
> 2.6.25.
>
> Please help me root cause the issue. Please point me to changes that
> may potential cause this bug.

I'm still greatly mistified by all that power savings code.

Its hard to read and utterly hard to comprehend - I've been about to rip
the whole stuff out on several occasions. But so far tried to carefully
thread around it maintaining its operation even though not fully
understood.

Someone with clue - preferably the authors of the code in question -
should enlighten us with a patch that adds some comments as to the
intent of said lines of code.

2008-08-29 20:17:18

by Max Krasnyansky

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

Vaidyanathan Srinivasan wrote:
> Hi,
>
> sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
> When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
>
> I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
> SD_LV_CPU whenever sched_mc_power_savings is set.
>
> This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
> is just compiled in while this is never hit when they are compiled
> out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
> sched_mc_power_savings = 1 completely breaks the
> sched_mc_power_savings heuristics.
>
> To recreate the problem,
> Have sched_mc power savings enabled CONFIG_SCHED_MC=y
> Add this BUG_ON()
>
> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
>
> Try these these on a multi core x86 box.
>
> sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
> I do not have a confirmation that the root cause is same in all
> successive versions. sched_mc_power_savings works perfect in
> 2.6.25.
>
> Please help me root cause the issue. Please point me to changes that
> may potential cause this bug.

That's my fault. I redid domain rebuild/hotplug handling awhile ago and missed
the fact that partition_sched_domains() is trying to avoid unnecessary domain
rebuilds. Primary issue at the time was circular locking issues and the the
testing that I was doing then was exercising a bunch of different scenarios
(ie cpu up/down, cpuset create/destroy, mcpowersave on/off) and I did not
notice that mcpowersave did not actually trigger domain rebuilds.

Anyway, I sent you guys a patch that should fix this issue.
Please confirm.

Max

2008-08-29 20:30:17

by Max Krasnyansky

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

Peter Zijlstra wrote:
> On Fri, 2008-08-29 at 18:45 +0530, Vaidyanathan Srinivasan wrote:
>> Hi,
>>
>> sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
>> When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
>>
>> I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
>> SD_LV_CPU whenever sched_mc_power_savings is set.
>>
>> This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
>> is just compiled in while this is never hit when they are compiled
>> out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
>> sched_mc_power_savings = 1 completely breaks the
>> sched_mc_power_savings heuristics.
>>
>> To recreate the problem,
>> Have sched_mc power savings enabled CONFIG_SCHED_MC=y
>> Add this BUG_ON()
>>
>> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
>>
>> Try these these on a multi core x86 box.
>>
>> sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
>> I do not have a confirmation that the root cause is same in all
>> successive versions. sched_mc_power_savings works perfect in
>> 2.6.25.
>>
>> Please help me root cause the issue. Please point me to changes that
>> may potential cause this bug.
>
> I'm still greatly mistified by all that power savings code.
>
> Its hard to read and utterly hard to comprehend - I've been about to rip
> the whole stuff out on several occasions. But so far tried to carefully
> thread around it maintaining its operation even though not fully
> understood.
>
> Someone with clue - preferably the authors of the code in question -
> should enlighten us with a patch that adds some comments as to the
> intent of said lines of code.

I do not fully understand how balancing is affected by the MC stuff but I can
explain how the mc power saving settings are applied to the domains and the
overall mechanism for that.
Here a quote from one of my emails to Paul

> Max wrote:
> ...
> Those things (mc_power and topology updates) have to update domain flags based
> on the mc/smt power and current topology settings.
> This is done in the
> __rebuild_sched_domains()
> ...
> SD_INIT(sd, ALLNODES);
> ...
> SD_INIT(sd, MC);
> ...
>
> SD_INIT(sd,X) uses one of SD initializers defined in the include/linux/topology.h
> For example SD_CPU_INIT() includes BALANCE_FOR_PKG_POWER which expands to
>
> #define BALANCE_FOR_PKG_POWER \
> ((sched_mc_power_savings || sched_smt_power_savings) ? \
> SD_POWERSAVINGS_BALANCE : 0)
>
> Yes it's kind convoluted :). Anyway, the point is that we need to rebuild the
> domains when those settings change. We could probably write a simpler version
> that just iterates existing domains and updates the flags. Maybe some other dat :)

As I explained in the previous reply I missed the fact the logic that avoids
redundant rebuilds in partition_sched_domains() will prevent
arch_reinit_sched_domains() from doing the actual rebuild and hence will not
apply the SD_POWERSAVINGS_BALANCE until something changes in cpuset setup.

btw I can certainly attest to the fact that powersaving code is very hard to
read and comprehend :)

Max

2008-08-30 11:27:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

On Fri, 2008-08-29 at 13:29 -0700, Max Krasnyansky wrote:
> Peter Zijlstra wrote:
> > On Fri, 2008-08-29 at 18:45 +0530, Vaidyanathan Srinivasan wrote:
> >> Hi,
> >>
> >> sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
> >> When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
> >>
> >> I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
> >> SD_LV_CPU whenever sched_mc_power_savings is set.
> >>
> >> This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
> >> is just compiled in while this is never hit when they are compiled
> >> out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
> >> sched_mc_power_savings = 1 completely breaks the
> >> sched_mc_power_savings heuristics.
> >>
> >> To recreate the problem,
> >> Have sched_mc power savings enabled CONFIG_SCHED_MC=y
> >> Add this BUG_ON()
> >>
> >> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> >>
> >> Try these these on a multi core x86 box.
> >>
> >> sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
> >> I do not have a confirmation that the root cause is same in all
> >> successive versions. sched_mc_power_savings works perfect in
> >> 2.6.25.
> >>
> >> Please help me root cause the issue. Please point me to changes that
> >> may potential cause this bug.
> >
> > I'm still greatly mistified by all that power savings code.
> >
> > Its hard to read and utterly hard to comprehend - I've been about to rip
> > the whole stuff out on several occasions. But so far tried to carefully
> > thread around it maintaining its operation even though not fully
> > understood.
> >
> > Someone with clue - preferably the authors of the code in question -
> > should enlighten us with a patch that adds some comments as to the
> > intent of said lines of code.
>
> I do not fully understand how balancing is affected by the MC stuff but I can
> explain how the mc power saving settings are applied to the domains and the
> overall mechanism for that.
> Here a quote from one of my emails to Paul
>
> > Max wrote:
> > ...
> > Those things (mc_power and topology updates) have to update domain flags based
> > on the mc/smt power and current topology settings.
> > This is done in the
> > __rebuild_sched_domains()
> > ...
> > SD_INIT(sd, ALLNODES);
> > ...
> > SD_INIT(sd, MC);
> > ...
> >
> > SD_INIT(sd,X) uses one of SD initializers defined in the include/linux/topology.h
> > For example SD_CPU_INIT() includes BALANCE_FOR_PKG_POWER which expands to
> >
> > #define BALANCE_FOR_PKG_POWER \
> > ((sched_mc_power_savings || sched_smt_power_savings) ? \
> > SD_POWERSAVINGS_BALANCE : 0)
> >
> > Yes it's kind convoluted :). Anyway, the point is that we need to rebuild the
> > domains when those settings change. We could probably write a simpler version
> > that just iterates existing domains and updates the flags. Maybe some other dat :)

I don't think iterating the domains and setting the flag is sufficient.
Look at this crap (found in arch/x86/kernel/smpboot.c):

cpumask_t cpu_coregroup_map(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
/*
* For perf, we return last level cache shared map.
* And for power savings, we return cpu_core_map
*/
if (sched_mc_power_savings || sched_smt_power_savings)
return per_cpu(cpu_core_map, cpu);
else
return c->llc_shared_map;
}

which means we'll actually end up building different domain/group
configurations depending on power savings settings.

> As I explained in the previous reply I missed the fact the logic that avoids
> redundant rebuilds in partition_sched_domains() will prevent
> arch_reinit_sched_domains() from doing the actual rebuild and hence will not
> apply the SD_POWERSAVINGS_BALANCE until something changes in cpuset setup.
>
> btw I can certainly attest to the fact that powersaving code is very hard to
> read and comprehend :)

Yeah - I was primarity hinting at the sched_group and find_*_group()
fudge, esp find_busiest_group() is an utter nightmare.

I'm still struggeling to understand _why_ we need those group things to
begin with, why aren't the child domains good enough?


2008-08-30 19:59:18

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

* Max Krasnyansky <[email protected]> [2008-08-29 13:17:02]:

> Vaidyanathan Srinivasan wrote:
> > Hi,
> >
> > sched_mc_power_savings seems to be broken with CGROUPS+CPUSETS.
> > When CONFIG_CPUSETS=y the attached BUG_ON() is being hit.
> >
> > I added a BUG_ON to check if SD_POWERSAVINGS_BALANCE is set at
> > SD_LV_CPU whenever sched_mc_power_savings is set.
> >
> > This BUG is hit when config CONFIG_CPUSETS (depends on CONFIG_CGROUPS)
> > is just compiled in while this is never hit when they are compiled
> > out. The fact that SD_POWERSAVINGS_BALANCE being cleared even when
> > sched_mc_power_savings = 1 completely breaks the
> > sched_mc_power_savings heuristics.
> >
> > To recreate the problem,
> > Have sched_mc power savings enabled CONFIG_SCHED_MC=y
> > Add this BUG_ON()
> >
> > echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> >
> > Try these these on a multi core x86 box.
> >
> > sched_mc_power_savings seems to be broken from 2.6.26-rc1, but
> > I do not have a confirmation that the root cause is same in all
> > successive versions. sched_mc_power_savings works perfect in
> > 2.6.25.
> >
> > Please help me root cause the issue. Please point me to changes that
> > may potential cause this bug.
>
> That's my fault. I redid domain rebuild/hotplug handling awhile ago and missed
> the fact that partition_sched_domains() is trying to avoid unnecessary domain
> rebuilds. Primary issue at the time was circular locking issues and the the
> testing that I was doing then was exercising a bunch of different scenarios
> (ie cpu up/down, cpuset create/destroy, mcpowersave on/off) and I did not
> notice that mcpowersave did not actually trigger domain rebuilds.
>
> Anyway, I sent you guys a patch that should fix this issue.
> Please confirm.

Hi Max,

Thank you very much for identifying the source of the problem. I will
test your patch and report.

My BUG_ON() may still fail during the window of change. I will watch
out for that.

--Vaidy

2008-08-30 20:39:25

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

* Peter Zijlstra <[email protected]> [2008-08-30 13:26:53]:

[snipped]

>
> I don't think iterating the domains and setting the flag is sufficient.
> Look at this crap (found in arch/x86/kernel/smpboot.c):
>
> cpumask_t cpu_coregroup_map(int cpu)
> {
> struct cpuinfo_x86 *c = &cpu_data(cpu);
> /*
> * For perf, we return last level cache shared map.
> * And for power savings, we return cpu_core_map
> */
> if (sched_mc_power_savings || sched_smt_power_savings)
> return per_cpu(cpu_core_map, cpu);
> else
> return c->llc_shared_map;
> }
>
> which means we'll actually end up building different domain/group
> configurations depending on power savings settings.

The above code helps a quad-core CPU to be treated as two dual core
for performance when sched_mc_power_savings=0 and they will be treated
as one quad core package if sched_mc_power_savings=1 since the power
control (voltage control) is per quad core socket.

On a dual socket machine with two quad core cpus,

sched_mc_power_savings=0 will build:

CPU0 attaching sched-domain:
domain 0: span 0,2 level MC
groups: 0 2
domain 1: span 0-7 level CPU
groups: 0,2 1,5 3-4 6-7

while sched_mc_power_savings=1 will build:

CPU0 attaching sched-domain:
domain 0: span 0,2-4 level MC
groups: 0 2 3 4
domain 1: span 0-7 level CPU
groups: 0,2-4 1,5-7

Last level cache (llc_shared_map) is used to build this map
differently based on power savings settings.

Do you think such detailed documentation around this code will help?

--Vaidy

[snipped]

2008-08-30 21:43:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_mc_power_savings broken with CGROUPS+CPUSETS

On Sun, 2008-08-31 at 02:12 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2008-08-30 13:26:53]:
>
> [snipped]
>
> >
> > I don't think iterating the domains and setting the flag is sufficient.
> > Look at this crap (found in arch/x86/kernel/smpboot.c):
> >
> > cpumask_t cpu_coregroup_map(int cpu)
> > {
> > struct cpuinfo_x86 *c = &cpu_data(cpu);
> > /*
> > * For perf, we return last level cache shared map.
> > * And for power savings, we return cpu_core_map
> > */
> > if (sched_mc_power_savings || sched_smt_power_savings)
> > return per_cpu(cpu_core_map, cpu);
> > else
> > return c->llc_shared_map;
> > }
> >
> > which means we'll actually end up building different domain/group
> > configurations depending on power savings settings.
>
> The above code helps a quad-core CPU to be treated as two dual core
> for performance when sched_mc_power_savings=0 and they will be treated
> as one quad core package if sched_mc_power_savings=1 since the power
> control (voltage control) is per quad core socket.
>
> On a dual socket machine with two quad core cpus,
>
> sched_mc_power_savings=0 will build:
>
> CPU0 attaching sched-domain:
> domain 0: span 0,2 level MC
> groups: 0 2
> domain 1: span 0-7 level CPU
> groups: 0,2 1,5 3-4 6-7
>
> while sched_mc_power_savings=1 will build:
>
> CPU0 attaching sched-domain:
> domain 0: span 0,2-4 level MC
> groups: 0 2 3 4
> domain 1: span 0-7 level CPU
> groups: 0,2-4 1,5-7
>
> Last level cache (llc_shared_map) is used to build this map
> differently based on power savings settings.

Same for my dual-core Opteron 12xx, due to this code it normally
generates CPU level domains, because its not sharing cache.

> Do you think such detailed documentation around this code will help?

I realized what its good for, I'm just not sure I agree with it. I'm
feeling there is something wrong with this, just can't quite put my
finger on it.

I'm just feeling the domain structure should be invariant to such things
- its the same hardware after all, whether we schedule to optimize for
power or performance.