2008-11-03 21:07:59

by Dimitri Sivanich

[permalink] [raw]
Subject: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

When load balancing gets switched off for a set of cpus via the
sched_load_balance flag in cpusets, those cpus wind up with the
globally defined def_root_domain attached. The def_root_domain is
attached when partition_sched_domains calls detach_destroy_domains().
A new root_domain is never allocated or attached as a sched domain
will never be attached by __build_sched_domains() for the non-load
balanced processors.

The problem with this scenario is that on systems with a large number
of processors with load balancing switched off, we start to see the
cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
This starts to become much more apparent above 8 waking RT threads
(with each RT thread running on it's own cpu, blocking and waking up
continuously).

I'm wondering if this is, in fact, the way things were meant to work,
or should we have a root domain allocated for each cpu that is not to
be part of a sched domain? Note the the def_root_domain spans all of
the non-load-balanced cpus in this case. Having it attached to cpus
that should not be load balancing doesn't quite make sense to me.


Here's where we've often seen this lock contention occur:

0xa0000001006df1e0 _spin_lock_irqsave+0x40
args (0xa000000101f8e1c8)
0xa00000010014b150 cpupri_set+0x290
args (0x16, 0x2c, 0x16, 0xa000000101f8e1c8, 0xa000000101f8b518, 0x1, 0x2c,
0xa000000100092ee0, 0x48c)
0xa000000100092ee0 __enqueue_rt_entity+0x300
args (0xe00000b4730401a0, 0xe0000b300316b510, 0xe0000b300316ba10, 0x500,
0xe0000b300316b518, 0x50, 0xa000000100093bc0, 0x286, 0x4f)
0xa000000100093bc0 enqueue_rt_entity+0xe0
args (0xe00000b4730401a0, 0x0, 0xa000000100093c50, 0x307, 0xe00000b4730401a0)
0xa000000100093c50 enqueue_task_rt+0x30
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa0000001000848d0, 0x309,
0xa000000101122134)
0xa0000001000848d0 enqueue_task+0xd0
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa000000100084ba0, 0x309,
0xa0000001013079b0)
0xa000000100084ba0 activate_task+0x60
args (0xe0000b300316b400, 0xe00000b473040000, 0x1, 0xa00000010009a270, 0x58e,
0xa000000100099ec0)
0xa00000010009a270 try_to_wake_up+0x530
args (0xe00000b473040000, 0x1, 0xe0000b300316b400, 0x49c6, 0xe0000b300316bc10,
0xe0000b300316bcac, 0xe00000b473040078, 0xe0000b300316bc38, 0xa00000010009a4d0)
0xa00000010009a4d0 wake_up_process+0x30


2008-11-03 22:32:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> When load balancing gets switched off for a set of cpus via the
> sched_load_balance flag in cpusets, those cpus wind up with the
> globally defined def_root_domain attached. The def_root_domain is
> attached when partition_sched_domains calls detach_destroy_domains().
> A new root_domain is never allocated or attached as a sched domain
> will never be attached by __build_sched_domains() for the non-load
> balanced processors.
>
> The problem with this scenario is that on systems with a large number
> of processors with load balancing switched off, we start to see the
> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> This starts to become much more apparent above 8 waking RT threads
> (with each RT thread running on it's own cpu, blocking and waking up
> continuously).
>
> I'm wondering if this is, in fact, the way things were meant to work,
> or should we have a root domain allocated for each cpu that is not to
> be part of a sched domain? Note the the def_root_domain spans all of
> the non-load-balanced cpus in this case. Having it attached to cpus
> that should not be load balancing doesn't quite make sense to me.

It shouldn't be like that, each load-balance domain (in your case a
single cpu) should get its own root domain. Gregory?

> Here's where we've often seen this lock contention occur:

what's this horrible output from?

2008-11-04 01:29:44

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Mon, Nov 03, 2008 at 11:33:23PM +0100, Peter Zijlstra wrote:
> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> > When load balancing gets switched off for a set of cpus via the
> > sched_load_balance flag in cpusets, those cpus wind up with the
> > globally defined def_root_domain attached. The def_root_domain is
> > attached when partition_sched_domains calls detach_destroy_domains().
> > A new root_domain is never allocated or attached as a sched domain
> > will never be attached by __build_sched_domains() for the non-load
> > balanced processors.
> >
> > The problem with this scenario is that on systems with a large number
> > of processors with load balancing switched off, we start to see the
> > cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> > This starts to become much more apparent above 8 waking RT threads
> > (with each RT thread running on it's own cpu, blocking and waking up
> > continuously).
> >
> > I'm wondering if this is, in fact, the way things were meant to work,
> > or should we have a root domain allocated for each cpu that is not to
> > be part of a sched domain? Note the the def_root_domain spans all of
> > the non-load-balanced cpus in this case. Having it attached to cpus
> > that should not be load balancing doesn't quite make sense to me.
>
> It shouldn't be like that, each load-balance domain (in your case a
> single cpu) should get its own root domain. Gregory?
>
> > Here's where we've often seen this lock contention occur:
>
> what's this horrible output from?

This output is a stack backtrace from KDB. KDB entry is triggered after too much time elapses prior to thread wakeup. The traces pointed to this lock. Too further test that theory, we hacked up a change to create root_domain's for each cpu and the max thread wakeup times improved.

2008-11-04 03:49:32

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Peter Zijlstra wrote:
> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
>
>> When load balancing gets switched off for a set of cpus via the
>> sched_load_balance flag in cpusets, those cpus wind up with the
>> globally defined def_root_domain attached. The def_root_domain is
>> attached when partition_sched_domains calls detach_destroy_domains().
>> A new root_domain is never allocated or attached as a sched domain
>> will never be attached by __build_sched_domains() for the non-load
>> balanced processors.
>>
>> The problem with this scenario is that on systems with a large number
>> of processors with load balancing switched off, we start to see the
>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
>> This starts to become much more apparent above 8 waking RT threads
>> (with each RT thread running on it's own cpu, blocking and waking up
>> continuously).
>>
>> I'm wondering if this is, in fact, the way things were meant to work,
>> or should we have a root domain allocated for each cpu that is not to
>> be part of a sched domain? Note the the def_root_domain spans all of
>> the non-load-balanced cpus in this case. Having it attached to cpus
>> that should not be load balancing doesn't quite make sense to me.
>>
>
> It shouldn't be like that, each load-balance domain (in your case a
> single cpu) should get its own root domain. Gregory?
>

Yeah, this sounds broken. I know that the root-domain code was being
developed coincident to some upheaval with the cpuset code, so I suspect
something may have been broken from the original intent. I will take a
look.

-Greg

>
>> Here's where we've often seen this lock contention occur:
>>
>
> what's this horrible output from?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-04 14:30:53

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Gregory Haskins wrote:
> Peter Zijlstra wrote:
>
>> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
>>
>>
>>> When load balancing gets switched off for a set of cpus via the
>>> sched_load_balance flag in cpusets, those cpus wind up with the
>>> globally defined def_root_domain attached. The def_root_domain is
>>> attached when partition_sched_domains calls detach_destroy_domains().
>>> A new root_domain is never allocated or attached as a sched domain
>>> will never be attached by __build_sched_domains() for the non-load
>>> balanced processors.
>>>
>>> The problem with this scenario is that on systems with a large number
>>> of processors with load balancing switched off, we start to see the
>>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
>>> This starts to become much more apparent above 8 waking RT threads
>>> (with each RT thread running on it's own cpu, blocking and waking up
>>> continuously).
>>>
>>> I'm wondering if this is, in fact, the way things were meant to work,
>>> or should we have a root domain allocated for each cpu that is not to
>>> be part of a sched domain? Note the the def_root_domain spans all of
>>> the non-load-balanced cpus in this case. Having it attached to cpus
>>> that should not be load balancing doesn't quite make sense to me.
>>>
>>>
>> It shouldn't be like that, each load-balance domain (in your case a
>> single cpu) should get its own root domain. Gregory?
>>
>>
>
> Yeah, this sounds broken. I know that the root-domain code was being
> developed coincident to some upheaval with the cpuset code, so I suspect
> something may have been broken from the original intent. I will take a
> look.
>
> -Greg
>
>

After thinking about it some more, I am not quite sure what to do here.
The root-domain code was really designed to be 1:1 with a disjoint
cpuset. In this case, it sounds like all the non-balanced cpus are
still in one default cpuset. In that case, the code is correct to place
all those cores in the singleton def_root_domain. The question really
is: How do we support the sched_load_balance flag better?

I suppose we could go through the scheduler code and have it check that
flag before consulting the root-domain. Another alternative is to have
the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-04 14:36:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
> Gregory Haskins wrote:
> > Peter Zijlstra wrote:
> >
> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> >>
> >>
> >>> When load balancing gets switched off for a set of cpus via the
> >>> sched_load_balance flag in cpusets, those cpus wind up with the
> >>> globally defined def_root_domain attached. The def_root_domain is
> >>> attached when partition_sched_domains calls detach_destroy_domains().
> >>> A new root_domain is never allocated or attached as a sched domain
> >>> will never be attached by __build_sched_domains() for the non-load
> >>> balanced processors.
> >>>
> >>> The problem with this scenario is that on systems with a large number
> >>> of processors with load balancing switched off, we start to see the
> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> >>> This starts to become much more apparent above 8 waking RT threads
> >>> (with each RT thread running on it's own cpu, blocking and waking up
> >>> continuously).
> >>>
> >>> I'm wondering if this is, in fact, the way things were meant to work,
> >>> or should we have a root domain allocated for each cpu that is not to
> >>> be part of a sched domain? Note the the def_root_domain spans all of
> >>> the non-load-balanced cpus in this case. Having it attached to cpus
> >>> that should not be load balancing doesn't quite make sense to me.
> >>>
> >>>
> >> It shouldn't be like that, each load-balance domain (in your case a
> >> single cpu) should get its own root domain. Gregory?
> >>
> >>
> >
> > Yeah, this sounds broken. I know that the root-domain code was being
> > developed coincident to some upheaval with the cpuset code, so I suspect
> > something may have been broken from the original intent. I will take a
> > look.
> >
> > -Greg
> >
> >
>
> After thinking about it some more, I am not quite sure what to do here.
> The root-domain code was really designed to be 1:1 with a disjoint
> cpuset. In this case, it sounds like all the non-balanced cpus are
> still in one default cpuset. In that case, the code is correct to place
> all those cores in the singleton def_root_domain. The question really
> is: How do we support the sched_load_balance flag better?
>
> I suppose we could go through the scheduler code and have it check that
> flag before consulting the root-domain. Another alternative is to have
> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?

Hmm, but you cannot disable load-balance on a cpu without placing it in
an cpuset first, right?

Or are folks disabling load-balance bottom-up, instead of top-down?

In that case, I think we should dis-allow that.

2008-11-04 14:40:29

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote:
> On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
> > Gregory Haskins wrote:
> > > Peter Zijlstra wrote:
> > >
> > >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> > >>
> > >>
> > >>> When load balancing gets switched off for a set of cpus via the
> > >>> sched_load_balance flag in cpusets, those cpus wind up with the
> > >>> globally defined def_root_domain attached. The def_root_domain is
> > >>> attached when partition_sched_domains calls detach_destroy_domains().
> > >>> A new root_domain is never allocated or attached as a sched domain
> > >>> will never be attached by __build_sched_domains() for the non-load
> > >>> balanced processors.
> > >>>
> > >>> The problem with this scenario is that on systems with a large number
> > >>> of processors with load balancing switched off, we start to see the
> > >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> > >>> This starts to become much more apparent above 8 waking RT threads
> > >>> (with each RT thread running on it's own cpu, blocking and waking up
> > >>> continuously).
> > >>>
> > >>> I'm wondering if this is, in fact, the way things were meant to work,
> > >>> or should we have a root domain allocated for each cpu that is not to
> > >>> be part of a sched domain? Note the the def_root_domain spans all of
> > >>> the non-load-balanced cpus in this case. Having it attached to cpus
> > >>> that should not be load balancing doesn't quite make sense to me.
> > >>>
> > >>>
> > >> It shouldn't be like that, each load-balance domain (in your case a
> > >> single cpu) should get its own root domain. Gregory?
> > >>
> > >>
> > >
> > > Yeah, this sounds broken. I know that the root-domain code was being
> > > developed coincident to some upheaval with the cpuset code, so I suspect
> > > something may have been broken from the original intent. I will take a
> > > look.
> > >
> > > -Greg
> > >
> > >
> >
> > After thinking about it some more, I am not quite sure what to do here.
> > The root-domain code was really designed to be 1:1 with a disjoint
> > cpuset. In this case, it sounds like all the non-balanced cpus are
> > still in one default cpuset. In that case, the code is correct to place
> > all those cores in the singleton def_root_domain. The question really
> > is: How do we support the sched_load_balance flag better?
> >
> > I suppose we could go through the scheduler code and have it check that
> > flag before consulting the root-domain. Another alternative is to have
> > the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?
>
> Hmm, but you cannot disable load-balance on a cpu without placing it in
> an cpuset first, right?
>
> Or are folks disabling load-balance bottom-up, instead of top-down?
>
> In that case, I think we should dis-allow that.

When I see this behavior, I am creating cpusets containing these non load balancing cpus. Whether I create a single cpuset for each one, or one cpuset for all of them, the root domain ends up being the def_root_domain with no sched domain attached once I set both the root cpuset and created cpuset's sched_load_balance flags to 0.

2008-11-04 14:45:58

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote:
> Or are folks disabling load-balance bottom-up, instead of top-down?
>
> In that case, I think we should dis-allow that.

If what you mean by "disabling load-balance bottom-up" is disabling
load-balance in the root cpuset before disabling it in the leaves,
in the end it does not matter which way you do it, the setup winds up
being the same.

2008-11-04 14:58:42

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Dimitri Sivanich wrote:
> On Tue, Nov 04, 2008 at 03:36:33PM +0100, Peter Zijlstra wrote:
>
>> On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
>>
>>> Gregory Haskins wrote:
>>>
>>>> Peter Zijlstra wrote:
>>>>
>>>>
>>>>> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
>>>>>
>>>>>
>>>>>
>>>>>> When load balancing gets switched off for a set of cpus via the
>>>>>> sched_load_balance flag in cpusets, those cpus wind up with the
>>>>>> globally defined def_root_domain attached. The def_root_domain is
>>>>>> attached when partition_sched_domains calls detach_destroy_domains().
>>>>>> A new root_domain is never allocated or attached as a sched domain
>>>>>> will never be attached by __build_sched_domains() for the non-load
>>>>>> balanced processors.
>>>>>>
>>>>>> The problem with this scenario is that on systems with a large number
>>>>>> of processors with load balancing switched off, we start to see the
>>>>>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
>>>>>> This starts to become much more apparent above 8 waking RT threads
>>>>>> (with each RT thread running on it's own cpu, blocking and waking up
>>>>>> continuously).
>>>>>>
>>>>>> I'm wondering if this is, in fact, the way things were meant to work,
>>>>>> or should we have a root domain allocated for each cpu that is not to
>>>>>> be part of a sched domain? Note the the def_root_domain spans all of
>>>>>> the non-load-balanced cpus in this case. Having it attached to cpus
>>>>>> that should not be load balancing doesn't quite make sense to me.
>>>>>>
>>>>>>
>>>>>>
>>>>> It shouldn't be like that, each load-balance domain (in your case a
>>>>> single cpu) should get its own root domain. Gregory?
>>>>>
>>>>>
>>>>>
>>>> Yeah, this sounds broken. I know that the root-domain code was being
>>>> developed coincident to some upheaval with the cpuset code, so I suspect
>>>> something may have been broken from the original intent. I will take a
>>>> look.
>>>>
>>>> -Greg
>>>>
>>>>
>>>>
>>> After thinking about it some more, I am not quite sure what to do here.
>>> The root-domain code was really designed to be 1:1 with a disjoint
>>> cpuset. In this case, it sounds like all the non-balanced cpus are
>>> still in one default cpuset. In that case, the code is correct to place
>>> all those cores in the singleton def_root_domain. The question really
>>> is: How do we support the sched_load_balance flag better?
>>>
>>> I suppose we could go through the scheduler code and have it check that
>>> flag before consulting the root-domain. Another alternative is to have
>>> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?
>>>
>> Hmm, but you cannot disable load-balance on a cpu without placing it in
>> an cpuset first, right?
>>
>> Or are folks disabling load-balance bottom-up, instead of top-down?
>>
>> In that case, I think we should dis-allow that.
>>
>
> When I see this behavior, I am creating cpusets containing these non load balancing cpus. Whether I create a single cpuset for each one, or one cpuset for all of them, the root domain ends up being the def_root_domain with no sched domain attached once I set both the root cpuset and created cpuset's sched_load_balance flags to 0.
>
>

If you tried creating different cpusets and it still had them all end up
in the def_root_domain, something is very broken indeed. I will take a
look.

-Greg




Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-06 09:14:16

by Nish Aravamudan

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Tue, Nov 4, 2008 at 6:36 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
>> Gregory Haskins wrote:
>> > Peter Zijlstra wrote:
>> >
>> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
>> >>
>> >>
>> >>> When load balancing gets switched off for a set of cpus via the
>> >>> sched_load_balance flag in cpusets, those cpus wind up with the
>> >>> globally defined def_root_domain attached. The def_root_domain is
>> >>> attached when partition_sched_domains calls detach_destroy_domains().
>> >>> A new root_domain is never allocated or attached as a sched domain
>> >>> will never be attached by __build_sched_domains() for the non-load
>> >>> balanced processors.
>> >>>
>> >>> The problem with this scenario is that on systems with a large number
>> >>> of processors with load balancing switched off, we start to see the
>> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
>> >>> This starts to become much more apparent above 8 waking RT threads
>> >>> (with each RT thread running on it's own cpu, blocking and waking up
>> >>> continuously).
>> >>>
>> >>> I'm wondering if this is, in fact, the way things were meant to work,
>> >>> or should we have a root domain allocated for each cpu that is not to
>> >>> be part of a sched domain? Note the the def_root_domain spans all of
>> >>> the non-load-balanced cpus in this case. Having it attached to cpus
>> >>> that should not be load balancing doesn't quite make sense to me.
>> >>>
>> >>>
>> >> It shouldn't be like that, each load-balance domain (in your case a
>> >> single cpu) should get its own root domain. Gregory?
>> >>
>> >>
>> >
>> > Yeah, this sounds broken. I know that the root-domain code was being
>> > developed coincident to some upheaval with the cpuset code, so I suspect
>> > something may have been broken from the original intent. I will take a
>> > look.
>> >
>> > -Greg
>> >
>> >
>>
>> After thinking about it some more, I am not quite sure what to do here.
>> The root-domain code was really designed to be 1:1 with a disjoint
>> cpuset. In this case, it sounds like all the non-balanced cpus are
>> still in one default cpuset. In that case, the code is correct to place
>> all those cores in the singleton def_root_domain. The question really
>> is: How do we support the sched_load_balance flag better?
>>
>> I suppose we could go through the scheduler code and have it check that
>> flag before consulting the root-domain. Another alternative is to have
>> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?
>
> Hmm, but you cannot disable load-balance on a cpu without placing it in
> an cpuset first, right?
>
> Or are folks disabling load-balance bottom-up, instead of top-down?
>
> In that case, I think we should dis-allow that.

I don't have a lot of insight into the technical discussion, but will
say that (if I understand you right), the "bottom-up" approach was
recommended on LKML by Max K. in the (long) thread from earlier this
year with Subject "Inquiry: Should we remove "isolcpus= kernel boot
option? (may have realtime uses)":

"Just to complete the example above. Lets say you want to isolate cpu2
(assuming that cpusets are already mounted).

# Bring cpu2 offline
echo 0 > /sys/devices/system/cpu/cpu2/online

# Disable system wide load balancing
echo 0 > /dev/cpuset/cpuset.sched_load_banace

# Bring cpu2 online
echo 1 > /sys/devices/system/cpu/cpu2/online

Now if you want to un-isolate cpu2 you do

# Disable system wide load balancing
echo 1 > /dev/cpuset/cpuset.sched_load_banace

Of course this is not a complete isolation. There are also irqs (see my
"default irq affinity" patch), workqueues and the stop machine. I'm working on
those too and will release .25 base cpuisol tree when I'm done."

Would you recommend instead, then, that a new cpuset be created with
only cpu 2 in it (should one set cpuset.cpu_exclusive then?) and then
disabling load balancing in that cpuset?

Thanks,
Nish

2008-11-06 13:32:23

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Thu, Nov 06, 2008 at 01:13:48AM -0800, Nish Aravamudan wrote:
> On Tue, Nov 4, 2008 at 6:36 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
> >> Gregory Haskins wrote:
> >> > Peter Zijlstra wrote:
> >> >
> >> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> >> >>
> >> >>
> >> >>> When load balancing gets switched off for a set of cpus via the
> >> >>> sched_load_balance flag in cpusets, those cpus wind up with the
> >> >>> globally defined def_root_domain attached. The def_root_domain is
> >> >>> attached when partition_sched_domains calls detach_destroy_domains().
> >> >>> A new root_domain is never allocated or attached as a sched domain
> >> >>> will never be attached by __build_sched_domains() for the non-load
> >> >>> balanced processors.
> >> >>>
> >> >>> The problem with this scenario is that on systems with a large number
> >> >>> of processors with load balancing switched off, we start to see the
> >> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> >> >>> This starts to become much more apparent above 8 waking RT threads
> >> >>> (with each RT thread running on it's own cpu, blocking and waking up
> >> >>> continuously).
> >> >>>
> >> >>> I'm wondering if this is, in fact, the way things were meant to work,
> >> >>> or should we have a root domain allocated for each cpu that is not to
> >> >>> be part of a sched domain? Note the the def_root_domain spans all of
> >> >>> the non-load-balanced cpus in this case. Having it attached to cpus
> >> >>> that should not be load balancing doesn't quite make sense to me.
> >> >>>
> >> >>>
> >> >> It shouldn't be like that, each load-balance domain (in your case a
> >> >> single cpu) should get its own root domain. Gregory?
> >> >>
> >> >>
> >> >
> >> > Yeah, this sounds broken. I know that the root-domain code was being
> >> > developed coincident to some upheaval with the cpuset code, so I suspect
> >> > something may have been broken from the original intent. I will take a
> >> > look.
> >> >
> >> > -Greg
> >> >
> >> >
> >>
> >> After thinking about it some more, I am not quite sure what to do here.
> >> The root-domain code was really designed to be 1:1 with a disjoint
> >> cpuset. In this case, it sounds like all the non-balanced cpus are
> >> still in one default cpuset. In that case, the code is correct to place
> >> all those cores in the singleton def_root_domain. The question really
> >> is: How do we support the sched_load_balance flag better?
> >>
> >> I suppose we could go through the scheduler code and have it check that
> >> flag before consulting the root-domain. Another alternative is to have
> >> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?
> >
> > Hmm, but you cannot disable load-balance on a cpu without placing it in
> > an cpuset first, right?
> >
> > Or are folks disabling load-balance bottom-up, instead of top-down?
> >
> > In that case, I think we should dis-allow that.
>
> I don't have a lot of insight into the technical discussion, but will
> say that (if I understand you right), the "bottom-up" approach was
> recommended on LKML by Max K. in the (long) thread from earlier this
> year with Subject "Inquiry: Should we remove "isolcpus= kernel boot
> option? (may have realtime uses)":
>
> "Just to complete the example above. Lets say you want to isolate cpu2
> (assuming that cpusets are already mounted).
>
> # Bring cpu2 offline
> echo 0 > /sys/devices/system/cpu/cpu2/online
>
> # Disable system wide load balancing
> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>
> # Bring cpu2 online
> echo 1 > /sys/devices/system/cpu/cpu2/online
>
> Now if you want to un-isolate cpu2 you do
>
> # Disable system wide load balancing
> echo 1 > /dev/cpuset/cpuset.sched_load_banace
>
> Of course this is not a complete isolation. There are also irqs (see my
> "default irq affinity" patch), workqueues and the stop machine. I'm working on
> those too and will release .25 base cpuisol tree when I'm done."
>
> Would you recommend instead, then, that a new cpuset be created with
> only cpu 2 in it (should one set cpuset.cpu_exclusive then?) and then
> disabling load balancing in that cpuset?
>

This is exactly the primary scenario that I've been trying (as well as having multiple cpus in that cpuset). Regardless of the setup, the same problem occurs - the default root domain is what gets attached, and that spans all other cpus with load balancing switched off. The lock in the def_root_domain's cpupri_vec therefore becomes contended, and that slows down thread wakeup.

2008-11-19 19:49:47

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Gregory Haskins wrote:
> If you tried creating different cpusets and it still had them all end up
> in the def_root_domain, something is very broken indeed. I will take a
> look.

I beleive that's the intended behaviour. We always put cpus that are not
balanced into null sched domains. This was done since day one (ie when
cpuisol= option was introduced) and cpusets just followed the same convention.

I think the idea is that we want to make balancer a noop on those processors.
We could change cpusets code to create a root sched domain for each cpu I
guess. But can we maybe scale cpupri some other way ?

Max

2008-11-19 19:55:28

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote:
> I think the idea is that we want to make balancer a noop on those processors.

Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution.

> We could change cpusets code to create a root sched domain for each cpu I
> guess. But can we maybe scale cpupri some other way ?

It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu.

>
> Max

2008-11-19 20:17:50

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance



Dimitri Sivanich wrote:
> On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote:
>> I think the idea is that we want to make balancer a noop on those processors.
>
> Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution.
Yes. I forgot to point out that if we do change cpusets to generate sched
domain per cpu we want to make sure that balancer is still a noop just like it
is today with the null sched domain.

>> We could change cpusets code to create a root sched domain for each cpu I
>> guess. But can we maybe scale cpupri some other way ?
>
> It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu.
I think 'root' in this case is a bit of a misnomer. What I meant is that each
non-balanced cpu would be in a separate sched domain.

Max

2008-11-19 20:21:20

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Wed, Nov 19, 2008 at 12:17:38PM -0800, Max Krasnyansky wrote:
>
>
> Dimitri Sivanich wrote:
> > On Wed, Nov 19, 2008 at 11:49:36AM -0800, Max Krasnyansky wrote:
> >> I think the idea is that we want to make balancer a noop on those processors.
> >
> > Ultimately, making the balancer a noop on processors with load balancing turned off would be the best solution.
> Yes. I forgot to point out that if we do change cpusets to generate sched
> domain per cpu we want to make sure that balancer is still a noop just like it
> is today with the null sched domain.

Sorry, I meant root_domain per cpu, not sched domain. Having NULL sched domains for these cpus is fine.

>
> >> We could change cpusets code to create a root sched domain for each cpu I
> >> guess. But can we maybe scale cpupri some other way ?
> >
> > It doesn't make sense to me that they'd have a root domain attached that spans more of the the system than that cpu.
> I think 'root' in this case is a bit of a misnomer. What I meant is that each
> non-balanced cpu would be in a separate sched domain.

I think a NULL sched domain, as it is now, is fine.

2008-11-19 20:21:40

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Max Krasnyansky wrote:
> Gregory Haskins wrote:
>
>> If you tried creating different cpusets and it still had them all end up
>> in the def_root_domain, something is very broken indeed. I will take a
>> look.
>>
>
> I beleive that's the intended behaviour.
Heh...well, as the guy that wrote root-domans, I can definitively say
that is not the behavior that I personally intended ;)



> We always put cpus that are not
> balanced into null sched domains. This was done since day one (ie when
> cpuisol= option was introduced) and cpusets just followed the same convention.
>

It sounds like the problem with my code is that "null sched domain"
translates into "default root-domain" which is understandably unexpected
by Dimitri (and myself). Really I intended root-domains to become
associated with each exclusive/disjoint cpuset that is created. In a
way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
with one member, but that is somewhat beyond the scope of the
root-domain code as it stands today. My primary concern was that
Dimitri reports that even creating a disjoint cpuset per cpu does not
yield an isolated root-domain per cpu. Rather they all end up in the
default root-domain, and this is not what I intended at all.

However, as a secondary goal it would be nice to somehow directly
support the "no-load-balance" option without requiring explicit
exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to
scope the scheduler to a subset of cpus (including only "self") is
root-domains so I would prefer to see the solution based on that.
However, today there is a rather tight coupling of root-domains and
cpusets, so this coupling would likely have to be relaxed a little bit
to get there.

There are certainly other ways to solve the problem as well. But seeing
as how I intended root-domains to represent the effective partition
scope of the scheduler, this seems like a natural fit in my mind until
its proven to me otherwise.

Regards,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-19 20:33:50

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote:
> It sounds like the problem with my code is that "null sched domain"
> translates into "default root-domain" which is understandably unexpected
> by Dimitri (and myself). Really I intended root-domains to become
> associated with each exclusive/disjoint cpuset that is created. In a
> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
> with one member, but that is somewhat beyond the scope of the

Actually, at one time, that is how things were setup. Setting the
cpu_exclusive bit on a single cpu cpuset would isolate that cpu from
load balancing.

> root-domain code as it stands today. My primary concern was that
> Dimitri reports that even creating a disjoint cpuset per cpu does not
> yield an isolated root-domain per cpu. Rather they all end up in the
> default root-domain, and this is not what I intended at all.
>
> However, as a secondary goal it would be nice to somehow directly
> support the "no-load-balance" option without requiring explicit
> exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to
> scope the scheduler to a subset of cpus (including only "self") is
> root-domains so I would prefer to see the solution based on that.
> However, today there is a rather tight coupling of root-domains and
> cpusets, so this coupling would likely have to be relaxed a little bit
> to get there.
>
> There are certainly other ways to solve the problem as well. But seeing
> as how I intended root-domains to represent the effective partition
> scope of the scheduler, this seems like a natural fit in my mind until
> its proven to me otherwise.
>

Agreed.

2008-11-19 21:27:33

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Dimitri Sivanich wrote:
> On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote:
>
>> It sounds like the problem with my code is that "null sched domain"
>> translates into "default root-domain" which is understandably unexpected
>> by Dimitri (and myself). Really I intended root-domains to become
>> associated with each exclusive/disjoint cpuset that is created. In a
>> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
>> with one member, but that is somewhat beyond the scope of the
>>
>
> Actually, at one time, that is how things were setup. Setting the
> cpu_exclusive bit on a single cpu cpuset would isolate that cpu from
> load balancing.
>
Do you know if this was pre or post the root-domain code? Here is a
reference to the commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=57d885fea0da0e9541d7730a9e1dcf734981a173

A bisection that shows when this last worked for you would be very
appreciated if you have the time, Dimitri.

Regards,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-19 21:48:13

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Wed, Nov 19, 2008 at 04:30:08PM -0500, Gregory Haskins wrote:
> Dimitri Sivanich wrote:
> > On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote:
> >
> >> It sounds like the problem with my code is that "null sched domain"
> >> translates into "default root-domain" which is understandably unexpected
> >> by Dimitri (and myself). Really I intended root-domains to become
> >> associated with each exclusive/disjoint cpuset that is created. In a
> >> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
> >> with one member, but that is somewhat beyond the scope of the
> >>
> >
> > Actually, at one time, that is how things were setup. Setting the
> > cpu_exclusive bit on a single cpu cpuset would isolate that cpu from
> > load balancing.
> >
> Do you know if this was pre or post the root-domain code? Here is a
> reference to the commit:

It was pre root-domain. That behavior was replaced by addition of the sched_load_balance flag with the following commit (though it was actually removed even earlier):

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=029190c515f15f512ac85de8fc686d4dbd0ae731

>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=57d885fea0da0e9541d7730a9e1dcf734981a173
>
> A bisection that shows when this last worked for you would be very
> appreciated if you have the time, Dimitri.
>
> Regards,
> -Greg
>
>

2008-11-19 22:21:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Dimitri Sivanich wrote:
> On Wed, Nov 19, 2008 at 03:25:15PM -0500, Gregory Haskins wrote:
>
>> It sounds like the problem with my code is that "null sched domain"
>> translates into "default root-domain" which is understandably unexpected
>> by Dimitri (and myself). Really I intended root-domains to become
>> associated with each exclusive/disjoint cpuset that is created. In a
>> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
>> with one member, but that is somewhat beyond the scope of the
>>
>
> Actually, at one time, that is how things were setup. Setting the
> cpu_exclusive bit on a single cpu cpuset would isolate that cpu from
> load balancing.
>

Re-reading my post made me realize what I said above was confusing. The
"that" in "but that is somewhat beyond the scope" was meant to be
"explicit/direct support for the no-balance flag". However, it perhaps
sounded like I was talking about exclusive cpusets with singleton
membership. Exclusive cpusets are the original raison-d'etre for
root-domains. ;)

Therefore I agree that the exclusive cpuset portion should work (but
seems to be broken, thus the bug report). My primary goal is to fix
this issue. However, I would also like to *add* support for the
no-balance flag as a secondary goal. Its just that this is new feature
from my perspective, so may it take some additional work to figure out
what needs to be done. HTH and sorry for the confusion.

-Greg
>
>> root-domain code as it stands today. My primary concern was that
>> Dimitri reports that even creating a disjoint cpuset per cpu does not
>> yield an isolated root-domain per cpu. Rather they all end up in the
>> default root-domain, and this is not what I intended at all.
>>
>> However, as a secondary goal it would be nice to somehow directly
>> support the "no-load-balance" option without requiring explicit
>> exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to
>> scope the scheduler to a subset of cpus (including only "self") is
>> root-domains so I would prefer to see the solution based on that.
>> However, today there is a rather tight coupling of root-domains and
>> cpusets, so this coupling would likely have to be relaxed a little bit
>> to get there.
>>
>> There are certainly other ways to solve the problem as well. But seeing
>> as how I intended root-domains to represent the effective partition
>> scope of the scheduler, this seems like a natural fit in my mind until
>> its proven to me otherwise.
>>
>>
>
> Agreed.
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-20 02:13:08

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Gregory Haskins wrote:
> Max Krasnyansky wrote:
>> We always put cpus that are not
>> balanced into null sched domains. This was done since day one (ie when
>> cpuisol= option was introduced) and cpusets just followed the same convention.
>>
>
> It sounds like the problem with my code is that "null sched domain"
> translates into "default root-domain" which is understandably unexpected
> by Dimitri (and myself). Really I intended root-domains to become
> associated with each exclusive/disjoint cpuset that is created. In a
> way, non-balanced/isolated cpus could be modeled as an exclusive cpuset
> with one member, but that is somewhat beyond the scope of the
> root-domain code as it stands today. My primary concern was that
> Dimitri reports that even creating a disjoint cpuset per cpu does not
> yield an isolated root-domain per cpu. Rather they all end up in the
> default root-domain, and this is not what I intended at all.
>
> However, as a secondary goal it would be nice to somehow directly
> support the "no-load-balance" option without requiring explicit
> exclusive per-cpu cpusets to do it. The proper mechanism (IMHO) to
> scope the scheduler to a subset of cpus (including only "self") is
> root-domains so I would prefer to see the solution based on that.
> However, today there is a rather tight coupling of root-domains and
> cpusets, so this coupling would likely have to be relaxed a little bit
> to get there.
>
> There are certainly other ways to solve the problem as well. But seeing
> as how I intended root-domains to represent the effective partition
> scope of the scheduler, this seems like a natural fit in my mind until
> its proven to me otherwise.

Since I was working on cpuisol updates I decided to stick some debug prinks
around and test a few scenarios. I'm basically printing cpumasks generated for
each cpuset and address of the root domain.
My conclusion is that everything is working as expected. I do not think we
need to fix anything in this area.

btw cpu_exclusive flag has no impact on the sched domains stuff. I'm not sure
what it was mentioned in this context.

Here comes a long text with a bunch of traces based on different cpuset
setups. This is an 8Core dual Xeon (L5410) box. 2.6.27.6 kernel.
All scenarios assume
mount -t cgroup -ocpusets /cpusets
cd /cpusets

----
Trace 1
$ echo 0 > cpuset.sched_load_balance

[ 1674.811610] cpusets: rebuild ndoms 0
[ 1674.811627] CPU0 root domain default
[ 1674.811629] CPU0 attaching NULL sched-domain.
[ 1674.811633] CPU1 root domain default
[ 1674.811635] CPU1 attaching NULL sched-domain.
[ 1674.811638] CPU2 root domain default
[ 1674.811639] CPU2 attaching NULL sched-domain.
[ 1674.811642] CPU3 root domain default
[ 1674.811643] CPU3 attaching NULL sched-domain.
[ 1674.811646] CPU4 root domain default
[ 1674.811647] CPU4 attaching NULL sched-domain.
[ 1674.811649] CPU5 root domain default
[ 1674.811651] CPU5 attaching NULL sched-domain.
[ 1674.811653] CPU6 root domain default
[ 1674.811655] CPU6 attaching NULL sched-domain.
[ 1674.811657] CPU7 root domain default
[ 1674.811659] CPU7 attaching NULL sched-domain.

Looks fine.

----
Trace 2
$ echo 1 > cpuset.sched_load_balance

[ 1748.260637] cpusets: rebuild ndoms 1
[ 1748.260648] cpuset: domain 0 cpumask ff
[ 1748.260650] CPU0 root domain ffff88025884a000
[ 1748.260652] CPU0 attaching sched-domain:
[ 1748.260654] domain 0: span 0-7 level CPU
[ 1748.260656] groups: 0 1 2 3 4 5 6 7
[ 1748.260665] CPU1 root domain ffff88025884a000
[ 1748.260666] CPU1 attaching sched-domain:
[ 1748.260668] domain 0: span 0-7 level CPU
[ 1748.260670] groups: 1 2 3 4 5 6 7 0
[ 1748.260677] CPU2 root domain ffff88025884a000
[ 1748.260679] CPU2 attaching sched-domain:
[ 1748.260681] domain 0: span 0-7 level CPU
[ 1748.260683] groups: 2 3 4 5 6 7 0 1
[ 1748.260690] CPU3 root domain ffff88025884a000
[ 1748.260692] CPU3 attaching sched-domain:
[ 1748.260693] domain 0: span 0-7 level CPU
[ 1748.260696] groups: 3 4 5 6 7 0 1 2
[ 1748.260703] CPU4 root domain ffff88025884a000
[ 1748.260705] CPU4 attaching sched-domain:
[ 1748.260706] domain 0: span 0-7 level CPU
[ 1748.260708] groups: 4 5 6 7 0 1 2 3
[ 1748.260715] CPU5 root domain ffff88025884a000
[ 1748.260717] CPU5 attaching sched-domain:
[ 1748.260718] domain 0: span 0-7 level CPU
[ 1748.260720] groups: 5 6 7 0 1 2 3 4
[ 1748.260727] CPU6 root domain ffff88025884a000
[ 1748.260729] CPU6 attaching sched-domain:
[ 1748.260731] domain 0: span 0-7 level CPU
[ 1748.260733] groups: 6 7 0 1 2 3 4 5
[ 1748.260740] CPU7 root domain ffff88025884a000
[ 1748.260742] CPU7 attaching sched-domain:
[ 1748.260743] domain 0: span 0-7 level CPU
[ 1748.260745] groups: 7 0 1 2 3 4 5 6

Looks perfect.

----
Trace 3
$ for i in 0 1 2 3 4 5 6 7; do mkdir par$i; echo $i > par$i/cpuset.cpus; done
$ echo 0 > cpuset.sched_load_balance

[ 1803.485838] cpusets: rebuild ndoms 1
[ 1803.485843] cpuset: domain 0 cpumask ff
[ 1803.486953] cpusets: rebuild ndoms 1
[ 1803.486957] cpuset: domain 0 cpumask ff
[ 1803.488039] cpusets: rebuild ndoms 1
[ 1803.488044] cpuset: domain 0 cpumask ff
[ 1803.489046] cpusets: rebuild ndoms 1
[ 1803.489056] cpuset: domain 0 cpumask ff
[ 1803.490306] cpusets: rebuild ndoms 1
[ 1803.490312] cpuset: domain 0 cpumask ff
[ 1803.491464] cpusets: rebuild ndoms 1
[ 1803.491474] cpuset: domain 0 cpumask ff
[ 1803.492617] cpusets: rebuild ndoms 1
[ 1803.492622] cpuset: domain 0 cpumask ff
[ 1803.493758] cpusets: rebuild ndoms 1
[ 1803.493763] cpuset: domain 0 cpumask ff
[ 1835.135245] cpusets: rebuild ndoms 8
[ 1835.135249] cpuset: domain 0 cpumask 80
[ 1835.135251] cpuset: domain 1 cpumask 40
[ 1835.135253] cpuset: domain 2 cpumask 20
[ 1835.135254] cpuset: domain 3 cpumask 10
[ 1835.135256] cpuset: domain 4 cpumask 08
[ 1835.135259] cpuset: domain 5 cpumask 04
[ 1835.135261] cpuset: domain 6 cpumask 02
[ 1835.135263] cpuset: domain 7 cpumask 01
[ 1835.135279] CPU0 root domain default
[ 1835.135281] CPU0 attaching NULL sched-domain.
[ 1835.135286] CPU1 root domain default
[ 1835.135288] CPU1 attaching NULL sched-domain.
[ 1835.135291] CPU2 root domain default
[ 1835.135294] CPU2 attaching NULL sched-domain.
[ 1835.135297] CPU3 root domain default
[ 1835.135299] CPU3 attaching NULL sched-domain.
[ 1835.135303] CPU4 root domain default
[ 1835.135305] CPU4 attaching NULL sched-domain.
[ 1835.135308] CPU5 root domain default
[ 1835.135311] CPU5 attaching NULL sched-domain.
[ 1835.135314] CPU6 root domain default
[ 1835.135316] CPU6 attaching NULL sched-domain.
[ 1835.135319] CPU7 root domain default
[ 1835.135322] CPU7 attaching NULL sched-domain.
[ 1835.192509] CPU7 root domain ffff88025884a000
[ 1835.192512] CPU7 attaching NULL sched-domain.
[ 1835.192518] CPU6 root domain ffff880258849000
[ 1835.192521] CPU6 attaching NULL sched-domain.
[ 1835.192526] CPU5 root domain ffff880258848800
[ 1835.192530] CPU5 attaching NULL sched-domain.
[ 1835.192536] CPU4 root domain ffff88025884c000
[ 1835.192539] CPU4 attaching NULL sched-domain.
[ 1835.192544] CPU3 root domain ffff88025884c800
[ 1835.192547] CPU3 attaching NULL sched-domain.
[ 1835.192553] CPU2 root domain ffff88025884f000
[ 1835.192556] CPU2 attaching NULL sched-domain.
[ 1835.192561] CPU1 root domain ffff88025884d000
[ 1835.192565] CPU1 attaching NULL sched-domain.
[ 1835.192570] CPU0 root domain ffff88025884b000
[ 1835.192573] CPU0 attaching NULL sched-domain.

Looks perfectly fine too. Notice how each cpu ended up in a different root_domain.

----
Trace 4
$ rmdir par*
$ echo 1 > cpuset.sched_load_balance

This trace looks the same as #2. Again all is fine.

----
Trace 5
$ mkdir par0
$ echo 0-3 > par0/cpuset.cpus
$ echo 0 > cpuset.sched_load_balance

[ 2204.382352] cpusets: rebuild ndoms 1
[ 2204.382358] cpuset: domain 0 cpumask ff
[ 2213.142995] cpusets: rebuild ndoms 1
[ 2213.143000] cpuset: domain 0 cpumask 0f
[ 2213.143005] CPU0 root domain default
[ 2213.143006] CPU0 attaching NULL sched-domain.
[ 2213.143011] CPU1 root domain default
[ 2213.143013] CPU1 attaching NULL sched-domain.
[ 2213.143017] CPU2 root domain default
[ 2213.143021] CPU2 attaching NULL sched-domain.
[ 2213.143026] CPU3 root domain default
[ 2213.143030] CPU3 attaching NULL sched-domain.
[ 2213.143035] CPU4 root domain default
[ 2213.143039] CPU4 attaching NULL sched-domain.
[ 2213.143044] CPU5 root domain default
[ 2213.143048] CPU5 attaching NULL sched-domain.
[ 2213.143053] CPU6 root domain default
[ 2213.143057] CPU6 attaching NULL sched-domain.
[ 2213.143062] CPU7 root domain default
[ 2213.143066] CPU7 attaching NULL sched-domain.
[ 2213.181261] CPU0 root domain ffff8802589eb000
[ 2213.181265] CPU0 attaching sched-domain:
[ 2213.181267] domain 0: span 0-3 level CPU
[ 2213.181275] groups: 0 1 2 3
[ 2213.181293] CPU1 root domain ffff8802589eb000
[ 2213.181297] CPU1 attaching sched-domain:
[ 2213.181302] domain 0: span 0-3 level CPU
[ 2213.181309] groups: 1 2 3 0
[ 2213.181327] CPU2 root domain ffff8802589eb000
[ 2213.181332] CPU2 attaching sched-domain:
[ 2213.181336] domain 0: span 0-3 level CPU
[ 2213.181343] groups: 2 3 0 1
[ 2213.181366] CPU3 root domain ffff8802589eb000
[ 2213.181370] CPU3 attaching sched-domain:
[ 2213.181373] domain 0: span 0-3 level CPU
[ 2213.181384] groups: 3 0 1 2

Looks perfectly fine too. CPU0-3 are in root domain ffff8802589eb000. The rest
are in def_root_domain.

-----
Trace 6
$ mkdir par1
$ echo 4-5 > par1/cpuset.cpus

[ 2752.979008] cpusets: rebuild ndoms 2
[ 2752.979014] cpuset: domain 0 cpumask 30
[ 2752.979016] cpuset: domain 1 cpumask 0f
[ 2752.979024] CPU4 root domain ffff8802589ec800
[ 2752.979028] CPU4 attaching sched-domain:
[ 2752.979032] domain 0: span 4-5 level CPU
[ 2752.979039] groups: 4 5
[ 2752.979052] CPU5 root domain ffff8802589ec800
[ 2752.979056] CPU5 attaching sched-domain:
[ 2752.979060] domain 0: span 4-5 level CPU
[ 2752.979071] groups: 5 4

Looks correct too. CPUs 4 and 5 got added to a new root domain
ffff8802589ec800 and nothing else changed.

-----

So. I think the only action item is for me to update 'syspart' to create a
cpuset for each isolated cpu to avoid putting a bunch of cpus into the default
root domain. Everything else looks perfectly fine.

btw We should probably rename 'root_domain' to something else to avoid
confusion. ie Most people assume that there should be only one root_romain.
Maybe something like 'base_domain' ?

Also we should probably commit those prints that I added and enable then under
SCHED_DEBUG. Right now we're just printing sched_domains and it's not clear
which root_domain they belong to.

Max

2008-11-21 01:54:08

by Gregory Haskins

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Hi Max,


Max Krasnyansky wrote:
> Here comes a long text with a bunch of traces based on different cpuset
> setups. This is an 8Core dual Xeon (L5410) box. 2.6.27.6 kernel.
> All scenarios assume
> mount -t cgroup -ocpusets /cpusets
> cd /cpusets
>

Thank you for doing this. Comments inline...


> ----
> Trace 1
> $ echo 0 > cpuset.sched_load_balance
>
> [ 1674.811610] cpusets: rebuild ndoms 0
> [ 1674.811627] CPU0 root domain default
> [ 1674.811629] CPU0 attaching NULL sched-domain.
> [ 1674.811633] CPU1 root domain default
> [ 1674.811635] CPU1 attaching NULL sched-domain.
> [ 1674.811638] CPU2 root domain default
> [ 1674.811639] CPU2 attaching NULL sched-domain.
> [ 1674.811642] CPU3 root domain default
> [ 1674.811643] CPU3 attaching NULL sched-domain.
> [ 1674.811646] CPU4 root domain default
> [ 1674.811647] CPU4 attaching NULL sched-domain.
> [ 1674.811649] CPU5 root domain default
> [ 1674.811651] CPU5 attaching NULL sched-domain.
> [ 1674.811653] CPU6 root domain default
> [ 1674.811655] CPU6 attaching NULL sched-domain.
> [ 1674.811657] CPU7 root domain default
> [ 1674.811659] CPU7 attaching NULL sched-domain.
>
> Looks fine.
>

I have to agree. The code is working "as designed" here since I do not
support the sched_load_balance=0 mode yet. While technically not a bug,
a new feature to add support for it would be nice :)

> ----
> Trace 2
> $ echo 1 > cpuset.sched_load_balance
>
> [ 1748.260637] cpusets: rebuild ndoms 1
> [ 1748.260648] cpuset: domain 0 cpumask ff
> [ 1748.260650] CPU0 root domain ffff88025884a000
> [ 1748.260652] CPU0 attaching sched-domain:
> [ 1748.260654] domain 0: span 0-7 level CPU
> [ 1748.260656] groups: 0 1 2 3 4 5 6 7
> [ 1748.260665] CPU1 root domain ffff88025884a000
> [ 1748.260666] CPU1 attaching sched-domain:
> [ 1748.260668] domain 0: span 0-7 level CPU
> [ 1748.260670] groups: 1 2 3 4 5 6 7 0
> [ 1748.260677] CPU2 root domain ffff88025884a000
> [ 1748.260679] CPU2 attaching sched-domain:
> [ 1748.260681] domain 0: span 0-7 level CPU
> [ 1748.260683] groups: 2 3 4 5 6 7 0 1
> [ 1748.260690] CPU3 root domain ffff88025884a000
> [ 1748.260692] CPU3 attaching sched-domain:
> [ 1748.260693] domain 0: span 0-7 level CPU
> [ 1748.260696] groups: 3 4 5 6 7 0 1 2
> [ 1748.260703] CPU4 root domain ffff88025884a000
> [ 1748.260705] CPU4 attaching sched-domain:
> [ 1748.260706] domain 0: span 0-7 level CPU
> [ 1748.260708] groups: 4 5 6 7 0 1 2 3
> [ 1748.260715] CPU5 root domain ffff88025884a000
> [ 1748.260717] CPU5 attaching sched-domain:
> [ 1748.260718] domain 0: span 0-7 level CPU
> [ 1748.260720] groups: 5 6 7 0 1 2 3 4
> [ 1748.260727] CPU6 root domain ffff88025884a000
> [ 1748.260729] CPU6 attaching sched-domain:
> [ 1748.260731] domain 0: span 0-7 level CPU
> [ 1748.260733] groups: 6 7 0 1 2 3 4 5
> [ 1748.260740] CPU7 root domain ffff88025884a000
> [ 1748.260742] CPU7 attaching sched-domain:
> [ 1748.260743] domain 0: span 0-7 level CPU
> [ 1748.260745] groups: 7 0 1 2 3 4 5 6
>
> Looks perfect.
>

Yep.

> ----
> Trace 3
> $ for i in 0 1 2 3 4 5 6 7; do mkdir par$i; echo $i > par$i/cpuset.cpus; done
> $ echo 0 > cpuset.sched_load_balance
>
> [ 1803.485838] cpusets: rebuild ndoms 1
> [ 1803.485843] cpuset: domain 0 cpumask ff
> [ 1803.486953] cpusets: rebuild ndoms 1
> [ 1803.486957] cpuset: domain 0 cpumask ff
> [ 1803.488039] cpusets: rebuild ndoms 1
> [ 1803.488044] cpuset: domain 0 cpumask ff
> [ 1803.489046] cpusets: rebuild ndoms 1
> [ 1803.489056] cpuset: domain 0 cpumask ff
> [ 1803.490306] cpusets: rebuild ndoms 1
> [ 1803.490312] cpuset: domain 0 cpumask ff
> [ 1803.491464] cpusets: rebuild ndoms 1
> [ 1803.491474] cpuset: domain 0 cpumask ff
> [ 1803.492617] cpusets: rebuild ndoms 1
> [ 1803.492622] cpuset: domain 0 cpumask ff
> [ 1803.493758] cpusets: rebuild ndoms 1
> [ 1803.493763] cpuset: domain 0 cpumask ff
> [ 1835.135245] cpusets: rebuild ndoms 8
> [ 1835.135249] cpuset: domain 0 cpumask 80
> [ 1835.135251] cpuset: domain 1 cpumask 40
> [ 1835.135253] cpuset: domain 2 cpumask 20
> [ 1835.135254] cpuset: domain 3 cpumask 10
> [ 1835.135256] cpuset: domain 4 cpumask 08
> [ 1835.135259] cpuset: domain 5 cpumask 04
> [ 1835.135261] cpuset: domain 6 cpumask 02
> [ 1835.135263] cpuset: domain 7 cpumask 01
> [ 1835.135279] CPU0 root domain default
> [ 1835.135281] CPU0 attaching NULL sched-domain.
> [ 1835.135286] CPU1 root domain default
> [ 1835.135288] CPU1 attaching NULL sched-domain.
> [ 1835.135291] CPU2 root domain default
> [ 1835.135294] CPU2 attaching NULL sched-domain.
> [ 1835.135297] CPU3 root domain default
> [ 1835.135299] CPU3 attaching NULL sched-domain.
> [ 1835.135303] CPU4 root domain default
> [ 1835.135305] CPU4 attaching NULL sched-domain.
> [ 1835.135308] CPU5 root domain default
> [ 1835.135311] CPU5 attaching NULL sched-domain.
> [ 1835.135314] CPU6 root domain default
> [ 1835.135316] CPU6 attaching NULL sched-domain.
> [ 1835.135319] CPU7 root domain default
> [ 1835.135322] CPU7 attaching NULL sched-domain.
> [ 1835.192509] CPU7 root domain ffff88025884a000
> [ 1835.192512] CPU7 attaching NULL sched-domain.
> [ 1835.192518] CPU6 root domain ffff880258849000
> [ 1835.192521] CPU6 attaching NULL sched-domain.
> [ 1835.192526] CPU5 root domain ffff880258848800
> [ 1835.192530] CPU5 attaching NULL sched-domain.
> [ 1835.192536] CPU4 root domain ffff88025884c000
> [ 1835.192539] CPU4 attaching NULL sched-domain.
> [ 1835.192544] CPU3 root domain ffff88025884c800
> [ 1835.192547] CPU3 attaching NULL sched-domain.
> [ 1835.192553] CPU2 root domain ffff88025884f000
> [ 1835.192556] CPU2 attaching NULL sched-domain.
> [ 1835.192561] CPU1 root domain ffff88025884d000
> [ 1835.192565] CPU1 attaching NULL sched-domain.
> [ 1835.192570] CPU0 root domain ffff88025884b000
> [ 1835.192573] CPU0 attaching NULL sched-domain.
>
> Looks perfectly fine too. Notice how each cpu ended up in a different root_domain.
>

Yep, I concur. This is how I intended it to work. However, Dimitri
reports that this is not working for him and this is what piqued my
interest and drove the creation of a BZ report.

Dimitri, can you share your cpuset configuration with us, and also
re-run both it and Max's approach (assuming they differ) on your end to
confirm the problem still exists? Max, perhaps you can post the patch
with your debugging instrumentation so we can equally see what happens
on Dimitri's side?
> ----
> Trace 4
> $ rmdir par*
> $ echo 1 > cpuset.sched_load_balance
>
> This trace looks the same as #2. Again all is fine.
>
> ----
> Trace 5
> $ mkdir par0
> $ echo 0-3 > par0/cpuset.cpus
> $ echo 0 > cpuset.sched_load_balance
>
> [ 2204.382352] cpusets: rebuild ndoms 1
> [ 2204.382358] cpuset: domain 0 cpumask ff
> [ 2213.142995] cpusets: rebuild ndoms 1
> [ 2213.143000] cpuset: domain 0 cpumask 0f
> [ 2213.143005] CPU0 root domain default
> [ 2213.143006] CPU0 attaching NULL sched-domain.
> [ 2213.143011] CPU1 root domain default
> [ 2213.143013] CPU1 attaching NULL sched-domain.
> [ 2213.143017] CPU2 root domain default
> [ 2213.143021] CPU2 attaching NULL sched-domain.
> [ 2213.143026] CPU3 root domain default
> [ 2213.143030] CPU3 attaching NULL sched-domain.
> [ 2213.143035] CPU4 root domain default
> [ 2213.143039] CPU4 attaching NULL sched-domain.
> [ 2213.143044] CPU5 root domain default
> [ 2213.143048] CPU5 attaching NULL sched-domain.
> [ 2213.143053] CPU6 root domain default
> [ 2213.143057] CPU6 attaching NULL sched-domain.
> [ 2213.143062] CPU7 root domain default
> [ 2213.143066] CPU7 attaching NULL sched-domain.
> [ 2213.181261] CPU0 root domain ffff8802589eb000
> [ 2213.181265] CPU0 attaching sched-domain:
> [ 2213.181267] domain 0: span 0-3 level CPU
> [ 2213.181275] groups: 0 1 2 3
> [ 2213.181293] CPU1 root domain ffff8802589eb000
> [ 2213.181297] CPU1 attaching sched-domain:
> [ 2213.181302] domain 0: span 0-3 level CPU
> [ 2213.181309] groups: 1 2 3 0
> [ 2213.181327] CPU2 root domain ffff8802589eb000
> [ 2213.181332] CPU2 attaching sched-domain:
> [ 2213.181336] domain 0: span 0-3 level CPU
> [ 2213.181343] groups: 2 3 0 1
> [ 2213.181366] CPU3 root domain ffff8802589eb000
> [ 2213.181370] CPU3 attaching sched-domain:
> [ 2213.181373] domain 0: span 0-3 level CPU
> [ 2213.181384] groups: 3 0 1 2
>
> Looks perfectly fine too. CPU0-3 are in root domain ffff8802589eb000. The rest
> are in def_root_domain.
>
> -----
> Trace 6
> $ mkdir par1
> $ echo 4-5 > par1/cpuset.cpus
>
> [ 2752.979008] cpusets: rebuild ndoms 2
> [ 2752.979014] cpuset: domain 0 cpumask 30
> [ 2752.979016] cpuset: domain 1 cpumask 0f
> [ 2752.979024] CPU4 root domain ffff8802589ec800
> [ 2752.979028] CPU4 attaching sched-domain:
> [ 2752.979032] domain 0: span 4-5 level CPU
> [ 2752.979039] groups: 4 5
> [ 2752.979052] CPU5 root domain ffff8802589ec800
> [ 2752.979056] CPU5 attaching sched-domain:
> [ 2752.979060] domain 0: span 4-5 level CPU
> [ 2752.979071] groups: 5 4
>
> Looks correct too. CPUs 4 and 5 got added to a new root domain
> ffff8802589ec800 and nothing else changed.
>
> -----
>
> So. I think the only action item is for me to update 'syspart' to create a
> cpuset for each isolated cpu to avoid putting a bunch of cpus into the default
> root domain. Everything else looks perfectly fine.
>

I agree. We just need to make sure Dimitri can reproduce these findings
on his side to make sure it is not something like a different cpuset
configuration that causes the problem. If you can, Max, could you also
add the rd->span to the instrumentation just so we can verify that it is
scoped appropriately?

> btw We should probably rename 'root_domain' to something else to avoid
> confusion. ie Most people assume that there should be only one root_romain.
>

Agreed, but that is already true (depending on your perspective ;) I
chose "root-domain" as short for root-sched-domain (meaning the top-most
sched-domain in the hierarchy). There is only one root-domain per
run-queue. There can be multiple root-domains per system. The former
is how I intended it to be considered, and I think in this context
"root" is appropriate. Just as you could consider that every Linux box
has a root filesystem, but there can be multiple root filesystems that
exist on, say, a single HDD for example. Its simply a context to
govern/scope the rq behavior.

Early iterations of my patches actually had the rd pointer hanging off
the top sched-domain structure, actually. This perhaps reinforced the
concept of "root" and thus allowed the reasoning for the chosen name to
be more apparent. However, I quickly realized that there was no
advantage to walking up the sd hierarchy to find "root" and thus the rd
pointer...you could effectively hang the pointer on the rq directly for
the same result and with less overhead. So I moved it in the later
patches which were ultimately accepted.

I don't feel strongly about the name either way, however. So if people
have a name they prefer and the consensus is that it's less confusing, I
am fine with that.

> Also we should probably commit those prints that I added and enable then under
> SCHED_DEBUG. Right now we're just printing sched_domains and it's not clear
> which root_domain they belong to.
>

Yes, please do! (and please add the rd->span as indicated earlier, if
you would be so kind ;)

If Dimitri can reproduce your findings, we can close out the bug as FAD
and create a new-feature request for the sched_load_balance flag. In
the meantime, the workaround for the new feature is to use per-cpu
exclusive cpusets which it sounds can be supported by your syspart tool.

Thanks Max,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-21 20:04:35

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 827cd9a..b94a6de 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -760,6 +760,16 @@ static void do_rebuild_sched_domains(struct work_struct *unused)
ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();

+ printk(KERN_INFO "cpusets: rebuild ndoms %u\n", ndoms);
+ if (doms) {
+ char str[128];
+ int i;
+ for (i=0; i < ndoms; i++) {
+ cpumask_scnprintf(str, sizeof(str), *(doms + i));
+ printk(KERN_INFO "cpuset: domain %u cpumask %s\n", i, str);
+ }
+ }
+
/* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr);

diff --git a/kernel/sched.c b/kernel/sched.c
index ad1962d..7833224 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6647,11 +6647,16 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
return 0;
}

-static void sched_domain_debug(struct sched_domain *sd, int cpu)
+static void sched_domain_debug(struct root_domain *rd, struct sched_domain *sd, int cpu)
{
cpumask_t *groupmask;
int level = 0;

+ if (rd == &def_root_domain)
+ printk(KERN_DEBUG "CPU%d root domain default\n", cpu);
+ else
+ printk(KERN_DEBUG "CPU%d root domain %p\n", cpu, rd);
+
if (!sd) {
printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu);
return;
@@ -6676,7 +6681,7 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu)
kfree(groupmask);
}
#else /* !CONFIG_SCHED_DEBUG */
-# define sched_domain_debug(sd, cpu) do { } while (0)
+# define sched_domain_debug(rd, sd, cpu) do { } while (0)
#endif /* CONFIG_SCHED_DEBUG */

static int sd_degenerate(struct sched_domain *sd)
@@ -6819,7 +6824,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
sd->child = NULL;
}

- sched_domain_debug(sd, cpu);
+ sched_domain_debug(rd, sd, cpu);

rq_attach_root(rq, rd);
rcu_assign_pointer(rq->sd, sd);


Attachments:
sched_domain_debug.patch (1.87 kB)

2008-11-21 21:18:18

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Hi Greg and Max,

On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote:
> Hi Greg,
>
> I attached debug instrumentation patch for Dmitri to try. I'll clean it up and
> add things you requested and will resubmit properly some time next week.
>

We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below.


mount -t cgroup cpuset -ocpuset /cpusets/

for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done

kernel: cpusets: rebuild ndoms 1
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpusets: rebuild ndoms 1
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpusets: rebuild ndoms 1
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpusets: rebuild ndoms 1
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0

echo 0 > cpuset.sched_load_balance
kernel: cpusets: rebuild ndoms 4
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 1 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 2 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 3 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: CPU0 root domain default
kernel: CPU0 attaching NULL sched-domain.
kernel: CPU1 root domain default
kernel: CPU1 attaching NULL sched-domain.
kernel: CPU2 root domain default
kernel: CPU2 attaching NULL sched-domain.
kernel: CPU3 root domain default
kernel: CPU3 attaching NULL sched-domain.
kernel: CPU3 root domain e0000069ecb20000
kernel: CPU3 attaching sched-domain:
kernel: domain 0: span 3 level NODE
kernel: groups: 3
kernel: CPU2 root domain e000006884a00000
kernel: CPU2 attaching sched-domain:
kernel: domain 0: span 2 level NODE
kernel: groups: 2
kernel: CPU1 root domain e000006884a20000
kernel: CPU1 attaching sched-domain:
kernel: domain 0: span 1 level NODE
kernel: groups: 1
kernel: CPU0 root domain e000006884a40000
kernel: CPU0 attaching sched-domain:
kernel: domain 0: span 0 level NODE
kernel: groups: 0


Which is the way sched_load_balance is supposed to work. You need to set sched_load_balance=0 for all cpusets containing any cpu you want to disable balancing on, otherwise some balancing will happen.

So in addition to the top (root) cpuset, we need to set it to '0' in the parX cpusets. That will turn off load balancing to the cpus in question (thereby attaching a NULL sched domain). So when we do that for just par3, we get the following:

echo 0 > par3/cpuset.sched_load_balance
kernel: cpusets: rebuild ndoms 3
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 1 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 2 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: CPU3 root domain default
kernel: CPU3 attaching NULL sched-domain.

So the def_root_domain is now attached for CPU 3. And we do have a NULL sched-domain, which we expect for a cpu with load balancing turned off. If we turn sched_load_balance off ('0') on each of the other cpusets (par0-2), each of those cpus would also have a NULL sched-domain attached.

2008-11-22 07:03:54

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance



Dimitri Sivanich wrote:
> Hi Greg and Max,
>
> On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote:
>> Hi Greg,
>>
>> I attached debug instrumentation patch for Dmitri to try. I'll clean it up and
>> add things you requested and will resubmit properly some time next week.
>>
>
> We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below.
>
>
> mount -t cgroup cpuset -ocpuset /cpusets/
>
> for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done
>
> kernel: cpusets: rebuild ndoms 1
> kernel: cpuset: domain 0 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
Oops. I did not realize your NR_CPUS is so large. Unfortunately all your masks
got truncated.
I'll update the patch to print cpu list instead of the masks.

> echo 0 > cpuset.sched_load_balance
> kernel: cpusets: rebuild ndoms 4
> kernel: cpuset: domain 0 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: cpuset: domain 1 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: cpuset: domain 2 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: cpuset: domain 3 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: CPU0 root domain default
> kernel: CPU0 attaching NULL sched-domain.
> kernel: CPU1 root domain default
> kernel: CPU1 attaching NULL sched-domain.
> kernel: CPU2 root domain default
> kernel: CPU2 attaching NULL sched-domain.
> kernel: CPU3 root domain default
> kernel: CPU3 attaching NULL sched-domain.

> kernel: CPU3 root domain e0000069ecb20000
> kernel: CPU3 attaching sched-domain:
> kernel: domain 0: span 3 level NODE
> kernel: groups: 3
> kernel: CPU2 root domain e000006884a00000
> kernel: CPU2 attaching sched-domain:
> kernel: domain 0: span 2 level NODE
> kernel: groups: 2
> kernel: CPU1 root domain e000006884a20000
> kernel: CPU1 attaching sched-domain:
> kernel: domain 0: span 1 level NODE
> kernel: groups: 1
> kernel: CPU0 root domain e000006884a40000
> kernel: CPU0 attaching sched-domain:
> kernel: domain 0: span 0 level NODE
> kernel: groups: 0
>
> Which is the way sched_load_balance is supposed to work. You need to set
> sched_load_balance=0 for all cpusets containing any cpu you want to disable
> balancing on, otherwise some balancing will happen.
It won't be much of a balancing in this case because this just one cpu per
domain.
In other words no that's not how it supposed to work. There is code in
cpu_attach_domain() that is supposed to remove redundant levels
(sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
btw The reason you got a different result that I did is because you have a
NUMA box where is mine is UMA. I was able to reproduce the problem though by
enabling multi-core scheduler. In which case I also get one redundant domain
level CPU, with a single CPU in it.
So we definitely need to fix this. I'll try to poke around tomorrow and figure
out why redundant level is not dropped.

> So in addition to the top (root) cpuset, we need to set it to '0' in the
> parX cpusets. That will turn off load balancing to the cpus in question
> (thereby attaching a NULL sched domain).
As I explained above we should not have to disable load balancing in cpusets
with a single CPU.

> So when we do that for just par3, we get the following:
> echo 0 > par3/cpuset.sched_load_balance
> kernel: cpusets: rebuild ndoms 3
> kernel: cpuset: domain 0 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: cpuset: domain 1 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: cpuset: domain 2 cpumask
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
> 0000000,00000000,00000000,00000000,0
> kernel: CPU3 root domain default
> kernel: CPU3 attaching NULL sched-domain.
>
> So the def_root_domain is now attached for CPU 3. And we do have a NULL
> sched-domain, which we expect for a cpu with load balancing turned off. If
> we turn sched_load_balance off ('0') on each of the other cpusets (par0-2),
> each of those cpus would also have a NULL sched-domain attached.
Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain
generator in cpusets should not drop domains with single cpu in them when
sched_load_balance==0. I'll look at that tomorrow too.

Max

2008-11-22 08:18:47

by Li Zefan

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Max Krasnyansky wrote:
>
> Dimitri Sivanich wrote:
>> Hi Greg and Max,
>>
>> On Fri, Nov 21, 2008 at 12:04:25PM -0800, Max Krasnyansky wrote:
>>> Hi Greg,
>>>
>>> I attached debug instrumentation patch for Dmitri to try. I'll clean it up and
>>> add things you requested and will resubmit properly some time next week.
>>>
>> We added Max's debug patch to our kernel and have run Max's Trace 3 scenario, but we do not see a NULL sched-domain remain attached, see my comments below.
>>
>>
>> mount -t cgroup cpuset -ocpuset /cpusets/
>>
>> for i in 0 1 2 3; do mkdir par$i; echo $i > par$i/cpuset.cpus; done
>>
>> kernel: cpusets: rebuild ndoms 1
>> kernel: cpuset: domain 0 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
> Oops. I did not realize your NR_CPUS is so large. Unfortunately all your masks
> got truncated.
> I'll update the patch to print cpu list instead of the masks.
>
>> echo 0 > cpuset.sched_load_balance
>> kernel: cpusets: rebuild ndoms 4
>> kernel: cpuset: domain 0 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: cpuset: domain 1 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: cpuset: domain 2 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: cpuset: domain 3 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: CPU0 root domain default
>> kernel: CPU0 attaching NULL sched-domain.
>> kernel: CPU1 root domain default
>> kernel: CPU1 attaching NULL sched-domain.
>> kernel: CPU2 root domain default
>> kernel: CPU2 attaching NULL sched-domain.
>> kernel: CPU3 root domain default
>> kernel: CPU3 attaching NULL sched-domain.
>
>> kernel: CPU3 root domain e0000069ecb20000
>> kernel: CPU3 attaching sched-domain:
>> kernel: domain 0: span 3 level NODE
>> kernel: groups: 3
>> kernel: CPU2 root domain e000006884a00000
>> kernel: CPU2 attaching sched-domain:
>> kernel: domain 0: span 2 level NODE
>> kernel: groups: 2
>> kernel: CPU1 root domain e000006884a20000
>> kernel: CPU1 attaching sched-domain:
>> kernel: domain 0: span 1 level NODE
>> kernel: groups: 1
>> kernel: CPU0 root domain e000006884a40000
>> kernel: CPU0 attaching sched-domain:
>> kernel: domain 0: span 0 level NODE
>> kernel: groups: 0
>>
>> Which is the way sched_load_balance is supposed to work. You need to set
>> sched_load_balance=0 for all cpusets containing any cpu you want to disable
>> balancing on, otherwise some balancing will happen.
> It won't be much of a balancing in this case because this just one cpu per
> domain.
> In other words no that's not how it supposed to work. There is code in
> cpu_attach_domain() that is supposed to remove redundant levels
> (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
> btw The reason you got a different result that I did is because you have a
> NUMA box where is mine is UMA. I was able to reproduce the problem though by
> enabling multi-core scheduler. In which case I also get one redundant domain
> level CPU, with a single CPU in it.
> So we definitely need to fix this. I'll try to poke around tomorrow and figure
> out why redundant level is not dropped.
>

You were not using latest kernel, were you?

There was a bug in sd degenerate code, and it has already been fixed:
http://lkml.org/lkml/2008/11/8/10

>> So in addition to the top (root) cpuset, we need to set it to '0' in the
>> parX cpusets. That will turn off load balancing to the cpus in question
>> (thereby attaching a NULL sched domain).
> As I explained above we should not have to disable load balancing in cpusets
> with a single CPU.
>

Yes, and please try the laste kernel. ;)

>> So when we do that for just par3, we get the following:
>> echo 0 > par3/cpuset.sched_load_balance
>> kernel: cpusets: rebuild ndoms 3
>> kernel: cpuset: domain 0 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: cpuset: domain 1 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: cpuset: domain 2 cpumask
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>> 0000000,00000000,00000000,00000000,0
>> kernel: CPU3 root domain default
>> kernel: CPU3 attaching NULL sched-domain.
>>
>> So the def_root_domain is now attached for CPU 3. And we do have a NULL
>> sched-domain, which we expect for a cpu with load balancing turned off. If
>> we turn sched_load_balance off ('0') on each of the other cpusets (par0-2),
>> each of those cpus would also have a NULL sched-domain attached.
> Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain
> generator in cpusets should not drop domains with single cpu in them when
> sched_load_balance==0. I'll look at that tomorrow too.
>

Do you mean the correct behavior should be as following?
kernel: cpusets: rebuild ndoms 4

But why do you think this is a bug? In generate_sched_domains(), cpusets with
sched_load_balance==0 will be skippped:

list_add(&top_cpuset.stack_list, &q);
while (!list_empty(&q)) {
...
if (is_sched_load_balance(cp)) {
csa[csn++] = cp;
continue;
}
...
}

Correct me if I misunderstood your point.

2008-11-24 15:11:27

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

On Sat, Nov 22, 2008 at 04:18:29PM +0800, Li Zefan wrote:
> Max Krasnyansky wrote:
> >
> > Dimitri Sivanich wrote:
> >>
> >> Which is the way sched_load_balance is supposed to work. You need to set
> >> sched_load_balance=0 for all cpusets containing any cpu you want to disable
> >> balancing on, otherwise some balancing will happen.
> > It won't be much of a balancing in this case because this just one cpu per
> > domain.
> > In other words no that's not how it supposed to work. There is code in
> > cpu_attach_domain() that is supposed to remove redundant levels
> > (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
> > btw The reason you got a different result that I did is because you have a
> > NUMA box where is mine is UMA. I was able to reproduce the problem though by
> > enabling multi-core scheduler. In which case I also get one redundant domain
> > level CPU, with a single CPU in it.
> > So we definitely need to fix this. I'll try to poke around tomorrow and figure
> > out why redundant level is not dropped.
> >
>
> You were not using latest kernel, were you?
>
> There was a bug in sd degenerate code, and it has already been fixed:
> http://lkml.org/lkml/2008/11/8/10

With the above patch added, we now see the results that Max is showing as far as individual root domains being created with a span of just their own cpu when sched_load_balance is turned off.

2008-11-24 21:46:25

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Li Zefan wrote:
> Max Krasnyansky wrote:
>> Dimitri Sivanich wrote:
>>> kernel: CPU3 root domain e0000069ecb20000
>>> kernel: CPU3 attaching sched-domain:
>>> kernel: domain 0: span 3 level NODE
>>> kernel: groups: 3
>>> kernel: CPU2 root domain e000006884a00000
>>> kernel: CPU2 attaching sched-domain:
>>> kernel: domain 0: span 2 level NODE
>>> kernel: groups: 2
>>> kernel: CPU1 root domain e000006884a20000
>>> kernel: CPU1 attaching sched-domain:
>>> kernel: domain 0: span 1 level NODE
>>> kernel: groups: 1
>>> kernel: CPU0 root domain e000006884a40000
>>> kernel: CPU0 attaching sched-domain:
>>> kernel: domain 0: span 0 level NODE
>>> kernel: groups: 0
>>>
>>> Which is the way sched_load_balance is supposed to work. You need to set
>>> sched_load_balance=0 for all cpusets containing any cpu you want to disable
>>> balancing on, otherwise some balancing will happen.
>> It won't be much of a balancing in this case because this just one cpu per
>> domain.
>> In other words no that's not how it supposed to work. There is code in
>> cpu_attach_domain() that is supposed to remove redundant levels
>> (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
>> btw The reason you got a different result that I did is because you have a
>> NUMA box where is mine is UMA. I was able to reproduce the problem though by
>> enabling multi-core scheduler. In which case I also get one redundant domain
>> level CPU, with a single CPU in it.
>> So we definitely need to fix this. I'll try to poke around tomorrow and figure
>> out why redundant level is not dropped.
>>
>
> You were not using latest kernel, were you?
>
> There was a bug in sd degenerate code, and it has already been fixed:
> http://lkml.org/lkml/2008/11/8/10
Ah, makes sense.
The funny part is that I did see the patch before but completely forgot
about it :).

>>> So when we do that for just par3, we get the following:
>>> echo 0 > par3/cpuset.sched_load_balance
>>> kernel: cpusets: rebuild ndoms 3
>>> kernel: cpuset: domain 0 cpumask
>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>>> 0000000,00000000,00000000,00000000,0
>>> kernel: cpuset: domain 1 cpumask
>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>>> 0000000,00000000,00000000,00000000,0
>>> kernel: cpuset: domain 2 cpumask
>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
>>> 0000000,00000000,00000000,00000000,0
>>> kernel: CPU3 root domain default
>>> kernel: CPU3 attaching NULL sched-domain.
>>>
>>> So the def_root_domain is now attached for CPU 3. And we do have a NULL
>>> sched-domain, which we expect for a cpu with load balancing turned off. If
>>> we turn sched_load_balance off ('0') on each of the other cpusets (par0-2),
>>> each of those cpus would also have a NULL sched-domain attached.
>> Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain
>> generator in cpusets should not drop domains with single cpu in them when
>> sched_load_balance==0. I'll look at that tomorrow too.
>>
>
> Do you mean the correct behavior should be as following?
> kernel: cpusets: rebuild ndoms 4
Yes.

> But why do you think this is a bug? In generate_sched_domains(), cpusets with
> sched_load_balance==0 will be skippped:
>
> list_add(&top_cpuset.stack_list, &q);
> while (!list_empty(&q)) {
> ...
> if (is_sched_load_balance(cp)) {
> csa[csn++] = cp;
> continue;
> }
> ...
> }
>
> Correct me if I misunderstood your point.
The problem is that all cpus in cpusets with sched_load_balance==0 end
up in the default root_domain which causes lock contention.
We can fix it either in sched.c:partition_sched_domains() or in
cpusets.c:generate_sched_domains(). I'd rather fix cpusets because
sched.c fix will be sub-optimal. See my answer to Greg on the same
thread. Basically the scheduler code would have to allocate a
root_domain for each CPU even on transitional states. So I'd rather fix
cpusets to generate domain for each non-overlapping cpuset regardless of
the sched_load_balance flag.

Max

2008-11-24 21:47:26

by Max Krasnyansky

[permalink] [raw]
Subject: Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance

Dimitri Sivanich wrote:
> On Sat, Nov 22, 2008 at 04:18:29PM +0800, Li Zefan wrote:
>> Max Krasnyansky wrote:
>>> Dimitri Sivanich wrote:
>>>> Which is the way sched_load_balance is supposed to work. You need to set
>>>> sched_load_balance=0 for all cpusets containing any cpu you want to disable
>>>> balancing on, otherwise some balancing will happen.
>>> It won't be much of a balancing in this case because this just one cpu per
>>> domain.
>>> In other words no that's not how it supposed to work. There is code in
>>> cpu_attach_domain() that is supposed to remove redundant levels
>>> (sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
>>> btw The reason you got a different result that I did is because you have a
>>> NUMA box where is mine is UMA. I was able to reproduce the problem though by
>>> enabling multi-core scheduler. In which case I also get one redundant domain
>>> level CPU, with a single CPU in it.
>>> So we definitely need to fix this. I'll try to poke around tomorrow and figure
>>> out why redundant level is not dropped.
>>>
>> You were not using latest kernel, were you?
>>
>> There was a bug in sd degenerate code, and it has already been fixed:
>> http://lkml.org/lkml/2008/11/8/10
>
> With the above patch added, we now see the results that Max is
> showing as far as individual root domains being created with a span
> of just their own cpu when sched_load_balance is turned off.

Nice.

Max