2006-10-18 02:45:39

by Suresh Siddha

[permalink] [raw]
Subject: exclusive cpusets broken with cpu hotplug

When ever a cpu hotplug happens, current kernel calls build_sched_domains()
with cpu_online_map. That will destroy all the domain partitions(done by
partition_sched_domains()) setup so far by exclusive cpusets.

And its not just cpu hotplug, this happens even if someone changes multi core
sched power savings policy.

Anyone would like to fix it up? In the presence of cpusets, we basically
need to traverse all the exclusive sets and setup the sched domains
accordingly.

If no one does :( then I will do that when I get some time...

thanks,
suresh


2006-10-18 07:14:39

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> Anyone would like to fix it up?

Hotplug is not high on my priority list.

I do what I can in my spare time to avoid having cpusets or hotplug
break each other.

Besides, I'm not sure I'd be able. I've gotten to the point where I am
confident I can make simple changes at the edges, such as mimicing the
sched domain side affects of the cpu_exclusive flag with my new
sched_domain flag. But that's near the current limit of my sched domain
writing skills.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-18 09:57:48

by Robin Holt

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

On Wed, Oct 18, 2006 at 12:14:24AM -0700, Paul Jackson wrote:
> > Anyone would like to fix it up?
>
> Hotplug is not high on my priority list.
>
> I do what I can in my spare time to avoid having cpusets or hotplug
> break each other.
>
> Besides, I'm not sure I'd be able. I've gotten to the point where I am
> confident I can make simple changes at the edges, such as mimicing the
> sched domain side affects of the cpu_exclusive flag with my new
> sched_domain flag. But that's near the current limit of my sched domain
> writing skills.

Paul and Suresh,

Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE
removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED,
CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and
partitioning all the cpu_exclusive cpusets.

Thanks,
Robin

2006-10-18 10:10:35

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Robin wrote:
> Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE
> removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED,
> CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and
> partitioning all the cpu_exclusive cpusets.

Perhaps.

The somewhat related problems, in my book, are:

1) I don't know how to tell what sched domains/groups a system has, nor
how to tell my customers how to see what sched domains they have, and

2) I suspect that Mr. Cpusets doesn't understand sched domains and that
Mr. Sched Domain doesn't understand cpusets, and that we've ended
up with some inscrutable and likely unsuitable interactions between
the two as a result, which in particular don't result in cpusets
driving the sched domain configuration in the desired ways for some
of the less trivial configs.

Well ... at least the first suspcicion above is a near certainty ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-18 10:54:18

by Robin Holt

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

On Wed, Oct 18, 2006 at 03:10:21AM -0700, Paul Jackson wrote:
> 2) I suspect that Mr. Cpusets doesn't understand sched domains and that
> Mr. Sched Domain doesn't understand cpusets, and that we've ended
> up with some inscrutable and likely unsuitable interactions between
> the two as a result, which in particular don't result in cpusets
> driving the sched domain configuration in the desired ways for some
> of the less trivial configs.

You do, however, hopefully have enough information to create the
calls you would make to partition_sched_domain if each had their
cpu_exclusive flags cleared. Essentially, what I am proposing is
making all the calls as if the user had cleared each as the
remove/add starts, and then behave as if each each was set again.

Thanks,
Robin

2006-10-18 12:16:54

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
> Robin wrote:
>
>>Could this be as simple as a CPU_UP_PREPARE or CPU_DOWN_PREPARE
>>removing all the cpu_exclusive cpusets and a CPU_UP_CANCELLED,
>>CPU_DOWN_CANCELLED, CPU_ONLINE, CPU_DEAD going through and
>>partitioning all the cpu_exclusive cpusets.
>
>
> Perhaps.
>
> The somewhat related problems, in my book, are:
>
> 1) I don't know how to tell what sched domains/groups a system has, nor
> how to tell my customers how to see what sched domains they have, and

I don't know if you want customers do know what domains they have. I think
you should avoid having explicit control over sched-domains in your cpusets
completely, and just have the cpusets create partitioned domains whenever
it can.

>
> 2) I suspect that Mr. Cpusets doesn't understand sched domains and that
> Mr. Sched Domain doesn't understand cpusets, and that we've ended
> up with some inscrutable and likely unsuitable interactions between
> the two as a result, which in particular don't result in cpusets
> driving the sched domain configuration in the desired ways for some
> of the less trivial configs.
>
> Well ... at least the first suspcicion above is a near certainty ;).

cpusets is the only thing that messes with sched-domains (excluding the
isolcpus -- that seems to require a small change to partition_sched_domains,
but forget that for now).

And so you should know what partitioning to build at any point when asked.
So we could have a call to cpusets at the end of arch_init_sched_domains,
which asks for the domains to be partitioned, no?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-18 14:34:54

by Suresh Siddha

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

On Wed, Oct 18, 2006 at 10:16:50PM +1000, Nick Piggin wrote:
> Paul Jackson wrote:
> > 1) I don't know how to tell what sched domains/groups a system has, nor

Paul, atleast for debugging one can know that by defining SCHED_DOMAIN_DEBUG

> > how to tell my customers how to see what sched domains they have, and
>
> I don't know if you want customers do know what domains they have. I think

At the first glance, I have to agree with Nick. All the customer wants is a
mechanism to specify group these cpus together for scheduling...

But looking at how cpusets interact with sched-domains and especially for
large systems, it will probably be useful if we export the topology through /sys

> cpusets is the only thing that messes with sched-domains (excluding the
> isolcpus -- that seems to require a small change to partition_sched_domains,
> but forget that for now).
>
> And so you should know what partitioning to build at any point when asked.
> So we could have a call to cpusets at the end of arch_init_sched_domains,
> which asks for the domains to be partitioned, no?

yes.

Robin, Right now everyone is calling arch_init_sched_domain() with
cpu_online_map. We can remove this argument and in the presence of cpusets,
this routine can go through exclusive cpusets and partition the domains
accordinly. Otherwise we can simply build one domain partition with
cpu_online_map.

thanks,
suresh

2006-10-18 14:51:37

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Siddha, Suresh B wrote:
> On Wed, Oct 18, 2006 at 10:16:50PM +1000, Nick Piggin wrote:
>
>>Paul Jackson wrote:
>>
>>> 1) I don't know how to tell what sched domains/groups a system has, nor
>
>
> Paul, atleast for debugging one can know that by defining SCHED_DOMAIN_DEBUG

Yep. This is meant to be useful precisely for things like making cpusets
partition the domains properly or ensuring a system's topology is built
correctly.

>>> how to tell my customers how to see what sched domains they have, and
>>
>>I don't know if you want customers do know what domains they have. I think
>
>
> At the first glance, I have to agree with Nick. All the customer wants is a
> mechanism to specify group these cpus together for scheduling...
>
> But looking at how cpusets interact with sched-domains and especially for
> large systems, it will probably be useful if we export the topology through /sys

I'll concede that point. It would probably be useful for a sysadmin to be
able to look at how they can better make cpuset placements such that they
get the best partitioning.

I would still prefer not to say "use an exclusive domain for this cpuset".
cpusets should be able to do the optimal thing with the data it has, so
this is one less complication to deal with.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-18 17:55:05

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

On Tue, Oct 17, 2006 at 07:25:48PM -0700, Siddha, Suresh B wrote:
> When ever a cpu hotplug happens, current kernel calls build_sched_domains()
> with cpu_online_map. That will destroy all the domain partitions(done by
> partition_sched_domains()) setup so far by exclusive cpusets.
>
> And its not just cpu hotplug, this happens even if someone changes multi core
> sched power savings policy.
>
> Anyone would like to fix it up? In the presence of cpusets, we basically
> need to traverse all the exclusive sets and setup the sched domains
> accordingly.
>
> If no one does :( then I will do that when I get some time...

Suresh,

I have a patch (though a very old one...) for handling hotplug and cpusets.
However there were some ugly locking issues and nesting of locks that I
ran into and I never got the time to sort them out. Also there didnt
seem to be any users for it and so I had no motivation to further complicate
the cpusets code/sched domains code. However I can dust up the patches if
there is a need

-Dinakar

2006-10-18 18:06:15

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Dinakar wrote:
> I have a patch (though a very old one...) for handling hotplug and cpusets.
> However there were some ugly locking issues and nesting of locks ...

The interaction of cpusets and hotplug should be in good shape. Look
in kernel/cpuset.c for CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG,
and you will see the two routines to call for cpu and memory hotplug
events, cpuset_handle_cpuhp() and cpuset_track_online_nodes().

The problem area is the interaction of dynamic sched domains and
cpusets with hot plug events.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-18 21:07:58

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> You do, however, hopefully have enough information to create the
> calls you would make to partition_sched_domain if each had their
> cpu_exclusive flags cleared. Essentially, what I am proposing is
> making all the calls as if the user had cleared each as the
> remove/add starts, and then behave as if each each was set again.

Yes - hopefully we have enough information to rebuild the sched domains
each time, consistently. And your proposal is probably an improvement
for that reason.

However, I'm afraid that only solves half the problem. It makes the
sched domains more repeatable and predictable. But I'm worried that
the cpuset control over sched domains is still broken .. see the
example below.

I've half a mind to prepare a patch to just rip out the sched domain
defining code from kernel/cpuset.c, completely uncoupling the
cpu_exclusive flag, and any other cpuset flags, from sched domains.

Example:

As best as I can tell (which is not very far ;), if some hapless
user does the following:

/dev/cpuset cpu_exclusive == 1; cpus == 0-7
/dev/cpuset/a cpu_exclusive == 1; cpus == 0-3
/dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7

and then runs a big job in the top cpuset (/dev/cpuset), then that
big job will not load balance correctly, with whatever threads
in the big job that got stuck on cpus 0-3 isolated from whatever
threads got stuck on cpus 4-7.

Is this correct?

If so, there no practical way that I can see on a production system for
the system admin to realize they have messed up their system this way.

If we can't make this work properly automatically, then we either need
to provide users the visibility and control to make it work by explicit
manual control (meaning my 'sched_domain' flag patch, plus some way of
exporting the sched domain topology in /sys), or we need to stop doing
this.

If the above example is not correct, then I'm afraid my education in
sched domains is in need of another lesson.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 05:56:41

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Earlier today I wrote:
> I've half a mind to prepare a patch to just rip out the sched domain
> defining code from kernel/cpuset.c, completely uncoupling the
> cpu_exclusive flag, and any other cpuset flags, from sched domains.
>
> Example:
>
> As best as I can tell (which is not very far ;), if some hapless
> user does the following:
>
> /dev/cpuset cpu_exclusive == 1; cpus == 0-7
> /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3
> /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7
>
> and then runs a big job in the top cpuset (/dev/cpuset), then that
> big job will not load balance correctly, with whatever threads
> in the big job that got stuck on cpus 0-3 isolated from whatever
> threads got stuck on cpus 4-7.
>
> Is this correct?
>
> If so, there no practical way that I can see on a production system for
> the system admin to realize they have messed up their system this way.
>
> If we can't make this work properly automatically, then we either need
> to provide users the visibility and control to make it work by explicit
> manual control (meaning my 'sched_domain' flag patch, plus some way of
> exporting the sched domain topology in /sys), or we need to stop doing
> this.


I am now more certain - the above gives an example of serious breakage
with the current mechanism of connecting cpusets to sched domains via
the cpuset flag.

We should either fix it (perhaps with my patch to add sched_domain
flags to cpusets, plus a yet to be written patch to make sched domains
visible via /sys or some such place), or we should nuke it,

I am now 90% certain we should nuke the entire mechanism connecting
cpusets to sched domains via the cpu_exclusive flag.

The only useful thing to be done, which is much simpler, is to provide
someway to manipulate the cpu_isolated_map at runtime.

I have a pair of patches ready to ship out that do this.

Coming soon to a mailing list near you ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 06:16:22

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> I don't know if you want customers do know what domains they have. I think
> you should avoid having explicit control over sched-domains in your cpusets
> completely, and just have the cpusets create partitioned domains whenever
> it can.

We have a choice to make. I am increasingly convinced that the
current mechanism linking cpusets with sched domains is busted,
allowing people to easily and unspectingly set up broken sched domain
configs, without even being able to see what they are doing.
Certainly that linkage has been confusing to some of us who are
not kernel/sched.c experts. Certainly users on production systems
cannot see what sched domains they have ended up with.

We should either make this linkage explicit and understandable, giving
users direct means to construct sched domains and probe what they have
done, or we should remove this linkage.

My patch to add sched_domain flags to cpusets was an attempt to
make this control explicit.

I am now 90% convinced that this is the wrong direction, and that
the entire chunk of code linking cpu_exclusive cpusets to sched
domains should be nuked.

The one thing I found so far today that people actually needed from
this was that my real time people needed to be able to something like
marking a cpu isolated. So I think we should have runtime support for
manipulating the cpu_isolated_map.

I will be sending in a pair of patches shortly to:
1) nuke the cpu_exclusive - sched_domain linkage, and
2) support runtime marking of isolated cpus.

Does that sound better to you?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 06:35:47

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
>>I don't know if you want customers do know what domains they have. I think
>>you should avoid having explicit control over sched-domains in your cpusets
>>completely, and just have the cpusets create partitioned domains whenever
>>it can.
>
>
> We have a choice to make. I am increasingly convinced that the
> current mechanism linking cpusets with sched domains is busted,
> allowing people to easily and unspectingly set up broken sched domain
> configs, without even being able to see what they are doing.
> Certainly that linkage has been confusing to some of us who are
> not kernel/sched.c experts. Certainly users on production systems
> cannot see what sched domains they have ended up with.
>
> We should either make this linkage explicit and understandable, giving
> users direct means to construct sched domains and probe what they have
> done, or we should remove this linkage.
>
> My patch to add sched_domain flags to cpusets was an attempt to
> make this control explicit.
>
> I am now 90% convinced that this is the wrong direction, and that
> the entire chunk of code linking cpu_exclusive cpusets to sched
> domains should be nuked.
>
> The one thing I found so far today that people actually needed from
> this was that my real time people needed to be able to something like
> marking a cpu isolated. So I think we should have runtime support for
> manipulating the cpu_isolated_map.
>
> I will be sending in a pair of patches shortly to:
> 1) nuke the cpu_exclusive - sched_domain linkage, and
> 2) support runtime marking of isolated cpus.
>
> Does that sound better to you?
>

I don't understand why you think the "implicit" (as in, not directly user
controlled?) linkage is wrong. If it is allowing people to set up busted
domains, then the cpusets code is asking for the wrong partitions.

Having them explicitly control it is wrong because it is really an
implementation detail that could change in the future.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 06:57:59

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Nick wrote:
> I don't understand why you think the "implicit" (as in, not directly user
> controlled?) linkage is wrong.

Twice now I've given the following specific example. I am not yet
confident that I have it right, and welcome feedback.

However, Suresh has apparently agreed with my conclusion that one
can use the current linkage between cpu_exclusive cpusets and sched
domains to get unexpected and perhaps undesirable sched domain setups.

What's your take on this example:

> Example:
>
> As best as I can tell (which is not very far ;), if some hapless
> user does the following:
>
> /dev/cpuset cpu_exclusive == 1; cpus == 0-7
> /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3
> /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7
>
> and then runs a big job in the top cpuset (/dev/cpuset), then that
> big job will not load balance correctly, with whatever threads
> in the big job that got stuck on cpus 0-3 isolated from whatever
> threads got stuck on cpus 4-7.
>
> Is this correct?

If I have concluded incorrectly what happens in the above example
(good chance) then please educate me on how this stuff works.

I should warn you that I have demonstrated a remarkable resistance
to being educatible on this subject ;).

If this interface has no material affect on users programs, then
implicit may well be ok. But if it has material affect on the
behaviour, such as CPU placement or scope of load balancing, of user
programs, then I am strongly in favor of making that affect explicit,
understandable, and visible at runtime, on production systems.

That, or getting rid of the affect, and replacing it with something
that is simple, understandable, explicit and visible ... my current
plan.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 07:04:51

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
> Nick wrote:
>
>>I don't understand why you think the "implicit" (as in, not directly user
>>controlled?) linkage is wrong.
>
>
> Twice now I've given the following specific example. I am not yet
> confident that I have it right, and welcome feedback.

Sorry, I skimmed over that.

>
> However, Suresh has apparently agreed with my conclusion that one
> can use the current linkage between cpu_exclusive cpusets and sched
> domains to get unexpected and perhaps undesirable sched domain setups.
>
> What's your take on this example:
>
>
>>Example:
>>
>> As best as I can tell (which is not very far ;), if some hapless
>> user does the following:
>>
>> /dev/cpuset cpu_exclusive == 1; cpus == 0-7
>> /dev/cpuset/a cpu_exclusive == 1; cpus == 0-3
>> /dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7
>>
>> and then runs a big job in the top cpuset (/dev/cpuset), then that
>> big job will not load balance correctly, with whatever threads
>> in the big job that got stuck on cpus 0-3 isolated from whatever
>> threads got stuck on cpus 4-7.
>>
>>Is this correct?
>
>
> If I have concluded incorrectly what happens in the above example
> (good chance) then please educate me on how this stuff works.

So that depends on what cpusets asks for. If, when setting up a and
b, it asks to partition the domains, then yes that breaks the parent
cpuset gets broken.

> I should warn you that I have demonstrated a remarkable resistance
> to being educatible on this subject ;).

Don't worry about the whole sched-domains implementation if you just
consider that partitioning the domains creates a hard partition
among the system's CPUs (but the upshot is that within the partitions,
balancing works pretty nicely).

So in your above example, cpusets should only ask for a partition of
the 0-7 CPUs.

If you wanted to get fancy and detect that there are no jobs in the
root cpuset, then you could make the two smaller partitions, and revert
back to the one bigger one if something gets assigned to it.

But that's all a matter of how you want cpusets to manage it, I really
don't think a user should control this (we simply shouldn't allow
situations where we put a partition in the middle of a cpuset).

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 07:34:13

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Nick wrote:
> (we simply shouldn't allow
> situations where we put a partition in the middle of a cpuset).

Could you explain to me what you mean by "put a partition in the
middle of a cpuset?"

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 07:33:32

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> So that depends on what cpusets asks for. If, when setting up a and
> b, it asks to partition the domains, then yes that breaks the parent
> cpuset gets broken.

That probably makes good sense from the sched domain side of things.

It is insanely counterintuitive from the cpuset side of things.

Using heirarchical cpuset properties to drive this is the wrong
approach.

In the general case, looking at it (as best I can) from the sched
domain side of things, it seems that the sched domain could be
defined on a system as follows.

Partition the CPUs on the system - into one or more subsets
(partitions), non-overlapping, and covering.

Each of those partitions can either have a sched domain setup on
it, to support load balancing across the CPUs in that partition,
or can be isolated, with no load balancing occuring within that
partition.

No load balancing occurs across partitions.

Using cpu_exclusive cpusets for this is next to impossible. It could
be approximated perhaps by having just the immediate children of the
root cpuset, /dev/cpuset/*, define the partition.

But if any lower level cpusets have any affect on the partitioning,
by setting their cpu_exclusive flag in the current implementation,
it is -always- the case, by the basic structure of the cpuset
hierarchy, that the lower level cpuset is a subset of its parents
cpus, and that that parent also has cpu_exclusive set.

The resulting partitioning, even in such simple examples as above, is
not obvious. If you look back a couple days, when I first presented
essentially this example, I got the resulting sched domain partitioning
entirely wrong.

The essential detail in my big patch of yesterday, to add new specific
sched_domain flags to cpusets, is that it -removed- the requirement to
mark a parent as defining a sched domain anytime a child defined one.

That requirement is one of the defining properties of the cpu_exclusive
flag, and makes that flag -outrageously- unsuited for defining sched
domain partitions.

My new sched_domain flags at least had the right properties, defaults
and rules, that they perhaps could have been used to sanely define sched
domain partitions. One could mark a few select cpusets, at any depth
in the hierarchy, as defining sched domain partions, without being
forced to mark a whole bunch more ancestor cpusets the same way,
slicing and dicing the sched domain partions into hamburger.

However, fortunately, ... so far as I can tell ... no one needs the
general case described above, of multiple sched domain partitions.

So far as I know, the only essential special case that user land
requires to deal with is to isolate one partition (one subset of CPUs)
from any scheduler load balancing. Every CPU is either load
balanced however the kernel/sched.c chooses to load balance it,
with potentially every other non-isolated CPU, or is in the isolated
partition (cpu_isolated_map) and not considered for load balancing.

Have I missed any case requiring explicit user intervention?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 08:07:38

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
> Nick wrote:
>
>>(we simply shouldn't allow
>>situations where we put a partition in the middle of a cpuset).
>
>
> Could you explain to me what you mean by "put a partition in the
> middle of a cpuset?"
>

Your example, if a partition is created for each of the sub cpusets.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 08:12:04

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> Paul Jackson wrote:
> > Nick wrote:
> >
> >>(we simply shouldn't allow
> >>situations where we put a partition in the middle of a cpuset).
> >
> >
> > Could you explain to me what you mean by "put a partition in the
> > middle of a cpuset?"
> >
>
> Your example, if a partition is created for each of the sub cpusets.

The thing "we simply shouldn't allow", then, is the bread and
butter of cpusets.

I am convinced that we are trying to pound nails with toothpicks.

The cpu_exclusive flag was the wrong flag to overload to define
sched domains.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 08:17:06

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
>>So that depends on what cpusets asks for. If, when setting up a and
>>b, it asks to partition the domains, then yes that breaks the parent
>>cpuset gets broken.
>
>
> That probably makes good sense from the sched domain side of things.
>
> It is insanely counterintuitive from the cpuset side of things.
>
> Using heirarchical cpuset properties to drive this is the wrong
> approach.
>
> In the general case, looking at it (as best I can) from the sched
> domain side of things, it seems that the sched domain could be
> defined on a system as follows.
>
> Partition the CPUs on the system - into one or more subsets
> (partitions), non-overlapping, and covering.
>
> Each of those partitions can either have a sched domain setup on
> it, to support load balancing across the CPUs in that partition,
> or can be isolated, with no load balancing occuring within that
> partition.
>
> No load balancing occurs across partitions.

Correct. But you don't have to treat isolated CPUs differently - they
are just the degenerate case of a partition of 1 CPU. I assume cpusets
could create similar "isolated" domains where no balancing takes place.

> Using cpu_exclusive cpusets for this is next to impossible. It could
> be approximated perhaps by having just the immediate children of the
> root cpuset, /dev/cpuset/*, define the partition.

Fine.

> But if any lower level cpusets have any affect on the partitioning,
> by setting their cpu_exclusive flag in the current implementation,
> it is -always- the case, by the basic structure of the cpuset
> hierarchy, that the lower level cpuset is a subset of its parents
> cpus, and that that parent also has cpu_exclusive set.
>
> The resulting partitioning, even in such simple examples as above, is
> not obvious. If you look back a couple days, when I first presented
> essentially this example, I got the resulting sched domain partitioning
> entirely wrong.
>
> The essential detail in my big patch of yesterday, to add new specific
> sched_domain flags to cpusets, is that it -removed- the requirement to
> mark a parent as defining a sched domain anytime a child defined one.
>
> That requirement is one of the defining properties of the cpu_exclusive
> flag, and makes that flag -outrageously- unsuited for defining sched
> domain partitions.

So make the new rule "cpu_exclusive && direct-child-of-root-cpuset".
Your problems go away, and they haven't been pushed to userspace.

If a user wants to, for some crazy reason, have a set of cpu_exclusive
sets deep in the cpuset hierarchy, such that no load balancing happens
between them... just tell them they can't; they should just make those
cpusets children of the root.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 08:22:58

by Nick Piggin

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Paul Jackson wrote:
>>Paul Jackson wrote:
>>
>>>Nick wrote:
>>>
>>>
>>>>(we simply shouldn't allow
>>>>situations where we put a partition in the middle of a cpuset).
>>>
>>>
>>>Could you explain to me what you mean by "put a partition in the
>>>middle of a cpuset?"
>>>
>>
>>Your example, if a partition is created for each of the sub cpusets.
>
>
> The thing "we simply shouldn't allow", then, is the bread and
> butter of cpusets.

No. They can put a cpuset there all they like. But the cpuset code
should *not* put a partition there. That is all.

>
> I am convinced that we are trying to pound nails with toothpicks.
>
> The cpu_exclusive flag was the wrong flag to overload to define
> sched domains.


Well it is the correct flag if we only create the domain for the
oldest ancestor with the cpu_exclusive flag set. From the documentation:

"A cpuset may be marked exclusive, which ensures that no other
cpuset (except direct ancestors and descendents) may contain
any overlapping CPUs or Memory Nodes."

It is this non overlapping property that we can take advantage of, and
partition the scheduler. Obviously, the exception (from the POV of the
oldest ancestor) is its descendents, which can be overlapping. So just
don't create partitions for those guys.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 08:31:46

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

> So make the new rule "cpu_exclusive && direct-child-of-root-cpuset".
> Your problems go away, and they haven't been pushed to userspace.

I don't know of anyone that has need for this feature.

Do you? If you do - good - lets consider them anew.

If such needs arise, I doubt I would recommend meeting them with the
cpu_exclusive flag, in any way shape or form. That would probably not
be a particularly clear and intuitive interface for whatever it was we
needed.

> If a user wants to, for some crazy reason, have a set of cpu_exclusive
> sets deep in the cpuset hierarchy, such that no load balancing happens
> between them... just tell them they can't; they should just make those
> cpusets children of the root.

I have no problem telling users what the limits are on mechanisms.

I have serious problems trying to push mechanisms on them that I
couldn't understand until after repeated attempts over many months,
that are counter intuitive and dangerous (at least unless such odd
rules are imposed) to use, and that provide no useful feedback to the
user as to what they are doing.

It doesn't increase my sympathy for this code that it has been my
biggest source of customer maintenance costs due to a couple of
serious bugs, in all of the cpuset code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 08:42:35

by Paul Jackson

[permalink] [raw]
Subject: Re: exclusive cpusets broken with cpu hotplug

Nick wrote:
> It is this non overlapping property that we can take advantage of, and
> partition the scheduler.

You want non-overlapping versus all other CPUs on the system.

You want to partition the systems CPUs, in the mathematical sense of
the word 'partition', a non-overlapping cover. Fine. That's an
honorable goal.

But cpu_exclusive gives you non-overlapping versus sibling cpusets.

Wrong tool for the job. Close - sounded right - has that nice long
word 'exclusive' in there somewhere. Wrong one however. It made
good sense to anyone that came at this from the kernel/sched.c side,
as it was obvious to them what was needed. To myself and my cpuset
users, it made no bleeping sense whatsoever.

What actual needs do we have here? Lets figure that out, then if that
leads to adding mechanism of the right shape to fit the needs, fine.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401