2008-11-07 19:24:06

by Nish Aravamudan

[permalink] [raw]
Subject: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

[Adding Max K. and Paul J. to the Cc]

On 11/6/08, Nish Aravamudan <[email protected]> wrote:
> On Tue, Nov 4, 2008 at 6:36 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2008-11-04 at 09:34 -0500, Gregory Haskins wrote:
> >> Gregory Haskins wrote:
> >> > Peter Zijlstra wrote:
> >> >
> >> >> On Mon, 2008-11-03 at 15:07 -0600, Dimitri Sivanich wrote:
> >> >>
> >> >>
> >> >>> When load balancing gets switched off for a set of cpus via the
> >> >>> sched_load_balance flag in cpusets, those cpus wind up with the
> >> >>> globally defined def_root_domain attached. The def_root_domain is
> >> >>> attached when partition_sched_domains calls detach_destroy_domains().
> >> >>> A new root_domain is never allocated or attached as a sched domain
> >> >>> will never be attached by __build_sched_domains() for the non-load
> >> >>> balanced processors.
> >> >>>
> >> >>> The problem with this scenario is that on systems with a large number
> >> >>> of processors with load balancing switched off, we start to see the
> >> >>> cpupri->pri_to_cpu->lock in the def_root_domain becoming contended.
> >> >>> This starts to become much more apparent above 8 waking RT threads
> >> >>> (with each RT thread running on it's own cpu, blocking and waking up
> >> >>> continuously).
> >> >>>
> >> >>> I'm wondering if this is, in fact, the way things were meant to work,
> >> >>> or should we have a root domain allocated for each cpu that is not to
> >> >>> be part of a sched domain? Note the the def_root_domain spans all of
> >> >>> the non-load-balanced cpus in this case. Having it attached to cpus
> >> >>> that should not be load balancing doesn't quite make sense to me.
> >> >>>
> >> >>>
> >> >> It shouldn't be like that, each load-balance domain (in your case a
> >> >> single cpu) should get its own root domain. Gregory?
> >> >>
> >> >>
> >> >
> >> > Yeah, this sounds broken. I know that the root-domain code was being
> >> > developed coincident to some upheaval with the cpuset code, so I suspect
> >> > something may have been broken from the original intent. I will take a
> >> > look.
> >> >
> >> > -Greg
> >> >
> >> >
> >>
> >> After thinking about it some more, I am not quite sure what to do here.
> >> The root-domain code was really designed to be 1:1 with a disjoint
> >> cpuset. In this case, it sounds like all the non-balanced cpus are
> >> still in one default cpuset. In that case, the code is correct to place
> >> all those cores in the singleton def_root_domain. The question really
> >> is: How do we support the sched_load_balance flag better?
> >>
> >> I suppose we could go through the scheduler code and have it check that
> >> flag before consulting the root-domain. Another alternative is to have
> >> the sched_load_balance=false flag create a disjoint cpuset. Any thoughts?
> >
> > Hmm, but you cannot disable load-balance on a cpu without placing it in
> > an cpuset first, right?
> >
> > Or are folks disabling load-balance bottom-up, instead of top-down?
> >
> > In that case, I think we should dis-allow that.
>
>
> I don't have a lot of insight into the technical discussion, but will
> say that (if I understand you right), the "bottom-up" approach was
> recommended on LKML by Max K. in the (long) thread from earlier this
> year with Subject "Inquiry: Should we remove "isolcpus= kernel boot
> option? (may have realtime uses)":
>
> "Just to complete the example above. Lets say you want to isolate cpu2
> (assuming that cpusets are already mounted).
>
> # Bring cpu2 offline
> echo 0 > /sys/devices/system/cpu/cpu2/online
>
> # Disable system wide load balancing
> echo 0 > /dev/cpuset/cpuset.sched_load_banace
>
> # Bring cpu2 online
> echo 1 > /sys/devices/system/cpu/cpu2/online
>
> Now if you want to un-isolate cpu2 you do
>
> # Disable system wide load balancing
> echo 1 > /dev/cpuset/cpuset.sched_load_banace
>
> Of course this is not a complete isolation. There are also irqs (see my
> "default irq affinity" patch), workqueues and the stop machine. I'm working on
> those too and will release .25 base cpuisol tree when I'm done."
>
> Would you recommend instead, then, that a new cpuset be created with
> only cpu 2 in it (should one set cpuset.cpu_exclusive then?) and then
> disabling load balancing in that cpuset?

Perhaps this is not a welcome comment, but I have been wondering this
as I spent some time playing with CPU isolation. Are cpusets the right
interface for system configuration?

It seems to me that, and the Documentation agrees with me, that
cpusets are designed around tasks and constraining in various ways
what system resources the tasks have. But may not have been originally
designed around the configuration of the system resources itself at
the system level. Now obviously these constraints will have
interactions with things like CPU hotplug, sched domains, etc. But it
does not seem obvious to me that cpusets *should* be the recommended
way to achieve isolation.

It *almost* makes sense to me to have a separate interface for system
configuration, perhaps in a system filesystem ... say sysfs :) ...
that could be used to indicate a given CPU should be isolated from the
remainder of the system. It could take the form of a file just like
"online", perhaps called "isolated". But rather than go all the way
through the hotplug sequence as writing to "online" does, it just goes
"through the motions" and then brings the CPU back up. In fact, we
could do more than we do with cpusets-based isolation, like removing
workqueues and stop machine. We would have an isolated_map (I guess)
that corresponds to those CPUs with isolated=1 and provide that list
in /sys/devices/system/cpu like the online file.

Or perhaps it makes more sense to present a filesystem *just* for
system partitioning (partfs?). The root directory would have all the
CPUs (for now, perhaps memory should be there too) and administrators
could create isolated groups of CPUs. But we wouldn't present a
transparent way to assign tasks to isolated CPUs (the tasks file) and
the root directory would automatically lose CPUs placed in its
subdirectories. Perhaps the latter is supported in cpusets by the
cpu_exclusive flag, but let me just say the Documentation is pretty
bad. The only reference to what this flag does:

" - cpu_exclusive flag: is cpu placement exclusive?"

I can't tell exactly what the author means by exclusive here.

This feels like something I read Max K. proposing a while ago, and I'm
sorry if it has already been Nak'd then. It just feels like we're
shoehorning system configuration into cpusets in a way that isn't the
most straightforward, when we have an existing system layout that
should work or could design one that is sane.

Thanks,
Nish


2008-11-19 02:00:15

by Max Krasnyansky

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Nish Aravamudan wrote:
> Perhaps this is not a welcome comment, but I have been wondering this
> as I spent some time playing with CPU isolation. Are cpusets the right
> interface for system configuration?
>
> It seems to me that, and the Documentation agrees with me, that
> cpusets are designed around tasks and constraining in various ways
> what system resources the tasks have. But may not have been originally
> designed around the configuration of the system resources itself at
> the system level. Now obviously these constraints will have
> interactions with things like CPU hotplug, sched domains, etc. But it
> does not seem obvious to me that cpusets *should* be the recommended
> way to achieve isolation.
>
> It *almost* makes sense to me to have a separate interface for system
> configuration, perhaps in a system filesystem ... say sysfs :) ...
> that could be used to indicate a given CPU should be isolated from the
> remainder of the system. It could take the form of a file just like
> "online", perhaps called "isolated". But rather than go all the way
> through the hotplug sequence as writing to "online" does, it just goes
> "through the motions" and then brings the CPU back up. In fact, we
> could do more than we do with cpusets-based isolation, like removing
> workqueues and stop machine. We would have an isolated_map (I guess)
> that corresponds to those CPUs with isolated=1 and provide that list
> in /sys/devices/system/cpu like the online file.
>
> Or perhaps it makes more sense to present a filesystem *just* for
> system partitioning (partfs?). The root directory would have all the
> CPUs (for now, perhaps memory should be there too) and administrators
> could create isolated groups of CPUs. But we wouldn't present a
> transparent way to assign tasks to isolated CPUs (the tasks file) and
> the root directory would automatically lose CPUs placed in its
> subdirectories. Perhaps the latter is supported in cpusets by the
> cpu_exclusive flag, but let me just say the Documentation is pretty
> bad. The only reference to what this flag does:
>
> " - cpu_exclusive flag: is cpu placement exclusive?"
>
> I can't tell exactly what the author means by exclusive here.
>
> This feels like something I read Max K. proposing a while ago, and I'm
> sorry if it has already been Nak'd then. It just feels like we're
> shoehorning system configuration into cpusets in a way that isn't the
> most straightforward, when we have an existing system layout that
> should work or could design one that is sane.

What you described is almost exactly what I did in my original cpu
isolation patch, which did get NAKed :). Basically I used global
cpu_isolated_map and exposed 'isolated' bit, etc.

I do not see how 'partfs' that you described would be different from
'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and
you already have your 'partfs'. You do _not_ have to use cpuset for
assigning tasks if you do not want to. Just use them to define sets of
cpus and keep all the tasks in the 'root' set. You can then explicitly
pin your threads down with pthread_set_affinity().

Max

2008-11-19 02:11:42

by Nish Aravamudan

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Max,

[ Removing Paul's bouncing address... ]

On Tue, Nov 18, 2008 at 5:59 PM, Max Krasnyansky <[email protected]> wrote:
> Nish Aravamudan wrote:
>>
>> Perhaps this is not a welcome comment, but I have been wondering this
>> as I spent some time playing with CPU isolation. Are cpusets the right
>> interface for system configuration?
>>
>> It seems to me that, and the Documentation agrees with me, that
>> cpusets are designed around tasks and constraining in various ways
>> what system resources the tasks have. But may not have been originally
>> designed around the configuration of the system resources itself at
>> the system level. Now obviously these constraints will have
>> interactions with things like CPU hotplug, sched domains, etc. But it
>> does not seem obvious to me that cpusets *should* be the recommended
>> way to achieve isolation.
>>
>> It *almost* makes sense to me to have a separate interface for system
>> configuration, perhaps in a system filesystem ... say sysfs :) ...
>> that could be used to indicate a given CPU should be isolated from the
>> remainder of the system. It could take the form of a file just like
>> "online", perhaps called "isolated". But rather than go all the way
>> through the hotplug sequence as writing to "online" does, it just goes
>> "through the motions" and then brings the CPU back up. In fact, we
>> could do more than we do with cpusets-based isolation, like removing
>> workqueues and stop machine. We would have an isolated_map (I guess)
>> that corresponds to those CPUs with isolated=1 and provide that list
>> in /sys/devices/system/cpu like the online file.
>>
>> Or perhaps it makes more sense to present a filesystem *just* for
>> system partitioning (partfs?). The root directory would have all the
>> CPUs (for now, perhaps memory should be there too) and administrators
>> could create isolated groups of CPUs. But we wouldn't present a
>> transparent way to assign tasks to isolated CPUs (the tasks file) and
>> the root directory would automatically lose CPUs placed in its
>> subdirectories. Perhaps the latter is supported in cpusets by the
>> cpu_exclusive flag, but let me just say the Documentation is pretty
>> bad. The only reference to what this flag does:
>>
>> " - cpu_exclusive flag: is cpu placement exclusive?"
>>
>> I can't tell exactly what the author means by exclusive here.
>>
>> This feels like something I read Max K. proposing a while ago, and I'm
>> sorry if it has already been Nak'd then. It just feels like we're
>> shoehorning system configuration into cpusets in a way that isn't the
>> most straightforward, when we have an existing system layout that
>> should work or could design one that is sane.
>
> What you described is almost exactly what I did in my original cpu isolation
> patch, which did get NAKed :). Basically I used global cpu_isolated_map and
> exposed 'isolated' bit, etc.

Ok, that was what I vaguely recalled from the discussion, thanks.

> I do not see how 'partfs' that you described would be different from
> 'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and you
> already have your 'partfs'. You do _not_ have to use cpuset for assigning
> tasks if you do not want to. Just use them to define sets of cpus and keep
> all the tasks in the 'root' set. You can then explicitly pin your threads
> down with pthread_set_affinity().

I guess you're right. It still feels a bit kludgy, but that is probably just me.

I have wondered, though, if it makes sense to provide an "isolated"
file in /sys/devices/system/cpu/cpuX/ to do most of the offline
sequence, break sched_domains and remove a CPU from the load balancer
(rather than turning the load balancer off), rather than requiring a
user to explicitly do an offline/online. I guess it can all be rather
transparently masked via a userspace tool, but we don't have a common
one yet.

I do have a question, though: is your recommendation to just turn the
load balancer off in the cpuset you create that has the isolated CPUs?
I guess the conceptual issue I was having was that the root cpuset (I
think) always contains all CPUs and all memory nodes. So even if you
put some CPUs in a cpuset under the root one, and isolate them using
hotplug + disabling the load balancer in that cpuset, those CPUs are
still available to tasks in the root cpuset? Maybe I'm just missing a
step in the configuration, but it seems like as long as the global
(root cpuset) load balancer is on, a CPU can't be guaranteed to stay
isolated?

Thanks,
Nish

2008-11-19 05:14:30

by Max Krasnyansky

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Nish Aravamudan wrote:
> On Tue, Nov 18, 2008 at 5:59 PM, Max Krasnyansky <[email protected]> wrote:
>> I do not see how 'partfs' that you described would be different from
>> 'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and you
>> already have your 'partfs'. You do _not_ have to use cpuset for assigning
>> tasks if you do not want to. Just use them to define sets of cpus and keep
>> all the tasks in the 'root' set. You can then explicitly pin your threads
>> down with pthread_set_affinity().
>
> I guess you're right. It still feels a bit kludgy, but that is probably just me.
>
> I have wondered, though, if it makes sense to provide an "isolated"
> file in /sys/devices/system/cpu/cpuX/ to do most of the offline
> sequence, break sched_domains and remove a CPU from the load balancer
> (rather than turning the load balancer off), rather than requiring a
> user to explicitly do an offline/online.
I do not see any benefits in exposing a special 'isolated' bit and have it do
the same thing that the cpu hotplug already does. As I explained in other
threads cpu hotplug is a _perfect_ fit for the isolation purposes. In order to
isolate a CPU dynamically (ie at runtime) we need to flush pending work, flush
chaches, move tasks and timers, etc. Which is _exactly_ what cpu hotplug code
does when it brings CPU down. There is no point in reimplementing it.

btw It sounds like you misunderstood the meaning of the
cpuset.sched_load_balance flag. It's does not turn really turn load balancer
off, it simply causes cpus in different cpusets to be put into separate sched
domains. In other words it already does exactly what you're asking for.

> I guess it can all be rather
> transparently masked via a userspace tool, but we don't have a common
> one yet.
I do :). It's called 'syspart'
http://git.kernel.org/?p=linux/kernel/git/maxk/syspart.git;a=summary
I'll push an updated version in a couple of days.

> I do have a question, though: is your recommendation to just turn the
> load balancer off in the cpuset you create that has the isolated CPUs?
> I guess the conceptual issue I was having was that the root cpuset (I
> think) always contains all CPUs and all memory nodes. So even if you
> put some CPUs in a cpuset under the root one, and isolate them using
> hotplug + disabling the load balancer in that cpuset, those CPUs are
> still available to tasks in the root cpuset? Maybe I'm just missing a
> step in the configuration, but it seems like as long as the global
> (root cpuset) load balancer is on, a CPU can't be guaranteed to stay
> isolated?
Take a look at what 'syspart' does. In short yes, of course we need to set
sched_load_balance flag in root cpuset to 0.

Max




2008-11-19 12:27:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Max Krasnyansky wrote:
> Nish Aravamudan wrote:
>
>> On Tue, Nov 18, 2008 at 5:59 PM, Max Krasnyansky <[email protected]> wrote:
>>
>>> I do not see how 'partfs' that you described would be different from
>>> 'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and you
>>> already have your 'partfs'. You do _not_ have to use cpuset for assigning
>>> tasks if you do not want to. Just use them to define sets of cpus and keep
>>> all the tasks in the 'root' set. You can then explicitly pin your threads
>>> down with pthread_set_affinity().
>>>
>> I guess you're right. It still feels a bit kludgy, but that is probably just me.
>>
>> I have wondered, though, if it makes sense to provide an "isolated"
>> file in /sys/devices/system/cpu/cpuX/ to do most of the offline
>> sequence, break sched_domains and remove a CPU from the load balancer
>> (rather than turning the load balancer off), rather than requiring a
>> user to explicitly do an offline/online.
>>
> I do not see any benefits in exposing a special 'isolated' bit and have it do
> the same thing that the cpu hotplug already does. As I explained in other
> threads cpu hotplug is a _perfect_ fit for the isolation purposes. In order to
> isolate a CPU dynamically (ie at runtime) we need to flush pending work, flush
> chaches, move tasks and timers, etc. Which is _exactly_ what cpu hotplug code
> does when it brings CPU down. There is no point in reimplementing it.
>
> btw It sounds like you misunderstood the meaning of the
> cpuset.sched_load_balance flag. It's does not turn really turn load balancer
> off, it simply causes cpus in different cpusets to be put into separate sched
> domains. In other words it already does exactly what you're asking for.
>

On a related note, please be advised I have a bug in this area:

http://bugzilla.kernel.org/show_bug.cgi?id=12054

-Greg


>
>> I guess it can all be rather
>> transparently masked via a userspace tool, but we don't have a common
>> one yet.
>>
> I do :). It's called 'syspart'
> http://git.kernel.org/?p=linux/kernel/git/maxk/syspart.git;a=summary
> I'll push an updated version in a couple of days.
>
>
>> I do have a question, though: is your recommendation to just turn the
>> load balancer off in the cpuset you create that has the isolated CPUs?
>> I guess the conceptual issue I was having was that the root cpuset (I
>> think) always contains all CPUs and all memory nodes. So even if you
>> put some CPUs in a cpuset under the root one, and isolate them using
>> hotplug + disabling the load balancer in that cpuset, those CPUs are
>> still available to tasks in the root cpuset? Maybe I'm just missing a
>> step in the configuration, but it seems like as long as the global
>> (root cpuset) load balancer is on, a CPU can't be guaranteed to stay
>> isolated?
>>
> Take a look at what 'syspart' does. In short yes, of course we need to set
> sched_load_balance flag in root cpuset to 0.
>
> Max
>
>
>
>
>
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-11-19 12:51:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]


* Max Krasnyansky <[email protected]> wrote:

> What you described is almost exactly what I did in my original cpu
> isolation patch, which did get NAKed :). Basically I used global
> cpu_isolated_map and exposed 'isolated' bit, etc.

Please extend cpusets according to the plan outlined by PeterZ a few
months ago - that's the right place to do partitioning.

Ingo

2008-11-19 16:28:25

by Max Krasnyansky

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]



Gregory Haskins wrote:
> Max Krasnyansky wrote:
>> Nish Aravamudan wrote:
>>
>>> On Tue, Nov 18, 2008 at 5:59 PM, Max Krasnyansky <[email protected]> wrote:
>>>
>>>> I do not see how 'partfs' that you described would be different from
>>>> 'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and you
>>>> already have your 'partfs'. You do _not_ have to use cpuset for assigning
>>>> tasks if you do not want to. Just use them to define sets of cpus and keep
>>>> all the tasks in the 'root' set. You can then explicitly pin your threads
>>>> down with pthread_set_affinity().
>>>>
>>> I guess you're right. It still feels a bit kludgy, but that is probably just me.
>>>
>>> I have wondered, though, if it makes sense to provide an "isolated"
>>> file in /sys/devices/system/cpu/cpuX/ to do most of the offline
>>> sequence, break sched_domains and remove a CPU from the load balancer
>>> (rather than turning the load balancer off), rather than requiring a
>>> user to explicitly do an offline/online.
>>>
>> I do not see any benefits in exposing a special 'isolated' bit and have it do
>> the same thing that the cpu hotplug already does. As I explained in other
>> threads cpu hotplug is a _perfect_ fit for the isolation purposes. In order to
>> isolate a CPU dynamically (ie at runtime) we need to flush pending work, flush
>> chaches, move tasks and timers, etc. Which is _exactly_ what cpu hotplug code
>> does when it brings CPU down. There is no point in reimplementing it.
>>
>> btw It sounds like you misunderstood the meaning of the
>> cpuset.sched_load_balance flag. It's does not turn really turn load balancer
>> off, it simply causes cpus in different cpusets to be put into separate sched
>> domains. In other words it already does exactly what you're asking for.
>>
>
> On a related note, please be advised I have a bug in this area:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=12054

Yes, I saw the original thread on this. I'll reply in there.

Max

2008-11-19 16:32:15

by Max Krasnyansky

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]



Ingo Molnar wrote:
> * Max Krasnyansky <[email protected]> wrote:
>
>> What you described is almost exactly what I did in my original cpu
>> isolation patch, which did get NAKed :). Basically I used global
>> cpu_isolated_map and exposed 'isolated' bit, etc.
>
> Please extend cpusets according to the plan outlined by PeterZ a few
> months ago - that's the right place to do partitioning.

Already did. It's all in mainline. The part you quoted was just pointing out
that the original approach was not correct.

Max

2008-11-19 17:45:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]


* Max Krasnyansky <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Max Krasnyansky <[email protected]> wrote:
> >
> >> What you described is almost exactly what I did in my original
> >> cpu isolation patch, which did get NAKed :). Basically I used
> >> global cpu_isolated_map and exposed 'isolated' bit, etc.
> >
> > Please extend cpusets according to the plan outlined by PeterZ a
> > few months ago - that's the right place to do partitioning.
>
> Already did. It's all in mainline. The part you quoted was just
> pointing out that the original approach was not correct.

Yeah, we have bits of it (i merged them, and i still remember them ;-)
- but we still dont have the "system set" concept suggested by Peter
though. We could go further and make it really easy to partition all
scheduling and irq aspects of the system via cpusets.

Ingo

2008-11-19 20:01:58

by Max Krasnyansky

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Ingo Molnar wrote:
> * Max Krasnyansky <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> * Max Krasnyansky <[email protected]> wrote:
>>>
>>>> What you described is almost exactly what I did in my original
>>>> cpu isolation patch, which did get NAKed :). Basically I used
>>>> global cpu_isolated_map and exposed 'isolated' bit, etc.
>>> Please extend cpusets according to the plan outlined by PeterZ a
>>> few months ago - that's the right place to do partitioning.
>> Already did. It's all in mainline. The part you quoted was just
>> pointing out that the original approach was not correct.
>
> Yeah, we have bits of it (i merged them, and i still remember them ;-)
> - but we still dont have the "system set" concept suggested by Peter
> though. We could go further and make it really easy to partition all
> scheduling and irq aspects of the system via cpusets.

Actually we (or maybe just me) gave up on those for now. We went back an forth
on the 'system set' and what it supposed to mean. Both Paul J. and Paul M.
were against the concept and especially backward compatibility with existing
user-space tools that use cpusets. Plus it's really really easy to setup the
'system' set from user-space and I just ended up writing 'syspart' thing that
I mentioned before.

Similar thing happened to "managing irqs via cpusets" idea. Peter and I wanted
to represent them ask tasks and Paul J. was very vocally :) opposed to it.
What we settled on was that we will manage irqs via proc for now. I added a
notion of 'default' irq affinity (ie /proc/irq/default_smp_affinity) which is
already in mainline.

We can probably revisit irq management and 'system' cpuset again. At this
point I'm swamped with other stuff though.

Max

2008-11-19 22:12:16

by Nish Aravamudan

[permalink] [raw]
Subject: Re: Using cpusets for configuration/isolation [Was Re: RT sched: cpupri_vec lock contention with def_root_domain and no load balance]

Max,

On Tue, Nov 18, 2008 at 9:14 PM, Max Krasnyansky <[email protected]> wrote:
> Nish Aravamudan wrote:
>> On Tue, Nov 18, 2008 at 5:59 PM, Max Krasnyansky <[email protected]> wrote:
>>> I do not see how 'partfs' that you described would be different from
>>> 'cpusets' that we have now. Just ignore 'tasks' files in the cpusets and you
>>> already have your 'partfs'. You do _not_ have to use cpuset for assigning
>>> tasks if you do not want to. Just use them to define sets of cpus and keep
>>> all the tasks in the 'root' set. You can then explicitly pin your threads
>>> down with pthread_set_affinity().
>>
>> I guess you're right. It still feels a bit kludgy, but that is probably just me.
>>
>> I have wondered, though, if it makes sense to provide an "isolated"
>> file in /sys/devices/system/cpu/cpuX/ to do most of the offline
>> sequence, break sched_domains and remove a CPU from the load balancer
>> (rather than turning the load balancer off), rather than requiring a
>> user to explicitly do an offline/online.
> I do not see any benefits in exposing a special 'isolated' bit and have it do
> the same thing that the cpu hotplug already does. As I explained in other
> threads cpu hotplug is a _perfect_ fit for the isolation purposes. In order to
> isolate a CPU dynamically (ie at runtime) we need to flush pending work, flush
> chaches, move tasks and timers, etc. Which is _exactly_ what cpu hotplug code
> does when it brings CPU down. There is no point in reimplementing it.

Ok, I guess I was just referring to the intent of the administrator
and making it a bit clearer. But using syspart or even a simple
script, it's easy enough to alias the offline/online sequence.

> btw It sounds like you misunderstood the meaning of the
> cpuset.sched_load_balance flag. It's does not turn really turn load balancer
> off, it simply causes cpus in different cpusets to be put into separate sched
> domains. In other words it already does exactly what you're asking for.

Ok, I'm re-reading the cpusets.txt section. Sorry for my confusion and
thanks for the clarification.

>> I guess it can all be rather
>> transparently masked via a userspace tool, but we don't have a common
>> one yet.
> I do :). It's called 'syspart'
> http://git.kernel.org/?p=linux/kernel/git/maxk/syspart.git;a=summary
> I'll push an updated version in a couple of days.

Has there been any effort to start driving this into distributions?

>> I do have a question, though: is your recommendation to just turn the
>> load balancer off in the cpuset you create that has the isolated CPUs?
>> I guess the conceptual issue I was having was that the root cpuset (I
>> think) always contains all CPUs and all memory nodes. So even if you
>> put some CPUs in a cpuset under the root one, and isolate them using
>> hotplug + disabling the load balancer in that cpuset, those CPUs are
>> still available to tasks in the root cpuset? Maybe I'm just missing a
>> step in the configuration, but it seems like as long as the global
>> (root cpuset) load balancer is on, a CPU can't be guaranteed to stay
>> isolated?
> Take a look at what 'syspart' does. In short yes, of course we need to set
> sched_load_balance flag in root cpuset to 0.

Will do, thanks,
Nish