2005-05-01 18:59:28

by Dinakar Guniguntala

[permalink] [raw]
Subject: [RFC PATCH] Dynamic sched domains (v0.5)


Ok, Here is the minimal patchset that I had promised after the
last discussion.

What it does have
o The current patch enhances the meaning of exclusive cpusets by
attaching them (exclusive cpusets) to sched domains
o It does _not_ require any additional cpumask_t variable. It
just parses the cpus_allowed of the parent/sibling/children
cpusets for manipulating sched domains
o All existing operations on non-/exclusive cpusets are preserved as-is.
o The sched code has been modified to bring it upto 2.6.12-rc2-mm3

Usage
o On setting the cpu_exclusive flag of a cpuset X, it creates two
sched domains
a. One, All cpus from X's parent cpuset that dont belong to any
exclusive sibling cpuset of X
b. Two, All cpus in X's cpus_allowed
o On adding/deleting cpus to/from a exclusive cpuset X that has exclusive
children, it creates two sched domains
a. One, All cpus from X's parent cpuset that dont belong to any
exclusive sibling cpuset of X
b. Two, All cpus in X's cpus_allowed, after taking away any cpus that
belong to exclusive child cpusets of X
o On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a
single sched domain, containing all cpus from X' parent cpuset that
dont belong to any exclusive sibling of X and the cpus of X
o It does _not_ modify the cpus_allowed variable of the parent as in the
previous version. It relies on user space to move tasks to proper
cpusets for complete isolation/exclusion
o See function update_cpu_domains for more details

What it does not have
o It is still short on documentation
o Does not have hotplug support as yet
o Supports only x86 as of now
o No thoughts on "memory domains" (Though I am not sure, who
would use such a thing and what exactly are the requirements)


Here is the increase in binary sizes for different values of CONFIG_NR_CPUS
(32 and 255)

kernel/cpuset.o.orig-32 16728
kernel/cpuset.o-v0.5-32 17176 (~2.7% increase)

kernel/cpuset.o.orig-255 17916
kernel/cpuset.o-v0.5-255 19120 (~6.7% increase)

The code changes are also minimal

include/linux/init.h | 2
include/linux/sched.h | 1
kernel/cpuset.c | 73 +++++++++++++++++++++++++++------
kernel/sched.c | 109 +++++++++++++++++++++++++++++++++-----------------
4 files changed, 134 insertions(+), 51 deletions(-)


-Dinakar



Attachments:
(No filename) (2.35 kB)
dyn-sd-v0.5-1.patch (6.24 kB)
dyn-sd-v0.5-2.patch (3.20 kB)
Download all attachments

2005-05-02 09:10:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar Guniguntala wrote:
> Ok, Here is the minimal patchset that I had promised after the
> last discussion.
>

The sched-domains part of it (kernel/sched.c) is looking much
better now. Haven't had a really good look, but it is definitely
on the right track now. Well done.

As I said before, I am not expert on the cpusets side of things,
but as far as sched-domains partitioning goes, we really just
want the absolute minimum support in kernel/sched.c which can be
managed by a higher layer.

I'll review it in detail when it gets into -mm, but I don't expect
to find any major problems.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

2005-05-02 09:44:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar Guniguntala wrote:

> +void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
> +{
> + cpumask_t change_map;
> +
> + cpus_or(change_map, span1, span2);
> +
> + preempt_disable();

Oh, you can't do this here, attach_domains does a synchronize_kernel.
So take it out, it doesn't do anything anyway, does it?

I suggest you also use some sort of locking to prevent concurrent rebuilds
and rebuilds racing with cpu hotplug. You could probably have a static
semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.

Or alternatively take the semaphore in the cpu hotplug notifier as well...
Maybe both - neither are performance critical code.

--
SUSE Labs, Novell Inc.

2005-05-02 17:06:33

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
> Dinakar Guniguntala wrote:
>
> >+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
> >+{
> >+ cpumask_t change_map;
> >+
> >+ cpus_or(change_map, span1, span2);
> >+
> >+ preempt_disable();
>
> Oh, you can't do this here, attach_domains does a synchronize_kernel.
> So take it out, it doesn't do anything anyway, does it?

I put that in to prevent hangs with CONFIG_PREEMPT turned on, but
clearly didn't test it with preempt turned on. Looks like all I need to
do here is a local_irq_disable

>
> I suggest you also use some sort of locking to prevent concurrent rebuilds
> and rebuilds racing with cpu hotplug. You could probably have a static
> semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.

I already do a lock_cpu_hotplug() in cpuset.c before calling
rebuild_sched_domains and also am holding cpuset_sem, so that should take
care of both hotplug and concurrent rebuilds

-Dinakar

2005-05-02 17:10:40

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

On Mon, May 02, 2005 at 07:10:35PM +1000, Nick Piggin wrote:
> Dinakar Guniguntala wrote:
> >Ok, Here is the minimal patchset that I had promised after the
> >last discussion.
> >
>
> The sched-domains part of it (kernel/sched.c) is looking much
> better now. Haven't had a really good look, but it is definitely
> on the right track now. Well done.
>
Thank you for reviewing !

-Dinakar

2005-05-02 18:02:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Thanks for the minimalist patch. I'll review it in more detail when I
get a chance. It looks promising, more promising than before.


My current concerns include:
o Having it work on ia64 would facilitate my testing.
o _Still_ no clear statement of the requirements - see below.
o Does this patch ensure that isolated sched domains form
a partition (disjoint cover) of a systems CPUs? Should it?
o Does this change any documented semantics of cpusets? I don't
see offhand that it does. Perhaps that's good. Perhaps
I missed something.


I am still in the frustrating position of playing guessing games with
what are you essential requirements, the one or two features that you
require. The closest we got was a six step list which was really more
like pseudo code for a prior implementation.

I see nothing on first glance in your latest patch that responds to my
previous attempt to ask this. I don't see how I can improve on the
wording of that question, so I repeat myself ...

On April 25, pj wrote:
======================

A few days ago, you provided a six step list, under the introduction:
> Ok, Let me begin at the beginning and attempt to define what I am
> doing here

I suspect those six steps were not really your essential requirements,
but one possible procedure that accomplishes them.

So far I am guessing that your requirement(s) are one or both of the
following two items:

(1) avoid having the scheduler wasting too much time trying to
load balance tasks that only turn out to be not allowed on
the cpus the scheduler is considering, and/or
(2) provide improved administrative control of a system by being
able to construct a cpuset that is guaranteed to have no
overlap of allowed cpus with its parent or any other cpuset
not descendent from it.

If (1) is one of your essential requirements, then I have described a
couple of implementations that mark some existing cpusets to form a
partition (in the mathematical sense of a disjoint covering of subsets)
of the system to define isolated scheduler domains. I did this without
adding any additional bitmasks to each cpuset.

If (2) is one of your essential requirements, then I believe this can be
done with the current cpuset kernel code, entirely with additional user
level code.

> I am working on a minimalistic design right now and will get back in
> a day or two

Good.

Hopefully, you will also be able to get through my thick head what your
essential requirement(s) is or are.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-02 23:23:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar Guniguntala wrote:
> On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
>
>>Dinakar Guniguntala wrote:
>>
>>
>>>+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
>>>+{
>>>+ cpumask_t change_map;
>>>+
>>>+ cpus_or(change_map, span1, span2);
>>>+
>>>+ preempt_disable();
>>
>>Oh, you can't do this here, attach_domains does a synchronize_kernel.
>>So take it out, it doesn't do anything anyway, does it?
>
>
> I put that in to prevent hangs with CONFIG_PREEMPT turned on, but
> clearly didn't test it with preempt turned on. Looks like all I need to
> do here is a local_irq_disable
>

What are you protecting against, though? synchroinze_kernel can
sleep, so local_irq_disable is probably the wrong thing to do as well.

AFAIKS, you don't need anything here - so long as you have mutual
exclusion from other sched-domain building then this can take as long
as it wants / be preempted as many times as we like.

>
>>I suggest you also use some sort of locking to prevent concurrent rebuilds
>>and rebuilds racing with cpu hotplug. You could probably have a static
>>semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.
>
>
> I already do a lock_cpu_hotplug() in cpuset.c before calling
> rebuild_sched_domains and also am holding cpuset_sem, so that should take
> care of both hotplug and concurrent rebuilds
>

OK.

But if we want this to be a respectable interface (possibly for more than
just cpusets) then it should probably do some locking itself. It isn't
performance critical, so I think taking a semaphore wouldn't hurt.

--
SUSE Labs, Novell Inc.

2005-05-03 14:34:34

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

On Mon, May 02, 2005 at 11:01:35AM -0700, Paul Jackson wrote:
> My current concerns include:
> o Having it work on ia64 would facilitate my testing.

I am working on making changes to ia64, should be ready pretty soon

> o Does this patch ensure that isolated sched domains form
> a partition (disjoint cover) of a systems CPUs? Should it?

With this patch the cpus of an exclusive cpuset form a sched domain.
Since only exclusive cpusets can form a sched domain, this ensures
that the cpus form a disjoint cover

> o Does this change any documented semantics of cpusets? I don't
> see offhand that it does. Perhaps that's good. Perhaps
> I missed something.

No, all semantics continue to be the same as before


I have trimmed the requirements to do only the absolute minimal in
kernel space

1. Partitioning of the system (both cpus and memory) to ensure
dedicated resources are available for application use.
This has to be done through user space with the help of the
existing cpuset infrastructure and no additional changes are
required to be done in the kernel

2. Remove unnecessary scheduler load balancing overhead in the
partitions mentioned above
a. Ensure that load balance code is aware of partitioning of cpus
and load balance happens within these partitions and not
across the entire system
b. Provide for complete removal of load-balancing on a given
partition of cpus

This is necessary for a variety of workloads, including real-time,
HPC and any mix of these workloads
In the current patch, only 2(a) has been addressed.
I intend to add support for 2(b) once the current patch is acceptable
to everyone. I think this should not be a major change

3. Support CPU hotplug is another requirement, though not a direct one


Hope this helps

-Dinakar

2005-05-03 14:52:13

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

> >On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
>
> What are you protecting against, though? synchroinze_kernel can
> sleep, so local_irq_disable is probably the wrong thing to do as well.

Paul, any reason why code marked "####" (fn cpuset_rmdir) is under
the dentry lock ??

spin_lock(&cs->dentry->d_lock);
parent = cs->parent; ####
set_bit(CS_REMOVED, &cs->flags); ####
if (is_cpu_exclusive(cs))
update_cpu_domains(cs);
list_del(&cs->sibling); /* delete my sibling from parent->children */
if (list_empty(&parent->children))
check_for_release(parent);
d = dget(cs->dentry); <----
cs->dentry = NULL; <----
spin_unlock(&d->d_lock);


As far as I can see only the ones marked "<----" should be under the
dentry lock, considering the fact that it already holds the cpuset_sem
all the while.

I saw that calling update_cpu_domains with the dentry lock held,
causes it to oops with preempt turned on. (Scheduling while atomic)

-Dinakar

2005-05-03 15:22:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar wrote:
> Since only exclusive cpusets can form a sched domain, this ensures
> that the cpus form a disjoint cover

I don't understand why this is so.

Certainly, the exclusive cpusets do not form a disjoint cover. Not
disjoint, because the top cpuset is exclusive, and overlaps with all
other cpusets. Nor do the other exclusive cpusets, excluding the top,
necessarily form a cover - there might not even be any other such
cpusets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-03 15:35:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar wrote:
> No, all semantics continue to be the same as before

Good.

> I have trimmed the requirements to do only the absolute minimal in
> kernel space

Good.

> Partitioning of the system ... done through user space

Ok.

> Remove unnecessary scheduler load balancing ...

Ok.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-03 15:31:43

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar wrote:
> As far as I can see only the ones marked "<----" should be under the
> dentry lock, considering the fact that it already holds the cpuset_sem
> all the while.

It looks that way to me, too.

I doubt we had any particular reason for locking entry as early
as we do in that code.

It's ok by me if you move the dentry lock lower, as you suggest.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-03 22:03:37

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Dinakar Guniguntala wrote:
> Ok, Here is the minimal patchset that I had promised after the
> last discussion.
>
> What it does have
> o The current patch enhances the meaning of exclusive cpusets by
> attaching them (exclusive cpusets) to sched domains
> o It does _not_ require any additional cpumask_t variable. It
> just parses the cpus_allowed of the parent/sibling/children
> cpusets for manipulating sched domains
> o All existing operations on non-/exclusive cpusets are preserved as-is.
> o The sched code has been modified to bring it upto 2.6.12-rc2-mm3
>
> Usage
> o On setting the cpu_exclusive flag of a cpuset X, it creates two
> sched domains
> a. One, All cpus from X's parent cpuset that dont belong to any
> exclusive sibling cpuset of X
> b. Two, All cpus in X's cpus_allowed
> o On adding/deleting cpus to/from a exclusive cpuset X that has exclusive
> children, it creates two sched domains
> a. One, All cpus from X's parent cpuset that dont belong to any
> exclusive sibling cpuset of X
> b. Two, All cpus in X's cpus_allowed, after taking away any cpus that
> belong to exclusive child cpusets of X
> o On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a
> single sched domain, containing all cpus from X' parent cpuset that
> dont belong to any exclusive sibling of X and the cpus of X
> o It does _not_ modify the cpus_allowed variable of the parent as in the
> previous version. It relies on user space to move tasks to proper
> cpusets for complete isolation/exclusion
> o See function update_cpu_domains for more details
>
> What it does not have
> o It is still short on documentation
> o Does not have hotplug support as yet
> o Supports only x86 as of now
> o No thoughts on "memory domains" (Though I am not sure, who
> would use such a thing and what exactly are the requirements)

An interesting feature. I tried a while ago to get cpusets and
sched_domains to play nice (nicer?) and didn't have much luck. It seems
you're taking a better approach, with smaller patches. Good luck!


> diff -Naurp linux-2.6.12-rc2.orig/include/linux/init.h linux-2.6.12-rc2/include/linux/init.h
> --- linux-2.6.12-rc2.orig/include/linux/init.h 2005-04-04 22:07:52.000000000 +0530
> +++ linux-2.6.12-rc2/include/linux/init.h 2005-05-01 22:07:56.000000000 +0530
> @@ -217,7 +217,7 @@ void __init parse_early_param(void);
> #define __initdata_or_module __initdata
> #endif /*CONFIG_MODULES*/
>
> -#ifdef CONFIG_HOTPLUG
> +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS)
> #define __devinit
> #define __devinitdata
> #define __devexit

This looks just plain wrong. Why do you need this? It doesn't seem that
arch_init_sched_domains() and/or update_sched_domains() are called from
anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS?


> diff -Naurp linux-2.6.12-rc2.orig/kernel/sched.c linux-2.6.12-rc2/kernel/sched.c
> --- linux-2.6.12-rc2.orig/kernel/sched.c 2005-04-28 18:24:11.000000000 +0530
> +++ linux-2.6.12-rc2/kernel/sched.c 2005-05-01 22:06:55.000000000 +0530
> @@ -4526,7 +4526,7 @@ int __init migration_init(void)
> #endif
>
> #ifdef CONFIG_SMP
> -#define SCHED_DOMAIN_DEBUG
> +#undef SCHED_DOMAIN_DEBUG
> #ifdef SCHED_DOMAIN_DEBUG
> static void sched_domain_debug(struct sched_domain *sd, int cpu)
> {

Is this just to quiet boot for your testing? Is there are better reason
you're turning this off? It seems unrelated to the rest of your patch.


> ------------------------------------------------------------------------
>
> diff -Naurp linux-2.6.12-rc2.orig/kernel/cpuset.c linux-2.6.12-rc2/kernel/cpuset.c
> --- linux-2.6.12-rc2.orig/kernel/cpuset.c 2005-04-28 18:24:11.000000000 +0530
> +++ linux-2.6.12-rc2/kernel/cpuset.c 2005-05-01 22:15:06.000000000 +0530
> @@ -602,12 +602,48 @@ static int validate_change(const struct
> return 0;
> }
>
> +static void update_cpu_domains(struct cpuset *cur)
> +{
> + struct cpuset *c, *par = cur->parent;
> + cpumask_t span1, span2;
> +
> + if (par == NULL || cpus_empty(cur->cpus_allowed))
> + return;
> +
> + /* Get all non-exclusive cpus from parent domain */
> + span1 = par->cpus_allowed;
> + list_for_each_entry(c, &par->children, sibling) {
> + if (is_cpu_exclusive(c))
> + cpus_andnot(span1, span1, c->cpus_allowed);
> + }
> + if (is_removed(cur) || !is_cpu_exclusive(cur)) {
> + cpus_or(span1, span1, cur->cpus_allowed);
> + if (cpus_equal(span1, cur->cpus_allowed))
> + return;
> + span2 = CPU_MASK_NONE;
> + }
> + else {
> + if (cpus_empty(span1))
> + return;
> + span2 = cur->cpus_allowed;
> + /* If current cpuset has exclusive children, exclude from domain */
> + list_for_each_entry(c, &cur->children, sibling) {
> + if (is_cpu_exclusive(c))
> + cpus_andnot(span2, span2, c->cpus_allowed);
> + }
> + }
> +
> + lock_cpu_hotplug();
> + rebuild_sched_domains(span1, span2);
> + unlock_cpu_hotplug();
> +}

Nitpicky, but span1 and span2 could do with better names.

Otherwise, the patch looks good to me.

-Matt

2005-05-04 00:09:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Matthew Dobson wrote:
> Dinakar Guniguntala wrote:
>
>>+ lock_cpu_hotplug();
>>+ rebuild_sched_domains(span1, span2);
>>+ unlock_cpu_hotplug();
>>+}
>
>
> Nitpicky, but span1 and span2 could do with better names.
>

As could rebuild_sched_domains while we're at it.

partition_disjoint_sched_domains(partition1, partition2);
?

Dunno. That isn't really great, but maybe better? Pretty
long, but it'll only ever be called in one or two places.

--
SUSE Labs, Novell Inc.

2005-05-04 00:28:30

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

Nick Piggin wrote:
> Matthew Dobson wrote:
>
>> Dinakar Guniguntala wrote:
>>
>>> + lock_cpu_hotplug();
>>> + rebuild_sched_domains(span1, span2);
>>> + unlock_cpu_hotplug();
>>> +}
>>
>>
>>
>> Nitpicky, but span1 and span2 could do with better names.
>>
>
> As could rebuild_sched_domains while we're at it.
>
> partition_disjoint_sched_domains(partition1, partition2);
> ?
>
> Dunno. That isn't really great, but maybe better? Pretty
> long, but it'll only ever be called in one or two places.

build_disjoint_sched_domains(partition1, partition2)? Or just
partition_sched_domains(partition1, partition2)? Partition and disjoint
seem mildly redundant to me, for varying definitions of partition... ;)

-Matt

2005-05-05 13:13:57

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

On Tue, May 03, 2005 at 03:03:23PM -0700, Matthew Dobson wrote:
> An interesting feature. I tried a while ago to get cpusets and
> sched_domains to play nice (nicer?) and didn't have much luck. It seems
> you're taking a better approach, with smaller patches. Good luck!

Thanks ! I would very much like to know your findings as far as
memory/node domains are concerned or are you going to be working on it?
I dont have any thoughts on it right now

> > -#ifdef CONFIG_HOTPLUG
> > +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS)
> > #define __devinit
> > #define __devinitdata
> > #define __devexit
>
> This looks just plain wrong. Why do you need this? It doesn't seem that
> arch_init_sched_domains() and/or update_sched_domains() are called from
> anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS?

cpu_attach_domain is defined as a __devinit, maybe I need to remove that
instead of the #ifdef

> > #ifdef CONFIG_SMP
> > -#define SCHED_DOMAIN_DEBUG
> > +#undef SCHED_DOMAIN_DEBUG
> > #ifdef SCHED_DOMAIN_DEBUG
> > static void sched_domain_debug(struct sched_domain *sd, int cpu)
> > {
>
> Is this just to quiet boot for your testing? Is there are better reason
> you're turning this off? It seems unrelated to the rest of your patch.
>

This gets called from cpu_attach_domain, and so everytime partitioning is done
and not only during boot with my changes

Thanks for your review !

-Dinakar

2005-05-05 13:16:17

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC PATCH] Dynamic sched domains (v0.5)

On Tue, May 03, 2005 at 05:28:20PM -0700, Matthew Dobson wrote:
>
> build_disjoint_sched_domains(partition1, partition2)? Or just
> partition_sched_domains(partition1, partition2)? Partition and disjoint
> seem mildly redundant to me, for varying definitions of partition... ;)

partition_sched_domains sounds fine to me, Thanks

-Dinakar