2006-10-19 09:24:07

by Paul Jackson

[permalink] [raw]
Subject: [RFC] cpuset: remove sched domain hooks from cpusets

From: Paul Jackson <[email protected]>

Remove the cpuset hooks that defined sched domains depending on the
setting of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on
the parent.

This made that flag painfully unsuitable for use as a flag defining
a partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit
on one cpuset, because it depended on what CPUs were in the remainder
of that cpusets siblings and child cpusets, after subtracting out
other cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, no
one has made much use of this feature, past the simplest case of
isolating some CPUs from scheduler balancing. A future patch will
propose a simple mechanism for this simple case.

Furthermore, since there was no way on a running system to see what
one was doing with sched domains, this change will be invisible to
any using code. Unless they have deep insight to the scheduler load
balancing choices, they will be unable to detect that this change
has been made in the kernel's behaviour.

Signed-off-by: Paul Jackson <[email protected]>

---

Documentation/cpusets.txt | 17 ---------
include/linux/sched.h | 3 -
kernel/cpuset.c | 84 +---------------------------------------------
kernel/sched.c | 27 --------------
4 files changed, 2 insertions(+), 129 deletions(-)

--- 2.6.19-rc1-mm1.orig/kernel/cpuset.c 2006-10-19 01:47:50.000000000 -0700
+++ 2.6.19-rc1-mm1/kernel/cpuset.c 2006-10-19 01:48:10.000000000 -0700
@@ -754,68 +754,13 @@ static int validate_change(const struct
}

/*
- * For a given cpuset cur, partition the system as follows
- * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
- * exclusive child cpusets
- * b. All cpus in the current cpuset's cpus_allowed that are not part of any
- * exclusive child cpusets
- * Build these two partitions by calling partition_sched_domains
- *
- * Call with manage_mutex held. May nest a call to the
- * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
- * Must not be called holding callback_mutex, because we must
- * not call lock_cpu_hotplug() while holding callback_mutex.
- */
-
-static void update_cpu_domains(struct cpuset *cur)
-{
- struct cpuset *c, *par = cur->parent;
- cpumask_t pspan, cspan;
-
- if (par == NULL || cpus_empty(cur->cpus_allowed))
- return;
-
- /*
- * Get all cpus from parent's cpus_allowed not part of exclusive
- * children
- */
- pspan = par->cpus_allowed;
- list_for_each_entry(c, &par->children, sibling) {
- if (is_cpu_exclusive(c))
- cpus_andnot(pspan, pspan, c->cpus_allowed);
- }
- if (!is_cpu_exclusive(cur)) {
- cpus_or(pspan, pspan, cur->cpus_allowed);
- if (cpus_equal(pspan, cur->cpus_allowed))
- return;
- cspan = CPU_MASK_NONE;
- } else {
- if (cpus_empty(pspan))
- return;
- cspan = cur->cpus_allowed;
- /*
- * Get all cpus from current cpuset's cpus_allowed not part
- * of exclusive children
- */
- list_for_each_entry(c, &cur->children, sibling) {
- if (is_cpu_exclusive(c))
- cpus_andnot(cspan, cspan, c->cpus_allowed);
- }
- }
-
- lock_cpu_hotplug();
- partition_sched_domains(&pspan, &cspan);
- unlock_cpu_hotplug();
-}
-
-/*
* Call with manage_mutex held. May take callback_mutex during call.
*/

static int update_cpumask(struct cpuset *cs, char *buf)
{
struct cpuset trialcs;
- int retval, cpus_unchanged;
+ int retval;

/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
if (cs == &top_cpuset)
@@ -831,12 +776,9 @@ static int update_cpumask(struct cpuset
retval = validate_change(cs, &trialcs);
if (retval < 0)
return retval;
- cpus_unchanged = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
mutex_lock(&callback_mutex);
cs->cpus_allowed = trialcs.cpus_allowed;
mutex_unlock(&callback_mutex);
- if (is_cpu_exclusive(cs) && !cpus_unchanged)
- update_cpu_domains(cs);
return 0;
}

@@ -1046,7 +988,7 @@ static int update_flag(cpuset_flagbits_t
{
int turning_on;
struct cpuset trialcs;
- int err, cpu_exclusive_changed;
+ int err;

turning_on = (simple_strtoul(buf, NULL, 10) != 0);

@@ -1059,14 +1001,10 @@ static int update_flag(cpuset_flagbits_t
err = validate_change(cs, &trialcs);
if (err < 0)
return err;
- cpu_exclusive_changed =
- (is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs));
mutex_lock(&callback_mutex);
cs->flags = trialcs.flags;
mutex_unlock(&callback_mutex);

- if (cpu_exclusive_changed)
- update_cpu_domains(cs);
return 0;
}

@@ -1930,17 +1868,6 @@ static int cpuset_mkdir(struct inode *di
return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
}

-/*
- * Locking note on the strange update_flag() call below:
- *
- * If the cpuset being removed is marked cpu_exclusive, then simulate
- * turning cpu_exclusive off, which will call update_cpu_domains().
- * The lock_cpu_hotplug() call in update_cpu_domains() must not be
- * made while holding callback_mutex. Elsewhere the kernel nests
- * callback_mutex inside lock_cpu_hotplug() calls. So the reverse
- * nesting would risk an ABBA deadlock.
- */
-
static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
{
struct cpuset *cs = dentry->d_fsdata;
@@ -1960,13 +1887,6 @@ static int cpuset_rmdir(struct inode *un
mutex_unlock(&manage_mutex);
return -EBUSY;
}
- if (is_cpu_exclusive(cs)) {
- int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
- if (retval < 0) {
- mutex_unlock(&manage_mutex);
- return retval;
- }
- }
parent = cs->parent;
mutex_lock(&callback_mutex);
set_bit(CS_REMOVED, &cs->flags);
--- 2.6.19-rc1-mm1.orig/Documentation/cpusets.txt 2006-10-19 01:47:09.000000000 -0700
+++ 2.6.19-rc1-mm1/Documentation/cpusets.txt 2006-10-19 01:48:10.000000000 -0700
@@ -86,9 +86,6 @@ This can be especially valuable on:
and a database), or
* NUMA systems running large HPC applications with demanding
performance characteristics.
- * Also cpu_exclusive cpusets are useful for servers running orthogonal
- workloads such as RT applications requiring low latency and HPC
- applications that are throughput sensitive

These subsets, or "soft partitions" must be able to be dynamically
adjusted, as the job mix changes, without impacting other concurrently
@@ -131,8 +128,6 @@ Cpusets extends these two mechanisms as
- A cpuset may be marked exclusive, which ensures that no other
cpuset (except direct ancestors and descendents) may contain
any overlapping CPUs or Memory Nodes.
- Also a cpu_exclusive cpuset would be associated with a sched
- domain.
- You can list all the tasks (by pid) attached to any cpuset.

The implementation of cpusets requires a few, simple hooks
@@ -144,9 +139,6 @@ into the rest of the kernel, none in per
allowed in that tasks cpuset.
- in sched.c migrate_all_tasks(), to keep migrating tasks within
the CPUs allowed by their cpuset, if possible.
- - in sched.c, a new API partition_sched_domains for handling
- sched domain changes associated with cpu_exclusive cpusets
- and related changes in both sched.c and arch/ia64/kernel/domain.c
- in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that tasks cpuset.
- in page_alloc.c, to restrict memory to allowed nodes.
@@ -231,15 +223,6 @@ If a cpuset is cpu or mem exclusive, no
a direct ancestor or descendent, may share any of the same CPUs or
Memory Nodes.

-A cpuset that is cpu_exclusive has a scheduler (sched) domain
-associated with it. The sched domain consists of all CPUs in the
-current cpuset that are not part of any exclusive child cpusets.
-This ensures that the scheduler load balancing code only balances
-against the CPUs that are in the sched domain as defined above and
-not all of the CPUs in the system. This removes any overhead due to
-load balancing code trying to pull tasks outside of the cpu_exclusive
-cpuset only to be prevented by the tasks' cpus_allowed mask.
-
A cpuset that is mem_exclusive restricts kernel allocations for
page, buffer and other data commonly shared by the kernel across
multiple users. All cpusets, whether mem_exclusive or not, restrict
--- 2.6.19-rc1-mm1.orig/include/linux/sched.h 2006-10-19 01:47:09.000000000 -0700
+++ 2.6.19-rc1-mm1/include/linux/sched.h 2006-10-19 01:48:10.000000000 -0700
@@ -715,9 +715,6 @@ struct sched_domain {
#endif
};

-extern int partition_sched_domains(cpumask_t *partition1,
- cpumask_t *partition2);
-
/*
* Maximum cache size the migration-costs auto-tuning code will
* search from:
--- 2.6.19-rc1-mm1.orig/kernel/sched.c 2006-10-19 01:47:09.000000000 -0700
+++ 2.6.19-rc1-mm1/kernel/sched.c 2006-10-19 01:48:10.000000000 -0700
@@ -6735,33 +6735,6 @@ static void detach_destroy_domains(const
arch_destroy_sched_domains(cpu_map);
}

-/*
- * Partition sched domains as specified by the cpumasks below.
- * This attaches all cpus from the cpumasks to the NULL domain,
- * waits for a RCU quiescent period, recalculates sched
- * domain information and then attaches them back to the
- * correct sched domains
- * Call with hotplug lock held
- */
-int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
-{
- cpumask_t change_map;
- int err = 0;
-
- cpus_and(*partition1, *partition1, cpu_online_map);
- cpus_and(*partition2, *partition2, cpu_online_map);
- cpus_or(change_map, *partition1, *partition2);
-
- /* Detach sched domains from all of the affected cpus */
- detach_destroy_domains(&change_map);
- if (!cpus_empty(*partition1))
- err = build_sched_domains(partition1);
- if (!err && !cpus_empty(*partition2))
- err = build_sched_domains(partition2);
-
- return err;
-}
-
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
int arch_reinit_sched_domains(void)
{

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401


2006-10-19 10:25:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Fix sched-domains partitioning by cpusets. Walk the whole cpusets tree after
something interesting changes, and recreate all partitions.

Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c 2006-10-19 19:26:54.000000000 +1000
+++ linux-2.6/kernel/cpuset.c 2006-10-19 20:21:29.000000000 +1000
@@ -751,6 +751,24 @@ static int validate_change(const struct
return 0;
}

+static void update_cpu_domains_children(struct cpuset *par,
+ cpumask_t *non_partitioned)
+{
+ struct cpuset *c;
+
+ list_for_each_entry(c, &par->children, sibling) {
+ if (cpus_empty(c->cpus_allowed))
+ continue;
+ if (is_cpu_exclusive(c)) {
+ if (!partition_sched_domains(&c->cpus_allowed)) {
+ cpus_andnot(*non_partitioned,
+ *non_partitioned, c->cpus_allowed);
+ }
+ } else
+ update_cpu_domains_children(c, non_partitioned);
+ }
+}
+
/*
* For a given cpuset cur, partition the system as follows
* a. All cpus in the parent cpuset's cpus_allowed that are not part of any
@@ -760,53 +778,38 @@ static int validate_change(const struct
* Build these two partitions by calling partition_sched_domains
*
* Call with manage_mutex held. May nest a call to the
- * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
- * Must not be called holding callback_mutex, because we must
- * not call lock_cpu_hotplug() while holding callback_mutex.
+ * lock_cpu_hotplug()/unlock_cpu_hotplug() pair. Must not be called holding
+ * callback_mutex, because we must not call lock_cpu_hotplug() while holding
+ * callback_mutex.
*/

-static void update_cpu_domains(struct cpuset *cur)
+static void update_cpu_domains(void)
{
- struct cpuset *c, *par = cur->parent;
- cpumask_t pspan, cspan;
+ cpumask_t non_partitioned;

- if (par == NULL || cpus_empty(cur->cpus_allowed))
- return;
-
- /*
- * Get all cpus from parent's cpus_allowed not part of exclusive
- * children
- */
- pspan = par->cpus_allowed;
- list_for_each_entry(c, &par->children, sibling) {
- if (is_cpu_exclusive(c))
- cpus_andnot(pspan, pspan, c->cpus_allowed);
- }
- if (!is_cpu_exclusive(cur)) {
- cpus_or(pspan, pspan, cur->cpus_allowed);
- if (cpus_equal(pspan, cur->cpus_allowed))
- return;
- cspan = CPU_MASK_NONE;
- } else {
- if (cpus_empty(pspan))
- return;
- cspan = cur->cpus_allowed;
- /*
- * Get all cpus from current cpuset's cpus_allowed not part
- * of exclusive children
- */
- list_for_each_entry(c, &cur->children, sibling) {
- if (is_cpu_exclusive(c))
- cpus_andnot(cspan, cspan, c->cpus_allowed);
- }
- }
+ BUG_ON(!mutex_is_locked(&manage_mutex));

lock_cpu_hotplug();
- partition_sched_domains(&pspan, &cspan);
+ non_partitioned = top_cpuset.cpus_allowed;
+ update_cpu_domains_children(&top_cpuset, &non_partitioned);
+ partition_sched_domains(&non_partitioned);
unlock_cpu_hotplug();
}

/*
+ * Same as above except called with lock_cpu_hotplug and without manage_mutex.
+ */
+
+int cpuset_hotplug_update_sched_domains(void)
+{
+ cpumask_t non_partitioned;
+
+ non_partitioned = top_cpuset.cpus_allowed;
+ update_cpu_domains_children(&top_cpuset, &non_partitioned);
+ return partition_sched_domains(&non_partitioned);
+}
+
+/*
* Call with manage_mutex held. May take callback_mutex during call.
*/

@@ -833,8 +836,8 @@ static int update_cpumask(struct cpuset
mutex_lock(&callback_mutex);
cs->cpus_allowed = trialcs.cpus_allowed;
mutex_unlock(&callback_mutex);
- if (is_cpu_exclusive(cs) && !cpus_unchanged)
- update_cpu_domains(cs);
+ if (!cpus_unchanged)
+ update_cpu_domains();
return 0;
}

@@ -1067,7 +1070,7 @@ static int update_flag(cpuset_flagbits_t
mutex_unlock(&callback_mutex);

if (cpu_exclusive_changed)
- update_cpu_domains(cs);
+ update_cpu_domains();
return 0;
}

@@ -1931,19 +1934,9 @@ static int cpuset_mkdir(struct inode *di
return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
}

-/*
- * Locking note on the strange update_flag() call below:
- *
- * If the cpuset being removed is marked cpu_exclusive, then simulate
- * turning cpu_exclusive off, which will call update_cpu_domains().
- * The lock_cpu_hotplug() call in update_cpu_domains() must not be
- * made while holding callback_mutex. Elsewhere the kernel nests
- * callback_mutex inside lock_cpu_hotplug() calls. So the reverse
- * nesting would risk an ABBA deadlock.
- */
-
static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
{
+ int is_exclusive;
struct cpuset *cs = dentry->d_fsdata;
struct dentry *d;
struct cpuset *parent;
@@ -1961,13 +1954,8 @@ static int cpuset_rmdir(struct inode *un
mutex_unlock(&manage_mutex);
return -EBUSY;
}
- if (is_cpu_exclusive(cs)) {
- int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
- if (retval < 0) {
- mutex_unlock(&manage_mutex);
- return retval;
- }
- }
+ is_exclusive = is_cpu_exclusive(cs);
+
parent = cs->parent;
mutex_lock(&callback_mutex);
set_bit(CS_REMOVED, &cs->flags);
@@ -1982,8 +1970,13 @@ static int cpuset_rmdir(struct inode *un
mutex_unlock(&callback_mutex);
if (list_empty(&parent->children))
check_for_release(parent, &pathbuf);
+
+ if (is_exclusive)
+ update_cpu_domains();
+
mutex_unlock(&manage_mutex);
cpuset_release_agent(pathbuf);
+
return 0;
}

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c 2006-10-19 19:24:48.000000000 +1000
+++ linux-2.6/kernel/sched.c 2006-10-19 20:21:50.000000000 +1000
@@ -6586,6 +6586,9 @@ error:
*/
static int arch_init_sched_domains(const cpumask_t *cpu_map)
{
+#ifdef CONFIG_CPUSETS
+ return cpuset_hotplug_update_sched_domains();
+#else
cpumask_t cpu_default_map;
int err;

@@ -6599,6 +6602,7 @@ static int arch_init_sched_domains(const
err = build_sched_domains(&cpu_default_map);

return err;
+#endif
}

static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
@@ -6622,29 +6626,26 @@ static void detach_destroy_domains(const

/*
* Partition sched domains as specified by the cpumasks below.
- * This attaches all cpus from the cpumasks to the NULL domain,
+ * This attaches all cpus from the partition to the NULL domain,
* waits for a RCU quiescent period, recalculates sched
- * domain information and then attaches them back to the
- * correct sched domains
- * Call with hotplug lock held
+ * domain information and then attaches them back to their own
+ * isolated partition.
+ *
+ * Called with hotplug lock held
+ *
+ * Returns 0 on success.
*/
-int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
+int partition_sched_domains(cpumask_t *partition)
{
+ cpumask_t non_isolated_cpus;
cpumask_t change_map;
- int err = 0;

- cpus_and(*partition1, *partition1, cpu_online_map);
- cpus_and(*partition2, *partition2, cpu_online_map);
- cpus_or(change_map, *partition1, *partition2);
+ cpus_andnot(non_isolated_cpus, cpu_online_map, cpu_isolated_map);
+ cpus_and(change_map, *partition, non_isolated_cpus);

/* Detach sched domains from all of the affected cpus */
detach_destroy_domains(&change_map);
- if (!cpus_empty(*partition1))
- err = build_sched_domains(partition1);
- if (!err && !cpus_empty(*partition2))
- err = build_sched_domains(partition2);
-
- return err;
+ return build_sched_domains(&change_map);
}

#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2006-10-19 20:02:24.000000000 +1000
+++ linux-2.6/include/linux/sched.h 2006-10-19 20:02:30.000000000 +1000
@@ -707,8 +707,7 @@ struct sched_domain {
#endif
};

-extern int partition_sched_domains(cpumask_t *partition1,
- cpumask_t *partition2);
+extern int partition_sched_domains(cpumask_t *partition);

/*
* Maximum cache size the migration-costs auto-tuning code will
Index: linux-2.6/include/linux/cpuset.h
===================================================================
--- linux-2.6.orig/include/linux/cpuset.h 2006-10-19 20:07:24.000000000 +1000
+++ linux-2.6/include/linux/cpuset.h 2006-10-19 20:21:08.000000000 +1000
@@ -14,6 +14,8 @@

#ifdef CONFIG_CPUSETS

+extern int cpuset_hotplug_update_sched_domains(void);
+
extern int number_of_cpusets; /* How many cpusets are defined in system? */

extern int cpuset_init_early(void);


Attachments:
sched-domains-cpusets-fixes.patch (8.27 kB)

2006-10-19 19:04:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Nick wrote:
> You shouldn't need to, assuming cpusets doesn't mess it up.

I'm guessing we're agreeing that the routines update_cpu_domains()
and related code in kernel/cpuset.c are messing things up.

I view that code as a failed intrustion of some sched domain code into
cpusets, and apparently you view that code as a failed attempt to
manage sched domains coming from cpusets.

Oh well ... finger pointing is such fun ;).

(Fortunately I've forgotten who wrote these routines ... best
I don't know. Whoever you are, don't take it personally. It
was nice clean code, caught between the rock and the flood.)


> + non_partitioned = top_cpuset.cpus_allowed;
> + update_cpu_domains_children(&top_cpuset, &non_partitioned);
> + partition_sched_domains(&non_partitioned);

So ... instead of throwing the baby out, you want to replace it
with a puppy. If one attempt to overload cpu_exclusive didn't
work, try another.

I have two problems with this.

1) I haven't found any need for this, past the need to mark some
CPUs as isolated from the scheduler balancing code, which we
seem to be agreeing on, more or less, on another patch.

Please explain why we need this or any such mechanism for user
space to affect sched domain partitioning.

2) I've had better luck with the cpuset API by adding new flags
when I needed some additional semantics, rather than overloading
existing flags. So once we figure out what's needed and why,
then odds are I will suggest a new flag, specific to that purpose.

This new flag might well logically depend on the cpu_exclusive
setting, if that's useful. But it would probably be a separate
flag or setting.

I dislike providing explicit mechanisms via implicit side affects.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 19:22:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Paul Jackson wrote:
> Nick wrote:
>
>>You shouldn't need to, assuming cpusets doesn't mess it up.
>
>
> I'm guessing we're agreeing that the routines update_cpu_domains()
> and related code in kernel/cpuset.c are messing things up.

At the moment they are, yes.

> I view that code as a failed intrustion of some sched domain code into
> cpusets, and apparently you view that code as a failed attempt to
> manage sched domains coming from cpusets.
>
> Oh well ... finger pointing is such fun ;).

:)

I don't know about finger pointing, but the sched-domains partitioning
works. It does what you ask of it, which is to partition the
multiprocessor balancing.

>>+ non_partitioned = top_cpuset.cpus_allowed;
>>+ update_cpu_domains_children(&top_cpuset, &non_partitioned);
>>+ partition_sched_domains(&non_partitioned);
>
>
> So ... instead of throwing the baby out, you want to replace it
> with a puppy. If one attempt to overload cpu_exclusive didn't
> work, try another.

It isn't overloading anything. Your cpusets code has assigned a
particular semantic to cpu_exclusive. It so happens that we can
take advantage of this knowledge in order to do a more efficient
implementation.

It doesn't suddenly become a flag to manage sched-domains; its
semantics are completely unchanged (modulo bugs). The cpuset
interface semantics have no connection to sched-domains.

Put it this way: you don't think your code is currently
overloading the cpuset cpus_allowed setting in order to set the
task's cpus_allowed field, do you? You shouldn't need a flag to
tell it to set that, it is all just the mechanism behind the
policy.

> I have two problems with this.
>
> 1) I haven't found any need for this, past the need to mark some
> CPUs as isolated from the scheduler balancing code, which we
> seem to be agreeing on, more or less, on another patch.
>
> Please explain why we need this or any such mechanism for user
> space to affect sched domain partitioning.

Until very recently, the multiprocessor balancing could easily be very
stupid when faced with cpus_allowed restrictions. This is somewhat
fixed, but it is still suboptimal compared to a sched-domains partition
when you are dealing with disjoint cpusets.

It is mostly SGI who seem to be running into these balancing issues, so
I would have thought this would be helpful for your customers primarily.

I don't know of anyone else using cpusets, but I'd be interested to know.

> 2) I've had better luck with the cpuset API by adding new flags
> when I needed some additional semantics, rather than overloading
> existing flags. So once we figure out what's needed and why,
> then odds are I will suggest a new flag, specific to that purpose.

There is no new semantic beyond what is already specified by
cpu_exclusive.

>
> This new flag might well logically depend on the cpu_exclusive
> setting, if that's useful. But it would probably be a separate
> flag or setting.
>
> I dislike providing explicit mechanisms via implicit side affects.

This is more like providing a specific implementation for a given
semantic.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 19:50:26

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets


> I don't know of anyone else using cpusets, but I'd be interested to know.

We (Google) are planning to use it to do some partitioning, albeit on
much smaller machines. I'd really like to NOT use cpus_allowed from
previous experience - if we can get it to to partition using separated
sched domains, that would be much better.

From my dim recollections of previous discussions when cpusets was
added in the first place, we asked for exactly the same thing then.
I think some of the problem came from the fact that "exclusive"
to cpusets doesn't actually mean exclusive at all, and they're
shared in some fashion. Perhaps that issue is cleared up now?
/me crosses all fingers and toes and prays really hard.

M.

2006-10-20 00:15:03

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Martin wrote:
> We (Google) are planning to use it to do some partitioning, albeit on
> much smaller machines. I'd really like to NOT use cpus_allowed from
> previous experience - if we can get it to to partition using separated
> sched domains, that would be much better.

Are you saying that you wished that cpusets was not implemented using
cpus_allowed, but -instead- implemented using sched domain partitioning?

Well, as you likely can guess by now, that's unlikely.

Cpusets provides hierarchically nested sets of CPU and Memory Nodes,
especially useful for managing nested allocation of processor and
memory resources on large systems. The essential mechanism at the core
of cpusets is manipulating the cpus_allowed and mems_allowed masks in
each task.

Cpusets have also been dabbling in the business of driving the sched
domain partitioning, but I am getting more inclined as time goes on to
think that was a mistake.


> From my dim recollections of previous discussions when cpusets was
> added in the first place, we asked for exactly the same thing then.

What are you asking for again? ;).

Are you asking for a decent interface to sched domain partitioning?

Perhaps cpusets are not the best way to get that.

I hear tell from my colleague Christoph Lameter that he is considering
trying to make some improvements, that would benefit us all, to the
sched domain partitioning code - smaller, faster, simpler, better and
all that good stuff. Perhaps you guys at Google should join in that
effort, and see to it that your needs are met as well. I would
recommend providing whatever kernel-user API's you need for this, if
any, separately from cpusets.

So far, the requirements that I am aware of on such an effort:
1) Somehow support isolated CPUs (no load balancing to or from them).
For example, at least one real-time project needs these.
2) Whatever you were talking about above that Google is planning, some
sort of partitioning.
3) Somehow, whether by magic or by implicit or explicit partitioning
of the systems CPUs, ensure that its load balancing scales to cover
my employers (SGI) big CPU count systems.
4) Hopefully smaller, less #ifdef'y and easier to understand than the
current code.
5) Avoid poor fit interactions with cpusets, which have a different
shape (naturally hierarchical), internal mechanism (allowed bitmasks
rather than scheduler balancing domains), scope (combined processor
plus memory) and natural API style (a full fledged file system to
name these sets, rather than a few bitmasks and flags.)

Good luck.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 16:03:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Martin Bligh wrote:
>
>> I don't know of anyone else using cpusets, but I'd be interested to know.
>
>
> We (Google) are planning to use it to do some partitioning, albeit on
> much smaller machines. I'd really like to NOT use cpus_allowed from
> previous experience - if we can get it to to partition using separated
> sched domains, that would be much better.
>
> From my dim recollections of previous discussions when cpusets was
> added in the first place, we asked for exactly the same thing then.
> I think some of the problem came from the fact that "exclusive"
> to cpusets doesn't actually mean exclusive at all, and they're
> shared in some fashion. Perhaps that issue is cleared up now?
> /me crosses all fingers and toes and prays really hard.

The I believe, is that an exclusive cpuset can have an exclusive parent
and exclusive children, which obviously all overlap one another, and
thus you have to do the partition only at the top-most exclusive cpuset.

Currently, cpusets is creating partitions in cpus_exclusive children as
well, which breaks balancing for the parent.

The patch I posted previously should (modulo bugs) only do partitioning
in the top-most cpuset. I still need clarification from Paul as to why
this is unacceptable, though.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-20 17:50:09

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Sat, Oct 21, 2006 at 02:03:22AM +1000, Nick Piggin wrote:
> Martin Bligh wrote:
> > We (Google) are planning to use it to do some partitioning, albeit on
> > much smaller machines. I'd really like to NOT use cpus_allowed from
> > previous experience - if we can get it to to partition using separated
> > sched domains, that would be much better.
> >
> > From my dim recollections of previous discussions when cpusets was
> > added in the first place, we asked for exactly the same thing then.
> > I think some of the problem came from the fact that "exclusive"
> > to cpusets doesn't actually mean exclusive at all, and they're
> > shared in some fashion. Perhaps that issue is cleared up now?
> > /me crosses all fingers and toes and prays really hard.
>
> The I believe, is that an exclusive cpuset can have an exclusive parent
> and exclusive children, which obviously all overlap one another, and
> thus you have to do the partition only at the top-most exclusive cpuset.
>
> Currently, cpusets is creating partitions in cpus_exclusive children as
> well, which breaks balancing for the parent.
>
> The patch I posted previously should (modulo bugs) only do partitioning
> in the top-most cpuset. I still need clarification from Paul as to why
> this is unacceptable, though.

I like the direction of Nick's patch which do domain partitioning at the
top-most exclusive cpuset.

thanks,
suresh

2006-10-20 19:00:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

> The patch I posted previously should (modulo bugs) only do partitioning
> in the top-most cpuset. I still need clarification from Paul as to why
> this is unacceptable, though.

That patch partitioned on the children of the top cpuset, not the
top cpuset itself.

There is only one top cpuset - and that covers the entire system.

Consider the following example:

/dev/cpuset cpu_exclusive=1, cpus=0-7, task A
/dev/cpuset/a cpu_exclusive=1, cpus=0-3, task B
/dev/cpuset/b cpu_exclusive=1, cpus=4-7, task C

We have three cpusets - the top cpuset and two children, 'a' and 'b'.

We have three tasks, A, B and C. Task A is running in the top cpuset,
with access to all 8 cpus on the system. Tasks B and C are each in
a child cpuset, with access to just 4 cpus.

By your patch, the cpu_exclusive cpusets 'a' and 'b' partition the
sched domains in two halves, each covering 4 of the systems 8 cpus.
(That, or I'm still a sched domain idiot - quite possible.)

As a result, task A is screwed. If it happens to be on any of cpus
0-3 when the above is set up and the sched domains become partitioned,
it will never be considered for load balancing on any of cpus 4-7.
Or vice versa, if it is on any of cpus 4-7, it has no chance of
subsequently running on cpus 0-3.

If your patch had been just an implicit optimization, benefiting
sched domains, by optimizing for smaller domains when it could do so
without any noticable harm, then it would at least be neutral, and
we could continue the discussion of that patch to ask if it provided
an optimization that helped enough to be worth doing.

But that's not the case, as the above example shows.

I do not see any way to harmlessly optimize sched domain partitioning
based on a systems cpuset configuration.

I am not aware of any possible cpuset configuration that defines a
partitioning of the systems cpus. In particular, the top cpuset
always covers all online cpus, and any task in that top cpuset can
run anywhere, so far as cpusets is concerned.

So ... what can we do.

What -would- be a useful partitioning of sched domains?

Not being a sched domain wizard, I can only hazard a guess, but I'd
guess it would be a partitioning that significantly reduced the typical
size of a sched domain below the full size of the system (apparently it
is quicker to balance several smaller domains than one big one), while
not cutting off any legitimate load balancing possibilities.

The static cpuset configuration doesn't tell us this (see the top
cpuset in the example above), but if one combined that with knowledge
of which cpusets had actively running jobs that should be load
balanced, then that could work.

I doubt we could detect this (which cpusets did or did not need to be
load balanced) automatically. We probably need to have user code tell
us this. That was the point of my patch that started this discussion
several days ago, adding explicit 'sched_domain' flag files to each
cpuset, so user code could mark the cpusets needing to be balanced.

Since proposing that patch, I've changed my recommendation. Instead
of using cpusets to drive sched domain partitioning, better to just
provide a separate API, specific to the needs of sched domains, by
which user code can partition sched domains. That, or make the
balancing fast enough, even on very large domains, that we don't need
to partition. If we do have to partition, it would basically be for
performance reasons, and since I don't see any automatic way to
correctly partition sched domains, I think it would require some
explicit kernel-user API by which user space code can define the
partitioning.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 19:19:31

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Suresh wrote:
> I like the direction of Nick's patch which do domain partitioning
> at the top-most exclusive cpuset.

See the reply I just posted to Nick on this.

His patch didn't partition at the top cpuset, but at its children.
It could not have done any better than that.

The top cpuset covers all online cpus on the system, which is the
same as the default sched domain partition. Partitioning there
would be a no-op, producing the same one big partition we have now.

Partitioning at any lower level, even just the immediate children
of the root cpuset as Nick's patch does, breaks load balancing for
any tasks in the top cpuset.

And even if for some strange reason that weren't a problem, still
partitioning at the level of the immediate children of the root cpuset
doesn't help much on a decent proportion of big systems. Many of my
big systems run with just two cpusets right under the top cpuset, a
tiny cpuset (say 4 cpus) for classic Unix daemons, cron jobs and init,
and a huge (say 1020 out of 1024 cpus) cpuset for the batch scheduler
to slice and dice, to sub-divide into smaller cpusets for the various
jobs and other needs it has.

These systems would still suffer from any performance problems we had
balancing a huge sched domain. Presumably the pain of balancing a
1020 cpu partition is not much less than it is for a 1024 cpu partition.

So, regrettably, Nick's patch is both broken and useless ;).

Only a finer grain sched domain partitioning, that accurately reflects
the placement of active jobs and tasks needing load balancing, is of
much use here.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 20:30:17

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Hi Paul,

This mail seems to be as good as any to reply to, so here goes

On Fri, Oct 20, 2006 at 12:00:05PM -0700, Paul Jackson wrote:
> > The patch I posted previously should (modulo bugs) only do partitioning
> > in the top-most cpuset. I still need clarification from Paul as to why
> > this is unacceptable, though.
>
> That patch partitioned on the children of the top cpuset, not the
> top cpuset itself.
>
> There is only one top cpuset - and that covers the entire system.
>
> Consider the following example:
>
> /dev/cpuset cpu_exclusive=1, cpus=0-7, task A
> /dev/cpuset/a cpu_exclusive=1, cpus=0-3, task B
> /dev/cpuset/b cpu_exclusive=1, cpus=4-7, task C
>
> We have three cpusets - the top cpuset and two children, 'a' and 'b'.
>
> We have three tasks, A, B and C. Task A is running in the top cpuset,
> with access to all 8 cpus on the system. Tasks B and C are each in
> a child cpuset, with access to just 4 cpus.
>
> By your patch, the cpu_exclusive cpusets 'a' and 'b' partition the
> sched domains in two halves, each covering 4 of the systems 8 cpus.
> (That, or I'm still a sched domain idiot - quite possible.)
>
> As a result, task A is screwed. If it happens to be on any of cpus
> 0-3 when the above is set up and the sched domains become partitioned,
> it will never be considered for load balancing on any of cpus 4-7.
> Or vice versa, if it is on any of cpus 4-7, it has no chance of
> subsequently running on cpus 0-3.

ok I see the issue here, although the above has been the case all along.
I think the main issue here is that most of the users dont have to
do more than one level of partitioning (having to partitioning a system
with not more than 16 - 32 cpus, mostly less) and it is fairly easy to keep
track of exclusive cpusets and task placements and this is not such
a big problem at all. However I can see that with 1024 cpus this is not
trivial anymore to remember all of the partitioning especially if the
partioning is more than 2 levels deep and that its gets unwieldy

So I propose the following changes to cpusets

1. Have a new flag that takes care of sched domains. (say sched_domain)
Although I still think that we can still tag sched domains at the
back of exclusive cpusets, I think it best to separate the two
and maybe even add a separate CONFIG option for this. This way we
can keep any complexity arising out of this, such as hotplug/sched
domains all under the config.
2. The main change is that we dont allow tasks to be added to a cpuset
if it has child cpusets that also have the sched_domain flag turned on
(Maybe return a EINVAL if the user tries to do that)

Clearly one issue remains, tasks that are already running at the top cpuset.
Unless these are manually moved down to the correct cpuset heirarchy they
will continue to have the problem as before. I still dont have a simple
enough solution for this at the moment other than to document this at the
moment. But I still think on smaller systems this should be fairly easy task
for the administrator if they really know what they are doing. And the
fact that we have a separate flag to indicate the sched domain partitioning
should make it harder for them to shoot themselves in the foot.
Maybe there are other better ways to resolve this ?

One point I would argue against is to completely decouple cpusets and
sched domains. We do need a way to partition sched domains and doing
it along the lines of cpusets seems to be the most logical. This is
also much simpler in terms of additional lines of code needed to support
this feature. (as compared to adding a whole new API just to do this)

-Dinakar

2006-10-20 21:42:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

> One point I would argue against is to completely decouple cpusets and
> sched domains. We do need a way to partition sched domains and doing
> it along the lines of cpusets seems to be the most logical. This is
> also much simpler in terms of additional lines of code needed to support
> this feature. (as compared to adding a whole new API just to do this)

The "simpler" (fewer code lines) point I can certainly agree with.

The "most logical" point I go back and forth on.

The flat partitions, forming a complete, non-overlapping cover, needed
by sched domains can be mapped to selected cpusets in their nested
hierarchy, if we impose the probably reasonable constraint that for
any cpuset across which we require to load balance, we would want that
cpusets cpus to be entirely contained within a single sched domain
partition.

Earlier, such as last week and before, I had been operating under the
assumption that sched domain partitions were hierarchical too, so that
just because a partition boundary ran right down the middle of my most
active cpuset didn't stop load balancing across that boundary, but just
perhaps slowed load balancing down a bit, as it would only occur at some
higher level in the partition hierarchy, which presumably balanced less
frequently. Apparently this sched domain partition hierarchy was a
figment of my over active imagination, along with the tooth fairy and
Santa Claus.

Anyhow, if we consider that constraint (don't split or cut an active
cpuset across partitions) not only reasonable, but desirable to impose,
then integrating the sched domain partitioning with cpusets, as you
describe, would indeed seem "most logical."

> 2. The main change is that we dont allow tasks to be added to a cpuset
> if it has child cpusets that also have the sched_domain flag turned on
> (Maybe return a EINVAL if the user tries to do that)

This I would not like. It's ok to have tasks in cpusets that are
cut by sched domain partitions (which is what I think you were getting
at), just so long as one doesn't mind that they don't load balance
across the partition boundaries.

For example, we -always- have several tasks per-cpu in the top cpuset.
These are the per-cpu kernel threads. They have zero interest in
load balancing, because they are pinned on a cpu, for their life.

Or, for a slightly more interesting example, one might have a sleeping
job (batch scheduler sent SIGPAUSE to all its threads) that is in a
cpuset cut by the current sched domain partitioning. Since that job is
not running, we don't care whether it gets good load balancing services
or not.

I still suspect we will just have to let the admin partition their
system as they will, and if they screw up their load balancing,
the best we can do is to make all this as transparent and simple
and obvious as we can, and wish them well.

One thing I'm sure of. The current (ab)use of the 'cpu_exclusive' flag
to define sched domain partitions is flunking the "transparent, simple
and obvious" test ;).

> I think the main issue here is that most of the users dont have to
> do more than one level of partitioning (having to partitioning a system
> with not more than 16 - 32 cpus, mostly less)

Could you (or some sched domain wizard) explain to me why we would even
want sched domain partitions on such 'small' systems? I've been operating
under the (mis?)conception that these sched domain partitions were just
a performance band-aid for the humongous systems, where load balancing
across say 1024 CPUs was difficult to do efficiently.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 21:47:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Dinakar wrote:
> Clearly one issue remains, tasks that are already running at the top cpuset.
> Unless these are manually moved down to the correct cpuset heirarchy they
> will continue to have the problem as before.

I take it you are looking for some reasonable and acceptable
constraints to place on cpusets, sufficient to enable us to
make it impossible (or at least difficult) to botch the
load balancing.

You want to make it difficult to split an active cpuset, so as
to avoid the undesirable limiting of load balancing across such
partition boundaries.

I doubt we can find a way to do that. We'll have to let our
users make a botch of it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 22:36:02

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Fri, Oct 20, 2006 at 02:41:53PM -0700, Paul Jackson wrote:
>
> > 2. The main change is that we dont allow tasks to be added to a cpuset
> > if it has child cpusets that also have the sched_domain flag turned on
> > (Maybe return a EINVAL if the user tries to do that)
>
> This I would not like. It's ok to have tasks in cpusets that are
> cut by sched domain partitions (which is what I think you were getting
> at), just so long as one doesn't mind that they don't load balance
> across the partition boundaries.
>
> For example, we -always- have several tasks per-cpu in the top cpuset.
> These are the per-cpu kernel threads. They have zero interest in
> load balancing, because they are pinned on a cpu, for their life.

I cannot think of any reason why this change would affect per-cpu tasks.

>
> Or, for a slightly more interesting example, one might have a sleeping
> job (batch scheduler sent SIGPAUSE to all its threads) that is in a
> cpuset cut by the current sched domain partitioning. Since that job is
> not running, we don't care whether it gets good load balancing services
> or not.

ok here's when I think a system administrator would want to partition
sched domains. If there is an application that is very sensitive to
performance and latencies and would have very low tolerance for
interference from any other code running on the cpus, then the
admin would partition the sched domain and separate this application
from the rest of the system. (per-cpu threads obviously will
continue to run in the same domain as the app)

So in this example, clearly there is no sense in letting a batch job
run in the same sched domain as our application. Now lets say if our
performance and latency sensitive application only runs during the
day, then the admin can turn off the sched domain flag and tear down
the sched domain for the night. This will then enable the batch job
running in the parent cpuset to get a chance to run on all the cpus.

Returning -EINVAL when trying to attach a job to the top cpuset when
it has a child cpuset(s) that has the sched_domain flag turned on, would
mean that the administrator would know that s/he does not have all of
the cpus in that cpuset for their use. However by attaching jobs
(such as the batch job in your example) to the top cpuset, before
doing any sched domain partitioning would mean that they make the best
use of resources as well (sort of a backdoor). However if you feel
that this puts too much of a restriction on the admin for creating
tasks such as the batch job, then we would have to do without it
(just documenting the sched_domain and its effects)

>
> I still suspect we will just have to let the admin partition their
> system as they will, and if they screw up their load balancing,
> the best we can do is to make all this as transparent and simple
> and obvious as we can, and wish them well.
>
> One thing I'm sure of. The current (ab)use of the 'cpu_exclusive' flag
> to define sched domain partitions is flunking the "transparent, simple
> and obvious" test ;).

I think this is a case of one set of folks talking <32 cpu systems
and another set talking >512 cpu systems

>
> > I think the main issue here is that most of the users dont have to
> > do more than one level of partitioning (having to partitioning a system
> > with not more than 16 - 32 cpus, mostly less)
>
> Could you (or some sched domain wizard) explain to me why we would even
> want sched domain partitions on such 'small' systems? I've been operating
> under the (mis?)conception that these sched domain partitions were just
> a performance band-aid for the humongous systems, where load balancing
> across say 1024 CPUs was difficult to do efficiently.

Well it makes a difference for applications that have a RT/performance
sensitive componant that needs a sched domain of its own

-Dinakar

2006-10-20 23:34:23

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

How about something like use_cpus_exclusive flag in cpuset?

And whenever a child cpuset sets this use_cpus_exclusive flag, remove
those set of child cpuset cpus from parent cpuset and also from the
tasks which were running in the parent cpuset. We can probably allow this
to happen as long as parent cpuset has atleast one cpu.

And if this use_cpus_exclusive flag is cleared in cpuset, its pool of cpus will
be returned to the parent. We can perhaps have cpus_owned inaddition to
cpus_allowed to reflect what is being exclusively used
and owned(which combines all the exclusive cpus used by the parent and children)

So effectively, a sched domain parition will get defined for each
cpuset having 'use_cpus_exclusive'.

And this is mostly inline with what anyone can expect from exclusive
cpu usage in a cpuset right?

Job manager/administrator/owner of the cpusets can set/reset the flags
depending on what cpusets/jobs are active.

Paul will this address your needs?

thanks,
suresh

2006-10-21 05:38:00

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

> And whenever a child cpuset sets this use_cpus_exclusive flag, remove
> those set of child cpuset cpus from parent cpuset and also from the
> tasks which were running in the parent cpuset. We can probably allow this
> to happen as long as parent cpuset has atleast one cpu.

Why are you seeking out obfuscated means and gratuitous entanglements
with cpuset semantics, in order to accomplish a straight forward end -
defining the sched domain partitioning?

If we are going to add or modify the meaning of per-cpuset flags in
order to determine sched domain partitioning, then we should do so in
the most straight forward way possible, which by all accounts seems to
be adding a 'sched_domain' flag to each cpuset, indicating whether it
delineates a sched domain partition. The kernel would enforce a rule
that the CPUs in the cpusets so marked could not overlap. The kernel
in return would promise not to split the CPUs in any cpuset so marked
into more than one sched domain partition, with the consequence that
the kernel would be able to load balance across all the CPUs contained
within any such partition.

Why do something less straightforward than that?

Meanwhile ...

If the existing cpuset semantics implied a real and useful partitioning
of the systems CPUs, as Nick had been figuring it did, then yes it
might make good sense to implicitly and automatically leverage this
cpuset partitioning when partitioning the sched domains.

But cpuset semantics, quite deliberately on my part, don't imply any
such system wide partitioning of CPUs.

So one of:
1) we don't need sched domain partitioning after all, or
2) this sched domain partitioning takes on a hierarchical nested shape
that fits better with cpusets, or
3) we provide a "transparent, simple and obvious" API to user space,
so it can define the sched domain partitioning.

And in any case, we should first take a look at the rest of this sched
domain code, as I have been led to believe it provides some nice
opportunities for refinement, before we go fussing over the details of
any kernel-user API's it might need.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-21 18:23:20

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On 10/19/06, Martin Bligh <[email protected]> wrote:
>
> > I don't know of anyone else using cpusets, but I'd be interested to know.
>
> We (Google) are planning to use it to do some partitioning, albeit on
> much smaller machines. I'd really like to NOT use cpus_allowed from
> previous experience - if we can get it to to partition using separated
> sched domains, that would be much better.

Actually, what we'd really like is to be able to set cpus_allowed in
arbitrary ways (we're already doing this via sched_setaffinity() -
doing it via cpusets would just be an optimization when changing cpu
masks) and have the scheduler automatically do balancing efficiently.
In some cases sched domains might be appropriate, but in most of the
cases we have today, we have a job that's running with a CPU reserved
for itself but also has access to a "public" CPU, and some CPUs are
not public, but shared amongst a set of jobs. I'm not very familiar
with the sched domains code but I guess it doesn't handle overlapping
cpu masks very well?

Paul

2006-10-21 20:55:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

> I'm not very familiar
> with the sched domains code but I guess it doesn't handle overlapping
> cpu masks very well?

As best as I can tell, the two motivations for explicity setting
sched domain partitions are:
1) isolating cpus for real time uses very sensitive to any interference,
2) handling load balancing on huge CPU counts, where the worse than linear
algorithms start to hurt.

The load balancing algorithms apparently should be close to linear, but
in the presence of many disallowed cpus (0 bits in tasks cpus_allowed),
I guess they have to work harder.

I still have little confidence that I understand this. Maybe if I say
enough stupid things about the scheduler domains and load balancing,
someone will get annoyed and try to educate me ;). Best of luck to
them.

It doesn't sound to me like your situation is a real time, very low
latency or jigger, sensitive application.

How many CPUs are you juggling? My utterly naive expectation would be
that dozens of CPUs should not need explicit sched domain partitioning,
but that hundreds of them would benefit from reduced time spent in
kernel/sched.c code if the sched domains were able to be partitioned
down to a significantly smaller size.

The only problem I can see that overlapping cpus_allowed masks presents
to this is that it inhibits partitioning down to smaller sched domains.
Apparently these partitions are system-wide hard partitions, such that
no load balancing occurs across partitions, so we should avoid creating
a partition that cuts across some tasks cpus_allowed mask.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-21 20:59:48

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On 10/21/06, Paul Jackson <[email protected]> wrote:
>
> How many CPUs are you juggling?

Not many by your standards - less than eight in general.

Paul

2006-10-21 23:05:50

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Suresh wrote:
> And whenever a child cpuset sets this use_cpus_exclusive flag, remove
> those set of child cpuset cpus from parent cpuset and also from the ..

That reminds me a little of Dinakar's first patch to partition sched
domains based on the cpuset configuration:

Subject: [Lse-tech] [RFC PATCH] Dynamic sched domains aka Isolated cpusets
From: Dinakar Guniguntala <[email protected]>
Date: Tue, 19 Apr 2005 01:56:44 +0530
http://lkml.org/lkml/2005/4/18/187

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-22 10:52:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Martin wrote:
> We (Google) are planning to use it to do some partitioning, albeit on
> much smaller machines. I'd really like to NOT use cpus_allowed from
> previous experience - if we can get it to to partition using separated
> sched domains, that would be much better.

Why not use cpus_allowed for this, via cpusets and/or sched_setaffinity?

In the followup to this between Paul M. and myself, I wrote:
> As best as I can tell, the two motivations for explicity setting
> sched domain partitions are:
> 1) isolating cpus for real time uses very sensitive to any interference,
> 2) handling load balancing on huge CPU counts, where the worse than linear
> algorithms start to hurt.
> ...
> How many CPUs are you juggling?

And Paul M. replied:
> Not many by your standards - less than eight in general.

So ... it would seem you have neither huge CPU counts nor real time
sensitivities.

So why not use cpus_allowed?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-22 12:03:11

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Dinikar wrote:
> On Fri, Oct 20, 2006 at 02:41:53PM -0700, Paul Jackson wrote:
> >
> > > 2. The main change is that we dont allow tasks to be added to a cpuset
> > > if it has child cpusets that also have the sched_domain flag turned on
> > > (Maybe return a EINVAL if the user tries to do that)
> >
> > This I would not like. It's ok to have tasks in cpusets that are
> > cut by sched domain partitions (which is what I think you were getting
> > at), just so long as one doesn't mind that they don't load balance
> > across the partition boundaries.
> >
> > For example, we -always- have several tasks per-cpu in the top cpuset.
> > These are the per-cpu kernel threads. They have zero interest in
> > load balancing, because they are pinned on a cpu, for their life.
>
> I cannot think of any reason why this change would affect per-cpu tasks.

You are correct that cpu pinned tasks don't mind not being load balanced.

I was reading too much into what you had suggested.

You suggested not adding tasks to cpusets that have child cpusets
defining sched domains.

But the fate of tasks that were already there was an open issue.

That kind of ordering dependency struck me as so odd that I
leapt to the false conclusion that you couldn't possibly have
meant it, and really mean to disallow any tasks in such a cpuset,
whether there before we setup the sched_domain or not.

So ... nevermind that comment.


Popping the stack, yes, as a practical matter, when setting up
some nodes to be used by real time software, we offload everything
else we can from those nodes, to minimize the interference.

We don't need more error conditions from the kernel to handle that,
not even on big systems. We just need some way to isolate those
nodes (cpu and memory) from any scheduler load balancing efforts.

This is easy. It can be done now at boottime (isolcpus=), and at
runtime with the existing cpu_exclusive overloading. It essentially
just involves setting a single system wide cpumask of isolated cpus.


In addition large systems (apparently for scheduler load balancing
performance reasons) need a way to partition sched domains down to
more reasonable sizes.

Partitioning sched domains interacts with cpusets in ways that are
still confounding us, and depends critcally on information that really
only system services such as the batch scheduler can actually provide
(such as which jobs are active.) And this, even at a mathematical
minimum, involves a partition of the systems cpus, which is a set of
subsets, or multiple cpumasks instead of just one. It's harder.

My sense is that your proposal, to try to get rid of, or not allow more
of, the tasks in the overlapping parent cpusets that were complicating
this effort, just makes life harder for such system services, forcing
them to work around our efforts to make life 'safe' for them.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 03:10:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Sorry Dinakar - I've been misspelling your name (Dinikar).

My bad.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 04:51:38

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Fri, Oct 20, 2006 at 10:37:38PM -0700, Paul Jackson wrote:
> > And whenever a child cpuset sets this use_cpus_exclusive flag, remove
> > those set of child cpuset cpus from parent cpuset and also from the
> > tasks which were running in the parent cpuset. We can probably allow this
> > to happen as long as parent cpuset has atleast one cpu.
>
> Why are you seeking out obfuscated means and gratuitous entanglements
> with cpuset semantics, in order to accomplish a straight forward end -
> defining the sched domain partitioning?
>
> If we are going to add or modify the meaning of per-cpuset flags in
> order to determine sched domain partitioning, then we should do so in
> the most straight forward way possible, which by all accounts seems to
> be adding a 'sched_domain' flag to each cpuset, indicating whether it
> delineates a sched domain partition. The kernel would enforce a rule
> that the CPUs in the cpusets so marked could not overlap. The kernel
> in return would promise not to split the CPUs in any cpuset so marked
> into more than one sched domain partition, with the consequence that
> the kernel would be able to load balance across all the CPUs contained
> within any such partition.
>
> Why do something less straightforward than that?

Ok. I went to implementation details(and ended up less straight forward..) but
my main intention was to say that we need to retain some sort of hierarchical
shape too, while creating these domain partitions.

Like for example, a big system can be divided into different groups of
cpus for each department in an organisation. And internally based
on the needs, each department can divide its pool of cpus into sub groups
and allocates to much smaller group. And based on the sub group creation/
deletion, cpus will move from department to subgroups and viceversa.

users probably want both flat and hierarchical partitions. And in this
partition mechanism, we should never allow cpus being present in more than one
partition.

thanks,
suresh

2006-10-23 05:47:21

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Sun, Oct 22, 2006 at 03:51:35AM -0700, Paul Jackson wrote:
> Martin wrote:
> > We (Google) are planning to use it to do some partitioning, albeit on
> > much smaller machines. I'd really like to NOT use cpus_allowed from
> > previous experience - if we can get it to to partition using separated
> > sched domains, that would be much better.
>
> Why not use cpus_allowed for this, via cpusets and/or sched_setaffinity?

group of pinned tasks can completely skew the system load balancing..

2006-10-23 05:55:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Suresh wrote:
> group of pinned tasks can completely skew the system load balancing..

Ah - yes. That was a problem. If the load balancer couldn't offload
tasks from one or two of the most loaded CPUs (perhaps because they
were pinned.) it tended to give up.

I believe that Christoph is actively working that problem. Adding him
to the cc list, so he can explain the state of this work more
accurately.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 05:59:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Suresh wrote:
> Ok. I went to implementation details(and ended up less straight forward..) but
> my main intention was to say that we need to retain some sort of hierarchical
> shape too, while creating these domain partitions.

Good points.

Getting cpusets to work in a hierarchical organization managing a large
system is a key goal of mine.

That means shaping the API's so that they fit the structure of various
users, so that the right person or program can make the right decision
at the right time, easily, and have it all work.

Take a look at my "no need to load balance" flag idea, in my post
a few minutes ago responding to Nick. That feels to me like it
might be an API that fits the users space, understanding and needs
well, while still giving us what we need to be able to reduce the
size of sched domain partitions on huge systems, where possible.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:02:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Paul Jackson wrote:
> Suresh wrote:
>
>>group of pinned tasks can completely skew the system load balancing..
>
>
> Ah - yes. That was a problem. If the load balancer couldn't offload
> tasks from one or two of the most loaded CPUs (perhaps because they
> were pinned.) it tended to give up.
>
> I believe that Christoph is actively working that problem. Adding him
> to the cc list, so he can explain the state of this work more
> accurately.

It is somewhat improved. The load balancing will now retry other CPUs,
but this is pretty costly in terms of latency and rq lock hold time.
And the algorithm itself still breaks down if you have lots of pinned
tasks, even if the load balancer is willing to try lesser loaded cpus.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 06:04:02

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Sun, Oct 22, 2006 at 10:54:56PM -0700, Paul Jackson wrote:
> Suresh wrote:
> > group of pinned tasks can completely skew the system load balancing..
>
> Ah - yes. That was a problem. If the load balancer couldn't offload
> tasks from one or two of the most loaded CPUs (perhaps because they
> were pinned.) it tended to give up.
>
> I believe that Christoph is actively working that problem. Adding him
> to the cc list, so he can explain the state of this work more
> accurately.

Pinned tasks can cause a number of challenges to scheduler.

Christoph has recently addressed one such issue and that too only partial.

It is very difficult to nicely and uniformly distribute the load pinned
to group of cpus..

2006-10-23 06:16:40

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Nick wrote:
> It is somewhat improved. The load balancing will now retry other CPUs,
> but this is pretty costly

Ah - ok. Sounds like a sticky problem.

I am beginning to appreciate Martin's preference for not using
cpus_allowed to pin tasks when load balancing is also needed.

For the big HPC apps that I worry about the most, with hundreds of
parallel, closely coupled threads, one per cpu, we pin all over the
place. But we make very little use of load balancing in that
situation, with one compute bound thread per cpu, humming along for
hours. The scheduler pretty quickly figures out that it has no
useful load balancing to do.

On the other hand, as someone already noted, one can't simulate pinning
to overlapping cpus_allowed masks using overlapping sched domains, as
tasks can just wander off onto someone elses cpu that way.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 16:01:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Sun, 22 Oct 2006, Paul Jackson wrote:

> I believe that Christoph is actively working that problem. Adding him
> to the cc list, so he can explain the state of this work more
> accurately.

That issue was fixed by retrying load balancing without the cpu that has
all processes pinned.

2006-10-23 16:03:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

On Mon, 23 Oct 2006, Nick Piggin wrote:

> It is somewhat improved. The load balancing will now retry other CPUs,
> but this is pretty costly in terms of latency and rq lock hold time.
> And the algorithm itself still breaks down if you have lots of pinned
> tasks, even if the load balancer is willing to try lesser loaded cpus.

We would need to have a way of traversing the processors by load in order
to avoid it. John Hawkes solves that issue earlier in 2.4.X by managing a
list of processors according to their load.


2006-11-09 10:59:58

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets

Andrew,

This patch is currently residing in your *-mm stack, as:

cpuset-remove-sched-domain-hooks-from-cpusets.patch

If it's easy for you to keep track of, I'd like to ask that you don't
push this to Linus until Dinakar and I (with the consent of various
mm experts) settle on the replacement mechanism for dealing with
sched domain partitioning (or whatever that turns in to.)

At the rate Dinakar and I are progressing, this likely means that this
"... remove ... hooks ..." patch will be sitting in *-mm through the
2.6.20 work, and go on to Linus for 2.6.21.

There are some folks actually depending on this mechanism, such as
some real time folks using this to inhibit load balancing on their
isolated CPUs. It would be polite not to yank out one mechanism
before its replacement is available.

If this sounds like a pain, or you'd just rather not baby sit this in
*-mm that long, then I'd have you drop the patch before I had you send
it to Linus without an accompanying replacement.

All else equal, I kind of like leaving this patch in *-mm for now, as
it puts a stake in the ground, indicating that the current connection
between the per-cpuset "cpu_exclusive" flag and sched domain partitions
is sitting on death row.

There seems to be a general concensus that death row is a good place for
that.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401