2006-10-19 09:26:13

by Paul Jackson

[permalink] [raw]
Subject: [RFC] cpuset: add interface to isolated cpus

From: Paul Jackson <[email protected]>

Enable user code to isolate CPUs on a system from the domains that
determine scheduler load balancing.

This is already doable using the boot parameter "isolcpus=". The folks
running realtime code on production systems, where some nodes are
isolated for realtime, and some not, and where it is unacceptable
to reboot to adjust this, need to be able to change which CPUs are
isolated from the scheduler balancing code on the fly.

This is done by exposing the kernels cpu_isolated_map as a cpumask
in the root cpuset directory, in a file called 'isolated_cpus',
where it can be read and written by sufficiently privileged user code.

Signed-off-by: Paul Jackson <[email protected]>

---

I have built and booted this on one system, ia64/sn2, and
verified that I can set and query the cpu_isolated_map,
and that tasks are not scheduled on such isolated CPUs
unless I explicitly place them there.
-pj

Documentation/cpusets.txt | 26 +++++++++++++++++++++++---
Documentation/kernel-parameters.txt | 4 ++++
include/linux/sched.h | 3 +++
kernel/cpuset.c | 31 +++++++++++++++++++++++++++++++
kernel/sched.c | 21 ++++++++++++++++++++-
5 files changed, 81 insertions(+), 4 deletions(-)

--- 2.6.19-rc1-mm1.orig/Documentation/cpusets.txt 2006-10-18 21:24:22.000000000 -0700
+++ 2.6.19-rc1-mm1/Documentation/cpusets.txt 2006-10-18 21:24:22.000000000 -0700
@@ -19,7 +19,8 @@ CONTENTS:
1.5 What does notify_on_release do ?
1.6 What is memory_pressure ?
1.7 What is memory spread ?
- 1.8 How do I use cpusets ?
+ 1.8 What is isolated_cpus ?
+ 1.9 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
@@ -174,8 +175,9 @@ containing the following files describin
- notify_on_release flag: run /sbin/cpuset_release_agent on exit?
- memory_pressure: measure of how much paging pressure in cpuset

-In addition, the root cpuset only has the following file:
+In addition, the root cpuset only has the following files:
- memory_pressure_enabled flag: compute memory_pressure?
+ - isolated_cpus: cpus excluded from scheduler domains

New cpusets are created using the mkdir system call or shell
command. The properties of a cpuset, such as its flags, allowed
@@ -378,7 +380,25 @@ data set, the memory allocation across t
can become very uneven.


-1.8 How do I use cpusets ?
+1.8 What is isolated_cpus ?
+---------------------------
+
+The cpumask isolated_cpus, visible in the top cpuset, provides direct
+access to the kernel cpu_isolated_map (in kernel/sched.c), which are
+the CPUs which are excluded from scheduler domains. The scheduler
+will not load balance tasks on isolated cpus. The 'isolated_cpus'
+file in the root cpuset can be read, to obtain the current setting
+of cpu_isolated_map, and can be written, to write a new cpumask to
+cpu_isolated_map.
+
+The initial value of the cpu_isolated_map can be set by the kernel
+boottime argument "isolcpus=n1,n2,n3,..." where n1, n2, n3, etc
+are specific decimal numbers of CPUs to be isolated. If not set
+on the kernel boot line, then cpu_isolated_map defaults to the empty
+cpumask. See further Documentation/kernel-parameters.txt for isolcpus
+documentation.
+
+1.9 How do I use cpusets ?
--------------------------

In order to minimize the impact of cpusets on critical kernel
--- 2.6.19-rc1-mm1.orig/Documentation/kernel-parameters.txt 2006-10-15 00:24:58.000000000 -0700
+++ 2.6.19-rc1-mm1/Documentation/kernel-parameters.txt 2006-10-18 21:24:22.000000000 -0700
@@ -731,6 +731,10 @@ and is between 256 and 4096 characters.
tasks in the system -- can cause problems and
suboptimal load balancer performance.

+ If CPUSETS is configured, then the set of isolated
+ CPUs can be manippulated after system boot using the
+ "isolated_cpus" cpumask in the top cpuset.
+
isp16= [HW,CD]
Format: <io>,<irq>,<dma>,<setup>

--- 2.6.19-rc1-mm1.orig/include/linux/sched.h 2006-10-18 21:24:22.000000000 -0700
+++ 2.6.19-rc1-mm1/include/linux/sched.h 2006-10-18 21:24:22.000000000 -0700
@@ -715,6 +715,9 @@ struct sched_domain {
#endif
};

+extern const cpumask_t *sched_get_isolated_cpus(void);
+extern int sched_set_isolated_cpus(const cpumask_t *isolated_cpus);
+
/*
* Maximum cache size the migration-costs auto-tuning code will
* search from:
--- 2.6.19-rc1-mm1.orig/kernel/cpuset.c 2006-10-18 21:24:22.000000000 -0700
+++ 2.6.19-rc1-mm1/kernel/cpuset.c 2006-10-18 21:24:22.000000000 -0700
@@ -974,6 +974,21 @@ static int update_memory_pressure_enable
}

/*
+ * Call with manage_mutex held.
+ */
+
+static int update_isolated_cpus(struct cpuset *cs, char *buf)
+{
+ cpumask_t isolated_cpus;
+ int retval;
+
+ retval = cpulist_parse(buf, isolated_cpus);
+ if (retval < 0)
+ return retval;
+ return sched_set_isolated_cpus(&isolated_cpus);
+}
+
+/*
* update_flag - read a 0 or a 1 in a file and update associated flag
* bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
* CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
@@ -1215,6 +1230,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_ISOLATED_CPUS,
FILE_TASKLIST,
} cpuset_filetype_t;

@@ -1282,6 +1298,9 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+ case FILE_ISOLATED_CPUS:
+ retval = update_isolated_cpus(cs, buffer);
+ break;
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
@@ -1397,6 +1416,10 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+ case FILE_ISOLATED_CPUS:
+ s += cpulist_scnprintf(page, PAGE_SIZE,
+ *sched_get_isolated_cpus());
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1770,6 +1793,11 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};

+static struct cftype cft_isolated_cpus = {
+ .name = "isolated_cpus",
+ .private = FILE_ISOLATED_CPUS,
+};
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1960,6 +1988,9 @@ int __init cpuset_init(void)
/* memory_pressure_enabled is in root cpuset only */
if (err == 0)
err = cpuset_add_file(root, &cft_memory_pressure_enabled);
+ /* isolated_cpus is in root cpuset only */
+ if (err == 0)
+ err = cpuset_add_file(root, &cft_isolated_cpus);
out:
return err;
}
--- 2.6.19-rc1-mm1.orig/kernel/sched.c 2006-10-18 21:24:22.000000000 -0700
+++ 2.6.19-rc1-mm1/kernel/sched.c 2006-10-18 21:49:41.000000000 -0700
@@ -5608,7 +5608,7 @@ static void cpu_attach_domain(struct sch
}

/* cpus with isolated domains */
-static cpumask_t __cpuinitdata cpu_isolated_map = CPU_MASK_NONE;
+static cpumask_t cpu_isolated_map = CPU_MASK_NONE;

/* Setup the mask of cpus configured for isolated domains */
static int __init isolated_cpu_setup(char *str)
@@ -6866,6 +6866,25 @@ void __init sched_init_smp(void)
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
}
+
+const cpumask_t *sched_get_isolated_cpus(void)
+{
+ return &cpu_isolated_map;
+}
+
+int sched_set_isolated_cpus(const cpumask_t *isolated_cpus)
+{
+ cpumask_t newly_isolated_cpus;
+ int err;
+
+ lock_cpu_hotplug();
+ cpus_and(newly_isolated_cpus, cpu_online_map, *isolated_cpus);
+ detach_destroy_domains(&newly_isolated_cpus);
+ cpu_isolated_map = *isolated_cpus;
+ err = arch_init_sched_domains(&cpu_online_map);
+ unlock_cpu_hotplug();
+ return err;
+}
#else
void __init sched_init_smp(void)
{

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401


2006-10-19 10:17:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> From: Paul Jackson <[email protected]>
>
> Enable user code to isolate CPUs on a system from the domains that
> determine scheduler load balancing.
>
> This is already doable using the boot parameter "isolcpus=". The folks
> running realtime code on production systems, where some nodes are
> isolated for realtime, and some not, and where it is unacceptable
> to reboot to adjust this, need to be able to change which CPUs are
> isolated from the scheduler balancing code on the fly.
>
> This is done by exposing the kernels cpu_isolated_map as a cpumask
> in the root cpuset directory, in a file called 'isolated_cpus',
> where it can be read and written by sufficiently privileged user code.

This should be done outside cpusets.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 17:55:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> This should be done outside cpusets.

So ... where should it be done?

And what would be better about that other place?

And by the way, I see in another patch you put some other
stuff for configuring sched domains in the root cpuset.
If the root cpuset was the right place for that, why isn't
it the right place for this? ... strange.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 18:07:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> Nick wrote:
>
>>This should be done outside cpusets.
>
>
> So ... where should it be done?

sched.c I suppose.

>
> And what would be better about that other place?

Because it is not a cpuset specific feature.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-19 18:57:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

> > So ... where should it be done?
>
> sched.c I suppose.

Are we discussing where the implementing code should go,
or where the isolated cpu map special file should be
exposed to user space?

And you didn't answer my other questions, such as:
1) If your other patch to manipulate sched domains
has code that belongs in kernel/cpuset.c, and
special files that belong in /dev/cpuset, why
shouldn't this one naturally go in the same places?
2) Why ... why? What would be better about sched.c
and what's wrong with where it is (the code and
the exposed file)?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-19 19:04:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
>>>So ... where should it be done?
>>
>>sched.c I suppose.
>
>
> Are we discussing where the implementing code should go,
> or where the isolated cpu map special file should be
> exposed to user space?

Both?

> And you didn't answer my other questions, such as:
> 1) If your other patch to manipulate sched domains
> has code that belongs in kernel/cpuset.c, and
> special files that belong in /dev/cpuset, why
> shouldn't this one naturally go in the same places?

Because they are cpuset specific. This is not.

> 2) Why ... why? What would be better about sched.c
> and what's wrong with where it is (the code and
> the exposed file)?

Because it is not specific to CONFIG_CPUSETS. People who
don't configure CONFIG_CPUSETS may still want to change
isolcpus at runtime.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-20 03:38:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

> > 1) If your other patch to manipulate sched domains
> > has code that belongs in kernel/cpuset.c, and
> > special files that belong in /dev/cpuset, why
> > shouldn't this one naturally go in the same places?
>
> Because they are cpuset specific. This is not.

Bizarre. That other patch of yours, that had cpu_exclusive cpusets
only at the top level defining sched domain partitions, is cpuset
specific only because you're hijacking the cpu_exclusive flag to add
new semantics to it, and then claiming you're not adding new semantics
to it, and then claiming we should be doing this all implicitly without
additional kernel-user API apparatus, and then claiming that if your
new abuse of cpuset flags intrudes on other usage patterns of cpusets
or if they have need of a particular sched domain partitioning that
they have to explicitly set up top level cpusets to get a desired sched
domain partitioning ...

Stop already ... this is getting insane.

Where are we?

By now, I think everyone (that is still following this thread ;)
seems to agree that the current code linking cpu_exclusive and
sched domain partitions is borked.

I responded to that realization earlier this week by proposing
a patch to add new sched_domain flags to each cpuset.

Suresh responded to this realization by writing:
> ...I don't know much about how job manager interacts with
> cpusets but this behavior sounds bad to me.

You responded by proposing a patch to use just the cpu_exclusive
flags in the top level cpusets to define sched domain partitions.

The current code linking cpu_exclusive and sched domain partitions
is borked, and serving almost no purpose. One of its goals was to
provide a way to workaround the scaling issues with the scheduler
load balancer code on humongous CPU count systems. This current code
hasn't helped us a bit there. We're starting to address these scaling
issues in various other ways.

So far as I can tell, the only actual successful application of this
current code to manipulate sched domain partitions has been by a real
time group, who have managed to isolate CPUs from load balancing with
this.

Everyone rejects the current code, and so far, for every replacement
suggested, at least one person rejects it. You don't like my patches,
and I don't like yours.

Sounds to me like we don't have concensus yet ;). Agreed?

And its not just the code that would try to define sched domains
via some implicit or explicit cpuset hooks that is in question here.

The sched domain related code in kernel/sched.c is overly complicated
and ifdef'd. I can't personally critique that code, because my brain
is too small. After repeated efforts to make sense of it, I still don't
understand it. But I have it on good authority from colleagues whose
brains are bigger than mine that this code could use an overhaul.

So ... not only do we not have concensus on the specific patch to control
the definition of sched domain partitions, this is moreover part of a larger
ball of yarn that should be unraveled.

We'd be fools to try to settle on the specifics of the (possibly cpuset
related) API to define sched domain partitions in isolation. We should
do that as part of such an overhaul of the rest of the sched domain code.

The current implicit hooks between cpu_exclusive cpusets and sched domain
partitions should be removed, as proposed in my patch:

[RFC] cpuset: remove sched domain hooks from cpusets

I will submit this patch to Andrew for inclusion in *-mm. It doesn't
change any explicit kernel API's - just removes the sched domain
side affects of the cpu_exclusive flag, which are currently borked
and more dangerous than useful. So long as any rework we do provides
some way to isolate CPUs from load balancing sometime in the next
several months, before whenever the one real time group doing this
needs to see it, no one will miss these current sched domain hooks in
cpusets.

I will agree to NAK your (Nick's) patch to overload the cpu_exclusive
flags in the top level cpusets with sched domain partitioning side
affects, if you agree to NAK my patches to create an 'isolated_cpus'
file in the top cpuset.

My colleague Christoph Lameter will be looking into overhauling some of
this the sched domain code. I am firmly convinced that any kernel
API's, implicit or explicit, cpuset related or not, involving these
sched domains and their partitioning should arise naturally out of that
effort, and not be band-aided on ahead of time.

Let's agree to give Christoph a clean shot at this, and join him in
this effort where we can.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 08:03:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
>>> 1) If your other patch to manipulate sched domains
>>> has code that belongs in kernel/cpuset.c, and
>>> special files that belong in /dev/cpuset, why
>>> shouldn't this one naturally go in the same places?
>>
>>Because they are cpuset specific. This is not.
>
>
> Bizarre. That other patch of yours, that had cpu_exclusive cpusets
> only at the top level defining sched domain partitions, is cpuset

I'm talking about isolcpus. What do isolcpus have to do with cpusets.c?
You can turn off CONFIG_CPUSETS and still use isolcpusets, can't you?

> specific only because you're hijacking the cpu_exclusive flag to add
> new semantics to it, and then claiming you're not adding new semantics
> to it, and then claiming we should be doing this all implicitly without
> additional kernel-user API apparatus, and then claiming that if your
> new abuse of cpuset flags intrudes on other usage patterns of cpusets
> or if they have need of a particular sched domain partitioning that
> they have to explicitly set up top level cpusets to get a desired sched
> domain partitioning ...
>
> Stop already ... this is getting insane.
>
> Where are we?

I'll start again...

set_cpus_allowed is a feature of the scheduler that allows you to
restrict one task to a subset of all cpus. Right?

And cpusets uses this interface as the mechanism to implement the
semantics which the user has asked for. Yes?

sched-domains partitioning is a feature of the scheduler that
allows you to restrict zero or more tasks to the partition, and
zero or more tasks to the complement of the partition. OK?

So if you have a particular policy you need to implement, which is
one cpus_exclusive cpuset off the root, covering half the cpus in
the system (as a simple example)... why is it good to implement
that with set_cpus_allowed and bad to implement it with partitions?

Or, another question, how does my patch hijack cpus_allowed? In
what way does it change the semantics of cpus_allowed?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-20 14:52:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick Piggin wrote:

> set_cpus_allowed is a feature of the scheduler that allows you to
> restrict one task to a subset of all cpus. Right?
>
> And cpusets uses this interface as the mechanism to implement the
> semantics which the user has asked for. Yes?
>
> sched-domains partitioning is a feature of the scheduler that
> allows you to restrict zero or more tasks to the partition, and
> zero or more tasks to the complement of the partition. OK?
>
> So if you have a particular policy you need to implement, which is
> one cpus_exclusive cpuset off the root, covering half the cpus in
> the system (as a simple example)... why is it good to implement
> that with set_cpus_allowed and bad to implement it with partitions?
>
> Or, another question, how does my patch hijack cpus_allowed? In
> what way does it change the semantics of cpus_allowed?

That should be, in what way does it change the semantics of cpusets
in any way?

IOW, how could a user possibly notice or care that partitions are
being used to implement a given policy? (apart from the fact that
the balancing will work better).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-20 20:00:00

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> I'm talking about isolcpus. What do isolcpus have to do with cpusets.c?
> You can turn off CONFIG_CPUSETS and still use isolcpusets, can't you?

The connection of isolcpus with cpusets is about as strong (or weak)
as the connection of sched domain partitioning with cpusets.

Both isolcpus and sched domain partitions are tweaks to the scheduler
domain code, to lessen the impact of load balancing, by making sched
domains smaller, or by avoiding some cpus entirely.

Cpusets is concerned with the placement of tasks on cpus and nodes.

That's more related to sched domains (and hence to isolated cpus and
sched domain partitioning) than it is, say, to the crypto code for
generating randomly random numbers from /dev/random.

But I will grant it is not a strong connection.

Apparently you were seeing the potential for a stronger connection
between cpusets and sched domain partitioning, because you thought
that the cpuset configuration naturally defined some partitions of
the systems cpus, that it would behoove the sched domain partitioning
code to take advantage of.

Unfortunately, cpusets define no such partitioning ... not system wide.

> So if you have a particular policy you need to implement, which is
> one cpus_exclusive cpuset off the root, covering half the cpus in
> the system (as a simple example)... why is it good to implement
> that with set_cpus_allowed and bad to implement it with partitions?

A cpu_exclusive cpuset does not implement the policy of partitioning
a system, such that no one else can use those cpus. It implements
a policy of limiting the sharing of cpus, just to ones cpuset ancestors
and descendents - no distant cousins. That makes it easier for system
admins and batch schedulers to manage the sharing of resources to meet
their needs. They have fewer places to worry about that might be
intruding.

It would be reasonable, for example, to do the following. One could
have a big job, that depended on scheduler load balancing, that ran
in the top cpuset, covering the entire system. And one could have
smaller jobs, in child cpusets. During the day one pauses the big
job and lets the smaller jobs run. During the night, one does the
reverse. Granted, this example is a little bit hypothetical. Things
like this are done, but usually a couple layers further down the
cpuset hierarchy. At the top level, one common configuration that
I am aware of puts almost the entire system into one cpuset, to be
managed by the batch scheduler, and just a handful of cpus into a
separate cpuset, to handle the classic Unix load (init, daemons, cron,
admins login.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 20:01:59

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> Or, another question, how does my patch hijack cpus_allowed? In
> what way does it change the semantics of cpus_allowed?

It limits load balancing for tasks in cpusets containing
a superset of that cpusets cpus.

There are always such cpusets - the top cpuset if no other.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 20:03:58

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> IOW, how could a user possibly notice or care that partitions are
> being used to implement a given policy? (apart from the fact that
> the balancing will work better).

Tasks in higher up cpusets (e.g., the top cpuset) wouldn't load balance
across these partitions.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 21:04:24

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

On Thu, Oct 19, 2006 at 02:26:07AM -0700, Paul Jackson wrote:
> From: Paul Jackson <[email protected]>
>
> Enable user code to isolate CPUs on a system from the domains that
> determine scheduler load balancing.
>
> This is already doable using the boot parameter "isolcpus=". The folks
> running realtime code on production systems, where some nodes are
> isolated for realtime, and some not, and where it is unacceptable
> to reboot to adjust this, need to be able to change which CPUs are
> isolated from the scheduler balancing code on the fly.
>
> This is done by exposing the kernels cpu_isolated_map as a cpumask
> in the root cpuset directory, in a file called 'isolated_cpus',
> where it can be read and written by sufficiently privileged user code.
>
> Signed-off-by: Paul Jackson <[email protected]>

IMO this patch addresses just one of the requirements for partitionable
sched domains

1. Partition the system into multiple sched domains such that rebalancing
happens within the group of each such partition
2. Ability to create partitioned sched domains that can further be
partitioned with exclusive and non exclusive cpusets (exclusive in this
context is not tied to sched domains anymore)
3. Ability to partition a system, such that a cpus in a partition dont
have any rebalancing at all

Hope this makes sense

-Dinakar


>
> ---
>
> I have built and booted this on one system, ia64/sn2, and
> verified that I can set and query the cpu_isolated_map,
> and that tasks are not scheduled on such isolated CPUs
> unless I explicitly place them there.
> -pj
>
> Documentation/cpusets.txt | 26 +++++++++++++++++++++++---
> Documentation/kernel-parameters.txt | 4 ++++
> include/linux/sched.h | 3 +++
> kernel/cpuset.c | 31 +++++++++++++++++++++++++++++++
> kernel/sched.c | 21 ++++++++++++++++++++-
> 5 files changed, 81 insertions(+), 4 deletions(-)
>
> --- 2.6.19-rc1-mm1.orig/Documentation/cpusets.txt 2006-10-18 21:24:22.000000000 -0700
> +++ 2.6.19-rc1-mm1/Documentation/cpusets.txt 2006-10-18 21:24:22.000000000 -0700
> @@ -19,7 +19,8 @@ CONTENTS:
> 1.5 What does notify_on_release do ?
> 1.6 What is memory_pressure ?
> 1.7 What is memory spread ?
> - 1.8 How do I use cpusets ?
> + 1.8 What is isolated_cpus ?
> + 1.9 How do I use cpusets ?
> 2. Usage Examples and Syntax
> 2.1 Basic Usage
> 2.2 Adding/removing cpus
> @@ -174,8 +175,9 @@ containing the following files describin
> - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
> - memory_pressure: measure of how much paging pressure in cpuset
>
> -In addition, the root cpuset only has the following file:
> +In addition, the root cpuset only has the following files:
> - memory_pressure_enabled flag: compute memory_pressure?
> + - isolated_cpus: cpus excluded from scheduler domains
>
> New cpusets are created using the mkdir system call or shell
> command. The properties of a cpuset, such as its flags, allowed
> @@ -378,7 +380,25 @@ data set, the memory allocation across t
> can become very uneven.
>
>
> -1.8 How do I use cpusets ?
> +1.8 What is isolated_cpus ?
> +---------------------------
> +
> +The cpumask isolated_cpus, visible in the top cpuset, provides direct
> +access to the kernel cpu_isolated_map (in kernel/sched.c), which are
> +the CPUs which are excluded from scheduler domains. The scheduler
> +will not load balance tasks on isolated cpus. The 'isolated_cpus'
> +file in the root cpuset can be read, to obtain the current setting
> +of cpu_isolated_map, and can be written, to write a new cpumask to
> +cpu_isolated_map.
> +
> +The initial value of the cpu_isolated_map can be set by the kernel
> +boottime argument "isolcpus=n1,n2,n3,..." where n1, n2, n3, etc
> +are specific decimal numbers of CPUs to be isolated. If not set
> +on the kernel boot line, then cpu_isolated_map defaults to the empty
> +cpumask. See further Documentation/kernel-parameters.txt for isolcpus
> +documentation.
> +
> +1.9 How do I use cpusets ?
> --------------------------
>
> In order to minimize the impact of cpusets on critical kernel
> --- 2.6.19-rc1-mm1.orig/Documentation/kernel-parameters.txt 2006-10-15 00:24:58.000000000 -0700
> +++ 2.6.19-rc1-mm1/Documentation/kernel-parameters.txt 2006-10-18 21:24:22.000000000 -0700
> @@ -731,6 +731,10 @@ and is between 256 and 4096 characters.
> tasks in the system -- can cause problems and
> suboptimal load balancer performance.
>
> + If CPUSETS is configured, then the set of isolated
> + CPUs can be manippulated after system boot using the
> + "isolated_cpus" cpumask in the top cpuset.
> +
> isp16= [HW,CD]
> Format: <io>,<irq>,<dma>,<setup>
>
> --- 2.6.19-rc1-mm1.orig/include/linux/sched.h 2006-10-18 21:24:22.000000000 -0700
> +++ 2.6.19-rc1-mm1/include/linux/sched.h 2006-10-18 21:24:22.000000000 -0700
> @@ -715,6 +715,9 @@ struct sched_domain {
> #endif
> };
>
> +extern const cpumask_t *sched_get_isolated_cpus(void);
> +extern int sched_set_isolated_cpus(const cpumask_t *isolated_cpus);
> +
> /*
> * Maximum cache size the migration-costs auto-tuning code will
> * search from:
> --- 2.6.19-rc1-mm1.orig/kernel/cpuset.c 2006-10-18 21:24:22.000000000 -0700
> +++ 2.6.19-rc1-mm1/kernel/cpuset.c 2006-10-18 21:24:22.000000000 -0700
> @@ -974,6 +974,21 @@ static int update_memory_pressure_enable
> }
>
> /*
> + * Call with manage_mutex held.
> + */
> +
> +static int update_isolated_cpus(struct cpuset *cs, char *buf)
> +{
> + cpumask_t isolated_cpus;
> + int retval;
> +
> + retval = cpulist_parse(buf, isolated_cpus);
> + if (retval < 0)
> + return retval;
> + return sched_set_isolated_cpus(&isolated_cpus);
> +}
> +
> +/*
> * update_flag - read a 0 or a 1 in a file and update associated flag
> * bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
> * CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
> @@ -1215,6 +1230,7 @@ typedef enum {
> FILE_MEMORY_PRESSURE,
> FILE_SPREAD_PAGE,
> FILE_SPREAD_SLAB,
> + FILE_ISOLATED_CPUS,
> FILE_TASKLIST,
> } cpuset_filetype_t;
>
> @@ -1282,6 +1298,9 @@ static ssize_t cpuset_common_file_write(
> retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
> cs->mems_generation = cpuset_mems_generation++;
> break;
> + case FILE_ISOLATED_CPUS:
> + retval = update_isolated_cpus(cs, buffer);
> + break;
> case FILE_TASKLIST:
> retval = attach_task(cs, buffer, &pathbuf);
> break;
> @@ -1397,6 +1416,10 @@ static ssize_t cpuset_common_file_read(s
> case FILE_SPREAD_SLAB:
> *s++ = is_spread_slab(cs) ? '1' : '0';
> break;
> + case FILE_ISOLATED_CPUS:
> + s += cpulist_scnprintf(page, PAGE_SIZE,
> + *sched_get_isolated_cpus());
> + break;
> default:
> retval = -EINVAL;
> goto out;
> @@ -1770,6 +1793,11 @@ static struct cftype cft_spread_slab = {
> .private = FILE_SPREAD_SLAB,
> };
>
> +static struct cftype cft_isolated_cpus = {
> + .name = "isolated_cpus",
> + .private = FILE_ISOLATED_CPUS,
> +};
> +
> static int cpuset_populate_dir(struct dentry *cs_dentry)
> {
> int err;
> @@ -1960,6 +1988,9 @@ int __init cpuset_init(void)
> /* memory_pressure_enabled is in root cpuset only */
> if (err == 0)
> err = cpuset_add_file(root, &cft_memory_pressure_enabled);
> + /* isolated_cpus is in root cpuset only */
> + if (err == 0)
> + err = cpuset_add_file(root, &cft_isolated_cpus);
> out:
> return err;
> }
> --- 2.6.19-rc1-mm1.orig/kernel/sched.c 2006-10-18 21:24:22.000000000 -0700
> +++ 2.6.19-rc1-mm1/kernel/sched.c 2006-10-18 21:49:41.000000000 -0700
> @@ -5608,7 +5608,7 @@ static void cpu_attach_domain(struct sch
> }
>
> /* cpus with isolated domains */
> -static cpumask_t __cpuinitdata cpu_isolated_map = CPU_MASK_NONE;
> +static cpumask_t cpu_isolated_map = CPU_MASK_NONE;
>
> /* Setup the mask of cpus configured for isolated domains */
> static int __init isolated_cpu_setup(char *str)
> @@ -6866,6 +6866,25 @@ void __init sched_init_smp(void)
> if (set_cpus_allowed(current, non_isolated_cpus) < 0)
> BUG();
> }
> +
> +const cpumask_t *sched_get_isolated_cpus(void)
> +{
> + return &cpu_isolated_map;
> +}
> +
> +int sched_set_isolated_cpus(const cpumask_t *isolated_cpus)
> +{
> + cpumask_t newly_isolated_cpus;
> + int err;
> +
> + lock_cpu_hotplug();
> + cpus_and(newly_isolated_cpus, cpu_online_map, *isolated_cpus);
> + detach_destroy_domains(&newly_isolated_cpus);
> + cpu_isolated_map = *isolated_cpus;
> + err = arch_init_sched_domains(&cpu_online_map);
> + unlock_cpu_hotplug();
> + return err;
> +}
> #else
> void __init sched_init_smp(void)
> {
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.925.600.0401

2006-10-20 21:20:10

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

On Fri, Oct 20, 2006 at 01:01:41PM -0700, Paul Jackson wrote:
> Nick wrote:
> > Or, another question, how does my patch hijack cpus_allowed? In
> > what way does it change the semantics of cpus_allowed?
>
> It limits load balancing for tasks in cpusets containing
> a superset of that cpusets cpus.
>
> There are always such cpusets - the top cpuset if no other.

Its just a corner case issue that Nick didn't consider while doing a quick
patch. Nick meant to partition the sched domain at the top
exclusive cpuset and he probably missed the case where root cpuset is marked
as exclusive.

thanks,
suresh

2006-10-21 01:33:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Suresh wrote:
> Its just a corner case issue that Nick didn't consider while doing a quick
> patch. Nick meant to partition the sched domain at the top
> exclusive cpuset and he probably missed the case where root cpuset is marked
> as exclusive.

This makes no sense.

If P is a partition of S, that means that P is a set of subsets of
S such that the intersection of any two members of P is empty, and
the union of the members of P equals S.

If P is a partition of S, then adding S itself to P as another member
makes P no longer a partion, for then every element of S is in two
elements of P, not one.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-21 06:14:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> Nick wrote:
>
>>Or, another question, how does my patch hijack cpus_allowed? In
>>what way does it change the semantics of cpus_allowed?
>
>
> It limits load balancing for tasks in cpusets containing
> a superset of that cpusets cpus.
>
> There are always such cpusets - the top cpuset if no other.

Ah OK, and there is my misunderstanding with cpusets. From the
documentation it appears as though cpu_exclusive cpusets are
made in order to do the partitioning thing.

If you always have other domains overlapping them (regardless
that it is a parent), then what actual use does cpu_exclusive
flag have?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-21 07:24:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

> If you always have other domains overlapping them (regardless
> that it is a parent), then what actual use does cpu_exclusive
> flag have?

Good question. To be a tad too honest, I don't consider it to be
all that essential.

It does turn out to be a bit useful, in that you can be confident that
if you are administering the batch scheduler and it has a cpu_exclusive
cpuset covering a big chunk of your system, then no other cpuset other
than the top cpuset right above, which you administer pretty tightly,
could have anything overlapping it. You don't have to actually look
at the cpumasks in all the parallel cpusets and cross check each one
against the batch schedulers set of cpus for overlap.

But as you can see by grep'ing in kernel/cpuset.c, it is only used
to generate a couple of error returns, for things that you can do
just fine by turning off the cpu_exclusive flag.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-21 10:51:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
>>If you always have other domains overlapping them (regardless
>>that it is a parent), then what actual use does cpu_exclusive
>>flag have?
>
>
> Good question. To be a tad too honest, I don't consider it to be
> all that essential.
>
> It does turn out to be a bit useful, in that you can be confident that
> if you are administering the batch scheduler and it has a cpu_exclusive
> cpuset covering a big chunk of your system, then no other cpuset other
> than the top cpuset right above, which you administer pretty tightly,
> could have anything overlapping it. You don't have to actually look
> at the cpumasks in all the parallel cpusets and cross check each one
> against the batch schedulers set of cpus for overlap.
>
> But as you can see by grep'ing in kernel/cpuset.c, it is only used
> to generate a couple of error returns, for things that you can do
> just fine by turning off the cpu_exclusive flag.

Well, it was supposed to be used for sched-domains partitioning, and
its uselessness for anything else I guess is what threw me.

But even the way cpu_exclusive semantics are defined makes it not
quite compatible with partitioning anyway, unfortunately.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-22 04:54:57

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> Well, it was supposed to be used for sched-domains partitioning, and
> its uselessness for anything else I guess is what threw me.

The use of cpu_exclusive for sched domain partitioning was added
later, by a patch from Dinikar, in April or May of 2005.

In hindsight, I think I made a mistake in agreeing to, and probably
encouraging, this particular overloading of cpu_exclusive. I had
difficulty adequately understanding what was going on.

Granted, as we've noted elsewhere on this thread, the cpu_exclusive
flag is underutilized. It gives cpusets so marked a certain limited
exclusivity to its cpus, relative to its siblings, but it doesn't do
much else, other than this controversial partitioning of sched domains.

There may well be a useful role for the cpu_exclusive flag in managing
sched domains and partitioning. The current role is flawed, in my view.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 03:18:40

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Dinakar wrote:
> IMO this patch addresses just one of the requirements for partitionable
> sched domains

Correct - this particular patch was just addressing one of these.

Nick raised the reasonable concern that this patch was adding something
to cpusets that was not especially related to cpusets.

So I will not be sending this patch to Andrew for *-mm.

There are further opportunities for improvements in some of this code,
which my colleague Christoph Lameter may be taking an interest in.
Ideally kernel-user API's for isolating and partitioning sched domains
would arise from that work, though I don't know if we can wait that
long.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 05:07:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> Dinakar wrote:
>
>>IMO this patch addresses just one of the requirements for partitionable
>>sched domains
>
>
> Correct - this particular patch was just addressing one of these.
>
> Nick raised the reasonable concern that this patch was adding something
> to cpusets that was not especially related to cpusets.

Did you send resend the patch to remove sched-domain partitioning?
After clearing up my confusion, IMO that is needed and could probably
go into 2.6.19.

> So I will not be sending this patch to Andrew for *-mm.
>
> There are further opportunities for improvements in some of this code,
> which my colleague Christoph Lameter may be taking an interest in.
> Ideally kernel-user API's for isolating and partitioning sched domains
> would arise from that work, though I don't know if we can wait that
> long.

The sched-domains code is all there and just ready to be used. IMO
using the cpusets API (or a slight extension thereof) would be the
best idea if we're going to use any explicit interface at all.

A cool option would be to determine the partitions according to the
disjoint set of unions of cpus_allowed masks of all tasks. I see this
getting computationally expensive though, probably O(tasks*CPUs)... I
guess that isn't too bad.

Might be better than a userspace interface.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 05:51:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> Did you send resend the patch to remove sched-domain partitioning?
> After clearing up my confusion, IMO that is needed and could probably
> go into 2.6.19.

The patch titled
cpuset: remove sched domain hooks from cpusets
went into *-mm on Friday, 20 Oct.

Is that the patch you mean?

It's just the first step - unplugging the old.

Now we need the new:
1) Ability at runtime to isolate cpus for real-time and such.
2) Big systems perform better if we can avoid load balancing across
zillions of cpus.

> The sched-domains code is all there and just ready to be used. IMO
> using the cpusets API (or a slight extension thereof) would be the
> best idea if we're going to use any explicit interface at all.

Good. Thanks.

> A cool option would be to determine the partitions according to the
> disjoint set of unions of cpus_allowed masks of all tasks. I see this
> getting computationally expensive though, probably O(tasks*CPUs)... I
> guess that isn't too bad.

Yeah - if that would work, from the practical perspective of providing
us with a useful partitioning (get those humongous sched domains carved
down to a reasonable size) then that would be cool.

I'm guessing that in practice, it would be annoying to use. One would
end up with stray tasks that happened to be sitting in one of the bigger
cpusets and that did not have their cpus_allowed narrowed, stopping us
from getting a useful partitioning. Perhaps anilliary tasks associated
with the batch scheduler, or some paused tasks in an inactive job that
were it active would need load balancing across a big swath of cpus.
These would be tasks that we really didn't need to load balance, but they
would appear as if they needed it because of their fat cpus_allowed.

Users (admins) would have to hunt down these tasks that were getting in
the way of a nice partitioning and whack their cpus_allowed down to
size.

So essentially, one would end up with another userspace API, backdoor
again. Like those magic doors in the libraries of wealthy protagonists
in mystery novels, where you have to open a particular book and pull
the lamp cord to get the door to appear and open.

Automatic chokes and transmissions are great - if they work. If not,
give me a knob and a stick.

===

Another idea for a cpuset-based API to this ...

>From our internal perspective, it's all about getting the sched domain
partitions cut down to a reasonable size, for performance reasons.

But from the users perspective, the deal we are asking them to
consider is to trade in fully automatic, all tasks across all cpus,
load balancing, in turn for better performance.

Big system admins would often be quite happy to mark the top cpuset
as "no need to load balance tasks in this cpuset." They would
take responsibility for moving any non-trivial, unpinned tasks into
lower cpusets (or not be upset if something left behind wasn't load
balancing.)

And the batch scheduler would be quite happy to mark its top cpuset as
"no need to load balance". It could mark any cpusets holding inactive
jobs the same way.

This "no need to load balance" flag would be advisory. The kernel
might load balance anyway. For example if the batch scheduler were
running under a top cpuset that was -not- so marked, we'd still have
to load balance everyone. The batch scheduler wouldn't care. It would
have done its duty, to mark which of its cpusets didn't need balancing.

All we need from them is the ok to not load balance certain cpusets,
and the rest is easy enough. If they give us such ok on enough of the
big cpusets, we give back a nice performance improvement.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:00:38

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

On Sun, Oct 22, 2006 at 10:51:08PM -0700, Paul Jackson wrote:
> Nick wrote:
> > A cool option would be to determine the partitions according to the
> > disjoint set of unions of cpus_allowed masks of all tasks. I see this
> > getting computationally expensive though, probably O(tasks*CPUs)... I
> > guess that isn't too bad.
>
> Yeah - if that would work, from the practical perspective of providing
> us with a useful partitioning (get those humongous sched domains carved
> down to a reasonable size) then that would be cool.
>
> I'm guessing that in practice, it would be annoying to use. One would
> end up with stray tasks that happened to be sitting in one of the bigger
> cpusets and that did not have their cpus_allowed narrowed, stopping us
> from getting a useful partitioning. Perhaps anilliary tasks associated
> with the batch scheduler, or some paused tasks in an inactive job that
> were it active would need load balancing across a big swath of cpus.
> These would be tasks that we really didn't need to load balance, but they
> would appear as if they needed it because of their fat cpus_allowed.
>
> Users (admins) would have to hunt down these tasks that were getting in
> the way of a nice partitioning and whack their cpus_allowed down to
> size.
>
> So essentially, one would end up with another userspace API, backdoor
> again. Like those magic doors in the libraries of wealthy protagonists
> in mystery novels, where you have to open a particular book and pull
> the lamp cord to get the door to appear and open.

Also we need to be careful with malicious users partitioning the systems wrongly
based on the cpus_allowed for their tasks.

>
> Automatic chokes and transmissions are great - if they work. If not,
> give me a knob and a stick.
>
> ===
>
> Another idea for a cpuset-based API to this ...
>
> From our internal perspective, it's all about getting the sched domain
> partitions cut down to a reasonable size, for performance reasons.

we need consider resource partitioning too.. and this is where
google interests are probably coming from?

>
> But from the users perspective, the deal we are asking them to
> consider is to trade in fully automatic, all tasks across all cpus,
> load balancing, in turn for better performance.
>
> Big system admins would often be quite happy to mark the top cpuset
> as "no need to load balance tasks in this cpuset." They would
> take responsibility for moving any non-trivial, unpinned tasks into
> lower cpusets (or not be upset if something left behind wasn't load
> balancing.)
>
> And the batch scheduler would be quite happy to mark its top cpuset as
> "no need to load balance". It could mark any cpusets holding inactive
> jobs the same way.
>
> This "no need to load balance" flag would be advisory. The kernel
> might load balance anyway. For example if the batch scheduler were
> running under a top cpuset that was -not- so marked, we'd still have
> to load balance everyone. The batch scheduler wouldn't care. It would
> have done its duty, to mark which of its cpusets didn't need balancing.
>
> All we need from them is the ok to not load balance certain cpusets,
> and the rest is easy enough. If they give us such ok on enough of the
> big cpusets, we give back a nice performance improvement.
>

2006-10-23 06:07:35

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Suresh wrote:
> Also we need to be careful with malicious users partitioning the systems wrongly
> based on the cpus_allowed for their tasks.

Untrusted users can't do that. On big systems where you have
untrusted users, you stick them in smaller cpusets. Their
cpus_allowed is not allowed to exceed what's allowed in their
cpuset.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:08:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Suresh wrote:
> > From our internal perspective, it's all about getting the sched domain
> > partitions cut down to a reasonable size, for performance reasons.
>
> we need consider resource partitioning too.. and this is where
> google interests are probably coming from?

I dropped a bit - what's 'resource partitioning' ?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:17:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> Nick wrote:
>
>>Did you send resend the patch to remove sched-domain partitioning?
>>After clearing up my confusion, IMO that is needed and could probably
>>go into 2.6.19.
>
>
> The patch titled
> cpuset: remove sched domain hooks from cpusets
> went into *-mm on Friday, 20 Oct.
>
> Is that the patch you mean?

Yes.

> It's just the first step - unplugging the old.
>
> Now we need the new:
> 1) Ability at runtime to isolate cpus for real-time and such.
> 2) Big systems perform better if we can avoid load balancing across
> zillions of cpus.

These are both part of the same larger solution, which is to
partition domains. isolated CPUs are just the case of 1 CPU in
its own domain (and that's how they are implemented now).

So we need the interface or some driver to do this partitioning.

>>A cool option would be to determine the partitions according to the
>>disjoint set of unions of cpus_allowed masks of all tasks. I see this
>>getting computationally expensive though, probably O(tasks*CPUs)... I
>>guess that isn't too bad.
>
>
> Yeah - if that would work, from the practical perspective of providing
> us with a useful partitioning (get those humongous sched domains carved
> down to a reasonable size) then that would be cool.
>
> I'm guessing that in practice, it would be annoying to use. One would
> end up with stray tasks that happened to be sitting in one of the bigger
> cpusets and that did not have their cpus_allowed narrowed, stopping us
> from getting a useful partitioning. Perhaps anilliary tasks associated
> with the batch scheduler, or some paused tasks in an inactive job that
> were it active would need load balancing across a big swath of cpus.
> These would be tasks that we really didn't need to load balance, but they
> would appear as if they needed it because of their fat cpus_allowed.

But we simply can't make a partition for them because they have asked
to use all CPUs. We can't know if this is something that should be
partitioned or not, can we?

> Users (admins) would have to hunt down these tasks that were getting in
> the way of a nice partitioning and whack their cpus_allowed down to
> size.

In the sense that they get what they ask for, yes. The obvious route
for the big SGI systems is to put them in the right cpuset. The job
managers (or admins) on those things are surely up to the task.

This leaves unbound and transient kernel threads like pdflush as a
remaining problem. Not quite sure what to do about that yet. I see
you have a little hack in there...

> So essentially, one would end up with another userspace API, backdoor
> again. Like those magic doors in the libraries of wealthy protagonists
> in mystery novels, where you have to open a particular book and pull
> the lamp cord to get the door to appear and open.
>
> Automatic chokes and transmissions are great - if they work. If not,
> give me a knob and a stick.

But your knob is just going to be some mechanism to say that you don't
care about such and such a task, or you want to put task x into domain
y.

> ===
>
> Another idea for a cpuset-based API to this ...
>
>>From our internal perspective, it's all about getting the sched domain
> partitions cut down to a reasonable size, for performance reasons.
>
> But from the users perspective, the deal we are asking them to
> consider is to trade in fully automatic, all tasks across all cpus,
> load balancing, in turn for better performance.
>
> Big system admins would often be quite happy to mark the top cpuset
> as "no need to load balance tasks in this cpuset." They would
> take responsibility for moving any non-trivial, unpinned tasks into
> lower cpusets (or not be upset if something left behind wasn't load
> balancing.)
>
> And the batch scheduler would be quite happy to mark its top cpuset as
> "no need to load balance". It could mark any cpusets holding inactive
> jobs the same way.
>
> This "no need to load balance" flag would be advisory. The kernel
> might load balance anyway. For example if the batch scheduler were
> running under a top cpuset that was -not- so marked, we'd still have
> to load balance everyone. The batch scheduler wouldn't care. It would
> have done its duty, to mark which of its cpusets didn't need balancing.
>
> All we need from them is the ok to not load balance certain cpusets,
> and the rest is easy enough. If they give us such ok on enough of the
> big cpusets, we give back a nice performance improvement.

I think this is much more of an automatic behind your back thing. If
they don't want any load balancing to happen, they could pin the tasks
in that cpuset to the cpu they're currently on, for example.

It would be trivial to make such a script to parse the root cpuset and
do exactly this, wouldn't it?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 06:42:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> These are both part of the same larger solution, which is to
> partition domains. isolated CPUs are just the case of 1 CPU in
> its own domain (and that's how they are implemented now).

and later, he also wrote:
> I think this is much more of an automatic behind your back thing.

I got confused there.

I agree that if we can do a -good- job of it, then an implicit,
automatic solution is better for the problem of reducing sched domain
partition sizes on large systems than yet another manual knob.

But I thought that it was good idea, with general agreement, to provide
an explicit control of isolated cpus for the real-time folks, even if
under the covers it use sched domain partitions of size 1 to implement
it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:48:20

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

> It would be trivial to make such a script to parse the root cpuset and
> do exactly this, wouldn't it?

Ah - yes - that's doable. A certain company I work for ships pretty
much that exact script, to its customers. It works well to remove
all the unpinned tasks from the top level cpuset and put them in what
we call the 'boot' cpuset, where the classic Unix load (init, cron,
daemons, sysadmin login) is confined. This frees up the rest of the
system to run "real" work. It works quite well, if I do say so.

Perhaps I'm being overly pessimistic about the potential of driving
this partitioning off the cpus_allowed masks of the tasks. As you
noted, it would be a cute trick to avoid some combinatorial explosion
of the computational costs. But there are enough practical constraints
on this problem - that should be quite doable.

Hmmm ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 06:49:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Paul Jackson wrote:
> Nick wrote:
>
>>These are both part of the same larger solution, which is to
>>partition domains. isolated CPUs are just the case of 1 CPU in
>>its own domain (and that's how they are implemented now).
>
>
> and later, he also wrote:
>
>>I think this is much more of an automatic behind your back thing.
>
>
> I got confused there.
>
> I agree that if we can do a -good- job of it, then an implicit,
> automatic solution is better for the problem of reducing sched domain
> partition sizes on large systems than yet another manual knob.

OK, good.

> But I thought that it was good idea, with general agreement, to provide
> an explicit control of isolated cpus for the real-time folks, even if
> under the covers it use sched domain partitions of size 1 to implement
> it.

If they isolate it by setting the cpus_allowed masks of processes
to reflect the way they'd like balancing to be carried out, then
the partition will be made for them.

But an explicit control might be required anyway, and I wouldn't
disagree with it. It might be required to do more than just sched
partitioning (eg. pdflush and other kernel threads should probably
be made to stay off isolated cpus as well, where possible).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 19:50:41

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

On Mon, Oct 23, 2006 at 03:07:46PM +1000, Nick Piggin wrote:
> Paul Jackson wrote:
> >Dinakar wrote:
> >
> >>IMO this patch addresses just one of the requirements for partitionable
> >>sched domains
> >Correct - this particular patch was just addressing one of these.
> >
> >Nick raised the reasonable concern that this patch was adding something
> >to cpusets that was not especially related to cpusets.
>
> Did you send resend the patch to remove sched-domain partitioning?
> After clearing up my confusion, IMO that is needed and could probably
> go into 2.6.19.
>
> >So I will not be sending this patch to Andrew for *-mm.
> >
> >There are further opportunities for improvements in some of this code,
> >which my colleague Christoph Lameter may be taking an interest in.
> >Ideally kernel-user API's for isolating and partitioning sched domains
> >would arise from that work, though I don't know if we can wait that
> >long.
>
> The sched-domains code is all there and just ready to be used. IMO
> using the cpusets API (or a slight extension thereof) would be the
> best idea if we're going to use any explicit interface at all.

Ok I am getting lost in all of the mails here, so let me try to summaize

The existing cpuset code that partitioned sched domains at the
back of a exclusive cpuset has one major problem. Administrators
will find that tasks assigned to top level cpusets, that contain
child cpusets that are exclusive, can no longer be rebalanced across
the entire cpus_allowed mask.
This as far as I can tell is the only problem with the current code.
So I dont see why we need a major rewrite that involves a complete change
in the approach to the dynamic sched domain implementation.

I really think all we need is to have a new flag (say sched_domain)
that can be used by the admin to create a sched domain. Since this is
in addition to the cpu_exclusive flag, the admin realizes that the tasks
associated with the parent cpuset may need to be moved around to better
reflect the cpus that will actually run on. Thats it.

Can somebody tell me why this approach is not good enough ?

I am testing this patch currently and will post it shortly for review

-Dinakar

>
> A cool option would be to determine the partitions according to the
> disjoint set of unions of cpus_allowed masks of all tasks. I see this
> getting computationally expensive though, probably O(tasks*CPUs)... I
> guess that isn't too bad.
>
> Might be better than a userspace interface.
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 20:47:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Dinakar wrote:
> This as far as I can tell is the only problem with the current code.
> So I dont see why we need a major rewrite that involves a complete change
> in the approach to the dynamic sched domain implementation.

Nick and I agree that if we can get an adequate automatic partition of
sched domains, without any such explicit 'sched_domain' API, then that
would be better.

Nick keeps hoping we can do this automatically, and I have been fading
in and out of agreement. I have doubts we can do an automatic
partition that is adequate.

Last night, Nick suggested we could do this by partitioning based on
the cpus_allowed masks of the tasks in the system, instead of based on
selected cpusets in the system. We could create a partition any place
that didn't cut across some tasks cpus_allowed. This would seem to have
a better chance than basing it on the cpus masks in cpusets - for
example a task in top cpuset that was pinned to a single CPU (many
kernel threads fit this description) would no longer impede
partitioning.

Right now, I am working on a posting that spells out an algorithm
to compute such a partitioning, based on all task cpus_allowed masks
in the system. I think that's doable.

But I doubt it works. I afraid it will result in useful partitions
only in the cases we seem to need them least, on systems using cpusets
to nicely carve up a system.

==

Why the heck are we doing this partitioning in the first place?

Putting aside for a moment the specialized needs of the real-time folks
who want to isolate some nodes from any source of jitter, aren't these
sched domain partitions just a workaround for performance issues, that
arise from trying to load balance across big sched domains?

Granted, there may be no better way to address this performance issue.
And granted, it may well be my own employers big honkin NUMA boxes
that are most in need of this, and I just don't realize it.

But could someone whack me upside the head with a clear cut instance
where we know we need this partitioning?

In particular, if (extreme example) I have 1024 threads on a 1024 CPU
system, each compute bound and each pinned to a separate CPU, and
nothing else, then do I still need these sched partitions? Or does the
scheduler efficiently handle this case, quickly recognizing that it has
no useful balancing work worth doing?

==

As it stands right now, if I had to place my "final answer" in Jeopardy
on this, I'd vote for something like the patch you describe, which I
take it is much like the sched_domain patch with which I started this
scrum a week ago, minus the 'sched_domain_enabled' flag that I had in
for backwards compatibility. I suspect we agree that we can do without
that flag, and that a single clean long term API outweighs perfect
backward compatibility, in this case.

==

The only twist to your patch I would like you to consider - instead
of a 'sched_domain' flag marking where the partitions go, how about
a flag that tells the kernel it is ok not to load balance tasks in
a cpuset?

Then lower level cpusets could set such a flag, without immediate and
brutal affects on the partitioning of all their parent cpusets. But if
the big top level non-overlapping cpusets were so marked, then we could
partition all the way down to where we were no longer able to do so,
because we hit a cpuset that didn't have this flag set.

I think such a "ok not to load balance tasks in this cpuset" flag
better fits what the users see here. They are being asked to let us
turn off some automatic load balancing, in return for which they get
better performance. I doubt that the phrase "dynamic scheduler domain
partitions" is in the vocabulary of most of our users. More of them
will understand the concept of load balancing automatically moving
tasks to underutilized CPUs, and more of them would be prepared to
make the choice between turning off load balancing in some top cpusets,
and better kernel scheduler performance.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-23 20:58:43

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Nick wrote:
> I think this is much more of an automatic behind your back thing. If
> they don't want any load balancing to happen, they could pin the tasks
> in that cpuset to the cpu they're currently on, for example.
>
> It would be trivial to make such a script to parse the root cpuset and
> do exactly this, wouldn't it?

Yeah ... but ...

Such a script is trivial.

But I'm getting mixed messages on this "automatic behind your back"
thing.

We can't actually run such a script automatically, without the user
asking for it. We only run such a script, to move what tasks we can
out of the top cpuset into a smaller cpuset, when the user sets up
a configuration asking for it.

And saying "If they don't want any load balancing to happen, they could
pin the tasks ..." doesn't sound automatic to me. It sounds like a
backdoor API.

So you might say, and continue to hope, that this sched domain
partitioning is automatic. But your own words, and my (limited)
experience, don't match that expectation.

- We agree that automatic, truly automatic with no user
intervention, beats some manually invoked API knob.

- But a straight forward, upfront API knob (such as a flag indicating
"no need to load balance across this cpuset") beats a backdoor,
indirect, API trying to pretend it's automatic when it not.


> In the sense that they get what they ask for, yes. The obvious route
> for the big SGI systems is to put them in the right cpuset. The job
> managers (or admins) on those things are surely up to the task.

This is touching on a key issue.

Sure, if the admins feel motivated, this is within their capacity.
For something like the real-time nodes, the admins seem willing to
jump through flaming hoops and perform all manner of unnatural acts,
to get those nodes running tight, low latency, jitter free.

But admins don't wake up in the morning -wanting- to think about
the performance of balancing sched domains. Partitioning sched
domains is not a frequently requested feature.

Sched domain partitioning is a workaround (apparently) for some
(perhaps inevitable) scaling issues in the schedulers load balancer -
right?

So, just how bad is this scaling issue with huge sched domains?
And why can't it be addressed at the source? Perhaps say, by
ratelimiting the balancing efforts at higher domain sizes?

For instance, I can imagine a dynamically variable ratelimit
throttle on how often we try to balance across the higher domain
sizes. If such load balancing efforts are not yielding much fruit
(seldom yield a worthwhile task movement) then do them less often.

Then if someone is truly running a mixed timeshare load, of
varying job sizes, wide open on a big system, with no serious
effort at confining the jobs into smaller cpusets, we would
continue to spend considerable effort load balancing the system.
Not the most efficient way to run a system, but it should work
as expected when you do that.

If on the other hand, someone is using lots of small cpusets and
pinning to carefully manage thread placement on a large system,
we would end up backing off the higher levels of load balancing,
as it seldom bore any fruit.

==

To my limited awareness, this sched domain balancing performance
problem (due to huge domains) isn't that big a problem.

I've been guessing that this is because the balancer is actually
reasonably efficient when presented with a carefully pinned load
that has one long living compute bound thread per CPU, even on
very high CPU count systems.

I'm afraid we are creating a non-automatic band-aid for a problem, that
would only work on systems (running nicely pinned workloads in small
cpusets) for which it is not actually a problem, designing the backdoor
API under the false assumption that it is really automatic, where
instead this API will be the least capable of automatically working for
exactly the workloads (poorly placed generic timeshare loads) that it
is most needed for.

And if you can parse that last sentence, you're a smarter man than
I am ;).

==

I can imagine an algorithm, along the lines Nick suggested yesterday,
to extract partitions from the cpus_allowed masks of all the tasks:

It would scan each byte of each cpus_allowed mask (it could do
bits, but bytes provide sufficient granularity and are faster.)

It would have an intermediate array of counters, one per cpumask_t
byte (that is, NR_CPUS/8 counters.)

For each task, it would note the lowest non-zero byte in that
tasks cpus_allowed, and increment the corresponding counter.
Then it also note the highest non-zero byte in that cpus_allowed,
and decrement the corresponding counter.

Then after this scan of all task cpus_allowed, it would make
a single pass over the counter array, summing the counts in a
running total. Anytime during that pass that the running total
was zero, a partition could be placed.

This resembles the algorithm used to scan parenthesized expressions
(Lisp, anyone?) to be sure that the parentheses are balanced
(count doesn't go negative) and closed (count ends back at zero.)
During such a scan, one increments a count on each open parenthesis,
and decrements it on each close.

But I'm going to be surprised if such an algorithm actually finds good
sched domain partitions, without at least manual intervention by the
sysadmin to remove tasks from all of the large cpusets.

And if it does find good partitions, that will most likely happen on
systems that are already administered using detailed job placement,
which systems might (?) not have been a problem in the first place.

So we've got an algorithm that is automatic when we least need it,
but requires administrative intervention when it was needed (making
it not so automatic ;).

And if that's what this leads to, I'd prefer a straight forward flag
that tells the system where to partition, or perhaps better, where it
can skip load balancing if it wants to.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-24 15:45:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

pj wrote to Dinakar:
> The only twist to your patch I would like you to consider - instead
> of a 'sched_domain' flag marking where the partitions go, how about
> a flag that tells the kernel it is ok not to load balance tasks in
> a cpuset?

Dinakar - one possibility that might work well:

Proceed with the 'sched_domain' patch you are working on, as you planned.

(If you like, stop reading here ... <grin>.)

Then I can propose a patch on top of that, to flip the kernel-user API
to the "ok not to load balance" style I'm proposing. This patch would:
- leave your internal logic in place, as is,
- remove your 'sched_domain' flag from the visible API (keep it internally),
- add a 'cpu_must_load_balance' (default 1) flag to the API, and
- add a bit of logic to set, top down, your internal sched_domain flag,
based on the cpu_must_load_balance and parent sched_domain settings.

Then we can all see these two alternative API styles, your sched_domain
style and my cpu_must_load_balance style, and pick one (just by keeping
or tossing my extra patch.)

A couple of aspects of my cpu_must_load_balance style that I like:

* The batch scheduler can turn off requiring load balancing on its
inactive cpusets, without worrying about whether it has the right
(exclusive control) to do that or not.

* I figure that users will relate better to a choice of whether or not
they require cpu load balancing, than they will to the question of
where to partition scheduler domains.

Of course, if Nick succeeds on his mission to convince us that we can
do this automatically, then the above doesn't matter, and we'd need a
different patch altogether.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-25 19:41:26

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC] cpuset: add interface to isolated cpus

Dinikar wrote:
> I really think all we need is to have a new flag (say sched_domain)
> that can be used by the admin to create a sched domain. ...
>
> I am testing this patch currently and will post it shortly for review

Are you still working on this patch? I'm still figuring that this, or
my 'cpu_must_load_balance' variation, will be our best choice.

The other avenue of pursuit, Nick's suggestion to partition based on
the individual task cpus_allowed masks, is blowing up on me. Details
below, if it matters to anyone ...

===

A couple days ago, I wrote:
>
> I can imagine an algorithm, along the lines Nick suggested yesterday,
> to extract partitions from the cpus_allowed masks of all the tasks:
>
> It would scan each byte of each cpus_allowed mask (it could do
> bits, but bytes provide sufficient granularity and are faster.)
>
> It would have an intermediate array of counters, one per cpumask_t
> byte (that is, NR_CPUS/8 counters.)
>
> For each task, it would note the lowest non-zero byte in that
> tasks cpus_allowed, and increment the corresponding counter.
> Then it also note the highest non-zero byte in that cpus_allowed,
> and decrement the corresponding counter.
>
> Then after this scan of all task cpus_allowed, it would make
> a single pass over the counter array, summing the counts in a
> running total. Anytime during that pass that the running total
> was zero, a partition could be placed.
>
> This resembles the algorithm used to scan parenthesized expressions
> (Lisp, anyone?) to be sure that the parentheses are balanced
> (count doesn't go negative) and closed (count ends back at zero.)
> During such a scan, one increments a count on each open parenthesis,
> and decrements it on each close.
>
> But I'm going to be surprised if such an algorithm actually finds good
> sched domain partitions, without at least manual intervention by the
> sysadmin to remove tasks from all of the large cpusets.


This is a terrible algorithm. It does a really poor job of finding
optimum (minimum size) sched domain partitions.

For example, if we pinned every task in the system so that it is
either running on the even numbered CPUs, or on the odd numbered CPUs,
this algorithm would miss the obvious partitioning, into even and
odd numbered CPUs.

Or for a more flagrant example, lets say we pinned every single task
but one to a single CPU, and then pinned that last task to a pair of
CPUs, the lowest numbered and the highest numbered CPUs. Ideally we
would only have a single sched domain partition in the entire system
that needed balancing - a two CPU partition for that last task.
The above algorithm would lump all the CPUs in the system into a
partition, because they are numbered in the range of CPUs covered
by that last task, into a single sched domain partition covering the
entire system.

If people only selected cpus_allowed masks allowing adjacent CPUs, then
this algorithm worked just fine. That's not a realistic constraint.

A deterministic algorithm for this might be NP-complete, I don't know.
Whether or not a useful approximation exists, I don't know. And even
if we found one, whether or not it would adequately meet the needs
of high CPU count systems anywhere nearly as well as simply marking
cpusets with flag to indicate which ones define sched domains,
or which ones don't need to be load balanced ... I don't know, but
I'm skeptical.



--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401