2004-09-10 08:54:48

by Simon Derr

[permalink] [raw]
Subject: [rfc][patch] 1/2 Additional cpuset features

Hi,

I'm sending a series of two cpuset-related patches.
I'm not asking for their inclusion in the current kernel.
These are just feature proposals.

Actually, these patches aim to restore the "original" behaviour of the
cpusets, as it was one year from now or so. Since then, to ease the
acceptance of the cpusets, some 'contreversial' features have been
removed.

This first patch adds automatic process migration upon cpuset
modification. Whenever the list of CPUs of a cpuset changes, the kernel
will move tasks so they stay "inside" their cpuset.

Applies against 2.6.9-rc1-mm4.


Signed-Off-By: Simon Derr <[email protected]>
Index: mm4/kernel/cpuset.c
===================================================================
--- mm4.orig/kernel/cpuset.c 2004-09-08 10:01:23.168151957 +0200
+++ mm4/kernel/cpuset.c 2004-09-08 10:13:30.335135237 +0200
@@ -534,6 +534,84 @@
return 0;
}

+/**
+ * migrate_cpuset_processes - re-place processes into their cpuset's cpus
+ * @cs: the cpusets whose processes we have to migrate.
+ *
+ * When the list of CPUs of cpuset @cs changes, we have to update all the
+ * attached processes' masks, and maybe even migrate them.
+ * Should be called with the cpuset_sem hold
+ */
+static void migrate_cpuset_processes(struct cpuset * cs)
+{
+ struct task_struct *g, *p;
+ /* This should be a RARE use of the cpusets.
+ * therefore we'll prefer an inefficient operation here
+ * (searching the whole process list)
+ * than adding another list_head in task_t
+ * and locks and list_add for each fork()
+ */
+
+ /* we need to lock tasklist_lock for reading the processes list
+ * BUT we cannot call set_cpus_allowed with any spinlock held
+ * => we need to store the list of task struct in an array
+ */
+ struct task_struct ** array;
+ int first = 1;
+ int nb = 0;
+ int sz;
+
+retry:
+ /* at most cs->count - 1 processes to migrate */
+ /* keep some room in case some processes fork() during kmalloc() */
+ sz = atomic_read(&cs->count) + 10;
+ array = (struct task_struct **) kmalloc(sz * sizeof(struct task_struct *), GFP_ATOMIC);
+ if (!array) {
+ printk("Error allocating array in migrate_cpuset_processes !\n");
+ return;
+ }
+ /* see linux/sched.h for this nested for/do-while loop */
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ if (p->cpuset == cs) {
+ if (nb == sz) {
+ printk("migrate_cpuset_processes: array full !\n");
+ read_unlock(&tasklist_lock);
+ kfree(array);
+ goto retry;
+ }
+ get_task_struct(p);
+ array[nb++] = p;
+ }
+ } while_each_thread(g, p);
+ read_unlock(&tasklist_lock);
+
+ while(nb) {
+ struct task_struct * p = array[--nb];
+ cpumask_t cpus;
+ /*
+ * If the tasks current CPU placement overlaps with its new cpuset,
+ * then let it run in that overlap. Otherwise fallback to simply
+ * letting it have the run of the CPUs in the new cpuset.
+ */
+ cpus_and(cpus, p->cpus_allowed, cs->cpus_allowed);
+ if (cpus_empty(cpus))
+ cpus = cs->cpus_allowed;
+ set_cpus_allowed(p, cpus);
+ put_task_struct(p);
+ }
+ kfree(array);
+ /* what happens if a task present in the array forks now ?
+ * solution (ahem) -- do everything twice -- that way forked
+ * tasks missed by the first pass will be taken by the second pass,
+ * and the tasks missed by the second pass have their parent taken
+ * by the first pass */
+ if (first) {
+ first = 0;
+ goto retry;
+ }
+}
+
static int update_cpumask(struct cpuset *cs, char *buf)
{
struct cpuset trialcs;
@@ -544,9 +622,13 @@
if (retval < 0)
return retval;
cpus_and(trialcs.cpus_allowed, trialcs.cpus_allowed, cpu_online_map);
- retval = validate_change(cs, &trialcs);
- if (retval == 0)
- cs->cpus_allowed = trialcs.cpus_allowed;
+ if (!cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed)) {
+ retval = validate_change(cs, &trialcs);
+ if (retval == 0) {
+ cs->cpus_allowed = trialcs.cpus_allowed;
+ migrate_cpuset_processes(cs);
+ }
+ }
return retval;
}


2004-09-11 08:08:55

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

Good luck with these patches, Simon, though I do not support them.

For the record, I was the one most responsible for removing these two
patches:

1) auto task migration on cpuset change, and
2) cpuset relative CPU/Memory numbering.

I continue to think that these can be done just as well in user space.
A bit better in user space actually, as the locking for (1) is easier
from user space, and the opportunity for more flexible adaption to
different renumbering needs that (2) attempts is easier from user space.

But if others find these worth persuing in kernel space, so be it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-09-23 19:43:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

The cpuset relative numbering may be essential for consistent
cpu numbering f.e. in ia64's perfmon etc. This may affect multiple
subsystems of the kernel that enumerate CPUs.

Simon's 2nd patch provides a translation that we need at SGI for perfmon
support within a cpuset. Without the virtualization some
means in user space needs to exist to translate a virtual CPU number
into a physical CPU number.

On Sat, 11 Sep 2004, Paul Jackson wrote:

> Good luck with these patches, Simon, though I do not support them.
>
> For the record, I was the one most responsible for removing these two
> patches:
>
> 1) auto task migration on cpuset change, and
> 2) cpuset relative CPU/Memory numbering.
>
> I continue to think that these can be done just as well in user space.
> A bit better in user space actually, as the locking for (1) is easier
> from user space, and the opportunity for more flexible adaption to
> different renumbering needs that (2) attempts is easier from user space.
>
> But if others find these worth persuing in kernel space, so be it.

2004-09-23 23:49:15

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

> Simon's 2nd patch provides a translation that we need at SGI for perfmon
> support within a cpuset. Without the virtualization some
> means in user space needs to exist to translate a virtual CPU number
> into a physical CPU number.

In my opinion, user space is exactly the right place for this translation.

Those inside SGI can see more detail of this in SGI Incident 903969.

But the jist of the matter is simple. Just as we (SGI) did with
cpumemsets and perfmon on 2.4 kernels, so should we do with cpusets and
perfmon on 2.6 kernels. And that is to perform this translation in
perfmon code. Is it only SGI's dplace that requires the cpuset-relative
numbering?

The kernel-user boundary should stick to a single, system-wide, numbering
of CPUs.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-09-24 00:13:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

On Thu, 23 Sep 2004, Paul Jackson wrote:

> > Simon's 2nd patch provides a translation that we need at SGI for perfmon
> > support within a cpuset. Without the virtualization some
> > means in user space needs to exist to translate a virtual CPU number
> > into a physical CPU number.
>
> In my opinion, user space is exactly the right place for this translation.

How do you do this translation? Search through /dev/cpusets?

> But the jist of the matter is simple. Just as we (SGI) did with
> cpumemsets and perfmon on 2.4 kernels, so should we do with cpusets and
> perfmon on 2.6 kernels. And that is to perform this translation in
> perfmon code. Is it only SGI's dplace that requires the cpuset-relative
> numbering?

pfmon, sched_setaffinity, dplace. And this is only what I saw today.
Might develop into a longer list. The 2.4 solutions were rather
complicated.

> The kernel-user boundary should stick to a single, system-wide, numbering
> of CPUs.

That leads to lots of complicated scripts doing logical -> physical
translation with the danger of access or attempting accesses to not
allowed CPUs. It may be easier to contain tasks into a range of cpus if
the CPUs in use are easily enumerable.

The view from inside a cpuset could simply be of a system with N cpus
(0..N-1) with N memory areas (0..N-1). No access to outside cpus or memory
us allowed. Kernel checks for valid cpu and memory area by simply checking
against an upper boundary on both and then maps these numbers dynamically
according to the CPU set.

Thats what Simon's patch allows.

The patch would allow the use of the existing tools as if the machine
only had N cpus (as you said a soft partitioning of the machine). If
scripts are to be used with the current approach then they need to know
about all the CPUs in the system and perform the mapping. Its going to be
a nightmare to develop scripts that partition off a 512 cpu cluster
appropriately and that track the physical cpu numbers instead of the cpu
number within the cpuset.


2004-09-24 01:17:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

Christoph wrote:
> How do you do this translation? Search through /dev/cpusets?

Map from pid to cpuset to cpus. No searching.

The file /proc/<pid>/cpuset names the cpuset to which that pid is
attached. Presuming the cpuset file system is mounted at /dev/cpuset,
then the file /dev/cpuset/xxx/cpus lists the cpus in the cpuset named
'xxx'.

> pfmon, sched_setaffinity, dplace.

To the best of my current understanding, the only reason perfmon
wants relative numbering is because that's what dplace wants.

Sched_setaffinity uses system-wide numbering, no?

> That leads to lots of complicated scripts doing logical -> physical
> translation with the danger of access or attempting accesses to not
> allowed CPUs.

No -- it leads to more user level libraries and tools, encapsulating
the complexity, layering the abstractions.

And "danger" ... what's dangerous? An application in a cpuset won't
be able to use (if that's what you meant by 'access') CPUs outside
its cpuset. Nothing dangerous there that I see.

> The view from inside a cpuset could simply be of a system with N cpus
> (0..N-1) with N memory areas (0..N-1). No access to outside cpus or memory
> us allowed. Kernel checks for valid cpu and memory area by simply checking
> against an upper boundary on both and then maps these numbers dynamically
> according to the CPU set.
>
> Thats what Simon's patch allows.

Regardless, that's the eventual view seen by some apps from inside the
cpuset. We're just discussing where the translation code goes. I see
nothing that requires kernel priviledge or synchronization here.

> Its going to be a nightmare to develop scripts that partition off a 512
> cpu cluster appropriately and that track the physical cpu numbers
> instead of the cpu number within the cpuset.

No need for any nightmares.

Just because the meaning of CPU numbers at the kernel-user boundary is
system-wide doesn't mean that this view has to be imposed on all above.
We should write the higher level stuff as if the kernel could do what
you want with relative numbering, then arrange the tools and libraries
to convert.

Just because something is essential doesn't mean the kernel needs to do
it. And just because I oppose putting something in the kernel doesn't
mean I oppose doing it. Indeed, I'm doing quite a bit of work in this
very direction ... outside the kernel.

We have more reasons than just this issue of numbering to require a
robust set of user level libraries and tools. Pretty much everyone
working in this area seems to agree that a decent library layer is
needed on top of the raw kernel API's, which are difficult to code to
directly, and vary in "interesting" ways between the affinity, the numa
and the cpuset interfaces (e.g. three different forms for passing
bitmaps).

This is perhaps the biggest difference between what SGI does on Irix,
and what is happening in Linux 2.6. Quite a bit is moved outside the
kernel.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-09-24 01:21:46

by Anton Blanchard

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features


Hi,

> > But the jist of the matter is simple. Just as we (SGI) did with
> > cpumemsets and perfmon on 2.4 kernels, so should we do with cpusets and
> > perfmon on 2.6 kernels. And that is to perform this translation in
> > perfmon code. Is it only SGI's dplace that requires the cpuset-relative
> > numbering?
>
> pfmon, sched_setaffinity, dplace. And this is only what I saw today.
> Might develop into a longer list. The 2.4 solutions were rather
> complicated.

Are pfmon and dplace SGI specific? sched_affinity users already have to
deal with potentially discontiguous cpu maps. Ive been teaching IBM
applications about this fact as I find problems.

> > The kernel-user boundary should stick to a single, system-wide, numbering
> > of CPUs.
>
> That leads to lots of complicated scripts doing logical -> physical
> translation with the danger of access or attempting accesses to not
> allowed CPUs. It may be easier to contain tasks into a range of cpus if
> the CPUs in use are easily enumerable.

I would think you could write this in your userspace library.

> The patch would allow the use of the existing tools as if the machine
> only had N cpus (as you said a soft partitioning of the machine). If
> scripts are to be used with the current approach then they need to know
> about all the CPUs in the system and perform the mapping. Its going to be
> a nightmare to develop scripts that partition off a 512 cpu cluster
> appropriately and that track the physical cpu numbers instead of the cpu
> number within the cpuset.

What happens when an application (or user) looks in /proc/cpuinfo?
And how does /sys/.../cpus match? Also what happens when you hotplug out
a cpu and your memory map becomes discontiguous?

Anton

2004-09-24 14:42:48

by Robin Holt

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

On Fri, Sep 24, 2004 at 11:17:51AM +1000, Anton Blanchard wrote:
>
> Hi,
>
> > > But the jist of the matter is simple. Just as we (SGI) did with
> > > cpumemsets and perfmon on 2.4 kernels, so should we do with cpusets and
> > > perfmon on 2.6 kernels. And that is to perform this translation in
> > > perfmon code. Is it only SGI's dplace that requires the cpuset-relative
> > > numbering?
> >
> > pfmon, sched_setaffinity, dplace. And this is only what I saw today.
> > Might develop into a longer list. The 2.4 solutions were rather
> > complicated.
>
> Are pfmon and dplace SGI specific? sched_affinity users already have to
> deal with potentially discontiguous cpu maps. Ive been teaching IBM
> applications about this fact as I find problems.
>

pfmon comes from HP's perfmon package. dplace is an SGI specific that is
being open sourced. It allows very complex process placement within a
cpuset. It uses process aggregates to migrate processes based upon stuff
like number of invocations of this name goes to this relative cpu.

Paul, aren't we going to adjust dplace so it uses the user libraries to
interpret the relative placement information provided in the application's
configuration file into kernel logical cpus before passing that into the
kernel module?

> > > The kernel-user boundary should stick to a single, system-wide, numbering
> > > of CPUs.
> >
> > That leads to lots of complicated scripts doing logical -> physical
> > translation with the danger of access or attempting accesses to not
> > allowed CPUs. It may be easier to contain tasks into a range of cpus if
> > the CPUs in use are easily enumerable.
>
> I would think you could write this in your userspace library.
>
> > The patch would allow the use of the existing tools as if the machine
> > only had N cpus (as you said a soft partitioning of the machine). If
> > scripts are to be used with the current approach then they need to know
> > about all the CPUs in the system and perform the mapping. Its going to be
> > a nightmare to develop scripts that partition off a 512 cpu cluster
> > appropriately and that track the physical cpu numbers instead of the cpu
> > number within the cpuset.
>
> What happens when an application (or user) looks in /proc/cpuinfo?
> And how does /sys/.../cpus match? Also what happens when you hotplug out
> a cpu and your memory map becomes discontiguous?
>
> Anton
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-09-24 16:09:08

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc][patch] 1/2 Additional cpuset features

Robin wrote:
> Paul, aren't we going to adjust dplace so it uses the user libraries to
> interpret the relative placement ...

Yup - either adjust dplace to accept system numbers, or adjust perfmon
to translate the system numbers that it wants to pass to dplace to
cpuset relative numbers first. Look at the SGI internal incidents
assigned to Christoph Lameter. I believe he's assigned this task, and
I'll bet that this is related to his response to Simon's relative cpuset
numbering patch, which started this subthread. We've come full circle.

We're wandering off the lkml ranch here ... into vendor specific,
user space stuff.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373