2001-11-27 03:31:34

by Robert Love

[permalink] [raw]
Subject: [PATCH] proc-based cpu affinity user interface

Attached is my procfs-based implementation of a user interface for
getting and setting a task's CPU affinity. Patch is against 2.4.16.

The kernel already respects affinity, which is stored in
task_struct->cpus_allowed.

Reading and writing /proc/<pid>/affinity will get and set the affinity.

Security is implemented: the writer must possess CAP_SYS_NICE or be the
same uid as the task in question. Anyone can read the data.

The read mask will be ANDed with cpu_online_map, so that only valid bits
are returned. The written data must have _some_ valid bits in it.
I.e., ffffffff is valid on a 2-way system but 01000000 probably is not.
Note you don't need to pass the full mask, e.g. "64" is legal. When a
new mask is set, a reschedule is forced to put the task on a legal CPU.

Note I had to implement a proc_write function for the procfs (pid)
code. This is generic and can be used by other, writable, entries.

This patch comes as an alternative to Ingo Molnar's syscall-implemented
variant. Ingo's code is good; however I and others expressed discontent
at yet another group of syscalls. Other benefits include the ease with
which to set the affinity of tasks that are unaware of the new interface
and that with this approach applications don't need to check for the
existence of a syscall.

Comments?

Robert Love


Attachments:
cpu-affinity-rml-2.4.16-1.patch (5.03 kB)

2001-11-27 03:42:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On 26 Nov 2001, Robert Love wrote:

> Attached is my procfs-based implementation of a user interface for
> getting and setting a task's CPU affinity. Patch is against 2.4.16.
>
> The kernel already respects affinity, which is stored in
> task_struct->cpus_allowed.
>
> Reading and writing /proc/<pid>/affinity will get and set the affinity.
>
> Security is implemented: the writer must possess CAP_SYS_NICE or be the
> same uid as the task in question. Anyone can read the data.
>
> The read mask will be ANDed with cpu_online_map, so that only valid bits
> are returned. The written data must have _some_ valid bits in it.
> I.e., ffffffff is valid on a 2-way system but 01000000 probably is not.
> Note you don't need to pass the full mask, e.g. "64" is legal. When a
> new mask is set, a reschedule is forced to put the task on a legal CPU.
>
> Note I had to implement a proc_write function for the procfs (pid)
> code. This is generic and can be used by other, writable, entries.
>
> This patch comes as an alternative to Ingo Molnar's syscall-implemented
> variant. Ingo's code is good; however I and others expressed discontent
> at yet another group of syscalls. Other benefits include the ease with
> which to set the affinity of tasks that are unaware of the new interface
> and that with this approach applications don't need to check for the
> existence of a syscall.
>
> Comments?

As I said in reply to Ingo patch, it'd be better to expose "number" cpu
masks not "logical" ( like cpus_allowed ).
In this way the users can use 0..N-1 ( N == number of cpus phisically
available ) w/out having to know the internal mapping between logical and
number ids.




- Davide


2001-11-27 04:14:17

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Mon, 2001-11-26 at 22:52, Davide Libenzi wrote:
> As I said in reply to Ingo patch, it'd be better to expose "number" cpu
> masks not "logical" ( like cpus_allowed ).
> In this way the users can use 0..N-1 ( N == number of cpus phisically
> available ) w/out having to know the internal mapping between logical and
> number ids.

Do you mean you don't like using a bitmask ?

00000001 = first CPU, its not logical, its physical.

Plus, how do you intend to set multiple non-contiguous CPUs? A fraction
of them? With only a 32-bit value?

Note also that my patch understands the underlying CPU nature, such that
"echo 0000ffff > /proc/123/affinity" will only affine task 123 to your
first 2 CPUs if you only have two. Thus, "cat /proc/123/affinity" will
return "00000003".

Robert Love

2001-11-27 04:29:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On 26 Nov 2001, Robert Love wrote:

> On Mon, 2001-11-26 at 22:52, Davide Libenzi wrote:
> > As I said in reply to Ingo patch, it'd be better to expose "number" cpu
> > masks not "logical" ( like cpus_allowed ).
> > In this way the users can use 0..N-1 ( N == number of cpus phisically
> > available ) w/out having to know the internal mapping between logical and
> > number ids.
>
> Do you mean you don't like using a bitmask ?

No no, I love binary math :)


> 00000001 = first CPU, its not logical, its physical.

If You set this directly to cpus_allowed You're going to set the logical
one.
This because cpus_allowed is logical, because it's used like, ie:

int this_cpu = p->processor;

if (p->cpus_allowed & (1 << this_cpu)) ...

and p->processor is logical.
Something like :

unsigned long num2log_mask(unsigned long cmsk) {
unsigned long msk = 0;
int ii;

for (ii = 0; ii < smp_num_cpus; ii++)
if (cmsk & (1 << ii))
msk |= (1 << cpu_logical_map(ii));
return msk;
}

unsigned long log2num_mask(unsigned long cmsk) {
unsigned long msk = 0;
int ii;

for (ii = 0; ii < smp_num_cpus; ii++)
if (cmsk & (1 << ii))
msk |= (1 << cpu_number_map(ii));
return msk;
}

You use num2log_mask() on the user input mask to set cpus_allowed and
log2num_mask() to return the map to the user.




- Davide


2001-11-27 04:38:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Mon, 26 Nov 2001, Davide Libenzi wrote:

> Something like :
>
> unsigned long num2log_mask(unsigned long cmsk) {
> unsigned long msk = 0;
> int ii;
>
> for (ii = 0; ii < smp_num_cpus; ii++)
> if (cmsk & (1 << ii))
> msk |= (1 << cpu_logical_map(ii));
> return msk;
> }
>
> unsigned long log2num_mask(unsigned long cmsk) {
> unsigned long msk = 0;
> int ii;
>
> for (ii = 0; ii < smp_num_cpus; ii++)

for (ii = 0; ii < NR_CPUS; ii++)





- Davide


2001-11-27 04:47:07

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface


> Attached is my procfs-based implementation of a user interface for
> getting and setting a task's CPU affinity. Patch is against 2.4.16.

Have you seen Andrew Mortons cpus_allowed patch?

http://www.zipworld.com.au/~akpm/linux/cpus_allowed.patch

Anton

2001-11-27 05:08:19

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Mon, 2001-11-26 at 23:37, Anton Blanchard wrote:
> > Attached is my procfs-based implementation of a user interface for
> > getting and setting a task's CPU affinity. Patch is against 2.4.16.
>
> Have you seen Andrew Mortons cpus_allowed patch?

No, although Andrew mentioned he had worked on it to me.

Looking it over...its of course very similar. One thing to note is his
patch doesn't incorporate the bits from Ingo's patch that I borrowed.
I.e., use of cpu_online_map, the forcing of the reschedule, etc. I
favor some of the differences in my variant.

I am glad to see he wrote a proc_write function, because I was worried I
was reimplementing something there, but I suppose not.

Either way, I'd like to see one of these in the kernel.

Robert Love

2001-11-27 06:06:08

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

> > Attached is my procfs-based implementation of a user interface for
> > getting and setting a task's CPU affinity. Patch is against 2.4.16.
>
> Have you seen Andrew Mortons cpus_allowed patch?

I really need to push pset - I guess I should finish the 2.4.x port, eh?

http://www.hockin.org/~thockin/pset

2001-11-27 06:25:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Nov 26, 2001 22:31 -0500, Robert Love wrote:
> Reading and writing /proc/<pid>/affinity will get and set the affinity.
>
> Security is implemented: the writer must possess CAP_SYS_NICE or be the
> same uid as the task in question. Anyone can read the data.

Hmm, now that I think about it, anyone should be able to restrict the
CPUs that their processes should run on, but like "nice", you should
have CAP_SYS_NICE in order to increase the number of CPUs your process
can run on. This makes it possible to "throttle" a user so that they
can only max out a single CPU.

Why would you do that? Maybe because "ulimit" and friends only allow
you to set an absolute limit on the number of CPU seconds you can use
per process, but not a "percentage" of a processor or some equivalent
"cycles per second" unit.

Not that I see this as being hugely necessary, but it may as well be
consistent with current behaviour (c.f. nice, ulimit, etc, can all go
down, but not necessarily up).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-27 06:40:41

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Tue, 2001-11-27 at 01:25, Andreas Dilger wrote:
> On Nov 26, 2001 22:31 -0500, Robert Love wrote:
> > Reading and writing /proc/<pid>/affinity will get and set the affinity.
> >
> > Security is implemented: the writer must possess CAP_SYS_NICE or be the
> > same uid as the task in question. Anyone can read the data.
>
> Hmm, now that I think about it, anyone should be able to restrict the
> CPUs that their processes should run on, but like "nice", you should
> have CAP_SYS_NICE in order to increase the number of CPUs your process
> can run on. This makes it possible to "throttle" a user so that they
> can only max out a single CPU.
>
> Why would you do that? Maybe because "ulimit" and friends only allow
> you to set an absolute limit on the number of CPU seconds you can use
> per process, but not a "percentage" of a processor or some equivalent
> "cycles per second" unit.
>
> Not that I see this as being hugely necessary, but it may as well be
> consistent with current behaviour (c.f. nice, ulimit, etc, can all go
> down, but not necessarily up).

I thought about this too, and I realized that cpus_allowed is by default
0xffffffff (i.e., all procs in the system). That means this is only an
issue when the administrator explicitly changed affinity.

And that would occur in the situation you describe ... personally, I
think users should be able to set their masks but I completely see your
argument.

I suppose we could count the bits in cpus_allowed and make sure they are
less than or equal to the new_mask bits. This seems overkill, though.
We could just not allow users to change their own affinity without
CAP_SYS_NICE, and allow root to do anything (via CAP_SYS_ADMIN).

Anyone else have an opinion?

Robert Love

2001-11-27 09:56:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface


On 26 Nov 2001, Robert Love wrote:

> Attached is my procfs-based implementation of a user interface for
> getting and setting a task's CPU affinity. Patch is against 2.4.16.

two comments. First, this has already been done - Andrew Morton has
written such a patch.

Second, as i've repeatedly said it, it's a failure to do this over /proc.
What if /proc is not mounted? What if the process is in a chroot()
environment, should it not be able to set its own affinity? This is a
fundamental limitation of your approach, and *if* we want to export the
cpus_allowed affinity to user-space (which is up to discussion), then the
right way (TM) to do it is via a syscall.

Ingo

2001-11-27 10:02:55

by Alan

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

> fundamental limitation of your approach, and *if* we want to export the
> cpus_allowed affinity to user-space (which is up to discussion), then the
> right way (TM) to do it is via a syscall.

HP and others have already implemented chunks of this stuff via syscall
interfaces. There is a complete pset api.

Alan

2001-11-27 12:21:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface


On Mon, 26 Nov 2001, Davide Libenzi wrote:

> As I said in reply to Ingo patch, it'd be better to expose "number"
> cpu masks not "logical" ( like cpus_allowed ). In this way the users
> can use 0..N-1 ( N == number of cpus phisically available ) w/out
> having to know the internal mapping between logical and number ids.

yep, agreed. I've uploaded a new set-affinity syscall patch with your
improvement added:

http://redhat.com/~mingo/set-affinity-patches/set-affinity-2.4.16-A0

i've only tested it on x86 which has a 1:1 mapping between physical and
logical CPUs, but it should be fine on other architectures as well.

Ingo

2001-11-27 12:24:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface


On Tue, 27 Nov 2001, Alan Cox wrote:

> > fundamental limitation of your approach, and *if* we want to export the
> > cpus_allowed affinity to user-space (which is up to discussion), then the
> > right way (TM) to do it is via a syscall.
>
> HP and others have already implemented chunks of this stuff via
> syscall interfaces. There is a complete pset api.

the sched_set_affinity() syscall tries to be simple, and uses the existing
->cpus_allowed mechanizm. Most of the pset patches i've seen so far are
IMHO overdoing this issue a bit, interface-wise, and do not provide more
than this simple solution.

Ingo

2001-11-27 16:39:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Tue, 27 Nov 2001, Ingo Molnar wrote:

>
> On Mon, 26 Nov 2001, Davide Libenzi wrote:
>
> > As I said in reply to Ingo patch, it'd be better to expose "number"
> > cpu masks not "logical" ( like cpus_allowed ). In this way the users
> > can use 0..N-1 ( N == number of cpus phisically available ) w/out
> > having to know the internal mapping between logical and number ids.
>
> yep, agreed. I've uploaded a new set-affinity syscall patch with your
> improvement added:
>
> http://redhat.com/~mingo/set-affinity-patches/set-affinity-2.4.16-A0
>
> i've only tested it on x86 which has a 1:1 mapping between physical and
> logical CPUs, but it should be fine on other architectures as well.

The snippet I sent yesterday should be corrected by checking if
cpu_{number,logical}_map(ii) is != -1 to avoid incorrect bit settings,
expecially from mask coming from user side.



- Davide


2001-11-27 16:48:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface


On Tue, 27 Nov 2001, Davide Libenzi wrote:

> The snippet I sent yesterday should be corrected by checking if
> cpu_{number,logical}_map(ii) is != -1 to avoid incorrect bit settings,
> expecially from mask coming from user side.

indeed. New patch is at:

http://redhat.com/~mingo/set-affinity-patches/set-affinity-2.4.16-A1

(the buggy function was not a security problem, it just produces
unexpected bitmaps.)

Ingo

2001-11-27 20:29:28

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

diff -Nur linux-2.4.10/fs/proc/array.c linux-2.4.10-launch_policy/fs/proc/array.c
--- linux-2.4.10/fs/proc/array.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/array.c Thu Nov 15 13:38:57 2001
@@ -50,6 +50,10 @@
* Al Viro & Jeff Garzik : moved most of the thing into base.c and
* : proc_misc.c. The rest may eventually go into
* : base.c too.
+ *
+ * Andrew Morton : cpus_allowed
+ *
+ * Matthew Dobson : launch_policy (Thanks to Andrew Morton for inspiraton)
*/

#include <linux/config.h>
@@ -343,7 +347,7 @@
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
task->pid,
task->comm,
state,
@@ -386,7 +390,9 @@
task->nswap,
task->cnswap,
task->exit_signal,
- task->processor);
+ task->processor,
+ task->cpus_allowed,
+ task->launch_policy);
if(mm)
mmput(mm);
return res;
@@ -691,5 +697,62 @@
task->per_cpu_stime[cpu_logical_map(i)]);

return len;
+}
+
+static inline int proc_pid_cpu_bitmask_read(char * buffer, unsigned long *bitmask)
+{
+ int len;
+
+ len = sprintf(buffer, "%08lx\n", *bitmask);
+ return len;
+}
+
+/*
+ * FIXME: cpu_online_map should be used, but not all archs have it.
+ */
+
+static inline int proc_pid_cpu_bitmask_write(char * buffer, size_t nbytes, unsigned long *bitmask)
+{
+ unsigned long new_mask;
+ char *endp;
+ int ret;
+ unsigned long flags;
+
+ ret = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto out;
+
+ new_mask = simple_strtoul(buffer, &endp, 16);
+ ret = endp - buffer;
+
+ spin_lock_irqsave(&runqueue_lock, flags); /* token effort to not be racy */
+ if ((((1 << smp_num_cpus) - 1) & new_mask) == 0)
+ ret = -EINVAL;
+ else
+ *bitmask = new_mask;
+ spin_unlock_irqrestore(&runqueue_lock, flags);
+
+out:
+ return ret;
+}
+
+int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->cpus_allowed);
+}
+
+int proc_pid_cpus_allowed_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, nbytes, &task->cpus_allowed);
+}
+
+int proc_pid_launch_policy_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->launch_policy);
+}
+
+int proc_pid_launch_policy_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, nbytes, &task->launch_policy);
}
#endif
diff -Nur linux-2.4.10/fs/proc/base.c linux-2.4.10-launch_policy/fs/proc/base.c
--- linux-2.4.10/fs/proc/base.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/base.c Fri Oct 19 15:42:15 2001
@@ -39,6 +39,10 @@
int proc_pid_status(struct task_struct*,char*);
int proc_pid_statm(struct task_struct*,char*);
int proc_pid_cpu(struct task_struct*,char*);
+int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer);
+int proc_pid_cpus_allowed_write(struct task_struct *task, char * buffer, size_t nbytes);
+int proc_pid_launch_policy_read(struct task_struct *task, char * buffer);
+int proc_pid_launch_policy_write(struct task_struct *task, char * buffer, size_t nbytes);

static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
@@ -305,8 +309,44 @@
return count;
}

+static ssize_t proc_info_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode * inode = file->f_dentry->d_inode;
+ unsigned long page;
+ ssize_t ret;
+ struct task_struct *task = inode->u.proc_i.task;
+
+ ret = -EINVAL;
+ if (inode->u.proc_i.op.proc_write == NULL)
+ goto out;
+ if (count > PAGE_SIZE - 1)
+ goto out;
+
+ ret = -ENOMEM;
+ if (!(page = __get_free_page(GFP_KERNEL)))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user((char *)page, buf, count))
+ goto out_free_page;
+
+ ((char *)page)[count] = '\0';
+ ret = inode->u.proc_i.op.proc_write(task, (char*)page, count);
+ if (ret < 0)
+ goto out_free_page;
+
+ *ppos += ret;
+
+out_free_page:
+ free_page(page);
+out:
+ return ret;
+}
+
static struct file_operations proc_info_file_operations = {
read: proc_info_read,
+ write: proc_info_write,
};

#define MAY_PTRACE(p) \
@@ -520,25 +560,29 @@
PROC_PID_STATM,
PROC_PID_MAPS,
PROC_PID_CPU,
+ PROC_PID_CPUS_ALLOWED,
+ PROC_PID_LAUNCH_POLICY,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};

#define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}
static struct pid_entry base_stuff[] = {
- E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
- E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
- E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
- E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
- E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
- E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
+ E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
+ E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
+ E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
+ E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
+ E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
+ E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
#ifdef CONFIG_SMP
- E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPUS_ALLOWED, "cpus_allowed", S_IFREG|S_IRUGO|S_IWUSR),
+ E(PROC_PID_LAUNCH_POLICY, "launch_policy",S_IFREG|S_IRUGO|S_IWUSR),
#endif
- E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
- E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
- E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
+ E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
+ E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
{0,0,NULL,0}
};
#undef E
@@ -892,6 +936,16 @@
case PROC_PID_CPU:
inode->i_fop = &proc_info_file_operations;
inode->u.proc_i.op.proc_read = proc_pid_cpu;
+ break;
+ case PROC_PID_CPUS_ALLOWED:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_cpus_allowed_read;
+ inode->u.proc_i.op.proc_write = proc_pid_cpus_allowed_write;
+ break;
+ case PROC_PID_LAUNCH_POLICY:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_launch_policy_read;
+ inode->u.proc_i.op.proc_write = proc_pid_launch_policy_write;
break;
#endif
case PROC_PID_MEM:
diff -Nur linux-2.4.10/include/linux/capability.h linux-2.4.10-launch_policy/include/linux/capability.h
--- linux-2.4.10/include/linux/capability.h Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/include/linux/capability.h Mon Nov 19 15:27:40 2001
@@ -231,6 +231,8 @@
/* Allow enabling/disabling tagged queuing on SCSI controllers and sending
arbitrary SCSI commands */
/* Allow setting encryption key on loopback filesystem */
+/* Allow bonding of tasks to CPUs */
+/* Allow setting of launch policies */

#define CAP_SYS_ADMIN 21

diff -Nur linux-2.4.10/include/linux/prctl.h linux-2.4.10-launch_policy/include/linux/prctl.h
--- linux-2.4.10/include/linux/prctl.h Thu Jul 19 20:39:57 2001
+++ linux-2.4.10-launch_policy/include/linux/prctl.h Mon Nov 19 15:24:10 2001
@@ -20,4 +20,12 @@
#define PR_GET_KEEPCAPS 7
#define PR_SET_KEEPCAPS 8

+/* Get/set cpus allowed */
+#define PR_GET_CPUS_ALLOWED 13
+#define PR_SET_CPUS_ALLOWED 14
+
+/* Get/set launch policies */
+#define PR_GET_LAUNCH_POLICY 15
+#define PR_SET_LAUNCH_POLICY 16
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.4.10/include/linux/proc_fs_i.h linux-2.4.10-launch_policy/include/linux/proc_fs_i.h
--- linux-2.4.10/include/linux/proc_fs_i.h Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/include/linux/proc_fs_i.h Wed Oct 17 14:18:53 2001
@@ -1,9 +1,10 @@
struct proc_inode_info {
struct task_struct *task;
int type;
- union {
+ struct {
int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
int (*proc_read)(struct task_struct *task, char *page);
+ int (*proc_write)(struct task_struct *task, char *page, size_t nbytes);
} op;
struct file *file;
};
diff -Nur linux-2.4.10/include/linux/sched.h linux-2.4.10-launch_policy/include/linux/sched.h
--- linux-2.4.10/include/linux/sched.h Sun Sep 23 10:31:02 2001
+++ linux-2.4.10-launch_policy/include/linux/sched.h Mon Nov 19 15:27:40 2001
@@ -344,6 +344,7 @@
struct task_struct *pidhash_next;
struct task_struct **pidhash_pprev;

+ unsigned long launch_policy; /* for *fork*() & exec() */
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
unsigned long rt_priority;
@@ -467,6 +468,7 @@
p_opptr: &tsk, \
p_pptr: &tsk, \
thread_group: LIST_HEAD_INIT(tsk.thread_group), \
+ launch_policy: -1, \
wait_chldexit: __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\
real_timer: { \
function: it_real_fn \
diff -Nur linux-2.4.10/kernel/fork.c linux-2.4.10-launch_policy/kernel/fork.c
--- linux-2.4.10/kernel/fork.c Mon Sep 17 21:46:04 2001
+++ linux-2.4.10-launch_policy/kernel/fork.c Wed Oct 24 15:55:55 2001
@@ -646,6 +646,7 @@
spin_lock_init(&p->sigmask_lock);
}
#endif
+ p->cpus_allowed = p->launch_policy; /* launch_policy is inherited from parent */
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;

diff -Nur linux-2.4.10/kernel/sys.c linux-2.4.10-launch_policy/kernel/sys.c
--- linux-2.4.10/kernel/sys.c Tue Sep 18 14:10:43 2001
+++ linux-2.4.10-launch_policy/kernel/sys.c Mon Nov 19 14:55:53 2001
@@ -1256,6 +1256,33 @@
}
current->keep_capabilities = arg2;
break;
+ case PR_GET_CPUS_ALLOWED:
+ error = put_user(current->cpus_allowed, (long *)arg2);
+ break;
+ case PR_SET_CPUS_ALLOWED:
+/*
+ * FIXME: cpu_online_map should be used, but not all archs have it.
+ */
+ if ((((1 << smp_num_cpus) - 1) & arg2) == 0)
+ error = -EINVAL;
+ else {
+ current->cpus_allowed = arg2;
+ if (!((1 << smp_processor_id()) & arg2))
+ current->need_resched = 1;
+ }
+ break;
+ case PR_GET_LAUNCH_POLICY:
+ error = put_user(current->launch_policy, (long *)arg2);
+ break;
+ case PR_SET_LAUNCH_POLICY:
+/*
+ * FIXME: cpu_online_map should be used, but not all archs have it.
+ */
+ if ((((1 << smp_num_cpus) - 1) & arg2) == 0)
+ error = -EINVAL;
+ else
+ current->launch_policy = arg2;
+ break;
default:
error = -EINVAL;
break;


Attachments:
launch_policy-2.4.10.patch (10.40 kB)

2001-11-27 20:44:48

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Tue, 2001-11-27 at 06:52, Ingo Molnar wrote:
> two comments. First, this has already been done - Andrew Morton has
> written such a patch.

I didn't know this until after I started, but it is irrelevant. Use
Andrew's if you want. However, I have incorporated some useful bits
from your patch and such that I think are superior.

> Second, as i've repeatedly said it, it's a failure to do this over /proc.
> What if /proc is not mounted? What if the process is in a chroot()
> environment, should it not be able to set its own affinity? This is a
> fundamental limitation of your approach, and *if* we want to export the
> cpus_allowed affinity to user-space (which is up to discussion), then the
> right way (TM) to do it is via a syscall.

OK OK OK ... we can argue all day over syscall vs. proc. Personally, I
don't find any of the arguments fruitful -- we make all sorts of
restrictions and "Don't do thats" in the kernel. Requiring procfs isn't
the end of the world.

When you posted your initial patch, I commented I liked it but was
interested in a proc variant. Some people were interested. Even you
said it wasn't a huge deal.

It doesn't matter to me, let's just expose _some_ interface to
userspace. Personally, I prefer procfs, but both implementations are
nicely done. I respect you too much to argue religion like this. I'll
push for either variant.

Robert Love

2001-12-10 08:33:36

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

> I've been working on a patch that is a roll up of Andrew Morton's patch, and
> also of some work by Nick Pollitt to use a prctl() interface to the
> cpus_allowed bitmask. It also includes a new bitmask 'launch_policy' which
> allows a process to set a launch policy which is used to determine it's
> childrens' cpus_allowed bitmasks prior to launch.

It looks like you are limiting the number of CPUs to sizeof(long).
Must you? Using "%lx" would be better in any case. Considering that
you may outgrow the format, maybe this info doesn't belong in the
/proc/*/stat files at all. For "ps" usage, a simple flag to indicate
if the process is locked to a CPU would be OK. There are 3 cases
of interest:

1. can run on all CPUs
2. can run on only the currently used CPU
3. can run on some subset of the available CPUs

(defined sanely for 1-CPU and 2-CPU systems of course)


> diff -Nur linux-2.4.10/fs/proc/array.c linux-2.4.10-launch_policy/fs/proc/array.c
> --- linux-2.4.10/fs/proc/array.c Fri Oct 26 15:07:16 2001
> +++ linux-2.4.10-launch_policy/fs/proc/array.c Thu Nov 15 13:38:57 2001
> @@ -50,6 +50,10 @@
> * Al Viro & Jeff Garzik : moved most of the thing into base.c and
> * : proc_misc.c. The rest may eventually go into
> * : base.c too.
> + *
> + * Andrew Morton : cpus_allowed
> + *
> + * Matthew Dobson : launch_policy (Thanks to Andrew Morton for inspiraton)
> */
>
> #include <linux/config.h>
> @@ -343,7 +347,7 @@
> read_unlock(&tasklist_lock);
> res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
> %lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
> task->pid,
> task->comm,
> state,
> @@ -386,7 +390,9 @@
> task->nswap,
> task->cnswap,
> task->exit_signal,
> - task->processor);
> + task->processor,
> + task->cpus_allowed,
> + task->launch_policy);

2001-12-10 08:40:46

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

On Mon, 2001-12-10 at 03:33, Albert D. Cahalan wrote:

> It looks like you are limiting the number of CPUs to sizeof(long).
> Must you? Using "%lx" would be better in any case. Considering that
> you may outgrow the format, maybe this info doesn't belong in the
> /proc/*/stat files at all. For "ps" usage, a simple flag to indicate
> if the process is locked to a CPU would be OK. There are 3 cases
> of interest:

We already limit it... we use cpus_allowed and cpus_runnable which are
unsigned long.

Robert Love

2001-12-10 09:38:15

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [PATCH] proc-based cpu affinity user interface

Robert Love writes:
> On Mon, 2001-12-10 at 03:33, Albert D. Cahalan wrote:

>> It looks like you are limiting the number of CPUs to sizeof(long).
>> Must you? Using "%lx" would be better in any case. Considering that
>> you may outgrow the format, maybe this info doesn't belong in the
>> /proc/*/stat files at all. For "ps" usage, a simple flag to indicate
>> if the process is locked to a CPU would be OK. There are 3 cases
>> of interest:
>
> We already limit it... we use cpus_allowed and cpus_runnable which are
> unsigned long.

Those can be changed without screwing up user code. You need to
be very careful about what you put in /proc, because user apps
will come to rely on whatever you choose.

Look at what happened with signals in /proc/*/stat. They were
originally reported as "%lu", just like you did. Then the kernel
was extended to support more than 32 signals on 32-bit hardware.
The format was briefly changed to be 16 hex chars, but this
broke lots of stuff. So now /proc/*/stat just reports the low
numbered signals as it did before, and /proc/*/status has the
rest of the signals.

So let's predict:

Around 2.5.42, IBM will have 60-way x86 or 80-way ppc64.
Then the kernel will be hacked to report a hex string of
arbitrary length, and user apps will die left and right.
Somebody will then hack things back to decimal, reporting
only the low sizeof(long)*8 processors in /proc/*/stat.
Any additional processors may be found as a hex string in
/proc/*/status, with the /proc/*/stat data useless on a
system with very many processors.