the attached set-affinity-A1 patch is relative to the scheduler
fixes/cleanups in 2.4.15-pre9. It implements the following two
new system calls:
asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len,
unsigned long *new_mask_ptr);
asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int
*user_mask_len_ptr, unsigned long *user_mask_ptr);
as a testcase, softirq.c is updated to use this mechanizm, plus see the
attached loop_affine.c code.
the sched_set_affinity() syscall also ensures that the target process will
run on the right CPU (or CPUs).
I think this interface is the right way to expose user-selectable affinity
to user-space - there are more complex affinity interfaces in existence,
but i believe that the discovery of actual caching hierarchy is and should
be up to a different mechanizm, i dont think it should be mixed into the
affinity syscalls. Using a mask of linear CPU IDs is IMO sufficient to
express user-space affinity wishes.
There are no security issues wrt. cpus_allowed, so these syscalls are
available to every process. (there are permission restrictions of course,
similar to those of existing scheduler syscalls.)
sched_get_affinity(pid, &mask_len, NULL) can be used to query the kernel's
supported CPU bitmask length. This should help us in achieving a stable
libc interface once we get over the 32/64 CPUs limit.
the attached loop_affine.c code tests both syscalls:
mars:~> ./loop_affine
current process's affinity: 4 bytes mask, value 000000ff.
trying to set process: affinity to 00000001.
current process's affinity: 4 bytes mask, value 00000001.
speed: 2162052 loops.
speed: 2162078 loops.
[...]
i've tested the patch on both SMP and UP systems. On UP the syscalls are
pretty pointless, but they show that the internal state of the scheduler
folds nicely into the UP case as well:
mars:~> ./loop_affine
current process's affinity: 4 bytes mask, value 00000001.
trying to set process: affinity to 00000001.
current process's affinity: 4 bytes mask, value 00000001.
speed: 2160880 loops.
speed: 2160511 loops.
[...]
comments? Is there any reason to do a more complex interface than this?
Ingo
On Thu, 22 Nov 2001, Ingo Molnar wrote:
>
> the attached set-affinity-A1 patch is relative to the scheduler
> fixes/cleanups in 2.4.15-pre9. It implements the following two
> new system calls:
>
> asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned int mask_len,
> unsigned long *new_mask_ptr);
>
> asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned int
> *user_mask_len_ptr, unsigned long *user_mask_ptr);
I think that maybe it's better to have a new type :
typedef whatever-is-appropriate cpu_affinity_t;
with a set of macros :
CPU_AFFINITY_INIT(aptr)
CPU_AFFINITY_SET(aptr, n)
CPU_AFFINITY_ISSET(aprt, n)
so that we can simplify that interfaces :
asmlinkage int sys_sched_set_affinity(pid_t pid, cpu_affinity_t *aptr);
asmlinkage int sys_sched_get_affinity(pid_t pid, cpu_affinity_t *aptr);
- Davide
On Thu, 2001-11-22 at 03:59, Ingo Molnar wrote:
> the attached set-affinity-A1 patch is relative to the scheduler
> fixes/cleanups in 2.4.15-pre9. It implements the following two
> new system calls: [...]
Ingo, I like your implementation, particularly the use of the
cpu_online_map, although I am not sure all arch's implement it yet. I
am curious, however, what you would think of using a /proc interface
instead of a set of syscalls ?
Ie, we would have a /proc/<pid>/cpu_affinity which is the same as your
`unsigned long * user_mask_ptr'. Reading and writing of the proc
interface would correspond to your get and set syscalls. Besides the
sort of relevancy and useful abstraction of putting the affinity in the
procfs, it eliminates any sizeof(cpus_allowed) problem since the read
string is the size in characters of cpus_allowed.
I would use your syscall code, though -- just reimplement it as a procfs
file. This would mean adding a proc_write function, since the _actual_
procfs (the proc part) only has a read method, but that is simple.
Thoughts?
Robert Love
On November 22, 2001 15:45, Robert Love wrote:
>
> Ie, we would have a /proc/<pid>/cpu_affinity which is the same as your
> `unsigned long * user_mask_ptr'. Reading and writing of the proc
> interface would correspond to your get and set syscalls. Besides the
> sort of relevancy and useful abstraction of putting the affinity in the
> procfs, it eliminates any sizeof(cpus_allowed) problem since the read
> string is the size in characters of cpus_allowed.
>
> I would use your syscall code, though -- just reimplement it as a procfs
> file. This would mean adding a proc_write function, since the _actual_
> procfs (the proc part) only has a read method, but that is simple.
>
> Thoughts?
Here here, I was just thinking "Well, I like the CPU affinity idea, but I
loathe syscall creep... I hope this Robert Love fellow says something about
that" as I read your email's header.
In addition to keeping the syscall table from being filled with very
specific, non-standard, and use-once syscalls, a /proc interface would allow
me to change the CPU affinity of processes that aren't {get, set}_affinity
aware (i.e., all Linux applications written up to this point). This isn't
very different from how it's possible to change a processes other scheduling
properties (priority, scheduler) from another process. Imagine if renice(8)
had to be implemented as attaching to a process and calling nice(2)... ick.
Also, as an application developer, I try to avoid conditionally compiled,
system-specific calls. I would have much less "cleanliness" objections
towards testing for the /proc/<pid>/cpu_affinity files existance and
conditionally writing to it. Compare this to the hacks some network servers
use to try to detect sendfile(2)'s presence at runtime, and you'll see what I
mean. Remember, everything is a file ;)
And one final thing... what sort of benifit does CPU affinity have if we
have the scheduler take in account CPU migration costs correctly? I can think
of a lot of corner cases, but in general, it seems to me that it's a lot more
sane to have the scheduler decide where processes belong. What if an
application with n threads, where n is less than the number of CPUs, has to
decide which CPUs to bind its threads to? What if a similar app, or another
instance of the same app, already decided to bind against the same set of
CPUs? The scheduler is stuck with an unfair scheduling load on those poor
CPUs, because the scheduling decision was moved away from where it really
should take place: the scheduler. I'm sure I'm missing something, though.
-Ryan
> CPUs, because the scheduling decision was moved away from where it really
> should take place: the scheduler. I'm sure I'm missing something, though.
only that it's nontrivial to estimate the migration costs, I think.
at one point, around 2.3.3*, there was some effort at doing this -
or something like it. specifically, the scheduler kept track of
how long a process ran on average, and was slightly more willing
to migrate a short-slice process than a long-slice. "short" was
defined relative to cache size and a WAG at dram bandwidth.
the rationale was that if you run for only 100 us, you probably
don't have a huge working set. that justification is pretty thin,
and perhaps that's why the code gradually disappeared.
hmm, you really want to monitor things like paging and cache misses,
but both might be tricky, and would be tricky to use sanely.
a really simple, and appealing heuristic is to migrate a process
that hasn't run for a long while - any cache state it may have had
is probably gone by now, so there *should* be no affinity.
regards, mark hahn.
On Thu, 2001-11-22 at 19:20, Ryan Cumming wrote:
> Here here, I was just thinking "Well, I like the CPU affinity idea, but I
> loathe syscall creep... I hope this Robert Love fellow says something about
> that" as I read your email's header.
Ah, we think the same way. The reason I spoke up, though, is in
addition to disliking the syscall way and liking the proc way, I thought
Ingo's implementation was nicely done. In particular, the use of
cpu_online_map and forcing the reschedule were things I probably
wouldn't of thought of.
> In addition to keeping the syscall table from being filled with very
> specific, non-standard, and use-once syscalls, a /proc interface would allow
> me to change the CPU affinity of processes that aren't {get, set}_affinity
> aware (i.e., all Linux applications written up to this point). This isn't
> very different from how it's possible to change a processes other scheduling
> properties (priority, scheduler) from another process. Imagine if renice(8)
> had to be implemented as attaching to a process and calling nice(2)... ick.
Heh, this seems like the strongest argument yet, and I didn't even
mention it. Note, however, that there is a pid_t field in the syscall
and from glossing over the code it seems you can set the affinity of any
arbitrary task given you have the right permissions. Thus, we could
make a binary that took in a pid and a cpu mask, and set the affinity.
But I still think "echo 0xffffffff > /proc/768/cpu_affinity" is nicer.
This opens up the issue of permissions with my proc suggestion, and we
have some options:
Users can set the affinity of their own task, root can set anything.
One needs a CAP capability to set affinity (which root of course has).
Everyone can set anything, or only root can set affinity.
I would suggest letting users set their own affinity (since it only
lessens what they can do) and let a capability dictate if non-root users
can set other user's tasks affinities. CAP_SYS_ADMIN would do fine.
> Also, as an application developer, I try to avoid conditionally compiled,
> system-specific calls. I would have much less "cleanliness" objections
> towards testing for the /proc/<pid>/cpu_affinity files existance and
> conditionally writing to it. Compare this to the hacks some network servers
> use to try to detect sendfile(2)'s presence at runtime, and you'll see what I
> mean. Remember, everything is a file ;)
Agreed. This:
sprintf(p, "%s%d%s", "/proc/", pid(), "/cpu_affinity");
f = open(p, "rw");
if (!f) /* no cpu_affinity ... */
Is a very simple check vs. the sort of magic hackery that I see to find
out if a syscall is supported at run-time.
Again I mention how we can move cpus_allowed now to any size, and even
support old sizes, since it is a non-issue with a string.
> And one final thing... what sort of benifit does CPU affinity have if we
> have the scheduler take in account CPU migration costs correctly? I can think
> of a lot of corner cases, but in general, it seems to me that it's a lot more
> sane to have the scheduler decide where processes belong. What if an
> application with n threads, where n is less than the number of CPUs, has to
> decide which CPUs to bind its threads to? What if a similar app, or another
> instance of the same app, already decided to bind against the same set of
> CPUs? The scheduler is stuck with an unfair scheduling load on those poor
> CPUs, because the scheduling decision was moved away from where it really
> should take place: the scheduler. I'm sure I'm missing something, though.
It is typically preferred not to force a specific CPU affinity. Solaris
and NT both allow it, for varying reasons. One would be if you set
aside a processor for a set of tasks and disallowed all other tasks from
operating there. This is common with RT (and usually accompanied by
disabling interrupt processing on that CPU and using a fully preemptible
kernel -- i.e. you get complete freedom for the real-time task on the
affined CPU). This is what Solaris's processor sets try to accomplish,
but IMO they are too heavy and this is why I like Ingo's proposal. We
already have the cpus_allowed property which we respect, we just need to
let userspace set it. The question is how?
Robert Love
On Nov 22, 2001 19:51 -0500, Robert Love wrote:
> I would suggest letting users set their own affinity (since it only
> lessens what they can do) and let a capability dictate if non-root users
> can set other user's tasks affinities. CAP_SYS_ADMIN would do fine.
Rather use something else, like CAP_SYS_NICE. It ties in with the idea
of scheduling, and doesn't further abuse the CAP_SYS_ADMIN capability.
CAP_SYS_ADMIN, while it has a good name, has become the catch-all of
capabilities, and if you have it, it is nearly the keys to the kingdom,
just like root.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Thu, 2001-11-22 at 20:11, Andreas Dilger wrote:
> Rather use something else, like CAP_SYS_NICE. It ties in with the idea
> of scheduling, and doesn't further abuse the CAP_SYS_ADMIN capability.
> CAP_SYS_ADMIN, while it has a good name, has become the catch-all of
> capabilities, and if you have it, it is nearly the keys to the kingdom,
> just like root.
Ah, forgot about CAP_SYS_NICE ... indeed, a better idea. I suppose if
people want it a CAP_SYS_CPU_AFFINITY could do, but this is a simple and
rare enough task that we are better off sticking it under something
else.
Robert Love
On 22 Nov 2001, Robert Love wrote:
> > the attached set-affinity-A1 patch is relative to the scheduler
> > fixes/cleanups in 2.4.15-pre9. It implements the following two
> > new system calls: [...]
>
> Ingo, I like your implementation, particularly the use of the
> cpu_online_map, although I am not sure all arch's implement it yet.
> [...]
cpu_online_map is (or should be) a standard component of the kernel, eg.
generic SMP code in init/main.c uses it. But this area can be changed in
whatever direction is needed - we should always keep an eye on CPU
hot-swapping's architectural needs.
> [...] I am curious, however, what you would think of using a /proc
> interface instead of a set of syscalls ?
to compare it this situation to a similar situation, i made
/proc/irq/N/smp_affinity a /proc thing because it appeared to be an
architecture-specific and nongeneric feature, seldom used by ordinary
processes and generally an admin thing. But i think setting affinity is a
natural extension of the existing sched_* class of system-calls. It could
be used by userspace webservers for example.
one issue is that /proc does not necesserily have to be mounted. But i
dont have any strong feelings either way - the syscall variant simply
looks a bit more correct.
Ingo
On Thu, 22 Nov 2001, Ryan Cumming wrote:
> [...] a /proc interface would allow me to change the CPU affinity of
> processes that aren't {get, set}_affinity aware (i.e., all Linux
> applications written up to this point). [...]
had you read my patch then you'd perhaps have noticed how easy it is
actually. I've attached a simple utility called 'chaff' (change affinity)
that allows to change the affinity of unaware processes:
mars:~> ./chaff 714 0xf0
pid 714's old affinity: 000000ff.
pid 714's new affinity: 000000f0.
> And one final thing... what sort of benifit does CPU affinity have if
> we have the scheduler take in account CPU migration costs correctly?
> [...]
while you are right that the scheduler can and should guess lots of
things, but it cannot guess some things. Eg. it has no idea whether a
particular process' workload is related to any IRQ source or not. And if
we bind IRQ sources for performance reasons, then the scheduler has no
chance finding the right CPU for the process. (I have attempted to
implement such a generic mechanizm a few months ago but quickly realized
that nothing like that will ever be accepted in the mainline kernel -
there is simply no way establish any reliable link between IRQ load and
process activities.)
So i implemented the smp_affinity and ->cpus_allowed mechanizms to allow
specific applications (who know the kind of load they generate) to bind to
specific CPUs, and to bind IRQs to CPUs. Obviously we still want the
scheduler to make good decisions - but linking IRQ load and scheduling
activity is too expensive. (i have a scheduler improvement patch that does
do some of this work at wakeup time, and which patch benefits Apache, but
this is still not enough to get the 'best' affinity.)
Ingo
On Thu, 22 Nov 2001, Mark Hahn wrote:
> only that it's nontrivial to estimate the migration costs, I think. at
> one point, around 2.3.3*, there was some effort at doing this - or
> something like it. specifically, the scheduler kept track of how long
> a process ran on average, and was slightly more willing to migrate a
> short-slice process than a long-slice. "short" was defined relative
> to cache size and a WAG at dram bandwidth.
yes. I added the avg_slice code, and i removed it as well - it was
hopeless to get it right and it was causing bad performance for certain
application sloads. Current CPUs simply do not support any good way of
tracking cache footprint of processes. There are methods that are an
approximation (eg. uninterrupted runtime and cache footprint are in a
monotonic relationship), but none of the methods (including cache traffic
machine counters) are good enough to cover all the important corner cases,
due to cache aliasing, MESI-invalidation and other effects.
> the rationale was that if you run for only 100 us, you probably don't
> have a huge working set. that justification is pretty thin, and
> perhaps that's why the code gradually disappeared.
yes.
> hmm, you really want to monitor things like paging and cache misses,
> but both might be tricky, and would be tricky to use sanely. a really
> simple, and appealing heuristic is to migrate a process that hasn't
> run for a long while - any cache state it may have had is probably
> gone by now, so there *should* be no affinity.
well it doesnt take much for a process to populate the whole L1 cache with
dirty cachelines. (which then have to be cross-invalidated if this process
is moved to another CPU.)
Ingo
On Fri, 23 Nov 2001, Ingo Molnar wrote:
[...]
Isn't it better to expose "number" cpu masks instead of
"logical" ones ?
Right now you set the raw cpus_allowed field that is a "logical" cpu
bitmask.
By using "number" maps the user can use 0..N-1 w/out having to
know internal cpu mapping.
- Davide
On Fri, 23 Nov 2001, Ingo Molnar wrote:
>
> On Thu, 22 Nov 2001, Mark Hahn wrote:
>
> > only that it's nontrivial to estimate the migration costs, I think. at
> > one point, around 2.3.3*, there was some effort at doing this - or
> > something like it. specifically, the scheduler kept track of how long
> > a process ran on average, and was slightly more willing to migrate a
> > short-slice process than a long-slice. "short" was defined relative
> > to cache size and a WAG at dram bandwidth.
>
> yes. I added the avg_slice code, and i removed it as well - it was
> hopeless to get it right and it was causing bad performance for certain
> application sloads. Current CPUs simply do not support any good way of
> tracking cache footprint of processes. There are methods that are an
> approximation (eg. uninterrupted runtime and cache footprint are in a
> monotonic relationship), but none of the methods (including cache traffic
> machine counters) are good enough to cover all the important corner cases,
> due to cache aliasing, MESI-invalidation and other effects.
Uninterrupted run-time is a good approximation of a task's cache footprint.
It's true, it's not 100% successful, processes like :
for (;;);
are uncorrectly classified but it's still way better than the method we're
currently using ( PROC_CHANGE_PENALTY ).
By taking the avg :
AVG = (AVG + LAST) >> 1;
run-time in jiffies is 1) fast 2) has a nice hysteresis property 3) gives
you a pretty good estimation of the "nature" of the task.
I'm currently using it as 1) classification for load balancing between
CPUs 2) task's watermark value for your counter decay patch :
[kernel/timer.c]
if (p->counter > p->avg_jrun)
--p->counter;
else if (++p->timer_ticks >= p->counter) {
p->counter = 0;
p->timer_ticks = 0;
p->need_resched = 1;
}
In this way I/O bound tasks have a counter decay behavior like the
standard scheduler while CPU bound ones preserve the priority inversion
proof.
- Davide
On Thu, 2001-11-22 at 19:20, Ryan Cumming wrote:
> Here here, I was just thinking "Well, I like the CPU affinity idea, but I
> loathe syscall creep... I hope this Robert Love fellow says something about
> that" as I read your email's header.
I did a procfs-based implementation of a user interface for setting CPU
affinity. It implements various features like Ingo's with the change
that it is, obviously, a procfs entry and not a set of syscalls.
It is readable and writable via /proc/<pid>/affinity
I posted a patch to lkml moments ago, but it is also available at
ftp://ftp.kernel.org/pub/linux/kernel/people/rml/cpu-affinity
(please use a mirror).
Comments, suggestions, et cetera welcome -- if possible under the new thread.
Robert Love
Robert and Ingo,
A nohup-like interface to the cpu affinity service would be useful. It
could work like the
following example:
$ cpuselect -c 1,3-5 gcc -c module.c
which would restrict this instantiation of gcc and all of its children to
cpus 1,3,4, and 5. This
tool can be implemented in a few lines of C, with either /proc or syscall
as the underlying
implementation.
On another subject -- capabilities -- any process should be able to reduce
the number of
cpus in its own cpu affinity mask without any special permission. To add
cpus to a
reduced mask, or to change the cpu affinity mask of other processes, should
require
the appropriate capability, be it CAP_SYS_NICE, CAP_SYS_ADMIN, or whatever
is decided.
Joe
On Mon, 2001-11-26 at 23:41, Linux maillist account wrote:
> Robert and Ingo,
> A nohup-like interface to the cpu affinity service would be useful. It
> could work like the following example:
>
> $ cpuselect -c 1,3-5 gcc -c module.c
>
> which would restrict this instantiation of gcc and all of its children to
> cpus 1,3,4, and 5. This tool can be implemented in a few lines of C, with
> either /proc or syscall as the underlying implementation.
I can see the use for this, but you can also just do `echo whatever >
/proc/123/affinity' once it is running ... not a big deal.
It is automatically inherited by its children.
> On another subject -- capabilities -- any process should be able to reduce
> the number of cpus in its own cpu affinity mask without any special
> permission. To add cpus to a reduced mask, or to change the cpu affinity
> mask of other processes, should require the appropriate capability, be it
> CAP_SYS_NICE, CAP_SYS_ADMIN, or whatever is decided.
My patch already does this. If the user writing the affinity entry is
the same as the user of the task in question, everything is fine. If
the user possesses the CAP_SYS_NICE bit, he can set any task's
affinity. See the patch.
Robert Love
On Mon, 26 Nov 2001, Linux maillist account wrote:
> A nohup-like interface to the cpu affinity service would be useful. It
> could work like the
> following example:
>
> $ cpuselect -c 1,3-5 gcc -c module.c
yep, this can be done via the chaff utility i posted:
gcc -c module.c & ./chaff $! 0x6
or, it can be done by changing the affinity of the current shell, every
new child process will inherit it:
./chaff $$ 0x6; gcc -c module.c
(or a cpuselect utility can be written.)
> On another subject -- capabilities -- any process should be able to
> reduce the number of cpus in its own cpu affinity mask without any
> special permission. To add cpus to a reduced mask, or to change the
> cpu affinity mask of other processes, should require the appropriate
> capability, be it CAP_SYS_NICE, CAP_SYS_ADMIN, or whatever is decided.
yep, this is how sched_set_affinity() is workig - it allows the setting of
affinities if either CAP_SYS_NICE is set, or the process' uid/gid matches
that of the target process' effective uid/gid.
Ingo
At 11:49 PM 11/26/01 -0500, Robert Love wrote:
>I can see the use for this, but you can also just do `echo whatever >
>/proc/123/affinity' once it is running ... not a big deal.
It's isn't quite the same..the biggest difference is races. The cpuselect(1)
tool would change the affinity mask before the fork & exec of the first
child. To
do this by hand via an `echo whatever >/proc/123/affinity' would miss all the
children spun off by 123 before the echo could be executed. One could write
cpuselect as a shell script I suppose, using within it an echo on
/proc/self/affinity,
though even as a shell script it would be better to have this tool be part
of the standard
Linux repetoire that everyone could depend upon as being there in all Linux
distributions
and having a well known and unchanging syntax and semantics, rather than
have it
remain something that each user creates ad-hoc as the need for the tool arises.
Joe
On Tue, 2001-11-27 at 01:32, Linux maillist account wrote:
> It's isn't quite the same..the biggest difference is races.
Then you can set the affinity on your shell before executing the
process. :)
But, OK OK -- I see your point. It would make for a handy utility, and
it has some necessary uses (i.e., the race issue you brought up).
Question now is, what interface to do we export to userspace.
Robert Love
your comments about syscall vs. procfs:
> This patch comes about as an alternative to Ingo Molnar's
> syscall-implemented version. Ingo's code is nice; however I and
> others expressed discontent as yet another syscall. [...]
i do express discontent over yet another procfs bloat. What if procfs is
not mounted in a high security installation? Are affinities suddenly
unavailable? Such kind of dependencies are unacceptable IMO - if we want
to export the setting of affinities to user-space, then it should be a
system call.
(Also procfs is visibly slower than a system-call - i can well imagine
this to be an issue in some sort of threaded environment that creates and
destroys threads at a high rate, and wants to have a different affinity
for every new thread.)
> [...] Other benefits include the ease with which to set the affinity
> of tasks that are unaware of the new interface [...]
this was a red herring - see chaff.c.
> [...] and that with this approach applications don't need to hackishly
> check for the existence of a syscall.
uhm, what check? A nonexistent system call does not have to be checked
for.
(so far no legitimate technical point has been made against the
syscall-based setting of affinities.)
Ingo
At 09:26 AM 11/27/01 +0100, Ingo Molnar wrote:
>On Mon, 26 Nov 2001, Linux maillist account wrote:
> > A nohup-like interface to the cpu affinity service would be useful. It
> > could work like the following example:
> >
> > $ cpuselect -c 1,3-5 gcc -c module.c
>
>yep, this can be done via the chaff utility i posted:
> gcc -c module.c & ./chaff $! 0x6
This of course is subject to a race -- the chaff may not execute before the
gcc has spun off a child or two.
>or, it can be done by changing the affinity of the current shell, every
>new child process will inherit it:
>
> ./chaff $$ 0x6; gcc -c module.c
I like this one *much* better, it is functionally equivalant to cpuselect,
if one puts parens
around the whole thing to keep chaff from infecting with a bias subsequent
commands.
It ideal solution might be to add nohup-like capability to the existing
chaff command:
./chaff 0x6 $$ 1234 43213 ... lots of other pids ... (note my
proposed reversal of pid & bias)
./chaff 0x6 -c gcc -c module.c
Joe
At 09:40 AM 11/27/01 +0100, Ingo Molnar wrote:
> > This patch comes about as an alternative to Ingo Molnar's
> > syscall-implemented version. Ingo's code is nice; however I and
> > others expressed discontent as yet another syscall. [...]
>
>i do express discontent over yet another procfs bloat. What if procfs is
>not mounted in a high security installation? Are affinities suddenly
>unavailable? Such kind of dependencies are unacceptable IMO - if we want
>to export the setting of affinities to user-space, then it should be a
>system call.
...
> > [...] Other benefits include the ease with which to set the affinity
> > of tasks that are unaware of the new interface [...]
I have not yet seen the patch, but one nice feature that a system call
interface
could provide is the ability to *atomically* change the cpu affinities of
sets of
processes -- for example, all processes with a certain uid or gid. All that
would be required would be for the system call to accept a command integer
value which would define what the argument integer value would mean -- a pid,
a gid, or a uid.
Joe
Joe Korty <[email protected]> writes:
>
> I have not yet seen the patch, but one nice feature that a system call
> interface
> could provide is the ability to *atomically* change the cpu affinities of
> sets of
> processes
Could you quickly explain an use case where it makes a difference if
CPU affinity settings for multiple processes are done atomically or not ?
The only way to make CPU affinity settings of processes really atomically
without a "consolidation window" is to
do them before the process starts up. This is easy when they're inherited --
just set them for the parent before starting the other processes. This
works with any interface; proc based or not as long as it inherits.
-Andi
On Tue, Nov 27, 2001 at 01:39:50AM -0500, Robert Love wrote:
> On Tue, 2001-11-27 at 01:32, Linux maillist account wrote:
>
> > It's isn't quite the same..the biggest difference is races.
>
> Then you can set the affinity on your shell before executing the
> process. :)
Surely
#!/bin/sh
echo $1 > /proc/self/affinity
shift
exec $@
...would be just fine 'n dandy as a wrapper.
Sean
On Tue, Nov 27, 2001 at 03:04:27AM -0500, Joe Korty wrote:
| I am not against a proc interface per se, I would like a proc interface,
| especially for the
| reading of affinity values. But in my view the system call interface
| should also exist
| and it should be the dominate way of communicating affinity to processes.
IWSTM that the way you would justify this being a system call would also
suggest working with non-linux kernel developers (both open source as well
as commercial) to determine a mutually agreed syntax/semantic for this
call to further ensure the basis of the universality that ensure it will
be one of those "forever" facilities, and maybe even make it into a future
standard.
You opened the camel's mouth; do you want to check the teeth?
--
-----------------------------------------------------------------
| Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ |
| [email protected] | Texas, USA | http://phil.ipal.org/ |
-----------------------------------------------------------------
> IWSTM that the way you would justify this being a system call would also
> suggest working with non-linux kernel developers (both open source as well
> as commercial) to determine a mutually agreed syntax/semantic for this
> call to further ensure the basis of the universality that ensure it will
> be one of those "forever" facilities, and maybe even make it into a future
> standard.
>
> You opened the camel's mouth; do you want to check the teeth?
once again - pset IS a candidate for this interface. (damn I need to
finish the 2.4 port). http://www.hockin.org/~thockin/pset
On Tue, 2001-11-27 at 02:13, Joe Korty wrote:
> I have not yet seen the patch, but one nice feature that a system call
> interface could provide is the ability to *atomically* change the cpu
> affinities of sets of processes -- for example, all processes with a
> certain uid or gid. All that would be required would be for the system
> call to accept a command integer value which would define what the
> argument integer value would mean -- a pid, a gid, or a uid.
Effecting all tasks matching a uid or some other filter is a little
beyond what either patch does. Note however that both interfaces have
atomicity.
You can open and write to proc from within a program ... very easily, in
fact.
Also, with some sed and grep magic, you can set the affinity of all
tasks via the proc interface pretty easy. Just a couple lines.
Robert Love
On Tue, 2001-11-27 at 02:32, Andi Kleen wrote:
> Could you quickly explain an use case where it makes a difference if
> CPU affinity settings for multiple processes are done atomically or not ?
>
> The only way to make CPU affinity settings of processes really atomically
> without a "consolidation window" is to
> do them before the process starts up. This is easy when they're inherited --
> just set them for the parent before starting the other processes. This
> works with any interface; proc based or not as long as it inherits.
I assume he meant to prevent the case of setting affinity _after_ a
process forks. In other words, "atomically" in the sense that it occurs
prior to some action, in order to affect properly all children.
This could be done in program with by writing to the proc entry before
forking, or can be done in a wrapper script (set affinity of self, exec
new task).
cpus_allowed is inherited by all children so this works fine.
Robert Love
On Tue, Nov 27, 2001 at 03:53:04PM -0500, Robert Love wrote:
> Effecting all tasks matching a uid or some other filter is a little
> beyond what either patch does. Note however that both interfaces have
> atomicity.
I don't see a need for that either, the inheritance and single-process change
are the major abilities needed.
> You can open and write to proc from within a program ... very easily, in
> fact.
>
> Also, with some sed and grep magic, you can set the affinity of all
> tasks via the proc interface pretty easy. Just a couple lines.
>From the admin point of view, this last ability is a good one.
A read-only entry in proc wouldn't do much good by itself. The writable /proc
entry is the one that sounds interesting.
-Nathan
> Robert Love
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
diff -Nur linux-2.4.10/fs/proc/array.c linux-2.4.10-launch_policy/fs/proc/array.c
--- linux-2.4.10/fs/proc/array.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/array.c Wed Nov 28 13:59:58 2001
@@ -50,6 +50,10 @@
* Al Viro & Jeff Garzik : moved most of the thing into base.c and
* : proc_misc.c. The rest may eventually go into
* : base.c too.
+ *
+ * Andrew Morton : cpus_allowed
+ *
+ * Matthew Dobson : launch_policy (Thanks to Andrew Morton for inspiraton)
*/
#include <linux/config.h>
@@ -344,7 +348,7 @@
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
task->pid,
task->comm,
state,
@@ -387,7 +391,9 @@
task->nswap,
task->cnswap,
task->exit_signal,
- task->processor);
+ task->processor,
+ task->cpus_allowed,
+ task->launch_policy);
if(mm)
mmput(mm);
return res;
@@ -692,5 +698,59 @@
task->per_cpu_stime[cpu_logical_map(i)]);
return len;
+}
+
+static inline int proc_pid_cpu_bitmask_read(char * buffer, unsigned long *bitmask)
+{
+ int len;
+
+ len = sprintf(buffer, "%08lx\n", *bitmask);
+ return len;
+}
+
+static inline int proc_pid_cpu_bitmask_write(char * buffer, unsigned long *bitmask,
+ size_t nbytes, struct task_struct *task)
+{
+ unsigned long new_mask;
+ char *endp;
+ int ret;
+ unsigned long flags;
+
+ ret = -EPERM;
+ if ((current->euid != task->euid) && (current->euid != task->uid) &&
+ (!capable(CAP_SYS_NICE)))
+ goto out;
+
+ new_mask = simple_strtoul(buffer, &endp, 16);
+ ret = endp - buffer;
+
+ spin_lock_irqsave(&runqueue_lock, flags); /* token effort to not be racy */
+ if (!(cpu_online_map & new_mask))
+ ret = -EINVAL;
+ else
+ *bitmask = new_mask;
+ spin_unlock_irqrestore(&runqueue_lock, flags);
+out:
+ return ret;
+}
+
+int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->cpus_allowed);
+}
+
+int proc_pid_cpus_allowed_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, &task->cpus_allowed, nbytes, task);
+}
+
+int proc_pid_launch_policy_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->launch_policy);
+}
+
+int proc_pid_launch_policy_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, &task->launch_policy, nbytes, task);
}
#endif
diff -Nur linux-2.4.10/fs/proc/base.c linux-2.4.10-launch_policy/fs/proc/base.c
--- linux-2.4.10/fs/proc/base.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/base.c Wed Nov 28 14:00:20 2001
@@ -39,6 +39,10 @@
int proc_pid_status(struct task_struct*,char*);
int proc_pid_statm(struct task_struct*,char*);
int proc_pid_cpu(struct task_struct*,char*);
+int proc_pid_cpus_allowed_read(struct task_struct*, char*);
+int proc_pid_cpus_allowed_write(struct task_struct*, char*, size_t);
+int proc_pid_launch_policy_read(struct task_struct*, char*);
+int proc_pid_launch_policy_write(struct task_struct*, char*, size_t);
static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
@@ -282,8 +286,44 @@
return count;
}
+static ssize_t proc_info_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode * inode = file->f_dentry->d_inode;
+ unsigned long page;
+ ssize_t ret;
+ struct task_struct *task = inode->u.proc_i.task;
+
+ ret = -EINVAL;
+ if (inode->u.proc_i.op.proc_write == NULL)
+ goto out;
+ if (count > PAGE_SIZE - 1)
+ goto out;
+
+ ret = -ENOMEM;
+ if (!(page = __get_free_page(GFP_KERNEL)))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user((char *)page, buf, count))
+ goto out_free_page;
+
+ ((char *)page)[count] = '\0';
+ ret = inode->u.proc_i.op.proc_write(task, (char*)page, count);
+ if (ret < 0)
+ goto out_free_page;
+
+ *ppos += ret;
+
+out_free_page:
+ free_page(page);
+out:
+ return ret;
+}
+
static struct file_operations proc_info_file_operations = {
read: proc_info_read,
+ write: proc_info_write,
};
#define MAY_PTRACE(p) \
@@ -497,25 +537,29 @@
PROC_PID_STATM,
PROC_PID_MAPS,
PROC_PID_CPU,
+ PROC_PID_CPUS_ALLOWED,
+ PROC_PID_LAUNCH_POLICY,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
#define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}
static struct pid_entry base_stuff[] = {
- E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
- E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
- E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
- E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
- E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
- E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
+ E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
+ E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
+ E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
+ E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
+ E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
+ E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
#ifdef CONFIG_SMP
- E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPUS_ALLOWED, "cpus_allowed", S_IFREG|S_IRUGO|S_IWUSR),
+ E(PROC_PID_LAUNCH_POLICY, "launch_policy",S_IFREG|S_IRUGO|S_IWUSR),
#endif
- E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
- E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
- E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
+ E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
+ E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
{0,0,NULL,0}
};
#undef E
@@ -869,6 +913,16 @@
case PROC_PID_CPU:
inode->i_fop = &proc_info_file_operations;
inode->u.proc_i.op.proc_read = proc_pid_cpu;
+ break;
+ case PROC_PID_CPUS_ALLOWED:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_cpus_allowed_read;
+ inode->u.proc_i.op.proc_write = proc_pid_cpus_allowed_write;
+ break;
+ case PROC_PID_LAUNCH_POLICY:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_launch_policy_read;
+ inode->u.proc_i.op.proc_write = proc_pid_launch_policy_write;
break;
#endif
case PROC_PID_MEM:
diff -Nur linux-2.4.10/include/linux/capability.h linux-2.4.10-launch_policy/include/linux/capability.h
--- linux-2.4.10/include/linux/capability.h Mon Nov 19 22:57:29 2001
+++ linux-2.4.10-launch_policy/include/linux/capability.h Wed Nov 28 13:49:59 2001
@@ -243,6 +243,8 @@
/* Allow use of FIFO and round-robin (realtime) scheduling on own
processes and setting the scheduling algorithm used by another
process. */
+/* Allow binding of tasks to CPUs */
+/* Allow setting of launch policies */
#define CAP_SYS_NICE 23
diff -Nur linux-2.4.10/include/linux/prctl.h linux-2.4.10-launch_policy/include/linux/prctl.h
--- linux-2.4.10/include/linux/prctl.h Thu Jul 19 20:39:57 2001
+++ linux-2.4.10-launch_policy/include/linux/prctl.h Mon Nov 19 15:24:10 2001
@@ -20,4 +20,12 @@
#define PR_GET_KEEPCAPS 7
#define PR_SET_KEEPCAPS 8
+/* Get/set cpus allowed */
+#define PR_GET_CPUS_ALLOWED 13
+#define PR_SET_CPUS_ALLOWED 14
+
+/* Get/set launch policies */
+#define PR_GET_LAUNCH_POLICY 15
+#define PR_SET_LAUNCH_POLICY 16
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.4.10/include/linux/proc_fs_i.h linux-2.4.10-launch_policy/include/linux/proc_fs_i.h
--- linux-2.4.10/include/linux/proc_fs_i.h Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/include/linux/proc_fs_i.h Wed Oct 17 14:18:53 2001
@@ -1,9 +1,10 @@
struct proc_inode_info {
struct task_struct *task;
int type;
- union {
+ struct {
int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
int (*proc_read)(struct task_struct *task, char *page);
+ int (*proc_write)(struct task_struct *task, char *page, size_t nbytes);
} op;
struct file *file;
};
diff -Nur linux-2.4.10/include/linux/sched.h linux-2.4.10-launch_policy/include/linux/sched.h
--- linux-2.4.10/include/linux/sched.h Mon Nov 19 22:57:29 2001
+++ linux-2.4.10-launch_policy/include/linux/sched.h Mon Nov 19 15:27:40 2001
@@ -352,6 +352,7 @@
struct task_struct *pidhash_next;
struct task_struct **pidhash_pprev;
+ unsigned long launch_policy; /* for *fork*() & exec() */
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
unsigned long rt_priority;
@@ -480,6 +481,7 @@
p_opptr: &tsk, \
p_pptr: &tsk, \
thread_group: LIST_HEAD_INIT(tsk.thread_group), \
+ launch_policy: -1, \
wait_chldexit: __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\
real_timer: { \
function: it_real_fn \
diff -Nur linux-2.4.10/kernel/fork.c linux-2.4.10-launch_policy/kernel/fork.c
--- linux-2.4.10/kernel/fork.c Mon Sep 17 21:46:04 2001
+++ linux-2.4.10-launch_policy/kernel/fork.c Wed Oct 24 15:55:55 2001
@@ -646,6 +646,7 @@
spin_lock_init(&p->sigmask_lock);
}
#endif
+ p->cpus_allowed = p->launch_policy; /* launch_policy is inherited from parent */
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
diff -Nur linux-2.4.10/kernel/sys.c linux-2.4.10-launch_policy/kernel/sys.c
--- linux-2.4.10/kernel/sys.c Tue Sep 18 14:10:43 2001
+++ linux-2.4.10-launch_policy/kernel/sys.c Wed Nov 28 14:13:20 2001
@@ -1256,6 +1256,27 @@
}
current->keep_capabilities = arg2;
break;
+ case PR_GET_CPUS_ALLOWED:
+ error = put_user(current->cpus_allowed, (long *)arg2);
+ break;
+ case PR_SET_CPUS_ALLOWED:
+ if (!(cpu_online_map & arg2))
+ error = -EINVAL;
+ else {
+ current->cpus_allowed = arg2;
+ if (!((1 << smp_processor_id()) & arg2))
+ current->need_resched = 1;
+ }
+ break;
+ case PR_GET_LAUNCH_POLICY:
+ error = put_user(current->launch_policy, (long *)arg2);
+ break;
+ case PR_SET_LAUNCH_POLICY:
+ if (!(cpu_online_map & arg2))
+ error = -EINVAL;
+ else
+ current->launch_policy = arg2;
+ break;
default:
error = -EINVAL;
break;
diff -Nur linux-2.4.10/fs/proc/array.c linux-2.4.10-launch_policy/fs/proc/array.c
--- linux-2.4.10/fs/proc/array.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/array.c Wed Nov 28 13:59:58 2001
@@ -50,6 +50,10 @@
* Al Viro & Jeff Garzik : moved most of the thing into base.c and
* : proc_misc.c. The rest may eventually go into
* : base.c too.
+ *
+ * Andrew Morton : cpus_allowed
+ *
+ * Matthew Dobson : launch_policy (Thanks to Andrew Morton for inspiraton)
*/
#include <linux/config.h>
@@ -344,7 +348,7 @@
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
task->pid,
task->comm,
state,
@@ -387,7 +391,9 @@
task->nswap,
task->cnswap,
task->exit_signal,
- task->processor);
+ task->processor,
+ task->cpus_allowed,
+ task->launch_policy);
if(mm)
mmput(mm);
return res;
@@ -692,5 +698,59 @@
task->per_cpu_stime[cpu_logical_map(i)]);
return len;
+}
+
+static inline int proc_pid_cpu_bitmask_read(char * buffer, unsigned long *bitmask)
+{
+ int len;
+
+ len = sprintf(buffer, "%08lx\n", *bitmask);
+ return len;
+}
+
+static inline int proc_pid_cpu_bitmask_write(char * buffer, unsigned long *bitmask,
+ size_t nbytes, struct task_struct *task)
+{
+ unsigned long new_mask;
+ char *endp;
+ int ret;
+ unsigned long flags;
+
+ ret = -EPERM;
+ if ((current->euid != task->euid) && (current->euid != task->uid) &&
+ (!capable(CAP_SYS_NICE)))
+ goto out;
+
+ new_mask = simple_strtoul(buffer, &endp, 16);
+ ret = endp - buffer;
+
+ spin_lock_irqsave(&runqueue_lock, flags); /* token effort to not be racy */
+ if (!(cpu_online_map & new_mask))
+ ret = -EINVAL;
+ else
+ *bitmask = new_mask;
+ spin_unlock_irqrestore(&runqueue_lock, flags);
+out:
+ return ret;
+}
+
+int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->cpus_allowed);
+}
+
+int proc_pid_cpus_allowed_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, &task->cpus_allowed, nbytes, task);
+}
+
+int proc_pid_launch_policy_read(struct task_struct *task, char * buffer)
+{
+ return proc_pid_cpu_bitmask_read(buffer, &task->launch_policy);
+}
+
+int proc_pid_launch_policy_write(struct task_struct *task, char * buffer, size_t nbytes)
+{
+ return proc_pid_cpu_bitmask_write(buffer, &task->launch_policy, nbytes, task);
}
#endif
diff -Nur linux-2.4.10/fs/proc/base.c linux-2.4.10-launch_policy/fs/proc/base.c
--- linux-2.4.10/fs/proc/base.c Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/fs/proc/base.c Wed Nov 28 14:00:20 2001
@@ -39,6 +39,10 @@
int proc_pid_status(struct task_struct*,char*);
int proc_pid_statm(struct task_struct*,char*);
int proc_pid_cpu(struct task_struct*,char*);
+int proc_pid_cpus_allowed_read(struct task_struct*, char*);
+int proc_pid_cpus_allowed_write(struct task_struct*, char*, size_t);
+int proc_pid_launch_policy_read(struct task_struct*, char*);
+int proc_pid_launch_policy_write(struct task_struct*, char*, size_t);
static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
@@ -282,8 +286,44 @@
return count;
}
+static ssize_t proc_info_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode * inode = file->f_dentry->d_inode;
+ unsigned long page;
+ ssize_t ret;
+ struct task_struct *task = inode->u.proc_i.task;
+
+ ret = -EINVAL;
+ if (inode->u.proc_i.op.proc_write == NULL)
+ goto out;
+ if (count > PAGE_SIZE - 1)
+ goto out;
+
+ ret = -ENOMEM;
+ if (!(page = __get_free_page(GFP_KERNEL)))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user((char *)page, buf, count))
+ goto out_free_page;
+
+ ((char *)page)[count] = '\0';
+ ret = inode->u.proc_i.op.proc_write(task, (char*)page, count);
+ if (ret < 0)
+ goto out_free_page;
+
+ *ppos += ret;
+
+out_free_page:
+ free_page(page);
+out:
+ return ret;
+}
+
static struct file_operations proc_info_file_operations = {
read: proc_info_read,
+ write: proc_info_write,
};
#define MAY_PTRACE(p) \
@@ -497,25 +537,29 @@
PROC_PID_STATM,
PROC_PID_MAPS,
PROC_PID_CPU,
+ PROC_PID_CPUS_ALLOWED,
+ PROC_PID_LAUNCH_POLICY,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
#define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}
static struct pid_entry base_stuff[] = {
- E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
- E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
- E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
- E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
- E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
- E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
+ E(PROC_PID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
+ E(PROC_PID_ENVIRON, "environ", S_IFREG|S_IRUSR),
+ E(PROC_PID_STATUS, "status", S_IFREG|S_IRUGO),
+ E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
+ E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
+ E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
#ifdef CONFIG_SMP
- E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+ E(PROC_PID_CPUS_ALLOWED, "cpus_allowed", S_IFREG|S_IRUGO|S_IWUSR),
+ E(PROC_PID_LAUNCH_POLICY, "launch_policy",S_IFREG|S_IRUGO|S_IWUSR),
#endif
- E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
- E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
- E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
- E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
+ E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
+ E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_ROOT, "root", S_IFLNK|S_IRWXUGO),
+ E(PROC_PID_EXE, "exe", S_IFLNK|S_IRWXUGO),
{0,0,NULL,0}
};
#undef E
@@ -869,6 +913,16 @@
case PROC_PID_CPU:
inode->i_fop = &proc_info_file_operations;
inode->u.proc_i.op.proc_read = proc_pid_cpu;
+ break;
+ case PROC_PID_CPUS_ALLOWED:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_cpus_allowed_read;
+ inode->u.proc_i.op.proc_write = proc_pid_cpus_allowed_write;
+ break;
+ case PROC_PID_LAUNCH_POLICY:
+ inode->i_fop = &proc_info_file_operations;
+ inode->u.proc_i.op.proc_read = proc_pid_launch_policy_read;
+ inode->u.proc_i.op.proc_write = proc_pid_launch_policy_write;
break;
#endif
case PROC_PID_MEM:
diff -Nur linux-2.4.10/include/linux/capability.h linux-2.4.10-launch_policy/include/linux/capability.h
--- linux-2.4.10/include/linux/capability.h Mon Nov 19 22:57:29 2001
+++ linux-2.4.10-launch_policy/include/linux/capability.h Wed Nov 28 13:49:59 2001
@@ -243,6 +243,8 @@
/* Allow use of FIFO and round-robin (realtime) scheduling on own
processes and setting the scheduling algorithm used by another
process. */
+/* Allow binding of tasks to CPUs */
+/* Allow setting of launch policies */
#define CAP_SYS_NICE 23
diff -Nur linux-2.4.10/include/linux/prctl.h linux-2.4.10-launch_policy/include/linux/prctl.h
--- linux-2.4.10/include/linux/prctl.h Thu Jul 19 20:39:57 2001
+++ linux-2.4.10-launch_policy/include/linux/prctl.h Mon Nov 19 15:24:10 2001
@@ -20,4 +20,12 @@
#define PR_GET_KEEPCAPS 7
#define PR_SET_KEEPCAPS 8
+/* Get/set cpus allowed */
+#define PR_GET_CPUS_ALLOWED 13
+#define PR_SET_CPUS_ALLOWED 14
+
+/* Get/set launch policies */
+#define PR_GET_LAUNCH_POLICY 15
+#define PR_SET_LAUNCH_POLICY 16
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.4.10/include/linux/proc_fs_i.h linux-2.4.10-launch_policy/include/linux/proc_fs_i.h
--- linux-2.4.10/include/linux/proc_fs_i.h Fri Oct 26 15:07:16 2001
+++ linux-2.4.10-launch_policy/include/linux/proc_fs_i.h Wed Oct 17 14:18:53 2001
@@ -1,9 +1,10 @@
struct proc_inode_info {
struct task_struct *task;
int type;
- union {
+ struct {
int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
int (*proc_read)(struct task_struct *task, char *page);
+ int (*proc_write)(struct task_struct *task, char *page, size_t nbytes);
} op;
struct file *file;
};
diff -Nur linux-2.4.10/include/linux/sched.h linux-2.4.10-launch_policy/include/linux/sched.h
--- linux-2.4.10/include/linux/sched.h Mon Nov 19 22:57:29 2001
+++ linux-2.4.10-launch_policy/include/linux/sched.h Mon Nov 19 15:27:40 2001
@@ -352,6 +352,7 @@
struct task_struct *pidhash_next;
struct task_struct **pidhash_pprev;
+ unsigned long launch_policy; /* for *fork*() & exec() */
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
unsigned long rt_priority;
@@ -480,6 +481,7 @@
p_opptr: &tsk, \
p_pptr: &tsk, \
thread_group: LIST_HEAD_INIT(tsk.thread_group), \
+ launch_policy: -1, \
wait_chldexit: __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\
real_timer: { \
function: it_real_fn \
diff -Nur linux-2.4.10/kernel/fork.c linux-2.4.10-launch_policy/kernel/fork.c
--- linux-2.4.10/kernel/fork.c Mon Sep 17 21:46:04 2001
+++ linux-2.4.10-launch_policy/kernel/fork.c Wed Oct 24 15:55:55 2001
@@ -646,6 +646,7 @@
spin_lock_init(&p->sigmask_lock);
}
#endif
+ p->cpus_allowed = p->launch_policy; /* launch_policy is inherited from parent */
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
diff -Nur linux-2.4.10/kernel/sys.c linux-2.4.10-launch_policy/kernel/sys.c
--- linux-2.4.10/kernel/sys.c Tue Sep 18 14:10:43 2001
+++ linux-2.4.10-launch_policy/kernel/sys.c Wed Nov 28 14:13:20 2001
@@ -1256,6 +1256,27 @@
}
current->keep_capabilities = arg2;
break;
+ case PR_GET_CPUS_ALLOWED:
+ error = put_user(current->cpus_allowed, (long *)arg2);
+ break;
+ case PR_SET_CPUS_ALLOWED:
+ if (!(cpu_online_map & arg2))
+ error = -EINVAL;
+ else {
+ current->cpus_allowed = arg2;
+ if (!((1 << smp_processor_id()) & arg2))
+ current->need_resched = 1;
+ }
+ break;
+ case PR_GET_LAUNCH_POLICY:
+ error = put_user(current->launch_policy, (long *)arg2);
+ break;
+ case PR_SET_LAUNCH_POLICY:
+ if (!(cpu_online_map & arg2))
+ error = -EINVAL;
+ else
+ current->launch_policy = arg2;
+ break;
default:
error = -EINVAL;
break;
On Wed, 5 Dec 2001, Matthew Dobson wrote:
> In response (albeit a week plus late) to the recent hubbub about the cpu
> affinity patches,
> I'd like to throw a third contender in the ring.
>
> Attatched is a patch (against 2.4.16) which implements a /proc and a prctl()
> interface to
> the cpus_allowed flag. The truly exciting (at least for me) part of this patch
> is the
> launch_policy flag that it also introduces. The launch_policy flag is used
> similarly to
> the cpus_allowed flag, but it controls the cpus_allowed flags of any subsequent
> children
> of the process, instead of the cpus_allowed of the process itself. Via this
> flag, there
> are no worries about processes being able to fork children before a 'chaff' or
> 'echo' or
> anything else for that matter can be executed. The child process is assigned
> the desired
> cpus_allowed at fork/exec time. All this without having to bounce the current
> process to
> different cpus to (hopefully) acheive the same results.
>
> The launch_policy flag can acually be quite powerful. It allows for children
> to be
> instantiated on the correct cpu/node with a minimum of memory footprint on the
> wrong
> cpu/node. This can be taken advantage of via the /proc interface (for smp/numa
> unaware
> programs) or through prctl() for more clueful programs.
What you probably want to do in real life is to move a process to a cpu
and have all its child spawned on that cpu, that is the actual behavior.
Can't You achieve the same by coding a :
pid_t affine_fork(int cpumask) {
pid_t pp = fork();
if (pp == 0) {
set_affinity(getpid(), cpumask);
...
}
return pp;
}
in your application and having the default bahavior to propagate it to the
following fork()s.
> +int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^
You want Al Viro screaming, don't You ? :)
- Davide
Davide Libenzi wrote:
>
> On Wed, 5 Dec 2001, Matthew Dobson wrote:
>
> > In response (albeit a week plus late) to the recent hubbub about the cpu
> > affinity patches,
> > I'd like to throw a third contender in the ring.
> >
> > Attatched is a patch (against 2.4.16) which implements a /proc and a prctl()
> > interface to
> > the cpus_allowed flag. The truly exciting (at least for me) part of this patch
> > is the
> > launch_policy flag that it also introduces. The launch_policy flag is used
> > similarly to
> > the cpus_allowed flag, but it controls the cpus_allowed flags of any subsequent
> > children
> > of the process, instead of the cpus_allowed of the process itself. Via this
> > flag, there
> > are no worries about processes being able to fork children before a 'chaff' or
> > 'echo' or
> > anything else for that matter can be executed. The child process is assigned
> > the desired
> > cpus_allowed at fork/exec time. All this without having to bounce the current
> > process to
> > different cpus to (hopefully) acheive the same results.
> >
> > The launch_policy flag can acually be quite powerful. It allows for children
> > to be
> > instantiated on the correct cpu/node with a minimum of memory footprint on the
> > wrong
> > cpu/node. This can be taken advantage of via the /proc interface (for smp/numa
> > unaware
> > programs) or through prctl() for more clueful programs.
>
> What you probably want to do in real life is to move a process to a cpu
> and have all its child spawned on that cpu, that is the actual behavior.
If you want a process moved, you change cpus_allowed; if you want the children
spawned
somewhere in particular, you change launch_policy; if you really want both, you
change both...
> Can't You achieve the same by coding a :
>
> pid_t affine_fork(int cpumask) {
> pid_t pp = fork();
> if (pp == 0) {
> set_affinity(getpid(), cpumask);
> ...
> }
> return pp;
> }
>
> in your application and having the default bahavior to propagate it to the
> following fork()s.
you could do that, but that means you have to keep track of the cpumask
somewhere.
i suppose you could force your children to:
pid_t enforce_launch_policy_fork() {
pid_t pp = fork();
if (pp == 0) {
set_affinity(getpid(), get_affinity());
...
}
return pp;
}
but, as soon as one of them exec()'s their no longer going to be using your
functions.
By making it a default part of fork's behavior, processes naturally end up
where they're
supposed to be. And the default launch_policy if 0xffffffff, so unless you
purposely
change launch_policy, the old default behavior (run wherever you can) is
preserved.
>
> > +int proc_pid_cpus_allowed_read(struct task_struct *task, char * buffer)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> You want Al Viro screaming, don't You ? :)
>
> - Davide
If that is the biggest complaint about the patch, then I'll be quite happy
with some yelling and screaming about descriptive function names! ;)
Cheers!
-matt
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, 5 Dec 2001, Matthew Dobson wrote:
> pid_t enforce_launch_policy_fork() {
> pid_t pp = fork();
> if (pp == 0) {
> set_affinity(getpid(), get_affinity());
> ...
> }
> return pp;
> }
>
> but, as soon as one of them exec()'s their no longer going to be using your
> functions.
That's the point, cpus_allowed is automatically inherited by the child in
kernel/fork.c
So once you spawn a child with the proposed function, all its dinasty (
if it's not explicitly changed ) will have the same cpu affinity.
- Davide
On Wed, 2001-12-05 at 21:17, Matthew Dobson wrote:
> but, as soon as one of them exec()'s their no longer going to be using your
> functions.
But cpus_allowed is inherited, so why does it matter?
The only benefit I see to having it part of the fork operation as
opposed to Ingo's or my own patch, is that the parent need not be given
the same affinity.
And honestly I don't see that as a need. You could always change it
back after the exec. If that is unacceptable (you point out the cost of
forcing a task on and off a certain CPU), you could just have a wrapper
you exec that changes its affinity and then it execs the children.
Robert Love
Robert Love wrote:
>
> On Wed, 2001-12-05 at 21:17, Matthew Dobson wrote:
>
> > but, as soon as one of them exec()'s their no longer going to be using your
> > functions.
>
> But cpus_allowed is inherited, so why does it matter?
You're right, cpus_allowed would inherit just fine on its own, but...
>
> The only benefit I see to having it part of the fork operation as
> opposed to Ingo's or my own patch, is that the parent need not be given
> the same affinity.
...this is the important part. As soon as you start a process executing
on a particular CPU/Node (more important on a NUMA box) it begins to develop
a memory footprint. Things start getting allocated to that CPU's or Node's
memory. When you push the process to a different node for no good reason
(just to fork() and then come back to the original node) it is inefficient.
You are going to be causing all kinds of remote memory accesses that don't need
to happen. As the kernel gets more NUMA-aware, this will be even more of a
discrepancy between the two methods.
>
> And honestly I don't see that as a need. You could always change it
> back after the exec. If that is unacceptable (you point out the cost of
> forcing a task on and off a certain CPU), you could just have a wrapper
> you exec that changes its affinity and then it execs the children.
This seems to me to be a bit of a kludgy (although perfectly valid) way of
doing something that could be done much more elegantly and efficiently with
a launch_policy. If you use the wrapper method, the task structure and
various other internal kernel data structures could be allocated on the
incorrect node. This data will eventually migrate to the correct place,
but why start the process out on the wrong foot, when the cost of doing
the correct allocation is so small (or nonexistent)?
Cheers!
-matt