the attached patch implements a new syscall, exit_free():
exit_free(error_code, addr, val);
this syscall is used as a performance optimization in glibc's threading
library.
upon exiting of child threads there is an ugly race condition that must be
solved: the freeing of the child stack. Old pthreads used to set up a
trampoline, jump to it, munmap() the child stack and call sys_exit(). It
does not have to be detailed how much this hurts performance: the
munmap(), besides being slow for such a lightweight thing as thread-exit,
also flushes the TLB, possibly across CPUs. Also, locality of reference of
thread stacks is lost as well.
the new thread library solves this performance by introducing a 'thread
stack cache' - a simple list of thread stacks. (with some more details to
handle different stack sizes.) The problem with the stack cache is that
there's a race condition: who releases it? We must not free the thread
stack up until the very last instruction the current thread executes,
because a signal handler might arrive and might use an alrady freed stack.
Other threads might pick this stack and overwrite it ... Disabling signals
upon thread-exit adds a syscall overhead, but it still doesnt solve the
fundamental problem of 'who frees the stack'. A global semaphore for a
trampoline stack would have to be introduced to be able to free the stack
safely - a messy, slow and unscalable solution. Or queueing the stack to a
helper thread is equally messy.
again the kernel can give the threading library a helping hand to solve
this catch-22 problem, surprisingly easily in this case as well.
exit_free() simply writes a user-provided word back to userspace. At that
point the user stack is not used anymore (nor will it ever be - sys_exit()
cannot fail), so freeing it is appropriate. The actual way glibc utilizes
exit_free() is that there's a "is this stack free" flag in the thread
stack control structure, and exit_free() sets this to '1'. So upon
thread-exit the thread frees the stack and puts it into the stack cache -
but other threads will skip over it because the 'free' flag is still 0.
with this syscall it was possible to implement single-syscall thread exit
in glibc. (well, actually not yet, the next patch i send will enable this
fully by solving the "who does the waitpid()?" problem.)
Ingo
--- linux/arch/i386/kernel/entry.S.orig Tue Aug 13 17:13:30 2002
+++ linux/arch/i386/kernel/entry.S Tue Aug 13 17:13:42 2002
@@ -755,6 +755,7 @@
.long sys_set_thread_area
.long sys_get_thread_area
.long sys_clone_startup /* 245 */
+ .long sys_exit_free
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
--- linux/include/asm-i386/unistd.h.orig Tue Aug 13 17:13:00 2002
+++ linux/include/asm-i386/unistd.h Tue Aug 13 17:13:17 2002
@@ -250,6 +250,7 @@
#define __NR_set_thread_area 243
#define __NR_get_thread_area 244
#define __NR_clone_startup 245
+#define __NR_exit_free 246
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
--- linux/kernel/exit.c.orig Tue Aug 13 17:12:06 2002
+++ linux/kernel/exit.c Tue Aug 13 17:12:45 2002
@@ -586,6 +586,12 @@
do_exit((error_code&0xff)<<8);
}
+asmlinkage long sys_exit_free(int error_code, unsigned long *addr, unsigned long val)
+{
+ put_user(val, addr);
+ do_exit((error_code&0xff)<<8);
+}
+
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru)
{
int flag, retval;
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> the attached patch implements a new syscall, exit_free():
>
> exit_free(error_code, addr, val);
>
> this syscall is used as a performance optimization in glibc's threading
> library.
This looks like a total glibc braindamage hack.
It may be small, but it's crap, unless you can explain to me why glibc
cannot just cannot just catch the death signal in the master thread and be
done with it (and do all maintenance in the master).
Too ugly to live.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> It may be small, but it's crap, unless you can explain to me why glibc
> cannot just cannot just catch the death signal in the master thread and
> be done with it (and do all maintenance in the master).
we dont really want any signal overhead, and we also dont want any extra
context-switching to the 'master thread'. And there's no master thread
anymore either.
the pthreads API provides sensible ways to just get rid of a helper thread
without *any* handshaking or notification done after exit with any of the
other threads - the thread has finished its work and is gone forever.
the fundamental problem is getting rid of the stack atomically, it's a
catch-22. A thread can be interrupted by a signal on the last instruction
it executes, it can be ptrace debugged, etc. And something must notify
about completion once the stack is 100% unused.
(i'll add any other, userspace-only solution to the code if there's any
that has equivalent performance - i couldnt find any other solution so
far.)
Ingo
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> we dont really want any signal overhead, and we also dont want any extra
> context-switching to the 'master thread'. And there's no master thread
> anymore either.
That still doesn't make it any les crap: because any thread that exits
without calling the "magic exit-flag interface" will then silently be
lost, with no information left around anywhere.
The whole interface is bogus.
If you want to do this, you can do it at _clone_ time, by extending on the
notion of "when I die, tell the parent using signal X" and making that
notion be a more generic "when I die, do X", where "X" migh include
updating some parent tables instead of sending a signal.
But the magic "exit_write()" has to die.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> If you want to do this, you can do it at _clone_ time, by extending on
> the notion of "when I die, tell the parent using signal X" and making
> that notion be a more generic "when I die, do X", where "X" migh include
> updating some parent tables instead of sending a signal.
>
> But the magic "exit_write()" has to die.
think about it - we have the *very same* problem in kernel-space, and we
had it for years. People wanted to get rid of parent notification in
helper processes for ages. A thread cannot free its own stack. We now can
do it only with very special care and atomicity. The same thing cannot be
done by user-space, because it has no 'atomic change and sys_exit()'
operation at its hands. This capability is that the syscall provides -
perhaps it should be called 'exit_atomic()' instead?
(we got rid of all signal passing in the main fabric of pthreads - and
that's done rightfully so. Futexes are used for message passing and
eventing.)
Ingo
I was actually surprised to see how much effort it takes on the glibc side
to solve this (admittedly conceptually hard) problem without any kernel
help - it's ugly and slow, and still not completely tight. By providing
this 'exit and free stack' capability we can help tremendously.
the thing that makes it special and hard is the completely shared VM.
There's just no way for a thread to 'get rid of itself' atomically and
also exit in the same round, without extensive locking and signal passing.
Linux actually has a very very fast clone()+exit() codepath, lets make it
possible for usespace to use it.
in essence the 'exit and send notification signal' thing now became a
simple word written into userspace. Should this be a more formal thing -
userspace mailboxes for the kernel to put events into? I think that might
be a bit overboard though.
Ingo
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> > we dont really want any signal overhead, and we also dont want any extra
> > context-switching to the 'master thread'. And there's no master thread
> > anymore either.
>
> That still doesn't make it any les crap: because any thread that exits
> without calling the "magic exit-flag interface" will then silently be
> lost, with no information left around anywhere.
that should be a pretty rare occurance: with the upcoming signals patch
any segmentation fault zaps all threads and does a proper (and
deadlock-free) multithreaded coredump. Sysadmin doing a kill(1) will get
all threads killed as well. The only possible way for an uncontrolled exit
is for the thread to call sys_exit() explicitly (which is not possible
without the glibc cleanup handlers being called), or for someone to send a
SIGKILL via sys_tkill().
but even in this rare and malicious case, whatever resources a thread has,
they are lost if there's an uncontrolled exit anyway. There's tons of
other stuff that glibc might have to clean up on exit - mutexes,
malloc()s, etc. Thread exit needs to be cooperative, no matter what. The
stack cache does not change this situation the least.
Ingo
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> I was actually surprised to see how much effort it takes on the glibc side
> to solve this (admittedly conceptually hard) problem without any kernel
> help - it's ugly and slow, and still not completely tight. By providing
> this 'exit and free stack' capability we can help tremendously.
Ingo, you're barking up the wrong tree.
I'm not against fixing it, but I'm very much against fixing it wrong.
I even told you how you can fix it right. You're arguing against the wrong
thing here.
Linus
On Tue, 13 Aug 2002, Ingo Molnar wrote:
> >
> > That still doesn't make it any les crap: because any thread that exits
> > without calling the "magic exit-flag interface" will then silently be
> > lost, with no information left around anywhere.
>
> that should be a pretty rare occurance: with the upcoming signals patch
> any segmentation fault zaps all threads and does a proper (and
> deadlock-free) multithreaded coredump.
That still doesn't change the fact that the interface is broken
_by_design_.
If the parent wants to get notified on child death, it should damn well
get notified on child death. Not "in case the child exists politely".
We don't depend on processes calling "exit()" to clean up all the stuff
they left behind. The VM gets cleaned up even for bad processes.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> If the parent wants to get notified on child death, it should damn well
> get notified on child death. Not "in case the child exists politely".
yes i agree. If the parent wants that, then it does not specify the
CLONE_DETACHED flag when creating the child thread. It is the parent that
specifies this flag and it has the freedom to decide whether it wants
signal based notification or not.
if CLONE_DETACHED is not specified upon creation then *no matter what the
child thread does* - both sys_exit() and sys_exit_free() notify the
parent. It's not a matter of politeness.
> We don't depend on processes calling "exit()" to clean up all the stuff
> they left behind. The VM gets cleaned up even for bad processes.
We'd be more than happy to do this cleanup in userspace, but how do you
free a stack which might as well be used by a debugger or a signal handler
right before executing the final "int $0x80" instruction?
should every signal handler start with code that tries to figure out
whether the stack is still valid (by calling gettid() and comparing it
with the TID written into a special offset on the stack)? Should the
exiting thread mask all signals before freeing the stack and calling
sys_exit()?
Ingo
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> think about it - we have the *very same* problem in kernel-space, and we
> had it for years. People wanted to get rid of parent notification in
> helper processes for ages.
So add the capability to mark the child for proper exit semantics.
That's what I said: if you wan tto do this, you need to mark it at
_create_ time. Exactly so that the proper exit semantics can be done 100%
reliably, instead of just "sometimes". THAT is why _any_ interface that
depends on exit_xxxx() must die - because it is inherently broken for
accidental deaths, and does not leave the parent any way to recover
sanely.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
>
> So add the capability to mark the child for proper exit semantics.
Actually, this has nothing at all to do with exit().
The same thing comes up when you want to do an execve() (yes, I know
pthreads doesn't support a thread starting another process, but the fact
that pthreads is broken is no excuse for broken interfaces).
If the parent needs to be notified that the stack slot is no longer in
use, it needs to happen for execve() too, not just exit().
In fact, I'd say that this thing is tied in to "mm_release()", not
"exit()".
The fact that the child doesn't want to send a signal to the parent on
exit is a totally different matter, and should already be supported by
just giving a zero signal number.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> The fact that the child doesn't want to send a signal to the parent on
> exit is a totally different matter, and should already be supported by
> just giving a zero signal number.
exit signal 0 is already being used and relied on by kmod - i originally
implemented it that way. In that case the child thread becomes a zombie
until the parent exits, and then it gets reparented to init. I did not
want to break any existing semantics (no matter how broken they appeared
to me) thus i introduced CLONE_DETACHED. But thinking about it, 'a zombie
staying around indefinitely' is not a semantics that it worth carrying too
far? But in case, if signal 0 is the preferred interface then i'm all for
it - this is not really a clone() property but an exit-signalling
property.
Ingo
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> The same thing comes up when you want to do an execve() (yes, I know
> pthreads doesn't support a thread starting another process, but the fact
> that pthreads is broken is no excuse for broken interfaces).
fixing this too would be very nice indeed.
> If the parent needs to be notified that the stack slot is no longer in
> use, it needs to happen for execve() too, not just exit().
>
> In fact, I'd say that this thing is tied in to "mm_release()", not
> "exit()".
yes.
from the practical POV right now we have a dualness of APIs. exit() is a
way to exit a current thread and destroy it. execve() is a way to exit()
the current thread and bootstrapping a completely new thread from scratch
- while saving over some well-specified state into the new thread.
so for any threading library to handle execve() correctly, there needs to
be some way to specify an mm_release event. It's in essence the same
'exit' conceptual thing we do in both the sys_exit() and sys_execve()
case, but it's accessible via two external interfaces.
one solution would be a new syscall to set 'VM exit notification' address
and value in the released VM. But since it would always come in pair with
sys_exit() or sys_execve(), it would be nicer to have a composite syscalls
as well - ie. exit_release_user_mm() and execve_release_user_mm(). I know
this is a pain to look at, but i dont have any better ideas right now. The
composite syscalls also have the advantage that no additional per-thread
field has to be used, since user-space can be notified right at the
beginning.
hm, maybe there's an idea: perhaps the most elegant way would be to handle
this at clone() time: if the CLONE_NOTIFY_MM_RELEASE flag is specified
then the top of the user stack address is taken as the notification
address. (or a new parameter can be used.) And the notification can as
well be an implicit 'set to 0' rule. [so it's basically a VM lock extended
to userspace.] The user-space stack's address is known at clone() time
already, nothing wants to change that address until exit() time.
mm_release() then sees this address set in current->, and notifies the
userspace VM of the release. No need for new syscalls, and *all* 'exit'
variants in the future will automatically have this capability, without
having to create clumsy composite syscalls.
> The fact that the child doesn't want to send a signal to the parent on
> exit is a totally different matter, and should already be supported by
> just giving a zero signal number.
yes.
Ingo
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> exit signal 0 is already being used and relied on by kmod - i originally
> implemented it that way. In that case the child thread becomes a zombie
> until the parent exits, and then it gets reparented to init. I did not
> want to break any existing semantics (no matter how broken they appeared
> to me) thus i introduced CLONE_DETACHED. But thinking about it, 'a zombie
> staying around indefinitely' is not a semantics that it worth carrying too
> far?
I think it makes more sense to say that since there was no notification of
the parent, we should just reparent at that point.
But in case, if signal 0 is the preferred interface then i'm all for
> it - this is not really a clone() property but an exit-signalling
> property.
Right. I think that it makes more sense to do it that way. Clearly the
parent doesn't care about the exit if the signal is zero.
Linus
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> > it - this is not really a clone() property but an exit-signalling
> > property.
>
> Right. I think that it makes more sense to do it that way. Clearly the
> parent doesn't care about the exit if the signal is zero.
the attached patch implements this (it's ontop of the SETTLS/SETTID patch
for sched.h conflict reasons), i've tested it with signal 0 threads and it
appears to work fine. (I have not tested kmod yet - that comes next.)
and looks pretty straightforward - !current->exit_signal is a good and
natural test.
Ingo
--- linux/kernel/exit.c.orig Tue Aug 13 21:01:50 2002
+++ linux/kernel/exit.c Tue Aug 13 21:49:35 2002
@@ -56,10 +56,13 @@
static void release_task(struct task_struct * p)
{
- if (p == current)
+ if (p == current && p->exit_signal)
+ BUG();
+ if (p->state != TASK_ZOMBIE)
BUG();
#ifdef CONFIG_SMP
- wait_task_inactive(p);
+ if (p != current)
+ wait_task_inactive(p);
#endif
atomic_dec(&p->user->processes);
security_ops->task_free_security(p);
@@ -67,10 +70,12 @@
unhash_process(p);
release_thread(p);
- current->cmin_flt += p->min_flt + p->cmin_flt;
- current->cmaj_flt += p->maj_flt + p->cmaj_flt;
- current->cnswap += p->nswap + p->cnswap;
- sched_exit(p);
+ if (p != current) {
+ current->cmin_flt += p->min_flt + p->cmin_flt;
+ current->cmaj_flt += p->maj_flt + p->cmaj_flt;
+ current->cnswap += p->nswap + p->cnswap;
+ sched_exit(p);
+ }
put_task_struct(p);
}
@@ -479,14 +484,15 @@
write_lock_irq(&tasklist_lock);
current->state = TASK_ZOMBIE;
- do_notify_parent(current, current->exit_signal);
+ if (current->exit_signal)
+ do_notify_parent(current, current->exit_signal);
while ((p = eldest_child(current))) {
list_del_init(&p->sibling);
p->ptrace = 0;
p->parent = p->real_parent;
list_add_tail(&p->sibling,&p->parent->children);
- if (p->state == TASK_ZOMBIE)
+ if (p->state == TASK_ZOMBIE && p->exit_signal)
do_notify_parent(p, p->exit_signal);
/*
* process group orphan check
@@ -555,6 +561,9 @@
tsk->exit_code = code;
exit_notify();
+ preempt_disable();
+ if (!current->exit_signal)
+ release_task(current);
schedule();
BUG();
/*
--- linux/kernel/signal.c.orig Tue Aug 13 21:02:04 2002
+++ linux/kernel/signal.c Tue Aug 13 21:48:01 2002
@@ -768,12 +768,14 @@
/*
* Let a parent know about a status change of a child.
*/
-
void do_notify_parent(struct task_struct *tsk, int sig)
{
struct siginfo info;
int why, status;
+ if (!sig || !tsk->exit_signal)
+ BUG();
+
info.si_signo = sig;
info.si_errno = 0;
info.si_pid = tsk->pid;
@@ -823,9 +825,11 @@
void
notify_parent(struct task_struct *tsk, int sig)
{
- read_lock(&tasklist_lock);
- do_notify_parent(tsk, sig);
- read_unlock(&tasklist_lock);
+ if (sig != -1) {
+ read_lock(&tasklist_lock);
+ do_notify_parent(tsk, sig);
+ read_unlock(&tasklist_lock);
+ }
}
#ifndef HAVE_ARCH_GET_SIGNAL_TO_DELIVER
--- linux/kernel/fork.c.orig Tue Aug 13 21:02:04 2002
+++ linux/kernel/fork.c Tue Aug 13 21:45:14 2002
@@ -50,6 +50,31 @@
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
+/*
+ * A per-CPU task cache - this relies on the fact that
+ * the very last portion of sys_exit() is executed with
+ * preemption turned off.
+ */
+static task_t *task_cache[NR_CPUS] __cacheline_aligned;
+
+void __put_task_struct(struct task_struct *tsk)
+{
+ if (tsk != current) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ } else {
+ int cpu = smp_processor_id();
+
+ tsk = task_cache[cpu];
+ if (tsk) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ }
+ task_cache[cpu] = current;
+ }
+}
+
+/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -123,13 +148,6 @@
return tsk;
}
-void __put_task_struct(struct task_struct *tsk)
-{
- free_thread_info(tsk->thread_info);
- kmem_cache_free(task_struct_cachep,tsk);
-}
-
-/* Protects next_safe and last_pid. */
spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
static int get_pid(unsigned long flags)
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> one solution would be a new syscall to set 'VM exit notification' address
> and value in the released VM. But since it would always come in pair with
> sys_exit() or sys_execve()
Why do you say that?
In fact, by looking at the current implementation of mm_release(), you
should realize that this is wrong.
We already have an interface where the parent wants to know about the
child doing a mm_release() - it's called vfork(). And the mm_release()
intention notification is done at _vfork_ time, not at exit or execve()
time.
This is my whole argument. It makes sense to say at clone time that "I
want to be notified when my thread exits" or "I want to be notified when
my thread no longer uses my address space". That makes 100% sense, and is
in fact something we're already doing with the signal mask, and with the
existing CLONE_VFORK case.
In contrast, it does _not_ make sense to say "I'm the child, and I'm now
exiting, so I want to notify my parent", because the child does not always
even _know_ when it is exiting.
For example, if you want to use the existing execvp() etc libc helper
routines, the final execve() is not going to be a magic system call. So
you cannot have a special execve_release() system call, because that makes
no sense within the libc.
I repeat: it is the _parent_ that knows whether it wants to be notified at
exit. Not the child.
Ergo, the "notify me on exit" should be a parent decision, and it might be
as simple as writing zero to the (same) "pid_t *" that it passed for the
pid information thing for startup.
Linus
> the attached patch implements this (it's ontop of the SETTLS/SETTID
> patch for sched.h conflict reasons), i've tested it with signal 0
> threads and it appears to work fine. (I have not tested kmod yet - that
> comes next.)
as expected, kmod breaks, because it relies on waitpid() working on a
forked child thread, while it blocks SIGCHLD and supresses SIGCHLD
delivery via specifying a 0 exit_signal. No userspace code is expected to
rely on this existing semantics of clone() though, and we should be able
to fix kmod.
a quick workaround is in the attached patch but it's an incorrect fix:
kmod does not want to receive any signals from the helper thread for a
reason - it can be executed from any process context, correct? I dont know
how this should be fixed.
Ingo
--- linux/kernel/exit.c.orig Tue Aug 13 21:01:50 2002
+++ linux/kernel/exit.c Tue Aug 13 21:49:35 2002
@@ -56,10 +56,13 @@
static void release_task(struct task_struct * p)
{
- if (p == current)
+ if (p == current && p->exit_signal)
+ BUG();
+ if (p->state != TASK_ZOMBIE)
BUG();
#ifdef CONFIG_SMP
- wait_task_inactive(p);
+ if (p != current)
+ wait_task_inactive(p);
#endif
atomic_dec(&p->user->processes);
security_ops->task_free_security(p);
@@ -67,10 +70,12 @@
unhash_process(p);
release_thread(p);
- current->cmin_flt += p->min_flt + p->cmin_flt;
- current->cmaj_flt += p->maj_flt + p->cmaj_flt;
- current->cnswap += p->nswap + p->cnswap;
- sched_exit(p);
+ if (p != current) {
+ current->cmin_flt += p->min_flt + p->cmin_flt;
+ current->cmaj_flt += p->maj_flt + p->cmaj_flt;
+ current->cnswap += p->nswap + p->cnswap;
+ sched_exit(p);
+ }
put_task_struct(p);
}
@@ -479,14 +484,15 @@
write_lock_irq(&tasklist_lock);
current->state = TASK_ZOMBIE;
- do_notify_parent(current, current->exit_signal);
+ if (current->exit_signal)
+ do_notify_parent(current, current->exit_signal);
while ((p = eldest_child(current))) {
list_del_init(&p->sibling);
p->ptrace = 0;
p->parent = p->real_parent;
list_add_tail(&p->sibling,&p->parent->children);
- if (p->state == TASK_ZOMBIE)
+ if (p->state == TASK_ZOMBIE && p->exit_signal)
do_notify_parent(p, p->exit_signal);
/*
* process group orphan check
@@ -555,6 +561,9 @@
tsk->exit_code = code;
exit_notify();
+ preempt_disable();
+ if (!current->exit_signal)
+ release_task(current);
schedule();
BUG();
/*
--- linux/kernel/signal.c.orig Tue Aug 13 21:02:04 2002
+++ linux/kernel/signal.c Tue Aug 13 21:48:01 2002
@@ -768,12 +768,14 @@
/*
* Let a parent know about a status change of a child.
*/
-
void do_notify_parent(struct task_struct *tsk, int sig)
{
struct siginfo info;
int why, status;
+ if (!sig || !tsk->exit_signal)
+ BUG();
+
info.si_signo = sig;
info.si_errno = 0;
info.si_pid = tsk->pid;
@@ -823,9 +825,11 @@
void
notify_parent(struct task_struct *tsk, int sig)
{
- read_lock(&tasklist_lock);
- do_notify_parent(tsk, sig);
- read_unlock(&tasklist_lock);
+ if (sig != -1) {
+ read_lock(&tasklist_lock);
+ do_notify_parent(tsk, sig);
+ read_unlock(&tasklist_lock);
+ }
}
#ifndef HAVE_ARCH_GET_SIGNAL_TO_DELIVER
--- linux/kernel/fork.c.orig Tue Aug 13 21:02:04 2002
+++ linux/kernel/fork.c Tue Aug 13 21:45:14 2002
@@ -50,6 +50,31 @@
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
+/*
+ * A per-CPU task cache - this relies on the fact that
+ * the very last portion of sys_exit() is executed with
+ * preemption turned off.
+ */
+static task_t *task_cache[NR_CPUS] __cacheline_aligned;
+
+void __put_task_struct(struct task_struct *tsk)
+{
+ if (tsk != current) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ } else {
+ int cpu = smp_processor_id();
+
+ tsk = task_cache[cpu];
+ if (tsk) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ }
+ task_cache[cpu] = current;
+ }
+}
+
+/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -123,13 +148,6 @@
return tsk;
}
-void __put_task_struct(struct task_struct *tsk)
-{
- free_thread_info(tsk->thread_info);
- kmem_cache_free(task_struct_cachep,tsk);
-}
-
-/* Protects next_safe and last_pid. */
spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
static int get_pid(unsigned long flags)
--- linux/kernel/kmod.c.orig Tue Aug 13 22:04:05 2002
+++ linux/kernel/kmod.c Tue Aug 13 22:04:18 2002
@@ -227,7 +227,7 @@
goto out;
}
- pid = kernel_thread(exec_modprobe, (void*) module_name, 0);
+ pid = kernel_thread(exec_modprobe, (void*) module_name, SIGCHLD);
if (pid < 0) {
printk(KERN_ERR "request_module[%s]: fork failed, errno %d\n", module_name, -pid);
atomic_dec(&kmod_concurrent);
On Tue, 13 Aug 2002, Ingo Molnar wrote:
>
> a quick workaround is in the attached patch but it's an incorrect fix:
> kmod does not want to receive any signals from the helper thread for a
> reason - it can be executed from any process context, correct? I dont know
> how this should be fixed.
Hmm.. The workarounds make me suspect that maybe your right interface was
the right one after all.
This interface has the advantage that the exit() path for this kind of
child has _zero_ context switching overhead etc, so I do like it. Although
I get this feeling that the release_task() issue would be much more
cleanly handled by just letting schedule_tail() do the last "put_task()",
the same way we handle the pending MM issue..
Maybe the best approach is to really mix the two approaches: a separate
clone flag to say that the parent really doesn't care about waiting for
the thing, but do it this way (so that init doesn't have to spend time
cleaning up either).
Linus
the attached patch implements CLONE_VM_RELEASE, which lets the child
release the 'user VM' at mm_release() time.
note that a quick testing did not show the desired result yet, so there
must be some thinko in it, but this is how i think it would roughly look
like. The copy_thread() code takes a pointer away from the user-stack -
userspace should put the lock there.
the patch also accelerates the common 'no funky CLONE flags' case in
copy_thread(), since the flags started mounting up.
(the patch is ontop the SETTID/SETTLS and DETACHED patches, for dependency
reasons.)
Ingo
--- linux/arch/i386/kernel/process.c.orig Tue Aug 13 22:26:39 2002
+++ linux/arch/i386/kernel/process.c Tue Aug 13 22:35:45 2002
@@ -566,6 +566,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
+ p->user_vm_lock = NULL;
p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -579,6 +580,19 @@
unlazy_fpu(tsk);
struct_cpy(&p->thread.i387, &tsk->thread.i387);
+ if (unlikely(NULL != tsk->thread.ts_io_bitmap)) {
+ p->thread.ts_io_bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+ if (!p->thread.ts_io_bitmap)
+ return -ENOMEM;
+ memcpy(p->thread.ts_io_bitmap, tsk->thread.ts_io_bitmap,
+ IO_BITMAP_BYTES);
+ }
+
+ /*
+ * The common fastpath:
+ */
+ if (!(clone_flags & (CLONE_SETTLS | CLONE_SETTID | CLONE_RELEASE_VM)))
+ return 0;
/*
* Set a new TLS for the child thread?
*/
@@ -608,14 +622,13 @@
if (put_user(p->pid, (pid_t *)childregs->edx))
return -EFAULT;
- if (unlikely(NULL != tsk->thread.ts_io_bitmap)) {
- p->thread.ts_io_bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
- if (!p->thread.ts_io_bitmap)
- return -ENOMEM;
- memcpy(p->thread.ts_io_bitmap, tsk->thread.ts_io_bitmap,
- IO_BITMAP_BYTES);
+ /*
+ * Does the userspace VM want any unlock on mm_release()?
+ */
+ if (clone_flags & CLONE_RELEASE_VM) {
+ childregs->esp -= sizeof(0UL);
+ p->user_vm_lock = (long *) esp;
}
-
return 0;
}
--- linux/include/linux/sched.h.orig Tue Aug 13 22:20:41 2002
+++ linux/include/linux/sched.h Tue Aug 13 22:35:20 2002
@@ -47,6 +47,7 @@
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
+#define CLONE_RELEASE_VM 0x00200000 /* release the userspace stack */
#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)
@@ -305,6 +306,7 @@
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
+ long *user_vm_lock; /* for CLONE_RELEASE_VM */
unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
--- linux/kernel/fork.c.orig Tue Aug 13 22:21:52 2002
+++ linux/kernel/fork.c Tue Aug 13 22:35:27 2002
@@ -367,6 +367,12 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
+ if (tsk->user_vm_lock)
+ /*
+ * We dont check the error code - if userspace has
+ * not set up a proper pointer then tough luck.
+ */
+ put_user(0UL, tsk->user_vm_lock);
}
static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
> note that a quick testing did not show the desired result yet, so there
> must be some thinko in it, but this is how i think it would roughly look
> like. The copy_thread() code takes a pointer away from the user-stack -
> userspace should put the lock there.
it was a flaw in the testing code, after fixing it the user-vm-unlock
works as expected.
Ingo
On Tue, 13 Aug 2002, Linus Torvalds wrote:
> This interface has the advantage that the exit() path for this kind of
> child has _zero_ context switching overhead etc, so I do like it.
> Although I get this feeling that the release_task() issue would be much
> more cleanly handled by just letting schedule_tail() do the last
> "put_task()", the same way we handle the pending MM issue..
hm, release_task() is like release_mm(), and there's no real problem with
disabling preemption at this point: the codepath to the final schedule()
is very short at this point. Putting extra code into schedule_tail() hurts
every context-switch - while having this branch in release_task() only
involves exit()-ing tasks.
> Maybe the best approach is to really mix the two approaches: a separate
> clone flag to say that the parent really doesn't care about waiting for
> the thing, but do it this way (so that init doesn't have to spend time
> cleaning up either).
yes. Patch attached (against the SETTID/SETTLS patch). [ I'm a wimp so
reparent_to_init() does not yet use this mechanism - but it will be done.
It will be a good test, and a nice speedup as well. ]
Ingo
--- linux/include/linux/sched.h.orig Tue Aug 13 23:14:39 2002
+++ linux/include/linux/sched.h Tue Aug 13 23:14:51 2002
@@ -47,6 +47,7 @@
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
+#define CLONE_DETACHED 0x00200000 /* parent wants no child-exit signal */
#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)
--- linux/kernel/exit.c.orig Tue Aug 13 23:13:41 2002
+++ linux/kernel/exit.c Tue Aug 13 23:14:51 2002
@@ -56,10 +56,11 @@
static void release_task(struct task_struct * p)
{
- if (p == current)
+ if (p->state != TASK_ZOMBIE)
BUG();
#ifdef CONFIG_SMP
- wait_task_inactive(p);
+ if (p != current)
+ wait_task_inactive(p);
#endif
atomic_dec(&p->user->processes);
security_ops->task_free_security(p);
@@ -67,10 +68,12 @@
unhash_process(p);
release_thread(p);
- current->cmin_flt += p->min_flt + p->cmin_flt;
- current->cmaj_flt += p->maj_flt + p->cmaj_flt;
- current->cnswap += p->nswap + p->cnswap;
- sched_exit(p);
+ if (p != current) {
+ current->cmin_flt += p->min_flt + p->cmin_flt;
+ current->cmaj_flt += p->maj_flt + p->cmaj_flt;
+ current->cnswap += p->nswap + p->cnswap;
+ sched_exit(p);
+ }
put_task_struct(p);
}
@@ -479,14 +482,15 @@
write_lock_irq(&tasklist_lock);
current->state = TASK_ZOMBIE;
- do_notify_parent(current, current->exit_signal);
+ if (current->exit_signal != -1)
+ do_notify_parent(current, current->exit_signal);
while ((p = eldest_child(current))) {
list_del_init(&p->sibling);
p->ptrace = 0;
p->parent = p->real_parent;
list_add_tail(&p->sibling,&p->parent->children);
- if (p->state == TASK_ZOMBIE)
+ if (p->state == TASK_ZOMBIE && p->exit_signal != -1)
do_notify_parent(p, p->exit_signal);
/*
* process group orphan check
@@ -555,6 +559,9 @@
tsk->exit_code = code;
exit_notify();
+ preempt_disable();
+ if (current->exit_signal == -1)
+ release_task(current);
schedule();
BUG();
/*
--- linux/kernel/signal.c.orig Tue Aug 13 23:13:44 2002
+++ linux/kernel/signal.c Tue Aug 13 23:14:51 2002
@@ -768,12 +768,15 @@
/*
* Let a parent know about a status change of a child.
*/
-
void do_notify_parent(struct task_struct *tsk, int sig)
{
struct siginfo info;
int why, status;
+ /* is the thread detached? */
+ if (sig == -1 || tsk->exit_signal == -1)
+ BUG();
+
info.si_signo = sig;
info.si_errno = 0;
info.si_pid = tsk->pid;
@@ -823,9 +826,11 @@
void
notify_parent(struct task_struct *tsk, int sig)
{
- read_lock(&tasklist_lock);
- do_notify_parent(tsk, sig);
- read_unlock(&tasklist_lock);
+ if (sig != -1) {
+ read_lock(&tasklist_lock);
+ do_notify_parent(tsk, sig);
+ read_unlock(&tasklist_lock);
+ }
}
#ifndef HAVE_ARCH_GET_SIGNAL_TO_DELIVER
--- linux/kernel/fork.c.orig Tue Aug 13 23:13:44 2002
+++ linux/kernel/fork.c Tue Aug 13 23:14:51 2002
@@ -50,6 +50,31 @@
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
+/*
+ * A per-CPU task cache - this relies on the fact that
+ * the very last portion of sys_exit() is executed with
+ * preemption turned off.
+ */
+static task_t *task_cache[NR_CPUS] __cacheline_aligned;
+
+void __put_task_struct(struct task_struct *tsk)
+{
+ if (tsk != current) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ } else {
+ int cpu = smp_processor_id();
+
+ tsk = task_cache[cpu];
+ if (tsk) {
+ free_thread_info(tsk->thread_info);
+ kmem_cache_free(task_struct_cachep,tsk);
+ }
+ task_cache[cpu] = current;
+ }
+}
+
+/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -123,13 +148,6 @@
return tsk;
}
-void __put_task_struct(struct task_struct *tsk)
-{
- free_thread_info(tsk->thread_info);
- kmem_cache_free(task_struct_cachep,tsk);
-}
-
-/* Protects next_safe and last_pid. */
spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
static int get_pid(unsigned long flags)
@@ -737,7 +755,10 @@
/* ok, now we should be set up.. */
p->swappable = 1;
- p->exit_signal = clone_flags & CSIGNAL;
+ if (clone_flags & CLONE_DETACHED)
+ p->exit_signal = -1;
+ else
+ p->exit_signal = clone_flags & CSIGNAL;
p->pdeath_signal = 0;
/*
and here's a merged CLONE_VM_RELEASE patch against the previous
clone-detached patch. (the sched.h reorganization broke it.)
I've tested it and userspace VM-release still works (after fixing the test
tool to use the repositioned clone flag bit):
[mingo@a mingo]$ ./start-detached 1000 1 0
stack end: 0x12341234
started up 1000 threads!
stack end after all threads exited: 0x00000000.
Ingo
--- linux/arch/i386/kernel/process.c.orig Tue Aug 13 23:14:39 2002
+++ linux/arch/i386/kernel/process.c Tue Aug 13 23:20:48 2002
@@ -566,6 +566,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
+ p->user_vm_lock = NULL;
p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -579,6 +580,19 @@
unlazy_fpu(tsk);
struct_cpy(&p->thread.i387, &tsk->thread.i387);
+ if (unlikely(NULL != tsk->thread.ts_io_bitmap)) {
+ p->thread.ts_io_bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+ if (!p->thread.ts_io_bitmap)
+ return -ENOMEM;
+ memcpy(p->thread.ts_io_bitmap, tsk->thread.ts_io_bitmap,
+ IO_BITMAP_BYTES);
+ }
+
+ /*
+ * The common fastpath:
+ */
+ if (!(clone_flags & (CLONE_SETTLS | CLONE_SETTID | CLONE_RELEASE_VM)))
+ return 0;
/*
* Set a new TLS for the child thread?
*/
@@ -608,14 +622,13 @@
if (put_user(p->pid, (pid_t *)childregs->edx))
return -EFAULT;
- if (unlikely(NULL != tsk->thread.ts_io_bitmap)) {
- p->thread.ts_io_bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
- if (!p->thread.ts_io_bitmap)
- return -ENOMEM;
- memcpy(p->thread.ts_io_bitmap, tsk->thread.ts_io_bitmap,
- IO_BITMAP_BYTES);
+ /*
+ * Does the userspace VM want any unlock on mm_release()?
+ */
+ if (clone_flags & CLONE_RELEASE_VM) {
+ childregs->esp -= sizeof(0UL);
+ p->user_vm_lock = (long *) esp;
}
-
return 0;
}
--- linux/include/linux/sched.h.orig Tue Aug 13 23:14:51 2002
+++ linux/include/linux/sched.h Tue Aug 13 23:21:21 2002
@@ -48,6 +48,7 @@
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
#define CLONE_DETACHED 0x00200000 /* parent wants no child-exit signal */
+#define CLONE_RELEASE_VM 0x00400000 /* release the userspace VM */
#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)
@@ -306,6 +307,7 @@
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
+ long *user_vm_lock; /* for CLONE_RELEASE_VM */
unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
--- linux/kernel/fork.c.orig Tue Aug 13 23:14:51 2002
+++ linux/kernel/fork.c Tue Aug 13 23:20:48 2002
@@ -367,6 +367,12 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
+ if (tsk->user_vm_lock)
+ /*
+ * We dont check the error code - if userspace has
+ * not set up a proper pointer then tough luck.
+ */
+ put_user(0UL, tsk->user_vm_lock);
}
static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
Ingo Molnar wrote:
> and here's a merged CLONE_VM_RELEASE patch against the previous
> clone-detached patch. (the sched.h reorganization broke it.)
I wonder if it makes more sense for the release word to be a futex --
then various ways of actually waiting for the stack are available.
It would be nice if the stored word were the exit() code, too. This
would remove the need for zombie threads even when an exit status is
desired.
> + /*
> + * Does the userspace VM want any unlock on mm_release()?
> + */
> + if (clone_flags & CLONE_RELEASE_VM) {
> + childregs->esp -= sizeof(0UL);
> + p->user_vm_lock = (long *) esp;
Is this correct? I would have expected this, given that stacks are
pre-decrement, and given that the value of `esp' is typically just after
the end of an mmaped region:
+ childregs->esp -= sizeof(0UL);
+ p->user_vm_lock = (long *) childregs->esp;
-- Jamie
On Thu, 15 Aug 2002, Jamie Lokier wrote:
> Is this correct? I would have expected this, given that stacks are
> pre-decrement, and given that the value of `esp' is typically just after
> the end of an mmaped region:
>
> + childregs->esp -= sizeof(0UL);
> + p->user_vm_lock = (long *) childregs->esp;
you are right. Fix against BK-curr attached.
Ingo
--- linux/arch/i386/kernel/process.c.orig Thu Aug 15 08:37:52 2002
+++ linux/arch/i386/kernel/process.c Thu Aug 15 08:37:59 2002
@@ -627,7 +627,7 @@
*/
if (clone_flags & CLONE_RELEASE_VM) {
childregs->esp -= sizeof(0UL);
- p->user_vm_lock = (long *) esp;
+ p->user_vm_lock = (long *) childregs->esp;
}
return 0;
}
On Thu, 15 Aug 2002, Jamie Lokier wrote:
> I wonder if it makes more sense for the release word to be a futex --
> then various ways of actually waiting for the stack are available.
the window for locking is really small (and will always be small), so it's
cheaper for the fastpath to implement this as a spinlock, with the
stack-user being the lock holder.
> It would be nice if the stored word were the exit() code, too. This
> would remove the need for zombie threads even when an exit status is
> desired.
this is an unrelated issue from the user-vm-unlock thing. pthread_join()
(the new, futex based variant) already uses a futex (well, internal
pthreads mutex that in turn uses a futex) which also stores the exit code,
so threads are always be created in a detached state. But there are cases
when this is unnecessery, eg. when a thread later on calls
pthread_detach().
Ingo
On Thu, 15 Aug 2002, Jamie Lokier wrote:
> Is this correct? I would have expected this, given that stacks are
> pre-decrement, and given that the value of `esp' is typically just after
> the end of an mmaped region:
>
> + childregs->esp -= sizeof(0UL);
> + p->user_vm_lock = (long *) childregs->esp;
btw., i backported all the recent threading improvements to 2.4, and
current pthreads sources already use the new APIs, which worked this
CLONE_VM_RELEASE API uncleanliness around - but it's of course cleaner to
have your fix in. In any case, all the new APIs are fully functional.
Ingo
Ingo Molnar wrote:
> > I wonder if it makes more sense for the release word to be a futex --
> > then various ways of actually waiting for the stack are available.
>
> the window for locking is really small (and will always be small), so it's
> cheaper for the fastpath to implement this as a spinlock, with the
> stack-user being the lock holder.
I'm thinking that any _clean_ threading library (pthreads or not)
should do these two things:
- intercept all the system calls that might call mmput(); that is,
exit() and execve()), just so it can move the thread-specific data
(including the stack) onto the "potentially free list".
- free the stack memory as soon as possible after a thread has died,
_without_ depending on garbage collection. What if all the other
threads are compute-bound? There's a lump of unnecessary stack
taking up memory for an indefinite time.
It seems that you're using a futex anyway -- so why not eliminate that
pesky system call _and_ make sure pthread_join() work if some library
you're linked to exits without calling pthread_exit()..
-- Jamie
> > + childregs->esp -= sizeof(0UL);
> > + p->user_vm_lock = (long *) childregs->esp;
>
> you are right. Fix against BK-curr attached.
in fact testing these changes in glibc revealed another thing - the top of
the thread's stack has to be 16-byte aligned (for SSE2 support), so the
attached patch ontop of BK-curr would be a better solution, it does not
modify the thread's stack alignment but simply writes to the top of the
stack. (which is the first word of the thread control block.) This removes
a few instructions both from glibc and from the kernel.
Ingo
--- arch/i386/kernel/process.c.orig Thu Aug 15 08:37:52 2002
+++ arch/i386/kernel/process.c Thu Aug 15 12:36:57 2002
@@ -625,10 +625,8 @@
/*
* Does the userspace VM want any unlock on mm_release()?
*/
- if (clone_flags & CLONE_RELEASE_VM) {
- childregs->esp -= sizeof(0UL);
- p->user_vm_lock = (long *) esp;
- }
+ if (clone_flags & CLONE_RELEASE_VM)
+ p->user_vm_lock = (long *) childregs->esp;
return 0;
}
On Thu, Aug 15, 2002 at 11:31:48AM +0100, Jamie Lokier wrote:
> - intercept all the system calls that might call mmput(); that is,
> exit() and execve()), just so it can move the thread-specific data
> (including the stack) onto the "potentially free list".
And the reason why i cannot intercept exit() or exec... again somewhere
in my threaded program is...?
On Thu, 15 Aug 2002, Jamie Lokier wrote:
> - intercept all the system calls that might call mmput(); that is,
> exit() and execve()), just so it can move the thread-specific data
> (including the stack) onto the "potentially free list".
(there's no need to intercept anything - glibc *is* the only legitimate
code that might call the raw sys_execve() & sys_exit() system-calls.)
> - free the stack memory as soon as possible after a thread has died,
> _without_ depending on garbage collection. What if all the other
> threads are compute-bound? There's a lump of unnecessary stack
> taking up memory for an indefinite time.
>
> It seems that you're using a futex anyway -- so why not eliminate that
> pesky system call _and_ make sure pthread_join() work if some library
> you're linked to exits without calling pthread_exit()..
so how would it work exactly? My prediction is that you wont be able to
suggest any better methods than what i outlined in the original email, so
the best (and fastest) solution is some (minimal) kernel help.
Ingo
doh, one more problem popped up while implementing support for this, which
needs a change in the interface again - this time it should be the final
solution. glibc does not pass in the true top of the stack to clone(), but
rather the address of the start function's parameter frame. So the
VM_RELEASE will overwrite the first parameter.
it is much cleaner anyway to pass in the address of the user-space VM lock
- this will also enable arbitrary implementations of the stack-unlock, as
the fifth clone() parameter. Patch against BK-curr attached.
with this it now works fine. (previously it only appeared to work fine -
but it leaked stackspace.)
Ingo
--- arch/i386/kernel/process.c.orig Thu Aug 15 08:37:52 2002
+++ arch/i386/kernel/process.c Thu Aug 15 12:36:57 2002
@@ -625,10 +625,8 @@
/*
* Does the userspace VM want any unlock on mm_release()?
*/
- if (clone_flags & CLONE_RELEASE_VM) {
- childregs->esp -= sizeof(0UL);
- p->user_vm_lock = (long *) esp;
- }
+ if (clone_flags & CLONE_RELEASE_VM)
+ p->user_vm_lock = (long *) childregs->edi;
return 0;
}
one of the debugging tests triggered a false-positive BUG() when a
detached thread was straced. Fix against BK-curr attached.
Ingo
--- kernel/signal.c.orig Thu Aug 15 17:12:02 2002
+++ kernel/signal.c Thu Aug 15 17:12:34 2002
@@ -774,7 +774,7 @@
int why, status;
/* is the thread detached? */
- if (sig == -1 || tsk->exit_signal == -1)
+ if (sig == -1)
BUG();
info.si_signo = sig;
On Thu, 15 Aug 2002, Ingo Molnar wrote:
>
> in fact testing these changes in glibc revealed another thing - the top of
> the thread's stack has to be 16-byte aligned (for SSE2 support), so the
> attached patch ontop of BK-curr would be a better solution, it does not
> modify the thread's stack alignment but simply writes to the top of the
> stack. (which is the first word of the thread control block.) This removes
> a few instructions both from glibc and from the kernel.
I do not understand why you want to link this to the stack at all. It
doesn't really make sense to me. The "thread exited" thing has nothing to
do with the stack, and you're only focused on that because you want to
free (or re-use) the stack when it exits.
I personally believe that it would make a lot more sense to either pass in
a totally independent pointer, or - my preferred approach - to just re-use
the TID pointer. Think about it: thread creation sets the TID (if
CLONE_SETTID is set) in the thread data structures, and thread exit clears
the TID (if CLONE_CLRTID is set). That sounds like a _sensible_ interface.
(CLONE_RELEASE_VM doesn't really make sense as a name. It doesn't describe
what the flag _means_, it really only describes an implementation detail
inside the kernel).
Eh?
Linus
Ingo Molnar wrote:
> (there's no need to intercept anything - glibc *is* the only legitimate
> code that might call the raw sys_execve() & sys_exit() system-calls.)
This is very glibc-centred.
I'm thinking of glibc (or other libc) + non-glibc (and non-pthreads)
thread library, for a language's run-time environment.
The less interception, and also signal wrappers (ugly things), that have
to be written in the _application-specific thread library_, which still
wishes to interface with unknown 3rd party libraries remember, the
better.
Linus Torvalds wrote:
> On Tue, 13 Aug 2002, Ingo Molnar wrote:
> > we dont really want any signal overhead, and we also dont want any extra
> > context-switching to the 'master thread'. And there's no master thread
> > anymore either.
>
> That still doesn't make it any les crap: because any thread that exits
> without calling the "magic exit-flag interface" will then silently be
> lost, with no information left around anywhere.
also:
> If the parent wants to get notified on child death, it should damn well
> get notified on child death. Not "in case the child exists politely".
Ingo, the reason I suggest a futex, and to store the exit status in it,
is precisely because of the above by Linus... if a thread exits without
calling "magic exit-flag interface" I would still like a parent thread
to know about it. (This is for _non-glibc-threads_ applications). And
I would like this while having the benefit of CLONE_DETACHED - because I
want to use this for high performance threading but still be robust - so
waitpid() is out.
So - following Linus' note on CLONE_CLRTID - store the tid when the
thread is created, store the exit status when the thread is destroyed,
and after storing the exit status, call sys_futex(address, FUTEX_WAKE,
1, 0) if the word value before storing was non-zero. The condition is
to avoid the slowness of sys_futex when it's not required. It's
probably most simple to use two consecutive words, for simplicity and to
avoid needing an atomic store (which is slower on some architectures).
This gives synchronous (FUTEX_WAIT), asynchronous (FUTEX_FD) and polling
(just read memory) interfaces to a parent monitoring its child (or
siblings monitoring each other). It gives access to the same
information as waitpid(), but with the advantages of CLONE_DETACHED.
All this for these few untested lines in the exit path :-)
if (tsk->user_vm_lock) {
long user_word;
if (likely(!get_user(user_word, tsk->user_vm_lock))) {
put_user(tsk->exit_code, tsk->user_vm_lock);
wmb();
if (user_word < 0)
sys_futex (tsk->user_vm_lock, FUTEX_WAKE, -user_word, 0);
}
}
-- Jamie
On Thu, 15 Aug 2002, Jamie Lokier wrote:
> > On Tue, 13 Aug 2002, Ingo Molnar wrote:
> > > we dont really want any signal overhead, and we also dont want any extra
> > > context-switching to the 'master thread'. And there's no master thread
> > > anymore either.
> >
> > That still doesn't make it any les crap: because any thread that exits
> > without calling the "magic exit-flag interface" will then silently be
> > lost, with no information left around anywhere.
i'm not quite sure why you repeated this part. I replied to this point,
read it.
> also:
>
> > If the parent wants to get notified on child death, it should damn well
> > get notified on child death. Not "in case the child exists politely".
and this point got replied to as well.
> Ingo, the reason I suggest a futex, and to store the exit status in it,
> is precisely because of the above by Linus... [...]
which was incorrect, and replied to. Please re-read the thread. There's no
'silent loss' of anything, and there is no 'unpolite exit'.
> [...] if a thread exits without calling "magic exit-flag interface" I
> would still like a parent thread to know about it. (This is for
> _non-glibc-threads_ applications). [...]
precisely what makes you think glibc does not want to know about about
child thead exit?
> [...] And I would like this while having the benefit of CLONE_DETACHED -
> because I want to use this for high performance threading but still be
> robust - so waitpid() is out.
like i said in the original email, the point of CLONE_DETACHED is to avoid
the waitpid() overhead. I also said that exit notification is done via
mutexes (futexes).
> So - following Linus' note on CLONE_CLRTID - store the tid when the
> thread is created, store the exit status when the thread is destroyed,
> and after storing the exit status, call sys_futex(address, FUTEX_WAKE,
> 1, 0) if the word value before storing was non-zero. The condition is
> to avoid the slowness of sys_futex when it's not required. It's
> probably most simple to use two consecutive words, for simplicity and to
> avoid needing an atomic store (which is slower on some architectures).
no, this is completely unnecessery. We already release the child-exit
mutes before calling the exit, that is a perfectly fine solution. No need
to add extra code to the kernel.
> This gives synchronous (FUTEX_WAIT), asynchronous (FUTEX_FD) and polling
> (just read memory) interfaces to a parent monitoring its child (or
> siblings monitoring each other). It gives access to the same
> information as waitpid(), but with the advantages of CLONE_DETACHED.
>
> All this for these few untested lines in the exit path :-)
this is a horrible hack that is completely unnecessery:
> if (tsk->user_vm_lock) {
> long user_word;
> if (likely(!get_user(user_word, tsk->user_vm_lock))) {
> put_user(tsk->exit_code, tsk->user_vm_lock);
> wmb();
> if (user_word < 0)
> sys_futex (tsk->user_vm_lock, FUTEX_WAKE, -user_word, 0);
> }
> }
the futex wake is being done before the exit, and it works just fine:
16265 clone(child_stack=0x567ffee0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|0x790000) = 16355
16355 _exit(0) = ?
16265 clone(child_stack=0x56fffee0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|0x790000) = 16356
16265 futex(0x56ffff38, FUTEX_WAIT, -1, NULL <unfinished ...>
16356 futex(0x56ffff38, FUTEX_WAKE, 1, NULL) = 1
16265 <... futex resumed> ) = 0
you are banging on open doors.
Ingo
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> I do not understand why you want to link this to the stack at all. [...]
you are right, indeed 'releasing basic thread state' is the correct
concept. Any sane threading library merges this with the stackframe, this
is why i keep referring to it as the stack. Basic thread state has to be
intact for signal handlers to function even in the last moment.
> I personally believe that it would make a lot more sense to either pass
> in a totally independent pointer, or - my preferred approach - to just
> re-use the TID pointer. Think about it: thread creation sets the TID (if
> CLONE_SETTID is set) in the thread data structures, and thread exit
> clears the TID (if CLONE_CLRTID is set). That sounds like a _sensible_
> interface.
you have applied my independent-pointer patch already, but i think your
CLEARTID variant is the most elegant solution: it reuses a clone argument,
thus reduces the number of arguments and it's also a nice conceptual pair
to the existing SETTID call. And the TID field can be used as a 'usage'
field as well, because the TID (PID) can never be 0, reducing the number
of fields in the TCB. And we can change the userspace locking code to use
the TID field no problem.
personally i'd make it even more compact by merging the two clone flags as
well, something along: CLONE_MANAGE_TID. I cannot see any reason for a
thread library to use one of the bits only. This also reflects the fact
that it's thread state that the kernel helps managing, both in the
thread-create and in the thread-exit path. But this might be stretching
things a bit?
patch against BK-curr attached.
Ingo
--- linux/arch/i386/kernel/process.c.orig Fri Aug 16 00:16:52 2002
+++ linux/arch/i386/kernel/process.c Fri Aug 16 00:18:49 2002
@@ -566,7 +566,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_vm_lock = NULL;
+ p->user_tid = NULL;
p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -591,7 +591,7 @@
/*
* The common fastpath:
*/
- if (!(clone_flags & (CLONE_SETTLS | CLONE_SETTID | CLONE_RELEASE_VM)))
+ if (!(clone_flags & (CLONE_SETTLS | CLONE_SETTID | CLONE_CLEARTID)))
return 0;
/*
* Set a new TLS for the child thread?
@@ -623,10 +623,10 @@
return -EFAULT;
/*
- * Does the userspace VM want any unlock on mm_release()?
+ * Does the userspace VM want the TID cleared on mm_release()?
*/
- if (clone_flags & CLONE_RELEASE_VM)
- p->user_vm_lock = (long *) childregs->edi;
+ if (clone_flags & CLONE_CLEARTID)
+ p->user_tid = (long *) childregs->edx;
return 0;
}
--- linux/include/linux/sched.h.orig Fri Aug 16 00:17:48 2002
+++ linux/include/linux/sched.h Fri Aug 16 00:26:05 2002
@@ -47,8 +47,9 @@
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_DETACHED 0x00200000 /* parent wants no child-exit signal */
-#define CLONE_RELEASE_VM 0x00400000 /* release the userspace VM */
+#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_RELEASE_VM 0x00800000 /* release the userspace VM */
#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)
@@ -307,7 +308,7 @@
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- long *user_vm_lock; /* for CLONE_RELEASE_VM */
+ long *user_tid; /* for CLONE_CLEARTID */
unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
--- linux/kernel/fork.c.orig Fri Aug 16 00:19:00 2002
+++ linux/kernel/fork.c Fri Aug 16 00:19:16 2002
@@ -368,12 +368,12 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_vm_lock)
+ if (tsk->user_tid)
/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0UL, tsk->user_vm_lock);
+ put_user(0UL, tsk->user_tid);
}
static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
> personally i'd make it even more compact by merging the two clone flags
> as well, something along: CLONE_MANAGE_TID. I cannot see any reason for
> a thread library to use one of the bits only. This also reflects the
> fact that it's thread state that the kernel helps managing, both in the
> thread-create and in the thread-exit path. But this might be stretching
> things a bit?
the attached patch implements this. (ontop the previous patch.)
Ingo
--- linux/arch/i386/kernel/process.c.orig Fri Aug 16 01:01:27 2002
+++ linux/arch/i386/kernel/process.c Fri Aug 16 01:02:03 2002
@@ -591,7 +591,7 @@
/*
* The common fastpath:
*/
- if (!(clone_flags & (CLONE_SETTLS | CLONE_SETTID | CLONE_CLEARTID)))
+ if (!(clone_flags & (CLONE_SETTLS | CLONE_MANAGE_TID)))
return 0;
/*
* Set a new TLS for the child thread?
@@ -616,17 +616,13 @@
}
/*
- * Notify the child of the TID?
+ * Notify the child of the TID, and clear it on exit?
*/
- if (clone_flags & CLONE_SETTID)
+ if (clone_flags & CLONE_MANAGE_TID) {
if (put_user(p->pid, (pid_t *)childregs->edx))
return -EFAULT;
-
- /*
- * Does the userspace VM want the TID cleared on mm_release()?
- */
- if (clone_flags & CLONE_CLEARTID)
p->user_tid = (long *) childregs->edx;
+ }
return 0;
}
--- linux/include/linux/sched.h.orig Fri Aug 16 01:00:33 2002
+++ linux/include/linux/sched.h Fri Aug 16 01:01:19 2002
@@ -46,10 +46,8 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_RELEASE_VM 0x00800000 /* release the userspace VM */
+#define CLONE_MANAGE_TID 0x00100000 /* set and clear the userspace TID */
+#define CLONE_DETACHED 0x00200000 /* parent wants no child-exit signal */
#define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD)
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> personally i'd make it even more compact by merging the two clone flags as
> well, something along: CLONE_MANAGE_TID. I cannot see any reason for a
> thread library to use one of the bits only.
A thread library - maybe not. But the SETTID thing makes sense even for a
fork() user to avoid the fork/SIGCHLD race condition. In contrast, a
CLRTID does _not_ make sense in that situation, so I actually think they
are two separate issues (and should thus be two separate bits).
Linus
On Tuesday 13 August 2002 02:17 pm, Linus Torvalds wrote:
> On Tue, 13 Aug 2002, Ingo Molnar wrote:
> > > That still doesn't make it any les crap: because any thread that exits
> > > without calling the "magic exit-flag interface" will then silently be
> > > lost, with no information left around anywhere.
> >
> > that should be a pretty rare occurance: with the upcoming signals patch
> > any segmentation fault zaps all threads and does a proper (and
> > deadlock-free) multithreaded coredump.
>
> That still doesn't change the fact that the interface is broken
> _by_design_.
>
> If the parent wants to get notified on child death, it should damn well
> get notified on child death. Not "in case the child exists politely".
>
> We don't depend on processes calling "exit()" to clean up all the stuff
> they left behind. The VM gets cleaned up even for bad processes.
>
What's been missing is the there is no option for synchronization as part of
the parent notification or exit processing.
What's needed, for TCore, is something like "when a process dies, signal
parents to do X and then wait on optional semaphore" before losing all the
process data to do_exit.
The big problem I've been had with TCore is that that there is no polite way
of synchronization for the dumping process with the parent and sibling
processes while they are getting a signal storm as they all go down.
I've come up with a few brutish ways to suspend these processes (with some
risk taken with semaphore locks from a maintenance point of view ;).
However; these approaches still sometimes don't get to run before some of the
siblings exit.
There is no good way, especially on SMP setups with a large multi-threaded
applications, to guarantee the signals don't get where they are going before
the core dump data is gathered.
--mgross
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> On Thu, 15 Aug 2002, Linus Torvalds wrote:
>
> > A thread library - maybe not. But the SETTID thing makes sense even for
> > a fork() user to avoid the fork/SIGCHLD race condition. In contrast, a
> > CLRTID does _not_ make sense in that situation, so I actually think they
> > are two separate issues (and should thus be two separate bits).
>
> we could skip the 'clear' bit if this is the last release of the mm.
Ahhah, but you miss the point.
The fork()'ed child may clone on its own, and then exit. In which case we
sure as heck don't want the original child to modify the VM that it now
shares with a subthread.
This is just more on my spiel about how only the _parent_ really knows
what it is it wants to maintain, and the child really cannot make any
assumptions on its own.
Linus
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> A thread library - maybe not. But the SETTID thing makes sense even for
> a fork() user to avoid the fork/SIGCHLD race condition. In contrast, a
> CLRTID does _not_ make sense in that situation, so I actually think they
> are two separate issues (and should thus be two separate bits).
okay. And it also makes sense for a newly forked task to know (and cache)
its own PID, without having to call getpid() again.
Ingo
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> A thread library - maybe not. But the SETTID thing makes sense even for
> a fork() user to avoid the fork/SIGCHLD race condition. In contrast, a
> CLRTID does _not_ make sense in that situation, so I actually think they
> are two separate issues (and should thus be two separate bits).
we could skip the 'clear' bit if this is the last release of the mm.
Ingo
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> okay. And it also makes sense for a newly forked task to know (and cache)
> its own PID, without having to call getpid() again.
Well, it won't. The pid write is _after_ we've done the copy_mm(), so the
child will never see it.
That looks like a potential mistake, though - it causes extra COW-faults
and it also means that this particular optimization (which I kind of like)
won't work.
However, if you want to fix it, you'd need to either move the
clone_thread() earlier, or you'd need to move the CLONE_SETTID logic up to
the generic layer (that latter path may make more sense, since if glibc
starts using this interface, you obviously need to do this in all
architectures anyway)
Linus
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> > okay. And it also makes sense for a newly forked task to know (and cache)
> > its own PID, without having to call getpid() again.
>
> Well, it won't. The pid write is _after_ we've done the copy_mm(), so
> the child will never see it.
hmm.
> That looks like a potential mistake, though - it causes extra COW-faults
> and it also means that this particular optimization (which I kind of
> like) won't work.
>
> However, if you want to fix it, you'd need to either move the
> clone_thread() earlier, or you'd need to move the CLONE_SETTID logic up
> to the generic layer (that latter path may make more sense, since if
> glibc starts using this interface, you obviously need to do this in all
> architectures anyway)
CLONE_SETTID indeed makes more sense in the do_fork() proper, there's
absolutely nothing x86-ish about it.
Ingo
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> > we could skip the 'clear' bit if this is the last release of the mm.
>
> Ahhah, but you miss the point.
>
> The fork()'ed child may clone on its own, and then exit. [...]
i was actually thinking about exactly this scenario when suggesting this.
The fork()ed child might as well end up being a 'thread' that exits and
thus needs to clear up after itself, right?
> [...] In which case we sure as heck don't want the original child to
> modify the VM that it now shares with a subthread.
in what way is clone() utilized? if it's via any threading library then
the fork()-ed process has its own thread state, which must be freed when
exiting. So it's something like:
thread X
fork() ===============> thread Y
clone() ===========> thread Z
so we at this point have the original thread X, a new thread Y that was
created via the fork(), and thread Z. Thread Y and Z share the same V. If
now thread Y exits:
exit()
then we'd sure expect for Z's sake to free Y's thread state, right?
Otherwise there would be a resource leak.
[ but it's getting late here and i might miss something :) ]
Ingo
On Fri, 16 Aug 2002, Ingo Molnar wrote:
> > [...] In which case we sure as heck don't want the original child to
> > modify the VM that it now shares with a subthread.
>
> in what way is clone() utilized? if it's via any threading library then
> the fork()-ed process has its own thread state, which must be freed when
> exiting.
See this:
process X
fork()
-------> Process Y
clone()
----> thread Z
exit()
THIS MUST NOT
WRITE TO MEMORY
IN Z!!
Notice how the exit() in Y will never be able to write into the address
space of X - it would only write into the address space of Z, and Z is not
expecting that at all!
Linus
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> process X
>
> fork()
> -------> Process Y
> clone()
> ----> thread Z
>
> exit()
> THIS MUST NOT
> WRITE TO MEMORY
> IN Z!!
i guess i'm just being difficult, but process (thread) Y and thread Z
share the same VM, right? So it's a threaded application, and as such i'd
expect it to free its state when exiting. Ie. it must write to memory in
Z. Now since the ->user_tid address is in thread Y's thread control block
(or any similar thread state descriptor), i cannot see any problem why
zeroing this TID value would be incorrect.
> Notice how the exit() in Y will never be able to write into the address
> space of X - it would only write into the address space of Z, and Z is
> not expecting that at all!
i think i see where the misunderstanding comes from: thread Y does not
want to get into the address space of X - this is how the current
CLEAR_TID code works and is expected to work. Threads always free their
*own* thread state descriptor upon exit (eg. they set a flag in their own
thread descriptor), not some field in the parent's domain. So thread Y
does not ever want to write into X's address space - it wants to write
into the VM that it's part of currently - if a fork() created a new VM
then so be it, it's not attached to X in any way.
Ingo
On Fri, 16 Aug 2002, Ingo Molnar wrote:
> i think i see where the misunderstanding comes from: thread Y does not
> want to get into the address space of X - this is how the current
> CLEAR_TID code works and is expected to work. Threads always free their
> *own* thread state descriptor upon exit (eg. they set a flag in their
> own thread descriptor), not some field in the parent's domain. So thread
> Y does not ever want to write into X's address space - it wants to write
> into the VM that it's part of currently - if a fork() created a new VM
> then so be it, it's not attached to X in any way.
and this is the reason why i named the clone flag CLONE_RELEASE_VM - upon
exit a thread wants to 'release its reference to the VM' - and free all
state it still holds. Stack or whatever other state it has.
Ingo
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> On Thu, 15 Aug 2002, Linus Torvalds wrote:
>
> > process X
> >
> > fork()
> > -------> Process Y
> > clone()
> > ----> thread Z
> >
> > exit()
> > THIS MUST NOT
> > WRITE TO MEMORY
> > IN Z!!
>
> i guess i'm just being difficult, but process (thread) Y and thread Z
> share the same VM, right? So it's a threaded application, and as such i'd
> expect it to free its state when exiting.
You're being dense.
The problem spot is the _fork_ from process X. Which gives a address in
process' _X_ virtual address space - used for SETTID.
See? Process _X_ is not threaded, and is not maintaining any thread data
structures.
Linus
On Thu, 15 Aug 2002, Linus Torvalds wrote:
>
> See? Process _X_ is not threaded, and is not maintaining any thread data
> structures.
.. Just to clarify: CLONE_SETTID is useful even outside of threading, so X
wants to use CLONE_SETTID simply because it keeps track of it's children
that way - but it is otherwise completely traditional, and depends on
SIGCHLD to tell it when the children have died.
The child (Y) it forks off, however, may use thread (Z) for some subtask.
Not pthreads, it might be just a clone-by-hand. So it may be doing an
exit while it's address space is still actively used by another thread -
but just because (X) wanted to use CLONE_SETTID to get the child
information on (Y) does _not_ mean that it's address space (and thus
that of Z) would somehow be updated at its exit.
Linus
Jamie Lokier wrote:
> It's probably most simple to use two consecutive words, for simplicity
> and to avoid needing an atomic store (which is slower on some
> architectures).
Ignore the above bit, I missed that you are already using two
independent addresses (in edx and edi), so atomicity is not an issue.
It might be cleaner to use a single address pointing to two consecutive
words anyway (for CLONE_SETTID and CLONE_VM_RELEASE) just to save a
register for future use - register pressure around clone() being a
precious thing, occasionally.
-- Jamie
Ingo Molnar wrote:
> > [...] And I would like this while having the benefit of CLONE_DETACHED -
> > because I want to use this for high performance threading but still be
> > robust - so waitpid() is out.
>
> like i said in the original email, the point of CLONE_DETACHED is to avoid
> the waitpid() overhead. I also said that exit notification is done via
> mutexes (futexes).
How? Scenario:
1. a thread calls a 3rd party library which was _not_ compiled with
threading in mind. (It shouldn't have to be).
2. 3rd party library sends itself a SIGABRT; perhaps an assertion
failed. (Variants: SIGFPE, library does execve(), etc.)
3. thread exits.... but the mutex was _not_ released
4. I _want_ to report the death to other thread, without having
to poll all my children in my main event loop.
This is a very legitimate and useful kind of thread death, and it's
perfectly safe too. (Not pthreads-conformant, but clone() is useful for
more than just pthreads).
As things are at the moment, you've arranged things so that I can't use
CLONE_DETACHED if I want to catch threads which die unexpectedly.
Why can't I have both at the same time? They are both good.
-- Jamie
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> The child (Y) it forks off, however, may use thread (Z) for some
> subtask. Not pthreads, it might be just a clone-by-hand. So it may be
> doing an exit while it's address space is still actively used by another
> thread - but just because (X) wanted to use CLONE_SETTID to get the
> child information on (Y) does _not_ mean that it's address space (and
> thus that of Z) would somehow be updated at its exit.
yes, now i understand. When using raw clone() then indeed a CLEARTID set
up in X's context does not necesserily have any meaning in Y's (and Z's)
context.
Ingo
On Thu, 15 Aug 2002, Linus Torvalds wrote:
> The problem spot is the _fork_ from process X. Which gives a address in
> process' _X_ virtual address space - used for SETTID.
>
> See? Process _X_ is not threaded, and is not maintaining any thread data
> structures.
okay, this is the misunderstanding then. If it fork()s and then uses some
threading (which uses clone()) then in all cases i know about it must be
linked against some threading library. Otherwise Y couldnt do a clone()
call and expect threading to work. In theory Y could 'become' a
threading-capable process, but right now no threading library i'm aware of
allows this - lots of standard C calls must be threading-aware and
threading-safe. So right now 'threading' is a property that comes with the
process image at exec() time. But this must not be so from a conceptual
angle, so i agree with you.
(but the question is mostly academic anyway, because it makes perfect
sense to use a pure SETTID for a completely unthreaded application, to get
the fork() PID return value in the child's context as well.)
Ingo
On Fri, 16 Aug 2002, Jamie Lokier wrote:
> > like i said in the original email, the point of CLONE_DETACHED is to avoid
> > the waitpid() overhead. I also said that exit notification is done via
> > mutexes (futexes).
>
> How? Scenario:
>
> 1. a thread calls a 3rd party library which was _not_ compiled with
> threading in mind. (It shouldn't have to be).
>
> 2. 3rd party library sends itself a SIGABRT; perhaps an assertion
> failed. (Variants: SIGFPE, library does execve(), etc.)
>
> 3. thread exits.... but the mutex was _not_ released
this is a simple and well-defined thing: if the thread exits by calling
the exit() function then libc calls the exit_thread_group() syscall [part
of my POSIX signals patch] and zaps all threads. This is a very clearly
defined thing in POSIX.
SIGABRT/SIGFPE if uncaught cause a segmentation fault that zaps all
threads (the kernel does this zapping). [note that for this you'll have to
use the POSIX signals patch i mentioned earlier.]
> 4. I _want_ to report the death to other thread, without having
> to poll all my children in my main event loop.
POSIX says that in this case (when eg. non-threaded POSIX code calls
exit()) all threads must exit and a status code is sent to the parent of
the threaded application. And this is precisely how it works.
> This is a very legitimate and useful kind of thread death, and it's
> perfectly safe too. (Not pthreads-conformant, but clone() is useful for
> more than just pthreads).
as long as you are using libc there are certain POSIX rules that must be
followed. NGPT has to do the same.
non-POSIX programming methods like JVMs can still implement *any*
semantics - but your whole example is based on POSIX issues like exit() or
default signal handlers, not Java.
Ingo
Ingo Molnar wrote:
> non-POSIX programming methods like JVMs can still implement *any*
> semantics - but your whole example is based on POSIX issues like exit() or
> default signal handlers, not Java.
Sorry if I was unclear. I'm specifically talking about non-POSIX
threading methods (normal C code though, not complicated JVMs).
Most uses of clone() that I've seen are not using any threading library
at all: some code that neads a helper thread calls clone(), and the
child does its own self-contained system calls (to avoid errno pollution).
It's conceptually fine that individual threads can die. _Conceptually_,
clone-by-hand threads are very similar to processes, and I have seen
this used in practice a few times.
And it all works fine: just use SIGCHLD and waitpid().
Now you have written this wonderful resource optimisation, which removes
zombies: CLONE_DETACHED. Unfortunately, catching invidual thread death
relies on the thread "exiting politely", as they say. So I still have
to use SIGCHLD and waitpid(), or a pipe(), for non-POSIX-model threads
that want to robustly detect "impolite" thread death.
I think that's an unfair penalty on non-POSIX-model threads.
-- JAmie
On Fri, 16 Aug 2002, Jamie Lokier wrote:
> Sorry if I was unclear. I'm specifically talking about non-POSIX
> threading methods (normal C code though, not complicated JVMs).
if you define your own threading method then it's your responsibility to
make it work wrt. the exit() method for example. Just one example:
application code can technically call sys_exit() in a 'raw' form anytime:
asm volatile ("movl $1, %eax; int $0x80;")
and if it's a clone()ed thread then all thread-local resources are
lost/leaked. malloc() space, private file descriptors, held futexes.
there's nothing about CLONE_DETACHED or libpthreads that changes this
fundamental situation.
the only guarantee the kernel can make is to clean up nicely when a
resource-group is released. You can use CLONE_VM but not CLONE_FILES, and
then the kernel will clean up per-thread file descriptors on exit. It's an
expensive but nice property for certain uses. Furthermore it can help a
bit to give a signal towards user-space that a thread has 'unused' a given
TID. But it *cannot* possibly clean up after shared resources completely -
that's the point in sharing resources between threads - the kernel cannot
know which ones are private and which not. Eg. if you use CLONE_VM then
your threads can leak memory upon sys_exit(). If you use CLONE_FILES only
then you can leak file descriptors upon sys_exit().
> Most uses of clone() that I've seen are not using any threading library
> at all: [...]
this is still possible, of course.
> It's conceptually fine that individual threads can die.
> _Conceptually_, clone-by-hand threads are very similar to processes, and
> I have seen this used in practice a few times.
>
> And it all works fine: just use SIGCHLD and waitpid().
huh? Nothing cleans up leaked memory if a CLONE_VM thread sys_exits, or
leaked file descriptors if a CLONE_FILES thread exits.
only a tiny segment of resource cleanup can be 'solved' by SIGCHLD. How do
you clean up a held futex via SIGCHLD notification? How do you clean up a
malloc() via SIGCHLD, if sys_exit() is called by application code
directly?
'polite exit' when threads hold shared resources simply *does not exist*.
the truth is that this is not possible and not desirable either - in
threaded C code there must be some harmony between application code and
exit methods, and applications still have the 'cooperative' responsibility
to not leak resources.
> Now you have written this wonderful resource optimisation, which removes
> zombies: CLONE_DETACHED. Unfortunately, catching invidual thread death
> relies on the thread "exiting politely", as they say. [...]
again, calling sys_exit directly is not 'polite' in any way - you can lose
malloc() and futex (and other) state, you can leak basically any resource
that can be used by a thread, and you can corrupt the threading library's
internal variables as well (except the really simple uses which do no
resource allocation at all). So a thread has to be careful how to exit no
matter what - CLONE_DETACHED does not change this in any way.
but if you dont like CLONE_DETACHED and want to use SIGCHLD notification
then you can do it. Lots of POSIX-conform code relies on SIGCHLD
notification so it's not like we want to remove it anytime soon.
> [...] So I still have to use SIGCHLD and waitpid(), or a pipe(), for
> non-POSIX-model threads that want to robustly detect "impolite" thread
> death.
well, i think this is ineffecitve and doesnt buy you anything - but it's
clearly up to you.
> I think that's an unfair penalty on non-POSIX-model threads.
there's nothing, i repeat, *NOTHING* POSIX about CLONE_DETACHED. POSIX
threading has cleanup needs as well, which are handled by intercepting all
exit() activity and doing the cleanups and notifications that are
necessery. The only difference here is that notification is not done via a
signal, but via faster futexes.
Why is CLONE_DETACHED such a big problem for you, why do you want to force
a more expensive notification method? If you want to spin your own
threading library which has a completely new API (good luck at that) then
dont use pthread_create() but raw clone() and you wont get any detached
threads - end of story. You have complete control over what kind of
notification method you use.
Ingo
Ingo Molnar wrote:
> Why is CLONE_DETACHED such a big problem for you, why do you want to force
> a more expensive notification method?
Eh? I _like_ CLONE_DETACHED, and I want notification cheaper, not
more expensive.
You've said that pthread_exit() _always_ notifies a sibling thread using
a futex.
Well, can we please move the futex wakeup into the kernel? That is all
I ask. It will make pthread_exit() _faster_, and me happy because
all exits are notified.
-- Jamie
On Fri, 16 Aug 2002, Jamie Lokier wrote:
> You've said that pthread_exit() _always_ notifies a sibling thread using
> a futex.
yes.
> Well, can we please move the futex wakeup into the kernel? That is all
> I ask. It will make pthread_exit() _faster_, and me happy because all
> exits are notified.
(well, now you have reduced your point to the question of pure a
performance optimisation, dropping allegations of "pthreadizm" or
"inability to support different threading libraries because there's no
polite exit", thus so far we are in agreement.)
there are some practical problems with making the notification a futex,
not a simple flag. Eg. futexes right now do not force any lock-counter
format upon userspace. Futexes can be used as mutexes, conditional
variables, read-write locks, all of which have different atomic counter
formats and uses. By doing the TID-release notification via a futex the
actual format of the lock is forced, which is a cleanliness problem. Just
writing $0 to the TID pointer is a robust thing on the other hand.
Ingo
ps. (you have not replied to 90% of the email i wrote. Does this mean
agreement or disagreement?)
Ingo Molnar wrote:
> there are some practical problems with making the notification a futex,
> not a simple flag. Eg. futexes right now do not force any lock-counter
> format upon userspace. Futexes can be used as mutexes, conditional
> variables, read-write locks, all of which have different atomic counter
> formats and uses.
Agreed; futexes are lovely because they are so generic.
> By doing the TID-release notification via a futex the actual format of
> the lock is forced, which is a cleanliness problem. Just writing $0 to
> the TID pointer is a robust thing on the other hand.
Quite. There is no lock; the the futex is used in its purest form, as
a wait queue.
TID = thread is alive
zero = thread is gone
There's no "lock-counter format", because this isn't a lock -- it's a
wakeup. There no need for atomicity either, because the listener only
reads, it doesn't write. I'm not sure if a PROT_SEM word is required
for cache coherency, but if it is, your current implementation requires
one too.
Here's a synchronous thread_join-style waiter; it is architecture-neutral:
while (tid = *tid_address) != 0)
retval = sys_futex (tid_address, FUTEX_WAIT, tid, 0);
-- Jamie
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> okay, this is the misunderstanding then. If it fork()s and then uses some
> threading (which uses clone()) then in all cases i know about it must be
> linked against some threading library. Otherwise Y couldnt do a clone()
> call and expect threading to work.
But of course it would. You can make your own async-io-like things by just
using clone() directly, that's what all the original clone() users were.
> So right now 'threading' is a property that comes with the
> process image at exec() time. But this must not be so from a conceptual
> angle, so i agree with you.
Even from a practical angle, it's not a "global" property. Sure, the code
that does the clone() itself must have come in through the execve() (or
through a loadable library later on), so in that sense you can think of it
as a global thing - since the code must obviously be in the address space
that the clone thing shares.
But a lot of the clone() decisions can be local, without anythign else
really knowing about the fact that something started up a thread. The most
trivial example is simply something like the appended, which just does a
asynchronous read (yeah yeah, stupid example, but it's basically a
threaded "cat").
Notice how none of this depends on any global state, so a library could do
the clone() without the caller even knowing that it does part of its work
in a local thread (it obviously wouldn't be doing anything this stupid,
but an async writer thread for sound output etc is not impossible to
imagine in a game library or something like that).
In fact, inside libraries there may well be reasons _not_ to use a global
threading model like pthreads, because the library might want to take
advantage of things like separate file descriptor address spaces etc that
clone() can give it.
Linus
---
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sched.h>
#define UNFINISHED (-1000)
struct iodesc {
int fd, len, status;
void *buffer;
};
int io_fn(void *_desc)
{
struct iodesc *desc = _desc;
desc->status = read(desc->fd, desc->buffer, desc->len);
_exit(0);
}
int main()
{
char buffer[4096];
struct iodesc desc = { 0, sizeof(buffer), UNFINISHED, buffer };
clone(io_fn, malloc(4096)+4096, CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_DETACHED, &desc);
while (desc.status == UNFINISHED)
sched_yield();
write(1, buffer, desc.status);
}
On Fri, 16 Aug 2002, Jamie Lokier wrote:
> There's no "lock-counter format", because this isn't a lock -- it's a
> wakeup. There no need for atomicity either, because the listener only
> reads, it doesn't write.
well, technically a 'wait until value is 0' thing is of course a counter
format ...
> Here's a synchronous thread_join-style waiter; it is architecture-neutral:
>
> while (tid = *tid_address) != 0)
> retval = sys_futex (tid_address, FUTEX_WAIT, tid, 0);
yes, this would work. And the current method of setting the counter to 0
is arbitrary (and has a 'format') already, so there's no reason we couldnt
say that the TID cannot be used as a thread-join futex that is zeroed out
by CLONE_CLEARTID, and any (potential) waiters are woken up.
i'll try this and send a patch, it's a nice optimization.
Ingo
On Fri, 16 Aug 2002, Jamie Lokier wrote:
>
> Quite. There is no lock; the the futex is used in its purest form, as
> a wait queue.
>
> TID = thread is alive
> zero = thread is gone
I don't disagree with using a futex, and the interface can be exactly the
same (ie "synchronize with thread exit" really becomes "futex wait for tid
address" as you point out). I think it's a wonderful interface, and for
people who don't use futexes, they'd never even _notice_ that they could
do so.
However, the thing that makes me slightly nervous is that it would make
the exit case a bit slower. Right now we just do a conditional write zero
to a virtual address (ie two instructions or so):
if (tsk->user_tid)
put_user(0UL, tsk->user_tid);
but if we were to use a futex, the code would need to be something
slightly more complex and slower (I'm adding the "only do this if
somebody else has the VM" test to avoid it for some cases):
if (tsk->user_tid && atomic_read(&tsk->mm->users) > 1) {
if (!put_user(0, tsk->user_tid))
sys_futex(FUTEX_WAKE, tsk->user_tid);
}
where that sys_futex will do a page table walk etc.
So yes, it's a lot more generic (and very pleasant to use when we want
to), but it does imply a fair amount of overhead.
So I'd like to know what the likelyhood of it being used as a futex is. If
90% of all users would like a futex, that's _wonderful_. No question that
we should do it. But if it's dynamically fairly rare to have this
"synchronize with thread exit", then we may be losing more than we win.
Linus
On Fri, 16 Aug 2002, Ingo Molnar wrote:
>
> having looked at threading libraries i can tell you that any library
> writer who cares about performance would use a futex for exit
> notification.
Oh, good. If it turns out that even pthreads wants the futex, let's just
do it that way. Pls send in a patch once you have something tested ready,
ok?
Much cleaner.
Linus
On Fri, 16 Aug 2002, Linus Torvalds wrote:
> So I'd like to know what the likelyhood of it being used as a futex is.
> If 90% of all users would like a futex, that's _wonderful_. No question
> that we should do it. But if it's dynamically fairly rare to have this
> "synchronize with thread exit", then we may be losing more than we win.
having looked at threading libraries i can tell you that any library
writer who cares about performance would use a futex for exit
notification. We've got a performance test that compares
signal-notification and futex-notification:
earth3:~/libc> ./p3 -s 100000 -t 1 -r 0 -T --sync-join
Runtime: 1.208586749 seconds
earth3:~/libc> ./p3 -s 100000 -t 1 -r 0 -T --sync-signal
Runtime: 1.336826839 seconds
the sync-join variant uses a futex, the sync-signal variant uses a signal
to notify completion. Note that the test does not even use SIGCHLD because
that is slower, it uses SIGUSR1 [in the application code] and sigwait(),
to remove the signal handler overhead. The futex variant is still faster,
and with futexes being used for CLONE_CLEARTID it would be faster by about
~0.1 seconds.
(The kernel used is an uptodate 2.4-threading kernel with yesterday's APIs
backported, and glibc fitted to those new interfaces.)
sure, this solution is even less generic and thus a bit more dangerous of
being a libpthread-specific optimization [ ;) ], but i cannot think of any
more complex threading library - things like JVMs tend to have less
semantical needs. And if someone wants signal-based notification then
it's still possible.
Ingo
On Fri, 16 Aug 2002, Linus Torvalds wrote:
> > having looked at threading libraries i can tell you that any library
> > writer who cares about performance would use a futex for exit
> > notification.
>
> Oh, good. If it turns out that even pthreads wants the futex, [...]
yes, pthreads used a futex for this ever since. This is what i was arguing
for all along, and this is why pthreads does not want any SIGCHLD
internally, ever. This is why i sent the patch that ended up being
CLONE_DETACHED. No signal notification is needed, everything can be done
via futexes. And all the 'unsafe exit' arguments are bogus...
merging futex release into exit() removes one more extra syscall, and
removes the need for having a thread-state-usage spinlock (of sorts). So
from the pthreads point of view the interface couldnt be nicer.
> [...] let's just do it that way. Pls send in a patch once you have
> something tested ready, ok?
okay.
Ingo
the only (small) overhead is associated to truly detached
(PTHREAD_CREATE_DETACHED) threads, but the more important usage is when a
thread is created and joined, with some status code passed along. And even
in the PTHREAD_CREATE_DETACHED case, most usages just never terminate the
thread. So the high-frequency thread creation/destruction always happens
via some sort of joining method.
Ingo
On Fri, 16 Aug 2002, Linus Torvalds wrote:
> Oh, good. If it turns out that even pthreads wants the futex, let's just
> do it that way. Pls send in a patch once you have something tested
> ready, ok?
tested patch (against BK-curr-ish) attached. glibc/pthreads had been
updated to use the TID-futex, this removes an extra system-call and it
also simplifies the pthread_join() code. The pthreads testcode works just
fine with the new kernel and does not work with a kernel that does not do
the futex wakeup, so it's working fine.
Ingo
--- linux/include/linux/futex.h.orig Sat Aug 17 20:45:01 2002
+++ linux/include/linux/futex.h Sat Aug 17 20:45:07 2002
@@ -6,4 +6,6 @@
#define FUTEX_WAKE (1)
#define FUTEX_FD (2)
+extern asmlinkage int sys_futex(void *uaddr, int op, int val, struct timespec *utime);
+
#endif
--- linux/kernel/fork.c.orig Sat Aug 17 20:45:17 2002
+++ linux/kernel/fork.c Sat Aug 17 20:46:30 2002
@@ -26,6 +26,7 @@
#include <linux/mman.h>
#include <linux/fs.h>
#include <linux/security.h>
+#include <linux/futex.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -370,12 +371,14 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid)
+ if (tsk->user_tid) {
/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
put_user(0UL, tsk->user_tid);
+ sys_futex(tsk->user_tid, FUTEX_WAKE, 1, NULL);
+ }
}
static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
On Sat, Aug 17, 2002 at 08:48:57PM +0200, Ingo Molnar wrote:
>
> On Fri, 16 Aug 2002, Linus Torvalds wrote:
>
> > Oh, good. If it turns out that even pthreads wants the futex, let's just
> > do it that way. Pls send in a patch once you have something tested
> > ready, ok?
>
> tested patch (against BK-curr-ish) attached. glibc/pthreads had been
> updated to use the TID-futex, this removes an extra system-call and it
> also simplifies the pthread_join() code. The pthreads testcode works just
Btw, where can I take a look at that pthread-NG code?
Yay! [ ;) ]
Just a note: I recall that Linus suggested the optimisation of checking
mm->mm_count > 1, but Ingo is right to not include this -- it's possible
for the tid word to be stored in shared memory.
-- Jamie
Hi!
> one of the debugging tests triggered a false-positive BUG() when a
> detached thread was straced. Fix against BK-curr attached.
>
> Ingo
>
> --- kernel/signal.c.orig Thu Aug 15 17:12:02 2002
> +++ kernel/signal.c Thu Aug 15 17:12:34 2002
> @@ -774,7 +774,7 @@
> int why, status;
>
> /* is the thread detached? */
> - if (sig == -1 || tsk->exit_signal == -1)
> + if (sig == -1)
> BUG();
>
> info.si_signo = sig;
Why not BUG_ON()?
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]
On Mon, 2002-08-26 at 08:35, Pavel Machek wrote:
> > --- kernel/signal.c.orig Thu Aug 15 17:12:02 2002
> > +++ kernel/signal.c Thu Aug 15 17:12:34 2002
> > @@ -774,7 +774,7 @@
> > int why, status;
> >
> > /* is the thread detached? */
> > - if (sig == -1 || tsk->exit_signal == -1)
> > + if (sig == -1)
> > BUG();
> >
> > info.si_signo = sig;
>
> Why not BUG_ON()?
Ingo has said previously that he does not like the syntax of BUG_ON.
I disagree, but it is his code - what we _should_ do here is mark the
branch unlikey() if nothing else.
Robert Love
> Ingo has said previously that he does not like the syntax of BUG_ON.
well, meanwhile BUG_ON() is being used quite extensively, so i'd rather
have global consistency than different coding styles. I'll fix these
places in an upcoming patch.
Ingo