2002-11-17 11:17:17

by Ingo Molnar

[permalink] [raw]
Subject: [patch] threading fix, tid-2.5.47-A3


the attached patch (against BK-curr) implements another threading related
detail, it changes the way TID setting/clearing works. These changes fix a
weakness of NPTL's handling of the "initial thread", noticed by Luca
Barbieri.

the problem is the following: the 'initial thread', ie. the 'process',
does not have any ->user_tid value set. But still it's a generic thread
that can be pthread_join()-ed upon. (but pthread_join() does not work, the
kernel does not do the futex wakeup because ->user_tid is NULL.)

the solution is to add a new syscall that sets the current->user_tid
address. This new syscall is used by glibc's exec() implementation.
Another change is to make CLONE_SETTID work even if CLONE_VM is not used.
This means that the TID must be set in the child's address space, not in
the parent's address space. I've also merged SETTID and CLEARTID, the two
should always be used together by any new-style threading abstraction.

the sys_set_tid_address() syscall returns the current TID, which is used
by glibc to set the TID address in the parent's context. (this is cheaper
than to do a put_user() in kernel-space.)

to implement the above semantics i've used the schedule_tail() callback to
do the TID setting in the child's context - doing it a'la
ptrace_writedata() / access_process_vm() would be way too expensive. It
looks a bit ugly to do the TID setting both in schedule_tail() and
do_fork(), and to do the CLONE_VM check, but the correct (and generic)
solution

In future glibc versions every process and thread will have a nonzero
user_tid address, so the callback is necessary.

Ulrich Drepper has changed glibc/NPTL to use these new semantics and the
initial thread now works fine. Also, i've compressed the CLONE flags to
remove the CLONE_CLEARTID hole, since NPTL is the only one using them
currently.

(the patch has no effect on old-style threading libraries.)

Ingo


2002-11-17 11:51:44

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> the attached patch
-> The patch that was meant to be attached :)

> the solution is to add a new syscall that sets the current->user_tid
> address. This new syscall is used by glibc's exec() implementation.
I don't understand this: why would glibc use it in exec()?

> Another change is to make CLONE_SETTID work even if CLONE_VM is not used.
> This means that the TID must be set in the child's address space, not in
> the parent's address space. I've also merged SETTID and CLEARTID, the two
> should always be used together by any new-style threading abstraction.
But this prevents using SETTID to get the tid in a
signal-handler-accessible place before a SIGCHLD can arrive, without
having to use sigprocmask.

How about renaming CLONE_SETTID to CLONE_SETTID_PARENT, leaving the
existing semantics alone, and adding a CLONE_SETTID (with a new value)
that sets the tid in the fork child?

This would require two separate tid pointers so that glibc could
implement a fork_get_pid(int* pid) setting pid in the parent vm and the
tid in struct pthread in the child.

Alternatively, if the fork child calls sys_set_tid_address on its own
right after creation, no modifications to clone are required (this is
what my sys_cleartid patch did).

BTW, user_tid needs to be cleared on exec, and I'm not sure if we are
doing this.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-17 12:13:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 17 Nov 2002, Luca Barbieri wrote:

> > the attached patch
> -> The patch that was meant to be attached :)

yeah ...

--- linux/arch/i386/kernel/entry.S.orig 2002-11-17 11:22:18.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-17 11:31:32.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

@@ -767,6 +765,7 @@
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
.long sys_remap_file_pages
+ .long sys_set_tid_address


.rept NR_syscalls-(.-sys_call_table)/4
--- linux/include/linux/sched.h.orig 2002-11-17 11:26:53.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-17 12:58:20.000000000 +0100
@@ -46,10 +46,9 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_SETTID 0x00100000 /* set/clear the TID */
+#define CLONE_DETACHED 0x00200000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00400000 /* set if the tracing process can't force CLONE_PTRACE on this clone */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +331,7 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *user_tid; /* for CLONE_SETTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
--- linux/kernel/sched.c.orig 2002-11-17 11:22:48.000000000 +0100
+++ linux/kernel/sched.c 2002-11-17 11:30:58.000000000 +0100
@@ -503,12 +503,16 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
+asmlinkage void FASTCALL(schedule_tail(task_t *prev));
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ /*
+ * Does the child thread/process want to be notified of the TID/PID?
+ */
+ if (current->user_tid)
+ put_user(current->pid, current->user_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new
--- linux/kernel/fork.c.orig 2002-11-17 11:25:35.000000000 +0100
+++ linux/kernel/fork.c 2002-11-17 12:40:44.000000000 +0100
@@ -676,6 +676,13 @@
p->flags = new_flags;
}

+asmlinkage int sys_set_tid_address(int *user_tid)
+{
+ current->user_tid = user_tid;
+
+ return current->pid;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -813,18 +820,14 @@
if (retval)
goto bad_fork_cleanup_namespace;
/*
- * Notify the child of the TID?
+ * Does the userspace VM want the TID set in the child's
+ * address space and it cleared on mm_release()?
*/
- retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
- goto bad_fork_cleanup_namespace;
-
- /*
- * Does the userspace VM want the TID cleared on mm_release()?
- */
- if (clone_flags & CLONE_CLEARTID)
+ if (clone_flags & CLONE_SETTID) {
p->user_tid = user_tid;
+ if (clone_flags & CLONE_VM)
+ put_user(p->pid, user_tid);
+ }

/*
* Syscall tracing should be turned off in the child regardless

2002-11-17 12:25:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 17 Nov 2002, Luca Barbieri wrote:

> I don't understand this: why would glibc use it in exec()?

i suspect the idea would be to always make every process a proper pthread
object as well. (but Ulrich will correct me if this is not the case.) This
means that across fork() we can set up the TID pointer via CLONE_SETTID,
and after exec() we need the new set_tid_address() syscall to initialize
it.

> > Another change is to make CLONE_SETTID work even if CLONE_VM is not used.
> > This means that the TID must be set in the child's address space, not in
> > the parent's address space. I've also merged SETTID and CLEARTID, the two
> > should always be used together by any new-style threading abstraction.
> But this prevents using SETTID to get the tid in a
> signal-handler-accessible place before a SIGCHLD can arrive, without
> having to use sigprocmask.

if CLONE_VM is set then the TID is set immediately, before sys_clone()
returns. Or are you worried about the fork() case?

> How about renaming CLONE_SETTID to CLONE_SETTID_PARENT, leaving the
> existing semantics alone, and adding a CLONE_SETTID (with a new value)
> that sets the tid in the fork child?

this would be fine to me, but i wanted to get away with a single pointer.

> Alternatively, if the fork child calls sys_set_tid_address on its own
> right after creation, no modifications to clone are required (this is
> what my sys_cleartid patch did).

we do not want to do yet another syscall. Also, this makes the TID value
nonatomic - debugging code would have to know whether the child has
already executed the syscall.

Ingo

2002-11-17 13:22:50

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> On 17 Nov 2002, Luca Barbieri wrote:
>
> > I don't understand this: why would glibc use it in exec()?
>
> i suspect the idea would be to always make every process a proper pthread
> object as well. (but Ulrich will correct me if this is not the case.) This
> means that across fork() we can set up the TID pointer via CLONE_SETTID,
> and after exec() we need the new set_tid_address() syscall to initialize
> it.
"after exec()" == "in the initialization code for the exec'ed program"?

> if CLONE_VM is set then the TID is set immediately, before sys_clone()
> returns. Or are you worried about the fork() case?
Yes.

> this would be fine to me, but i wanted to get away with a single pointer.
Using two pointers would allow to provide all the functionality
mentioned in the discussion on your first clone/tid patch
<http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.1/1409.html>.

Calling sys_set_tid_address after fork is equivalent (but non-atomic)
but requires an additional system call.

> Also, this makes the TID value
> nonatomic - debugging code would have to know whether the child has
> already executed the syscall.
Debugging tools already need to be aware of this for process
initialization, so this shouldn't be a serious problem.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-17 17:21:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ingo Molnar wrote:
>
> the problem is the following: the 'initial thread', ie. the 'process',
> does not have any ->user_tid value set. But still it's a generic thread
> that can be pthread_join()-ed upon. (but pthread_join() does not work, the
> kernel does not do the futex wakeup because ->user_tid is NULL.)

There is no way in _hell_ that the correct way to handle this is by doing
magic things at execve() time. Stop that NOW!

First off, a program had better be correctly startable even if the
process that does the execve() is _not_ using the new glibc. If I have an
old "bash" binary, and that means that I cannot start up new binaries
correctly, that is BROKEN. It's so incredibly broken that it's scary.

Why not do it the _sane_ way, with a system call in crt0.S instead to set
up the user_tid if you want it?

This patch is crap, or at least needs a ton more explanation about why you
would do something that looks quite this horrible and stupid.

Linus

2002-11-17 17:55:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Linus Torvalds wrote:

> Why not do it the _sane_ way, with a system call in crt0.S instead to
> set up the user_tid if you want it?

the patch adds a syscall, which will indeed be used in the exec() case.
The patch does not add any magic execve() thing. Plus the patch changes
the TID interfaces of sys_clone() to work not only for pthread_create()
but also in the case of fork().

Ingo


2002-11-17 18:07:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Linus Torvalds wrote:

> First off, a program had better be correctly startable even if the
> process that does the execve() is _not_ using the new glibc. [...]

it most definitely is. Binary compatibility is taken very seriously.

the SETTID change only affects the fork() case. Ie. there are 4 major ways
a context can be started:

execve(): here we build a process image from scratch. NPTL changes
nothing here, except that if a new NPTL binary is started up,
it will call sys_set_tid_address() to get the 'initial thread'
set up properly. Old binaries continue to work.

fork(): here we build a new process image by copying the parent image.
NPTL applications are using sys_clone(CLONE_SETTID) internally,
to set up the initial thread of the new process image. [the
fork() code in glibc also does other cleanup work to get a true
initial thread going, even if a threaded application forks, but
this is nothing the kernel should worry about.]

pthread_create(): here we create a new thread that shares all sharable
state with the parent thread. LinuxThreads (old glibc)
does whatever it always did, NPTL uses all the new
flags.

raw clone(): share a subset of the parent thread's resources.

there is no change anywhere due to NPTL. You can start and old glibc
application from within a new glibc application and vice versa.

Ingo

2002-11-17 18:20:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ingo Molnar wrote:
>
> there is no change anywhere due to NPTL. You can start and old glibc
> application from within a new glibc application and vice versa.

So then your explanation was extremely confused. Can you explain _where_
you're doing the ne wsystem call? If it is at execve() time in the parent,
there is no way in hell I will accept it.

In fact, I have committed the following to my tree to make sure that there
is no user_tid pointer left over when we release a memory space, the old
code was buggy and left the user_tid alone over an execve() which meant
that a subsequent clone(CLONE_VM) and exit of the old process would have
corrupted memory.

Feel free to re-send the "set_user_tid()" patch assuming it is done from
crt0.S, and not from glibc execve() like your original explanation
claimed. Also, please send the set_user_tid() thing _separate_ from
whatever else you did, since it has nothing to do with your other changes.

Linus

----
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.863 -> 1.864
# kernel/fork.c 1.84 -> 1.85
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/11/17 [email protected] 1.864
# Make sure we clean user_tid when we've released the
# memory space it was associated with.
# --------------------------------------------
#
diff -Nru a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c Sun Nov 17 10:21:59 2002
+++ b/kernel/fork.c Sun Nov 17 10:21:59 2002
@@ -408,12 +408,15 @@
complete(vfork_done);
}
if (tsk->user_tid) {
+ int * user_tid = tsk->user_tid;
+ tsk->user_tid = NULL;
+
/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, tsk->user_tid);
- sys_futex((unsigned long)tsk->user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, user_tid);
+ sys_futex((unsigned long)user_tid, FUTEX_WAKE, 1, NULL);
}
}


2002-11-17 18:49:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


this is the first patch, the introduction of the sys_set_thread_address()
syscall. It returns the PID so that the newly initialized 'initial thread'
does not have to do an additional sys_gettid() call.

Ingo

--- linux/arch/i386/kernel/entry.S.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-17 20:54:55.000000000 +0100
@@ -767,6 +767,7 @@
.long sys_epoll_ctl /* 255 */
.long sys_epoll_wait
.long sys_remap_file_pages
+ .long sys_set_tid_address


.rept NR_syscalls-(.-sys_call_table)/4
--- linux/kernel/fork.c.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/kernel/fork.c 2002-11-17 20:54:55.000000000 +0100
@@ -676,6 +676,13 @@
p->flags = new_flags;
}

+asmlinkage int sys_set_tid_address(int *user_tid)
+{
+ current->user_tid = user_tid;
+
+ return current->pid;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.

2002-11-17 18:47:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ulrich Drepper wrote:
>
> It doesn't do this. Ingo's description simply wasn't right.
>
> The syscall is used in one place and this is when the thread library
> gets initialized. I never gets used unconditionally and in situations
> where the process is not prepared.

Ok, good. That means that "sys_set_userpid()" is fine with me.

That still leaves the other part of the patch. I do not think that SETTID
and CLEARTID should be mixed together. There are perfectly valid reasons
why a parent wants SETTID even when it _doesn't_ want CLEARTID.

In fact, SETTID is clearly useful even without threads, and exactly for
the case that Ingo apparently broke with his patch: the parent wants to
atomically save the TID of the child in the _parents_ address space (so
that a immediate SIGCHLD won't be racy with saving off the pid by the
parent).

So Ingo, please send me just the sys_set_userpid() parts, and revert your
broken code that made SETTID do bad things and only work for threads.

There's no reason to make SETTID/CLEARTID be one flag, since they are
clearly different things, and NPTL can just always set both bits if that
is the behaviour glibc wants (and I agree with that behaviour, of course.
I just disagree with not allowing others to do different things).

Linus


2002-11-17 18:54:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Linus Torvalds wrote:

> In fact, SETTID is clearly useful even without threads, and exactly for
> the case that Ingo apparently broke with his patch: the parent wants to
> atomically save the TID of the child in the _parents_ address space (so
> that a immediate SIGCHLD won't be racy with saving off the pid by the
> parent).

while it makes sense, the problem in this case is that the 'TID address'
is the parent's TID address. (ie. might be futex-waited upon by some other
context, for an exit() event.)

i think what makes most sense is what Luca suggested, to split up the
things and use two different TID address: one for race-less setting of the
TID in the parent's address space, and another for the race-less
initialization of the TID value in the child's context.

> There's no reason to make SETTID/CLEARTID be one flag, since they are
> clearly different things, and NPTL can just always set both bits if that
> is the behaviour glibc wants (and I agree with that behaviour, of
> course. I just disagree with not allowing others to do different
> things).

ok, agreed. Other libraries might choose to still do SIGCHLD based exit()
notification - exit notification and initial-TID setting are separate
things.

(i'll send a new patch in a few minutes so that we can see the full impact
on things.)

Ingo

2002-11-17 18:57:01

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> This patch is crap, or at least needs a ton more explanation about why you
> would do something that looks quite this horrible and stupid.

Let me explani the use. Sorry for all the confusion. I left Ingo last
night in a state where he didn't have all the info.

The facts:

- - the settid syscall isn't used and is not needed anywhere in libc.so

- - it is only used in the init code of the new libpthread

I.e., all applications but those which are using the new thread code are
unaffected. No special exec handling. Set time, when the setid call is
executed, is early enough: it is the very first thing a multi-threaded
application does. It is the first constructor which is run. Always and
guaranteed.

The *alternatives* to this is a special whacky way to tell exec to do
this. I've no idea how this could be implemented at user level since
the registered address is at least in my code not known at link time.
So this is one of these rare situations where Linus and I actually agree
on something. The settid syscall is not used for exec but instead to
prevent modifying exec.


On to the second part: the clone() change is necessary for situations
where the CLONE_VM bit isn't set (to implement fork()). Using the
CLONE_SETTID flag nevertheless would mean a data structure in the parent
process would be affected. This isn't correct. The data structure
which has to be modified is still in use in the parent. That's the
whole point of fork(), the child inherits the same environment and might
on its own create more threads. I.e., the internals of the thread
library must be intact.

With the proposed change the TID is only written to the child's VM.
This is the right semantics since the child is the one which gets
notified in the end and which might get a signal delivered before it had
the chance to set the TID on its own. This one change saved us from the
mess this could potentially create plus it saves a system call.

And please note: how many normal application (not MT) are calling
getpid() first thing after fork()? With this patch it can be avoided.
Not a big win but it rectifies an IMO not too well designed since
asymmetric interface.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE91+gb2ijCOnn/RHQRAluVAJ9nZB5/w3MImWbVKXS+nw05W+pfQACgzg/d
BsbD+2ff+BaRiDkbHhu1/9w=
=hnJQ
-----END PGP SIGNATURE-----

2002-11-17 19:14:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


here are the my current TID-setting changes. It's now 3 clone flags:

- CLONE_PARENT_SETTID

race-free setting of the new TID/PID in the parent's address space.
This is done in do_fork(), before sys_fork() returns.

- CLONE_CHILD_SETTID

race-free setting of the new TID/PID in the child's address space.
this uses the ->user_tid field to delay setting of the TID when the
child context first runs. It's guaranteed to be set before the child's
first userspace instruction is executed, but has no specific ordering
with sys_clone() itself.

- CLONE_CHILD_CLEARTID

exit() time notification, clearing of the TID and futex wakeup. It's
guaranteed to be done after the last instruction of a context is
executed.

i've extended sys_clone() with a new argument (in %esi), which is the
child_tidaddr pointer. %edx is now the parent_tidaddr.

Ingo

--- linux/arch/i386/kernel/entry.S.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-17 20:57:44.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-17 21:12:49.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-17 21:12:52.000000000 +0100
@@ -498,7 +498,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-17 21:03:01.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-17 21:11:52.000000000 +0100
@@ -225,7 +225,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -502,7 +502,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -511,14 +511,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.esi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -536,7 +537,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/include/linux/sched.h.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-17 21:15:27.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_SETTID 0x00200000 /* set the TID in the child */
+#define CLONE_CHILD_CLEARTID 0x00400000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00800000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x01000000 /* set if the tracing process can't force CLONE_PTRACE on this clone */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,7 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *user_tid; /* CLONE_CHILD_[SET|CLEAR]TID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +616,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/sched.c.orig 2002-11-17 20:52:30.000000000 +0100
+++ linux/kernel/sched.c 2002-11-17 21:12:03.000000000 +0100
@@ -503,12 +503,16 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
+asmlinkage void FASTCALL(schedule_tail(task_t *prev));
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ /*
+ * Does the child thread/process want to be notified of the TID/PID?
+ */
+ if (current->user_tid)
+ put_user(current->pid, current->user_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new
--- linux/kernel/fork.c.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/kernel/fork.c 2002-11-17 21:09:49.000000000 +0100
@@ -695,7 +695,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -819,19 +820,14 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
+ if (clone_flags & CLONE_PARENT_SETTID)
+ put_user(p->pid, parent_tidptr);
/*
- * Notify the child of the TID?
+ * Does the userspace VM want the TID set in the child's
+ * address space and it cleared on mm_release()?
*/
- retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
- goto bad_fork_cleanup_namespace;
-
- /*
- * Does the userspace VM want the TID cleared on mm_release()?
- */
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->user_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1000,7 +996,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1011,7 +1008,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;


2002-11-17 19:48:15

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

The new definition of the clone flags is binary incompatible with older
2.5 kernels.
How about this instead:
#define CLONE_PARENT_SETTID 0x00100000
#define CLONE_CHILD_CLEARTID 0x00200000
#define CLONE_DETACHED 0x00400000
#define CLONE_UNTRACED 0x00800000
#define CLONE_CHILD_SETTID 0x01000000

> -#if CONFIG_SMP || CONFIG_PREEMPT
> +asmlinkage void FASTCALL(schedule_tail(task_t *prev));
> asmlinkage void schedule_tail(task_t *prev)
> {
> finish_arch_switch(this_rq(), prev);
Maybe finish_arch_switch should only be called if CONFIG_SMP ||
CONFIG_PREEMPT, like what happened without this patch?

> + if (clone_flags & CLONE_PARENT_SETTID)
> + put_user(p->pid, parent_tidptr);
How about failing if put_user fails?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-17 19:54:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 17 Nov 2002, Luca Barbieri wrote:

> The new definition of the clone flags is binary incompatible with older
> 2.5 kernels.

we broke binary compatibility several times, for the benefit of having a
cleaner interface. Ulrich has no problems with this approach and NPTL is
the only user of these interfaces currently. But i think you are one of
the few peoples who are running an NPTL system (ie. with the new
NPTL-glibc actually installed as the default system glibc) - is binary
compatibility important to you for this specific case?

> > -#if CONFIG_SMP || CONFIG_PREEMPT
> > +asmlinkage void FASTCALL(schedule_tail(task_t *prev));
> > asmlinkage void schedule_tail(task_t *prev)
> > {
> > finish_arch_switch(this_rq(), prev);
> Maybe finish_arch_switch should only be called if CONFIG_SMP ||
> CONFIG_PREEMPT, like what happened without this patch?

finish_arch_switch() is a no-op on UP. (well, almost, i'll fix that.)

> > + if (clone_flags & CLONE_PARENT_SETTID)
> > + put_user(p->pid, parent_tidptr);
> How about failing if put_user fails?

we could do that - although we cannot fail the CHILD_SETTID variant, and i
wanted to keep it symmetric.

Ingo

2002-11-17 19:55:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

Linus Torvalds wrote:
> There is no way in _hell_ that the correct way to handle this is by doing
> magic things at execve() time. Stop that NOW!
>
> First off, a program had better be correctly startable even if the
> process that does the execve() is _not_ using the new glibc. If I have an
> old "bash" binary, and that means that I cannot start up new binaries
> correctly, that is BROKEN. It's so incredibly broken that it's scary.
>
> Why not do it the _sane_ way, with a system call in crt0.S instead to set
> up the user_tid if you want it?

I agree with Linus: a set_tid_address() system call should be enough,
and it's clear and simple.

Not just because of old binaries. I want to be able to use the new
threading features _without_ using glibc's pthreads, but using the
non-threading portions of glibc if that is still possible.[*] - and
still be able to fork/exec standard glibc-using programs properly.

Btw, when CLONE_VM is clear, SETTID is useful without the CLEARTID
functionality, for the usual reason of preventing races with signal
handlers which need to know the pid. (It's also useful to use both
flags, of course). It might be better to keep the two flags.

Btw2, instead of a new system call, you could create a new clone flag:
CLONE_SELF. This mean "create all the clonable resources, but install
them in _this_ process instead of creating another process".

For example, clone(CLONE_SELF | CLONE_VM, ...) would cause the current
thread to get a new copy of the file descriptor table, new signal
handler table etc., but it would install those in the current process,
not a new one.

This makes it possible to converted a clone()'d thread into a
fork()'d process, for example.

When you have that, set_tid_address() is simply a combination of
flags: CLONE_SETTID | CLONE_VM | CLONE_FS (etc.). Admittedly, the
list of flags is rather whacky and doesn't work when new sharing flags
are added - and I'm not sure if it's actually useful. But it's a
curious idea, what do you think?

-- Jamie


[*] Perhaps it isn't any more. I recently wanted to call Glibc's
shm_open(), but for no apparent reason that forced pthreads to be
linked in which conflicted with the program's own clone() based
threading. So I ended up using a file in /tmp instead of proper
shared memory.

2002-11-17 20:10:33

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> we broke binary compatibility several times, for the benefit of having a
> cleaner interface. Ulrich has no problems with this approach and NPTL is
> the only user of these interfaces currently. But i think you are one of
> the few peoples who are running an NPTL system (ie. with the new
> NPTL-glibc actually installed as the default system glibc) - is binary
> compatibility important to you for this specific case?
No, but since I don't see any advantage in breaking it (other than a
more aesthetically pleasing header), why break it?
The problem is just the numbering of the flags, not the new
functionality.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-17 20:22:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 17 Nov 2002, Luca Barbieri wrote:

> No, but since I don't see any advantage in breaking it (other than a
> more aesthetically pleasing header), why break it? The problem is just
> the numbering of the flags, not the new functionality.

tiny bits of aesthetics, like clean ordering of the flag values, is what
makes for a clean and easy to understand source tree. Lets do it while we
can do it.

Ingo

2002-11-17 20:37:59

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> Instead let the user initialize the memory location the clone call is
> supposed to write in with zero. if the value didn't change the
> user-level code can detect the error and handle appropriately.

Does that work? Zero is also read if the child was created, did
something, and exited with CLEARTID.

You can initialise it with -1, though.

-- Jamie

2002-11-17 20:28:16

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> But i think you are one of
> the few peoples who are running an NPTL system (ie. with the new
> NPTL-glibc actually installed as the default system glibc) - is binary
> compatibility important to you for this specific case?

It's all encapsulated in the libpthread which is used. No apps need to
be recompiled so it is OK to make this incompatible change.


> we could do that - although we cannot fail the CHILD_SETTID variant, and i
> wanted to keep it symmetric.

I cannot see any reasonable way out if any of the put_user calls fail?
Do you want the clone() call to fail if the parent's receiving address
is wrong? You'd have to go and kill the child again since it's already
created.

Instead let the user initialize the memory location the clone call is
supposed to write in with zero. if the value didn't change the
user-level code can detect the error and handle appropriately. So,
ignore the put_user errors. Maybe say so explicitly and add (void) in
front.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE91/2G2ijCOnn/RHQRAt34AKCrzkjdPfQ3D1VvEXPW5fwZxmCvWgCgmY0A
IYWZflwRcxusjo4fMPOx6jk=
=xjs0
-----END PGP SIGNATURE-----

2002-11-17 20:45:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


ok, the attached patch also does the return-check of the parent-tidptr
put_user() call:

--- linux/arch/i386/kernel/entry.S.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-17 20:57:44.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-17 21:12:49.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-17 21:12:52.000000000 +0100
@@ -498,7 +498,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-17 21:03:01.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-17 21:11:52.000000000 +0100
@@ -225,7 +225,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -502,7 +502,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -511,14 +511,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.esi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -536,7 +537,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/include/linux/sched.h.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-17 21:15:27.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_SETTID 0x00200000 /* set the TID in the child */
+#define CLONE_CHILD_CLEARTID 0x00400000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00800000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x01000000 /* set if the tracing process can't force CLONE_PTRACE on this clone */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,7 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *user_tid; /* CLONE_CHILD_[SET|CLEAR]TID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +616,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/sched.c.orig 2002-11-17 20:52:30.000000000 +0100
+++ linux/kernel/sched.c 2002-11-17 21:12:03.000000000 +0100
@@ -503,12 +503,16 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
+asmlinkage void FASTCALL(schedule_tail(task_t *prev));
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ /*
+ * Does the child thread/process want to be notified of the TID/PID?
+ */
+ if (current->user_tid)
+ put_user(current->pid, current->user_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new
--- linux/kernel/fork.c.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/kernel/fork.c 2002-11-17 22:52:25.000000000 +0100
@@ -695,7 +695,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -819,19 +820,18 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
- retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
- goto bad_fork_cleanup_namespace;

+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr)) {
+ retval = -EFAULT;
+ goto bad_fork_cleanup_namespace;
+ }
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Does the userspace VM want the TID set in the child's
+ * address space and it cleared on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->user_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1000,7 +1000,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1011,7 +1012,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;


2002-11-17 20:42:15

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> It's all encapsulated in the libpthread which is used. No apps need to
> be recompiled so it is OK to make this incompatible change.
My point was that aesthetical reasons do not justify breaking anything
(aka forcing people to figure out what happened and spend time
recompiling).
Anyway, I'll need to recompile anyway, so I don't really care.

> I cannot see any reasonable way out if any of the put_user calls fail?
> Do you want the clone() call to fail if the parent's receiving address
> is wrong? You'd have to go and kill the child again since it's already
> created.
>
> Instead let the user initialize the memory location the clone call is
> supposed to write in with zero. if the value didn't change the
> user-level code can detect the error and handle appropriately. So,
> ignore the put_user errors. Maybe say so explicitly and add (void) in
> front.
I proposed to fail in put_user for the parent tid simply because the old
code did it.
I'm not completely sure whether we should fail or not, but if put_user
fails something bad is happening, so it may be better to signal the
error rather than just continuing since it can be done easily (for the
parent tid).


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-17 22:53:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ingo Molnar wrote:
>
> here are the my current TID-setting changes. It's now 3 clone flags:

Hmm.. I really ahve to say that I just prefer the current two flags, and I
don't see any advantage to the three flag thing, and I do see
disadvantages:

- no real new semantic behaviour.
- the async nature of CLONE_CHILD_SETTID is just bound to cause
interesting-to-debug behaviour

Basically, the current two-flag approach gives all the behaviour the
three-flag thing does, with no downsides:

- if the fork() is a CLONE_VM, the parent/child VM is the same, and the
two flags are really all you need, agreed?

- the the for is _not_ a CLONE_VM, the child is really an independent
VM thing, and we should not even _allow_ the parent to change the VM of
the child: the SETTID behaviour (where it changes the parent VM) makes
sense and is good, but we should probably disallow CLEARTID altogether
for that case (and if the independent child wants to clear its own
memory space on exit, it should just do a set_tid_address() itself)

In fact, from what I can tell, your new CLONE_CHILD_SETTID really is 100%
semantically equivalent to the child just doing a "set_tid_address()" on
its own.

In short, I really don't see the advantage of this patch. I don't think
it's "evil and wrong" in any way, I just think that the lack of reason for
it would argue _against_ making clone() a bit more complicated and
breaking existing behaviour..

Hmm? I _think_ NPTL is fine with the current semantics, right? It just
sets both of the current flags, and that's all it really wants? Uli?

Linus

2002-11-17 23:16:15

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> - the the for is _not_ a CLONE_VM, the child is really an independent
> VM thing, and we should not even _allow_ the parent to change the VM of
> the child: the SETTID behaviour (where it changes the parent VM) makes
> sense and is good, but we should probably disallow CLEARTID altogether
> for that case (and if the independent child wants to clear its own
> memory space on exit, it should just do a set_tid_address() itself)

Why? The parent controls exactly how the memory in the child looks like
after the fork. Calling fork() in an MT application means that the new
process looks like the old process with all threads but the one calling
fork() not there. But it does mean the new process uses the thread code
and it does use the memory handling mechanism of the thread library,
including the use of settid/cleartid.

If clone() cannot instruct the kernel to modify the new process image
and install the cleartid handler the first thing *all* new processes
have to do is to call set_tid_address and assign the TID to the
appropriate field. But there is a gap between process creation and the
return of the set_tid_address syscall and the subsequent assignment.
The memory location which contains the TID has a valid value (inherited
from the parent) so neither signal handlers nor external programs like
debuggers know without more testing whether the field was initialized or
not.


> Hmm? I _think_ NPTL is fine with the current semantics, right? It just
> sets both of the current flags, and that's all it really wants? Uli?

Not for the fork() implementation. With the patch I've replaced the
fork() syscall with an clone() syscall:

sys_clone(CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,
NULL, NULL, &THREAD_SELF->tid)

I.e., the information is stored only in the child.


If you dislike the introduction of the additional flag you can use the
less obvious way of the first patch: check whether CLONE_VM is set.
Ingo said you'd dislike this (probably with good reason) but these since
CLONE_CHILD_SETTID and CLONE_PARENT_SETTID have exactly the same
semantics if CLONE_VM is set spending a flag bit might just be as ugly.


And one last thing. I am toying with the idea of having a function

int cfork (pid_t *);

(name can be discussed) which works like fork() but always returns the
PID to the caller (in the memory location pointed to by the parameter),
even in the child. This seems to be the interface fork() should have
gotten from the beginning. For this implementation something like the
CLONE_CHILD_SETTID flag should be available.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92CTl2ijCOnn/RHQRAvYPAKC0vQ+8YF4YtXIcY1WUZWNCqovwsgCeNrib
D2JP0bbF7KD+d/moYTfDb8Y=
=d2wd
-----END PGP SIGNATURE-----

2002-11-17 23:30:45

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> here are the my current TID-setting changes. It's now 3 clone flags:
>
> - CLONE_PARENT_SETTID
> [...]

BTW, this patch contains one little bug.

diff -u linux/arch/i386/kernel/process.c linux/arch/i386/kernel/process.c
- --- linux/arch/i386/kernel/process.c 2002-11-17 21:11:52.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-17 21:11:52.000000000 +0100
@@ -516,7 +516,7 @@
clone_flags = regs.ebx;
newsp = regs.ecx;
parent_tidptr = (int *)regs.edx;
- - child_tidptr = (int *)regs.esi;
+ child_tidptr = (int *)regs.edi;
if (!newsp)
newsp = regs.esp;
p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0,
parent_tidptr, child_tidptr);


%esi is used for the TLS pointer.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92ChN2ijCOnn/RHQRAqUnAKCa4VHC6EtZCCtArHJ9qHcznA84kACgrGNG
0niSahEXQ5aGa0XrSLSwv7A=
=YpSY
-----END PGP SIGNATURE-----

2002-11-18 01:26:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ulrich Drepper wrote:
>
> > Hmm? I _think_ NPTL is fine with the current semantics, right? It just
> > sets both of the current flags, and that's all it really wants? Uli?
>
> Not for the fork() implementation. With the patch I've replaced the
> fork() syscall with an clone() syscall:
>
> sys_clone(CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,
> NULL, NULL, &THREAD_SELF->tid)
>
> I.e., the information is stored only in the child.

And the point of this is? The child has to re-initialize it's pthread
structures anyway after a fork, since the child certainly doesn't have the
threads that the fork()ing parent had. I don't see how the TID paths help
you there, you still have to make sure that if/when the child tries to
create new threads of its own, it re-initializes everything first.

> If you dislike the introduction of the additional flag you can use the
> less obvious way of the first patch: check whether CLONE_VM is set.
> Ingo said you'd dislike this (probably with good reason) but these since
> CLONE_CHILD_SETTID and CLONE_PARENT_SETTID have exactly the same
> semantics if CLONE_VM is set spending a flag bit might just be as ugly.

The thing is, I don't see the _point_. I refuse to add magic flags that
are just some obscure implementation issue inside of glibc. A flag should
have a _meaning_, and I seriously dislike the "meaning" behind
CLONE_CHILD_SETTID. I dislike in particular its asynchronous behaviour,
which is visible in the parent if CLONE_VM isn't set.

Asynchronous behaviour without good reason is _bad_. Clearly, CLEARTID
needs to be async, since it happens when the child exits, which is
fundamentally asynchronous for the parent. But SETTID is certainly _not_
an asynchronous thing.

> And one last thing. I am toying with the idea of having a function
>
> int cfork (pid_t *);
>
> (name can be discussed) which works like fork() but always returns the
> PID to the caller (in the memory location pointed to by the parameter),
> even in the child. This seems to be the interface fork() should have
> gotten from the beginning. For this implementation something like the
> CLONE_CHILD_SETTID flag should be available.

Actually, the above interface is most easily done by just havign
CLONE_SETTID take effect _before_ we start copying the VM space. Which may
well make sense (and avoids any extra page dirtying and COW breakage, as
well as all asynchronous behaviour).

So moving the CLONE_SETTID check to _before_ copy_mm() will give you
exactly the semantics you want. I wouldn't have any issues with that
approach (it's certainly synchronous, and in fact it has to be done there
if we want to use this for non-CLONE_VM anyway).

Linus

2002-11-18 01:39:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

Linus Torvalds wrote:
> - the async nature of CLONE_CHILD_SETTID is just bound to cause
> interesting-to-debug behaviour

Fun for debuggers and tracers, certainly. It could be made
synchronous by selecting the new tid value and storing it in the
SETTID address before the mm is cloned.

> Hmm.. I really ahve to say that I just prefer the current two flags,

Agreed, but for different reasons than you gave.

> - the the for is _not_ a CLONE_VM, the child is really an independent
> VM thing, and we should not even _allow_ the parent to change the VM of
> the child: the SETTID behaviour (where it changes the parent VM) makes
> sense and is good, but we should probably disallow CLEARTID altogether
> for that case (and if the independent child wants to clear its own
> memory space on exit, it should just do a set_tid_address() itself)
>
> In fact, from what I can tell, your new CLONE_CHILD_SETTID really is 100%
> semantically equivalent to the child just doing a "set_tid_address()" on
> its own.

Don't get this in a muddle. As far as I can tell, the only purpose of
set_tid_address() is to set the address for _CLEARTID_. (As a bonus,
it returns the tid so that the caller can save the value, but that's
just an optimisation).

So, CLONE_CHILD_SETTID is not at all similar to calling
set_tid_address() in the child.

That said, you're still right. There should be two flags, and these
are the simple, obvious semantics:

1. CLONE_SETTID sets the child's tid in the parent's memory.
In the child's memory, if CLONE_VM is not set, the tid word will
have whatever value was stored in it before the parent called
clone(). Unless I've misread the code, this is the behaviour we
have now.

2. CLONE_CLEARTID sets tid_address in the child (only the child).
It is equivalent to calling set_tid_address() first thing in the
child.

Note that these semantics make sense, and the implementation is very
simple. There is no problem with allowing CLEARTID always.

The race condition which Ulrich brought up, which requires
CLONE_CHILD_SETTID to be used without CLONE_PARENT_SETTID, only occurs
when using the same address to store the tid of the forked child as
the parent's tid address. In other words, it's a user space error.
(Please correct me if I'm mistaken, Ulrich).

Finally, it would be kinder to everyone if CLONE_SETTID stored the
child's tid in _both_ the parent and child address spaces, atomically
w.r.t. debuggers. The simplest way to do that is to store the child's
tid value in the parent's memory before cloning the mm. If thread
creation fails, that's fine because the parent knows.

-- Jamie

2002-11-18 03:26:34

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> On Sun, 17 Nov 2002, Ulrich Drepper wrote:

>> sys_clone(CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,
>> NULL, NULL, &THREAD_SELF->tid)
>>
>>I.e., the information is stored only in the child.
>
>
> And the point of this is? The child has to re-initialize it's pthread
> structures anyway after a fork,

No, that's the whole point. The thread descriptor for the one thread in
the new process is fine with the one exception: the TID. There is not
one change to the thread descriptor in the fork code left if this change
is available.


> since the child certainly doesn't have the
> threads that the fork()ing parent had.

No, not all the threadsm only the one is available. But all the memory
is still available and deallocating it requires only two list
operations. Again, all thread descriptors are just fine.



> The thing is, I don't see the _point_. I refuse to add magic flags that
> are just some obscure implementation issue inside of glibc. A flag should
> have a _meaning_, and I seriously dislike the "meaning" behind
> CLONE_CHILD_SETTID. I dislike in particular its asynchronous behaviour,
> which is visible in the parent if CLONE_VM isn't set.

The parent isn't affected at all if CLONE_VM and CLONE_PARENT_SETTID are
not set.



> Actually, the above interface is most easily done by just havign
> CLONE_SETTID take effect _before_ we start copying the VM space. Which may
> well make sense (and avoids any extra page dirtying and COW breakage, as
> well as all asynchronous behaviour).
>
> So moving the CLONE_SETTID check to _before_ copy_mm() will give you
> exactly the semantics you want. I wouldn't have any issues with that
> approach (it's certainly synchronous, and in fact it has to be done there
> if we want to use this for non-CLONE_VM anyway).

But this is wrong. This would make it impossible to use CLONE_SETTID in
the fork() replacement and it would require the extra set_tid_address
call in the child.

The memory the clone() call used to implement fork() uses points to the
very same memory location which the currently running thread uses. It's
just in the other VM and it is therefore absolutely necessary that the
pid_t gets written into memory after the VM got cloned.

To be clear(er):

each thread descriptor contains a field tid

in the fork() implementation clone gets called with a parameter
pointing to the tid field of the parent's thread descriptor.

the parent's tid field must not be modified

the child must have the correct value from the very beginning

after the child starts the only things it does are these:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if (__fork_generation_pointer != NULL)
*__fork_generation_pointer += 4;

/* Reset the file list. These are recursive mutexes. */
fresetlockfiles ();

/* We execute this even if the 'fork' call failed. */
_IO_list_resetlock ();

/* Run the handlers registered for the child. */
list_for_each (runp, &__fork_child_list)
{
struct fork_handler *curp;

curp = list_entry (runp, struct fork_handler, list);

curp->handler ();
}

/* Initialize the fork lock. */
__fork_lock = (lll_lock_t) LLL_LOCK_INITIALIZER;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e., it's bumping a magic generation counter needed to implement
pthread_once

it re-initialized all the FILE handles which could be locked

it executes the user-registered handlers (one of the is the cleanup
handling of the memory of all the other threads)

finally it re-initializes the fork mutex


That's all! Without the CLONE_CHILD_SETTID flag I'd have to add a call
to set_tid_address at the very beginning of the block. This would
basically work but it means there is a time when the TID field is not
set correctly and in this time a signal can arrive or an external
program can access the process. Especially the first can only be
handled by never trusting the TID field and always making gettid() syscalls.


I know that the CLONE_CHILD_SETTID interface is a bit awkward but I
cannot see a cleaner way (misusing CLONE_VM being even worse). Because
clone() can create a new VM or not and since all four possible
combinations of writing/not writing are useful I think the two flags are
an OK solution.
CLONE_PARENT_SETTID
no yes

no normal fork() new thread with CLONE_VM
CLONE_CHILD_SETTID
yes fork() in MT app cfork()


And again: not having the flag means making an additional syscall and
for a brief period the thread descriptor of the one thread in the new
process wouldn't be initialized.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92F+T2ijCOnn/RHQRArOfAJwOco6h27qkoPxB8W6NSNLMYs4FhwCfVchk
b5+5W2/3/eitAEYpRwLc9Ok=
=Iqpl
-----END PGP SIGNATURE-----

2002-11-18 03:33:59

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:

> The race condition which Ulrich brought up, which requires
> CLONE_CHILD_SETTID to be used without CLONE_PARENT_SETTID, only occurs
> when using the same address to store the tid of the forked child as
> the parent's tid address. In other words, it's a user space error.
> (Please correct me if I'm mistaken, Ulrich).

It's not an error. It the very basic concept of always having a correct
thread descriptor. A process created with fork() from an MT app is
different from one created with fork() from a single threaded app. In
the latter case the thread descriptors are not used and can contain
garbage. In the former case the descriptor better be correct all the
time. It basically the same problem as with setting the TLS "register".
clone() get the CLONE_TLS flag because the child and parent have
(potentially) different TLS address and it is not possible to set the
value before the fork() call (since the parent would have the wrong
value for some time) nor after the fork() (since then there would be a
window for a signal to arrive for an uninitialized thread).


> Finally, it would be kinder to everyone if CLONE_SETTID stored the
> child's tid in _both_ the parent and child address spaces, atomically
> w.r.t. debuggers.

Yes, but then clone still has to get two addresses. In case of the
fork() implementation the parent's TID value must not be overwritten
while the child's TID value is at exactly the same virtual address as
the value in the parent. I've no problem with requiring the value
always to be written (the pointer can be to a scratch object) but there
must be two pointer parameters for that.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92GFG2ijCOnn/RHQRAjQSAKCIRuhc1wgZy1PFS+waFapXRYqGjACgt6Bn
T2uKMpj7gHJedPzXxQ5EjoM=
=0C7c
-----END PGP SIGNATURE-----

2002-11-18 03:36:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ulrich Drepper wrote:
> >
> > And the point of this is? The child has to re-initialize it's pthread
> > structures anyway after a fork,
>
> No, that's the whole point. The thread descriptor for the one thread in
> the new process is fine with the one exception: the TID. There is not
> one change to the thread descriptor in the fork code left if this change
> is available.

Oh, I realized that, but I was talking about the _other_ thread
descriptors. The ones that exist in the parent (because the parent
potentially has multiple threads before the fork()), but that will be
invalid in the child.

So when the child eventually creates a new thread, it needs to know to
ignore the other thread slots, even though they _look_ valid. They were
valid in the parent, they aren't valid in the child.

No?

I follow your argument, and I don't dislike it per se. I just think that
you end up having to do other setup for pthreads-after-fork _anyway_, no?

I still want to have those flags make _sense_ rather than just adding a
new flag without reason. For example, I've seen no real reason to
introduce a second pointer - as far as I can tell it makes sense in none
of the cases anybody has ever mentioned so far.

Basically, I don't see the patch and bits making sense. I see it
potentially _working_. But that's a different issue - and "working" to me
is sometimes a much less potent argument than "makes sense".

Linus

2002-11-18 03:51:46

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> So when the child eventually creates a new thread, it needs to know to
> ignore the other thread slots, even though they _look_ valid. They were
> valid in the parent, they aren't valid in the child.
>
> No?

They don't have to be as such. It's a simple list operation (move all
active threads except the active one on the free list). That's it. No
more work especially no resetting of the TID fields.


> I follow your argument, and I don't dislike it per se. I just think that
> you end up having to do other setup for pthreads-after-fork _anyway_, no?

I don't walk the thread descriptors. I don't write into them. I move
entire double-linked lists with a dozen or so instructions. Regardless
of how many threads were active in the parent.


> Basically, I don't see the patch and bits making sense. I see it
> potentially _working_. But that's a different issue - and "working" to me
> is sometimes a much less potent argument than "makes sense".

With the flag and/or two settid pointers available the thread descriptor
for the one thread in the new process is correct and complete from the
first instruction on. The address of the thread descriptor is the same
in parent and child and the only field which needs changing is the tid.
I don't know what else to say. For correct operation the thread
descriptor must be complete and correct and if the kernel doesn't help
setting the descriptor up correctly there always will be a time period
when it is not complete and effectively points to the parent thread in
the old process. If the clone() flag would not be available I would
probably ignore the problem for now but when it becomes a problem the
only way for user-level code to handle the problem is to disable signals
around fork calls in the parent and then re-enable them after fork() and
after setting up the descriptor in the new process. And this still
wouldn't help with debuggers etc.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92GV+2ijCOnn/RHQRAiWpAJ9xxRkqVFFXyMcGwy8Y6udvtpqrvgCfRqm4
msw9MdH4MZ/+57JS9+OKlJQ=
=MWxi
-----END PGP SIGNATURE-----

2002-11-18 04:04:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ulrich Drepper wrote:
>
> They don't have to be as such. It's a simple list operation (move all
> active threads except the active one on the free list). That's it. No
> more work especially no resetting of the TID fields.

Ok. So you _do_ do more than just the clone(), but you depend on the clone
to at least set up the local thread descriptor for you.

Fair enough, and it sounds like a good implementation.

I'm convinced. However, I still want som elargely cosmetic changes to the
patch, Ingo:

- the existing CLONE_SETTID should be called CLONE_PARENT_SETTID, because
that is how it already works since it is done after the VM copy (this
is what your patch already does)
- the existing CLONE_CLEARTID should then be CLONE_CHILD_CLEARTID (your
existing patch re-numbers this) and using existing semantics
- the new flag should be CLONE_CHILD_SETTID, and should _not_ renumber
old existing bits (your existing patch renumbers everything, including
totally unrelated bits like CLONE_DETATCHED)
- please don't introduce a new pointer, just use the old one. There
appears to be no cases where you want to have different pointers
anyway.

With those changes, no actual behavioural changes should take place, and
the patch is _purely_ adding a new flag (CLONE_CHILD_SETTID) without
changing existing behaviour (yeah, it renames a few old flags too, but
that's purely syntactic).

Plus the patch should be smaller.

Linus

2002-11-18 04:24:03

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> - please don't introduce a new pointer, just use the old one. There
> appears to be no cases where you want to have different pointers
> anyway.

Although I could imagine a possible situation where this might be useful
I haven't had the need. And not adding a new pointer means that
clone() on x86 has a parameter register left which might be important in
future.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92G0I2ijCOnn/RHQRAqAeAKC6VJnPL77c6bsx5QuOgt9x4r5DEACfcKJu
XVteSqGucerkqMfbCpaxzEo=
=JD5n
-----END PGP SIGNATURE-----

2002-11-18 06:39:33

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

--- linux-2.5/arch/i386/kernel/entry.S.tid 2002-11-17 16:11:26.000000000 -0800
+++ linux-2.5/arch/i386/kernel/entry.S 2002-11-17 21:56:18.000000000 -0800
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux-2.5/include/linux/sched.h.tid 2002-11-17 16:11:26.000000000 -0800
+++ linux-2.5/include/linux/sched.h 2002-11-17 22:04:56.000000000 -0800
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
+#define CLONE_PARENT_SETTID 0x00100000 /* write the TID back to userspace */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the userspace TID */
#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,7 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *user_tid; /* for CLONE_[SET|CLEAR]TID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
--- linux-2.5/kernel/sched.c.tid 2002-11-17 16:11:26.000000000 -0800
+++ linux-2.5/kernel/sched.c 2002-11-17 21:59:58.000000000 -0800
@@ -503,12 +503,17 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
+asmlinkage void FASTCALL(schedule_tail(task_t *prev));
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+
+ /*
+ * Does the child thread/process want to be notified of the TID/PID?
+ */
+ if (current->user_tid)
+ put_user(current->pid, current->user_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new
--- linux-2.5/kernel/fork.c.tid 2002-11-17 16:11:26.000000000 -0800
+++ linux-2.5/kernel/fork.c 2002-11-17 22:06:00.000000000 -0800
@@ -822,18 +822,15 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
+ if (clone_flags & CLONE_PARENT_SETTID)
if (put_user(p->pid, user_tid))
goto bad_fork_cleanup_namespace;
-
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Does the userspace VM want the TID set in the child's
+ * address space and it cleared on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
+ if (clone_flags & (CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID))
p->user_tid = user_tid;

/*


Attachments:
d-tid (3.08 kB)

2002-11-18 08:00:24

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> - please don't introduce a new pointer, just use the old one. There
> appears to be no cases where you want to have different pointers
> anyway.
There are: suppose that you want to implement the int cfork(int* pid)
function, which returns the pid in *pid in the parent vm, in a
multithreaded application.

The child tid pointer must be set to the current thread tid field,
because the thread structure is going to be reused.

However, that field contains the tid of the forking thread in the parent
process and must not be modified. So a different pointer is needed.

You could avoid the additional pointer by letting
child_tidptr = (!(flags & CLONE_VM) && current->user_tid) ?
current->user_tid : parent_tidptr;

but this forces the thread library to reuse the thread structure when
forking.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 08:08:21

by Ingo Molnar

[permalink] [raw]
Subject: [patch] threading enhancements, tid-2.5.47-C0


On Sun, 17 Nov 2002, Linus Torvalds wrote:

> I'm convinced. However, I still want som elargely cosmetic changes to the
> patch, Ingo:
>
> - the existing CLONE_SETTID should be called CLONE_PARENT_SETTID, because
> that is how it already works since it is done after the VM copy (this
> is what your patch already does)
> - the existing CLONE_CLEARTID should then be CLONE_CHILD_CLEARTID (your
> existing patch re-numbers this) and using existing semantics
> - the new flag should be CLONE_CHILD_SETTID, and should _not_ renumber
> old existing bits (your existing patch renumbers everything, including
> totally unrelated bits like CLONE_DETATCHED)

ok, done.

> - please don't introduce a new pointer, just use the old one. There
> appears to be no cases where you want to have different pointers
> anyway.

i'm against this simplification, as it makes CLONE_CHILD_SETTID and
CLONE_PARENT_SETTID mutually exclusive. Ulrich does not use the two flags
at once currently, but there's no reason to restrict the API to match the
current usage pattern. This enables a safer fork() variant that guarantees
the setting of the child PID before starting the new context.

i've attached my current patch against BK-curr, which also solves the
problem Ulrich mentioned - it splits up ->user_tid into ->set_child_tid
and ->clear_child_tid pointers. This way the clearing and setting
functionality is cleanly separated.

(plus the new syscall # is now in unistd.h)

Ingo

--- linux/arch/i386/kernel/entry.S.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-17 20:57:44.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-17 21:12:49.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-17 21:12:52.000000000 +0100
@@ -498,7 +498,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-17 21:03:01.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-18 10:04:56.000000000 +0100
@@ -225,7 +225,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -287,7 +287,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_tid = NULL;
+ p->set_child_tid = p->clear_child_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -502,7 +502,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -511,14 +511,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.esi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -536,7 +537,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/include/linux/sched.h.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-18 10:04:09.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,8 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *set_child_tid; /* CLONE_CHILD_SETTID */
+ int *clear_child_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +617,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/include/asm-i386/unistd.h.orig 2002-11-18 10:06:42.000000000 +0100
+++ linux/include/asm-i386/unistd.h 2002-11-18 10:07:00.000000000 +0100
@@ -262,6 +262,7 @@
#define __NR_sys_epoll_ctl 255
#define __NR_sys_epoll_wait 256
#define __NR_remap_file_pages 257
+#define __NR_set_tid_address 258


/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
--- linux/kernel/sched.c.orig 2002-11-17 20:52:30.000000000 +0100
+++ linux/kernel/sched.c 2002-11-18 10:05:30.000000000 +0100
@@ -503,12 +503,16 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
+asmlinkage void FASTCALL(schedule_tail(task_t *prev));
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ /*
+ * Does the child thread/process want to be notified of the TID/PID?
+ */
+ if (current->set_child_tid)
+ put_user(current->pid, current->set_child_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new
--- linux/kernel/fork.c.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/kernel/fork.c 2002-11-18 10:08:15.000000000 +0100
@@ -407,13 +407,13 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid) {
+ if (tsk->clear_child_tid) {
/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, tsk->user_tid);
- sys_futex((unsigned long)tsk->user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, tsk->clear_child_tid);
+ sys_futex((unsigned long)tsk->clear_child_tid, FUTEX_WAKE, 1, NULL);
}
}

@@ -676,9 +676,9 @@
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->clear_child_tid = tidptr;

return current->pid;
}
@@ -695,7 +695,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -819,19 +820,20 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
- retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
- goto bad_fork_cleanup_namespace;

+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr)) {
+ retval = -EFAULT;
+ goto bad_fork_cleanup_namespace;
+ }
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Does the userspace VM want the TID set in the child's
+ * address space? And/or get it cleared on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->set_child_tid = child_tidptr;
+ if (clone_flags & CLONE_CHILD_CLEARTID)
+ p->clear_child_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1000,7 +1002,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1011,7 +1014,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;


2002-11-18 08:14:48

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

Luca Barbieri wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> There are: suppose that you want to implement the int cfork(int* pid)
> function, which returns the pid in *pid in the parent vm, in a
> multithreaded application.

Correct, for a cfork() implementation two separate pointers would be
needed. Ingo was asking the same question.

Note that I haven't received a request for such a fork interface variant
yet. There obviously is a problem with fork() in MT apps but people are
either not hitting it or working around it. In any case, cfork() would
at least for the time being Linux-specific anyway.


If people are in fact interested in such a new interface then you should
start asking Linus to change his mind on the second parameter. I'm at
this point in time neutral since there is no urgent need from my
perspective to implement cfork(),

> You could avoid the additional pointer by letting
> child_tidptr = (!(flags & CLONE_VM) && current->user_tid) ?
> current->user_tid : parent_tidptr;

This doesn't work since it would overwrite the TID field in the calling
thread's descriptor.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92KMJ2ijCOnn/RHQRAqciAKCCZanzvzkgEND8YdMt9Q79V5DWlwCgikX/
W1RMEu4Vz8KQn0BlIZ7zlFo=
=Hkaz
-----END PGP SIGNATURE-----

2002-11-18 08:24:01

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> I don't walk the thread descriptors. I don't write into them. I move
> entire double-linked lists with a dozen or so instructions. Regardless
> of how many threads were active in the parent.
However this would cause a lot of copy-on-write faults on thread stacks
when other thread resume execution.

How about adding a MAP_DONTCOPY flag to mmap, using it for the thread
stacks and then adding yet another flag and pointer to the clone
syscall, pointing to a userspace array of addresses and flags, allowing
to specify whether vmas should be copied, ignored (or maybe shared, as a
future extension) so that userspace could specify that the current
thread stack should be copied anyway?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 08:20:38

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> > You could avoid the additional pointer by letting
> > child_tidptr = (!(flags & CLONE_VM) && current->user_tid) ?
> > current->user_tid : parent_tidptr;
>
> This doesn't work since it would overwrite the TID field in the calling
> thread's descriptor.
No: for the pthread_create case you would pass the pointer to new struct
pthread tid in parent_tidptr, while for fork you would pass the
parameter of cfork as parent_tidptr and child_tidptr would be inherited.

However I don't think that this is a good interface, just a bit better
than having two flags and a single pointer.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 08:22:15

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0


> problem Ulrich mentioned - it splits up ->user_tid into ->set_child_tid
> and ->clear_child_tid pointers. This way the clearing and setting
> functionality is cleanly separated.
How about making ->set_child_tid a parameter for schedule_tail, or even
directly using it in the ret_from_fork assembly?
It doesn't make much sense to have a variable in task_struct which is
used only at task creation.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 10:50:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0


On 18 Nov 2002, Luca Barbieri wrote:

> How about making ->set_child_tid a parameter for schedule_tail, or even
> directly using it in the ret_from_fork assembly? It doesn't make much
> sense to have a variable in task_struct which is used only at task
> creation.

what we could do is in fact to make the child address synchronous, see the
attached patch.

this moves the complexity of setting the child TID from the kernel into
glibc. glibc's fork() implementation would have to set up a new thread
descriptor for the child (which would be served from the stack-cache most
of the time so it's not significant overhead), because the child TID is
also set in the parent's VM, but otherwise it's good enough.

the even simpler solution would be to keep things as-is in 2.5.48, this
would remove the ability of cfork()-alike functionality. glibc would still
have to allocate a new descriptor in the fork() case - but compared to the
generic overhead of fork() this should not be an issue.

Ingo

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-17 21:12:49.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-17 21:12:52.000000000 +0100
@@ -498,7 +498,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-17 21:03:01.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-18 12:51:31.000000000 +0100
@@ -225,7 +225,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -287,7 +287,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_tid = NULL;
+ p->clear_child_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -502,7 +502,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -511,14 +511,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.esi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -536,7 +537,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/include/linux/sched.h.orig 2002-11-17 20:53:52.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-18 12:48:26.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,7 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *clear_child_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +616,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/fork.c.orig 2002-11-17 20:54:55.000000000 +0100
+++ linux/kernel/fork.c 2002-11-18 12:50:11.000000000 +0100
@@ -407,13 +407,13 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid) {
+ if (tsk->clear_child_tid) {
/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, tsk->user_tid);
- sys_futex((unsigned long)tsk->user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, tsk->clear_child_tid);
+ sys_futex((unsigned long)tsk->clear_child_tid, FUTEX_WAKE, 1, NULL);
}
}

@@ -676,9 +676,9 @@
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->clear_child_tid = tidptr;

return current->pid;
}
@@ -695,7 +695,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -819,19 +820,19 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
+
retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr))
+ goto bad_fork_cleanup_namespace;
+ if (clone_flags & CLONE_CHILD_SETTID)
+ if (put_user(p->pid, child_tidptr))
goto bad_fork_cleanup_namespace;
-
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Clear TID on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_CLEARTID)
+ p->clear_child_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1000,7 +1001,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1011,7 +1013,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;


2002-11-18 10:59:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 18 Nov 2002, Luca Barbieri wrote:

> How about adding a MAP_DONTCOPY flag to mmap, using it for the thread
> stacks and then adding yet another flag and pointer to the clone
> syscall, pointing to a userspace array of addresses and flags, allowing
> to specify whether vmas should be copied, ignored (or maybe shared, as a
> future extension) so that userspace could specify that the current
> thread stack should be copied anyway?

i'd just add MAP_DONTCOPY, and use a new non-MAP_DONTCOPY descriptor for
the forked process. It's clearly possible with SETTID and SETTLS, nothing
says that the new process must have the same TLS as the old one.

this means that you can define VM-private data structures upon allocation
- this reduces the overhead at fork() time, with your other method the
kernel would have to parse & interpret the userspace array.

plus it might make sense to expressly enable this via a clone flag, ie.
CLONE_VM_COPYABLE.

Ingo

2002-11-18 11:03:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


and we have VM_DONTCOPY already, which is used by the DRM code. So it
would only be a question of exporting this API to userspace. The attached
patch adds MAP_DONTCOPY. I made it unpriviledged, i doubt it has any
security impact.

Ingo

--- linux/include/asm-i386/mman.h.orig2 2002-11-18 13:07:16.000000000 +0100
+++ linux/include/asm-i386/mman.h 2002-11-18 13:07:44.000000000 +0100
@@ -20,6 +20,7 @@
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
+#define MAP_DONTCOPY 0x20000 /* do not block on IO */

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
--- linux/mm/mmap.c.orig2 2002-11-18 13:08:03.000000000 +0100
+++ linux/mm/mmap.c 2002-11-18 13:08:44.000000000 +0100
@@ -450,6 +450,11 @@
*/
vm_flags = calc_vm_flags(prot,flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

+ /*
+ * Copy the mapping on fork?
+ */
+ if (flags & MAP_DONTCOPY)
+ vm_flags |= VM_DONTCOPY;
if (flags & MAP_LOCKED) {
if (!capable(CAP_IPC_LOCK))
return -EPERM;

2002-11-18 12:04:58

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

But this way you throw away a lot of functionality, make the existence
of two pointers pointless, cause pthread_self() to change across fork
and force NPTL to copy thread state.

How about instead doing a verify_area in copy_process, putting the
child_settid address and the tid in two child registers and assigning it
in assembly in ret_from_fork?

Alternatively you could also manually call the copy-on-write handler
functions but this adds complexity for little gain.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 12:43:13

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

> i'd just add MAP_DONTCOPY, and use a new non-MAP_DONTCOPY descriptor for
> the forked process. It's clearly possible with SETTID and SETTLS, nothing
> says that the new process must have the same TLS as the old one.
But it must have the same stack, and since you don't know which thread
is going to fork, the only way is to set MAP_DONTCOPY on everything and
then tell the kernel the ignore it at fork time.

An additional extension could be a MAP_THREAD flag that causes the vma
to be put in a linked list of vmas that are freed when the thread exits.

One would then use MAP_DONTCOPY | MAP_THREAD for the thread stacks
(maybe not setting MAP_THREAD unless there are a lot of thread, so that
an mmap doesn't have to be done for each pthread_create), only
MAP_DONTCOPY for the thread descriptors if they are
PTHREAD_CREATE_JOINABLE and MAP_DONTCOPY | MAP_THREAD if they are
PTHREAD_CREATE_DETACHED and then tell clone to copy the stack and
descriptor of the current process ignoring the MAP_DONTCOPY.

To do this, we need to add support for MAP_DONTCOPY and MAP_THREAD in
mmap and also in mprotect (by, for example, shifting the MAP_ flags 8
bit to the left and or'ing the PROT_ flags) and add a way to override
them at clone time.

For clone, it seems reasonable to implement this using an userspace
array pointed to by %ebp.
Array entries would contain a start address, an end address, a set of
flags to AND in mprotect format and a set of flags to XOR in mprotect
format (96 bits per entry if the mprotect flags fit in 16 bits).

The kernel mode implementation can do a normal copy of the vmas, then
scan the array and either remove, add or modify vmas to correct the copy
done before and finally setup pagetables based on the final vma tree.

All this can be further extended by adding support for memory allocation
and sharing memory rather than copy-on-writing it.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 12:45:51

by Alan

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

On Mon, 2002-11-18 at 12:26, Ingo Molnar wrote:
>
> and we have VM_DONTCOPY already, which is used by the DRM code. So it
> would only be a question of exporting this API to userspace. The attached
> patch adds MAP_DONTCOPY. I made it unpriviledged, i doubt it has any
> security impact.

What is the behaviour of someone setting VM_DONTCOPY on memory that was
copy on write between a large number of processes (say an executable
image) ? Don't copy - but don't copy from what, from the original
mapping or from the COW mapping of the original mapping ?

Alan

2002-11-18 12:57:25

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

VM_DONTCOPY means that the vma is ignored when the mm is duplicated as a
result of fork or clone without CLONE_VM.
See kernel/fork.c:239


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-18 15:19:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 18 Nov 2002, Alan Cox wrote:

> What is the behaviour of someone setting VM_DONTCOPY on memory that was
> copy on write between a large number of processes (say an executable
> image) ? Don't copy - but don't copy from what, from the original
> mapping or from the COW mapping of the original mapping ?

it would result in a VM 'hole' - completely unmapped virtual memory with
no vma backing it.

Ingo

2002-11-18 15:54:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On Sun, 17 Nov 2002, Ulrich Drepper wrote:
>
> which works for me. But since in schedule_tail() the code reads
>
> + if (current->user_tid)
> + put_user(current->pid, current->user_tid);
>
> this enables writing the TID even if CLONE_CHILD_SETTID isn't set. The
> question is: how to access the clone flag information in the child?

The alternate approach is to just say that using CLEARTID without SETTID
is a bug. I think that's the right approach.

Linus

2002-11-18 16:17:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3


On 18 Nov 2002, Alan Cox wrote:
>
> What is the behaviour of someone setting VM_DONTCOPY on memory that was
> copy on write between a large number of processes (say an executable
> image) ? Don't copy - but don't copy from what, from the original
> mapping or from the COW mapping of the original mapping ?

It literally means to not copy that VMA _at_all_. Basically, the mapping
simply won't show up in the child. It's kind of the mapping equivalent of
close-on-exec..

NOTE! Ingo - you might want to check with the shared-page-table people
about this, I remember us talking about discontinuing the feature because
it makes shared page tables harder to implement. Other than that it's a
trivial feature.

Linus

2002-11-18 22:19:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading fix, tid-2.5.47-A3

Ulrich Drepper wrote:
> [...] It the very basic concept of always having a correct
> thread descriptor. A process created with fork() from an MT app is
> different from one created with fork() from a single threaded app. In
> the latter case the thread descriptors are not used and can contain
> garbage. In the former case the descriptor better be correct all the
> time. It basically the same problem as with setting the TLS "register".
> clone() get the CLONE_TLS flag because the child and parent have
> (potentially) different TLS address and it is not possible to set the
> value before the fork() call (since the parent would have the wrong
> value for some time) nor after the fork() (since then there would be a
> window for a signal to arrive for an uninitialized thread).

Ok, I understand now.

1. You need the child to have a valid thread descriptor immediately after
fork(), and the parent's thread descriptor to be the same
before and after fork().

2. At all times, get_current_thread()->tid must return the current
thread's tid in both the parent and child.

That is fine. Just allocate a new TLS for the child, use
CLONE_SETTID|CLONE_CLEARTID|CLONE_SETTLS in your threaded fork(), and
pass the child's tid address (in the child's tls area).

It does require allocating a new TLS area on fork(). Is that a
problem?

cheers,
-- Jamie

2002-11-20 01:34:23

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus,

could you please make a decision regarding this? I'd like to make a new
nptl release which fixes the bug this patch helps to fix so that we get
testing (also of the kernel issues).

Ingo's last patch has two pointer, one for the parent and one for the
child. The is necessary (despite what Jamie tried to argue) if we want
to have a cfork() implementation which works in MT applications.
cfork() is IMO really necessary if you want to mix threads and fork().

Assume you have an application which forks children and assosicates
certain actions with the termination of a child. When SIGCHLD is
received one of the threads of the app searches, using the PID of the
terminated child, which action has to be performed. It wouldn't find
anything if the thread, which created the child, hasn't yet written the
PID of the child in the appropriate memory location. This can very well
happen and can only be fixed by the kernel writing the PID values.

If you don't agree with this and want to have exactly one pointer, do
you insist on making CHLID_CLEARTID include CHILD_SETTID? I don't see
how this can be useful since unless you also require the CHILD_SETTID
includes CHILD_CLEARTID you still need at least a flag saying the
CLEARTID is not enabled. The alternative is to merge the SET and CLEAR
flag for the child which would mean no functionality like cfork() could
ever be implemented.


Anyway, please let us know your current position. Thanks,

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92ugi2ijCOnn/RHQRAqDQAJ48o5cbgZMvSJYG7G5z9qTfPJ2WwwCfXlJc
jVgLoqqyeCfUrkzTKwWZrAc=
=dKh2
-----END PGP SIGNATURE-----

2002-11-20 01:51:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0


On Tue, 19 Nov 2002, Ulrich Drepper wrote:
>
> could you please make a decision regarding this? I'd like to make a new
> nptl release which fixes the bug this patch helps to fix so that we get
> testing (also of the kernel issues).
>
> Ingo's last patch has two pointer

Ingo's patch (a) doesn't apply any more and (b) was broken anyway. It
sets the child tidptr from the parent _after_ doing the copy_mm(), so the
child setting doesn't work at all for the non-CLONE_VM case. So I don't
have any decision to make.

Linus

2002-11-20 03:30:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Ulrich Drepper wrote:
> Ingo's last patch has two pointer, one for the parent and one for the
> child. The is necessary (despite what Jamie tried to argue) if we want
> to have a cfork() implementation which works in MT applications.
> cfork() is IMO really necessary if you want to mix threads and fork().

Hi Ulrich,

This is "int cfork(pid_t * user_tid_ptr)", yes? I've searched google for
cfork and not found anything fruitful - just references to solaris
patches about a function of the same name.

I agree with you, if you need this functionality:

1. cfork(ptr) equivalent to { pid_t p = fork(); if (p > 0) *ptr = p; }
2. The pid is stored in the parent before any signals can be handled.
3. Don't want to block signals temporarily (of course).

Then yes, you need two pointers, one for the parent's cfork() argument
for SETTID in the parent, and one for the child's thread descriptor
for CLEARTID in the child. Strictly speaking, SETTID does not need to
affect the child (because the child can store the tid itself), but it
would make a lot of sense to do it.

> Assume you have an application which forks children and assosicates
> certain actions with the termination of a child. When SIGCHLD is
> received one of the threads of the app searches, using the PID of the
> terminated child, which action has to be performed. It wouldn't find
> anything if the thread, which created the child, hasn't yet written the
> PID of the child in the appropriate memory location. This can very well
> happen and can only be fixed by the kernel writing the PID values.

If the application is using cfork() and requires the pid stored
atomically at _its_ address, which is separate from the thread
libraries current_thread->tid address, then I agree with Ulrich: two
pointers are best.

It's possible to get by with one pointer, but then you're back to
blocking signals in the cfork() implementation, or making the thread
library horrendous in other ways (complicating ->tid reads everywhere).

(That said, I'm not entirely convinced that blocking signals in cfork()
is so bad, if we assume that cfork() is a relatively expensive
operation anyway...)

-- Jamie

2002-11-20 03:57:42

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:

> This is "int cfork(pid_t * user_tid_ptr)", yes? I've searched google for
> cfork and not found anything fruitful - just references to solaris
> patches about a function of the same name.

It's just a coincident. I made the name up and there is no function
like that so far, at least as far I know.


> Then yes, you need two pointers, one for the parent's cfork() argument
> for SETTID in the parent, and one for the child's thread descriptor
> for CLEARTID in the child. Strictly speaking, SETTID does not need to
> affect the child (because the child can store the tid itself), but it
> would make a lot of sense to do it.

The CHILD_SETTID is needed, too, since otherwise the PID isn't stored in
the child's thread descriptor.


> (That said, I'm not entirely convinced that blocking signals in cfork()
> is so bad, if we assume that cfork() is a relatively expensive
> operation anyway...)

It could mean a signal cannot be delivered and reacted on in time. The
other threads could have blocked the signal which arrives. Every time
signals have to be blocked to implement a function something is wrong,

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92wnC2ijCOnn/RHQRAqDqAJ9gfHvRN/Lz04EXd04h4VdcNlWjWgCghEjG
Cuf+eoUKcJ+9+BcyqpeY/sU=
=iW0/
-----END PGP SIGNATURE-----

2002-11-20 07:28:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0


On Tue, 19 Nov 2002, Ulrich Drepper wrote:

> Ingo's last patch has two pointer, one for the parent and one for the
> child. [...]

i was waiting for you to test that patch - there's no point in trying to
put in untested patches. The previous patch i'm sure would not have worked
without major surgery on the fork() side (like the creation of a new
thread descriptor for the child). The attached patch adds the following 3
clone flags:

CLONE_PARENT_SETTID - this sets the TID synchronously in the parent VM
only, parameter on x86 is in %esi.

CLONE_CHILD_SETTID - this sets the TID asynchronously in the child VM
only, parameter on x86 is in %edi.

CLONE_CHILD_CLEARTID - this clears the TID in the child VM upon exit()
or exec(), and wakes up futex waiters. Same
parameter as CLONE_CHILD_SETTID.

old CLONE_SETTID became CLONE_PARENT_SETTID, old CLONE_CLEARTID became
CLONE_CHILD_CLEARTID. CLONE_CHILD_SETTID is a new flag, 0x01000000.

please let me know whether this works, and it would also be nice to add
cfork() at the same time to make sure the interface and the semantics are
correct.

Ingo

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-20 09:19:59.000000000 +0100
@@ -499,7 +499,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-20 09:27:42.000000000 +0100
@@ -226,7 +226,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -288,7 +288,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_tid = NULL;
+ p->set_child_tid = p->clear_child_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -503,7 +503,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -512,14 +512,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.esi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -537,7 +538,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/arch/i386/kernel/entry.S.orig 2002-11-20 09:25:32.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-20 09:25:37.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/include/linux/sched.h.orig 2002-11-20 09:18:47.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-20 09:25:19.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,8 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *set_child_tid; /* CLONE_CHILD_SETTID */
+ int *clear_child_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +617,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/fork.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/kernel/fork.c 2002-11-20 09:27:12.000000000 +0100
@@ -408,16 +408,16 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid) {
- int * user_tid = tsk->user_tid;
- tsk->user_tid = NULL;
+ if (tsk->clear_child_tid) {
+ int * tidptr = tsk->clear_child_tid;
+ tsk->clear_child_tid = NULL;

/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, user_tid);
- sys_futex((unsigned long)user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, tidptr);
+ sys_futex((unsigned long)tidptr, FUTEX_WAKE, 1, NULL);
}
}

@@ -680,9 +680,9 @@
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->clear_child_tid = tidptr;

return current->pid;
}
@@ -699,7 +699,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -823,19 +824,18 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
+
retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr))
goto bad_fork_cleanup_namespace;
-
+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->set_child_tid = child_tidptr;
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Clear TID on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_CLEARTID)
+ p->clear_child_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1004,7 +1004,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1015,7 +1016,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;

--- linux/kernel/sched.c.orig 2002-11-20 09:25:45.000000000 +0100
+++ linux/kernel/sched.c 2002-11-20 09:26:49.000000000 +0100
@@ -503,12 +503,12 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ if (current->set_child_tid)
+ put_user(current->pid, current->set_child_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new

2002-11-20 08:29:01

by Ingo Molnar

[permalink] [raw]
Subject: [patch] threading enhancements, tid-2.5.48-A1


here's an update to the patch, Ulrich noticed that the x86 register
parameters were incorrect, the correct use is %edx for the parent pointer,
%edi for the child pointer.

Ingo

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-20 09:19:59.000000000 +0100
@@ -499,7 +499,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-20 10:33:43.000000000 +0100
@@ -226,7 +226,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -288,7 +288,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_tid = NULL;
+ p->set_child_tid = p->clear_child_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -503,7 +503,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -512,14 +512,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.edi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -537,7 +538,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/arch/i386/kernel/entry.S.orig 2002-11-20 09:25:32.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-20 09:25:37.000000000 +0100
@@ -193,10 +193,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/include/linux/sched.h.orig 2002-11-20 09:18:47.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-20 09:25:19.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,8 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *set_child_tid; /* CLONE_CHILD_SETTID */
+ int *clear_child_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +617,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/fork.c.orig 2002-11-20 09:19:38.000000000 +0100
+++ linux/kernel/fork.c 2002-11-20 09:27:12.000000000 +0100
@@ -408,16 +408,16 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid) {
- int * user_tid = tsk->user_tid;
- tsk->user_tid = NULL;
+ if (tsk->clear_child_tid) {
+ int * tidptr = tsk->clear_child_tid;
+ tsk->clear_child_tid = NULL;

/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, user_tid);
- sys_futex((unsigned long)user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, tidptr);
+ sys_futex((unsigned long)tidptr, FUTEX_WAKE, 1, NULL);
}
}

@@ -680,9 +680,9 @@
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->clear_child_tid = tidptr;

return current->pid;
}
@@ -699,7 +699,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -823,19 +824,18 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
+
retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr))
goto bad_fork_cleanup_namespace;
-
+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->set_child_tid = child_tidptr;
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Clear TID on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_CLEARTID)
+ p->clear_child_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1004,7 +1004,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1015,7 +1016,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;

--- linux/kernel/sched.c.orig 2002-11-20 09:25:45.000000000 +0100
+++ linux/kernel/sched.c 2002-11-20 09:26:49.000000000 +0100
@@ -503,12 +503,12 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ if (current->set_child_tid)
+ put_user(current->pid, current->set_child_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new

2002-11-20 08:34:50

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.48-A1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> here's an update to the patch, Ulrich noticed that the x86 register
> parameters were incorrect, the correct use is %edx for the parent pointer,
> %edi for the child pointer.

This patch works just fine for the adequately adjusted nptl.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE920rQ2ijCOnn/RHQRAgb3AKCsAWcAJxixpO0iUyURrZXxD+ViCACfU5Qc
mmygn+orhoBq2ypStbgSxYI=
=YG+H
-----END PGP SIGNATURE-----

2002-11-20 20:13:23

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.48-A1

But why not pass the child tidptr to ret_from_fork in a child register
rather than in a task_struct field?

The following patch implements this idea, but I only tested successful
compilation, not correct execution.


diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.48_ldb_no_tid/arch/i386/kernel/entry.S linux-2.5.48_ldb/arch/i386/kernel/entry.S
--- linux-2.5.48_ldb_no_tid/arch/i386/kernel/entry.S 2002-11-18 15:30:37.000000000 +0100
+++ linux-2.5.48_ldb/arch/i386/kernel/entry.S 2002-11-20 21:18:49.000000000 +0100
@@ -192,7 +192,20 @@ ENTRY(lcall27)
jmp resume_userspace


+#define CLONE_CHILD_SETTID 0x01000000
ENTRY(ret_from_fork)
+ testl $CLONE_CHILD_SETTID, EBX(%esp)
+ je 2f
+ movl EDI(%esp), %ecx
+ movl EAX(%esp), %eax
+1: movl %eax, (%ecx)
+2:
+.section __ex_table,"a"
+ .align 4
+ .long 1b,2b
+.previous
+
+ movl $0, EAX(%esp)
#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.48_ldb_no_tid/arch/i386/kernel/process.c linux-2.5.48_ldb/arch/i386/kernel/process.c
--- linux-2.5.48_ldb_no_tid/arch/i386/kernel/process.c 2002-11-18 05:29:19.000000000 +0100
+++ linux-2.5.48_ldb/arch/i386/kernel/process.c 2002-11-20 15:26:24.000000000 +0100
@@ -225,7 +225,7 @@ int kernel_thread(int (*fn)(void *), voi
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -285,9 +285,8 @@ int copy_thread(int nr, unsigned long cl

childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p->thread_info)) - 1;
struct_cpy(childregs, regs);
- childregs->eax = 0;
+ childregs->eax = p->pid;
childregs->esp = esp;
- p->user_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -502,7 +501,7 @@ asmlinkage int sys_fork(struct pt_regs r
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -511,14 +510,15 @@ asmlinkage int sys_clone(struct pt_regs
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.edi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -536,7 +536,7 @@ asmlinkage int sys_vfork(struct pt_regs
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.48_ldb_no_tid/arch/i386/kernel/smpboot.c linux-2.5.48_ldb/arch/i386/kernel/smpboot.c
--- linux-2.5.48_ldb_no_tid/arch/i386/kernel/smpboot.c 2002-11-18 05:29:45.000000000 +0100
+++ linux-2.5.48_ldb/arch/i386/kernel/smpboot.c 2002-11-20 14:55:19.000000000 +0100
@@ -498,7 +498,7 @@ static struct task_struct * __init fork_
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.48_ldb_no_tid/include/linux/sched.h linux-2.5.48_ldb/include/linux/sched.h
--- linux-2.5.48_ldb_no_tid/include/linux/sched.h 2002-11-18 05:29:22.000000000 +0100
+++ linux-2.5.48_ldb/include/linux/sched.h 2002-11-20 15:31:55.000000000 +0100
@@ -46,10 +46,11 @@ struct exec_domain;
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,7 @@ struct task_struct {

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *user_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +616,7 @@ extern void daemonize(void);
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
diff --exclude-from=/home/ldb/src/linux-exclude -urNdp linux-2.5.48_ldb_no_tid/kernel/fork.c linux-2.5.48_ldb/kernel/fork.c
--- linux-2.5.48_ldb_no_tid/kernel/fork.c 2002-11-18 17:08:55.000000000 +0100
+++ linux-2.5.48_ldb/kernel/fork.c 2002-11-20 15:55:10.000000000 +0100
@@ -409,15 +409,15 @@ void mm_release(void)
complete(vfork_done);
}
if (tsk->user_tid) {
- int * user_tid = tsk->user_tid;
+ int * tidptr = tsk->user_tid;
tsk->user_tid = NULL;

/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, user_tid);
- sys_futex((unsigned long)user_tid, FUTEX_WAKE, 1, NULL);
+ __put_user(0, tidptr);
+ sys_futex((unsigned long)tidptr, FUTEX_WAKE, 1, NULL);
}
}

@@ -680,9 +680,9 @@ static inline void copy_flags(unsigned l
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->user_tid = tidptr;

return current->pid;
}
@@ -699,7 +699,8 @@ static struct task_struct *copy_process(
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -716,6 +717,10 @@ static struct task_struct *copy_process(
if ((clone_flags & CLONE_DETACHED) && !(clone_flags & CLONE_THREAD))
return ERR_PTR(-EINVAL);

+ if ((clone_flags & (CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID))
+ && (retval = verify_area(VERIFY_WRITE, child_tidptr, sizeof(int))))
+ return ERR_PTR(retval);
+
retval = security_ops->task_create(clone_flags);
if (retval)
goto fork_out;
@@ -823,19 +828,15 @@ static struct task_struct *copy_process(
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
+
retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr))
goto bad_fork_cleanup_namespace;
-
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Clear TID on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ p->user_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : 0;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1004,7 +1005,8 @@ struct task_struct *do_fork(unsigned lon
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1015,7 +1017,7 @@ struct task_struct *do_fork(unsigned lon
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2002-11-20 21:47:47

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Ulrich Drepper wrote:
> > (That said, I'm not entirely convinced that blocking signals in cfork()
> > is so bad, if we assume that cfork() is a relatively expensive
> > operation anyway...)
>
> It could mean a signal cannot be delivered and reacted on in time. The
> other threads could have blocked the signal which arrives. Every time
> signals have to be blocked to implement a function something is wrong,

I don't buy this argument. You block signals, do something, unblock
signals. There may be a _tiny_ delay in delivering the signal - of
the order of a single system call time, i.e. not significant. (That
delay is much shorter than signal delivery time itself). No signals
are actually _lost_, which would be important if it could happen.

Blocking signals briefly is very similar to taking a spinlock. It has
a small overhead, which is probably not significant in the case of
cfork() and its likely applications.

Regarding whether clone() needs a separate child tid_address pointer -
I have no strong opinion (you can implement cfork() with or without),
but you might want to consider, from Glibc's perspective, that there
aren't many argument words left for future uses..

-- Jamie

2002-11-20 22:06:57

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:

> I don't buy this argument. You block signals, do something, unblock
> signals. There may be a _tiny_ delay in delivering the signal

Tiny? You said yourself that fork can be expensive.


> - of
> the order of a single system call time, i.e. not significant. (That
> delay is much shorter than signal delivery time itself). No signals
> are actually _lost_,

Of course they can get lost. Normal Unix signals are not queued.


- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE93Aiv2ijCOnn/RHQRAippAKCnwjE420nRHMJpGSm86CxNhkgtXwCgjAA3
gqpLLi1ytAanQWzIq+0+sWE=
=TRHu
-----END PGP SIGNATURE-----

2002-11-20 23:21:23

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jamie Lokier wrote:
>
> Erm, am I getting confused here? I'm assuming that you can block
> signals in _only_ the thread that is forking, leaving it unblocked in
> the others. I'm not very familiar with the current threaded signal
> model - is the blocked-signal mask shared between all of them?

Each thread has it's own mask but that also means that a signal can be
blocked by all threads except the one forking.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE93Bqb2ijCOnn/RHQRAnyXAKCRLSumP1tg1q0ZPkw9CLL+Cv87ggCePz0r
e/qB0dwl1DD/JDAAYdkIOSM=
=Y7EM
-----END PGP SIGNATURE-----

2002-11-20 23:18:39

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Ulrich Drepper wrote:
> > I don't buy this argument. You block signals, do something, unblock
> > signals. There may be a _tiny_ delay in delivering the signal
>
> Tiny? You said yourself that fork can be expensive.

You can't receive a signal in a thread while it is forking anyway.

Erm, am I getting confused here? I'm assuming that you can block
signals in _only_ the thread that is forking, leaving it unblocked in
the others. I'm not very familiar with the current threaded signal
model - is the blocked-signal mask shared between all of them?

-- Jamie

2002-11-21 00:10:12

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Ulrich Drepper wrote:
> > Erm, am I getting confused here? I'm assuming that you can block
> > signals in _only_ the thread that is forking, leaving it unblocked in
> > the others. I'm not very familiar with the current threaded signal
> > model - is the blocked-signal mask shared between all of them?
>
> Each thread has it's own mask but that also means that a signal can be
> blocked by all threads except the one forking.

Thread calls cfork(), which does this in the parent:

sigprocmask(...)
// Very short time during which signals aren't delivered.
clone(...)
// Very short time during which signals aren't delivered.
sigprocmask(...)

and the child does this:

current_thread->tid = *argument_to_cfork
sigprocmask(...)

The only time that a signal can be delayed due to the sigprocmask()
calls, if all the other threads have blocked it, is immediately before
and after the clone() call. I.e. zero instructions worth of time :)

It can also be delayed during the clone() call, but that is _already_
true; adding the sigprocmask() calls doesn't change that.

(Note that the _only_ reason for blocking signals is because of the
possibility of the child receiving SIGINT or something like that, and
wanting its own current_thread->tid value to be valid in the signal
handler.)

Here's an idea
--------------

I appreciate that the above method of implementing cfork() isn't as
fast as it could be. On the other hand, it works with a simpler
clone(), and an idea has just appeared in my head:

The above pattern is simpler if we add _another_ clone flag:
CLONE_BLOCKSIGS, meaning "set the child's signal mask to block all
signals". That removes the sigprocmask() calls from the parent.

That could be quite a useful flag in a number of situations, whenever
a child thread or process would like to do more complicated things
before re-allowing signals, such as a few dup() calls to set up
stdin/stdout, for example. (Setting current_thread->tid is just a
special case of that, which its possible to handle with an extra
argument to clone(). The general case cannot be handled in that way.)

-- Jamie

2002-11-21 00:29:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Ulrich Drepper wrote:
> > - of
> > the order of a single system call time, i.e. not significant. (That
> > delay is much shorter than signal delivery time itself). No signals
> > are actually _lost_,
>
> Of course they can get lost. Normal Unix signals are not queued.

Oh yes they are. Libc info for "Why Block":

Temporary blocking of signals with `sigprocmask' gives you a way to
prevent interrupts during critical parts of your code. *If signals
arrive in that part of the program, they are delivered later, after you
unblock them.*

And sigpending(2):

The sigpending call allows the examination of pending signals (ones
which have been raised while blocked). The signal mask of pending
signals is stored in set".

-- Jamie

2002-11-21 07:49:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0


On Thu, 21 Nov 2002, Jamie Lokier wrote:

> Thread calls cfork(), which does this in the parent:
>
> sigprocmask(...)
> // Very short time during which signals aren't delivered.
> clone(...)
> // Very short time during which signals aren't delivered.
> sigprocmask(...)
>

Jamie, we've been there, done that. This is precisely the kind of signal
locking cruft we got rid of in LinuxThreads, and which cruft caused it to
be slow. My goal was and still is to isolate signals from the rest of the
kernel APIs as much as possible, while still keeping the traditional
semantics. Check out an strace of a LinuxThreads linked pthread
application and you'll see signal mask manipulation syscalls all around
the place. Check out an NPTL strace, and see all those straightforward
single-syscall operations.

sure, fork() has some overhead larger than signal manipulation costs, but
this does not make the approach right in any way. If all this userspace
cost can be dealt with by doing some simple things in kernel-space, why
not do it?

Ingo

2002-11-21 11:59:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.47-C0

Hi Ingo,

I'm just being anal and disagreeing with Ulrich over a technical
point, i.e. that the _only_ cost is a few syscalls, nothing
fundamental here :)

Ingo Molnar wrote:
> If all this userspace cost can be dealt with by doing some simple
> things in kernel-space, why not do it?

Agreed - separate child/parent tid pointers is fine.

Btw, what do you think of the CLONE_CHILD_BLOCKSIGS flag idea?

-- Jamie

2002-11-21 16:39:36

by Ingo Molnar

[permalink] [raw]
Subject: [patch] threading enhancements, tid-2.5.48-C0


On 20 Nov 2002, Luca Barbieri wrote:

> But why not pass the child tidptr to ret_from_fork in a child register
> rather than in a task_struct field?

this method is quite dangerous as the register usage is largely ad-hoc in
the x86 lowlevel code. Eg. your %ebx use clashes with that of kernel
threads, which also go through ret_from_fork. The schedule_tail obviously
needs to be done before writing to userspace, we are holding the runqueue
lock. There might be other problems as well.

Anyway, i found it simpler to have an explicit parameter passing
interface. I'd suggest for you to update your optimization relative to the
attached patch, and make it work on SMP and with kernel threads - what
matters now is the semantics and correctness of the API, we can do
non-obvious performance optimizations relative to them.

i dont like the verify_area()/__put_user() splitup you did, because it
does not bring many advantages. verify_area() only checks whether the
pointer is in userspace, it does not check other things like pagetable
validity, etc. So the overwhelming majority of -EFAULT cases are still
hidden asynchronously.

there's one more improvement in the interface: set the parent-TID prior
doing the copy_mm() - this helps cfork() to pass the TID to the child as
well. The attached patch (against BK-curr) was tested by Ulrich and it
works fine.

Ingo

--- linux/arch/i386/kernel/smpboot.c.orig 2002-11-21 11:51:26.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2002-11-21 11:51:55.000000000 +0100
@@ -499,7 +499,7 @@
* don't care about the eip and regs settings since
* we'll never reschedule the forked task.
*/
- return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL);
+ return do_fork(CLONE_VM|CLONE_IDLETASK, 0, &regs, 0, NULL, NULL);
}

/* which physical APIC ID maps to which logical CPU number */
--- linux/arch/i386/kernel/process.c.orig 2002-11-21 11:51:26.000000000 +0100
+++ linux/arch/i386/kernel/process.c 2002-11-21 11:51:55.000000000 +0100
@@ -226,7 +226,7 @@
regs.eflags = 0x286;

/* Ok, create the new process.. */
- p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL);
+ p = do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -288,7 +288,7 @@
struct_cpy(childregs, regs);
childregs->eax = 0;
childregs->esp = esp;
- p->user_tid = NULL;
+ p->set_child_tid = p->clear_child_tid = NULL;

p->thread.esp = (unsigned long) childregs;
p->thread.esp0 = (unsigned long) (childregs+1);
@@ -503,7 +503,7 @@
{
struct task_struct *p;

- p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -512,14 +512,15 @@
struct task_struct *p;
unsigned long clone_flags;
unsigned long newsp;
- int *user_tid;
+ int *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;
newsp = regs.ecx;
- user_tid = (int *)regs.edx;
+ parent_tidptr = (int *)regs.edx;
+ child_tidptr = (int *)regs.edi;
if (!newsp)
newsp = regs.esp;
- p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, user_tid);
+ p = do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

@@ -537,7 +538,7 @@
{
struct task_struct *p;

- p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL);
+ p = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
return IS_ERR(p) ? PTR_ERR(p) : p->pid;
}

--- linux/arch/i386/kernel/entry.S.orig 2002-11-21 11:51:26.000000000 +0100
+++ linux/arch/i386/kernel/entry.S 2002-11-21 11:51:55.000000000 +0100
@@ -170,10 +170,8 @@


ENTRY(ret_from_fork)
-#if CONFIG_SMP || CONFIG_PREEMPT
# NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
-#endif
GET_THREAD_INFO(%ebx)
jmp syscall_exit

--- linux/include/linux/sched.h.orig 2002-11-21 11:51:16.000000000 +0100
+++ linux/include/linux/sched.h 2002-11-21 11:51:55.000000000 +0100
@@ -46,10 +46,11 @@
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
-#define CLONE_SETTID 0x00100000 /* write the TID back to userspace */
-#define CLONE_CLEARTID 0x00200000 /* clear the userspace TID */
-#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
-#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
+#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
+#define CLONE_DETACHED 0x00400000 /* parent wants no child-exit signal */
+#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
+#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */

/*
* List of flags we want to share for kernel threads,
@@ -332,7 +333,8 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
- int *user_tid; /* for CLONE_CLEARTID */
+ int *set_child_tid; /* CLONE_CHILD_SETTID */
+ int *clear_child_tid; /* CLONE_CHILD_CLEARTID */

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -615,7 +617,7 @@
extern task_t *child_reaper;

extern int do_execve(char *, char **, char **, struct pt_regs *);
-extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *);
+extern struct task_struct *do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int *, int *);

#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
--- linux/kernel/fork.c.orig 2002-11-21 11:51:27.000000000 +0100
+++ linux/kernel/fork.c 2002-11-21 11:52:45.000000000 +0100
@@ -408,16 +408,16 @@
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->user_tid) {
- int * user_tid = tsk->user_tid;
- tsk->user_tid = NULL;
+ if (tsk->clear_child_tid) {
+ int * tidptr = tsk->clear_child_tid;
+ tsk->clear_child_tid = NULL;

/*
* We dont check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
- put_user(0, user_tid);
- sys_futex((unsigned long)user_tid, FUTEX_WAKE, 1, NULL);
+ put_user(0, tidptr);
+ sys_futex((unsigned long)tidptr, FUTEX_WAKE, 1, NULL);
}
}

@@ -680,9 +680,9 @@
p->flags = new_flags;
}

-asmlinkage int sys_set_tid_address(int *user_tid)
+asmlinkage int sys_set_tid_address(int *tidptr)
{
- current->user_tid = user_tid;
+ current->clear_child_tid = tidptr;

return current->pid;
}
@@ -699,7 +699,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
int retval;
struct task_struct *p = NULL;
@@ -766,6 +767,11 @@
if (p->pid == -1)
goto bad_fork_cleanup;
}
+ retval = -EFAULT;
+ if (clone_flags & CLONE_PARENT_SETTID)
+ if (put_user(p->pid, parent_tidptr))
+ goto bad_fork_cleanup;
+
p->proc_dentry = NULL;

INIT_LIST_HEAD(&p->run_list);
@@ -823,19 +829,14 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
- /*
- * Notify the child of the TID?
- */
- retval = -EFAULT;
- if (clone_flags & CLONE_SETTID)
- if (put_user(p->pid, user_tid))
- goto bad_fork_cleanup_namespace;

+ if (clone_flags & CLONE_CHILD_SETTID)
+ p->set_child_tid = child_tidptr;
/*
- * Does the userspace VM want the TID cleared on mm_release()?
+ * Clear TID on mm_release()?
*/
- if (clone_flags & CLONE_CLEARTID)
- p->user_tid = user_tid;
+ if (clone_flags & CLONE_CHILD_CLEARTID)
+ p->clear_child_tid = child_tidptr;

/*
* Syscall tracing should be turned off in the child regardless
@@ -1004,7 +1005,8 @@
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
- int *user_tid)
+ int *parent_tidptr,
+ int *child_tidptr)
{
struct task_struct *p;
int trace = 0;
@@ -1015,7 +1017,7 @@
clone_flags |= CLONE_PTRACE;
}

- p = copy_process(clone_flags, stack_start, regs, stack_size, user_tid);
+ p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr);
if (!IS_ERR(p)) {
struct completion vfork;

--- linux/kernel/sched.c.orig 2002-11-21 11:51:15.000000000 +0100
+++ linux/kernel/sched.c 2002-11-21 11:51:55.000000000 +0100
@@ -503,12 +503,12 @@
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
-#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(task_t *prev)
{
finish_arch_switch(this_rq(), prev);
+ if (current->set_child_tid)
+ put_user(current->pid, current->set_child_tid);
}
-#endif

/*
* context_switch - switch to the new MM and the new

2002-11-21 19:23:52

by Luca Barbieri

[permalink] [raw]
Subject: Re: [patch] threading enhancements, tid-2.5.48-C0

> this method is quite dangerous as the register usage is largely ad-hoc in
> the x86 lowlevel code. Eg. your %ebx use clashes with that of kernel
> threads, which also go through ret_from_fork.
Yes, I realize that it was a bad idea, since bloat in task_struct can be
avoided by putting clear_tid in an union other temporary data (e.g.
*link_count), without using arch-specific code (and this is a obviously
a separate patch).


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part