LinuxLists.cc - [PATCH] fix for zeroed user-space tids in multi-threaded core dumps

2006-10-26 01:42:14

Subject: [PATCH] fix for zeroed user-space tids in multi-threaded core dumps

Hi, Andrew. Please consider the patch below for the next open 2.6 release.

The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
on behalf of pthread_create() library calls. This feature is used to
request that the kernel clear the thread-id in user space (at an address
provided in the syscall) when the thread disassociates itself from the
address space, which is done in mm_release().

Unfortunately, when a multi-threaded process incurs a core dump (such as
from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
the other threads, which then proceed to clear their user-space tids
before synchronizing in exit_mm() with the start of core dumping. This
misrepresents the state of process's address space at the time of the
SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
problems (misleading him/her to conclude that the threads had gone away
before the fault).

The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
core dump has been initiated.

Cheers. -ernie

--- linux-2.6.18/kernel/fork.c.orig
+++ linux-2.6.18/kernel/fork.c
@@ -439,7 +439,9 @@ void mm_release(struct task_struct *tsk,
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->clear_child_tid && atomic_read(&mm->mm_users) > 1) {
+ if (tsk->clear_child_tid &&
+ mm->core_waiters == 0 &&
+ atomic_read(&mm->mm_users) > 1) {
u32 __user * tidptr = tsk->clear_child_tid;
tsk->clear_child_tid = NULL;

2006-10-27 01:59:37

by Roland McGrath

[permalink] [raw]

Subject: Re: [PATCH] fix for zeroed user-space tids in multi-threaded core dumps

I concur with the analysis and the approach to avoiding the problem. Any
change half this subtle deserves some comments in the code. Other places
that look at mm->core_waiters hold mm->mmap_sem, though I'm not entirely
sure it matters in practice here.

I would actually go further and skip the clearing on any signal exit,
rather than looking specifically for the core dump issue. I've CC'd
Ulrich, the creator of the feature, who can verify that the intended
userland ABI specification has always been concerned only with the sys_exit
path. It's arguably incorrect by POSIX as it is now (and would still be
with just Ernie's change), in that a thread taking SIGTERM should not cause
the race where another thread's pthread_join on that thread might return
before the SIGTERM took effect to kill all the threads. (I'm not at all
sure such a race is actually possible, due to other interactions.)

This patch is race-free in that it depends only on the activities of the
current thread itself. If it had started exiting normally before the
process-wide core dump happened, then the core dump can show it that way.
The PF_SIGNALED check covers the core dump case, and also the normal
no-core fatal signal case (so a superset of what Ernie's change covers
modulo the race difference, though still not including the several obscure
cases of direct do_exit calls that aren't from the normal signal code).

Thanks,
Roland

---
[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit.

The CLONE_CHILD_CLEARTID flag is used by NPTL to have its threads
communicate via memory/futex when they exit, so pthread_join can
synchronize using a simple futex wait. The word of user memory where NPTL
stores a thread's own TID is what it passes; this gets reset to zero at
thread exit.

It is not desireable to touch this user memory when threads are dying due
to a fatal signal. A core dump is more usefully representative of the
dying program state if the threads live at the time of the crash have their
NPTL data structures unperturbed. The userland expectation of
CLONE_CHILD_CLEARTID has only ever been that it works for a thread making
an _exit system call.

This problem was identified by Ernie Petrides <[email protected]>.

Signed-off-by: Roland McGrath <[email protected]>
---
kernel/fork.c | 21 +++++++++++++++------
1 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 29ebb30..da3e2e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -448,7 +448,16 @@ void mm_release(struct task_struct *tsk,
tsk->vfork_done = NULL;
complete(vfork_done);
}
- if (tsk->clear_child_tid && atomic_read(&mm->mm_users) > 1) {
+
+ /*
+ * If we're exiting normally, clear a user-space tid field if
+ * requested. We leave this alone when dying by signal, to leave
+ * the value intact in a core dump, and to save the unnecessary
+ * trouble otherwise. Userland only wants this done for a sys_exit.
+ */
+ if (tsk->clear_child_tid
+ && !(tsk->flags & PF_SIGNALED)
+ && atomic_read(&mm->mm_users) > 1) {
u32 __user * tidptr = tsk->clear_child_tid;
tsk->clear_child_tid = NULL;

2015-08-20 14:48:42

by Andreas Schwab

[permalink] [raw]

Subject: Re: [PATCH] fix for zeroed user-space tids in multi-threaded core dumps

Roland McGrath <[email protected]> writes:

> [PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit.
>
> The CLONE_CHILD_CLEARTID flag is used by NPTL to have its threads
> communicate via memory/futex when they exit, so pthread_join can
> synchronize using a simple futex wait. The word of user memory where NPTL
> stores a thread's own TID is what it passes; this gets reset to zero at
> thread exit.
>
> It is not desireable to touch this user memory when threads are dying due
> to a fatal signal. A core dump is more usefully representative of the
> dying program state if the threads live at the time of the crash have their
> NPTL data structures unperturbed. The userland expectation of
> CLONE_CHILD_CLEARTID has only ever been that it works for a thread making
> an _exit system call.

This breaks nscd. It uses CLONE_CHILD_CLEARTID to clear the
nscd_certainly_running flag in the shared databases, so that the clients
are notified when nscd is restarted. Now, when nscd uses a
non-persistent database, the clients that have it mapped keep thinking
the database is being updated by nscd, when in fact nscd has created a
new (anonymous) one (for non-persistent databases it uses an unlinked
file as backend).

Andreas.

--
Andreas Schwab, SUSE Labs, [email protected]
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."