2002-08-19 12:10:56

by Ingo Molnar

[permalink] [raw]
Subject: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


this patch is the next step in the journey to get top-notch threading
support implemented under Linux.

every Linux kernel in existence, including 2.5.31, has a fundamentally
unscalable sys_exit() implementation, with an overhead that goes up if the
number of tasks in the system goes up. It does not matter whether those
tasks are doing any work - just sleeping indefinitely causes sys_exit()
overhead to go up.

200 tasks is typical for a relatively idle server system. 1000 tasks or
more is not uncommon during usual server load on a midrange server.
There are servers that use 5000 tasks or more. It is not uncommon for
highly threaded code to use even more threads - client/server models are
often the easiest to implement by using a per-connection thread model.
[Eg. vsftpd, one of the fastest and most secure FTP servers under Linux,
uses 2 (often 3) threads per client connection [which, for security reason
is implemented via inside per-client isolated processes].]

Some numbers to back this up, i've tested 2.5.31-BK-curr on a 525 MHz
PIII, it produces the following lat_proc fork+exit latencies:

# of tasks in the system 200 1000 2000 4000
------------------------------------------------------------------
./lat_proc fork+exit (microseconds): 743.0 923.1 1153.4 1622.2

it can be seen that the fork()+exit() overhead more than doubles with
every 4000 tasks in the system.


for threaded applications the situation is even worse. A threading
benchmark that just tests the (linear) creation and exit of 100 thousand
threads using the old glibc libpthreads library, gives the following
results:

# of tasks in the system 200 1000 2000 4000
------------------------------------------------------------------
./perf -s 100000 -t 1 -r 0 -T
in seconds: 17.8 37.3 61.1 108.0

latency of single-thread create+exit
in microseconds: 178 373 611 1080

using the same test linked against new libpthreads:

# of tasks in the system 200 1000 2000 4000
------------------------------------------------------------------
./perf -s 100000 -t 1 -r 0 -T
in seconds: 6.8 25.6 48.7 95.1

latency of single-thread create+exit
in microseconds: 68 256 487 951

the same regression happens as in the old-pthreads case, but with a
(dramatically) lower baseline [which are due to the other optimizations].
With a couple of hundred threads created, the thread create+exit latency
becomes dominated by the hefty sys_exit() overhead.

all in one - sys_exit() is O(nr_tasks), and heavily so - even a reasonable
number of completely idle tasks increase the exit() overhead very visibly.


why is sys_exit() so expensive? The main overhead is in
forget_original_parent(), which must be called for every thread that
exits: the kernel has to find all children the exiting task has created
during its lifetime, and has to 'reparent' them to some other task. The
current implementation uses a for_each_task() over every task in the
system, and finds those tasks whose real parent is the now exiting task.
This is a fundamental work of the kernel that cannot be optimized away -
the child/parent tree must always stay coherent.

but the for_each_task() iteration is excessive. There is a subtle reason
for it though: ptrace reparents debugged tasks to the debugging task,
which detaches the child from the original parent. Thus
forget_original_parent() has to search the whole tasklist, to make sure
even debugged tasks are properly reparented.

the attached patch (against BK-curr) reimplements this logic in a scalable
way: the pthreads code also maintains a global list of debugged tasks -
which list is usually empty in a normal system. It has at most a few tasks
- those one which are debugged currently. This list can be maintained very
cheaply, in a number of strategic places.

forget_original_parent() then searched the exiting task's ->children list,
plus the global ptrace list. In the usual 'task has no children and there
are no debugged tasks around' case this becomes as cheap as two
list_empty() tests!

now on to the performance results, on the same 525 MHz PIII box, lat_proc:

# of tasks in the system 200 1000 2000 4000
----------------------------------------------------------------------
./lat_proc fork+exit (microseconds):
657.1 680.6 640.8 682.5

(unpatched kernel results): (743.0) (923.1) (1153.4) (1622.2)

process creation latency is essentially constant, it's independent of the
number of tasks in the system. Even the baseline results got improved by
more than 10%. For the 4000 tasks case the speedup is more than 130%.

the speedup is even bigger for threaded applications using the old
pthreads code:

# of tasks in the system 200 1000 2000 4000
--------------------------------------------------------------------
./perf -s 100000 -t 1 -r 0 -T
in seconds: 12.6 12.8 11.9 11.9
(unpatched results): (17.8) (37.3) (61.1) (108.0)

latency of single-thread create+exit
in microseconds: 126 128 119 119
(unpatched kernel): (178) (373) (611) (1080)

improvement: 41% 191% 413% 807%

ie. in the 4000 tasks case the improvement is almost 10-fold!

testing the new glibc libpthreads code shows dramatic improvements:

# of tasks in the system 200 1000 2000 4000
------------------------------------------------------------------
./perf -s 100000 -t 1 -r 0 -T
in seconds: 1.7 1.9 1.9 1.9
(unpatched results): (6.8) (25.6) (48.7) (95.1)

latency of single-thread create+exit
in microseconds: 17 19 19 19
(unpatched kernel): (68) (256) (487) (951)


improvement: 300% 1247% 2463% 4900%

in the 200 tasks case the speedup is 4-fold, in the 4000 tasks case the
speedup is 50-fold (!). Thread create+destroy latency is a steady 19
usecs. This also enables the head-to-head comparison of old pthreads and
new libpthreads: new libpthreads is more than 6 times faster. This is what
i'd finally call good threading performance.

the hardest part of the patch was to make sure ptrace() semantics are
still correct. I believe i have managed to at least test the typical
workload: i've tested a complex mix of high-load strace situations,
threaded and unthreaded code as well, SMP and UP, so i'm reasonably
certain that it works for the kind of load i use on my systems. [ But
ptrace() is complex beyond belief, so i might as well have missed some of
the subtler items. ]

- the patch also includes another optimization to make bash's wait4() job
easier as well: wait4() does not have to consider non-debugged
CLONE_DETACHED tasks, it wont get any exit code from them, and those
tasks can clean themselves up.

- the patch also cleans up the <linux/ptrace.h> includes, and moves the
pthread specific declarations from mm.h to ptrace.h - most of the
existing includes were bogus.

Comments?

Ingo

--- linux/arch/i386/kernel/irq.c.orig Mon Aug 19 12:10:27 2002
+++ linux/arch/i386/kernel/irq.c Mon Aug 19 12:11:21 2002
@@ -18,7 +18,6 @@
*/

#include <linux/config.h>
-#include <linux/ptrace.h>
#include <linux/errno.h>
#include <linux/signal.h>
#include <linux/sched.h>
--- linux/arch/i386/kernel/i8259.c.orig Mon Aug 19 12:11:00 2002
+++ linux/arch/i386/kernel/i8259.c Mon Aug 19 12:11:05 2002
@@ -1,5 +1,4 @@
#include <linux/config.h>
-#include <linux/ptrace.h>
#include <linux/errno.h>
#include <linux/signal.h>
#include <linux/sched.h>
--- linux/arch/i386/kernel/process.c.orig Mon Aug 19 12:38:47 2002
+++ linux/arch/i386/kernel/process.c Mon Aug 19 12:38:49 2002
@@ -23,7 +23,6 @@
#include <linux/smp_lock.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
-#include <linux/ptrace.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/user.h>
--- linux/arch/i386/kernel/signal.c.orig Mon Aug 19 12:38:56 2002
+++ linux/arch/i386/kernel/signal.c Mon Aug 19 12:38:58 2002
@@ -15,7 +15,6 @@
#include <linux/signal.h>
#include <linux/errno.h>
#include <linux/wait.h>
-#include <linux/ptrace.h>
#include <linux/unistd.h>
#include <linux/stddef.h>
#include <linux/personality.h>
--- linux/arch/i386/kernel/traps.c.orig Mon Aug 19 12:39:05 2002
+++ linux/arch/i386/kernel/traps.c Mon Aug 19 12:39:06 2002
@@ -16,7 +16,6 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/errno.h>
-#include <linux/ptrace.h>
#include <linux/timer.h>
#include <linux/mm.h>
#include <linux/init.h>
--- linux/arch/i386/kernel/vm86.c.orig Mon Aug 19 12:39:17 2002
+++ linux/arch/i386/kernel/vm86.c Mon Aug 19 12:39:20 2002
@@ -8,7 +8,6 @@
#include <linux/kernel.h>
#include <linux/signal.h>
#include <linux/string.h>
-#include <linux/ptrace.h>
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/smp_lock.h>
--- linux/include/linux/sched.h.orig Mon Aug 19 11:57:00 2002
+++ linux/include/linux/sched.h Mon Aug 19 11:57:06 2002
@@ -270,6 +270,7 @@
unsigned int time_slice, first_time_slice;

struct list_head tasks;
+ struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
struct list_head local_pages;
--- linux/include/linux/mm.h.orig Mon Aug 19 11:57:25 2002
+++ linux/include/linux/mm.h Mon Aug 19 12:16:24 2002
@@ -354,12 +354,6 @@
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
-extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int len);
-extern int ptrace_writedata(struct task_struct *tsk, char * src, unsigned long dst, int len);
-extern int ptrace_attach(struct task_struct *tsk);
-extern int ptrace_detach(struct task_struct *, unsigned int);
-extern void ptrace_disable(struct task_struct *);
-extern int ptrace_check_attach(struct task_struct *task, int kill);

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
--- linux/include/linux/ptrace.h.orig Mon Aug 19 12:08:30 2002
+++ linux/include/linux/ptrace.h Mon Aug 19 12:37:12 2002
@@ -23,4 +23,28 @@

#include <asm/ptrace.h>

+#ifdef __KERNEL__
+extern list_t ptrace_tasks;
+
+extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int len);
+extern int ptrace_writedata(struct task_struct *tsk, char * src, unsigned long dst, int len);
+extern int ptrace_attach(struct task_struct *tsk);
+extern int ptrace_detach(struct task_struct *, unsigned int);
+extern void ptrace_disable(struct task_struct *);
+extern int ptrace_check_attach(struct task_struct *task, int kill);
+extern void __ptrace_link(struct task_struct *child, struct task_struct *new_parent);
+extern void __ptrace_unlink(struct task_struct *child);
+
+static inline void ptrace_link(struct task_struct *child, struct task_struct *new_parent)
+{
+ if (child->ptrace)
+ __ptrace_link(child, new_parent);
+}
+static inline void ptrace_unlink(struct task_struct *child)
+{
+ if (child->ptrace)
+ __ptrace_unlink(child);
+}
+#endif
+
#endif
--- linux/include/linux/binfmts.h.orig Mon Aug 19 12:19:45 2002
+++ linux/include/linux/binfmts.h Mon Aug 19 12:20:30 2002
@@ -1,7 +1,6 @@
#ifndef _LINUX_BINFMTS_H
#define _LINUX_BINFMTS_H

-#include <linux/ptrace.h>
#include <linux/capability.h>

/*
--- linux/include/linux/elfcore.h.orig Mon Aug 19 12:22:38 2002
+++ linux/include/linux/elfcore.h Mon Aug 19 12:22:41 2002
@@ -4,7 +4,6 @@
#include <linux/types.h>
#include <linux/signal.h>
#include <linux/time.h>
-#include <linux/ptrace.h>
#include <linux/user.h>

struct elf_siginfo
--- linux/include/asm-i386/smp.h.orig Mon Aug 19 12:12:27 2002
+++ linux/include/asm-i386/smp.h Mon Aug 19 12:12:32 2002
@@ -7,7 +7,6 @@
#ifndef __ASSEMBLY__
#include <linux/config.h>
#include <linux/threads.h>
-#include <linux/ptrace.h>
#endif

#ifdef CONFIG_X86_LOCAL_APIC
--- linux/include/asm-i386/user.h.orig Mon Aug 19 12:22:19 2002
+++ linux/include/asm-i386/user.h Mon Aug 19 12:22:21 2002
@@ -2,7 +2,6 @@
#define _I386_USER_H

#include <asm/page.h>
-#include <linux/ptrace.h>
/* Core file format: The core file is written in such a way that gdb
can understand it and provide useful information to the user (under
linux we use the 'trad-core' bfd). There are quite a number of
--- linux/kernel/exit.c.orig Mon Aug 19 11:45:38 2002
+++ linux/kernel/exit.c Mon Aug 19 12:40:51 2002
@@ -18,6 +18,7 @@
#include <linux/acct.h>
#include <linux/file.h>
#include <linux/binfmts.h>
+#include <linux/ptrace.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -65,6 +66,8 @@
atomic_dec(&p->user->processes);
security_ops->task_free_security(p);
free_uid(p->user);
+ if (p->ptrace || !list_empty(&p->ptrace_list))
+ BUG();
unhash_process(p);

release_thread(p);
@@ -177,6 +180,7 @@
{
write_lock_irq(&tasklist_lock);

+ ptrace_unlink(current);
/* Reparent to init */
REMOVE_LINKS(current);
current->parent = child_reaper;
@@ -231,6 +235,22 @@
atomic_inc(&current->files->count);
}

+static void reparent_thread(task_t *p, task_t *reaper, task_t *child_reaper)
+{
+ /* We dont want people slaying init */
+ p->exit_signal = SIGCHLD;
+ p->self_exec_id++;
+
+ /* Make sure we're not reparenting to ourselves */
+ if (p == reaper)
+ p->real_parent = child_reaper;
+ else
+ p->real_parent = reaper;
+
+ if (p->pdeath_signal)
+ send_sig(p->pdeath_signal, p, 0);
+}
+
/*
* When we die, we re-parent all our children.
* Try to give them to another thread in our thread
@@ -239,7 +259,8 @@
*/
static inline void forget_original_parent(struct task_struct * father)
{
- struct task_struct * p, *reaper;
+ struct task_struct *p, *reaper;
+ list_t *_p;

read_lock(&tasklist_lock);

@@ -254,20 +275,22 @@
if (reaper == father)
reaper = child_reaper;

- for_each_task(p) {
- if (p->real_parent == father) {
- /* We dont want people slaying init */
- p->exit_signal = SIGCHLD;
- p->self_exec_id++;
-
- /* Make sure we're not reparenting to ourselves */
- if (p == reaper)
- p->real_parent = child_reaper;
- else
- p->real_parent = reaper;
-
- if (p->pdeath_signal) send_sig(p->pdeath_signal, p, 0);
- }
+ /*
+ * There are only two places where our children can be:
+ *
+ * - in our child list
+ * - in the global ptrace list
+ *
+ * Search them and reparent children.
+ */
+ list_for_each(_p, &father->children) {
+ p = list_entry(_p,struct task_struct,sibling);
+ reparent_thread(p, reaper, child_reaper);
+ }
+ list_for_each(_p, &ptrace_tasks) {
+ p = list_entry(_p,struct task_struct,ptrace_list);
+ if (p->real_parent == father)
+ reparent_thread(p, reaper, child_reaper);
}
read_unlock(&tasklist_lock);
}
@@ -485,11 +508,12 @@
if (current->exit_signal != -1)
do_notify_parent(current, current->exit_signal);
while ((p = eldest_child(current))) {
+ ptrace_unlink(p);
list_del_init(&p->sibling);
p->ptrace = 0;

p->parent = p->real_parent;
- list_add_tail(&p->sibling,&p->parent->children);
+ list_add_tail(&p->sibling, &p->parent->children);
if (p->state == TASK_ZOMBIE && p->exit_signal != -1)
do_notify_parent(p, p->exit_signal);
/*
@@ -623,6 +647,12 @@
if (p->pgrp != -pid)
continue;
}
+ /*
+ * Do not consider detached threads that are
+ * not ptraced:
+ */
+ if (p->exit_signal == -1 && !p->ptrace)
+ continue;
/* Wait for all children (clone and not) if __WALL is set;
* otherwise, wait for clone children *only* if __WCLONE is
* set; otherwise, wait for non-clone children *only*. (Note:
@@ -667,7 +697,7 @@
if (retval)
goto end_wait4;
retval = p->pid;
- if (p->real_parent != p->parent) {
+ if (p->real_parent != p->parent || p->ptrace) {
write_lock_irq(&tasklist_lock);
remove_parent(p);
p->parent = p->real_parent;
--- linux/kernel/fork.c.orig Mon Aug 19 11:53:22 2002
+++ linux/kernel/fork.c Mon Aug 19 12:20:54 2002
@@ -27,6 +27,7 @@
#include <linux/fs.h>
#include <linux/security.h>
#include <linux/futex.h>
+#include <linux/ptrace.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -808,6 +809,7 @@
*/
p->tgid = p->pid;
INIT_LIST_HEAD(&p->thread_group);
+ INIT_LIST_HEAD(&p->ptrace_list);

/* Need tasklist lock for parent etc handling! */
write_lock_irq(&tasklist_lock);
@@ -827,6 +829,7 @@
}

SET_LINKS(p);
+ ptrace_link(p, p->parent);
hash_pid(p);
nr_threads++;
write_unlock_irq(&tasklist_lock);
--- linux/kernel/ptrace.c.orig Mon Aug 19 11:58:06 2002
+++ linux/kernel/ptrace.c Mon Aug 19 13:29:53 2002
@@ -13,11 +13,56 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
+#include <linux/ptrace.h>

#include <asm/pgtable.h>
#include <asm/uaccess.h>

/*
+ * This is the global list of ptraced threads.
+ *
+ * Any code that wants to iterate over its ->children list has to
+ * search this list as well, because sys_ptrace() temporarily reparents
+ * tasks to the debugging task.
+ */
+LIST_HEAD(ptrace_tasks);
+
+/*
+ * ptrace a task: make the debugger its new parent and
+ * move it to the ptrace list.
+ *
+ * Must be called with the tasklist lock write-held.
+ */
+void __ptrace_link(task_t *child, task_t *new_parent)
+{
+ if (!list_empty(&child->ptrace_list))
+ BUG();
+ list_add(&child->ptrace_list, &ptrace_tasks);
+ REMOVE_LINKS(child);
+ child->parent = new_parent;
+ SET_LINKS(child);
+}
+
+/*
+ * unptrace a task: move it back to its original parent and
+ * remove it from the ptrace list.
+ *
+ * Must be called with the tasklist lock write-held.
+ */
+void __ptrace_unlink(task_t *child)
+{
+ if (!child->ptrace)
+ BUG();
+ child->ptrace = 0;
+ if (list_empty(&child->ptrace_list))
+ return;
+ list_del_init(&child->ptrace_list);
+ REMOVE_LINKS(child);
+ child->parent = child->real_parent;
+ SET_LINKS(child);
+}
+
+/*
* Check that we have indeed attached to the thing..
*/
int ptrace_check_attach(struct task_struct *child, int kill)
@@ -75,11 +120,7 @@
task_unlock(task);

write_lock_irq(&tasklist_lock);
- if (task->parent != current) {
- REMOVE_LINKS(task);
- task->parent = current;
- SET_LINKS(task);
- }
+ __ptrace_link(task, current);
write_unlock_irq(&tasklist_lock);

send_sig(SIGSTOP, task, 1);
@@ -99,16 +140,15 @@
ptrace_disable(child);

/* .. re-parent .. */
- child->ptrace = 0;
child->exit_code = data;
+
write_lock_irq(&tasklist_lock);
- REMOVE_LINKS(child);
- child->parent = child->real_parent;
- SET_LINKS(child);
+ __ptrace_unlink(child);
+ /* .. and wake it up. */
+ if (child->state != TASK_ZOMBIE)
+ wake_up_process(child);
write_unlock_irq(&tasklist_lock);

- /* .. and wake it up. */
- wake_up_process(child);
return 0;
}



2002-08-19 16:10:56

by Ingo Molnar

[permalink] [raw]
Subject: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-B4


the attached patch updates a number of items:

- adds cleanups suggested by Christoph Hellwig: needed unlikely()
statements, a superfluous #define and line length problems.

- splits up the global ptrace list into per-task ptrace lists. This was
pretty straightforward, and this makes the worst-case exit() latency
O(nr_children).

the per-task ptrace lists unearthed a bug that the previous code did not
take care of: tasks on the ptrace list have to be correctly reparented as
well. This patch passed my stresstests as well.

Ingo

--- linux/arch/i386/kernel/irq.c.orig Mon Aug 19 12:10:27 2002
+++ linux/arch/i386/kernel/irq.c Mon Aug 19 18:12:27 2002
@@ -18,7 +18,6 @@
*/

#include <linux/config.h>
-#include <linux/ptrace.h>
#include <linux/errno.h>
#include <linux/signal.h>
#include <linux/sched.h>
--- linux/arch/i386/kernel/i8259.c.orig Mon Aug 19 12:11:00 2002
+++ linux/arch/i386/kernel/i8259.c Mon Aug 19 18:12:27 2002
@@ -1,5 +1,4 @@
#include <linux/config.h>
-#include <linux/ptrace.h>
#include <linux/errno.h>
#include <linux/signal.h>
#include <linux/sched.h>
--- linux/arch/i386/kernel/process.c.orig Mon Aug 19 12:38:47 2002
+++ linux/arch/i386/kernel/process.c Mon Aug 19 18:12:27 2002
@@ -23,7 +23,6 @@
#include <linux/smp_lock.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
-#include <linux/ptrace.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/user.h>
--- linux/arch/i386/kernel/signal.c.orig Mon Aug 19 12:38:56 2002
+++ linux/arch/i386/kernel/signal.c Mon Aug 19 18:12:27 2002
@@ -15,7 +15,6 @@
#include <linux/signal.h>
#include <linux/errno.h>
#include <linux/wait.h>
-#include <linux/ptrace.h>
#include <linux/unistd.h>
#include <linux/stddef.h>
#include <linux/personality.h>
--- linux/arch/i386/kernel/traps.c.orig Mon Aug 19 12:39:05 2002
+++ linux/arch/i386/kernel/traps.c Mon Aug 19 18:12:27 2002
@@ -16,7 +16,6 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/errno.h>
-#include <linux/ptrace.h>
#include <linux/timer.h>
#include <linux/mm.h>
#include <linux/init.h>
--- linux/arch/i386/kernel/vm86.c.orig Mon Aug 19 12:39:17 2002
+++ linux/arch/i386/kernel/vm86.c Mon Aug 19 18:12:27 2002
@@ -8,7 +8,6 @@
#include <linux/kernel.h>
#include <linux/signal.h>
#include <linux/string.h>
-#include <linux/ptrace.h>
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/smp_lock.h>
--- linux/include/linux/sched.h.orig Mon Aug 19 11:57:00 2002
+++ linux/include/linux/sched.h Mon Aug 19 18:12:27 2002
@@ -270,6 +270,8 @@
unsigned int time_slice, first_time_slice;

struct list_head tasks;
+ struct list_head ptrace_children;
+ struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
struct list_head local_pages;
--- linux/include/linux/mm.h.orig Mon Aug 19 11:57:25 2002
+++ linux/include/linux/mm.h Mon Aug 19 18:12:27 2002
@@ -354,12 +354,6 @@
extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
extern int make_pages_present(unsigned long addr, unsigned long end);
extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
-extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int len);
-extern int ptrace_writedata(struct task_struct *tsk, char * src, unsigned long dst, int len);
-extern int ptrace_attach(struct task_struct *tsk);
-extern int ptrace_detach(struct task_struct *, unsigned int);
-extern void ptrace_disable(struct task_struct *);
-extern int ptrace_check_attach(struct task_struct *task, int kill);

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
--- linux/include/linux/ptrace.h.orig Mon Aug 19 12:08:30 2002
+++ linux/include/linux/ptrace.h Mon Aug 19 18:12:27 2002
@@ -3,6 +3,8 @@
/* ptrace.h */
/* structs and defines to help the user use the ptrace system call. */

+#include <linux/compiler.h>
+
/* has the defines to get at the registers. */

#define PTRACE_TRACEME 0
@@ -22,5 +24,27 @@
#define PTRACE_SYSCALL 24

#include <asm/ptrace.h>
+
+extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int len);
+extern int ptrace_writedata(struct task_struct *tsk, char * src, unsigned long dst, int len);
+extern int ptrace_attach(struct task_struct *tsk);
+extern int ptrace_detach(struct task_struct *, unsigned int);
+extern void ptrace_disable(struct task_struct *);
+extern int ptrace_check_attach(struct task_struct *task, int kill);
+extern void __ptrace_link(struct task_struct *child,
+ struct task_struct *new_parent);
+extern void __ptrace_unlink(struct task_struct *child);
+
+static inline void ptrace_link(struct task_struct *child,
+ struct task_struct *new_parent)
+{
+ if (unlikely(child->ptrace))
+ __ptrace_link(child, new_parent);
+}
+static inline void ptrace_unlink(struct task_struct *child)
+{
+ if (unlikely(child->ptrace))
+ __ptrace_unlink(child);
+}

#endif
--- linux/include/linux/binfmts.h.orig Mon Aug 19 12:19:45 2002
+++ linux/include/linux/binfmts.h Mon Aug 19 18:12:27 2002
@@ -1,7 +1,6 @@
#ifndef _LINUX_BINFMTS_H
#define _LINUX_BINFMTS_H

-#include <linux/ptrace.h>
#include <linux/capability.h>

/*
--- linux/include/linux/elfcore.h.orig Mon Aug 19 12:22:38 2002
+++ linux/include/linux/elfcore.h Mon Aug 19 18:12:27 2002
@@ -4,7 +4,6 @@
#include <linux/types.h>
#include <linux/signal.h>
#include <linux/time.h>
-#include <linux/ptrace.h>
#include <linux/user.h>

struct elf_siginfo
--- linux/include/linux/init_task.h.orig Mon Aug 19 17:39:30 2002
+++ linux/include/linux/init_task.h Mon Aug 19 18:12:27 2002
@@ -54,6 +54,8 @@
.run_list = LIST_HEAD_INIT(tsk.run_list), \
.time_slice = HZ, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
+ .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \
+ .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \
.real_parent = &tsk, \
.parent = &tsk, \
.children = LIST_HEAD_INIT(tsk.children), \
--- linux/include/asm-i386/smp.h.orig Mon Aug 19 12:12:27 2002
+++ linux/include/asm-i386/smp.h Mon Aug 19 18:12:27 2002
@@ -7,7 +7,6 @@
#ifndef __ASSEMBLY__
#include <linux/config.h>
#include <linux/threads.h>
-#include <linux/ptrace.h>
#endif

#ifdef CONFIG_X86_LOCAL_APIC
--- linux/include/asm-i386/user.h.orig Mon Aug 19 12:22:19 2002
+++ linux/include/asm-i386/user.h Mon Aug 19 18:12:27 2002
@@ -2,7 +2,6 @@
#define _I386_USER_H

#include <asm/page.h>
-#include <linux/ptrace.h>
/* Core file format: The core file is written in such a way that gdb
can understand it and provide useful information to the user (under
linux we use the 'trad-core' bfd). There are quite a number of
--- linux/kernel/exit.c.orig Mon Aug 19 11:45:38 2002
+++ linux/kernel/exit.c Mon Aug 19 18:12:27 2002
@@ -18,6 +18,7 @@
#include <linux/acct.h>
#include <linux/file.h>
#include <linux/binfmts.h>
+#include <linux/ptrace.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -65,6 +66,8 @@
atomic_dec(&p->user->processes);
security_ops->task_free_security(p);
free_uid(p->user);
+ BUG_ON(p->ptrace || !list_empty(&p->ptrace_list) ||
+ !list_empty(&p->ptrace_children));
unhash_process(p);

release_thread(p);
@@ -177,6 +180,7 @@
{
write_lock_irq(&tasklist_lock);

+ ptrace_unlink(current);
/* Reparent to init */
REMOVE_LINKS(current);
current->parent = child_reaper;
@@ -231,45 +235,20 @@
atomic_inc(&current->files->count);
}

-/*
- * When we die, we re-parent all our children.
- * Try to give them to another thread in our thread
- * group, and if no such member exists, give it to
- * the global child reaper process (ie "init")
- */
-static inline void forget_original_parent(struct task_struct * father)
+static void reparent_thread(task_t *p, task_t *reaper, task_t *child_reaper)
{
- struct task_struct * p, *reaper;
-
- read_lock(&tasklist_lock);
+ /* We dont want people slaying init */
+ p->exit_signal = SIGCHLD;
+ p->self_exec_id++;
+
+ /* Make sure we're not reparenting to ourselves */
+ if (p == reaper)
+ p->real_parent = child_reaper;
+ else
+ p->real_parent = reaper;

- /* Next in our thread group, if they're not already exiting */
- reaper = father;
- do {
- reaper = next_thread(reaper);
- if (!(reaper->flags & PF_EXITING))
- break;
- } while (reaper != father);
-
- if (reaper == father)
- reaper = child_reaper;
-
- for_each_task(p) {
- if (p->real_parent == father) {
- /* We dont want people slaying init */
- p->exit_signal = SIGCHLD;
- p->self_exec_id++;
-
- /* Make sure we're not reparenting to ourselves */
- if (p == reaper)
- p->real_parent = child_reaper;
- else
- p->real_parent = reaper;
-
- if (p->pdeath_signal) send_sig(p->pdeath_signal, p, 0);
- }
- }
- read_unlock(&tasklist_lock);
+ if (p->pdeath_signal)
+ send_sig(p->pdeath_signal, p, 0);
}

static inline void close_files(struct files_struct * files)
@@ -420,12 +399,85 @@
}

/*
+ * When we die, we re-parent all our children.
+ * Try to give them to another thread in our thread
+ * group, and if no such member exists, give it to
+ * the global child reaper process (ie "init")
+ */
+static inline void forget_original_parent(struct task_struct * father)
+{
+ struct task_struct *p, *reaper;
+ list_t *_p;
+
+ read_lock(&tasklist_lock);
+
+ /* Next in our thread group, if they're not already exiting */
+ reaper = father;
+ do {
+ reaper = next_thread(reaper);
+ if (!(reaper->flags & PF_EXITING))
+ break;
+ } while (reaper != father);
+
+ if (reaper == father)
+ reaper = child_reaper;
+
+ /*
+ * There are only two places where our children can be:
+ *
+ * - in our child list
+ * - in the global ptrace list
+ *
+ * Search them and reparent children.
+ */
+ list_for_each(_p, &father->children) {
+ p = list_entry(_p,struct task_struct,sibling);
+ reparent_thread(p, reaper, child_reaper);
+ }
+ list_for_each(_p, &father->ptrace_children) {
+ p = list_entry(_p,struct task_struct,ptrace_list);
+ reparent_thread(p, reaper, child_reaper);
+ }
+ read_unlock(&tasklist_lock);
+}
+
+static inline void zap_thread(task_t *p, task_t *father)
+{
+ ptrace_unlink(p);
+ list_del_init(&p->sibling);
+ p->ptrace = 0;
+
+ p->parent = p->real_parent;
+ list_add_tail(&p->sibling, &p->parent->children);
+ if (p->state == TASK_ZOMBIE && p->exit_signal != -1)
+ do_notify_parent(p, p->exit_signal);
+ /*
+ * process group orphan check
+ * Case ii: Our child is in a different pgrp
+ * than we are, and it was the only connection
+ * outside, so the child pgrp is now orphaned.
+ */
+ if ((p->pgrp != current->pgrp) &&
+ (p->session == current->session)) {
+ int pgrp = p->pgrp;
+
+ write_unlock_irq(&tasklist_lock);
+ if (is_orphaned_pgrp(pgrp) && has_stopped_jobs(pgrp)) {
+ kill_pg(pgrp,SIGHUP,1);
+ kill_pg(pgrp,SIGCONT,1);
+ }
+ write_lock_irq(&tasklist_lock);
+ }
+}
+
+/*
* Send signals to all our closest relatives so that they know
* to properly mourn us..
*/
static void exit_notify(void)
{
- struct task_struct * p, *t;
+ struct task_struct *t;
+ list_t *_p, *_n;

forget_original_parent(current);
/*
@@ -484,33 +536,20 @@
current->state = TASK_ZOMBIE;
if (current->exit_signal != -1)
do_notify_parent(current, current->exit_signal);
- while ((p = eldest_child(current))) {
- list_del_init(&p->sibling);
- p->ptrace = 0;
-
- p->parent = p->real_parent;
- list_add_tail(&p->sibling,&p->parent->children);
- if (p->state == TASK_ZOMBIE && p->exit_signal != -1)
- do_notify_parent(p, p->exit_signal);
- /*
- * process group orphan check
- * Case ii: Our child is in a different pgrp
- * than we are, and it was the only connection
- * outside, so the child pgrp is now orphaned.
- */
- if ((p->pgrp != current->pgrp) &&
- (p->session == current->session)) {
- int pgrp = p->pgrp;
-
- write_unlock_irq(&tasklist_lock);
- if (is_orphaned_pgrp(pgrp) && has_stopped_jobs(pgrp)) {
- kill_pg(pgrp,SIGHUP,1);
- kill_pg(pgrp,SIGCONT,1);
- }
- write_lock_irq(&tasklist_lock);
- }
- }

+zap_again:
+ list_for_each_safe(_p, _n, &current->children)
+ zap_thread(list_entry(_p,struct task_struct,sibling), current);
+ list_for_each_safe(_p, _n, &current->ptrace_children)
+ zap_thread(list_entry(_p,struct task_struct,ptrace_list), current);
+ /*
+ * reparent_thread might drop the tasklist lock, thus we could
+ * have new children queued back from the ptrace list into the
+ * child list:
+ */
+ if (unlikely(!list_empty(&current->children) ||
+ !list_empty(&current->ptrace_children)))
+ goto zap_again;
/*
* No need to unlock IRQs, we'll schedule() immediately
* anyway. In the preemption case this also makes it
@@ -623,6 +662,12 @@
if (p->pgrp != -pid)
continue;
}
+ /*
+ * Do not consider detached threads that are
+ * not ptraced:
+ */
+ if (p->exit_signal == -1 && !p->ptrace)
+ continue;
/* Wait for all children (clone and not) if __WALL is set;
* otherwise, wait for clone children *only* if __WCLONE is
* set; otherwise, wait for non-clone children *only*. (Note:
@@ -667,7 +712,7 @@
if (retval)
goto end_wait4;
retval = p->pid;
- if (p->real_parent != p->parent) {
+ if (p->real_parent != p->parent || p->ptrace) {
write_lock_irq(&tasklist_lock);
remove_parent(p);
p->parent = p->real_parent;
--- linux/kernel/fork.c.orig Mon Aug 19 11:53:22 2002
+++ linux/kernel/fork.c Mon Aug 19 18:12:27 2002
@@ -27,6 +27,7 @@
#include <linux/fs.h>
#include <linux/security.h>
#include <linux/futex.h>
+#include <linux/ptrace.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -808,6 +809,8 @@
*/
p->tgid = p->pid;
INIT_LIST_HEAD(&p->thread_group);
+ INIT_LIST_HEAD(&p->ptrace_children);
+ INIT_LIST_HEAD(&p->ptrace_list);

/* Need tasklist lock for parent etc handling! */
write_lock_irq(&tasklist_lock);
@@ -827,6 +830,7 @@
}

SET_LINKS(p);
+ ptrace_link(p, p->parent);
hash_pid(p);
nr_threads++;
write_unlock_irq(&tasklist_lock);
--- linux/kernel/ptrace.c.orig Mon Aug 19 11:58:06 2002
+++ linux/kernel/ptrace.c Mon Aug 19 18:12:27 2002
@@ -13,11 +13,49 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
+#include <linux/ptrace.h>

#include <asm/pgtable.h>
#include <asm/uaccess.h>

/*
+ * ptrace a task: make the debugger its new parent and
+ * move it to the ptrace list.
+ *
+ * Must be called with the tasklist lock write-held.
+ */
+void __ptrace_link(task_t *child, task_t *new_parent)
+{
+ if (!list_empty(&child->ptrace_list))
+ BUG();
+ if (child->parent == new_parent)
+ BUG();
+ list_add(&child->ptrace_list, &child->parent->ptrace_children);
+ REMOVE_LINKS(child);
+ child->parent = new_parent;
+ SET_LINKS(child);
+}
+
+/*
+ * unptrace a task: move it back to its original parent and
+ * remove it from the ptrace list.
+ *
+ * Must be called with the tasklist lock write-held.
+ */
+void __ptrace_unlink(task_t *child)
+{
+ if (!child->ptrace)
+ BUG();
+ child->ptrace = 0;
+ if (list_empty(&child->ptrace_list))
+ return;
+ list_del_init(&child->ptrace_list);
+ REMOVE_LINKS(child);
+ child->parent = child->real_parent;
+ SET_LINKS(child);
+}
+
+/*
* Check that we have indeed attached to the thing..
*/
int ptrace_check_attach(struct task_struct *child, int kill)
@@ -75,11 +113,7 @@
task_unlock(task);

write_lock_irq(&tasklist_lock);
- if (task->parent != current) {
- REMOVE_LINKS(task);
- task->parent = current;
- SET_LINKS(task);
- }
+ __ptrace_link(task, current);
write_unlock_irq(&tasklist_lock);

send_sig(SIGSTOP, task, 1);
@@ -99,16 +133,15 @@
ptrace_disable(child);

/* .. re-parent .. */
- child->ptrace = 0;
child->exit_code = data;
+
write_lock_irq(&tasklist_lock);
- REMOVE_LINKS(child);
- child->parent = child->real_parent;
- SET_LINKS(child);
+ __ptrace_unlink(child);
+ /* .. and wake it up. */
+ if (child->state != TASK_ZOMBIE)
+ wake_up_process(child);
write_unlock_irq(&tasklist_lock);

- /* .. and wake it up. */
- wake_up_process(child);
return 0;
}


2002-08-19 17:34:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


Hmm.. This looks good, but I wonder if the real problem isn't really that
our ptrace approach has always been kind of flaky.

Basically, we started with the notion that only parents can trace their
children, so no reparenting was ever needed. Then PTRACE_ATTACH came
along, and we did the reparenting, and I think _that_ is where we made our
big mistake.

We sh ould just have made a separate "tsk->tracer" pointer, instead of
continuing with the perverted "my parent is my tracer" logic. We shouldn't
really re-write the parent/child relationship just because we're being
traced.

I'd be happy to apply this patch (well, your fixed version), but I think
I'd prefer even more to make the whole reparenting go away, and keep the
child list valid all through the lifetime of a process. How painful could
that be?

Linus

2002-08-19 17:40:08

by Larry McVoy

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

On Mon, Aug 19, 2002 at 10:42:06AM -0700, Linus Torvalds wrote:
> We sh ould just have made a separate "tsk->tracer" pointer, instead of
> continuing with the perverted "my parent is my tracer" logic. We shouldn't
> really re-write the parent/child relationship just because we're being
> traced.

I've always wondered if the process model shouldn't be virtualized much
like files are virtual. One application of this could be for ptraced
processes, they have a different ops vector than non-traced processes.
Any chance that could work?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-08-19 18:03:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Linus Torvalds wrote:

> I'd be happy to apply this patch (well, your fixed version), but I think
> I'd prefer even more to make the whole reparenting go away, and keep the
> child list valid all through the lifetime of a process. How painful
> could that be?

the problem is that the tracing task wants to do a wait4() on all traced
children, and the only way to get that is to have the traced tasks in the
child list. Eg. strace -f traces a random number of tasks, and after the
PTRACE_CONTINUE call, the wait4 done by strace must be able to 'get
events' from pretty much any of the traced tasks. So unless the ptrace
interface is reworked in an incompatible way, i cannot see how this would
work. wait4 could perhaps somehow search the whole tasklist, but that
could be a pretty big pain even for something like strace.

Ingo

2002-08-19 18:27:56

by Dave McCracken

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


--On Monday, August 19, 2002 08:08:10 PM +0200 Ingo Molnar <[email protected]>
wrote:

> the problem is that the tracing task wants to do a wait4() on all traced
> children, and the only way to get that is to have the traced tasks in the
> child list. Eg. strace -f traces a random number of tasks, and after the
> PTRACE_CONTINUE call, the wait4 done by strace must be able to 'get
> events' from pretty much any of the traced tasks. So unless the ptrace
> interface is reworked in an incompatible way, i cannot see how this would
> work. wait4 could perhaps somehow search the whole tasklist, but that
> could be a pretty big pain even for something like strace.

It seems to me it would be worth adding list_heads in the task struct for
ptraced tasks and ptraced siblings if it gets rid of the reparenting.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-08-19 18:31:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Dave McCracken wrote:

> It seems to me it would be worth adding list_heads in the task struct
> for ptraced tasks and ptraced siblings if it gets rid of the
> reparenting.

well, this means that we'd still have to iterate through both lists in
wait4(), and we'd have to maintain the ptrace list(s) in all the relevant
codepaths - does this buy us anything relative to -B4?

Ingo

2002-08-19 18:47:55

by Dave McCracken

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


--On Monday, August 19, 2002 08:36:24 PM +0200 Ingo Molnar <[email protected]>
wrote:

> well, this means that we'd still have to iterate through both lists in
> wait4(), and we'd have to maintain the ptrace list(s) in all the relevant
> codepaths - does this buy us anything relative to -B4?

The lists would constitute the tasks that wait4() should consider, at
least. And maintaining the list wouldn't be any more work than the
current reparenting. I do admit that I don't see a significant win,
codewise, other than aesthetics.

In looking at the code I was wondering something. What happens to the real
parent of a ptraced task when it calls wait4()? If that's its only child,
won't it return ECHILD?

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-08-19 18:52:05

by George Anzinger

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

Dave McCracken wrote:
>
> --On Monday, August 19, 2002 08:08:10 PM +0200 Ingo Molnar <[email protected]>
> wrote:
>
> > the problem is that the tracing task wants to do a wait4() on all traced
> > children, and the only way to get that is to have the traced tasks in the
> > child list. Eg. strace -f traces a random number of tasks, and after the
> > PTRACE_CONTINUE call, the wait4 done by strace must be able to 'get
> > events' from pretty much any of the traced tasks. So unless the ptrace
> > interface is reworked in an incompatible way, i cannot see how this would
> > work. wait4 could perhaps somehow search the whole tasklist, but that
> > could be a pretty big pain even for something like strace.
>
> It seems to me it would be worth adding list_heads in the task struct for
> ptraced tasks and ptraced siblings if it gets rid of the reparenting.
>
That and then each time a SIGCHILD is sent, it is also sent
to the tracer, if there is one. Some filtering may be
needed here as the real parent should not be disturbed by
trace ONLY things. It has been done on other systems. Not
super clean, but it does push Heisenberg out a ring or two.
The current way its done, a child can not get the pid of its
father and thus would NEED to know it was being traced in
order to do anything that required such. This is asking
Heisenberg to live somewhere more or less close by :)
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-19 19:07:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6



On Mon, 19 Aug 2002, Ingo Molnar wrote:
>
> well, this means that we'd still have to iterate through both lists in
> wait4(), and we'd have to maintain the ptrace list(s) in all the relevant
> codepaths - does this buy us anything relative to -B4?

Ok, you've convinced me. The reparenting is fairly ugly, but it sounds
like other implementations would be fairly equivalent and it would be
mainly an issue of just which list we'd work on.

Linus

2002-08-19 19:10:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, george anzinger wrote:

> The current way its done, a child can not get the pid of its father and
> thus would NEED to know it was being traced in order to do anything that
> required such. [...]

what do you mean by this? sys_getppid() uses ->real_parent so it gets the
proper PID.

Ingo

2002-08-19 19:31:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Dave McCracken wrote:

> In looking at the code I was wondering something. What happens to the
> real parent of a ptraced task when it calls wait4()? If that's its only
> child, won't it return ECHILD?

yes, this is ugly beyond belief.

eg. under bash start up some code that blocks, eg:

# cat

the shell will not display a prompt, it wait4()s for the 'cat' process to
exit.

strace the bash PID - it shows the wait4().

strace 'cat' PID (via strace -p) and keep the strace running - it will
show 'cat' blocked on console input. [some signal like this could happen
to a shell anyway, in a reasonably complex script.]

now do something that breaks bash out of its wait4 - eg. send a 'kill
-SIGCONT BASH_PID' signal - bash returns with a prompt! The 'cat' becomes
a 'background' task - and it gets majorly confused when doing the next
shell command in bash - it displays "cat: -: Input/output error" and goes
zombie.

ptrace is clearly broken - and i tested this with a stock 2.4 kernel.

Ingo

2002-08-19 19:59:36

by George Anzinger

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

Ingo Molnar wrote:
>
> On Mon, 19 Aug 2002, george anzinger wrote:
>
> > The current way its done, a child can not get the pid of its father and
> > thus would NEED to know it was being traced in order to do anything that
> > required such. [...]
>
> what do you mean by this? sys_getppid() uses ->real_parent so it gets the
> proper PID.
>
Oops. Open mouth, insert foot :)
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-19 20:54:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Dave McCracken wrote:

> In looking at the code I was wondering something. What happens to the
> real parent of a ptraced task when it calls wait4()? If that's its only
> child, won't it return ECHILD?

hm, so this could be fixed by iterating over the ptraced tasks as well
when doing a wait4.

the problem is that the debugger wants to do a wait4 as well, to receive
the SIGSTOP result. Now if the original parent 'steals' the wait4 result,
what will happen?

this whole mess can only be fixed by decoupling the ptrace() mechanism
from signals and wait4 completely, it's a nasty relationship that infests
both the kernel and userspace code [check out strace.c once to see the
kind of pain it has to go through to isolate ptrace events from other
signals.]

I'm not quite sure whether this is possible, how deeply do ptrace
applications depend on a real SIGSTOP signal interrupting the task? Would
it be equally good if it was a different interruption/signalling method
that did this? [with a few minor and straightforward cleanups to entry.S i
think we could use a task ornament flag for ptrace interruption. This
would result in a few orders better behavior on all fronts.]

Ingo

2002-08-19 20:57:24

by Dave McCracken

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


--On Monday, August 19, 2002 10:59:17 PM +0200 Ingo Molnar <[email protected]>
wrote:

> this whole mess can only be fixed by decoupling the ptrace() mechanism
> from signals and wait4 completely, it's a nasty relationship that infests
> both the kernel and userspace code [check out strace.c once to see the
> kind of pain it has to go through to isolate ptrace events from other
> signals.]

I guess this is why most versions of Unix have abandoned ptrace for a
debugging API based on /proc.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-08-19 21:23:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Ingo Molnar wrote:
>
> the problem is that the debugger wants to do a wait4 as well, to receive
> the SIGSTOP result. Now if the original parent 'steals' the wait4 result,
> what will happen?

If a child has a debugger, it clearly is never "stopped" or "zombie" as
far as the parent is concerned, so the parent should either block, or it
should return -EAGAIN.

> this whole mess can only be fixed by decoupling the ptrace() mechanism
> from signals and wait4 completely

No, you only need to make debugged children slightly pecial in wait4(), in
that the parent must never see their state, only the fact that they are
there (as if they were still running, in short, regardless of their _real_
state)

Linus

2002-08-19 21:38:27

by Dave McCracken

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


--On Monday, August 19, 2002 02:29:08 PM -0700 Linus Torvalds
<[email protected]> wrote:

> No, you only need to make debugged children slightly pecial in wait4(), in
> that the parent must never see their state, only the fact that they are
> there (as if they were still running, in short, regardless of their _real_
> state)

So does that mean we need something like a 'count of children stolen by
debuggers' in the task struct that wait4() can check?

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-08-19 22:31:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


On Mon, 19 Aug 2002, Dave McCracken wrote:

> > No, you only need to make debugged children slightly pecial in wait4(), in
> > that the parent must never see their state, only the fact that they are
> > there (as if they were still running, in short, regardless of their _real_
> > state)
>
> So does that mean we need something like a 'count of children stolen by
> debuggers' in the task struct that wait4() can check?

in fact we have this already, almost:

if (!list_empty(&current->ptrace_children))

then block (or return -EAGAIN). Instead of the current -ENOCHLD.

Ingo

2002-08-20 03:17:02

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

On Mon, Aug 19, 2002 at 10:59:17PM +0200, Ingo Molnar wrote:
>
> On Mon, 19 Aug 2002, Dave McCracken wrote:
>
> > In looking at the code I was wondering something. What happens to the
> > real parent of a ptraced task when it calls wait4()? If that's its only
> > child, won't it return ECHILD?
>
> hm, so this could be fixed by iterating over the ptraced tasks as well
> when doing a wait4.
>
> the problem is that the debugger wants to do a wait4 as well, to receive
> the SIGSTOP result. Now if the original parent 'steals' the wait4 result,
> what will happen?
>
> this whole mess can only be fixed by decoupling the ptrace() mechanism
> from signals and wait4 completely, it's a nasty relationship that infests
> both the kernel and userspace code [check out strace.c once to see the
> kind of pain it has to go through to isolate ptrace events from other
> signals.]
>
> I'm not quite sure whether this is possible, how deeply do ptrace
> applications depend on a real SIGSTOP signal interrupting the task? Would
> it be equally good if it was a different interruption/signalling method
> that did this? [with a few minor and straightforward cleanups to entry.S i
> think we could use a task ornament flag for ptrace interruption. This
> would result in a few orders better behavior on all fronts.]

I have in my mailbox somewhere the beginnings of an implementation of
this, from David Howells (the ornaments approach); I think he ran out
of time to work on it. Rather than pursue it, if there's someone
interested in giving Linux a way to access at least some of the
Solaris/BSD procfs debug interface...

The biggest problem with ptrace is the way it overloads signals. It
makes figuring out what events are pretty tricky (see the patches I
just posted for some workarounds on this), and it makes debugging a
signal-intensive process extremely awkward. For instance, in a lot of
cases you'll lose or confuse the realtime payload entirely.


I don't think there is anything you could do here that would not break
strace/gdb's uses of ptrace, however; any incremental changes would
toast them.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-08-20 10:55:32

by Richard Z

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

On Mon, Aug 19, 2002 at 08:08:10PM +0200, Ingo Molnar wrote:
>
> On Mon, 19 Aug 2002, Linus Torvalds wrote:
>
> > I'd be happy to apply this patch (well, your fixed version), but I think
> > I'd prefer even more to make the whole reparenting go away, and keep the
> > child list valid all through the lifetime of a process. How painful
> > could that be?
>
> the problem is that the tracing task wants to do a wait4() on all traced
> children, and the only way to get that is to have the traced tasks in the
> child list. Eg. strace -f traces a random number of tasks, and after the
> PTRACE_CONTINUE call, the wait4 done by strace must be able to 'get
> events' from pretty much any of the traced tasks. So unless the ptrace
> interface is reworked in an incompatible way, i cannot see how this would
> work. wait4 could perhaps somehow search the whole tasklist, but that
> could be a pretty big pain even for something like strace.

the whole signal driven approach of ptrace is imho not very elegant
and causes high overhead. So reworking the ptrace interface to avoid
signals would be a good idea. Instead of wait4 the tracer could eg
block or poll on read of /proc/#num/ptrace.

Richard

2002-08-20 14:28:30

by Dave McCracken

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


--On Tuesday, August 20, 2002 12:36:36 AM +0200 Ingo Molnar <[email protected]>
wrote:

> in fact we have this already, almost:
>
> if (!list_empty(&current->ptrace_children))
>
> then block (or return -EAGAIN). Instead of the current -ENOCHLD.

No fair adding new code in the middle of a discussion :)

Seriously, that looks like it solves the problem.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-08-20 14:31:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6


the attached patch ontop of BK-curr fixes the ptrace wait4() anomaly that
can be observed in any previous Linux kernel i could get my hands at. So
in fact ->ptrace_children, besides being a speedup, also fixed a bug that
couldnt be fixed in any satisfactory way before.

Ingo

--- linux/kernel/exit.c.orig Tue Aug 20 16:28:57 2002
+++ linux/kernel/exit.c Tue Aug 20 16:30:13 2002
@@ -731,7 +731,7 @@
tsk = next_thread(tsk);
} while (tsk != current);
read_unlock(&tasklist_lock);
- if (flag) {
+ if (flag || !list_empty(&current->ptrace_children)) {
retval = 0;
if (options & WNOHANG)
goto end_wait4;

2002-08-27 15:47:55

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

On Tue, Aug 20, 2002 at 04:36:19PM +0200, Ingo Molnar wrote:
>
> the attached patch ontop of BK-curr fixes the ptrace wait4() anomaly that
> can be observed in any previous Linux kernel i could get my hands at. So
> in fact ->ptrace_children, besides being a speedup, also fixed a bug that
> couldnt be fixed in any satisfactory way before.
>
> Ingo
>
> --- linux/kernel/exit.c.orig Tue Aug 20 16:28:57 2002
> +++ linux/kernel/exit.c Tue Aug 20 16:30:13 2002
> @@ -731,7 +731,7 @@
> tsk = next_thread(tsk);
> } while (tsk != current);
> read_unlock(&tasklist_lock);
> - if (flag) {
> + if (flag || !list_empty(&current->ptrace_children)) {
> retval = 0;
> if (options & WNOHANG)
> goto end_wait4;

Ingo,

At this point your ptrace changes have completely broken both
_TRACEME/exec and _ATTACH debugging. If an attached process finishes
while a debugger is attached, its parent no longer gets the proper wait
result for it:

wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478

etc. It is never removed from the list. _TRACEME/exec debugging
appears to have the same problem but it's harder to tell, since one can
not strace GDB in 2.5 without the patch I posted here two weeks ago.
If you don't have a chance to look at this I'll investigate later
today.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-08-28 02:48:17

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

On Tue, Aug 27, 2002 at 11:39:57AM -0400, Daniel Jacobowitz wrote:
> On Tue, Aug 20, 2002 at 04:36:19PM +0200, Ingo Molnar wrote:
> >
> > the attached patch ontop of BK-curr fixes the ptrace wait4() anomaly that
> > can be observed in any previous Linux kernel i could get my hands at. So
> > in fact ->ptrace_children, besides being a speedup, also fixed a bug that
> > couldnt be fixed in any satisfactory way before.
> >
> > Ingo
> >
> > --- linux/kernel/exit.c.orig Tue Aug 20 16:28:57 2002
> > +++ linux/kernel/exit.c Tue Aug 20 16:30:13 2002
> > @@ -731,7 +731,7 @@
> > tsk = next_thread(tsk);
> > } while (tsk != current);
> > read_unlock(&tasklist_lock);
> > - if (flag) {
> > + if (flag || !list_empty(&current->ptrace_children)) {
> > retval = 0;
> > if (options & WNOHANG)
> > goto end_wait4;
>
> Ingo,
>
> At this point your ptrace changes have completely broken both
> _TRACEME/exec and _ATTACH debugging. If an attached process finishes
> while a debugger is attached, its parent no longer gets the proper wait
> result for it:
>
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
> wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG|WUNTRACED, NULL) = 478
>
> etc. It is never removed from the list. _TRACEME/exec debugging
> appears to have the same problem but it's harder to tell, since one can
> not strace GDB in 2.5 without the patch I posted here two weeks ago.
> If you don't have a chance to look at this I'll investigate later
> today.

The problem is the line:
retval = p->pid;
!!! if (p->real_parent != p->parent || p->ptrace) {
write_lock_irq(&tasklist_lock);
remove_parent(p);
p->parent = p->real_parent;
add_parent(p, p->parent);
do_notify_parent(p, SIGCHLD);
write_unlock_irq(&tasklist_lock);
} else
release_task(p);

p->ptrace continues to be true, even if the real parent is waiting for
it. So we go through this code time after time once p->real_parent ==
p->parent, sending it extra SIGCHLDs and not releasing the task. I
don't think that change is correct but I'm not clear what you're doing
with the reparenting right now, so I don't know the right fix.

You might want to add:
gdb /bin/ls
run
run

to your stress testing in the future.


--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer