LinuxLists.cc - [PATCH 1/3] will_become_orphaned

2007-12-08 18:37:28

Subject: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

p->exit_state != 0 doesn't mean this process is dead, it may have sub-threads.

However, the new "p->exit_state && thread_group_empty(p)" check is not correct
either, this is just the temporary hack. Perhaps we can just remove this check,
but I don't understand orphaned process groups magic. At all. However, I think
exit_notify() is obviously and completely wrong wrt this helper.

Signed-off-by: Oleg Nesterov <[email protected]>

--- PT/kernel/exit.c~4_orphaned_pgrp 2007-12-06 18:06:09.000000000 +0300
+++ PT/kernel/exit.c 2007-12-07 20:25:40.000000000 +0300
@@ -219,9 +219,9 @@ static int will_become_orphaned_pgrp(str
int ret = 1;

do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
- if (p == ignored_task
- || p->exit_state
- || is_global_init(p->real_parent))
+ if ((p == ignored_task) ||
+ (p->exit_state && thread_group_empty(p)) ||
+ is_global_init(p->real_parent))
continue;
if (task_pgrp(p->real_parent) != pgrp &&
task_session(p->real_parent) == task_session(p)) {

2007-12-08 23:52:24

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

Oleg Nesterov <[email protected]> writes:

> p->exit_state != 0 doesn't mean this process is dead, it may have sub-threads.
>
> However, the new "p->exit_state && thread_group_empty(p)" check is not correct
> either, this is just the temporary hack. Perhaps we can just remove this check,
> but I don't understand orphaned process groups magic. At all. However, I think
> exit_notify() is obviously and completely wrong wrt this helper.

The problem that orphaned processes groups address is what happens if
an entire process group is stopped, and there is not a process that
can wake them up.

The rule for an unprivileged process sending a signal to a process
group is that it must be in the same session as the process group.

The rule for sending a signal to a process group is that the signal sender
must be in the same session.

So we are testing for a process group that does not have a living
member with a parent outside of the process that can send the process
group a signal.

The test for init seems bogus. /sbin/init rarely if ever starts
processes in it's own session which is likely why this has not caused
problems. If we keep the test for init we need to make the test
is_container_init rather the is global_init.

As for exit_notify I agree. We need a thread_group_exit_notify.
That is responsible for performing work when we know the entire
thread group has exited. Sending the exit_signal and performing
the through group orphaned check look like two of those tasks
that need to be performed only at thread group exit.

Oleg what do you see wrong with checking p->exit_state &&
thread_group_empty(p)? Since non-leader threads all self reap
that seems to be a valid test for an dead thread group.

Eric

> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- PT/kernel/exit.c~4_orphaned_pgrp 2007-12-06 18:06:09.000000000 +0300
> +++ PT/kernel/exit.c 2007-12-07 20:25:40.000000000 +0300
> @@ -219,9 +219,9 @@ static int will_become_orphaned_pgrp(str
> int ret = 1;
>
> do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
> - if (p == ignored_task
> - || p->exit_state
> - || is_global_init(p->real_parent))
> + if ((p == ignored_task) ||
> + (p->exit_state && thread_group_empty(p)) ||
> + is_global_init(p->real_parent))
> continue;
> if (task_pgrp(p->real_parent) != pgrp &&
> task_session(p->real_parent) == task_session(p)) {

2007-12-09 14:20:35

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

On 12/08, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > p->exit_state != 0 doesn't mean this process is dead, it may have sub-threads.
> >
> > However, the new "p->exit_state && thread_group_empty(p)" check is not correct
> > either, this is just the temporary hack. Perhaps we can just remove this check,
> > but I don't understand orphaned process groups magic. At all. However, I think
> > exit_notify() is obviously and completely wrong wrt this helper.
>
> The problem that orphaned processes groups address is what happens if
> an entire process group is stopped, and there is not a process that
> can wake them up.
>
> The rule for an unprivileged process sending a signal to a process
> group is that it must be in the same session as the process group.
>
> The rule for sending a signal to a process group is that the signal sender
> must be in the same session.
>
> So we are testing for a process group that does not have a living
> member with a parent outside of the process that can send the process
> group a signal.

Ah, thanks a lot Eric, I am _starting_ to understand this.

> Oleg what do you see wrong with checking p->exit_state &&
> thread_group_empty(p)? Since non-leader threads all self reap
> that seems to be a valid test for an dead thread group.

There is a window when exit_notify() drops tasklist and before release_task().
Suppose the last (non-leader) thread exits. This means that entire group exits,
but thread_group_empty() is not true.

Oleg.

2007-12-09 15:48:48

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

Oleg Nesterov <[email protected]> writes:

> On 12/08, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > p->exit_state != 0 doesn't mean this process is dead, it may have
> sub-threads.
>> >
>> > However, the new "p->exit_state && thread_group_empty(p)" check is not
> correct
>> > either, this is just the temporary hack. Perhaps we can just remove this
> check,
>> > but I don't understand orphaned process groups magic. At all. However, I
> think
>> > exit_notify() is obviously and completely wrong wrt this helper.
>>
>> The problem that orphaned processes groups address is what happens if
>> an entire process group is stopped, and there is not a process that
>> can wake them up.
>>
>> The rule for an unprivileged process sending a signal to a process
>> group is that it must be in the same session as the process group.
>>
>> The rule for sending a signal to a process group is that the signal sender
>> must be in the same session.
>>
>> So we are testing for a process group that does not have a living
>> member with a parent outside of the process that can send the process
>> group a signal.
>
> Ah, thanks a lot Eric, I am _starting_ to understand this.
>
>> Oleg what do you see wrong with checking p->exit_state &&
>> thread_group_empty(p)? Since non-leader threads all self reap
>> that seems to be a valid test for an dead thread group.
>
> There is a window when exit_notify() drops tasklist and before release_task().
> Suppose the last (non-leader) thread exits. This means that entire group exits,
> but thread_group_empty() is not true.

And if you are an observer this is important.

Equally messed up is a our status in /proc at that point. Which
says our sleeping process is a zombie.

I'm thinking we need to do at least some of the thread group leadership
transfer in do_exit, instead of de_thread. Then p->group_leader->exit_state
would be sufficient to see if the entire thread group was alive,
as the group_leader would be whoever was left alive. The original
group_leader might still need to be kept around for it's pid...

I think that would solve most of the problems you have with a dead
thread group leader and sending SIG_STOP as well.

Eric

2007-12-09 16:42:30

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

On 12/09, Eric W. Biederman wrote:
>
> Equally messed up is a our status in /proc at that point. Which
> says our sleeping process is a zombie.

Yes, this is annoying.

> I'm thinking we need to do at least some of the thread group leadership
> transfer in do_exit, instead of de_thread. Then p->group_leader->exit_state
> would be sufficient to see if the entire thread group was alive,
> as the group_leader would be whoever was left alive. The original
> group_leader might still need to be kept around for it's pid...
>
> I think that would solve most of the problems you have with a dead
> thread group leader and sending SIG_STOP as well.

Yes I was thinking about that too, but I am not brave enough to even
try to to think to the end ;)

As a minimal change, I tried to add "task_struct *leader_proxy" to
signal_struct, which points to the next live thread, and changed by
exit_notify(). eligible_child() checks it instead of ->exit_signal.
But this is so messy...

And in fact, if we are talking about group stop, it is a group operation,
why do_wait() uses per-thread ->exit_code but not ->group_exit_code ?

But yes, [PATCH 3/3] adds a visible difference, and I don't know if
this difference is good or bad.

$ sleep 1000

[1]+ Stopped sleep 1000
$ strace -p `pidof sleep`
Process 432 attached - interrupt to quit

Now strace "hangs" in do_wait() because ->exit_code was eaten by the
shell. We need SIGCONT.

With the "[PATCH 3/3]" strace proceeds happily.

Oleg.

2007-12-09 23:58:54

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

Oleg Nesterov <[email protected]> writes:

> On 12/09, Eric W. Biederman wrote:
>>
>> Equally messed up is a our status in /proc at that point. Which
>> says our sleeping process is a zombie.
>
> Yes, this is annoying.
>
>> I'm thinking we need to do at least some of the thread group leadership
>> transfer in do_exit, instead of de_thread. Then p->group_leader->exit_state
>> would be sufficient to see if the entire thread group was alive,
>> as the group_leader would be whoever was left alive. The original
>> group_leader might still need to be kept around for it's pid...
>>
>> I think that would solve most of the problems you have with a dead
>> thread group leader and sending SIG_STOP as well.
>
> Yes I was thinking about that too, but I am not brave enough to even
> try to to think to the end ;)
>
> As a minimal change, I tried to add "task_struct *leader_proxy" to
> signal_struct, which points to the next live thread, and changed by
> exit_notify(). eligible_child() checks it instead of ->exit_signal.
> But this is so messy...
>
> And in fact, if we are talking about group stop, it is a group operation,
> why do_wait() uses per-thread ->exit_code but not ->group_exit_code ?
Good question, we would need a fallback for the case it isn't a group
operation like in exit but that might clean something up.

> But yes, [PATCH 3/3] adds a visible difference, and I don't know if
> this difference is good or bad.
>
> $ sleep 1000
>
> [1]+ Stopped sleep 1000
> $ strace -p `pidof sleep`
> Process 432 attached - interrupt to quit
>
> Now strace "hangs" in do_wait() because ->exit_code was eaten by the
> shell. We need SIGCONT.
>
> With the "[PATCH 3/3]" strace proceeds happily.
>
> Oleg.

Well I got to playing with the idea of actually moving group_leader
and it turns out that while it is a pain it isn't actually that bad.
The worst part is not really changing the pid of the leader to the pid
of the entire thread group. As there are a few cases where we are
current referencing the task_pid where we really want task_tgid.

Oleg below is my proof of concept patch, which really needs to be
broken up into a whole patch series, so the changes are small
enough we can do a thorough audit on them. Anyway take a look
and see what you think.

This patch does fix your weird test case without any actual
change to the do_wait logic itself.

The key idea is that by not making PIDTYPE_PID a hash chain we
can point two different struct pids at the same process allowing
two different pids to return the same process from pid_task(pid,
PIDTYPE_PID);

Which means most things continue to just work by working on
PIDTYPE_PID, although as mentioned previously there are a few
things particulary do_notify_parent_cldstop and do_wait that
need to be changed to return the tgid instead of the pid.

Oh and in eligible child the PIDTYPE_PID test is now sneaky
essentially doing a task lookup and seeing if the result
is our target pid, instead of comparing pids.

The funny part is grep pid /proc/<tgid>/status no longer
always equals the tgid after the pid exits. Still that seems
better then making the entire thread group look like a zombie
just because the wrong thread exited.

Subject: [PATCH] All thread group leaders to exit

---
fs/exec.c | 81 ++---------------------
fs/fcntl.c | 20 ++++--
fs/proc/base.c | 6 +-
include/linux/init_task.h | 25 ++++----
include/linux/pid.h | 14 ++---
include/linux/sched.h | 43 ++++++------
kernel/exit.c | 157 +++++++++++++++++++++++++++-----------------
kernel/fork.c | 2 +-
kernel/itimer.c | 2 +-
kernel/pid.c | 60 +++++++++++------
kernel/signal.c | 23 ++-----
11 files changed, 204 insertions(+), 229 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 14a690d..1f69326 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -786,22 +786,6 @@ static int de_thread(struct task_struct *tsk)
* Account for the thread group leader hanging around:
*/
count = 1;
- if (!thread_group_leader(tsk)) {
- count = 2;
- /*
- * The SIGALRM timer survives the exec, but needs to point
- * at us as the new group leader now. We have a race with
- * a timer firing now getting the old leader, so we need to
- * synchronize with any firing (by calling del_timer_sync)
- * before we can safely let the old group leader die.
- */
- sig->tsk = tsk;
- spin_unlock_irq(lock);
- if (hrtimer_cancel(&sig->real_timer))
- hrtimer_restart(&sig->real_timer);
- spin_lock_irq(lock);
- }
-
sig->notify_count = count;
while (atomic_read(&sig->count) > count) {
__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -811,68 +795,15 @@ static int de_thread(struct task_struct *tsk)
}
spin_unlock_irq(lock);

- /*
- * At this point all other threads have exited, all we have to
- * do is to wait for the thread group leader to become inactive,
- * and to assume its PID:
- */
- if (!thread_group_leader(tsk)) {
- leader = tsk->group_leader;
-
- sig->notify_count = -1;
- for (;;) {
- write_lock_irq(&tasklist_lock);
- if (likely(leader->exit_state))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- write_unlock_irq(&tasklist_lock);
- schedule();
+ /* If it isn't already force gettid() == getpid() */
+ if (sig->tgid != tsk->tid) {
+ write_lock_irq(&tasklist_lock);
+ if (sig->tgid != tsk->tid) {
+ detach_pid(tsk, PIDTYPE_PID);
+ attach_pid(tsk, PIDTYPE_PID, sig->tgid);
}
-
- /*
- * The only record we have of the real-time age of a
- * process, regardless of execs it's done, is start_time.
- * All the past CPU time is accumulated in signal_struct
- * from sister threads now dead. But in this non-leader
- * exec, nothing survives from the original leader thread,
- * whose birth marks the true age of this process now.
- * When we take on its identity by switching to its PID, we
- * also take its birthdate (always earlier than our own).
- */
- tsk->start_time = leader->start_time;
-
- BUG_ON(!same_thread_group(leader, tsk));
- BUG_ON(has_group_leader_pid(tsk));
- /*
- * An exec() starts a new thread group with the
- * TGID of the previous thread group. Rehash the
- * two threads with a switched PID, and release
- * the former thread group leader:
- */
-
- /* Become a process group leader with the old leader's pid.
- * The old leader becomes a thread of the this thread group.
- * Note: The old leader also uses this pid until release_task
- * is called. Odd but simple and correct.
- */
- detach_pid(tsk, PIDTYPE_PID);
- tsk->pid = leader->pid;
- attach_pid(tsk, PIDTYPE_PID, task_pid(leader));
- transfer_pid(leader, tsk, PIDTYPE_PGID);
- transfer_pid(leader, tsk, PIDTYPE_SID);
- list_replace_rcu(&leader->tasks, &tsk->tasks);
-
- tsk->group_leader = tsk;
- leader->group_leader = tsk;
-
- tsk->exit_signal = SIGCHLD;
-
- BUG_ON(leader->exit_state != EXIT_ZOMBIE);
- leader->exit_state = EXIT_DEAD;
-
write_unlock_irq(&tasklist_lock);
}
-
sig->group_exit_task = NULL;
sig->notify_count = 0;

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 8685263..bc0a125 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -516,9 +516,13 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
goto out_unlock_fown;

read_lock(&tasklist_lock);
- do_each_pid_task(pid, type, p) {
- send_sigio_to_task(p, fown, fd, band);
- } while_each_pid_task(pid, type, p);
+ if (type == PIDTYPE_PID)
+ send_sigio_to_task(pid_task(pid, type), fown, fd, band);
+ else {
+ do_each_pid_task(pid, type, p) {
+ send_sigio_to_task(p, fown, fd, band);
+ } while_each_pid_task(pid, type, p);
+ }
read_unlock(&tasklist_lock);
out_unlock_fown:
read_unlock(&fown->lock);
@@ -547,9 +551,13 @@ int send_sigurg(struct fown_struct *fown)
ret = 1;

read_lock(&tasklist_lock);
- do_each_pid_task(pid, type, p) {
- send_sigurg_to_task(p, fown);
- } while_each_pid_task(pid, type, p);
+ if (type == PIDTYPE_PID)
+ send_sigurg_to_task(pid_task(pid, type), fown);
+ else {
+ do_each_pid_task(pid, type, p) {
+ send_sigurg_to_task(p, fown);
+ } while_each_pid_task(pid, type, p);
+ }
read_unlock(&tasklist_lock);
out_unlock_fown:
read_unlock(&fown->lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d59708e..f7bd620 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2438,15 +2438,15 @@ retry:
* pid of a thread_group_leader. Testing for task
* being a thread_group_leader is the obvious thing
* todo but there is a window when it fails, due to
- * the pid transfer logic in de_thread.
+ * the pid transfer logic at group leader death.
*
* So we perform the straight forward test of seeing
- * if the pid we have found is the pid of a thread
+ * if the pid we have found is the pid of the thread
* group leader, and don't worry if the task we have
* found doesn't happen to be a thread group leader.
* As we don't care in the case of readdir.
*/
- if (!iter.task || !has_group_leader_pid(iter.task)) {
+ if (!iter.task || pid != task_tgid(iter.task)) {
iter.tgid += 1;
goto retry;
}
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 96be7d6..ddcd7c1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -67,6 +67,9 @@
.posix_timers = LIST_HEAD_INIT(sig.posix_timers), \
.cpu_timers = INIT_CPU_TIMERS(sig.cpu_timers), \
.rlim = INIT_RLIMITS, \
+ .tgid = &init_struct_pid, \
+ .pids[PIDTYPE_PGID] = &init_struct_pid, \
+ .pids[PIDTYPE_SID] = &init_struct_pid, \
}

extern struct nsproxy init_nsproxy;
@@ -91,10 +94,10 @@ extern struct group_info init_groups;

#define INIT_STRUCT_PID { \
.count = ATOMIC_INIT(1), \
+ .tsk = &init_task, \
.tasks = { \
- { .first = &init_task.pids[PIDTYPE_PID].node }, \
- { .first = &init_task.pids[PIDTYPE_PGID].node }, \
- { .first = &init_task.pids[PIDTYPE_SID].node }, \
+ { .first = &init_task.pids[PIDTYPE_PGID] }, \
+ { .first = &init_task.pids[PIDTYPE_SID] }, \
}, \
.rcu = RCU_HEAD_INIT, \
.level = 0, \
@@ -105,13 +108,10 @@ extern struct group_info init_groups;
}, } \
}

-#define INIT_PID_LINK(type) \
-{ \
- .node = { \
- .next = NULL, \
- .pprev = &init_struct_pid.tasks[type].first, \
- }, \
- .pid = &init_struct_pid, \
+#define INIT_PID_HLIST_NODE(type) \
+{ \
+ .next = NULL, \
+ .pprev = &init_struct_pid.tasks[type].first, \
}

#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
@@ -179,9 +179,8 @@ extern struct group_info init_groups;
.fs_excl = ATOMIC_INIT(0), \
.pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \
.pids = { \
- [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \
- [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \
- [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
+ [PIDTYPE_PGID] = INIT_PID_HLIST_NODE(PIDTYPE_PGID), \
+ [PIDTYPE_SID] = INIT_PID_HLIST_NODE(PIDTYPE_SID), \
}, \
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
INIT_TRACE_IRQFLAGS \
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 061abb6..828355e 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -5,9 +5,10 @@

enum pid_type
{
- PIDTYPE_PID,
PIDTYPE_PGID,
PIDTYPE_SID,
+ PIDTYPE_PID,
+#define PIDTYPE_ARRAY_MAX PIDTYPE_PID
PIDTYPE_MAX
};

@@ -58,7 +59,8 @@ struct pid
{
atomic_t count;
/* lists of tasks that use this pid */
- struct hlist_head tasks[PIDTYPE_MAX];
+ struct task_struct *tsk;
+ struct hlist_head tasks[PIDTYPE_ARRAY_MAX];
struct rcu_head rcu;
int level;
struct upid numbers[1];
@@ -66,12 +68,6 @@ struct pid

extern struct pid init_struct_pid;

-struct pid_link
-{
- struct hlist_node node;
- struct pid *pid;
-};
-
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
@@ -158,7 +154,7 @@ static inline pid_t pid_vnr(struct pid *pid)
struct hlist_node *pos___; \
if (pid != NULL) \
hlist_for_each_entry_rcu((task), pos___, \
- &pid->tasks[type], pids[type].node) {
+ &pid->tasks[type], pids[type]) {

#define while_each_pid_task(pid, type, task) \
} \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b1e25b..496dfda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -453,7 +453,6 @@ struct signal_struct {

/* ITIMER_REAL timer for the process */
struct hrtimer real_timer;
- struct task_struct *tsk;
ktime_t it_real_incr;

/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
@@ -461,6 +460,8 @@ struct signal_struct {
cputime_t it_prof_incr, it_virt_incr;

/* job control IDs */
+ struct pid *tgid;
+ struct pid *pids[PIDTYPE_ARRAY_MAX];

/*
* pgrp and session fields are deprecated.
@@ -1034,8 +1035,9 @@ struct task_struct {
struct list_head sibling; /* linkage in my parent's children list */
struct task_struct *group_leader; /* threadgroup leader */

+ struct pid *tid;
/* PID/PID hash table linkage. */
- struct pid_link pids[PIDTYPE_MAX];
+ struct hlist_node pids[PIDTYPE_ARRAY_MAX];
struct list_head thread_group;

struct completion *vfork_done; /* for vfork() */
@@ -1261,22 +1263,34 @@ static inline void set_task_pgrp(struct task_struct *tsk, pid_t pgrp)

static inline struct pid *task_pid(struct task_struct *task)
{
- return task->pids[PIDTYPE_PID].pid;
+ return task->tid;
}

static inline struct pid *task_tgid(struct task_struct *task)
{
- return task->group_leader->pids[PIDTYPE_PID].pid;
+ struct signal_struct *sig = rcu_dereference(task->signal);
+ struct pid *pid = NULL;
+ if (sig)
+ pid = sig->tgid;
+ return pid;
}

static inline struct pid *task_pgrp(struct task_struct *task)
{
- return task->group_leader->pids[PIDTYPE_PGID].pid;
+ struct signal_struct *sig = rcu_dereference(task->signal);
+ struct pid *pid = NULL;
+ if (sig)
+ pid = sig->pids[PIDTYPE_PGID];
+ return pid;
}

static inline struct pid *task_session(struct task_struct *task)
{
- return task->group_leader->pids[PIDTYPE_SID].pid;
+ struct signal_struct *sig = rcu_dereference(task->signal);
+ struct pid *pid = NULL;
+ if (sig)
+ pid = sig->pids[PIDTYPE_SID];
+ return pid;
}

struct pid_namespace;
@@ -1371,7 +1385,7 @@ static inline pid_t task_ppid_nr_ns(struct task_struct *tsk,
*/
static inline int pid_alive(struct task_struct *p)
{
- return p->pids[PIDTYPE_PID].pid != NULL;
+ return p->signal != NULL;
}

/**
@@ -1652,7 +1666,6 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
extern void unblock_all_signals(void);
extern void release_task(struct task_struct * p);
extern int send_sig_info(int, struct siginfo *, struct task_struct *);
-extern int send_group_sig_info(int, struct siginfo *, struct task_struct *);
extern int force_sigsegv(int, struct task_struct *);
extern int force_sig_info(int, struct siginfo *, struct task_struct *);
extern int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp);
@@ -1772,17 +1785,6 @@ extern void wait_task_inactive(struct task_struct * p);
/* de_thread depends on thread_group_leader not being a pid based check */
#define thread_group_leader(p) (p == p->group_leader)

-/* Do to the insanities of de_thread it is possible for a process
- * to have the pid of the thread group leader without actually being
- * the thread group leader. For iteration through the pids in proc
- * all we care about is that we have a task with the appropriate
- * pid, we don't actually care if we have the right task.
- */
-static inline int has_group_leader_pid(struct task_struct *p)
-{
- return p->pid == p->tgid;
-}
-
static inline
int same_thread_group(struct task_struct *p1, struct task_struct *p2)
{
@@ -1800,9 +1802,6 @@ static inline int thread_group_empty(struct task_struct *p)
return list_empty(&p->thread_group);
}

-#define delay_group_leader(p) \
- (thread_group_leader(p) && !thread_group_empty(p))
-
/*
* Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
* subscriptions and synchronises with wait4(). Also used in procfs. Also
diff --git a/kernel/exit.c b/kernel/exit.c
index 1ab19f0..94552e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -57,7 +57,6 @@ static void exit_mm(struct task_struct * tsk);
static void __unhash_process(struct task_struct *p)
{
nr_threads--;
- detach_pid(p, PIDTYPE_PID);
if (thread_group_leader(p)) {
detach_pid(p, PIDTYPE_PGID);
detach_pid(p, PIDTYPE_SID);
@@ -65,6 +64,7 @@ static void __unhash_process(struct task_struct *p)
list_del_rcu(&p->tasks);
__get_cpu_var(process_counts)--;
}
+ detach_pid(p, PIDTYPE_PID);
list_del_rcu(&p->thread_group);
remove_parent(p);
}
@@ -144,44 +144,15 @@ static void delayed_put_task_struct(struct rcu_head *rhp)

void release_task(struct task_struct * p)
{
- struct task_struct *leader;
- int zap_leader;
-repeat:
atomic_dec(&p->user->processes);
proc_flush_task(p);
write_lock_irq(&tasklist_lock);
ptrace_unlink(p);
BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
__exit_signal(p);
-
- /*
- * If we are the last non-leader member of the thread
- * group, and the leader is zombie, then notify the
- * group leader's parent process. (if it wants notification.)
- */
- zap_leader = 0;
- leader = p->group_leader;
- if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
- BUG_ON(leader->exit_signal == -1);
- do_notify_parent(leader, leader->exit_signal);
- /*
- * If we were the last child thread and the leader has
- * exited already, and the leader's parent ignores SIGCHLD,
- * then we are the one who should release the leader.
- *
- * do_notify_parent() will have marked it self-reaping in
- * that case.
- */
- zap_leader = (leader->exit_signal == -1);
- }
-
write_unlock_irq(&tasklist_lock);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
-
- p = leader;
- if (unlikely(zap_leader))
- goto repeat;
}

/*
@@ -633,8 +604,7 @@ reparent_thread(struct task_struct *p, struct task_struct *father, int traced)
/* If we'd notified the old parent about this child's death,
* also notify the new parent.
*/
- if (!traced && p->exit_state == EXIT_ZOMBIE &&
- p->exit_signal != -1 && thread_group_empty(p))
+ if (!traced && p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1)
do_notify_parent(p, p->exit_signal);

/*
@@ -702,8 +672,7 @@ static void forget_original_parent(struct task_struct *father)
} else {
/* reparent ptraced task to its real parent */
__ptrace_unlink (p);
- if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1 &&
- thread_group_empty(p))
+ if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1)
do_notify_parent(p, p->exit_signal);
}

@@ -773,6 +742,11 @@ static void exit_notify(struct task_struct *tsk)
exit_task_namespaces(tsk);

write_lock_irq(&tasklist_lock);
+ /* If we haven't yet made gettid() == getpid() do so now */
+ if (thread_group_leader(tsk) && (tsk->tid != tsk->signal->tgid)) {
+ detach_pid(tsk, PIDTYPE_PID);
+ attach_pid(tsk, PIDTYPE_PID, tsk->signal->tgid);
+ }
/*
* Check to see if any process groups have become orphaned
* as a result of our exiting, and if they have any stopped
@@ -818,7 +792,7 @@ static void exit_notify(struct task_struct *tsk)
* send it a SIGCHLD instead of honoring exit_signal. exit_signal
* only has special meaning to our real parent.
*/
- if (tsk->exit_signal != -1 && thread_group_empty(tsk)) {
+ if (tsk->exit_signal != -1) {
int signal = tsk->parent == tsk->real_parent ? tsk->exit_signal : SIGCHLD;
do_notify_parent(tsk, signal);
} else if (tsk->ptrace) {
@@ -946,6 +920,48 @@ fastcall NORET_TYPE void do_exit(long code)
}

tsk->flags |= PF_EXITING;
+ /* Transfer thread group leadership */
+ if (thread_group_leader(tsk) && !thread_group_empty(tsk)) {
+ struct task_struct *new_leader, *t;
+ write_lock_irq(&tasklist_lock);
+ for (t = next_thread(tsk); t != tsk; t = next_thread(t)) {
+ if (!(t->flags & PF_EXITING))
+ break;
+ }
+ if (t != tsk) {
+ new_leader = t;
+
+ new_leader->start_time = tsk->start_time;
+ task_pid(tsk)->tsk = new_leader;
+ transfer_pid(tsk, new_leader, PIDTYPE_PGID);
+ transfer_pid(tsk, new_leader, PIDTYPE_SID);
+ list_replace_rcu(&tsk->tasks, &new_leader->tasks);
+
+ /* Update group_leader on all of the threads... */
+ new_leader->group_leader = new_leader;
+ tsk->group_leader = new_leader;
+ for (t = next_thread(tsk); t != tsk; t= next_thread(t)) {
+ t->group_leader = new_leader;
+ }
+
+ new_leader->exit_signal = tsk->exit_signal;
+ tsk->exit_signal = -1;
+
+ write_unlock_irq(&tasklist_lock);
+ } else {
+ write_unlock_irq(&tasklist_lock);
+ /* Wait for the other threads to exit before continuing */
+ for (;;) {
+ read_lock(&tasklist_lock);
+ if (thread_group_empty(tsk))
+ break;
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ read_unlock(&tasklist_lock);
+ schedule();
+ }
+ read_unlock(&tasklist_lock);
+ }
+ }
/*
* tsk->flags are checked in the futex code to protect against
* an exiting task cleaning up the robust pi futexes.
@@ -1106,20 +1122,18 @@ asmlinkage void sys_exit_group(int error_code)
do_group_exit((error_code & 0xff) << 8);
}

-static int eligible_child(pid_t pid, int options, struct task_struct *p)
+static int eligible_child(enum pid_type type, struct pid *pid, int options, struct task_struct *p)
{
int err;
- struct pid_namespace *ns;

- ns = current->nsproxy->pid_ns;
- if (pid > 0) {
- if (task_pid_nr_ns(p, ns) != pid)
+ if (type == PIDTYPE_PID) {
+ /* Match all pids pointing at task p */
+ if (pid_task(pid, PIDTYPE_PID) != p)
return 0;
- } else if (!pid) {
- if (task_pgrp_nr_ns(p, ns) != task_pgrp_vnr(current))
- return 0;
- } else if (pid != -1) {
- if (task_pgrp_nr_ns(p, ns) != -pid)
+ } else if (type < PIDTYPE_MAX) {
+ struct signal_struct *sig;
+ sig = rcu_dereference(p->signal);
+ if (sig && (sig->pids[type] != pid))
return 0;
}

@@ -1346,7 +1360,8 @@ static int wait_task_stopped(struct task_struct *p,
{
int retval, exit_code, why;
uid_t uid = 0; /* unneeded, required by compiler */
- pid_t pid;
+ struct pid *pid;
+ pid_t upid;

exit_code = 0;
spin_lock_irq(&p->sighand->siglock);
@@ -1382,12 +1397,16 @@ unlock_sig:
* possibly take page faults for user memory.
*/
get_task_struct(p);
- pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
+ if (p->ptrace && same_thread_group(current, p->parent))
+ pid = task_pid(p);
+ else
+ pid = task_tgid(p);
+ upid = pid_nr_ns(pid, current->nsproxy->pid_ns);
why = (p->ptrace & PT_PTRACED) ? CLD_TRAPPED : CLD_STOPPED;
read_unlock(&tasklist_lock);

if (unlikely(noreap))
- return wait_noreap_copyout(p, pid, uid,
+ return wait_noreap_copyout(p, upid, uid,
why, exit_code,
infop, ru);

@@ -1403,11 +1422,11 @@ unlock_sig:
if (!retval && infop)
retval = put_user(exit_code, &infop->si_status);
if (!retval && infop)
- retval = put_user(pid, &infop->si_pid);
+ retval = put_user(upid, &infop->si_pid);
if (!retval && infop)
retval = put_user(uid, &infop->si_uid);
if (!retval)
- retval = pid;
+ retval = upid;
put_task_struct(p);

BUG_ON(!retval);
@@ -1425,7 +1444,8 @@ static int wait_task_continued(struct task_struct *p, int noreap,
int __user *stat_addr, struct rusage __user *ru)
{
int retval;
- pid_t pid;
+ struct pid *pid;
+ pid_t upid;
uid_t uid;

if (!(p->signal->flags & SIGNAL_STOP_CONTINUED))
@@ -1440,8 +1460,11 @@ static int wait_task_continued(struct task_struct *p, int noreap,
if (!noreap)
p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
spin_unlock_irq(&p->sighand->siglock);
-
- pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
+ if (p->ptrace && same_thread_group(current, p->parent))
+ pid = task_pid(p);
+ else
+ pid = task_tgid(p);
+ upid = pid_nr_ns(pid, current->nsproxy->pid_ns);
uid = p->uid;
get_task_struct(p);
read_unlock(&tasklist_lock);
@@ -1452,9 +1475,9 @@ static int wait_task_continued(struct task_struct *p, int noreap,
if (!retval && stat_addr)
retval = put_user(0xffff, stat_addr);
if (!retval)
- retval = pid;
+ retval = upid;
} else {
- retval = wait_noreap_copyout(p, pid, uid,
+ retval = wait_noreap_copyout(p, upid, uid,
CLD_CONTINUED, SIGCONT,
infop, ru);
BUG_ON(retval == 0);
@@ -1463,13 +1486,25 @@ static int wait_task_continued(struct task_struct *p, int noreap,
return retval;
}

-static long do_wait(pid_t pid, int options, struct siginfo __user *infop,
+static long do_wait(pid_t upid, int options, struct siginfo __user *infop,
int __user *stat_addr, struct rusage __user *ru)
{
DECLARE_WAITQUEUE(wait, current);
struct task_struct *tsk;
int flag, retval;
-
+ struct pid *pid = NULL;
+ enum pid_type type = PIDTYPE_MAX;
+
+ if (upid > 0) {
+ type = PIDTYPE_PID;
+ pid = find_get_pid(upid);
+ } else if (upid == 0) {
+ type = PIDTYPE_PGID;
+ pid = get_pid(task_pgrp(current));
+ } else if (upid < -1) {
+ type = PIDTYPE_PGID;
+ pid = find_get_pid(-upid);
+ }
add_wait_queue(&current->signal->wait_chldexit,&wait);
repeat:
/*
@@ -1484,7 +1519,7 @@ repeat:
struct task_struct *p;

list_for_each_entry(p, &tsk->children, sibling) {
- int ret = eligible_child(pid, options, p);
+ int ret = eligible_child(type, pid, options, p);
if (!ret)
continue;

@@ -1503,8 +1538,7 @@ repeat:
retval = wait_task_stopped(p,
(options & WNOWAIT), infop,
stat_addr, ru);
- } else if (p->exit_state == EXIT_ZOMBIE &&
- !delay_group_leader(p)) {
+ } else if (p->exit_state == EXIT_ZOMBIE) {
/*
* We don't reap group leaders with subthreads.
*/
@@ -1531,7 +1565,7 @@ repeat:
if (!flag) {
list_for_each_entry(p, &tsk->ptrace_children,
ptrace_list) {
- flag = eligible_child(pid, options, p);
+ flag = eligible_child(type, pid, options, p);
if (!flag)
continue;
if (likely(flag > 0))
@@ -1560,6 +1594,7 @@ repeat:
end:
current->state = TASK_RUNNING;
remove_wait_queue(&current->signal->wait_chldexit,&wait);
+ put_pid(pid);
if (infop) {
if (retval > 0)
retval = 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index 7abb592..e986be9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -883,7 +883,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->it_real_incr.tv64 = 0;
sig->real_timer.function = it_real_fn;
- sig->tsk = tsk;

sig->it_virt_expires = cputime_zero;
sig->it_virt_incr = cputime_zero;
@@ -1308,6 +1307,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_NEWPID)
p->nsproxy->pid_ns->child_reaper = p;

+ p->signal->tgid = pid;
p->signal->tty = current->signal->tty;
set_task_pgrp(p, task_pgrp_nr(current));
set_task_session(p, task_session_nr(current));
diff --git a/kernel/itimer.c b/kernel/itimer.c
index 2fab344..f40b589 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -132,7 +132,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
struct signal_struct *sig =
container_of(timer, struct signal_struct, real_timer);

- send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk);
+ kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->tgid);

return HRTIMER_NORESTART;
}
diff --git a/kernel/pid.c b/kernel/pid.c
index 21f027c..b45b53d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -319,28 +319,39 @@ EXPORT_SYMBOL_GPL(find_pid);
int fastcall attach_pid(struct task_struct *task, enum pid_type type,
struct pid *pid)
{
- struct pid_link *link;
-
- link = &task->pids[type];
- link->pid = pid;
- hlist_add_head_rcu(&link->node, &pid->tasks[type]);
+ if (type == PIDTYPE_PID) {
+ task->tid = pid;
+ pid->tsk = task;
+ }
+ else {
+ task->signal->pids[type] = pid;
+ hlist_add_head_rcu(&task->pids[type], &pid->tasks[type]);
+ }

return 0;
}

void fastcall detach_pid(struct task_struct *task, enum pid_type type)
{
- struct pid_link *link;
- struct pid *pid;
+ struct pid **ppid, *pid;
int tmp;

- link = &task->pids[type];
- pid = link->pid;
-
- hlist_del_rcu(&link->node);
- link->pid = NULL;
+ if (type == PIDTYPE_PID) {
+ ppid = &task->tid;
+ pid = *ppid;
+ if (pid->tsk == task)
+ pid->tsk = NULL;
+ }
+ else {
+ hlist_del_rcu(&task->pids[type]);
+ ppid = &task->signal->pids[type];
+ }
+ pid = *ppid;
+ *ppid = NULL;

- for (tmp = PIDTYPE_MAX; --tmp >= 0; )
+ if (pid->tsk)
+ return;
+ for (tmp = PIDTYPE_MAX -1; --tmp >= 0; )
if (!hlist_empty(&pid->tasks[tmp]))
return;

@@ -351,19 +362,22 @@ void fastcall detach_pid(struct task_struct *task, enum pid_type type)
void fastcall transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type type)
{
- new->pids[type].pid = old->pids[type].pid;
- hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
- old->pids[type].pid = NULL;
+ hlist_replace_rcu(&old->pids[type], &new->pids[type]);
}

struct task_struct * fastcall pid_task(struct pid *pid, enum pid_type type)
{
struct task_struct *result = NULL;
if (pid) {
- struct hlist_node *first;
- first = rcu_dereference(pid->tasks[type].first);
- if (first)
- result = hlist_entry(first, struct task_struct, pids[(type)].node);
+ if (type == PIDTYPE_PID)
+ result = rcu_dereference(pid->tsk);
+ else {
+ struct hlist_node *first;
+ first = rcu_dereference(pid->tasks[type].first);
+ if (first)
+ result = hlist_entry(first, struct task_struct,
+ pids[(type)]);
+ }
}
return result;
}
@@ -402,7 +416,11 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
{
struct pid *pid;
rcu_read_lock();
- pid = get_pid(task->pids[type].pid);
+ if (type == PIDTYPE_PID)
+ pid = task->tid;
+ else
+ pid = task->signal->pids[type];
+ get_pid(pid);
rcu_read_unlock();
return pid;
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 06e663d..af8c49f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1195,20 +1195,6 @@ send_sig(int sig, struct task_struct *p, int priv)
return send_sig_info(sig, __si_special(priv), p);
}

-/*
- * This is the entry point for "process-wide" signals.
- * They will go to an appropriate thread in the thread group.
- */
-int
-send_group_sig_info(int sig, struct siginfo *info, struct task_struct *p)
-{
- int ret;
- read_lock(&tasklist_lock);
- ret = group_send_sig_info(sig, info, p);
- read_unlock(&tasklist_lock);
- return ret;
-}
-
void
force_sig(int sig, struct task_struct *p)
{
@@ -1501,12 +1487,15 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
unsigned long flags;
struct task_struct *parent;
struct sighand_struct *sighand;
+ struct pid *pid;

- if (tsk->ptrace & PT_PTRACED)
+ if (tsk->ptrace & PT_PTRACED) {
parent = tsk->parent;
- else {
+ pid = task_pid(tsk);
+ } else {
tsk = tsk->group_leader;
parent = tsk->real_parent;
+ pid = task_tgid(tsk);
}

info.si_signo = SIGCHLD;
@@ -1515,7 +1504,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
* see comment in do_notify_parent() abot the following 3 lines
*/
rcu_read_lock();
- info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+ info.si_pid = pid_nr_ns(pid, parent->nsproxy->pid_ns);
rcu_read_unlock();

info.si_uid = tsk->uid;
--
1.5.3.rc6.17.g1911

2007-12-10 18:36:49

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

On 12/09, Eric W. Biederman wrote:
>
> Oleg below is my proof of concept patch, which really needs to be
> broken up into a whole patch series, so the changes are small
> enough we can do a thorough audit on them. Anyway take a look
> and see what you think.

Amazing ;)

This patch certainly needs a time for understanding, so far I have
read only the small subset, a couple of random questions.

> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -5,9 +5,10 @@
>
> enum pid_type
> {
> - PIDTYPE_PID,
> PIDTYPE_PGID,
> PIDTYPE_SID,
> + PIDTYPE_PID,
> +#define PIDTYPE_ARRAY_MAX PIDTYPE_PID
> PIDTYPE_MAX
> };
>
> @@ -58,7 +59,8 @@ struct pid
> {
> atomic_t count;
> /* lists of tasks that use this pid */
> - struct hlist_head tasks[PIDTYPE_MAX];
> + struct task_struct *tsk;
> + struct hlist_head tasks[PIDTYPE_ARRAY_MAX];
> struct rcu_head rcu;
> int level;
> struct upid numbers[1];
>
> [...snip...]
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -453,7 +453,6 @@ struct signal_struct {
>
> /* ITIMER_REAL timer for the process */
> struct hrtimer real_timer;
> - struct task_struct *tsk;
> ktime_t it_real_incr;
>
> /* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
> @@ -461,6 +460,8 @@ struct signal_struct {
> cputime_t it_prof_incr, it_virt_incr;
>
> /* job control IDs */
> + struct pid *tgid;
> + struct pid *pids[PIDTYPE_ARRAY_MAX];
>
> /*
> * pgrp and session fields are deprecated.
> @@ -1034,8 +1035,9 @@ struct task_struct {
> struct list_head sibling; /* linkage in my parent's children list */
> struct task_struct *group_leader; /* threadgroup leader */
>
> + struct pid *tid;
> /* PID/PID hash table linkage. */
> - struct pid_link pids[PIDTYPE_MAX];
> + struct hlist_node pids[PIDTYPE_ARRAY_MAX];

OK. It certainly makes sense to move PIDTYPE_PGID/SID pids from task_struct
to signal struct.

But can't we go a bit further? With this patch pid->tasks[].first still "points"
to leader's task_struct. Suppose we replace pid->tasks[] with pid->signals[],
so that pid->signals[].first points to signal_struct. Then we can find the task
(group_leader) via signal->tgid.

This means we can remove task_struct->pids, and kill transfer_pid().

> static inline struct pid *task_tgid(struct task_struct *task)
> {
> - return task->group_leader->pids[PIDTYPE_PID].pid;
> + struct signal_struct *sig = rcu_dereference(task->signal);
> + struct pid *pid = NULL;
> + if (sig)
> + pid = sig->tgid;
> + return pid;
> }

Hmm. This is fixable, but note that task->signal is not RCU protected,
only ->sighand.

> static inline int pid_alive(struct task_struct *p)
> {
> - return p->pids[PIDTYPE_PID].pid != NULL;
> + return p->signal != NULL;
> }

(this change btw is imho good regardless, because pid_alive() currently
means "the task is not unhashed yet" anyway).

> static void __unhash_process(struct task_struct *p)
> {
> nr_threads--;
> - detach_pid(p, PIDTYPE_PID);
> if (thread_group_leader(p)) {
> detach_pid(p, PIDTYPE_PGID);
> detach_pid(p, PIDTYPE_SID);
> @@ -65,6 +64,7 @@ static void __unhash_process(struct task_struct *p)
> list_del_rcu(&p->tasks);
> __get_cpu_var(process_counts)--;
> }
> + detach_pid(p, PIDTYPE_PID);

Not sure why this change is needed... To prevent the premature
detach_pid()->free_pid() ? But this doesn't looks possible, if
the task is leader, p->tid->tsk == p, and detach_pid() does

if (pid->tsk) // still used, don't free.
return;

> @@ -946,6 +920,48 @@ fastcall NORET_TYPE void do_exit(long code)
> }
>
> tsk->flags |= PF_EXITING;
> + /* Transfer thread group leadership */
> + if (thread_group_leader(tsk) && !thread_group_empty(tsk)) {

Ah, this is racy without tasklist_lock. Suppose that the current
->group_leader exits right now and elects us as a new leader.

> + struct task_struct *new_leader, *t;
> + write_lock_irq(&tasklist_lock);
> + for (t = next_thread(tsk); t != tsk; t = next_thread(t)) {
> + if (!(t->flags & PF_EXITING))
> + break;
> + }
> + if (t != tsk) {
> + new_leader = t;
> +
> + new_leader->start_time = tsk->start_time;
> + task_pid(tsk)->tsk = new_leader;

So this pid won't be freed when current does detach_pid(PIDTYPE_PID), from
now current->tid->tsk != current, so detach_pid() doesn't clear pid->tsk.

But when it will be freed then?

> + transfer_pid(tsk, new_leader, PIDTYPE_PGID);
> + transfer_pid(tsk, new_leader, PIDTYPE_SID);
> + list_replace_rcu(&tsk->tasks, &new_leader->tasks);
> +
> + /* Update group_leader on all of the threads... */
> + new_leader->group_leader = new_leader;
> + tsk->group_leader = new_leader;

This is one of the most questionable changes. Now current can't "trust" its
->group_leader without tasklist.

As a random example, let's look at sys_setpgid(). When we take tasklist_lock
group_leader can point to the already freed/reused memory.

> + } else {
> + write_unlock_irq(&tasklist_lock);
> + /* Wait for the other threads to exit before continuing */
> + for (;;) {
> + read_lock(&tasklist_lock);
> + if (thread_group_empty(tsk))
> + break;
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + read_unlock(&tasklist_lock);
> + schedule();
> + }

I don't get this... Who will wake us?

Hmm, deadlock with coredump? The coredumping thread may be waiting for us,
but we didn't do exit_mm() yet.

> @@ -1440,8 +1460,11 @@ static int wait_task_continued(struct task_struct *p, int noreap,
> if (!noreap)
> p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
> spin_unlock_irq(&p->sighand->siglock);
> -
> - pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
> + if (p->ptrace && same_thread_group(current, p->parent))
> + pid = task_pid(p);
> + else
> + pid = task_tgid(p);

Why do we need "&& same_thread_group()" above? Ptracer needs per-thread
pid in any case.

> @@ -1501,12 +1487,15 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
> unsigned long flags;
> struct task_struct *parent;
> struct sighand_struct *sighand;
> + struct pid *pid;
>
> - if (tsk->ptrace & PT_PTRACED)
> + if (tsk->ptrace & PT_PTRACED) {
> parent = tsk->parent;
> - else {
> + pid = task_pid(tsk);
> + } else {
> tsk = tsk->group_leader;
> parent = tsk->real_parent;
> + pid = task_tgid(tsk);

Hmm, yes... With this patch task_pid(p->leader) != task_tgid(p).

Will continue tomorrow.

Oleg.

2007-12-11 01:38:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

Oleg Nesterov <[email protected]> writes:

> On 12/09, Eric W. Biederman wrote:
>>
>> Oleg below is my proof of concept patch, which really needs to be
>> broken up into a whole patch series, so the changes are small
>> enough we can do a thorough audit on them. Anyway take a look
>> and see what you think.
>
> Amazing ;)
>
> This patch certainly needs a time for understanding, so far I have
> read only the small subset, a couple of random questions.

Well I think it succeeds as a proof of concept and totally fails
as a production patch at this point.

>> * pgrp and session fields are deprecated.
>> @@ -1034,8 +1035,9 @@ struct task_struct {
>> struct list_head sibling; /* linkage in my parent's children list */
>> struct task_struct *group_leader; /* threadgroup leader */
>>
>> + struct pid *tid;
>> /* PID/PID hash table linkage. */
>> - struct pid_link pids[PIDTYPE_MAX];
>> + struct hlist_node pids[PIDTYPE_ARRAY_MAX];
>
> OK. It certainly makes sense to move PIDTYPE_PGID/SID pids from task_struct
> to signal struct.
>
> But can't we go a bit further? With this patch pid->tasks[].first still "points"
> to leader's task_struct. Suppose we replace pid->tasks[] with pid->signals[],
> so that pid->signals[].first points to signal_struct. Then we can find the task
> (group_leader) via signal->tgid.
>
> This means we can remove task_struct->pids, and kill transfer_pid().

We need a way to sill implement do_each_pid_task, but otherwise that should
work and be a nice clean up all on it's own.

>> static inline struct pid *task_tgid(struct task_struct *task)
>> {
>> - return task->group_leader->pids[PIDTYPE_PID].pid;
>> + struct signal_struct *sig = rcu_dereference(task->signal);
>> + struct pid *pid = NULL;
>> + if (sig)
>> + pid = sig->tgid;
>> + return pid;
>> }
>
> Hmm. This is fixable, but note that task->signal is not RCU protected,
> only ->sighand.

Yes. I realized that after I had sent the patch out. We do run
those functions with just rcu protection sometimes so something would
need to be resolved there.

>> static inline int pid_alive(struct task_struct *p)
>> {
>> - return p->pids[PIDTYPE_PID].pid != NULL;
>> + return p->signal != NULL;
>> }
>
> (this change btw is imho good regardless, because pid_alive() currently
> means "the task is not unhashed yet" anyway).

Yes.

>> static void __unhash_process(struct task_struct *p)
>> {
>> nr_threads--;
>> - detach_pid(p, PIDTYPE_PID);
>> if (thread_group_leader(p)) {
>> detach_pid(p, PIDTYPE_PGID);
>> detach_pid(p, PIDTYPE_SID);
>> @@ -65,6 +64,7 @@ static void __unhash_process(struct task_struct *p)
>> list_del_rcu(&p->tasks);
>> __get_cpu_var(process_counts)--;
>> }
>> + detach_pid(p, PIDTYPE_PID);
>
> Not sure why this change is needed... To prevent the premature
> detach_pid()->free_pid() ? But this doesn't looks possible, if
> the task is leader, p->tid->tsk == p, and detach_pid() does

This is a bit of a relic of how my patch developed.
I had the "if (task->tid != tsk->signal->tgid)" check in there
and was assuming the thread group id as my pid so I could clean
things up properly. And it worked out nicer if the detach_pid
was for PIDTYPE_PID came later as I could reuse the same logic
as in de_thread.

>
> if (pid->tsk) // still used, don't free.
> return;
>
>> @@ -946,6 +920,48 @@ fastcall NORET_TYPE void do_exit(long code)
>> }
>>
>> tsk->flags |= PF_EXITING;
>> + /* Transfer thread group leadership */
>> + if (thread_group_leader(tsk) && !thread_group_empty(tsk)) {
>
> Ah, this is racy without tasklist_lock. Suppose that the current
> ->group_leader exits right now and elects us as a new leader.

Hmm. I thought I was redoing that test inside of the lock.
Anyway this hunk probably needs the most work as it is brand new code.

>> + struct task_struct *new_leader, *t;
>> + write_lock_irq(&tasklist_lock);
>> + for (t = next_thread(tsk); t != tsk; t = next_thread(t)) {
>> + if (!(t->flags & PF_EXITING))
>> + break;
>> + }
>> + if (t != tsk) {
>> + new_leader = t;
>> +
>> + new_leader->start_time = tsk->start_time;
>> + task_pid(tsk)->tsk = new_leader;
>
> So this pid won't be freed when current does detach_pid(PIDTYPE_PID), from
> now current->tid->tsk != current, so detach_pid() doesn't clear pid->tsk.
>
> But when it will be freed then?

When new_leader does detach_pid on it.

>> + transfer_pid(tsk, new_leader, PIDTYPE_PGID);
>> + transfer_pid(tsk, new_leader, PIDTYPE_SID);
>> + list_replace_rcu(&tsk->tasks, &new_leader->tasks);
>> +
>> + /* Update group_leader on all of the threads... */
>> + new_leader->group_leader = new_leader;
>> + tsk->group_leader = new_leader;
>
> This is one of the most questionable changes. Now current can't "trust" its
> ->group_leader without tasklist.
>
> As a random example, let's look at sys_setpgid(). When we take tasklist_lock
> group_leader can point to the already freed/reused memory.
>
>> + } else {
>> + write_unlock_irq(&tasklist_lock);
>> + /* Wait for the other threads to exit before continuing */
>> + for (;;) {
>> + read_lock(&tasklist_lock);
>> + if (thread_group_empty(tsk))
>> + break;
>> + __set_current_state(TASK_UNINTERRUPTIBLE);
>> + read_unlock(&tasklist_lock);
>> + schedule();
>> + }
>
> I don't get this... Who will wake us?

Oh. Duh. I totally spaced that you need a wakeup in cases like this.

> Hmm, deadlock with coredump? The coredumping thread may be waiting for us,
> but we didn't do exit_mm() yet.

Yes. That part is pretty ugly.
>
>> @@ -1440,8 +1460,11 @@ static int wait_task_continued(struct task_struct *p,
> int noreap,
>> if (!noreap)
>> p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
>> spin_unlock_irq(&p->sighand->siglock);
>> -
>> - pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
>> + if (p->ptrace && same_thread_group(current, p->parent))
>> + pid = task_pid(p);
>> + else
>> + pid = task_tgid(p);
>
> Why do we need "&& same_thread_group()" above? Ptracer needs per-thread
> pid in any case.

It isn't easy to tell if our real parent or the ptracer is waiting
for us. My thinking was to detect the ptrace waiter by doing
current == p->parent, and then generalizing that to same_thread_group.
I now see that has a problem if the real parent and the ptracer
are the same.

I guess I need to use the state information and simply not tell the
real parent that we have stopped for a ptrace stop, or visa versa.

>> @@ -1501,12 +1487,15 @@ static void do_notify_parent_cldstop(struct
> task_struct *tsk, int why)
>> unsigned long flags;
>> struct task_struct *parent;
>> struct sighand_struct *sighand;
>> + struct pid *pid;
>>
>> - if (tsk->ptrace & PT_PTRACED)
>> + if (tsk->ptrace & PT_PTRACED) {
>> parent = tsk->parent;
>> - else {
>> + pid = task_pid(tsk);
>> + } else {
>> tsk = tsk->group_leader;
>> parent = tsk->real_parent;
>> + pid = task_tgid(tsk);
>
> Hmm, yes... With this patch task_pid(p->leader) != task_tgid(p).

Not necessarily no, and that is a pain.

> Will continue tomorrow.

Thanks

Eric