de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks
if an rt-prio execer shares the same cpu with ->group_leader. Change the code
to use ->group_exit_task/notify_count mechanics.
This patch certainly uglifies the code, perhaps someone can suggest something
better.
Signed-off-by: Oleg Nesterov <[email protected]>
--- t/kernel/exit.c~5_DEADLOCK 2007-08-17 18:56:31.000000000 +0400
+++ t/kernel/exit.c 2007-08-18 20:57:49.000000000 +0400
@@ -102,10 +102,9 @@ static void __exit_signal(struct task_st
* If there is any task waiting for the group exit
* then notify it:
*/
- if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count) {
+ if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count)
wake_up_process(sig->group_exit_task);
- sig->group_exit_task = NULL;
- }
+
if (tsk == sig->curr_target)
sig->curr_target = next_thread(tsk);
/*
@@ -850,6 +849,11 @@ static void exit_notify(struct task_stru
state = EXIT_DEAD;
tsk->exit_state = state;
+ if (thread_group_leader(tsk) &&
+ tsk->signal->notify_count < 0 &&
+ tsk->signal->group_exit_task)
+ wake_up_process(tsk->signal->group_exit_task);
+
write_unlock_irq(&tasklist_lock);
list_for_each_safe(_p, _n, &ptrace_dead) {
--- t/fs/exec.c~5_DEADLOCK 2007-08-18 19:34:12.000000000 +0400
+++ t/fs/exec.c 2007-08-18 20:43:59.000000000 +0400
@@ -828,16 +828,15 @@ static int de_thread(struct task_struct
hrtimer_restart(&sig->real_timer);
spin_lock_irq(lock);
}
+
+ sig->notify_count = count;
+ sig->group_exit_task = tsk;
while (atomic_read(&sig->count) > count) {
- sig->group_exit_task = tsk;
- sig->notify_count = count;
__set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock_irq(lock);
schedule();
spin_lock_irq(lock);
}
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
spin_unlock_irq(lock);
/*
@@ -846,14 +845,17 @@ static int de_thread(struct task_struct
* and to assume its PID:
*/
if (!thread_group_leader(tsk)) {
- /*
- * Wait for the thread group leader to be a zombie.
- * It should already be zombie at this point, most
- * of the time.
- */
leader = tsk->group_leader;
- while (leader->exit_state != EXIT_ZOMBIE)
- yield();
+
+ sig->notify_count = -1;
+ for (;;) {
+ write_lock_irq(&tasklist_lock);
+ if (likely(leader->exit_state))
+ break;
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ write_unlock_irq(&tasklist_lock);
+ schedule();
+ }
/*
* The only record we have of the real-time age of a
@@ -867,8 +869,6 @@ static int de_thread(struct task_struct
*/
tsk->start_time = leader->start_time;
- write_lock_irq(&tasklist_lock);
-
BUG_ON(leader->tgid != tsk->tgid);
BUG_ON(tsk->pid == tsk->tgid);
/*
@@ -901,6 +901,8 @@ static int de_thread(struct task_struct
write_unlock_irq(&tasklist_lock);
}
+ sig->group_exit_task = NULL;
+ sig->notify_count = 0;
/*
* There may be one thread left which is just exiting,
* but it's safe to stop telling the group to kill themselves.
Maybe it can use wait_task_inactive, which IIUC is being changed to address
the same RT issue. OTOH, notify_count exists only for this. So maybe the
better way is to clean that whole mechanism up somehow. The exit.c changes
in your patch seem to be making it more mysterious rather than less so.
I haven't really thought much about the better solution.
Thanks,
Roland
On 08/18, Roland McGrath wrote:
>
> Maybe it can use wait_task_inactive, which IIUC is being changed to address
> the same RT issue.
The group_leader can sleep before it enters exit_notify(). In that case
wait_task_inactive() returns, and we still need some polling to wait for
EXIT_ZOMBIE.
> OTOH, notify_count exists only for this. So maybe the
> better way is to clean that whole mechanism up somehow.
Yes sure. But in any case I think we should avoid polling, we need some
notification from exit_notify().
> The exit.c changes
> in your patch seem to be making it more mysterious rather than less so.
I completely agree.
Oleg.
> The group_leader can sleep before it enters exit_notify(). In that case
> wait_task_inactive() returns, and we still need some polling to wait for
> EXIT_ZOMBIE.
It could be a loop as now with yield. It's still polling, but only one
poll per wakeup of the target.
> Yes sure. But in any case I think we should avoid polling, we need some
> notification from exit_notify().
Indeed.
Thanks,
Roland
On 08/19, Roland McGrath wrote:
>
> > The group_leader can sleep before it enters exit_notify(). In that case
> > wait_task_inactive() returns, and we still need some polling to wait for
> > EXIT_ZOMBIE.
>
> It could be a loop as now with yield. It's still polling, but only one
> poll per wakeup of the target.
I guess I misunderstand you. Do you mean
de_thread:
/*
* Wait for the thread group leader to be a zombie.
* It should already be zombie at this point, most
* of the time.
*/
while (leader->exit_state != EXIT_ZOMBIE)
wait_task_inactive(leader);
? This becomes a busy-wait loop if the leader sleeps, wait_task_inactive()
doesn't sleep/yield in this case. Not good.
> > Yes sure. But in any case I think we should avoid polling, we need some
> > notification from exit_notify().
>
> Indeed.
This means we should put something in exit_notify(), like this patch does.
It could be simplified a bit, we don't in fact need a negative ->notify_count,
we can tolerate a false wakeup. We can even skip the "thread_group_leader()"
check for the same reason.
(Of course, we can also add wait_queue_head_t to ->signal, but I don't think
you have this in mind).
Oleg.
> ? This becomes a busy-wait loop if the leader sleeps, wait_task_inactive()
> doesn't sleep/yield in this case. Not good.
Ah, I see. Yes, you're right, that will not fly. (I was thinking of the
apparently forthcoming wait_task_inactive change to avoid yield, but that
will still busy-wait in fact.)
> This means we should put something in exit_notify(), like this patch does.
> It could be simplified a bit, we don't in fact need a negative ->notify_count,
> we can tolerate a false wakeup. We can even skip the "thread_group_leader()"
> check for the same reason.
>
> (Of course, we can also add wait_queue_head_t to ->signal, but I don't think
> you have this in mind).
I had in mind unifying this need with what's now done by the notify_count
check that's in __exit_signal. Aside from those BUG_ON's, I'm not sure
de_thread really cares whether the other non-leader threads are finished
self reaping, or only if they are dead. Adding some field to signal_struct
for this new mechanism is certainly fine if it goes along with removing a
word or two from task_struct (notify_count, group_exit_task). (The single
other use overloaded on group_exit_task is for a similar need in the
pre-coredump synchronization. So maybe that can be done more cleanly too.)
Any new field could be kept to a single pointer; since it's only needed
while one given thread is blocking, it can point to something larger on its
stack if necessary.
While we are considering cleaning up the exec synchronization, there is a
long-standing issue it would be nice to address. That is, the race of a
group fatal signal with exec. (I've mentioned this before, but never come
up with an adequate solution.) As things are, one thread can be processing
a fatal signal while another thread does exec (de_thread). If de_thread
gets the siglock first and sets SIGNAL_GROUP_EXIT, then the fatal-signal
thread will see an exit already in process and just drop its signal on the
floor. But, it was not an exit at all, but really an exec. A fatal signal
from shared_pending should have killed the whole process, but was swallowed.
This is a POSIX violation, and potentially usable in a racy DoS exploit to
let a runaway process keep exec'ing and never get killed (though probably not).
An obvious path to a fix for that is to avoid overloading SIGNAL_GROUP_EXIT
in de_thread. In coming up with different synchronization method we might
find a way that cleans that up too.
Thanks,
Roland
On 08/19, Roland McGrath wrote:
>
> I had in mind unifying this need with what's now done by the notify_count
> check that's in __exit_signal. Aside from those BUG_ON's, I'm not sure
> de_thread really cares whether the other non-leader threads are finished
> self reaping, or only if they are dead. Adding some field to signal_struct
> for this new mechanism is certainly fine if it goes along with removing a
> word or two from task_struct (notify_count, group_exit_task). (The single
> other use overloaded on group_exit_task is for a similar need in the
> pre-coredump synchronization. So maybe that can be done more cleanly too.)
> Any new field could be kept to a single pointer; since it's only needed
> while one given thread is blocking, it can point to something larger on its
> stack if necessary.
I seem to understand what you mean... Yes, at first glance, we can consider
the sub-thread with ->exit_state != 0 as "dead enough" for de_thread. This
allows us to simplify the logic greatly.
But: we should somehow change the ->group_leader for all sub-threads which
didn't do release_task(self) yet, nasty. (Perhaps it makes sense to move
->group_leader into signal_struct. It is not safe to use it if tsk already
did __exit_signal() anyway).
Another problem, it is not easy to define when the ->group_exit_task (or
whatever) should be notified from exit_notify(), we need another atomic_t
for that (like signal_struct.live).
In fact, it is not necessary to put this counter into signal_struct,
de_thread() can count sub-threads (say, modify zap_other_threads() to
return "int") and put this info into the structure on stack, as you
pointed out.
Imho, this definitely worth thinking. See also below.
But what do you think about this patch? Should we go with it for now,
or we should ignore the problem until we can cleanup the whole thing?
I do not claim this problem is very much important, but this yield()
is really buggy (perhaps it is always buggy).
> While we are considering cleaning up the exec synchronization, there is a
> long-standing issue it would be nice to address. That is, the race of a
> group fatal signal with exec. (I've mentioned this before, but never come
> up with an adequate solution.)
>
> An obvious path to a fix for that is to avoid overloading SIGNAL_GROUP_EXIT
> in de_thread. In coming up with different synchronization method we might
> find a way that cleans that up too.
Yes, yes, yes, these problems are really connected, and I also thought about
that.
But can't we first cleanup some other oddities with signal->flags? For example,
it would be nice to kill SIGNAL_STOP_DEQUEUED, but we can't because
handle_stop_signal(SIGKILL) clears ->signal->flags. And this is done because
tkill() doesn't do what __group_complete_signal() does for sig_fatal() signals
(I mean this big "if (sig_fatal(p, sig) ...").
Why? can't we split __group_complete_signal() into 2 functions, the new one
is used both by do_tkill() and __group_send_sig_info() ?
Just one example. Suppose that an application does
signal(SIGTERM, SIG_DFL);
sigtimedwait( {SIGTERM} );
Now,
tkill(SIGTERM) => sigtimedwait() succeeds
kill(SIGTERM) => application is killed UNLESS it has TIF_SIGPENDING
This looks really strange for me.
While we are here, I'd like to ask another question (sorry for the long and
somewhat off-topic post :)
Suppose that we have sub-threads T1 and T2, both do not block SIGXXX.
kill(SIGXXX) choose T1 as a target and does signal_wake_up(T1).
T1 can enter sys_sigprocmask(SIG_BLOCK, {SIGXXX}) before it sees the signal.
Now SIGXXX is delayed until T2 does recalc_sigpending().
Is it ok? This is easy to fix, for example sys_sigprocmask() can check
signal_pending() and return ERESTARTNOINTR.
Oleg.
> I seem to understand what you mean... Yes, at first glance, we can consider
> the sub-thread with ->exit_state != 0 as "dead enough" for de_thread. This
> allows us to simplify the logic greatly.
I can't really think of any definite problem with that off hand. Aside
from stray BUG_ON's, that is. It occurs to me to worry about waiting
for the __exit_signal work of accumulating [us]time in signal_struct to
be finished before the exec fully completes and the new program can use
getrusage et al. But I don't think there really is any problem there.
For each thread whose __exit_signal isn't finished yet, the normal
summing in k_getrusage will work.
> But: we should somehow change the ->group_leader for all sub-threads which
> didn't do release_task(self) yet, nasty. (Perhaps it makes sense to move
> ->group_leader into signal_struct. It is not safe to use it if tsk already
> did __exit_signal() anyway).
It's at least still safe to use ->group_leader->{pid,tgid} under
rcu_read_lock if you checked for NULL. So I'm concerned there could be
some strange cases that make changing this a pain in ways we haven't
thought of. But I have long thought that between group_leader, signal,
pid, and tgid, we have some redundancy. ->signal->tsk is used only for
itimers, but is effectively ->group_leader already. Likewise ->tgid is
->group_leader->pid and it's hard to see a real reason for the field.
But there are probably some strange corners.
Cleanups aside, why does it matter if their ->group_leader pointer is
stale while they are on the way to dying and self-reaping? Off hand,
I think it only matters that the old leader is release_task'd last (as
now) in case an old pointer is used. But I'm not sure what in that
circumstance would really care about getting stale data from whatever
fields it examines in ->group_leader.
> Another problem, it is not easy to define when the ->group_exit_task (or
> whatever) should be notified from exit_notify(), we need another atomic_t
> for that (like signal_struct.live).
Off hand it seems to me that either signal->live or signal->count can serve.
> In fact, it is not necessary to put this counter into signal_struct,
> de_thread() can count sub-threads (say, modify zap_other_threads() to
> return "int") and put this info into the structure on stack, as you
> pointed out.
Could be.
> But what do you think about this patch? Should we go with it for now,
> or we should ignore the problem until we can cleanup the whole thing?
> I do not claim this problem is very much important, but this yield()
> is really buggy (perhaps it is always buggy).
I think we can probably come up with an incremental bit of cleanup
soon that is not too invasive and solves the problem more cleanly.
So I'd rather try to hash that out now than just use that patch.
> But can't we first cleanup some other oddities with signal->flags? For example,
> it would be nice to kill SIGNAL_STOP_DEQUEUED, but we can't because
> handle_stop_signal(SIGKILL) clears ->signal->flags. And this is done because
> tkill() doesn't do what __group_complete_signal() does for sig_fatal() signals
> (I mean this big "if (sig_fatal(p, sig) ...").
This is an odd way to describe it. Firstly, SIGNAL_STOP_DEQUEUED exists
for other reasons, as I've just posted about in another thread. SIGKILL
clears ->signal->flags by design, because it overrides previous states.
It's certainly not "because tkill doesn't do ...". The sig_fatal code in
__group_complete_signal was added as an optimization. The intent was
always that the behavior already be correct without it, just sometimes
subject to more scheduling delay than a user might like.
Perhaps with complex scheduling policies it is now the case that the
unoptimized behavior can sometimes select so wrong a thread to dequeue a
process signal that the quality of implementation of process signal
delivery drops to effectively incorrect because you can't make a thing die
any time soon.
> Why? can't we split __group_complete_signal() into 2 functions, the new one
> is used both by do_tkill() and __group_send_sig_info() ?
We can. The new function (complete_signal?) could be called by
__group_complete_signal on the selected target, and roughly replace the
signal_wake_up calls in specific_send_sig_info and send_sigqueue.
> Just one example. Suppose that an application does
>
> signal(SIGTERM, SIG_DFL);
> sigtimedwait( {SIGTERM} );
>
> Now,
> tkill(SIGTERM) => sigtimedwait() succeeds
>
> kill(SIGTERM) => application is killed UNLESS it has TIF_SIGPENDING
>
> This looks really strange for me.
If that happens, it is a bug. This check is supposed to prevent that:
!sigismember(&t->real_blocked, sig) &&
> While we are here, I'd like to ask another question (sorry for the long and
> somewhat off-topic post :)
I would prefer separate threads about each issue, but I'm certainly
glad for the discussion.
> Suppose that we have sub-threads T1 and T2, both do not block SIGXXX.
> kill(SIGXXX) choose T1 as a target and does signal_wake_up(T1).
> T1 can enter sys_sigprocmask(SIG_BLOCK, {SIGXXX}) before it sees the signal.
> Now SIGXXX is delayed until T2 does recalc_sigpending().
>
> Is it ok? This is easy to fix, for example sys_sigprocmask() can check
> signal_pending() and return ERESTARTNOINTR.
I think this is a bug and needs to be fixed. Scheduling delays in signal
delivery are fine, but process signals must not sit pending while threads
not blocking that sgianl run merrily along. I think the fix you suggest
will do fine for this case. We should consider any other places where
->blocked is affected, e.g. sigsuspend, pselect, ppoll. If it's possible
those can add signals to ->blocked before handling pending process signals,
then it might be necessary to have them find another thread to do
recalc_sigpending_and_wake on, as in exit_notify. (Hopefully any such can
just be made to handle pending signals before blocking any.)
Thanks,
Roland
On 08/28, Roland McGrath wrote:
>
> > I seem to understand what you mean... Yes, at first glance, we can consider
> > the sub-thread with ->exit_state != 0 as "dead enough" for de_thread. This
> > allows us to simplify the logic greatly.
>
> I can't really think of any definite problem with that off hand. Aside
> from stray BUG_ON's, that is.
Yes. The more I think about your idea, the more I like it.
> > But: we should somehow change the ->group_leader for all sub-threads which
> > didn't do release_task(self) yet, nasty. (Perhaps it makes sense to move
> > ->group_leader into signal_struct. It is not safe to use it if tsk already
> > did __exit_signal() anyway).
>
> It's at least still safe to use ->group_leader->{pid,tgid} under
> rcu_read_lock if you checked for NULL.
No, if de_thread() changes does release_task(old_leader). Now, if another
sub-thread didn't to release_task(self), it has a valid sighand, could be
found by /proc, etc, but its ->group_leader points to nothing, this memory
could be freed. Note the proc_task_readdir() for example.
In fact, even the release_task(self) becomes unsafe.
> > Another problem, it is not easy to define when the ->group_exit_task (or
> > whatever) should be notified from exit_notify(), we need another atomic_t
> > for that (like signal_struct.live).
>
> Off hand it seems to me that either signal->live or signal->count can serve.
I don't think they can work, the first one is too early, the second is too
late...
> > In fact, it is not necessary to put this counter into signal_struct,
> > de_thread() can count sub-threads (say, modify zap_other_threads() to
> > return "int") and put this info into the structure on stack, as you
> > pointed out.
>
> Could be.
Yes, I think this can work.
> > But what do you think about this patch? Should we go with it for now,
> > or we should ignore the problem until we can cleanup the whole thing?
> > I do not claim this problem is very much important, but this yield()
> > is really buggy (perhaps it is always buggy).
>
> I think we can probably come up with an incremental bit of cleanup
> soon that is not too invasive and solves the problem more cleanly.
> So I'd rather try to hash that out now than just use that patch.
OK, we can drop it, I agree.
> > But can't we first cleanup some other oddities with signal->flags? For example,
> > it would be nice to kill SIGNAL_STOP_DEQUEUED, but we can't because
> > handle_stop_signal(SIGKILL) clears ->signal->flags. And this is done because
> > tkill() doesn't do what __group_complete_signal() does for sig_fatal() signals
> > (I mean this big "if (sig_fatal(p, sig) ...").
>
> This is an odd way to describe it. Firstly, SIGNAL_STOP_DEQUEUED exists
> for other reasons, as I've just posted about in another thread. SIGKILL
> clears ->signal->flags by design, because it overrides previous states.
Yes, yes, I meant that it would be nice to do it other way, see below...
> > Why? can't we split __group_complete_signal() into 2 functions, the new one
> > is used both by do_tkill() and __group_send_sig_info() ?
>
> We can. The new function (complete_signal?) could be called by
> __group_complete_signal on the selected target, and roughly replace the
> signal_wake_up calls in specific_send_sig_info and send_sigqueue.
Yes, thanks, this is what I meant, and implies that handle_stop_signal()
can do nothing for SIGKILL.
> > Just one example. Suppose that an application does
> >
> > signal(SIGTERM, SIG_DFL);
> > sigtimedwait( {SIGTERM} );
> >
> > Now,
> > tkill(SIGTERM) => sigtimedwait() succeeds
> >
> > kill(SIGTERM) => application is killed UNLESS it has TIF_SIGPENDING
> >
> > This looks really strange for me.
>
> If that happens, it is a bug. This check is supposed to prevent that:
>
> !sigismember(&t->real_blocked, sig) &&
Yes, but only if the signal was blocked before doing sigtimedwait().
Otherwise the behaviour is very different, and I don't know what
behaviour is correct. Perhaps, we can ignore the difference between
kill/tkill, at least this difference is always the same.
But TIF_SIGPENDING is "random", imho it is not good it can make the
difference. (In case I was unclear, note that wants_signal() checks
signal_pending(), so __group_complete_signal() may return without
"converting" sig_fatal() signal to SIGKILL).
> > Suppose that we have sub-threads T1 and T2, both do not block SIGXXX.
> > kill(SIGXXX) choose T1 as a target and does signal_wake_up(T1).
> > T1 can enter sys_sigprocmask(SIG_BLOCK, {SIGXXX}) before it sees the signal.
> > Now SIGXXX is delayed until T2 does recalc_sigpending().
> >
> > Is it ok? This is easy to fix, for example sys_sigprocmask() can check
> > signal_pending() and return ERESTARTNOINTR.
>
> I think this is a bug and needs to be fixed. Scheduling delays in signal
> delivery are fine, but process signals must not sit pending while threads
> not blocking that sgianl run merrily along. I think the fix you suggest
> will do fine for this case.
Great,
> We should consider any other places where
> ->blocked is affected, e.g. sigsuspend, pselect, ppoll. If it's possible
> those can add signals to ->blocked before handling pending process signals,
> then it might be necessary to have them find another thread to do
> recalc_sigpending_and_wake on, as in exit_notify. (Hopefully any such can
> just be made to handle pending signals before blocking any.)
Aha, I need to think about this, thanks!
Oleg.