(sorry to all, re-sending with the correct Andrew's email)
Convert do_tkill() to use rcu_read_lock() + lock_task_sighand() to avoid taking
tasklist lock.
Note that we don't return an error if lock_task_sighand() fails, we pretend the
task dies after receiving the signal. Otherwise, we should fight with the nasty
races with mt-exec without having any advantage.
Signed-off-by: Oleg Nesterov <[email protected]>
--- 25/kernel/signal.c~2_TKILL_NO_TASKLIST 2008-03-07 10:47:02.000000000 +0300
+++ 25/kernel/signal.c 2008-03-07 12:20:34.000000000 +0300
@@ -2194,6 +2194,7 @@ static int do_tkill(int tgid, int pid, i
int error;
struct siginfo info;
struct task_struct *p;
+ unsigned long flags;
error = -ESRCH;
info.si_signo = sig;
@@ -2202,7 +2203,7 @@ static int do_tkill(int tgid, int pid, i
info.si_pid = task_tgid_vnr(current);
info.si_uid = current->uid;
- read_lock(&tasklist_lock);
+ rcu_read_lock();
p = find_task_by_vpid(pid);
if (p && (tgid <= 0 || task_tgid_vnr(p) == tgid)) {
error = check_kill_permission(sig, &info, p);
@@ -2210,13 +2211,12 @@ static int do_tkill(int tgid, int pid, i
* The null signal is a permissions and process existence
* probe. No signal is actually delivered.
*/
- if (!error && sig && p->sighand) {
- spin_lock_irq(&p->sighand->siglock);
+ if (!error && sig && lock_task_sighand(p, &flags)) {
error = specific_send_sig_info(sig, &info, p);
- spin_unlock_irq(&p->sighand->siglock);
+ unlock_task_sighand(p, &flags);
}
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
return error;
}
On Fri, Mar 07, 2008 at 12:58:13PM +0300, Oleg Nesterov wrote:
> Note that we don't return an error if lock_task_sighand() fails, we pretend the
> task dies after receiving the signal. Otherwise, we should fight with the nasty
> races with mt-exec without having any advantage.
This should be mentioned in a comment in the code.
On 03/07, Christoph Hellwig wrote:
>
> On Fri, Mar 07, 2008 at 12:58:13PM +0300, Oleg Nesterov wrote:
> > Note that we don't return an error if lock_task_sighand() fails, we pretend the
> > task dies after receiving the signal. Otherwise, we should fight with the nasty
> > races with mt-exec without having any advantage.
>
> This should be mentioned in a comment in the code.
Signed-off-by: Oleg Nesterov <[email protected]>
--- 25/kernel/signal.c~2__COMMENT 2008-03-07 13:06:17.000000000 +0300
+++ 25/kernel/signal.c 2008-03-07 13:59:09.000000000 +0300
@@ -2201,6 +2201,10 @@ static int do_tkill(int tgid, int pid, i
/*
* The null signal is a permissions and process existence
* probe. No signal is actually delivered.
+ *
+ * If lock_task_sighand() fails we pretend the task dies
+ * after receiving the signal. The window is tiny, and the
+ * signal is private anyway.
*/
if (!error && sig && lock_task_sighand(p, &flags)) {
error = specific_send_sig_info(sig, &info, p);
> Convert do_tkill() to use rcu_read_lock() + lock_task_sighand() to avoid
> taking tasklist lock.
That part looks good.
> Note that we don't return an error if lock_task_sighand() fails, we
> pretend the task dies after receiving the signal. Otherwise, we should
> fight with the nasty races with mt-exec without having any advantage.
To clarify, this is not a change from the existing behavior.
So your change is fine regardless of this issue.
The case you have in mind is that p was the old group_leader
being replaced by another thread that exec'd, right?
It is the most obscure of nits, but I think it can be wrong to drop a
signal in this case. If it's a fatal signal (especially SIGKILL),
then either the thread group should be killed or the call should
return an error.
For the exec case, if p->sighand is cleared that means the
release_task(leader) call at the end of de_thread started. So by
now, the pid has been transferred to the exec'ing thread. If we just
restart the lookup, it will find the new thread (or not, and we can
return -ESRCH).
I'm inclined to do that, but it certainly should be a second patch
after this one.
Thanks,
Roland
On 03/07, Roland McGrath wrote:
>
> > Note that we don't return an error if lock_task_sighand() fails, we
> > pretend the task dies after receiving the signal. Otherwise, we should
> > fight with the nasty races with mt-exec without having any advantage.
>
> To clarify, this is not a change from the existing behavior.
> So your change is fine regardless of this issue.
>
> The case you have in mind is that p was the old group_leader
> being replaced by another thread that exec'd, right?
Yes.
> It is the most obscure of nits, but I think it can be wrong to drop a
> signal in this case. If it's a fatal signal (especially SIGKILL),
> then either the thread group should be killed or the call should
> return an error.
Yes, we are not consistent here with or without this patch.
Btw, a question: we are buggy or just "not perfect" ? After all, the
main thread actually exits although this is just linux's implementation
detail.
> For the exec case, if p->sighand is cleared that means the
> release_task(leader) call at the end of de_thread started. So by
> now, the pid has been transferred to the exec'ing thread. If we just
> restart the lookup, it will find the new thread (or not, and we can
> return -ESRCH).
Yep. kill_pid_info() does exactly this. Initially I was going to repeat
this logic,
> I'm inclined to do that, but it certainly should be a second patch
> after this one.
but actually it doesn't buy much here.
Suppose that the main thread is already dead (dequeued SIGKILL), but
not yet released. This window is not that small. In that window (before
de_thread() switches pids) any private signal (even SIGKILL) sent to the
main thread will be silently lost.
To do a proper fix, we should change de_thread() so that the new leader
also "inherits" old_leader->pending signals.
This is certainly possible as a separate change.
Oleg.
On 03/07, Oleg Nesterov wrote:
>
> Suppose that the main thread is already dead (dequeued SIGKILL), but
> not yet released. This window is not that small. In that window (before
> de_thread() switches pids) any private signal (even SIGKILL) sent to the
> main thread will be silently lost.
>
> To do a proper fix, we should change de_thread() so that the new leader
> also "inherits" old_leader->pending signals.
Ah, even this is not enough if we want a "perfect" fix. Because the private
SIGKILL will be "ignored" until the main thread dequeues the SIGKILL which
was sent by de_thread().
We can change __group_complete_signal/zap_other_threads so that they won't
do sigaddset(), just signal_wake_up(). But in that case dequeue_signal()
and recalc_signal() should take signal_group_exit into account...
Oleg.
> Btw, a question: we are buggy or just "not perfect" ? After all, the
> main thread actually exits although this is just linux's implementation
> detail.
I think it's buggy. The SIGKILL should kill the whole process.
> Yep. kill_pid_info() does exactly this. Initially I was going to repeat
> this logic,
I think that precedent is enough reason to follow its example as the first
change. We can consider more cleanups from there.
> Suppose that the main thread is already dead (dequeued SIGKILL), but
> not yet released. This window is not that small. In that window (before
> de_thread() switches pids) any private signal (even SIGKILL) sent to the
> main thread will be silently lost.
This is the big problem with exec that I've cited before. It can even
happen with group-wide signals that should be fatal, but avoided the
__group_complete_signal special fatal case. (e.g. the thread racing with
the exec thread just now unblocked the signal and dequeued it.) IIRC it
was the biggest reason we wanted to revisit the whole MT exec plan.
> We can change __group_complete_signal/zap_other_threads so that they won't
> do sigaddset(), just signal_wake_up(). But in that case dequeue_signal()
> and recalc_signal() should take signal_group_exit into account...
I'd like to revisit the use of "fake" SIGKILL for group exits. That goes
well with a rethink of MT exec. But let's not get into all of that right now.
Thanks,
Roland
On 03/07, Roland McGrath wrote:
>
> > Btw, a question: we are buggy or just "not perfect" ? After all, the
> > main thread actually exits although this is just linux's implementation
> > detail.
>
> I think it's buggy. The SIGKILL should kill the whole process.
OK.
> > Suppose that the main thread is already dead (dequeued SIGKILL), but
> > not yet released. This window is not that small. In that window (before
> > de_thread() switches pids) any private signal (even SIGKILL) sent to the
> > main thread will be silently lost.
>
> This is the big problem with exec that I've cited before. It can even
> happen with group-wide signals that should be fatal, but avoided the
> __group_complete_signal special fatal case. (e.g. the thread racing with
> the exec thread just now unblocked the signal and dequeued it.) IIRC it
> was the biggest reason we wanted to revisit the whole MT exec plan.
Oh. Could you clarify? Afaics, currently exec() can't miss the fatal group
signal?
> > We can change __group_complete_signal/zap_other_threads so that they won't
> > do sigaddset(), just signal_wake_up(). But in that case dequeue_signal()
> > and recalc_signal() should take signal_group_exit into account...
>
> I'd like to revisit the use of "fake" SIGKILL for group exits. That goes
> well with a rethink of MT exec. But let's not get into all of that right now.
Yes.
Oleg.
> > This is the big problem with exec that I've cited before. It can even
> > happen with group-wide signals that should be fatal, but avoided the
> > __group_complete_signal special fatal case. (e.g. the thread racing with
> > the exec thread just now unblocked the signal and dequeued it.) IIRC it
> > was the biggest reason we wanted to revisit the whole MT exec plan.
>
> Oh. Could you clarify? Afaics, currently exec() can't miss the fatal group
> signal?
It's a while since I thought hard about the exec stuff. Just now I was
thinking of one particular scenario. Perhaps it can't really happen.
Here it is:
Threads A and B both block SIGTERM. An outside process sends SIGTERM,
so it is queued in shared_pending but noone wakes.
A starts an exec.
B unblocks SIGTERM.
B enters get_signal_to_deliver, locks, dequeues SIGTERM, unlocks.
Now B is e.g. just before "current->flags |= PF_SIGNALED;".
A locks. No group-exit is in yet progress.
A does zap_other_threads, and unlocks.
B enters do_group_exit. A group-exit is in progress, so it just exits.
SIGTERM is lost.
So I think it really can happen. Anyway, this is just one example.
It's not so hard to think of ways to address this one (though it gets
nontrivial quick with the coredump case). But for me it is just an
example of why we still need to step back and think over the whole
exec picture.
As I said, let's not conflate all that with this thread about a
relatively conservative cleanup. We can discuss that separately.
(But I think it might need to wait for a breather and time for
other dust to settle.)
Thanks,
Roland
(trim CC list, change the subject)
On 03/07, Roland McGrath wrote:
>
> > Btw, a question: we are buggy or just "not perfect" ? After all, the
> > main thread actually exits although this is just linux's implementation
> > detail.
>
> I think it's buggy. The SIGKILL should kill the whole process.
OK, thanks... but the question above was only about the exiting leader in
the middle of exec...
Do we behave correctly when we send the specific "special" KILL/STOP/CONT
signal to the zombie leader when the thread group is alive?
For example, SIGKILL is silently (has subtle side effect) ignored, SIGSTOP
doesn't stop but has side effects, SIGCONT works.
Hmm? If this is not correct we can fix it.
> > > This is the big problem with exec that I've cited before. It can even
> > > happen with group-wide signals that should be fatal, but avoided the
> > > __group_complete_signal special fatal case. (e.g. the thread racing with
> > > the exec thread just now unblocked the signal and dequeued it.) IIRC it
> > > was the biggest reason we wanted to revisit the whole MT exec plan.
> >
> > Oh. Could you clarify? Afaics, currently exec() can't miss the fatal group
> > signal?
>
> Threads A and B both block SIGTERM. An outside process sends SIGTERM,
> so it is queued in shared_pending but noone wakes.
>
> A starts an exec.
>
> B unblocks SIGTERM.
> B enters get_signal_to_deliver, locks, dequeues SIGTERM, unlocks.
> Now B is e.g. just before "current->flags |= PF_SIGNALED;".
>
> A locks. No group-exit is in yet progress.
> A does zap_other_threads, and unlocks.
>
> B enters do_group_exit. A group-exit is in progress, so it just exits.
> SIGTERM is lost.
Ah, this. Yes I know, we can lost the non-fatal group-wide signal
on mt-exec. Note that we don't need to block the signal, it could
be stealed by sub-thread anyway.
> But for me it is just an
> example of why we still need to step back and think over the whole
> exec picture.
Yes sure.
But I already thought about this, and actually I was almost going
to send something like the (incomplete) patch below.
What do you think? Do you agree it should fix all problems with
the group signals vs exec?
Oleg.
--- kernel/signal.c 2008-03-08 19:15:00.000000000 +0300
+++ kernel/signal.c 2008-03-08 21:00:44.883423954 +0300
@@ -379,6 +379,9 @@ int dequeue_signal(struct task_struct *t
{
int signr;
+ if (signal_group_exit(tsk->signal))
+ return SIGKILL;
+
/* We only dequeue private signals from ourselves, we don't let
* signalfd steal them
*/
@@ -428,8 +431,7 @@ int dequeue_signal(struct task_struct *t
* is to alert stop-signal processing code when another
* processor has come along and cleared the flag.
*/
- if (!(tsk->signal->flags & SIGNAL_GROUP_EXIT))
- tsk->signal->flags |= SIGNAL_STOP_DEQUEUED;
+ tsk->signal->flags |= SIGNAL_STOP_DEQUEUED;
}
if ((info->si_code & __SI_MASK) == __SI_TIMER && info->si_sys_private) {
/*
Sorry, I'm pretty well full up on time I can allocate to figuring out this
cleanup stuff, for the next week or two anyway. I'm very glad you're doing
it, but I don't think we can resolve the exec area all in one push now.
I don't have anything to contradict your plan about cleaning up exec.
But I'm just not able to devote enough time right now to giving it the full
consideration it requires.
Thanks,
Roland