Hi Oleg,
Thanks for some patches about tasklist_lock. The avoidance of
tasklist_lock is very important for us. And now, I found another
avoidable tasklist_lock, and made the patch. Could you please have a
look?
This patch avoid taking tasklist_lock in kill_something_info() when
the pid is -1. It can convert to rcu_read_lock() for this case because
group_send_sig_info() doesn't need tasklist_lock.
This patch is for 2.6.25-rc5-mm1.
Signed-off-by: Atsushi Tsuji <[email protected]>
---
diff --git a/kernel/signal.c b/kernel/signal.c
index 3edbfd4..a888c58 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1089,14 +1089,16 @@ static int kill_something_info(int sig, struct siginfo
*info, int pid)
return ret;
}
- read_lock(&tasklist_lock);
if (pid != -1) {
+ read_lock(&tasklist_lock);
ret = __kill_pgrp_info(sig, info,
pid ? find_vpid(-pid) : task_pgrp(current));
+ read_unlock(&tasklist_lock);
} else {
int retval = 0, count = 0;
struct task_struct * p;
+ rcu_read_lock();
for_each_process(p) {
if (p->pid > 1 && !same_thread_group(p, current)) {
int err = group_send_sig_info(sig, info, p);
@@ -1106,8 +1108,8 @@ static int kill_something_info(int sig, struct siginfo
*info, int pid)
}
}
ret = count ? retval : -ESRCH;
+ rcu_read_unlock();
}
- read_unlock(&tasklist_lock);
return ret;
}
On 03/25, Atsushi Tsuji wrote:
>
> This patch avoid taking tasklist_lock in kill_something_info() when
> the pid is -1. It can convert to rcu_read_lock() for this case because
> group_send_sig_info() doesn't need tasklist_lock.
>
> This patch is for 2.6.25-rc5-mm1.
>
> Signed-off-by: Atsushi Tsuji <[email protected]>
> ---
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 3edbfd4..a888c58 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1089,14 +1089,16 @@ static int kill_something_info(int sig, struct
> siginfo *info, int pid)
> return ret;
> }
>
> - read_lock(&tasklist_lock);
> if (pid != -1) {
> + read_lock(&tasklist_lock);
> ret = __kill_pgrp_info(sig, info,
> pid ? find_vpid(-pid) : task_pgrp(current));
> + read_unlock(&tasklist_lock);
> } else {
> int retval = 0, count = 0;
> struct task_struct * p;
>
> + rcu_read_lock();
> for_each_process(p) {
> if (p->pid > 1 && !same_thread_group(p, current)) {
> int err = group_send_sig_info(sig, info, p);
> @@ -1106,8 +1108,8 @@ static int kill_something_info(int sig, struct
> siginfo *info, int pid)
> }
> }
> ret = count ? retval : -ESRCH;
> + rcu_read_unlock();
> }
> - read_unlock(&tasklist_lock);
>
> return ret;
> }
Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
take tasklist_lock to "freeze" the tasks list, so that we can't miss
a new forked process.
Same for __kill_pgrp_info(), we take tasklist to kill the whole group
"atomically".
However. Is it really needed? copy_process() returns -ERESTARTNOINTR
if signal_pending(), and the new task is always placed at the tail
of the list. Looks like nobody can escape the signal, at least fatal
or SIGSTOP.
If the signal is blocked/ignored or has a handler, we can miss a forked
child, but this looks OK, we can pretend it was forked after we dropped
tasklist_lock.
Note also that copy_process() does list_add_tail_rcu(p->tasks) under
->siglock, this means kill_something_info() must see the new childs
after group_send_sig_info() drops ->siglock.
Except: We don't send the signal to /sbin/init. This means that (say)
kill(-1, SIGKILL) can miss the task forked by init. Note that this
task could be forked even before we start kill_something_info(), but
without tasklist there is no guarantee we will see it on the ->tasks
list.
I think this is the only problem with this change.
Eric, Roland?
(Unfortunately, attach_pid() adds the task to the head of hlist, this
means we can't avoid tasklist for __kill_pgrp_info).
Oleg.
On 05/31, Eric W. Biederman wrote:
>
> We can read old next values when walking the task list under the rcu
> lock. So I don't believe we are guaranteed to see additions that
> happen while we hold the rcu lock.
>
> If a new process spawns, passes the check for the parent having the
> signal, the signal is delivered the signal, and then appends to the
> task list. We might miss it. I'm not certain, but that feels right.
I don't think we can miss it. To simplify, let's consider kill(-1, SIGKILL)
and the forking process is P.
When P forks, copy_process() adds the child to the end of the init_task.tasks
list under ->siglock.
When kill_something_info()->group_send_sig_info(P) suceeds, we must see the
child, because we locked the same ->siglock and thus we have the necessary
barrier. (more precisely, we must see the new next values once we locked
->siglock). And P can't fork again.
Yes, kill(-1, /*say*/ SIGCONT) is different, we can miss the child which was
forked _after_ P has recieved/handled the signal, but probably this is OK,
we can pretend it was forked after kill(-1) has returned.
The more interesting case: P forks and exits _before_ we send the signal
to it. Can we miss the child? I don't think so, but I'm not sure.
fork() + exit() means list_add_rcu() + wmb() + list_del_rcu().
If we see the result of list_del_rcu() (ie, we don't see P on list), we
must see the result of list_add_rcu(), because of smp_read_barrier_depends()
in next_task().
But again, I'm not sure.
> > However, I think this patch adds another subtle race which I missed before.
> >
> > Let's suppose that the task has two threads, A (== main thread) and B. A has
> > already exited, B does exec.
> >
> > In that case it is possible that (without tasklist_lock) kill_something_info()
> > sends the signal to the old leader (A), but before group_send_sig_info(A)
> > takes ->siglock B switches the leader and does release_task(A). In that
> > group_send_sig_info()->lock_task_sighand() fails and we miss the process.
>
> Hmm. Does that problem affect normal signal deliver. I seem to recall
> being careful and doing something to make that case work.
>
> Does that fix only apply when we have a specific pid, not when we have
> a task and are walking the task_list. Because kill_pid_info can retry
> if we receive -ESRCH?
Yes, kill_pid_info() is fine, and other users call group_send_sig_info()
under tasklist.
> To fix the pid namespace case we need to start walking the list of
> pids, not the task list for kill -1.
Or, we can make somthing like
/* needs rcu lock */
int kill_group(int sig, struct siginfo *info, struct task_struct *g)
{
struct task_struct *p = g;
do {
ret = group_send_sig_info(sig, info, p);
if (ret != -ESRCH)
break;
} while_each_thread(g, p);
return ret;
}
> > Note the (broken) "p->pid > 1" check, kill_something_info() skips init.
> > Not that it matters though.
>
> Oh right. I had forgotten about that special case. Grr Special cases
> suck!
>
> We have a hole with init spawning new children, during kill -1.
Yes, I was wondering about this too.
> Ugh. Or are those tasks indistinguishable from children spawned by
> init just after the signal was sent?
Oh, I don't know what is supposed semantics. Perhaps this works in
practice during shutdown? we can change the state of /sbin/init so
that it won't spawn the new tasks, and then we can do kill(-1).
I don't know.
> A practical question. I need to rework the signal delivery for
> the case of kill -1 to be based on find_ge_pid. So that it
> works with pid namespaces.
Hmm... not sure I understand why do we need find_ge_pid(). In fact,
I can't see which problems we have with kill_something_info() wrt
pid namespaces. I mean it should be changed of course, but these
changes should be relatively simple/straightforward? However I didn't
really think about this, may be wrong.
Oleg.