2008-07-06 16:00:16

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] ptrace: simplify ptrace_stop()->sigkill_pending() path

1. SIGKILL can't be blocked, remove this check from sigkill_pending().

2. When ptrace_stop() sees sigkill_pending() == T, it can just return.
Kill "int killed" and simplify the code. This also is more correct,
the tracer shouldn't see us in TASK_TRACED if we are not going to
stop.

I strongly believe this code needs further changes. We should do the
"was this task killed" check unconditionally, currently it depends on
arch_ptrace_stop_needed(). On the other hand, sigkill_pending() isn't
very clever. If the task was killed tkill(SIGKILL), the signal can be
already dequeued if the caller is do_exit().

Signed-off-by: Oleg Nesterov <[email protected]>

--- 26-rc2/kernel/signal.c~PT_KILL 2008-06-28 17:06:58.000000000 +0400
+++ 26-rc2/kernel/signal.c 2008-07-06 19:29:27.000000000 +0400
@@ -1498,9 +1498,8 @@ static inline int may_ptrace_stop(void)
*/
static int sigkill_pending(struct task_struct *tsk)
{
- return ((sigismember(&tsk->pending.signal, SIGKILL) ||
- sigismember(&tsk->signal->shared_pending.signal, SIGKILL)) &&
- !unlikely(sigismember(&tsk->blocked, SIGKILL)));
+ return sigismember(&tsk->pending.signal, SIGKILL) ||
+ sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
}

/*
@@ -1516,8 +1515,6 @@ static int sigkill_pending(struct task_s
*/
static void ptrace_stop(int exit_code, int clear_code, siginfo_t *info)
{
- int killed = 0;
-
if (arch_ptrace_stop_needed(exit_code, info)) {
/*
* The arch code has something special to do before a
@@ -1533,7 +1530,8 @@ static void ptrace_stop(int exit_code, i
spin_unlock_irq(&current->sighand->siglock);
arch_ptrace_stop(exit_code, info);
spin_lock_irq(&current->sighand->siglock);
- killed = sigkill_pending(current);
+ if (sigkill_pending(current))
+ return;
}

/*
@@ -1550,7 +1548,7 @@ static void ptrace_stop(int exit_code, i
__set_current_state(TASK_TRACED);
spin_unlock_irq(&current->sighand->siglock);
read_lock(&tasklist_lock);
- if (!unlikely(killed) && may_ptrace_stop()) {
+ if (may_ptrace_stop()) {
do_notify_parent_cldstop(current, CLD_TRAPPED);
read_unlock(&tasklist_lock);
schedule();


2008-07-07 04:57:47

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH] ptrace: simplify ptrace_stop()->sigkill_pending() path

Sorry I haven't replied sooner to your related patches about this recently.

I would rather not futz around the margins with the arch_ptrace_stop_needed
case. That code path exists only on ia64. Let's not worry about the
details of its current form. We can just hash out the whole subject and
this bit will wind up irrelevant or quite different.

To give some perspective, the arch_ptrace_stop_needed path was added
(recently) for ia64 where there was a need to unlock the siglock where it
wasn't unlocked before. Unlike the pure signals races, this new ia64 case
potentially works/blocks for a relatively long time (doing page faults)
while leaving the siglock unlocked in the path that always before held it
throughout. The bad situation with a missed SIGKILL actually came up on
ia64 when someone tried adding the unlock+arch_ptrace_stop+lock sequence
without the extra check. So the "killed" logic was added with a narrow
focus to redress just that case, and keep ia64 no worse than the status quo
of many years. (The ia64 path can have real blocks, whereas any other
problems on this path have never been possible in uniprocessor no-preempt
kernels, for example.)

When you sent your earlier patches, something I didn't find immediately
clear was the exact list of problem scenarios you were considering. So,
let's look at the whole picture.

The two paths into ptrace_stop() are ptrace_signal() and ptrace_notify().

In the ptrace_signal() path, the siglock is held throughout. If someone
comes along later to post a SIGKILL, it will happen after ptrace_stop()
enters TASK_TRACED and signal_wake_up() will do the right thing. The
problem scenario is when the SIGKILL was already posted before we took the
siglock, but ptrace_signal() gets called on a different signal first, i.e.
one numbered < 9 is dequeued first.

In the ptrace_notify() path, we just took the siglock before ptrace_stop().
There can have been a SIGKILL pending before that, and we'll stop anyway.
The paths into ptrace_notify() are e.g. after fork, exec, or syscall exit,
where there very well may have been a long time for the signal to be posted
since the last time it was checked for.

Now let's look at what we do and don't want, to call things "right".

The prevailing default case should be that we stop. We want to let the
debugger do its job. Consider exit tracing (PTRACE_O_TRACEEXIT). If there
is a normal exit_group syscall, every thread should stop for exit tracing.
If there is a run of the mill signal that causes termination, every thread
should stop for exit tracing.

When there is a real SIGKILL from userland (sys_kill et al actually called
with SIGKILL), then we should never stop. A real SIGKILL trumps all else.
We must die as quickly as we can, end of story.

When there is a coredump_wait in progress, we should not stop. This is
something outside the debugger's power to prevent (now that it's started in
another thread) and the core-dump thread will block so it can't be forced
to finish quickly, until we die.

When there is an exec in progress, we should not stop. This is something
outside the debugger's power to prevent (now that it's started in another
thread) and the thread doing exec will block uninterruptibly until we die.

If you work out all the cases from those "rules", there are several where
we currently make the wrong choice (in one direction or the other), even
without considering races. The changes from your patch series also don't
get it right in several cases, as I've defined "right" above.

Long ago, I had wanted to introduce a signal_struct.flags bit for SIGKILL
and simplify some of this checking. I then talked myself back out of
thinking it was required. Now we have cleaned up the flags handling a lot,
and we have TASK_WAKEKILL. I'm inclined again to go this route, but now to
take it further.

Let's stop overloading SIGKILL for internal purposes. At the same time,
let's treat real SIGKILL as special up front with its own bookkeeping.

We can check t->signal->flags in recalc_sigpending_tsk(). Let's replace
"t->signal->group_stop_count > 0" with "signal_group_sigpending(t->signal)".
In there we can stick the logic we come up with. To start with, it can be
(signal->group_stop_count > 0 || signal_group_exit(signal)). That lets us
remove the SIGKILL from zap_other_threads(), complete_signal(), and
zap_process(). In get_signal_to_deliver(), first thing in the loop (before
the group_stop_count check), check signal_group_exit() and if set go
directly to do_group_exit() without dequeuing any signal.

Next, we can optimize signal_group_sigpending() by reducing it to a single
flags check. Add a flag (maybe called SIGNAL_GROUP_INTERRUPT) meaning "all
threads must get into get_signal_to_deliver() ASAP". Include this bit in
SIGNAL_GROUP_EXIT so it's set every place that sets it. Also make
de_thread() set this bit when it sets group_exit_task (and clear it after).
Also make do_signal_stop() set it when it sets group_stop_count, and clear
it when it sets SIGNAL_STOP_STOPPED (which existing code would implicitly).

Now, the ptrace issues. Add a flag SIGNAL_GROUP_KILLED meaning "must exit
ASAP, no waiting". de_thread() sets this for exec, complete_signal() sets
it for SIGKILL, and coredump_wait() sets it. de_thread() has to clear the
flag afterwards on success, which could hide that a real SIGKILL came in
the middle. So add another flag SIGNAL_GROUP_SIGKILL that is only set for
a real SIGKILL. de_thread() doesn't clear SIGNAL_GROUP_KILLED if that's set.

fatal_signal_pending() is only called on current, and only from places
that can't be on the exit path. It can just check ->signal->flags. Only
the schedule() case, i.e. signal_pending_state(), could conceivably be
called after exit_notify(). (At that point, release_task() can be racing
and clear/free ->signal so it's not safe to use the pointer.) I really
don't think there ought to be any "optional" sleeps after exit_notify(),
i.e. anything interruptible or killable--just preemption and maybe some
uninterruptible wait inside one of the teardown calls. So I doubt that
comes up, but it's easy to check for with ->exit_state.

What I have in mind is:

/*
* Only called on current.
*/
static inline int fatal_signal_pending(struct task_struct *p)
{
return (p->signal->flags & SIGNAL_GROUP_KILLED) != 0;
}

static inline int signal_pending_state(long state, struct task_struct *p)
{
if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
return 0;
if (!signal_pending(p))
return 0;
if (state & TASK_INTERRUPTIBLE)
return 1;

if (unlikely(p->exit_state)) {
WARN_ON_ONCE(1);
return 1;
}

return fatal_signal_pending(p);
}

(Btw, do_wait_for_common() should use signal_pending_state() too.)

Now we can rely on TASK_WAKEKILL to deal with all the races for us inside
schedule(). We can nix sigkill_pending() and simplify ptrace_stop()
without the core check as your patches did. That alone is fine for
correctness. It is fine to momentarily enter TASK_TRACED and then not
stop. Worst case, a ptracer sees a wait succeed and then the task is on
its way to death so ptrace() refuses to work on it--it's just as if the
SIGKILL (or other thread's exec) came between the wait and ptrace calls.

We can add a short-circuit check at the top of ptrace_stop() to optimize
out arch_ptrace_stop() and other work when we're not going to stop.

I think that covers everything. How do you like it?


Thanks,
Roland

2008-07-07 17:58:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] ptrace: simplify ptrace_stop()->sigkill_pending() path

On 07/06, Roland McGrath wrote:
>
> When you sent your earlier patches, something I didn't find immediately
> clear was the exact list of problem scenarios you were considering. So,
> let's look at the whole picture.
>
> The two paths into ptrace_stop() are ptrace_signal() and ptrace_notify().
>
> In the ptrace_signal() path, the siglock is held throughout. If someone
> comes along later to post a SIGKILL, it will happen after ptrace_stop()
> enters TASK_TRACED and signal_wake_up() will do the right thing. The
> problem scenario is when the SIGKILL was already posted before we took the
> siglock, but ptrace_signal() gets called on a different signal first, i.e.
> one numbered < 9 is dequeued first.
>
> In the ptrace_notify() path, we just took the siglock before ptrace_stop().
> There can have been a SIGKILL pending before that, and we'll stop anyway.
> The paths into ptrace_notify() are e.g. after fork, exec, or syscall exit,
> where there very well may have been a long time for the signal to be posted
> since the last time it was checked for.
>
> Now let's look at what we do and don't want, to call things "right".

(re-ordered)

> When there is a real SIGKILL from userland (sys_kill et al actually called
> with SIGKILL), then we should never stop. A real SIGKILL trumps all else.
> We must die as quickly as we can, end of story.

OK, thanks,

> The prevailing default case should be that we stop. We want to let the
> debugger do its job. Consider exit tracing (PTRACE_O_TRACEEXIT). If there
> is a normal exit_group syscall, every thread should stop for exit tracing.

OK,

> When there is an exec in progress, we should not stop. This is something
> outside the debugger's power to prevent (now that it's started in another
> thread) and the thread doing exec will block uninterruptibly until we die.

Hmm. Yes, the execing thread will block, but why this is wrong? It waits
for the traced thread which stops, this looks "normal".

> If there is a run of the mill signal that causes termination, every thread
> should stop for exit tracing.

Just to be sure I got this right, you mean the fatal signal but not SIGKILL,
yes?

> When there is a coredump_wait in progress, we should not stop. This is
> something outside the debugger's power to prevent (now that it's started in
> another thread) and the core-dump thread will block so it can't be forced
> to finish quickly, until we die.

This differs from the case above just because the signal is
sig_kernel_coredump(), looks a bit strange.

> If you work out all the cases from those "rules", there are several where
> we currently make the wrong choice (in one direction or the other), even
> without considering races. The changes from your patch series also don't
> get it right in several cases, as I've defined "right" above.

Yes.

> Let's stop overloading SIGKILL for internal purposes. At the same time,
> let's treat real SIGKILL as special up front with its own bookkeeping.
>
> We can check t->signal->flags in recalc_sigpending_tsk(). Let's replace
> "t->signal->group_stop_count > 0" with "signal_group_sigpending(t->signal)".
> In there we can stick the logic we come up with. To start with, it can be
> (signal->group_stop_count > 0 || signal_group_exit(signal)). That lets us
> remove the SIGKILL from zap_other_threads(), complete_signal(), and
> zap_process(). In get_signal_to_deliver(), first thing in the loop (before
> the group_stop_count check), check signal_group_exit() and if set go
> directly to do_group_exit() without dequeuing any signal.

Agreed, I also thought about that. Not that it matters, but we can safely
do this after the group_stop_count check, or we can change dequeue_signal()
to return SIGKILL if signal_group_exit().

> Next, we can optimize signal_group_sigpending() by reducing it to a single
> flags check. Add a flag (maybe called SIGNAL_GROUP_INTERRUPT) meaning "all
> threads must get into get_signal_to_deliver() ASAP". Include this bit in
> SIGNAL_GROUP_EXIT so it's set every place that sets it. Also make
> de_thread() set this bit when it sets group_exit_task (and clear it after).
> Also make do_signal_stop() set it when it sets group_stop_count, and clear
> it when it sets SIGNAL_STOP_STOPPED (which existing code would implicitly).

(this is only optimization, let's ignore this issue for now)

> Now, the ptrace issues. Add a flag SIGNAL_GROUP_KILLED meaning "must exit
> ASAP, no waiting". de_thread() sets this for exec, complete_signal() sets
> it for SIGKILL, and coredump_wait() sets it. de_thread() has to clear the
> flag afterwards on success, which could hide that a real SIGKILL came in
> the middle. So add another flag SIGNAL_GROUP_SIGKILL that is only set for
> a real SIGKILL. de_thread() doesn't clear SIGNAL_GROUP_KILLED if that's set.
>
[...]
>
> static inline int fatal_signal_pending(struct task_struct *p)
> {
> return (p->signal->flags & SIGNAL_GROUP_KILLED) != 0;
> }

Well, this is OK for ptrace (but see below), but doesn't look right
in general. complete_signal() sets this flag only for SIGKILL, what
about other fatal signals which cause SIGNAL_GROUP_EXIT? In that
case fatal_signal_pending() must be true as well.

The same for sys_exit_group(), it also must set this flag. So I can't
understand how SIGNAL_GROUP_KILLED can differ from signal_group_exit().
Apart from optimization, of course.

But signal_group_exit() doesn't suit for ptrace_stop() case.

Confused.

> fatal_signal_pending() is only called on current, and only from places
> that can't be on the exit path. It can just check ->signal->flags. Only
> the schedule() case, i.e. signal_pending_state(), could conceivably be
> called after exit_notify(). (At that point, release_task() can be racing
> and clear/free ->signal so it's not safe to use the pointer.) I really
> don't think there ought to be any "optional" sleeps after exit_notify(),

(off-topic: any sleep after exit_notify() is not good, CPU_DEAD can't
in general find this task for migration. However this is possible.)

> i.e. anything interruptible or killable--just preemption and maybe some
> uninterruptible wait inside one of the teardown calls. So I doubt that
> comes up, but it's easy to check for with ->exit_state.

Agreed. In fact I have already thought about changing fatal_signal_pending()
to use signal_group_exit(), see http://marc.info/?l=linux-kernel&m=121267953524149
This also fixes the issues with /sbin/init. But again, this is not good
for PT_TRACE_EXIT.

And I was worried about schedule in TASK_KILLABLE after exit_notify().
This doesn't look possible currently, probably not a problem.

> static inline int signal_pending_state(long state, struct task_struct *p)
> {
> if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
> return 0;
> if (!signal_pending(p))
> return 0;
> if (state & TASK_INTERRUPTIBLE)
> return 1;
>
> if (unlikely(p->exit_state)) {
> WARN_ON_ONCE(1);
> return 1;
> }
>
> return fatal_signal_pending(p);
> }
>
> Now we can rely on TASK_WAKEKILL to deal with all the races for us inside
> schedule(). We can nix sigkill_pending() and simplify ptrace_stop()
> without the core check as your patches did.

Well yes, but PT_TRACE_EXIT is special again. signal_pending() can be
false, so we can sleep even if the task was SIGKILL'ed.

> (Btw, do_wait_for_common() should use signal_pending_state() too.)

Completely agreed, I was going to do this from the very beginning.
And __down_common() too.

Oleg.