LinuxLists.cc - [PATCH 1/2] schedule: fix TASK

2008-06-04 17:09:13

Subject: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
this allows us to do

current->state = TASK_INTERRUPTIBLE;
schedule();

without fear to sleep with pending signal.

However, the code like

current->state = TASK_KILLABLE;
schedule();

is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
no effect).

Introduce the new helper, signal_pending_state(), and change schedule() to
use it.

Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
Probably it would be better to remove it, but this will change the current
behaviour and thus needs a separate discussion.

Note also that with or without this patch TASK_WAKEKILL is not exactly right
wrt /sbin/init, but this is another issue.

Signed-off-by: Oleg Nesterov <[email protected]>

include/linux/sched.h | 2 ++
kernel/signal.c | 14 ++++++++++++++
kernel/sched.c | 6 ++----
3 files changed, 18 insertions(+), 4 deletions(-)

--- 26-rc2/include/linux/sched.h~1_SCHED_KILLABLE 2008-06-01 16:44:39.000000000 +0400
+++ 26-rc2/include/linux/sched.h 2008-06-01 16:44:39.000000000 +0400
@@ -2020,6 +2020,8 @@ static inline int signal_pending(struct
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

+extern int signal_pending_state(long state, struct task_struct *p);
+
extern int __fatal_signal_pending(struct task_struct *p);

static inline int fatal_signal_pending(struct task_struct *p)
--- 26-rc2/kernel/signal.c~1_SCHED_KILLABLE 2008-05-31 16:03:39.000000000 +0400
+++ 26-rc2/kernel/signal.c 2008-06-04 19:57:34.000000000 +0400
@@ -980,6 +980,20 @@ int __fatal_signal_pending(struct task_s
}
EXPORT_SYMBOL(__fatal_signal_pending);

+int signal_pending_state(long state, struct task_struct *p)
+{
+ if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
+ return 0;
+ if (!signal_pending(p))
+ return 0;
+
+ if (state & TASK_INTERRUPTIBLE)
+ return 1;
+ if (state & (__TASK_STOPPED | __TASK_TRACED))
+ return 0;
+ return __fatal_signal_pending(p);
+}
+
struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long *flags)
{
struct sighand_struct *sighand;
--- 26-rc2/kernel/sched.c~1_SCHED_KILLABLE 2008-05-18 15:44:18.000000000 +0400
+++ 26-rc2/kernel/sched.c 2008-06-04 17:42:59.000000000 +0400
@@ -4510,12 +4510,10 @@ need_resched_nonpreemptible:
clear_tsk_need_resched(prev);

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
- if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
- signal_pending(prev))) {
+ if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
- } else {
+ else
deactivate_task(rq, prev, 1);
- }
switch_count = &prev->nvcsw;
}

2008-06-04 17:33:41

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

On Wed, Jun 04, 2008 at 09:09:05PM +0400, Oleg Nesterov wrote:
> Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
> Probably it would be better to remove it, but this will change the current
> behaviour and thus needs a separate discussion.

We're changing the behaviour anyway. Let's have the discussion and get
it right.

In my opinion, not checking for TASK_STOPPED or TASK_TRACED previously was
an oversight. This should be fixed.

> Note also that with or without this patch TASK_WAKEKILL is not exactly right
> wrt /sbin/init, but this is another issue.

That's certainly an interesting conversation to have.

> +int signal_pending_state(long state, struct task_struct *p)
> +{
> + if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
> + return 0;
> + if (!signal_pending(p))
> + return 0;
> +
> + if (state & TASK_INTERRUPTIBLE)
> + return 1;
> + if (state & (__TASK_STOPPED | __TASK_TRACED))
> + return 0;

Just deleting the above two lines should do it?

> + return __fatal_signal_pending(p);
> +}
> +
> struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long *flags)
> {
> struct sighand_struct *sighand;
> --- 26-rc2/kernel/sched.c~1_SCHED_KILLABLE 2008-05-18 15:44:18.000000000 +0400
> +++ 26-rc2/kernel/sched.c 2008-06-04 17:42:59.000000000 +0400
> @@ -4510,12 +4510,10 @@ need_resched_nonpreemptible:
> clear_tsk_need_resched(prev);
>
> if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> - if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
> - signal_pending(prev))) {
> + if (unlikely(signal_pending_state(prev->state, prev)))
> prev->state = TASK_RUNNING;
> - } else {
> + else
> deactivate_task(rq, prev, 1);
> - }

Getting rid of the extra braces is against CodingStyle:

Do not unnecessarily use braces where a single statement will do.

if (condition)
action();

This does not apply if one branch of a conditional statement is a single
statement. Use braces in both branches.

if (condition) {
do_this();
do_that();
} else {
otherwise();
}

This patch is going to add quite a few cycles to schedule(). Has anyone
done any benchmarks with a schedule-heavy workload?

I don't think signal_pending_state() should be in signal.c, just put it
in sched.c along with its only caller. That way, gcc can choose to
inline it if that's more efficient.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-04 17:59:43

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

On 06/04, Matthew Wilcox wrote:
>
> On Wed, Jun 04, 2008 at 09:09:05PM +0400, Oleg Nesterov wrote:
> > Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
> > Probably it would be better to remove it, but this will change the current
> > behaviour and thus needs a separate discussion.
>
> We're changing the behaviour anyway. Let's have the discussion and get
> it right.
>
> In my opinion, not checking for TASK_STOPPED or TASK_TRACED previously was
> an oversight. This should be fixed.

Perhaps, and the changelog has a special note. But imho we need another patch
for that, this is a user-visible change.

> > +int signal_pending_state(long state, struct task_struct *p)
> > +{
> > + if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
> > + return 0;
> > + if (!signal_pending(p))
> > + return 0;
> > +
> > + if (state & TASK_INTERRUPTIBLE)
> > + return 1;
> > + if (state & (__TASK_STOPPED | __TASK_TRACED))
> > + return 0;
>
> Just deleting the above two lines should do it?

Yes.

> > if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> > - if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
> > - signal_pending(prev))) {
> > + if (unlikely(signal_pending_state(prev->state, prev)))
> > prev->state = TASK_RUNNING;
> > - } else {
> > + else
> > deactivate_task(rq, prev, 1);
> > - }
>
> Getting rid of the extra braces is against CodingStyle:
>
> Do not unnecessarily use braces where a single statement will do.
>
> if (condition)
> action();
>
> This does not apply if one branch of a conditional statement is a single
> statement. Use braces in both branches.
>
> if (condition) {
> do_this();
> do_that();
> } else {
> otherwise();
> }

With this patch the code is

if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
deactivate_task(rq, prev, 1);

> This patch is going to add quite a few cycles to schedule(). Has anyone
> done any benchmarks with a schedule-heavy workload?

No, I didn't. This patch is bugfix.

> I don't think signal_pending_state() should be in signal.c, just put it
> in sched.c along with its only caller. That way, gcc can choose to
> inline it if that's more efficient.

Perhaps you are right. In that case it doesn't need the "long state" argument.

However, I think the new helper can have other users. Not that I have a strong
opinion.

Oleg.

2008-06-04 18:51:49

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

On Wed, 2008-06-04 at 11:33 -0600, Matthew Wilcox wrote:
> On Wed, Jun 04, 2008 at 09:09:05PM +0400, Oleg Nesterov wrote:

> > --- 26-rc2/kernel/sched.c~1_SCHED_KILLABLE 2008-05-18 15:44:18.000000000 +0400
> > +++ 26-rc2/kernel/sched.c 2008-06-04 17:42:59.000000000 +0400
> > @@ -4510,12 +4510,10 @@ need_resched_nonpreemptible:
> > clear_tsk_need_resched(prev);
> >
> > if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> > - if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
> > - signal_pending(prev))) {
> > + if (unlikely(signal_pending_state(prev->state, prev)))
> > prev->state = TASK_RUNNING;
> > - } else {
> > + else
> > deactivate_task(rq, prev, 1);
> > - }
>
> Getting rid of the extra braces is against CodingStyle:

No it doesn't.

> Do not unnecessarily use braces where a single statement will do.
>
> if (condition)
> action();
>
> This does not apply if one branch of a conditional statement is a single
> statement. Use braces in both branches.
>
> if (condition) {
> do_this();
> do_that();
> } else {
> otherwise();
> }

Which does not say anything about

if (cond)
stmt1;
else
stmt2;

like the above.

2008-06-04 19:53:01

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

On Wed, Jun 04, 2008 at 10:01:01PM +0400, Oleg Nesterov wrote:
> > In my opinion, not checking for TASK_STOPPED or TASK_TRACED previously was
> > an oversight. This should be fixed.
>
> Perhaps, and the changelog has a special note. But imho we need another patch
> for that, this is a user-visible change.

It is?

> > > if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> > > - if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
> > > - signal_pending(prev))) {
> > > + if (unlikely(signal_pending_state(prev->state, prev)))
> > > prev->state = TASK_RUNNING;
> > > - } else {
> > > + else
> > > deactivate_task(rq, prev, 1);
> > > - }
> >
> > Getting rid of the extra braces is against CodingStyle:
> >
>
> With this patch the code is
>
> if (unlikely(signal_pending_state(prev->state, prev)))
> prev->state = TASK_RUNNING;
> else
> deactivate_task(rq, prev, 1);

Didn't notice that. Still, adjusting brace style is a bad idea. Just
leave it the way it is.

> > This patch is going to add quite a few cycles to schedule(). Has anyone
> > done any benchmarks with a schedule-heavy workload?
>
> No, I didn't. This patch is bugfix.

But there are other ways to fix the bug if this patch proves to be too
heavy-weight.

> However, I think the new helper can have other users. Not that I have a strong
> opinion.

I don't think so ...

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-05 15:22:11

by Oleg Nesterov

[permalink] [raw]

Subject: TASK_WAKEKILL && /sbin/init (was: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race)

Sorry Matthew, I left this part unanswered because I didn't have the
time yesterday...

On 06/04, Matthew Wilcox wrote:
>
> On Wed, Jun 04, 2008 at 09:09:05PM +0400, Oleg Nesterov wrote:
> > Note also that with or without this patch TASK_WAKEKILL is not exactly right
> > wrt /sbin/init, but this is another issue.
>
> That's certainly an interesting conversation to have.

If lock_page_killable() fails because the task was killed by SIGKILL or another
fatal signal, do_generic_file_read() returns -EIO.

This seems to be OK, because in fact the userspace won't see this error, the
task will dequeue SIGKILL and exit.

However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and be
confused by this bogus -EIO. Please note that while this bug is not likely,
it is _not_ theoretical. It does happen that user-space sends the unhandled
fatal signals to init.

Imho, this is 2.6.26 material. Unless I missed something, of course.

It is not clear to me what should we do. I'd like very much to avoid adding
more SIGNAL_UNKILLABLE checks, but perhaps we don't have another choice.
We can fix the bug with

--- kernel/signal.c
+++ kernel/signal.c
@@ -974,7 +974,7 @@ void zap_other_threads(struct task_struc

int fastcall __fatal_signal_pending(struct task_struct *tsk)
{
- return sigismember(&tsk->pending.signal, SIGKILL);
+ return signal_group_exit(tsk->signal);
}

, but this makes __fatal_signal_pending() slower, and because we use
tsk->signal, schedule() (in particular) can't use this helper.

Anyway. How about the (untested/uncompiled) patch for now? -EINTR or
-ERESTARTNOINTR looks "more correct" regardless.

Oleg.

--- mm/filemap.c
+++ mm/filemap.c
@@ -188,7 +188,7 @@ static int sync_page(void *word)
static int sync_page_killable(void *word)
{
sync_page(word);
- return fatal_signal_pending(current) ? -EINTR : 0;
+ return fatal_signal_pending(current) ? -ERESTARTNOINTR : 0;
}

/**
@@ -1000,8 +1000,9 @@ page_ok:

page_not_up_to_date:
/* Get exclusive access to the page ... */
- if (lock_page_killable(page))
- goto readpage_eio;
+ error = lock_page_killable(page);
+ if (error)
+ goto readpage_error;

/* Did it get truncated before we got the lock? */
if (!page->mapping) {
@@ -1029,8 +1030,9 @@ readpage:
}

if (!PageUptodate(page)) {
- if (lock_page_killable(page))
- goto readpage_eio;
+ error = lock_page_killable(page);
+ if (error)
+ goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*

2008-06-05 15:22:43

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race

On 06/04, Matthew Wilcox wrote:
>
> On Wed, Jun 04, 2008 at 10:01:01PM +0400, Oleg Nesterov wrote:
> > > In my opinion, not checking for TASK_STOPPED or TASK_TRACED previously was
> > > an oversight. This should be fixed.
> >
> > Perhaps, and the changelog has a special note. But imho we need another patch
> > for that, this is a user-visible change.
>
> It is?

Think about ptrace_notify().

Don't get me wrong. As I said, I think this change would be nice (but I didn't
think thoroughly yet), and it also allows us to cleanup (or fix?) ptrace_stop().
But with another patch, please.

> > > This patch is going to add quite a few cycles to schedule(). Has anyone
> > > done any benchmarks with a schedule-heavy workload?
> >
> > No, I didn't. This patch is bugfix.
>
> But there are other ways to fix the bug if this patch proves to be too
> heavy-weight.

If I knew a better solution, I wouldn't have sent this patch ;)

Yes, we can change all users of TASK_KILLABLE. But personally I think this
would be wrong. I strongly believe this code

current->state = TASK_KILLABLE;
schedule();

should work "as expected".

Btw, I don't completely agree with "quite a few cycles". Let's look at the
code again:

int signal_pending_state(long state, struct task_struct *p)
{
if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
return 0;
if (!signal_pending(p))
return 0;

if (state & TASK_INTERRUPTIBLE)
return 1;
if (state & (__TASK_STOPPED | __TASK_TRACED))
return 0;
return __fatal_signal_pending(p);
}

The fast path is "(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)) + signal_pending(p)",
basically the same that we do now. And we can inline this helper to eliminate the
function call.

But yes sure, it does bloat schedule(), and I would be happy to see the better way.
That is why I didn't send this patch immediately, but started with
"Q: down_killable() is racy? or schedule() is not right?".

> > However, I think the new helper can have other users. Not that I have a strong
> > opinion.
>
> I don't think so ...

The only part I strongly disagree with. Imho, __down_common/__mutex_lock_common/etc
should use this helper. What if we add another "interesting" state? Or find another
bug? why should we copy-and-paste this code to yet another something_new_killable() ?

Btw, if we make it inline, __down_common() won't suffer (but otoh, I think that these
_common()'s shouldn't be inline).

So. I can re-send this patch unchanged or with signal_pending_state() inlined,
or I can wait for another solution.

What do you think?

Oleg.

2008-06-05 15:49:13

by Matthew Wilcox

[permalink] [raw]

Subject: Re: TASK_WAKEKILL && /sbin/init (was: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race)

On Thu, Jun 05, 2008 at 07:23:16PM +0400, Oleg Nesterov wrote:
> Sorry Matthew, I left this part unanswered because I didn't have the
> time yesterday...

That's OK, thanks for picking it up again.

> On 06/04, Matthew Wilcox wrote:
> >
> > On Wed, Jun 04, 2008 at 09:09:05PM +0400, Oleg Nesterov wrote:
> > > Note also that with or without this patch TASK_WAKEKILL is not exactly right
> > > wrt /sbin/init, but this is another issue.
> >
> > That's certainly an interesting conversation to have.
>
> If lock_page_killable() fails because the task was killed by SIGKILL or
> another fatal signal, do_generic_file_read() returns -EIO.
>
> This seems to be OK, because in fact the userspace won't see this error, the
> task will dequeue SIGKILL and exit.
>
> However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and be
> confused by this bogus -EIO. Please note that while this bug is not likely,
> it is _not_ theoretical. It does happen that user-space sends the unhandled
> fatal signals to init.

Have you actually tested this? I thought it was handled by:

/*
* Global init gets no signals it doesn't want.
*/
if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
!signal_group_exit(signal))
continue;

in get_signal_to_deliver().

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-05 16:03:20

by Oleg Nesterov

[permalink] [raw]

Subject: Re: TASK_WAKEKILL && /sbin/init (was: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race)

On 06/05, Matthew Wilcox wrote:
>
> On Thu, Jun 05, 2008 at 07:23:16PM +0400, Oleg Nesterov wrote:
> >
> > If lock_page_killable() fails because the task was killed by SIGKILL or
> > another fatal signal, do_generic_file_read() returns -EIO.
> >
> > This seems to be OK, because in fact the userspace won't see this error, the
> > task will dequeue SIGKILL and exit.
> >
> > However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and be
> > confused by this bogus -EIO. Please note that while this bug is not likely,
> > it is _not_ theoretical. It does happen that user-space sends the unhandled
> > fatal signals to init.
>
> Have you actually tested this?

No I didn't. And I would be happy to be wrong. But,

> I thought it was handled by:
>
> /*
> * Global init gets no signals it doesn't want.
> */
> if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> !signal_group_exit(signal))
> continue;
>
> in get_signal_to_deliver().

This is what I am talking about. The SIGNAL_UNKILLABLE task (init) dequeues
the pending SIGKILL and just ignores it. Then it returns to the user space
with -EIO.

But when we send SIGKILL, the sender wakes up the TASK_KILLABLE task, and
after that fatal_signal_pending() is true. Once again, it is not hard to
fix this problem in kernel/signal.c, but _perhaps_ the change in filemap.c
makes sense anyway.

Oleg.

2008-06-05 16:15:09

by Oleg Nesterov

[permalink] [raw]

Subject: Re: TASK_WAKEKILL && /sbin/init (was: [PATCH 1/2] schedule: fix TASK_WAKEKILL vs SIGKILL race)

On 06/05, Oleg Nesterov wrote:
>
> Anyway. How about the (untested/uncompiled) patch for now? -EINTR or
> -ERESTARTNOINTR looks "more correct" regardless.
>
> Oleg.
>
> --- mm/filemap.c
> +++ mm/filemap.c
> @@ -188,7 +188,7 @@ static int sync_page(void *word)
> static int sync_page_killable(void *word)
> {
> sync_page(word);
> - return fatal_signal_pending(current) ? -EINTR : 0;
> + return fatal_signal_pending(current) ? -ERESTARTNOINTR : 0;

Forgot to mention, this part is questionable of course. And it can't
prevent the case when "ret = sys_read(count)" succeeds, but ret < count.

Oleg.