2007-05-28 02:32:12

by Satoru Takeuchi

[permalink] [raw]
Subject: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Hi,

I found a bug on signal subsystem. If there is some multithread program
and one of the thread is blocking on the system call, it returns with
wrong errno on receiving SIGSTOP and following SIGCONT.

Arch dependency
===============

succeed to reproduce: i386, ia64
Unknown: any other arch

# I suspect that this problem occur on any arch...

How to reproduce
================

This process needs 2 terminals, term A and term B.

1. issue the test program (attached on this mail) from term A.

$ ./mt-wrong-errno

2. send SIGSTOP to test program from term B.

$ ps a
...
9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
9688 pts/4 R+ 0:00 ps a
$ kill -STOP 9685

3. send SIGCONT to test program from term B.

$ kill -CONT 9685

Expected Result
===============

`./mt-wrong-errno' restarts read syscall (and stop again by receiving
SIGTTIN).

Actual Result
=============

`./mt-wrong-errno' returns with wrong errno 512 means ERESTARTSYS.
The comment on include/linux/errno.h says "this errno should never be
seen by user program." So it's definitely a bug.

output of term A
----------------

$ ./mt-wrong-errno

[1]+ Stopped ./mt-wrong-errno
$ errno = 512

output of term B
----------------

$ ps a
...
9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
9688 pts/4 R+ 0:00 ps a
$ kill -STOP 9685
$ kill -CONT 9685


This problem can always reproduce, so it's no timing issue. I suspect
that group_stop_count counting mechanism has something wrong and
signal handling is canceled erroneously.

Any hints or patch itself are welcome.

Thanks,

Satoru

-------------------------------------------------------------------------------
/*
* mt-wrong-errno.c
*
* Copyright 2007 Satoru Takeuchi <[email protected]>
*
* This software may be used and distributed according to the terms
* of the GNU General Public License, incorporated herein by reference.
*
*/

#include <sys/types.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <err.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

void *thread_fn(void *arg)
{
char c;
int ret;

ret = read(STDIN_FILENO, &c, 1);
if (ret < 0)
printf("errno = %d\n", errno);

return NULL;
}

int main(int argc, char *argv[])
{
pthread_t t;

if (pthread_create(&t, NULL, thread_fn, NULL))
err(EXIT_FAILURE, "pthread_create() failed\n");

if (pthread_join(t, NULL))
warn("pthread_join() failed\n");
}


2007-05-28 07:08:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Satoru Takeuchi wrote:
>
> I found a bug on signal subsystem. If there is some multithread program
> and one of the thread is blocking on the system call, it returns with
> wrong errno on receiving SIGSTOP and following SIGCONT.

It doesn't matter in this particular case, but in general it is good to
report the kernel version as well.

> 1. issue the test program (attached on this mail) from term A.
>
> $ ./mt-wrong-errno
>
> 2. send SIGSTOP to test program from term B.
>
> $ ps a
> ...
> 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> 9688 pts/4 R+ 0:00 ps a
> $ kill -STOP 9685
>
> 3. send SIGCONT to test program from term B.
>
> $ kill -CONT 9685
>
> Expected Result
> ===============
>
> `./mt-wrong-errno' restarts read syscall (and stop again by receiving
> SIGTTIN).
>
> Actual Result
> =============
>
> `./mt-wrong-errno' returns with wrong errno 512 means ERESTARTSYS.
> The comment on include/linux/errno.h says "this errno should never be
> seen by user program." So it's definitely a bug.

Wild guess,

drivers/char/n_tty.c:job_control()

kill_pgrp(task_pgrp(current), SIGTTIN, 1);
return -ERESTARTSYS;

This is wrong. kill_pgrp()->__group_send_sig_info() sends SIGTTIN
to the thread group, but can choose any sub-thread as a target for
signal_wake_up().

This means we return ERESTARTSYS, but signal_pending() != true, bug.

Satoru, could you re-test with the untested patch below?

Oleg.

--- t/drivers/char/n_tty.c~ 2007-04-05 12:18:26.000000000 +0400
+++ t/drivers/char/n_tty.c 2007-05-28 10:57:58.000000000 +0400
@@ -1191,6 +1191,7 @@ static int job_control(struct tty_struct
is_current_pgrp_orphaned())
return -EIO;
kill_pgrp(task_pgrp(current), SIGTTIN, 1);
+ set_thread_flag(TIF_SIGPENDING);
return -ERESTARTSYS;
}
}

2007-05-28 09:59:24

by Satoru Takeuchi

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Hi Oleg,

At Mon, 28 May 2007 11:07:53 +0400,
Oleg Nesterov wrote:
>
> Satoru Takeuchi wrote:
> >
> > I found a bug on signal subsystem. If there is some multithread program
> > and one of the thread is blocking on the system call, it returns with
> > wrong errno on receiving SIGSTOP and following SIGCONT.
>
> It doesn't matter in this particular case, but in general it is good to
> report the kernel version as well.
>
> > 1. issue the test program (attached on this mail) from term A.
> >
> > $ ./mt-wrong-errno
> >
> > 2. send SIGSTOP to test program from term B.
> >
> > $ ps a
> > ...
> > 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> > 9688 pts/4 R+ 0:00 ps a
> > $ kill -STOP 9685
> >
> > 3. send SIGCONT to test program from term B.
> >
> > $ kill -CONT 9685
> >
> > Expected Result
> > ===============
> >
> > `./mt-wrong-errno' restarts read syscall (and stop again by receiving
> > SIGTTIN).
> >
> > Actual Result
> > =============
> >
> > `./mt-wrong-errno' returns with wrong errno 512 means ERESTARTSYS.
> > The comment on include/linux/errno.h says "this errno should never be
> > seen by user program." So it's definitely a bug.
>
> Wild guess,
>
> drivers/char/n_tty.c:job_control()
>
> kill_pgrp(task_pgrp(current), SIGTTIN, 1);
> return -ERESTARTSYS;
>
> This is wrong. kill_pgrp()->__group_send_sig_info() sends SIGTTIN
> to the thread group, but can choose any sub-thread as a target for
> signal_wake_up().
>
> This means we return ERESTARTSYS, but signal_pending() != true, bug.
>
> Satoru, could you re-test with the untested patch below?

I tested this patch on my ia64 box and it works fine.

# Since I don't know anything about tty driver, I can't determine its
# adequacy immediately, sorry.

Satoru

>
> Oleg.
>
> --- t/drivers/char/n_tty.c~ 2007-04-05 12:18:26.000000000 +0400
> +++ t/drivers/char/n_tty.c 2007-05-28 10:57:58.000000000 +0400
> @@ -1191,6 +1191,7 @@ static int job_control(struct tty_struct
> is_current_pgrp_orphaned())
> return -EIO;
> kill_pgrp(task_pgrp(current), SIGTTIN, 1);
> + set_thread_flag(TIF_SIGPENDING);
> return -ERESTARTSYS;
> }
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-05-28 10:13:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

On 05/28, Satoru Takeuchi wrote:
>
> At Mon, 28 May 2007 11:07:53 +0400,
> Oleg Nesterov wrote:
> >
> > drivers/char/n_tty.c:job_control()
> >
> > kill_pgrp(task_pgrp(current), SIGTTIN, 1);
> > return -ERESTARTSYS;
> >
> > This is wrong. kill_pgrp()->__group_send_sig_info() sends SIGTTIN
> > to the thread group, but can choose any sub-thread as a target for
> > signal_wake_up().
> >
> > This means we return ERESTARTSYS, but signal_pending() != true, bug.
> >
> > Satoru, could you re-test with the untested patch below?
>
> I tested this patch on my ia64 box and it works fine.

Thanks! A quick grep shows that tty_check_change() may have the same problem.

> # Since I don't know anything about tty driver, I can't determine its
> # adequacy immediately, sorry.

/dev/tty is a black magic to me too. I'll send the patch later unless
maintainers have any objections.

> > --- t/drivers/char/n_tty.c~ 2007-04-05 12:18:26.000000000 +0400
> > +++ t/drivers/char/n_tty.c 2007-05-28 10:57:58.000000000 +0400
> > @@ -1191,6 +1191,7 @@ static int job_control(struct tty_struct
> > is_current_pgrp_orphaned())
> > return -EIO;
> > kill_pgrp(task_pgrp(current), SIGTTIN, 1);
> > + set_thread_flag(TIF_SIGPENDING);
> > return -ERESTARTSYS;
> > }
> > }
> >

Oleg.

2007-05-29 07:34:43

by Roland McGrath

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Your fix seems like the only way to go. From skimming all the ERESTART*
uses, I think that in all cases (except for n_tty.c:job_control before your
patch), TIF_SIGPENDING is indeed set when a thread returns -ERESTART*.

But it makes me realize that there is a danger of leaking a -ERESTART*
return code to userland when TIF_SIGPENDING gets cleared by another thread
doing a recalc_sigpending_tsk. Because of -ERESTART* I think we must make
it a rule that no thread can clear another thread's TIF_SIGPENDING, only
set it (unless it's known to be stopped in the signal code or something).
>From our recent work on it, I think that do_sigaction is in fact the only
place this can happen. So that says we should err in the other direction
from what I said before in do_sigaction, and not have it do recalc at all.


Thanks,
Roland

2007-05-29 08:04:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

On Mon, 28 May 2007 11:32:00 +0900 Satoru Takeuchi <[email protected]> wrote:

> Hi,
>
> I found a bug on signal subsystem. If there is some multithread program
> and one of the thread is blocking on the system call, it returns with
> wrong errno on receiving SIGSTOP and following SIGCONT.
>
> Arch dependency
> ===============
>
> succeed to reproduce: i386, ia64
> Unknown: any other arch
>
> # I suspect that this problem occur on any arch...
>
> How to reproduce
> ================
>
> This process needs 2 terminals, term A and term B.
>
> 1. issue the test program (attached on this mail) from term A.
>
> $ ./mt-wrong-errno
>
> 2. send SIGSTOP to test program from term B.
>
> $ ps a
> ...
> 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> 9688 pts/4 R+ 0:00 ps a
> $ kill -STOP 9685
>
> 3. send SIGCONT to test program from term B.
>
> $ kill -CONT 9685
>
> Expected Result
> ===============
>
> `./mt-wrong-errno' restarts read syscall (and stop again by receiving
> SIGTTIN).
>
> Actual Result
> =============
>
> `./mt-wrong-errno' returns with wrong errno 512 means ERESTARTSYS.
> The comment on include/linux/errno.h says "this errno should never be
> seen by user program." So it's definitely a bug.
>
> output of term A
> ----------------
>
> $ ./mt-wrong-errno
>
> [1]+ Stopped ./mt-wrong-errno
> $ errno = 512
>
> output of term B
> ----------------
>
> $ ps a
> ...
> 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> 9688 pts/4 R+ 0:00 ps a
> $ kill -STOP 9685
> $ kill -CONT 9685
>
>
> This problem can always reproduce, so it's no timing issue. I suspect
> that group_stop_count counting mechanism has something wrong and
> signal handling is canceled erroneously.
>
> Any hints or patch itself are welcome.
>
> Thanks,
>
> Satoru
>
> -------------------------------------------------------------------------------
> /*
> * mt-wrong-errno.c
> *
> * Copyright 2007 Satoru Takeuchi <[email protected]>
> *
> * This software may be used and distributed according to the terms
> * of the GNU General Public License, incorporated herein by reference.
> *
> */
>
> #include <sys/types.h>
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <err.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <errno.h>
>
> void *thread_fn(void *arg)
> {
> char c;
> int ret;
>
> ret = read(STDIN_FILENO, &c, 1);
> if (ret < 0)
> printf("errno = %d\n", errno);
>
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> pthread_t t;
>
> if (pthread_create(&t, NULL, thread_fn, NULL))
> err(EXIT_FAILURE, "pthread_create() failed\n");
>
> if (pthread_join(t, NULL))
> warn("pthread_join() failed\n");
> }

What kernel versions is this occurring on?

2007-05-29 08:05:43

by Roland McGrath

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Off hand I think it can happen on all 2.6 kernels in theory.

2007-05-29 08:10:36

by Satoru Takeuchi

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Hi Andrew,

> What kernel versions is this occurring on?

2.6.22-rc3 is.

Satoru

At Tue, 29 May 2007 01:03:15 -0700,
Andrew Morton wrote:
>
> On Mon, 28 May 2007 11:32:00 +0900 Satoru Takeuchi <[email protected]> wrote:
>
> > Hi,
> >
> > I found a bug on signal subsystem. If there is some multithread program
> > and one of the thread is blocking on the system call, it returns with
> > wrong errno on receiving SIGSTOP and following SIGCONT.
> >
> > Arch dependency
> > ===============
> >
> > succeed to reproduce: i386, ia64
> > Unknown: any other arch
> >
> > # I suspect that this problem occur on any arch...
> >
> > How to reproduce
> > ================
> >
> > This process needs 2 terminals, term A and term B.
> >
> > 1. issue the test program (attached on this mail) from term A.
> >
> > $ ./mt-wrong-errno
> >
> > 2. send SIGSTOP to test program from term B.
> >
> > $ ps a
> > ...
> > 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> > 9688 pts/4 R+ 0:00 ps a
> > $ kill -STOP 9685
> >
> > 3. send SIGCONT to test program from term B.
> >
> > $ kill -CONT 9685
> >
> > Expected Result
> > ===============
> >
> > `./mt-wrong-errno' restarts read syscall (and stop again by receiving
> > SIGTTIN).
> >
> > Actual Result
> > =============
> >
> > `./mt-wrong-errno' returns with wrong errno 512 means ERESTARTSYS.
> > The comment on include/linux/errno.h says "this errno should never be
> > seen by user program." So it's definitely a bug.
> >
> > output of term A
> > ----------------
> >
> > $ ./mt-wrong-errno
> >
> > [1]+ Stopped ./mt-wrong-errno
> > $ errno = 512
> >
> > output of term B
> > ----------------
> >
> > $ ps a
> > ...
> > 9685 pts/3 Sl+ 0:00 ./mt-wrong-errno
> > 9688 pts/4 R+ 0:00 ps a
> > $ kill -STOP 9685
> > $ kill -CONT 9685
> >
> >
> > This problem can always reproduce, so it's no timing issue. I suspect
> > that group_stop_count counting mechanism has something wrong and
> > signal handling is canceled erroneously.
> >
> > Any hints or patch itself are welcome.
> >
> > Thanks,
> >
> > Satoru
> >
> > -------------------------------------------------------------------------------
> > /*
> > * mt-wrong-errno.c
> > *
> > * Copyright 2007 Satoru Takeuchi <[email protected]>
> > *
> > * This software may be used and distributed according to the terms
> > * of the GNU General Public License, incorporated herein by reference.
> > *
> > */
> >
> > #include <sys/types.h>
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <err.h>
> > #include <unistd.h>
> > #include <stdio.h>
> > #include <errno.h>
> >
> > void *thread_fn(void *arg)
> > {
> > char c;
> > int ret;
> >
> > ret = read(STDIN_FILENO, &c, 1);
> > if (ret < 0)
> > printf("errno = %d\n", errno);
> >
> > return NULL;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> > pthread_t t;
> >
> > if (pthread_create(&t, NULL, thread_fn, NULL))
> > err(EXIT_FAILURE, "pthread_create() failed\n");
> >
> > if (pthread_join(t, NULL))
> > warn("pthread_join() failed\n");
> > }
>
> What kernel versions is this occurring on?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-05-29 19:23:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

On 05/29, Roland McGrath wrote:
>
> Your fix seems like the only way to go. From skimming all the ERESTART*
> uses, I think that in all cases (except for n_tty.c:job_control before your
> patch), TIF_SIGPENDING is indeed set when a thread returns -ERESTART*.
>
> But it makes me realize that there is a danger of leaking a -ERESTART*
> return code to userland when TIF_SIGPENDING gets cleared by another thread
> doing a recalc_sigpending_tsk. Because of -ERESTART* I think we must make
> it a rule that no thread can clear another thread's TIF_SIGPENDING, only
> set it (unless it's known to be stopped in the signal code or something).
> >From our recent work on it, I think that do_sigaction is in fact the only
> place this can happen. So that says we should err in the other direction
> from what I said before in do_sigaction, and not have it do recalc at all.

I think you are right.

But please note that cancel_freezing(p) is special. It is also called when
try_to_freeze_tasks() fails. So it should clear TIF_SIGPENDING if "p" is a
kernel thread, otherwise p may run with signal_pending() forever.

Unfortunately, it is not easy to detect the kernel thread, is_user_space()
is not reliable. Probably we should ignore this minor problem and do not
change cancel_freezing().

Oleg.

2007-05-29 19:28:20

by Roland McGrath

[permalink] [raw]
Subject: Re: [BUG] signal: multithread program returns with wrong errno on receiving SIGSTOP

Yeah, I knowingly glossed over freezer. I don't really want to think about
it. There are more problems than just this with how it goes about it, but
I'd rather leave that alone until it really bites us.


Thanks,
Roland