2003-11-26 16:53:07

by Richard B. Johnson

[permalink] [raw]
Subject: BUG (non-kernel), can hurt developers.


Note to hackers. Even though this is a lib-c bug, be
aware that many versions of the 'C' runtime library
have a rand() function that can (read will) segfault
in threads or signals.

glibc-2.1.3
libc.so.6

Are two culprits. This "little" problem just took me
a week to find. Rand() was used as a source of test
data in a system diagnostic. The diagnostic kept blowing
up. The following code tests for the problem.

//-----------------
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/time.h>

static int spare;
static int inside;
void handler(int unused)
{
struct itimerval it;
inside++;
spare = rand();
it.it_interval.tv_sec = 0L;
it.it_interval.tv_usec = 0L;
it.it_value.tv_sec = 0L;
it.it_value.tv_usec = 1L;
(void)signal(SIGALRM, handler);
setitimer(ITIMER_REAL, &it, &it);
inside--;
}
void bad(int sig)
{
char *where;
if(inside)
where = "inside";
else
where = "outside";
fprintf(stderr, "Failed %s handler on %d\n", where, spare);
exit(EXIT_FAILURE);
}
int main(void);
int main()
{
(void)signal(SIGSEGV, bad);
handler(0);
for(;;)
(void)rand();
return 0;
}
//---------------------


Run this for a few minutes.

Script started on Wed Nov 26 11:50:23 2003
$ gcc -Wall -o xxx -O2 xxx.c
$ ./xxx
Failed inside handler on 1735818301
$ ./xxx
Failed inside handler on 129960814
$ ./xxx
Failed inside handler on 1999426653
$ exit
exit
Script done on Wed Nov 26 11:50:52 2003


Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.



2003-11-26 17:21:15

by YOSHIFUJI Hideaki

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

In article <Pine.LNX.4.53.0311261153050.10929@chaos> (at Wed, 26 Nov 2003 11:54:56 -0500 (EST)), "Richard B. Johnson" <[email protected]> says:

> Note to hackers. Even though this is a lib-c bug, be
> aware that many versions of the 'C' runtime library
> have a rand() function that can (read will) segfault
> in threads or signals.

How about rand_r(); reentrant version of rand()?

--yoshfuji

2003-11-26 18:30:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.


On Wed, 26 Nov 2003, Richard B. Johnson wrote:
>
> Note to hackers. Even though this is a lib-c bug

It's not.

It's a bug in your program.

You can't just randomly use library functions in signal handlers. You can
only use a very specific "signal-safe" set.

POSIX lists that set in 3.3.1.3 (3f), and says

"All POSIX.1 functions not in the preceding table and all
functions defined in the C standard {2} not stated to be callable
from a signal-catching function are considered to be /unsafe/
with respect to signals. .."

typos mine.

The thing is, they have internal state that makes then non-reentrant (and
note that even the re-entrant ones might not be signal-safe, since they
may have deadlock issues: being thread-safe is _not_ the same as being
signal-safe).

In other words, if you want to do complex things from signals, you should
really just set a flag (of type "sigatomic_t") and have your main loop do
them. Or you have to be very very careful and only use stuff that is
defined to be signal-safe (mainly core system calls - stuff like <stdio.h>
is right out).

Linus

2003-11-26 18:52:58

by Richard B. Johnson

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

On Wed, 26 Nov 2003, Linus Torvalds wrote:

>
> On Wed, 26 Nov 2003, Richard B. Johnson wrote:
> >
> > Note to hackers. Even though this is a lib-c bug
>
> It's not.
>
> It's a bug in your program.
>
> You can't just randomly use library functions in signal handlers. You can
> only use a very specific "signal-safe" set.
>
> POSIX lists that set in 3.3.1.3 (3f), and says
>
> "All POSIX.1 functions not in the preceding table and all
> functions defined in the C standard {2} not stated to be callable
> from a signal-catching function are considered to be /unsafe/
> with respect to signals. .."
>
> typos mine.
>
> The thing is, they have internal state that makes then non-reentrant (and
> note that even the re-entrant ones might not be signal-safe, since they
> may have deadlock issues: being thread-safe is _not_ the same as being
> signal-safe).
>
> In other words, if you want to do complex things from signals, you should
> really just set a flag (of type "sigatomic_t") and have your main loop do
> them. Or you have to be very very careful and only use stuff that is
> defined to be signal-safe (mainly core system calls - stuff like <stdio.h>
> is right out).
>
> Linus
>

Well, again, I took a very compilcated sequence of events and
minimized them to where they could be readily observed. The actual
problem in the production machine involves two absolutely independent
tasks that end up using the same shared 'C' runtime library. There
should be no interaction between them, none whatsover. However, when
they both execute rand(), they interact in bad ways. This interraction
occurs on random days at monthly intervals. To find this bug, I
had to compress that time. So, I allowed rand() to be "interrupted"
just as it would be in a context-switch. I simply used a signal
handler, knowing quite well that the "interrupt" could occur at
any time. Now, I didn't give a damn about the value returned in
either function invovation. What I brought to light was a SIGSEGV
that can occur when the shared-library rand() function is
"interrupted". This is likely caused by the failure to use "-s"
in the compilation of a shared library function, fixed in subsequent
releases.

So don't pick on the code. It was designed to emphasize the
problem. It is not supposed to show how to write a signal
handler.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-26 18:59:39

by YOSHIFUJI Hideaki

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

In article <[email protected]> (at Wed, 26 Nov 2003 10:29:54 -0800 (PST)), Linus Torvalds <[email protected]> says:

> You can't just randomly use library functions in signal handlers. You can
> only use a very specific "signal-safe" set.
>
> POSIX lists that set in 3.3.1.3 (3f), and says
>
> "All POSIX.1 functions not in the preceding table and all
> functions defined in the C standard {2} not stated to be callable
> from a signal-catching function are considered to be /unsafe/
> with respect to signals. .."
>
> typos mine.

Just FYI:
http://www.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_04.html#tag_02_04_03

--yoshfuji

2003-11-26 19:33:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

Richard B. Johnson wrote:
> The actual problem in the production machine involves two absolutely
> independent tasks that end up using the same shared 'C' runtime
> library. There should be no interaction between them, none
> whatsover. However, when they both execute rand(), they interact in
> bad ways. This interraction occurs on random days at monthly
> intervals.

On Linux (unlike Windows), there is _no_ interaction between the
libraries of different tasks. Neither of them sees changes to the
other's memory space.

If you are seeing a fault, then there might well be a bug, even a
kernel bug, but your test program does not illustrate the same problem.

What is the "bad interaction" that you observed at monthly intervals?
Also a SIGSEGV?

> This is likely caused by the failure to use "-s" in the compilation
> of a shared library function, fixed in subsequent releases.

No, this has nothing to do with it. Unlike Windows and some embedded
environments, Linux shared libraries do not have "shared writable data"
sections.

> So, I allowed rand() to be "interrupted" just as it would be in a
> context-switch. I simply used a signal handler, knowing quite well
> that the "interrupt" could occur at any time. [...] What I brought
> to light was a SIGSEGV that can occur when the shared-library rand()
> function is "interrupted".

You have made a mistake. You program shows a different problem to the
one which you noticed every month or so.

Calling a function from a signal handler while it is being interrupted
by that handler is _very_ different from tasks context switching.
They are not similar at all! (Yes, signals can be used to simulate
context switches, but not like this!)

Your code interrupts one call to rand() and calls rand() _within_
the interrupt handler. The inner call and outer call interfere, in a
very similar way to calling it twice from two threads (note: threads
not tasks). The memory state becomes corrupted.

This is _very_ different from two independent tasks context switching.
Independent tasks do not share the same memory space, not even when
they share the same libraries, so this type of corruption isn't
possible.

Summary: your monthly "bad interaction" is not illustrated in this
test program. It's a different problem.

-- Jamie

2003-11-26 20:14:19

by Richard B. Johnson

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

On Wed, 26 Nov 2003, Jamie Lokier wrote:

> Richard B. Johnson wrote:
> > The actual problem in the production machine involves two absolutely
> > independent tasks that end up using the same shared 'C' runtime
> > library. There should be no interaction between them, none
> > whatsover. However, when they both execute rand(), they interact in
> > bad ways. This interraction occurs on random days at monthly
> > intervals.
>
> On Linux (unlike Windows), there is _no_ interaction between the
> libraries of different tasks. Neither of them sees changes to the
> other's memory space.
>
> If you are seeing a fault, then there might well be a bug, even a
> kernel bug, but your test program does not illustrate the same problem.
>
> What is the "bad interaction" that you observed at monthly intervals?
> Also a SIGSEGV?
>

Yes. When the call to rand() was replaced with a static-linked
clone it went away.

> > This is likely caused by the failure to use "-s" in the compilation
> > of a shared library function, fixed in subsequent releases.
>
> No, this has nothing to do with it. Unlike Windows and some embedded
> environments, Linux shared libraries do not have "shared writable data"
> sections.

Well the libc rand() does something that looks like that.

>
> > So, I allowed rand() to be "interrupted" just as it would be in a
> > context-switch. I simply used a signal handler, knowing quite well
> > that the "interrupt" could occur at any time. [...] What I brought
> > to light was a SIGSEGV that can occur when the shared-library rand()
> > function is "interrupted".
>
> You have made a mistake. You program shows a different problem to the
> one which you noticed every month or so.
>

The calling rand() from a handler in a newer libc doesn't seg-fault.

> Calling a function from a signal handler while it is being interrupted
> by that handler is _very_ different from tasks context switching.
> They are not similar at all! (Yes, signals can be used to simulate
> context switches, but not like this!)
>

Not with the emulation. The problem is that rand() uses a thread-
specific pointer to find the seed (history variable), just like
'errno' which isn't really a static variable, but a function
that returns a pointer to a thread-specific integer. If this
is interrupted in a critical section, and that same pointer
is used, that pointer is left pointing to a variable in somebody
else's address space. That same problem is observed to happen when
the same shared runtime library was used by entirely different tasks.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-26 20:45:00

by Jamie Lokier

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

Richard B. Johnson wrote:
> > What is the "bad interaction" that you observed at monthly intervals?
> > Also a SIGSEGV?
>
> Yes. When the call to rand() was replaced with a static-linked
> clone it went away.

> The calling rand() from a handler in a newer libc doesn't seg-fault.

On both cases, although it doesn't seg-fault, you can no longer trust
the results to be the same quality of random numbers.

It's an implementation detail that the other versions of rand() happen
not to segfault even though you are calling them incorrectly. Just
like you can call free() twice on a memory block and it will segfault
with Glibc, but is fine in some versions of BSD. It's still an error
to do it.

> Not with the emulation. The problem is that rand() uses a thread-
> specific pointer to find the seed (history variable), just like
> 'errno' which isn't really a static variable, but a function
> that returns a pointer to a thread-specific integer. If this
> is interrupted in a critical section, and that same pointer
> is used, that pointer is left pointing to a variable in somebody
> else's address space.

Yes that sounds reasonable. A newer libc would fix it because newer
libc uses a different method for looking up thread-specific pointers.

> That same problem is observed to happen when the same shared runtime
> library was used by entirely different tasks.

When you say "entirely different tasks", do you mean "different
threads in the same process" or "different processes"?

That same problem _can_ happen between different threads in a single
process, but it _cannot_ happen between different processes.

-- Jamie

2003-11-27 20:42:03

by Mikulas Patocka

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.



On Wed, 26 Nov 2003, Linus Torvalds wrote:

>
> On Wed, 26 Nov 2003, Richard B. Johnson wrote:
> >
> > Note to hackers. Even though this is a lib-c bug
>
> It's not.
>
> It's a bug in your program.
>
> You can't just randomly use library functions in signal handlers. You can
> only use a very specific "signal-safe" set.
>
> POSIX lists that set in 3.3.1.3 (3f), and says
>
> "All POSIX.1 functions not in the preceding table and all
> functions defined in the C standard {2} not stated to be callable
> from a signal-catching function are considered to be /unsafe/
> with respect to signals. .."
>
> typos mine.
>
> The thing is, they have internal state that makes then non-reentrant (and
> note that even the re-entrant ones might not be signal-safe, since they
> may have deadlock issues: being thread-safe is _not_ the same as being
> signal-safe).
>
> In other words, if you want to do complex things from signals, you should
> really just set a flag (of type "sigatomic_t") and have your main loop do
> them. Or you have to be very very careful and only use stuff that is
> defined to be signal-safe (mainly core system calls - stuff like <stdio.h>
> is right out).

Just curious --- what happens when these functions are interrupted by
signal and signal handler does siglongjmp out of signal?

According to this assumption siglongjmp should not work. Does it handle
these situations specially? I don't understand why is it in specification
if it doesn't work.

Mikulas

2003-11-28 07:18:35

by Tomas Szepe

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

On Nov-26 2003, Wed, 10:29 -0800
Linus Torvalds <[email protected]> wrote:

> You can't just randomly use library functions in signal handlers. You can
> only use a very specific "signal-safe" set.
>
> POSIX lists that set in 3.3.1.3 (3f), and says
>
> "All POSIX.1 functions not in the preceding table and all
> functions defined in the C standard {2} not stated to be callable
> from a signal-catching function are considered to be /unsafe/
> with respect to signals. .."
>
> typos mine.
>
> The thing is, they have internal state that makes then non-reentrant (and
> note that even the re-entrant ones might not be signal-safe, since they
> may have deadlock issues: being thread-safe is _not_ the same as being
> signal-safe).

I believe it would be very useful to have this information included in
the standard Linux signal(2) manpage. (I've just verified it's not in
man-pages-1.60.)

What do you think, Andries?

--
Tomas Szepe <[email protected]>

2003-11-28 10:30:22

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

> I believe it would be very useful to have this information included
> in the standard Linux signal(2) manpage.

OK. You might have included a patch. I made it say

The effects of this call in a multi-threaded process are
unspecified.

The routine handler must be very careful, since processing
elsewhere was interrupted at some arbitrary point. POSIX
has the concept of "safe function". If a signal inter?
rupts an unsafe function, and handler calls an unsafe
function, then the behavior is undefined. Safe functions
are listed explicitly in the various standards. The POSIX
1003.1-2003 list is

_Exit() _exit() abort() accept() access() aio_error()
aio_return() aio_suspend() alarm() bind() cfgetispeed()
cfgetospeed() cfsetispeed() cfsetospeed() chdir() chmod()
chown() clock_gettime() close() connect() creat() dup()
dup2() execle() execve() fchmod() fchown() fcntl() fdata?
sync() fork() fpathconf() fstat() fsync() ftruncate()
getegid() geteuid() getgid() getgroups() getpeername()
getpgrp() getpid() getppid() getsockname() getsockopt()
getuid() kill() link() listen() lseek() lstat() mkdir()
mkfifo() open() pathconf() pause() pipe() poll()
posix_trace_event() pselect() raise() read() readlink()
recv() recvfrom() recvmsg() rename() rmdir() select()
sem_post() send() sendmsg() sendto() setgid() setpgid()
setsid() setsockopt() setuid() shutdown() sigaction()
sigaddset() sigdelset() sigemptyset() sigfillset() sigis?
member() sleep() signal() sigpause() sigpending() sigproc?
mask() sigqueue() sigset() sigsuspend() socket() socket?
pair() stat() symlink() sysconf() tcdrain() tcflow()
tcflush() tcgetattr() tcgetpgrp() tcsendbreak() tcse?
tattr() tcsetpgrp() time() timer_getoverrun() timer_get?
time() timer_settime() times() umask() uname() unlink()
utime() wait() waitpid() write().

Andries

2003-11-28 17:23:26

by Chris Friesen

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

[email protected] wrote:
> The routine handler must be very careful, since processing
> elsewhere was interrupted at some arbitrary point. POSIX
> has the concept of "safe function". If a signal inter?
> rupts an unsafe function, and handler calls an unsafe
> function, then the behavior is undefined. Safe functions
> are listed explicitly in the various standards. The POSIX
> 1003.1-2003 list is

<snip>

You may also want to mention the SUS async-safe list as well, since
there are some additional functions there.

Chris


--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-11-28 21:21:42

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

> You may also want to mention the SUS async-safe list as well,
> since there are some additional functions there.

Are you sure?
Which? In which SUS version?

2003-11-28 21:39:26

by Chris Friesen

[permalink] [raw]
Subject: Re: BUG (non-kernel), can hurt developers.

[email protected] wrote:
>>You may also want to mention the SUS async-safe list as well,
>>since there are some additional functions there.
>>
>
> Are you sure?
> Which? In which SUS version?

My bad. SUS references the posix list. I must have been looking at an
old function list. POSIX should be sufficient.

Chris



--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]