2009-11-22 21:14:51

by Michael Tokarev

[permalink] [raw]
Subject: Why processes on linux loses signals?

It's a very old issue, but I still don't know an answer.

In short, processes on linux loses signals. It happens
rarely, but it happens, and the frequency of this happening
is enough to be annoying.

For example, I've a program that used alarm(2) to periodically
check for something. Nothing fancy, nothing interesting is done
in the signal handler, no long operations or something, plain
signal(2) with sighandler just setting a global variable. When
under heavy usage (it's a DNS nameserver), in about a week
(sometimes a few hours, sometimes after a month) it stops checking
for updates, because apparently some sigalrm got lost.

For this program I had to replace alarm() with setitimer(), but
only on linux. On all other operating systems (Solaris, FreeBSD,
HP/UX, AIX) where it is used, everything works as expected.

Another common issue is SIGIO-based event loop. For a classical
form of it, on a non-heavily-loaded process. Quite often server
loses SIGIO so even if an I/O is possible, the process does not
know. The pending (or stuck) I/O gets processed on receipt of
next SIGIO that indicates readiness of another filedescriptor --
since after SIGIO a process does poll() it notices both.

A "classical" (for me) example of this is an Oracle database
version 8 (we've many of these in production still; in later
versions they rewrote the event loop to use different techniques).
There, there's a dispatcher process that does nothing but listens
on the network, receives requests and sends them to a set of
worker processes. Everything is non-blocking and the process
mostly does nothing. It is very annoying when trivial actions
in a user application causes loooong delays - when an app sent
some request to oracle db and that request stuck in the event
queue because the corresponding SIGIO was never delivered. It
helps immediately to make another connection to the same DB to
"unstuck" that request. It is done transparently when there are
many users are working with the database at the same time, each
making requests --- this way any stuck/lost I/O unstucks immediately
because new requests are coming from other users; but at evenings
or over periods of small activity it becomes real problem.

I looked at the server behavour numerous times -- the server (oracle)
works quite reasonable, strace is sane enough. That to say, one
can't blame "stupid closed-source programmers" for this.

There are other examples like this, all involving lost signals.
The two above are just the most "famous" for me.

The problem becomes much much worse when a system has multiple
cores. On single-CPU system such situation is rare enough to
become almost unnoticeable. But with even second core the issue
emerges almost immediately - enough for many users to start calling
techsupport because their apps are very slow.

Last time I asked similar question here, I was told that signals
are unreliable and should not be used. But what is the reason for
the unreliability, and why signals should be unreliable on linux
only?

Thanks!

/mjt


2009-11-22 22:27:18

by Nikita V. Youshchenko

[permalink] [raw]
Subject: Re: Why processes on linux loses signals?

> In short, processes on linux loses signals. It happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> ...
>
> The problem becomes much much worse when a system has multiple
> cores... But with even second core the issue emerges almost
> immediately ...

Looks like a classical race description.
Double-check your user-space code for signal-related races.

2009-11-23 02:04:35

by Ray Lee

[permalink] [raw]
Subject: Re: Why processes on linux loses signals?

[ adding potential interested parties to the CC:. Michael, please respond
with the latest kernel version you've tried that exhibits the problem, as well
as whether or not you've been able to create a test-case that shows the
signal loss. ]

On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <[email protected]> wrote:
> It's a very old issue, but I still don't know an answer.
>
> In short, processes on linux loses signals.  It happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> For example, I've a program that used alarm(2) to periodically
> check for something.  Nothing fancy, nothing interesting is done
> in the signal handler, no long operations or something, plain
> signal(2) with sighandler just setting a global variable.  When
> under heavy usage (it's a DNS nameserver), in about a week
> (sometimes a few hours, sometimes after a month) it stops checking
> for updates, because apparently some sigalrm got lost.
>
> For this program I had to replace alarm() with setitimer(), but
> only on linux.  On all other operating systems (Solaris, FreeBSD,
> HP/UX, AIX) where it is used, everything works as expected.
>
> Another common issue is SIGIO-based event loop.  For a classical
> form of it, on a non-heavily-loaded process.  Quite often server
> loses SIGIO so even if an I/O is possible, the process does not
> know.  The pending (or stuck) I/O gets processed on receipt of
> next SIGIO that indicates readiness of another filedescriptor --
> since after SIGIO a process does poll() it notices both.
>
> A "classical" (for me) example of this is an Oracle database
> version 8 (we've many of these in production still; in later
> versions they rewrote the event loop to use different techniques).
> There, there's a dispatcher process that does nothing but listens
> on the network, receives requests and sends them to a set of
> worker processes.  Everything is non-blocking and the process
> mostly does nothing.  It is very annoying when trivial actions
> in a user application causes loooong delays - when an app sent
> some request to oracle db and that request stuck in the event
> queue because the corresponding SIGIO was never delivered.  It
> helps immediately to make another connection to the same DB to
> "unstuck" that request.  It is done transparently when there are
> many users are working with the database at the same time, each
> making requests --- this way any stuck/lost I/O unstucks immediately
> because new requests are coming from other users; but at evenings
> or over periods of small activity it becomes real problem.
>
> I looked at the server behavour numerous times -- the server (oracle)
> works quite reasonable, strace is sane enough.  That to say, one
> can't blame "stupid closed-source programmers" for this.
>
> There are other examples like this, all involving lost signals.
> The two above are just the most "famous" for me.
>
> The problem becomes much much worse when a system has multiple
> cores.  On single-CPU system such situation is rare enough to
> become almost unnoticeable.  But with even second core the issue
> emerges almost immediately - enough for many users to start calling
> techsupport because their apps are very slow.
>
> Last time I asked similar question here, I was told that signals
> are unreliable and should not be used.  But what is the reason for
> the unreliability, and why signals should be unreliable on linux
> only?
>
> Thanks!
>
> /mjt

2009-11-23 10:34:57

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Why processes on linux loses signals?

Michael Tokarev writes:
> In short, processes on linux loses signals.

You neglected to attach a self-contained test program for this alleged problem.

2009-11-23 14:45:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Why processes on linux loses signals?

On 11/22, Ray Lee wrote:
>
> [ adding potential interested parties to the CC:. Michael, please respond
> with the latest kernel version you've tried that exhibits the problem, as well
> as whether or not you've been able to create a test-case that shows the
> signal loss. ]

Yes, it would be nice to have a test-case.

> On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <[email protected]> wrote:
>
> > It's a very old issue, but I still don't know an answer.
> >
> > In short, processes on linux loses signals. ?It happens
> > rarely, but it happens, and the frequency of this happening
> > is enough to be annoying.
> >
> > For example, I've a program that used alarm(2) to periodically
> > check for something. ?Nothing fancy, nothing interesting is done
> > in the signal handler, no long operations or something, plain
> > signal(2) with sighandler just setting a global variable. ?When
> > under heavy usage (it's a DNS nameserver), in about a week
> > (sometimes a few hours, sometimes after a month) it stops checking
> > for updates, because apparently some sigalrm got lost.

This shouldn't happen (assuming your application is correct ;)

If this happens again, could you look in /proc/pid/status? I don't
really think this will help, but still.

> > Last time I asked similar question here, I was told that signals
> > are unreliable

They should be reliable. If not we have a kernel bug.

Oleg.