2004-04-09 09:12:52

by Nikita V. Youshchenko

[permalink] [raw]
Subject: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

Hello.

Several days ago I've posted to linux-kernel describing "zombie problem"
related to sigqueue overflow.

Futher exploration of the problem showed that the reason of the described
behaviour is in user-space. There is a process that blocks a signal and
later receives tons of such signals. This effectively causes sigqueue
overflow.

The following program gives the same effect:

#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
sigset_t set;
int i;
pid_t pid;

sigemptyset(&set);
sigaddset(&set, 40);
sigprocmask(SIG_BLOCK, &set, 0);

pid = getpid();
for (i = 0; i < 1024; i++)
kill(pid, 40);

while (1)
sleep(1);
}

Running this program on 2.4 or 2.6 kernel with
default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and
all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will
stop receiving notifications about thread exits, so all completed threads
will become zombies. Exact reason why this is hapenning is described in
detail in my previous postings.

This is a local DoS.

Affected system services include (but are not limited to) mysql and clamav.
In fact, any linuxthreads application will be affected.

The problem is not that bad on 2.6, since NPTL is used instead of
linuxthreads, so there are no zombies from system daemons. However, bad
things still happen: when sigqueue is overflown, all processes get zeroed
siginfo, which causes random application misbehaviours (like hangs in
pthread_cancel()).

I don't know what is the correct solution for this issue. Probably there
should be per-process or per-user (but not systemwide) limits on number of
pending signals.


2004-04-09 14:45:57

by Denis Vlasenko

[permalink] [raw]
Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

On Friday 09 April 2004 12:11, Nikita V. Youshchenko wrote:
> Hello.
>
> Several days ago I've posted to linux-kernel describing "zombie problem"
> related to sigqueue overflow.
>
> Futher exploration of the problem showed that the reason of the described
> behaviour is in user-space. There is a process that blocks a signal and
> later receives tons of such signals. This effectively causes sigqueue
> overflow.

One solution would be to watermark sigqueue and upon reaching
high mark, find the process with most signals queued and drop those.
This prevents one buggy process, even root-launched, from interfering
with non-buggy ones.

If low watermark is not reached, find _UID_ which have max # of
signals pending, and drop them all. This will work against rogue
user trying to DoS box who's careful enough to do it from multiple
processes.
--
vda

2004-04-13 13:09:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

On Fri, Apr 09, 2004 at 01:11:50PM +0400, Nikita V. Youshchenko wrote:
> Hello.
>
> Several days ago I've posted to linux-kernel describing "zombie problem"
> related to sigqueue overflow.
>
> Futher exploration of the problem showed that the reason of the described
> behaviour is in user-space. There is a process that blocks a signal and
> later receives tons of such signals. This effectively causes sigqueue
> overflow.
>
> The following program gives the same effect:
>
> #include <signal.h>
> #include <unistd.h>
> #include <stdlib.h>
>
> int main()
> {
> sigset_t set;
> int i;
> pid_t pid;
>
> sigemptyset(&set);
> sigaddset(&set, 40);
> sigprocmask(SIG_BLOCK, &set, 0);
>
> pid = getpid();
> for (i = 0; i < 1024; i++)
> kill(pid, 40);
>
> while (1)
> sleep(1);
> }
>
> Running this program on 2.4 or 2.6 kernel with
> default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and
> all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will
> stop receiving notifications about thread exits, so all completed threads
> will become zombies. Exact reason why this is hapenning is described in
> detail in my previous postings.
>
> This is a local DoS.
>
> Affected system services include (but are not limited to) mysql and clamav.
> In fact, any linuxthreads application will be affected.
>
> The problem is not that bad on 2.6, since NPTL is used instead of
> linuxthreads, so there are no zombies from system daemons. However, bad
> things still happen: when sigqueue is overflown, all processes get zeroed
> siginfo, which causes random application misbehaviours (like hangs in
> pthread_cancel()).
>
> I don't know what is the correct solution for this issue. Probably there
> should be per-process or per-user (but not systemwide) limits on number of
> pending signals.

Indeed, per-user sigqueue limit is the way to fix this.

Anyone willing to implement it ?

2004-06-14 17:02:08

by David Lang

[permalink] [raw]
Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

I think I may be running into the same (or a similar) issue with a
workload that forks heavily (~3500 forks/sec). What can I do to let the
system survive this sort of load?

David Lang

On Tue, 13 Apr 2004, Marcelo Tosatti wrote:

> Date: Tue, 13 Apr 2004 10:10:17 -0300
> From: Marcelo Tosatti <[email protected]>
> To: Nikita V. Youshchenko <[email protected]>
> Cc: [email protected]
> Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
>
> On Fri, Apr 09, 2004 at 01:11:50PM +0400, Nikita V. Youshchenko wrote:
>> Hello.
>>
>> Several days ago I've posted to linux-kernel describing "zombie problem"
>> related to sigqueue overflow.
>>
>> Futher exploration of the problem showed that the reason of the described
>> behaviour is in user-space. There is a process that blocks a signal and
>> later receives tons of such signals. This effectively causes sigqueue
>> overflow.
>>
>> The following program gives the same effect:
>>
>> #include <signal.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>>
>> int main()
>> {
>> sigset_t set;
>> int i;
>> pid_t pid;
>>
>> sigemptyset(&set);
>> sigaddset(&set, 40);
>> sigprocmask(SIG_BLOCK, &set, 0);
>>
>> pid = getpid();
>> for (i = 0; i < 1024; i++)
>> kill(pid, 40);
>>
>> while (1)
>> sleep(1);
>> }
>>
>> Running this program on 2.4 or 2.6 kernel with
>> default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and
>> all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will
>> stop receiving notifications about thread exits, so all completed threads
>> will become zombies. Exact reason why this is hapenning is described in
>> detail in my previous postings.
>>
>> This is a local DoS.
>>
>> Affected system services include (but are not limited to) mysql and clamav.
>> In fact, any linuxthreads application will be affected.
>>
>> The problem is not that bad on 2.6, since NPTL is used instead of
>> linuxthreads, so there are no zombies from system daemons. However, bad
>> things still happen: when sigqueue is overflown, all processes get zeroed
>> siginfo, which causes random application misbehaviours (like hangs in
>> pthread_cancel()).
>>
>> I don't know what is the correct solution for this issue. Probably there
>> should be per-process or per-user (but not systemwide) limits on number of
>> pending signals.
>
> Indeed, per-user sigqueue limit is the way to fix this.
>
> Anyone willing to implement it ?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-06-15 00:35:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

On Mon, Jun 14, 2004 at 10:01:53AM -0700, David Lang wrote:
> I think I may be running into the same (or a similar) issue with a
> workload that forks heavily (~3500 forks/sec). What can I do to let the
> system survive this sort of load?

Hi David,

v2.6.7-mm tree contains a fix for this, adding a rlimit for
pending signals.

Can you describe the problem you are seeing in more detail?


2004-06-15 01:32:19

by David Lang

[permalink] [raw]
Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

On Mon, 14 Jun 2004, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2004 at 10:01:53AM -0700, David Lang wrote:
>> I think I may be running into the same (or a similar) issue with a
>> workload that forks heavily (~3500 forks/sec). What can I do to let the
>> system survive this sort of load?
>
> Hi David,
>
> v2.6.7-mm tree contains a fix for this, adding a rlimit for
> pending signals.

I'll have to give this a try.

> Can you describe the problem you are seeing in more detail?

I have a stress-test I am running on a dual Opteron 1.4GHz box that
receives a network connection, forks a new process, does a little bit of
network traffic then the child exits. when I hammer this I get ~3500
connections/sec (with a significant amount of spare CPU, I'm limited by
my load boxes), but after a few secnds (8-10) something happens and the
parent stops receiving the sigchild signals. if I connect strace to the
parent process the signals are re-enabled and everything works for a
little bit longer before the process repeats.

if I only hit it with ~10,000 connections and then pause the box survives
indefinantly

running the same test on a dual athlonMP 2200+ I get ~2500 connections a
sec and it has no problems. I just compiled a 32 bit kernel for the
opteron and get ~3300 connections/sec (with no idle CPU time) and the box
doesn't lock up.

I don't know if this is becouse it's just below the threashold of the
problem or if there is a bug in the 64 bit kernel (or both)

I'm currently trying to tweak the 32 bit opteron kernel to get a smidge
more speed out of it to see if getting back up to the same speed starts
triggering the problem again.

David Lang

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan