2005-09-19 17:40:19

by Badari Pulavarty

[permalink] [raw]
Subject: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

Hi,

I am looking at a problem where the parent process doesn't seem
to cleanup the exited children (with a webserver). We narrowed it
down to a simple testcase. Seems more like a lost SIG_CHILD.

I can easily reproduce the problem on my AMD64 machine.
Any thoughts on why this is happening ? Any known issues/fixes ?

Thanks,
Badari

elm3b29:~ # ./proctest 10 30
Parent process id: 30007
Spawned 10 children
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Waiting for 10 children to exit
Child 0 exiting. Executed 3682 forks
Child 3 exiting. Executed 3677 forks
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
Waiting for 8 children to exit
...

#ps -aef
....
root 30007 20480 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30009 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30011 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30015 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30017 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30019 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30023 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30026 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 30031 30007 0 02:35 pts/1 00:00:00 ./proctest 10 30
root 698 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 704 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 724 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 730 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 738 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 754 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 761 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 766 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 781 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 786 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 792 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 814 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 820 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 833 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 840 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 859 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 868 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 876 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 890 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 900 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 903 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 911 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 924 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 930 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 935 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 939 30023 0 02:35 pts/1 00:00:00 [child] <defunct>
root 2822 30017 0 02:35 pts/1 00:00:00 [child] <defunct>
root 2826 30017 0 02:35 pts/1 00:00:00 [child] <defunct>
root 2834 30017 0 02:35 pts/1 00:00:00 [child] <defunct>
root 2842 30017 0 02:35 pts/1 00:00:00 [child] <defunct>
...
...

elm3b29:~ # strace -p 30023
Process 30023 attached - interrupt to quit
futex(0x2aaaaaddf118, FUTEX_WAIT, 2, NULL


Attachments:
proctest.c (2.49 kB)
child.c (80.00 B)
Download all attachments

2005-09-19 18:01:41

by Kyle Moffett

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

On Sep 19, 2005, at 13:39:33, Badari Pulavarty wrote:
> Hi,
>
> I am looking at a problem where the parent process doesn't seem to
> cleanup the exited children (with a webserver). We narrowed it down
> to a simple testcase. Seems more like a lost SIG_CHILD.

You don't get one SIG_CHLD per child that quits. The kernel may and
probably will merge SIG_CHLD signals together if it has several
queued before it gets a chance to deliver them to your process. This
is true of _all_ signals. If you "kill -STOP 1234", then "kill -QUIT
1234", "kill -QUIT 1234", "kill -QUIT 1234", "kill -CONT 1234", the
PID 1234 will have 3 signals delivered: The original untrappable
SIGSTOP, the SIGCONT that causes it to resume, and a single SIGQUIT
immediately following it. The correct and portable way to handle
this is to put a loop in your SIGCHLD signal handler:

#include <sys/types.h>
#include <sys/wait.h>

void sigchld_handler(int signal) {
pid_t pid;
int status;

while( -1 != (child = waitpid(-1, &status, WNOHANG)) ) {
/*
* Now "status" is the exit status of the child and
* "pid" is its pid. See the waitpid() manpage for
* macros you can use to get information from the
* status variable.
*/
do_some_processing_of_exited_child(pid,status);
}
}

Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory
subsystems, sorry. It would be like trying to make a human more
robust with an unreliable O2 supply. Memory just has to work.
-- Andi Kleen


2005-09-19 18:21:19

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

On Mon, 2005-09-19 at 14:01 -0400, Kyle Moffett wrote:
> On Sep 19, 2005, at 13:39:33, Badari Pulavarty wrote:
> > Hi,
> >
> > I am looking at a problem where the parent process doesn't seem to
> > cleanup the exited children (with a webserver). We narrowed it down
> > to a simple testcase. Seems more like a lost SIG_CHILD.
>
> You don't get one SIG_CHLD per child that quits. The kernel may and
> probably will merge SIG_CHLD signals together if it has several
> queued before it gets a chance to deliver them to your process. This
> is true of _all_ signals. If you "kill -STOP 1234", then "kill -QUIT
> 1234", "kill -QUIT 1234", "kill -QUIT 1234", "kill -CONT 1234", the
> PID 1234 will have 3 signals delivered: The original untrappable
> SIGSTOP, the SIGCONT that causes it to resume, and a single SIGQUIT

True.

> immediately following it. The correct and portable way to handle
> this is to put a loop in your SIGCHLD signal handler:
>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> void sigchld_handler(int signal) {
> pid_t pid;
> int status;
>
> while( -1 != (child = waitpid(-1, &status, WNOHANG)) ) {
> /*
> * Now "status" is the exit status of the child and
> * "pid" is its pid. See the waitpid() manpage for
> * macros you can use to get information from the
> * status variable.
> */
> do_some_processing_of_exited_child(pid,status);
> }
> }


Thats what I already do, in my current testcase.

while(1) {
pid = wait4(-1, &status, WNOHANG, &ru);
if (pid <= 0) break;
// printf("SIGCHLD received for pid %d\n", pid);
pid_exited(pid);
}


Thanks,
Badari

2005-09-19 19:15:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour



On Mon, 19 Sep 2005, Badari Pulavarty wrote:
>
> I can easily reproduce the problem on my AMD64 machine.
> Any thoughts on why this is happening ? Any known issues/fixes ?

Interesting.

I don't think it is SIGCHLD, because you can "strace" the waiter in one
window, and send it a SIGCHLD _by_hand_ in the other window, and it will
do a new "wait4()", but still not pick up its zombie children.

So it looks like wait() itself is broken, and doesn't pick up the children
for some reason. It just returns 0.

Ingo, Roland - Badari included a test-program in his post on lkml, and I
can trigger the behaviour on ppc64, even if I can't see what's wrong yet.
Mind taking a look?

Linus

2005-09-19 20:53:13

by Edgar Toernig

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

Badari Pulavarty wrote:
>
> I am looking at a problem where the parent process doesn't seem
> to cleanup the exited children (with a webserver). We narrowed it
> down to a simple testcase. Seems more like a lost SIG_CHILD.

You are aware that the defunct processes are all grandchildren
and the sigchild handler of the children is the one inherited from
the main process? That sighandler is pretty bogus for the children,
i.e. pid_exited never does anything useful. Further, the children
terminate unconditionally after some time - all still present grand-
children are inherited by init (pid 1) who does the clean up.

But I think the major problem is that the main-loop of the children
and the signal handler both call usleep(rand()) - I wouldn't be
surprised if this combo isn't reentrant, especially from signal
handlers ...

Ciao, ET.

2005-09-19 23:34:52

by Roland McGrath

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

The test program is buggy. Here is one clue:

elm3b29:~ # strace -p 30023
Process 30023 attached - interrupt to quit
futex(0x2aaaaaddf118, FUTEX_WAIT, 2, NULL

It's not anywhere near wait4. It's deadlocked in the rand() call inside
rand_delay, called from sigchld_handler. You cannot safely call rand
inside a signal handler, for exactly this reason. The signal came during
another rand call and attempted to reenter. If this sort of deadlock is
the failure mode of your real-world case, then it is probably an
application bug. If this deadlock is just a mistake in your test program
here, then you'll need to give us a corrected test program to pursue
whatever real kernel issue you may have.


Thanks,
Roland

2005-09-20 16:50:08

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14-rc1 wait()/SIG_CHILD bevahiour

On Mon, 2005-09-19 at 16:34 -0700, Roland McGrath wrote:
> The test program is buggy. Here is one clue:
>
> elm3b29:~ # strace -p 30023
> Process 30023 attached - interrupt to quit
> futex(0x2aaaaaddf118, FUTEX_WAIT, 2, NULL
>
> It's not anywhere near wait4. It's deadlocked in the rand() call inside
> rand_delay, called from sigchld_handler. You cannot safely call rand
> inside a signal handler, for exactly this reason. The signal came during
> another rand call and attempted to reenter. If this sort of deadlock is
> the failure mode of your real-world case, then it is probably an
> application bug. If this deadlock is just a mistake in your test program
> here, then you'll need to give us a corrected test program to pursue
> whatever real kernel issue you may have.

Thank you for pointing out the problem with the testcase.

While running webserving benchmark (Zeus), we noticed that process is
not cleaning up its zombie childern. We end up with thousands of them
and eventually fork fails. Interesting thing to note in that case was,
moment we do strace -p <parent>, it suddenly seem to cleanup all its
zombie childern. I don't have access to the webserver source code, so we
tried the behaviour simulate it with a simple testcase.

I will let you know, if I can come up with a *better* testcase to
reproduce the problem.

Thanks,
Badari