2004-03-11 16:02:19

by David Fort

[permalink] [raw]
Subject: Unkillable Zombie process under 2.6.3 and 2.6.4

Hi list,
i have some troubles with some totally unkillable zombie process:
Here's how i can get unkillable zombies debug multi-threaded program
using gdb and
in the execution my program popens a command, sometimes i get the
following gdb message

waiting for new child: No child processes.
(gdb)

And gdb give me back the prompt. I have the impression that the child
process has
been effectively launched.
If i ask gdb to continue the process goes on but the incriminated thread
looks freezed. When
in this state i can contact other threads, but gdb is stuck(Ctrl+C
doesn't work).

Killing -9 my program doesn't have any effect. But killing -9 gdb
effectivelly kills gdb
but not my program(which is a son of gdb). Shouldn't the kernel finish
the job with zombie
process when their father die ?(there's nobody to catch signals, or
return codes).

My big problem is that the faulty program keeps its binding sockets
opened, so i can't
launch anything on that ports.

--
Fort David, Projet IDsA
IRISA-INRIA, Campus de Beaulieu, 35042 Rennes cedex, France
T?l: +33 (0) 2 99 84 71 33



2004-03-12 13:55:09

by David Fort

[permalink] [raw]
Subject: Re: Unkillable Zombie process under 2.6.3 and 2.6.4

Andrew Morton wrote:

>Would you have time to prepare a little test app to demonstrate this?
>
>Thanks.
>
>
>
I've wrote this little app that do nothing complicated: it just launch a
thread that do popen in its body.
This programs sticks gdb completly, i don't know who is to blame gdb or
the kernel.
The fact is that there's something really strange here.
I'm trying to build a test app that can trigger the case where GDBed
process become unkillable zombies
(i have some still running on my box).

I've explored several ideas i had:
-related to TLS -> playing around with the tlsData var didn't show
anything
-SIGCHLD intercepted by the program and not caught by gdb -> even
without the signal handler i get the
bug

I'm gonna modify the app to test that the apps doesn't become unkillable
when it has a socket in WAIT_STATE.

Attached is a tarbal that contains: the program, the makefile and a
quite long report of the behaviour that i'm
seeing while debugging and my .config.

gcc version 3.3.2 (Mandrake Linux 10.0 3.3.2-6mdk)
glibc-2.3.3-10mdk
2.6.4 kernel

--
Fort David, Projet IDsA
IRISA-INRIA, Campus de Beaulieu, 35042 Rennes cedex, France
T?l: +33 (0) 2 99 84 71 33



Attachments:
kbug1.tar.bz2 (9.67 kB)

2004-03-12 14:06:25

by Christian Borntraeger

[permalink] [raw]
Subject: Re: Unkillable Zombie process under 2.6.3 and 2.6.4

David Fort wrote:
> I'm trying to build a test app that can trigger the case where GDBed
> process become unkillable zombies

Does it help to send a SIGCONT to all processes in T state?

cheers

Christian

2004-03-12 14:30:27

by David Fort

[permalink] [raw]
Subject: Re: Unkillable Zombie process under 2.6.3 and 2.6.4

Christian Borntraeger wrote:

>David Fort wrote:
>
>
>>I'm trying to build a test app that can trigger the case where GDBed
>>process become unkillable zombies
>>
>>
>
>Does it help to send a SIGCONT to all processes in T state?
>
>
>
No it doesn't, gdb loose its context when doing this(this triggers an
internal gdb error):

lin-lwp.c:653: internal-error: stop_wait_callback: Assertion `pid ==
GET_LWP (lp->ptid)' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.


Cheers

--
Fort David, Projet IDsA
IRISA-INRIA, Campus de Beaulieu, 35042 Rennes cedex, France
T?l: +33 (0) 2 99 84 71 33


2004-03-12 16:38:42

by David Fort

[permalink] [raw]
Subject: Re: Unkillable Zombie process under 2.6.3 and 2.6.4

David Fort wrote:
[...]

>>Does it help to send a SIGCONT to all processes in T state?
>>
>>
>>
> No it doesn't, gdb loose its context when doing this(this triggers an
> internal gdb error):
>
> lin-lwp.c:653: internal-error: stop_wait_callback: Assertion `pid ==
> GET_LWP (lp->ptid)' failed.
> A problem internal to GDB has been detected,
> further debugging may prove unreliable.
>
I was finally able to reproduce the unkillable zombie process. How to
reproduce it:
- launch gdb with the program in the attached tarball.
- once the program is running you have 5 seconds to telnet on
localhost on port 7899, and
sit on your keyboard.
-after 5 seconds the first popen occurs in one of the thread of kbug
which causes gdb or the kernel to bug(waiting for new
child: No child processes.)

[dfo@chiffre kbug]$ ~/cvs/gdb/gdb/gdb ./kbug
GNU gdb 2004-03-12-cvs
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...Using host libthread_db
library "/lib/tls/libthread_db.so.1".

(gdb) r 5 6
Starting program: /home/dfo/test/kbug/kbug 5 6
[Thread debugging using libthread_db enabled]
[New Thread 1073961248 (LWP 1882)]
[New Thread 1083698096 (LWP 1885)]
[New Thread 1092090800 (LWP 1886)]
1882(1083698096)do_task1: starting
1882(1073961248)main: starting
1882(1092090800)do_task1: starting
[New Thread 1100487600 (LWP 1888)]
1882(1100487600)remoteworker_thread_run: starting
remoteworker_thread_run(1100487600): readed 2 bytes [
]
[.. that's when i'm sending things via the telnet......]
]
remoteworker_thread_run(1100487600): readed 2 bytes [
]
waiting for new child: No child processes.
(gdb)



When here ask to quit, and gdb may hang:
(gdb) q
The program is running. Exit anyway? (y or n) y

Once this is done just kill gdb(killall gdb), here i get the following:
ps xwaf | grep kbug
5255 pts18 S 0:00 \_ grep kbug
1882 pts16 Z 0:00 [kbug] <defunct>

the kbug process became an unkillable zombie.
The provided trace is with the CVS of gdb but i have same behaviour with the
legacy gdb-6.0-2mdk

kbug:
kbug has a POPENFUNC which is a just call to popen(/bin/true)
threads of kbug:
-main thread creates things(other threads) and then call POPENFUNC
-task1 thread binds 7899, accepts the first connection and launch a
thead "remoteworker" to treat the incoming connection, once this
is done it calls POPENFUNC
-remoteworker "select" the incoming connection and prints what is send
-task2 does only POPENFUNC

I can provide any information that is needed.

--
Fort David, Projet IDsA
IRISA-INRIA, Campus de Beaulieu, 35042 Rennes cedex, France
T?l: +33 (0) 2 99 84 71 33



Attachments:
kbug2.tar.bz2 (10.32 kB)